From dataset to results

In buzz, datasets are not meant to be used for gaining insights into corpus data. Instead, Dataset is simply the contents of a loaded corpus or a search result, including all the entries you want, and all their attributes, even when some of these attributes are not important to you.

After you have the information you need within a Dataset, the next step is to transform it into data that you can more easily interpret, such as a table of frequencies, or keyness scores.

For this, you can use the dataset.table() method. It distills the total information of a dataset into something simpler. By default, this is a matrix of words by file:

dtrt = Corpus('dtrt/do-the-right-thing-parsed').load()
dtrt.table().head().to_html()
w . , the .. you and i to a 's
file
01-we-love-radio-station-storefront 17 13 16 3 0 12 4 2 6 5
02-da-mayor's-bedroom 2 3 1 1 0 0 0 0 0 1
03-jade's-apartment 5 4 5 0 0 0 0 0 0 3
04-jade's-bedroom 15 6 3 6 1 4 4 6 3 6
05-sal's-famous-pizzeria 16 12 14 15 4 4 6 7 7 6
06-mookie's-brownstone 1 0 0 0 0 1 0 1 0 0
07-street 2 1 4 0 0 1 0 5 0 0
08-mother-sister's-stoop 10 7 6 4 3 0 4 3 0 3
09-sal's-famous-pizzeria 32 17 10 23 8 6 10 4 9 7
10-sal's-famous-pizzeria 2 1 1 0 0 1 0 0 0 0

This is the most basic kind of table.

For more complex tables, you can use a number of keyword arguments:

| Keyword argument | default | type | purpose | | show | ['w'] | list of strings | | subcorpora | ['file'] | list of strings | Dataset columns/index levels to use as new index

When you have multiple items in show/subcorpora, by default, the index of the Table is a pd.MultiIndex, but the columns as slash-separated and kept as a single level.

If you want multiindex columns, you can turn them on with:

multi = dtrt.table(show=['l', 'x'], multiiindex_columns=True) 

If you wanted a slash-separated index, the easiest way is to use pandas' transpose method:

multi_row = dtrt.table(show=['file', 'speaker'], subcorpora=['w']).T

Relative frequencies

Absolute frequencies can be difficult to interpret, especially when your subcorpora are different sizes. To normalise the frequencies, use the relative keyword argument.

relative keyword argument value can be:

Value for relative Interpretation
False default: just use absolute frequencies
True Calculate relative frequencies using the sum of the axis
buzz.Dataset Turn dataset into a table using the same criteria used on the main table. Use the values of the result as the denominators
pd.Series Use the values in the Series as denominators

For example, to find out the frequency of each noun in the corpus, relative to all nouns in the corpus, you can do:

dtrt.just.wordclass.NOUN.table(relative=True)

And to find out the frequency of nouns, relative to all tokens in the corpus:

dtrt.just.wordclass.NOUN.table(relative=dtrt)

So, the following two lines give the same output:

dtrt.see.pos.by.speaker.relative().sort('total')
dtrt.table(show=['p'], subcorpora=['speaker'], relative=True, sort='total')