From dataset to results

In buzz, datasets are not meant to be used for gaining insights into corpus data. Instead, Dataset is simply the contents of a loaded corpus or a search result, including all the entries you want, and all their attributes, even when some of these attributes are not important to you.

After you have the information you need within a Dataset, the next step is to transform it into data that you can more easily interpret, such as a table of frequencies, or keyness scores.

For this, you can use the dataset.table() method. It distills the total information of a dataset into something simpler. By default, this is a matrix of words by file:

dtrt = Corpus('dtrt/do-the-right-thing-parsed').load()
dtrt.table().head().to_html()
w . , the .. you and i to a 's
file
01-we-love-radio-station-storefront 17 13 16 3 0 12 4 2 6 5
02-da-mayor's-bedroom 2 3 1 1 0 0 0 0 0 1
03-jade's-apartment 5 4 5 0 0 0 0 0 0 3
04-jade's-bedroom 15 6 3 6 1 4 4 6 3 6
05-sal's-famous-pizzeria 16 12 14 15 4 4 6 7 7 6
06-mookie's-brownstone 1 0 0 0 0 1 0 1 0 0
07-street 2 1 4 0 0 1 0 5 0 0
08-mother-sister's-stoop 10 7 6 4 3 0 4 3 0 3
09-sal's-famous-pizzeria 32 17 10 23 8 6 10 4 9 7
10-sal's-famous-pizzeria 2 1 1 0 0 1 0 0 0 0

This is the most basic kind of table.

For more complex tables, you can use a number of keyword arguments:

Keyword argument default type purpose
show ['w'] list of str What token features to use as new columns. For example, ["w", "l"] will show word/lemma
subcorpora ['file'] list of str Dataset columns/index levels to use as new index. Same format as show.
preserve_case False bool Whether or not results should be lowercased before counting
sort "total" str Sort output columns by "total"/"infreq", "name"/"reverse". You can also pass in "increase"/"decrease" or "static"/"turbulent", which will do linear regression and sort by slope
relative False bool/Dataset Get relative frequencies. You can pass in a dataset to use as the reference corpus, or use True to use the current dataset as the reference (more info below)
keyness False bool/"pd"/"ll" Calculate keyness (percentage difference or log-likelihood)
remove_above_p False bool/float If sort triggered linear regression, p values were generated; you can pass in a float value, or True to use 0.05. Results with higher p value will be dropped.
multiindex_columns False bool If len(show) > 1, make columns a pandas MultiIndex rather than slash-separated strings
keep_stats False bool If sort triggered linear regression, keep the associated stats in the table
show_entities False bool Display whole entity, rather than just matching tokens within entitiy

show and subcorpora

show and subcorpora can be used to customise exactly what your index and column values will be. The main features you will need can be accessed as per the following:

Feature Valid accessors
file file
sentence # s
word w
lemma l
POS tag p
wordclass x
dependency label f
governor index g
speaker speaker

So, ["l", "x", "speaker"] will produce lemma/wordclass/speaker, which may look like champion/noun/tony. Any additional metadata in your corpus can also be accessed and displayed.

Showing surrounding tokens

In addition to the regular show and subcorpora strings, you can use a special notation to get features from surrounding words: +1w will get the following word form; -2l will get the lemma form of the token two spaces back. Numbers 1-9 are accepted.

So, you could do ["-1x", "l", "x", "+1x"] to show the preceding token's wordclass, matching lemma, matching wordclass, and following token's wordclass. This can be useful if you are interested in n-grams, group patterns, and the like.

Multiindexing

When you have multiple items in show/subcorpora, by default, the index of the Table is a pd.MultiIndex, but the columns as slash-separated and kept as a single level.

If you want multiindex columns, you can turn them on with:

multi = dtrt.table(show=['l', 'x'], multiiindex_columns=True)

If you wanted a slash-separated index, the easiest way is to use pandas' transposer, .T:

multi_row = dtrt.table(show=['file', 'speaker'], subcorpora=['w']).T

Relative frequencies

Absolute frequencies can be difficult to interpret, especially when your subcorpora are different sizes. To normalise the frequencies, use the relative keyword argument.

relative keyword argument value can be:

Value for relative Interpretation
False default: just use absolute frequencies
True Calculate relative frequencies using the sum of the axis
buzz.Dataset Turn dataset into a table using the same criteria used on the main table. Use the values of the result as the denominators
pd.Series Use the values in the Series as denominators

For example, to find out the frequency of each noun in the corpus, relative to all nouns in the corpus, you can do:

dtrt.just.wordclass.NOUN.table(relative=True)

And to find out the frequency of nouns, relative to all tokens in the corpus:

dtrt.just.wordclass.NOUN.table(relative=dtrt)

So, the following two lines give the same output:

dtrt.see.pos.by.speaker.relative().sort('total')
dtrt.table(show=['p'], subcorpora=['speaker'], relative=True, sort='total')