From dataset to results

In buzz, datasets are not meant to be used for gaining insights into corpus data. Instead, Dataset is simply the contents of a loaded corpus or a search result, including all the entries you want, and all their attributes, even when some of these attributes are not important to you.

After you have the information you need within a Dataset, the next step is to transform it into data that you can more easily interpret, such as a table of frequencies, or keyness scores.

For this, you can use the dataset.table() method. It distills the total information of a dataset into something simpler. By default, this is a matrix of words by file:

dtrt = Corpus('dtrt/do-the-right-thing-parsed').load()
dtrt.table().head().to_html()

w	.	,	the	..	you	and	i	to	a	's
file
01-we-love-radio-station-storefront	17	13	16	3	0	12	4	2	6	5
02-da-mayor's-bedroom	2	3	1	1	0	0	0	0	0	1
03-jade's-apartment	5	4	5	0	0	0	0	0	0	3
04-jade's-bedroom	15	6	3	6	1	4	4	6	3	6
05-sal's-famous-pizzeria	16	12	14	15	4	4	6	7	7	6
06-mookie's-brownstone	1	0	0	0	0	1	0	1	0	0
07-street	2	1	4	0	0	1	0	5	0	0
08-mother-sister's-stoop	10	7	6	4	3	0	4	3	0	3
09-sal's-famous-pizzeria	32	17	10	23	8	6	10	4	9	7
10-sal's-famous-pizzeria	2	1	1	0	0	1	0	0	0	0

This is the most basic kind of table.

For more complex tables, you can use a number of keyword arguments:

Keyword argument	default	type	purpose
`show`	`['w']`	`list` of `str`	What token features to use as new columns. For example, `["w", "l"]` will show `word/lemma`
`subcorpora`	`['file']`	`list` of `str`	Dataset columns/index levels to use as new index. Same format as `show`.
`preserve_case`	`False`	`bool`	Whether or not results should be lowercased before counting
`sort`	`"total"`	`str`	Sort output columns by `"total"`/`"infreq"`, `"name"`/`"reverse"`. You can also pass in `"increase"`/`"decrease"` or `"static"`/`"turbulent"`, which will do linear regression and sort by slope
`relative`	`False`	`bool`/`Dataset`	Get relative frequencies. You can pass in a dataset to use as the reference corpus, or use `True` to use the current dataset as the reference (more info below)
`keyness`	`False`	`bool`/`"pd"`/`"ll"`	Calculate keyness (percentage difference or log-likelihood)
`remove_above_p`	`False`	`bool`/`float`	If `sort` triggered linear regression, `p` values were generated; you can pass in a float value, or `True` to use `0.05`. Results with higher `p` value will be dropped.
`multiindex_columns`	`False`	`bool`	If `len(show) > 1`, make columns a pandas MultiIndex rather than slash-separated strings
`keep_stats`	`False`	`bool`	If `sort` triggered linear regression, keep the associated stats in the table
`show_entities`	`False`	`bool`	Display whole entity, rather than just matching tokens within entitiy

`show` and `subcorpora`

show and subcorpora can be used to customise exactly what your index and column values will be. The main features you will need can be accessed as per the following:

Feature	Valid accessors
file	`file`
sentence #	`s`
word	`w`
lemma	`l`
POS tag	`p`
wordclass	`x`
dependency label	`f`
governor index	`g`
speaker	`speaker`

So, ["l", "x", "speaker"] will produce lemma/wordclass/speaker, which may look like champion/noun/tony. Any additional metadata in your corpus can also be accessed and displayed.

Showing surrounding tokens

In addition to the regular show and subcorpora strings, you can use a special notation to get features from surrounding words: +1w will get the following word form; -2l will get the lemma form of the token two spaces back. Numbers 1-9 are accepted.

So, you could do ["-1x", "l", "x", "+1x"] to show the preceding token's wordclass, matching lemma, matching wordclass, and following token's wordclass. This can be useful if you are interested in n-grams, group patterns, and the like.

Multiindexing

When you have multiple items in show/subcorpora, by default, the index of the Table is a pd.MultiIndex, but the columns as slash-separated and kept as a single level.

If you want multiindex columns, you can turn them on with:

multi = dtrt.table(show=['l', 'x'], multiiindex_columns=True)

If you wanted a slash-separated index, the easiest way is to use pandas' transposer, .T:

multi_row = dtrt.table(show=['file', 'speaker'], subcorpora=['w']).T

Relative frequencies

Absolute frequencies can be difficult to interpret, especially when your subcorpora are different sizes. To normalise the frequencies, use the relative keyword argument.

relative keyword argument value can be:

Value for relative	Interpretation
`False`	default: just use absolute frequencies
`True`	Calculate relative frequencies using the sum of the axis
`buzz.Dataset`	Turn dataset into a table using the same criteria used on the main table. Use the values of the result as the denominators
`pd.Series`	Use the values in the `Series` as denominators

For example, to find out the frequency of each noun in the corpus, relative to all nouns in the corpus, you can do:

dtrt.just.wordclass.NOUN.table(relative=True)

And to find out the frequency of nouns, relative to all tokens in the corpus:

dtrt.just.wordclass.NOUN.table(relative=dtrt)

So, the following two lines give the same output:

dtrt.see.pos.by.speaker.relative().sort('total')
dtrt.table(show=['p'], subcorpora=['speaker'], relative=True, sort='total')

From dataset to results

show and subcorpora

Showing surrounding tokens

Multiindexing

Relative frequencies

`show` and `subcorpora`