From dataset to results
In buzz, datasets are not meant to be used for gaining insights into corpus data. Instead, Dataset is simply the contents of a loaded corpus or a search result, including all the entries you want, and all their attributes, even when some of these attributes are not important to you.
After you have the information you need within a Dataset, the next step is to transform it into data that you can more easily interpret, such as a table of frequencies, or keyness scores.
For this, you can use the dataset.table()
method. It distills the total information of a dataset into something simpler. By default, this is a matrix of words by file:
dtrt = Corpus('dtrt/do-the-right-thing-parsed').load()
dtrt.table().head().to_html()
w | . | , | the | .. | you | and | i | to | a | 's |
---|---|---|---|---|---|---|---|---|---|---|
file | ||||||||||
01-we-love-radio-station-storefront | 17 | 13 | 16 | 3 | 0 | 12 | 4 | 2 | 6 | 5 |
02-da-mayor's-bedroom | 2 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
03-jade's-apartment | 5 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
04-jade's-bedroom | 15 | 6 | 3 | 6 | 1 | 4 | 4 | 6 | 3 | 6 |
05-sal's-famous-pizzeria | 16 | 12 | 14 | 15 | 4 | 4 | 6 | 7 | 7 | 6 |
06-mookie's-brownstone | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
07-street | 2 | 1 | 4 | 0 | 0 | 1 | 0 | 5 | 0 | 0 |
08-mother-sister's-stoop | 10 | 7 | 6 | 4 | 3 | 0 | 4 | 3 | 0 | 3 |
09-sal's-famous-pizzeria | 32 | 17 | 10 | 23 | 8 | 6 | 10 | 4 | 9 | 7 |
10-sal's-famous-pizzeria | 2 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
This is the most basic kind of table.
For more complex tables, you can use a number of keyword arguments:
Keyword argument | default | type | purpose |
---|---|---|---|
show |
['w'] |
list of str |
What token features to use as new columns. For example, ["w", "l"] will show word/lemma |
subcorpora |
['file'] |
list of str |
Dataset columns/index levels to use as new index. Same format as show . |
preserve_case |
False |
bool |
Whether or not results should be lowercased before counting |
sort |
"total" |
str |
Sort output columns by "total" /"infreq" , "name" /"reverse" . You can also pass in "increase" /"decrease" or "static" /"turbulent" , which will do linear regression and sort by slope |
relative |
False |
bool /Dataset |
Get relative frequencies. You can pass in a dataset to use as the reference corpus, or use True to use the current dataset as the reference (more info below) |
keyness |
False |
bool /"pd" /"ll" |
Calculate keyness (percentage difference or log-likelihood) |
remove_above_p |
False |
bool /float |
If sort triggered linear regression, p values were generated; you can pass in a float value, or True to use 0.05 . Results with higher p value will be dropped. |
multiindex_columns |
False |
bool |
If len(show) > 1 , make columns a pandas MultiIndex rather than slash-separated strings |
keep_stats |
False |
bool |
If sort triggered linear regression, keep the associated stats in the table |
show_entities |
False |
bool |
Display whole entity, rather than just matching tokens within entitiy |
show
and subcorpora
show
and subcorpora
can be used to customise exactly what your index and column values will be. The main features you will need can be accessed as per the following:
Feature | Valid accessors |
---|---|
file | file |
sentence # | s |
word | w |
lemma | l |
POS tag | p |
wordclass | x |
dependency label | f |
governor index | g |
speaker | speaker |
So, ["l", "x", "speaker"]
will produce lemma/wordclass/speaker
, which may look like champion/noun/tony
. Any additional metadata in your corpus can also be accessed and displayed.
Showing surrounding tokens
In addition to the regular show
and subcorpora
strings, you can use a special notation to get features from surrounding words: +1w
will get the following word form; -2l
will get the lemma form of the token two spaces back. Numbers 1-9 are accepted.
So, you could do ["-1x", "l", "x", "+1x"]
to show the preceding token's wordclass, matching lemma, matching wordclass, and following token's wordclass. This can be useful if you are interested in n-grams, group patterns, and the like.
Multiindexing
When you have multiple items in show
/subcorpora
, by default, the index of the Table is a pd.MultiIndex
, but the columns as slash-separated and kept as a single level.
If you want multiindex columns, you can turn them on with:
multi = dtrt.table(show=['l', 'x'], multiiindex_columns=True)
If you wanted a slash-separated index, the easiest way is to use pandas' transposer, .T
:
multi_row = dtrt.table(show=['file', 'speaker'], subcorpora=['w']).T
Relative frequencies
Absolute frequencies can be difficult to interpret, especially when your subcorpora are different sizes. To normalise the frequencies, use the relative
keyword argument.
relative
keyword argument value can be:
Value for relative | Interpretation |
---|---|
False |
default: just use absolute frequencies |
True |
Calculate relative frequencies using the sum of the axis |
buzz.Dataset |
Turn dataset into a table using the same criteria used on the main table. Use the values of the result as the denominators |
pd.Series |
Use the values in the Series as denominators |
For example, to find out the frequency of each noun in the corpus, relative to all nouns in the corpus, you can do:
dtrt.just.wordclass.NOUN.table(relative=True)
And to find out the frequency of nouns, relative to all tokens in the corpus:
dtrt.just.wordclass.NOUN.table(relative=dtrt)
So, the following two lines give the same output:
dtrt.see.pos.by.speaker.relative().sort('total')
dtrt.table(show=['p'], subcorpora=['speaker'], relative=True, sort='total')