Working with pandas

buzz keeps everything in DataFrame-like objects, so you can use the powerful pandas syntax whenever you like. This section demonstrates how you can use some of pandas' features directly to explore your data in new ways.

Monkey-patching Datasets

Just as you can with pandas, you can also write your own methods, and monkey-patch them to the Dataset object:

from buzz import Dataset

def no_punct(df):
    return df.skip.wordclass.PUNCT

Dataset.no_punct = no_punct

From that point on, you can quickly pre-process your corpus with:

preprocessed = Corpus('path/to/corpus').load().no_punct()

Recipe: finding character mentions

In Do the right thing, as in any film, characters refer to other characters by name. It's interesting to compare who is mentioned, how often, and whether or not mentions are proportional to the amount of lines the character has. Something like that is made trivial by buzz.

First, we load in our dataset, and get the set of all speaker names:

dtrt = Corpus('do-the-right-thing-parsed').load()
unique_speakers = set(dtrt.speaker)

We can use this set of names as a search query on the corpus, returning all tokens that are names:

# don't care about case, but match entire word, not just part of it
mentions = dtrt.just.word(unique_speakers, case=False, exact_match=True)

Then, we turn this into a table, remove the non-speaker entries, and relativise the frequencies:

# show word (i.e. mentioned speaker) by mentioner
tab = mentions.table(show='w', subcorpora='speaker')
# remove things we don't care about
tab = tab.drop('stage_direction').drop('stage_direction', axis=1)
# make relative, then cut down to top
tab = tab.relative()
tab.iloc[:14,:7].to_html()

This leaves us with a nice matrix of mentions:

MOOKIE SAL JADE BUGGIN' OUT PINO DA MAYOR VITO
CROWD 0 0 0 0 1 0 0
DA MAYOR 1 0 1 1 0 6 0
EDDIE 0 0 0 0 2 2 0
MISTER SEÑOR LOVE DADDY 2 0 0 0 0 0 1
MOOKIE 1 15 3 1 6 1 7
MOTHER SISTER 1 0 1 0 0 2 0
PINO 4 3 2 0 0 0 3
RADIO RAHEEM 1 1 0 0 1 0 0
SAL 21 5 6 12 1 1 0