Exploring parsed corpora

By this point, you have a parsed corpus. For our example here, we'll use a metadata-rich script from Spike Lee's Do the Right Thing.

from buzz import Corpus
dtrt = Corpus('dtrt/do-the-right-thing-parsed')

If it's not too large, we'll probably want to load it into memory too:

dtrt = dtrt.load(multiprocess=True)

The commands in this section will work on both Corpus and Dataset objects (i.e. on unloaded and loaded data), but will all be much faster on Datasets, because there is no file reading performed.

Dataset attributes

Features of the token, as determined by the parser, are all available for you to work with.

You can also use any metadata features that were included in your original texts.

Feature Valid accessors
file file
sentence # s, sentence, sentences
word w, word, words
lemma l, lemma, lemmas, lemmata
POS tag p, pos, partofspeech, tag, tags
wordclass x, wordclass, wordclasses
dependency label f, function, functions
governor index g, governor, governors, gov, govs
speaker speaker, speakers

Dataset.just

Dataset.just is equivalent to a simple search of the corpus.

If we want to see the nouns in the corpus, we could do any of the below, depending on exactly what we want to capture, and how we prefer to express it:

dtrt.just.wordclass.NOUN.head().to_html()
w l x p g f
file s i
01-we-love-radio-station-storefront 1 6 teeth tooth NOUN NNS 2 dobj
13 lips lip NOUN NNS 9 appos
10 2 voice voice NOUN NN 0 ROOT
4 choice choice NOUN NN 3 pobj
11 2 world world NOUN NN 8 poss

You could get similar results in slightly different ways:

# when your query needs to be a string
dtrt.just.wordclass('NOUN')
# if you wanted to use pos rather than wordclass
dtrt.just.pos('^N')
# if you want a full dependency query, use depgrep:
dtrt.just.depgrep('l/person/ <- F"ROOT"')

Using the bracket syntax, you can pass in some keyword arguments:

  • case: case sensitivity (default True)
  • regex: True by default, you can turn it off if you want simple string search
  • exact_match: False by default, whether or not strings must match end-to-end, rather than within

Because Datasets are subclasses of pandas DataFrames, we could also do this using pandas syntax:

dtrt[dtrt.p.str.startswith('N')]

Because metadata attributes are treated in the same way as token features, we could get all words spoken by a specific character with:

dtrt.just.speaker.MOOKIE.to_html()
w l x p g f
file s i
04-jade's-bedroom 12 1 Wake wake PROPN NNP 0 ROOT
2 up!. up!. PROPN NNP 1 acomp
15 1 It -PRON- PRON PRP 3 nsubj
2 's be VERB VBZ 3 aux
3 gon go VERB VBG 0 ROOT

Dataset.skip

Dataset.skip is the inverse of Dataset.just, helping you quickly cut down your corpus to remove unwanted information. One common task is to remove punctuation tokens from our corpus:

dtrt.skip.wordclass.PUNCT

Or, a regex approach:

# skip tokens not starting with alpha numeric...
dtrt.skip.word('^[^A-Za-z0-9]')

Dataset.near

Dataset.near will find tokens within a certain distance of a match. To get tokens within the default three spaces of the lemma person, you can use:

dtrt.near.lemma.person

The method returns matching rows, plus an extra column, _position, which tells you the place this token occupies relative to the match (i.e. -1 is one position to the left, 1 is one position to the right).

If you want to customise the distance, or use a regular expression, you can use brackets:

dtrt.near.lemma('person', distance=2)

If you have a more specific query, you can match a depgrep query:

dtrt.near.depgrep('l/person/ = X"NOUN"', distance=2)

Dataset.bigrams

Dataset.bigrams will find tokens immediately before or after the matching token. It is equivalent to Dataset.near(query, distance=1)

Chaining operations

Remember that in buzz, search results are simply subsets of your corpus. This means that it's always possible to further refine your searches, often eliminating the need to write complex queries.

For example, if we wanted to find the most common verbs used by Mookie's boss, Sal, we could do:

dtrt.just.speaker.SAL.just.wordclass.VERB
w l x p g f
file s i
05-sal's-famous-pizzeria 8 3 get get VERB VB 0 ROOT
7 sweep sweep VERB VB 3 conj
14 6 shaddup shaddup VERB VB 2 acl
19 1 Watch watch VERB VB 0 ROOT
22 1 Can can VERB MD 3 aux

If the value you want to search by is not a valid Python name (i.e. it contains spaces, hyphens, etc.) you can use an alternative syntax:

dtrt.just.setting('FRUIT-N-VEG DELIGHT')

Dataset.see

with Dataset.see, you can pivot your data into a table of frequencies. To see a distribution of dependency function label by part-of-speech, you can simply do:

dtrt.see.function.by.pos.to_html()
p nn . nnp prp dt in vbz rb
ROOT 114 0 120 9 2 11 527 18
acl 1 0 0 0 0 0 1 0
acomp 5 0 4 0 0 1 0 3
advcl 4 0 4 0 0 0 51 1
advmod 2 0 4 0 3 14 0 602
agent 0 0 0 0 0 11 0 0
amod 18 0 7 0 0 0 0 3
appos 45 0 39 3 12 0 0 1

Putting it all together

With a parsed and metadata-rich text, you can find, well, whatever you could possibly want. Let's say we're really interested in verb lemmata spoken inside Sal's Famous Pizzeria by anyone except Sal.

verbs = dtrt.just.wordclass.VERB.just.setting("SAL'S FAMOUS PIZZERIA").skip.speaker.SAL
verbs.see.speaker.by.lemma.to_html()
speaker/verb be do get go have look see want will can
AHMAD 0 0 1 0 0 0 0 0 0 0
BUGGIN' OUT 10 3 4 1 2 0 2 2 0 0
CROWD 0 0 1 0 0 0 0 0 0 0
DA MAYOR 7 4 1 3 1 0 0 0 2 0
JADE 12 3 2 1 0 0 1 0 0 1

This is really just the basics, however. If you want to do more advanced kinds of frequency calculations, you'll want to use the Dataset.table() method, with documentation available here.