Exploring parsed corpora
By this point, you have a parsed corpus. For our example here, we'll use a metadata-rich script from Spike Lee's Do the Right Thing.
from buzz import Corpus
dtrt = Corpus('dtrt/do-the-right-thing-parsed')
If it's not too large, we'll probably want to load it into memory too:
dtrt = dtrt.load(multiprocess=True)
The commands in this section will work on both Corpus
and Dataset
objects (i.e. on unloaded and loaded data), but will all be much faster on Datasets, because there is no file reading performed.
Dataset attributes
Features of the token, as determined by the parser, are all available for you to work with.
You can also use any metadata features that were included in your original texts.
Feature | Valid accessors |
---|---|
file | file |
sentence # | s , sentence , sentences |
word | w , word , words |
lemma | l , lemma , lemmas , lemmata |
POS tag | p , pos , partofspeech , tag , tags |
wordclass | x , wordclass , wordclasses |
dependency label | f , function , functions |
governor index | g , governor , governors , gov , govs |
speaker | speaker , speakers |
Dataset.just
Dataset.just
is equivalent to a simple search of the corpus.
If we want to see the nouns in the corpus, we could do any of the below, depending on exactly what we want to capture, and how we prefer to express it:
dtrt.just.wordclass.NOUN.head().to_html()
w | l | x | p | g | f | |||
---|---|---|---|---|---|---|---|---|
file | s | i | ||||||
01-we-love-radio-station-storefront | 1 | 6 | teeth | tooth | NOUN | NNS | 2 | dobj |
13 | lips | lip | NOUN | NNS | 9 | appos | ||
10 | 2 | voice | voice | NOUN | NN | 0 | ROOT | |
4 | choice | choice | NOUN | NN | 3 | pobj | ||
11 | 2 | world | world | NOUN | NN | 8 | poss |
You could get similar results in slightly different ways:
# when your query needs to be a string
dtrt.just.wordclass('NOUN')
# if you wanted to use pos rather than wordclass
dtrt.just.pos('^N')
# if you want a full dependency query, use depgrep:
dtrt.just.depgrep('l/person/ <- F"ROOT"')
Using the bracket syntax, you can pass in some keyword arguments:
case
: case sensitivity (default True)regex
:True
by default, you can turn it off if you want simple string searchexact_match
:False
by default, whether or not strings must match end-to-end, rather than within
Because Datasets are subclasses of pandas DataFrames, we could also do this using pandas syntax:
dtrt[dtrt.p.str.startswith('N')]
Because metadata attributes are treated in the same way as token features, we could get all words spoken by a specific character with:
dtrt.just.speaker.MOOKIE.to_html()
w | l | x | p | g | f | |||
---|---|---|---|---|---|---|---|---|
file | s | i | ||||||
04-jade's-bedroom | 12 | 1 | Wake | wake | PROPN | NNP | 0 | ROOT |
2 | up!. | up!. | PROPN | NNP | 1 | acomp | ||
15 | 1 | It | -PRON- | PRON | PRP | 3 | nsubj | |
2 | 's | be | VERB | VBZ | 3 | aux | ||
3 | gon | go | VERB | VBG | 0 | ROOT |
Dataset.skip
Dataset.skip
is the inverse of Dataset.just
, helping you quickly cut down your corpus to remove unwanted information. One common task is to remove punctuation tokens from our corpus:
dtrt.skip.wordclass.PUNCT
Or, a regex approach:
# skip tokens not starting with alpha numeric...
dtrt.skip.word('^[^A-Za-z0-9]')
Dataset.near
Dataset.near
will find tokens within a certain distance of a match. To get tokens within the default three spaces of the lemma person
, you can use:
dtrt.near.lemma.person
The method returns matching rows, plus an extra column, _position
, which tells you the place this token occupies relative to the match (i.e. -1
is one position to the left, 1
is one position to the right).
If you want to customise the distance, or use a regular expression, you can use brackets:
dtrt.near.lemma('person', distance=2)
If you have a more specific query, you can match a depgrep query:
dtrt.near.depgrep('l/person/ = X"NOUN"', distance=2)
Dataset.bigrams
Dataset.bigrams
will find tokens immediately before or after the matching token. It is equivalent to Dataset.near(query, distance=1)
Chaining operations
Remember that in buzz
, search results are simply subsets of your corpus. This means that it's always possible to further refine your searches, often eliminating the need to write complex queries.
For example, if we wanted to find the most common verbs used by Mookie's boss, Sal, we could do:
dtrt.just.speaker.SAL.just.wordclass.VERB
w | l | x | p | g | f | |||
---|---|---|---|---|---|---|---|---|
file | s | i | ||||||
05-sal's-famous-pizzeria | 8 | 3 | get | get | VERB | VB | 0 | ROOT |
7 | sweep | sweep | VERB | VB | 3 | conj | ||
14 | 6 | shaddup | shaddup | VERB | VB | 2 | acl | |
19 | 1 | Watch | watch | VERB | VB | 0 | ROOT | |
22 | 1 | Can | can | VERB | MD | 3 | aux |
If the value you want to search by is not a valid Python name (i.e. it contains spaces, hyphens, etc.) you can use an alternative syntax:
dtrt.just.setting('FRUIT-N-VEG DELIGHT')
Dataset.see
with Dataset.see
, you can pivot your data into a table of frequencies. To see a distribution of dependency function label by part-of-speech, you can simply do:
dtrt.see.function.by.pos.to_html()
p | nn | . | nnp | prp | dt | in | vbz | rb |
---|---|---|---|---|---|---|---|---|
ROOT | 114 | 0 | 120 | 9 | 2 | 11 | 527 | 18 |
acl | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
acomp | 5 | 0 | 4 | 0 | 0 | 1 | 0 | 3 |
advcl | 4 | 0 | 4 | 0 | 0 | 0 | 51 | 1 |
advmod | 2 | 0 | 4 | 0 | 3 | 14 | 0 | 602 |
agent | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 0 |
amod | 18 | 0 | 7 | 0 | 0 | 0 | 0 | 3 |
appos | 45 | 0 | 39 | 3 | 12 | 0 | 0 | 1 |
Putting it all together
With a parsed and metadata-rich text, you can find, well, whatever you could possibly want. Let's say we're really interested in verb lemmata spoken inside Sal's Famous Pizzeria by anyone except Sal.
verbs = dtrt.just.wordclass.VERB.just.setting("SAL'S FAMOUS PIZZERIA").skip.speaker.SAL
verbs.see.speaker.by.lemma.to_html()
speaker/verb | be | do | get | go | have | look | see | want | will | can |
---|---|---|---|---|---|---|---|---|---|---|
AHMAD | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
BUGGIN' OUT | 10 | 3 | 4 | 1 | 2 | 0 | 2 | 2 | 0 | 0 |
CROWD | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DA MAYOR | 7 | 4 | 1 | 3 | 1 | 0 | 0 | 0 | 2 | 0 |
JADE | 12 | 3 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
This is really just the basics, however. If you want to do more advanced kinds of frequency calculations, you'll want to use the Dataset.table()
method, with documentation available here.