The Corpus object
Corpus can model collections of unparsed or parsed data. It's probably the only class you will ever need to import from
All you need to get started is a directory containing plain text files, optionally within subdirectories, which will be understood as subcorpora. Once you have this, you import the
buzz.Corpus object, and point it to the directory containing your data:
from buzz import Corpus corpus = Corpus('dtrt/do-the-right-thing') print(corpus.files.path) # `dtrt/do-the-right-thing/01-we-love-radio-station-storefront.txt` corpus.files.read()
WE SEE only big white teeth and very Negroidal (big) lips. <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" /> Waaaake up! Wake up! Wake up! Wake up! Up ya wake! Up ya wake! Up ya wake!. <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="1"> <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="1" /> This is Mister Señor Love Daddy. Your voice of choice. The world's only twelve-hour strongman, here on WE LOVE radio, 108 FM. The last on your dial, but the first in ya hearts, and that's the truth, Ruth!. <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="2"> <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="2" />
Corpus objects can be manipulated using the
files attributes. You can use these attributes to quickly select subsets of the corpus.
If your corpus contains subfolders, they will be understood as
# get the first three subcorpora corpus.subcorpora[:3] # get second file from first subcorpus; corpus.subcorpora.files
Subcorpus objects have the
files attribute. As shown above, you can get files by index. You can also pass in a regular expression to get just files whose names match a pattern:
import re abcd = corpus.files[re.compile('^[abcd]')]
Parsing a corpus
parse method of Corpus objects will do a full dependency and constituency parse of the text, including lemmatisation, POS tagging, and so on. The result will be stored in a new directory with the same structure as the unparsed corpus, with each file now being formatted as CONLL-U 2.0.
parsed = corpus.parse() print(parsed) # <buzz.corpus.Corpus object at 0x7fb2f3af6470 (dtrt/do-the-right-thing-parsed, parsed)> print(parsed.files.path) # `dtrt/do-the-right-thing-parsed/01-we-love-radio-station-storefront.txt.conllu` print(parsed.files.read())
# loc = INT # parse = (S (NP (PRP WE)) (VP (VBP SEE) (NP (NP (RB only) (JJ big) (JJ white) (NNS teeth)) (CC and) (NP (ADJP (ADJP (RB very) (JJ Negroidal)) (PRN (-LRB- -LRB-) (JJ big) (-RRB- -RRB-))) (NNS lips)))) (. .) ( )) # scene = 1 # sent_id = 1 # sent_len = 15 # setting = WE LOVE RADIO STATION STOREFRONT # stage_direction = True # text = WE SEE only big white teeth and very Negroidal (big) lips. # time = DAY 1 WE we PRON PRP _ 2 nsubj _ _ 2 SEE see VERB VBP _ 0 ROOT _ _ 3 only only ADV RB _ 6 advmod _ _ 4 big big ADJ JJ _ 6 amod _ _ 5 white white ADJ JJ _ 6 amod _ _ 6 teeth tooth NOUN NNS _ 2 dobj _ _ 7 and and CCONJ CC _ 6 cc _ _ 8 very very ADV RB _ 9 amod _ _ 9 Negroidal negroidal ADJ JJ _ 6 conj _ _ 10 ( ( PUNCT -LRB- _ 9 punct _ _ 11 big big ADJ JJ _ 13 amod _ _ 12 ) ) PUNCT -RRB- _ 13 punct _ _ 13 lips lip NOUN NNS _ 9 appos _ _ 14 . . PUNCT . _ 2 punct _ _
Load corpora into memory
At this point, you have a
Corpus object that models a corpus of parsed text files. This is everything you need in order to start exploring its contents. However, you first need to make a decision as to whether or not you should load the corpus into memory before searching it. The advantage of loading into memory is after loading, searching will be really fast. The disadvantage is that your machine needs enough memory to store a given corpus.
Unless your corpus is huge (tens of millions of words or more), you'll probably be able to load it into memory without too much of a problem. To do this, use the
load method of the
Corpus. If this operation is slow, you can speed it up by using the
multiprocess argument, which can be set to either a number of processes to spawn, or
True (one for each CPU). If your corpus is small, don't worry about including this argument, because loading won't take long.
loaded = parsed.load(multiprocess=True)
We now have a
buzz.Dataset object, which is like a
pandas DataFrame, with one row for each token, and one column for each token feature:
Just as before, if you want to load parts of a corpus, you can still select the needed parts from the
files attributes, though now we are accessing the parsed dataset:
# load the first 5 subcorpora parsed.subcorpora[:5].load() # load all files whose name matches a regex pattern: import re parsed.files[re.compile('pizzeria$')].load()
All metadata attributes are available for each token in the text. Here, we look at the features of the first token in the text:
|parse||(S (NP (PRP WE)) (VP (VBP SEE) (NP (NP (RB onl...|
|setting||WE LOVE RADIO STATION STOREFRONT|
|text||WE SEE only big white teeth and very Negroidal...|
By default, loading files will spend a fair bit of time transforming data into optimal categories (e.g. categorical datatypes for POS tags). This slows down loading quite a lot, so if you don't care about setting optimial data types, you can do
data.load(set_data_types=False) to double the loading speed.
Next, head here to learn how to explore your corpus data.