The Corpus object

Corpus can model collections of unparsed or parsed data. It's probably the only class you will ever need to import from buzz.

All you need to get started is a directory containing plain text files, optionally within subdirectories, which will be understood as subcorpora. Once you have this, you import the buzz.Corpus object, and point it to the directory containing your data:

from buzz import Corpus
corpus = Corpus('dtrt/do-the-right-thing')
print(corpus.files[0].path)
# `dtrt/do-the-right-thing/01-we-love-radio-station-storefront.txt`
corpus.files[0].read()
WE SEE only big white teeth and very Negroidal (big) lips. <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" />
Waaaake up! Wake up! Wake up! Wake up! Up ya wake! Up ya wake! Up ya wake!. <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="1"> <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="1" />
This is Mister Señor Love Daddy. Your voice of choice. The world's only twelve-hour strongman, here on WE LOVE radio, 108 FM. The last on your dial, but the first in ya hearts, and that's the truth, Ruth!. <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="2"> <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="2" />

Corpus objects can be manipulated using the subcorpora and files attributes. You can use these attributes to quickly select subsets of the corpus.

If your corpus contains subfolders, they will be understood as subcorpora:

# get the first three subcorpora
corpus.subcorpora[:3]
# get second file from first subcorpus;
corpus.subcorpora[0].files[1]

Both Corpus and Subcorpus objects have the files attribute. As shown above, you can get files by index. You can also pass in a regular expression to get just files whose names match a pattern:

import re
abcd = corpus.files[re.compile('^[abcd]')]

Parsing a corpus

The parse method of Corpus objects will do a full dependency and constituency parse of the text, including lemmatisation, POS tagging, and so on. The result will be stored in a new directory with the same structure as the unparsed corpus, with each file now being formatted as CONLL-U 2.0.

parsed = corpus.parse()
print(parsed)
# <buzz.corpus.Corpus object at 0x7fb2f3af6470 (dtrt/do-the-right-thing-parsed, parsed)>
print(parsed.files[0].path)
# `dtrt/do-the-right-thing-parsed/01-we-love-radio-station-storefront.txt.conllu`
print(parsed.files[0].read())
# loc = INT
# parse = (S (NP (PRP WE)) (VP (VBP SEE) (NP (NP (RB only) (JJ big) (JJ white) (NNS teeth)) (CC and) (NP (ADJP (ADJP (RB very) (JJ Negroidal)) (PRN (-LRB- -LRB-) (JJ big) (-RRB- -RRB-))) (NNS lips)))) (. .) (  ))
# scene = 1
# sent_id = 1
# sent_len = 15
# setting = WE LOVE RADIO STATION STOREFRONT
# stage_direction = True
# text = WE SEE only big white teeth and very Negroidal (big) lips.
# time = DAY
1   WE  we  PRON    PRP _   2   nsubj   _   _
2   SEE see VERB    VBP _   0   ROOT    _   _
3   only    only    ADV RB  _   6   advmod  _   _
4   big big ADJ JJ  _   6   amod    _   _
5   white   white   ADJ JJ  _   6   amod    _   _
6   teeth   tooth   NOUN    NNS _   2   dobj    _   _
7   and and CCONJ   CC  _   6   cc  _   _
8   very    very    ADV RB  _   9   amod    _   _
9   Negroidal   negroidal   ADJ JJ  _   6   conj    _   _
10  (   (   PUNCT   -LRB-   _   9   punct   _   _
11  big big ADJ JJ  _   13  amod    _   _
12  )   )   PUNCT   -RRB-   _   13  punct   _   _
13  lips    lip NOUN    NNS _   9   appos   _   _
14  .   .   PUNCT   .   _   2   punct   _   _

Load corpora into memory

At this point, you have a Corpus object that models a corpus of parsed text files. This is everything you need in order to start exploring its contents. However, you first need to make a decision as to whether or not you should load the corpus into memory before searching it. The advantage of loading into memory is after loading, searching will be really fast. The disadvantage is that your machine needs enough memory to store a given corpus.

Unless your corpus is huge (tens of millions of words or more), you'll probably be able to load it into memory without too much of a problem. To do this, use the load method of the Corpus. If this operation is slow, you can speed it up by using the multiprocess argument, which can be set to either a number of processes to spawn, or True (one for each CPU). If your corpus is small, don't worry about including this argument, because loading won't take long.

loaded = parsed.load(multiprocess=True)

We now have a buzz.Dataset object, which is like a pandas DataFrame, with one row for each token, and one column for each token feature:

print(loaded.iloc[0:8, 0:6].to_html())
w l x p g f
file s i
01-we-love-radio-station-storefront 1 1 WE we PRON PRP 2 nsubj
2 SEE see VERB VBP 0 ROOT
3 only only ADV RB 6 advmod
4 big big ADJ JJ 6 amod
5 white white ADJ JJ 6 amod
6 teeth tooth NOUN NNS 2 dobj
7 and and CCONJ CC 6 cc
8 very very ADV RB 9 amod

Just as before, if you want to load parts of a corpus, you can still select the needed parts from the subcorpora or files attributes, though now we are accessing the parsed dataset:

# load the first 5 subcorpora
parsed.subcorpora[:5].load()
# load all files whose name matches a regex pattern:
import re
parsed.files[re.compile('pizzeria$')].load()

All metadata attributes are available for each token in the text. Here, we look at the features of the first token in the text:

print(loaded.head(1).T.to_html())
file 01-we-love-radio-station-storefront
s 1
i 1
w WE
l we
x PRON
p PRP
g 2
f nsubj
e _
loc INT
parse (S (NP (PRP WE)) (VP (VBP SEE) (NP (NP (RB onl...
scene 1
sent_id 1
sent_len 15
setting WE LOVE RADIO STATION STOREFRONT
stage_direction True
text WE SEE only big white teeth and very Negroidal...
time DAY

By default, loading files will spend a fair bit of time transforming data into optimal categories (e.g. categorical datatypes for POS tags). This slows down loading quite a lot, so if you don't care about setting optimial data types, you can do data.load(set_data_types=False) to double the loading speed.

Next steps

Next, head here to learn how to explore your corpus data.