The Corpus object
Corpus
can model collections of unparsed or parsed data. It's probably the only class you will ever need to import from buzz
.
All you need to get started is a directory containing plain text files, optionally within subdirectories, which will be understood as subcorpora. Once you have this, you import the buzz.Corpus
object, and point it to the directory containing your data:
from buzz import Corpus
corpus = Corpus('dtrt/do-the-right-thing')
print(corpus.files[0].path)
# `dtrt/do-the-right-thing/01-we-love-radio-station-storefront.txt`
corpus.files[0].read()
WE SEE only big white teeth and very Negroidal (big) lips. <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" />
Waaaake up! Wake up! Wake up! Wake up! Up ya wake! Up ya wake! Up ya wake!. <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="1"> <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="1" />
This is Mister Señor Love Daddy. Your voice of choice. The world's only twelve-hour strongman, here on WE LOVE radio, 108 FM. The last on your dial, but the first in ya hearts, and that's the truth, Ruth!. <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="2"> <meta time="DAY" loc="INT" setting="WE LOVE RADIO STATION STOREFRONT" scene="1" stage_direction="true" speaker="MISTER SEÑOR LOVE DADDY" line="2" />
Navigating the Corpus
object
Corpus
objects can be manipulated using the subcorpora
and files
attributes. You can use these attributes to quickly select subsets of the corpus.
If your corpus contains subfolders, they will be understood as subcorpora
:
# get the first three subcorpora
corpus.subcorpora[:3]
# get second file from first subcorpus;
corpus.subcorpora[0].files[1]
Both Corpus
and Subcorpus
objects have the files
attribute. As shown above, you can get files by index. You can also pass in a regular expression to get just files whose names match a pattern:
import re
abcd = corpus.files[re.compile('^[abcd]')]
Parsing a corpus
The parse
method of Corpus objects will do a full dependency (and optionally also constituency) parse of the text, including lemmatisation, POS tagging, and so on. The result will be stored in a new directory with the same structure as the unparsed corpus, with each file now being formatted as CONLL-U 2.0.
parsed = corpus.parse()
# for constituency parsing, do corpus.parse(cons_parser="benepar"/"bllip")
print(parsed)
# <buzz.corpus.Corpus object at 0x7fb2f3af6470 (dtrt/do-the-right-thing-parsed, parsed)>
print(parsed.files[0].path)
# `dtrt/do-the-right-thing-parsed/01-we-love-radio-station-storefront.txt.conllu`
print(parsed.files[0].read())
# loc = INT
# parse = (S (NP (PRP WE)) (VP (VBP SEE) (NP (NP (RB only) (JJ big) (JJ white) (NNS teeth)) (CC and) (NP (ADJP (ADJP (RB very) (JJ Negroidal)) (PRN (-LRB- -LRB-) (JJ big) (-RRB- -RRB-))) (NNS lips)))) (. .) ( ))
# scene = 1
# sent_id = 1
# sent_len = 15
# setting = WE LOVE RADIO STATION STOREFRONT
# stage_direction = True
# text = WE SEE only big white teeth and very Negroidal (big) lips.
# time = DAY
1 WE we PRON PRP _ 2 nsubj _ _
2 SEE see VERB VBP _ 0 ROOT _ _
3 only only ADV RB _ 6 advmod _ _
4 big big ADJ JJ _ 6 amod _ _
5 white white ADJ JJ _ 6 amod _ _
6 teeth tooth NOUN NNS _ 2 dobj _ _
7 and and CCONJ CC _ 6 cc _ _
8 very very ADV RB _ 9 amod _ _
9 Negroidal negroidal ADJ JJ _ 6 conj _ _
10 ( ( PUNCT -LRB- _ 9 punct _ _
11 big big ADJ JJ _ 13 amod _ _
12 ) ) PUNCT -RRB- _ 13 punct _ _
13 lips lip NOUN NNS _ 9 appos _ _
14 . . PUNCT . _ 2 punct _ _
Load corpora into memory
At this point, you have a Corpus
object that models a corpus of parsed text files. This is everything you need in order to start exploring its contents. However, you first need to make a decision as to whether or not you should load the corpus into memory before searching it. The advantage of loading into memory is after loading, searching will be really fast. The disadvantage is that your machine needs enough memory to store a given corpus.
Unless your corpus is huge (tens of millions of words or more), you'll probably be able to load it into memory without too much of a problem. To do this, use the load
method of the Corpus
. If this operation is slow, you can speed it up by using the multiprocess
argument, which can be set to either a number of processes to spawn, or True
(one for each CPU). If your corpus is small, don't worry about including this argument, because loading won't take long.
loaded = parsed.load(multiprocess=True)
We now have a buzz.Dataset
object, which is like a pandas DataFrame
, with one row for each token, and one column for each token feature:
print(loaded.iloc[0:8, 0:6].to_html())
w | l | x | p | g | f | |||
---|---|---|---|---|---|---|---|---|
file | s | i | ||||||
01-we-love-radio-station-storefront | 1 | 1 | WE | we | PRON | PRP | 2 | nsubj |
2 | SEE | see | VERB | VBP | 0 | ROOT | ||
3 | only | only | ADV | RB | 6 | advmod | ||
4 | big | big | ADJ | JJ | 6 | amod | ||
5 | white | white | ADJ | JJ | 6 | amod | ||
6 | teeth | tooth | NOUN | NNS | 2 | dobj | ||
7 | and | and | CCONJ | CC | 6 | cc | ||
8 | very | very | ADV | RB | 9 | amod |
Just as before, if you want to load parts of a corpus, you can still select the needed parts from the subcorpora
or files
attributes, though now we are accessing the parsed dataset:
# load the first 5 subcorpora
parsed.subcorpora[:5].load()
# load all files whose name matches a regex pattern:
import re
parsed.files[re.compile('pizzeria$')].load()
All metadata attributes are available for each token in the text. Here, we look at the features of the first token in the text:
print(loaded.head(1).T.to_html())
file | 01-we-love-radio-station-storefront |
---|---|
s | 1 |
i | 1 |
w | WE |
l | we |
x | PRON |
p | PRP |
g | 2 |
f | nsubj |
e | _ |
loc | INT |
parse | (S (NP (PRP WE)) (VP (VBP SEE) (NP (NP (RB onl... |
scene | 1 |
sent_id | 1 |
sent_len | 15 |
setting | WE LOVE RADIO STATION STOREFRONT |
stage_direction | True |
text | WE SEE only big white teeth and very Negroidal... |
time | DAY |
By default, loading files will spend a fair bit of time transforming data into optimal categories (e.g. categorical datatypes for POS tags). This slows down loading quite a lot, so if you don't care about setting optimial data types, you can do data.load(set_data_types=False)
to double the loading speed.
Customising the way your subcorpora are loaded into the DataFrame
If your dataset is not just a single folder full of text files, but a nested structure, where folder names are meaningful, you may want to think about exactly how you want you data loaded into memory. Note that doing things this way is not recommended. Ideally, you have a flat folder structure, plus the use of XML metadata tags only. But, if that is not possible, buzz can you still help.
If you want the path to the file in the dataset's index, you can do that using the following (default) beahviour):
mem = corpus.load(folder="index")
This way, the index will show the whole path to the filename, rather than just the filename itself. This is good for avoiding unambiguity.
Alternatively, the following will make a metadata-like column with the subfolder name:
mem = corpus.load(folder="column")
# this makes it easy to pull out a specific subcorpus
mem[mem.subcorpus == "chapter1"]
Finally, you can ignore your folder structure altogether:
mem = corpus.load(folder=None)
This will ignore your folder structure entirely. If you do it, make sure that all filenames in your corpus are unique, or there could be problems.
But, again, it's recommended that you don't use folder structures for your corpus. Instead, use a flat list of files, with each contaning metadata that can be used to dynamically produce subcorpora. Try to get into that habit, it's much more powerful, and in a future release, buzz may not continue to support handling of subfolders within corpora (since everything it can help you do is possible with metadata tagging).
Next steps
Next, head here to learn how to explore your corpus data.