Prototypicality and similarity
buzz contains tools that help judge the similarity of parsed documents using TF-IDF metrics and word embeddings, similar to how a search engine might produce an ordered set of similar documents to some search query.
The following will return a multiindexed
pandas.Series containing the sentence and its similarity to each language model (i.e. bin):
dtrt = Corpus('do-the-right-thing-parsed').load() proto = dtrt.proto.speaker.showing.lemmata
proto begins by segmenting the corpus by the feature of interest (speaker in the case above) into bins. It then builds as TF-IDF model of each bin using [
(https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from scikit-learn. These are stored in memory alongside the corpus. Then, each sentence is scored against each model.
To customise results further, you can use the bracketted expression, which gives you some extra options, and allows you to construct the language model using combinations of word features.
showis a list of features to join together when constructing the language model.
only_correctcan be switched off, in order to see how every sentence compared to every model.
n_top_memberscan remove infrequent members of the metadata field of interest. This can speed up the operation, remove junk, and prevent outliers from having a large effect.
topcan be used to quickly filter just the
nmost similar sentences per bin., So, setting it to
1will show you just the most prototypical sentence from each bin.
model_format = ['l', 'x'] # model corpus as lemma/wordclass tuples, rather than words proto = dtrt.proto.speaker(show=model_format, only_correct=True, n_top_members=5, top=1) # make the data print a little nicer as html: proto.reset_index().drop('l/x', axis=1).to_html(index=False)
If you've seen the film, you may find yourself thinking how these sentences seem quite typical for their respective speakers.
|BUGGIN' OUT||33-street||18||Not only did you knock me down, you stepped on my new white Air Jordans that I just bought and that's all you can say, "Excuse me?||0.334317|
|DA MAYOR||66-street||53||You didn't have to hit your son; he's scared to death as it was.||0.324264|
|JADE||68-sal's-famous-pizzeria||23||Mookie, you can hardly pay your rent and you're gonna tell me what to do.||0.299531|
|MISTER SEÑOR LOVE DADDY||69-control-booth||1||As the evening slowly falls upon us living here in Brooklyn, New York, this is ya Love Daddy rappin' to you.||0.287197|
|MOOKIE||97-sal's-famous-pizzeria||25||I know I wants to get my money.||0.255000|
|MOTHER SISTER||96-mother-sister's-bedroom||10||I didn't.||0.294582|
|PINO||77-storeroom||26||He, them, they're not to be trusted.||0.339005|
|SAL||71-sal's-famous-pizzeria||3||Mookie, I don't know what you're talking about, plus I don't want to hear it.||0.288566|
|TINA||73-tina's-apartment||26||You think I'm gonna let you get some, put on your clothes, then run outta here and never see you again in who knows when?||0.324863|
|stage_direction||92-street||1||Jade and Mother Sister try to hold on to a streetlamp as a gush of water hits them; their grips loosens, the water is too powerful, and they slide away down the block and Da Mayor runs after them.||0.256906|
Choosing what gets modelled
Whenever you are trying to calculate prototypicality or similarity of text, you first need to ask yourself, in which sense should the texts be similar?. One trivial kind of similarity is sentence length. For this, you would group your data into bins, find the average sentence length, and compare this average to the length of some new text. Obviously such a method isn't exactly the bleeding edge of linguistics. Nonetheless, buzz/pandas can help you with this; see the pandas section of these docs for some pandas recipes.
You're probably much more interested in a more complex kind of text similarity, based on the grammar, the word choices, or both. For this kind of research question, first, consider whether you should remove or normalise anything in your corpus. If you don't think you care about punctuation, you might like to drop it with
loaded_corpus.skip.wordclass.PUNCT. From there, you could also remove words you think are unimportant, or normalise tokens or lemmata as you wish.
After modifying the dataset, the next step is to think about what to use as the
show argument, as this can have a big effect on prototypes and similarity.
Words and wordings
Think of language as a combination of lexis (words used) and grammar (how they are ordered), with things like part-of-speech resting in the middle:
token -> normalised word -> lemma -> wordclass -> part-of-speech -> dependency label
If you're trying to look at how the bins differ in word choice, use
show='w' only. If you are interested in the grammatical differences between groups, you might like to try wordclasses, so that you are modelling patterns like
DETERMINER ADJECTIVE NOUN VERB DETERMINER NOUN (instead of the hungry cat ate the rat).
show can be a list, in which case multiple token features are used in the model.
show=['f', 'x'], for example, would model the text by dependency labels and wordclasses.
I run would become
nsubj/PRON root/VERB. Among other effects, this would negate the influence of tense, showing no distinction between
Word and document vectors
buzz can access the word and document vectors generated by spaCY and gensim, and use these to evaluate similarity.
For Corpus and Dataset objects, simply use the
vector property. Below, we calculate the similarity between everything said by Mookie and everything said by Sal in the Do the right thing corpus:
sal = dtrt.just.speaker.SAL.vector mookie = dtrt.just.speaker.MOOKIE.vector sal.similarity(mookie)
You can also score against new strings:
sal.similarity('This is a sentence to judge similarity against.')