Prototypicality and similarity

buzz contains tools that help judge the similarity of parsed documents using TF-IDF metrics and word embeddings, similar to how a search engine might produce an ordered set of similar documents to some search query.

TF-IDF prototypicality

The following will return a multiindexed pandas.Series containing the sentence and its similarity to each language model (i.e. bin):

dtrt = Corpus('do-the-right-thing-parsed').load()
proto = dtrt.proto.speaker.showing.lemmata

proto begins by segmenting the corpus by the feature of interest (speaker in the case above) into bins. It then builds as TF-IDF model of each bin using [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from scikit-learn. These are stored in memory alongside the corpus. Then, each sentence is scored against each model.

To customise results further, you can use the bracketted expression, which gives you some extra options, and allows you to construct the language model using combinations of word features.

show is a list of features to join together when constructing the language model.
only_correct can be switched off, in order to see how every sentence compared to every model.
n_top_members can remove infrequent members of the metadata field of interest. This can speed up the operation, remove junk, and prevent outliers from having a large effect.
top can be used to quickly filter just the n most similar sentences per bin., So, setting it to 1 will show you just the most prototypical sentence from each bin.

model_format = ['l', 'x']  # model corpus as lemma/wordclass tuples, rather than words
proto = dtrt.proto.speaker(show=model_format, only_correct=True, n_top_members=5, top=1)
# make the data print a little nicer as html:
proto.reset_index().drop('l/x', axis=1).to_html(index=False)

If you've seen the film, you may find yourself thinking how these sentences seem quite typical for their respective speakers.

speaker	file	s	text	similarity
BUGGIN' OUT	33-street	18	Not only did you knock me down, you stepped on my new white Air Jordans that I just bought and that's all you can say, "Excuse me?	0.334317
DA MAYOR	66-street	53	You didn't have to hit your son; he's scared to death as it was.	0.324264
JADE	68-sal's-famous-pizzeria	23	Mookie, you can hardly pay your rent and you're gonna tell me what to do.	0.299531
MISTER SEÑOR LOVE DADDY	69-control-booth	1	As the evening slowly falls upon us living here in Brooklyn, New York, this is ya Love Daddy rappin' to you.	0.287197
MOOKIE	97-sal's-famous-pizzeria	25	I know I wants to get my money.	0.255000
MOTHER SISTER	96-mother-sister's-bedroom	10	I didn't.	0.294582
PINO	77-storeroom	26	He, them, they're not to be trusted.	0.339005
SAL	71-sal's-famous-pizzeria	3	Mookie, I don't know what you're talking about, plus I don't want to hear it.	0.288566
TINA	73-tina's-apartment	26	You think I'm gonna let you get some, put on your clothes, then run outta here and never see you again in who knows when?	0.324863
stage_direction	92-street	1	Jade and Mother Sister try to hold on to a streetlamp as a gush of water hits them; their grips loosens, the water is too powerful, and they slide away down the block and Da Mayor runs after them.	0.256906

Choosing what gets modelled

Whenever you are trying to calculate prototypicality or similarity of text, you first need to ask yourself, in which sense should the texts be similar?. One trivial kind of similarity is sentence length. For this, you would group your data into bins, find the average sentence length, and compare this average to the length of some new text. Obviously such a method isn't exactly the bleeding edge of linguistics. Nonetheless, buzz/pandas can help you with this; see the pandas section of these docs for some pandas recipes.

You're probably much more interested in a more complex kind of text similarity, based on the grammar, the word choices, or both. For this kind of research question, first, consider whether you should remove or normalise anything in your corpus. If you don't think you care about punctuation, you might like to drop it with loaded_corpus.skip.wordclass.PUNCT. From there, you could also remove words you think are unimportant, or normalise tokens or lemmata as you wish.

After modifying the dataset, the next step is to think about what to use as the show argument, as this can have a big effect on prototypes and similarity.

Words and wordings

Think of language as a combination of lexis (words used) and grammar (how they are ordered), with things like part-of-speech resting in the middle:

token -> normalised word -> lemma -> wordclass -> part-of-speech -> dependency label

If you're trying to look at how the bins differ in word choice, use show='w' only. If you are interested in the grammatical differences between groups, you might like to try wordclasses, so that you are modelling patterns like DETERMINER ADJECTIVE NOUN VERB DETERMINER NOUN (instead of the hungry cat ate the rat).

show can be a list, in which case multiple token features are used in the model. show=['f', 'x'], for example, would model the text by dependency labels and wordclasses. I run would become nsubj/PRON root/VERB. Among other effects, this would negate the influence of tense, showing no distinction between run and ran.

Word and document vectors

buzz can access the word and document vectors generated by spaCY and gensim, and use these to evaluate similarity.

For Corpus and Dataset objects, simply use the vector property. Below, we calculate the similarity between everything said by Mookie and everything said by Sal in the Do the right thing corpus:

sal = dtrt.just.speaker.SAL.vector
mookie = dtrt.just.speaker.MOOKIE.vector
sal.similarity(mookie)

You can also score against new strings:

sal.similarity('This is a sentence to judge similarity against.')