**Keywords: **
inverse document frequency, term frequency, similar document, similarity formula, similarity score, simple formula, words, text, word, vector, idea, compute.
*Powered by TextRank.*

I've added a new feature to the end of each post that lists similar posts to it. What does similar really mean though? Is it similar because it talks about the same thing? Is it similar because it shares other attributes like tags or it was written on close dates? Or maybe the edit distance is close?

Thankfully this is a question answer long ago and there are various metrics of similarity. For finding similar documents there is a simple method that does OK. And it's the cosine similarity metric. The idea is the following:

If 2 vectors in space pointing towards the same direction they are the same. If they are pointing in opposite directions they or they are at a 90 degree angle they are not. It's easy to visualize this in 2D but angles get confusing once you have more dimensions. There is also a simple formula to calculate the angle:

$$ \cos(\alpha) = \frac{ A \cdot B}{ ||A|| ||B|| } $$

cos(alpha) = dotProduct(A,B) / magnitude(A) * magnitude(B)

The `dotProduct(A,B)`

is defined as

$$ A \cdot B = \sum_{i = 0}^{N}{A_iB_i} $$

sum(A[i] * B[i]) for i in 0...A.count

and the magnitude is

$$ \sum_{i = 0}^{N}{A_i^2} $$

sqrt(sum(A[i]*A[i])) for i in 0...A.count

Obviously the vectors needs to be of the same dimension so you may need apply padding as needed for any missing dimensions.

But the main question is how do you convert documents of text to vectors? This has been hyped a lot recently especially with the advent of LLM tech. But this is a simple static site blog so I don't really have the resources to run an expensive model. Another simpler idea that has been around forever is to look at the words in documents and try to figure out their relative importance within the given documents corpus. If the word "parser" appears in a couple of documents within corpus then we can assume it conveys more information then say the word "text" which appears in almost all of the documents.

The idea was formalized as TF-IDF which mean term frequency / inverse document frequency. This is a score that we will compute for each term in our documents to make a vector.

The number of times a term appears in a document. This basically a histogram per document of words that make it up.

For the inverse part for a second. Document frequency is the number of documents that contained this word. The inverse is `1/document frequency`

(actually there is a different formula to account for precision loss).

log( Double(1 + count(documents)) / Double( 1 + appearances) ) + 1

So basically you are scaling down the frequency of a term by its appearance across all documents. If it is a common term then its score will decrease. If it appears a lot in one document its score will increase. The formula incentives specialization and penalizes generalization.

After you compute the TF and IDF, the vector computation is just multiplying these 2 values and assigning them to the index of each token. Here is an example:

Document A: I like gas cars Document B: I like electric cars Histogram A: [I: 1, like: 1, gas: 1, cars: 1] Histogram B: [I: 1, like: 1, electric: 1, cars: 1] TF A: [cars: 1, electric: 0, gas: 1, I: 1, like: 1] TF B: [cars: 1, electric: 1, gas: 0, I: 1, like: 1] IDF A: [cars: 1/2, electric: 1/1, gas: 1/1, I: 1/2, like: 1/2] TF/IDF table cars electric gas I like Document A: 0.5 0 1 0.5 0.5 Document B: 0.5 1 0 0.5 0.5

When implementing this algorithm here are some of the pitfalls to avoid.

These are the kinds of words that are used frequently in all text and bare importance. The algorithm should weed these out but these introduce noise and lower the performance. We know beforehand that words like "then", "and" etc. don't add much to the information content of the document so we can just remove the noise first before any processing.

Text contains numbers, punctuation, emoji's etc. that need to be removed. I would also recommend removing short words, maybe anything less than 3 characters. You can also go a step further and lemmatize text to remove different variations of the word that have the same meaning. E.g "Books" and "Book" should be considered the same token. There are ready made libraries for these and the process requires data and is not straightforward to implement so I skipped it.

The most important point here is after creating the histogram for all the documents, you need to to pad the histograms of all the documents with the missing tokens from the documents. E.g

Document A: I like electric cars

tokens are: `[I, like, electric, cars]`

and

Document B: I like gas cars

tokens are: `[I, like, gas, cars]`

.

These 2 vectors have the same size as is but you cannot compare them or operate on them. You need to take the union of these and use `[I, like, electric, gas, cars]`

.

To make sure each documents vector has corresponding elements you need to sort the tokens some way that is stable across documents. I sorted them alphabetically.

Another good idea is to normalize the vectors. The stops longer documents from skewing the vectors.

After you compute the vectors for every document, you can arrange them in a matrix such that each row represents the vector of the document. Now it's easy to compute another matrix by applying the cosine similarity formula pairwise between the rows to create a new matrix that has all the similarity scores between documents. If you traverse the row you can pick the most similar documents to the query document based on some threshold score. I set this to about 0.5 but you can experiment with different values. Selecting the threshold is more of a recall/precision and parameter tuning problem.

*Powered by TF-IDF/Cosine similarity*

First published on 2024-02-24

Generated on May 29, 2024, 10:02 PM

Mobile optimized version. Desktop version.