Keyword extraction in bloggen

Keywords: multi word keyword, original input text, window step, document vector, matrix, graph, post, pagerank, idea. Powered by TextRank.



A new addition after the similar posts feature in bloggen was to explore machine learned keyword extraction. This is a complicated/math heavy topic that required more effort than I would have anticipated. The aim is the following: Given a corpus of text (in this case a blog post) extract the word that would have the most meaning in giving the reader an idea of what the post is about.

There are various methods of doing this but I thought that the TextRank approach seemed to be the coolest. The idea is the based on the following premise:

Words that are close to each other are connected. Words that have a lot of incoming connections are important and are good keyword candidates. Connected words can be represented as a graph and graph node ranking can be done with the PageRank algortihm of Google fame (hence the name TextRank). This is a pretty neat idea from the original paper .

Deeper dive

The first step is to build that graph. Words that occur within a certain window are assumed to be connected. The window size is an adjustable parameter. My experiments yielded the best results with 5. An example:

After you compute the vector for every document you can arrange them in a matrix such that each row represents the document vector

As with any NLP task you need to preprocess the input and get rid of the noise.

compute vector document arrange matrix row represents document vector

Turns out that adjective and nouns make the best keywords, so using a part-of-speech tagger to get rid of the useless tags improve the results. Implementing a PoS taggger is another blog post. This is what the text reduces to after PoS filtering:

"vector", "document", "matrix", "row", "document", "vector"

Now let's take a window size of 3 and build the graph, here's what the sliding window steps look like:

["vector", "document", "matrix"], "row", "document", "vector"
"vector", ["document", "matrix", "row"], "document" ,"vector"
"vector", "document", ["matrix", "row", "document"], "vector"
"vector", "document", "matrix", ["row", "document", "vector"]

We'll use the adjecency matrix representation for the graph as this will come in handy when computing the PageRank

         vector  document  matrix  row
vector     0        1        1      0
document   1        0        1      0
matrix     1        1        0      0
row        0        0        0      0

This is the matrix we would get after the 1st step. I'm not going to go through all the steps here because it's too much, but in end we'll end up with something like this where any value other than 0 indicates a connection between the nodes.

         vector  document  matrix  row
vector     0        4        2      2
document   4        0        2      2
matrix     2        2        0      2
row        2        2        2      0

As expected there are no self references allowed so the diagonal is all 0. From looking at this matrix is seems like document is an important node in the graph. Let's normalize the matrix

$$ \begin{bmatrix} 0 & 0.5 & 0.25 & 0.25 \\ 0.5 & 0 & 0.25 & 0.25 \\ 0.33 & 0.33 & 0 & 0.33 \\ 0.33 & 0.33 & 0.33 & 0 \end{bmatrix} $$

After we transpose, we can run PageRank on this square matrix with the starting rank vector $v = \begin{vmatrix}0.25 & 0.25 & 0.25 & 0.25\end{vmatrix}$.

$$ \begin{bmatrix} 0 & 0.5 & 0.33 & 0.33 \\ 0.5 & 0 & 0.33 & 0.33 \\ 0.25 & 0.25 & 0 & 0.33 \\ 0.25 & 0.25 & 0.33 & 0 \end{bmatrix} \times \begin{bmatrix} 0.25 \\ 0.25 \\ 0.25 \\ 0.25 \end{bmatrix} $$

The intuition behind PageRank is that at each step we will adjust the importance of a node by the amount of incoming links. This turns out to be the operation $Av$. Repeating this operation until the values in the rank vector converge and stop changing by any significant amount is how we get the final ranks. This can be represented as $A^{n}v$ where n is the number of epoch we'll be converging for. If we reach convergence before n then we'll just exit the loop.

So, after one iteration our ranking matrix is

$$ v = \begin{bmatrix} 0.29 \\ 0.29 \\ 0.2075 \\ 0.2075 \end{bmatrix} $$

after 2200 iterations it is at

$$ v = \begin{bmatrix} 0.00267546 \\ 0.00267546 \\ 0.00201161 \\ 0.00201161 \end{bmatrix} $$

which is the convergence point. Based on this output the first 2 items of the input vector and document can be considered the most import strings in this text and can be used as keywords. We already have good data to put the keywords out there but there is one final post processing step we need and that is to scan the data for multi-word keywords.

The post-processing step is based on the fact that if N words are extracted as individual keywords yet they appear together in the original text, it's likely that they are multi-word keywords. This doesn't always perform great especially if there is puncuation involved. We create a list of all the possible arrangements of 2 to N keyword combinations from the top 15 keywords and search for them in the text. If a match is found then we can consider it a multi keyword phrase.

In our original input text, "document vector" would be a multi-word keyword as these two are the extracted keywords and appear together. If keywords make a multi-word keyword I don't include them as single word keywords but I can see the case that they should be included.


Similar posts

Powered by TF-IDF/Cosine similarity

First published on 2024-03-23 ~ 2024-05-04

Generated on May 29, 2024, 10:02 PM


Mobile optimized version. Desktop version.