Client side search engine

Keywords: search engine index, search query, search result, client side, result list, input query, document, lemmatize, enhance, site, map, root. Powered by TextRank.

Searching statically generating sites can be a bit unusual. You don't have a server side component that will execute the search so you either need to have an external search engine index your site or you need to have the client perform the search on an index that you provide for download. Having an external search engine index needs your site to provide the HTML for crawler requests and web application don't usually work well with this. You should probably have a fully rendered version of your site for SEO purposes but I'm not about SEO. So I went ahead and wrote a mini search engine for this purpose.

So when exporting the content from the source (it could be a markdown source etc.) the script also needs to generate the search index. The most basic structure for this index could a JSON file that contains the document URL mapped to it's content.

docs = {
    "file1.html": "this is some document with text. Crime never pays.",
    "file2.html": "this is another document. If crime paid things would be different."
}
docs

{'file1.html': 'this is some document with text. Crime never pays.',
 'file2.html': 'this is another document. If crime paid things would be different.'}

Just this would be enough to whip up a Javascript grep engine. But just grepping isn't a great search experience. You need to enhance the search engine with some language understanding to get better results. So for example in our 2 documents a search for "pays" would only result in document 1 if we were grepping. But document 2 actually also contains relevant information. The best way to capture this type of language understanding is to add lemmatization to the index.

First step in this process is to tokenize the input. I'm using the python nltk package for all the NLP stuff.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
words = []
for k, v in docs.items():
    words += word_tokenize(v)
words

['this',
 'is',
 'some',
 'document',
 'with',
 'text',
 '.',
 'Crime',
 'never',
 'pays',
 '.',
 'this',
 'is',
 'another',
 'document',
 '.',
 'If',
 'crime',
 'paid',
 'things',
 'would',
 'be',
 'different',
 '.']

There are duplicates, stop words and mixed case words in the result list, so let's clean it up.

words = [w.lower() for w in words if w not in stop_words]
words

['document',
 'text',
 '.',
 'crime',
 'never',
 'pays',
 '.',
 'another',
 'document',
 '.',
 'if',
 'crime',
 'paid',
 'things',
 'would',
 'different',
 '.']

That's better. Now we'll need a map to store the root (lemma) of the word and it's versions that were present in the text. On the client side we'll use this map to enhance our search.

lemma_map = {}
for w in words:
    if w in lemma_map: continue
    lemmas = list(set([lem.lemmatize(w, mode) for mode in list("avsn")]))
    if w in lemmas:
        lemmas.remove(w)
    if len(lemmas) == 1 and lemmas[0] == w: continue
    for lemma in lemmas:
        if lemma not in lemma_map:
            lemma_map[lemma] = []
        if w not in lemma_map[lemma]:
            lemma_map[lemma].append(w)
lemma_map

{'pay': ['pays', 'paid'], 'thing': ['things']}

Now we have the ability to enhance our search on the client side to include both "pays" and "paid" for the query. How is that going to work? We are going to lemmatize the input query. So if the user were to search for "paying" the root would be "pay" and we would search for "paid", "pay", "paying", "pays". Why not just lemmatize the whole body and just send that as the index file to the client? Well we want to highlight and show a snippet of the document to the user for the search results and we can't show that from the lemmatized version of the document body, so the original document needs to be in the client. There is a trade-off between the index size and the amount of processing the client has to do and the time it takes for the results to render. In my experiments I tried pushing the lemmatization process on the client using the same WordNet lemmatizer for JavaScript and the results were pretty bad. On a i9 Macbook Pro it took around 5 seconds to render the page (this doesn't not include the network time spent downloading the index).

On the client we lemmatize the search query and add the results to a list of extended search terms. Then for each of these terms we look up more terms from the lemma_map

search_term = "paying"
extended_terms = list(set([lem.lemmatize(search_term, mode) for mode in list("avsn")]))
for w in extended_terms:
    if w in lemma_map:
        extended_terms += lemma_map[w]
extended_terms

['pay', 'paying', 'pays', 'paid']

What's left is now to search through the contents of the documents for each of these terms.

for file, body in docs.items():
    for t in extended_terms:
        if t in body.lower():
            print("found ", file, "for term",t)

found  file1.html for term pay
found  file1.html for term pays
found  file2.html for term paid

There is some duplicate detection and search term highlighting that's also done on the client but it's not integral to the way the search index is built, so I won't be going into those details.


Metadata

Similar posts

Powered by TF-IDF/Cosine similarity

First published on 2021-02-15

Generated on May 29, 2024, 10:02 PM

Index

Mobile optimized version. Desktop version.