Client side search engine v2

Keywords: search engine, index size, word, client, document, context, lemmatization, match, compiler, issue, address, query, trie. Powered by TextRank.

I posted about how to implement a client side search engine with lemmatization for better natural language understanding and query processing before. Since then I've been thinking a bit more about this problem and realized 2 short commings:

So I wanted to address these. The first issue needed a restructing of the way the index was being represented. In the old engine there was no inverted index, the search was just looking for multiple versions of the query in all the documents. So the source of all the documents had to be downloaded on the client for the search. I now changed this to a completely different approach. An invereted index is created by the compiler and is the main index for the search. I tried to use a trie structure to store this index bu the overhead of the JSON formatting resulted in a worse file size than just storing then in a dictionary. The trie was about 500kb but using the dict resulted in 350kb.

To address the second issue, the compiler will generate a mapping of the root of a word to all the inflections found in the document. So if you have the word "paying" and "pays" in the doc, the rootMapping variable in the index will contain pay -> [paying, pays] . This will be used on the client side to do the highlighting. So if the user searched for "paid" the searching code would expand to search to encompass all the versions of the word that was indexed.

In the previous version I had also incorporated synonym expansion but that turned out to be subpar simply because synonyms for words are actually context dependent. Without that context the results would contain low quality results, so I ditched that.


Similar posts

Powered by TF-IDF/Cosine similarity

First published on 2021-10-03

Generated on May 29, 2024, 10:02 PM


Mobile optimized version. Desktop version.