Client side search engine v2

Keywords: search engine, index size, word, client, document, context, lemmatization, match, compiler, issue, address, query, trie. Powered by TextRank.

I posted about how to implement a client side search engine with lemmatization for better natural language understanding and query processing before. Since then I've been thinking a bit more about this problem and realized 2 short commings:

The index size needs to be much smaller than it's current 1.1mb size because it takes too long to load and parse and the client freezes
There can be cases where the highligthing in the search result may fail because of the lemmatization. If a document has the the word "paying", but the search term was "paid" even though the document would match, it would not be possible to highlight the match.

So I wanted to address these. The first issue needed a restructing of the way the index was being represented. In the old engine there was no inverted index, the search was just looking for multiple versions of the query in all the documents. So the source of all the documents had to be downloaded on the client for the search. I now changed this to a completely different approach. An invereted index is created by the compiler and is the main index for the search. I tried to use a trie structure to store this index bu the overhead of the JSON formatting resulted in a worse file size than just storing then in a dictionary. The trie was about 500kb but using the dict resulted in 350kb.

To address the second issue, the compiler will generate a mapping of the root of a word to all the inflections found in the document. So if you have the word "paying" and "pays" in the doc, the rootMapping variable in the index will contain pay -> [paying, pays] . This will be used on the client side to do the highlighting. So if the user searched for "paid" the searching code would expand to search to encompass all the versions of the word that was indexed.

In the previous version I had also incorporated synonym expansion but that turned out to be subpar simply because synonyms for words are actually context dependent. Without that context the results would contain low quality results, so I ditched that.

Metadata

383 words

Client side search engine v2

Metadata

Similar posts