Keywords: search engine, index size, word, client, document, context, lemmatization, match, compiler, issue, address, query, trie. Powered by TextRank.
I posted about how to implement a client side search engine with lemmatization for better natural language understanding and query processing before. Since then I've been thinking a bit more about this problem and realized 2 short commings:
So I wanted to address these. The first issue needed a restructing of the way the index was being represented. In the old engine there was no inverted index, the search was just looking for multiple versions of the query in all the documents. So the source of all the documents had to be downloaded on the client for the search. I now changed this to a completely different approach. An invereted index is created by the compiler and is the main index for the search. I tried to use a trie structure to store this index bu the overhead of the JSON formatting resulted in a worse file size than just storing then in a dictionary. The trie was about 500kb but using the dict resulted in 350kb.
To address the second issue, the compiler will generate a mapping of the root of a word to all the inflections found in the document. So if you have the word "paying" and "pays" in the doc, the rootMapping
variable in the index will contain pay -> [paying, pays]
. This will be used on the client side to do the highlighting. So if the user searched for "paid" the searching code would expand to search to encompass all the versions of the word that was indexed.
In the previous version I had also incorporated synonym expansion but that turned out to be subpar simply because synonyms for words are actually context dependent. Without that context the results would contain low quality results, so I ditched that.
383 words
Powered by TF-IDF/Cosine similarity
First published on 2021-10-03
Generated on 13 Dec 2024 at 12:12 AM
Mobile optimized version. Desktop version.