Querying the MinHash Parser with large documents (hundreds of words)

Corrado Fiore Mon, 26 Aug 2024 07:28:02 -0700

Dear All,

I am using Solr to index job postings. Each document in my collectionis approximately 500 to 1500 words long and the collection holds ~3million documents.

Several documents in the collection are near duplicates (5-10 differentwords out of a thousand) and I wish to identify and deduplicate them.Think “Solr, give me a list of documents that are similar to this oneby a factor of 0.9 or more”.

The most promising solution to my use case seems to be the use ofMinHash and Locally Sensitive Hashing (LSH), which translates to usingthe MinHash filter(https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#minhash-filter)at indexing time in Solr, followed by the MinHash Query Parser(https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#minhash-query-parser)at query time. The hashes are created on the basis of the full text ofeach job posting.

My question for the community is: what is the canonical way to querythe MinHash Parser?

Based on the example on pagehttps://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#example-with-analysed-fields,I fed the entire text of a document (400+ words) to the MinHash Parserin order to find its near duplicates:

”q”:”{!min_hash field=\"description_minhash\" sim=\"0.9\"}Createan outstanding customer experience through exceptional service [400+more words here]”

This worked quite well, but I am afraid I might be hitting some limitsin terms of payload size when larger documents are involved.


Is there a better way?

Thank you,
Corrado

Querying the MinHash Parser with large documents (hundreds of words)

Reply via email to