Dear All,

I am using Solr to index job postings. Each document in my collection is approximately 500 to 1500 words long and the collection holds ~3 million documents.

Several documents in the collection are near duplicates (5-10 different words out of a thousand) and I wish to identify and deduplicate them. Think “Solr, give me a list of documents that are similar to this one by a factor of 0.9 or more”.

The most promising solution to my use case seems to be the use of MinHash and Locally Sensitive Hashing (LSH), which translates to using the MinHash filter (https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#minhash-filter) at indexing time in Solr, followed by the MinHash Query Parser (https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#minhash-query-parser) at query time. The hashes are created on the basis of the full text of each job posting.

My question for the community is: what is the canonical way to query the MinHash Parser?

Based on the example on page https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#example-with-analysed-fields, I fed the entire text of a document (400+ words) to the MinHash Parser in order to find its near duplicates:

”q”:”{!min_hash field=\"description_minhash\" sim=\"0.9\"}Create an outstanding customer experience through exceptional service [400+ more words here]”

This worked quite well, but I am afraid I might be hitting some limits in terms of payload size when larger documents are involved.

Is there a better way?

Thank you,
Corrado

Reply via email to