Dear All,
I am using Solr to index job postings. Each document in my collection
is approximately 500 to 1500 words long and the collection holds ~3
million documents.
Several documents in the collection are near duplicates (5-10 different
words out of a thousand) and I wish to identify and deduplicate them.
Think “Solr, give me a list of documents that are similar to this one
by a factor of 0.9 or more”.
The most promising solution to my use case seems to be the use of
MinHash and Locally Sensitive Hashing (LSH), which translates to using
the MinHash filter
(https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#minhash-filter)
at indexing time in Solr, followed by the MinHash Query Parser
(https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#minhash-query-parser)
at query time. The hashes are created on the basis of the full text of
each job posting.
My question for the community is: what is the canonical way to query
the MinHash Parser?
Based on the example on page
https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#example-with-analysed-fields,
I fed the entire text of a document (400+ words) to the MinHash Parser
in order to find its near duplicates:
”q”:”{!min_hash field=\"description_minhash\" sim=\"0.9\"}Create
an outstanding customer experience through exceptional service [400+
more words here]”
This worked quite well, but I am afraid I might be hitting some limits
in terms of payload size when larger documents are involved.
Is there a better way?
Thank you,
Corrado