Hi Corrado,

Are you using http post requests with query specifics in the body? If yes then 
there shouldn’t be problem regarding the payload size although I never used 
MinHash parser, I used other types of queries with large payloads that way and 
had no issues

—ufuk

—

> On Aug 26, 2024, at 23:27, Corrado Fiore <cpfi...@gmail.com> wrote:
> 
> Dear All,
> 
> I am using Solr to index job postings.  Each document in my collection is 
> approximately 500 to 1500 words long and the collection holds ~3 million 
> documents.
> 
> Several documents in the collection are near duplicates (5-10 different words 
> out of a thousand) and I wish to identify and deduplicate them.  Think “Solr, 
> give me a list of documents that are similar to this one by a factor of 0.9 
> or more”.
> 
> The most promising solution to my use case seems to be the use of MinHash and 
> Locally Sensitive Hashing (LSH), which translates to using the MinHash filter 
> (https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#minhash-filter)
>  at indexing time in Solr, followed by the MinHash Query Parser 
> (https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#minhash-query-parser)
>  at query time.  The hashes are created on the basis of the full text of each 
> job posting.
> 
> My question for the community is:  what is the canonical way to query the 
> MinHash Parser?
> 
> Based on the example on page 
> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#example-with-analysed-fields,
>  I fed the entire text of a document (400+ words) to the MinHash Parser in 
> order to find its near duplicates:
> 
> ”q”:”{!min_hash field=\"description_minhash\" sim=\"0.9\"}Create an 
> outstanding customer experience through exceptional service [400+ more words 
> here]”
> 
> This worked quite well, but I am afraid I might be hitting some limits in 
> terms of payload size when larger documents are involved.
> 
> Is there a better way?
> 
> Thank you,
> Corrado

Reply via email to