Hi Corrado, Are you using http post requests with query specifics in the body? If yes then there shouldn’t be problem regarding the payload size although I never used MinHash parser, I used other types of queries with large payloads that way and had no issues
—ufuk — > On Aug 26, 2024, at 23:27, Corrado Fiore <cpfi...@gmail.com> wrote: > > Dear All, > > I am using Solr to index job postings. Each document in my collection is > approximately 500 to 1500 words long and the collection holds ~3 million > documents. > > Several documents in the collection are near duplicates (5-10 different words > out of a thousand) and I wish to identify and deduplicate them. Think “Solr, > give me a list of documents that are similar to this one by a factor of 0.9 > or more”. > > The most promising solution to my use case seems to be the use of MinHash and > Locally Sensitive Hashing (LSH), which translates to using the MinHash filter > (https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#minhash-filter) > at indexing time in Solr, followed by the MinHash Query Parser > (https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#minhash-query-parser) > at query time. The hashes are created on the basis of the full text of each > job posting. > > My question for the community is: what is the canonical way to query the > MinHash Parser? > > Based on the example on page > https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#example-with-analysed-fields, > I fed the entire text of a document (400+ words) to the MinHash Parser in > order to find its near duplicates: > > ”q”:”{!min_hash field=\"description_minhash\" sim=\"0.9\"}Create an > outstanding customer experience through exceptional service [400+ more words > here]” > > This worked quite well, but I am afraid I might be hitting some limits in > terms of payload size when larger documents are involved. > > Is there a better way? > > Thank you, > Corrado