[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706571#comment-16706571 ]
Mayya Sharipova commented on LUCENE-6968: ----------------------------------------- [~andyhind] Hello Andy! I have several questions about the implementation of the *MinHashFilter*, and was wondering if you would be able to answer them. Thanks a lot in advance. The implementation from 1st original patch where the minimum set is kept is very clear to me, and follows the classic idea of constructing MinHash signature and LSH search after it. But I am having a hard time understanding the final implementation for MinHashFilter. 1) What constitutes the signature of a document? Are these all values stored in the hash table? Doesn't it make a signature too large? Can you please refer the paper that describes this way of constructing minhash signatures. 2) What is the use of {{withRotation}} parameter? What is the advantage of using {{withRotation=true}}? In the paper you cited: [http://www.auai.org/uai2014/proceedings/individuals/225.pdf], they fill empty bins with "value of the closest non-empty bin in the clockwise direction (circular right hand side) added *with offset C*". In the {{MinHashFilter}} implementation values for empty buckets are just blindly copied from non-empty ones, so a lot of buckets with have the same value. Hopefully the questions make sense. Thanks again in advance. > LSH Filter > ---------- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: Cao Manh Dat > Assignee: Tommaso Teofili > Priority: Major > Fix For: 6.2, 7.0 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org