[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706571#comment-16706571
 ] 

Mayya Sharipova commented on LUCENE-6968:
-----------------------------------------

[~andyhind]  Hello Andy! I have several questions about the implementation of 
the *MinHashFilter*, and was wondering if you would be able to answer them. 
Thanks a lot in advance.

The implementation from 1st original patch where the minimum set is kept is 
very clear to me, and follows the classic idea of constructing MinHash 
signature and LSH search after it. But I am having a hard time understanding 
the final implementation for MinHashFilter.

1) What constitutes the signature of a document? Are these all values stored in 
the hash table? Doesn't it make a signature too large? Can you please refer the 
paper that describes this way of constructing minhash signatures.

2) What is the use of {{withRotation}} parameter? What is the advantage of 
using {{withRotation=true}}? In the paper you cited: 
[http://www.auai.org/uai2014/proceedings/individuals/225.pdf], they fill empty 
bins with "value of the closest non-empty bin in the clockwise direction 
(circular right hand side) added *with offset C*". In the {{MinHashFilter}} 
implementation values for empty buckets are just blindly copied from non-empty 
ones, so a lot of buckets with have the same value.

Hopefully the questions make sense. Thanks again in advance.

> LSH Filter
> ----------
>
>                 Key: LUCENE-6968
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6968
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Cao Manh Dat
>            Assignee: Tommaso Teofili
>            Priority: Major
>             Fix For: 6.2, 7.0
>
>         Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to