[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

Xin-Chun Zhang (Jira) Tue, 04 Feb 2020 05:36:01 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028283#comment-17028283
 ]


Xin-Chun Zhang edited comment on LUCENE-9004 at 2/4/20 1:34 PM:
----------------------------------------------------------------

 ??"Is it making life difficult to keep them separate?"??

[~sokolov] No, we can keep them separate at present. I have merged your 
[branch|[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]] 
into my person 
[github|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] in 
order to do the comparison between IVFFlat and HNSW. And I reused some work 
that [~tomoko] and you did. Code refactoring is required when we are going to 
commit.

 

??"Have you tried comparing them on real data?"??

[~yurymalkov], [~mikemccand] Thanks for your advice. I haven't do it yet, and 
will do it soon. 

 

*Update – Feb. 4, 2020*

I have added two performance test tool 
(KnnIvfPerformTester/KnnIvfAndGraphPerformTester) into my personal branch. And 
sift1M dataset (1000,000 base vectors with 128 dimensions, 
[http://corpus-texmex.irisa.fr/]) is employed for the test. Top 1 recall 
performance of IVFFlat is as follows,

centroids=707
||nprobe||avg. search time (ms)||recall percent (%)||
|8|71.314|69.15|
|16|121.7565|82.3|
|32|155.692|92.85|
|64|159.3655|98.7|
|128|217.5205|99.9%|

centroids=4000
||nprobe||avg. search time (ms)||recall percent (%)||
|8|56.3745|65.35|
|16|59.5435|78.85|
|32|71.751|89.85|
|64|90.396|96.25|
|128|135.3805|99.3|

Unfortunately, I couldn't obtain the corresponding results of HNSW due to the 
out of memory error in my PC. A special case with 2,000 base vectors 
demonstrates that IVFFlat is faster and more accurate. HNSW may outperform 
IVFFlat on larger data sets when larger memory is available, as shown in 
[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors].


was (Author: irvingzhang):
 ??"Is it making life difficult to keep them separate?"??

[~sokolov] No, we can keep them separate at present. I have merged your 
[branch|[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]] 
into my person 
[github|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] in 
order to do the comparison between IVFFlat and HNSW. And I reused some work 
that [~tomoko] and you did. Code refactoring is required when we are going to 
commit.

 

??"Have you tried comparing them on real data?"??

[~yurymalkov], [~mikemccand] Thanks for your advice. I haven't do it yet, and 
will do it soon. 

 

*Update – Feb. 4, 2020*

I have added two performance test tool 
(KnnIvfPerformTester/KnnIvfAndGraphPerformTester) into my personal branch. And 
sift1M dataset (1000,000 base vectors with 128 dimensions, 
[http://corpus-texmex.irisa.fr/]) is employed for the test. Top 1 recall 
performance of IVFFlat is as follows,
||nprobe||avg. search time (ms)||recall percent (%)||
|8|71.314|69.15|
|16|121.7565|82.3|
|32|155.692|92.85|
|64|159.3655|98.7|
|128|217.5205|99.9%|

Unfortunately, I couldn't obtain the corresponding results of HNSW due to the 
out of memory error in my PC. A special case with 2,000 base vectors 
demonstrates that IVFFlat is faster and more accurate. HNSW may outperform 
IVFFlat on larger data sets when larger memory is available, as shown in 
[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors].

> Approximate nearest vector search
> ---------------------------------
>
>                 Key: LUCENE-9004
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9004
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael Sokolov
>            Priority: Major
>         Attachments: hnsw_layered_graph.png
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop indexed 10K documents in around 10 seconds and searched them at 95% 
> recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I 
> haven't made any attempt to use multithreaded search for this, but it is 
> amenable to per-segment concurrency.
> [1] 
> [https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164]
>  
> *UPDATES:*
>  * (1/12/2020) The up-to-date branch is: 
> [https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

Reply via email to