[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028283#comment-17028283 ]
Xin-Chun Zhang edited comment on LUCENE-9004 at 2/4/20 1:34 PM: ---------------------------------------------------------------- ??"Is it making life difficult to keep them separate?"?? [~sokolov] No, we can keep them separate at present. I have merged your [branch|[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]] into my person [github|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] in order to do the comparison between IVFFlat and HNSW. And I reused some work that [~tomoko] and you did. Code refactoring is required when we are going to commit. ??"Have you tried comparing them on real data?"?? [~yurymalkov], [~mikemccand] Thanks for your advice. I haven't do it yet, and will do it soon. *Update – Feb. 4, 2020* I have added two performance test tool (KnnIvfPerformTester/KnnIvfAndGraphPerformTester) into my personal branch. And sift1M dataset (1000,000 base vectors with 128 dimensions, [http://corpus-texmex.irisa.fr/]) is employed for the test. Top 1 recall performance of IVFFlat is as follows, centroids=707 ||nprobe||avg. search time (ms)||recall percent (%)|| |8|71.314|69.15| |16|121.7565|82.3| |32|155.692|92.85| |64|159.3655|98.7| |128|217.5205|99.9%| centroids=4000 ||nprobe||avg. search time (ms)||recall percent (%)|| |8|56.3745|65.35| |16|59.5435|78.85| |32|71.751|89.85| |64|90.396|96.25| |128|135.3805|99.3| Unfortunately, I couldn't obtain the corresponding results of HNSW due to the out of memory error in my PC. A special case with 2,000 base vectors demonstrates that IVFFlat is faster and more accurate. HNSW may outperform IVFFlat on larger data sets when larger memory is available, as shown in [https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]. was (Author: irvingzhang): ??"Is it making life difficult to keep them separate?"?? [~sokolov] No, we can keep them separate at present. I have merged your [branch|[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]] into my person [github|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] in order to do the comparison between IVFFlat and HNSW. And I reused some work that [~tomoko] and you did. Code refactoring is required when we are going to commit. ??"Have you tried comparing them on real data?"?? [~yurymalkov], [~mikemccand] Thanks for your advice. I haven't do it yet, and will do it soon. *Update – Feb. 4, 2020* I have added two performance test tool (KnnIvfPerformTester/KnnIvfAndGraphPerformTester) into my personal branch. And sift1M dataset (1000,000 base vectors with 128 dimensions, [http://corpus-texmex.irisa.fr/]) is employed for the test. Top 1 recall performance of IVFFlat is as follows, ||nprobe||avg. search time (ms)||recall percent (%)|| |8|71.314|69.15| |16|121.7565|82.3| |32|155.692|92.85| |64|159.3655|98.7| |128|217.5205|99.9%| Unfortunately, I couldn't obtain the corresponding results of HNSW due to the out of memory error in my PC. A special case with 2,000 base vectors demonstrates that IVFFlat is faster and more accurate. HNSW may outperform IVFFlat on larger data sets when larger memory is available, as shown in [https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]. > Approximate nearest vector search > --------------------------------- > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Michael Sokolov > Priority: Major > Attachments: hnsw_layered_graph.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits, basic design is not really set, there is no Query implementation > and no integration iwth IndexSearcher, but it does work by some measure using > a standalone test class. I've tested with uniform random vectors and on my > laptop indexed 10K documents in around 10 seconds and searched them at 95% > recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I > haven't made any attempt to use multithreaded search for this, but it is > amenable to per-segment concurrency. > [1] > [https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164] > > *UPDATES:* > * (1/12/2020) The up-to-date branch is: > [https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org