abernardi597 commented on PR #15472:
URL: https://github.com/apache/lucene/pull/15472#issuecomment-4012846854

   Here are some of the results from my benchmarking on the Cohere V3 dataset.
   
   ## Indexing
   On a `m8g.16xlarge` EC2 instance running AL2, I indexed 5M documents using 
16 indexing threads and 48 merge threads with both Lucene's HNSW codec and the 
candidate JVector codec over a range of compression levels. For HNSW indices, 
this involves using Optimized Scalar Quantization, while JVector relies on 
Product Quantization for compression. Lucene's OSQ supports 8-bit, 4-bit, and 
1-bit quantization achieving 4x, 8x, and 32x compression, respectively. Product 
Quantization supports the same compression levels by setting the number of 
subvectors appropriately (i.e. 1-dim subvectors = 4x compression, 2-dim 
subvectors = 8x, etc.).
   
   All indices were created with `maxConn=64` and `beamWidth=250` using the 
COSINE similarity metric. For JVector, I set `alpha=1.0`, 
`neighborOverflow=1.2`, and `useHierarchy=true` to generate graphs with a 
similar structure to HNSW and mitigate differences in recall due to graph 
quality.
   
   
![legend](https://github.com/user-attachments/assets/6de5050e-ab67-4e6e-94dc-094eb3682110)
   
![indexing](https://github.com/user-attachments/assets/a3c1f0d4-c6d0-409f-8b47-5ccf25e0e40f)
   
   The graph above shows the wall-clock time to index all 5M documents, where 
the slightly transparent section of the bar denotes the additional time taken 
to force-merge the segments into a single graph.
   
   Overall, indexing time is quite comparable in the case of 1-bit quantization 
or full-precision vectors, but Lucene shows much better performance with 4-bit 
and 8-bit quantization. In every case, however, JVector seems to spend an 
inordinate amount of time in merging segments compared to HNSW. The discrepancy 
is likely due to a key optimization found in Lucene's HNSW implementation that 
re-uses adjacency information from the incoming graphs when merging to reduce 
search costs for inserting most nodes into the graph. I did not port this 
functionality, but I am interested in contributing it upstream.
   
   ## Searching
   The following search benchmarks perform 10k queries across 8 threads using 
Lucene's `MMapDirectory` to memory-map files in the index. The `topK=100` 
nearest neighbors for each target vector are compared to the ground-truth 
computed by brute-force to compute the recall. For each index, I ran the 
benchmarks using `overSample=1..5`, both with and without rerank. The 
benchmarks execute each query as a warm-up before measuring anything.
   
   To get an idea of the best-case performance, I first ran search benchmarks 
on the same `m8g.16xlarge` instance I used to build the index, since it can 
hold the entire index in memory and simulate a "hot" index. I used the Lucene 
preload API to ensure that the index files were loaded into memory.
   
   I also wanted to run the benchmarks in a memory-constrained environment to 
simulate a "warm" index so I ran the benchmarks on a `c8gd.xlarge` instance 
running AL2023, copying the index files onto the direct-attached SSD formatted 
with xfs to simulate a "warm" index. The benchmarks do not attempt to preload 
the index files in this case, since they cannot all fit in the limited memory 
of the machine. Moreover, `MADV_RANDOM` is used for the graph index and vector 
files to minimize IO saturation due to overzealous kernel prefetching despite 
mostly random disk access.
   
   
![legend](https://github.com/user-attachments/assets/6de5050e-ab67-4e6e-94dc-094eb3682110)
   
![latency](https://github.com/user-attachments/assets/406d476f-5e19-416b-8480-644ec0b25492)
   
   This plot of recall over the average CPU time per query illustrates how 
over-sampling and re-ranking affect latency and recall. Each line connects the 
points with increasing over-sampling. Dashed lines connect the data points with 
reranking, while the solid lines connect those without.
   
   From this we can see that generally Lucene outperforms JVector for "hot" 
indices. It does seem that JVector benefits more from reranking without 
full-precision vectors, though this may be an artifact of PQ more than 
anything. It is also notable that JVector 32x compressed has lower latencies 
than HNSW, but again its hard to say whether that is a win of the JVector 
algorithm rather than its implementation of PQ.
   
   For the "warm" index benchmarks, it seems the performance for JVector is 
less affected by reranking. This would indicate the advantage of DiskANN-style 
reranking where full-precision vectors are gathered inline with the graph 
search (relying on page locality to get them for "free"), rather than an 
explicit second phase for reranking. At the time of writing, however, it seems 
that JVector also [performs re-ranking in second 
phase](https://github.com/datastax/jvector/blob/3ca3ce10b20d807c95f4843cc989ae0faf712a54/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L229),
 like Lucene HNSW. This leads me to believe the discrepancy is due to how 
JVector explicitly loads quantized vectors onto the heap, whereas HNSW is 
memory-mapping the quantized vectors. The page cache may be more effective at 
keeping full-precision vectors (which are likely also loaded when traversing 
the index, due to locality) resident, despite less cache space due to increased 
heap usage, 
 when it does not need to juggle more disk accesses.
   
   ---
   
   Overall, it seems like Lucene HNSW is more competitive with JVector than 
[some 
benchmarks](https://github.com/opensearch-project/opensearch-jvector/blob/main/README.md#benchmarks)
 would suggest.
   Even so, it could still be worth adding it to the sandbox as an alternative 
KNN engine, or perhaps trying to incorporate some of its ideas/features into 
Lucene.
   
   I will update this PR with a non-draft revision, pending some cleanup of my 
commits and `knnPerfTest.py` PRs in `luceneutil`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to