Re: [PR] Bypass HNSW graph building for tiny segments [lucene]

via GitHub Fri, 01 Aug 2025 14:01:01 -0700


jpountz commented on code in PR #14963:
URL: https://github.com/apache/lucene/pull/14963#discussion_r2248862786



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##########
@@ -115,6 +115,13 @@ public final class Lucene99HnswVectorsFormat extends 
KnnVectorsFormat {
   /** Default to use single thread merge */
   public static final int DEFAULT_NUM_MERGE_WORKER = 1;
 
+  /**
+   * Threshold below which HNSW graph building is bypassed for tiny segments. 
Segments with fewer
+   * vectors will use flat storage only, improving indexing performance when 
having frequent
+   * flushes.
+   */
+  public static final int HNSW_GRAPH_THRESHOLD = 10_000;

Review Comment:
   I think that the comment should try to expand a bit more on this value to 
help future readers think through whether it's still right or whether it should 
be updated.
   
   One thing we discussed on the linked issue is that the number of visited 
nodes is in the order of `log(size) * k`. So having a graph only helps if 
`log(size) * k << size` <=> `size / log(size) >> k`. If we arbitrarily choose k 
= 100, 10,000 is the first power of 10 so that `size / log(size)` is one order 
of magnitude greater than k (10/log(10) ~= 4.3, 100/log(100) ~= 22, 
1000/log(1000) ~= 144, 10000 / log(10000) ~= 1085).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Bypass HNSW graph building for tiny segments [lucene]

Reply via email to