jpountz commented on code in PR #14963:
URL: https://github.com/apache/lucene/pull/14963#discussion_r2248862786
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##########
@@ -115,6 +115,13 @@ public final class Lucene99HnswVectorsFormat extends
KnnVectorsFormat {
/** Default to use single thread merge */
public static final int DEFAULT_NUM_MERGE_WORKER = 1;
+ /**
+ * Threshold below which HNSW graph building is bypassed for tiny segments.
Segments with fewer
+ * vectors will use flat storage only, improving indexing performance when
having frequent
+ * flushes.
+ */
+ public static final int HNSW_GRAPH_THRESHOLD = 10_000;
Review Comment:
I think that the comment should try to expand a bit more on this value to
help future readers think through whether it's still right or whether it should
be updated.
One thing we discussed on the linked issue is that the number of visited
nodes is in the order of `log(size) * k`. So having a graph only helps if
`log(size) * k << size` <=> `size / log(size) >> k`. If we arbitrarily choose k
= 100, 10,000 is the first power of 10 so that `size / log(size)` is one order
of magnitude greater than k (10/log(10) ~= 4.3, 100/log(100) ~= 22,
1000/log(1000) ~= 144, 10000 / log(10000) ~= 1085).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]