dungba88 commented on code in PR #12844:
URL: https://github.com/apache/lucene/pull/12844#discussion_r1421782211
##########
lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java:
##########
@@ -330,15 +330,36 @@ public static int[] growExact(int[] array, int newLength)
{
return copy;
}
+ /**
+ * Returns an array whose size is at least {@code minLength}, generally
over-allocating
+ * exponentially, but never allocating more than {@code maxLength} elements.
+ */
+ public static int[] growInRange(int[] array, int minLength, int maxLength) {
+ assert minLength >= 0
+ : "length must be positive (got " + minLength + "): likely integer
overflow?";
+
+ if (minLength > maxLength) {
Review Comment:
I think this can be assert instead, as this class is marked as internal
##########
lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java:
##########
@@ -757,6 +757,30 @@ public void testRamUsageEstimate() throws IOException {
long estimated = RamUsageEstimator.sizeOfObject(hnsw);
long actual = ramUsed(hnsw);
+ // The estimation assumes neighbor arrays are always max size.
+ // When this is not true, the estimate can be much larger than the actual
value.
+ // In these cases, we compute how much we overestimated the neighbors
arrays.
+ if (estimated > actual) {
+ long neighborsError = 0;
+ int numLevels = hnsw.numLevels();
+ for (int level = 0; level < numLevels; ++level) {
+ NodesIterator nodesOnLevel = hnsw.getNodesOnLevel(level);
+ while (nodesOnLevel.hasNext()) {
+ int node = nodesOnLevel.nextInt();
+ NeighborArray neighbors = hnsw.getNeighbors(level, node);
+ long maxNeighborsSize;
+ if (level == 0) {
Review Comment:
I think generally we would rather want to avoid having test to duplicate
assumption or logic made on prod path? This seems to be a specific
implementation decision that could be changed independently. I couldn't think
of a better way though.
But I'm unsure about the need of this newly added code. It seems we only
compute it in a single test, and we want to have a better estimation? The test
seems to verify that our over-estimation cannot be more than 30% of the actual
size. If we provide a better estimation maybe we can lower the tolerant
threshold?
Still it seems strange that if we truly need this better estimation but it
is only in test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]