Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

via GitHub Sun, 10 Dec 2023 09:25:04 -0800


dungba88 commented on code in PR #12844:
URL: https://github.com/apache/lucene/pull/12844#discussion_r1421782211



##########
lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java:
##########
@@ -330,15 +330,36 @@ public static int[] growExact(int[] array, int newLength) 
{
     return copy;
   }
 
+  /**
+   * Returns an array whose size is at least {@code minLength}, generally 
over-allocating
+   * exponentially, but never allocating more than {@code maxLength} elements.
+   */
+  public static int[] growInRange(int[] array, int minLength, int maxLength) {
+    assert minLength >= 0
+        : "length must be positive (got " + minLength + "): likely integer 
overflow?";
+
+    if (minLength > maxLength) {

Review Comment:
   I think this can be assert instead, as this class is marked as internal



##########
lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java:
##########
@@ -757,6 +757,30 @@ public void testRamUsageEstimate() throws IOException {
     long estimated = RamUsageEstimator.sizeOfObject(hnsw);
     long actual = ramUsed(hnsw);
 
+    // The estimation assumes neighbor arrays are always max size.
+    // When this is not true, the estimate can be much larger than the actual 
value.
+    // In these cases, we compute how much we overestimated the neighbors 
arrays.
+    if (estimated > actual) {
+      long neighborsError = 0;
+      int numLevels = hnsw.numLevels();
+      for (int level = 0; level < numLevels; ++level) {
+        NodesIterator nodesOnLevel = hnsw.getNodesOnLevel(level);
+        while (nodesOnLevel.hasNext()) {
+          int node = nodesOnLevel.nextInt();
+          NeighborArray neighbors = hnsw.getNeighbors(level, node);
+          long maxNeighborsSize;
+          if (level == 0) {

Review Comment:
   I think generally we would rather want to avoid having test to duplicate 
assumption or logic made on prod path? This seems to be a specific 
implementation decision that could be changed independently. I couldn't think 
of a better way though.
   
   But I'm unsure about the need of this newly added code. It seems we only 
compute it in a single test, and we want to have a better estimation? The test 
seems to verify that our over-estimation cannot be more than 30% of the actual 
size. If we provide a better estimation maybe we can lower the tolerant 
threshold?
   
   Still it seems strange that if we truly need this better estimation but it 
is only in test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

Reply via email to