weizijun opened a new pull request, #14527:
URL: https://github.com/apache/lucene/pull/14527

   When bbq is used with lucene, one datanode can contain more data.
   So when more shards are merged concurrently, there will be a problem of very 
high heap memory size.
   I found that the NeighborArray object was taking up a lot of memory. And I 
found that the number of nodes always fails to reach maxSize. It only uses 
about 1/3 or 1/4 of maxSize.
   Therefore, I use FloatArrayList\IntArrayList to replace float[]\int[], which 
can significantly reduce the heap memory usage.
   
   Here is a comparison of the jmap histo results(I set the parameter of m = 
64):
   before:
   ```
    num     #instances         #bytes  class name (module)
   -------------------------------------------------------
      1:      11443026     6396808120  [F ([email protected])
      2:      11387631     6129931608  [I ([email protected])
      3:       3265644     1319152760  [B ([email protected])
      4:      11308339      361866848  
org.apache.lucene.util.hnsw.NeighborArray 
([email protected])
      5:      11134203      267240168  
[Lorg.apache.lucene.util.hnsw.NeighborArray; 
([email protected])
      6:            77       57916272  
[[Lorg.apache.lucene.util.hnsw.NeighborArray; 
([email protected])
      7:       2404231       57701544  java.lang.String 
([email protected])
      8:         34911       42546120  Ljdk.internal.vm.FillerArray; 
([email protected])
      9:        772788       30911520  
org.nlpcn.commons.lang.tire.domain.SmartForest
     10:        113758       19111344  
org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame 
([email protected])
     11:        545656       17460992  java.util.HashMap$Node 
([email protected])
   ```
   
   after:
   ```
   num     #instances         #bytes  class name (module)
   -------------------------------------------------------
      1:       9228299     1612257464  [F ([email protected])
      2:       9246406     1402537720  [I ([email protected])
      3:       3279264     1141869192  [B ([email protected])
      4:       9124020      364960800  
org.apache.lucene.util.hnsw.NeighborArray 
([email protected])
      5:       9124036      218976864  
org.apache.lucene.internal.hppc.FloatArrayList 
([email protected])
      6:       9124036      218976864  
org.apache.lucene.internal.hppc.IntArrayList 
([email protected])
      7:       8983027      215608448  
[Lorg.apache.lucene.util.hnsw.NeighborArray; 
([email protected])
      8:       2492594       59822256  java.lang.String 
([email protected])
      9:            56       51013776  
[[Lorg.apache.lucene.util.hnsw.NeighborArray; 
([email protected])
     10:        772788       30911520  
org.nlpcn.commons.lang.tire.domain.SmartForest
     11:         68970       28703992  Ljdk.internal.vm.FillerArray; 
([email protected])
   ```
   
   The avg size of float[] is 559 before.
   The avg size of float[] is 174 after.
   
   The avg size of int[] is 538 before.
   The avg size of int[] is 151 after.
   
   I tests some dataset like GIST 100K vectors, 960 dimensions\LAION 100M 
vectors, 768 dimensions. They have similar conclusions.
   I haven't tested the performance very rigorously. It seems that this 
modification has no impact on performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to