weizijun opened a new pull request, #14527:
URL: https://github.com/apache/lucene/pull/14527
When bbq is used with lucene, one datanode can contain more data.
So when more shards are merged concurrently, there will be a problem of very
high heap memory size.
I found that the NeighborArray object was taking up a lot of memory. And I
found that the number of nodes always fails to reach maxSize. It only uses
about 1/3 or 1/4 of maxSize.
Therefore, I use FloatArrayList\IntArrayList to replace float[]\int[], which
can significantly reduce the heap memory usage.
Here is a comparison of the jmap histo results(I set the parameter of m =
64):
before:
```
num #instances #bytes class name (module)
-------------------------------------------------------
1: 11443026 6396808120 [F ([email protected])
2: 11387631 6129931608 [I ([email protected])
3: 3265644 1319152760 [B ([email protected])
4: 11308339 361866848
org.apache.lucene.util.hnsw.NeighborArray
([email protected])
5: 11134203 267240168
[Lorg.apache.lucene.util.hnsw.NeighborArray;
([email protected])
6: 77 57916272
[[Lorg.apache.lucene.util.hnsw.NeighborArray;
([email protected])
7: 2404231 57701544 java.lang.String
([email protected])
8: 34911 42546120 Ljdk.internal.vm.FillerArray;
([email protected])
9: 772788 30911520
org.nlpcn.commons.lang.tire.domain.SmartForest
10: 113758 19111344
org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame
([email protected])
11: 545656 17460992 java.util.HashMap$Node
([email protected])
```
after:
```
num #instances #bytes class name (module)
-------------------------------------------------------
1: 9228299 1612257464 [F ([email protected])
2: 9246406 1402537720 [I ([email protected])
3: 3279264 1141869192 [B ([email protected])
4: 9124020 364960800
org.apache.lucene.util.hnsw.NeighborArray
([email protected])
5: 9124036 218976864
org.apache.lucene.internal.hppc.FloatArrayList
([email protected])
6: 9124036 218976864
org.apache.lucene.internal.hppc.IntArrayList
([email protected])
7: 8983027 215608448
[Lorg.apache.lucene.util.hnsw.NeighborArray;
([email protected])
8: 2492594 59822256 java.lang.String
([email protected])
9: 56 51013776
[[Lorg.apache.lucene.util.hnsw.NeighborArray;
([email protected])
10: 772788 30911520
org.nlpcn.commons.lang.tire.domain.SmartForest
11: 68970 28703992 Ljdk.internal.vm.FillerArray;
([email protected])
```
The avg size of float[] is 559 before.
The avg size of float[] is 174 after.
The avg size of int[] is 538 before.
The avg size of int[] is 151 after.
I tests some dataset like GIST 100K vectors, 960 dimensions\LAION 100M
vectors, 768 dimensions. They have similar conclusions.
I haven't tested the performance very rigorously. It seems that this
modification has no impact on performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]