vsop-479 commented on PR #12846:
URL: https://github.com/apache/lucene/pull/12846#issuecomment-1842384044
> is this new approach helping cover more cases maybe?
Yes, Previous patch only apply copy for level 0 's acc in
``Lucene99SkipWriter.bufferSkip``, Current patch apply copy for all level's
acc in ``Lucene99SkipWriter.writeSkipData`` by adding a flag to indicates
wether this is an empty acc.
> when knowing when the accumulator is empty is the responsibility of the
caller rather than CompetitiveFreqNormAccumulator.
I will move the copy/addAll selection to caller.
Before that, there are some performance data I want to share:
First of all, I measured method execution time(without assertion):
Method | ns | diff
-- | -- | --
addAll | 48291 | -
copy | 17000 | +64.7%
And add the implementation to JMH to enable JIT's optimization(auto-unroll,
auto-vectorization):
````
Benchmark (size) Mode Cnt Score Error
Units
VectorUtilBenchmark.accAddAll 256 thrpt 15 17.311 ± 4.468
ops/us
VectorUtilBenchmark.accCopy 256 thrpt 15 21.417 ± 7.009
ops/us
Benchmark (size) Mode Cnt Score Error
Units
VectorUtilBenchmark.accAddAll 256 thrpt 15 19.943 ± 4.831
ops/us
VectorUtilBenchmark.accCopy 256 thrpt 15 16.516 ± 1.904
ops/us
````
The speedup is not stable, so I split the implementation into array part and
treeSet part.
1. Array part:
````
Benchmark (size) Mode Cnt Score Error
Units
VectorUtilBenchmark.arrayCopy 256 thrpt 15 141.356 ± 3.640
ops/us
VectorUtilBenchmark.arrayMax 256 thrpt 15 77.875 ± 1.310
ops/us
Benchmark (size) Mode Cnt Score Error
Units
VectorUtilBenchmark.arrayCopy 256 thrpt 15 138.484 ± 5.038
ops/us
VectorUtilBenchmark.arrayMax 256 thrpt 15 78.250 ± 0.992
ops/us
````
ArrayMax is the baseline that iterates the array and get the max value.
ArrayCopy uses ``System.arraycopy``, which get a explicit speedup.
2. TreeSet part:
````
Benchmark (size) Mode Cnt Score Error
Units
VectorUtilBenchmark.treeAddAll 256 thrpt 15 27.673 ± 19.096
ops/us
VectorUtilBenchmark.treeScalarAdd 256 thrpt 15 34.888 ± 14.321
ops/us
Benchmark (size) Mode Cnt Score Error
Units
VectorUtilBenchmark.treeAddAll 256 thrpt 15 25.342 ± 4.151
ops/us
VectorUtilBenchmark.treeScalarAdd 256 thrpt 15 42.673 ± 19.396
ops/us
````
I also measured the method performance without JMH:
Method | ns | diff
-- | -- | --
treeScalarAdd | 36000 | -
treeAddAll | 26875 | +25.3%
The method performance indicates that `TreeSet.addAll` get a better
performance, but JMH benchmark result indicates that scalarAdd(baseline) get a
better performance than `TreeSet.addAll`, which is unexpected for me. I am
still working on
finding out the reason.
Finally, I modified `acc.copy` to just copy the array but use the original
iterate add for treeSet, and measured the performance:
````
Benchmark (size) Mode Cnt Score Error
Units
VectorUtilBenchmark.accAddAll 256 thrpt 15 16.180 ± 3.014
ops/us
VectorUtilBenchmark.accCopy 256 thrpt 15 31.807 ± 12.811
ops/us
Benchmark (size) Mode Cnt Score Error
Units
VectorUtilBenchmark.accAddAll 256 thrpt 15 20.414 ± 3.693
ops/us
VectorUtilBenchmark.accCopy 256 thrpt 15 24.432 ± 6.862
ops/us
````
Now `acc.copy` get a speedup.
Any idea about that? @jpountz
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]