Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]

Danny Thomas Sat, 07 Oct 2023 23:23:11 -0700

On Thu, 5 Oct 2023 23:36:48 GMT, Srinivas Vamsi Parasa <[email protected]> wrote:


>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |    Arrays.sort benchmark   |       Array Size      |       Baseline 
>> (us/op)        |       AVX512 Sort (us/op)     |       Speedup |
>> |    ---     |       ---     |       ---     |       ---     |       ---     
>> |
>> |    ArraysSort.doubleSort   |       10      |       0.034   |       0.035   
>> |       1.0     |
>> |    ArraysSort.doubleSort   |       25      |       0.116   |       0.089   
>> |       1.3     |
>> |    ArraysSort.doubleSort   |       50      |       0.282   |       0.291   
>> |       1.0     |
>> |    ArraysSort.doubleSort   |       75      |       0.474   |       0.358   
>> |       1.3     |
>> |    ArraysSort.doubleSort   |       100     |       0.654   |       0.623   
>> |       1.0     |
>> |    ArraysSort.doubleSort   |       1000    |       9.274   |       6.331   
>> |       1.5     |
>> |    ArraysSort.doubleSort   |       10000   |       323.339 |       71.228  
>> |       **4.5** |
>> |    ArraysSort.doubleSort   |       100000  |       4471.871        |       
>> 1002.748        |       **4.5** |
>> |    ArraysSort.doubleSort   |       1000000 |       51660.742       |       
>> 12921.295       |       **4.0** |
>> |    ArraysSort.floatSort    |       10      |       0.045   |       0.046   
>> |       1.0     |
>> |    ArraysSort.floatSort    |       25      |       0.103   |       0.084   
>> |       1.2     |
>> |    ArraysSort.floatSort    |       50      |       0.285   |       0.33    
>> |       0.9     |
>> |    ArraysSort.floatSort    |       75      |       0.492   |       0.346   
>> |       1.4     |
>> |    ArraysSort.floatSort    |       100     |       0.597   |       0.326   
>> |       1.8     |
>> |    ArraysSort.floatSort    |       1000    |       9.811   |       5.294   
>> |       1.9     |
>> |    ArraysSort.floatSort    |       10000   |       323.955 |       50.547  
>> |       **6.4** |
>> |    ArraysSort.floatSort    |       100000  |       4326.38 |       731.152 
>> |       **5.9** |
>> |    ArraysSort.floatSort    |       1000000 |       52413.88        |       
>> 8409.193        |       **6.2** |
>> |    ArraysSort.intSort      |       10      |       0.033   |       0.033   
>> |       1.0     |
>> |    ArraysSort.intSort      |       25      |       0.086   |       0.051   
>> |       1.7     |
>> |    ArraysSort.intSort      |       50      |       0.236   |       0.151   
>> |       1.6     |
>> |    ArraysSort.intSort      |       75      |       0.416   |       0.332   
>> |       1.3     |
>> |    ArraysSort.intSort      |       100     |       0.63    |       0.521   
>> |       1.2     |
>> |    ArraysSort.intSort      |       1000    |       10.518  |       4.698   
>> |       2.2     |
>> |    ArraysSort.intSort      |       10000   |       309.659 |       42.518  
>> |       **7.3** |
>> |    ArraysSort.intSort      |       100000  |       4130.917        |       
>> 573.956 |       **7.2** |
>> |    ArraysSort.intSort      |       1000000 |       49876.307       |       
>> 6712.812        |       **7.4** |
>> |    ArraysSort.longSort     |       10      |       0.036   |       0.037   
>> |       1.0     |
>> |    ArraysSort.longSort     |       25      |       0.094   |       0.08    
>> |       1.2     |
>> |    ArraysSort.longSort     |       50      |       0.218   |       0.227   
>> |       1.0     |
>> |    ArraysSort.longSort     |       75      |       0.466   |       0.402   
>> |       1.2     |
>> |    ArraysSort.longSort     |       100     |       0.76    |       0.58    
>> |       1.3     |
>> |    ArraysSort.longSort     |       1000    |       10.449  |       6....
>
> Srinivas Vamsi Parasa has updated the pull request with a new target base due 
> to a merge or a rebase. The pull request now contains 45 commits:
> 
>  - fix code style and formatting
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into avx512sort
>  - Update CompileThresholdScaling only for the sort and partition intrinsics; 
> update build script to remove nested if
>  - change variable names of indexPivot* to pivotIndex*
>  - Update DualPivotQuicksort.java
>  - Rename arraySort and arrayPartition Java methods to sort and partition. 
> Cleanup some comments
>  - Remove the unnecessary exception in single pivot partitioning fallback 
> method
>  - Move functional interfaces close to the associated methods
>  - Refactor the sort and partition intrinsics to accept method references for 
> fallback functions
>  - Refactor stub handling to use a generic function for all types
>  - ... and 35 more: https://git.openjdk.org/jdk/compare/a1c9587c...a5262d86

A [discussion on 
Reddit](https://www.reddit.com/r/java/comments/171t5sj/heads_up_openjdk_implementation_of_avx512_based/)
 raised that this had the potential to regress sort performance on AMD Zen 4. 
The poster didn't have access to Zen 4 hardware, so I took a moment to run the 
benchmark, comparing against a baseline with `UseAVX=2` on an AWS `m7a.4xlarge`:


processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 17
model name      : AMD EPYC 9R14
stepping        : 1
microcode       : 0xa10113e
cpu MHz         : 3673.537
cache size      : 1024 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb 
rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf 
tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic 
movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy 
abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd 
perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid avx512f 
avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw 
avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr rdpru 
wbnoinvd arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq 
avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid flush_l1d
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 5200.00
TLB size        : 3584 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:



$ make test TEST="micro:java.lang.ArraysSort.intSort"

Benchmark            (size)  Mode  Cnt      Score     Error  Units
ArraysSort.intSort       10  avgt    3      0.033 ?   0.001  us/op
ArraysSort.intSort       25  avgt    3      0.067 ?   0.002  us/op
ArraysSort.intSort       50  avgt    3      0.391 ?   0.007  us/op
ArraysSort.intSort       75  avgt    3      0.923 ?   0.041  us/op
ArraysSort.intSort      100  avgt    3      1.292 ?   0.009  us/op
ArraysSort.intSort     1000  avgt    3     24.268 ?   0.078  us/op
ArraysSort.intSort    10000  avgt    3    320.131 ?   0.381  us/op
ArraysSort.intSort   100000  avgt    3   4803.142 ?  63.901  us/op
ArraysSort.intSort  1000000  avgt    3  61457.708 ? 347.446  us/op



$ make test TEST="micro:java.lang.ArraysSort.intSort" 
MICRO="VM_OPTIONS=-XX:UseAVX=2"

Benchmark            (size)  Mode  Cnt      Score      Error  Units
ArraysSort.intSort       10  avgt    3      0.034 ?    0.001  us/op
ArraysSort.intSort       25  avgt    3      0.091 ?    0.008  us/op
ArraysSort.intSort       50  avgt    3      0.247 ?    0.022  us/op
ArraysSort.intSort       75  avgt    3      0.432 ?    0.005  us/op
ArraysSort.intSort      100  avgt    3      0.560 ?    0.018  us/op
ArraysSort.intSort     1000  avgt    3      8.848 ?    0.530  us/op
ArraysSort.intSort    10000  avgt    3    118.476 ?  478.914  us/op
ArraysSort.intSort   100000  avgt    3   4656.937 ?  110.323  us/op
ArraysSort.intSort  1000000  avgt    3  56598.595 ? 2749.859  us/op


This is explained in the linked comments/commits:

- https://github.com/intel/x86-simd-sort/issues/6#issuecomment-1433687405
- 
https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1751934240

Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]

Reply via email to