On Thu, 5 Oct 2023 23:36:48 GMT, Srinivas Vamsi Parasa <d...@openjdk.org> wrote:
>> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX512 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and >> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the >> performance data below. >> >> >> **Arrays.sort performance data using JMH benchmarks for arrays with random >> data** >> >> | Arrays.sort benchmark | Array Size | Baseline >> (us/op) | AVX512 Sort (us/op) | Speedup | >> | --- | --- | --- | --- | --- >> | >> | ArraysSort.doubleSort | 10 | 0.034 | 0.035 >> | 1.0 | >> | ArraysSort.doubleSort | 25 | 0.116 | 0.089 >> | 1.3 | >> | ArraysSort.doubleSort | 50 | 0.282 | 0.291 >> | 1.0 | >> | ArraysSort.doubleSort | 75 | 0.474 | 0.358 >> | 1.3 | >> | ArraysSort.doubleSort | 100 | 0.654 | 0.623 >> | 1.0 | >> | ArraysSort.doubleSort | 1000 | 9.274 | 6.331 >> | 1.5 | >> | ArraysSort.doubleSort | 10000 | 323.339 | 71.228 >> | **4.5** | >> | ArraysSort.doubleSort | 100000 | 4471.871 | >> 1002.748 | **4.5** | >> | ArraysSort.doubleSort | 1000000 | 51660.742 | >> 12921.295 | **4.0** | >> | ArraysSort.floatSort | 10 | 0.045 | 0.046 >> | 1.0 | >> | ArraysSort.floatSort | 25 | 0.103 | 0.084 >> | 1.2 | >> | ArraysSort.floatSort | 50 | 0.285 | 0.33 >> | 0.9 | >> | ArraysSort.floatSort | 75 | 0.492 | 0.346 >> | 1.4 | >> | ArraysSort.floatSort | 100 | 0.597 | 0.326 >> | 1.8 | >> | ArraysSort.floatSort | 1000 | 9.811 | 5.294 >> | 1.9 | >> | ArraysSort.floatSort | 10000 | 323.955 | 50.547 >> | **6.4** | >> | ArraysSort.floatSort | 100000 | 4326.38 | 731.152 >> | **5.9** | >> | ArraysSort.floatSort | 1000000 | 52413.88 | >> 8409.193 | **6.2** | >> | ArraysSort.intSort | 10 | 0.033 | 0.033 >> | 1.0 | >> | ArraysSort.intSort | 25 | 0.086 | 0.051 >> | 1.7 | >> | ArraysSort.intSort | 50 | 0.236 | 0.151 >> | 1.6 | >> | ArraysSort.intSort | 75 | 0.416 | 0.332 >> | 1.3 | >> | ArraysSort.intSort | 100 | 0.63 | 0.521 >> | 1.2 | >> | ArraysSort.intSort | 1000 | 10.518 | 4.698 >> | 2.2 | >> | ArraysSort.intSort | 10000 | 309.659 | 42.518 >> | **7.3** | >> | ArraysSort.intSort | 100000 | 4130.917 | >> 573.956 | **7.2** | >> | ArraysSort.intSort | 1000000 | 49876.307 | >> 6712.812 | **7.4** | >> | ArraysSort.longSort | 10 | 0.036 | 0.037 >> | 1.0 | >> | ArraysSort.longSort | 25 | 0.094 | 0.08 >> | 1.2 | >> | ArraysSort.longSort | 50 | 0.218 | 0.227 >> | 1.0 | >> | ArraysSort.longSort | 75 | 0.466 | 0.402 >> | 1.2 | >> | ArraysSort.longSort | 100 | 0.76 | 0.58 >> | 1.3 | >> | ArraysSort.longSort | 1000 | 10.449 | 6.... > > Srinivas Vamsi Parasa has updated the pull request with a new target base due > to a merge or a rebase. The pull request now contains 45 commits: > > - fix code style and formatting > - Merge branch 'master' of https://git.openjdk.java.net/jdk into avx512sort > - Update CompileThresholdScaling only for the sort and partition intrinsics; > update build script to remove nested if > - change variable names of indexPivot* to pivotIndex* > - Update DualPivotQuicksort.java > - Rename arraySort and arrayPartition Java methods to sort and partition. > Cleanup some comments > - Remove the unnecessary exception in single pivot partitioning fallback > method > - Move functional interfaces close to the associated methods > - Refactor the sort and partition intrinsics to accept method references for > fallback functions > - Refactor stub handling to use a generic function for all types > - ... and 35 more: https://git.openjdk.org/jdk/compare/a1c9587c...a5262d86 A [discussion on Reddit](https://www.reddit.com/r/java/comments/171t5sj/heads_up_openjdk_implementation_of_avx512_based/) raised that this had the potential to regress sort performance on AMD Zen 4. The poster didn't have access to Zen 4 hardware, so I took a moment to run the benchmark, comparing against a baseline with `UseAVX=2` on an AWS `m7a.4xlarge`: processor : 0 vendor_id : AuthenticAMD cpu family : 25 model : 17 model name : AMD EPYC 9R14 stepping : 1 microcode : 0xa10113e cpu MHz : 3673.537 cache size : 1024 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 16 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr rdpru wbnoinvd arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid flush_l1d bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass bogomips : 5200.00 TLB size : 3584 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: $ make test TEST="micro:java.lang.ArraysSort.intSort" Benchmark (size) Mode Cnt Score Error Units ArraysSort.intSort 10 avgt 3 0.033 ? 0.001 us/op ArraysSort.intSort 25 avgt 3 0.067 ? 0.002 us/op ArraysSort.intSort 50 avgt 3 0.391 ? 0.007 us/op ArraysSort.intSort 75 avgt 3 0.923 ? 0.041 us/op ArraysSort.intSort 100 avgt 3 1.292 ? 0.009 us/op ArraysSort.intSort 1000 avgt 3 24.268 ? 0.078 us/op ArraysSort.intSort 10000 avgt 3 320.131 ? 0.381 us/op ArraysSort.intSort 100000 avgt 3 4803.142 ? 63.901 us/op ArraysSort.intSort 1000000 avgt 3 61457.708 ? 347.446 us/op $ make test TEST="micro:java.lang.ArraysSort.intSort" MICRO="VM_OPTIONS=-XX:UseAVX=2" Benchmark (size) Mode Cnt Score Error Units ArraysSort.intSort 10 avgt 3 0.034 ? 0.001 us/op ArraysSort.intSort 25 avgt 3 0.091 ? 0.008 us/op ArraysSort.intSort 50 avgt 3 0.247 ? 0.022 us/op ArraysSort.intSort 75 avgt 3 0.432 ? 0.005 us/op ArraysSort.intSort 100 avgt 3 0.560 ? 0.018 us/op ArraysSort.intSort 1000 avgt 3 8.848 ? 0.530 us/op ArraysSort.intSort 10000 avgt 3 118.476 ? 478.914 us/op ArraysSort.intSort 100000 avgt 3 4656.937 ? 110.323 us/op ArraysSort.intSort 1000000 avgt 3 56598.595 ? 2749.859 us/op This is explained in the linked comments/commits: - https://github.com/intel/x86-simd-sort/issues/6#issuecomment-1433687405 - https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026 ------------- PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1751934240