Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]

Quan Anh Mai Fri, 13 Oct 2023 22:07:29 -0700

On Sat, 14 Oct 2023 03:21:52 GMT, himichael <d...@openjdk.org> wrote:


>>> my question is that this feature should improve performance several times, 
>>> but it doesn't look like there's much difference between open jdk 22.19 and 
>>> jdk 8. is there a problem with my configuration ?
>> 
>> Hello @himichael,
>> 
>> Using your code snippet, please see the output below using the latest JDK 
>> and JDK 20 (which does not have AVX512 sort):
>> 
>> JDK 20 (without AVX512 sort): 
>> `java 
>> -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001
>>  -XX:-TieredCompilation JDKSort `
>> 
>> elapse time -> **7501 ms**
>> 
>> ------------------------------
>> JDK 22 (with AVX512 sort)
>> `java 
>> -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001
>>  -XX:-TieredCompilation JDKSort`
>> elapse time -> **1607 ms**
>> 
>> It shows 4.66x speedup.
>
>> > my question is that this feature should improve performance several times, 
>> > but it doesn't look like there's much difference between open jdk 22.19 
>> > and jdk 8. is there a problem with my configuration ?
>> 
>> Hello @himichael,
>> 
>> Using your code snippet, please see the output below using the latest JDK 
>> and JDK 20 (which does not have AVX512 sort):
>> 
>> JDK 20 (without AVX512 sort): `java 
>> -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001
>>  -XX:-TieredCompilation JDKSort `
>> 
>> elapse time -> **7501 ms**
>> 
>> JDK 22 (with AVX512 sort) `java 
>> -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001
>>  -XX:-TieredCompilation JDKSort` elapse time -> **1607 ms**
>> 
>> It shows 4.66x speedup.
> 
> Hello, @vamsi-parasa 
> I used the commands you provided, but nothing seems to have changed.   
> The test procedure as follow:   
> use JDK 8(without AVX512 sort)   
> 
> /data/soft/jdk1.8.0_371/bin/javac  JDKSort.java
> /data/soft/jdk1.8.0_371/bin/java  JDKSort
> 
> elapse time -> **15309 ms**   
>    
> use OpenJDK 22.19(with AVX512 sort)   
> 
> /data/soft/jdk-22/bin/javac JDKSort.java
> /data/soft/jdk-22/bin/java 
> -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001
>  -XX:-TieredCompilation JDKSort
> CompileCommand: CompileThresholdScaling java/util/DualPivotQuicksort.sort 
> double CompileThresholdScaling = 0.000100
> 
> elapse time -> **11687 ms**
>    
> Not much seems to have changed.   
> 
> My JDK info:   
> OpenJDK 22.19:
> 
> /data/soft/jdk-22/bin/java -version
> openjdk version "22-ea" 2024-03-19
> OpenJDK Runtime Environment (build 22-ea+19-1460)
> OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing)
> 
> 
> JDK 8:
> 
> /data/soft/jdk1.8.0_371/bin/java -version
> java version "1.8.0_371"
> Java(TM) SE Runtime Environment (build 1.8.0_371-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)
> 
> 
> 
> I tested Intel's **x86-simd-sort**, my code as follow:
> ```c++
> #include <iostream>
> #include <vector>
> #include <algorithm>
> #include <chrono>
> #include "src/avx512-32bit-qsort.hpp"
> 
> int main() {
> 
>     // 100 million records
>     const int size = 100000000;
>     std::vector<int> random_array(size);
> 
>     for (int i = 0; i < size; ++i) {
>         random_array[i] = rand();
>     }
> 
>     auto start_time = std::chrono::steady_clock::now();
> 
>     avx512_qsort(random_array.data(), size);
> 
>     auto end_time = std::chrono::steady_clock::now();
>     auto elapse_time = 
> std::chrono::duration_cast<std::chrono::milliseconds>(end_time - 
> start_time)....

@himichael What do you mean by this having nothing to do with benchmark. You 
are trying to execute some code to measure its execution time, which is 
benchmarking. And you are doing that on only 1 simple function, which makes 
your benchmark micro.

To be more specific, this is a C2-specific optimisation, so only C2-compiled 
code is benefitted from it. As a result, you need to have the function compiled 
BEFORE starting the clock. Typically, this is ensured by executing the function 
repeatedly for several iterations (the current default value is 20000), 
starting the clock, executing the function several more times, stopping the 
clock and calculating the average throughput. As this is quite complex and 
contains non-trivial caveats, it is recommended to use JMH for microbenchmarks.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1762598167

Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]

Reply via email to