On Tue, 3 Sep 2024 07:37:33 GMT, Francesco Nigro <d...@openjdk.org> wrote:

>> Working on it
>
> @galderz in the benchmark did you collected the mispredicts/branches?

@franz1981 No I hadn't done so until now, but I will be tracking those more 
closely.

Context:

I have been running some reduction JMH benchmarks and I could see a big drop in 
non AVX-512 performance compared to the unpatched code. E.g.


    @Benchmark
    public long reductionSingleLongMax() {
        long result = 0;
        for (int i = 0; i < size; i++) {
            final long v = 11 * aLong[i];
            result = Math.max(result, v);
        }
        return result;
    }


This is caused by keeping the Max/Min nodes in the IR, which get translated 
into `cmpq+cmovlq` instructions (via the macro expansion). The code gets 
unrolled but a dependency chain on the current max value. In the unpatched code 
the intrinsic does not kick in and uses a standard ternary operation, which 
gets translated into a normal control flow. The system is able to handle this 
better due to branch prediction. @franz1981's comment is precisely about this. 
I need to enhance the benchmark to control the branchiness of the test (e.g. 
how often it goes one side or the other of a max/min call) and measure the 
mispredictions and branches...etc.

FYI: A similar situation can be replicated with reduction benchmarks that use 
max/min integer, but for the code to fallback into `cmov`, both AVX and SSE 
have be turned off.

I also need to see what the performance looks on like on a system with AVX-512, 
and also look at how non-reduction JMH benchmarks behave on systems 
with/without AVX-512.

Finally, I'm also looking at an experiment to see what would happen in cmovl 
was implemented with branch+mov instead.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2337131179

Reply via email to