On Tue, 3 Sep 2024 07:37:33 GMT, Francesco Nigro <d...@openjdk.org> wrote:
>> Working on it > > @galderz in the benchmark did you collected the mispredicts/branches? @franz1981 No I hadn't done so until now, but I will be tracking those more closely. Context: I have been running some reduction JMH benchmarks and I could see a big drop in non AVX-512 performance compared to the unpatched code. E.g. @Benchmark public long reductionSingleLongMax() { long result = 0; for (int i = 0; i < size; i++) { final long v = 11 * aLong[i]; result = Math.max(result, v); } return result; } This is caused by keeping the Max/Min nodes in the IR, which get translated into `cmpq+cmovlq` instructions (via the macro expansion). The code gets unrolled but a dependency chain on the current max value. In the unpatched code the intrinsic does not kick in and uses a standard ternary operation, which gets translated into a normal control flow. The system is able to handle this better due to branch prediction. @franz1981's comment is precisely about this. I need to enhance the benchmark to control the branchiness of the test (e.g. how often it goes one side or the other of a max/min call) and measure the mispredictions and branches...etc. FYI: A similar situation can be replicated with reduction benchmarks that use max/min integer, but for the code to fallback into `cmov`, both AVX and SSE have be turned off. I also need to see what the performance looks on like on a system with AVX-512, and also look at how non-reduction JMH benchmarks behave on systems with/without AVX-512. Finally, I'm also looking at an experiment to see what would happen in cmovl was implemented with branch+mov instead. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2337131179