On Thu, 28 Nov 2024 18:22:24 GMT, Aleksey Shipilev <sh...@openjdk.org> wrote:
> Found this while cleaning up x86_32 code for removal. > > In our current code there is a block added by > [JDK-8076373](https://bugs.openjdk.org/browse/JDK-8076373): > https://github.com/openjdk/jdk/blob/3b21a298c29d88720f6bfb2dc1f3305b6a3db307/src/hotspot/share/compiler/compileBroker.cpp#L1451-L1473 > > Ostensibly, that block is for x86_32 handling of signalling NaNs -- x87 FPU > has a peculiarity with them. See other funky bugs we seen with it: > [JDK-8285985](https://bugs.openjdk.org/browse/JDK-8285985), > [JDK-8293991](https://bugs.openjdk.org/browse/JDK-8293991). > > But the way current block is coded, it is enabled for X86 wholesale, which > also means x86_64! In fact, it is likely even worse on x86_64, because the > related "fast" entries are generated only for x86_32: > https://github.com/openjdk/jdk/blob/3b21a298c29d88720f6bfb2dc1f3305b6a3db307/src/hotspot/share/interpreter/templateInterpreterGenerator.cpp#L493-L502 > > This can be solved by checking `IA32` instead of `X86`. This block would be > gone completely once we remove x86_32 port. Meanwhile, we can make it right > by x86_64, and make eventual x86_32 removal less confusing. This issue seems > to only affect the compilation of native methods, while most of the hot code > is riding on compiler intrinsics. I'll put performance data in comments. > > Additional testing: > - [ ] Linux x86_64 server fastdebug, `all` As expected, none of this matters when C2 intrinsics work: Benchmark Mode Cnt Score Error Units # Baseline DoubleBitConversion.doubleToLongBits_NaN avgt 9 0.542 ± 0.001 ns/op DoubleBitConversion.doubleToLongBits_one avgt 9 0.542 ± 0.001 ns/op DoubleBitConversion.doubleToLongBits_zero avgt 9 0.542 ± 0.001 ns/op DoubleBitConversion.doubleToRawLongBits_NaN avgt 9 0.420 ± 0.041 ns/op DoubleBitConversion.doubleToRawLongBits_one avgt 9 0.413 ± 0.012 ns/op DoubleBitConversion.doubleToRawLongBits_zero avgt 9 0.412 ± 0.020 ns/op DoubleBitConversion.longBitsToDouble_NaN avgt 9 0.413 ± 0.007 ns/op DoubleBitConversion.longBitsToDouble_one avgt 9 0.409 ± 0.007 ns/op DoubleBitConversion.longBitsToDouble_zero avgt 9 0.414 ± 0.012 ns/op FloatBitConversion.floatToIntBits_NaN avgt 9 0.542 ± 0.001 ns/op FloatBitConversion.floatToIntBits_one avgt 9 0.542 ± 0.001 ns/op FloatBitConversion.floatToIntBits_zero avgt 9 0.542 ± 0.001 ns/op FloatBitConversion.floatToRawIntBits_NaN avgt 9 0.410 ± 0.005 ns/op FloatBitConversion.floatToRawIntBits_one avgt 9 0.412 ± 0.008 ns/op FloatBitConversion.floatToRawIntBits_zero avgt 9 0.413 ± 0.004 ns/op FloatBitConversion.intBitsToFloat_NaN avgt 9 0.412 ± 0.008 ns/op FloatBitConversion.intBitsToFloat_one avgt 9 0.413 ± 0.009 ns/op FloatBitConversion.intBitsToFloat_zero avgt 9 0.421 ± 0.022 ns/op # Patched DoubleBitConversion.doubleToLongBits_NaN avgt 9 0.542 ± 0.001 ns/op DoubleBitConversion.doubleToLongBits_one avgt 9 0.542 ± 0.001 ns/op DoubleBitConversion.doubleToLongBits_zero avgt 9 0.542 ± 0.001 ns/op DoubleBitConversion.doubleToRawLongBits_NaN avgt 9 0.425 ± 0.036 ns/op DoubleBitConversion.doubleToRawLongBits_one avgt 9 0.418 ± 0.009 ns/op DoubleBitConversion.doubleToRawLongBits_zero avgt 9 0.416 ± 0.017 ns/op DoubleBitConversion.longBitsToDouble_NaN avgt 9 0.412 ± 0.004 ns/op DoubleBitConversion.longBitsToDouble_one avgt 9 0.412 ± 0.010 ns/op DoubleBitConversion.longBitsToDouble_zero avgt 9 0.414 ± 0.005 ns/op FloatBitConversion.floatToIntBits_NaN avgt 9 0.542 ± 0.001 ns/op FloatBitConversion.floatToIntBits_one avgt 9 0.542 ± 0.001 ns/op FloatBitConversion.floatToIntBits_zero avgt 9 0.542 ± 0.001 ns/op FloatBitConversion.floatToRawIntBits_NaN avgt 9 0.410 ± 0.005 ns/op FloatBitConversion.floatToRawIntBits_one avgt 9 0.408 ± 0.007 ns/op FloatBitConversion.floatToRawIntBits_zero avgt 9 0.413 ± 0.015 ns/op FloatBitConversion.intBitsToFloat_NaN avgt 9 0.411 ± 0.008 ns/op FloatBitConversion.intBitsToFloat_one avgt 9 0.409 ± 0.008 ns/op FloatBitConversion.intBitsToFloat_zero avgt 9 0.426 ± 0.011 ns/op It does matter a lot when the choice is to go through interpreter native entry (slow) or via compiled native adapter (fast): # Baseline, -XX:-InlineMathNatives DoubleBitConversion.doubleToLongBits_NaN avgt 9 0.604 ± 0.015 ns/op DoubleBitConversion.doubleToLongBits_one avgt 9 97.382 ± 1.364 ns/op DoubleBitConversion.doubleToLongBits_zero avgt 9 97.636 ± 2.620 ns/op DoubleBitConversion.doubleToRawLongBits_NaN avgt 9 96.162 ± 0.513 ns/op DoubleBitConversion.doubleToRawLongBits_one avgt 9 98.678 ± 3.378 ns/op DoubleBitConversion.doubleToRawLongBits_zero avgt 9 97.374 ± 3.878 ns/op DoubleBitConversion.longBitsToDouble_NaN avgt 9 96.753 ± 3.659 ns/op DoubleBitConversion.longBitsToDouble_one avgt 9 97.173 ± 2.879 ns/op DoubleBitConversion.longBitsToDouble_zero avgt 9 96.375 ± 2.150 ns/op FloatBitConversion.floatToIntBits_NaN avgt 9 0.542 ± 0.001 ns/op FloatBitConversion.floatToIntBits_one avgt 9 95.868 ± 2.192 ns/op FloatBitConversion.floatToIntBits_zero avgt 9 97.377 ± 2.346 ns/op FloatBitConversion.floatToRawIntBits_NaN avgt 9 95.947 ± 2.211 ns/op FloatBitConversion.floatToRawIntBits_one avgt 9 97.705 ± 3.467 ns/op FloatBitConversion.floatToRawIntBits_zero avgt 9 96.052 ± 2.359 ns/op FloatBitConversion.intBitsToFloat_NaN avgt 9 98.793 ± 1.997 ns/op FloatBitConversion.intBitsToFloat_one avgt 9 97.201 ± 2.327 ns/op FloatBitConversion.intBitsToFloat_zero avgt 9 97.515 ± 1.939 ns/op # Patched, -XX:-InlineMathNatives DoubleBitConversion.doubleToLongBits_NaN avgt 9 0.598 ± 0.025 ns/op DoubleBitConversion.doubleToLongBits_one avgt 9 4.508 ± 0.318 ns/op DoubleBitConversion.doubleToLongBits_zero avgt 9 4.370 ± 0.003 ns/op DoubleBitConversion.doubleToRawLongBits_NaN avgt 9 4.285 ± 0.295 ns/op DoubleBitConversion.doubleToRawLongBits_one avgt 9 4.281 ± 0.331 ns/op DoubleBitConversion.doubleToRawLongBits_zero avgt 9 4.155 ± 0.311 ns/op DoubleBitConversion.longBitsToDouble_NaN avgt 9 4.592 ± 0.362 ns/op DoubleBitConversion.longBitsToDouble_one avgt 9 4.815 ± 0.038 ns/op DoubleBitConversion.longBitsToDouble_zero avgt 9 4.800 ± 0.019 ns/op FloatBitConversion.floatToIntBits_NaN avgt 9 0.542 ± 0.001 ns/op FloatBitConversion.floatToIntBits_one avgt 9 4.510 ± 0.322 ns/op FloatBitConversion.floatToIntBits_zero avgt 9 4.501 ± 0.332 ns/op FloatBitConversion.floatToRawIntBits_NaN avgt 9 4.280 ± 0.336 ns/op FloatBitConversion.floatToRawIntBits_one avgt 9 4.278 ± 0.320 ns/op FloatBitConversion.floatToRawIntBits_zero avgt 9 4.144 ± 0.329 ns/op FloatBitConversion.intBitsToFloat_NaN avgt 9 4.551 ± 0.329 ns/op FloatBitConversion.intBitsToFloat_one avgt 9 4.549 ± 0.327 ns/op FloatBitConversion.intBitsToFloat_zero avgt 9 4.676 ± 0.328 ns/op ------------- PR Comment: https://git.openjdk.org/jdk/pull/22446#issuecomment-2506638455