On Mon, 6 Jan 2025 13:18:50 GMT, Shaojin Wen <s...@openjdk.org> wrote:
> Improve the performance of UUID::toString by using Long.expand and SWAR (SIMD > within a register) instead of table lookup. Eliminating the table lookup can > also avoid the performance degradation problem when the cache misses. By stepping through the code of `Long.expand`, and substituting in the constants, I come up with this: static long expandNibbles(long i){ // Inlined version of Long.expand(i,0x0F0F_0F0F_0F0F_0F0FL) long t = i << 16; i = (i & ~0xFFFF00000000L) | (t & 0xFFFF00000000L); t = i << 8; i = (i & ~0xFF000000FF0000L) | (t & 0xFF000000FF0000L); t = i << 4; i = (i & ~0xF000F000F000F00L) | (t & 0xF000F000F000F00L); return i & 0x0F0F_0F0F_0F0F_0F0FL; } This looks like it might actually do better than *Method 2*. If inlining and constant folding is happening in the non-intrinsic `Long.expand` I would imagine it would perform comparably to this. The non-intrinsified java code should be able to run as quickly as the hand-inlined one. I think I've found an issue that prevents the code from being constant-folded as expected. C2 seems to not do constant-folding of xor nodes. See https://github.com/openjdk/jdk/pull/23089 for an attempt at addressing this. There are no XOR nodes in expandNibbles  ------------- PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2584577398 PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2588342173 PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2590840422