On Mon, 6 Jan 2025 13:18:50 GMT, Shaojin Wen <s...@openjdk.org> wrote:

> Improve the performance of UUID::toString by using Long.expand and SWAR (SIMD 
> within a register) instead of table lookup. Eliminating the table lookup can 
> also avoid the performance degradation problem when the cache misses.

By stepping through the code of `Long.expand`, and substituting in the 
constants, I come up with this:


   static long expandNibbles(long i){
        // Inlined version of Long.expand(i,0x0F0F_0F0F_0F0F_0F0FL)
        long t = i << 16;
        i = (i & ~0xFFFF00000000L) | (t & 0xFFFF00000000L);
        t = i << 8;
        i = (i & ~0xFF000000FF0000L) | (t & 0xFF000000FF0000L);
        t = i << 4;
        i = (i & ~0xF000F000F000F00L) | (t & 0xF000F000F000F00L);
        
        return i & 0x0F0F_0F0F_0F0F_0F0FL;
    }


This looks like it might actually do better than *Method 2*.  If inlining and 
constant folding is happening in  the non-intrinsic `Long.expand` I would 
imagine it would perform comparably to this.

The non-intrinsified java code should be able to run as quickly as the 
hand-inlined one.

I think I've found  an issue that prevents the code from being constant-folded 
as expected. C2 seems to not do constant-folding of xor nodes.

See https://github.com/openjdk/jdk/pull/23089 for an attempt at addressing this.

There are no XOR nodes in expandNibbles
![image](https://github.com/user-attachments/assets/057bc8fc-62a2-4fab-8d56-8e0128dac3cd)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2584577398
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2588342173
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2590840422

Reply via email to