On Sat, 11 Jan 2025 05:21:36 GMT, Shaojin Wen <s...@openjdk.org> wrote:
>> Improve the performance of UUID::toString by using Long.expand and SWAR >> (SIMD within a register) instead of table lookup. Eliminating the table >> lookup can also avoid the performance degradation problem when the cache >> misses. > > The new implementation improves performance on the aarch64 architecture but > results in a performance regression on x64. > > ## 1. Script > > git remote add wenshao g...@github.com:wenshao/jdk.git > git fetch wenshao > > # baseline dfaa89162a3 > git checkout dfaa89162a35acd20b1ed35e147f9626a181510a > make test TEST="micro:java.util.UUIDBench.toString" > > # current c513087056b > git checkout c513087056be8c1e1a915625e0b425a7ecbb21d6 > make test TEST="micro:java.util.UUIDBench.toString" > > > ## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa) > > -Benchmark (size) Mode Cnt Score Error Units (baseline > dfaa89162a3) > -UUIDBench.toString 20000 thrpt 15 94.274 ± 0.452 ops/us > > +Benchmark (size) Mode Cnt Score Error Units (current > c513087056b) > +UUIDBench.toString 20000 thrpt 15 80.241 ± 0.894 ops/us -14.88% > > > > ## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids) > > -Benchmark (size) Mode Cnt Score Error Units (baseline > dfaa89162a3) > -UUIDBench.toString 20000 thrpt 15 85.323 ± 2.044 ops/us > > +Benchmark (size) Mode Cnt Score Error Units (current > c513087056b) > +UUIDBench.toString 20000 thrpt 15 73.636 ± 0.590 ops/us -13.69% > > > ## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710) > > -Benchmark (size) Mode Cnt Score Error Units (baseline > dfaa89162a3) > -UUIDBench.toString 20000 thrpt 15 69.286 ± 1.136 ops/us > > +Benchmark (size) Mode Cnt Score Error Units (current > c513087056b) > +UUIDBench.toString 20000 thrpt 15 80.475 ± 0.310 ops/us +16.14% > > > > ## 5. MacBook M1 Pro (aarch64) > > -Benchmark (size) Mode Cnt Score Error Units (baseline > dfaa89162a3) > -UUIDBench.toString 20000 thrpt 15 108.254 ? 1.167 ops/us > > +Benchmark (size) Mode Cnt Score Error Units (current > c513087056b) > +UUIDBench.toString 20000 thrpt 15 122.313 ? 0.820 ops/us +12.98% > > > > ## 6. orange_pi5_aarch64 (CPU RK3588S) > > -Benchmark (size) Mode Cnt Score Error Units (baseline > dfaa89162a3) > -UUIDBench.toString 20000 thrpt 15 37.783 ± 1.553 ops/us > > +Benchmark (size) Mode Cnt Score Error Units (current > c513087056b) > +UUIDBench.toString 20000 thrpt 15 42.928 ± 2.534 ops/us +13.61% > > > > > ## 7. orange_aipro_aarch64 (CPU TAISHANV200M) > > -Benchmark (size) Mode Cnt Score Error Units (baseline > dfaa89162a3) > -UUIDBench.toString 20000 thrpt 15 13.822 ± 0.203 ops/us > > +Benchmark (size) M... With regard to the aarch64 vector instrinsic, I don't have access to an aarch64 to try it on (I'm faking it x64 by disabling the intrinsic). @wenshao would it be possible for you to try the Long.expand version of this patch with the patch from https://github.com/openjdk/jdk/pull/23089 to see how aarch64 performs? > ARMv8 includes Apple M1/M2, AWS Graviton 3; ARMv9.0 includes Apple M3/M4, > Aliyun Yitian 710. An interesting piece of trivia - while the M4 is ARMv9, it appears not to support SVE - in particular the bdep instruction that this code would use. See https://github.com/llvm/llvm-project/blob/14b44179cb61dd551c911dea54de57b588621005/llvm/lib/Target/AArch64/AArch64Processors.td#L923 ------------- PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2590911374 PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2614028489