On Tue, 25 Oct 2022 10:37:40 GMT, Claes Redestad <redes...@openjdk.org> wrote:
> Continuing the work initiated by @luhenry to unroll and then intrinsify > polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` > method. To make this work I've harmonized how they are invoked so that > there's less special handling and checks in the intrinsic. Mainly do the > null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the > factor and start values which are now hard-coded (always 31, and a start > value of either one for `Arrays` or zero for `String`). It seems somewhat > premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all > tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a > few trivial method calls which increase the call stack depth, so surprises > cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look > like this: > > Benchmark (size) Mode Cnt Score Error > Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ± 0.017 > ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ± 0.049 > ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ± 0.221 > ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ± 7.020 > ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error > Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ± 0.013 > ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ± 0.122 > ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ± 0.512 > ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ± 67.630 > ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as > ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ± 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ± 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ± 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ± 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ± 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ± 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ± 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ± 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ± 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ± 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ± 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ± 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ± 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ± 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ± 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ± 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ± 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ± 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ± 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ± 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ± 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ± 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ± 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ± 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ± 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ± 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ± 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ± 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ± 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ± 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ± 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ± 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster > on large inputs for `char` and `int` (the ones currently vectorized). I aim > to fix `byte` and `short` cases before integrating, though it might be > acceptable to hand that off as follow-up enhancements to not further delay > integration of this enhancement. I agree, please go ahead, I leave some comments for the x86 implementation. Thanks. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3358: > 3356: movl(result, is_string_hashcode ? 0 : 1); > 3357: > 3358: // if (cnt1 == 0) { You may want to reorder the execution of the loops, a short array suffers more from processing than a big array, so you should have minimum extra hops for those. For example, I think this could be: if (cnt1 >= 4) { if (cnt1 >= 16) { UNROLLED VECTOR LOOP SINGLE VECTOR LOOP } UNROLLED SCALAR LOOP } SINGLE SCALAR LOOP The thresholds are arbitrary and need to be measured carefully. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3374: > 3372: > 3373: // int i = 0; > 3374: movl(index, 0); `xorl(index, index)` src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3387: > 3385: for (int idx = 0; idx < 4; idx++) { > 3386: // h = (31 * h) or (h << 5 - h); > 3387: movl(tmp, result); If you are unrolling this, maybe break the dependency chain, `h = h * 31**4 + x[i] * 31**3 + x[i + 1] * 31**2 + x[i + 2] * 31 + x[i + 3]` src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3418: > 3416: // } else { // cnt1 >= 32 > 3417: address power_of_31_backwards = pc(); > 3418: emit_int32( 2111290369); Can this giant table be shared among compilations instead? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3484: > 3482: decrementl(index); > 3483: jmpb(LONG_SCALAR_LOOP_BEGIN); > 3484: bind(LONG_SCALAR_LOOP_END); You can share this loop with the scalar ones above. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3493: > 3491: // vnext = IntVector.broadcast(I256, power_of_31_backwards[0]); > 3492: movdl(vnext, InternalAddress(power_of_31_backwards + (0 * > sizeof(jint)))); > 3493: vpbroadcastd(vnext, vnext, Assembler::AVX_256bit); `vpbroadcastd` can take an `Address` argument instead. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3523: > 3521: subl(index, 32); > 3522: // i >= 0; > 3523: cmpl(index, 0); You don't need this since `subl` sets flags according to the result. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3528: > 3526: vpmulld(vcoef[idx], vcoef[idx], vnext, Assembler::AVX_256bit); > 3527: } > 3528: jmp(LONG_VECTOR_LOOP_BEGIN); Calculating backward forces you to do calculating the coefficients on each iteration, I think doing this normally would be better. ------------- PR: https://git.openjdk.org/jdk/pull/10847