Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v9]

David Schlosnagle Tue, 08 Nov 2022 18:49:24 -0800

On Tue, 8 Nov 2022 23:48:22 GMT, Claes Redestad <[email protected]> wrote:


>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark                               (size)  Mode  Cnt     Score    Error 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1      10  avgt    5     6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1     100  avgt    5    29.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark                               (size)  Mode  Cnt     Score    Error 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1      10  avgt    5     9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1     100  avgt    5    90.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark              (size)  Mode  Cnt     Score    Error  Units
>> ArraysHashCode.bytes        1  avgt    5     1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes       10  avgt    5     6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes      100  avgt    5    87.218 ±  0.595  ns/op
>> ArraysHashCode.bytes    10000  avgt    5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars        1  avgt    5     2.200 ±  0.010  ns/op
>> ArraysHashCode.chars       10  avgt    5     6.935 ±  0.034  ns/op
>> ArraysHashCode.chars      100  avgt    5    30.216 ±  0.134  ns/op
>> ArraysHashCode.chars    10000  avgt    5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints         1  avgt    5     2.200 ±  0.007  ns/op
>> ArraysHashCode.ints        10  avgt    5     6.936 ±  0.034  ns/op
>> ArraysHashCode.ints       100  avgt    5    29.412 ±  0.268  ns/op
>> ArraysHashCode.ints     10000  avgt    5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts       1  avgt    5     1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts      10  avgt    5     6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts     100  avgt    5    87.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   10000  avgt    5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark              (size)  Mode  Cnt     Score    Error  Units
>> ArraysHashCode.bytes        1  avgt    5     3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes       10  avgt    5     8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes      100  avgt    5    90.315 ±  0.655  ns/op
>> ArraysHashCode.bytes    10000  avgt    5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars        1  avgt    5     3.040 ±  0.066  ns/op
>> ArraysHashCode.chars       10  avgt    5     8.497 ±  0.074  ns/op
>> ArraysHashCode.chars      100  avgt    5    90.074 ±  0.387  ns/op
>> ArraysHashCode.chars    10000  avgt    5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints         1  avgt    5     2.827 ±  0.019  ns/op
>> ArraysHashCode.ints        10  avgt    5     7.727 ±  0.043  ns/op
>> ArraysHashCode.ints       100  avgt    5    89.405 ±  0.593  ns/op
>> ArraysHashCode.ints     10000  avgt    5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts       1  avgt    5     3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts      10  avgt    5     8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts     100  avgt    5    90.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   10000  avgt    5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as follow-up enhancements to not further delay 
>> integration of this enhancement.
>
> Claes Redestad has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 55 commits:
> 
>  - Revert accidental ModuleHashes change
>  - Merge branch 'master' into 8282664-polyhash
>  - Merge pull request #2 from luhenry/dev/cl4es/8282664-polyhash
>    
>    Unroll + Reorder BBs
>  - fixup! Handle size=0 and size=1 in Java
>  - Handle size=0 and size=1 in Java
>  - reorder BB to do single scalar first to avoid slowdown of short arrays, 
> longer arrays jumps will be amortized by speedups
>  - Unroll loop for cnt1 < 32
>  - Merge pull request #1 from luhenry/dev/cl4es/8282664-polyhash
>    
>    Switch to forward approach for vectorization
>  - Fix vector loop
>  - fix indexing
>  - ... and 45 more: https://git.openjdk.org/jdk/compare/dd5d4df5...853a7575

Overall I am excited to see these changes land as this will be a nice boost for 
many strong heavy applications!

src/hotspot/share/opto/matcher.cpp line 1707:

> 1705:   if (x >= _LAST_MACH_OPER) {
> 1706:     fprintf(stderr, "x = %d, _LAST_MACH_OPER = %d\n", x, 
> _LAST_MACH_OPER);
> 1707:     fprintf(stderr, "dump n\n");

Should this be removed before merging?
Suggestion:

src/hotspot/share/opto/matcher.cpp line 1709:

> 1707:     fprintf(stderr, "dump n\n");
> 1708:     n->dump();
> 1709:     fprintf(stderr, "dump svec\n");

Remove?

Suggestion:

-------------

PR: https://git.openjdk.org/jdk/pull/10847

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v9]

Reply via email to