On Tue, 25 Oct 2022 16:03:28 GMT, Ludovic Henry <luhe...@openjdk.org> wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark                               (size)  Mode  Cnt     Score    Error 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1      10  avgt    5     6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1     100  avgt    5    29.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark                               (size)  Mode  Cnt     Score    Error 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1      10  avgt    5     9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1     100  avgt    5    90.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark              (size)  Mode  Cnt     Score    Error  Units
>> ArraysHashCode.bytes        1  avgt    5     1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes       10  avgt    5     6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes      100  avgt    5    87.218 ±  0.595  ns/op
>> ArraysHashCode.bytes    10000  avgt    5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars        1  avgt    5     2.200 ±  0.010  ns/op
>> ArraysHashCode.chars       10  avgt    5     6.935 ±  0.034  ns/op
>> ArraysHashCode.chars      100  avgt    5    30.216 ±  0.134  ns/op
>> ArraysHashCode.chars    10000  avgt    5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints         1  avgt    5     2.200 ±  0.007  ns/op
>> ArraysHashCode.ints        10  avgt    5     6.936 ±  0.034  ns/op
>> ArraysHashCode.ints       100  avgt    5    29.412 ±  0.268  ns/op
>> ArraysHashCode.ints     10000  avgt    5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts       1  avgt    5     1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts      10  avgt    5     6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts     100  avgt    5    87.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   10000  avgt    5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark              (size)  Mode  Cnt     Score    Error  Units
>> ArraysHashCode.bytes        1  avgt    5     3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes       10  avgt    5     8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes      100  avgt    5    90.315 ±  0.655  ns/op
>> ArraysHashCode.bytes    10000  avgt    5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars        1  avgt    5     3.040 ±  0.066  ns/op
>> ArraysHashCode.chars       10  avgt    5     8.497 ±  0.074  ns/op
>> ArraysHashCode.chars      100  avgt    5    90.074 ±  0.387  ns/op
>> ArraysHashCode.chars    10000  avgt    5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints         1  avgt    5     2.827 ±  0.019  ns/op
>> ArraysHashCode.ints        10  avgt    5     7.727 ±  0.043  ns/op
>> ArraysHashCode.ints       100  avgt    5    89.405 ±  0.593  ns/op
>> ArraysHashCode.ints     10000  avgt    5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts       1  avgt    5     3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts      10  avgt    5     8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts     100  avgt    5    90.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   10000  avgt    5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as follow-up enhancements to not further delay 
>> integration of this enhancement.
>
> I did a quick write up explaining the approach at 
> https://gist.github.com/luhenry/2fc408be6f906ef79aaf4115525b9d0c. Also, you 
> can find details in @richardstartin's [blog 
> post](https://richardstartin.github.io/posts/vectorised-polynomial-hash-codes)

I've restored the 2-stride dependency-chain breaking implementation that got 
lost in translation when me and @luhenry took turns on this. This helps keep 
things fast in the 1-31 size range, and allows for a decent speed-up on 
`byte[]` and `short[]` cases until we can figure out how to vectorize those 
properly.

@luhenry baseline:

Benchmark                               (size)  Mode  Cnt     Score   Error  
Units
StringHashCode.Algorithm.defaultLatin1       0  avgt    5     0.786 ± 0.005  
ns/op
StringHashCode.Algorithm.defaultLatin1       1  avgt    5     1.068 ± 0.005  
ns/op
StringHashCode.Algorithm.defaultLatin1       2  avgt    5     2.513 ± 0.017  
ns/op
StringHashCode.Algorithm.defaultLatin1      31  avgt    5    22.837 ± 0.082  
ns/op
StringHashCode.Algorithm.defaultLatin1      32  avgt    5    16.622 ± 0.107  
ns/op
StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  1193.884 ± 1.862  
ns/op
StringHashCode.Algorithm.defaultUTF16        0  avgt    5     0.786 ± 0.002  
ns/op
StringHashCode.Algorithm.defaultUTF16        1  avgt    5     1.884 ± 0.002  
ns/op
StringHashCode.Algorithm.defaultUTF16        2  avgt    5     2.512 ± 0.011  
ns/op
StringHashCode.Algorithm.defaultUTF16       31  avgt    5    23.061 ± 0.119  
ns/op
StringHashCode.Algorithm.defaultUTF16       32  avgt    5    16.429 ± 0.044  
ns/op
StringHashCode.Algorithm.defaultUTF16    10000  avgt    5  1191.283 ± 4.600  
ns/op

Patch:

Benchmark                               (size)  Mode  Cnt     Score   Error  
Units
StringHashCode.Algorithm.defaultLatin1       0  avgt    5     0.787 ± 0.004  
ns/op
StringHashCode.Algorithm.defaultLatin1       1  avgt    5     1.050 ± 0.009  
ns/op
StringHashCode.Algorithm.defaultLatin1       2  avgt    5     2.198 ± 0.010  
ns/op
StringHashCode.Algorithm.defaultLatin1      31  avgt    5    18.413 ± 0.516  
ns/op
StringHashCode.Algorithm.defaultLatin1      32  avgt    5    16.599 ± 0.074  
ns/op
StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  1189.958 ± 8.420  
ns/op
StringHashCode.Algorithm.defaultUTF16        0  avgt    5     0.785 ± 0.002  
ns/op
StringHashCode.Algorithm.defaultUTF16        1  avgt    5     1.885 ± 0.006  
ns/op
StringHashCode.Algorithm.defaultUTF16        2  avgt    5     2.219 ± 0.146  
ns/op
StringHashCode.Algorithm.defaultUTF16       31  avgt    5    19.052 ± 1.203  
ns/op
StringHashCode.Algorithm.defaultUTF16       32  avgt    5    16.558 ± 0.107  
ns/op
StringHashCode.Algorithm.defaultUTF16    10000  avgt    5  1188.122 ± 9.394  
ns/op


The switches @luhenry added to help the 0 and 1 cases marginally help the by 
allowing the compilation to do early returns in these cases, avoiding jumping 
around as would be necessary in the inlined intrinsic. It allowed me to 
simplify the previous attempt at a 2-element stride routine, while ensuring the 
routine is correct even if we'd call it directly without the switch preamble.

I think this is ready for a final review now.

-------------

PR: https://git.openjdk.org/jdk/pull/10847

Reply via email to