On Wed, 24 May 2023 09:25:06 GMT, Andrew Haley <a...@openjdk.org> wrote:

>> This provides a solid speedup of about 3-4x over the Java implementation.
>> 
>> I have a vectorized version of this which uses a bunch of tricks to speed it 
>> up, but it's complex and can still be improved. We're getting close to ramp 
>> down, so I'm submitting this simple intrinsic so that we can get it reviewed 
>> in time.
>> 
>> Benchmarks:
>> 
>> 
>> ThunderX (2, I think):
>> 
>> Benchmark                        (dataSize)  (provider)   Mode  Cnt         
>> Score         Error  Units
>> Poly1305DigestBench.updateBytes          64              thrpt    3  
>> 14078352.014 ± 4201407.966  ops/s
>> Poly1305DigestBench.updateBytes         256              thrpt    3   
>> 5154958.794 ± 1717146.980  ops/s
>> Poly1305DigestBench.updateBytes        1024              thrpt    3   
>> 1416563.273 ± 1311809.454  ops/s
>> Poly1305DigestBench.updateBytes       16384              thrpt    3     
>> 94059.570 ±    2913.021  ops/s
>> Poly1305DigestBench.updateBytes     1048576              thrpt    3      
>> 1441.024 ±     164.443  ops/s
>> 
>> Benchmark                        (dataSize)  (provider)   Mode  Cnt        
>> Score        Error  Units
>> Poly1305DigestBench.updateBytes          64              thrpt    3  
>> 4516486.795 ± 419624.224  ops/s
>> Poly1305DigestBench.updateBytes         256              thrpt    3  
>> 1228542.774 ± 202815.694  ops/s
>> Poly1305DigestBench.updateBytes        1024              thrpt    3   
>> 316051.912 ±  23066.449  ops/s
>> Poly1305DigestBench.updateBytes       16384              thrpt    3    
>> 20649.561 ±   1094.687  ops/s
>> Poly1305DigestBench.updateBytes     1048576              thrpt    3      
>> 310.564 ±     31.053  ops/s
>> 
>> Apple M1:
>> 
>> Benchmark                        (dataSize)  (provider)   Mode  Cnt         
>> Score        Error  Units
>> Poly1305DigestBench.updateBytes          64              thrpt    3  
>> 33551968.946 ± 849843.905  ops/s
>> Poly1305DigestBench.updateBytes         256              thrpt    3   
>> 9911637.214 ±  63417.224  ops/s
>> Poly1305DigestBench.updateBytes        1024              thrpt    3   
>> 2604370.740 ±  29208.265  ops/s
>> Poly1305DigestBench.updateBytes       16384              thrpt    3    
>> 165183.633 ±   1975.998  ops/s
>> Poly1305DigestBench.updateBytes     1048576              thrpt    3      
>> 2587.132 ±     40.240  ops/s
>> 
>> Benchmark                        (dataSize)  (provider)   Mode  Cnt         
>> Score        Error  Units
>> Poly1305DigestBench.updateBytes          64              thrpt    3  
>> 12373649.589 ± 184757.721  ops/s
>> Poly1305DigestBench.upd...
>
> Andrew Haley has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Whitespace

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7097:

> 7095:       // together partial products without any risk of needing to
> 7096:       // propagate a carry out.
> 7097:       wide_mul(U_0, U_0HI, S_0, R_0);  wide_madd(U_0, U_0HI, S_1, 
> RR_1); wide_madd(U_0, U_0HI, S_2, RR_0);

What is `r` corresponding to here? This asserts that 'the top four bits of each 
32-bit subword of "r" are zero'. If `r` is `R_0...R_2` it would seem broken 
since we're packing 26-bit values into `R_0...R_2` above in a way that would 
break this invariant?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1203838423

Reply via email to