On Fri, 2 Sep 2022 09:32:56 GMT, Andrew Haley <[email protected]> wrote:
>> This PR delivers ChaCha20 intrinsics that accelerate the core block function
>> that generates key stream from the key, counter and nonce. Intrinsics have
>> been written for the following platforms and instruction sets:
>>
>> - x86_64: AVX, AVX2 and AVX512
>> - aarch64: platforms that support the advanced SIMD instructions
>>
>> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the
>> pending Poly1305 intrinsics to be delivered in #10582)
>>
>> x86_64
>> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
>>
>> Java only (-XX:-UseChaCha20Intrinsics)
>> --------------------------------------
>> Benchmark (dataSize) Mode Cnt Score Error
>> Units
>> ChaCha20.decrypt 256 thrpt 40 772956.829 ± 4434.965
>> ops/s
>> ChaCha20.decrypt 1024 thrpt 40 230478.075 ± 660.617
>> ops/s
>> ChaCha20.decrypt 4096 thrpt 40 61504.367 ± 187.485
>> ops/s
>> ChaCha20.decrypt 16384 thrpt 40 15671.893 ± 59.860
>> ops/s
>> ChaCha20.encrypt 256 thrpt 40 793708.698 ± 3587.562
>> ops/s
>> ChaCha20.encrypt 1024 thrpt 40 232413.842 ± 808.766
>> ops/s
>> ChaCha20.encrypt 4096 thrpt 40 61586.483 ± 94.821
>> ops/s
>> ChaCha20.encrypt 16384 thrpt 40 15749.637 ± 34.497
>> ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 219991.514 ± 2117.364
>> ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 101672.568 ± 1921.214
>> ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 32582.073 ± 946.061
>> ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 8485.793 ± 26.348
>> ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 291605.327 ± 2893.898
>> ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 121034.948 ± 2545.312
>> ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 32657.343 ± 114.322
>> ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 8527.834 ± 33.711
>> ops/s
>>
>> Intrinsics enabled (-XX:UseAVX=1)
>> ---------------------------------
>> Benchmark (dataSize) Mode Cnt Score Error
>> Units
>> ChaCha20.decrypt 256 thrpt 40 1293211.662 ± 9833.892
>> ops/s
>> ChaCha20.decrypt 1024 thrpt 40 450135.559 ± 1614.303
>> ops/s
>> ChaCha20.decrypt 4096 thrpt 40 123675.797 ± 576.160
>> ops/s
>> ChaCha20.decrypt 16384 thrpt 40 31707.566 ± 93.988
>> ops/s
>> ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240
>> ops/s
>> ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322
>> ops/s
>> ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535
>> ops/s
>> ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770
>> ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 250683.639 ± 3990.340
>> ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 131000.144 ± 2895.410
>> ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 45215.542 ± 1368.148
>> ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 11879.307 ± 55.006
>> ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 355255.774 ± 5397.267
>> ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 156057.380 ± 4294.091
>> ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 47016.845 ± 1618.779
>> ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 12113.919 ± 45.792
>> ops/s
>>
>> Intrinsics enabled (-XX:UseAVX=2)
>> ---------------------------------
>> Benchmark (dataSize) Mode Cnt Score Error
>> Units
>> ChaCha20.decrypt 256 thrpt 40 1824729.604 ± 12130.198
>> ops/s
>> ChaCha20.decrypt 1024 thrpt 40 746024.477 ± 3921.472
>> ops/s
>> ChaCha20.decrypt 4096 thrpt 40 219662.823 ± 2128.901
>> ops/s
>> ChaCha20.decrypt 16384 thrpt 40 57198.868 ± 221.973
>> ops/s
>> ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718
>> ops/s
>> ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552
>> ops/s
>> ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309
>> ops/s
>> ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048
>> ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 260529.149 ± 4298.662
>> ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 144967.984 ± 4558.697
>> ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 50047.575 ± 171.204
>> ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 13976.999 ± 72.299
>> ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 378971.408 ± 9324.721
>> ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 179361.248 ± 7968.109
>> ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 55727.145 ± 2860.765
>> ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 14205.830 ± 59.411
>> ops/s
>>
>> Intrinsics enabled (-XX:UseAVX=3)
>> ---------------------------------
>> Benchmark (dataSize) Mode Cnt Score Error
>> Units
>> ChaCha20.decrypt 256 thrpt 40 1182958.956 ± 7782.532
>> ops/s
>> ChaCha20.decrypt 1024 thrpt 40 1003530.400 ± 10315.996
>> ops/s
>> ChaCha20.decrypt 4096 thrpt 40 339428.341 ± 2376.804
>> ops/s
>> ChaCha20.decrypt 16384 thrpt 40 92903.498 ± 1112.425
>> ops/s
>> ChaCha20.encrypt 256 thrpt 40 1266584.736 ± 5101.597
>> ops/s
>> ChaCha20.encrypt 1024 thrpt 40 1059717.173 ± 9435.649
>> ops/s
>> ChaCha20.encrypt 4096 thrpt 40 350520.581 ± 2787.593
>> ops/s
>> ChaCha20.encrypt 16384 thrpt 40 95181.548 ± 1638.579
>> ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 200722.479 ± 2045.896
>> ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 124660.386 ± 3869.517
>> ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 44059.327 ± 143.765
>> ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 12412.936 ± 54.845
>> ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 274528.005 ± 2945.416
>> ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 145146.188 ± 857.254
>> ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 47045.637 ± 128.049
>> ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 12643.929 ± 55.748
>> ops/s
>>
>> aarch64
>> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3,
>> part : 0xd0c, revision : 1
>>
>> Java only (-XX:-UseChaCha20Intrinsics)
>> --------------------------------------
>> Benchmark (dataSize) Mode Cnt Score Error
>> Units
>> ChaCha20.decrypt 256 thrpt 40 1301037.920 ± 1734.836
>> ops/s
>> ChaCha20.decrypt 1024 thrpt 40 387115.013 ± 1122.264
>> ops/s
>> ChaCha20.decrypt 4096 thrpt 40 102591.108 ± 229.456
>> ops/s
>> ChaCha20.decrypt 16384 thrpt 40 25878.583 ± 89.351
>> ops/s
>> ChaCha20.encrypt 256 thrpt 40 1332737.880 ± 2478.508
>> ops/s
>> ChaCha20.encrypt 1024 thrpt 40 390288.663 ± 2361.851
>> ops/s
>> ChaCha20.encrypt 4096 thrpt 40 101882.728 ± 744.907
>> ops/s
>> ChaCha20.encrypt 16384 thrpt 40 26001.888 ± 71.907
>> ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 351189.393 ± 2209.148
>> ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 142960.999 ± 361.619
>> ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 42437.822 ± 85.557
>> ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 11173.152 ± 24.969
>> ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 444870.664 ± 12571.799
>> ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 158481.143 ± 2149.208
>> ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 43610.721 ± 282.795
>> ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 11150.783 ± 27.911
>> ops/s
>>
>> Intrinsics enabled
>> ------------------
>> Benchmark (dataSize) Mode Cnt Score Error
>> Units
>> ChaCha20.decrypt 256 thrpt 40 1907215.648 ± 3163.767
>> ops/s
>> ChaCha20.decrypt 1024 thrpt 40 631804.007 ± 736.430
>> ops/s
>> ChaCha20.decrypt 4096 thrpt 40 172280.991 ± 362.190
>> ops/s
>> ChaCha20.decrypt 16384 thrpt 40 44150.254 ± 98.927
>> ops/s
>> ChaCha20.encrypt 256 thrpt 40 1990050.859 ± 6380.625
>> ops/s
>> ChaCha20.encrypt 1024 thrpt 40 636574.405 ± 3332.471
>> ops/s
>> ChaCha20.encrypt 4096 thrpt 40 173258.615 ± 327.199
>> ops/s
>> ChaCha20.encrypt 16384 thrpt 40 44191.925 ± 72.996
>> ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 360555.774 ± 1988.467
>> ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 162093.489 ± 413.684
>> ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 50799.888 ± 110.955
>> ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 13560.165 ± 32.208
>> ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 458079.724 ± 13746.235
>> ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 188228.966 ± 3498.480
>> ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 52665.733 ± 151.740
>> ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 13606.192 ± 52.134
>> ops/s
>>
>> Special thanks to the folks who have made many helpful comments while this
>> PR was in draft form.
>
> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2521:
>
>> 2519: #undef INSN3
>> 2520: #undef INSN4
>> 2521:
>
> This code to handle the AdvSIMD load/store single structure and AdvSIMD
> load/store single structure (post-indexed) is excessive.
>
> Every one of these instructions has the the format,
>
> `0|Q|0011010|L|R|00000|opcode|S|size|Rn|Rt`
>
> or
>
> `0|Q|0011011|L|R| Rm|opcode|S|size|Rn|Rt`
>
> Perhaps consider using a `RegSet regs` for the registers. Then the
> instruction encoding to use (1,2,3,or 4 consecutive registers) can be picked
> up from `regs.size()`. There only needs to be a single routine for all of the
> `ld_st` variants.
Thanks for the suggestion. I will look into this. I can see how `regs.size()`
could simplify these macros.
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4068:
>
>> 4066: __ ext(c, __ T16B, c, c, cCnt); \
>> 4067: __ ext(d, __ T16B, d, d, dCnt); \
>> 4068:
>
> There's a fairly extensive use of macros here for the rounds, but I don't
> think there's any need for them to be macros. `SHIFT_LANES` and all the other
> macros here should be functions. This would reduce the size of the libjvm.so
> binary.
Thanks for the feedback. I've been wondering if I might need something like a
macroAssembler_<arch>_chapoly.cpp file to handle these kinds of things and
future functions for Poly1305 when I start in on that. I wasn't aware of the
impact on libjvm.so going the macro approach versus functions. I'll pull these
out to functions.
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4141:
>
>> 4139: // rotation tbl instruction.
>> 4140: __ lea(tmpAddr, ExternalAddress(
>> 4141: StubRoutines::aarch64::chacha20_constdata()));
>
> Better to move `cc20_gen_constdata()` to the start of `cc20_gen_constdata()`,
> mark it with a `Label`, and use `adr(tmpAddr, LABEL);` .
I think I see what you're saying from looking at `generate_sha1_implCompress()`
and how it uses adr. I also see what looks like a similar approach in some
functions in the same file where it defines the constant value via a `static
const uint64_t[] foo = { ... };` and then loads that address via `lea(reg,
ExternalAddress((address) foo)` and proceeds from there (see
`generate_sha3_implCompress()`). To my eye that looks a bit more
straightforward and the approach seems to be used more often than the adr
approach in the file for defining constants. What I don't know is if one
approach is better than the other for other reasons like performance or memory
consumption. Do you have any feelings one way or the other?
-------------
PR: https://git.openjdk.org/jdk/pull/7702