On Fri, 4 Mar 2022 16:47:54 GMT, Jamil Nimeh <[email protected]> wrote:
> This PR delivers ChaCha20 intrinsics that accelerate the core block function
> that generates key stream from the key, counter and nonce. Intrinsics have
> been written for the following platforms and instruction sets:
>
> - x86_64: AVX, AVX2 and AVX512
> - aarch64: platforms that support the advanced SIMD instructions
>
> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the
> pending Poly1305 intrinsics to be delivered in #10582)
>
> x86_64
> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
>
> Java only (-XX:-UseChaCha20Intrinsics)
> --------------------------------------
> Benchmark (dataSize) Mode Cnt Score Error
> Units
> ChaCha20.decrypt 256 thrpt 40 772956.829 ± 4434.965
> ops/s
> ChaCha20.decrypt 1024 thrpt 40 230478.075 ± 660.617
> ops/s
> ChaCha20.decrypt 4096 thrpt 40 61504.367 ± 187.485
> ops/s
> ChaCha20.decrypt 16384 thrpt 40 15671.893 ± 59.860
> ops/s
> ChaCha20.encrypt 256 thrpt 40 793708.698 ± 3587.562
> ops/s
> ChaCha20.encrypt 1024 thrpt 40 232413.842 ± 808.766
> ops/s
> ChaCha20.encrypt 4096 thrpt 40 61586.483 ± 94.821
> ops/s
> ChaCha20.encrypt 16384 thrpt 40 15749.637 ± 34.497
> ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 219991.514 ± 2117.364
> ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 101672.568 ± 1921.214
> ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 32582.073 ± 946.061
> ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 8485.793 ± 26.348
> ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 291605.327 ± 2893.898
> ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 121034.948 ± 2545.312
> ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 32657.343 ± 114.322
> ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 8527.834 ± 33.711
> ops/s
>
> Intrinsics enabled (-XX:UseAVX=1)
> ---------------------------------
> Benchmark (dataSize) Mode Cnt Score Error
> Units
> ChaCha20.decrypt 256 thrpt 40 1293211.662 ± 9833.892
> ops/s
> ChaCha20.decrypt 1024 thrpt 40 450135.559 ± 1614.303
> ops/s
> ChaCha20.decrypt 4096 thrpt 40 123675.797 ± 576.160
> ops/s
> ChaCha20.decrypt 16384 thrpt 40 31707.566 ± 93.988
> ops/s
> ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240
> ops/s
> ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322
> ops/s
> ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535
> ops/s
> ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770
> ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 250683.639 ± 3990.340
> ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 131000.144 ± 2895.410
> ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 45215.542 ± 1368.148
> ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 11879.307 ± 55.006
> ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 355255.774 ± 5397.267
> ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 156057.380 ± 4294.091
> ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 47016.845 ± 1618.779
> ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 12113.919 ± 45.792
> ops/s
>
> Intrinsics enabled (-XX:UseAVX=2)
> ---------------------------------
> Benchmark (dataSize) Mode Cnt Score Error
> Units
> ChaCha20.decrypt 256 thrpt 40 1824729.604 ± 12130.198
> ops/s
> ChaCha20.decrypt 1024 thrpt 40 746024.477 ± 3921.472
> ops/s
> ChaCha20.decrypt 4096 thrpt 40 219662.823 ± 2128.901
> ops/s
> ChaCha20.decrypt 16384 thrpt 40 57198.868 ± 221.973
> ops/s
> ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718
> ops/s
> ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552
> ops/s
> ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309
> ops/s
> ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048
> ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 260529.149 ± 4298.662
> ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 144967.984 ± 4558.697
> ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 50047.575 ± 171.204
> ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 13976.999 ± 72.299
> ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 378971.408 ± 9324.721
> ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 179361.248 ± 7968.109
> ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 55727.145 ± 2860.765
> ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 14205.830 ± 59.411
> ops/s
>
> Intrinsics enabled (-XX:UseAVX=3)
> ---------------------------------
> Benchmark (dataSize) Mode Cnt Score Error
> Units
> ChaCha20.decrypt 256 thrpt 40 1182958.956 ± 7782.532
> ops/s
> ChaCha20.decrypt 1024 thrpt 40 1003530.400 ± 10315.996
> ops/s
> ChaCha20.decrypt 4096 thrpt 40 339428.341 ± 2376.804
> ops/s
> ChaCha20.decrypt 16384 thrpt 40 92903.498 ± 1112.425
> ops/s
> ChaCha20.encrypt 256 thrpt 40 1266584.736 ± 5101.597
> ops/s
> ChaCha20.encrypt 1024 thrpt 40 1059717.173 ± 9435.649
> ops/s
> ChaCha20.encrypt 4096 thrpt 40 350520.581 ± 2787.593
> ops/s
> ChaCha20.encrypt 16384 thrpt 40 95181.548 ± 1638.579
> ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 200722.479 ± 2045.896
> ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 124660.386 ± 3869.517
> ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 44059.327 ± 143.765
> ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 12412.936 ± 54.845
> ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 274528.005 ± 2945.416
> ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 145146.188 ± 857.254
> ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 47045.637 ± 128.049
> ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 12643.929 ± 55.748
> ops/s
>
> aarch64
> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3,
> part : 0xd0c, revision : 1
>
> Java only (-XX:-UseChaCha20Intrinsics)
> --------------------------------------
> Benchmark (dataSize) Mode Cnt Score Error
> Units
> ChaCha20.decrypt 256 thrpt 40 1301037.920 ± 1734.836
> ops/s
> ChaCha20.decrypt 1024 thrpt 40 387115.013 ± 1122.264
> ops/s
> ChaCha20.decrypt 4096 thrpt 40 102591.108 ± 229.456
> ops/s
> ChaCha20.decrypt 16384 thrpt 40 25878.583 ± 89.351
> ops/s
> ChaCha20.encrypt 256 thrpt 40 1332737.880 ± 2478.508
> ops/s
> ChaCha20.encrypt 1024 thrpt 40 390288.663 ± 2361.851
> ops/s
> ChaCha20.encrypt 4096 thrpt 40 101882.728 ± 744.907
> ops/s
> ChaCha20.encrypt 16384 thrpt 40 26001.888 ± 71.907
> ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 351189.393 ± 2209.148
> ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 142960.999 ± 361.619
> ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 42437.822 ± 85.557
> ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 11173.152 ± 24.969
> ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 444870.664 ± 12571.799
> ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 158481.143 ± 2149.208
> ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 43610.721 ± 282.795
> ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 11150.783 ± 27.911
> ops/s
>
> Intrinsics enabled
> ------------------
> Benchmark (dataSize) Mode Cnt Score Error
> Units
> ChaCha20.decrypt 256 thrpt 40 1907215.648 ± 3163.767
> ops/s
> ChaCha20.decrypt 1024 thrpt 40 631804.007 ± 736.430
> ops/s
> ChaCha20.decrypt 4096 thrpt 40 172280.991 ± 362.190
> ops/s
> ChaCha20.decrypt 16384 thrpt 40 44150.254 ± 98.927
> ops/s
> ChaCha20.encrypt 256 thrpt 40 1990050.859 ± 6380.625
> ops/s
> ChaCha20.encrypt 1024 thrpt 40 636574.405 ± 3332.471
> ops/s
> ChaCha20.encrypt 4096 thrpt 40 173258.615 ± 327.199
> ops/s
> ChaCha20.encrypt 16384 thrpt 40 44191.925 ± 72.996
> ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 360555.774 ± 1988.467
> ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 162093.489 ± 413.684
> ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 50799.888 ± 110.955
> ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 13560.165 ± 32.208
> ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 458079.724 ± 13746.235
> ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 188228.966 ± 3498.480
> ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 52665.733 ± 151.740
> ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 13606.192 ± 52.134
> ops/s
>
> Special thanks to the folks who have made many helpful comments while this PR
> was in draft form.
src/hotspot/cpu/x86/assembler_x86.cpp line 4994:
> 4992: assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
> 4993: (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
> 4994: (vector_len == AVX_512bit ? VM_Version::supports_evex() :
> 0)), "");
VM_Version::supports_evex() here should be VM_Version::supports_avx512bw().
src/hotspot/cpu/x86/assembler_x86.cpp line 4996:
> 4994: (vector_len == AVX_512bit ? VM_Version::supports_evex() :
> 0)), "");
> 4995: NOT_LP64(assert(VM_Version::supports_sse2(), ""));
> 4996: InstructionAttr attributes(vector_len, /* rex_w */ false, /*
> legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
legacy mode here should be _legacy_mode_bw.
src/hotspot/cpu/x86/assembler_x86.cpp line 5025:
> 5023: assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
> 5024: (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
> 5025: (vector_len == AVX_512bit ? VM_Version::supports_evex() :
> 0)), "");
VM_Version::supports_evex() here should be VM_Version::supports_avx512bw().
src/hotspot/cpu/x86/assembler_x86.cpp line 5027:
> 5025: (vector_len == AVX_512bit ? VM_Version::supports_evex() :
> 0)), "");
> 5026: NOT_LP64(assert(VM_Version::supports_sse2(), ""));
> 5027: InstructionAttr attributes(vector_len, /* rex_w */ false, /*
> legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
legacy_mode here should be _legacy_mode_bw.
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5682:
> 5680: /* Add mask for 4-block ChaCha20 Block calculations */
> 5681: address chacha20_ctradd_avx512() {
> 5682: __ align(CodeEntryAlignment);
This could be __ align64();
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5698:
> 5696: /* Scatter mask for key stream output on AVX-512 */
> 5697: address chacha20_scmask_avx512() {
> 5698: __ align(CodeEntryAlignment);
This could be __ align64();
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5728:
> 5726: const XMMRegister zmm_cVec = xmm2;
> 5727: const XMMRegister zmm_dVec = xmm3;
> 5728: const XMMRegister zmm_scratch = xmm4;
We could have 5 additional scratch registers zmm_s1 .. zmm_s5 (mapping to xmm5
... xmm9) to keep values read from memory into registers.
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5738:
> 5736: __ evbroadcasti32x4(zmm_bVec, Address(state, 16),
> Assembler::AVX_512bit);
> 5737: __ evbroadcasti32x4(zmm_cVec, Address(state, 32),
> Assembler::AVX_512bit);
> 5738: __ evbroadcasti32x4(zmm_dVec, Address(state, 48),
> Assembler::AVX_512bit);
zmm_aVec to zmm_dVec could be copied into zmm_s1 to zmm_s4 respectively thereby
eliminating broadcast needed later. For example:
__ evmovdquq(zmm_s1, zmm_aVec, Assembler::AVX_512bit);
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5740:
> 5738: __ evbroadcasti32x4(zmm_dVec, Address(state, 48),
> Assembler::AVX_512bit);
> 5739:
> 5740: __ vpaddd(zmm_dVec, zmm_dVec,
> ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()),
> Assembler::AVX_512bit, rax);
The chacha20_counter_addmask_avx512() could be preloaded into zmm_s5 before
line 5735 as follows:
__ evmovdquq(zmm_s5,
ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()),
Assembler::AVX_512bit, rax);
vpaddd can then use zmm_s5 also the later usage could use zmm_s5 directly.
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5827:
> 5825: __ evbroadcasti32x4(zmm_scratch, Address(state, 48),
> Assembler::AVX_512bit);
> 5826: __ vpaddd(zmm_dVec, zmm_dVec, zmm_scratch, Assembler::AVX_512bit);
> 5827: __ vpaddd(zmm_dVec, zmm_dVec,
> ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()),
> Assembler::AVX_512bit, rax);
These could directly use the values in zmm_s1 to zmm_s5 registers :
__ vpaddd(zmm_aVec, zmm_aVec, zmm_s1, Assembler::AVX_512bit);
...
__ vpaddd(zmm_dVec, zmm_dVec, zmm_s5, Assembler::AVX_512bit);
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5842:
> 5840: __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, 32),
> writeMask, zmm_cVec, Assembler::AVX_512bit);
> 5841: __ knotwl(writeMask, writeMask);
> 5842: __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, 48),
> writeMask, zmm_dVec, Assembler::AVX_512bit);
Using the vextracti32x4 instead of evpscatterdd would give better performance:
__ vextracti32x4(Address(result, 0), zmm_aVec, 0);
__ vextracti32x4(Address(result, 64), zmm_aVec, 1);
__ vextracti32x4(Address(result, 128), zmm_aVec, 2);
__ vextracti32x4(Address(result, 192), zmm_aVec, 3);
__ vextracti32x4(Address(result, 16), zmm_bVec, 0);
__ vextracti32x4(Address(result, 80), zmm_bVec, 1);
__ vextracti32x4(Address(result, 144), zmm_bVec, 2);
__ vextracti32x4(Address(result, 208), zmm_bVec, 3);
__ vextracti32x4(Address(result, 32), zmm_cVec, 0);
__ vextracti32x4(Address(result, 96), zmm_cVec, 1);
__ vextracti32x4(Address(result, 160), zmm_cVec, 2);
__ vextracti32x4(Address(result, 224), zmm_cVec, 3);
__ vextracti32x4(Address(result, 48), zmm_dVec, 0);
__ vextracti32x4(Address(result, 112), zmm_dVec, 1);
__ vextracti32x4(Address(result, 176), zmm_dVec, 2);
__ vextracti32x4(Address(result, 240), zmm_dVec, 3);
-------------
PR: https://git.openjdk.org/jdk/pull/7702