Re: RFR: 8247645: ChaCha20 intrinsics

Sandhya Viswanathan Sun, 06 Nov 2022 23:40:40 -0800

On Fri, 4 Mar 2022 16:47:54 GMT, Jamil Nimeh <[email protected]> wrote:


> This PR delivers ChaCha20 intrinsics that accelerate the core block function 
> that generates key stream from the key, counter and nonce.  Intrinsics have 
> been written for the following platforms and instruction sets:
> 
> - x86_64: AVX, AVX2 and AVX512
> - aarch64: platforms that support the advanced SIMD instructions
> 
> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the 
> pending Poly1305 intrinsics to be delivered in #10582)
> 
> x86_64
> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
> 
> Java only (-XX:-UseChaCha20Intrinsics)
> --------------------------------------
> Benchmark                  (dataSize)     Mode  Cnt       Score      Error  
> Units
> ChaCha20.decrypt                  256    thrpt   40  772956.829 ± 4434.965  
> ops/s
> ChaCha20.decrypt                 1024    thrpt   40  230478.075 ±  660.617  
> ops/s
> ChaCha20.decrypt                 4096    thrpt   40   61504.367 ±  187.485  
> ops/s
> ChaCha20.decrypt                16384    thrpt   40   15671.893 ±   59.860  
> ops/s
> ChaCha20.encrypt                  256    thrpt   40  793708.698 ± 3587.562  
> ops/s
> ChaCha20.encrypt                 1024    thrpt   40  232413.842 ±  808.766  
> ops/s
> ChaCha20.encrypt                 4096    thrpt   40   61586.483 ±   94.821  
> ops/s
> ChaCha20.encrypt                16384    thrpt   40   15749.637 ±   34.497  
> ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40  219991.514 ± 2117.364  
> ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40  101672.568 ± 1921.214  
> ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40   32582.073 ±  946.061  
> ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    8485.793 ±   26.348  
> ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40  291605.327 ± 2893.898  
> ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40  121034.948 ± 2545.312  
> ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40   32657.343 ±  114.322  
> ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    8527.834 ±   33.711  
> ops/s
> 
> Intrinsics enabled (-XX:UseAVX=1)
> ---------------------------------
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
> Units
> ChaCha20.decrypt                  256    thrpt   40  1293211.662 ±  9833.892  
> ops/s
> ChaCha20.decrypt                 1024    thrpt   40   450135.559 ±  1614.303  
> ops/s
> ChaCha20.decrypt                 4096    thrpt   40   123675.797 ±   576.160  
> ops/s
> ChaCha20.decrypt                16384    thrpt   40    31707.566 ±    93.988  
> ops/s
> ChaCha20.encrypt                  256    thrpt   40  1338667.215 ± 12012.240  
> ops/s
> ChaCha20.encrypt                 1024    thrpt   40   453682.363 ±  2559.322  
> ops/s
> ChaCha20.encrypt                 4096    thrpt   40   124785.645 ±   394.535  
> ops/s
> ChaCha20.encrypt                16384    thrpt   40    31788.969 ±    90.770  
> ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40   250683.639 ±  3990.340  
> ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40   131000.144 ±  2895.410  
> ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40    45215.542 ±  1368.148  
> ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    11879.307 ±    55.006  
> ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40   355255.774 ±  5397.267  
> ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40   156057.380 ±  4294.091  
> ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40    47016.845 ±  1618.779  
> ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    12113.919 ±    45.792  
> ops/s
> 
> Intrinsics enabled (-XX:UseAVX=2)
> ---------------------------------
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
> Units
> ChaCha20.decrypt                  256    thrpt   40  1824729.604 ± 12130.198  
> ops/s
> ChaCha20.decrypt                 1024    thrpt   40   746024.477 ±  3921.472  
> ops/s
> ChaCha20.decrypt                 4096    thrpt   40   219662.823 ±  2128.901  
> ops/s
> ChaCha20.decrypt                16384    thrpt   40    57198.868 ±   221.973  
> ops/s
> ChaCha20.encrypt                  256    thrpt   40  1893810.127 ± 21870.718  
> ops/s
> ChaCha20.encrypt                 1024    thrpt   40   758024.511 ±  5414.552  
> ops/s
> ChaCha20.encrypt                 4096    thrpt   40   224032.805 ±   935.309  
> ops/s
> ChaCha20.encrypt                16384    thrpt   40    58112.296 ±   498.048  
> ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40   260529.149 ±  4298.662  
> ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40   144967.984 ±  4558.697  
> ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40    50047.575 ±   171.204  
> ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    13976.999 ±    72.299  
> ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40   378971.408 ±  9324.721  
> ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40   179361.248 ±  7968.109  
> ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40    55727.145 ±  2860.765  
> ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    14205.830 ±    59.411  
> ops/s
> 
> Intrinsics enabled (-XX:UseAVX=3)
> ---------------------------------
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
> Units
> ChaCha20.decrypt                  256    thrpt   40  1182958.956 ±  7782.532  
> ops/s
> ChaCha20.decrypt                 1024    thrpt   40  1003530.400 ± 10315.996  
> ops/s
> ChaCha20.decrypt                 4096    thrpt   40   339428.341 ±  2376.804  
> ops/s
> ChaCha20.decrypt                16384    thrpt   40    92903.498 ±  1112.425  
> ops/s
> ChaCha20.encrypt                  256    thrpt   40  1266584.736 ±  5101.597  
> ops/s
> ChaCha20.encrypt                 1024    thrpt   40  1059717.173 ±  9435.649  
> ops/s
> ChaCha20.encrypt                 4096    thrpt   40   350520.581 ±  2787.593  
> ops/s
> ChaCha20.encrypt                16384    thrpt   40    95181.548 ±  1638.579  
> ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40   200722.479 ±  2045.896  
> ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40   124660.386 ±  3869.517  
> ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40    44059.327 ±   143.765  
> ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    12412.936 ±    54.845  
> ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40   274528.005 ±  2945.416  
> ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40   145146.188 ±   857.254  
> ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40    47045.637 ±   128.049  
> ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    12643.929 ±    55.748  
> ops/s
> 
> aarch64
> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3,
>   part : 0xd0c, revision : 1
> 
> Java only (-XX:-UseChaCha20Intrinsics)
> --------------------------------------
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
> Units
> ChaCha20.decrypt                  256    thrpt   40  1301037.920 ±  1734.836  
> ops/s
> ChaCha20.decrypt                 1024    thrpt   40   387115.013 ±  1122.264  
> ops/s
> ChaCha20.decrypt                 4096    thrpt   40   102591.108 ±   229.456  
> ops/s
> ChaCha20.decrypt                16384    thrpt   40    25878.583 ±    89.351  
> ops/s
> ChaCha20.encrypt                  256    thrpt   40  1332737.880 ±  2478.508  
> ops/s
> ChaCha20.encrypt                 1024    thrpt   40   390288.663 ±  2361.851  
> ops/s
> ChaCha20.encrypt                 4096    thrpt   40   101882.728 ±   744.907  
> ops/s
> ChaCha20.encrypt                16384    thrpt   40    26001.888 ±    71.907  
> ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40   351189.393 ±  2209.148  
> ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40   142960.999 ±   361.619  
> ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40    42437.822 ±    85.557  
> ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    11173.152 ±    24.969  
> ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40   444870.664 ± 12571.799  
> ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40   158481.143 ±  2149.208  
> ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40    43610.721 ±   282.795  
> ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    11150.783 ±    27.911  
> ops/s
> 
> Intrinsics enabled
> ------------------
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
> Units
> ChaCha20.decrypt                  256    thrpt   40  1907215.648 ±  3163.767  
> ops/s
> ChaCha20.decrypt                 1024    thrpt   40   631804.007 ±   736.430  
> ops/s
> ChaCha20.decrypt                 4096    thrpt   40   172280.991 ±   362.190  
> ops/s
> ChaCha20.decrypt                16384    thrpt   40    44150.254 ±    98.927  
> ops/s
> ChaCha20.encrypt                  256    thrpt   40  1990050.859 ±  6380.625  
> ops/s
> ChaCha20.encrypt                 1024    thrpt   40   636574.405 ±  3332.471  
> ops/s
> ChaCha20.encrypt                 4096    thrpt   40   173258.615 ±   327.199  
> ops/s
> ChaCha20.encrypt                16384    thrpt   40    44191.925 ±    72.996  
> ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40   360555.774 ±  1988.467  
> ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40   162093.489 ±   413.684  
> ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40    50799.888 ±   110.955  
> ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    13560.165 ±    32.208  
> ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40   458079.724 ± 13746.235  
> ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40   188228.966 ±  3498.480  
> ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40    52665.733 ±   151.740  
> ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    13606.192 ±    52.134  
> ops/s
> 
> Special thanks to the folks who have made many helpful comments while this PR 
> was in draft form.

src/hotspot/cpu/x86/assembler_x86.cpp line 4994:

> 4992:     assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
> 4993:             (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
> 4994:             (vector_len == AVX_512bit ? VM_Version::supports_evex() : 
> 0)), "");

VM_Version::supports_evex() here should be VM_Version::supports_avx512bw().

src/hotspot/cpu/x86/assembler_x86.cpp line 4996:

> 4994:             (vector_len == AVX_512bit ? VM_Version::supports_evex() : 
> 0)), "");
> 4995:     NOT_LP64(assert(VM_Version::supports_sse2(), ""));
> 4996:     InstructionAttr attributes(vector_len, /* rex_w */ false, /* 
> legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);

legacy mode here should be _legacy_mode_bw.

src/hotspot/cpu/x86/assembler_x86.cpp line 5025:

> 5023:     assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
> 5024:             (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
> 5025:             (vector_len == AVX_512bit ? VM_Version::supports_evex() : 
> 0)), "");

VM_Version::supports_evex() here should be VM_Version::supports_avx512bw().

src/hotspot/cpu/x86/assembler_x86.cpp line 5027:

> 5025:             (vector_len == AVX_512bit ? VM_Version::supports_evex() : 
> 0)), "");
> 5026:     NOT_LP64(assert(VM_Version::supports_sse2(), ""));
> 5027:     InstructionAttr attributes(vector_len, /* rex_w */ false, /* 
> legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);

legacy_mode here should be _legacy_mode_bw.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5682:

> 5680:   /* Add mask for 4-block ChaCha20 Block calculations */
> 5681:   address chacha20_ctradd_avx512() {
> 5682:     __ align(CodeEntryAlignment);

This could be __ align64();

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5698:

> 5696:   /* Scatter mask for key stream output on AVX-512 */
> 5697:   address chacha20_scmask_avx512() {
> 5698:     __ align(CodeEntryAlignment);

This could be __ align64();

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5728:

> 5726:     const XMMRegister zmm_cVec = xmm2;
> 5727:     const XMMRegister zmm_dVec = xmm3;
> 5728:     const XMMRegister zmm_scratch = xmm4;

We could have 5 additional scratch registers zmm_s1 .. zmm_s5 (mapping to xmm5 
... xmm9)  to keep values read from memory into registers.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5738:

> 5736:     __ evbroadcasti32x4(zmm_bVec, Address(state, 16), 
> Assembler::AVX_512bit);
> 5737:     __ evbroadcasti32x4(zmm_cVec, Address(state, 32), 
> Assembler::AVX_512bit);
> 5738:     __ evbroadcasti32x4(zmm_dVec, Address(state, 48), 
> Assembler::AVX_512bit);

zmm_aVec to zmm_dVec could be copied into zmm_s1 to zmm_s4 respectively thereby 
eliminating broadcast needed later. For example:
 __ evmovdquq(zmm_s1, zmm_aVec, Assembler::AVX_512bit);

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5740:

> 5738:     __ evbroadcasti32x4(zmm_dVec, Address(state, 48), 
> Assembler::AVX_512bit);
> 5739: 
> 5740:     __ vpaddd(zmm_dVec, zmm_dVec, 
> ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), 
> Assembler::AVX_512bit, rax);

The chacha20_counter_addmask_avx512() could be preloaded into zmm_s5 before 
line 5735 as follows:
 __ evmovdquq(zmm_s5, 
ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), 
Assembler::AVX_512bit, rax);
vpaddd can then use zmm_s5 also the later usage could use zmm_s5 directly.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5827:

> 5825:     __ evbroadcasti32x4(zmm_scratch, Address(state, 48), 
> Assembler::AVX_512bit);
> 5826:     __ vpaddd(zmm_dVec, zmm_dVec, zmm_scratch, Assembler::AVX_512bit);
> 5827:     __ vpaddd(zmm_dVec, zmm_dVec, 
> ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), 
> Assembler::AVX_512bit, rax);

These could directly use the values in zmm_s1 to zmm_s5 registers  :
    __ vpaddd(zmm_aVec, zmm_aVec, zmm_s1, Assembler::AVX_512bit);
    ...
    __ vpaddd(zmm_dVec, zmm_dVec, zmm_s5, Assembler::AVX_512bit);

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5842:

> 5840:     __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, 32), 
> writeMask, zmm_cVec, Assembler::AVX_512bit);
> 5841:     __ knotwl(writeMask, writeMask);
> 5842:     __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, 48), 
> writeMask, zmm_dVec, Assembler::AVX_512bit);

Using the vextracti32x4 instead of evpscatterdd would give better performance:
    __ vextracti32x4(Address(result, 0), zmm_aVec, 0);
    __ vextracti32x4(Address(result, 64), zmm_aVec, 1);
    __ vextracti32x4(Address(result, 128), zmm_aVec, 2);
    __ vextracti32x4(Address(result, 192), zmm_aVec, 3);
    __ vextracti32x4(Address(result, 16), zmm_bVec, 0);
    __ vextracti32x4(Address(result, 80), zmm_bVec, 1);
    __ vextracti32x4(Address(result, 144), zmm_bVec, 2);
    __ vextracti32x4(Address(result, 208), zmm_bVec, 3);
    __ vextracti32x4(Address(result, 32), zmm_cVec, 0);
    __ vextracti32x4(Address(result, 96), zmm_cVec, 1);
    __ vextracti32x4(Address(result, 160), zmm_cVec, 2);
    __ vextracti32x4(Address(result, 224), zmm_cVec, 3);
    __ vextracti32x4(Address(result, 48), zmm_dVec, 0);
    __ vextracti32x4(Address(result, 112), zmm_dVec, 1);
    __ vextracti32x4(Address(result, 176), zmm_dVec, 2);
    __ vextracti32x4(Address(result, 240), zmm_dVec, 3);

-------------

PR: https://git.openjdk.org/jdk/pull/7702

Re: RFR: 8247645: ChaCha20 intrinsics

Reply via email to