On Thu, 18 Aug 2022 14:26:47 GMT, Dmitry Chuyko <dchu...@openjdk.org> wrote:
>> This PR delivers ChaCha20 intrinsics that accelerate the core block function >> that generates key stream from the key, counter and nonce. Intrinsics have >> been written for the following platforms and instruction sets: >> >> - x86_64: AVX, AVX2 and AVX512 >> - aarch64: platforms that support the advanced SIMD instructions >> >> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the >> pending Poly1305 intrinsics to be delivered in #10582) >> >> x86_64 >> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz >> >> Java only (-XX:-UseChaCha20Intrinsics) >> -------------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 772956.829 ± 4434.965 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 230478.075 ± 660.617 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 61504.367 ± 187.485 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 15671.893 ± 59.860 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 793708.698 ± 3587.562 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 232413.842 ± 808.766 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 61586.483 ± 94.821 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 15749.637 ± 34.497 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 219991.514 ± 2117.364 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 101672.568 ± 1921.214 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 32582.073 ± 946.061 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 8485.793 ± 26.348 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 291605.327 ± 2893.898 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 121034.948 ± 2545.312 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 32657.343 ± 114.322 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 8527.834 ± 33.711 >> ops/s >> >> Intrinsics enabled (-XX:UseAVX=1) >> --------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1293211.662 ± 9833.892 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 450135.559 ± 1614.303 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 123675.797 ± 576.160 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 31707.566 ± 93.988 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 250683.639 ± 3990.340 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 131000.144 ± 2895.410 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 45215.542 ± 1368.148 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 11879.307 ± 55.006 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 355255.774 ± 5397.267 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 156057.380 ± 4294.091 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 47016.845 ± 1618.779 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 12113.919 ± 45.792 >> ops/s >> >> Intrinsics enabled (-XX:UseAVX=2) >> --------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1824729.604 ± 12130.198 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 746024.477 ± 3921.472 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 219662.823 ± 2128.901 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 57198.868 ± 221.973 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 260529.149 ± 4298.662 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 144967.984 ± 4558.697 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 50047.575 ± 171.204 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 13976.999 ± 72.299 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 378971.408 ± 9324.721 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 179361.248 ± 7968.109 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 55727.145 ± 2860.765 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 14205.830 ± 59.411 >> ops/s >> >> Intrinsics enabled (-XX:UseAVX=3) >> --------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1182958.956 ± 7782.532 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 1003530.400 ± 10315.996 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 339428.341 ± 2376.804 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 92903.498 ± 1112.425 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1266584.736 ± 5101.597 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 1059717.173 ± 9435.649 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 350520.581 ± 2787.593 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 95181.548 ± 1638.579 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 200722.479 ± 2045.896 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 124660.386 ± 3869.517 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 44059.327 ± 143.765 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 12412.936 ± 54.845 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 274528.005 ± 2945.416 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 145146.188 ± 857.254 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 47045.637 ± 128.049 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 12643.929 ± 55.748 >> ops/s >> >> aarch64 >> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3, >> part : 0xd0c, revision : 1 >> >> Java only (-XX:-UseChaCha20Intrinsics) >> -------------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1301037.920 ± 1734.836 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 387115.013 ± 1122.264 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 102591.108 ± 229.456 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 25878.583 ± 89.351 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1332737.880 ± 2478.508 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 390288.663 ± 2361.851 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 101882.728 ± 744.907 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 26001.888 ± 71.907 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 351189.393 ± 2209.148 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 142960.999 ± 361.619 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 42437.822 ± 85.557 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 11173.152 ± 24.969 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 444870.664 ± 12571.799 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 158481.143 ± 2149.208 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 43610.721 ± 282.795 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 11150.783 ± 27.911 >> ops/s >> >> Intrinsics enabled >> ------------------ >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1907215.648 ± 3163.767 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 631804.007 ± 736.430 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 172280.991 ± 362.190 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 44150.254 ± 98.927 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1990050.859 ± 6380.625 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 636574.405 ± 3332.471 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 173258.615 ± 327.199 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 44191.925 ± 72.996 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 360555.774 ± 1988.467 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 162093.489 ± 413.684 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 50799.888 ± 110.955 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 13560.165 ± 32.208 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 458079.724 ± 13746.235 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 188228.966 ± 3498.480 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 52665.733 ± 151.740 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 13606.192 ± 52.134 >> ops/s >> >> Special thanks to the folks who have made many helpful comments while this >> PR was in draft form. > > src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4156: > >> 4154: // Decrement and iterate >> 4155: __ subs(loopCtr, loopCtr, 1); >> 4156: __ cmp(loopCtr, (u1)0); > > CMP probably can be removed or can there be just SUB and CBNZ? See my comment on the similar note below. I will likely be removing this version of the intrinsic in favor of the _blockpar version. I really like that second version better as it removes the need for the two sets of lane shifting operations on each of the 10 iterations. > src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4306: > >> 4304: __ subs(loopCtr, loopCtr, 1); >> 4305: __ cmp(loopCtr, (u1)0); >> 4306: __ br(Assembler::NE, L_twoRounds); > > Same thing about subs-cmp0-bne. Thanks for the suggestion. I actually have a version of the _blockpar cc20 block function intrinsic that uses a C++ for-loop around the cc20_quarter_round macro calls to generate that portion of the stub. I believe that effectively unrolls the loop in the resulting stub and removes the need for the subs, cmp and br for all 10 iterations. Right now the aarch64 has two versions of the same block function as I was play testing both. I will probably end up removing the _qr (quarter-round parallel) version and favor the _blockpar (block-parallel) version as they both are pretty comparable in terms of speed, but the block parallel version seems to be a little better. I'm always open to these other ways of handling the loop control as assembly is not my strong suit so I appreciate the suggestion! Interesting, I had not considered that. Thanks for pointing that out. I'm honestly not sure how to evaluate the impact of the generated code on the icache. I'll look at the logic surrounding the ghash processBlocks(_wide) code to see how that decision is made. I don't have an aversion to going back to an assembly-based loop using the suggestions that @dchuyko made and maybe that's the right choice if it means more compact code. ------------- PR: https://git.openjdk.org/jdk/pull/7702