On Fri, 4 Mar 2022 16:47:54 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:
> This PR delivers ChaCha20 intrinsics that accelerate the core block function > that generates key stream from the key, counter and nonce. Intrinsics have > been written for the following platforms and instruction sets: > > - x86_64: AVX, AVX2 and AVX512 > - aarch64: platforms that support the advanced SIMD instructions > > Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the > pending Poly1305 intrinsics to be delivered in #10582) > > x86_64 > Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz > > Java only (-XX:-UseChaCha20Intrinsics) > -------------------------------------- > Benchmark (dataSize) Mode Cnt Score Error > Units > ChaCha20.decrypt 256 thrpt 40 772956.829 ± 4434.965 > ops/s > ChaCha20.decrypt 1024 thrpt 40 230478.075 ± 660.617 > ops/s > ChaCha20.decrypt 4096 thrpt 40 61504.367 ± 187.485 > ops/s > ChaCha20.decrypt 16384 thrpt 40 15671.893 ± 59.860 > ops/s > ChaCha20.encrypt 256 thrpt 40 793708.698 ± 3587.562 > ops/s > ChaCha20.encrypt 1024 thrpt 40 232413.842 ± 808.766 > ops/s > ChaCha20.encrypt 4096 thrpt 40 61586.483 ± 94.821 > ops/s > ChaCha20.encrypt 16384 thrpt 40 15749.637 ± 34.497 > ops/s > > ChaCha20Poly1305.decrypt 256 thrpt 40 219991.514 ± 2117.364 > ops/s > ChaCha20Poly1305.decrypt 1024 thrpt 40 101672.568 ± 1921.214 > ops/s > ChaCha20Poly1305.decrypt 4096 thrpt 40 32582.073 ± 946.061 > ops/s > ChaCha20Poly1305.decrypt 16384 thrpt 40 8485.793 ± 26.348 > ops/s > ChaCha20Poly1305.encrypt 256 thrpt 40 291605.327 ± 2893.898 > ops/s > ChaCha20Poly1305.encrypt 1024 thrpt 40 121034.948 ± 2545.312 > ops/s > ChaCha20Poly1305.encrypt 4096 thrpt 40 32657.343 ± 114.322 > ops/s > ChaCha20Poly1305.encrypt 16384 thrpt 40 8527.834 ± 33.711 > ops/s > > Intrinsics enabled (-XX:UseAVX=1) > --------------------------------- > Benchmark (dataSize) Mode Cnt Score Error > Units > ChaCha20.decrypt 256 thrpt 40 1293211.662 ± 9833.892 > ops/s > ChaCha20.decrypt 1024 thrpt 40 450135.559 ± 1614.303 > ops/s > ChaCha20.decrypt 4096 thrpt 40 123675.797 ± 576.160 > ops/s > ChaCha20.decrypt 16384 thrpt 40 31707.566 ± 93.988 > ops/s > ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240 > ops/s > ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322 > ops/s > ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535 > ops/s > ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770 > ops/s > > ChaCha20Poly1305.decrypt 256 thrpt 40 250683.639 ± 3990.340 > ops/s > ChaCha20Poly1305.decrypt 1024 thrpt 40 131000.144 ± 2895.410 > ops/s > ChaCha20Poly1305.decrypt 4096 thrpt 40 45215.542 ± 1368.148 > ops/s > ChaCha20Poly1305.decrypt 16384 thrpt 40 11879.307 ± 55.006 > ops/s > ChaCha20Poly1305.encrypt 256 thrpt 40 355255.774 ± 5397.267 > ops/s > ChaCha20Poly1305.encrypt 1024 thrpt 40 156057.380 ± 4294.091 > ops/s > ChaCha20Poly1305.encrypt 4096 thrpt 40 47016.845 ± 1618.779 > ops/s > ChaCha20Poly1305.encrypt 16384 thrpt 40 12113.919 ± 45.792 > ops/s > > Intrinsics enabled (-XX:UseAVX=2) > --------------------------------- > Benchmark (dataSize) Mode Cnt Score Error > Units > ChaCha20.decrypt 256 thrpt 40 1824729.604 ± 12130.198 > ops/s > ChaCha20.decrypt 1024 thrpt 40 746024.477 ± 3921.472 > ops/s > ChaCha20.decrypt 4096 thrpt 40 219662.823 ± 2128.901 > ops/s > ChaCha20.decrypt 16384 thrpt 40 57198.868 ± 221.973 > ops/s > ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718 > ops/s > ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552 > ops/s > ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309 > ops/s > ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048 > ops/s > > ChaCha20Poly1305.decrypt 256 thrpt 40 260529.149 ± 4298.662 > ops/s > ChaCha20Poly1305.decrypt 1024 thrpt 40 144967.984 ± 4558.697 > ops/s > ChaCha20Poly1305.decrypt 4096 thrpt 40 50047.575 ± 171.204 > ops/s > ChaCha20Poly1305.decrypt 16384 thrpt 40 13976.999 ± 72.299 > ops/s > ChaCha20Poly1305.encrypt 256 thrpt 40 378971.408 ± 9324.721 > ops/s > ChaCha20Poly1305.encrypt 1024 thrpt 40 179361.248 ± 7968.109 > ops/s > ChaCha20Poly1305.encrypt 4096 thrpt 40 55727.145 ± 2860.765 > ops/s > ChaCha20Poly1305.encrypt 16384 thrpt 40 14205.830 ± 59.411 > ops/s > > Intrinsics enabled (-XX:UseAVX=3) > --------------------------------- > Benchmark (dataSize) Mode Cnt Score Error > Units > ChaCha20.decrypt 256 thrpt 40 1182958.956 ± 7782.532 > ops/s > ChaCha20.decrypt 1024 thrpt 40 1003530.400 ± 10315.996 > ops/s > ChaCha20.decrypt 4096 thrpt 40 339428.341 ± 2376.804 > ops/s > ChaCha20.decrypt 16384 thrpt 40 92903.498 ± 1112.425 > ops/s > ChaCha20.encrypt 256 thrpt 40 1266584.736 ± 5101.597 > ops/s > ChaCha20.encrypt 1024 thrpt 40 1059717.173 ± 9435.649 > ops/s > ChaCha20.encrypt 4096 thrpt 40 350520.581 ± 2787.593 > ops/s > ChaCha20.encrypt 16384 thrpt 40 95181.548 ± 1638.579 > ops/s > > ChaCha20Poly1305.decrypt 256 thrpt 40 200722.479 ± 2045.896 > ops/s > ChaCha20Poly1305.decrypt 1024 thrpt 40 124660.386 ± 3869.517 > ops/s > ChaCha20Poly1305.decrypt 4096 thrpt 40 44059.327 ± 143.765 > ops/s > ChaCha20Poly1305.decrypt 16384 thrpt 40 12412.936 ± 54.845 > ops/s > ChaCha20Poly1305.encrypt 256 thrpt 40 274528.005 ± 2945.416 > ops/s > ChaCha20Poly1305.encrypt 1024 thrpt 40 145146.188 ± 857.254 > ops/s > ChaCha20Poly1305.encrypt 4096 thrpt 40 47045.637 ± 128.049 > ops/s > ChaCha20Poly1305.encrypt 16384 thrpt 40 12643.929 ± 55.748 > ops/s > > aarch64 > Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3, > part : 0xd0c, revision : 1 > > Java only (-XX:-UseChaCha20Intrinsics) > -------------------------------------- > Benchmark (dataSize) Mode Cnt Score Error > Units > ChaCha20.decrypt 256 thrpt 40 1301037.920 ± 1734.836 > ops/s > ChaCha20.decrypt 1024 thrpt 40 387115.013 ± 1122.264 > ops/s > ChaCha20.decrypt 4096 thrpt 40 102591.108 ± 229.456 > ops/s > ChaCha20.decrypt 16384 thrpt 40 25878.583 ± 89.351 > ops/s > ChaCha20.encrypt 256 thrpt 40 1332737.880 ± 2478.508 > ops/s > ChaCha20.encrypt 1024 thrpt 40 390288.663 ± 2361.851 > ops/s > ChaCha20.encrypt 4096 thrpt 40 101882.728 ± 744.907 > ops/s > ChaCha20.encrypt 16384 thrpt 40 26001.888 ± 71.907 > ops/s > > ChaCha20Poly1305.decrypt 256 thrpt 40 351189.393 ± 2209.148 > ops/s > ChaCha20Poly1305.decrypt 1024 thrpt 40 142960.999 ± 361.619 > ops/s > ChaCha20Poly1305.decrypt 4096 thrpt 40 42437.822 ± 85.557 > ops/s > ChaCha20Poly1305.decrypt 16384 thrpt 40 11173.152 ± 24.969 > ops/s > ChaCha20Poly1305.encrypt 256 thrpt 40 444870.664 ± 12571.799 > ops/s > ChaCha20Poly1305.encrypt 1024 thrpt 40 158481.143 ± 2149.208 > ops/s > ChaCha20Poly1305.encrypt 4096 thrpt 40 43610.721 ± 282.795 > ops/s > ChaCha20Poly1305.encrypt 16384 thrpt 40 11150.783 ± 27.911 > ops/s > > Intrinsics enabled > ------------------ > Benchmark (dataSize) Mode Cnt Score Error > Units > ChaCha20.decrypt 256 thrpt 40 1907215.648 ± 3163.767 > ops/s > ChaCha20.decrypt 1024 thrpt 40 631804.007 ± 736.430 > ops/s > ChaCha20.decrypt 4096 thrpt 40 172280.991 ± 362.190 > ops/s > ChaCha20.decrypt 16384 thrpt 40 44150.254 ± 98.927 > ops/s > ChaCha20.encrypt 256 thrpt 40 1990050.859 ± 6380.625 > ops/s > ChaCha20.encrypt 1024 thrpt 40 636574.405 ± 3332.471 > ops/s > ChaCha20.encrypt 4096 thrpt 40 173258.615 ± 327.199 > ops/s > ChaCha20.encrypt 16384 thrpt 40 44191.925 ± 72.996 > ops/s > > ChaCha20Poly1305.decrypt 256 thrpt 40 360555.774 ± 1988.467 > ops/s > ChaCha20Poly1305.decrypt 1024 thrpt 40 162093.489 ± 413.684 > ops/s > ChaCha20Poly1305.decrypt 4096 thrpt 40 50799.888 ± 110.955 > ops/s > ChaCha20Poly1305.decrypt 16384 thrpt 40 13560.165 ± 32.208 > ops/s > ChaCha20Poly1305.encrypt 256 thrpt 40 458079.724 ± 13746.235 > ops/s > ChaCha20Poly1305.encrypt 1024 thrpt 40 188228.966 ± 3498.480 > ops/s > ChaCha20Poly1305.encrypt 4096 thrpt 40 52665.733 ± 151.740 > ops/s > ChaCha20Poly1305.encrypt 16384 thrpt 40 13606.192 ± 52.134 > ops/s > > Special thanks to the folks who have made many helpful comments while this PR > was in draft form. Work is ongoing. I'm making a few refinements on the x86_64 side and will remove x86_32 stub generators but hopefully will open this up for formal review soon. I've also extended the single-structure st4 to now do single structure st1/2/3/4. I just needed to do a little internal playtesting with them to make sure I was still getting the correct results. I don't plan on using st1/2/3 but since they all use the same opcode generation macros as st4 I figured it would be worth including them. That will all show up in my next commit/push. FYI, I'm holding off on some changes that @iwanowww had suggested in order to wait for #10111 and #10124 to integrate (but more for the former). I think I may end up shifting the CC20 intrinsics into separate files like Vladimir is proposing for AES. Also it has been a while since I've merged the master branch so it could do with a refresh to get 10111 in there. Quick update: I've run into a strange "Unschedulable graph" issue being raised at the C2 layer of things. It happens specifically with the ChaCha20Poly1305.decrypt microbenchmark and only on AVX512 (with -XX:UseAVX=3). Investigation is ongoing, but points away (right now) from the stub itself and may be a latent C2 issue that is being uncovered. I have run hundreds of thousands of AVX512 cc20-p1305 decrypts of various sizes outside the microbenchmark and never run into this. I will share more as I learn it. Quick update on the unschedulable graph issue: It appears that we're running into an issue related to either [JDK-8252848](https://bugs.openjdk.org/browse/JDK-8252848) or [JDK-8266951](https://bugs.openjdk.org/browse/JDK-8266951). A new issue to track this has been created in [JDK-8296233](https://bugs.openjdk.org/browse/JDK-8296233). While this has only ever been seen thus far with the ChaCha20Poly1305.decrypt microbenchmark when -XX:UseAVX=3 is employed, the nature of the issue is such that it could happen with any of the intrinsics since it is triggered more by the library call change to C2's IR. But this has never been seen outside of the current narrow configuration to date. Good news. It turns out that [JDK-8292780](https://bugs.openjdk.org/browse/JDK-8292780) is a fix for the underlying issue that caused the benchmark to crash. Once I did a pull/merge and retested the benchmarks are no longer failing. src/hotspot/cpu/x86/stubGenerator_x86_32.cpp line 3636: > 3634: const XMMRegister zmm_cState = xmm6; > 3635: const XMMRegister zmm_dState = xmm7; > 3636: const XMMRegister zmm_addMask = xmm8; Whoops! It looks like there may not be an xmm8 register available to 32-bit architectures. This may need a little creative restructuring in order to make it work. Or we might just add from the ExternalAddress directly in this specific case. ------------- PR: https://git.openjdk.org/jdk/pull/7702