This PR delivers ChaCha20 intrinsics that accelerate the core block function 
that generates key stream from the key, counter and nonce.  Intrinsics have 
been written for the following platforms and instruction sets:

- x86_64: AVX, AVX2 and AVX512
- aarch64: platforms that support the advanced SIMD instructions

Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the 
pending Poly1305 intrinsics to be delivered in #10582)

x86_64
Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz

Java only (-XX:-UseChaCha20Intrinsics)
--------------------------------------
Benchmark                  (dataSize)     Mode  Cnt       Score      Error  
Units
ChaCha20.decrypt                  256    thrpt   40  772956.829 ± 4434.965  
ops/s
ChaCha20.decrypt                 1024    thrpt   40  230478.075 ±  660.617  
ops/s
ChaCha20.decrypt                 4096    thrpt   40   61504.367 ±  187.485  
ops/s
ChaCha20.decrypt                16384    thrpt   40   15671.893 ±   59.860  
ops/s
ChaCha20.encrypt                  256    thrpt   40  793708.698 ± 3587.562  
ops/s
ChaCha20.encrypt                 1024    thrpt   40  232413.842 ±  808.766  
ops/s
ChaCha20.encrypt                 4096    thrpt   40   61586.483 ±   94.821  
ops/s
ChaCha20.encrypt                16384    thrpt   40   15749.637 ±   34.497  
ops/s

ChaCha20Poly1305.decrypt          256    thrpt   40  219991.514 ± 2117.364  
ops/s
ChaCha20Poly1305.decrypt         1024    thrpt   40  101672.568 ± 1921.214  
ops/s
ChaCha20Poly1305.decrypt         4096    thrpt   40   32582.073 ±  946.061  
ops/s
ChaCha20Poly1305.decrypt        16384    thrpt   40    8485.793 ±   26.348  
ops/s
ChaCha20Poly1305.encrypt          256    thrpt   40  291605.327 ± 2893.898  
ops/s
ChaCha20Poly1305.encrypt         1024    thrpt   40  121034.948 ± 2545.312  
ops/s
ChaCha20Poly1305.encrypt         4096    thrpt   40   32657.343 ±  114.322  
ops/s
ChaCha20Poly1305.encrypt        16384    thrpt   40    8527.834 ±   33.711  
ops/s

Intrinsics enabled (-XX:UseAVX=1)
---------------------------------
Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
Units
ChaCha20.decrypt                  256    thrpt   40  1293211.662 ±  9833.892  
ops/s
ChaCha20.decrypt                 1024    thrpt   40   450135.559 ±  1614.303  
ops/s
ChaCha20.decrypt                 4096    thrpt   40   123675.797 ±   576.160  
ops/s
ChaCha20.decrypt                16384    thrpt   40    31707.566 ±    93.988  
ops/s
ChaCha20.encrypt                  256    thrpt   40  1338667.215 ± 12012.240  
ops/s
ChaCha20.encrypt                 1024    thrpt   40   453682.363 ±  2559.322  
ops/s
ChaCha20.encrypt                 4096    thrpt   40   124785.645 ±   394.535  
ops/s
ChaCha20.encrypt                16384    thrpt   40    31788.969 ±    90.770  
ops/s

ChaCha20Poly1305.decrypt          256    thrpt   40   250683.639 ±  3990.340  
ops/s
ChaCha20Poly1305.decrypt         1024    thrpt   40   131000.144 ±  2895.410  
ops/s
ChaCha20Poly1305.decrypt         4096    thrpt   40    45215.542 ±  1368.148  
ops/s
ChaCha20Poly1305.decrypt        16384    thrpt   40    11879.307 ±    55.006  
ops/s
ChaCha20Poly1305.encrypt          256    thrpt   40   355255.774 ±  5397.267  
ops/s
ChaCha20Poly1305.encrypt         1024    thrpt   40   156057.380 ±  4294.091  
ops/s
ChaCha20Poly1305.encrypt         4096    thrpt   40    47016.845 ±  1618.779  
ops/s
ChaCha20Poly1305.encrypt        16384    thrpt   40    12113.919 ±    45.792  
ops/s

Intrinsics enabled (-XX:UseAVX=2)
---------------------------------
Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
Units
ChaCha20.decrypt                  256    thrpt   40  1824729.604 ± 12130.198  
ops/s
ChaCha20.decrypt                 1024    thrpt   40   746024.477 ±  3921.472  
ops/s
ChaCha20.decrypt                 4096    thrpt   40   219662.823 ±  2128.901  
ops/s
ChaCha20.decrypt                16384    thrpt   40    57198.868 ±   221.973  
ops/s
ChaCha20.encrypt                  256    thrpt   40  1893810.127 ± 21870.718  
ops/s
ChaCha20.encrypt                 1024    thrpt   40   758024.511 ±  5414.552  
ops/s
ChaCha20.encrypt                 4096    thrpt   40   224032.805 ±   935.309  
ops/s
ChaCha20.encrypt                16384    thrpt   40    58112.296 ±   498.048  
ops/s

ChaCha20Poly1305.decrypt          256    thrpt   40   260529.149 ±  4298.662  
ops/s
ChaCha20Poly1305.decrypt         1024    thrpt   40   144967.984 ±  4558.697  
ops/s
ChaCha20Poly1305.decrypt         4096    thrpt   40    50047.575 ±   171.204  
ops/s
ChaCha20Poly1305.decrypt        16384    thrpt   40    13976.999 ±    72.299  
ops/s
ChaCha20Poly1305.encrypt          256    thrpt   40   378971.408 ±  9324.721  
ops/s
ChaCha20Poly1305.encrypt         1024    thrpt   40   179361.248 ±  7968.109  
ops/s
ChaCha20Poly1305.encrypt         4096    thrpt   40    55727.145 ±  2860.765  
ops/s
ChaCha20Poly1305.encrypt        16384    thrpt   40    14205.830 ±    59.411  
ops/s

Intrinsics enabled (-XX:UseAVX=3)
---------------------------------
Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
Units
ChaCha20.decrypt                  256    thrpt   40  1182958.956 ±  7782.532  
ops/s
ChaCha20.decrypt                 1024    thrpt   40  1003530.400 ± 10315.996  
ops/s
ChaCha20.decrypt                 4096    thrpt   40   339428.341 ±  2376.804  
ops/s
ChaCha20.decrypt                16384    thrpt   40    92903.498 ±  1112.425  
ops/s
ChaCha20.encrypt                  256    thrpt   40  1266584.736 ±  5101.597  
ops/s
ChaCha20.encrypt                 1024    thrpt   40  1059717.173 ±  9435.649  
ops/s
ChaCha20.encrypt                 4096    thrpt   40   350520.581 ±  2787.593  
ops/s
ChaCha20.encrypt                16384    thrpt   40    95181.548 ±  1638.579  
ops/s

ChaCha20Poly1305.decrypt          256    thrpt   40   200722.479 ±  2045.896  
ops/s
ChaCha20Poly1305.decrypt         1024    thrpt   40   124660.386 ±  3869.517  
ops/s
ChaCha20Poly1305.decrypt         4096    thrpt   40    44059.327 ±   143.765  
ops/s
ChaCha20Poly1305.decrypt        16384    thrpt   40    12412.936 ±    54.845  
ops/s
ChaCha20Poly1305.encrypt          256    thrpt   40   274528.005 ±  2945.416  
ops/s
ChaCha20Poly1305.encrypt         1024    thrpt   40   145146.188 ±   857.254  
ops/s
ChaCha20Poly1305.encrypt         4096    thrpt   40    47045.637 ±   128.049  
ops/s
ChaCha20Poly1305.encrypt        16384    thrpt   40    12643.929 ±    55.748  
ops/s

aarch64
Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3,
  part : 0xd0c, revision : 1

Java only (-XX:-UseChaCha20Intrinsics)
--------------------------------------
Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
Units
ChaCha20.decrypt                  256    thrpt   40  1301037.920 ±  1734.836  
ops/s
ChaCha20.decrypt                 1024    thrpt   40   387115.013 ±  1122.264  
ops/s
ChaCha20.decrypt                 4096    thrpt   40   102591.108 ±   229.456  
ops/s
ChaCha20.decrypt                16384    thrpt   40    25878.583 ±    89.351  
ops/s
ChaCha20.encrypt                  256    thrpt   40  1332737.880 ±  2478.508  
ops/s
ChaCha20.encrypt                 1024    thrpt   40   390288.663 ±  2361.851  
ops/s
ChaCha20.encrypt                 4096    thrpt   40   101882.728 ±   744.907  
ops/s
ChaCha20.encrypt                16384    thrpt   40    26001.888 ±    71.907  
ops/s

ChaCha20Poly1305.decrypt          256    thrpt   40   351189.393 ±  2209.148  
ops/s
ChaCha20Poly1305.decrypt         1024    thrpt   40   142960.999 ±   361.619  
ops/s
ChaCha20Poly1305.decrypt         4096    thrpt   40    42437.822 ±    85.557  
ops/s
ChaCha20Poly1305.decrypt        16384    thrpt   40    11173.152 ±    24.969  
ops/s
ChaCha20Poly1305.encrypt          256    thrpt   40   444870.664 ± 12571.799  
ops/s
ChaCha20Poly1305.encrypt         1024    thrpt   40   158481.143 ±  2149.208  
ops/s
ChaCha20Poly1305.encrypt         4096    thrpt   40    43610.721 ±   282.795  
ops/s
ChaCha20Poly1305.encrypt        16384    thrpt   40    11150.783 ±    27.911  
ops/s

Intrinsics enabled
------------------
Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
Units
ChaCha20.decrypt                  256    thrpt   40  1907215.648 ±  3163.767  
ops/s
ChaCha20.decrypt                 1024    thrpt   40   631804.007 ±   736.430  
ops/s
ChaCha20.decrypt                 4096    thrpt   40   172280.991 ±   362.190  
ops/s
ChaCha20.decrypt                16384    thrpt   40    44150.254 ±    98.927  
ops/s
ChaCha20.encrypt                  256    thrpt   40  1990050.859 ±  6380.625  
ops/s
ChaCha20.encrypt                 1024    thrpt   40   636574.405 ±  3332.471  
ops/s
ChaCha20.encrypt                 4096    thrpt   40   173258.615 ±   327.199  
ops/s
ChaCha20.encrypt                16384    thrpt   40    44191.925 ±    72.996  
ops/s

ChaCha20Poly1305.decrypt          256    thrpt   40   360555.774 ±  1988.467  
ops/s
ChaCha20Poly1305.decrypt         1024    thrpt   40   162093.489 ±   413.684  
ops/s
ChaCha20Poly1305.decrypt         4096    thrpt   40    50799.888 ±   110.955  
ops/s
ChaCha20Poly1305.decrypt        16384    thrpt   40    13560.165 ±    32.208  
ops/s
ChaCha20Poly1305.encrypt          256    thrpt   40   458079.724 ± 13746.235  
ops/s
ChaCha20Poly1305.encrypt         1024    thrpt   40   188228.966 ±  3498.480  
ops/s
ChaCha20Poly1305.encrypt         4096    thrpt   40    52665.733 ±   151.740  
ops/s
ChaCha20Poly1305.encrypt        16384    thrpt   40    13606.192 ±    52.134  
ops/s

Special thanks to the folks who have made many helpful comments while this PR 
was in draft form.

-------------

Commit messages:
 - consolidate single-structure ld_st methods
 - Add intrinsic tests that target specific SIMD instruction sets
 - add explicit int cast on counter rollover protection
 - Merge with main
 - expand input sizes for ChaCha20 and ChaCha20-Poly1305 micro benchmarks
 - rename chapoly to chacha
 - make alg-specific stub/macro files exclusive to chacha20
 - Remove stubRoutines constant generation method, replace using emit_int64/adr
 - Use block-parallel intrinsic, remove qr-parallel intrinsic, use sub/cbnz for 
loop control
 - Minor fixes from comments
 - ... and 30 more: https://git.openjdk.org/jdk/compare/c7b95a89...c79abe34

Changes: https://git.openjdk.org/jdk/pull/7702/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=7702&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8247645
  Stats: 1590 lines in 30 files changed: 1552 ins; 4 del; 34 mod
  Patch: https://git.openjdk.org/jdk/pull/7702.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/7702/head:pull/7702

PR: https://git.openjdk.org/jdk/pull/7702

Reply via email to