Hi Carter,
CTR doesn't have the same splitting up of the input data to speed the
triggering of the intrinsic that GCM has. The need to split data is
such as narrow situation as users don't typically use 1, 10 or 100MB
data sizes.
Are you using a particular application where you are seeing the
performance drop off clearly, besides a benchmark?
thanks
Tony
On 10/26/22 8:01 AM, Carter Kozak wrote:
Continuing a conversation I had with Sean Mullan at Java One, for a
broader audience.
We tend to believe that bulk operations are good. Large bulk operations
give the system the most information at once, allowing it to make more
informed decisions. Understanding the hotspot compiler on some level and
how the security components interact with it, the observed performance
degradation makes sense as a result, but I don’t think it’s obvious or
desirable most of those using the JDK. As the industry shifts toward
shorter lived and horizontally scalable instances, it becomes more
important than ever to deliver cryptography performance consistently and
early.
Encryption in Java is usually fast, around 2-3 GiB/second per core using
the default OpenJDK JSSE provider on my test system. However, when
developers use larger buffers (~10 MiB, perhaps large for
networking/TLS, but reasonable for local data), I can observe throughput
drop to 60 MiB/second (between 2 and 3 percent of the expected throughput!).
Results from
https://github.com/carterkozak/java-crypto-buffer-performance
<https://github.com/carterkozak/java-crypto-buffer-performance>:
Benchmark (cipher) (numBytes) (writeStrategy) Mode Cnt Score Error Units
EncryptionBenchmark.encrypt AES/GCM/NoPadding 1048576 ENTIRE_BUFFER
thrpt 4 2215.898 ± 185.661 ops/s
EncryptionBenchmark.encrypt AES/GCM/NoPadding 10485760 ENTIRE_BUFFER
thrpt 4 6.427 ± 0.475 ops/s
EncryptionBenchmark.encrypt AES/GCM/NoPadding 104857600 ENTIRE_BUFFER
thrpt 4 0.620 ± 0.096 ops/s
EncryptionBenchmark.encrypt AES/CTR/NoPadding 1048576 ENTIRE_BUFFER
thrpt 4 2933.808 ± 17.538 ops/s
EncryptionBenchmark.encrypt AES/CTR/NoPadding 10485760 ENTIRE_BUFFER
thrpt 4 31.775 ± 1.898 ops/s
EncryptionBenchmark.encrypt AES/CTR/NoPadding 104857600 ENTIRE_BUFFER
thrpt 4 3.174 ± 0.171 ops/s
Using |AES/GCM/NoPadding|, large buffers result in a great deal of work
withinGHASH.processBlocks
<https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/com/sun/crypto/provider/GHASH.java#L272-L286>which is intrinsified <https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/hotspot/share/classfile/vmIntrinsics.hpp#L462-L466>, however the intrinsic isn’t used because the method is called infrequently, and a tremendous amount of work occurs within the default implementation. You can find notes from my initial investigation are here (with flame graphs) <https://github.com/palantir/hadoop-crypto/pull/586#issuecomment-964394587>. When I introduce a wrapper to chunk input buffers into 16 KiB segments (other sizes tested here <https://github.com/palantir/hadoop-crypto/pull/586#issue-1047810949>), we can effectively force the method to warm up, and perform nearly two orders of magnitude better:
https://github.com/carterkozak/java-crypto-buffer-performance#jdk-17
<https://github.com/carterkozak/java-crypto-buffer-performance#jdk-17>
Benchmark (cipher) (numBytes) (writeStrategy) Mode Cnt Score Error Units
EncryptionBenchmark.encrypt AES/GCM/NoPadding 1048576 ENTIRE_BUFFER
thrpt 4 2215.898 ± 185.661 ops/s
EncryptionBenchmark.encrypt AES/GCM/NoPadding 1048576 CHUNKED thrpt 4
2516.770 ± 193.009 ops/s
EncryptionBenchmark.encrypt AES/GCM/NoPadding 10485760 ENTIRE_BUFFER
thrpt 4 6.427 ± 0.475 ops/s
EncryptionBenchmark.encrypt AES/GCM/NoPadding 10485760 CHUNKED thrpt 4
246.956 ± 51.193 ops/s
EncryptionBenchmark.encrypt AES/GCM/NoPadding 104857600 ENTIRE_BUFFER
thrpt 4 0.620 ± 0.096 ops/s
EncryptionBenchmark.encrypt AES/GCM/NoPadding 104857600 CHUNKED thrpt 4
24.633 ± 2.784 ops/s
EncryptionBenchmark.encrypt AES/CTR/NoPadding 1048576 ENTIRE_BUFFER
thrpt 4 2933.808 ± 17.538 ops/s
EncryptionBenchmark.encrypt AES/CTR/NoPadding 1048576 CHUNKED thrpt 4
3277.374 ± 569.573 ops/s
EncryptionBenchmark.encrypt AES/CTR/NoPadding 10485760 ENTIRE_BUFFER
thrpt 4 31.775 ± 1.898 ops/s
EncryptionBenchmark.encrypt AES/CTR/NoPadding 10485760 CHUNKED thrpt 4
332.873 ± 55.589 ops/s
EncryptionBenchmark.encrypt AES/CTR/NoPadding 104857600 ENTIRE_BUFFER
thrpt 4 3.174 ± 0.171 ops/s
EncryptionBenchmark.encrypt AES/CTR/NoPadding 104857600 CHUNKED thrpt 4
33.909 ± 1.675 ops/s
The 10 MiB full-buffer benchmark is eventually partially optimized after
~3 minutes of encryption on ~10 GiB of data, however in practice this
takes much longer because the encrypted data must also be put somewhere,
potentially leading to rubber-banding over a network.
While writing this up I re-ran my investigation using JDK-19
<https://github.com/carterkozak/java-crypto-buffer-performance#jdk-19>
and found, to my surprise, that AES/GCM performed substantially better,
warming up quickly, while AES/CTR performance was largely equivalent! It
turns out that JDK-8273297
<https://bugs.openjdk.org/browse/JDK-8273297>, which aimed to improve
performance of an intrinsic, has the side-effect of allowing the
intrinsic to be used much faster by segmenting inputs to 1mb chunks
<https://github.com/openjdk/jdk/commit/13e9ea9e922030927775345b1abde1313a6ec03f#diff-a533e78f757a3ad64a8d2453bea64a0d426890ef85031452bf74070776ad8be0R575-R596>.
I’ve intentionally avoided suggesting specific solutions, as a layperson
I don’t feel confident making explicit recommendations, but my goal is
reliably high throughput based on the amount of work done rather than
the size of individual operations. As a user, native implementations
like tcnative <https://tomcat.apache.org/native-doc/> and Conscrypt
<https://github.com/google/conscrypt> provide the performance
characteristics I’m looking for, but without the reliability or
flexibility of OpenJDK JSSE. Is there a solution which allows us to get
the best of both worlds?
Thanks,
Carter Kozak