Re: Clogged pipes: 50x throughput degradation with large Cipher writes

Anthony Scarpino Thu, 27 Oct 2022 09:01:05 -0700

Hi Carter,

CTR doesn't have the same splitting up of the input data to speed thetriggering of the intrinsic that GCM has. The need to split data issuch as narrow situation as users don't typically use 1, 10 or 100MBdata sizes.

Are you using a particular application where you are seeing theperformance drop off clearly, besides a benchmark?


thanks

Tony


On 10/26/22 8:01 AM, Carter Kozak wrote:

Continuing a conversation I had with Sean Mullan at Java One, for abroader audience.
We tend to believe that bulk operations are good. Large bulk operationsgive the system the most information at once, allowing it to make moreinformed decisions. Understanding the hotspot compiler on some level andhow the security components interact with it, the observed performancedegradation makes sense as a result, but I don’t think it’s obvious ordesirable most of those using the JDK. As the industry shifts towardshorter lived and horizontally scalable instances, it becomes moreimportant than ever to deliver cryptography performance consistently andearly.
Encryption in Java is usually fast, around 2-3 GiB/second per core usingthe default OpenJDK JSSE provider on my test system. However, whendevelopers use larger buffers (~10 MiB, perhaps large fornetworking/TLS, but reasonable for local data), I can observe throughputdrop to 60 MiB/second (between 2 and 3 percent of the expected throughput!).
Results fromhttps://github.com/carterkozak/java-crypto-buffer-performance<https://github.com/carterkozak/java-crypto-buffer-performance>:
Benchmark (cipher) (numBytes) (writeStrategy) Mode Cnt Score Error Units
EncryptionBenchmark.encrypt AES/GCM/NoPadding 1048576 ENTIRE_BUFFERthrpt 4 2215.898 ± 185.661 ops/sEncryptionBenchmark.encrypt AES/GCM/NoPadding 10485760 ENTIRE_BUFFERthrpt 4 6.427 ± 0.475 ops/sEncryptionBenchmark.encrypt AES/GCM/NoPadding 104857600 ENTIRE_BUFFERthrpt 4 0.620 ± 0.096 ops/sEncryptionBenchmark.encrypt AES/CTR/NoPadding 1048576 ENTIRE_BUFFERthrpt 4 2933.808 ± 17.538 ops/sEncryptionBenchmark.encrypt AES/CTR/NoPadding 10485760 ENTIRE_BUFFERthrpt 4 31.775 ± 1.898 ops/sEncryptionBenchmark.encrypt AES/CTR/NoPadding 104857600 ENTIRE_BUFFERthrpt 4 3.174 ± 0.171 ops/s
Using |AES/GCM/NoPadding|, large buffers result in a great deal of workwithinGHASH.processBlocks<https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/com/sun/crypto/provider/GHASH.java#L272-L286>which is intrinsified <https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/hotspot/share/classfile/vmIntrinsics.hpp#L462-L466>, however the intrinsic isn’t used because the method is called infrequently, and a tremendous amount of work occurs within the default implementation. You can find notes from my initial investigation are here (with flame graphs) <https://github.com/palantir/hadoop-crypto/pull/586#issuecomment-964394587>. When I introduce a wrapper to chunk input buffers into 16 KiB segments (other sizes tested here <https://github.com/palantir/hadoop-crypto/pull/586#issue-1047810949>), we can effectively force the method to warm up, and perform nearly two orders of magnitude better:
https://github.com/carterkozak/java-crypto-buffer-performance#jdk-17<https://github.com/carterkozak/java-crypto-buffer-performance#jdk-17>
Benchmark (cipher) (numBytes) (writeStrategy) Mode Cnt Score Error Units
EncryptionBenchmark.encrypt AES/GCM/NoPadding 1048576 ENTIRE_BUFFERthrpt 4 2215.898 ± 185.661 ops/sEncryptionBenchmark.encrypt AES/GCM/NoPadding 1048576 CHUNKED thrpt 42516.770 ± 193.009 ops/sEncryptionBenchmark.encrypt AES/GCM/NoPadding 10485760 ENTIRE_BUFFERthrpt 4 6.427 ± 0.475 ops/sEncryptionBenchmark.encrypt AES/GCM/NoPadding 10485760 CHUNKED thrpt 4246.956 ± 51.193 ops/sEncryptionBenchmark.encrypt AES/GCM/NoPadding 104857600 ENTIRE_BUFFERthrpt 4 0.620 ± 0.096 ops/sEncryptionBenchmark.encrypt AES/GCM/NoPadding 104857600 CHUNKED thrpt 424.633 ± 2.784 ops/sEncryptionBenchmark.encrypt AES/CTR/NoPadding 1048576 ENTIRE_BUFFERthrpt 4 2933.808 ± 17.538 ops/sEncryptionBenchmark.encrypt AES/CTR/NoPadding 1048576 CHUNKED thrpt 43277.374 ± 569.573 ops/sEncryptionBenchmark.encrypt AES/CTR/NoPadding 10485760 ENTIRE_BUFFERthrpt 4 31.775 ± 1.898 ops/sEncryptionBenchmark.encrypt AES/CTR/NoPadding 10485760 CHUNKED thrpt 4332.873 ± 55.589 ops/sEncryptionBenchmark.encrypt AES/CTR/NoPadding 104857600 ENTIRE_BUFFERthrpt 4 3.174 ± 0.171 ops/sEncryptionBenchmark.encrypt AES/CTR/NoPadding 104857600 CHUNKED thrpt 433.909 ± 1.675 ops/s
The 10 MiB full-buffer benchmark is eventually partially optimized after~3 minutes of encryption on ~10 GiB of data, however in practice thistakes much longer because the encrypted data must also be put somewhere,potentially leading to rubber-banding over a network.
While writing this up I re-ran my investigation using JDK-19<https://github.com/carterkozak/java-crypto-buffer-performance#jdk-19>and found, to my surprise, that AES/GCM performed substantially better,warming up quickly, while AES/CTR performance was largely equivalent! Itturns out that JDK-8273297<https://bugs.openjdk.org/browse/JDK-8273297>, which aimed to improveperformance of an intrinsic, has the side-effect of allowing theintrinsic to be used much faster by segmenting inputs to 1mb chunks<https://github.com/openjdk/jdk/commit/13e9ea9e922030927775345b1abde1313a6ec03f#diff-a533e78f757a3ad64a8d2453bea64a0d426890ef85031452bf74070776ad8be0R575-R596>.
I’ve intentionally avoided suggesting specific solutions, as a laypersonI don’t feel confident making explicit recommendations, but my goal isreliably high throughput based on the amount of work done rather thanthe size of individual operations. As a user, native implementationslike tcnative <https://tomcat.apache.org/native-doc/> and Conscrypt<https://github.com/google/conscrypt> provide the performancecharacteristics I’m looking for, but without the reliability orflexibility of OpenJDK JSSE. Is there a solution which allows us to getthe best of both worlds?
Thanks,
Carter Kozak

Re: Clogged pipes: 50x throughput degradation with large Cipher writes

Reply via email to