Clogged pipes: 50x throughput degradation with large Cipher writes

Carter Kozak Wed, 26 Oct 2022 08:03:01 -0700

Continuing a conversation I had with Sean Mullan at Java One, for a broader 
audience.


We tend to believe that bulk operations are good. Large bulk operations give 
the system the most information at once, allowing it to make more informed 
decisions. Understanding the hotspot compiler on some level and how the 
security components interact with it, the observed performance degradation 
makes sense as a result, but I don’t think it’s obvious or desirable most of 
those using the JDK. As the industry shifts toward shorter lived and 
horizontally scalable instances, it becomes more important than ever to deliver 
cryptography performance consistently and early.

Encryption in Java is usually fast, around 2-3 GiB/second per core using the 
default OpenJDK JSSE provider on my test system. However, when developers use 
larger buffers (~10 MiB, perhaps large for networking/TLS, but reasonable for 
local data), I can observe throughput drop to 60 MiB/second (between 2 and 3 
percent of the expected throughput!).

Results from https://github.com/carterkozak/java-crypto-buffer-performance:
Benchmark                             (cipher)  (numBytes)  (writeStrategy)   
Mode  Cnt     Score     Error  Units
EncryptionBenchmark.encrypt  AES/GCM/NoPadding     1048576    ENTIRE_BUFFER  
thrpt    4  2215.898 ± 185.661  ops/s
EncryptionBenchmark.encrypt  AES/GCM/NoPadding    10485760    ENTIRE_BUFFER  
thrpt    4     6.427 ±   0.475  ops/s
EncryptionBenchmark.encrypt  AES/GCM/NoPadding   104857600    ENTIRE_BUFFER  
thrpt    4     0.620 ±   0.096  ops/s
EncryptionBenchmark.encrypt  AES/CTR/NoPadding     1048576    ENTIRE_BUFFER  
thrpt    4  2933.808 ±  17.538  ops/s
EncryptionBenchmark.encrypt  AES/CTR/NoPadding    10485760    ENTIRE_BUFFER  
thrpt    4    31.775 ±   1.898  ops/s
EncryptionBenchmark.encrypt  AES/CTR/NoPadding   104857600    ENTIRE_BUFFER  
thrpt    4     3.174 ±   0.171  ops/s

Using `AES/GCM/NoPadding`, large buffers result in a great deal of work within 
GHASH.processBlocks 
<https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/com/sun/crypto/provider/GHASH.java#L272-L286>
 which is intrinsified 
<https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/hotspot/share/classfile/vmIntrinsics.hpp#L462-L466>,
 however the intrinsic isn’t used because the method is called infrequently, 
and a tremendous amount of work occurs within the default implementation. You 
can find notes from my initial investigation are here (with flame graphs) 
<https://github.com/palantir/hadoop-crypto/pull/586#issuecomment-964394587>. 
When I introduce a wrapper to chunk input buffers into 16 KiB segments (other 
sizes tested here 
<https://github.com/palantir/hadoop-crypto/pull/586#issue-1047810949>), we can 
effectively force the method to warm up, and perform nearly two orders of 
magnitude better:

https://github.com/carterkozak/java-crypto-buffer-performance#jdk-17
Benchmark                             (cipher)  (numBytes)  (writeStrategy)   
Mode  Cnt     Score     Error  Units
EncryptionBenchmark.encrypt  AES/GCM/NoPadding     1048576    ENTIRE_BUFFER  
thrpt    4  2215.898 ± 185.661  ops/s
EncryptionBenchmark.encrypt  AES/GCM/NoPadding     1048576          CHUNKED  
thrpt    4  2516.770 ± 193.009  ops/s
EncryptionBenchmark.encrypt  AES/GCM/NoPadding    10485760    ENTIRE_BUFFER  
thrpt    4     6.427 ±   0.475  ops/s
EncryptionBenchmark.encrypt  AES/GCM/NoPadding    10485760          CHUNKED  
thrpt    4   246.956 ±  51.193  ops/s
EncryptionBenchmark.encrypt  AES/GCM/NoPadding   104857600    ENTIRE_BUFFER  
thrpt    4     0.620 ±   0.096  ops/s
EncryptionBenchmark.encrypt  AES/GCM/NoPadding   104857600          CHUNKED  
thrpt    4    24.633 ±   2.784  ops/s
EncryptionBenchmark.encrypt  AES/CTR/NoPadding     1048576    ENTIRE_BUFFER  
thrpt    4  2933.808 ±  17.538  ops/s
EncryptionBenchmark.encrypt  AES/CTR/NoPadding     1048576          CHUNKED  
thrpt    4  3277.374 ± 569.573  ops/s
EncryptionBenchmark.encrypt  AES/CTR/NoPadding    10485760    ENTIRE_BUFFER  
thrpt    4    31.775 ±   1.898  ops/s
EncryptionBenchmark.encrypt  AES/CTR/NoPadding    10485760          CHUNKED  
thrpt    4   332.873 ±  55.589  ops/s
EncryptionBenchmark.encrypt  AES/CTR/NoPadding   104857600    ENTIRE_BUFFER  
thrpt    4     3.174 ±   0.171  ops/s
EncryptionBenchmark.encrypt  AES/CTR/NoPadding   104857600          CHUNKED  
thrpt    4    33.909 ±   1.675  ops/s

The 10 MiB full-buffer benchmark is eventually partially optimized after ~3 
minutes of encryption on ~10 GiB of data, however in practice this takes much 
longer because the encrypted data must also be put somewhere, potentially 
leading to rubber-banding over a network.

While writing this up I re-ran my investigation using JDK-19 
<https://github.com/carterkozak/java-crypto-buffer-performance#jdk-19> and 
found, to my surprise, that AES/GCM performed substantially better, warming up 
quickly, while AES/CTR performance was largely equivalent! It turns out that 
JDK-8273297 <https://bugs.openjdk.org/browse/JDK-8273297>, which aimed to 
improve performance of an intrinsic, has the side-effect of allowing the 
intrinsic to be used much faster by segmenting inputs to 1mb chunks 
<https://github.com/openjdk/jdk/commit/13e9ea9e922030927775345b1abde1313a6ec03f#diff-a533e78f757a3ad64a8d2453bea64a0d426890ef85031452bf74070776ad8be0R575-R596>.

I’ve intentionally avoided suggesting specific solutions, as a layperson I 
don’t feel confident making explicit recommendations, but my goal is reliably 
high throughput based on the amount of work done rather than the size of 
individual operations. As a user, native implementations like tcnative 
<https://tomcat.apache.org/native-doc/> and Conscrypt 
<https://github.com/google/conscrypt> provide the performance characteristics 
I’m looking for, but without the reliability or flexibility of OpenJDK JSSE. Is 
there a solution which allows us to get the best of both worlds?

Thanks,
Carter Kozak

Clogged pipes: 50x throughput degradation with large Cipher writes

Reply via email to