We've started running our usual suite of performance tests against Kafka 0.10.0.0 RC. These tests orchestrate multiple consumer/producer machines to run a fairly normal mixed workload of producers and consumers (each producer/consumer are just instances of kafka's inbuilt consumer/producer perf tests). We've found about a 33% performance drop in the producer if TLS is used (compared to 0.9.0.1)
We've seen notable producer performance degredations between 0.9.0.1 and 0.10.0.0 RC. We're running as of the commit 9404680 right now. Our specific test case runs Kafka on 8 EC2 machines, with enhanced networking. Nothing is changed between the instances, and I've reproduced this over 4 different sets of clusters now. We're seeing about a 33% performance drop between 0.9.0.1 and 0.10.0.0 as of commit 9404680. Please to note that this doesn't match up with https://issues.apache.org/jira/browse/KAFKA-3565, because our performance tests are with compression off, and this seems to be an TLS only issue. Under 0.10.0-rc4, we see an 8 node cluster with replication factor of 3, and 13 producers max out at around 1 million 100 byte messages a second. Under 0.9.0.1, the same cluster does 1.5 million messages a second. Both tests were with TLS on. I've reproduced this on multiple clusters now (5 or so of each version) to account for the inherent performance variance of EC2. There's no notable performance difference without TLS on these runs - it appears to be an TLS regression entirely. A single producer with TLS under 0.10 does about 75k messages/s. Under 0.9.0.01 it does around 120k messages/s. The exact producer-perf line we're using is this: bin/kafka-producer-perf-test --topic "bench" --num-records "500000000" --record-size "100" --throughput "100" --producer-props acks="-1" bootstrap.servers=REDACTED ssl.keystore.location=client.jks ssl.keystore.password=REDACTED ssl.truststore.location=server.jks ssl.truststore.password=REDACTED ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 security.protocol=SSL We're using the same setup, machine type etc for each test run. We've tried using both 0.9.0.1 producers and 0.10.0.0 producers and the TLS performance impact was there for both. I've glanced over the code between 0.9.0.1 and 0.10.0.0 and haven't seen anything that seemed to have this kind of impact - indeed the TLS code doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0. Any thoughts? Should I file an issue and see about reproducing a more minimal test case? I don't think this is related to https://issues.apache.org/jira/browse/KAFKA-3565 - that is for compression on and plaintext, and this is for TLS only.