Tom, Maybe it is mentioned and I missed. I am wondering if you see performance degradation on the consumer side when TLS is used? This could help us understand whether the issue is only producer related or TLS in general.
Thanks, Jiangjie (Becket) Qin On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tcrayf...@heroku.com> wrote: > Ismael, > > Thanks. I'm writing up an issue with some new findings since yesterday > right now. > > Thanks > > Tom > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <ism...@juma.me.uk> wrote: > > > Hi Tom, > > > > That's because JIRA is in lockdown due to excessive spam. I have added > you > > as a contributor in JIRA and you should be able to file a ticket now. > > > > Thanks, > > Ismael > > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tcrayf...@heroku.com> > > wrote: > > > > > Ok, I don't seem to be able to file a new Jira issue at all. Can > somebody > > > check my permissions on Jira? My user is `tcrayford-heroku` > > > > > > Tom Crayford > > > Heroku Kafka > > > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <j...@confluent.io> wrote: > > > > > > > Tom, > > > > > > > > We don't have a CSV metrics reporter in the producer right now. The > > > metrics > > > > will be available in jmx. You can find out the details in > > > > http://kafka.apache.org/documentation.html#new_producer_monitoring > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <tcrayf...@heroku.com> > > > > wrote: > > > > > > > > > Yep, I can try those particular commits tomorrow. Before I try a > > > bisect, > > > > > I'm going to replicate with a less intensive to iterate on smaller > > > scale > > > > > perf test. > > > > > > > > > > Jun, inline: > > > > > > > > > > On Thursday, 12 May 2016, Jun Rao <j...@confluent.io> wrote: > > > > > > > > > > > Tom, > > > > > > > > > > > > Thanks for reporting this. A few quick comments. > > > > > > > > > > > > 1. Did you send the right command for producer-perf? The command > > > limits > > > > > the > > > > > > throughput to 100 msgs/sec. So, not sure how a single producer > can > > > get > > > > > 75K > > > > > > msgs/sec. > > > > > > > > > > > > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry, was > > > > > interpolating variables into a shell script. > > > > > > > > > > > > > > > > > > > > > > 2. Could you collect some stats (e.g. average batch size) in the > > > > producer > > > > > > and see if there is any noticeable difference between 0.9 and > 0.10? > > > > > > > > > > > > > > > That'd just be hooking up the CSV metrics reporter right? > > > > > > > > > > > > > > > > > > > > > > 3. Is the broker-to-broker communication also on SSL? Could you > do > > > > > another > > > > > > test with replication factor 1 and see if you still see the > > > > degradation? > > > > > > > > > > > > > > > Interbroker replication is always SSL in all test runs so far. I > can > > > try > > > > > with replication factor 1 tomorrow. > > > > > > > > > > > > > > > > > > > > > > Finally, email is probably not the best way to discuss > performance > > > > > results. > > > > > > If you have more of them, could you create a jira and attach your > > > > > findings > > > > > > there? > > > > > > > > > > > > > > > Yep. I only wrote the email because JIRA was in lockdown mode and I > > > > > couldn't create new issues. > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford < > > tcrayf...@heroku.com > > > > > > <javascript:;>> wrote: > > > > > > > > > > > > > We've started running our usual suite of performance tests > > against > > > > > Kafka > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple consumer/producer > > > > > machines > > > > > > to > > > > > > > run a fairly normal mixed workload of producers and consumers > > (each > > > > > > > producer/consumer are just instances of kafka's inbuilt > > > > > consumer/producer > > > > > > > perf tests). We've found about a 33% performance drop in the > > > producer > > > > > if > > > > > > > TLS is used (compared to 0.9.0.1) > > > > > > > > > > > > > > We've seen notable producer performance degredations between > > > 0.9.0.1 > > > > > and > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right now. > > > > > > > > > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with > > enhanced > > > > > > > networking. Nothing is changed between the instances, and I've > > > > > reproduced > > > > > > > this over 4 different sets of clusters now. We're seeing about > a > > > 33% > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit > > 9404680. > > > > > > Please > > > > > > > to note that this doesn't match up with > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because our > > > > > > performance > > > > > > > tests are with compression off, and this seems to be an TLS > only > > > > issue. > > > > > > > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication > > factor > > > of > > > > > 3, > > > > > > > and 13 producers max out at around 1 million 100 byte messages > a > > > > > second. > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a > > second. > > > > > Both > > > > > > > tests were with TLS on. I've reproduced this on multiple > clusters > > > now > > > > > (5 > > > > > > or > > > > > > > so of each version) to account for the inherent performance > > > variance > > > > of > > > > > > > EC2. There's no notable performance difference without TLS on > > these > > > > > runs > > > > > > - > > > > > > > it appears to be an TLS regression entirely. > > > > > > > > > > > > > > A single producer with TLS under 0.10 does about 75k > messages/s. > > > > Under > > > > > > > 0.9.0.01 it does around 120k messages/s. > > > > > > > > > > > > > > The exact producer-perf line we're using is this: > > > > > > > > > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records > > > > "500000000" > > > > > > > --record-size "100" --throughput "100" --producer-props > acks="-1" > > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks > > > > > > > ssl.keystore.password=REDACTED > ssl.truststore.location=server.jks > > > > > > > ssl.truststore.password=REDACTED > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 > security.protocol=SSL > > > > > > > > > > > > > > We're using the same setup, machine type etc for each test run. > > > > > > > > > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0 producers > > and > > > > the > > > > > > TLS > > > > > > > performance impact was there for both. > > > > > > > > > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and > > haven't > > > > > seen > > > > > > > anything that seemed to have this kind of impact - indeed the > TLS > > > > code > > > > > > > doesn't seem to have changed much between 0.9.0.1 and 0.10.0.0. > > > > > > > > > > > > > > Any thoughts? Should I file an issue and see about reproducing > a > > > more > > > > > > > minimal test case? > > > > > > > > > > > > > > I don't think this is related to > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is for > > > > > > compression > > > > > > > on and plaintext, and this is for TLS only. > > > > > > > > > > > > > > > > > > > > > > > > > > > >