Hi, We (Ismael, Magnus and Jun) are also trying to reproduce and figure it out on our side. Will keep you posted.
Gwen On Fri, May 13, 2016 at 11:32 AM, Tom Crayford <tcrayf...@heroku.com> wrote: > I've been investigating this pretty hard since I first noticed it. Right > now I have more avenues for investigation than I can shake a stick at, and > am also dealing with several other things in flight/on fire. I'll respond > when I have more information and can confirm things. > > On Fri, May 13, 2016 at 6:30 PM, Becket Qin <becket....@gmail.com> wrote: > >> Tom, >> >> Maybe it is mentioned and I missed. I am wondering if you see performance >> degradation on the consumer side when TLS is used? This could help us >> understand whether the issue is only producer related or TLS in general. >> >> Thanks, >> >> Jiangjie (Becket) Qin >> >> On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tcrayf...@heroku.com> >> wrote: >> >> > Ismael, >> > >> > Thanks. I'm writing up an issue with some new findings since yesterday >> > right now. >> > >> > Thanks >> > >> > Tom >> > >> > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <ism...@juma.me.uk> wrote: >> > >> > > Hi Tom, >> > > >> > > That's because JIRA is in lockdown due to excessive spam. I have added >> > you >> > > as a contributor in JIRA and you should be able to file a ticket now. >> > > >> > > Thanks, >> > > Ismael >> > > >> > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tcrayf...@heroku.com> >> > > wrote: >> > > >> > > > Ok, I don't seem to be able to file a new Jira issue at all. Can >> > somebody >> > > > check my permissions on Jira? My user is `tcrayford-heroku` >> > > > >> > > > Tom Crayford >> > > > Heroku Kafka >> > > > >> > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <j...@confluent.io> wrote: >> > > > >> > > > > Tom, >> > > > > >> > > > > We don't have a CSV metrics reporter in the producer right now. The >> > > > metrics >> > > > > will be available in jmx. You can find out the details in >> > > > > http://kafka.apache.org/documentation.html#new_producer_monitoring >> > > > > >> > > > > Thanks, >> > > > > >> > > > > Jun >> > > > > >> > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford < >> tcrayf...@heroku.com> >> > > > > wrote: >> > > > > >> > > > > > Yep, I can try those particular commits tomorrow. Before I try a >> > > > bisect, >> > > > > > I'm going to replicate with a less intensive to iterate on >> smaller >> > > > scale >> > > > > > perf test. >> > > > > > >> > > > > > Jun, inline: >> > > > > > >> > > > > > On Thursday, 12 May 2016, Jun Rao <j...@confluent.io> wrote: >> > > > > > >> > > > > > > Tom, >> > > > > > > >> > > > > > > Thanks for reporting this. A few quick comments. >> > > > > > > >> > > > > > > 1. Did you send the right command for producer-perf? The >> command >> > > > limits >> > > > > > the >> > > > > > > throughput to 100 msgs/sec. So, not sure how a single producer >> > can >> > > > get >> > > > > > 75K >> > > > > > > msgs/sec. >> > > > > > >> > > > > > >> > > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry, >> was >> > > > > > interpolating variables into a shell script. >> > > > > > >> > > > > > >> > > > > > > >> > > > > > > 2. Could you collect some stats (e.g. average batch size) in >> the >> > > > > producer >> > > > > > > and see if there is any noticeable difference between 0.9 and >> > 0.10? >> > > > > > >> > > > > > >> > > > > > That'd just be hooking up the CSV metrics reporter right? >> > > > > > >> > > > > > >> > > > > > > >> > > > > > > 3. Is the broker-to-broker communication also on SSL? Could you >> > do >> > > > > > another >> > > > > > > test with replication factor 1 and see if you still see the >> > > > > degradation? >> > > > > > >> > > > > > >> > > > > > Interbroker replication is always SSL in all test runs so far. I >> > can >> > > > try >> > > > > > with replication factor 1 tomorrow. >> > > > > > >> > > > > > >> > > > > > > >> > > > > > > Finally, email is probably not the best way to discuss >> > performance >> > > > > > results. >> > > > > > > If you have more of them, could you create a jira and attach >> your >> > > > > > findings >> > > > > > > there? >> > > > > > >> > > > > > >> > > > > > Yep. I only wrote the email because JIRA was in lockdown mode >> and I >> > > > > > couldn't create new issues. >> > > > > > >> > > > > > > >> > > > > > > Thanks, >> > > > > > > >> > > > > > > Jun >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford < >> > > tcrayf...@heroku.com >> > > > > > > <javascript:;>> wrote: >> > > > > > > >> > > > > > > > We've started running our usual suite of performance tests >> > > against >> > > > > > Kafka >> > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple >> consumer/producer >> > > > > > machines >> > > > > > > to >> > > > > > > > run a fairly normal mixed workload of producers and consumers >> > > (each >> > > > > > > > producer/consumer are just instances of kafka's inbuilt >> > > > > > consumer/producer >> > > > > > > > perf tests). We've found about a 33% performance drop in the >> > > > producer >> > > > > > if >> > > > > > > > TLS is used (compared to 0.9.0.1) >> > > > > > > > >> > > > > > > > We've seen notable producer performance degredations between >> > > > 0.9.0.1 >> > > > > > and >> > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right >> now. >> > > > > > > > >> > > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with >> > > enhanced >> > > > > > > > networking. Nothing is changed between the instances, and >> I've >> > > > > > reproduced >> > > > > > > > this over 4 different sets of clusters now. We're seeing >> about >> > a >> > > > 33% >> > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit >> > > 9404680. >> > > > > > > Please >> > > > > > > > to note that this doesn't match up with >> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because >> our >> > > > > > > performance >> > > > > > > > tests are with compression off, and this seems to be an TLS >> > only >> > > > > issue. >> > > > > > > > >> > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication >> > > factor >> > > > of >> > > > > > 3, >> > > > > > > > and 13 producers max out at around 1 million 100 byte >> messages >> > a >> > > > > > second. >> > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a >> > > second. >> > > > > > Both >> > > > > > > > tests were with TLS on. I've reproduced this on multiple >> > clusters >> > > > now >> > > > > > (5 >> > > > > > > or >> > > > > > > > so of each version) to account for the inherent performance >> > > > variance >> > > > > of >> > > > > > > > EC2. There's no notable performance difference without TLS on >> > > these >> > > > > > runs >> > > > > > > - >> > > > > > > > it appears to be an TLS regression entirely. >> > > > > > > > >> > > > > > > > A single producer with TLS under 0.10 does about 75k >> > messages/s. >> > > > > Under >> > > > > > > > 0.9.0.01 it does around 120k messages/s. >> > > > > > > > >> > > > > > > > The exact producer-perf line we're using is this: >> > > > > > > > >> > > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records >> > > > > "500000000" >> > > > > > > > --record-size "100" --throughput "100" --producer-props >> > acks="-1" >> > > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks >> > > > > > > > ssl.keystore.password=REDACTED >> > ssl.truststore.location=server.jks >> > > > > > > > ssl.truststore.password=REDACTED >> > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 >> > security.protocol=SSL >> > > > > > > > >> > > > > > > > We're using the same setup, machine type etc for each test >> run. >> > > > > > > > >> > > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0 >> producers >> > > and >> > > > > the >> > > > > > > TLS >> > > > > > > > performance impact was there for both. >> > > > > > > > >> > > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and >> > > haven't >> > > > > > seen >> > > > > > > > anything that seemed to have this kind of impact - indeed the >> > TLS >> > > > > code >> > > > > > > > doesn't seem to have changed much between 0.9.0.1 and >> 0.10.0.0. >> > > > > > > > >> > > > > > > > Any thoughts? Should I file an issue and see about >> reproducing >> > a >> > > > more >> > > > > > > > minimal test case? >> > > > > > > > >> > > > > > > > I don't think this is related to >> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is >> for >> > > > > > > compression >> > > > > > > > on and plaintext, and this is for TLS only. >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >>