Hi Tom, Great to hear that the failure testing scenario went well. :)
Your suggested improvement sounds good to me and a PR would be great. For this kind of change, you can skip the JIRA, just prefix the PR title with `MINOR:`. Thanks, Ismael On Sun, May 15, 2016 at 9:17 PM, Tom Crayford <tcrayf...@heroku.com> wrote: > How about this? > > <b>Note:</b> Due to the additional timestamp introduced in each message > (8 bytes of data), producers sending small messages may see a > message throughput degradation because of the increased overhead. > Likewise, replication now transmits an additional 8 bytes per message. > If you're running close to the network capacity of your cluster, it's > possible that you'll overwhelm the network cards and see failures and > performance > issues due to the overload. > When receiving compressed messages, 0.10.0 > brokers avoid recompressing the messages, which in general reduces the > latency and improves the throughput. In > certain cases, this may reduce the batching size on the producer, which > could lead to worse throughput. If this > happens, users can tune linger.ms and batch.size of the producer for > better throughput. > > Would you like a Jira/PR with this kind of change so we can discuss them in > a more convenient format? > > Re our failure testing scenario: Kafka 0.10 RC behaves exactly the same > under failure as 0.9 - the controller typically shifts the leader in around > 2 seconds or so, and the benchmark sees a small drop in throughput during > that, then another drop whilst the replacement broker comes back to speed. > So, overall we're extremely happy and excited for this release! Thanks to > the committers and maintainers for all their hard work. > > On Sun, May 15, 2016 at 9:03 PM, Ismael Juma <ism...@juma.me.uk> wrote: > > > Hi Tom, > > > > Thanks for the update and for all the testing you have done! No worries > > about the chase here, I'd much rather have false positives by people who > > are validating the releases than false negatives because people don't > > validate the releases. :) > > > > The upgrade note we currently have follows: > > > > https://github.com/apache/kafka/blob/0.10.0/docs/upgrade.html#L67 > > > > Please feel free to suggest improvements. > > > > Thanks, > > Ismael > > > > On Sun, May 15, 2016 at 6:39 PM, Tom Crayford <tcrayf...@heroku.com> > > wrote: > > > > > I've been digging into this some more. It seems like this may have been > > an > > > issue with benchmarks maxing out the network card - under 0.10.0.0-RC > the > > > slightly additional bandwidth per message seems to have pushed the > > broker's > > > NIC into overload territory where it starts dropping packets (verified > > with > > > ifconfig on each broker). This leads to it not being able to talk to > > > Zookeeper properly, which leads to OfflinePartitions, which then causes > > > issues with the benchmarks validity, as throughput drops a lot when > > brokers > > > are flapping in and out of being online. 0.9.0.1 doing that 8 bytes > less > > > per message means the broker's NIC can sustain more messages/s. There > was > > > an "alignment" issue with the benchmarks here - under 0.9 we were > *just* > > at > > > the barrier of the broker's NICs sustaining traffic, and under 0.10 we > > > pushed over that (at 1.5 million messages/s, 8 bytes extra per message > is > > > an extra 36 MB/s with replication factor 3 [if my math is right, and > > that's > > > before SSL encryption which may be additional overhead], which is as > much > > > as an additional producer machine). > > > > > > The dropped packets and the flapping weren't causing notable timeout > > issues > > > in the producer, but looking at the metrics on the brokers, offline > > > partitions was clearly triggered and undergoing, and the broker logs > show > > > ZK session timeouts. This is consistent with earlier benchmarking > > > experience - the number of producers we were running under 0.9.0.1 was > > > carefully selected to be just under the limit here. > > > > > > The other issue with the benchmark where I reported an issue between > two > > > single producers was caused by a "performance of producer machine" > issue > > > that I wasn't properly aware of. Apologies there. > > > > > > I've done benchmarks now where I limit the producer throughput (via > > > --throughput) to slightly below what the NICs can sustain and seen no > > > notable performance or stability difference between 0.10 and 0.9.0.1 as > > > long as you stay under the limits of the network interfaces. All of the > > > clusters I have tested happily keep up a benchmark at this rate for 6 > > hours > > > under both 0.9.0.1 and 0.10.0.0. I've also verified that our clusters > are > > > entirely network bound in these producer benchmarking scenarios - the > > disks > > > and CPU/memory have a bunch of remaining capacity. > > > > > > This was pretty hard to verify fully, which is why I've taken so long > to > > > reply. All in all I think the result here is expected and not a blocker > > for > > > release, but a good thing to note on upgrades - if folk are running at > > the > > > limit of their network cards (which you never want to do anyway, but > > > benchmarking scenarios often uncover those limits), they'll see issues > > due > > > to increased replication and producer traffic under 0.10.0.0. > > > > > > Apologies for the chase here - this distinctly seemed like a real issue > > and > > > one I (and I think everybody else) would have wanted to block the > release > > > on. I'm going to move onto our "failure" testing, in which we run the > > same > > > performance benchmarks whilst causing a hard kill on the node. We've > seen > > > very good results for that under 0.9 and hopefully they'll continue > under > > > 0.10. > > > > > > On Sat, May 14, 2016 at 1:33 AM, Gwen Shapira <g...@confluent.io> > wrote: > > > > > > > also, perhaps sharing the broker configuration? maybe this will > > > > provide some hints... > > > > > > > > On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <ism...@juma.me.uk> > > wrote: > > > > > Thanks Tom. I just wanted to share that I have been unable to > > reproduce > > > > > this so far. Please feel free to share whatever you information you > > > have > > > > so > > > > > far when you have a chance, don't feel that you need to have all > the > > > > > answers. > > > > > > > > > > Ismael > > > > > > > > > > On Fri, May 13, 2016 at 7:32 PM, Tom Crayford < > tcrayf...@heroku.com> > > > > wrote: > > > > > > > > > >> I've been investigating this pretty hard since I first noticed it. > > > Right > > > > >> now I have more avenues for investigation than I can shake a stick > > at, > > > > and > > > > >> am also dealing with several other things in flight/on fire. I'll > > > > respond > > > > >> when I have more information and can confirm things. > > > > >> > > > > >> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <becket....@gmail.com > > > > > > wrote: > > > > >> > > > > >> > Tom, > > > > >> > > > > > >> > Maybe it is mentioned and I missed. I am wondering if you see > > > > performance > > > > >> > degradation on the consumer side when TLS is used? This could > help > > > us > > > > >> > understand whether the issue is only producer related or TLS in > > > > general. > > > > >> > > > > > >> > Thanks, > > > > >> > > > > > >> > Jiangjie (Becket) Qin > > > > >> > > > > > >> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford < > > tcrayf...@heroku.com > > > > > > > > >> > wrote: > > > > >> > > > > > >> > > Ismael, > > > > >> > > > > > > >> > > Thanks. I'm writing up an issue with some new findings since > > > > yesterday > > > > >> > > right now. > > > > >> > > > > > > >> > > Thanks > > > > >> > > > > > > >> > > Tom > > > > >> > > > > > > >> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma < > ism...@juma.me.uk > > > > > > > >> wrote: > > > > >> > > > > > > >> > > > Hi Tom, > > > > >> > > > > > > > >> > > > That's because JIRA is in lockdown due to excessive spam. I > > have > > > > >> added > > > > >> > > you > > > > >> > > > as a contributor in JIRA and you should be able to file a > > ticket > > > > now. > > > > >> > > > > > > > >> > > > Thanks, > > > > >> > > > Ismael > > > > >> > > > > > > > >> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford < > > > > tcrayf...@heroku.com > > > > >> > > > > > >> > > > wrote: > > > > >> > > > > > > > >> > > > > Ok, I don't seem to be able to file a new Jira issue at > all. > > > Can > > > > >> > > somebody > > > > >> > > > > check my permissions on Jira? My user is > `tcrayford-heroku` > > > > >> > > > > > > > > >> > > > > Tom Crayford > > > > >> > > > > Heroku Kafka > > > > >> > > > > > > > > >> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao < > j...@confluent.io > > > > > > > >> wrote: > > > > >> > > > > > > > > >> > > > > > Tom, > > > > >> > > > > > > > > > >> > > > > > We don't have a CSV metrics reporter in the producer > right > > > > now. > > > > >> The > > > > >> > > > > metrics > > > > >> > > > > > will be available in jmx. You can find out the details > in > > > > >> > > > > > > > > > >> > http://kafka.apache.org/documentation.html#new_producer_monitoring > > > > >> > > > > > > > > > >> > > > > > Thanks, > > > > >> > > > > > > > > > >> > > > > > Jun > > > > >> > > > > > > > > > >> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford < > > > > >> > tcrayf...@heroku.com> > > > > >> > > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > Yep, I can try those particular commits tomorrow. > > Before I > > > > try > > > > >> a > > > > >> > > > > bisect, > > > > >> > > > > > > I'm going to replicate with a less intensive to > iterate > > on > > > > >> > smaller > > > > >> > > > > scale > > > > >> > > > > > > perf test. > > > > >> > > > > > > > > > > >> > > > > > > Jun, inline: > > > > >> > > > > > > > > > > >> > > > > > > On Thursday, 12 May 2016, Jun Rao <j...@confluent.io> > > > wrote: > > > > >> > > > > > > > > > > >> > > > > > > > Tom, > > > > >> > > > > > > > > > > > >> > > > > > > > Thanks for reporting this. A few quick comments. > > > > >> > > > > > > > > > > > >> > > > > > > > 1. Did you send the right command for producer-perf? > > The > > > > >> > command > > > > >> > > > > limits > > > > >> > > > > > > the > > > > >> > > > > > > > throughput to 100 msgs/sec. So, not sure how a > single > > > > >> producer > > > > >> > > can > > > > >> > > > > get > > > > >> > > > > > > 75K > > > > >> > > > > > > > msgs/sec. > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > Ah yep, wrong commands. I'll get the right one > tomorrow. > > > > Sorry, > > > > >> > was > > > > >> > > > > > > interpolating variables into a shell script. > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > 2. Could you collect some stats (e.g. average batch > > > size) > > > > in > > > > >> > the > > > > >> > > > > > producer > > > > >> > > > > > > > and see if there is any noticeable difference > between > > > 0.9 > > > > and > > > > >> > > 0.10? > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > That'd just be hooking up the CSV metrics reporter > > right? > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > 3. Is the broker-to-broker communication also on > SSL? > > > > Could > > > > >> you > > > > >> > > do > > > > >> > > > > > > another > > > > >> > > > > > > > test with replication factor 1 and see if you still > > see > > > > the > > > > >> > > > > > degradation? > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > Interbroker replication is always SSL in all test runs > > so > > > > far. > > > > >> I > > > > >> > > can > > > > >> > > > > try > > > > >> > > > > > > with replication factor 1 tomorrow. > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > Finally, email is probably not the best way to > discuss > > > > >> > > performance > > > > >> > > > > > > results. > > > > >> > > > > > > > If you have more of them, could you create a jira > and > > > > attach > > > > >> > your > > > > >> > > > > > > findings > > > > >> > > > > > > > there? > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > Yep. I only wrote the email because JIRA was in > lockdown > > > > mode > > > > >> > and I > > > > >> > > > > > > couldn't create new issues. > > > > >> > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > Thanks, > > > > >> > > > > > > > > > > > >> > > > > > > > Jun > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford < > > > > >> > > > tcrayf...@heroku.com > > > > >> > > > > > > > <javascript:;>> wrote: > > > > >> > > > > > > > > > > > >> > > > > > > > > We've started running our usual suite of > performance > > > > tests > > > > >> > > > against > > > > >> > > > > > > Kafka > > > > >> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple > > > > >> > consumer/producer > > > > >> > > > > > > machines > > > > >> > > > > > > > to > > > > >> > > > > > > > > run a fairly normal mixed workload of producers > and > > > > >> consumers > > > > >> > > > (each > > > > >> > > > > > > > > producer/consumer are just instances of kafka's > > > inbuilt > > > > >> > > > > > > consumer/producer > > > > >> > > > > > > > > perf tests). We've found about a 33% performance > > drop > > > in > > > > >> the > > > > >> > > > > producer > > > > >> > > > > > > if > > > > >> > > > > > > > > TLS is used (compared to 0.9.0.1) > > > > >> > > > > > > > > > > > > >> > > > > > > > > We've seen notable producer performance > degredations > > > > >> between > > > > >> > > > > 0.9.0.1 > > > > >> > > > > > > and > > > > >> > > > > > > > > 0.10.0.0 RC. We're running as of the commit > 9404680 > > > > right > > > > >> > now. > > > > >> > > > > > > > > > > > > >> > > > > > > > > Our specific test case runs Kafka on 8 EC2 > machines, > > > > with > > > > >> > > > enhanced > > > > >> > > > > > > > > networking. Nothing is changed between the > > instances, > > > > and > > > > >> > I've > > > > >> > > > > > > reproduced > > > > >> > > > > > > > > this over 4 different sets of clusters now. We're > > > seeing > > > > >> > about > > > > >> > > a > > > > >> > > > > 33% > > > > >> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as > of > > > > commit > > > > >> > > > 9404680. > > > > >> > > > > > > > Please > > > > >> > > > > > > > > to note that this doesn't match up with > > > > >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, > > > > because > > > > >> > our > > > > >> > > > > > > > performance > > > > >> > > > > > > > > tests are with compression off, and this seems to > be > > > an > > > > TLS > > > > >> > > only > > > > >> > > > > > issue. > > > > >> > > > > > > > > > > > > >> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with > > > > replication > > > > >> > > > factor > > > > >> > > > > of > > > > >> > > > > > > 3, > > > > >> > > > > > > > > and 13 producers max out at around 1 million 100 > > byte > > > > >> > messages > > > > >> > > a > > > > >> > > > > > > second. > > > > >> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million > > > > messages a > > > > >> > > > second. > > > > >> > > > > > > Both > > > > >> > > > > > > > > tests were with TLS on. I've reproduced this on > > > multiple > > > > >> > > clusters > > > > >> > > > > now > > > > >> > > > > > > (5 > > > > >> > > > > > > > or > > > > >> > > > > > > > > so of each version) to account for the inherent > > > > performance > > > > >> > > > > variance > > > > >> > > > > > of > > > > >> > > > > > > > > EC2. There's no notable performance difference > > without > > > > TLS > > > > >> on > > > > >> > > > these > > > > >> > > > > > > runs > > > > >> > > > > > > > - > > > > >> > > > > > > > > it appears to be an TLS regression entirely. > > > > >> > > > > > > > > > > > > >> > > > > > > > > A single producer with TLS under 0.10 does about > 75k > > > > >> > > messages/s. > > > > >> > > > > > Under > > > > >> > > > > > > > > 0.9.0.01 it does around 120k messages/s. > > > > >> > > > > > > > > > > > > >> > > > > > > > > The exact producer-perf line we're using is this: > > > > >> > > > > > > > > > > > > >> > > > > > > > > bin/kafka-producer-perf-test --topic "bench" > > > > --num-records > > > > >> > > > > > "500000000" > > > > >> > > > > > > > > --record-size "100" --throughput "100" > > > --producer-props > > > > >> > > acks="-1" > > > > >> > > > > > > > > bootstrap.servers=REDACTED > > > > ssl.keystore.location=client.jks > > > > >> > > > > > > > > ssl.keystore.password=REDACTED > > > > >> > > ssl.truststore.location=server.jks > > > > >> > > > > > > > > ssl.truststore.password=REDACTED > > > > >> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 > > > > >> > > security.protocol=SSL > > > > >> > > > > > > > > > > > > >> > > > > > > > > We're using the same setup, machine type etc for > > each > > > > test > > > > >> > run. > > > > >> > > > > > > > > > > > > >> > > > > > > > > We've tried using both 0.9.0.1 producers and > > 0.10.0.0 > > > > >> > producers > > > > >> > > > and > > > > >> > > > > > the > > > > >> > > > > > > > TLS > > > > >> > > > > > > > > performance impact was there for both. > > > > >> > > > > > > > > > > > > >> > > > > > > > > I've glanced over the code between 0.9.0.1 and > > > 0.10.0.0 > > > > and > > > > >> > > > haven't > > > > >> > > > > > > seen > > > > >> > > > > > > > > anything that seemed to have this kind of impact - > > > > indeed > > > > >> the > > > > >> > > TLS > > > > >> > > > > > code > > > > >> > > > > > > > > doesn't seem to have changed much between 0.9.0.1 > > and > > > > >> > 0.10.0.0. > > > > >> > > > > > > > > > > > > >> > > > > > > > > Any thoughts? Should I file an issue and see about > > > > >> > reproducing > > > > >> > > a > > > > >> > > > > more > > > > >> > > > > > > > > minimal test case? > > > > >> > > > > > > > > > > > > >> > > > > > > > > I don't think this is related to > > > > >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 > - > > > > that is > > > > >> > for > > > > >> > > > > > > > compression > > > > >> > > > > > > > > on and plaintext, and this is for TLS only. > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > >