Re: [VOTE] 0.10.0.0 RC4

Tom Crayford Sun, 15 May 2016 10:40:19 -0700

I've been digging into this some more. It seems like this may have been an
issue with benchmarks maxing out the network card - under 0.10.0.0-RC the
slightly additional bandwidth per message seems to have pushed the broker's
NIC into overload territory where it starts dropping packets (verified with
ifconfig on each broker). This leads to it not being able to talk to
Zookeeper properly, which leads to OfflinePartitions, which then causes
issues with the benchmarks validity, as throughput drops a lot when brokers
are flapping in and out of being online. 0.9.0.1 doing that 8 bytes less
per message means the broker's NIC can sustain more messages/s. There was
an "alignment" issue with the benchmarks here - under 0.9 we were *just* at
the barrier of the broker's NICs sustaining traffic, and under 0.10 we
pushed over that (at 1.5 million messages/s, 8 bytes extra per message is
an extra 36 MB/s with replication factor 3 [if my math is right, and that's
before SSL encryption which may be additional overhead], which is as much
as an additional producer machine).


The dropped packets and the flapping weren't causing notable timeout issues
in the producer, but looking at the metrics on the brokers, offline
partitions was clearly triggered and undergoing, and the broker logs show
ZK session timeouts. This is consistent with earlier benchmarking
experience - the number of producers we were running under 0.9.0.1 was
carefully selected to be just under the limit here.

The other issue with the benchmark where I reported an issue between two
single producers was caused by a "performance of producer machine" issue
that I wasn't properly aware of. Apologies there.

I've done benchmarks now where I limit the producer throughput (via
--throughput) to slightly below what the NICs can sustain and seen no
notable performance or stability difference between 0.10 and 0.9.0.1 as
long as you stay under the limits of the network interfaces. All of the
clusters I have tested happily keep up a benchmark at this rate for 6 hours
under both 0.9.0.1 and 0.10.0.0. I've also verified that our clusters are
entirely network bound in these producer benchmarking scenarios - the disks
and CPU/memory have a bunch of remaining capacity.

This was pretty hard to verify fully, which is why I've taken so long to
reply. All in all I think the result here is expected and not a blocker for
release, but a good thing to note on upgrades - if folk are running at the
limit of their network cards (which you never want to do anyway, but
benchmarking scenarios often uncover those limits), they'll see issues due
to increased replication and producer traffic under 0.10.0.0.

Apologies for the chase here - this distinctly seemed like a real issue and
one I (and I think everybody else) would have wanted to block the release
on. I'm going to move onto our "failure" testing, in which we run the same
performance benchmarks whilst causing a hard kill on the node. We've seen
very good results for that under 0.9 and hopefully they'll continue under
0.10.

On Sat, May 14, 2016 at 1:33 AM, Gwen Shapira <g...@confluent.io> wrote:

> also, perhaps sharing the broker configuration? maybe this will
> provide some hints...
>
> On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <ism...@juma.me.uk> wrote:
> > Thanks Tom. I just wanted to share that I have been unable to reproduce
> > this so far. Please feel free to share whatever you information you have
> so
> > far when you have a chance, don't feel that you need to have all the
> > answers.
> >
> > Ismael
> >
> > On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <tcrayf...@heroku.com>
> wrote:
> >
> >> I've been investigating this pretty hard since I first noticed it. Right
> >> now I have more avenues for investigation than I can shake a stick at,
> and
> >> am also dealing with several other things in flight/on fire. I'll
> respond
> >> when I have more information and can confirm things.
> >>
> >> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <becket....@gmail.com>
> wrote:
> >>
> >> > Tom,
> >> >
> >> > Maybe it is mentioned and I missed. I am wondering if you see
> performance
> >> > degradation on the consumer side when TLS is used? This could help us
> >> > understand whether the issue is only producer related or TLS in
> general.
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tcrayf...@heroku.com>
> >> > wrote:
> >> >
> >> > > Ismael,
> >> > >
> >> > > Thanks. I'm writing up an issue with some new findings since
> yesterday
> >> > > right now.
> >> > >
> >> > > Thanks
> >> > >
> >> > > Tom
> >> > >
> >> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <ism...@juma.me.uk>
> >> wrote:
> >> > >
> >> > > > Hi Tom,
> >> > > >
> >> > > > That's because JIRA is in lockdown due to excessive spam. I have
> >> added
> >> > > you
> >> > > > as a contributor in JIRA and you should be able to file a ticket
> now.
> >> > > >
> >> > > > Thanks,
> >> > > > Ismael
> >> > > >
> >> > > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <
> tcrayf...@heroku.com
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Ok, I don't seem to be able to file a new Jira issue at all. Can
> >> > > somebody
> >> > > > > check my permissions on Jira? My user is `tcrayford-heroku`
> >> > > > >
> >> > > > > Tom Crayford
> >> > > > > Heroku Kafka
> >> > > > >
> >> > > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <j...@confluent.io>
> >> wrote:
> >> > > > >
> >> > > > > > Tom,
> >> > > > > >
> >> > > > > > We don't have a CSV metrics reporter in the producer right
> now.
> >> The
> >> > > > > metrics
> >> > > > > > will be available in jmx. You can find out the details in
> >> > > > > >
> >> http://kafka.apache.org/documentation.html#new_producer_monitoring
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > >
> >> > > > > > Jun
> >> > > > > >
> >> > > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
> >> > tcrayf...@heroku.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Yep, I can try those particular commits tomorrow. Before I
> try
> >> a
> >> > > > > bisect,
> >> > > > > > > I'm going to replicate with a less intensive to iterate on
> >> > smaller
> >> > > > > scale
> >> > > > > > > perf test.
> >> > > > > > >
> >> > > > > > > Jun, inline:
> >> > > > > > >
> >> > > > > > > On Thursday, 12 May 2016, Jun Rao <j...@confluent.io> wrote:
> >> > > > > > >
> >> > > > > > > > Tom,
> >> > > > > > > >
> >> > > > > > > > Thanks for reporting this. A few quick comments.
> >> > > > > > > >
> >> > > > > > > > 1. Did you send the right command for producer-perf? The
> >> > command
> >> > > > > limits
> >> > > > > > > the
> >> > > > > > > > throughput to 100 msgs/sec. So, not sure how a single
> >> producer
> >> > > can
> >> > > > > get
> >> > > > > > > 75K
> >> > > > > > > > msgs/sec.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Ah yep, wrong commands. I'll get the right one tomorrow.
> Sorry,
> >> > was
> >> > > > > > > interpolating variables into a shell script.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > 2. Could you collect some stats (e.g. average batch size)
> in
> >> > the
> >> > > > > > producer
> >> > > > > > > > and see if there is any noticeable difference between 0.9
> and
> >> > > 0.10?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > That'd just be hooking up the CSV metrics reporter right?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > 3. Is the broker-to-broker communication also on SSL?
> Could
> >> you
> >> > > do
> >> > > > > > > another
> >> > > > > > > > test with replication factor 1 and see if you still see
> the
> >> > > > > > degradation?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Interbroker replication is always SSL in all test runs so
> far.
> >> I
> >> > > can
> >> > > > > try
> >> > > > > > > with replication factor 1 tomorrow.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > Finally, email is probably not the best way to discuss
> >> > > performance
> >> > > > > > > results.
> >> > > > > > > > If you have more of them, could you create a jira and
> attach
> >> > your
> >> > > > > > > findings
> >> > > > > > > > there?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Yep. I only wrote the email because JIRA was in lockdown
> mode
> >> > and I
> >> > > > > > > couldn't create new issues.
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > Thanks,
> >> > > > > > > >
> >> > > > > > > > Jun
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
> >> > > > tcrayf...@heroku.com
> >> > > > > > > > <javascript:;>> wrote:
> >> > > > > > > >
> >> > > > > > > > > We've started running our usual suite of performance
> tests
> >> > > > against
> >> > > > > > > Kafka
> >> > > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
> >> > consumer/producer
> >> > > > > > > machines
> >> > > > > > > > to
> >> > > > > > > > > run a fairly normal mixed workload of producers and
> >> consumers
> >> > > > (each
> >> > > > > > > > > producer/consumer are just instances of kafka's inbuilt
> >> > > > > > > consumer/producer
> >> > > > > > > > > perf tests). We've found about a 33% performance drop in
> >> the
> >> > > > > producer
> >> > > > > > > if
> >> > > > > > > > > TLS is used (compared to 0.9.0.1)
> >> > > > > > > > >
> >> > > > > > > > > We've seen notable producer performance degredations
> >> between
> >> > > > > 0.9.0.1
> >> > > > > > > and
> >> > > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680
> right
> >> > now.
> >> > > > > > > > >
> >> > > > > > > > > Our specific test case runs Kafka on 8 EC2 machines,
> with
> >> > > > enhanced
> >> > > > > > > > > networking. Nothing is changed between the instances,
> and
> >> > I've
> >> > > > > > > reproduced
> >> > > > > > > > > this over 4 different sets of clusters now. We're seeing
> >> > about
> >> > > a
> >> > > > > 33%
> >> > > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of
> commit
> >> > > > 9404680.
> >> > > > > > > > Please
> >> > > > > > > > > to note that this doesn't match up with
> >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565,
> because
> >> > our
> >> > > > > > > > performance
> >> > > > > > > > > tests are with compression off, and this seems to be an
> TLS
> >> > > only
> >> > > > > > issue.
> >> > > > > > > > >
> >> > > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with
> replication
> >> > > > factor
> >> > > > > of
> >> > > > > > > 3,
> >> > > > > > > > > and 13 producers max out at around 1 million 100 byte
> >> > messages
> >> > > a
> >> > > > > > > second.
> >> > > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million
> messages a
> >> > > > second.
> >> > > > > > > Both
> >> > > > > > > > > tests were with TLS on. I've reproduced this on multiple
> >> > > clusters
> >> > > > > now
> >> > > > > > > (5
> >> > > > > > > > or
> >> > > > > > > > > so of each version) to account for the inherent
> performance
> >> > > > > variance
> >> > > > > > of
> >> > > > > > > > > EC2. There's no notable performance difference without
> TLS
> >> on
> >> > > > these
> >> > > > > > > runs
> >> > > > > > > > -
> >> > > > > > > > > it appears to be an TLS regression entirely.
> >> > > > > > > > >
> >> > > > > > > > > A single producer with TLS under 0.10 does about 75k
> >> > > messages/s.
> >> > > > > > Under
> >> > > > > > > > > 0.9.0.01 it does around 120k messages/s.
> >> > > > > > > > >
> >> > > > > > > > > The exact producer-perf line we're using is this:
> >> > > > > > > > >
> >> > > > > > > > > bin/kafka-producer-perf-test --topic "bench"
> --num-records
> >> > > > > > "500000000"
> >> > > > > > > > > --record-size "100" --throughput "100" --producer-props
> >> > > acks="-1"
> >> > > > > > > > > bootstrap.servers=REDACTED
> ssl.keystore.location=client.jks
> >> > > > > > > > > ssl.keystore.password=REDACTED
> >> > > ssl.truststore.location=server.jks
> >> > > > > > > > > ssl.truststore.password=REDACTED
> >> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
> >> > > security.protocol=SSL
> >> > > > > > > > >
> >> > > > > > > > > We're using the same setup, machine type etc for each
> test
> >> > run.
> >> > > > > > > > >
> >> > > > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0
> >> > producers
> >> > > > and
> >> > > > > > the
> >> > > > > > > > TLS
> >> > > > > > > > > performance impact was there for both.
> >> > > > > > > > >
> >> > > > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0
> and
> >> > > > haven't
> >> > > > > > > seen
> >> > > > > > > > > anything that seemed to have this kind of impact -
> indeed
> >> the
> >> > > TLS
> >> > > > > > code
> >> > > > > > > > > doesn't seem to have changed much between 0.9.0.1 and
> >> > 0.10.0.0.
> >> > > > > > > > >
> >> > > > > > > > > Any thoughts? Should I file an issue and see about
> >> > reproducing
> >> > > a
> >> > > > > more
> >> > > > > > > > > minimal test case?
> >> > > > > > > > >
> >> > > > > > > > > I don't think this is related to
> >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 -
> that is
> >> > for
> >> > > > > > > > compression
> >> > > > > > > > > on and plaintext, and this is for TLS only.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: [VOTE] 0.10.0.0 RC4

Reply via email to