Re: [VOTE] 0.10.0.0 RC4

Gwen Shapira Fri, 13 May 2016 11:41:18 -0700

Hi,

We (Ismael, Magnus and Jun) are also trying to reproduce and figure it
out on our side. Will keep you posted.


Gwen

On Fri, May 13, 2016 at 11:32 AM, Tom Crayford <tcrayf...@heroku.com> wrote:
> I've been investigating this pretty hard since I first noticed it. Right
> now I have more avenues for investigation than I can shake a stick at, and
> am also dealing with several other things in flight/on fire. I'll respond
> when I have more information and can confirm things.
>
> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <becket....@gmail.com> wrote:
>
>> Tom,
>>
>> Maybe it is mentioned and I missed. I am wondering if you see performance
>> degradation on the consumer side when TLS is used? This could help us
>> understand whether the issue is only producer related or TLS in general.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <tcrayf...@heroku.com>
>> wrote:
>>
>> > Ismael,
>> >
>> > Thanks. I'm writing up an issue with some new findings since yesterday
>> > right now.
>> >
>> > Thanks
>> >
>> > Tom
>> >
>> > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma <ism...@juma.me.uk> wrote:
>> >
>> > > Hi Tom,
>> > >
>> > > That's because JIRA is in lockdown due to excessive spam. I have added
>> > you
>> > > as a contributor in JIRA and you should be able to file a ticket now.
>> > >
>> > > Thanks,
>> > > Ismael
>> > >
>> > > On Fri, May 13, 2016 at 12:17 PM, Tom Crayford <tcrayf...@heroku.com>
>> > > wrote:
>> > >
>> > > > Ok, I don't seem to be able to file a new Jira issue at all. Can
>> > somebody
>> > > > check my permissions on Jira? My user is `tcrayford-heroku`
>> > > >
>> > > > Tom Crayford
>> > > > Heroku Kafka
>> > > >
>> > > > On Fri, May 13, 2016 at 12:24 AM, Jun Rao <j...@confluent.io> wrote:
>> > > >
>> > > > > Tom,
>> > > > >
>> > > > > We don't have a CSV metrics reporter in the producer right now. The
>> > > > metrics
>> > > > > will be available in jmx. You can find out the details in
>> > > > > http://kafka.apache.org/documentation.html#new_producer_monitoring
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > > On Thu, May 12, 2016 at 3:08 PM, Tom Crayford <
>> tcrayf...@heroku.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Yep, I can try those particular commits tomorrow. Before I try a
>> > > > bisect,
>> > > > > > I'm going to replicate with a less intensive to iterate on
>> smaller
>> > > > scale
>> > > > > > perf test.
>> > > > > >
>> > > > > > Jun, inline:
>> > > > > >
>> > > > > > On Thursday, 12 May 2016, Jun Rao <j...@confluent.io> wrote:
>> > > > > >
>> > > > > > > Tom,
>> > > > > > >
>> > > > > > > Thanks for reporting this. A few quick comments.
>> > > > > > >
>> > > > > > > 1. Did you send the right command for producer-perf? The
>> command
>> > > > limits
>> > > > > > the
>> > > > > > > throughput to 100 msgs/sec. So, not sure how a single producer
>> > can
>> > > > get
>> > > > > > 75K
>> > > > > > > msgs/sec.
>> > > > > >
>> > > > > >
>> > > > > > Ah yep, wrong commands. I'll get the right one tomorrow. Sorry,
>> was
>> > > > > > interpolating variables into a shell script.
>> > > > > >
>> > > > > >
>> > > > > > >
>> > > > > > > 2. Could you collect some stats (e.g. average batch size) in
>> the
>> > > > > producer
>> > > > > > > and see if there is any noticeable difference between 0.9 and
>> > 0.10?
>> > > > > >
>> > > > > >
>> > > > > > That'd just be hooking up the CSV metrics reporter right?
>> > > > > >
>> > > > > >
>> > > > > > >
>> > > > > > > 3. Is the broker-to-broker communication also on SSL? Could you
>> > do
>> > > > > > another
>> > > > > > > test with replication factor 1 and see if you still see the
>> > > > > degradation?
>> > > > > >
>> > > > > >
>> > > > > > Interbroker replication is always SSL in all test runs so far. I
>> > can
>> > > > try
>> > > > > > with replication factor 1 tomorrow.
>> > > > > >
>> > > > > >
>> > > > > > >
>> > > > > > > Finally, email is probably not the best way to discuss
>> > performance
>> > > > > > results.
>> > > > > > > If you have more of them, could you create a jira and attach
>> your
>> > > > > > findings
>> > > > > > > there?
>> > > > > >
>> > > > > >
>> > > > > > Yep. I only wrote the email because JIRA was in lockdown mode
>> and I
>> > > > > > couldn't create new issues.
>> > > > > >
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Jun
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, May 12, 2016 at 1:26 PM, Tom Crayford <
>> > > tcrayf...@heroku.com
>> > > > > > > <javascript:;>> wrote:
>> > > > > > >
>> > > > > > > > We've started running our usual suite of performance tests
>> > > against
>> > > > > > Kafka
>> > > > > > > > 0.10.0.0 RC. These tests orchestrate multiple
>> consumer/producer
>> > > > > > machines
>> > > > > > > to
>> > > > > > > > run a fairly normal mixed workload of producers and consumers
>> > > (each
>> > > > > > > > producer/consumer are just instances of kafka's inbuilt
>> > > > > > consumer/producer
>> > > > > > > > perf tests). We've found about a 33% performance drop in the
>> > > > producer
>> > > > > > if
>> > > > > > > > TLS is used (compared to 0.9.0.1)
>> > > > > > > >
>> > > > > > > > We've seen notable producer performance degredations between
>> > > > 0.9.0.1
>> > > > > > and
>> > > > > > > > 0.10.0.0 RC. We're running as of the commit 9404680 right
>> now.
>> > > > > > > >
>> > > > > > > > Our specific test case runs Kafka on 8 EC2 machines, with
>> > > enhanced
>> > > > > > > > networking. Nothing is changed between the instances, and
>> I've
>> > > > > > reproduced
>> > > > > > > > this over 4 different sets of clusters now. We're seeing
>> about
>> > a
>> > > > 33%
>> > > > > > > > performance drop between 0.9.0.1 and 0.10.0.0 as of commit
>> > > 9404680.
>> > > > > > > Please
>> > > > > > > > to note that this doesn't match up with
>> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565, because
>> our
>> > > > > > > performance
>> > > > > > > > tests are with compression off, and this seems to be an TLS
>> > only
>> > > > > issue.
>> > > > > > > >
>> > > > > > > > Under 0.10.0-rc4, we see an 8 node cluster with replication
>> > > factor
>> > > > of
>> > > > > > 3,
>> > > > > > > > and 13 producers max out at around 1 million 100 byte
>> messages
>> > a
>> > > > > > second.
>> > > > > > > > Under 0.9.0.1, the same cluster does 1.5 million messages a
>> > > second.
>> > > > > > Both
>> > > > > > > > tests were with TLS on. I've reproduced this on multiple
>> > clusters
>> > > > now
>> > > > > > (5
>> > > > > > > or
>> > > > > > > > so of each version) to account for the inherent performance
>> > > > variance
>> > > > > of
>> > > > > > > > EC2. There's no notable performance difference without TLS on
>> > > these
>> > > > > > runs
>> > > > > > > -
>> > > > > > > > it appears to be an TLS regression entirely.
>> > > > > > > >
>> > > > > > > > A single producer with TLS under 0.10 does about 75k
>> > messages/s.
>> > > > > Under
>> > > > > > > > 0.9.0.01 it does around 120k messages/s.
>> > > > > > > >
>> > > > > > > > The exact producer-perf line we're using is this:
>> > > > > > > >
>> > > > > > > > bin/kafka-producer-perf-test --topic "bench" --num-records
>> > > > > "500000000"
>> > > > > > > > --record-size "100" --throughput "100" --producer-props
>> > acks="-1"
>> > > > > > > > bootstrap.servers=REDACTED ssl.keystore.location=client.jks
>> > > > > > > > ssl.keystore.password=REDACTED
>> > ssl.truststore.location=server.jks
>> > > > > > > > ssl.truststore.password=REDACTED
>> > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
>> > security.protocol=SSL
>> > > > > > > >
>> > > > > > > > We're using the same setup, machine type etc for each test
>> run.
>> > > > > > > >
>> > > > > > > > We've tried using both 0.9.0.1 producers and 0.10.0.0
>> producers
>> > > and
>> > > > > the
>> > > > > > > TLS
>> > > > > > > > performance impact was there for both.
>> > > > > > > >
>> > > > > > > > I've glanced over the code between 0.9.0.1 and 0.10.0.0 and
>> > > haven't
>> > > > > > seen
>> > > > > > > > anything that seemed to have this kind of impact - indeed the
>> > TLS
>> > > > > code
>> > > > > > > > doesn't seem to have changed much between 0.9.0.1 and
>> 0.10.0.0.
>> > > > > > > >
>> > > > > > > > Any thoughts? Should I file an issue and see about
>> reproducing
>> > a
>> > > > more
>> > > > > > > > minimal test case?
>> > > > > > > >
>> > > > > > > > I don't think this is related to
>> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3565 - that is
>> for
>> > > > > > > compression
>> > > > > > > > on and plaintext, and this is for TLS only.
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: [VOTE] 0.10.0.0 RC4

Reply via email to