Re: 0.8.2.1 upgrade causes much more IO

Jun Rao Fri, 14 Aug 2015 05:10:28 -0700

Hi, Andrew,

Yes, I agree that this is a serious issue. Let me start a discussion thread
on this to see if there is any objection in doing an 0.8.2.2 release just
for this.


Thanks,

Jun

On Thu, Aug 13, 2015 at 1:10 PM, Andrew Otto <o...@wikimedia.org> wrote:

> Hey all,
>
> Just wanted to confirm, this was totally our issue.  Thank so much Todd and
> Matt, our cluster is much more stable now.
>
> Apache Kafka folks:  I know 0.8.3 is slated to come out soon, but this is a
> pretty serious bug.  I would think it would merit a minor release just to
> get it out there, so that others don't run into this problem.  0.8.2.1
> basically does not work at scale with snappy compression.  I will add a
> comment to https://issues.apache.org/jira/browse/KAFKA-2189 noting this
> too.
>
> Thanks so much!
> -Andrew
>
> On Tue, Aug 11, 2015 at 3:43 PM, Matthew Bruce <mbr...@blackberry.com>
> wrote:
>
> > Hi Andrew,
> >
> >
> >
> > I work with Todd and did our 0.8.2.1 testing with him.  I believe that
> the
> > Kafka 0.8.x brokers recompresses the messages once it receives them in,
> > order to assign the offsets to the messages (see the ‘Compression in
> Kafka’
> > section of:
> > http://nehanarkhede.com/2013/03/28/compression-in-kafka-gzip-or-snappy/
> ).
> > I expect that you will see an improvement with Snappy 1.1.1.7  (FWIW, our
> > load generator’s version of Snappy didn’t change between our 0.8.1.1 and
> > 0.8.2.1 testing, and we still saw the IO hit on the broker side, which
> > seems to confirm this).
> >
> >
> >
> > Thanks,
> >
> > Matt Bruce
> >
> >
> >
> >
> >
> > *From:* Andrew Otto [mailto:ao...@wikimedia.org]
> > *Sent:* Tuesday, August 11, 2015 3:15 PM
> > *To:* users@kafka.apache.org
> > *Cc:* Dan Andreescu <dandree...@wikimedia.org>; Joseph Allemandou <
> > jalleman...@wikimedia.org>
> > *Subject:* Re: 0.8.2.1 upgrade causes much more IO
> >
> >
> >
> > Hi Todd,
> >
> >
> >
> > We are using snappy!  And we are using version 1.1.1.6 as of our upgrade
> > to 0.8.2.1 yesterday.  However, as far as I can tell, that is only
> relevant
> > for Java producers, right?   Our main producers use librdkafka (the
> Kafka C
> > lib) to produce, and in doing so use a built in C version of snappy[1].
> >
> >
> >
> > Even so, your issue sounds very similar to mine, and I don’t have a full
> > understanding of how brokers deal with compression, so I have updated the
> > snappy java version to 1.1.1.7 on one of our brokers.  We’ll have to
> wait a
> > while to see if the log sizes are actually smaller for data written to
> this
> > broker.
> >
> >
> >
> > Thanks!
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > [1] https://github.com/edenhill/librdkafka/blob/0.8.5/src/snappy.c
> >
> > On Aug 11, 2015, at 12:58, Todd Snyder <tsny...@blackberry.com> wrote:
> >
> >
> >
> > Hi Andrew,
> >
> >
> >
> > Are you using Snappy Compression by chance?  When we tested the 0.8.2.1
> > upgrade initially we saw similar results and tracked it down to a problem
> > with Snappy version 1.1.1.6 (
> > https://issues.apache.org/jira/browse/KAFKA-2189).  We’re running with
> > Snappy 1.1.1.7 now and the performance is back to where it used to be.
> >
> >
> >
> >
> >
> > Sent from my BlackBerry 10 smartphone on the TELUS network.
> >
> > *From: *Andrew Otto
> >
> > *Sent: *Tuesday, August 11, 2015 12:26 PM
> >
> > *To: *users@kafka.apache.org
> >
> > *Reply To: *users@kafka.apache.org
> >
> > *Cc: *Dan Andreescu; Joseph Allemandou
> >
> > *Subject: *0.8.2.1 upgrade causes much more IO
> >
> >
> >
> > Hi all!
> >
> >
> >
> > Yesterday I did a production upgrade of our 4 broker Kafka cluster from
> > 0.8.1.1 to 0.8.2.1.
> >
> >
> >
> > When we did so, we were running our (varnishkafka) producers with
> > request.required.acks = -1.  After switching to 0.8.2.1, producers saw
> > produce response RTTs of >60 seconds.  I then switched to
> > request.required.acks = 1, and producers settled down.  However, we then
> > started seeing flapping ISRs about every 10 minutes.  We run Camus every
> 10
> > minutes.  If we disable Camus, then ISRs don’t flap.
> >
> >
> >
> > All of these issues seem to be a side affect of a larger problem.  The
> > total amount of network and disk IO that Kafka brokers are doing after
> the
> > upgrade to 0.8.2.1 has tripled.  We were previously seeing about 20 MB/s
> > incoming on broker interfaces, 0.8.2.1 knocks this up to around 60 MB/s.
> > Disk writes have tripled accordingly.  Disk reads have also increased by
> a
> > huge amount, although I suspect this is a consequence of more data flying
> > around somehow dirtying the disk cache
> >
> >
> >
> > You can see these changes in this dashboard:
> > http://grafana.wikimedia.org/#/dashboard/db/kafka-0821-upgrade
> >
> >
> >
> > The upgrade started at around 2015-08-10 14:30, and was completed on all
> 4
> > brokers within a couple of hours.
> >
> >
> >
> > Probably the most relevant is network rx_bytes on brokers.
> >
> >
> >
> >
> >
> >
> >
> > We looked at Kafka .log file sizes and noticed that file sizes are indeed
> > much larger than they were before this upgrade:
> >
> >
> >
> > # 0.8.1.1
> >
> > 2015-08-10T04 38119109383
> >
> > 2015-08-10T05 46172089174
> >
> > 2015-08-10T06 46172182745
> >
> > 2015-08-10T07 53151490032
> >
> > 2015-08-10T08 53151892928
> >
> > 2015-08-10T09 55836248198
> >
> > 2015-08-10T10 57984054557
> >
> > 2015-08-10T11 63353197416
> >
> > 2015-08-10T12 68184938548
> >
> > 2015-08-10T13 69259218741
> >
> > 2015-08-10T14 79567698089
> >
> > # Upgrade to 0.8.2.1 starts here
> >
> > 2015-08-10T15 133643184876
> >
> > 2015-08-10T16 168515916825
> >
> > 2015-08-10T17 181394338213
> >
> > 2015-08-10T18 177097927553
> >
> > 2015-08-10T19 183530782549
> >
> > 2015-08-10T20 178706680082
> >
> > 2015-08-10T21 178712665924
> >
> > 2015-08-10T22 171741495606
> >
> > 2015-08-10T23 169049665348
> >
> > 2015-08-11T00 163682183241
> >
> > 2015-08-11T01 165292426510
> >
> >
> >
> >
> >
> > Aside from the request.required.acks change I mentioned above, we haven’t
> > made any config changes on brokers, producers, or consumers.  Our
> > server.properties file is here:
> > https://gist.github.com/ottomata/cdd270102287661c176a
> >
> >
> >
> > Has anyone seen this before?  What could be the cause of more data here?
> > Perhaps there is some compression config change that we missed that is
> > causing this data to be sent or saved uncompressed?  (Sent uncompressed
> is
> > unlikely, as we would probably notice a larger network change on the
> > producers than we do.  (Unless I’m looking at that wrong right now…:))
> Is
> > there a quick way to tell if the data is compressed?
> >
> >
> >
> >
> >
> > Thanks!
> >
> > -Andrew Otto
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > This transmission (including any attachments) may contain confidential
> > information, privileged material (including material protected by the
> > solicitor-client or other applicable privileges), or constitute
> non-public
> > information. Any use of this information by anyone other than the
> intended
> > recipient is prohibited. If you have received this transmission in error,
> > please immediately reply to the sender and delete this information from
> > your system. Use, dissemination, distribution, or reproduction of this
> > transmission by unintended recipients is not authorized and may be
> unlawful.
> >
> >
> > ---------------------------------------------------------------------
> > This transmission (including any attachments) may contain confidential
> > information, privileged material (including material protected by the
> > solicitor-client or other applicable privileges), or constitute
> non-public
> > information. Any use of this information by anyone other than the
> intended
> > recipient is prohibited. If you have received this transmission in error,
> > please immediately reply to the sender and delete this information from
> > your system. Use, dissemination, distribution, or reproduction of this
> > transmission by unintended recipients is not authorized and may be
> unlawful.
> >
>

Re: 0.8.2.1 upgrade causes much more IO

Reply via email to