Hi, Andrew, Yes, I agree that this is a serious issue. Let me start a discussion thread on this to see if there is any objection in doing an 0.8.2.2 release just for this.
Thanks, Jun On Thu, Aug 13, 2015 at 1:10 PM, Andrew Otto <o...@wikimedia.org> wrote: > Hey all, > > Just wanted to confirm, this was totally our issue. Thank so much Todd and > Matt, our cluster is much more stable now. > > Apache Kafka folks: I know 0.8.3 is slated to come out soon, but this is a > pretty serious bug. I would think it would merit a minor release just to > get it out there, so that others don't run into this problem. 0.8.2.1 > basically does not work at scale with snappy compression. I will add a > comment to https://issues.apache.org/jira/browse/KAFKA-2189 noting this > too. > > Thanks so much! > -Andrew > > On Tue, Aug 11, 2015 at 3:43 PM, Matthew Bruce <mbr...@blackberry.com> > wrote: > > > Hi Andrew, > > > > > > > > I work with Todd and did our 0.8.2.1 testing with him. I believe that > the > > Kafka 0.8.x brokers recompresses the messages once it receives them in, > > order to assign the offsets to the messages (see the ‘Compression in > Kafka’ > > section of: > > http://nehanarkhede.com/2013/03/28/compression-in-kafka-gzip-or-snappy/ > ). > > I expect that you will see an improvement with Snappy 1.1.1.7 (FWIW, our > > load generator’s version of Snappy didn’t change between our 0.8.1.1 and > > 0.8.2.1 testing, and we still saw the IO hit on the broker side, which > > seems to confirm this). > > > > > > > > Thanks, > > > > Matt Bruce > > > > > > > > > > > > *From:* Andrew Otto [mailto:ao...@wikimedia.org] > > *Sent:* Tuesday, August 11, 2015 3:15 PM > > *To:* users@kafka.apache.org > > *Cc:* Dan Andreescu <dandree...@wikimedia.org>; Joseph Allemandou < > > jalleman...@wikimedia.org> > > *Subject:* Re: 0.8.2.1 upgrade causes much more IO > > > > > > > > Hi Todd, > > > > > > > > We are using snappy! And we are using version 1.1.1.6 as of our upgrade > > to 0.8.2.1 yesterday. However, as far as I can tell, that is only > relevant > > for Java producers, right? Our main producers use librdkafka (the > Kafka C > > lib) to produce, and in doing so use a built in C version of snappy[1]. > > > > > > > > Even so, your issue sounds very similar to mine, and I don’t have a full > > understanding of how brokers deal with compression, so I have updated the > > snappy java version to 1.1.1.7 on one of our brokers. We’ll have to > wait a > > while to see if the log sizes are actually smaller for data written to > this > > broker. > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > [1] https://github.com/edenhill/librdkafka/blob/0.8.5/src/snappy.c > > > > On Aug 11, 2015, at 12:58, Todd Snyder <tsny...@blackberry.com> wrote: > > > > > > > > Hi Andrew, > > > > > > > > Are you using Snappy Compression by chance? When we tested the 0.8.2.1 > > upgrade initially we saw similar results and tracked it down to a problem > > with Snappy version 1.1.1.6 ( > > https://issues.apache.org/jira/browse/KAFKA-2189). We’re running with > > Snappy 1.1.1.7 now and the performance is back to where it used to be. > > > > > > > > > > > > Sent from my BlackBerry 10 smartphone on the TELUS network. > > > > *From: *Andrew Otto > > > > *Sent: *Tuesday, August 11, 2015 12:26 PM > > > > *To: *users@kafka.apache.org > > > > *Reply To: *users@kafka.apache.org > > > > *Cc: *Dan Andreescu; Joseph Allemandou > > > > *Subject: *0.8.2.1 upgrade causes much more IO > > > > > > > > Hi all! > > > > > > > > Yesterday I did a production upgrade of our 4 broker Kafka cluster from > > 0.8.1.1 to 0.8.2.1. > > > > > > > > When we did so, we were running our (varnishkafka) producers with > > request.required.acks = -1. After switching to 0.8.2.1, producers saw > > produce response RTTs of >60 seconds. I then switched to > > request.required.acks = 1, and producers settled down. However, we then > > started seeing flapping ISRs about every 10 minutes. We run Camus every > 10 > > minutes. If we disable Camus, then ISRs don’t flap. > > > > > > > > All of these issues seem to be a side affect of a larger problem. The > > total amount of network and disk IO that Kafka brokers are doing after > the > > upgrade to 0.8.2.1 has tripled. We were previously seeing about 20 MB/s > > incoming on broker interfaces, 0.8.2.1 knocks this up to around 60 MB/s. > > Disk writes have tripled accordingly. Disk reads have also increased by > a > > huge amount, although I suspect this is a consequence of more data flying > > around somehow dirtying the disk cache > > > > > > > > You can see these changes in this dashboard: > > http://grafana.wikimedia.org/#/dashboard/db/kafka-0821-upgrade > > > > > > > > The upgrade started at around 2015-08-10 14:30, and was completed on all > 4 > > brokers within a couple of hours. > > > > > > > > Probably the most relevant is network rx_bytes on brokers. > > > > > > > > > > > > > > > > We looked at Kafka .log file sizes and noticed that file sizes are indeed > > much larger than they were before this upgrade: > > > > > > > > # 0.8.1.1 > > > > 2015-08-10T04 38119109383 > > > > 2015-08-10T05 46172089174 > > > > 2015-08-10T06 46172182745 > > > > 2015-08-10T07 53151490032 > > > > 2015-08-10T08 53151892928 > > > > 2015-08-10T09 55836248198 > > > > 2015-08-10T10 57984054557 > > > > 2015-08-10T11 63353197416 > > > > 2015-08-10T12 68184938548 > > > > 2015-08-10T13 69259218741 > > > > 2015-08-10T14 79567698089 > > > > # Upgrade to 0.8.2.1 starts here > > > > 2015-08-10T15 133643184876 > > > > 2015-08-10T16 168515916825 > > > > 2015-08-10T17 181394338213 > > > > 2015-08-10T18 177097927553 > > > > 2015-08-10T19 183530782549 > > > > 2015-08-10T20 178706680082 > > > > 2015-08-10T21 178712665924 > > > > 2015-08-10T22 171741495606 > > > > 2015-08-10T23 169049665348 > > > > 2015-08-11T00 163682183241 > > > > 2015-08-11T01 165292426510 > > > > > > > > > > > > Aside from the request.required.acks change I mentioned above, we haven’t > > made any config changes on brokers, producers, or consumers. Our > > server.properties file is here: > > https://gist.github.com/ottomata/cdd270102287661c176a > > > > > > > > Has anyone seen this before? What could be the cause of more data here? > > Perhaps there is some compression config change that we missed that is > > causing this data to be sent or saved uncompressed? (Sent uncompressed > is > > unlikely, as we would probably notice a larger network change on the > > producers than we do. (Unless I’m looking at that wrong right now…:)) > Is > > there a quick way to tell if the data is compressed? > > > > > > > > > > > > Thanks! > > > > -Andrew Otto > > > > > > > > > > > > --------------------------------------------------------------------- > > This transmission (including any attachments) may contain confidential > > information, privileged material (including material protected by the > > solicitor-client or other applicable privileges), or constitute > non-public > > information. Any use of this information by anyone other than the > intended > > recipient is prohibited. If you have received this transmission in error, > > please immediately reply to the sender and delete this information from > > your system. Use, dissemination, distribution, or reproduction of this > > transmission by unintended recipients is not authorized and may be > unlawful. > > > > > > --------------------------------------------------------------------- > > This transmission (including any attachments) may contain confidential > > information, privileged material (including material protected by the > > solicitor-client or other applicable privileges), or constitute > non-public > > information. Any use of this information by anyone other than the > intended > > recipient is prohibited. If you have received this transmission in error, > > please immediately reply to the sender and delete this information from > > your system. Use, dissemination, distribution, or reproduction of this > > transmission by unintended recipients is not authorized and may be > unlawful. > > >