Hi Andrew,
Are you using Snappy Compression by chance? When we tested the 0.8.2.1 upgrade initially we saw similar results and tracked it down to a problem with Snappy version 1.1.1.6 (https://issues.apache.org/jira/browse/KAFKA-2189). We’re running with Snappy 1.1.1.7 now and the performance is back to where it used to be. Sent from my BlackBerry 10 smartphone on the TELUS network. From: Andrew Otto Sent: Tuesday, August 11, 2015 12:26 PM To: users@kafka.apache.org Reply To: users@kafka.apache.org Cc: Dan Andreescu; Joseph Allemandou Subject: 0.8.2.1 upgrade causes much more IO Hi all! Yesterday I did a production upgrade of our 4 broker Kafka cluster from 0.8.1.1 to 0.8.2.1. When we did so, we were running our (varnishkafka) producers with request.required.acks = -1. After switching to 0.8.2.1, producers saw produce response RTTs of >60 seconds. I then switched to request.required.acks = 1, and producers settled down. However, we then started seeing flapping ISRs about every 10 minutes. We run Camus every 10 minutes. If we disable Camus, then ISRs don’t flap. All of these issues seem to be a side affect of a larger problem. The total amount of network and disk IO that Kafka brokers are doing after the upgrade to 0.8.2.1 has tripled. We were previously seeing about 20 MB/s incoming on broker interfaces, 0.8.2.1 knocks this up to around 60 MB/s. Disk writes have tripled accordingly. Disk reads have also increased by a huge amount, although I suspect this is a consequence of more data flying around somehow dirtying the disk cache You can see these changes in this dashboard: http://grafana.wikimedia.org/#/dashboard/db/kafka-0821-upgrade The upgrade started at around 2015-08-10 14:30, and was completed on all 4 brokers within a couple of hours. Probably the most relevant is network rx_bytes on brokers. [cid:099E3DC1-28F5-4BFC-A149-691DB87B01FD] We looked at Kafka .log file sizes and noticed that file sizes are indeed much larger than they were before this upgrade: # 0.8.1.1 2015-08-10T04 38119109383 2015-08-10T05 46172089174 2015-08-10T06 46172182745 2015-08-10T07 53151490032 2015-08-10T08 53151892928 2015-08-10T09 55836248198 2015-08-10T10 57984054557 2015-08-10T11 63353197416 2015-08-10T12 68184938548 2015-08-10T13 69259218741 2015-08-10T14 79567698089 # Upgrade to 0.8.2.1 starts here 2015-08-10T15 133643184876 2015-08-10T16 168515916825 2015-08-10T17 181394338213 2015-08-10T18 177097927553 2015-08-10T19 183530782549 2015-08-10T20 178706680082 2015-08-10T21 178712665924 2015-08-10T22 171741495606 2015-08-10T23 169049665348 2015-08-11T00 163682183241 2015-08-11T01 165292426510 Aside from the request.required.acks change I mentioned above, we haven’t made any config changes on brokers, producers, or consumers. Our server.properties file is here: https://gist.github.com/ottomata/cdd270102287661c176a Has anyone seen this before? What could be the cause of more data here? Perhaps there is some compression config change that we missed that is causing this data to be sent or saved uncompressed? (Sent uncompressed is unlikely, as we would probably notice a larger network change on the producers than we do. (Unless I’m looking at that wrong right now…:)) Is there a quick way to tell if the data is compressed? Thanks! -Andrew Otto --------------------------------------------------------------------- This transmission (including any attachments) may contain confidential information, privileged material (including material protected by the solicitor-client or other applicable privileges), or constitute non-public information. Any use of this information by anyone other than the intended recipient is prohibited. If you have received this transmission in error, please immediately reply to the sender and delete this information from your system. Use, dissemination, distribution, or reproduction of this transmission by unintended recipients is not authorized and may be unlawful.