Carl, looking at the boundary chart it looks like you have periods of no traffic also... prior to the spikes.
I also noticed you are using AWS from your logs, what instance types are you using? Do you have any network checks in place? The logs show underReplication=true which leads towards what Joel was theorizing as the issue. Do you track stats on the cluster? http://kafka.apache.org/documentation.html#monitoring I would expect correlation of changes in the kafka stats and the boundary chart. /******************************************* Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> ********************************************/ On Fri, Feb 7, 2014 at 2:47 AM, Carl Lerche <m...@carllerche.com> wrote: > One last thing, I have collected a snippet of the network traffic > between Kafka instances using tcpdump. However, it contains some > customer data and less than a minutes worth was over 1 GB, so I can't > really post it here, but I could possibly share offline if it can help > debug the issue. > > On Thu, Feb 6, 2014 at 11:44 PM, Carl Lerche <m...@carllerche.com> wrote: > > Re: > > > >> Could you also check if the on-disk data size/rate match the network > >> traffic? > > > > While I have not explicitly checked this, I would say that the answer > > is no. The network is over 1Gbps and I have setup monitoring for disk > > space and nothing out of the norm is happening there. The expected > > data is on the order of 500 kbits per sec. > > > > cheers. > > > > On Thu, Feb 6, 2014 at 9:06 PM, Jun Rao <jun...@gmail.com> wrote: > >> Could you also check if the on-disk data size/rate match the network > >> traffic? > >> > >> Thanks, > >> > >> Jun > >> > >> > >> On Thu, Feb 6, 2014 at 7:48 PM, Carl Lerche <m...@carllerche.com> wrote: > >> > >>> So, the "good news" is that the problem came back again. The bad news > >>> is that I disabled debug logs as it was filling disk (and I had other > >>> fires to put out). I will re-enable debug logs and wait for it to > >>> happen again. > >>> > >>> On Thu, Feb 6, 2014 at 4:05 AM, Neha Narkhede <neha.narkh...@gmail.com > > > >>> wrote: > >>> > Carl, > >>> > > >>> > It will help if you can list the steps to reproduce this issue > starting > >>> > from a fresh installation. Your setup, the way it stands, seems to > have > >>> > gone through some config and state changes. > >>> > > >>> > Thanks, > >>> > Neha > >>> > > >>> > > >>> > On Wed, Feb 5, 2014 at 5:17 PM, Joel Koshy <jjkosh...@gmail.com> > wrote: > >>> > > >>> >> On Wed, Feb 05, 2014 at 04:51:16PM -0800, Carl Lerche wrote: > >>> >> > So, I tried enabling debug logging, I also made some tweaks to the > >>> >> > config (which I probably shouldn't have) and craziness happened. > >>> >> > > >>> >> > First, some more context. Besides the very high network traffic, > we > >>> >> > were seeing some other issues that we were not focusing on yet. > >>> >> > > >>> >> > * Even though the log retention was set to 50GB & 24 hours, data > logs > >>> >> > were getting cleaned up far quicker quicker. I'm not entirely > sure how > >>> >> > much quicker, but there was definitely far less than 12 hours and > 1GB > >>> >> > of data. > >>> >> > > >>> >> > * Kafka was not properly balanced. We had 3 servers, and only 2 of > >>> >> > them were partition leaders. One server was a replica for all > >>> >> > partitions. We tried to run a rebalance command, but it did not > work. > >>> >> > We were going to investigate later. > >>> >> > >>> >> Were any of the brokers down for an extended period? If the > preferred > >>> >> replica election command failed it could be because the preferred > >>> >> replica was catching up (which could explain the higher than > expected > >>> >> network traffic). Do you monitor the under-replicated partitions > count > >>> >> on your cluster? If you have that data it could help confirm this. > >>> >> > >>> >> Joel > >>> >> > >>> >> > > >>> >> > So, after restarting all the kafkas, something happened with the > >>> >> > offsets. The offsets that our consumers had no longer existed. It > >>> >> > looks like somehow all the contents was lost? The logs show many > >>> >> > exceptions like: > >>> >> > > >>> >> > `Request for offset 770354 but we only have log segments in the > range > >>> >> > 759234 to 759838.` > >>> >> > > >>> >> > So, I reset all the consumer offsets to the head of the queue as > I did > >>> >> > not know of anything better to do. Once the dust settled, all the > >>> >> > issues we were seeing vanished. Communication between Kafka nodes > >>> >> > appear to be normal, Kafka was able to rebalance, and hopefully > log > >>> >> > retention will be normal. > >>> >> > > >>> >> > I am unsure what happened or how to get more debug information. > >>> >> > > >>> >> > On Wed, Feb 5, 2014 at 12:31 PM, Jay Kreps <jay.kr...@gmail.com> > >>> wrote: > >>> >> > > Can you enable DEBUG logging in log4j and see what requests are > >>> coming > >>> >> in? > >>> >> > > > >>> >> > > -Jay > >>> >> > > > >>> >> > > > >>> >> > > On Tue, Feb 4, 2014 at 9:51 PM, Carl Lerche <m...@carllerche.com> > >>> wrote: > >>> >> > > > >>> >> > >> Hi Jay, > >>> >> > >> > >>> >> > >> I do not believe that I have changed the > replica.fetch.wait.max.ms > >>> >> > >> setting. Here I have included the kafka config as well as a > >>> snapshot > >>> >> > >> of jnettop from one of the servers. > >>> >> > >> > >>> >> > >> https://gist.github.com/carllerche/4f2cf0f0f6d1e891f482 > >>> >> > >> > >>> >> > >> The bottom row (89.9K/s) is the producer (it lives on a Kafka > >>> server). > >>> >> > >> The top two rows are Kafkas on other servers, you can see the > >>> combined > >>> >> > >> throughput is ~80MB/s > >>> >> > >> > >>> >> > >> On Tue, Feb 4, 2014 at 9:36 PM, Jay Kreps <jay.kr...@gmail.com > > > >>> >> wrote: > >>> >> > >> > No this is not normal. > >>> >> > >> > > >>> >> > >> > Checking twice a second (using 500ms default) for new data > >>> shouldn't > >>> >> > >> cause > >>> >> > >> > high network traffic (that should be like < 1KB of > overhead). I > >>> >> don't > >>> >> > >> think > >>> >> > >> > that explains things. Is it possible that setting has been > >>> >> overridden? > >>> >> > >> > > >>> >> > >> > -Jay > >>> >> > >> > > >>> >> > >> > > >>> >> > >> > On Tue, Feb 4, 2014 at 9:25 PM, Guozhang Wang < > >>> wangg...@gmail.com> > >>> >> > >> wrote: > >>> >> > >> > > >>> >> > >> >> Hi Carl, > >>> >> > >> >> > >>> >> > >> >> For each partition the follower will also fetch data from > the > >>> >> leader > >>> >> > >> >> replica, even if there is no new data in the leader > replicas. > >>> >> > >> >> > >>> >> > >> >> One thing you can try to increase replica.fetch.wait.max.ms > >>> (default > >>> >> > >> value > >>> >> > >> >> 500ms) so that the followers's fetching request frequency > to the > >>> >> leader > >>> >> > >> can > >>> >> > >> >> be reduced, and see if that has some effect on the traffic. > >>> >> > >> >> > >>> >> > >> >> Guozhang > >>> >> > >> >> > >>> >> > >> >> > >>> >> > >> >> On Tue, Feb 4, 2014 at 8:46 PM, Carl Lerche < > m...@carllerche.com> > >>> >> wrote: > >>> >> > >> >> > >>> >> > >> >> > Hello, > >>> >> > >> >> > > >>> >> > >> >> > I'm running a 0.8.0 Kafka cluster of 3 servers. The > service > >>> that > >>> >> it is > >>> >> > >> >> > for is not in full production yet, so the data written to > >>> >> cluster is > >>> >> > >> >> > minimal (seems to average between 100kb/s -> 300kb/s per > >>> >> server). I > >>> >> > >> >> > have configured Kafka to have a 3 replicas. I am noticing > that > >>> >> each > >>> >> > >> >> > Kafka server is talking to all the others at a data rate > of > >>> >> 40MB/s for > >>> >> > >> >> > each server (so, a total of 80MB/s for each server). This > >>> >> > >> >> > communication is constant. > >>> >> > >> >> > > >>> >> > >> >> > Is this normal? This seems like very strange behavior and > I'm > >>> not > >>> >> > >> >> > exactly sure how to debug. > >>> >> > >> >> > > >>> >> > >> >> > Thanks, > >>> >> > >> >> > Carl > >>> >> > >> >> > > >>> >> > >> >> > >>> >> > >> >> > >>> >> > >> >> > >>> >> > >> >> -- > >>> >> > >> >> -- Guozhang > >>> >> > >> >> > >>> >> > >> > >>> >> > >>> >> > >>> >