Ok, sorry for the lock of concrete information to help debug this issue. I am not really an ops guy, so I am trying to keep up.
First, I added boundary to our servers. Normal Kafka behavior should be resulting in 500 kbps or less on our cluster. Here you can see that it's peaking at over 1 Gbps: http://f.cl.ly/items/2e0B3Z0h1B2W4535010O/Boundary_-_Streams.png Second, I reset Kafka from scratch with the version of Kafka 0.8.0 downloaded from an apache mirror. I wrapped up the setup process with a bash script that you can see here: https://gist.github.com/carllerche/779e990e59bb6e25f2b0#file-kafka-topic-L13 (the ensure argument is passed to this script). I also included the Kafka config. Finally, logs Here are the logs that I captured right now: http://cl.ly/2p0A3U430g2e/download/logs2.tar.gz The issue just reappeared. Here are the logs that I captured from before resetting Kafka to a pristine state: http://cl.ly/2p0A3U430g2e/download/logs2.tar.gz I do not know what is going on. If any of the Kafka devs would like access to the actual servers, I would be more than happy to work with them. Cheers, Carl On Thu, Feb 6, 2014 at 9:06 PM, Jun Rao <jun...@gmail.com> wrote: > Could you also check if the on-disk data size/rate match the network > traffic? > > Thanks, > > Jun > > > On Thu, Feb 6, 2014 at 7:48 PM, Carl Lerche <m...@carllerche.com> wrote: > >> So, the "good news" is that the problem came back again. The bad news >> is that I disabled debug logs as it was filling disk (and I had other >> fires to put out). I will re-enable debug logs and wait for it to >> happen again. >> >> On Thu, Feb 6, 2014 at 4:05 AM, Neha Narkhede <neha.narkh...@gmail.com> >> wrote: >> > Carl, >> > >> > It will help if you can list the steps to reproduce this issue starting >> > from a fresh installation. Your setup, the way it stands, seems to have >> > gone through some config and state changes. >> > >> > Thanks, >> > Neha >> > >> > >> > On Wed, Feb 5, 2014 at 5:17 PM, Joel Koshy <jjkosh...@gmail.com> wrote: >> > >> >> On Wed, Feb 05, 2014 at 04:51:16PM -0800, Carl Lerche wrote: >> >> > So, I tried enabling debug logging, I also made some tweaks to the >> >> > config (which I probably shouldn't have) and craziness happened. >> >> > >> >> > First, some more context. Besides the very high network traffic, we >> >> > were seeing some other issues that we were not focusing on yet. >> >> > >> >> > * Even though the log retention was set to 50GB & 24 hours, data logs >> >> > were getting cleaned up far quicker quicker. I'm not entirely sure how >> >> > much quicker, but there was definitely far less than 12 hours and 1GB >> >> > of data. >> >> > >> >> > * Kafka was not properly balanced. We had 3 servers, and only 2 of >> >> > them were partition leaders. One server was a replica for all >> >> > partitions. We tried to run a rebalance command, but it did not work. >> >> > We were going to investigate later. >> >> >> >> Were any of the brokers down for an extended period? If the preferred >> >> replica election command failed it could be because the preferred >> >> replica was catching up (which could explain the higher than expected >> >> network traffic). Do you monitor the under-replicated partitions count >> >> on your cluster? If you have that data it could help confirm this. >> >> >> >> Joel >> >> >> >> > >> >> > So, after restarting all the kafkas, something happened with the >> >> > offsets. The offsets that our consumers had no longer existed. It >> >> > looks like somehow all the contents was lost? The logs show many >> >> > exceptions like: >> >> > >> >> > `Request for offset 770354 but we only have log segments in the range >> >> > 759234 to 759838.` >> >> > >> >> > So, I reset all the consumer offsets to the head of the queue as I did >> >> > not know of anything better to do. Once the dust settled, all the >> >> > issues we were seeing vanished. Communication between Kafka nodes >> >> > appear to be normal, Kafka was able to rebalance, and hopefully log >> >> > retention will be normal. >> >> > >> >> > I am unsure what happened or how to get more debug information. >> >> > >> >> > On Wed, Feb 5, 2014 at 12:31 PM, Jay Kreps <jay.kr...@gmail.com> >> wrote: >> >> > > Can you enable DEBUG logging in log4j and see what requests are >> coming >> >> in? >> >> > > >> >> > > -Jay >> >> > > >> >> > > >> >> > > On Tue, Feb 4, 2014 at 9:51 PM, Carl Lerche <m...@carllerche.com> >> wrote: >> >> > > >> >> > >> Hi Jay, >> >> > >> >> >> > >> I do not believe that I have changed the replica.fetch.wait.max.ms >> >> > >> setting. Here I have included the kafka config as well as a >> snapshot >> >> > >> of jnettop from one of the servers. >> >> > >> >> >> > >> https://gist.github.com/carllerche/4f2cf0f0f6d1e891f482 >> >> > >> >> >> > >> The bottom row (89.9K/s) is the producer (it lives on a Kafka >> server). >> >> > >> The top two rows are Kafkas on other servers, you can see the >> combined >> >> > >> throughput is ~80MB/s >> >> > >> >> >> > >> On Tue, Feb 4, 2014 at 9:36 PM, Jay Kreps <jay.kr...@gmail.com> >> >> wrote: >> >> > >> > No this is not normal. >> >> > >> > >> >> > >> > Checking twice a second (using 500ms default) for new data >> shouldn't >> >> > >> cause >> >> > >> > high network traffic (that should be like < 1KB of overhead). I >> >> don't >> >> > >> think >> >> > >> > that explains things. Is it possible that setting has been >> >> overridden? >> >> > >> > >> >> > >> > -Jay >> >> > >> > >> >> > >> > >> >> > >> > On Tue, Feb 4, 2014 at 9:25 PM, Guozhang Wang < >> wangg...@gmail.com> >> >> > >> wrote: >> >> > >> > >> >> > >> >> Hi Carl, >> >> > >> >> >> >> > >> >> For each partition the follower will also fetch data from the >> >> leader >> >> > >> >> replica, even if there is no new data in the leader replicas. >> >> > >> >> >> >> > >> >> One thing you can try to increase replica.fetch.wait.max.ms >> (default >> >> > >> value >> >> > >> >> 500ms) so that the followers's fetching request frequency to the >> >> leader >> >> > >> can >> >> > >> >> be reduced, and see if that has some effect on the traffic. >> >> > >> >> >> >> > >> >> Guozhang >> >> > >> >> >> >> > >> >> >> >> > >> >> On Tue, Feb 4, 2014 at 8:46 PM, Carl Lerche <m...@carllerche.com> >> >> wrote: >> >> > >> >> >> >> > >> >> > Hello, >> >> > >> >> > >> >> > >> >> > I'm running a 0.8.0 Kafka cluster of 3 servers. The service >> that >> >> it is >> >> > >> >> > for is not in full production yet, so the data written to >> >> cluster is >> >> > >> >> > minimal (seems to average between 100kb/s -> 300kb/s per >> >> server). I >> >> > >> >> > have configured Kafka to have a 3 replicas. I am noticing that >> >> each >> >> > >> >> > Kafka server is talking to all the others at a data rate of >> >> 40MB/s for >> >> > >> >> > each server (so, a total of 80MB/s for each server). This >> >> > >> >> > communication is constant. >> >> > >> >> > >> >> > >> >> > Is this normal? This seems like very strange behavior and I'm >> not >> >> > >> >> > exactly sure how to debug. >> >> > >> >> > >> >> > >> >> > Thanks, >> >> > >> >> > Carl >> >> > >> >> > >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> -- >> >> > >> >> -- Guozhang >> >> > >> >> >> >> > >> >> >> >> >> >>