One last thing, I have collected a snippet of the network traffic
between Kafka instances using tcpdump. However, it contains some
customer data and less than a minutes worth was over 1 GB, so I can't
really post it here, but I could possibly share offline if it can help
debug the issue.

On Thu, Feb 6, 2014 at 11:44 PM, Carl Lerche <m...@carllerche.com> wrote:
> Re:
>
>> Could you also check if the on-disk data size/rate match the network
>> traffic?
>
> While I have not explicitly checked this, I would say that the answer
> is no. The network is over 1Gbps and I have setup monitoring for disk
> space and nothing out of the norm is happening there. The expected
> data is on the order of 500 kbits per sec.
>
> cheers.
>
> On Thu, Feb 6, 2014 at 9:06 PM, Jun Rao <jun...@gmail.com> wrote:
>> Could you also check if the on-disk data size/rate match the network
>> traffic?
>>
>> Thanks,
>>
>> Jun
>>
>>
>> On Thu, Feb 6, 2014 at 7:48 PM, Carl Lerche <m...@carllerche.com> wrote:
>>
>>> So, the "good news" is that the problem came back again. The bad news
>>> is that I disabled debug logs as it was filling disk (and I had other
>>> fires to put out). I will re-enable debug logs and wait for it to
>>> happen again.
>>>
>>> On Thu, Feb 6, 2014 at 4:05 AM, Neha Narkhede <neha.narkh...@gmail.com>
>>> wrote:
>>> > Carl,
>>> >
>>> > It will help if you can list the steps to reproduce this issue starting
>>> > from a fresh installation. Your setup, the way it stands, seems to have
>>> > gone through some config and state changes.
>>> >
>>> > Thanks,
>>> > Neha
>>> >
>>> >
>>> > On Wed, Feb 5, 2014 at 5:17 PM, Joel Koshy <jjkosh...@gmail.com> wrote:
>>> >
>>> >> On Wed, Feb 05, 2014 at 04:51:16PM -0800, Carl Lerche wrote:
>>> >> > So, I tried enabling debug logging, I also made some tweaks to the
>>> >> > config (which I probably shouldn't have) and craziness happened.
>>> >> >
>>> >> > First, some more context. Besides the very high network traffic, we
>>> >> > were seeing some other issues that we were not focusing on yet.
>>> >> >
>>> >> > * Even though the log retention was set to 50GB & 24 hours, data logs
>>> >> > were getting cleaned up far quicker quicker. I'm not entirely sure how
>>> >> > much quicker, but there was definitely far less than 12 hours and 1GB
>>> >> > of data.
>>> >> >
>>> >> > * Kafka was not properly balanced. We had 3 servers, and only 2 of
>>> >> > them were partition leaders. One server was a replica for all
>>> >> > partitions. We tried to run a rebalance command, but it did not work.
>>> >> > We were going to investigate later.
>>> >>
>>> >> Were any of the brokers down for an extended period? If the preferred
>>> >> replica election command failed it could be because the preferred
>>> >> replica was catching up (which could explain the higher than expected
>>> >> network traffic). Do you monitor the under-replicated partitions count
>>> >> on your cluster? If you have that data it could help confirm this.
>>> >>
>>> >> Joel
>>> >>
>>> >> >
>>> >> > So, after restarting all the kafkas, something happened with the
>>> >> > offsets. The offsets that our consumers had no longer existed. It
>>> >> > looks like somehow all the contents was lost? The logs show many
>>> >> > exceptions like:
>>> >> >
>>> >> > `Request for offset 770354 but we only have log segments in the range
>>> >> > 759234 to 759838.`
>>> >> >
>>> >> > So, I reset all the consumer offsets to the head of the queue as I did
>>> >> > not know of anything better to do. Once the dust settled, all the
>>> >> > issues we were seeing vanished. Communication between Kafka nodes
>>> >> > appear to be normal, Kafka was able to rebalance, and hopefully log
>>> >> > retention will be normal.
>>> >> >
>>> >> > I am unsure what happened or how to get more debug information.
>>> >> >
>>> >> > On Wed, Feb 5, 2014 at 12:31 PM, Jay Kreps <jay.kr...@gmail.com>
>>> wrote:
>>> >> > > Can you enable DEBUG logging in log4j and see what requests are
>>> coming
>>> >> in?
>>> >> > >
>>> >> > > -Jay
>>> >> > >
>>> >> > >
>>> >> > > On Tue, Feb 4, 2014 at 9:51 PM, Carl Lerche <m...@carllerche.com>
>>> wrote:
>>> >> > >
>>> >> > >> Hi Jay,
>>> >> > >>
>>> >> > >> I do not believe that I have changed the replica.fetch.wait.max.ms
>>> >> > >> setting. Here I have included the kafka config as well as a
>>> snapshot
>>> >> > >> of jnettop from one of the servers.
>>> >> > >>
>>> >> > >> https://gist.github.com/carllerche/4f2cf0f0f6d1e891f482
>>> >> > >>
>>> >> > >> The bottom row (89.9K/s) is the producer (it lives on a Kafka
>>> server).
>>> >> > >> The top two rows are Kafkas on other servers, you can see the
>>> combined
>>> >> > >> throughput is ~80MB/s
>>> >> > >>
>>> >> > >> On Tue, Feb 4, 2014 at 9:36 PM, Jay Kreps <jay.kr...@gmail.com>
>>> >> wrote:
>>> >> > >> > No this is not normal.
>>> >> > >> >
>>> >> > >> > Checking twice a second (using 500ms default) for new data
>>> shouldn't
>>> >> > >> cause
>>> >> > >> > high network traffic (that should be like < 1KB of overhead). I
>>> >> don't
>>> >> > >> think
>>> >> > >> > that explains things. Is it possible that setting has been
>>> >> overridden?
>>> >> > >> >
>>> >> > >> > -Jay
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > On Tue, Feb 4, 2014 at 9:25 PM, Guozhang Wang <
>>> wangg...@gmail.com>
>>> >> > >> wrote:
>>> >> > >> >
>>> >> > >> >> Hi Carl,
>>> >> > >> >>
>>> >> > >> >> For each partition the follower will also fetch data from the
>>> >> leader
>>> >> > >> >> replica, even if there is no new data in the leader replicas.
>>> >> > >> >>
>>> >> > >> >> One thing you can try to increase replica.fetch.wait.max.ms
>>> (default
>>> >> > >> value
>>> >> > >> >> 500ms) so that the followers's fetching request frequency to the
>>> >> leader
>>> >> > >> can
>>> >> > >> >> be reduced, and see if that has some effect on the traffic.
>>> >> > >> >>
>>> >> > >> >> Guozhang
>>> >> > >> >>
>>> >> > >> >>
>>> >> > >> >> On Tue, Feb 4, 2014 at 8:46 PM, Carl Lerche <m...@carllerche.com>
>>> >> wrote:
>>> >> > >> >>
>>> >> > >> >> > Hello,
>>> >> > >> >> >
>>> >> > >> >> > I'm running a 0.8.0 Kafka cluster of 3 servers. The service
>>> that
>>> >> it is
>>> >> > >> >> > for is not in full production yet, so the data written to
>>> >> cluster is
>>> >> > >> >> > minimal (seems to average between 100kb/s -> 300kb/s per
>>> >> server). I
>>> >> > >> >> > have configured Kafka to have a 3 replicas. I am noticing that
>>> >> each
>>> >> > >> >> > Kafka server is talking to all the others at a data rate of
>>> >> 40MB/s for
>>> >> > >> >> > each server (so, a total of 80MB/s for each server). This
>>> >> > >> >> > communication is constant.
>>> >> > >> >> >
>>> >> > >> >> > Is this normal? This seems like very strange behavior and I'm
>>> not
>>> >> > >> >> > exactly sure how to debug.
>>> >> > >> >> >
>>> >> > >> >> > Thanks,
>>> >> > >> >> > Carl
>>> >> > >> >> >
>>> >> > >> >>
>>> >> > >> >>
>>> >> > >> >>
>>> >> > >> >> --
>>> >> > >> >> -- Guozhang
>>> >> > >> >>
>>> >> > >>
>>> >>
>>> >>
>>>

Reply via email to