Re: Debugging high log flush latency on a broker.

Steve Miller Tue, 22 Sep 2015 12:35:29 -0700

   There may be more elegant ways to do this, but I'd think that you could just 
ls all the directories specified in log.dirs in your server.properties file for 
Kafka.  You should see directories for each topicname-partitionnumber there.

   Offhand it sounds to me like maybe something's evicting pages from the 
buffer cache from time to time, causing Kafka to do a lot more I/O all of a 
sudden than usual.  Why that happens, I don't know, but that'd be my guess: 
either something needs more pages for applications all of a sudden, or like you 
said, there's some characteristic of the traffic for the partitions on this 
broker that isn't the same as it is for all the other brokers.

   Filesystem type and creation parameters are the same as on the other hosts?  
sysctl stuff all tuned the same way (assuming this is Linux, that is)?

   Any chance there's some sort of network hiccup that makes some follower get 
a little behind, and then the act of it trying to catch back up pushes the I/O 
past what it can sustain steady-state?  (If something gets significantly 
behind, depending on the size of your buffer cache relative to the retention in 
your topics, you could have something, say, start reading from the first offset 
in that topic and partition, which might well require going to disk rather than 
being satisfied from the buffer cache.  I could see that slowing I/O enough, if 
it's on the edge otherwise, that now you can't keep up with the write rate 
until that consumer gets caught up.)

   The other idea would be that, I dunno, maybe there's topic where the segment 
size is different, and so when it goes to delete a segment it's spending a lot 
more time putting blocks from that file back onto the filesystem free list (or 
whatever data structure it is these days (-: ).

        -Steve

On Tue, Sep 22, 2015 at 11:46:49AM -0700, Rajiv Kurian wrote:
> Also any hints on how I can find the exact topic/partitions assigned to
> this broker? I know in ZK we can see the partition -> broker mapping, but I
> am looking for a broker -> partition mapping. I can't be sure if the load
> that is causing this problem is because of leader traffic or follower
> traffic. What is weird is that I rarely if ever see other brokers in the
> cluster have the same problem. With 3 way replication (leader + 2 replicas)
> I'd imagine that the same work load would cause problems on other brokers
> too.

Re: Debugging high log flush latency on a broker.

Reply via email to