There may be more elegant ways to do this, but I'd think that you could just ls all the directories specified in log.dirs in your server.properties file for Kafka. You should see directories for each topicname-partitionnumber there.
Offhand it sounds to me like maybe something's evicting pages from the buffer cache from time to time, causing Kafka to do a lot more I/O all of a sudden than usual. Why that happens, I don't know, but that'd be my guess: either something needs more pages for applications all of a sudden, or like you said, there's some characteristic of the traffic for the partitions on this broker that isn't the same as it is for all the other brokers. Filesystem type and creation parameters are the same as on the other hosts? sysctl stuff all tuned the same way (assuming this is Linux, that is)? Any chance there's some sort of network hiccup that makes some follower get a little behind, and then the act of it trying to catch back up pushes the I/O past what it can sustain steady-state? (If something gets significantly behind, depending on the size of your buffer cache relative to the retention in your topics, you could have something, say, start reading from the first offset in that topic and partition, which might well require going to disk rather than being satisfied from the buffer cache. I could see that slowing I/O enough, if it's on the edge otherwise, that now you can't keep up with the write rate until that consumer gets caught up.) The other idea would be that, I dunno, maybe there's topic where the segment size is different, and so when it goes to delete a segment it's spending a lot more time putting blocks from that file back onto the filesystem free list (or whatever data structure it is these days (-: ). -Steve On Tue, Sep 22, 2015 at 11:46:49AM -0700, Rajiv Kurian wrote: > Also any hints on how I can find the exact topic/partitions assigned to > this broker? I know in ZK we can see the partition -> broker mapping, but I > am looking for a broker -> partition mapping. I can't be sure if the load > that is causing this problem is because of leader traffic or follower > traffic. What is weird is that I rarely if ever see other brokers in the > cluster have the same problem. With 3 way replication (leader + 2 replicas) > I'd imagine that the same work load would cause problems on other brokers > too.