Hey Folks, we're running into an odd issue with mirrormaker and the fetch request purgatory on the brokers. Our setup consists of two six-node clusters (all running 0.8.2.1 on identical hw with the same config). All "normal" producing and consuming happens on cluster A. Mirrormaker has been set up to copy all topics (except a tiny blacklist) from cluster A to cluster B.
Cluster A is completely healthy at the moment. Cluster B is not, which is very odd since it is literally handling the exact same traffic. The graph for Fetch Request Purgatory Size looks like this: https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08.37.png?dl=0 Every time the purgatory shrinks, the latency from that causes one or more nodes to drop their leadership (it quickly recovers). We could probably alleviate the symptoms by decreasing `fetch.purgatory.purge.interval.requests` (it is currently at the default value) but I'd rather try and understand/solve the root cause here. Cluster B is handling no outside fetch requests, and turning mirrormaker off "fixes" the problem, so clearly (since mirrormaker is producing to this cluster not consuming from it) the fetch requests must be coming from internal replication. However, the same data is being replicated when it is originally produced in cluster A, and the fetch purgatory size sits stably at ~10k there. There is nothing unusual in the logs on either cluster. I have read all the wiki pages and jira tickets I can find about the new purgatory design in 0.8.2 but nothing stands out as applicable. I'm happy to provide more detailed logs, configuration, etc. if anyone thinks there might be something important in there. I am completely baffled as to what could be causing this. Any suggestions would be appreciated. I'm starting to think at this point that we've completely misunderstood or misconfigured *something*. Thanks, Evan