Any ideas on this? It's still occurring... Is there a separate mailing list or project for mirrormaker that I could ask?
Thanks, Evan On Thu, Apr 9, 2015 at 4:36 PM, Evan Huus <evan.h...@shopify.com> wrote: > Hey Folks, we're running into an odd issue with mirrormaker and the fetch > request purgatory on the brokers. Our setup consists of two six-node > clusters (all running 0.8.2.1 on identical hw with the same config). All > "normal" producing and consuming happens on cluster A. Mirrormaker has been > set up to copy all topics (except a tiny blacklist) from cluster A to > cluster B. > > Cluster A is completely healthy at the moment. Cluster B is not, which is > very odd since it is literally handling the exact same traffic. > > The graph for Fetch Request Purgatory Size looks like this: > https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08.37.png?dl=0 > > Every time the purgatory shrinks, the latency from that causes one or more > nodes to drop their leadership (it quickly recovers). We could probably > alleviate the symptoms by decreasing > `fetch.purgatory.purge.interval.requests` (it is currently at the default > value) but I'd rather try and understand/solve the root cause here. > > Cluster B is handling no outside fetch requests, and turning mirrormaker > off "fixes" the problem, so clearly (since mirrormaker is producing to this > cluster not consuming from it) the fetch requests must be coming from > internal replication. However, the same data is being replicated when it is > originally produced in cluster A, and the fetch purgatory size sits stably > at ~10k there. There is nothing unusual in the logs on either cluster. > > I have read all the wiki pages and jira tickets I can find about the new > purgatory design in 0.8.2 but nothing stands out as applicable. I'm happy > to provide more detailed logs, configuration, etc. if anyone thinks there > might be something important in there. I am completely baffled as to what > could be causing this. > > Any suggestions would be appreciated. I'm starting to think at this point > that we've completely misunderstood or misconfigured *something*. > > Thanks, > Evan >