Hi Andrew, It seems the throughput of the new cluster is smaller than that of the old cluster. And for this reason MirrorMaker cannot send messages fast enough (i.e. they expire). I recommend comparing the configurations. For the hanging MirrorMaker instances, I think looking at stack dumps would help you get closer to the root cause.
Best regards, Andras On Mon, Mar 12, 2018 at 7:56 PM, Andrew Otto <o...@wikimedia.org> wrote: > Hi all, > > I’m troubleshooting a MirrorMaker issue, and am not quite sure yet why this > is happening, so I’d thought I’d ask here in case anyone else has seen this > before. > > We’ve been running a Kafka 1.0 cluster for a few months now, replicating > data from a Kafka 0.9.0.1 cluster using 0.9.0.1 MirrorMaker. This had > mostly been working fine, until last week when we finished migrating a set > of high volume producers to this new Kafka 1.0 cluster. Before last week, > the new 1.0 cluster was handling around 50-70K messages per second, as of > last week it is up to around 150K messages per second. > > Everything in the new 1.0 cluster is working totally fine. However, since > we increased traffic to the new 1.0 cluster, our previously operational > MirrorMaker instances started dying with a lot of > > ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback - > Error when sending message to topic XXXX with key: null, value: 964 bytes > with error: Batch Expired > > and > > ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback - > Error when sending message to topic XXXX with key: null, value: 964 bytes > with error: Producer is closed forcefully. > > and > > INFO kafka.tools.MirrorMaker$ - Exiting on send failure, skip committing > offsets. > > > This log is printed for some of the higher volume (not really, these are > around 1K messages per second max) topics that MirrorMaker is replicating. > This happens a few minutes after the MirrorMaker instance starts. Until > this happens, it is able to produce fine. Once it happens, the instance > dies and a rebalance and is triggered. I’m haven’t been able to > consistently reproduce what happens next, but it seems that after a short > series of instance flapping + rebalancing happens, the MirrorMaker > instances all totally get stuck. They continue owning partitions, but > don’t produce anything, and don’t log any more errors. The MirrorMaker > consumers start lagging. > > A full restart of all instances seems to reset the thing, but eventually > they get stuck again. > > Perhaps there’s some problem I’ve missed with older MirrorMaker producing > to new Kafka clusters? Could more load on the brokers cause MirrorMaker > produce requests to expire like this? We’re were running the same load on > our old 0.9.0.1 cluster before this migration, with the same MirrorMaker > setup with no problems. > > Our MirrorMaker and 1.0 broker configuration is here: > https://gist.github.com/ottomata/5324fc3becdd20e9a678d5d37c2db872 > > Any help is appreciated, thanks! > > -Andrew Otto > Senior Systems Engineer > Wikimedia Foundation >