Hi all, I’m troubleshooting a MirrorMaker issue, and am not quite sure yet why this is happening, so I’d thought I’d ask here in case anyone else has seen this before.
We’ve been running a Kafka 1.0 cluster for a few months now, replicating data from a Kafka 0.9.0.1 cluster using 0.9.0.1 MirrorMaker. This had mostly been working fine, until last week when we finished migrating a set of high volume producers to this new Kafka 1.0 cluster. Before last week, the new 1.0 cluster was handling around 50-70K messages per second, as of last week it is up to around 150K messages per second. Everything in the new 1.0 cluster is working totally fine. However, since we increased traffic to the new 1.0 cluster, our previously operational MirrorMaker instances started dying with a lot of ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback - Error when sending message to topic XXXX with key: null, value: 964 bytes with error: Batch Expired and ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback - Error when sending message to topic XXXX with key: null, value: 964 bytes with error: Producer is closed forcefully. and INFO kafka.tools.MirrorMaker$ - Exiting on send failure, skip committing offsets. This log is printed for some of the higher volume (not really, these are around 1K messages per second max) topics that MirrorMaker is replicating. This happens a few minutes after the MirrorMaker instance starts. Until this happens, it is able to produce fine. Once it happens, the instance dies and a rebalance and is triggered. I’m haven’t been able to consistently reproduce what happens next, but it seems that after a short series of instance flapping + rebalancing happens, the MirrorMaker instances all totally get stuck. They continue owning partitions, but don’t produce anything, and don’t log any more errors. The MirrorMaker consumers start lagging. A full restart of all instances seems to reset the thing, but eventually they get stuck again. Perhaps there’s some problem I’ve missed with older MirrorMaker producing to new Kafka clusters? Could more load on the brokers cause MirrorMaker produce requests to expire like this? We’re were running the same load on our old 0.9.0.1 cluster before this migration, with the same MirrorMaker setup with no problems. Our MirrorMaker and 1.0 broker configuration is here: https://gist.github.com/ottomata/5324fc3becdd20e9a678d5d37c2db872 Any help is appreciated, thanks! -Andrew Otto Senior Systems Engineer Wikimedia Foundation