Hi all,

I’m troubleshooting a MirrorMaker issue, and am not quite sure yet why this
is happening, so I’d thought I’d ask here in case anyone else has seen this
before.

We’ve been running a Kafka 1.0 cluster for a few months now, replicating
data from a Kafka 0.9.0.1 cluster using 0.9.0.1 MirrorMaker.  This had
mostly been working fine, until last week when we finished migrating a set
of high volume producers to this new Kafka 1.0 cluster.  Before last week,
the new 1.0 cluster was handling around 50-70K messages per second, as of
last week it is up to around 150K messages per second.

Everything in the new 1.0 cluster is working totally fine.  However, since
we increased traffic to the new 1.0 cluster, our previously operational
MirrorMaker instances started dying with a lot of

ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  -
Error when sending message to topic XXXX with key: null, value: 964 bytes
with error: Batch Expired

and

ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  -
Error when sending message to topic XXXX with key: null, value: 964 bytes
with error: Producer is closed forcefully.

and

INFO  kafka.tools.MirrorMaker$  - Exiting on send failure, skip committing
offsets.


This log is printed for some of the higher volume (not really, these are
around 1K messages per second max) topics that MirrorMaker is replicating.
This happens a few minutes after the MirrorMaker instance starts.  Until
this happens, it is able to produce fine.  Once it happens, the instance
dies and a rebalance and is triggered. I’m haven’t been able to
consistently reproduce what happens next, but it seems that after a short
series of instance flapping + rebalancing happens, the MirrorMaker
instances all totally get stuck.  They continue owning partitions, but
don’t produce anything, and don’t log any more errors.  The MirrorMaker
consumers start lagging.

A full restart of all instances seems to reset the thing, but eventually
they get stuck again.

Perhaps there’s some problem I’ve missed with older MirrorMaker producing
to new Kafka clusters?  Could more load on the brokers cause MirrorMaker
produce requests to expire like this?  We’re were running the same load on
our old 0.9.0.1 cluster before this migration, with the same MirrorMaker
setup with no problems.

Our MirrorMaker and 1.0 broker configuration is here:
https://gist.github.com/ottomata/5324fc3becdd20e9a678d5d37c2db872

Any help is appreciated, thanks!

-Andrew Otto
 Senior Systems Engineer
 Wikimedia Foundation

Reply via email to