Hi Andrew,

It seems the throughput of the new cluster is smaller than that of the old
cluster. And for this reason MirrorMaker cannot send messages fast enough
(i.e. they expire). I recommend comparing the configurations.
For the hanging MirrorMaker instances, I think looking at stack dumps would
help you get closer to the root cause.

Best regards,
Andras

On Mon, Mar 12, 2018 at 7:56 PM, Andrew Otto <o...@wikimedia.org> wrote:

> Hi all,
>
> I’m troubleshooting a MirrorMaker issue, and am not quite sure yet why this
> is happening, so I’d thought I’d ask here in case anyone else has seen this
> before.
>
> We’ve been running a Kafka 1.0 cluster for a few months now, replicating
> data from a Kafka 0.9.0.1 cluster using 0.9.0.1 MirrorMaker.  This had
> mostly been working fine, until last week when we finished migrating a set
> of high volume producers to this new Kafka 1.0 cluster.  Before last week,
> the new 1.0 cluster was handling around 50-70K messages per second, as of
> last week it is up to around 150K messages per second.
>
> Everything in the new 1.0 cluster is working totally fine.  However, since
> we increased traffic to the new 1.0 cluster, our previously operational
> MirrorMaker instances started dying with a lot of
>
> ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  -
> Error when sending message to topic XXXX with key: null, value: 964 bytes
> with error: Batch Expired
>
> and
>
> ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  -
> Error when sending message to topic XXXX with key: null, value: 964 bytes
> with error: Producer is closed forcefully.
>
> and
>
> INFO  kafka.tools.MirrorMaker$  - Exiting on send failure, skip committing
> offsets.
>
>
> This log is printed for some of the higher volume (not really, these are
> around 1K messages per second max) topics that MirrorMaker is replicating.
> This happens a few minutes after the MirrorMaker instance starts.  Until
> this happens, it is able to produce fine.  Once it happens, the instance
> dies and a rebalance and is triggered. I’m haven’t been able to
> consistently reproduce what happens next, but it seems that after a short
> series of instance flapping + rebalancing happens, the MirrorMaker
> instances all totally get stuck.  They continue owning partitions, but
> don’t produce anything, and don’t log any more errors.  The MirrorMaker
> consumers start lagging.
>
> A full restart of all instances seems to reset the thing, but eventually
> they get stuck again.
>
> Perhaps there’s some problem I’ve missed with older MirrorMaker producing
> to new Kafka clusters?  Could more load on the brokers cause MirrorMaker
> produce requests to expire like this?  We’re were running the same load on
> our old 0.9.0.1 cluster before this migration, with the same MirrorMaker
> setup with no problems.
>
> Our MirrorMaker and 1.0 broker configuration is here:
> https://gist.github.com/ottomata/5324fc3becdd20e9a678d5d37c2db872
>
> Any help is appreciated, thanks!
>
> -Andrew Otto
>  Senior Systems Engineer
>  Wikimedia Foundation
>

Reply via email to