Re: Kafka 0.9 MirrorMaker failing with Batch Expired when producing to Kafka 1.0 cluster

Andras Beni Tue, 13 Mar 2018 13:16:41 -0700

Hi Andrew,

It seems the throughput of the new cluster is smaller than that of the old
cluster. And for this reason MirrorMaker cannot send messages fast enough
(i.e. they expire). I recommend comparing the configurations.
For the hanging MirrorMaker instances, I think looking at stack dumps would
help you get closer to the root cause.


Best regards,
Andras

On Mon, Mar 12, 2018 at 7:56 PM, Andrew Otto <o...@wikimedia.org> wrote:

> Hi all,
>
> I’m troubleshooting a MirrorMaker issue, and am not quite sure yet why this
> is happening, so I’d thought I’d ask here in case anyone else has seen this
> before.
>
> We’ve been running a Kafka 1.0 cluster for a few months now, replicating
> data from a Kafka 0.9.0.1 cluster using 0.9.0.1 MirrorMaker.  This had
> mostly been working fine, until last week when we finished migrating a set
> of high volume producers to this new Kafka 1.0 cluster.  Before last week,
> the new 1.0 cluster was handling around 50-70K messages per second, as of
> last week it is up to around 150K messages per second.
>
> Everything in the new 1.0 cluster is working totally fine.  However, since
> we increased traffic to the new 1.0 cluster, our previously operational
> MirrorMaker instances started dying with a lot of
>
> ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  -
> Error when sending message to topic XXXX with key: null, value: 964 bytes
> with error: Batch Expired
>
> and
>
> ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  -
> Error when sending message to topic XXXX with key: null, value: 964 bytes
> with error: Producer is closed forcefully.
>
> and
>
> INFO  kafka.tools.MirrorMaker$  - Exiting on send failure, skip committing
> offsets.
>
>
> This log is printed for some of the higher volume (not really, these are
> around 1K messages per second max) topics that MirrorMaker is replicating.
> This happens a few minutes after the MirrorMaker instance starts.  Until
> this happens, it is able to produce fine.  Once it happens, the instance
> dies and a rebalance and is triggered. I’m haven’t been able to
> consistently reproduce what happens next, but it seems that after a short
> series of instance flapping + rebalancing happens, the MirrorMaker
> instances all totally get stuck.  They continue owning partitions, but
> don’t produce anything, and don’t log any more errors.  The MirrorMaker
> consumers start lagging.
>
> A full restart of all instances seems to reset the thing, but eventually
> they get stuck again.
>
> Perhaps there’s some problem I’ve missed with older MirrorMaker producing
> to new Kafka clusters?  Could more load on the brokers cause MirrorMaker
> produce requests to expire like this?  We’re were running the same load on
> our old 0.9.0.1 cluster before this migration, with the same MirrorMaker
> setup with no problems.
>
> Our MirrorMaker and 1.0 broker configuration is here:
> https://gist.github.com/ottomata/5324fc3becdd20e9a678d5d37c2db872
>
> Any help is appreciated, thanks!
>
> -Andrew Otto
>  Senior Systems Engineer
>  Wikimedia Foundation
>

Re: Kafka 0.9 MirrorMaker failing with Batch Expired when producing to Kafka 1.0 cluster

Reply via email to