​Hm, the hardware should be mostly beefier in this new cluster, but there
are a couple of differences: Mainly we are now using RAID instead of JBOD,
and​ our high volume producers (that 150K / second I mentioned) use TLS.

Also, other producers seem have no problems, it is only MirrorMaker.  Even
stranger, this problem has gone away since yesterday.  Several bounces of
MirrorMaker have happened between then and now (as we tried a few different
MirrorMaker settings).  So far, this is a mystery.

Thanks for the reply Andras! If it happens again I’ll look into your
suggestion.


On Tue, Mar 13, 2018 at 4:15 PM, Andras Beni <andrasb...@cloudera.com>
wrote:

> Hi Andrew,
>
> It seems the throughput of the new cluster is smaller than that of the old
> cluster. And for this reason MirrorMaker cannot send messages fast enough
> (i.e. they expire). I recommend comparing the configurations.
> For the hanging MirrorMaker instances, I think looking at stack dumps would
> help you get closer to the root cause.
>
> Best regards,
> Andras
>
> On Mon, Mar 12, 2018 at 7:56 PM, Andrew Otto <o...@wikimedia.org> wrote:
>
> > Hi all,
> >
> > I’m troubleshooting a MirrorMaker issue, and am not quite sure yet why
> this
> > is happening, so I’d thought I’d ask here in case anyone else has seen
> this
> > before.
> >
> > We’ve been running a Kafka 1.0 cluster for a few months now, replicating
> > data from a Kafka 0.9.0.1 cluster using 0.9.0.1 MirrorMaker.  This had
> > mostly been working fine, until last week when we finished migrating a
> set
> > of high volume producers to this new Kafka 1.0 cluster.  Before last
> week,
> > the new 1.0 cluster was handling around 50-70K messages per second, as of
> > last week it is up to around 150K messages per second.
> >
> > Everything in the new 1.0 cluster is working totally fine.  However,
> since
> > we increased traffic to the new 1.0 cluster, our previously operational
> > MirrorMaker instances started dying with a lot of
> >
> > ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback
> -
> > Error when sending message to topic XXXX with key: null, value: 964 bytes
> > with error: Batch Expired
> >
> > and
> >
> > ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback
> -
> > Error when sending message to topic XXXX with key: null, value: 964 bytes
> > with error: Producer is closed forcefully.
> >
> > and
> >
> > INFO  kafka.tools.MirrorMaker$  - Exiting on send failure, skip
> committing
> > offsets.
> >
> >
> > This log is printed for some of the higher volume (not really, these are
> > around 1K messages per second max) topics that MirrorMaker is
> replicating.
> > This happens a few minutes after the MirrorMaker instance starts.  Until
> > this happens, it is able to produce fine.  Once it happens, the instance
> > dies and a rebalance and is triggered. I’m haven’t been able to
> > consistently reproduce what happens next, but it seems that after a short
> > series of instance flapping + rebalancing happens, the MirrorMaker
> > instances all totally get stuck.  They continue owning partitions, but
> > don’t produce anything, and don’t log any more errors.  The MirrorMaker
> > consumers start lagging.
> >
> > A full restart of all instances seems to reset the thing, but eventually
> > they get stuck again.
> >
> > Perhaps there’s some problem I’ve missed with older MirrorMaker producing
> > to new Kafka clusters?  Could more load on the brokers cause MirrorMaker
> > produce requests to expire like this?  We’re were running the same load
> on
> > our old 0.9.0.1 cluster before this migration, with the same MirrorMaker
> > setup with no problems.
> >
> > Our MirrorMaker and 1.0 broker configuration is here:
> > https://gist.github.com/ottomata/5324fc3becdd20e9a678d5d37c2db872
> >
> > Any help is appreciated, thanks!
> >
> > -Andrew Otto
> >  Senior Systems Engineer
> >  Wikimedia Foundation
> >
>

Reply via email to