Re: Investigating apparent data loss during preferred replica election

Mark Smith Fri, 18 Nov 2016 00:27:07 -0800

Correct, we've disabled unclean leader election. There were also no log
messages from an unclean election. I believe that Kafka thinks it
performed a clean election and still lost data.



--

Mark Smith

m...@qq.is





On Thu, Nov 17, 2016, at 06:23 PM, Tauzell, Dave wrote:

> Do you have:

>

> Unclean.leader.election.enable = false ?

>

> Dave

>

> > On Nov 17, 2016, at 19:39, Mark Smith <m...@qq.is> wrote:

> >

> > Hey folks,

> >

> > I work at Dropbox and I was doing some maintenance yesterday and it
> > looks like we lost some committed data during a preferred replica

> > election. As far as I understand this shouldn't happen, but I have a
> > theory and want to run it by ya'll.

> >

> > Preamble:

> > * Kafka 0.9.0.1

> > * required.acks = -1 (All)

> > * min.insync.replicas = 2 (only 2 replicas for the partition, so we
> > require both to have the data)

> > * consumer is Kafka Connect

> > * 1400 topics, total of about 15,000 partitions

> > * 30 brokers

> >

> > I was performing some rolling restarts of brokers yesterday as
> > part of
> > our regular DRT (disaster testing) process and at the end that
> > always
> > leaves many partitions that need to be failed back to the preferred
> > replica. There were about 8,000 partitions that needed moving. I
> > started
> > the election in Kafka Manager and it worked, but it looks like 4 of
> > those 8,000 partitions experienced some relatively small amount
> > of data
> > loss at the tail.

> >

> > From the Kafka Connect point of view, we saw a handful of these:

> >

> > [2016-11-17 02:55:26,513] [WorkerSinkTask-clogger-analytics-staging-8-
> > 5]
> > INFO Fetch offset 67614479952 is out of range, resetting offset

> > (o.a.k.c.c.i.Fetcher:595)

> >

> > I believe that was because it asked the new leader for data and
> > the new
> > leader had less data than the old leader. Indeed, the old leader
> > became
> > a follower and immediately truncated:

> >

> > 2016-11-17 02:55:27,237 INFO log.Log: Truncating log

> > goscribe.client-host_activity-21 to offset 67614479601.

> >

> > Given the above production settings I don't know why KC would
> > ever see
> > an OffsetOutOfRange error but this caused KC to reset to the
> > beginning
> > of the partition. Various broker logs for the failover paint the

> > following timeline:

> > https://gist.github.com/zorkian/d80a4eb288d40c1ee7fb5d2d340986d6

> >

> > My current working theory that I'd love eyes on:

> >

> >  1. Leader receives produce request and appends to log, incrementing
> >  LEO, but given the durability requirements the HW is not
> >  incremented
> >  and the produce response is delayed (normal)

> >

> >  2. Replica sends Fetch request to leader as part of normal
> >     replication
> >  flow

> >

> >  3. Leader increments HW when it STARTS to respond to the Fetch
> >     request
> >  (per fetchMessages in ReplicaManager.scala), so the HW is
> >  updated as
> >  soon as we've prepared messages for response -- importantly the
> >  HW is
> >  updated even though the replica has not yet actually seen the

> >  messages, even given the durability settings we've got

> >

> >  4. Meanwhile, Kafka Connect sends Fetch request to leader and
> >     receives
> >  the messages below the new HW, but the messages have not
> >  actually been
> >  received by the replica yet still

> >

> >  5. Preferred replica election begins (oh the travesty!)

> >

> >  6. Replica starts the become-leader process and makeLeader removes
> >  this partition from partitionMap, which means when the response
> >  comes
> >  in finally, we ignore it (we discard the old-leader committed

> >  messages)

> >

> >  7. Old-leader starts become-follower process and truncates to the
> >     HW
> >  of the new-leader i.e. the old-leader has now thrown away data
> >  it had
> >  committed and given out moments ago

> >

> >  8. Kafka Connect sends Fetch request to the new-leader but its
> >     offset
> >  is now greater than the HW of the new-leader, so we get the

> >  OffsetOutOfRange error and restart

> >

> > Can someone tell me whether or not this is plausible? If it is,
> > is there
> > a known issue/bug filed for it? I'm not exactly sure what the
> > solution
> > is, but it does seem unfortunate that a normal operation (leader

> > election with both brokers alive and well) can result in the loss of
> > committed messages.

> >

> > And, if my theory doesn't hold, can anybody explain what
> > happened? I'm
> > happy to provide more logs or whatever.

> >

> > Thanks!

> >

> >

> > --

> > Mark Smith

> > m...@qq.is

> This e-mail and any files transmitted with it are confidential, may

> contain sensitive information, and are intended solely for the
> use of the
> individual or entity to whom they are addressed. If you have received
> this e-mail in error, please notify the sender by reply e-mail

> immediately and destroy all copies of the e-mail and any attachments.

Re: Investigating apparent data loss during preferred replica election

Reply via email to