Re: Investigating apparent data loss during preferred replica election

Jun Rao Fri, 18 Nov 2016 10:53:22 -0800

Mark,

Thanks for reporting this. First, a clarification. The HW is actually never
advanced until all in-sync followers have fetched the corresponding
message. For example, in step 2, if all follower replicas issue a fetch
request at offset 10, it serves as an indication that all replicas have
received messages up to offset 9. So,only then, the HW is advanced to
offset 10 (which is not inclusive).


I think the problem that you are seeing are probably caused by two known
issues. The first one is https://issues.apache.org/jira/browse/KAFKA-1211.
The issue is that the HW is propagated asynchronously from the leader to
the followers. If the leadership changes multiple time very quickly, what
can happen is that a follower first truncates its data up to HW and then
immediately becomes the new leader. Since the follower's HW may not be up
to date, some previously committed messages could be lost. The second one
is https://issues.apache.org/jira/browse/KAFKA-3670. The issue is that
controlled shutdown and leader balancing can cause leadership to change
more than once quickly, which could expose the data loss problem in the
first issue.

The second issue has been fixed in 0.10.0. So, if you upgrade to that
version or above, it should reduce the chance of hitting the first issue
significantly. We are actively working on the first issue and hopefully it
will be addressed in the next release.

Jun

On Thu, Nov 17, 2016 at 5:39 PM, Mark Smith <m...@qq.is> wrote:

> Hey folks,
>
> I work at Dropbox and I was doing some maintenance yesterday and it
> looks like we lost some committed data during a preferred replica
> election. As far as I understand this shouldn't happen, but I have a
> theory and want to run it by ya'll.
>
> Preamble:
> * Kafka 0.9.0.1
> * required.acks = -1 (All)
> * min.insync.replicas = 2 (only 2 replicas for the partition, so we
> require both to have the data)
> * consumer is Kafka Connect
> * 1400 topics, total of about 15,000 partitions
> * 30 brokers
>
> I was performing some rolling restarts of brokers yesterday as part of
> our regular DRT (disaster testing) process and at the end that always
> leaves many partitions that need to be failed back to the preferred
> replica. There were about 8,000 partitions that needed moving. I started
> the election in Kafka Manager and it worked, but it looks like 4 of
> those 8,000 partitions experienced some relatively small amount of data
> loss at the tail.
>
> From the Kafka Connect point of view, we saw a handful of these:
>
> [2016-11-17 02:55:26,513] [WorkerSinkTask-clogger-analytics-staging-8-5]
> INFO Fetch offset 67614479952 is out of range, resetting offset
> (o.a.k.c.c.i.Fetcher:595)
>
> I believe that was because it asked the new leader for data and the new
> leader had less data than the old leader. Indeed, the old leader became
> a follower and immediately truncated:
>
> 2016-11-17 02:55:27,237 INFO log.Log: Truncating log
> goscribe.client-host_activity-21 to offset 67614479601.
>
> Given the above production settings I don't know why KC would ever see
> an OffsetOutOfRange error but this caused KC to reset to the beginning
> of the partition. Various broker logs for the failover paint the
> following timeline:
> https://gist.github.com/zorkian/d80a4eb288d40c1ee7fb5d2d340986d6
>
> My current working theory that I'd love eyes on:
>
>   1. Leader receives produce request and appends to log, incrementing
>   LEO, but given the durability requirements the HW is not incremented
>   and the produce response is delayed (normal)
>
>   2. Replica sends Fetch request to leader as part of normal replication
>   flow
>
>   3. Leader increments HW when it STARTS to respond to the Fetch request
>   (per fetchMessages in ReplicaManager.scala), so the HW is updated as
>   soon as we've prepared messages for response -- importantly the HW is
>   updated even though the replica has not yet actually seen the
>   messages, even given the durability settings we've got
>
>   4. Meanwhile, Kafka Connect sends Fetch request to leader and receives
>   the messages below the new HW, but the messages have not actually been
>   received by the replica yet still
>
>   5. Preferred replica election begins (oh the travesty!)
>
>   6. Replica starts the become-leader process and makeLeader removes
>   this partition from partitionMap, which means when the response comes
>   in finally, we ignore it (we discard the old-leader committed
>   messages)
>
>   7. Old-leader starts become-follower process and truncates to the HW
>   of the new-leader i.e. the old-leader has now thrown away data it had
>   committed and given out moments ago
>
>   8. Kafka Connect sends Fetch request to the new-leader but its offset
>   is now greater than the HW of the new-leader, so we get the
>   OffsetOutOfRange error and restart
>
> Can someone tell me whether or not this is plausible? If it is, is there
> a known issue/bug filed for it? I'm not exactly sure what the solution
> is, but it does seem unfortunate that a normal operation (leader
> election with both brokers alive and well) can result in the loss of
> committed messages.
>
> And, if my theory doesn't hold, can anybody explain what happened? I'm
> happy to provide more logs or whatever.
>
> Thanks!
>
>
> --
> Mark Smith
> m...@qq.is
>

Re: Investigating apparent data loss during preferred replica election

Reply via email to