Hey folks,

I work at Dropbox and I was doing some maintenance yesterday and it
looks like we lost some committed data during a preferred replica
election. As far as I understand this shouldn't happen, but I have a
theory and want to run it by ya'll.

Preamble:
* Kafka 0.9.0.1
* required.acks = -1 (All)
* min.insync.replicas = 2 (only 2 replicas for the partition, so we
require both to have the data)
* consumer is Kafka Connect
* 1400 topics, total of about 15,000 partitions
* 30 brokers

I was performing some rolling restarts of brokers yesterday as part of
our regular DRT (disaster testing) process and at the end that always
leaves many partitions that need to be failed back to the preferred
replica. There were about 8,000 partitions that needed moving. I started
the election in Kafka Manager and it worked, but it looks like 4 of
those 8,000 partitions experienced some relatively small amount of data
loss at the tail.

>From the Kafka Connect point of view, we saw a handful of these:

[2016-11-17 02:55:26,513] [WorkerSinkTask-clogger-analytics-staging-8-5]
INFO Fetch offset 67614479952 is out of range, resetting offset
(o.a.k.c.c.i.Fetcher:595)

I believe that was because it asked the new leader for data and the new
leader had less data than the old leader. Indeed, the old leader became
a follower and immediately truncated:

2016-11-17 02:55:27,237 INFO log.Log: Truncating log
goscribe.client-host_activity-21 to offset 67614479601.

Given the above production settings I don't know why KC would ever see
an OffsetOutOfRange error but this caused KC to reset to the beginning
of the partition. Various broker logs for the failover paint the
following timeline:
https://gist.github.com/zorkian/d80a4eb288d40c1ee7fb5d2d340986d6

My current working theory that I'd love eyes on:

  1. Leader receives produce request and appends to log, incrementing
  LEO, but given the durability requirements the HW is not incremented
  and the produce response is delayed (normal)

  2. Replica sends Fetch request to leader as part of normal replication
  flow

  3. Leader increments HW when it STARTS to respond to the Fetch request
  (per fetchMessages in ReplicaManager.scala), so the HW is updated as
  soon as we've prepared messages for response -- importantly the HW is
  updated even though the replica has not yet actually seen the
  messages, even given the durability settings we've got

  4. Meanwhile, Kafka Connect sends Fetch request to leader and receives
  the messages below the new HW, but the messages have not actually been
  received by the replica yet still

  5. Preferred replica election begins (oh the travesty!)

  6. Replica starts the become-leader process and makeLeader removes
  this partition from partitionMap, which means when the response comes
  in finally, we ignore it (we discard the old-leader committed
  messages)

  7. Old-leader starts become-follower process and truncates to the HW
  of the new-leader i.e. the old-leader has now thrown away data it had
  committed and given out moments ago

  8. Kafka Connect sends Fetch request to the new-leader but its offset
  is now greater than the HW of the new-leader, so we get the
  OffsetOutOfRange error and restart

Can someone tell me whether or not this is plausible? If it is, is there
a known issue/bug filed for it? I'm not exactly sure what the solution
is, but it does seem unfortunate that a normal operation (leader
election with both brokers alive and well) can result in the loss of
committed messages.

And, if my theory doesn't hold, can anybody explain what happened? I'm
happy to provide more logs or whatever.

Thanks!


-- 
Mark Smith
m...@qq.is

Reply via email to