Correct, we've disabled unclean leader election. There were also no log messages from an unclean election. I believe that Kafka thinks it performed a clean election and still lost data.
-- Mark Smith m...@qq.is On Thu, Nov 17, 2016, at 06:23 PM, Tauzell, Dave wrote: > Do you have: > > Unclean.leader.election.enable = false ? > > Dave > > > On Nov 17, 2016, at 19:39, Mark Smith <m...@qq.is> wrote: > > > > Hey folks, > > > > I work at Dropbox and I was doing some maintenance yesterday and it > > looks like we lost some committed data during a preferred replica > > election. As far as I understand this shouldn't happen, but I have a > > theory and want to run it by ya'll. > > > > Preamble: > > * Kafka 0.9.0.1 > > * required.acks = -1 (All) > > * min.insync.replicas = 2 (only 2 replicas for the partition, so we > > require both to have the data) > > * consumer is Kafka Connect > > * 1400 topics, total of about 15,000 partitions > > * 30 brokers > > > > I was performing some rolling restarts of brokers yesterday as > > part of > > our regular DRT (disaster testing) process and at the end that > > always > > leaves many partitions that need to be failed back to the preferred > > replica. There were about 8,000 partitions that needed moving. I > > started > > the election in Kafka Manager and it worked, but it looks like 4 of > > those 8,000 partitions experienced some relatively small amount > > of data > > loss at the tail. > > > > From the Kafka Connect point of view, we saw a handful of these: > > > > [2016-11-17 02:55:26,513] [WorkerSinkTask-clogger-analytics-staging-8- > > 5] > > INFO Fetch offset 67614479952 is out of range, resetting offset > > (o.a.k.c.c.i.Fetcher:595) > > > > I believe that was because it asked the new leader for data and > > the new > > leader had less data than the old leader. Indeed, the old leader > > became > > a follower and immediately truncated: > > > > 2016-11-17 02:55:27,237 INFO log.Log: Truncating log > > goscribe.client-host_activity-21 to offset 67614479601. > > > > Given the above production settings I don't know why KC would > > ever see > > an OffsetOutOfRange error but this caused KC to reset to the > > beginning > > of the partition. Various broker logs for the failover paint the > > following timeline: > > https://gist.github.com/zorkian/d80a4eb288d40c1ee7fb5d2d340986d6 > > > > My current working theory that I'd love eyes on: > > > > 1. Leader receives produce request and appends to log, incrementing > > LEO, but given the durability requirements the HW is not > > incremented > > and the produce response is delayed (normal) > > > > 2. Replica sends Fetch request to leader as part of normal > > replication > > flow > > > > 3. Leader increments HW when it STARTS to respond to the Fetch > > request > > (per fetchMessages in ReplicaManager.scala), so the HW is > > updated as > > soon as we've prepared messages for response -- importantly the > > HW is > > updated even though the replica has not yet actually seen the > > messages, even given the durability settings we've got > > > > 4. Meanwhile, Kafka Connect sends Fetch request to leader and > > receives > > the messages below the new HW, but the messages have not > > actually been > > received by the replica yet still > > > > 5. Preferred replica election begins (oh the travesty!) > > > > 6. Replica starts the become-leader process and makeLeader removes > > this partition from partitionMap, which means when the response > > comes > > in finally, we ignore it (we discard the old-leader committed > > messages) > > > > 7. Old-leader starts become-follower process and truncates to the > > HW > > of the new-leader i.e. the old-leader has now thrown away data > > it had > > committed and given out moments ago > > > > 8. Kafka Connect sends Fetch request to the new-leader but its > > offset > > is now greater than the HW of the new-leader, so we get the > > OffsetOutOfRange error and restart > > > > Can someone tell me whether or not this is plausible? If it is, > > is there > > a known issue/bug filed for it? I'm not exactly sure what the > > solution > > is, but it does seem unfortunate that a normal operation (leader > > election with both brokers alive and well) can result in the loss of > > committed messages. > > > > And, if my theory doesn't hold, can anybody explain what > > happened? I'm > > happy to provide more logs or whatever. > > > > Thanks! > > > > > > -- > > Mark Smith > > m...@qq.is > This e-mail and any files transmitted with it are confidential, may > contain sensitive information, and are intended solely for the > use of the > individual or entity to whom they are addressed. If you have received > this e-mail in error, please notify the sender by reply e-mail > immediately and destroy all copies of the e-mail and any attachments.