[jira] [Commented] (KAFKA-4414) Unexpected "Halting because log truncation is not allowed"

Meyer Kizner (JIRA) Wed, 16 Nov 2016 07:04:59 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670636#comment-15670636
 ]


Meyer Kizner commented on KAFKA-4414:
-------------------------------------

What value would you suggest? We're already using 5000ms, which I thought was 
relatively short. A shorter timeout makes the issue less likely, but it looks 
like there's a race condition here.

> Unexpected "Halting because log truncation is not allowed"
> ----------------------------------------------------------
>
>                 Key: KAFKA-4414
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4414
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.9.0.1
>            Reporter: Meyer Kizner
>
> Our Kafka installation runs with unclean leader election disabled, so brokers 
> halt when they find that their message offset is ahead of the leader's offset 
> for a topic. We had two brokers halt today with this issue. After much time 
> spent digging through the logs, I believe the following timeline describes 
> what occurred and points to a plausible hypothesis as to what happened.
> * B1, B2, and B3 are replicas of a topic, all in the ISR. B2 is currently the 
> leader, but B1 is the preferred leader. The controller runs on B3.
> * B1 fails, but the controller does not detect the failure immediately.
> * B2 receives a message from a producer and B3 fetches it to stay up to date. 
> B2 has not accepted the message, because B1 is down and so has not 
> acknowledged the message.
> * The controller triggers a preferred leader election, making B1 the leader, 
> and notifies all replicas.
> * Very shortly afterwards (~200ms), B1's broker registration in ZooKeeper 
> expires, so the controller reassigns B2 to be leader again and notifies all 
> replicas.
> * Because B3 is the controller, while B2 is on another box, B3 hears about 
> both of these events before B2 hears about either. B3 truncates its log to 
> the high water mark (before the pending message) and resumes fetching from B2.
> * B3 fetches the pending message from B2 again.
> * B2 learns that it has been displaced and then reelected, and truncates its 
> log to the high water mark, before the pending message.
> * The next time B3 tries to fetch from B2, it sees that B2 is missing the 
> pending message and halts.
> In this case, there was no data loss or inconsistency. I haven't fully 
> thought through whether either would be possible, but it seems likely that 
> they would be, especially if there had been multiple producers to this topic.
> I'm not completely certain about this timeline, but this sequence of events 
> appears to at least be possible. Looking a bit through the controller code, 
> there doesn't seem to be anything that forces {{LeaderAndIsrRequest}} to be 
> sent in a particular order. If someone with more knowledge of the code base 
> believes this is incorrect, I'd be happy to post the logs and/or do some more 
> digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4414) Unexpected "Halting because log truncation is not allowed"

Reply via email to