[ 
https://issues.apache.org/jira/browse/KAFKA-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-2827:
-------------------------------
    Affects Version/s:     (was: 0.9.0.1)
                       0.9.0.0

> Producer metadata stall
> -----------------------
>
>                 Key: KAFKA-2827
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2827
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.9.0.0
>            Reporter: Geoff Anderson
>
> This is a tracking issue for a suspicious behavior that has been plaguing the 
> stability of certain system tests.
> Producer metadata requests and produce requests can be blocked for "quite a 
> while" (tens of seconds to minutes) after a hard bounce of a broker has 
> occurred despite the fact that new leaders for for the topicPartitions on the 
> bounced broker have already been reelected.
> This shows up in particular during ducktape replication tests where the 
> leader of a topic is bounced with a hard kill. The initial symptom was the 
> console consumer timing out due to no messages being available for 
> consumption during this lengthy producer "stall" period.
> This behavior does not appear to be present in kafka 0.8.2: 15 of 15 runs of 
> the test in question passed with 0.8.2 substituted for trunk.
> [~benstopford]'s findings so far:
> Theory:
> The first node dies, Node1. A new controller is elected and it correctly 
> reassigns the partition leaders to be the other two nodes (Node2, Node3). 
> However, for some reason, Node1 remains in the ISR list for two of the three 
> partitions: 
> Broker 3 cached leader info (Leader:3,ISR:3,2, AllReplicas:1,3,2) for 
> partition test_topic,0
> Broker 3 cached leader info (Leader:3,ISR:3,2,1, AllReplicas:3,2,1) for 
> partition test_topic,2
> Broker 3 cached leader info (Leader:2,ISR:2,1,3, AllReplicas:2,1,3) for 
> partition test_topic,1
> (a little later broker 1 is dropped from the ISR for partition 2, partition 3 
> keeps all three replicas throughout)
> Meanwhile the producer is sending requests with acks=-1. These block on the 
> server because the ISR for partitions 1,2 still contain the dead Node1. As an 
> aside, metadata requests from the producer are blocked too during this period 
> (they are present if you look in the producer’s request queue).
> The consumer times out as it hasn’t consumed any data for its timeout period 
> (this is a timeout in the console_consumer). This is because no data is 
> produced during this period, as the producer is blocked. (The producer is 
> blocked usually for around 15 seconds but this varies from run to run).
> When Node1 is restarted everything clears as the fetchers from Node1 can 
> replicate data again. The Producer is updated with metadata and continues 
> producing. All is back to normal.
> The next node goes down and the whole circle loops round again.
> Unanswered questions: 
> I don’t know why we end up in a situation where the ISR is empty. isr=[]. 
> What I do know is this always coincides with it being in a 
> REPLICA_NOT_AVAILABLE or LEADER_NOT_AVAILABLE error state 
> (partition_error_code=5/9).
> To Do:
> Dig into why the dead node is not removed from the ISR
> To get the test running for now one option would be to wait for the isr to 
> return to isr=1,2,3 after each node is restarted.
> Sample Timeline
> 18:45:59,606 - consumer gets first disconnection error (n1 is dead)
> 18:45:59,500 - producer gets first disconnection error (n1 is dead)
> 18:46:06,001 - ZK times out node1
> 18:46:06,321 - Broker 3 (Controller) Selects new leader and ISR
> {"leader":3,"leader_epoch":1,"isr":[3,2]}
> test_topic,0
> => so it moves the leader but it doesn’t change the ISR on the other two 
> 18:46:06,168 - Broker 2 is updated by Broker 3 with ISR (leaders exclude n1 
> BUT ISR still names all 3?)
> 18:46:09,582 - Consumer gives up at 10s timeout
> 18:46:14,866 - Controller is notified by ZK that node 1 is back.
> 18:46:15,845 - metadata update hits the producer (my guess is this is simply 
> because node 1 is now available so replication can continue and release held 
> up produce requests).
> 18:46:16,922 - producer disconnects from node 3 (node 3 dead)
> If we need a workaround for the tests we can simply set 
> replica.lag.time.max.ms to a lower value, like 1000ms (so far untested)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to