Geoff Anderson created KAFKA-2827: ------------------------------------- Summary: Producer metadata stall Key: KAFKA-2827 URL: https://issues.apache.org/jira/browse/KAFKA-2827 Project: Kafka Issue Type: Bug Affects Versions: 0.9.0.1 Reporter: Geoff Anderson
This is a tracking issue for a suspicious behavior that has been plaguing the stability of certain system tests. Producer metadata requests and produce requests can be blocked for "quite a while" (tens of seconds to minutes) after a hard bounce of a broker has occurred despite the fact that new leaders for for the topicPartitions on the bounced broker have already been reelected. This shows up in particular during ducktape replication tests where the leader of a topic is bounced with a hard kill. The initial symptom was the console consumer timing out due to no messages being available for consumption during this lengthy producer "stall" period. This behavior does not appear to be present in kafka 0.8.2: 15 of 15 runs of the test in question passed with 0.8.2 substituted for trunk. [~benstopford]'s findings so far: Theory: The first node dies, Node1. A new controller is elected and it correctly reassigns the partition leaders to be the other two nodes (Node2, Node3). However, for some reason, Node1 remains in the ISR list for two of the three partitions: Broker 3 cached leader info (Leader:3,ISR:3,2, AllReplicas:1,3,2) for partition test_topic,0 Broker 3 cached leader info (Leader:3,ISR:3,2,1, AllReplicas:3,2,1) for partition test_topic,2 Broker 3 cached leader info (Leader:2,ISR:2,1,3, AllReplicas:2,1,3) for partition test_topic,1 (a little later broker 1 is dropped from the ISR for partition 2, partition 3 keeps all three replicas throughout) Meanwhile the producer is sending requests with acks=-1. These block on the server because the ISR for partitions 1,2 still contain the dead Node1. As an aside, metadata requests from the producer are blocked too during this period (they are present if you look in the producer’s request queue). The consumer times out as it hasn’t consumed any data for its timeout period (this is a timeout in the console_consumer). This is because no data is produced during this period, as the producer is blocked. (The producer is blocked usually for around 15 seconds but this varies from run to run). When Node1 is restarted everything clears as the fetchers from Node1 can replicate data again. The Producer is updated with metadata and continues producing. All is back to normal. The next node goes down and the whole circle loops round again. Unanswered questions: I don’t know why we end up in a situation where the ISR is empty. isr=[]. What I do know is this always coincides with it being in a REPLICA_NOT_AVAILABLE or LEADER_NOT_AVAILABLE error state (partition_error_code=5/9). To Do: Dig into why the dead node is not removed from the ISR To get the test running for now one option would be to wait for the isr to return to isr=1,2,3 after each node is restarted. Sample Timeline 18:45:59,606 - consumer gets first disconnection error (n1 is dead) 18:45:59,500 - producer gets first disconnection error (n1 is dead) 18:46:06,001 - ZK times out node1 18:46:06,321 - Broker 3 (Controller) Selects new leader and ISR {"leader":3,"leader_epoch":1,"isr":[3,2]} test_topic,0 => so it moves the leader but it doesn’t change the ISR on the other two 18:46:06,168 - Broker 2 is updated by Broker 3 with ISR (leaders exclude n1 BUT ISR still names all 3?) 18:46:09,582 - Consumer gives up at 10s timeout 18:46:14,866 - Controller is notified by ZK that node 1 is back. 18:46:15,845 - metadata update hits the producer (my guess is this is simply because node 1 is now available so replication can continue and release held up produce requests). 18:46:16,922 - producer disconnects from node 3 (node 3 dead) If we need a workaround for the tests we can simply set replica.lag.time.max.ms to a lower value, like 1000ms (so far untested) -- This message was sent by Atlassian JIRA (v6.3.4#6332)