[ https://issues.apache.org/jira/browse/KAFKA-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismael Juma updated KAFKA-2827: ------------------------------- Affects Version/s: (was: 0.9.0.1) 0.9.0.0 > Producer metadata stall > ----------------------- > > Key: KAFKA-2827 > URL: https://issues.apache.org/jira/browse/KAFKA-2827 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.9.0.0 > Reporter: Geoff Anderson > > This is a tracking issue for a suspicious behavior that has been plaguing the > stability of certain system tests. > Producer metadata requests and produce requests can be blocked for "quite a > while" (tens of seconds to minutes) after a hard bounce of a broker has > occurred despite the fact that new leaders for for the topicPartitions on the > bounced broker have already been reelected. > This shows up in particular during ducktape replication tests where the > leader of a topic is bounced with a hard kill. The initial symptom was the > console consumer timing out due to no messages being available for > consumption during this lengthy producer "stall" period. > This behavior does not appear to be present in kafka 0.8.2: 15 of 15 runs of > the test in question passed with 0.8.2 substituted for trunk. > [~benstopford]'s findings so far: > Theory: > The first node dies, Node1. A new controller is elected and it correctly > reassigns the partition leaders to be the other two nodes (Node2, Node3). > However, for some reason, Node1 remains in the ISR list for two of the three > partitions: > Broker 3 cached leader info (Leader:3,ISR:3,2, AllReplicas:1,3,2) for > partition test_topic,0 > Broker 3 cached leader info (Leader:3,ISR:3,2,1, AllReplicas:3,2,1) for > partition test_topic,2 > Broker 3 cached leader info (Leader:2,ISR:2,1,3, AllReplicas:2,1,3) for > partition test_topic,1 > (a little later broker 1 is dropped from the ISR for partition 2, partition 3 > keeps all three replicas throughout) > Meanwhile the producer is sending requests with acks=-1. These block on the > server because the ISR for partitions 1,2 still contain the dead Node1. As an > aside, metadata requests from the producer are blocked too during this period > (they are present if you look in the producer’s request queue). > The consumer times out as it hasn’t consumed any data for its timeout period > (this is a timeout in the console_consumer). This is because no data is > produced during this period, as the producer is blocked. (The producer is > blocked usually for around 15 seconds but this varies from run to run). > When Node1 is restarted everything clears as the fetchers from Node1 can > replicate data again. The Producer is updated with metadata and continues > producing. All is back to normal. > The next node goes down and the whole circle loops round again. > Unanswered questions: > I don’t know why we end up in a situation where the ISR is empty. isr=[]. > What I do know is this always coincides with it being in a > REPLICA_NOT_AVAILABLE or LEADER_NOT_AVAILABLE error state > (partition_error_code=5/9). > To Do: > Dig into why the dead node is not removed from the ISR > To get the test running for now one option would be to wait for the isr to > return to isr=1,2,3 after each node is restarted. > Sample Timeline > 18:45:59,606 - consumer gets first disconnection error (n1 is dead) > 18:45:59,500 - producer gets first disconnection error (n1 is dead) > 18:46:06,001 - ZK times out node1 > 18:46:06,321 - Broker 3 (Controller) Selects new leader and ISR > {"leader":3,"leader_epoch":1,"isr":[3,2]} > test_topic,0 > => so it moves the leader but it doesn’t change the ISR on the other two > 18:46:06,168 - Broker 2 is updated by Broker 3 with ISR (leaders exclude n1 > BUT ISR still names all 3?) > 18:46:09,582 - Consumer gives up at 10s timeout > 18:46:14,866 - Controller is notified by ZK that node 1 is back. > 18:46:15,845 - metadata update hits the producer (my guess is this is simply > because node 1 is now available so replication can continue and release held > up produce requests). > 18:46:16,922 - producer disconnects from node 3 (node 3 dead) > If we need a workaround for the tests we can simply set > replica.lag.time.max.ms to a lower value, like 1000ms (so far untested) -- This message was sent by Atlassian JIRA (v6.3.4#6332)