Geoff Anderson created KAFKA-2827:
-------------------------------------

             Summary: Producer metadata stall
                 Key: KAFKA-2827
                 URL: https://issues.apache.org/jira/browse/KAFKA-2827
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.9.0.1
            Reporter: Geoff Anderson


This is a tracking issue for a suspicious behavior that has been plaguing the 
stability of certain system tests.

Producer metadata requests and produce requests can be blocked for "quite a 
while" (tens of seconds to minutes) after a hard bounce of a broker has 
occurred despite the fact that new leaders for for the topicPartitions on the 
bounced broker have already been reelected.

This shows up in particular during ducktape replication tests where the leader 
of a topic is bounced with a hard kill. The initial symptom was the console 
consumer timing out due to no messages being available for consumption during 
this lengthy producer "stall" period.

This behavior does not appear to be present in kafka 0.8.2: 15 of 15 runs of 
the test in question passed with 0.8.2 substituted for trunk.

[~benstopford]'s findings so far:
Theory:
The first node dies, Node1. A new controller is elected and it correctly 
reassigns the partition leaders to be the other two nodes (Node2, Node3). 
However, for some reason, Node1 remains in the ISR list for two of the three 
partitions: 
Broker 3 cached leader info (Leader:3,ISR:3,2, AllReplicas:1,3,2) for partition 
test_topic,0
Broker 3 cached leader info (Leader:3,ISR:3,2,1, AllReplicas:3,2,1) for 
partition test_topic,2
Broker 3 cached leader info (Leader:2,ISR:2,1,3, AllReplicas:2,1,3) for 
partition test_topic,1
(a little later broker 1 is dropped from the ISR for partition 2, partition 3 
keeps all three replicas throughout)
Meanwhile the producer is sending requests with acks=-1. These block on the 
server because the ISR for partitions 1,2 still contain the dead Node1. As an 
aside, metadata requests from the producer are blocked too during this period 
(they are present if you look in the producer’s request queue).
The consumer times out as it hasn’t consumed any data for its timeout period 
(this is a timeout in the console_consumer). This is because no data is 
produced during this period, as the producer is blocked. (The producer is 
blocked usually for around 15 seconds but this varies from run to run).
When Node1 is restarted everything clears as the fetchers from Node1 can 
replicate data again. The Producer is updated with metadata and continues 
producing. All is back to normal.
The next node goes down and the whole circle loops round again.

Unanswered questions: 
I don’t know why we end up in a situation where the ISR is empty. isr=[]. What 
I do know is this always coincides with it being in a REPLICA_NOT_AVAILABLE or 
LEADER_NOT_AVAILABLE error state (partition_error_code=5/9).
To Do:
Dig into why the dead node is not removed from the ISR
To get the test running for now one option would be to wait for the isr to 
return to isr=1,2,3 after each node is restarted.
Sample Timeline
18:45:59,606 - consumer gets first disconnection error (n1 is dead)
18:45:59,500 - producer gets first disconnection error (n1 is dead)
18:46:06,001 - ZK times out node1
18:46:06,321 - Broker 3 (Controller) Selects new leader and ISR
{"leader":3,"leader_epoch":1,"isr":[3,2]}
test_topic,0
=> so it moves the leader but it doesn’t change the ISR on the other two 
18:46:06,168 - Broker 2 is updated by Broker 3 with ISR (leaders exclude n1 BUT 
ISR still names all 3?)
18:46:09,582 - Consumer gives up at 10s timeout
18:46:14,866 - Controller is notified by ZK that node 1 is back.
18:46:15,845 - metadata update hits the producer (my guess is this is simply 
because node 1 is now available so replication can continue and release held up 
produce requests).
18:46:16,922 - producer disconnects from node 3 (node 3 dead)

If we need a workaround for the tests we can simply set replica.lag.time.max.ms 
to a lower value, like 1000ms (so far untested)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to