[ https://issues.apache.org/jira/browse/KAFKA-12252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Randall Hauch updated KAFKA-12252: ---------------------------------- Fix Version/s: 2.6.3 > Distributed herder tick thread loops rapidly when worker loses leadership > ------------------------------------------------------------------------- > > Key: KAFKA-12252 > URL: https://issues.apache.org/jira/browse/KAFKA-12252 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Reporter: Chris Egerton > Assignee: Chris Egerton > Priority: Major > Fix For: 3.0.0, 2.6.3, 2.7.2, 2.8.1 > > > When a new session key is read from the config topic, if the worker is the > leader, it [schedules a new key > rotation|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1579-L1581]. > The time between key rotations is configurable but defaults to an hour. > The herder then continues its tick loop, which usually ends with a long poll > for rebalance activity. However, when a key rotation is scheduled, it will > [limit the time spent > polling|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L384-L388] > at the end of the tick loop in order to be able to perform the rotation. > Once woken up, the worker checks to see if a key rotation is necessary and, > if so, [sets the expected key rotation time to > Long.MAX_VALUE|https://github.com/apache/kafka/blob/bf4afae8f53471ab6403cbbfcd2c4e427bdd4568/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L344], > then [writes a new session key to the config > topic|https://github.com/apache/kafka/blob/bf4afae8f53471ab6403cbbfcd2c4e427bdd4568/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L345-L348]. > The problem is, [the worker only ever decides a key rotation is necessary if > it is still the > leader|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L456-L469]. > If the worker is no longer the leader at the time of the key rotation > (likely due to falling out of the cluster after losing contact with the group > coordinator), its key expiration time won’t be reset, and the long poll for > rebalance activity at the end of the tick loop will be given a timeout of 0 > ms and result in the tick loop being immediately restarted. Even if the > worker reads a new session key from the config topic, it’ll continue looping > like this since its scheduled key rotation won’t be updated. At this point, > the only thing that would help the worker get back into a healthy state would > be if it were made the leader of the cluster again. > One possible fix could be to add a conditional check in the tick thread to > only limit the time spent on rebalance polling if the worker is currently the > leader. -- This message was sent by Atlassian Jira (v8.3.4#803005)