Chris Egerton created KAFKA-12252:
-------------------------------------
Summary: Distributed herder tick thread loops rapidly when worker
loses leadership
Key: KAFKA-12252
URL: https://issues.apache.org/jira/browse/KAFKA-12252
Project: Kafka
Issue Type: Bug
Components: KafkaConnect
Reporter: Chris Egerton
Assignee: Chris Egerton
When a new session key is read from the config topic, if the worker is the
leader, it [schedules a new key
rotation|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1579-L1581].
The time between key rotations is configurable but defaults to an hour.
The herder then continues its tick loop, which usually ends with a long poll
for rebalance activity. However, when a key rotation is scheduled, it will
[limit the time spent
polling|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L384-L388]
at the end of the tick loop in order to be able to perform the rotation.
Once woken up, the worker checks to see if a key rotation is necessary and, if
so, [sets the expected key rotation time to
Long.MAX_VALUE|https://github.com/apache/kafka/blob/bf4afae8f53471ab6403cbbfcd2c4e427bdd4568/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L344],
then [writes a new session key to the config
topic|https://github.com/apache/kafka/blob/bf4afae8f53471ab6403cbbfcd2c4e427bdd4568/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L345-L348].
The problem is, [the worker only ever decides a key rotation is necessary if
it is still the
leader|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L456-L469].
If the worker is no longer the leader at the time of the key rotation (likely
due to falling out of the cluster after losing contact with the group
coordinator), its key expiration time won’t be reset, and the long poll for
rebalance activity at the end of the tick loop will be given a timeout of 0 ms
and result in the tick loop being immediately restarted. Even if the worker
reads a new session key from the config topic, it’ll continue looping like this
since its scheduled key rotation won’t be updated. At this point, the only
thing that would help the worker get back into a healthy state would be if it
were made the leader of the cluster again.
One possible fix could be to add a conditional check in the tick thread to only
limit the time spent on rebalance polling if the worker is currently the leader.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)