Chris Egerton created KAFKA-17252:
-------------------------------------

             Summary: Forwarded source task zombie fencings may fail when 
leader has just started
                 Key: KAFKA-17252
                 URL: https://issues.apache.org/jira/browse/KAFKA-17252
             Project: Kafka
          Issue Type: Bug
          Components: connect
            Reporter: Chris Egerton
            Assignee: Chris Egerton


We've observed some flaky integration test failures such as [this 
one|https://ge.apache.org/s/52il7msnknzp2/tests/task/:connect:mirror:test/details/org.apache.kafka.connect.mirror.integration.DedicatedMirrorIntegrationTest/testMultiNodeCluster()?top-execution=1]
 where a source task fails to start with exactly-once support enabled with this 
stack trace:
{code:java}
org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: This worker 
is still starting up and has not been able to read a session key from the 
config topic yet
        at 
org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:186)
        at 
org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:140)
        at 
org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:101)
        at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$fenceZombieSourceTasks$23(DistributedHerder.java:1329)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583) {code}
This occurs because the leader has not yet read (or believes it has not yet 
read) a session key from the config topic.

However, in a cluster where all nodes have always used the sessioned rebalance 
protocol, this scenario should be impossible: there must be a session key 
present in the topic in order for a leader to handle external requests (such as 
creating connectors), and all workers must read to the end of all internal 
topics before joining the cluster.

The cause of this failure is that, during startup, session keys read from the 
config topic are ignored. The herder does [check its config state snapshot for 
a session 
key|https://github.com/apache/kafka/blob/da14b5a61dc90fc70748278c98ce312a7a433c0d/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L463]
 in its tick thread loop, which should help, but it's possible that a worker 
joins the cluster, becomes the leader, and receives a request from a follower 
to fence a zombie source task before this check occurs, which will then cause 
the leader to response with a 503 error, failing the task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to