[ https://issues.apache.org/jira/browse/KAFKA-15238 ]
Yash Mayya deleted comment on KAFKA-15238: ------------------------------------ was (Author: yash.mayya): https://github.com/apache/kafka/pull/14079 > Connect workers can be disabled by DLQ-related blocking admin client calls > -------------------------------------------------------------------------- > > Key: KAFKA-15238 > URL: https://issues.apache.org/jira/browse/KAFKA-15238 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Reporter: Yash Mayya > Assignee: Yash Mayya > Priority: Major > > When Kafka Connect is run in distributed mode - if a sink connector's task is > restarted (via a worker's REST API), the following sequence of steps will > occur (on the DistributedHerder's thread): > > # The existing sink task will be stopped > ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1367]) > # A new sink task will be started > ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1867C40-L1867C40]) > # As a part of the above step, a new {{WorkerSinkTask}} will be instantiated > ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L656-L663]) > # The DLQ reporter (see > [KIP-298|https://cwiki.apache.org/confluence/display/KAFKA/KIP-298%3A+Error+Handling+in+Connect]) > for the sink task is also instantiated and configured as a part of this > ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L1800]) > # The DLQ reporter setup involves two synchronous admin client calls to list > topics and create the DLQ topic if it isn't already created > ([ref|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/errors/DeadLetterQueueReporter.java#L84-L87]) > > All of these are occurring synchronously on the herder's tick thread - in > this portion > [here|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L457-L469] > where external requests are run. If the admin client call in the DLQ > reporter setup step blocks for some time (due to auth failures and retries or > network issues or whatever other reason), this can cause the Connect worker > to become non-functional (REST API requests will timeout) and even fall out > of the Connect cluster and become a zombie (since the tick thread also drives > group membership functions - see > [here|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L403], > > [here|https://github.com/apache/kafka/blob/4981fa939d588645401619bfc3e321dc523d10e7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L535]). -- This message was sent by Atlassian Jira (v8.20.10#820010)