[ 
https://issues.apache.org/jira/browse/KAFKA-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335592#comment-17335592
 ] 
Ryanne Dolan commented on KAFKA-12726:
--------------------------------------

[~ChrisEgerton] ah, indeed I have conflated WorkerTask.stop() and Task.stop(). 
If I'm (re-)reading this correctly, Task.stop() is called from 
WorkerTask.close() 
[here|https://github.com/apache/kafka/blob/f9de25f046452b2a6d916e6bca41e31d49bbdecf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L169]
 (not WorkTask.stop()...) which is called at the end of the WorkerTask's main 
loop 
[here|https://github.com/apache/kafka/blob/f9de25f046452b2a6d916e6bca41e31d49bbdecf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L197].

wrt KAFKA-10792, we've observed the problem while only running SinkConnectors, 
so I don't think that particular fix can be related.

So it appears the problem still exists -- a stuck Task.stop() will prevent the 
WorkerTask from closing, as we've observed in production -- but it's clear I'm 
mistaken about 1) my remark that Task.stop()s are sequential (they are not) and 
2) where this fix needs to go :)

Lemme relocate this logic to WorkerTask.doClose(). However, this begs the 
question: should we be doing this for every Task method? Seems any stuck method 
would yield the same behavior.

wrt BlockingConnectorTest, that is indeed where I started my investigation, but 
the existing tests don't seem to capture this issue. I'll see if I can add a 
test to repro and show it failing. My understanding is that 
BlockingConnectorTest is only testing whether subsequent Tasks can created and 
run, but doesn't test that stopped Tasks are ever actually stopped: 
https://github.com/apache/kafka/blob/f9de25f046452b2a6d916e6bca41e31d49bbdecf/connect/runtime/src/test/java/org/apache/kafka/connect/integration/BlockingConnectorTest.java#L219

In that test, I believe the blocked Task will be leaked, which is what we're 
observing in production.

> misbehaving Task.stop() can prevent other Tasks from stopping
> -------------------------------------------------------------
>
>                 Key: KAFKA-12726
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12726
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.8.0
>            Reporter: Ryanne Dolan
>            Assignee: Ryanne Dolan
>            Priority: Minor
>
> We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck 
> in a retry loop). Despite Connect supporting a property 
> task.shutdown.graceful.timeout.ms, this is currently not enforced -- tasks 
> can take as long as they want to stop, and the only consequence is an error 
> message.
> Unfortunately, Workers stop Tasks sequentially, meaning that a stuck Task can 
> prevent any further Tasks from stopping. Moreover, after a rebalance, these 
> lingering tasks can persist along with their replacements. For example, we've 
> seen a Worker's "task-count" metric double following a rebalance.
> While the Connector implementation is ultimately to blame here -- a Task 
> probably shouldn't loop forever in stop() -- we believe the Connect runtime 
> should handle this situation more gracefully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to