[ 
https://issues.apache.org/jira/browse/KAFKA-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335638#comment-17335638
 ] 

Chris Egerton commented on KAFKA-12726:
---------------------------------------

Ahhh, that makes sense. We wrestled with this a bit in KAFKA-9374 and the 
unfortunate conclusion was that you basically can't forcibly kill a thread in 
Java without doing 
[this|https://stackoverflow.com/questions/5241822/is-there-a-good-way-to-forcefully-stop-a-java-thread/32909191#32909191].
 The current approach is to allow for resource leakage if a task is 
irretrievably blocked and allow a new task to be brought up in its place, after 
the graceful shutdown period has elapsed.

It sounds like you're proposing that, if a task has exhausted its graceful 
shutdown period, we invoke {{Task::stop}} from a separate thread, even if the 
task is blocked in the middle of a call to something else like {{preCommit}}, 
{{commitRecord}}, {{put}}, {{poll}}, etc. Is that correct?

I was thinking that the {{BlockingConnectorTest}} can be a starting point if 
you'd like to reproduce and/or test for this problem, as opposed to providing a 
test case for this scenario as-is. Just a thought; it may not be the right tool 
for this job so no worries if something else comes in handy instead.

 

> misbehaving Task.stop() can prevent other Tasks from stopping
> -------------------------------------------------------------
>
>                 Key: KAFKA-12726
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12726
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.8.0
>            Reporter: Ryanne Dolan
>            Assignee: Ryanne Dolan
>            Priority: Minor
>
> We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck 
> in a retry loop). Despite Connect supporting a property 
> task.shutdown.graceful.timeout.ms, this is currently not enforced -- tasks 
> can take as long as they want to stop, and the only consequence is an error 
> message.
> Unfortunately, Workers stop Tasks sequentially, meaning that a stuck Task can 
> prevent any further Tasks from stopping. Moreover, after a rebalance, these 
> lingering tasks can persist along with their replacements. For example, we've 
> seen a Worker's "task-count" metric double following a rebalance.
> While the Connector implementation is ultimately to blame here -- a Task 
> probably shouldn't loop forever in stop() -- we believe the Connect runtime 
> should handle this situation more gracefully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to