[ https://issues.apache.org/jira/browse/FLINK-14949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982305#comment-16982305 ]
Hwanju Kim commented on FLINK-14949: ------------------------------------ [~azagrebin], thanks for the quick answer and sure, I can work on this. > Task cancellation can be stuck against out-of-thread error > ---------------------------------------------------------- > > Key: FLINK-14949 > URL: https://issues.apache.org/jira/browse/FLINK-14949 > Project: Flink > Issue Type: Bug > Components: Runtime / Task > Affects Versions: 1.8.2 > Reporter: Hwanju Kim > Priority: Major > > Task cancellation > ([_cancelOrFailAndCancelInvokable_|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L991]) > relies on multiple separate threads, which are _TaskCanceler_, > _TaskInterrupter_, and _TaskCancelerWatchdog_. While TaskCanceler performs > cancellation itself, TaskInterrupter periodically interrupts a non-reacting > task and TaskCancelerWatchdog kills JVM if cancellation has never been > finished within a certain amount of time (by default 3 min). Those all ensure > that cancellation can be done or either aborted transitioning to a terminal > state in finite time (FLINK-4715). > However, if any asynchronous thread creation is failed such as by > out-of-thread (_java.lang.OutOfMemoryError: unable to create new native > thread_), the code transitions to CANCELING, but nothing could be performed > for cancellation or watched by watchdog. Currently, jobmanager does [retry > cancellation|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1121] > against any error returned, but a next retry [returns success once it sees > CANCELING|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L997], > assuming that it is in progress. This leads to complete stuck in CANCELING, > which is non-terminal, so state machine is stuck after that. > One solution would be that if a task has transitioned to CANCELLING but it > gets fatal error or OOM (i.e., _isJvmFatalOrOutOfMemoryError_ is true) > indicating that it could not reach spawning TaskCancelerWatchdog, it could > immediately consider that as fatal error (not safely cancellable) calling > _notifyFatalError_, just as TaskCancelerWatchdog does but eagerly and > synchronously. That way, it can at least transition out of the non-terminal > state and furthermore clear potentially leaked thread/memory by restarting > JVM. The same method is also invoked by _failExternally_, but transitioning > to FAILED seems less critical as it's already terminal state. > How to reproduce is straightforward by running an application that keeps > creating threads, each of which never finishes in a loop, and has multiple > tasks so that one task triggers failure and then the others are attempted to > be cancelled by full fail-over. In web UI dashboard, some tasks from a task > manager where any of cancellation-related threads failed to be spawned are > stuck in CANCELLING for good. -- This message was sent by Atlassian Jira (v8.3.4#803005)