GitHub user StephanEwen opened a pull request: https://github.com/apache/flink/pull/5658
[FLINK-8856] [TaskManager] Move all cancellation interrupt calls to TaskCanceller thread ## What is the purpose of the change This cleans up the code and guards against a JVM bug where `interrupt()` calls block/deadlock if the thread is engaged in certain I/O operations. In addition, this makes sure that the process really goes away when the cancellation timeout expires, rather than relying on the TaskManager to be able to properly handle the fatal error notification. Some minor robustness enhancements related to this change are included in this PR. ## Verifying this change The change is motivated by an occasional JVM bug that I could not purposefully trigger in tests to guard against rollback to the prior state. All tests were passing prior to this change and are passing after this change. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented) You can merge this pull request into a Git repository by running: $ git pull https://github.com/StephanEwen/incubator-flink fix_task_interrupt Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5658.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5658 ---- commit 36726fc5c277c649dc360975a34bf7be0afd7a0e Author: Stephan Ewen <sewen@...> Date: 2018-03-06T14:54:13Z [hotfix] [taskmanager] Fix checkstyle in Task and TaskTest commit 385b2032bcaa397d21d28e90efa89a44e12ebe99 Author: Stephan Ewen <sewen@...> Date: 2018-03-06T14:18:33Z [FLINK-8856] [TaskManager] Move all cancellation interrupt calls to TaskCanceller thread This cleans up the code and guards against a JVM bug where 'interrupt()' calls block/deadlock if the thread is engaged in certain I/O operations. In addition, this makes sure that the process really goes away when the cancellation timeout expires, rather than relying on the TaskManager to be able to properly handle the fatal error notification. commit 3b18d7d9eccc936dc53b05b787f3fd4c19171d4f Author: Stephan Ewen <sewen@...> Date: 2018-03-06T15:36:13Z [FLINK-8883] [core] Make ThreadDeath a fatal error in ExceptionUtils commit f3884088b210a061dba4d83323884bece1d31864 Author: Stephan Ewen <sewen@...> Date: 2018-03-06T16:14:54Z [FLINK-8885] [TaskManager] DispatcherThreadFactory registers a fatal error exception handler In case dispatcher threads let an exception bubble out (does not hanle it), the exception handler terminates the process, to esure we don't leave broken processes. commit 5236bb73a2482ccdf016d1b9bea5cd0f17f2f620 Author: Stephan Ewen <sewen@...> Date: 2018-03-06T16:18:38Z [hotfix] [runtime] Harden FatalExitExceptionHandler In case the logging framework throws an exception when handling the exception, we still kill the process, as intended. ---- ---