Robert Metzger created FLINK-2079: ------------------------------------- Summary: Add watcher to YARN TM containers to detect stopped actor system Key: FLINK-2079 URL: https://issues.apache.org/jira/browse/FLINK-2079 Project: Flink Issue Type: Improvement Components: TaskManager, YARN Client Affects Versions: 0.9 Reporter: Robert Metzger Assignee: Robert Metzger
I experienced an OutOfMemoryError (caused by the usercode) while running Flink on YARN. It seems that the TaskManager is correctly detecting the fatal error, however the JVM is not shutting down, so YARN won't bring up new containers. Therefore, I want to start a thread on the YarnTaskManagerRunner which periodically (every 30 seconds) checks whether the actor system is still running. If not, its doing a System.exit(1). -- This message was sent by Atlassian JIRA (v6.3.4#6332)