[ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841028#comment-16841028 ]
vinoyang commented on FLINK-5621: --------------------------------- Hi [~till.rohrmann] Can you tell me what's your opinion now? > Flink should provide a mechanism to prevent scheduling tasks on TaskManagers > with operational issues > ---------------------------------------------------------------------------------------------------- > > Key: FLINK-5621 > URL: https://issues.apache.org/jira/browse/FLINK-5621 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination > Affects Versions: 1.1.4 > Reporter: Jamie Grier > Assignee: vinoyang > Priority: Major > > There are cases where jobs can get into a state where no progress can be made > if there is something pathologically wrong with one of the TaskManager nodes > in the cluster. > An example of this would be a TaskManager on a machine that runs out of disk > space. Flink never considers the TM to be "bad" and will keep using it to > attempt to run tasks -- which will continue to fail. > A suggestion for overcoming this would be to allow an option where a TM will > commit suicide if that TM was the source of an exception that caused a job to > fail/restart. > I'm sure there are plenty of other approaches to solving this.. -- This message was sent by Atlassian JIRA (v7.6.3#76005)