[ https://issues.apache.org/jira/browse/FLINK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann closed FLINK-11000. --------------------------------- Resolution: Abandoned Closed for inactivity. > Introduce Resource Blacklist Mechanism > -------------------------------------- > > Key: FLINK-11000 > URL: https://issues.apache.org/jira/browse/FLINK-11000 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: Yingjie Cao > Assignee: Yingjie Cao > Priority: Major > > In a large clusters, jobs encounter Hardware and software environment > problems > occasionally, including software library missing,bad hardware,resource > shortage like out of disk space,these problems will lead to task failure,the > failover strategy will take care of that and redeploy the relevant tasks. > But because of reasons like location preference and limited total > resources,the failed task will be scheduled to be deployed on the same host, > then the task will fail again and again, many times. The primary cause of > this problem is the mismatching of task and resource. Currently, the > resource allocation algorithm does not take these into consideration. > The blacklist mechanism can solve this problem. The basic idea > is that when a task fails too many times on some resource, the Scheduler > will not assign the resource to that task. The detail design doc is as > follows, > [https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw] -- This message was sent by Atlassian Jira (v8.3.4#803005)