Thanks for sharing this design document with the community Yingjie. I like the design to pass the job specific blacklisted TMs as a scheduling constraint. This makes a lot of sense to me.
Cheers, Till On Fri, Nov 2, 2018 at 4:51 PM yingjie <kevin.ying...@gmail.com> wrote: > Hi everyone, > > This post proposes the blacklist mechanism as an enhancement of flink > scheduler. The motivation is as follows. > > In our clusters, jobs encounter Hardware and software environment problems > occasionally, including software library missing,bad hardware,resource > shortage like out of disk space,these problems will lead to task > failure,the > failover strategy will take care of that and redeploy the relevant tasks. > But because of reasons like location preference and limited total > resources,the failed task will be scheduled to be deployed on the same > host, > then the task will fail again and again, many times. The primary cause of > this problem is the mismatching of task and resource. Currently, the > resource allocation algorithm does not take these into consideration. > > We introduce the blacklist mechanism to solve this problem. The basic idea > is that when a task fails too many times on some resource, the Scheduler > will not assign the resource to that task. We have implemented this feature > in our inner version of flink, and currently, it works fine. > > The following is the design draft, we would really appreciate it if you can > review and comment. > > https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw > > Best, > Yingjie > > > > -- > Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ >