thanks yingjie to share this doc and I think this is very important feature for production.
As you mentioned in your document, an unhealthy node can cause a TM startup failure but cluster management may offer the same node for some reason. (I have encountered such a scenario in our production environment.) As your proposal RM can blacklist this unhealthy node because of the launch failure. I have some questions: Do you want every ResourceManager(MesosResoruceManager,YarnResourceManager) to implement this policy? If not, you want the Flink to implement this mechanism, I think the interface of current RM may be not enough. thanks. Yun Gao <yungao...@aliyun.com.invalid> 于2018年11月28日周三 上午11:29写道: > Hi yingjie, > Thanks for proposing the blacklist! I agree with that black list is > important for job maintenance, since some jobs may not be able to failover > automatically if some tasks are always scheduled to the problematic hosts > or TMs. This will increase the burden of the operators since they need to > pay more attention to the status of the jobs. > > I have read the proposal and left some comments. I think a problem > is how we cooperator with external resource managers (like YARN or Mesos) > so that they will apply for resource according to our blacklist. If they > cannot fully obey the blacklist, then we may need to deal with the > inappropriate resource. > > Looking forward to the future advance of this feature! Thanks again > for the exciting proposal. > > > Best, > Yun Gao > > > > ------------------------------------------------------------------ > From:zhijiang <wangzhijiang...@aliyun.com.INVALID> > Send Time:2018 Nov 27 (Tue) 10:40 > To:dev <dev@flink.apache.org> > Subject:回复:[DISCUSS]Enhancing flink scheduler by implementing blacklist > mechanism > > Thanks yingjie for bringing this discussion. > > I encountered this issue during failover and also noticed other users > complainting related issues in community before. > So it is necessary to have this mechanism for enhancing schedule process > first, and then enrich the internal rules step by step. > Wish this feature working in the next major release. :) > > Best, > Zhijiang > ------------------------------------------------------------------ > 发件人:Till Rohrmann <trohrm...@apache.org> > 发送时间:2018年11月5日(星期一) 18:43 > 收件人:dev <dev@flink.apache.org> > 主 题:Re: [DISCUSS]Enhancing flink scheduler by implementing blacklist > mechanism > > Thanks for sharing this design document with the community Yingjie. > > I like the design to pass the job specific blacklisted TMs as a scheduling > constraint. This makes a lot of sense to me. > > Cheers, > Till > > On Fri, Nov 2, 2018 at 4:51 PM yingjie <kevin.ying...@gmail.com> wrote: > > > Hi everyone, > > > > This post proposes the blacklist mechanism as an enhancement of flink > > scheduler. The motivation is as follows. > > > > In our clusters, jobs encounter Hardware and software environment > problems > > occasionally, including software library missing,bad hardware,resource > > shortage like out of disk space,these problems will lead to task > > failure,the > > failover strategy will take care of that and redeploy the relevant tasks. > > But because of reasons like location preference and limited total > > resources,the failed task will be scheduled to be deployed on the same > > host, > > then the task will fail again and again, many times. The primary cause of > > this problem is the mismatching of task and resource. Currently, the > > resource allocation algorithm does not take these into consideration. > > > > We introduce the blacklist mechanism to solve this problem. The basic > idea > > is that when a task fails too many times on some resource, the Scheduler > > will not assign the resource to that task. We have implemented this > feature > > in our inner version of flink, and currently, it works fine. > > > > The following is the design draft, we would really appreciate it if you > can > > review and comment. > > > > > https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw > > > > Best, > > Yingjie > > > > > > > > -- > > Sent from: > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > >