You are right. I think, at least, we need a new interface to be implemented to collect the failure information.
Best, Yingjie Guowei Ma <guowei....@gmail.com> 于2018年11月28日周三 下午2:23写道: > thanks yingjie to share this doc and I think this is very important feature > for production. > > As you mentioned in your document, an unhealthy node can cause a TM > startup failure but cluster management may offer the same node for some > reason. (I have encountered such a scenario in our production environment.) > As your proposal RM can blacklist this unhealthy node because of the > launch failure. > > I have some questions: > Do you want every > ResourceManager(MesosResoruceManager,YarnResourceManager) to implement > this policy? > If not, you want the Flink to implement this mechanism, I think the > interface of current RM may be not enough. > > thanks. > > > Yun Gao <yungao...@aliyun.com.invalid> 于2018年11月28日周三 上午11:29写道: > > > Hi yingjie, > > Thanks for proposing the blacklist! I agree with that black list is > > important for job maintenance, since some jobs may not be able to > failover > > automatically if some tasks are always scheduled to the problematic hosts > > or TMs. This will increase the burden of the operators since they need to > > pay more attention to the status of the jobs. > > > > I have read the proposal and left some comments. I think a problem > > is how we cooperator with external resource managers (like YARN or Mesos) > > so that they will apply for resource according to our blacklist. If they > > cannot fully obey the blacklist, then we may need to deal with the > > inappropriate resource. > > > > Looking forward to the future advance of this feature! Thanks again > > for the exciting proposal. > > > > > > Best, > > Yun Gao > > > > > > > > ------------------------------------------------------------------ > > From:zhijiang <wangzhijiang...@aliyun.com.INVALID> > > Send Time:2018 Nov 27 (Tue) 10:40 > > To:dev <dev@flink.apache.org> > > Subject:回复:[DISCUSS]Enhancing flink scheduler by implementing blacklist > > mechanism > > > > Thanks yingjie for bringing this discussion. > > > > I encountered this issue during failover and also noticed other users > > complainting related issues in community before. > > So it is necessary to have this mechanism for enhancing schedule process > > first, and then enrich the internal rules step by step. > > Wish this feature working in the next major release. :) > > > > Best, > > Zhijiang > > ------------------------------------------------------------------ > > 发件人:Till Rohrmann <trohrm...@apache.org> > > 发送时间:2018年11月5日(星期一) 18:43 > > 收件人:dev <dev@flink.apache.org> > > 主 题:Re: [DISCUSS]Enhancing flink scheduler by implementing blacklist > > mechanism > > > > Thanks for sharing this design document with the community Yingjie. > > > > I like the design to pass the job specific blacklisted TMs as a > scheduling > > constraint. This makes a lot of sense to me. > > > > Cheers, > > Till > > > > On Fri, Nov 2, 2018 at 4:51 PM yingjie <kevin.ying...@gmail.com> wrote: > > > > > Hi everyone, > > > > > > This post proposes the blacklist mechanism as an enhancement of flink > > > scheduler. The motivation is as follows. > > > > > > In our clusters, jobs encounter Hardware and software environment > > problems > > > occasionally, including software library missing,bad hardware,resource > > > shortage like out of disk space,these problems will lead to task > > > failure,the > > > failover strategy will take care of that and redeploy the relevant > tasks. > > > But because of reasons like location preference and limited total > > > resources,the failed task will be scheduled to be deployed on the same > > > host, > > > then the task will fail again and again, many times. The primary cause > of > > > this problem is the mismatching of task and resource. Currently, the > > > resource allocation algorithm does not take these into consideration. > > > > > > We introduce the blacklist mechanism to solve this problem. The basic > > idea > > > is that when a task fails too many times on some resource, the > Scheduler > > > will not assign the resource to that task. We have implemented this > > feature > > > in our inner version of flink, and currently, it works fine. > > > > > > The following is the design draft, we would really appreciate it if you > > can > > > review and comment. > > > > > > > > > https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw > > > > > > Best, > > > Yingjie > > > > > > > > > > > > -- > > > Sent from: > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > > > > > > >