Thanks, Weihua. Your suggestions make a lot of sense to me. Currently, all blacklisted resources will be released from blacklist if there is no available resource. Maybe only releasing a portion of the blacklisted resources based on the number of slots needed and some LRU like algorithm is a better choice.
Best, Yingjie Weihua Jiang <weihua.ji...@gmail.com> 于2018年11月28日周三 下午2:57写道: > This is a quite useful feature for production use. I once encountered such > a case in production cluster and the Storm jobs used 2 hours to stabilize. > After that, we implemented similar blacklist solution for storm. > > The design doc looks good to me. Some minor suggestions about blacklist > removal: in some cases, when the cluster is problematic (the whole > cluster), the worst case is that all the nodes are in blacklist if > in-proper configured blacklist size. Then the whole cluster is unavailable > for allocation and have to wait for the removal timeout. This is much > easier to happen on small cluster. > > The solution I once used was: we will not allocate nodes in blacklist if > resource available. But, if no resource available, we will remove nodes > from blacklist via some LRU algorithm to allocate. > > Hope this help. > > Thanks > Weihua > > Guowei Ma <guowei....@gmail.com> 于2018年11月28日周三 下午2:23写道: > > > thanks yingjie to share this doc and I think this is very important > feature > > for production. > > > > As you mentioned in your document, an unhealthy node can cause a TM > > startup failure but cluster management may offer the same node for some > > reason. (I have encountered such a scenario in our production > environment.) > > As your proposal RM can blacklist this unhealthy node because of the > > launch failure. > > > > I have some questions: > > Do you want every > > ResourceManager(MesosResoruceManager,YarnResourceManager) to implement > > this policy? > > If not, you want the Flink to implement this mechanism, I think the > > interface of current RM may be not enough. > > > > thanks. > > > > > > Yun Gao <yungao...@aliyun.com.invalid> 于2018年11月28日周三 上午11:29写道: > > > > > Hi yingjie, > > > Thanks for proposing the blacklist! I agree with that black list > is > > > important for job maintenance, since some jobs may not be able to > > failover > > > automatically if some tasks are always scheduled to the problematic > hosts > > > or TMs. This will increase the burden of the operators since they need > to > > > pay more attention to the status of the jobs. > > > > > > I have read the proposal and left some comments. I think a > problem > > > is how we cooperator with external resource managers (like YARN or > Mesos) > > > so that they will apply for resource according to our blacklist. If > they > > > cannot fully obey the blacklist, then we may need to deal with the > > > inappropriate resource. > > > > > > Looking forward to the future advance of this feature! Thanks > again > > > for the exciting proposal. > > > > > > > > > Best, > > > Yun Gao > > > > > > > > > > > > ------------------------------------------------------------------ > > > From:zhijiang <wangzhijiang...@aliyun.com.INVALID> > > > Send Time:2018 Nov 27 (Tue) 10:40 > > > To:dev <dev@flink.apache.org> > > > Subject:回复:[DISCUSS]Enhancing flink scheduler by implementing blacklist > > > mechanism > > > > > > Thanks yingjie for bringing this discussion. > > > > > > I encountered this issue during failover and also noticed other users > > > complainting related issues in community before. > > > So it is necessary to have this mechanism for enhancing schedule > process > > > first, and then enrich the internal rules step by step. > > > Wish this feature working in the next major release. :) > > > > > > Best, > > > Zhijiang > > > ------------------------------------------------------------------ > > > 发件人:Till Rohrmann <trohrm...@apache.org> > > > 发送时间:2018年11月5日(星期一) 18:43 > > > 收件人:dev <dev@flink.apache.org> > > > 主 题:Re: [DISCUSS]Enhancing flink scheduler by implementing blacklist > > > mechanism > > > > > > Thanks for sharing this design document with the community Yingjie. > > > > > > I like the design to pass the job specific blacklisted TMs as a > > scheduling > > > constraint. This makes a lot of sense to me. > > > > > > Cheers, > > > Till > > > > > > On Fri, Nov 2, 2018 at 4:51 PM yingjie <kevin.ying...@gmail.com> > wrote: > > > > > > > Hi everyone, > > > > > > > > This post proposes the blacklist mechanism as an enhancement of flink > > > > scheduler. The motivation is as follows. > > > > > > > > In our clusters, jobs encounter Hardware and software environment > > > problems > > > > occasionally, including software library missing,bad > hardware,resource > > > > shortage like out of disk space,these problems will lead to task > > > > failure,the > > > > failover strategy will take care of that and redeploy the relevant > > tasks. > > > > But because of reasons like location preference and limited total > > > > resources,the failed task will be scheduled to be deployed on the > same > > > > host, > > > > then the task will fail again and again, many times. The primary > cause > > of > > > > this problem is the mismatching of task and resource. Currently, the > > > > resource allocation algorithm does not take these into consideration. > > > > > > > > We introduce the blacklist mechanism to solve this problem. The basic > > > idea > > > > is that when a task fails too many times on some resource, the > > Scheduler > > > > will not assign the resource to that task. We have implemented this > > > feature > > > > in our inner version of flink, and currently, it works fine. > > > > > > > > The following is the design draft, we would really appreciate it if > you > > > can > > > > review and comment. > > > > > > > > > > > > > > https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw > > > > > > > > Best, > > > > Yingjie > > > > > > > > > > > > > > > > -- > > > > Sent from: > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > > > > > > > > > > > > >