This is a quite useful feature for production use. I once encountered such
a case in production cluster and the Storm jobs used 2 hours to stabilize.
After that, we implemented similar blacklist solution for storm.

The design doc looks good to me. Some minor suggestions about blacklist
removal: in some cases, when the cluster is problematic (the whole
cluster), the worst case is that all the nodes are in blacklist if
in-proper configured blacklist size. Then the whole cluster is unavailable
for allocation and have to wait for the removal timeout. This is much
easier to happen on small cluster.

The solution I once used was: we will not allocate nodes in blacklist if
resource available. But, if no resource available, we will remove nodes
from blacklist via some LRU algorithm to allocate.

Hope this help.

Thanks
Weihua

Guowei Ma <guowei....@gmail.com> 于2018年11月28日周三 下午2:23写道:

> thanks yingjie to share this doc and I think this is very important feature
> for production.
>
> As you mentioned in your document, an unhealthy node  can cause a TM
> startup failure but cluster management may offer the same node for some
> reason. (I have encountered such a scenario in our production environment.)
> As your proposal  RM can blacklist this unhealthy node because of the
> launch failure.
>
> I have some questions:
> Do you want every
> ResourceManager(MesosResoruceManager,YarnResourceManager)  to implement
> this policy?
> If not, you want the Flink to implement this mechanism, I think the
> interface of current RM may be not enough.
>
> thanks.
>
>
> Yun Gao <yungao...@aliyun.com.invalid> 于2018年11月28日周三 上午11:29写道:
>
> > Hi yingjie,
> >       Thanks for proposing the blacklist! I agree with that black list is
> > important for job maintenance, since some jobs may not be able to
> failover
> > automatically if some tasks are always scheduled to the problematic hosts
> > or TMs. This will increase the burden of the operators since they need to
> > pay more attention to the status of the jobs.
> >
> >       I have read the proposal and left some comments. I think a problem
> > is how we cooperator with external resource managers (like YARN or Mesos)
> > so that they will apply for resource according to our blacklist. If they
> > cannot fully obey the blacklist, then we may need to deal with the
> > inappropriate resource.
> >
> >      Looking forward to the future advance of this feature! Thanks again
> > for the exciting proposal.
> >
> >
> > Best,
> > Yun Gao
> >
> >
> >
> > ------------------------------------------------------------------
> > From:zhijiang <wangzhijiang...@aliyun.com.INVALID>
> > Send Time:2018 Nov 27 (Tue) 10:40
> > To:dev <dev@flink.apache.org>
> > Subject:回复:[DISCUSS]Enhancing flink scheduler by implementing blacklist
> > mechanism
> >
> > Thanks yingjie for bringing this discussion.
> >
> > I encountered this issue during failover and also noticed other users
> > complainting related issues in community before.
> > So it is necessary to have this mechanism for enhancing schedule process
> > first, and then enrich the internal rules step by step.
> > Wish this feature working in the next major release. :)
> >
> > Best,
> > Zhijiang
> > ------------------------------------------------------------------
> > 发件人:Till Rohrmann <trohrm...@apache.org>
> > 发送时间:2018年11月5日(星期一) 18:43
> > 收件人:dev <dev@flink.apache.org>
> > 主 题:Re: [DISCUSS]Enhancing flink scheduler by implementing blacklist
> > mechanism
> >
> > Thanks for sharing this design document with the community Yingjie.
> >
> > I like the design to pass the job specific blacklisted TMs as a
> scheduling
> > constraint. This makes a lot of sense to me.
> >
> > Cheers,
> > Till
> >
> > On Fri, Nov 2, 2018 at 4:51 PM yingjie <kevin.ying...@gmail.com> wrote:
> >
> > > Hi everyone,
> > >
> > > This post proposes the blacklist mechanism as an enhancement of flink
> > > scheduler. The motivation is as follows.
> > >
> > > In our clusters, jobs encounter Hardware and software environment
> > problems
> > > occasionally, including software library missing,bad hardware,resource
> > > shortage like out of disk space,these problems will lead to task
> > > failure,the
> > > failover strategy will take care of that and redeploy the relevant
> tasks.
> > > But because of reasons like location preference and limited total
> > > resources,the failed task will be scheduled to be deployed on the same
> > > host,
> > > then the task will fail again and again, many times. The primary cause
> of
> > > this problem is the mismatching of task and resource. Currently, the
> > > resource allocation algorithm does not take these into consideration.
> > >
> > > We introduce the blacklist mechanism to solve this problem. The basic
> > idea
> > > is that when a task fails too many times on some resource, the
> Scheduler
> > > will not assign the resource to that task. We have implemented this
> > feature
> > > in our inner version of flink, and currently, it works fine.
> > >
> > > The following is the design draft, we would really appreciate it if you
> > can
> > > review and comment.
> > >
> > >
> >
> https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw
> > >
> > > Best,
> > > Yingjie
> > >
> > >
> > >
> > > --
> > > Sent from:
> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > >
> >
> >
> >
>

Reply via email to