Re: [DISCUSS]Enhancing flink scheduler by implementing blacklist mechanism

Yingjie Cao Wed, 28 Nov 2018 19:10:32 -0800

You are right. I think, at least, we need a new interface to be implemented
to collect the failure information.


Best,
Yingjie

Guowei Ma <guowei....@gmail.com> 于2018年11月28日周三 下午2:23写道：

> thanks yingjie to share this doc and I think this is very important feature
> for production.
>
> As you mentioned in your document, an unhealthy node  can cause a TM
> startup failure but cluster management may offer the same node for some
> reason. (I have encountered such a scenario in our production environment.)
> As your proposal  RM can blacklist this unhealthy node because of the
> launch failure.
>
> I have some questions:
> Do you want every
> ResourceManager(MesosResoruceManager,YarnResourceManager)  to implement
> this policy?
> If not, you want the Flink to implement this mechanism, I think the
> interface of current RM may be not enough.
>
> thanks.
>
>
> Yun Gao <yungao...@aliyun.com.invalid> 于2018年11月28日周三 上午11:29写道：
>
> > Hi yingjie,
> >       Thanks for proposing the blacklist! I agree with that black list is
> > important for job maintenance, since some jobs may not be able to
> failover
> > automatically if some tasks are always scheduled to the problematic hosts
> > or TMs. This will increase the burden of the operators since they need to
> > pay more attention to the status of the jobs.
> >
> >       I have read the proposal and left some comments. I think a problem
> > is how we cooperator with external resource managers (like YARN or Mesos)
> > so that they will apply for resource according to our blacklist. If they
> > cannot fully obey the blacklist, then we may need to deal with the
> > inappropriate resource.
> >
> >      Looking forward to the future advance of this feature! Thanks again
> > for the exciting proposal.
> >
> >
> > Best,
> > Yun Gao
> >
> >
> >
> > ------------------------------------------------------------------
> > From:zhijiang <wangzhijiang...@aliyun.com.INVALID>
> > Send Time:2018 Nov 27 (Tue) 10:40
> > To:dev <dev@flink.apache.org>
> > Subject:回复：[DISCUSS]Enhancing flink scheduler by implementing blacklist
> > mechanism
> >
> > Thanks yingjie for bringing this discussion.
> >
> > I encountered this issue during failover and also noticed other users
> > complainting related issues in community before.
> > So it is necessary to have this mechanism for enhancing schedule process
> > first, and then enrich the internal rules step by step.
> > Wish this feature working in the next major release. :)
> >
> > Best,
> > Zhijiang
> > ------------------------------------------------------------------
> > 发件人：Till Rohrmann <trohrm...@apache.org>
> > 发送时间：2018年11月5日(星期一) 18:43
> > 收件人：dev <dev@flink.apache.org>
> > 主 题：Re: [DISCUSS]Enhancing flink scheduler by implementing blacklist
> > mechanism
> >
> > Thanks for sharing this design document with the community Yingjie.
> >
> > I like the design to pass the job specific blacklisted TMs as a
> scheduling
> > constraint. This makes a lot of sense to me.
> >
> > Cheers,
> > Till
> >
> > On Fri, Nov 2, 2018 at 4:51 PM yingjie <kevin.ying...@gmail.com> wrote:
> >
> > > Hi everyone,
> > >
> > > This post proposes the blacklist mechanism as an enhancement of flink
> > > scheduler. The motivation is as follows.
> > >
> > > In our clusters, jobs encounter Hardware and software environment
> > problems
> > > occasionally, including software library missing，bad hardware，resource
> > > shortage like out of disk space，these problems will lead to task
> > > failure，the
> > > failover strategy will take care of that and redeploy the relevant
> tasks.
> > > But because of reasons like location preference and limited total
> > > resources，the failed task will be scheduled to be deployed on the same
> > > host,
> > > then the task will fail again and again, many times. The primary cause
> of
> > > this problem is the mismatching of task and resource. Currently, the
> > > resource allocation algorithm does not take these into consideration.
> > >
> > > We introduce the blacklist mechanism to solve this problem. The basic
> > idea
> > > is that when a task fails too many times on some resource, the
> Scheduler
> > > will not assign the resource to that task. We have implemented this
> > feature
> > > in our inner version of flink, and currently, it works fine.
> > >
> > > The following is the design draft, we would really appreciate it if you
> > can
> > > review and comment.
> > >
> > >
> >
> https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw
> > >
> > > Best,
> > > Yingjie
> > >
> > >
> > >
> > > --
> > > Sent from:
> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > >
> >
> >
> >
>

Re: [DISCUSS]Enhancing flink scheduler by implementing blacklist mechanism

Reply via email to