thanks yingjie to share this doc and I think this is very important feature
for production.

As you mentioned in your document, an unhealthy node  can cause a TM
startup failure but cluster management may offer the same node for some
reason. (I have encountered such a scenario in our production environment.)
As your proposal  RM can blacklist this unhealthy node because of the
launch failure.

I have some questions:
Do you want every
ResourceManager(MesosResoruceManager,YarnResourceManager)  to implement
this policy?
If not, you want the Flink to implement this mechanism, I think the
interface of current RM may be not enough.

thanks.


Yun Gao <yungao...@aliyun.com.invalid> 于2018年11月28日周三 上午11:29写道:

> Hi yingjie,
>       Thanks for proposing the blacklist! I agree with that black list is
> important for job maintenance, since some jobs may not be able to failover
> automatically if some tasks are always scheduled to the problematic hosts
> or TMs. This will increase the burden of the operators since they need to
> pay more attention to the status of the jobs.
>
>       I have read the proposal and left some comments. I think a problem
> is how we cooperator with external resource managers (like YARN or Mesos)
> so that they will apply for resource according to our blacklist. If they
> cannot fully obey the blacklist, then we may need to deal with the
> inappropriate resource.
>
>      Looking forward to the future advance of this feature! Thanks again
> for the exciting proposal.
>
>
> Best,
> Yun Gao
>
>
>
> ------------------------------------------------------------------
> From:zhijiang <wangzhijiang...@aliyun.com.INVALID>
> Send Time:2018 Nov 27 (Tue) 10:40
> To:dev <dev@flink.apache.org>
> Subject:回复:[DISCUSS]Enhancing flink scheduler by implementing blacklist
> mechanism
>
> Thanks yingjie for bringing this discussion.
>
> I encountered this issue during failover and also noticed other users
> complainting related issues in community before.
> So it is necessary to have this mechanism for enhancing schedule process
> first, and then enrich the internal rules step by step.
> Wish this feature working in the next major release. :)
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> 发件人:Till Rohrmann <trohrm...@apache.org>
> 发送时间:2018年11月5日(星期一) 18:43
> 收件人:dev <dev@flink.apache.org>
> 主 题:Re: [DISCUSS]Enhancing flink scheduler by implementing blacklist
> mechanism
>
> Thanks for sharing this design document with the community Yingjie.
>
> I like the design to pass the job specific blacklisted TMs as a scheduling
> constraint. This makes a lot of sense to me.
>
> Cheers,
> Till
>
> On Fri, Nov 2, 2018 at 4:51 PM yingjie <kevin.ying...@gmail.com> wrote:
>
> > Hi everyone,
> >
> > This post proposes the blacklist mechanism as an enhancement of flink
> > scheduler. The motivation is as follows.
> >
> > In our clusters, jobs encounter Hardware and software environment
> problems
> > occasionally, including software library missing,bad hardware,resource
> > shortage like out of disk space,these problems will lead to task
> > failure,the
> > failover strategy will take care of that and redeploy the relevant tasks.
> > But because of reasons like location preference and limited total
> > resources,the failed task will be scheduled to be deployed on the same
> > host,
> > then the task will fail again and again, many times. The primary cause of
> > this problem is the mismatching of task and resource. Currently, the
> > resource allocation algorithm does not take these into consideration.
> >
> > We introduce the blacklist mechanism to solve this problem. The basic
> idea
> > is that when a task fails too many times on some resource, the Scheduler
> > will not assign the resource to that task. We have implemented this
> feature
> > in our inner version of flink, and currently, it works fine.
> >
> > The following is the design draft, we would really appreciate it if you
> can
> > review and comment.
> >
> >
> https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw
> >
> > Best,
> > Yingjie
> >
> >
> >
> > --
> > Sent from:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >
>
>
>

Reply via email to