Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Roman Boyko Sat, 07 May 2022 04:25:56 -0700

Hi Lijie!




*a) “Probably storing inside Zookeeper/Configmap might be helpfulhere.”
Can you explain it in detail? I don't fully understand that. In myopinion,
non-active and active are the same, and no special treatment isrequired.*

Sorry this was a misunderstanding from my side. I thought we were talking
about the HA mode (but not about Active and Standalone ResourceManager).
And the original question was - how to handle the blacklisted nodes list at
the moment of leader change? Should we simply forget about them or try to
pre-save that list on the remote storage?

On Sat, 7 May 2022 at 10:51, Yang Wang <[email protected]> wrote:

> Thanks Lijie and ZhuZhu for the explanation.
>
> I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is indeed
> some functionalities the external tools(e.g. kubectl taint) could not
> support.
>
>
> Best,
> Yang
>
> Lijie Wang <[email protected]> 于2022年5月6日周五 22:18写道：
>
> > Thanks for your feedback, Jiangang and Martijn.
> >
> > @Jiangang
> >
> >
> > > For auto-detecting, I wonder how to make the strategy and mark a node
> > blocked?
> >
> > In fact, we currently plan to not support auto-detection in this FLIP.
> The
> > part about auto-detection may be continued in a separate FLIP in the
> > future. Some guys have the same concerns as you, and the correctness and
> > necessity of auto-detection may require further discussion in the future.
> >
> > > In session mode, multi jobs can fail on the same bad node and the node
> > should be marked blocked.
> > By design, the blocklist information will be shared among all jobs in a
> > cluster/session. The JM will sync blocklist information with RM.
> >
> > @Martijn
> >
> > > I agree with Yang Wang on this.
> > As Zhu Zhu and I mentioned above, we think the MARK_BLOCKLISTED(Just
> limits
> > the load of the node and does not  kill all the processes on it) is also
> > important, and we think that external systems (*yarn rmadmin or kubectl
> > taint*) cannot support it. So we think it makes sense even only
> *manually*.
> >
> > > I also agree with Chesnay that magical mechanisms are indeed super hard
> > to get right.
> > Yes, as you see, Jiangang(and a few others) have the same concern.
> > However, we currently plan to not support auto-detection in this FLIP,
> and
> > only *manually*. In addition, I'd like to say that the FLIP provides a
> > mechanism to support MARK_BLOCKLISTED and
> > MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> > the auto-detection may be done by external systems.
> >
> > Best,
> > Lijie
> >
> > Martijn Visser <[email protected]> 于2022年5月6日周五 19:04写道：
> >
> > > > If we only support to block nodes manually, then I could not see
> > > the obvious advantages compared with current SRE's approach(via *yarn
> > > rmadmin or kubectl taint*).
> > >
> > > I agree with Yang Wang on this.
> > >
> > > >  To me this sounds yet again like one of those magical mechanisms
> that
> > > will rarely work just right.
> > >
> > > I also agree with Chesnay that magical mechanisms are indeed super hard
> > to
> > > get right.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <[email protected]>
> > > wrote:
> > >
> > >> Thanks for the valuable design. The auto-detecting can decrease great
> > work
> > >> for us. We have implemented the similar feature in our inner flink
> > >> version.
> > >> Below is something that I care about:
> > >>
> > >>    1. For auto-detecting, I wonder how to make the strategy and mark a
> > >> node
> > >>    blocked? Sometimes the blocked node is hard to be detected, for
> > >> example,
> > >>    the upper node or the down node will be blocked when network
> > >> unreachable.
> > >>    2. I see that the strategy is made in JobMaster side. How about
> > >>    implementing the similar logic in resource manager? In session
> mode,
> > >> multi
> > >>    jobs can fail on the same bad node and the node should be marked
> > >> blocked.
> > >>    If the job makes the strategy, the node may be not marked blocked
> if
> > >> the
> > >>    fail times don't exceed the threshold.
> > >>
> > >>
> > >> Zhu Zhu <[email protected]> 于2022年5月5日周四 23:35写道：
> > >>
> > >> > Thank you for all your feedback!
> > >> >
> > >> > Besides the answers from Lijie, I'd like to share some of my
> thoughts:
> > >> > 1. Whether to enable automatical blocklist
> > >> > Generally speaking, it is not a goal of FLIP-224.
> > >> > The automatical way should be something built upon the blocklist
> > >> > mechanism and well decoupled. It was designed to be a configurable
> > >> > blocklist strategy, but I think we can further decouple it by
> > >> > introducing a abnormal node detector, as Becket suggested, which
> just
> > >> > uses the blocklist mechanism once bad nodes are detected. However,
> it
> > >> > should be a separate FLIP with further dev discussions and feedback
> > >> > from users. I also agree with Becket that different users have
> > different
> > >> > requirements, and we should listen to them.
> > >> >
> > >> > 2. Is it enough to just take away abnormal nodes externally
> > >> > My answer is no. As Lijie has mentioned, we need a way to avoid
> > >> > deploying tasks to temporary hot nodes. In this case, users may just
> > >> > want to limit the load of the node and do not want to kill all the
> > >> > processes on it. Another case is the speculative execution[1] which
> > >> > may also leverage this feature to avoid starting mirror tasks on
> slow
> > >> > nodes.
> > >> >
> > >> > Thanks,
> > >> > Zhu
> > >> >
> > >> > [1]
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > >> >
> > >> > Lijie Wang <[email protected]> 于2022年5月5日周四 15:56写道：
> > >> >
> > >> > >
> > >> > > Hi everyone,
> > >> > >
> > >> > >
> > >> > > Thanks for your feedback.
> > >> > >
> > >> > >
> > >> > > There's one detail that I'd like to re-emphasize here because it
> can
> > >> > affect the value and design of the blocklist mechanism (perhaps I
> > should
> > >> > highlight it in the FLIP). We propose two actions in FLIP:
> > >> > >
> > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as
> blocked.
> > >> > Future slots should not be allocated from the blocked task manager
> or
> > >> node.
> > >> > But slots that are already allocated will not be affected. A typical
> > >> > application scenario is to mitigate machine hotspots. In this case,
> we
> > >> hope
> > >> > that subsequent resource allocations will not be on the hot machine,
> > but
> > >> > tasks currently running on it should not be affected.
> > >> > >
> > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or
> > node
> > >> as
> > >> > blocked, and evacuate all tasks on it. Evacuated tasks will be
> > >> restarted on
> > >> > non-blocked task managers.
> > >> > >
> > >> > > For the above 2 actions, the former may more highlight the meaning
> > of
> > >> > this FLIP, because the external system cannot do that.
> > >> > >
> > >> > >
> > >> > > Regarding *Manually* and *Automatically*, I basically agree with
> > >> @Becket
> > >> > Qin: different users have different answers. Not all users’
> deployment
> > >> > environments have a special external system that can perform the
> > anomaly
> > >> > detection. In addition, adding pluggable/optional auto-detection
> > doesn't
> > >> > require much extra work on top of manual specification.
> > >> > >
> > >> > >
> > >> > > I will answer your other questions one by one.
> > >> > >
> > >> > >
> > >> > > @Yangze
> > >> > >
> > >> > > a) I think you are right, we do not need to expose the
> > >> > `cluster.resource-blocklist.item.timeout-check-interval` to users.
> > >> > >
> > >> > > b) We can abstract the `notifyException` to a separate interface
> > >> (maybe
> > >> > BlocklistExceptionListener), and the ResourceManagerBlocklistHandler
> > can
> > >> > implement it in the future.
> > >> > >
> > >> > >
> > >> > > @Martijn
> > >> > >
> > >> > > a) I also think the manual blocking should be done by cluster
> > >> operators.
> > >> > >
> > >> > > b) I think manual blocking makes sense, because according to my
> > >> > experience, users are often the first to perceive the machine
> problems
> > >> > (because of job failover or delay), and they will contact cluster
> > >> operators
> > >> > to solve it, or even tell the cluster operators which machine is
> > >> > problematic. From this point of view, I think the people who really
> > need
> > >> > the manual blocking are the users, and it’s just performed by the
> > >> cluster
> > >> > operator, so I think the manual blocking makes sense.
> > >> > >
> > >> > >
> > >> > > @Chesnay
> > >> > >
> > >> > > We need to touch the logic of JM/SlotPool, because for
> > >> MARK_BLOCKLISTED
> > >> > , we need to know whether the slot is blocklisted when the task is
> > >> > FINISHED/CANCELLED/FAILED. If so,  SlotPool should release the slot
> > >> > directly to avoid assigning other tasks (of this job) on it. If we
> > only
> > >> > maintain the blocklist information on the RM, JM needs to retrieve
> it
> > by
> > >> > RPC. I think the performance overhead of that is relatively large,
> so
> > I
> > >> > think it's worth maintaining the blocklist information on the JM
> side
> > >> and
> > >> > syncing them.
> > >> > >
> > >> > >
> > >> > > @Роман
> > >> > >
> > >> > >     a) “Probably storing inside Zookeeper/Configmap might be
> helpful
> > >> > here.”  Can you explain it in detail? I don't fully understand that.
> > In
> > >> my
> > >> > opinion, non-active and active are the same, and no special
> treatment
> > is
> > >> > required.
> > >> > >
> > >> > > b) I agree with you, the `endTimestamp` makes sense, I will add it
> > to
> > >> > FLIP.
> > >> > >
> > >> > >
> > >> > > @Yang
> > >> > >
> > >> > > As mentioned above, AFAK, the external system cannot support the
> > >> > MARK_BLOCKLISTED action.
> > >> > >
> > >> > >
> > >> > > Looking forward to your further feedback.
> > >> > >
> > >> > >
> > >> > > Best,
> > >> > >
> > >> > > Lijie
> > >> > >
> > >> > >
> > >> > > Yang Wang <[email protected]> 于2022年5月3日周二 21:09写道：
> > >> > >>
> > >> > >> Thanks Lijie and Zhu for creating the proposal.
> > >> > >>
> > >> > >> I want to share some thoughts about Flink cluster operations.
> > >> > >>
> > >> > >> In the production environment, the SRE(aka Site Reliability
> > Engineer)
> > >> > >> already has many tools to detect the unstable nodes, which could
> > take
> > >> > the
> > >> > >> system logs/metrics into consideration.
> > >> > >> Then they use graceful-decomission in YARN and taint in K8s to
> > >> prevent
> > >> > new
> > >> > >> allocations on these unstable nodes.
> > >> > >> At last, they will evict all the containers and pods running on
> > these
> > >> > nodes.
> > >> > >> This mechanism also works for planned maintenance. So I am afraid
> > >> this
> > >> > is
> > >> > >> not the typical use case for FLIP-224.
> > >> > >>
> > >> > >> If we only support to block nodes manually, then I could not see
> > >> > >> the obvious advantages compared with current SRE's approach(via
> > *yarn
> > >> > >> rmadmin or kubectl taint*).
> > >> > >> At least, we need to have a pluggable component which could
> expose
> > >> the
> > >> > >> potential unstable nodes automatically and block them if enabled
> > >> > explicitly.
> > >> > >>
> > >> > >>
> > >> > >> Best,
> > >> > >> Yang
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> Becket Qin <[email protected]> 于2022年5月2日周一 16:36写道：
> > >> > >>
> > >> > >> > Thanks for the proposal, Lijie.
> > >> > >> >
> > >> > >> > This is an interesting feature and discussion, and somewhat
> > related
> > >> > to the
> > >> > >> > design principle about how people should operate Flink.
> > >> > >> >
> > >> > >> > I think there are three things involved in this FLIP.
> > >> > >> >      a) Detect and report the unstable node.
> > >> > >> >      b) Collect the information of the unstable node and form a
> > >> > blocklist.
> > >> > >> >      c) Take the action to block nodes.
> > >> > >> >
> > >> > >> > My two cents:
> > >> > >> >
> > >> > >> > 1. It looks like people all agree that Flink should have c). It
> > is
> > >> > not only
> > >> > >> > useful for cases of node failures, but also handy for some
> > planned
> > >> > >> > maintenance.
> > >> > >> >
> > >> > >> > 2. People have different opinions on b), i.e. who should be the
> > >> brain
> > >> > to
> > >> > >> > make the decision to block a node. I think this largely depends
> > on
> > >> > who we
> > >> > >> > talk to. Different users would probably give different answers.
> > For
> > >> > people
> > >> > >> > who do have a centralized node health management service, let
> > Flink
> > >> > do just
> > >> > >> > do a) and c) would be preferred. So essentially Flink would be
> > one
> > >> of
> > >> > the
> > >> > >> > sources that may detect unstable nodes, report it to that
> > service,
> > >> > and then
> > >> > >> > take the command from that service to block the problematic
> > nodes.
> > >> On
> > >> > the
> > >> > >> > other hand, for users who do not have such a service, simply
> > >> letting
> > >> > Flink
> > >> > >> > be clever by itself to block the suspicious nodes might be
> > desired
> > >> to
> > >> > >> > ensure the jobs are running smoothly.
> > >> > >> >
> > >> > >> > So that indicates a) and b) here should be pluggable /
> optional.
> > >> > >> >
> > >> > >> > In light of this, maybe it would make sense to have something
> > >> > pluggable
> > >> > >> > like a UnstableNodeReporter which exposes unstable nodes
> > actively.
> > >> (A
> > >> > more
> > >> > >> > general interface should be JobInfoReporter<T> which can be
> used
> > to
> > >> > report
> > >> > >> > any information of type <T>. But I'll just keep the scope
> > relevant
> > >> to
> > >> > this
> > >> > >> > FLIP here). Personally speaking, I think it is OK to have a
> > default
> > >> > >> > implementation of a reporter which just tells Flink to take
> > action
> > >> to
> > >> > block
> > >> > >> > problematic nodes and also unblocks them after timeout.
> > >> > >> >
> > >> > >> > Thanks,
> > >> > >> >
> > >> > >> > Jiangjie (Becket) Qin
> > >> > >> >
> > >> > >> >
> > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
> [email protected]
> > >
> > >> > wrote:
> > >> > >> >
> > >> > >> > > Thanks for good initiative, Lijie and Zhu!
> > >> > >> > >
> > >> > >> > > If it's possible I'd like to participate in development.
> > >> > >> > >
> > >> > >> > > I agree with 3rd point of Konstantin's reply - we should
> > consider
> > >> > to move
> > >> > >> > > somehow the information of blocklisted nodes/TMs from active
> > >> > >> > > ResourceManager to non-active ones. Probably storing inside
> > >> > >> > > Zookeeper/Configmap might be helpful here.
> > >> > >> > >
> > >> > >> > > And I agree with Martijn that a lot of organizations don't
> want
> > >> to
> > >> > expose
> > >> > >> > > such API for a cluster user group. But I think it's necessary
> > to
> > >> > have the
> > >> > >> > > mechanism for unblocking the nodes/TMs anyway for avoiding
> > >> incorrect
> > >> > >> > > automatic behaviour.
> > >> > >> > >
> > >> > >> > > And another one small suggestion - I think it would be better
> > to
> > >> > extend
> > >> > >> > the
> > >> > >> > > *BlocklistedItem* class with the *endTimestamp* field and
> fill
> > it
> > >> > at the
> > >> > >> > > item creation. This simple addition will allow to:
> > >> > >> > >
> > >> > >> > >    -
> > >> > >> > >
> > >> > >> > >    Provide the ability to users to setup the exact time of
> > >> > blocklist end
> > >> > >> > >    through RestAPI
> > >> > >> > >    -
> > >> > >> > >
> > >> > >> > >    Not being tied to a single value of
> > >> > >> > >    *cluster.resource-blacklist.item.timeout*
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler <
> > >> [email protected]>
> > >> > >> > wrote:
> > >> > >> > >
> > >> > >> > > > I do share the concern between blurring the lines a bit.
> > >> > >> > > >
> > >> > >> > > > That said, I'd prefer to not have any auto-detection and
> only
> > >> > have an
> > >> > >> > > > opt-in mechanism
> > >> > >> > > > to manually block processes/nodes. To me this sounds yet
> > again
> > >> > like one
> > >> > >> > > > of those
> > >> > >> > > > magical mechanisms that will rarely work just right.
> > >> > >> > > > An external system can leverage way more information after
> > all.
> > >> > >> > > >
> > >> > >> > > > Moreover, I'm quite concerned about the complexity of this
> > >> > proposal.
> > >> > >> > > > Tracking on both the RM/JM side; syncing between
> components;
> > >> > >> > adjustments
> > >> > >> > > > to the
> > >> > >> > > > slot and resource protocol.
> > >> > >> > > >
> > >> > >> > > > In a way it seems overly complicated.
> > >> > >> > > >
> > >> > >> > > > If we look at it purely from an active resource management
> > >> > perspective,
> > >> > >> > > > then there
> > >> > >> > > > isn't really a need to touch the slot protocol at all (or
> in
> > >> fact
> > >> > to
> > >> > >> > > > anything in the JobMaster),
> > >> > >> > > > because there isn't any point in keeping around blocked TMs
> > in
> > >> the
> > >> > >> > first
> > >> > >> > > > place.
> > >> > >> > > > They'd just be idling, potentially shutting down after a
> > while
> > >> by
> > >> > the
> > >> > >> > RM
> > >> > >> > > > because of
> > >> > >> > > > it (unless we _also_ touch that logic).
> > >> > >> > > > Here the blocking of a process (be it by blocking the
> process
> > >> or
> > >> > node)
> > >> > >> > is
> > >> > >> > > > equivalent with shutting down the blocked process(es).
> > >> > >> > > > Once the block is lifted we can just spin it back up.
> > >> > >> > > >
> > >> > >> > > > And I do wonder whether we couldn't apply the same line of
> > >> > thinking to
> > >> > >> > > > standalone resource management.
> > >> > >> > > > Here being able to stop/restart a process/node manually
> > should
> > >> be
> > >> > a
> > >> > >> > core
> > >> > >> > > > requirement for a Flink deployment anyway.
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote:
> > >> > >> > > > > Hi everyone,
> > >> > >> > > > >
> > >> > >> > > > > Thanks for creating this FLIP. I can understand the
> problem
> > >> and
> > >> > I see
> > >> > >> > > > value
> > >> > >> > > > > in the automatic detection and blocklisting. I do have
> some
> > >> > concerns
> > >> > >> > > with
> > >> > >> > > > > the ability to manually specify to be blocked resources.
> I
> > >> have
> > >> > two
> > >> > >> > > > > concerns;
> > >> > >> > > > >
> > >> > >> > > > > * Most organizations explicitly have a separation of
> > >> concerns,
> > >> > >> > meaning
> > >> > >> > > > that
> > >> > >> > > > > there's a group who's responsible for managing a cluster
> > and
> > >> > there's
> > >> > >> > a
> > >> > >> > > > user
> > >> > >> > > > > group who uses that cluster. With the introduction of
> this
> > >> > mechanism,
> > >> > >> > > the
> > >> > >> > > > > latter group now can influence the responsibility of the
> > >> first
> > >> > group.
> > >> > >> > > So
> > >> > >> > > > it
> > >> > >> > > > > can be possible that someone from the user group blocks
> > >> > something,
> > >> > >> > > which
> > >> > >> > > > > causes an outage (which could result in paging mechanism
> > >> > triggering
> > >> > >> > > etc)
> > >> > >> > > > > which impacts the first group.
> > >> > >> > > > > * How big is the group of people who can go through the
> > >> process
> > >> > of
> > >> > >> > > > manually
> > >> > >> > > > > identifying a node that isn't behaving as it should be? I
> > do
> > >> > think
> > >> > >> > this
> > >> > >> > > > > group is relatively limited. Does it then make sense to
> > >> > introduce
> > >> > >> > such
> > >> > >> > > a
> > >> > >> > > > > feature, which would only be used by a really small user
> > >> group
> > >> > of
> > >> > >> > > Flink?
> > >> > >> > > > We
> > >> > >> > > > > still have to maintain, test and support such a feature.
> > >> > >> > > > >
> > >> > >> > > > > I'm +1 for the autodetection features, but I'm leaning
> > >> towards
> > >> > not
> > >> > >> > > > exposing
> > >> > >> > > > > this to the user group but having this available strictly
> > for
> > >> > cluster
> > >> > >> > > > > operators. They could then also set up their
> > >> > paging/metrics/logging
> > >> > >> > > > system
> > >> > >> > > > > to take this into account.
> > >> > >> > > > >
> > >> > >> > > > > Best regards,
> > >> > >> > > > >
> > >> > >> > > > > Martijn Visser
> > >> > >> > > > > https://twitter.com/MartijnVisser82
> > >> > >> > > > > https://github.com/MartijnVisser
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
> > [email protected]
> > >> >
> > >> > wrote:
> > >> > >> > > > >
> > >> > >> > > > >> Thanks for driving this, Zhu and Lijie.
> > >> > >> > > > >>
> > >> > >> > > > >> +1 for the overall proposal. Just share some cents here:
> > >> > >> > > > >>
> > >> > >> > > > >> - Why do we need to expose
> > >> > >> > > > >> cluster.resource-blacklist.item.timeout-check-interval
> to
> > >> the
> > >> > user?
> > >> > >> > > > >> I think the semantics of
> > >> > `cluster.resource-blacklist.item.timeout`
> > >> > >> > is
> > >> > >> > > > >> sufficient for the user. How to guarantee the timeout
> > >> > mechanism is
> > >> > >> > > > >> Flink's internal implementation. I think it will be very
> > >> > confusing
> > >> > >> > and
> > >> > >> > > > >> we do not need to expose it to users.
> > >> > >> > > > >>
> > >> > >> > > > >> - ResourceManager can notify the exception of a task
> > >> manager to
> > >> > >> > > > >> `BlacklistHandler` as well.
> > >> > >> > > > >> For example, the slot allocation might fail in case the
> > >> target
> > >> > task
> > >> > >> > > > >> manager is busy or has a network jitter. I don't mean we
> > >> need
> > >> > to
> > >> > >> > cover
> > >> > >> > > > >> this case in this version, but we can also open a
> > >> > `notifyException`
> > >> > >> > in
> > >> > >> > > > >> `ResourceManagerBlacklistHandler`.
> > >> > >> > > > >>
> > >> > >> > > > >> - Before we sync the blocklist to ResourceManager, will
> > the
> > >> > slot of
> > >> > >> > a
> > >> > >> > > > >> blocked task manager continues to be released and
> > allocated?
> > >> > >> > > > >>
> > >> > >> > > > >> Best,
> > >> > >> > > > >> Yangze Guo
> > >> > >> > > > >>
> > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <
> > >> > >> > [email protected]>
> > >> > >> > > > >> wrote:
> > >> > >> > > > >>> Hi Konstantin,
> > >> > >> > > > >>>
> > >> > >> > > > >>> Thanks for your feedback. I will response your 4
> remarks:
> > >> > >> > > > >>>
> > >> > >> > > > >>>
> > >> > >> > > > >>> 1) Thanks for reminding me of the controversy. I think
> > >> > “BlockList”
> > >> > >> > is
> > >> > >> > > > >> good
> > >> > >> > > > >>> enough, and I will change it in FLIP.
> > >> > >> > > > >>>
> > >> > >> > > > >>>
> > >> > >> > > > >>> 2) Your suggestion for the REST API is a good idea.
> Based
> > >> on
> > >> > the
> > >> > >> > > > above, I
> > >> > >> > > > >>> would change REST API as following:
> > >> > >> > > > >>>
> > >> > >> > > > >>> POST/GET <host>/blocklist/nodes
> > >> > >> > > > >>>
> > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers
> > >> > >> > > > >>>
> > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier>
> > >> > >> > > > >>>
> > >> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier>
> > >> > >> > > > >>>
> > >> > >> > > > >>>
> > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it means that all
> > >> task
> > >> > >> > managers
> > >> > >> > > > on
> > >> > >> > > > >>> this node are blocklisted. All slots on these TMs are
> not
> > >> > >> > available.
> > >> > >> > > > This
> > >> > >> > > > >>> is actually a bit like TM losts, but these TMs are not
> > >> really
> > >> > lost,
> > >> > >> > > > they
> > >> > >> > > > >>> are in an unavailable status, and they are still
> > registered
> > >> > in this
> > >> > >> > > > flink
> > >> > >> > > > >>> cluster. They will be available again once the
> > >> corresponding
> > >> > >> > > blocklist
> > >> > >> > > > >> item
> > >> > >> > > > >>> is removed. This behavior is the same in
> > active/non-active
> > >> > >> > clusters.
> > >> > >> > > > >>> However in the active clusters, these TMs may be
> released
> > >> due
> > >> > to
> > >> > >> > idle
> > >> > >> > > > >>> timeouts.
> > >> > >> > > > >>>
> > >> > >> > > > >>>
> > >> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The
> reasons
> > >> are
> > >> > as
> > >> > >> > > > >> following:
> > >> > >> > > > >>> a) The timeout will not affect users adding or removing
> > >> items
> > >> > via
> > >> > >> > > REST
> > >> > >> > > > >> API,
> > >> > >> > > > >>> and users can disable it by configuring it to
> > >> Long.MAX_VALUE .
> > >> > >> > > > >>>
> > >> > >> > > > >>> b) Some node problems can recover after a period of
> time
> > >> > (such as
> > >> > >> > > > machine
> > >> > >> > > > >>> hotspots), in which case users may prefer that Flink
> can
> > do
> > >> > this
> > >> > >> > > > >>> automatically instead of requiring the user to do it
> > >> manually.
> > >> > >> > > > >>>
> > >> > >> > > > >>>
> > >> > >> > > > >>> Best,
> > >> > >> > > > >>>
> > >> > >> > > > >>> Lijie
> > >> > >> > > > >>>
> > >> > >> > > > >>> Konstantin Knauf <[email protected]> 于2022年4月27日周三
> > >> 19:23写道：
> > >> > >> > > > >>>
> > >> > >> > > > >>>> Hi Lijie,
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> I think, this makes sense and +1 to only support
> > manually
> > >> > blocking
> > >> > >> > > > >>>> taskmanagers and nodes. Maybe the different strategies
> > can
> > >> > also be
> > >> > >> > > > >>>> maintained outside of Apache Flink.
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> A few remarks:
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due to
> the
> > >> > >> > controversy
> > >> > >> > > > >> around
> > >> > >> > > > >>>> the term? [1] There was also a Jira Ticket about this
> > >> topic a
> > >> > >> > while
> > >> > >> > > > >> back
> > >> > >> > > > >>>> and there was generally a consensus to avoid the term
> > >> > blacklist &
> > >> > >> > > > >> whitelist
> > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or
> > "quarantined"
> > >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly different
> > >> design
> > >> > as
> > >> > >> > verbs
> > >> > >> > > > >> like
> > >> > >> > > > >>>> add/remove often considered an anti-pattern for REST
> > APIs.
> > >> > POST
> > >> > >> > on a
> > >> > >> > > > >> list
> > >> > >> > > > >>>> item is generally the standard to add items. DELETE on
> > the
> > >> > >> > > individual
> > >> > >> > > > >>>> resource is standard to remove an item.
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> POST <host>/quarantine/items
> > >> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier>
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> We could also consider to separate taskmanagers and
> > nodes
> > >> in
> > >> > the
> > >> > >> > > REST
> > >> > >> > > > >> API
> > >> > >> > > > >>>> (and internal data structures). Any opinion on this?
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes
> > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager
> > >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier>
> > >> > >> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier>
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> 3) How would blocking nodes behave with non-active
> > >> resource
> > >> > >> > > managers,
> > >> > >> > > > >> i.e.
> > >> > >> > > > >>>> standalone or reactive mode?
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> 4) To keep the implementation even more minimal, do we
> > >> need
> > >> > the
> > >> > >> > > > timeout
> > >> > >> > > > >>>> behavior? If items are added/removed manually we could
> > >> > delegate
> > >> > >> > this
> > >> > >> > > > >> to the
> > >> > >> > > > >>>> user easily. In my opinion the timeout behavior would
> > >> better
> > >> > fit
> > >> > >> > > into
> > >> > >> > > > >>>> specific strategies at a later point.
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> Looking forward to your thoughts.
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> Cheers and thank you,
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> Konstantin
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> [1]
> > >> > >> > > > >>>>
> > >> > >> > > > >>>>
> > >> > >> > > > >>
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> >
> > >>
> >
> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
> > >> > >> > > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209
> > >> > >> > > > >>>>
> > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang
> <
> > >> > >> > > > >>>> [email protected]>:
> > >> > >> > > > >>>>
> > >> > >> > > > >>>>> Hi all,
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> Flink job failures may happen due to cluster node
> > issues
> > >> > >> > > > >> (insufficient
> > >> > >> > > > >>>> disk
> > >> > >> > > > >>>>> space, bad hardware, network abnormalities). Flink
> will
> > >> > take care
> > >> > >> > > of
> > >> > >> > > > >> the
> > >> > >> > > > >>>>> failures and redeploy the tasks. However, due to data
> > >> > locality
> > >> > >> > and
> > >> > >> > > > >>>> limited
> > >> > >> > > > >>>>> resources, the new tasks are very likely to be
> > redeployed
> > >> > to the
> > >> > >> > > same
> > >> > >> > > > >>>>> nodes, which will result in continuous task
> > abnormalities
> > >> > and
> > >> > >> > > affect
> > >> > >> > > > >> job
> > >> > >> > > > >>>>> progress.
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> Currently, Flink users need to manually identify the
> > >> > problematic
> > >> > >> > > > >> node and
> > >> > >> > > > >>>>> take it offline to solve this problem. But this
> > approach
> > >> has
> > >> > >> > > > >> following
> > >> > >> > > > >>>>> disadvantages:
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy process.
> Users
> > >> may
> > >> > need
> > >> > >> > to
> > >> > >> > > > >>>> contact
> > >> > >> > > > >>>>> cluster administors to do this. The operation can
> even
> > be
> > >> > >> > dangerous
> > >> > >> > > > >> and
> > >> > >> > > > >>>> not
> > >> > >> > > > >>>>> allowed during some important business events.
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> 2. Identifying and solving this kind of problems
> > manually
> > >> > would
> > >> > >> > be
> > >> > >> > > > >> slow
> > >> > >> > > > >>>> and
> > >> > >> > > > >>>>> a waste of human resources.
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to
> > >> introduce a
> > >> > >> > > blacklist
> > >> > >> > > > >>>>> mechanism for Flink to filter out problematic
> > resources.
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> You can find more details in FLIP-224[1]. Looking
> > forward
> > >> > to your
> > >> > >> > > > >>>> feedback.
> > >> > >> > > > >>>>> [1]
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>>
> > >> > >> > > > >>
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> Best,
> > >> > >> > > > >>>>>
> > >> > >> > > > >>>>> Lijie
> > >> > >> > > > >>>>>
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> >
> > >>
> > >
> >
>


-- 
Best regards,
Roman Boyko
e.: [email protected]

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Reply via email to