Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Lijie Wang Thu, 19 May 2022 02:35:32 -0700

Hi Konstantin,

We found that Flink REST URL does not support the format ":merge" , which
will be recognized as a parameter in the URL(due to start with a colon).


We will keep the previous way, i.e.

POST: http://{jm_rest_address:port}/blocklist/taskmanagers
and the "id" and "merge" flag are put into the request body.

Best,
Lijie

Lijie Wang <wangdachui9...@gmail.com> 于2022年5月18日周三 09:35写道：

> Hi Weihua,
> thanks for feedback.
>
> 1. Yes, only *Manually* is supported in this FLIP, but it's the first step
> towards auto-detection.
> 2. We wii print the blocked nodes in logs. Maybe also put it into the
> exception of insufficient resources.
> 3. No. This FLIP won't change the WebUI. The blocklist information can be
> obtained through REST API and metrics.
>
> Best,
> Lijie
>
> Weihua Hu <huweihua....@gmail.com> 于2022年5月17日周二 21:41写道：
>
>> Hi,
>> Thanks for creating this FLIP.
>> We have implemented an automatic blocklist detection mechanism
>> internally, which is indeed very effective for handling node failures.
>> Due to the large number of nodes, although SREs already support automatic
>> offline failure nodes, the detection is not 100% accurate and there is some
>> delay.
>> So the blocklist mechanism can make flink job recover from failure much
>> faster.
>>
>> Here are some of my thoughts:
>> 1. In this FLIP, it needs users to locate machine failure manually, there
>> is a certain cost of use
>> 2. What happens if too many nodes are blocked, resulting in insufficient
>> resources? Will there be a special Exception for the user?
>> 3. Will we display the blocklist information in the WebUI? The blocklist
>> is for cluster level, and if multiple users share a cluster, some users may
>> be a little confused when resources are not enough, or when resources are
>> applied for more.
>>
>> Also, Looking forward to the next FLIP on auto-detection.
>>
>> Best,
>> Weihua
>>
>> > 2022年5月16日 下午11:22，Lijie Wang <wangdachui9...@gmail.com> 写道：
>> >
>> > Hi Konstantin,
>> >
>> > Maybe change it to the following:
>> >
>> > 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>> > Merge is not allowed. If the {id} already exists, return error.
>> Otherwise,
>> > create a new item.
>> >
>> > 2. POST: http://
>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>> > Merge is allowed. If the {id} already exists, merge. Otherwise, create a
>> > new item.
>> >
>> > WDYT?
>> >
>> > Best,
>> > Lijie
>> >
>> > Konstantin Knauf <kna...@apache.org> 于2022年5月16日周一 20:07写道：
>> >
>> >> Hi Lijie,
>> >>
>> >> hm, maybe the following is more appropriate in that case
>> >>
>> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>> >>
>> >> Best,
>> >>
>> >> Konstantin
>> >>
>> >> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
>> >> wangdachui9...@gmail.com>:
>> >>
>> >>> Hi Konstantin,
>> >>> thanks for your feedback.
>> >>>
>> >>> From what I understand, PUT should be idempotent. However, we have a
>> >>> *timeout* field in the request. This means that initiating the same
>> >> request
>> >>> at two different times will lead to different resource status
>> (timestamps
>> >>> of the items to be removed will be different).
>> >>>
>> >>> Should we use PUT in this case? WDYT?
>> >>>
>> >>> Best,
>> >>> Lijie
>> >>>
>> >>> Konstantin Knauf <kna...@apache.org> 于2022年5月13日周五 17:20写道：
>> >>>
>> >>>> Hi Lijie,
>> >>>>
>> >>>> wouldn't the REST API-idiomatic way for an update/replace be a PUT on
>> >> the
>> >>>> resource?
>> >>>>
>> >>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>> >>>>
>> >>>> Best,
>> >>>>
>> >>>> Konstantin
>> >>>>
>> >>>>
>> >>>>
>> >>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
>> >>>> wangdachui9...@gmail.com>:
>> >>>>
>> >>>>> Hi everyone,
>> >>>>>
>> >>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and made
>> >>> the
>> >>>>> following changes on REST API:
>> >>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose
>> >>> one.
>> >>>> If
>> >>>>> both are specified, will return error.
>> >>>>> 2.  If the specified item is already there, the *ADD* operation has
>> >> two
>> >>>>> behaviors:  *return error*(default value) or *merge/update*, and we
>> >>> add a
>> >>>>> flag to the request body to control it. You can find more details
>> >>> "Public
>> >>>>> Interface" section.
>> >>>>>
>> >>>>> If there is no more feedback, we will start the vote thread next
>> >> week.
>> >>>>>
>> >>>>> Best,
>> >>>>> Lijie
>> >>>>>
>> >>>>> Lijie Wang <wangdachui9...@gmail.com> 于2022年5月10日周二 17:14写道：
>> >>>>>
>> >>>>>> Hi Becket Qin,
>> >>>>>>
>> >>>>>> Thanks for your suggestions.  I have moved the description of
>> >>>>>> configurations, metrics and REST API into "Public Interface"
>> >> section,
>> >>>> and
>> >>>>>> made a few updates according to your suggestion.  And in this FLIP,
>> >>>> there
>> >>>>>> no public java Interfaces or pluggables that users need to
>> >> implement
>> >>> by
>> >>>>>> themselves.
>> >>>>>>
>> >>>>>> Answers for you questions:
>> >>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
>> >>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
>> >> items
>> >>>> can
>> >>>>>> only be added through the REST API, so these 2 action are mentioned
>> >>> in
>> >>>>> the
>> >>>>>> REST API part (The REST API part has beed moved to public interface
>> >>>> now).
>> >>>>>> 2. I agree with you. I have changed the "Cause" field to String,
>> >> and
>> >>>>> allow
>> >>>>>> users to specify it via REST API.
>> >>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
>> >> above,
>> >>>> we
>> >>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the ADD
>> >>>> REST
>> >>>>>> API to specify when to remove the blocked item. These 2 fields are
>> >>>>>> optional, if neither is specified, it means that the blocked item
>> >> is
>> >>>>>> permanent and will not be removed. If both are specified, the
>> >> minimum
>> >>>> of
>> >>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as the
>> >>> time
>> >>>> to
>> >>>>>> remove the blocked item. To keep the configurations more minimal,
>> >> we
>> >>>> have
>> >>>>>> removed the *cluster.resource-blocklist.item.timeout* configuration
>> >>>>>> option.
>> >>>>>> 4. Yes, the block item will be overridden if the specified item
>> >>> already
>> >>>>>> exists. The ADD operation is *ADD or UPDATE*.
>> >>>>>> 5. Yes. On JM/RM side, all the blocklist information is maintained
>> >> in
>> >>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
>> >>>>> abstracted
>> >>>>>> to other interfaces) will be propagated to different components.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Lijie
>> >>>>>>
>> >>>>>> Becket Qin <becket....@gmail.com> 于2022年5月10日周二 11:26写道：
>> >>>>>>
>> >>>>>>> Hi Lijie,
>> >>>>>>>
>> >>>>>>> Thanks for updating the FLIP. It looks like the public interface
>> >>>> section
>> >>>>>>> did not fully reflect all the user sensible behavior and API. Can
>> >>> you
>> >>>>> put
>> >>>>>>> everything that users may be aware of there? That would include
>> >> the
>> >>>> REST
>> >>>>>>> API, metrics, configurations, public java Interfaces or pluggables
>> >>>> that
>> >>>>>>> users may see or implement by themselves, as well as a brief
>> >> summary
>> >>>> of
>> >>>>>>> the
>> >>>>>>> behavior of the public API.
>> >>>>>>>
>> >>>>>>> Besides that, I have a few questions:
>> >>>>>>>
>> >>>>>>> 1. According to the conversation in the discussion thread, it
>> >> looks
>> >>>> like
>> >>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
>> >>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so,
>> >> can
>> >>>> you
>> >>>>>>> add
>> >>>>>>> that to the public interface as well?
>> >>>>>>>
>> >>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
>> >>> Throwable
>> >>>>> and
>> >>>>>>> is not reflected in the REST API. Should that be included in the
>> >>> query
>> >>>>>>> response? And should we change that field to be a String so users
>> >>> may
>> >>>>>>> specify the cause via the REST API when they block some nodes /
>> >> TMs?
>> >>>>>>>
>> >>>>>>> 3. Would it be useful to allow users to have different timeouts
>> >> for
>> >>>>>>> different blocked items? So while there is a default timeout,
>> >> users
>> >>>> can
>> >>>>>>> also override it via the REST API when they block an entity.
>> >>>>>>>
>> >>>>>>> 4. Regarding the ADD operation, if the specified item is already
>> >>>> there,
>> >>>>>>> will the block item be overridden? For example, if the user wants
>> >> to
>> >>>>>>> extend
>> >>>>>>> the timeout of a blocked item, can they just  issue an ADD command
>> >>>>> again?
>> >>>>>>>
>> >>>>>>> 5. I am not quite familiar with the details of this, but is there
>> >> a
>> >>>>> source
>> >>>>>>> of truth for the blocked list? I think it might be good to have a
>> >>>> single
>> >>>>>>> source of truth for the blocked list and just propagate that list
>> >> to
>> >>>>>>> different components to take the action of actually blocking the
>> >>>>> resource.
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>>
>> >>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>
>> >>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
>> >> wangdachui9...@gmail.com
>> >>>>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Hi everyone,
>> >>>>>>>>
>> >>>>>>>> Based on the discussion in the mailing list, I updated the FLIP
>> >>> doc,
>> >>>>> the
>> >>>>>>>> changes include:
>> >>>>>>>> 1. Changed the description of the motivation section to more
>> >>> clearly
>> >>>>>>>> describe the problem this FLIP is trying to solve.
>> >>>>>>>> 2. Only  *Manually* is supported.
>> >>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
>> >>>>>>>>
>> >>>>>>>> Best,
>> >>>>>>>> Lijie
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Roman Boyko <ro.v.bo...@gmail.com> 于2022年5月7日周六 19:25写道：
>> >>>>>>>>
>> >>>>>>>>> Hi Lijie!
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
>> >>>>>>> helpfulhere.”
>> >>>>>>>>> Can you explain it in detail? I don't fully understand that.
>> >> In
>> >>>>>>>> myopinion,
>> >>>>>>>>> non-active and active are the same, and no special treatment
>> >>>>>>> isrequired.*
>> >>>>>>>>>
>> >>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
>> >>> were
>> >>>>>>> talking
>> >>>>>>>>> about the HA mode (but not about Active and Standalone
>> >>>>>>> ResourceManager).
>> >>>>>>>>> And the original question was - how to handle the blacklisted
>> >>>> nodes
>> >>>>>>> list
>> >>>>>>>> at
>> >>>>>>>>> the moment of leader change? Should we simply forget about
>> >> them
>> >>> or
>> >>>>>>> try to
>> >>>>>>>>> pre-save that list on the remote storage?
>> >>>>>>>>>
>> >>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <danrtsey...@gmail.com
>> >>>
>> >>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
>> >>>>>>>>>>
>> >>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
>> >> it
>> >>> is
>> >>>>>>> indeed
>> >>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
>> >>>> could
>> >>>>>>> not
>> >>>>>>>>>> support.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Best,
>> >>>>>>>>>> Yang
>> >>>>>>>>>>
>> >>>>>>>>>> Lijie Wang <wangdachui9...@gmail.com> 于2022年5月6日周五 22:18写道：
>> >>>>>>>>>>
>> >>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
>> >>>>>>>>>>>
>> >>>>>>>>>>> @Jiangang
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
>> >> and
>> >>>>> mark a
>> >>>>>>>> node
>> >>>>>>>>>>> blocked?
>> >>>>>>>>>>>
>> >>>>>>>>>>> In fact, we currently plan to not support auto-detection
>> >> in
>> >>>> this
>> >>>>>>>> FLIP.
>> >>>>>>>>>> The
>> >>>>>>>>>>> part about auto-detection may be continued in a separate
>> >>> FLIP
>> >>>> in
>> >>>>>>> the
>> >>>>>>>>>>> future. Some guys have the same concerns as you, and the
>> >>>>>>> correctness
>> >>>>>>>>> and
>> >>>>>>>>>>> necessity of auto-detection may require further discussion
>> >>> in
>> >>>>> the
>> >>>>>>>>> future.
>> >>>>>>>>>>>
>> >>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
>> >> node
>> >>>> and
>> >>>>>>> the
>> >>>>>>>>> node
>> >>>>>>>>>>> should be marked blocked.
>> >>>>>>>>>>> By design, the blocklist information will be shared among
>> >>> all
>> >>>>> jobs
>> >>>>>>>> in a
>> >>>>>>>>>>> cluster/session. The JM will sync blocklist information
>> >> with
>> >>>> RM.
>> >>>>>>>>>>>
>> >>>>>>>>>>> @Martijn
>> >>>>>>>>>>>
>> >>>>>>>>>>>> I agree with Yang Wang on this.
>> >>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
>> >>>>>>> MARK_BLOCKLISTED(Just
>> >>>>>>>>>> limits
>> >>>>>>>>>>> the load of the node and does not  kill all the processes
>> >> on
>> >>>> it)
>> >>>>>>> is
>> >>>>>>>>> also
>> >>>>>>>>>>> important, and we think that external systems (*yarn
>> >> rmadmin
>> >>>> or
>> >>>>>>>> kubectl
>> >>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
>> >>>> only
>> >>>>>>>>>> *manually*.
>> >>>>>>>>>>>
>> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>> >>> indeed
>> >>>>>>> super
>> >>>>>>>>> hard
>> >>>>>>>>>>> to get right.
>> >>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
>> >>>>> concern.
>> >>>>>>>>>>> However, we currently plan to not support auto-detection
>> >> in
>> >>>> this
>> >>>>>>>> FLIP,
>> >>>>>>>>>> and
>> >>>>>>>>>>> only *manually*. In addition, I'd like to say that the
>> >> FLIP
>> >>>>>>> provides
>> >>>>>>>> a
>> >>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
>> >>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
>> >>>>>>>>>>> the auto-detection may be done by external systems.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>> Lijie
>> >>>>>>>>>>>
>> >>>>>>>>>>> Martijn Visser <mart...@ververica.com> 于2022年5月6日周五
>> >>> 19:04写道：
>> >>>>>>>>>>>
>> >>>>>>>>>>>>> If we only support to block nodes manually, then I
>> >> could
>> >>>> not
>> >>>>>>> see
>> >>>>>>>>>>>> the obvious advantages compared with current SRE's
>> >>>>> approach(via
>> >>>>>>>> *yarn
>> >>>>>>>>>>>> rmadmin or kubectl taint*).
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I agree with Yang Wang on this.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> To me this sounds yet again like one of those magical
>> >>>>>>> mechanisms
>> >>>>>>>>>> that
>> >>>>>>>>>>>> will rarely work just right.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>> >>> indeed
>> >>>>>>> super
>> >>>>>>>>> hard
>> >>>>>>>>>>> to
>> >>>>>>>>>>>> get right.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Martijn
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>> >>>>>>>> liujiangangp...@gmail.com
>> >>>>>>>>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
>> >>>>> decrease
>> >>>>>>>>> great
>> >>>>>>>>>>> work
>> >>>>>>>>>>>>> for us. We have implemented the similar feature in our
>> >>>> inner
>> >>>>>>> flink
>> >>>>>>>>>>>>> version.
>> >>>>>>>>>>>>> Below is something that I care about:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>   1. For auto-detecting, I wonder how to make the
>> >>> strategy
>> >>>>> and
>> >>>>>>>>> mark a
>> >>>>>>>>>>>>> node
>> >>>>>>>>>>>>>   blocked? Sometimes the blocked node is hard to be
>> >>>>> detected,
>> >>>>>>> for
>> >>>>>>>>>>>>> example,
>> >>>>>>>>>>>>>   the upper node or the down node will be blocked when
>> >>>>> network
>> >>>>>>>>>>>>> unreachable.
>> >>>>>>>>>>>>>   2. I see that the strategy is made in JobMaster
>> >> side.
>> >>>> How
>> >>>>>>> about
>> >>>>>>>>>>>>>   implementing the similar logic in resource manager?
>> >> In
>> >>>>>>> session
>> >>>>>>>>>> mode,
>> >>>>>>>>>>>>> multi
>> >>>>>>>>>>>>>   jobs can fail on the same bad node and the node
>> >> should
>> >>>> be
>> >>>>>>>> marked
>> >>>>>>>>>>>>> blocked.
>> >>>>>>>>>>>>>   If the job makes the strategy, the node may be not
>> >>>> marked
>> >>>>>>>> blocked
>> >>>>>>>>>> if
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>   fail times don't exceed the threshold.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2022年5月5日周四 23:35写道：
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thank you for all your feedback!
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
>> >> some
>> >>> of
>> >>>>> my
>> >>>>>>>>>> thoughts:
>> >>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
>> >>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
>> >>>>>>>>>>>>>> The automatical way should be something built upon
>> >> the
>> >>>>>>> blocklist
>> >>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
>> >>>>>>>> configurable
>> >>>>>>>>>>>>>> blocklist strategy, but I think we can further
>> >> decouple
>> >>>> it
>> >>>>> by
>> >>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
>> >>>> suggested,
>> >>>>>>> which
>> >>>>>>>>>> just
>> >>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
>> >>> detected.
>> >>>>>>>> However,
>> >>>>>>>>>> it
>> >>>>>>>>>>>>>> should be a separate FLIP with further dev
>> >> discussions
>> >>>> and
>> >>>>>>>>> feedback
>> >>>>>>>>>>>>>> from users. I also agree with Becket that different
>> >>> users
>> >>>>>>> have
>> >>>>>>>>>>> different
>> >>>>>>>>>>>>>> requirements, and we should listen to them.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
>> >>>> externally
>> >>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
>> >> way
>> >>> to
>> >>>>>>> avoid
>> >>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
>> >>>> users
>> >>>>>>> may
>> >>>>>>>>> just
>> >>>>>>>>>>>>>> want to limit the load of the node and do not want to
>> >>>> kill
>> >>>>>>> all
>> >>>>>>>> the
>> >>>>>>>>>>>>>> processes on it. Another case is the speculative
>> >>>>> execution[1]
>> >>>>>>>>> which
>> >>>>>>>>>>>>>> may also leverage this feature to avoid starting
>> >> mirror
>> >>>>>>> tasks on
>> >>>>>>>>>> slow
>> >>>>>>>>>>>>>> nodes.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>> Zhu
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Lijie Wang <wangdachui9...@gmail.com> 于2022年5月5日周四
>> >>>>> 15:56写道：
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hi everyone,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Thanks for your feedback.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
>> >> here
>> >>>>>>> because
>> >>>>>>>> it
>> >>>>>>>>>> can
>> >>>>>>>>>>>>>> affect the value and design of the blocklist
>> >> mechanism
>> >>>>>>> (perhaps
>> >>>>>>>> I
>> >>>>>>>>>>> should
>> >>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
>> >>>> FLIP:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
>> >>> node
>> >>>>> as
>> >>>>>>>>>> blocked.
>> >>>>>>>>>>>>>> Future slots should not be allocated from the blocked
>> >>>> task
>> >>>>>>>> manager
>> >>>>>>>>>> or
>> >>>>>>>>>>>>> node.
>> >>>>>>>>>>>>>> But slots that are already allocated will not be
>> >>>> affected.
>> >>>>> A
>> >>>>>>>>> typical
>> >>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
>> >>> In
>> >>>>> this
>> >>>>>>>>> case,
>> >>>>>>>>>> we
>> >>>>>>>>>>>>> hope
>> >>>>>>>>>>>>>> that subsequent resource allocations will not be on
>> >> the
>> >>>> hot
>> >>>>>>>>> machine,
>> >>>>>>>>>>> but
>> >>>>>>>>>>>>>> tasks currently running on it should not be affected.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
>> >> task
>> >>>>>>> manager
>> >>>>>>>> or
>> >>>>>>>>>>> node
>> >>>>>>>>>>>>> as
>> >>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
>> >> tasks
>> >>>> will
>> >>>>>>> be
>> >>>>>>>>>>>>> restarted on
>> >>>>>>>>>>>>>> non-blocked task managers.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> For the above 2 actions, the former may more
>> >>> highlight
>> >>>>> the
>> >>>>>>>>> meaning
>> >>>>>>>>>>> of
>> >>>>>>>>>>>>>> this FLIP, because the external system cannot do
>> >> that.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
>> >> basically
>> >>>>> agree
>> >>>>>>>> with
>> >>>>>>>>>>>>> @Becket
>> >>>>>>>>>>>>>> Qin: different users have different answers. Not all
>> >>>> users’
>> >>>>>>>>>> deployment
>> >>>>>>>>>>>>>> environments have a special external system that can
>> >>>>> perform
>> >>>>>>> the
>> >>>>>>>>>>> anomaly
>> >>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
>> >>>>>>> auto-detection
>> >>>>>>>>>>> doesn't
>> >>>>>>>>>>>>>> require much extra work on top of manual
>> >> specification.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> I will answer your other questions one by one.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Yangze
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
>> >>> the
>> >>>>>>>>>>>>>>
>> >>> `cluster.resource-blocklist.item.timeout-check-interval`
>> >>>> to
>> >>>>>>>> users.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
>> >>> separate
>> >>>>>>>> interface
>> >>>>>>>>>>>>> (maybe
>> >>>>>>>>>>>>>> BlocklistExceptionListener), and the
>> >>>>>>>>> ResourceManagerBlocklistHandler
>> >>>>>>>>>>> can
>> >>>>>>>>>>>>>> implement it in the future.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Martijn
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> a) I also think the manual blocking should be done
>> >> by
>> >>>>>>> cluster
>> >>>>>>>>>>>>> operators.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
>> >>>> according
>> >>>>>>> to
>> >>>>>>>> my
>> >>>>>>>>>>>>>> experience, users are often the first to perceive the
>> >>>>> machine
>> >>>>>>>>>> problems
>> >>>>>>>>>>>>>> (because of job failover or delay), and they will
>> >>> contact
>> >>>>>>>> cluster
>> >>>>>>>>>>>>> operators
>> >>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
>> >>>>>>> machine is
>> >>>>>>>>>>>>>> problematic. From this point of view, I think the
>> >>> people
>> >>>>> who
>> >>>>>>>>> really
>> >>>>>>>>>>> need
>> >>>>>>>>>>>>>> the manual blocking are the users, and it’s just
>> >>>> performed
>> >>>>> by
>> >>>>>>>> the
>> >>>>>>>>>>>>> cluster
>> >>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Chesnay
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
>> >>> for
>> >>>>>>>>>>>>> MARK_BLOCKLISTED
>> >>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
>> >> when
>> >>>> the
>> >>>>>>> task
>> >>>>>>>> is
>> >>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
>> >>>> release
>> >>>>>>> the
>> >>>>>>>>> slot
>> >>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
>> >>> on
>> >>>>> it.
>> >>>>>>> If
>> >>>>>>>> we
>> >>>>>>>>>>> only
>> >>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
>> >> needs
>> >>> to
>> >>>>>>>> retrieve
>> >>>>>>>>>> it
>> >>>>>>>>>>> by
>> >>>>>>>>>>>>>> RPC. I think the performance overhead of that is
>> >>>> relatively
>> >>>>>>>> large,
>> >>>>>>>>>> so
>> >>>>>>>>>>> I
>> >>>>>>>>>>>>>> think it's worth maintaining the blocklist
>> >> information
>> >>> on
>> >>>>>>> the JM
>> >>>>>>>>>> side
>> >>>>>>>>>>>>> and
>> >>>>>>>>>>>>>> syncing them.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Роман
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>    a) “Probably storing inside Zookeeper/Configmap
>> >>>> might
>> >>>>>>> be
>> >>>>>>>>>> helpful
>> >>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
>> >>>>>>> understand
>> >>>>>>>>> that.
>> >>>>>>>>>>> In
>> >>>>>>>>>>>>> my
>> >>>>>>>>>>>>>> opinion, non-active and active are the same, and no
>> >>>> special
>> >>>>>>>>>> treatment
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>>> required.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
>> >> sense,
>> >>> I
>> >>>>> will
>> >>>>>>>> add
>> >>>>>>>>> it
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> FLIP.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> @Yang
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
>> >> cannot
>> >>>>>>> support
>> >>>>>>>> the
>> >>>>>>>>>>>>>> MARK_BLOCKLISTED action.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Looking forward to your further feedback.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Yang Wang <danrtsey...@gmail.com> 于2022年5月3日周二
>> >>>> 21:09写道：
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
>> >>>>>>> operations.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
>> >>>>>>> Reliability
>> >>>>>>>>>>> Engineer)
>> >>>>>>>>>>>>>>>> already has many tools to detect the unstable
>> >> nodes,
>> >>>>> which
>> >>>>>>>>> could
>> >>>>>>>>>>> take
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> system logs/metrics into consideration.
>> >>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
>> >> taint
>> >>>> in
>> >>>>>>> K8s
>> >>>>>>>> to
>> >>>>>>>>>>>>> prevent
>> >>>>>>>>>>>>>> new
>> >>>>>>>>>>>>>>>> allocations on these unstable nodes.
>> >>>>>>>>>>>>>>>> At last, they will evict all the containers and
>> >> pods
>> >>>>>>> running
>> >>>>>>>> on
>> >>>>>>>>>>> these
>> >>>>>>>>>>>>>> nodes.
>> >>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
>> >>> So
>> >>>> I
>> >>>>> am
>> >>>>>>>>> afraid
>> >>>>>>>>>>>>> this
>> >>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>> >>>> could
>> >>>>>>> not
>> >>>>>>>>> see
>> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>> >>>>>>>> approach(via
>> >>>>>>>>>>> *yarn
>> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>> >>>>>>>>>>>>>>>> At least, we need to have a pluggable component
>> >>> which
>> >>>>>>> could
>> >>>>>>>>>> expose
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> potential unstable nodes automatically and block
>> >>> them
>> >>>> if
>> >>>>>>>>> enabled
>> >>>>>>>>>>>>>> explicitly.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>> Yang
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Becket Qin <becket....@gmail.com> 于2022年5月2日周一
>> >>>> 16:36写道：
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
>> >> and
>> >>>>>>> somewhat
>> >>>>>>>>>>> related
>> >>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>> design principle about how people should operate
>> >>>>> Flink.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> I think there are three things involved in this
>> >>>> FLIP.
>> >>>>>>>>>>>>>>>>>     a) Detect and report the unstable node.
>> >>>>>>>>>>>>>>>>>     b) Collect the information of the unstable
>> >>> node
>> >>>>> and
>> >>>>>>>>> form a
>> >>>>>>>>>>>>>> blocklist.
>> >>>>>>>>>>>>>>>>>     c) Take the action to block nodes.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> My two cents:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
>> >>> should
>> >>>>> have
>> >>>>>>>> c).
>> >>>>>>>>> It
>> >>>>>>>>>>> is
>> >>>>>>>>>>>>>> not only
>> >>>>>>>>>>>>>>>>> useful for cases of node failures, but also
>> >> handy
>> >>>> for
>> >>>>>>> some
>> >>>>>>>>>>> planned
>> >>>>>>>>>>>>>>>>> maintenance.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
>> >> who
>> >>>>>>> should be
>> >>>>>>>>> the
>> >>>>>>>>>>>>> brain
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> make the decision to block a node. I think this
>> >>>>> largely
>> >>>>>>>>> depends
>> >>>>>>>>>>> on
>> >>>>>>>>>>>>>> who we
>> >>>>>>>>>>>>>>>>> talk to. Different users would probably give
>> >>>> different
>> >>>>>>>>> answers.
>> >>>>>>>>>>> For
>> >>>>>>>>>>>>>> people
>> >>>>>>>>>>>>>>>>> who do have a centralized node health management
>> >>>>>>> service,
>> >>>>>>>> let
>> >>>>>>>>>>> Flink
>> >>>>>>>>>>>>>> do just
>> >>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
>> >>>> Flink
>> >>>>>>> would
>> >>>>>>>>> be
>> >>>>>>>>>>> one
>> >>>>>>>>>>>>> of
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
>> >> it
>> >>> to
>> >>>>>>> that
>> >>>>>>>>>>> service,
>> >>>>>>>>>>>>>> and then
>> >>>>>>>>>>>>>>>>> take the command from that service to block the
>> >>>>>>> problematic
>> >>>>>>>>>>> nodes.
>> >>>>>>>>>>>>> On
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> other hand, for users who do not have such a
>> >>>> service,
>> >>>>>>>> simply
>> >>>>>>>>>>>>> letting
>> >>>>>>>>>>>>>> Flink
>> >>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
>> >> nodes
>> >>>>> might
>> >>>>>>> be
>> >>>>>>>>>>> desired
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
>> >>>> pluggable /
>> >>>>>>>>>> optional.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
>> >>> have
>> >>>>>>>> something
>> >>>>>>>>>>>>>> pluggable
>> >>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
>> >> unstable
>> >>>>> nodes
>> >>>>>>>>>>> actively.
>> >>>>>>>>>>>>> (A
>> >>>>>>>>>>>>>> more
>> >>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
>> >>> which
>> >>>>>>> can be
>> >>>>>>>>>> used
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> report
>> >>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
>> >>> the
>> >>>>>>> scope
>> >>>>>>>>>>> relevant
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
>> >> OK
>> >>> to
>> >>>>>>> have a
>> >>>>>>>>>>> default
>> >>>>>>>>>>>>>>>>> implementation of a reporter which just tells
>> >>> Flink
>> >>>> to
>> >>>>>>> take
>> >>>>>>>>>>> action
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>> block
>> >>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
>> >>>>> timeout.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>> >>>>>>>>>> ro.v.bo...@gmail.com
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
>> >>>>>>> development.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
>> >>> we
>> >>>>>>> should
>> >>>>>>>>>>> consider
>> >>>>>>>>>>>>>> to move
>> >>>>>>>>>>>>>>>>>> somehow the information of blocklisted
>> >> nodes/TMs
>> >>>>> from
>> >>>>>>>>> active
>> >>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
>> >>>> storing
>> >>>>>>>> inside
>> >>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
>> >>>> organizations
>> >>>>>>>> don't
>> >>>>>>>>>> want
>> >>>>>>>>>>>>> to
>> >>>>>>>>>>>>>> expose
>> >>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
>> >>>> it's
>> >>>>>>>>> necessary
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> have the
>> >>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
>> >>> for
>> >>>>>>>> avoiding
>> >>>>>>>>>>>>> incorrect
>> >>>>>>>>>>>>>>>>>> automatic behaviour.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
>> >>>> would
>> >>>>> be
>> >>>>>>>>> better
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>>> extend
>> >>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
>> >> *endTimestamp*
>> >>>>> field
>> >>>>>>> and
>> >>>>>>>>>> fill
>> >>>>>>>>>>> it
>> >>>>>>>>>>>>>> at the
>> >>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
>> >>> to:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>   -
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>   Provide the ability to users to setup the
>> >>> exact
>> >>>>>>> time
>> >>>>>>>> of
>> >>>>>>>>>>>>>> blocklist end
>> >>>>>>>>>>>>>>>>>>   through RestAPI
>> >>>>>>>>>>>>>>>>>>   -
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>   Not being tied to a single value of
>> >>>>>>>>>>>>>>>>>>   *cluster.resource-blacklist.item.timeout*
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
>> >> <
>> >>>>>>>>>>>>> ches...@apache.org>
>> >>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
>> >>>> lines
>> >>>>> a
>> >>>>>>>> bit.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
>> >>>>> auto-detection
>> >>>>>>>> and
>> >>>>>>>>>> only
>> >>>>>>>>>>>>>> have an
>> >>>>>>>>>>>>>>>>>>> opt-in mechanism
>> >>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
>> >> this
>> >>>>> sounds
>> >>>>>>>> yet
>> >>>>>>>>>>> again
>> >>>>>>>>>>>>>> like one
>> >>>>>>>>>>>>>>>>>>> of those
>> >>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
>> >> just
>> >>>>> right.
>> >>>>>>>>>>>>>>>>>>> An external system can leverage way more
>> >>>>> information
>> >>>>>>>>> after
>> >>>>>>>>>>> all.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
>> >>>> complexity
>> >>>>>>> of
>> >>>>>>>>> this
>> >>>>>>>>>>>>>> proposal.
>> >>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
>> >>> between
>> >>>>>>>>>> components;
>> >>>>>>>>>>>>>>>>> adjustments
>> >>>>>>>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>>> slot and resource protocol.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> If we look at it purely from an active
>> >>> resource
>> >>>>>>>>> management
>> >>>>>>>>>>>>>> perspective,
>> >>>>>>>>>>>>>>>>>>> then there
>> >>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
>> >> protocol
>> >>>> at
>> >>>>>>> all
>> >>>>>>>> (or
>> >>>>>>>>>> in
>> >>>>>>>>>>>>> fact
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>> anything in the JobMaster),
>> >>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
>> >>> around
>> >>>>>>> blocked
>> >>>>>>>>> TMs
>> >>>>>>>>>>> in
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> first
>> >>>>>>>>>>>>>>>>>>> place.
>> >>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
>> >>> down
>> >>>>>>> after
>> >>>>>>>> a
>> >>>>>>>>>>> while
>> >>>>>>>>>>>>> by
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> RM
>> >>>>>>>>>>>>>>>>>>> because of
>> >>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
>> >>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
>> >>>> blocking
>> >>>>>>> the
>> >>>>>>>>>> process
>> >>>>>>>>>>>>> or
>> >>>>>>>>>>>>>> node)
>> >>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
>> >>>>>>> process(es).
>> >>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
>> >>>> back
>> >>>>>>> up.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
>> >> the
>> >>>> same
>> >>>>>>> line
>> >>>>>>>>> of
>> >>>>>>>>>>>>>> thinking to
>> >>>>>>>>>>>>>>>>>>> standalone resource management.
>> >>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
>> >> process/node
>> >>>>>>> manually
>> >>>>>>>>>>> should
>> >>>>>>>>>>>>> be
>> >>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>> core
>> >>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
>> >>>>>>>>>>>>>>>>>>>> Hi everyone,
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
>> >>>> understand
>> >>>>>>> the
>> >>>>>>>>>> problem
>> >>>>>>>>>>>>> and
>> >>>>>>>>>>>>>> I see
>> >>>>>>>>>>>>>>>>>>> value
>> >>>>>>>>>>>>>>>>>>>> in the automatic detection and
>> >>> blocklisting. I
>> >>>>> do
>> >>>>>>>> have
>> >>>>>>>>>> some
>> >>>>>>>>>>>>>> concerns
>> >>>>>>>>>>>>>>>>>> with
>> >>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
>> >>> blocked
>> >>>>>>>>> resources.
>> >>>>>>>>>> I
>> >>>>>>>>>>>>> have
>> >>>>>>>>>>>>>> two
>> >>>>>>>>>>>>>>>>>>>> concerns;
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
>> >>>>> separation
>> >>>>>>> of
>> >>>>>>>>>>>>> concerns,
>> >>>>>>>>>>>>>>>>> meaning
>> >>>>>>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
>> >>>> managing a
>> >>>>>>>>> cluster
>> >>>>>>>>>>> and
>> >>>>>>>>>>>>>> there's
>> >>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>> user
>> >>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
>> >>>>>>> introduction of
>> >>>>>>>>>> this
>> >>>>>>>>>>>>>> mechanism,
>> >>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> latter group now can influence the
>> >>>>> responsibility
>> >>>>>>> of
>> >>>>>>>>> the
>> >>>>>>>>>>>>> first
>> >>>>>>>>>>>>>> group.
>> >>>>>>>>>>>>>>>>>> So
>> >>>>>>>>>>>>>>>>>>> it
>> >>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
>> >>>> group
>> >>>>>>>> blocks
>> >>>>>>>>>>>>>> something,
>> >>>>>>>>>>>>>>>>>> which
>> >>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
>> >>> paging
>> >>>>>>>>> mechanism
>> >>>>>>>>>>>>>> triggering
>> >>>>>>>>>>>>>>>>>> etc)
>> >>>>>>>>>>>>>>>>>>>> which impacts the first group.
>> >>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
>> >> go
>> >>>>>>> through
>> >>>>>>>> the
>> >>>>>>>>>>>>> process
>> >>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>> manually
>> >>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
>> >> it
>> >>>>>>> should
>> >>>>>>>>> be? I
>> >>>>>>>>>>> do
>> >>>>>>>>>>>>>> think
>> >>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
>> >>> make
>> >>>>>>> sense
>> >>>>>>>> to
>> >>>>>>>>>>>>>> introduce
>> >>>>>>>>>>>>>>>>> such
>> >>>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
>> >>> really
>> >>>>>>> small
>> >>>>>>>>> user
>> >>>>>>>>>>>>> group
>> >>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>> Flink?
>> >>>>>>>>>>>>>>>>>>> We
>> >>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
>> >>> such
>> >>>> a
>> >>>>>>>>> feature.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
>> >>> I'm
>> >>>>>>>> leaning
>> >>>>>>>>>>>>> towards
>> >>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>> exposing
>> >>>>>>>>>>>>>>>>>>>> this to the user group but having this
>> >>>> available
>> >>>>>>>>> strictly
>> >>>>>>>>>>> for
>> >>>>>>>>>>>>>> cluster
>> >>>>>>>>>>>>>>>>>>>> operators. They could then also set up
>> >> their
>> >>>>>>>>>>>>>> paging/metrics/logging
>> >>>>>>>>>>>>>>>>>>> system
>> >>>>>>>>>>>>>>>>>>>> to take this into account.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Best regards,
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Martijn Visser
>> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>> >>>>>>>>>>> karma...@gmail.com
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
>> >>> some
>> >>>>>>> cents
>> >>>>>>>>> here:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
>> >>>>>>>>>> to
>> >>>>>>>>>>>>> the
>> >>>>>>>>>>>>>> user?
>> >>>>>>>>>>>>>>>>>>>>> I think the semantics of
>> >>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
>> >>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
>> >>> the
>> >>>>>>>> timeout
>> >>>>>>>>>>>>>> mechanism is
>> >>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
>> >> it
>> >>>>> will
>> >>>>>>> be
>> >>>>>>>>> very
>> >>>>>>>>>>>>>> confusing
>> >>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
>> >> exception
>> >>>> of a
>> >>>>>>> task
>> >>>>>>>>>>>>> manager to
>> >>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
>> >>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
>> >> fail
>> >>>> in
>> >>>>>>> case
>> >>>>>>>>> the
>> >>>>>>>>>>>>> target
>> >>>>>>>>>>>>>> task
>> >>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
>> >> I
>> >>>>> don't
>> >>>>>>>> mean
>> >>>>>>>>> we
>> >>>>>>>>>>>>> need
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> cover
>> >>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
>> >> also
>> >>>>> open a
>> >>>>>>>>>>>>>> `notifyException`
>> >>>>>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
>> >>>>>>> ResourceManager,
>> >>>>>>>>> will
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>> slot of
>> >>>>>>>>>>>>>>>>> a
>> >>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
>> >>> released
>> >>>>> and
>> >>>>>>>>>>> allocated?
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>> Yangze Guo
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
>> >> Wang
>> >>> <
>> >>>>>>>>>>>>>>>>> wangdachui9...@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
>> >> response
>> >>>>> your 4
>> >>>>>>>>>> remarks:
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
>> >>>>> controversy. I
>> >>>>>>>>> think
>> >>>>>>>>>>>>>> “BlockList”
>> >>>>>>>>>>>>>>>>> is
>> >>>>>>>>>>>>>>>>>>>>> good
>> >>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
>> >>>> good
>> >>>>>>> idea.
>> >>>>>>>>>> Based
>> >>>>>>>>>>>>> on
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>> above, I
>> >>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> DELETE
>> >> <host>/blocklist/node/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>> <host>/blocklist/taskmanager/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
>> >>>> means
>> >>>>>>> that
>> >>>>>>>>> all
>> >>>>>>>>>>>>> task
>> >>>>>>>>>>>>>>>>> managers
>> >>>>>>>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
>> >>>> these
>> >>>>>>> TMs
>> >>>>>>>> are
>> >>>>>>>>>> not
>> >>>>>>>>>>>>>>>>> available.
>> >>>>>>>>>>>>>>>>>>> This
>> >>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
>> >> these
>> >>>> TMs
>> >>>>>>> are
>> >>>>>>>>> not
>> >>>>>>>>>>>>> really
>> >>>>>>>>>>>>>> lost,
>> >>>>>>>>>>>>>>>>>>> they
>> >>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
>> >> are
>> >>>>> still
>> >>>>>>>>>>> registered
>> >>>>>>>>>>>>>> in this
>> >>>>>>>>>>>>>>>>>>> flink
>> >>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
>> >> once
>> >>>> the
>> >>>>>>>>>>>>> corresponding
>> >>>>>>>>>>>>>>>>>> blocklist
>> >>>>>>>>>>>>>>>>>>>>> item
>> >>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
>> >>>>>>>>>>> active/non-active
>> >>>>>>>>>>>>>>>>> clusters.
>> >>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
>> >> TMs
>> >>>> may
>> >>>>> be
>> >>>>>>>>>> released
>> >>>>>>>>>>>>> due
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>> idle
>> >>>>>>>>>>>>>>>>>>>>>> timeouts.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
>> >> keep
>> >>>> it.
>> >>>>>>> The
>> >>>>>>>>>> reasons
>> >>>>>>>>>>>>> are
>> >>>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>>>>>>>> following:
>> >>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
>> >>> adding
>> >>>> or
>> >>>>>>>>> removing
>> >>>>>>>>>>>>> items
>> >>>>>>>>>>>>>> via
>> >>>>>>>>>>>>>>>>>> REST
>> >>>>>>>>>>>>>>>>>>>>> API,
>> >>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
>> >> it
>> >>>> to
>> >>>>>>>>>>>>> Long.MAX_VALUE .
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
>> >> a
>> >>>>>>> period of
>> >>>>>>>>>> time
>> >>>>>>>>>>>>>> (such as
>> >>>>>>>>>>>>>>>>>>> machine
>> >>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
>> >> prefer
>> >>>> that
>> >>>>>>>> Flink
>> >>>>>>>>>> can
>> >>>>>>>>>>> do
>> >>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
>> >>> user
>> >>>> to
>> >>>>>>> do
>> >>>>>>>> it
>> >>>>>>>>>>>>> manually.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kna...@apache.org>
>> >>>>>>> 于2022年4月27日周三
>> >>>>>>>>>>>>> 19:23写道：
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
>> >> only
>> >>>>>>> support
>> >>>>>>>>>>> manually
>> >>>>>>>>>>>>>> blocking
>> >>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
>> >>> different
>> >>>>>>>>> strategies
>> >>>>>>>>>>> can
>> >>>>>>>>>>>>>> also be
>> >>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> A few remarks:
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
>> >>>> "bla.cklist"
>> >>>>>>> due
>> >>>>>>>> to
>> >>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> controversy
>> >>>>>>>>>>>>>>>>>>>>> around
>> >>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
>> >>> Ticket
>> >>>>>>> about
>> >>>>>>>>> this
>> >>>>>>>>>>>>> topic a
>> >>>>>>>>>>>>>>>>> while
>> >>>>>>>>>>>>>>>>>>>>> back
>> >>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
>> >>>> avoid
>> >>>>>>> the
>> >>>>>>>>> term
>> >>>>>>>>>>>>>> blacklist &
>> >>>>>>>>>>>>>>>>>>>>> whitelist
>> >>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
>> >> "denylist"
>> >>> or
>> >>>>>>>>>>> "quarantined"
>> >>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
>> >>> slightly
>> >>>>>>>> different
>> >>>>>>>>>>>>> design
>> >>>>>>>>>>>>>> as
>> >>>>>>>>>>>>>>>>> verbs
>> >>>>>>>>>>>>>>>>>>>>> like
>> >>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
>> >>> anti-pattern
>> >>>>> for
>> >>>>>>>> REST
>> >>>>>>>>>>> APIs.
>> >>>>>>>>>>>>>> POST
>> >>>>>>>>>>>>>>>>> on a
>> >>>>>>>>>>>>>>>>>>>>> list
>> >>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
>> >>>> items.
>> >>>>>>>> DELETE
>> >>>>>>>>> on
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> individual
>> >>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>> <host>/quarantine/items/<itemidentifier>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
>> >>>>> taskmanagers
>> >>>>>>>> and
>> >>>>>>>>>>> nodes
>> >>>>>>>>>>>>> in
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>> REST
>> >>>>>>>>>>>>>>>>>>>>> API
>> >>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
>> >>> opinion
>> >>>>> on
>> >>>>>>>> this?
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
>> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>> <host>/quarantine/nodes/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>> >>>>>>> <host>/quarantine/taskmanager/<identifier>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
>> >>>>>>> non-active
>> >>>>>>>>>>>>> resource
>> >>>>>>>>>>>>>>>>>> managers,
>> >>>>>>>>>>>>>>>>>>>>> i.e.
>> >>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
>> >>>>>>> minimal,
>> >>>>>>>> do
>> >>>>>>>>> we
>> >>>>>>>>>>>>> need
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>> timeout
>> >>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
>> >>>> manually
>> >>>>>>> we
>> >>>>>>>>> could
>> >>>>>>>>>>>>>> delegate
>> >>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
>> >>>>> behavior
>> >>>>>>>>> would
>> >>>>>>>>>>>>> better
>> >>>>>>>>>>>>>> fit
>> >>>>>>>>>>>>>>>>>> into
>> >>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Konstantin
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>> >>>>>>>>>>>>>>>>>>>>>>> [2]
>> >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
>> >>> schrieb
>> >>>>>>> Lijie
>> >>>>>>>>> Wang
>> >>>>>>>>>> <
>> >>>>>>>>>>>>>>>>>>>>>>> wangdachui9...@gmail.com>:
>> >>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
>> >>>> cluster
>> >>>>>>> node
>> >>>>>>>>>>> issues
>> >>>>>>>>>>>>>>>>>>>>> (insufficient
>> >>>>>>>>>>>>>>>>>>>>>>> disk
>> >>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
>> >>>> abnormalities).
>> >>>>>>>> Flink
>> >>>>>>>>>> will
>> >>>>>>>>>>>>>> take care
>> >>>>>>>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
>> >>> However,
>> >>>>> due
>> >>>>>>> to
>> >>>>>>>>> data
>> >>>>>>>>>>>>>> locality
>> >>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>> limited
>> >>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
>> >> likely
>> >>>> to
>> >>>>> be
>> >>>>>>>>>>> redeployed
>> >>>>>>>>>>>>>> to the
>> >>>>>>>>>>>>>>>>>> same
>> >>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
>> >>>> task
>> >>>>>>>>>>> abnormalities
>> >>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>> affect
>> >>>>>>>>>>>>>>>>>>>>> job
>> >>>>>>>>>>>>>>>>>>>>>>>> progress.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
>> >> manually
>> >>>>>>> identify
>> >>>>>>>>> the
>> >>>>>>>>>>>>>> problematic
>> >>>>>>>>>>>>>>>>>>>>> node and
>> >>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
>> >>> But
>> >>>>>>> this
>> >>>>>>>>>>> approach
>> >>>>>>>>>>>>> has
>> >>>>>>>>>>>>>>>>>>>>> following
>> >>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
>> >> heavy
>> >>>>>>> process.
>> >>>>>>>>>> Users
>> >>>>>>>>>>>>> may
>> >>>>>>>>>>>>>> need
>> >>>>>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>>>>>>>>> contact
>> >>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
>> >>>>> operation
>> >>>>>>> can
>> >>>>>>>>>> even
>> >>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>> dangerous
>> >>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>> not
>> >>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
>> >>>>> events.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
>> >> of
>> >>>>>>> problems
>> >>>>>>>>>>> manually
>> >>>>>>>>>>>>>> would
>> >>>>>>>>>>>>>>>>> be
>> >>>>>>>>>>>>>>>>>>>>> slow
>> >>>>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
>> >>>> propose
>> >>>>>>> to
>> >>>>>>>>>>>>> introduce a
>> >>>>>>>>>>>>>>>>>> blacklist
>> >>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
>> >>>>> problematic
>> >>>>>>>>>>> resources.
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
>> >>> FLIP-224[1].
>> >>>>>>> Looking
>> >>>>>>>>>>> forward
>> >>>>>>>>>>>>>> to your
>> >>>>>>>>>>>>>>>>>>>>>>> feedback.
>> >>>>>>>>>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>>> Lijie
>> >>>>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Best regards,
>> >>>>>>>>> Roman Boyko
>> >>>>>>>>> e.: ro.v.bo...@gmail.com
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> https://twitter.com/snntrable
>> >>>> https://github.com/knaufk
>> >>>>
>> >>>
>> >>
>> >>
>> >> --
>> >> https://twitter.com/snntrable
>> >> https://github.com/knaufk
>> >>
>>
>>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Reply via email to