Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Lijie Wang Thu, 19 May 2022 22:20:39 -0700

Hi everyone,

I have started a vote for this FLIP [1]. Please cast your vote there or ask
additional questions here. [1]
https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg


Best, Lijie

Lijie Wang <[email protected]> 于2022年5月19日周四 17:34写道：

> Hi Konstantin,
>
> We found that Flink REST URL does not support the format ":merge" , which
> will be recognized as a parameter in the URL(due to start with a colon).
>
> We will keep the previous way, i.e.
>
> POST: http://{jm_rest_address:port}/blocklist/taskmanagers
> and the "id" and "merge" flag are put into the request body.
>
> Best,
> Lijie
>
> Lijie Wang <[email protected]> 于2022年5月18日周三 09:35写道：
>
>> Hi Weihua,
>> thanks for feedback.
>>
>> 1. Yes, only *Manually* is supported in this FLIP, but it's the first
>> step towards auto-detection.
>> 2. We wii print the blocked nodes in logs. Maybe also put it into the
>> exception of insufficient resources.
>> 3. No. This FLIP won't change the WebUI. The blocklist information can be
>> obtained through REST API and metrics.
>>
>> Best,
>> Lijie
>>
>> Weihua Hu <[email protected]> 于2022年5月17日周二 21:41写道：
>>
>>> Hi,
>>> Thanks for creating this FLIP.
>>> We have implemented an automatic blocklist detection mechanism
>>> internally, which is indeed very effective for handling node failures.
>>> Due to the large number of nodes, although SREs already support
>>> automatic offline failure nodes, the detection is not 100% accurate and
>>> there is some delay.
>>> So the blocklist mechanism can make flink job recover from failure much
>>> faster.
>>>
>>> Here are some of my thoughts:
>>> 1. In this FLIP, it needs users to locate machine failure manually,
>>> there is a certain cost of use
>>> 2. What happens if too many nodes are blocked, resulting in insufficient
>>> resources? Will there be a special Exception for the user?
>>> 3. Will we display the blocklist information in the WebUI? The blocklist
>>> is for cluster level, and if multiple users share a cluster, some users may
>>> be a little confused when resources are not enough, or when resources are
>>> applied for more.
>>>
>>> Also, Looking forward to the next FLIP on auto-detection.
>>>
>>> Best,
>>> Weihua
>>>
>>> > 2022年5月16日 下午11:22，Lijie Wang <[email protected]> 写道：
>>> >
>>> > Hi Konstantin,
>>> >
>>> > Maybe change it to the following:
>>> >
>>> > 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>> > Merge is not allowed. If the {id} already exists, return error.
>>> Otherwise,
>>> > create a new item.
>>> >
>>> > 2. POST: http://
>>> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>>> > Merge is allowed. If the {id} already exists, merge. Otherwise, create
>>> a
>>> > new item.
>>> >
>>> > WDYT?
>>> >
>>> > Best,
>>> > Lijie
>>> >
>>> > Konstantin Knauf <[email protected]> 于2022年5月16日周一 20:07写道：
>>> >
>>> >> Hi Lijie,
>>> >>
>>> >> hm, maybe the following is more appropriate in that case
>>> >>
>>> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge
>>> >>
>>> >> Best,
>>> >>
>>> >> Konstantin
>>> >>
>>> >> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang <
>>> >> [email protected]>:
>>> >>
>>> >>> Hi Konstantin,
>>> >>> thanks for your feedback.
>>> >>>
>>> >>> From what I understand, PUT should be idempotent. However, we have a
>>> >>> *timeout* field in the request. This means that initiating the same
>>> >> request
>>> >>> at two different times will lead to different resource status
>>> (timestamps
>>> >>> of the items to be removed will be different).
>>> >>>
>>> >>> Should we use PUT in this case? WDYT?
>>> >>>
>>> >>> Best,
>>> >>> Lijie
>>> >>>
>>> >>> Konstantin Knauf <[email protected]> 于2022年5月13日周五 17:20写道：
>>> >>>
>>> >>>> Hi Lijie,
>>> >>>>
>>> >>>> wouldn't the REST API-idiomatic way for an update/replace be a PUT
>>> on
>>> >> the
>>> >>>> resource?
>>> >>>>
>>> >>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}
>>> >>>>
>>> >>>> Best,
>>> >>>>
>>> >>>> Konstantin
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang <
>>> >>>> [email protected]>:
>>> >>>>
>>> >>>>> Hi everyone,
>>> >>>>>
>>> >>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and
>>> made
>>> >>> the
>>> >>>>> following changes on REST API:
>>> >>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose
>>> >>> one.
>>> >>>> If
>>> >>>>> both are specified, will return error.
>>> >>>>> 2.  If the specified item is already there, the *ADD* operation has
>>> >> two
>>> >>>>> behaviors:  *return error*(default value) or *merge/update*, and we
>>> >>> add a
>>> >>>>> flag to the request body to control it. You can find more details
>>> >>> "Public
>>> >>>>> Interface" section.
>>> >>>>>
>>> >>>>> If there is no more feedback, we will start the vote thread next
>>> >> week.
>>> >>>>>
>>> >>>>> Best,
>>> >>>>> Lijie
>>> >>>>>
>>> >>>>> Lijie Wang <[email protected]> 于2022年5月10日周二 17:14写道：
>>> >>>>>
>>> >>>>>> Hi Becket Qin,
>>> >>>>>>
>>> >>>>>> Thanks for your suggestions.  I have moved the description of
>>> >>>>>> configurations, metrics and REST API into "Public Interface"
>>> >> section,
>>> >>>> and
>>> >>>>>> made a few updates according to your suggestion.  And in this
>>> FLIP,
>>> >>>> there
>>> >>>>>> no public java Interfaces or pluggables that users need to
>>> >> implement
>>> >>> by
>>> >>>>>> themselves.
>>> >>>>>>
>>> >>>>>> Answers for you questions:
>>> >>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and.
>>> >>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block
>>> >> items
>>> >>>> can
>>> >>>>>> only be added through the REST API, so these 2 action are
>>> mentioned
>>> >>> in
>>> >>>>> the
>>> >>>>>> REST API part (The REST API part has beed moved to public
>>> interface
>>> >>>> now).
>>> >>>>>> 2. I agree with you. I have changed the "Cause" field to String,
>>> >> and
>>> >>>>> allow
>>> >>>>>> users to specify it via REST API.
>>> >>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned
>>> >> above,
>>> >>>> we
>>> >>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the
>>> ADD
>>> >>>> REST
>>> >>>>>> API to specify when to remove the blocked item. These 2 fields are
>>> >>>>>> optional, if neither is specified, it means that the blocked item
>>> >> is
>>> >>>>>> permanent and will not be removed. If both are specified, the
>>> >> minimum
>>> >>>> of
>>> >>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as the
>>> >>> time
>>> >>>> to
>>> >>>>>> remove the blocked item. To keep the configurations more minimal,
>>> >> we
>>> >>>> have
>>> >>>>>> removed the *cluster.resource-blocklist.item.timeout*
>>> configuration
>>> >>>>>> option.
>>> >>>>>> 4. Yes, the block item will be overridden if the specified item
>>> >>> already
>>> >>>>>> exists. The ADD operation is *ADD or UPDATE*.
>>> >>>>>> 5. Yes. On JM/RM side, all the blocklist information is maintained
>>> >> in
>>> >>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or
>>> >>>>> abstracted
>>> >>>>>> to other interfaces) will be propagated to different components.
>>> >>>>>>
>>> >>>>>> Best,
>>> >>>>>> Lijie
>>> >>>>>>
>>> >>>>>> Becket Qin <[email protected]> 于2022年5月10日周二 11:26写道：
>>> >>>>>>
>>> >>>>>>> Hi Lijie,
>>> >>>>>>>
>>> >>>>>>> Thanks for updating the FLIP. It looks like the public interface
>>> >>>> section
>>> >>>>>>> did not fully reflect all the user sensible behavior and API. Can
>>> >>> you
>>> >>>>> put
>>> >>>>>>> everything that users may be aware of there? That would include
>>> >> the
>>> >>>> REST
>>> >>>>>>> API, metrics, configurations, public java Interfaces or
>>> pluggables
>>> >>>> that
>>> >>>>>>> users may see or implement by themselves, as well as a brief
>>> >> summary
>>> >>>> of
>>> >>>>>>> the
>>> >>>>>>> behavior of the public API.
>>> >>>>>>>
>>> >>>>>>> Besides that, I have a few questions:
>>> >>>>>>>
>>> >>>>>>> 1. According to the conversation in the discussion thread, it
>>> >> looks
>>> >>>> like
>>> >>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and
>>> >>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so,
>>> >> can
>>> >>>> you
>>> >>>>>>> add
>>> >>>>>>> that to the public interface as well?
>>> >>>>>>>
>>> >>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a
>>> >>> Throwable
>>> >>>>> and
>>> >>>>>>> is not reflected in the REST API. Should that be included in the
>>> >>> query
>>> >>>>>>> response? And should we change that field to be a String so users
>>> >>> may
>>> >>>>>>> specify the cause via the REST API when they block some nodes /
>>> >> TMs?
>>> >>>>>>>
>>> >>>>>>> 3. Would it be useful to allow users to have different timeouts
>>> >> for
>>> >>>>>>> different blocked items? So while there is a default timeout,
>>> >> users
>>> >>>> can
>>> >>>>>>> also override it via the REST API when they block an entity.
>>> >>>>>>>
>>> >>>>>>> 4. Regarding the ADD operation, if the specified item is already
>>> >>>> there,
>>> >>>>>>> will the block item be overridden? For example, if the user wants
>>> >> to
>>> >>>>>>> extend
>>> >>>>>>> the timeout of a blocked item, can they just  issue an ADD
>>> command
>>> >>>>> again?
>>> >>>>>>>
>>> >>>>>>> 5. I am not quite familiar with the details of this, but is there
>>> >> a
>>> >>>>> source
>>> >>>>>>> of truth for the blocked list? I think it might be good to have a
>>> >>>> single
>>> >>>>>>> source of truth for the blocked list and just propagate that list
>>> >> to
>>> >>>>>>> different components to take the action of actually blocking the
>>> >>>>> resource.
>>> >>>>>>>
>>> >>>>>>> Thanks,
>>> >>>>>>>
>>> >>>>>>> Jiangjie (Becket) Qin
>>> >>>>>>>
>>> >>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <
>>> >> [email protected]
>>> >>>>
>>> >>>>>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> Hi everyone,
>>> >>>>>>>>
>>> >>>>>>>> Based on the discussion in the mailing list, I updated the FLIP
>>> >>> doc,
>>> >>>>> the
>>> >>>>>>>> changes include:
>>> >>>>>>>> 1. Changed the description of the motivation section to more
>>> >>> clearly
>>> >>>>>>>> describe the problem this FLIP is trying to solve.
>>> >>>>>>>> 2. Only  *Manually* is supported.
>>> >>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*.
>>> >>>>>>>>
>>> >>>>>>>> Best,
>>> >>>>>>>> Lijie
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Roman Boyko <[email protected]> 于2022年5月7日周六 19:25写道：
>>> >>>>>>>>
>>> >>>>>>>>> Hi Lijie!
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be
>>> >>>>>>> helpfulhere.”
>>> >>>>>>>>> Can you explain it in detail? I don't fully understand that.
>>> >> In
>>> >>>>>>>> myopinion,
>>> >>>>>>>>> non-active and active are the same, and no special treatment
>>> >>>>>>> isrequired.*
>>> >>>>>>>>>
>>> >>>>>>>>> Sorry this was a misunderstanding from my side. I thought we
>>> >>> were
>>> >>>>>>> talking
>>> >>>>>>>>> about the HA mode (but not about Active and Standalone
>>> >>>>>>> ResourceManager).
>>> >>>>>>>>> And the original question was - how to handle the blacklisted
>>> >>>> nodes
>>> >>>>>>> list
>>> >>>>>>>> at
>>> >>>>>>>>> the moment of leader change? Should we simply forget about
>>> >> them
>>> >>> or
>>> >>>>>>> try to
>>> >>>>>>>>> pre-save that list on the remote storage?
>>> >>>>>>>>>
>>> >>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <[email protected]
>>> >>>
>>> >>>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation.
>>> >>>>>>>>>>
>>> >>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level,
>>> >> it
>>> >>> is
>>> >>>>>>> indeed
>>> >>>>>>>>>> some functionalities the external tools(e.g. kubectl taint)
>>> >>>> could
>>> >>>>>>> not
>>> >>>>>>>>>> support.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> Best,
>>> >>>>>>>>>> Yang
>>> >>>>>>>>>>
>>> >>>>>>>>>> Lijie Wang <[email protected]> 于2022年5月6日周五 22:18写道：
>>> >>>>>>>>>>
>>> >>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> @Jiangang
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy
>>> >> and
>>> >>>>> mark a
>>> >>>>>>>> node
>>> >>>>>>>>>>> blocked?
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> In fact, we currently plan to not support auto-detection
>>> >> in
>>> >>>> this
>>> >>>>>>>> FLIP.
>>> >>>>>>>>>> The
>>> >>>>>>>>>>> part about auto-detection may be continued in a separate
>>> >>> FLIP
>>> >>>> in
>>> >>>>>>> the
>>> >>>>>>>>>>> future. Some guys have the same concerns as you, and the
>>> >>>>>>> correctness
>>> >>>>>>>>> and
>>> >>>>>>>>>>> necessity of auto-detection may require further discussion
>>> >>> in
>>> >>>>> the
>>> >>>>>>>>> future.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> In session mode, multi jobs can fail on the same bad
>>> >> node
>>> >>>> and
>>> >>>>>>> the
>>> >>>>>>>>> node
>>> >>>>>>>>>>> should be marked blocked.
>>> >>>>>>>>>>> By design, the blocklist information will be shared among
>>> >>> all
>>> >>>>> jobs
>>> >>>>>>>> in a
>>> >>>>>>>>>>> cluster/session. The JM will sync blocklist information
>>> >> with
>>> >>>> RM.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> @Martijn
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> I agree with Yang Wang on this.
>>> >>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the
>>> >>>>>>> MARK_BLOCKLISTED(Just
>>> >>>>>>>>>> limits
>>> >>>>>>>>>>> the load of the node and does not  kill all the processes
>>> >> on
>>> >>>> it)
>>> >>>>>>> is
>>> >>>>>>>>> also
>>> >>>>>>>>>>> important, and we think that external systems (*yarn
>>> >> rmadmin
>>> >>>> or
>>> >>>>>>>> kubectl
>>> >>>>>>>>>>> taint*) cannot support it. So we think it makes sense even
>>> >>>> only
>>> >>>>>>>>>> *manually*.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>>> >>> indeed
>>> >>>>>>> super
>>> >>>>>>>>> hard
>>> >>>>>>>>>>> to get right.
>>> >>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same
>>> >>>>> concern.
>>> >>>>>>>>>>> However, we currently plan to not support auto-detection
>>> >> in
>>> >>>> this
>>> >>>>>>>> FLIP,
>>> >>>>>>>>>> and
>>> >>>>>>>>>>> only *manually*. In addition, I'd like to say that the
>>> >> FLIP
>>> >>>>>>> provides
>>> >>>>>>>> a
>>> >>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and
>>> >>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
>>> >>>>>>>>>>> the auto-detection may be done by external systems.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Best,
>>> >>>>>>>>>>> Lijie
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Martijn Visser <[email protected]> 于2022年5月6日周五
>>> >>> 19:04写道：
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>>> If we only support to block nodes manually, then I
>>> >> could
>>> >>>> not
>>> >>>>>>> see
>>> >>>>>>>>>>>> the obvious advantages compared with current SRE's
>>> >>>>> approach(via
>>> >>>>>>>> *yarn
>>> >>>>>>>>>>>> rmadmin or kubectl taint*).
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> I agree with Yang Wang on this.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>> To me this sounds yet again like one of those magical
>>> >>>>>>> mechanisms
>>> >>>>>>>>>> that
>>> >>>>>>>>>>>> will rarely work just right.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are
>>> >>> indeed
>>> >>>>>>> super
>>> >>>>>>>>> hard
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>>> get right.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Best regards,
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Martijn
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu <
>>> >>>>>>>> [email protected]
>>> >>>>>>>>>>
>>> >>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can
>>> >>>>> decrease
>>> >>>>>>>>> great
>>> >>>>>>>>>>> work
>>> >>>>>>>>>>>>> for us. We have implemented the similar feature in our
>>> >>>> inner
>>> >>>>>>> flink
>>> >>>>>>>>>>>>> version.
>>> >>>>>>>>>>>>> Below is something that I care about:
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>   1. For auto-detecting, I wonder how to make the
>>> >>> strategy
>>> >>>>> and
>>> >>>>>>>>> mark a
>>> >>>>>>>>>>>>> node
>>> >>>>>>>>>>>>>   blocked? Sometimes the blocked node is hard to be
>>> >>>>> detected,
>>> >>>>>>> for
>>> >>>>>>>>>>>>> example,
>>> >>>>>>>>>>>>>   the upper node or the down node will be blocked when
>>> >>>>> network
>>> >>>>>>>>>>>>> unreachable.
>>> >>>>>>>>>>>>>   2. I see that the strategy is made in JobMaster
>>> >> side.
>>> >>>> How
>>> >>>>>>> about
>>> >>>>>>>>>>>>>   implementing the similar logic in resource manager?
>>> >> In
>>> >>>>>>> session
>>> >>>>>>>>>> mode,
>>> >>>>>>>>>>>>> multi
>>> >>>>>>>>>>>>>   jobs can fail on the same bad node and the node
>>> >> should
>>> >>>> be
>>> >>>>>>>> marked
>>> >>>>>>>>>>>>> blocked.
>>> >>>>>>>>>>>>>   If the job makes the strategy, the node may be not
>>> >>>> marked
>>> >>>>>>>> blocked
>>> >>>>>>>>>> if
>>> >>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>   fail times don't exceed the threshold.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> Zhu Zhu <[email protected]> 于2022年5月5日周四 23:35写道：
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Thank you for all your feedback!
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share
>>> >> some
>>> >>> of
>>> >>>>> my
>>> >>>>>>>>>> thoughts:
>>> >>>>>>>>>>>>>> 1. Whether to enable automatical blocklist
>>> >>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224.
>>> >>>>>>>>>>>>>> The automatical way should be something built upon
>>> >> the
>>> >>>>>>> blocklist
>>> >>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a
>>> >>>>>>>> configurable
>>> >>>>>>>>>>>>>> blocklist strategy, but I think we can further
>>> >> decouple
>>> >>>> it
>>> >>>>> by
>>> >>>>>>>>>>>>>> introducing a abnormal node detector, as Becket
>>> >>>> suggested,
>>> >>>>>>> which
>>> >>>>>>>>>> just
>>> >>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are
>>> >>> detected.
>>> >>>>>>>> However,
>>> >>>>>>>>>> it
>>> >>>>>>>>>>>>>> should be a separate FLIP with further dev
>>> >> discussions
>>> >>>> and
>>> >>>>>>>>> feedback
>>> >>>>>>>>>>>>>> from users. I also agree with Becket that different
>>> >>> users
>>> >>>>>>> have
>>> >>>>>>>>>>> different
>>> >>>>>>>>>>>>>> requirements, and we should listen to them.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes
>>> >>>> externally
>>> >>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a
>>> >> way
>>> >>> to
>>> >>>>>>> avoid
>>> >>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case,
>>> >>>> users
>>> >>>>>>> may
>>> >>>>>>>>> just
>>> >>>>>>>>>>>>>> want to limit the load of the node and do not want to
>>> >>>> kill
>>> >>>>>>> all
>>> >>>>>>>> the
>>> >>>>>>>>>>>>>> processes on it. Another case is the speculative
>>> >>>>> execution[1]
>>> >>>>>>>>> which
>>> >>>>>>>>>>>>>> may also leverage this feature to avoid starting
>>> >> mirror
>>> >>>>>>> tasks on
>>> >>>>>>>>>> slow
>>> >>>>>>>>>>>>>> nodes.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Thanks,
>>> >>>>>>>>>>>>>> Zhu
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> [1]
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Lijie Wang <[email protected]> 于2022年5月5日周四
>>> >>>>> 15:56写道：
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Hi everyone,
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Thanks for your feedback.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize
>>> >> here
>>> >>>>>>> because
>>> >>>>>>>> it
>>> >>>>>>>>>> can
>>> >>>>>>>>>>>>>> affect the value and design of the blocklist
>>> >> mechanism
>>> >>>>>>> (perhaps
>>> >>>>>>>> I
>>> >>>>>>>>>>> should
>>> >>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in
>>> >>>> FLIP:
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or
>>> >>> node
>>> >>>>> as
>>> >>>>>>>>>> blocked.
>>> >>>>>>>>>>>>>> Future slots should not be allocated from the blocked
>>> >>>> task
>>> >>>>>>>> manager
>>> >>>>>>>>>> or
>>> >>>>>>>>>>>>> node.
>>> >>>>>>>>>>>>>> But slots that are already allocated will not be
>>> >>>> affected.
>>> >>>>> A
>>> >>>>>>>>> typical
>>> >>>>>>>>>>>>>> application scenario is to mitigate machine hotspots.
>>> >>> In
>>> >>>>> this
>>> >>>>>>>>> case,
>>> >>>>>>>>>> we
>>> >>>>>>>>>>>>> hope
>>> >>>>>>>>>>>>>> that subsequent resource allocations will not be on
>>> >> the
>>> >>>> hot
>>> >>>>>>>>> machine,
>>> >>>>>>>>>>> but
>>> >>>>>>>>>>>>>> tasks currently running on it should not be affected.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the
>>> >> task
>>> >>>>>>> manager
>>> >>>>>>>> or
>>> >>>>>>>>>>> node
>>> >>>>>>>>>>>>> as
>>> >>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated
>>> >> tasks
>>> >>>> will
>>> >>>>>>> be
>>> >>>>>>>>>>>>> restarted on
>>> >>>>>>>>>>>>>> non-blocked task managers.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> For the above 2 actions, the former may more
>>> >>> highlight
>>> >>>>> the
>>> >>>>>>>>> meaning
>>> >>>>>>>>>>> of
>>> >>>>>>>>>>>>>> this FLIP, because the external system cannot do
>>> >> that.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I
>>> >> basically
>>> >>>>> agree
>>> >>>>>>>> with
>>> >>>>>>>>>>>>> @Becket
>>> >>>>>>>>>>>>>> Qin: different users have different answers. Not all
>>> >>>> users’
>>> >>>>>>>>>> deployment
>>> >>>>>>>>>>>>>> environments have a special external system that can
>>> >>>>> perform
>>> >>>>>>> the
>>> >>>>>>>>>>> anomaly
>>> >>>>>>>>>>>>>> detection. In addition, adding pluggable/optional
>>> >>>>>>> auto-detection
>>> >>>>>>>>>>> doesn't
>>> >>>>>>>>>>>>>> require much extra work on top of manual
>>> >> specification.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> I will answer your other questions one by one.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> @Yangze
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> a) I think you are right, we do not need to expose
>>> >>> the
>>> >>>>>>>>>>>>>>
>>> >>> `cluster.resource-blocklist.item.timeout-check-interval`
>>> >>>> to
>>> >>>>>>>> users.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a
>>> >>> separate
>>> >>>>>>>> interface
>>> >>>>>>>>>>>>> (maybe
>>> >>>>>>>>>>>>>> BlocklistExceptionListener), and the
>>> >>>>>>>>> ResourceManagerBlocklistHandler
>>> >>>>>>>>>>> can
>>> >>>>>>>>>>>>>> implement it in the future.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> @Martijn
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> a) I also think the manual blocking should be done
>>> >> by
>>> >>>>>>> cluster
>>> >>>>>>>>>>>>> operators.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> b) I think manual blocking makes sense, because
>>> >>>> according
>>> >>>>>>> to
>>> >>>>>>>> my
>>> >>>>>>>>>>>>>> experience, users are often the first to perceive the
>>> >>>>> machine
>>> >>>>>>>>>> problems
>>> >>>>>>>>>>>>>> (because of job failover or delay), and they will
>>> >>> contact
>>> >>>>>>>> cluster
>>> >>>>>>>>>>>>> operators
>>> >>>>>>>>>>>>>> to solve it, or even tell the cluster operators which
>>> >>>>>>> machine is
>>> >>>>>>>>>>>>>> problematic. From this point of view, I think the
>>> >>> people
>>> >>>>> who
>>> >>>>>>>>> really
>>> >>>>>>>>>>> need
>>> >>>>>>>>>>>>>> the manual blocking are the users, and it’s just
>>> >>>> performed
>>> >>>>> by
>>> >>>>>>>> the
>>> >>>>>>>>>>>>> cluster
>>> >>>>>>>>>>>>>> operator, so I think the manual blocking makes sense.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> @Chesnay
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because
>>> >>> for
>>> >>>>>>>>>>>>> MARK_BLOCKLISTED
>>> >>>>>>>>>>>>>> , we need to know whether the slot is blocklisted
>>> >> when
>>> >>>> the
>>> >>>>>>> task
>>> >>>>>>>> is
>>> >>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so,  SlotPool should
>>> >>>> release
>>> >>>>>>> the
>>> >>>>>>>>> slot
>>> >>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job)
>>> >>> on
>>> >>>>> it.
>>> >>>>>>> If
>>> >>>>>>>> we
>>> >>>>>>>>>>> only
>>> >>>>>>>>>>>>>> maintain the blocklist information on the RM, JM
>>> >> needs
>>> >>> to
>>> >>>>>>>> retrieve
>>> >>>>>>>>>> it
>>> >>>>>>>>>>> by
>>> >>>>>>>>>>>>>> RPC. I think the performance overhead of that is
>>> >>>> relatively
>>> >>>>>>>> large,
>>> >>>>>>>>>> so
>>> >>>>>>>>>>> I
>>> >>>>>>>>>>>>>> think it's worth maintaining the blocklist
>>> >> information
>>> >>> on
>>> >>>>>>> the JM
>>> >>>>>>>>>> side
>>> >>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>> syncing them.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> @Роман
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>    a) “Probably storing inside Zookeeper/Configmap
>>> >>>> might
>>> >>>>>>> be
>>> >>>>>>>>>> helpful
>>> >>>>>>>>>>>>>> here.”  Can you explain it in detail? I don't fully
>>> >>>>>>> understand
>>> >>>>>>>>> that.
>>> >>>>>>>>>>> In
>>> >>>>>>>>>>>>> my
>>> >>>>>>>>>>>>>> opinion, non-active and active are the same, and no
>>> >>>> special
>>> >>>>>>>>>> treatment
>>> >>>>>>>>>>> is
>>> >>>>>>>>>>>>>> required.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes
>>> >> sense,
>>> >>> I
>>> >>>>> will
>>> >>>>>>>> add
>>> >>>>>>>>> it
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>>>>> FLIP.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> @Yang
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> As mentioned above, AFAK, the external system
>>> >> cannot
>>> >>>>>>> support
>>> >>>>>>>> the
>>> >>>>>>>>>>>>>> MARK_BLOCKLISTED action.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Looking forward to your further feedback.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Lijie
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Yang Wang <[email protected]> 于2022年5月3日周二
>>> >>>> 21:09写道：
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster
>>> >>>>>>> operations.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site
>>> >>>>>>> Reliability
>>> >>>>>>>>>>> Engineer)
>>> >>>>>>>>>>>>>>>> already has many tools to detect the unstable
>>> >> nodes,
>>> >>>>> which
>>> >>>>>>>>> could
>>> >>>>>>>>>>> take
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>> system logs/metrics into consideration.
>>> >>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and
>>> >> taint
>>> >>>> in
>>> >>>>>>> K8s
>>> >>>>>>>> to
>>> >>>>>>>>>>>>> prevent
>>> >>>>>>>>>>>>>> new
>>> >>>>>>>>>>>>>>>> allocations on these unstable nodes.
>>> >>>>>>>>>>>>>>>> At last, they will evict all the containers and
>>> >> pods
>>> >>>>>>> running
>>> >>>>>>>> on
>>> >>>>>>>>>>> these
>>> >>>>>>>>>>>>>> nodes.
>>> >>>>>>>>>>>>>>>> This mechanism also works for planned maintenance.
>>> >>> So
>>> >>>> I
>>> >>>>> am
>>> >>>>>>>>> afraid
>>> >>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>> is
>>> >>>>>>>>>>>>>>>> not the typical use case for FLIP-224.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> If we only support to block nodes manually, then I
>>> >>>> could
>>> >>>>>>> not
>>> >>>>>>>>> see
>>> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's
>>> >>>>>>>> approach(via
>>> >>>>>>>>>>> *yarn
>>> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*).
>>> >>>>>>>>>>>>>>>> At least, we need to have a pluggable component
>>> >>> which
>>> >>>>>>> could
>>> >>>>>>>>>> expose
>>> >>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>> potential unstable nodes automatically and block
>>> >>> them
>>> >>>> if
>>> >>>>>>>>> enabled
>>> >>>>>>>>>>>>>> explicitly.
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>>>>> Yang
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>> Becket Qin <[email protected]> 于2022年5月2日周一
>>> >>>> 16:36写道：
>>> >>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> This is an interesting feature and discussion,
>>> >> and
>>> >>>>>>> somewhat
>>> >>>>>>>>>>> related
>>> >>>>>>>>>>>>>> to the
>>> >>>>>>>>>>>>>>>>> design principle about how people should operate
>>> >>>>> Flink.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> I think there are three things involved in this
>>> >>>> FLIP.
>>> >>>>>>>>>>>>>>>>>     a) Detect and report the unstable node.
>>> >>>>>>>>>>>>>>>>>     b) Collect the information of the unstable
>>> >>> node
>>> >>>>> and
>>> >>>>>>>>> form a
>>> >>>>>>>>>>>>>> blocklist.
>>> >>>>>>>>>>>>>>>>>     c) Take the action to block nodes.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> My two cents:
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink
>>> >>> should
>>> >>>>> have
>>> >>>>>>>> c).
>>> >>>>>>>>> It
>>> >>>>>>>>>>> is
>>> >>>>>>>>>>>>>> not only
>>> >>>>>>>>>>>>>>>>> useful for cases of node failures, but also
>>> >> handy
>>> >>>> for
>>> >>>>>>> some
>>> >>>>>>>>>>> planned
>>> >>>>>>>>>>>>>>>>> maintenance.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e.
>>> >> who
>>> >>>>>>> should be
>>> >>>>>>>>> the
>>> >>>>>>>>>>>>> brain
>>> >>>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>> make the decision to block a node. I think this
>>> >>>>> largely
>>> >>>>>>>>> depends
>>> >>>>>>>>>>> on
>>> >>>>>>>>>>>>>> who we
>>> >>>>>>>>>>>>>>>>> talk to. Different users would probably give
>>> >>>> different
>>> >>>>>>>>> answers.
>>> >>>>>>>>>>> For
>>> >>>>>>>>>>>>>> people
>>> >>>>>>>>>>>>>>>>> who do have a centralized node health management
>>> >>>>>>> service,
>>> >>>>>>>> let
>>> >>>>>>>>>>> Flink
>>> >>>>>>>>>>>>>> do just
>>> >>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially
>>> >>>> Flink
>>> >>>>>>> would
>>> >>>>>>>>> be
>>> >>>>>>>>>>> one
>>> >>>>>>>>>>>>> of
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report
>>> >> it
>>> >>> to
>>> >>>>>>> that
>>> >>>>>>>>>>> service,
>>> >>>>>>>>>>>>>> and then
>>> >>>>>>>>>>>>>>>>> take the command from that service to block the
>>> >>>>>>> problematic
>>> >>>>>>>>>>> nodes.
>>> >>>>>>>>>>>>> On
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>> other hand, for users who do not have such a
>>> >>>> service,
>>> >>>>>>>> simply
>>> >>>>>>>>>>>>> letting
>>> >>>>>>>>>>>>>> Flink
>>> >>>>>>>>>>>>>>>>> be clever by itself to block the suspicious
>>> >> nodes
>>> >>>>> might
>>> >>>>>>> be
>>> >>>>>>>>>>> desired
>>> >>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>> ensure the jobs are running smoothly.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> So that indicates a) and b) here should be
>>> >>>> pluggable /
>>> >>>>>>>>>> optional.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to
>>> >>> have
>>> >>>>>>>> something
>>> >>>>>>>>>>>>>> pluggable
>>> >>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes
>>> >> unstable
>>> >>>>> nodes
>>> >>>>>>>>>>> actively.
>>> >>>>>>>>>>>>> (A
>>> >>>>>>>>>>>>>> more
>>> >>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T>
>>> >>> which
>>> >>>>>>> can be
>>> >>>>>>>>>> used
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>>>>> report
>>> >>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep
>>> >>> the
>>> >>>>>>> scope
>>> >>>>>>>>>>> relevant
>>> >>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is
>>> >> OK
>>> >>> to
>>> >>>>>>> have a
>>> >>>>>>>>>>> default
>>> >>>>>>>>>>>>>>>>> implementation of a reporter which just tells
>>> >>> Flink
>>> >>>> to
>>> >>>>>>> take
>>> >>>>>>>>>>> action
>>> >>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>> block
>>> >>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after
>>> >>>>> timeout.
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> Thanks,
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко <
>>> >>>>>>>>>> [email protected]
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu!
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in
>>> >>>>>>> development.
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply -
>>> >>> we
>>> >>>>>>> should
>>> >>>>>>>>>>> consider
>>> >>>>>>>>>>>>>> to move
>>> >>>>>>>>>>>>>>>>>> somehow the information of blocklisted
>>> >> nodes/TMs
>>> >>>>> from
>>> >>>>>>>>> active
>>> >>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably
>>> >>>> storing
>>> >>>>>>>> inside
>>> >>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here.
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of
>>> >>>> organizations
>>> >>>>>>>> don't
>>> >>>>>>>>>> want
>>> >>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>> expose
>>> >>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think
>>> >>>> it's
>>> >>>>>>>>> necessary
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>>>>> have the
>>> >>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway
>>> >>> for
>>> >>>>>>>> avoiding
>>> >>>>>>>>>>>>> incorrect
>>> >>>>>>>>>>>>>>>>>> automatic behaviour.
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> And another one small suggestion - I think it
>>> >>>> would
>>> >>>>> be
>>> >>>>>>>>> better
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>>>>> extend
>>> >>>>>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the
>>> >> *endTimestamp*
>>> >>>>> field
>>> >>>>>>> and
>>> >>>>>>>>>> fill
>>> >>>>>>>>>>> it
>>> >>>>>>>>>>>>>> at the
>>> >>>>>>>>>>>>>>>>>> item creation. This simple addition will allow
>>> >>> to:
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>   -
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>   Provide the ability to users to setup the
>>> >>> exact
>>> >>>>>>> time
>>> >>>>>>>> of
>>> >>>>>>>>>>>>>> blocklist end
>>> >>>>>>>>>>>>>>>>>>   through RestAPI
>>> >>>>>>>>>>>>>>>>>>   -
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>   Not being tied to a single value of
>>> >>>>>>>>>>>>>>>>>>   *cluster.resource-blacklist.item.timeout*
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler
>>> >> <
>>> >>>>>>>>>>>>> [email protected]>
>>> >>>>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> I do share the concern between blurring the
>>> >>>> lines
>>> >>>>> a
>>> >>>>>>>> bit.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any
>>> >>>>> auto-detection
>>> >>>>>>>> and
>>> >>>>>>>>>> only
>>> >>>>>>>>>>>>>> have an
>>> >>>>>>>>>>>>>>>>>>> opt-in mechanism
>>> >>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me
>>> >> this
>>> >>>>> sounds
>>> >>>>>>>> yet
>>> >>>>>>>>>>> again
>>> >>>>>>>>>>>>>> like one
>>> >>>>>>>>>>>>>>>>>>> of those
>>> >>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work
>>> >> just
>>> >>>>> right.
>>> >>>>>>>>>>>>>>>>>>> An external system can leverage way more
>>> >>>>> information
>>> >>>>>>>>> after
>>> >>>>>>>>>>> all.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the
>>> >>>> complexity
>>> >>>>>>> of
>>> >>>>>>>>> this
>>> >>>>>>>>>>>>>> proposal.
>>> >>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing
>>> >>> between
>>> >>>>>>>>>> components;
>>> >>>>>>>>>>>>>>>>> adjustments
>>> >>>>>>>>>>>>>>>>>>> to the
>>> >>>>>>>>>>>>>>>>>>> slot and resource protocol.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> In a way it seems overly complicated.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> If we look at it purely from an active
>>> >>> resource
>>> >>>>>>>>> management
>>> >>>>>>>>>>>>>> perspective,
>>> >>>>>>>>>>>>>>>>>>> then there
>>> >>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot
>>> >> protocol
>>> >>>> at
>>> >>>>>>> all
>>> >>>>>>>> (or
>>> >>>>>>>>>> in
>>> >>>>>>>>>>>>> fact
>>> >>>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>>>> anything in the JobMaster),
>>> >>>>>>>>>>>>>>>>>>> because there isn't any point in keeping
>>> >>> around
>>> >>>>>>> blocked
>>> >>>>>>>>> TMs
>>> >>>>>>>>>>> in
>>> >>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>> first
>>> >>>>>>>>>>>>>>>>>>> place.
>>> >>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting
>>> >>> down
>>> >>>>>>> after
>>> >>>>>>>> a
>>> >>>>>>>>>>> while
>>> >>>>>>>>>>>>> by
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>> RM
>>> >>>>>>>>>>>>>>>>>>> because of
>>> >>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic).
>>> >>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by
>>> >>>> blocking
>>> >>>>>>> the
>>> >>>>>>>>>> process
>>> >>>>>>>>>>>>> or
>>> >>>>>>>>>>>>>> node)
>>> >>>>>>>>>>>>>>>>> is
>>> >>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked
>>> >>>>>>> process(es).
>>> >>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it
>>> >>>> back
>>> >>>>>>> up.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply
>>> >> the
>>> >>>> same
>>> >>>>>>> line
>>> >>>>>>>>> of
>>> >>>>>>>>>>>>>> thinking to
>>> >>>>>>>>>>>>>>>>>>> standalone resource management.
>>> >>>>>>>>>>>>>>>>>>> Here being able to stop/restart a
>>> >> process/node
>>> >>>>>>> manually
>>> >>>>>>>>>>> should
>>> >>>>>>>>>>>>> be
>>> >>>>>>>>>>>>>> a
>>> >>>>>>>>>>>>>>>>> core
>>> >>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway.
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote:
>>> >>>>>>>>>>>>>>>>>>>> Hi everyone,
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can
>>> >>>> understand
>>> >>>>>>> the
>>> >>>>>>>>>> problem
>>> >>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>> I see
>>> >>>>>>>>>>>>>>>>>>> value
>>> >>>>>>>>>>>>>>>>>>>> in the automatic detection and
>>> >>> blocklisting. I
>>> >>>>> do
>>> >>>>>>>> have
>>> >>>>>>>>>> some
>>> >>>>>>>>>>>>>> concerns
>>> >>>>>>>>>>>>>>>>>> with
>>> >>>>>>>>>>>>>>>>>>>> the ability to manually specify to be
>>> >>> blocked
>>> >>>>>>>>> resources.
>>> >>>>>>>>>> I
>>> >>>>>>>>>>>>> have
>>> >>>>>>>>>>>>>> two
>>> >>>>>>>>>>>>>>>>>>>> concerns;
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a
>>> >>>>> separation
>>> >>>>>>> of
>>> >>>>>>>>>>>>> concerns,
>>> >>>>>>>>>>>>>>>>> meaning
>>> >>>>>>>>>>>>>>>>>>> that
>>> >>>>>>>>>>>>>>>>>>>> there's a group who's responsible for
>>> >>>> managing a
>>> >>>>>>>>> cluster
>>> >>>>>>>>>>> and
>>> >>>>>>>>>>>>>> there's
>>> >>>>>>>>>>>>>>>>> a
>>> >>>>>>>>>>>>>>>>>>> user
>>> >>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the
>>> >>>>>>> introduction of
>>> >>>>>>>>>> this
>>> >>>>>>>>>>>>>> mechanism,
>>> >>>>>>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>>>> latter group now can influence the
>>> >>>>> responsibility
>>> >>>>>>> of
>>> >>>>>>>>> the
>>> >>>>>>>>>>>>> first
>>> >>>>>>>>>>>>>> group.
>>> >>>>>>>>>>>>>>>>>> So
>>> >>>>>>>>>>>>>>>>>>> it
>>> >>>>>>>>>>>>>>>>>>>> can be possible that someone from the user
>>> >>>> group
>>> >>>>>>>> blocks
>>> >>>>>>>>>>>>>> something,
>>> >>>>>>>>>>>>>>>>>> which
>>> >>>>>>>>>>>>>>>>>>>> causes an outage (which could result in
>>> >>> paging
>>> >>>>>>>>> mechanism
>>> >>>>>>>>>>>>>> triggering
>>> >>>>>>>>>>>>>>>>>> etc)
>>> >>>>>>>>>>>>>>>>>>>> which impacts the first group.
>>> >>>>>>>>>>>>>>>>>>>> * How big is the group of people who can
>>> >> go
>>> >>>>>>> through
>>> >>>>>>>> the
>>> >>>>>>>>>>>>> process
>>> >>>>>>>>>>>>>> of
>>> >>>>>>>>>>>>>>>>>>> manually
>>> >>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as
>>> >> it
>>> >>>>>>> should
>>> >>>>>>>>> be? I
>>> >>>>>>>>>>> do
>>> >>>>>>>>>>>>>> think
>>> >>>>>>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then
>>> >>> make
>>> >>>>>>> sense
>>> >>>>>>>> to
>>> >>>>>>>>>>>>>> introduce
>>> >>>>>>>>>>>>>>>>> such
>>> >>>>>>>>>>>>>>>>>> a
>>> >>>>>>>>>>>>>>>>>>>> feature, which would only be used by a
>>> >>> really
>>> >>>>>>> small
>>> >>>>>>>>> user
>>> >>>>>>>>>>>>> group
>>> >>>>>>>>>>>>>> of
>>> >>>>>>>>>>>>>>>>>> Flink?
>>> >>>>>>>>>>>>>>>>>>> We
>>> >>>>>>>>>>>>>>>>>>>> still have to maintain, test and support
>>> >>> such
>>> >>>> a
>>> >>>>>>>>> feature.
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but
>>> >>> I'm
>>> >>>>>>>> leaning
>>> >>>>>>>>>>>>> towards
>>> >>>>>>>>>>>>>> not
>>> >>>>>>>>>>>>>>>>>>> exposing
>>> >>>>>>>>>>>>>>>>>>>> this to the user group but having this
>>> >>>> available
>>> >>>>>>>>> strictly
>>> >>>>>>>>>>> for
>>> >>>>>>>>>>>>>> cluster
>>> >>>>>>>>>>>>>>>>>>>> operators. They could then also set up
>>> >> their
>>> >>>>>>>>>>>>>> paging/metrics/logging
>>> >>>>>>>>>>>>>>>>>>> system
>>> >>>>>>>>>>>>>>>>>>>> to take this into account.
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> Best regards,
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> Martijn Visser
>>> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
>>> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo <
>>> >>>>>>>>>>> [email protected]
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share
>>> >>> some
>>> >>>>>>> cents
>>> >>>>>>>>> here:
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> - Why do we need to expose
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>> cluster.resource-blacklist.item.timeout-check-interval
>>> >>>>>>>>>> to
>>> >>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>> user?
>>> >>>>>>>>>>>>>>>>>>>>> I think the semantics of
>>> >>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout`
>>> >>>>>>>>>>>>>>>>> is
>>> >>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee
>>> >>> the
>>> >>>>>>>> timeout
>>> >>>>>>>>>>>>>> mechanism is
>>> >>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think
>>> >> it
>>> >>>>> will
>>> >>>>>>> be
>>> >>>>>>>>> very
>>> >>>>>>>>>>>>>> confusing
>>> >>>>>>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the
>>> >> exception
>>> >>>> of a
>>> >>>>>>> task
>>> >>>>>>>>>>>>> manager to
>>> >>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well.
>>> >>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might
>>> >> fail
>>> >>>> in
>>> >>>>>>> case
>>> >>>>>>>>> the
>>> >>>>>>>>>>>>> target
>>> >>>>>>>>>>>>>> task
>>> >>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter.
>>> >> I
>>> >>>>> don't
>>> >>>>>>>> mean
>>> >>>>>>>>> we
>>> >>>>>>>>>>>>> need
>>> >>>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>> cover
>>> >>>>>>>>>>>>>>>>>>>>> this case in this version, but we can
>>> >> also
>>> >>>>> open a
>>> >>>>>>>>>>>>>> `notifyException`
>>> >>>>>>>>>>>>>>>>> in
>>> >>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`.
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to
>>> >>>>>>> ResourceManager,
>>> >>>>>>>>> will
>>> >>>>>>>>>>> the
>>> >>>>>>>>>>>>>> slot of
>>> >>>>>>>>>>>>>>>>> a
>>> >>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be
>>> >>> released
>>> >>>>> and
>>> >>>>>>>>>>> allocated?
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>>>>>>>>>> Yangze Guo
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie
>>> >> Wang
>>> >>> <
>>> >>>>>>>>>>>>>>>>> [email protected]>
>>> >>>>>>>>>>>>>>>>>>>>> wrote:
>>> >>>>>>>>>>>>>>>>>>>>>> Hi Konstantin,
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will
>>> >> response
>>> >>>>> your 4
>>> >>>>>>>>>> remarks:
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the
>>> >>>>> controversy. I
>>> >>>>>>>>> think
>>> >>>>>>>>>>>>>> “BlockList”
>>> >>>>>>>>>>>>>>>>> is
>>> >>>>>>>>>>>>>>>>>>>>> good
>>> >>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP.
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a
>>> >>>> good
>>> >>>>>>> idea.
>>> >>>>>>>>>> Based
>>> >>>>>>>>>>>>> on
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>>> above, I
>>> >>>>>>>>>>>>>>>>>>>>>> would change REST API as following:
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> DELETE
>>> >> <host>/blocklist/node/<identifier>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> DELETE
>>> >>>>> <host>/blocklist/taskmanager/<identifier>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it
>>> >>>> means
>>> >>>>>>> that
>>> >>>>>>>>> all
>>> >>>>>>>>>>>>> task
>>> >>>>>>>>>>>>>>>>> managers
>>> >>>>>>>>>>>>>>>>>>> on
>>> >>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on
>>> >>>> these
>>> >>>>>>> TMs
>>> >>>>>>>> are
>>> >>>>>>>>>> not
>>> >>>>>>>>>>>>>>>>> available.
>>> >>>>>>>>>>>>>>>>>>> This
>>> >>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but
>>> >> these
>>> >>>> TMs
>>> >>>>>>> are
>>> >>>>>>>>> not
>>> >>>>>>>>>>>>> really
>>> >>>>>>>>>>>>>> lost,
>>> >>>>>>>>>>>>>>>>>>> they
>>> >>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they
>>> >> are
>>> >>>>> still
>>> >>>>>>>>>>> registered
>>> >>>>>>>>>>>>>> in this
>>> >>>>>>>>>>>>>>>>>>> flink
>>> >>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again
>>> >> once
>>> >>>> the
>>> >>>>>>>>>>>>> corresponding
>>> >>>>>>>>>>>>>>>>>> blocklist
>>> >>>>>>>>>>>>>>>>>>>>> item
>>> >>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in
>>> >>>>>>>>>>> active/non-active
>>> >>>>>>>>>>>>>>>>> clusters.
>>> >>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these
>>> >> TMs
>>> >>>> may
>>> >>>>> be
>>> >>>>>>>>>> released
>>> >>>>>>>>>>>>> due
>>> >>>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>> idle
>>> >>>>>>>>>>>>>>>>>>>>>> timeouts.
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to
>>> >> keep
>>> >>>> it.
>>> >>>>>>> The
>>> >>>>>>>>>> reasons
>>> >>>>>>>>>>>>> are
>>> >>>>>>>>>>>>>> as
>>> >>>>>>>>>>>>>>>>>>>>> following:
>>> >>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users
>>> >>> adding
>>> >>>> or
>>> >>>>>>>>> removing
>>> >>>>>>>>>>>>> items
>>> >>>>>>>>>>>>>> via
>>> >>>>>>>>>>>>>>>>>> REST
>>> >>>>>>>>>>>>>>>>>>>>> API,
>>> >>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring
>>> >> it
>>> >>>> to
>>> >>>>>>>>>>>>> Long.MAX_VALUE .
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after
>>> >> a
>>> >>>>>>> period of
>>> >>>>>>>>>> time
>>> >>>>>>>>>>>>>> (such as
>>> >>>>>>>>>>>>>>>>>>> machine
>>> >>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may
>>> >> prefer
>>> >>>> that
>>> >>>>>>>> Flink
>>> >>>>>>>>>> can
>>> >>>>>>>>>>> do
>>> >>>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the
>>> >>> user
>>> >>>> to
>>> >>>>>>> do
>>> >>>>>>>> it
>>> >>>>>>>>>>>>> manually.
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> Lijie
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <[email protected]>
>>> >>>>>>> 于2022年4月27日周三
>>> >>>>>>>>>>>>> 19:23写道：
>>> >>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Lijie,
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to
>>> >> only
>>> >>>>>>> support
>>> >>>>>>>>>>> manually
>>> >>>>>>>>>>>>>> blocking
>>> >>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the
>>> >>> different
>>> >>>>>>>>> strategies
>>> >>>>>>>>>>> can
>>> >>>>>>>>>>>>>> also be
>>> >>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink.
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> A few remarks:
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than
>>> >>>> "bla.cklist"
>>> >>>>>>> due
>>> >>>>>>>> to
>>> >>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>> controversy
>>> >>>>>>>>>>>>>>>>>>>>> around
>>> >>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira
>>> >>> Ticket
>>> >>>>>>> about
>>> >>>>>>>>> this
>>> >>>>>>>>>>>>> topic a
>>> >>>>>>>>>>>>>>>>> while
>>> >>>>>>>>>>>>>>>>>>>>> back
>>> >>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to
>>> >>>> avoid
>>> >>>>>>> the
>>> >>>>>>>>> term
>>> >>>>>>>>>>>>>> blacklist &
>>> >>>>>>>>>>>>>>>>>>>>> whitelist
>>> >>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist"
>>> >> "denylist"
>>> >>> or
>>> >>>>>>>>>>> "quarantined"
>>> >>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a
>>> >>> slightly
>>> >>>>>>>> different
>>> >>>>>>>>>>>>> design
>>> >>>>>>>>>>>>>> as
>>> >>>>>>>>>>>>>>>>> verbs
>>> >>>>>>>>>>>>>>>>>>>>> like
>>> >>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an
>>> >>> anti-pattern
>>> >>>>> for
>>> >>>>>>>> REST
>>> >>>>>>>>>>> APIs.
>>> >>>>>>>>>>>>>> POST
>>> >>>>>>>>>>>>>>>>> on a
>>> >>>>>>>>>>>>>>>>>>>>> list
>>> >>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add
>>> >>>> items.
>>> >>>>>>>> DELETE
>>> >>>>>>>>> on
>>> >>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>> individual
>>> >>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item.
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items
>>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>>> >>>>> <host>/quarantine/items/<itemidentifier>
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate
>>> >>>>> taskmanagers
>>> >>>>>>>> and
>>> >>>>>>>>>>> nodes
>>> >>>>>>>>>>>>> in
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>> REST
>>> >>>>>>>>>>>>>>>>>>>>> API
>>> >>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any
>>> >>> opinion
>>> >>>>> on
>>> >>>>>>>> this?
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes
>>> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager
>>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>>> >>> <host>/quarantine/nodes/<identifier>
>>> >>>>>>>>>>>>>>>>>>>>>>> DELETE
>>> >>>>>>> <host>/quarantine/taskmanager/<identifier>
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with
>>> >>>>>>> non-active
>>> >>>>>>>>>>>>> resource
>>> >>>>>>>>>>>>>>>>>> managers,
>>> >>>>>>>>>>>>>>>>>>>>> i.e.
>>> >>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode?
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more
>>> >>>>>>> minimal,
>>> >>>>>>>> do
>>> >>>>>>>>> we
>>> >>>>>>>>>>>>> need
>>> >>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>>> timeout
>>> >>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed
>>> >>>> manually
>>> >>>>>>> we
>>> >>>>>>>>> could
>>> >>>>>>>>>>>>>> delegate
>>> >>>>>>>>>>>>>>>>> this
>>> >>>>>>>>>>>>>>>>>>>>> to the
>>> >>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout
>>> >>>>> behavior
>>> >>>>>>>>> would
>>> >>>>>>>>>>>>> better
>>> >>>>>>>>>>>>>> fit
>>> >>>>>>>>>>>>>>>>>> into
>>> >>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point.
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you,
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> Konstantin
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> [1]
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term
>>> >>>>>>>>>>>>>>>>>>>>>>> [2]
>>> >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr
>>> >>> schrieb
>>> >>>>>>> Lijie
>>> >>>>>>>>> Wang
>>> >>>>>>>>>> <
>>> >>>>>>>>>>>>>>>>>>>>>>> [email protected]>:
>>> >>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to
>>> >>>> cluster
>>> >>>>>>> node
>>> >>>>>>>>>>> issues
>>> >>>>>>>>>>>>>>>>>>>>> (insufficient
>>> >>>>>>>>>>>>>>>>>>>>>>> disk
>>> >>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network
>>> >>>> abnormalities).
>>> >>>>>>>> Flink
>>> >>>>>>>>>> will
>>> >>>>>>>>>>>>>> take care
>>> >>>>>>>>>>>>>>>>>> of
>>> >>>>>>>>>>>>>>>>>>>>> the
>>> >>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks.
>>> >>> However,
>>> >>>>> due
>>> >>>>>>> to
>>> >>>>>>>>> data
>>> >>>>>>>>>>>>>> locality
>>> >>>>>>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>>>>>>>>>> limited
>>> >>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very
>>> >> likely
>>> >>>> to
>>> >>>>> be
>>> >>>>>>>>>>> redeployed
>>> >>>>>>>>>>>>>> to the
>>> >>>>>>>>>>>>>>>>>> same
>>> >>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous
>>> >>>> task
>>> >>>>>>>>>>> abnormalities
>>> >>>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>>>>> affect
>>> >>>>>>>>>>>>>>>>>>>>> job
>>> >>>>>>>>>>>>>>>>>>>>>>>> progress.
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to
>>> >> manually
>>> >>>>>>> identify
>>> >>>>>>>>> the
>>> >>>>>>>>>>>>>> problematic
>>> >>>>>>>>>>>>>>>>>>>>> node and
>>> >>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem.
>>> >>> But
>>> >>>>>>> this
>>> >>>>>>>>>>> approach
>>> >>>>>>>>>>>>> has
>>> >>>>>>>>>>>>>>>>>>>>> following
>>> >>>>>>>>>>>>>>>>>>>>>>>> disadvantages:
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a
>>> >> heavy
>>> >>>>>>> process.
>>> >>>>>>>>>> Users
>>> >>>>>>>>>>>>> may
>>> >>>>>>>>>>>>>> need
>>> >>>>>>>>>>>>>>>>> to
>>> >>>>>>>>>>>>>>>>>>>>>>> contact
>>> >>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The
>>> >>>>> operation
>>> >>>>>>> can
>>> >>>>>>>>>> even
>>> >>>>>>>>>>> be
>>> >>>>>>>>>>>>>>>>> dangerous
>>> >>>>>>>>>>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>>>>>>>>>> not
>>> >>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business
>>> >>>>> events.
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind
>>> >> of
>>> >>>>>>> problems
>>> >>>>>>>>>>> manually
>>> >>>>>>>>>>>>>> would
>>> >>>>>>>>>>>>>>>>> be
>>> >>>>>>>>>>>>>>>>>>>>> slow
>>> >>>>>>>>>>>>>>>>>>>>>>> and
>>> >>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources.
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I
>>> >>>> propose
>>> >>>>>>> to
>>> >>>>>>>>>>>>> introduce a
>>> >>>>>>>>>>>>>>>>>> blacklist
>>> >>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out
>>> >>>>> problematic
>>> >>>>>>>>>>> resources.
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> You can find more details in
>>> >>> FLIP-224[1].
>>> >>>>>>> Looking
>>> >>>>>>>>>>> forward
>>> >>>>>>>>>>>>>> to your
>>> >>>>>>>>>>>>>>>>>>>>>>> feedback.
>>> >>>>>>>>>>>>>>>>>>>>>>>> [1]
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>>>>>> Lijie
>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> --
>>> >>>>>>>>> Best regards,
>>> >>>>>>>>> Roman Boyko
>>> >>>>>>>>> e.: [email protected]
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> https://twitter.com/snntrable
>>> >>>> https://github.com/knaufk
>>> >>>>
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> https://twitter.com/snntrable
>>> >> https://github.com/knaufk
>>> >>
>>>
>>>

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

Reply via email to