Hi Lijie!
*a) “Probably storing inside Zookeeper/Configmap might be helpfulhere.” Can you explain it in detail? I don't fully understand that. In myopinion, non-active and active are the same, and no special treatment isrequired.* Sorry this was a misunderstanding from my side. I thought we were talking about the HA mode (but not about Active and Standalone ResourceManager). And the original question was - how to handle the blacklisted nodes list at the moment of leader change? Should we simply forget about them or try to pre-save that list on the remote storage? On Sat, 7 May 2022 at 10:51, Yang Wang <danrtsey...@gmail.com> wrote: > Thanks Lijie and ZhuZhu for the explanation. > > I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is indeed > some functionalities the external tools(e.g. kubectl taint) could not > support. > > > Best, > Yang > > Lijie Wang <wangdachui9...@gmail.com> 于2022年5月6日周五 22:18写道: > > > Thanks for your feedback, Jiangang and Martijn. > > > > @Jiangang > > > > > > > For auto-detecting, I wonder how to make the strategy and mark a node > > blocked? > > > > In fact, we currently plan to not support auto-detection in this FLIP. > The > > part about auto-detection may be continued in a separate FLIP in the > > future. Some guys have the same concerns as you, and the correctness and > > necessity of auto-detection may require further discussion in the future. > > > > > In session mode, multi jobs can fail on the same bad node and the node > > should be marked blocked. > > By design, the blocklist information will be shared among all jobs in a > > cluster/session. The JM will sync blocklist information with RM. > > > > @Martijn > > > > > I agree with Yang Wang on this. > > As Zhu Zhu and I mentioned above, we think the MARK_BLOCKLISTED(Just > limits > > the load of the node and does not kill all the processes on it) is also > > important, and we think that external systems (*yarn rmadmin or kubectl > > taint*) cannot support it. So we think it makes sense even only > *manually*. > > > > > I also agree with Chesnay that magical mechanisms are indeed super hard > > to get right. > > Yes, as you see, Jiangang(and a few others) have the same concern. > > However, we currently plan to not support auto-detection in this FLIP, > and > > only *manually*. In addition, I'd like to say that the FLIP provides a > > mechanism to support MARK_BLOCKLISTED and > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS, > > the auto-detection may be done by external systems. > > > > Best, > > Lijie > > > > Martijn Visser <mart...@ververica.com> 于2022年5月6日周五 19:04写道: > > > > > > If we only support to block nodes manually, then I could not see > > > the obvious advantages compared with current SRE's approach(via *yarn > > > rmadmin or kubectl taint*). > > > > > > I agree with Yang Wang on this. > > > > > > > To me this sounds yet again like one of those magical mechanisms > that > > > will rarely work just right. > > > > > > I also agree with Chesnay that magical mechanisms are indeed super hard > > to > > > get right. > > > > > > Best regards, > > > > > > Martijn > > > > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu <liujiangangp...@gmail.com> > > > wrote: > > > > > >> Thanks for the valuable design. The auto-detecting can decrease great > > work > > >> for us. We have implemented the similar feature in our inner flink > > >> version. > > >> Below is something that I care about: > > >> > > >> 1. For auto-detecting, I wonder how to make the strategy and mark a > > >> node > > >> blocked? Sometimes the blocked node is hard to be detected, for > > >> example, > > >> the upper node or the down node will be blocked when network > > >> unreachable. > > >> 2. I see that the strategy is made in JobMaster side. How about > > >> implementing the similar logic in resource manager? In session > mode, > > >> multi > > >> jobs can fail on the same bad node and the node should be marked > > >> blocked. > > >> If the job makes the strategy, the node may be not marked blocked > if > > >> the > > >> fail times don't exceed the threshold. > > >> > > >> > > >> Zhu Zhu <reed...@gmail.com> 于2022年5月5日周四 23:35写道: > > >> > > >> > Thank you for all your feedback! > > >> > > > >> > Besides the answers from Lijie, I'd like to share some of my > thoughts: > > >> > 1. Whether to enable automatical blocklist > > >> > Generally speaking, it is not a goal of FLIP-224. > > >> > The automatical way should be something built upon the blocklist > > >> > mechanism and well decoupled. It was designed to be a configurable > > >> > blocklist strategy, but I think we can further decouple it by > > >> > introducing a abnormal node detector, as Becket suggested, which > just > > >> > uses the blocklist mechanism once bad nodes are detected. However, > it > > >> > should be a separate FLIP with further dev discussions and feedback > > >> > from users. I also agree with Becket that different users have > > different > > >> > requirements, and we should listen to them. > > >> > > > >> > 2. Is it enough to just take away abnormal nodes externally > > >> > My answer is no. As Lijie has mentioned, we need a way to avoid > > >> > deploying tasks to temporary hot nodes. In this case, users may just > > >> > want to limit the load of the node and do not want to kill all the > > >> > processes on it. Another case is the speculative execution[1] which > > >> > may also leverage this feature to avoid starting mirror tasks on > slow > > >> > nodes. > > >> > > > >> > Thanks, > > >> > Zhu > > >> > > > >> > [1] > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job > > >> > > > >> > Lijie Wang <wangdachui9...@gmail.com> 于2022年5月5日周四 15:56写道: > > >> > > > >> > > > > >> > > Hi everyone, > > >> > > > > >> > > > > >> > > Thanks for your feedback. > > >> > > > > >> > > > > >> > > There's one detail that I'd like to re-emphasize here because it > can > > >> > affect the value and design of the blocklist mechanism (perhaps I > > should > > >> > highlight it in the FLIP). We propose two actions in FLIP: > > >> > > > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or node as > blocked. > > >> > Future slots should not be allocated from the blocked task manager > or > > >> node. > > >> > But slots that are already allocated will not be affected. A typical > > >> > application scenario is to mitigate machine hotspots. In this case, > we > > >> hope > > >> > that subsequent resource allocations will not be on the hot machine, > > but > > >> > tasks currently running on it should not be affected. > > >> > > > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task manager or > > node > > >> as > > >> > blocked, and evacuate all tasks on it. Evacuated tasks will be > > >> restarted on > > >> > non-blocked task managers. > > >> > > > > >> > > For the above 2 actions, the former may more highlight the meaning > > of > > >> > this FLIP, because the external system cannot do that. > > >> > > > > >> > > > > >> > > Regarding *Manually* and *Automatically*, I basically agree with > > >> @Becket > > >> > Qin: different users have different answers. Not all users’ > deployment > > >> > environments have a special external system that can perform the > > anomaly > > >> > detection. In addition, adding pluggable/optional auto-detection > > doesn't > > >> > require much extra work on top of manual specification. > > >> > > > > >> > > > > >> > > I will answer your other questions one by one. > > >> > > > > >> > > > > >> > > @Yangze > > >> > > > > >> > > a) I think you are right, we do not need to expose the > > >> > `cluster.resource-blocklist.item.timeout-check-interval` to users. > > >> > > > > >> > > b) We can abstract the `notifyException` to a separate interface > > >> (maybe > > >> > BlocklistExceptionListener), and the ResourceManagerBlocklistHandler > > can > > >> > implement it in the future. > > >> > > > > >> > > > > >> > > @Martijn > > >> > > > > >> > > a) I also think the manual blocking should be done by cluster > > >> operators. > > >> > > > > >> > > b) I think manual blocking makes sense, because according to my > > >> > experience, users are often the first to perceive the machine > problems > > >> > (because of job failover or delay), and they will contact cluster > > >> operators > > >> > to solve it, or even tell the cluster operators which machine is > > >> > problematic. From this point of view, I think the people who really > > need > > >> > the manual blocking are the users, and it’s just performed by the > > >> cluster > > >> > operator, so I think the manual blocking makes sense. > > >> > > > > >> > > > > >> > > @Chesnay > > >> > > > > >> > > We need to touch the logic of JM/SlotPool, because for > > >> MARK_BLOCKLISTED > > >> > , we need to know whether the slot is blocklisted when the task is > > >> > FINISHED/CANCELLED/FAILED. If so, SlotPool should release the slot > > >> > directly to avoid assigning other tasks (of this job) on it. If we > > only > > >> > maintain the blocklist information on the RM, JM needs to retrieve > it > > by > > >> > RPC. I think the performance overhead of that is relatively large, > so > > I > > >> > think it's worth maintaining the blocklist information on the JM > side > > >> and > > >> > syncing them. > > >> > > > > >> > > > > >> > > @Роман > > >> > > > > >> > > a) “Probably storing inside Zookeeper/Configmap might be > helpful > > >> > here.” Can you explain it in detail? I don't fully understand that. > > In > > >> my > > >> > opinion, non-active and active are the same, and no special > treatment > > is > > >> > required. > > >> > > > > >> > > b) I agree with you, the `endTimestamp` makes sense, I will add it > > to > > >> > FLIP. > > >> > > > > >> > > > > >> > > @Yang > > >> > > > > >> > > As mentioned above, AFAK, the external system cannot support the > > >> > MARK_BLOCKLISTED action. > > >> > > > > >> > > > > >> > > Looking forward to your further feedback. > > >> > > > > >> > > > > >> > > Best, > > >> > > > > >> > > Lijie > > >> > > > > >> > > > > >> > > Yang Wang <danrtsey...@gmail.com> 于2022年5月3日周二 21:09写道: > > >> > >> > > >> > >> Thanks Lijie and Zhu for creating the proposal. > > >> > >> > > >> > >> I want to share some thoughts about Flink cluster operations. > > >> > >> > > >> > >> In the production environment, the SRE(aka Site Reliability > > Engineer) > > >> > >> already has many tools to detect the unstable nodes, which could > > take > > >> > the > > >> > >> system logs/metrics into consideration. > > >> > >> Then they use graceful-decomission in YARN and taint in K8s to > > >> prevent > > >> > new > > >> > >> allocations on these unstable nodes. > > >> > >> At last, they will evict all the containers and pods running on > > these > > >> > nodes. > > >> > >> This mechanism also works for planned maintenance. So I am afraid > > >> this > > >> > is > > >> > >> not the typical use case for FLIP-224. > > >> > >> > > >> > >> If we only support to block nodes manually, then I could not see > > >> > >> the obvious advantages compared with current SRE's approach(via > > *yarn > > >> > >> rmadmin or kubectl taint*). > > >> > >> At least, we need to have a pluggable component which could > expose > > >> the > > >> > >> potential unstable nodes automatically and block them if enabled > > >> > explicitly. > > >> > >> > > >> > >> > > >> > >> Best, > > >> > >> Yang > > >> > >> > > >> > >> > > >> > >> > > >> > >> Becket Qin <becket....@gmail.com> 于2022年5月2日周一 16:36写道: > > >> > >> > > >> > >> > Thanks for the proposal, Lijie. > > >> > >> > > > >> > >> > This is an interesting feature and discussion, and somewhat > > related > > >> > to the > > >> > >> > design principle about how people should operate Flink. > > >> > >> > > > >> > >> > I think there are three things involved in this FLIP. > > >> > >> > a) Detect and report the unstable node. > > >> > >> > b) Collect the information of the unstable node and form a > > >> > blocklist. > > >> > >> > c) Take the action to block nodes. > > >> > >> > > > >> > >> > My two cents: > > >> > >> > > > >> > >> > 1. It looks like people all agree that Flink should have c). It > > is > > >> > not only > > >> > >> > useful for cases of node failures, but also handy for some > > planned > > >> > >> > maintenance. > > >> > >> > > > >> > >> > 2. People have different opinions on b), i.e. who should be the > > >> brain > > >> > to > > >> > >> > make the decision to block a node. I think this largely depends > > on > > >> > who we > > >> > >> > talk to. Different users would probably give different answers. > > For > > >> > people > > >> > >> > who do have a centralized node health management service, let > > Flink > > >> > do just > > >> > >> > do a) and c) would be preferred. So essentially Flink would be > > one > > >> of > > >> > the > > >> > >> > sources that may detect unstable nodes, report it to that > > service, > > >> > and then > > >> > >> > take the command from that service to block the problematic > > nodes. > > >> On > > >> > the > > >> > >> > other hand, for users who do not have such a service, simply > > >> letting > > >> > Flink > > >> > >> > be clever by itself to block the suspicious nodes might be > > desired > > >> to > > >> > >> > ensure the jobs are running smoothly. > > >> > >> > > > >> > >> > So that indicates a) and b) here should be pluggable / > optional. > > >> > >> > > > >> > >> > In light of this, maybe it would make sense to have something > > >> > pluggable > > >> > >> > like a UnstableNodeReporter which exposes unstable nodes > > actively. > > >> (A > > >> > more > > >> > >> > general interface should be JobInfoReporter<T> which can be > used > > to > > >> > report > > >> > >> > any information of type <T>. But I'll just keep the scope > > relevant > > >> to > > >> > this > > >> > >> > FLIP here). Personally speaking, I think it is OK to have a > > default > > >> > >> > implementation of a reporter which just tells Flink to take > > action > > >> to > > >> > block > > >> > >> > problematic nodes and also unblocks them after timeout. > > >> > >> > > > >> > >> > Thanks, > > >> > >> > > > >> > >> > Jiangjie (Becket) Qin > > >> > >> > > > >> > >> > > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко < > ro.v.bo...@gmail.com > > > > > >> > wrote: > > >> > >> > > > >> > >> > > Thanks for good initiative, Lijie and Zhu! > > >> > >> > > > > >> > >> > > If it's possible I'd like to participate in development. > > >> > >> > > > > >> > >> > > I agree with 3rd point of Konstantin's reply - we should > > consider > > >> > to move > > >> > >> > > somehow the information of blocklisted nodes/TMs from active > > >> > >> > > ResourceManager to non-active ones. Probably storing inside > > >> > >> > > Zookeeper/Configmap might be helpful here. > > >> > >> > > > > >> > >> > > And I agree with Martijn that a lot of organizations don't > want > > >> to > > >> > expose > > >> > >> > > such API for a cluster user group. But I think it's necessary > > to > > >> > have the > > >> > >> > > mechanism for unblocking the nodes/TMs anyway for avoiding > > >> incorrect > > >> > >> > > automatic behaviour. > > >> > >> > > > > >> > >> > > And another one small suggestion - I think it would be better > > to > > >> > extend > > >> > >> > the > > >> > >> > > *BlocklistedItem* class with the *endTimestamp* field and > fill > > it > > >> > at the > > >> > >> > > item creation. This simple addition will allow to: > > >> > >> > > > > >> > >> > > - > > >> > >> > > > > >> > >> > > Provide the ability to users to setup the exact time of > > >> > blocklist end > > >> > >> > > through RestAPI > > >> > >> > > - > > >> > >> > > > > >> > >> > > Not being tied to a single value of > > >> > >> > > *cluster.resource-blacklist.item.timeout* > > >> > >> > > > > >> > >> > > > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler < > > >> ches...@apache.org> > > >> > >> > wrote: > > >> > >> > > > > >> > >> > > > I do share the concern between blurring the lines a bit. > > >> > >> > > > > > >> > >> > > > That said, I'd prefer to not have any auto-detection and > only > > >> > have an > > >> > >> > > > opt-in mechanism > > >> > >> > > > to manually block processes/nodes. To me this sounds yet > > again > > >> > like one > > >> > >> > > > of those > > >> > >> > > > magical mechanisms that will rarely work just right. > > >> > >> > > > An external system can leverage way more information after > > all. > > >> > >> > > > > > >> > >> > > > Moreover, I'm quite concerned about the complexity of this > > >> > proposal. > > >> > >> > > > Tracking on both the RM/JM side; syncing between > components; > > >> > >> > adjustments > > >> > >> > > > to the > > >> > >> > > > slot and resource protocol. > > >> > >> > > > > > >> > >> > > > In a way it seems overly complicated. > > >> > >> > > > > > >> > >> > > > If we look at it purely from an active resource management > > >> > perspective, > > >> > >> > > > then there > > >> > >> > > > isn't really a need to touch the slot protocol at all (or > in > > >> fact > > >> > to > > >> > >> > > > anything in the JobMaster), > > >> > >> > > > because there isn't any point in keeping around blocked TMs > > in > > >> the > > >> > >> > first > > >> > >> > > > place. > > >> > >> > > > They'd just be idling, potentially shutting down after a > > while > > >> by > > >> > the > > >> > >> > RM > > >> > >> > > > because of > > >> > >> > > > it (unless we _also_ touch that logic). > > >> > >> > > > Here the blocking of a process (be it by blocking the > process > > >> or > > >> > node) > > >> > >> > is > > >> > >> > > > equivalent with shutting down the blocked process(es). > > >> > >> > > > Once the block is lifted we can just spin it back up. > > >> > >> > > > > > >> > >> > > > And I do wonder whether we couldn't apply the same line of > > >> > thinking to > > >> > >> > > > standalone resource management. > > >> > >> > > > Here being able to stop/restart a process/node manually > > should > > >> be > > >> > a > > >> > >> > core > > >> > >> > > > requirement for a Flink deployment anyway. > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote: > > >> > >> > > > > Hi everyone, > > >> > >> > > > > > > >> > >> > > > > Thanks for creating this FLIP. I can understand the > problem > > >> and > > >> > I see > > >> > >> > > > value > > >> > >> > > > > in the automatic detection and blocklisting. I do have > some > > >> > concerns > > >> > >> > > with > > >> > >> > > > > the ability to manually specify to be blocked resources. > I > > >> have > > >> > two > > >> > >> > > > > concerns; > > >> > >> > > > > > > >> > >> > > > > * Most organizations explicitly have a separation of > > >> concerns, > > >> > >> > meaning > > >> > >> > > > that > > >> > >> > > > > there's a group who's responsible for managing a cluster > > and > > >> > there's > > >> > >> > a > > >> > >> > > > user > > >> > >> > > > > group who uses that cluster. With the introduction of > this > > >> > mechanism, > > >> > >> > > the > > >> > >> > > > > latter group now can influence the responsibility of the > > >> first > > >> > group. > > >> > >> > > So > > >> > >> > > > it > > >> > >> > > > > can be possible that someone from the user group blocks > > >> > something, > > >> > >> > > which > > >> > >> > > > > causes an outage (which could result in paging mechanism > > >> > triggering > > >> > >> > > etc) > > >> > >> > > > > which impacts the first group. > > >> > >> > > > > * How big is the group of people who can go through the > > >> process > > >> > of > > >> > >> > > > manually > > >> > >> > > > > identifying a node that isn't behaving as it should be? I > > do > > >> > think > > >> > >> > this > > >> > >> > > > > group is relatively limited. Does it then make sense to > > >> > introduce > > >> > >> > such > > >> > >> > > a > > >> > >> > > > > feature, which would only be used by a really small user > > >> group > > >> > of > > >> > >> > > Flink? > > >> > >> > > > We > > >> > >> > > > > still have to maintain, test and support such a feature. > > >> > >> > > > > > > >> > >> > > > > I'm +1 for the autodetection features, but I'm leaning > > >> towards > > >> > not > > >> > >> > > > exposing > > >> > >> > > > > this to the user group but having this available strictly > > for > > >> > cluster > > >> > >> > > > > operators. They could then also set up their > > >> > paging/metrics/logging > > >> > >> > > > system > > >> > >> > > > > to take this into account. > > >> > >> > > > > > > >> > >> > > > > Best regards, > > >> > >> > > > > > > >> > >> > > > > Martijn Visser > > >> > >> > > > > https://twitter.com/MartijnVisser82 > > >> > >> > > > > https://github.com/MartijnVisser > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo < > > karma...@gmail.com > > >> > > > >> > wrote: > > >> > >> > > > > > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie. > > >> > >> > > > >> > > >> > >> > > > >> +1 for the overall proposal. Just share some cents here: > > >> > >> > > > >> > > >> > >> > > > >> - Why do we need to expose > > >> > >> > > > >> cluster.resource-blacklist.item.timeout-check-interval > to > > >> the > > >> > user? > > >> > >> > > > >> I think the semantics of > > >> > `cluster.resource-blacklist.item.timeout` > > >> > >> > is > > >> > >> > > > >> sufficient for the user. How to guarantee the timeout > > >> > mechanism is > > >> > >> > > > >> Flink's internal implementation. I think it will be very > > >> > confusing > > >> > >> > and > > >> > >> > > > >> we do not need to expose it to users. > > >> > >> > > > >> > > >> > >> > > > >> - ResourceManager can notify the exception of a task > > >> manager to > > >> > >> > > > >> `BlacklistHandler` as well. > > >> > >> > > > >> For example, the slot allocation might fail in case the > > >> target > > >> > task > > >> > >> > > > >> manager is busy or has a network jitter. I don't mean we > > >> need > > >> > to > > >> > >> > cover > > >> > >> > > > >> this case in this version, but we can also open a > > >> > `notifyException` > > >> > >> > in > > >> > >> > > > >> `ResourceManagerBlacklistHandler`. > > >> > >> > > > >> > > >> > >> > > > >> - Before we sync the blocklist to ResourceManager, will > > the > > >> > slot of > > >> > >> > a > > >> > >> > > > >> blocked task manager continues to be released and > > allocated? > > >> > >> > > > >> > > >> > >> > > > >> Best, > > >> > >> > > > >> Yangze Guo > > >> > >> > > > >> > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang < > > >> > >> > wangdachui9...@gmail.com> > > >> > >> > > > >> wrote: > > >> > >> > > > >>> Hi Konstantin, > > >> > >> > > > >>> > > >> > >> > > > >>> Thanks for your feedback. I will response your 4 > remarks: > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > > >>> 1) Thanks for reminding me of the controversy. I think > > >> > “BlockList” > > >> > >> > is > > >> > >> > > > >> good > > >> > >> > > > >>> enough, and I will change it in FLIP. > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > > >>> 2) Your suggestion for the REST API is a good idea. > Based > > >> on > > >> > the > > >> > >> > > > above, I > > >> > >> > > > >>> would change REST API as following: > > >> > >> > > > >>> > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes > > >> > >> > > > >>> > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers > > >> > >> > > > >>> > > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier> > > >> > >> > > > >>> > > >> > >> > > > >>> DELETE <host>/blocklist/taskmanager/<identifier> > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it means that all > > >> task > > >> > >> > managers > > >> > >> > > > on > > >> > >> > > > >>> this node are blocklisted. All slots on these TMs are > not > > >> > >> > available. > > >> > >> > > > This > > >> > >> > > > >>> is actually a bit like TM losts, but these TMs are not > > >> really > > >> > lost, > > >> > >> > > > they > > >> > >> > > > >>> are in an unavailable status, and they are still > > registered > > >> > in this > > >> > >> > > > flink > > >> > >> > > > >>> cluster. They will be available again once the > > >> corresponding > > >> > >> > > blocklist > > >> > >> > > > >> item > > >> > >> > > > >>> is removed. This behavior is the same in > > active/non-active > > >> > >> > clusters. > > >> > >> > > > >>> However in the active clusters, these TMs may be > released > > >> due > > >> > to > > >> > >> > idle > > >> > >> > > > >>> timeouts. > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > > >>> 4) For the item timeout, I prefer to keep it. The > reasons > > >> are > > >> > as > > >> > >> > > > >> following: > > >> > >> > > > >>> a) The timeout will not affect users adding or removing > > >> items > > >> > via > > >> > >> > > REST > > >> > >> > > > >> API, > > >> > >> > > > >>> and users can disable it by configuring it to > > >> Long.MAX_VALUE . > > >> > >> > > > >>> > > >> > >> > > > >>> b) Some node problems can recover after a period of > time > > >> > (such as > > >> > >> > > > machine > > >> > >> > > > >>> hotspots), in which case users may prefer that Flink > can > > do > > >> > this > > >> > >> > > > >>> automatically instead of requiring the user to do it > > >> manually. > > >> > >> > > > >>> > > >> > >> > > > >>> > > >> > >> > > > >>> Best, > > >> > >> > > > >>> > > >> > >> > > > >>> Lijie > > >> > >> > > > >>> > > >> > >> > > > >>> Konstantin Knauf <kna...@apache.org> 于2022年4月27日周三 > > >> 19:23写道: > > >> > >> > > > >>> > > >> > >> > > > >>>> Hi Lijie, > > >> > >> > > > >>>> > > >> > >> > > > >>>> I think, this makes sense and +1 to only support > > manually > > >> > blocking > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the different strategies > > can > > >> > also be > > >> > >> > > > >>>> maintained outside of Apache Flink. > > >> > >> > > > >>>> > > >> > >> > > > >>>> A few remarks: > > >> > >> > > > >>>> > > >> > >> > > > >>>> 1) Can we use another term than "bla.cklist" due to > the > > >> > >> > controversy > > >> > >> > > > >> around > > >> > >> > > > >>>> the term? [1] There was also a Jira Ticket about this > > >> topic a > > >> > >> > while > > >> > >> > > > >> back > > >> > >> > > > >>>> and there was generally a consensus to avoid the term > > >> > blacklist & > > >> > >> > > > >> whitelist > > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" or > > "quarantined" > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a slightly different > > >> design > > >> > as > > >> > >> > verbs > > >> > >> > > > >> like > > >> > >> > > > >>>> add/remove often considered an anti-pattern for REST > > APIs. > > >> > POST > > >> > >> > on a > > >> > >> > > > >> list > > >> > >> > > > >>>> item is generally the standard to add items. DELETE on > > the > > >> > >> > > individual > > >> > >> > > > >>>> resource is standard to remove an item. > > >> > >> > > > >>>> > > >> > >> > > > >>>> POST <host>/quarantine/items > > >> > >> > > > >>>> DELETE <host>/quarantine/items/<itemidentifier> > > >> > >> > > > >>>> > > >> > >> > > > >>>> We could also consider to separate taskmanagers and > > nodes > > >> in > > >> > the > > >> > >> > > REST > > >> > >> > > > >> API > > >> > >> > > > >>>> (and internal data structures). Any opinion on this? > > >> > >> > > > >>>> > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager > > >> > >> > > > >>>> DELETE <host>/quarantine/nodes/<identifier> > > >> > >> > > > >>>> DELETE <host>/quarantine/taskmanager/<identifier> > > >> > >> > > > >>>> > > >> > >> > > > >>>> 3) How would blocking nodes behave with non-active > > >> resource > > >> > >> > > managers, > > >> > >> > > > >> i.e. > > >> > >> > > > >>>> standalone or reactive mode? > > >> > >> > > > >>>> > > >> > >> > > > >>>> 4) To keep the implementation even more minimal, do we > > >> need > > >> > the > > >> > >> > > > timeout > > >> > >> > > > >>>> behavior? If items are added/removed manually we could > > >> > delegate > > >> > >> > this > > >> > >> > > > >> to the > > >> > >> > > > >>>> user easily. In my opinion the timeout behavior would > > >> better > > >> > fit > > >> > >> > > into > > >> > >> > > > >>>> specific strategies at a later point. > > >> > >> > > > >>>> > > >> > >> > > > >>>> Looking forward to your thoughts. > > >> > >> > > > >>>> > > >> > >> > > > >>>> Cheers and thank you, > > >> > >> > > > >>>> > > >> > >> > > > >>>> Konstantin > > >> > >> > > > >>>> > > >> > >> > > > >>>> [1] > > >> > >> > > > >>>> > > >> > >> > > > >>>> > > >> > >> > > > >> > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > > > >> > > > https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term > > >> > >> > > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-18209 > > >> > >> > > > >>>> > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang > < > > >> > >> > > > >>>> wangdachui9...@gmail.com>: > > >> > >> > > > >>>> > > >> > >> > > > >>>>> Hi all, > > >> > >> > > > >>>>> > > >> > >> > > > >>>>> Flink job failures may happen due to cluster node > > issues > > >> > >> > > > >> (insufficient > > >> > >> > > > >>>> disk > > >> > >> > > > >>>>> space, bad hardware, network abnormalities). Flink > will > > >> > take care > > >> > >> > > of > > >> > >> > > > >> the > > >> > >> > > > >>>>> failures and redeploy the tasks. However, due to data > > >> > locality > > >> > >> > and > > >> > >> > > > >>>> limited > > >> > >> > > > >>>>> resources, the new tasks are very likely to be > > redeployed > > >> > to the > > >> > >> > > same > > >> > >> > > > >>>>> nodes, which will result in continuous task > > abnormalities > > >> > and > > >> > >> > > affect > > >> > >> > > > >> job > > >> > >> > > > >>>>> progress. > > >> > >> > > > >>>>> > > >> > >> > > > >>>>> Currently, Flink users need to manually identify the > > >> > problematic > > >> > >> > > > >> node and > > >> > >> > > > >>>>> take it offline to solve this problem. But this > > approach > > >> has > > >> > >> > > > >> following > > >> > >> > > > >>>>> disadvantages: > > >> > >> > > > >>>>> > > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy process. > Users > > >> may > > >> > need > > >> > >> > to > > >> > >> > > > >>>> contact > > >> > >> > > > >>>>> cluster administors to do this. The operation can > even > > be > > >> > >> > dangerous > > >> > >> > > > >> and > > >> > >> > > > >>>> not > > >> > >> > > > >>>>> allowed during some important business events. > > >> > >> > > > >>>>> > > >> > >> > > > >>>>> 2. Identifying and solving this kind of problems > > manually > > >> > would > > >> > >> > be > > >> > >> > > > >> slow > > >> > >> > > > >>>> and > > >> > >> > > > >>>>> a waste of human resources. > > >> > >> > > > >>>>> > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I propose to > > >> introduce a > > >> > >> > > blacklist > > >> > >> > > > >>>>> mechanism for Flink to filter out problematic > > resources. > > >> > >> > > > >>>>> > > >> > >> > > > >>>>> > > >> > >> > > > >>>>> You can find more details in FLIP-224[1]. Looking > > forward > > >> > to your > > >> > >> > > > >>>> feedback. > > >> > >> > > > >>>>> [1] > > >> > >> > > > >>>>> > > >> > >> > > > >>>>> > > >> > >> > > > >> > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism > > >> > >> > > > >>>>> > > >> > >> > > > >>>>> Best, > > >> > >> > > > >>>>> > > >> > >> > > > >>>>> Lijie > > >> > >> > > > >>>>> > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > > > >> > > > > > > -- Best regards, Roman Boyko e.: ro.v.bo...@gmail.com