Hi Lijie, hm, maybe the following is more appropriate in that case
POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge Best, Konstantin Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang < wangdachui9...@gmail.com>: > Hi Konstantin, > thanks for your feedback. > > From what I understand, PUT should be idempotent. However, we have a > *timeout* field in the request. This means that initiating the same request > at two different times will lead to different resource status (timestamps > of the items to be removed will be different). > > Should we use PUT in this case? WDYT? > > Best, > Lijie > > Konstantin Knauf <kna...@apache.org> 于2022年5月13日周五 17:20写道: > > > Hi Lijie, > > > > wouldn't the REST API-idiomatic way for an update/replace be a PUT on the > > resource? > > > > PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id} > > > > Best, > > > > Konstantin > > > > > > > > Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang < > > wangdachui9...@gmail.com>: > > > > > Hi everyone, > > > > > > I've had an offline discussion with Becket Qin and Zhu Zhu, and made > the > > > following changes on REST API: > > > 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose > one. > > If > > > both are specified, will return error. > > > 2. If the specified item is already there, the *ADD* operation has two > > > behaviors: *return error*(default value) or *merge/update*, and we > add a > > > flag to the request body to control it. You can find more details > "Public > > > Interface" section. > > > > > > If there is no more feedback, we will start the vote thread next week. > > > > > > Best, > > > Lijie > > > > > > Lijie Wang <wangdachui9...@gmail.com> 于2022年5月10日周二 17:14写道: > > > > > > > Hi Becket Qin, > > > > > > > > Thanks for your suggestions. I have moved the description of > > > > configurations, metrics and REST API into "Public Interface" section, > > and > > > > made a few updates according to your suggestion. And in this FLIP, > > there > > > > no public java Interfaces or pluggables that users need to implement > by > > > > themselves. > > > > > > > > Answers for you questions: > > > > 1. Yes, there 2 block actions: MARK_BLOCKED and. > > > > MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block items > > can > > > > only be added through the REST API, so these 2 action are mentioned > in > > > the > > > > REST API part (The REST API part has beed moved to public interface > > now). > > > > 2. I agree with you. I have changed the "Cause" field to String, and > > > allow > > > > users to specify it via REST API. > > > > 3. Yes, it is useful to allow different timeouts. As mentioned above, > > we > > > > will introduce 2 fields : *timeout* and *endTimestamp* into the ADD > > REST > > > > API to specify when to remove the blocked item. These 2 fields are > > > > optional, if neither is specified, it means that the blocked item is > > > > permanent and will not be removed. If both are specified, the minimum > > of > > > > *currentTimestamp+tiemout *and* endTimestamp* will be used as the > time > > to > > > > remove the blocked item. To keep the configurations more minimal, we > > have > > > > removed the *cluster.resource-blocklist.item.timeout* configuration > > > > option. > > > > 4. Yes, the block item will be overridden if the specified item > already > > > > exists. The ADD operation is *ADD or UPDATE*. > > > > 5. Yes. On JM/RM side, all the blocklist information is maintained in > > > > JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or > > > abstracted > > > > to other interfaces) will be propagated to different components. > > > > > > > > Best, > > > > Lijie > > > > > > > > Becket Qin <becket....@gmail.com> 于2022年5月10日周二 11:26写道: > > > > > > > >> Hi Lijie, > > > >> > > > >> Thanks for updating the FLIP. It looks like the public interface > > section > > > >> did not fully reflect all the user sensible behavior and API. Can > you > > > put > > > >> everything that users may be aware of there? That would include the > > REST > > > >> API, metrics, configurations, public java Interfaces or pluggables > > that > > > >> users may see or implement by themselves, as well as a brief summary > > of > > > >> the > > > >> behavior of the public API. > > > >> > > > >> Besides that, I have a few questions: > > > >> > > > >> 1. According to the conversation in the discussion thread, it looks > > like > > > >> the BlockAction will have "MARK_BLOCKLISTED" and > > > >> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so, can > > you > > > >> add > > > >> that to the public interface as well? > > > >> > > > >> 2. At this point, the "Cause" field in the BlockingItem is a > Throwable > > > and > > > >> is not reflected in the REST API. Should that be included in the > query > > > >> response? And should we change that field to be a String so users > may > > > >> specify the cause via the REST API when they block some nodes / TMs? > > > >> > > > >> 3. Would it be useful to allow users to have different timeouts for > > > >> different blocked items? So while there is a default timeout, users > > can > > > >> also override it via the REST API when they block an entity. > > > >> > > > >> 4. Regarding the ADD operation, if the specified item is already > > there, > > > >> will the block item be overridden? For example, if the user wants to > > > >> extend > > > >> the timeout of a blocked item, can they just issue an ADD command > > > again? > > > >> > > > >> 5. I am not quite familiar with the details of this, but is there a > > > source > > > >> of truth for the blocked list? I think it might be good to have a > > single > > > >> source of truth for the blocked list and just propagate that list to > > > >> different components to take the action of actually blocking the > > > resource. > > > >> > > > >> Thanks, > > > >> > > > >> Jiangjie (Becket) Qin > > > >> > > > >> On Mon, May 9, 2022 at 5:54 PM Lijie Wang <wangdachui9...@gmail.com > > > > > >> wrote: > > > >> > > > >> > Hi everyone, > > > >> > > > > >> > Based on the discussion in the mailing list, I updated the FLIP > doc, > > > the > > > >> > changes include: > > > >> > 1. Changed the description of the motivation section to more > clearly > > > >> > describe the problem this FLIP is trying to solve. > > > >> > 2. Only *Manually* is supported. > > > >> > 3. Adopted some suggestions, such as *endTimestamp*. > > > >> > > > > >> > Best, > > > >> > Lijie > > > >> > > > > >> > > > > >> > Roman Boyko <ro.v.bo...@gmail.com> 于2022年5月7日周六 19:25写道: > > > >> > > > > >> > > Hi Lijie! > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > *a) “Probably storing inside Zookeeper/Configmap might be > > > >> helpfulhere.” > > > >> > > Can you explain it in detail? I don't fully understand that. In > > > >> > myopinion, > > > >> > > non-active and active are the same, and no special treatment > > > >> isrequired.* > > > >> > > > > > >> > > Sorry this was a misunderstanding from my side. I thought we > were > > > >> talking > > > >> > > about the HA mode (but not about Active and Standalone > > > >> ResourceManager). > > > >> > > And the original question was - how to handle the blacklisted > > nodes > > > >> list > > > >> > at > > > >> > > the moment of leader change? Should we simply forget about them > or > > > >> try to > > > >> > > pre-save that list on the remote storage? > > > >> > > > > > >> > > On Sat, 7 May 2022 at 10:51, Yang Wang <danrtsey...@gmail.com> > > > wrote: > > > >> > > > > > >> > > > Thanks Lijie and ZhuZhu for the explanation. > > > >> > > > > > > >> > > > I just overlooked the "MARK_BLOCKLISTED". For tasks level, it > is > > > >> indeed > > > >> > > > some functionalities the external tools(e.g. kubectl taint) > > could > > > >> not > > > >> > > > support. > > > >> > > > > > > >> > > > > > > >> > > > Best, > > > >> > > > Yang > > > >> > > > > > > >> > > > Lijie Wang <wangdachui9...@gmail.com> 于2022年5月6日周五 22:18写道: > > > >> > > > > > > >> > > > > Thanks for your feedback, Jiangang and Martijn. > > > >> > > > > > > > >> > > > > @Jiangang > > > >> > > > > > > > >> > > > > > > > >> > > > > > For auto-detecting, I wonder how to make the strategy and > > > mark a > > > >> > node > > > >> > > > > blocked? > > > >> > > > > > > > >> > > > > In fact, we currently plan to not support auto-detection in > > this > > > >> > FLIP. > > > >> > > > The > > > >> > > > > part about auto-detection may be continued in a separate > FLIP > > in > > > >> the > > > >> > > > > future. Some guys have the same concerns as you, and the > > > >> correctness > > > >> > > and > > > >> > > > > necessity of auto-detection may require further discussion > in > > > the > > > >> > > future. > > > >> > > > > > > > >> > > > > > In session mode, multi jobs can fail on the same bad node > > and > > > >> the > > > >> > > node > > > >> > > > > should be marked blocked. > > > >> > > > > By design, the blocklist information will be shared among > all > > > jobs > > > >> > in a > > > >> > > > > cluster/session. The JM will sync blocklist information with > > RM. > > > >> > > > > > > > >> > > > > @Martijn > > > >> > > > > > > > >> > > > > > I agree with Yang Wang on this. > > > >> > > > > As Zhu Zhu and I mentioned above, we think the > > > >> MARK_BLOCKLISTED(Just > > > >> > > > limits > > > >> > > > > the load of the node and does not kill all the processes on > > it) > > > >> is > > > >> > > also > > > >> > > > > important, and we think that external systems (*yarn rmadmin > > or > > > >> > kubectl > > > >> > > > > taint*) cannot support it. So we think it makes sense even > > only > > > >> > > > *manually*. > > > >> > > > > > > > >> > > > > > I also agree with Chesnay that magical mechanisms are > indeed > > > >> super > > > >> > > hard > > > >> > > > > to get right. > > > >> > > > > Yes, as you see, Jiangang(and a few others) have the same > > > concern. > > > >> > > > > However, we currently plan to not support auto-detection in > > this > > > >> > FLIP, > > > >> > > > and > > > >> > > > > only *manually*. In addition, I'd like to say that the FLIP > > > >> provides > > > >> > a > > > >> > > > > mechanism to support MARK_BLOCKLISTED and > > > >> > > > > MARK_BLOCKLISTED_AND_EVACUATE_TASKS, > > > >> > > > > the auto-detection may be done by external systems. > > > >> > > > > > > > >> > > > > Best, > > > >> > > > > Lijie > > > >> > > > > > > > >> > > > > Martijn Visser <mart...@ververica.com> 于2022年5月6日周五 > 19:04写道: > > > >> > > > > > > > >> > > > > > > If we only support to block nodes manually, then I could > > not > > > >> see > > > >> > > > > > the obvious advantages compared with current SRE's > > > approach(via > > > >> > *yarn > > > >> > > > > > rmadmin or kubectl taint*). > > > >> > > > > > > > > >> > > > > > I agree with Yang Wang on this. > > > >> > > > > > > > > >> > > > > > > To me this sounds yet again like one of those magical > > > >> mechanisms > > > >> > > > that > > > >> > > > > > will rarely work just right. > > > >> > > > > > > > > >> > > > > > I also agree with Chesnay that magical mechanisms are > indeed > > > >> super > > > >> > > hard > > > >> > > > > to > > > >> > > > > > get right. > > > >> > > > > > > > > >> > > > > > Best regards, > > > >> > > > > > > > > >> > > > > > Martijn > > > >> > > > > > > > > >> > > > > > On Fri, 6 May 2022 at 12:03, Jiangang Liu < > > > >> > liujiangangp...@gmail.com > > > >> > > > > > > >> > > > > > wrote: > > > >> > > > > > > > > >> > > > > >> Thanks for the valuable design. The auto-detecting can > > > decrease > > > >> > > great > > > >> > > > > work > > > >> > > > > >> for us. We have implemented the similar feature in our > > inner > > > >> flink > > > >> > > > > >> version. > > > >> > > > > >> Below is something that I care about: > > > >> > > > > >> > > > >> > > > > >> 1. For auto-detecting, I wonder how to make the > strategy > > > and > > > >> > > mark a > > > >> > > > > >> node > > > >> > > > > >> blocked? Sometimes the blocked node is hard to be > > > detected, > > > >> for > > > >> > > > > >> example, > > > >> > > > > >> the upper node or the down node will be blocked when > > > network > > > >> > > > > >> unreachable. > > > >> > > > > >> 2. I see that the strategy is made in JobMaster side. > > How > > > >> about > > > >> > > > > >> implementing the similar logic in resource manager? In > > > >> session > > > >> > > > mode, > > > >> > > > > >> multi > > > >> > > > > >> jobs can fail on the same bad node and the node should > > be > > > >> > marked > > > >> > > > > >> blocked. > > > >> > > > > >> If the job makes the strategy, the node may be not > > marked > > > >> > blocked > > > >> > > > if > > > >> > > > > >> the > > > >> > > > > >> fail times don't exceed the threshold. > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> Zhu Zhu <reed...@gmail.com> 于2022年5月5日周四 23:35写道: > > > >> > > > > >> > > > >> > > > > >> > Thank you for all your feedback! > > > >> > > > > >> > > > > >> > > > > >> > Besides the answers from Lijie, I'd like to share some > of > > > my > > > >> > > > thoughts: > > > >> > > > > >> > 1. Whether to enable automatical blocklist > > > >> > > > > >> > Generally speaking, it is not a goal of FLIP-224. > > > >> > > > > >> > The automatical way should be something built upon the > > > >> blocklist > > > >> > > > > >> > mechanism and well decoupled. It was designed to be a > > > >> > configurable > > > >> > > > > >> > blocklist strategy, but I think we can further decouple > > it > > > by > > > >> > > > > >> > introducing a abnormal node detector, as Becket > > suggested, > > > >> which > > > >> > > > just > > > >> > > > > >> > uses the blocklist mechanism once bad nodes are > detected. > > > >> > However, > > > >> > > > it > > > >> > > > > >> > should be a separate FLIP with further dev discussions > > and > > > >> > > feedback > > > >> > > > > >> > from users. I also agree with Becket that different > users > > > >> have > > > >> > > > > different > > > >> > > > > >> > requirements, and we should listen to them. > > > >> > > > > >> > > > > >> > > > > >> > 2. Is it enough to just take away abnormal nodes > > externally > > > >> > > > > >> > My answer is no. As Lijie has mentioned, we need a way > to > > > >> avoid > > > >> > > > > >> > deploying tasks to temporary hot nodes. In this case, > > users > > > >> may > > > >> > > just > > > >> > > > > >> > want to limit the load of the node and do not want to > > kill > > > >> all > > > >> > the > > > >> > > > > >> > processes on it. Another case is the speculative > > > execution[1] > > > >> > > which > > > >> > > > > >> > may also leverage this feature to avoid starting mirror > > > >> tasks on > > > >> > > > slow > > > >> > > > > >> > nodes. > > > >> > > > > >> > > > > >> > > > > >> > Thanks, > > > >> > > > > >> > Zhu > > > >> > > > > >> > > > > >> > > > > >> > [1] > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job > > > >> > > > > >> > > > > >> > > > > >> > Lijie Wang <wangdachui9...@gmail.com> 于2022年5月5日周四 > > > 15:56写道: > > > >> > > > > >> > > > > >> > > > > >> > > > > > >> > > > > >> > > Hi everyone, > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > Thanks for your feedback. > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > There's one detail that I'd like to re-emphasize here > > > >> because > > > >> > it > > > >> > > > can > > > >> > > > > >> > affect the value and design of the blocklist mechanism > > > >> (perhaps > > > >> > I > > > >> > > > > should > > > >> > > > > >> > highlight it in the FLIP). We propose two actions in > > FLIP: > > > >> > > > > >> > > > > > >> > > > > >> > > 1) MARK_BLOCKLISTED: Just mark the task manager or > node > > > as > > > >> > > > blocked. > > > >> > > > > >> > Future slots should not be allocated from the blocked > > task > > > >> > manager > > > >> > > > or > > > >> > > > > >> node. > > > >> > > > > >> > But slots that are already allocated will not be > > affected. > > > A > > > >> > > typical > > > >> > > > > >> > application scenario is to mitigate machine hotspots. > In > > > this > > > >> > > case, > > > >> > > > we > > > >> > > > > >> hope > > > >> > > > > >> > that subsequent resource allocations will not be on the > > hot > > > >> > > machine, > > > >> > > > > but > > > >> > > > > >> > tasks currently running on it should not be affected. > > > >> > > > > >> > > > > > >> > > > > >> > > 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the task > > > >> manager > > > >> > or > > > >> > > > > node > > > >> > > > > >> as > > > >> > > > > >> > blocked, and evacuate all tasks on it. Evacuated tasks > > will > > > >> be > > > >> > > > > >> restarted on > > > >> > > > > >> > non-blocked task managers. > > > >> > > > > >> > > > > > >> > > > > >> > > For the above 2 actions, the former may more > highlight > > > the > > > >> > > meaning > > > >> > > > > of > > > >> > > > > >> > this FLIP, because the external system cannot do that. > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > Regarding *Manually* and *Automatically*, I basically > > > agree > > > >> > with > > > >> > > > > >> @Becket > > > >> > > > > >> > Qin: different users have different answers. Not all > > users’ > > > >> > > > deployment > > > >> > > > > >> > environments have a special external system that can > > > perform > > > >> the > > > >> > > > > anomaly > > > >> > > > > >> > detection. In addition, adding pluggable/optional > > > >> auto-detection > > > >> > > > > doesn't > > > >> > > > > >> > require much extra work on top of manual specification. > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > I will answer your other questions one by one. > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > @Yangze > > > >> > > > > >> > > > > > >> > > > > >> > > a) I think you are right, we do not need to expose > the > > > >> > > > > >> > > `cluster.resource-blocklist.item.timeout-check-interval` > > to > > > >> > users. > > > >> > > > > >> > > > > > >> > > > > >> > > b) We can abstract the `notifyException` to a > separate > > > >> > interface > > > >> > > > > >> (maybe > > > >> > > > > >> > BlocklistExceptionListener), and the > > > >> > > ResourceManagerBlocklistHandler > > > >> > > > > can > > > >> > > > > >> > implement it in the future. > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > @Martijn > > > >> > > > > >> > > > > > >> > > > > >> > > a) I also think the manual blocking should be done by > > > >> cluster > > > >> > > > > >> operators. > > > >> > > > > >> > > > > > >> > > > > >> > > b) I think manual blocking makes sense, because > > according > > > >> to > > > >> > my > > > >> > > > > >> > experience, users are often the first to perceive the > > > machine > > > >> > > > problems > > > >> > > > > >> > (because of job failover or delay), and they will > contact > > > >> > cluster > > > >> > > > > >> operators > > > >> > > > > >> > to solve it, or even tell the cluster operators which > > > >> machine is > > > >> > > > > >> > problematic. From this point of view, I think the > people > > > who > > > >> > > really > > > >> > > > > need > > > >> > > > > >> > the manual blocking are the users, and it’s just > > performed > > > by > > > >> > the > > > >> > > > > >> cluster > > > >> > > > > >> > operator, so I think the manual blocking makes sense. > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > @Chesnay > > > >> > > > > >> > > > > > >> > > > > >> > > We need to touch the logic of JM/SlotPool, because > for > > > >> > > > > >> MARK_BLOCKLISTED > > > >> > > > > >> > , we need to know whether the slot is blocklisted when > > the > > > >> task > > > >> > is > > > >> > > > > >> > FINISHED/CANCELLED/FAILED. If so, SlotPool should > > release > > > >> the > > > >> > > slot > > > >> > > > > >> > directly to avoid assigning other tasks (of this job) > on > > > it. > > > >> If > > > >> > we > > > >> > > > > only > > > >> > > > > >> > maintain the blocklist information on the RM, JM needs > to > > > >> > retrieve > > > >> > > > it > > > >> > > > > by > > > >> > > > > >> > RPC. I think the performance overhead of that is > > relatively > > > >> > large, > > > >> > > > so > > > >> > > > > I > > > >> > > > > >> > think it's worth maintaining the blocklist information > on > > > >> the JM > > > >> > > > side > > > >> > > > > >> and > > > >> > > > > >> > syncing them. > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > @Роман > > > >> > > > > >> > > > > > >> > > > > >> > > a) “Probably storing inside Zookeeper/Configmap > > might > > > >> be > > > >> > > > helpful > > > >> > > > > >> > here.” Can you explain it in detail? I don't fully > > > >> understand > > > >> > > that. > > > >> > > > > In > > > >> > > > > >> my > > > >> > > > > >> > opinion, non-active and active are the same, and no > > special > > > >> > > > treatment > > > >> > > > > is > > > >> > > > > >> > required. > > > >> > > > > >> > > > > > >> > > > > >> > > b) I agree with you, the `endTimestamp` makes sense, > I > > > will > > > >> > add > > > >> > > it > > > >> > > > > to > > > >> > > > > >> > FLIP. > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > @Yang > > > >> > > > > >> > > > > > >> > > > > >> > > As mentioned above, AFAK, the external system cannot > > > >> support > > > >> > the > > > >> > > > > >> > MARK_BLOCKLISTED action. > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > Looking forward to your further feedback. > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > Best, > > > >> > > > > >> > > > > > >> > > > > >> > > Lijie > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > Yang Wang <danrtsey...@gmail.com> 于2022年5月3日周二 > > 21:09写道: > > > >> > > > > >> > >> > > > >> > > > > >> > >> Thanks Lijie and Zhu for creating the proposal. > > > >> > > > > >> > >> > > > >> > > > > >> > >> I want to share some thoughts about Flink cluster > > > >> operations. > > > >> > > > > >> > >> > > > >> > > > > >> > >> In the production environment, the SRE(aka Site > > > >> Reliability > > > >> > > > > Engineer) > > > >> > > > > >> > >> already has many tools to detect the unstable nodes, > > > which > > > >> > > could > > > >> > > > > take > > > >> > > > > >> > the > > > >> > > > > >> > >> system logs/metrics into consideration. > > > >> > > > > >> > >> Then they use graceful-decomission in YARN and taint > > in > > > >> K8s > > > >> > to > > > >> > > > > >> prevent > > > >> > > > > >> > new > > > >> > > > > >> > >> allocations on these unstable nodes. > > > >> > > > > >> > >> At last, they will evict all the containers and pods > > > >> running > > > >> > on > > > >> > > > > these > > > >> > > > > >> > nodes. > > > >> > > > > >> > >> This mechanism also works for planned maintenance. > So > > I > > > am > > > >> > > afraid > > > >> > > > > >> this > > > >> > > > > >> > is > > > >> > > > > >> > >> not the typical use case for FLIP-224. > > > >> > > > > >> > >> > > > >> > > > > >> > >> If we only support to block nodes manually, then I > > could > > > >> not > > > >> > > see > > > >> > > > > >> > >> the obvious advantages compared with current SRE's > > > >> > approach(via > > > >> > > > > *yarn > > > >> > > > > >> > >> rmadmin or kubectl taint*). > > > >> > > > > >> > >> At least, we need to have a pluggable component > which > > > >> could > > > >> > > > expose > > > >> > > > > >> the > > > >> > > > > >> > >> potential unstable nodes automatically and block > them > > if > > > >> > > enabled > > > >> > > > > >> > explicitly. > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> Best, > > > >> > > > > >> > >> Yang > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> Becket Qin <becket....@gmail.com> 于2022年5月2日周一 > > 16:36写道: > > > >> > > > > >> > >> > > > >> > > > > >> > >> > Thanks for the proposal, Lijie. > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > This is an interesting feature and discussion, and > > > >> somewhat > > > >> > > > > related > > > >> > > > > >> > to the > > > >> > > > > >> > >> > design principle about how people should operate > > > Flink. > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > I think there are three things involved in this > > FLIP. > > > >> > > > > >> > >> > a) Detect and report the unstable node. > > > >> > > > > >> > >> > b) Collect the information of the unstable > node > > > and > > > >> > > form a > > > >> > > > > >> > blocklist. > > > >> > > > > >> > >> > c) Take the action to block nodes. > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > My two cents: > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > 1. It looks like people all agree that Flink > should > > > have > > > >> > c). > > > >> > > It > > > >> > > > > is > > > >> > > > > >> > not only > > > >> > > > > >> > >> > useful for cases of node failures, but also handy > > for > > > >> some > > > >> > > > > planned > > > >> > > > > >> > >> > maintenance. > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > 2. People have different opinions on b), i.e. who > > > >> should be > > > >> > > the > > > >> > > > > >> brain > > > >> > > > > >> > to > > > >> > > > > >> > >> > make the decision to block a node. I think this > > > largely > > > >> > > depends > > > >> > > > > on > > > >> > > > > >> > who we > > > >> > > > > >> > >> > talk to. Different users would probably give > > different > > > >> > > answers. > > > >> > > > > For > > > >> > > > > >> > people > > > >> > > > > >> > >> > who do have a centralized node health management > > > >> service, > > > >> > let > > > >> > > > > Flink > > > >> > > > > >> > do just > > > >> > > > > >> > >> > do a) and c) would be preferred. So essentially > > Flink > > > >> would > > > >> > > be > > > >> > > > > one > > > >> > > > > >> of > > > >> > > > > >> > the > > > >> > > > > >> > >> > sources that may detect unstable nodes, report it > to > > > >> that > > > >> > > > > service, > > > >> > > > > >> > and then > > > >> > > > > >> > >> > take the command from that service to block the > > > >> problematic > > > >> > > > > nodes. > > > >> > > > > >> On > > > >> > > > > >> > the > > > >> > > > > >> > >> > other hand, for users who do not have such a > > service, > > > >> > simply > > > >> > > > > >> letting > > > >> > > > > >> > Flink > > > >> > > > > >> > >> > be clever by itself to block the suspicious nodes > > > might > > > >> be > > > >> > > > > desired > > > >> > > > > >> to > > > >> > > > > >> > >> > ensure the jobs are running smoothly. > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > So that indicates a) and b) here should be > > pluggable / > > > >> > > > optional. > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > In light of this, maybe it would make sense to > have > > > >> > something > > > >> > > > > >> > pluggable > > > >> > > > > >> > >> > like a UnstableNodeReporter which exposes unstable > > > nodes > > > >> > > > > actively. > > > >> > > > > >> (A > > > >> > > > > >> > more > > > >> > > > > >> > >> > general interface should be JobInfoReporter<T> > which > > > >> can be > > > >> > > > used > > > >> > > > > to > > > >> > > > > >> > report > > > >> > > > > >> > >> > any information of type <T>. But I'll just keep > the > > > >> scope > > > >> > > > > relevant > > > >> > > > > >> to > > > >> > > > > >> > this > > > >> > > > > >> > >> > FLIP here). Personally speaking, I think it is OK > to > > > >> have a > > > >> > > > > default > > > >> > > > > >> > >> > implementation of a reporter which just tells > Flink > > to > > > >> take > > > >> > > > > action > > > >> > > > > >> to > > > >> > > > > >> > block > > > >> > > > > >> > >> > problematic nodes and also unblocks them after > > > timeout. > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > Thanks, > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > Jiangjie (Becket) Qin > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > On Mon, May 2, 2022 at 3:27 PM Роман Бойко < > > > >> > > > ro.v.bo...@gmail.com > > > >> > > > > > > > > >> > > > > >> > wrote: > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > > Thanks for good initiative, Lijie and Zhu! > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > If it's possible I'd like to participate in > > > >> development. > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > I agree with 3rd point of Konstantin's reply - > we > > > >> should > > > >> > > > > consider > > > >> > > > > >> > to move > > > >> > > > > >> > >> > > somehow the information of blocklisted nodes/TMs > > > from > > > >> > > active > > > >> > > > > >> > >> > > ResourceManager to non-active ones. Probably > > storing > > > >> > inside > > > >> > > > > >> > >> > > Zookeeper/Configmap might be helpful here. > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > And I agree with Martijn that a lot of > > organizations > > > >> > don't > > > >> > > > want > > > >> > > > > >> to > > > >> > > > > >> > expose > > > >> > > > > >> > >> > > such API for a cluster user group. But I think > > it's > > > >> > > necessary > > > >> > > > > to > > > >> > > > > >> > have the > > > >> > > > > >> > >> > > mechanism for unblocking the nodes/TMs anyway > for > > > >> > avoiding > > > >> > > > > >> incorrect > > > >> > > > > >> > >> > > automatic behaviour. > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > And another one small suggestion - I think it > > would > > > be > > > >> > > better > > > >> > > > > to > > > >> > > > > >> > extend > > > >> > > > > >> > >> > the > > > >> > > > > >> > >> > > *BlocklistedItem* class with the *endTimestamp* > > > field > > > >> and > > > >> > > > fill > > > >> > > > > it > > > >> > > > > >> > at the > > > >> > > > > >> > >> > > item creation. This simple addition will allow > to: > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > - > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > Provide the ability to users to setup the > exact > > > >> time > > > >> > of > > > >> > > > > >> > blocklist end > > > >> > > > > >> > >> > > through RestAPI > > > >> > > > > >> > >> > > - > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > Not being tied to a single value of > > > >> > > > > >> > >> > > *cluster.resource-blacklist.item.timeout* > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > On Mon, 2 May 2022 at 14:17, Chesnay Schepler < > > > >> > > > > >> ches...@apache.org> > > > >> > > > > >> > >> > wrote: > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > > I do share the concern between blurring the > > lines > > > a > > > >> > bit. > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > That said, I'd prefer to not have any > > > auto-detection > > > >> > and > > > >> > > > only > > > >> > > > > >> > have an > > > >> > > > > >> > >> > > > opt-in mechanism > > > >> > > > > >> > >> > > > to manually block processes/nodes. To me this > > > sounds > > > >> > yet > > > >> > > > > again > > > >> > > > > >> > like one > > > >> > > > > >> > >> > > > of those > > > >> > > > > >> > >> > > > magical mechanisms that will rarely work just > > > right. > > > >> > > > > >> > >> > > > An external system can leverage way more > > > information > > > >> > > after > > > >> > > > > all. > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > Moreover, I'm quite concerned about the > > complexity > > > >> of > > > >> > > this > > > >> > > > > >> > proposal. > > > >> > > > > >> > >> > > > Tracking on both the RM/JM side; syncing > between > > > >> > > > components; > > > >> > > > > >> > >> > adjustments > > > >> > > > > >> > >> > > > to the > > > >> > > > > >> > >> > > > slot and resource protocol. > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > In a way it seems overly complicated. > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > If we look at it purely from an active > resource > > > >> > > management > > > >> > > > > >> > perspective, > > > >> > > > > >> > >> > > > then there > > > >> > > > > >> > >> > > > isn't really a need to touch the slot protocol > > at > > > >> all > > > >> > (or > > > >> > > > in > > > >> > > > > >> fact > > > >> > > > > >> > to > > > >> > > > > >> > >> > > > anything in the JobMaster), > > > >> > > > > >> > >> > > > because there isn't any point in keeping > around > > > >> blocked > > > >> > > TMs > > > >> > > > > in > > > >> > > > > >> the > > > >> > > > > >> > >> > first > > > >> > > > > >> > >> > > > place. > > > >> > > > > >> > >> > > > They'd just be idling, potentially shutting > down > > > >> after > > > >> > a > > > >> > > > > while > > > >> > > > > >> by > > > >> > > > > >> > the > > > >> > > > > >> > >> > RM > > > >> > > > > >> > >> > > > because of > > > >> > > > > >> > >> > > > it (unless we _also_ touch that logic). > > > >> > > > > >> > >> > > > Here the blocking of a process (be it by > > blocking > > > >> the > > > >> > > > process > > > >> > > > > >> or > > > >> > > > > >> > node) > > > >> > > > > >> > >> > is > > > >> > > > > >> > >> > > > equivalent with shutting down the blocked > > > >> process(es). > > > >> > > > > >> > >> > > > Once the block is lifted we can just spin it > > back > > > >> up. > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > And I do wonder whether we couldn't apply the > > same > > > >> line > > > >> > > of > > > >> > > > > >> > thinking to > > > >> > > > > >> > >> > > > standalone resource management. > > > >> > > > > >> > >> > > > Here being able to stop/restart a process/node > > > >> manually > > > >> > > > > should > > > >> > > > > >> be > > > >> > > > > >> > a > > > >> > > > > >> > >> > core > > > >> > > > > >> > >> > > > requirement for a Flink deployment anyway. > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > On 02/05/2022 08:49, Martijn Visser wrote: > > > >> > > > > >> > >> > > > > Hi everyone, > > > >> > > > > >> > >> > > > > > > > >> > > > > >> > >> > > > > Thanks for creating this FLIP. I can > > understand > > > >> the > > > >> > > > problem > > > >> > > > > >> and > > > >> > > > > >> > I see > > > >> > > > > >> > >> > > > value > > > >> > > > > >> > >> > > > > in the automatic detection and > blocklisting. I > > > do > > > >> > have > > > >> > > > some > > > >> > > > > >> > concerns > > > >> > > > > >> > >> > > with > > > >> > > > > >> > >> > > > > the ability to manually specify to be > blocked > > > >> > > resources. > > > >> > > > I > > > >> > > > > >> have > > > >> > > > > >> > two > > > >> > > > > >> > >> > > > > concerns; > > > >> > > > > >> > >> > > > > > > > >> > > > > >> > >> > > > > * Most organizations explicitly have a > > > separation > > > >> of > > > >> > > > > >> concerns, > > > >> > > > > >> > >> > meaning > > > >> > > > > >> > >> > > > that > > > >> > > > > >> > >> > > > > there's a group who's responsible for > > managing a > > > >> > > cluster > > > >> > > > > and > > > >> > > > > >> > there's > > > >> > > > > >> > >> > a > > > >> > > > > >> > >> > > > user > > > >> > > > > >> > >> > > > > group who uses that cluster. With the > > > >> introduction of > > > >> > > > this > > > >> > > > > >> > mechanism, > > > >> > > > > >> > >> > > the > > > >> > > > > >> > >> > > > > latter group now can influence the > > > responsibility > > > >> of > > > >> > > the > > > >> > > > > >> first > > > >> > > > > >> > group. > > > >> > > > > >> > >> > > So > > > >> > > > > >> > >> > > > it > > > >> > > > > >> > >> > > > > can be possible that someone from the user > > group > > > >> > blocks > > > >> > > > > >> > something, > > > >> > > > > >> > >> > > which > > > >> > > > > >> > >> > > > > causes an outage (which could result in > paging > > > >> > > mechanism > > > >> > > > > >> > triggering > > > >> > > > > >> > >> > > etc) > > > >> > > > > >> > >> > > > > which impacts the first group. > > > >> > > > > >> > >> > > > > * How big is the group of people who can go > > > >> through > > > >> > the > > > >> > > > > >> process > > > >> > > > > >> > of > > > >> > > > > >> > >> > > > manually > > > >> > > > > >> > >> > > > > identifying a node that isn't behaving as it > > > >> should > > > >> > > be? I > > > >> > > > > do > > > >> > > > > >> > think > > > >> > > > > >> > >> > this > > > >> > > > > >> > >> > > > > group is relatively limited. Does it then > make > > > >> sense > > > >> > to > > > >> > > > > >> > introduce > > > >> > > > > >> > >> > such > > > >> > > > > >> > >> > > a > > > >> > > > > >> > >> > > > > feature, which would only be used by a > really > > > >> small > > > >> > > user > > > >> > > > > >> group > > > >> > > > > >> > of > > > >> > > > > >> > >> > > Flink? > > > >> > > > > >> > >> > > > We > > > >> > > > > >> > >> > > > > still have to maintain, test and support > such > > a > > > >> > > feature. > > > >> > > > > >> > >> > > > > > > > >> > > > > >> > >> > > > > I'm +1 for the autodetection features, but > I'm > > > >> > leaning > > > >> > > > > >> towards > > > >> > > > > >> > not > > > >> > > > > >> > >> > > > exposing > > > >> > > > > >> > >> > > > > this to the user group but having this > > available > > > >> > > strictly > > > >> > > > > for > > > >> > > > > >> > cluster > > > >> > > > > >> > >> > > > > operators. They could then also set up their > > > >> > > > > >> > paging/metrics/logging > > > >> > > > > >> > >> > > > system > > > >> > > > > >> > >> > > > > to take this into account. > > > >> > > > > >> > >> > > > > > > > >> > > > > >> > >> > > > > Best regards, > > > >> > > > > >> > >> > > > > > > > >> > > > > >> > >> > > > > Martijn Visser > > > >> > > > > >> > >> > > > > https://twitter.com/MartijnVisser82 > > > >> > > > > >> > >> > > > > https://github.com/MartijnVisser > > > >> > > > > >> > >> > > > > > > > >> > > > > >> > >> > > > > > > > >> > > > > >> > >> > > > > On Fri, 29 Apr 2022 at 09:39, Yangze Guo < > > > >> > > > > karma...@gmail.com > > > >> > > > > >> > > > > >> > > > > >> > wrote: > > > >> > > > > >> > >> > > > > > > > >> > > > > >> > >> > > > >> Thanks for driving this, Zhu and Lijie. > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> +1 for the overall proposal. Just share > some > > > >> cents > > > >> > > here: > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> - Why do we need to expose > > > >> > > > > >> > >> > > > >> > > > >> > cluster.resource-blacklist.item.timeout-check-interval > > > >> > > > to > > > >> > > > > >> the > > > >> > > > > >> > user? > > > >> > > > > >> > >> > > > >> I think the semantics of > > > >> > > > > >> > `cluster.resource-blacklist.item.timeout` > > > >> > > > > >> > >> > is > > > >> > > > > >> > >> > > > >> sufficient for the user. How to guarantee > the > > > >> > timeout > > > >> > > > > >> > mechanism is > > > >> > > > > >> > >> > > > >> Flink's internal implementation. I think it > > > will > > > >> be > > > >> > > very > > > >> > > > > >> > confusing > > > >> > > > > >> > >> > and > > > >> > > > > >> > >> > > > >> we do not need to expose it to users. > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> - ResourceManager can notify the exception > > of a > > > >> task > > > >> > > > > >> manager to > > > >> > > > > >> > >> > > > >> `BlacklistHandler` as well. > > > >> > > > > >> > >> > > > >> For example, the slot allocation might fail > > in > > > >> case > > > >> > > the > > > >> > > > > >> target > > > >> > > > > >> > task > > > >> > > > > >> > >> > > > >> manager is busy or has a network jitter. I > > > don't > > > >> > mean > > > >> > > we > > > >> > > > > >> need > > > >> > > > > >> > to > > > >> > > > > >> > >> > cover > > > >> > > > > >> > >> > > > >> this case in this version, but we can also > > > open a > > > >> > > > > >> > `notifyException` > > > >> > > > > >> > >> > in > > > >> > > > > >> > >> > > > >> `ResourceManagerBlacklistHandler`. > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> - Before we sync the blocklist to > > > >> ResourceManager, > > > >> > > will > > > >> > > > > the > > > >> > > > > >> > slot of > > > >> > > > > >> > >> > a > > > >> > > > > >> > >> > > > >> blocked task manager continues to be > released > > > and > > > >> > > > > allocated? > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> Best, > > > >> > > > > >> > >> > > > >> Yangze Guo > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang > < > > > >> > > > > >> > >> > wangdachui9...@gmail.com> > > > >> > > > > >> > >> > > > >> wrote: > > > >> > > > > >> > >> > > > >>> Hi Konstantin, > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> Thanks for your feedback. I will response > > > your 4 > > > >> > > > remarks: > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> 1) Thanks for reminding me of the > > > controversy. I > > > >> > > think > > > >> > > > > >> > “BlockList” > > > >> > > > > >> > >> > is > > > >> > > > > >> > >> > > > >> good > > > >> > > > > >> > >> > > > >>> enough, and I will change it in FLIP. > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> 2) Your suggestion for the REST API is a > > good > > > >> idea. > > > >> > > > Based > > > >> > > > > >> on > > > >> > > > > >> > the > > > >> > > > > >> > >> > > > above, I > > > >> > > > > >> > >> > > > >>> would change REST API as following: > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/nodes > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> POST/GET <host>/blocklist/taskmanagers > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> DELETE <host>/blocklist/node/<identifier> > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> DELETE > > > <host>/blocklist/taskmanager/<identifier> > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> 3) If a node is blocking/blocklisted, it > > means > > > >> that > > > >> > > all > > > >> > > > > >> task > > > >> > > > > >> > >> > managers > > > >> > > > > >> > >> > > > on > > > >> > > > > >> > >> > > > >>> this node are blocklisted. All slots on > > these > > > >> TMs > > > >> > are > > > >> > > > not > > > >> > > > > >> > >> > available. > > > >> > > > > >> > >> > > > This > > > >> > > > > >> > >> > > > >>> is actually a bit like TM losts, but these > > TMs > > > >> are > > > >> > > not > > > >> > > > > >> really > > > >> > > > > >> > lost, > > > >> > > > > >> > >> > > > they > > > >> > > > > >> > >> > > > >>> are in an unavailable status, and they are > > > still > > > >> > > > > registered > > > >> > > > > >> > in this > > > >> > > > > >> > >> > > > flink > > > >> > > > > >> > >> > > > >>> cluster. They will be available again once > > the > > > >> > > > > >> corresponding > > > >> > > > > >> > >> > > blocklist > > > >> > > > > >> > >> > > > >> item > > > >> > > > > >> > >> > > > >>> is removed. This behavior is the same in > > > >> > > > > active/non-active > > > >> > > > > >> > >> > clusters. > > > >> > > > > >> > >> > > > >>> However in the active clusters, these TMs > > may > > > be > > > >> > > > released > > > >> > > > > >> due > > > >> > > > > >> > to > > > >> > > > > >> > >> > idle > > > >> > > > > >> > >> > > > >>> timeouts. > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> 4) For the item timeout, I prefer to keep > > it. > > > >> The > > > >> > > > reasons > > > >> > > > > >> are > > > >> > > > > >> > as > > > >> > > > > >> > >> > > > >> following: > > > >> > > > > >> > >> > > > >>> a) The timeout will not affect users > adding > > or > > > >> > > removing > > > >> > > > > >> items > > > >> > > > > >> > via > > > >> > > > > >> > >> > > REST > > > >> > > > > >> > >> > > > >> API, > > > >> > > > > >> > >> > > > >>> and users can disable it by configuring it > > to > > > >> > > > > >> Long.MAX_VALUE . > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> b) Some node problems can recover after a > > > >> period of > > > >> > > > time > > > >> > > > > >> > (such as > > > >> > > > > >> > >> > > > machine > > > >> > > > > >> > >> > > > >>> hotspots), in which case users may prefer > > that > > > >> > Flink > > > >> > > > can > > > >> > > > > do > > > >> > > > > >> > this > > > >> > > > > >> > >> > > > >>> automatically instead of requiring the > user > > to > > > >> do > > > >> > it > > > >> > > > > >> manually. > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> Best, > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> Lijie > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>> Konstantin Knauf <kna...@apache.org> > > > >> 于2022年4月27日周三 > > > >> > > > > >> 19:23写道: > > > >> > > > > >> > >> > > > >>> > > > >> > > > > >> > >> > > > >>>> Hi Lijie, > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> I think, this makes sense and +1 to only > > > >> support > > > >> > > > > manually > > > >> > > > > >> > blocking > > > >> > > > > >> > >> > > > >>>> taskmanagers and nodes. Maybe the > different > > > >> > > strategies > > > >> > > > > can > > > >> > > > > >> > also be > > > >> > > > > >> > >> > > > >>>> maintained outside of Apache Flink. > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> A few remarks: > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> 1) Can we use another term than > > "bla.cklist" > > > >> due > > > >> > to > > > >> > > > the > > > >> > > > > >> > >> > controversy > > > >> > > > > >> > >> > > > >> around > > > >> > > > > >> > >> > > > >>>> the term? [1] There was also a Jira > Ticket > > > >> about > > > >> > > this > > > >> > > > > >> topic a > > > >> > > > > >> > >> > while > > > >> > > > > >> > >> > > > >> back > > > >> > > > > >> > >> > > > >>>> and there was generally a consensus to > > avoid > > > >> the > > > >> > > term > > > >> > > > > >> > blacklist & > > > >> > > > > >> > >> > > > >> whitelist > > > >> > > > > >> > >> > > > >>>> [2]? We could use "blocklist" "denylist" > or > > > >> > > > > "quarantined" > > > >> > > > > >> > >> > > > >>>> 2) For the REST API, I'd prefer a > slightly > > > >> > different > > > >> > > > > >> design > > > >> > > > > >> > as > > > >> > > > > >> > >> > verbs > > > >> > > > > >> > >> > > > >> like > > > >> > > > > >> > >> > > > >>>> add/remove often considered an > anti-pattern > > > for > > > >> > REST > > > >> > > > > APIs. > > > >> > > > > >> > POST > > > >> > > > > >> > >> > on a > > > >> > > > > >> > >> > > > >> list > > > >> > > > > >> > >> > > > >>>> item is generally the standard to add > > items. > > > >> > DELETE > > > >> > > on > > > >> > > > > the > > > >> > > > > >> > >> > > individual > > > >> > > > > >> > >> > > > >>>> resource is standard to remove an item. > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> POST <host>/quarantine/items > > > >> > > > > >> > >> > > > >>>> DELETE > > > <host>/quarantine/items/<itemidentifier> > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> We could also consider to separate > > > taskmanagers > > > >> > and > > > >> > > > > nodes > > > >> > > > > >> in > > > >> > > > > >> > the > > > >> > > > > >> > >> > > REST > > > >> > > > > >> > >> > > > >> API > > > >> > > > > >> > >> > > > >>>> (and internal data structures). Any > opinion > > > on > > > >> > this? > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/nodes > > > >> > > > > >> > >> > > > >>>> POST/GET <host>/quarantine/taskmanager > > > >> > > > > >> > >> > > > >>>> DELETE > <host>/quarantine/nodes/<identifier> > > > >> > > > > >> > >> > > > >>>> DELETE > > > >> <host>/quarantine/taskmanager/<identifier> > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> 3) How would blocking nodes behave with > > > >> non-active > > > >> > > > > >> resource > > > >> > > > > >> > >> > > managers, > > > >> > > > > >> > >> > > > >> i.e. > > > >> > > > > >> > >> > > > >>>> standalone or reactive mode? > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> 4) To keep the implementation even more > > > >> minimal, > > > >> > do > > > >> > > we > > > >> > > > > >> need > > > >> > > > > >> > the > > > >> > > > > >> > >> > > > timeout > > > >> > > > > >> > >> > > > >>>> behavior? If items are added/removed > > manually > > > >> we > > > >> > > could > > > >> > > > > >> > delegate > > > >> > > > > >> > >> > this > > > >> > > > > >> > >> > > > >> to the > > > >> > > > > >> > >> > > > >>>> user easily. In my opinion the timeout > > > behavior > > > >> > > would > > > >> > > > > >> better > > > >> > > > > >> > fit > > > >> > > > > >> > >> > > into > > > >> > > > > >> > >> > > > >>>> specific strategies at a later point. > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> Looking forward to your thoughts. > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> Cheers and thank you, > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> Konstantin > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> [1] > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term > > > >> > > > > >> > >> > > > >>>> [2] > > > >> > > https://issues.apache.org/jira/browse/FLINK-18209 > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr > schrieb > > > >> Lijie > > > >> > > Wang > > > >> > > > < > > > >> > > > > >> > >> > > > >>>> wangdachui9...@gmail.com>: > > > >> > > > > >> > >> > > > >>>> > > > >> > > > > >> > >> > > > >>>>> Hi all, > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >>>>> Flink job failures may happen due to > > cluster > > > >> node > > > >> > > > > issues > > > >> > > > > >> > >> > > > >> (insufficient > > > >> > > > > >> > >> > > > >>>> disk > > > >> > > > > >> > >> > > > >>>>> space, bad hardware, network > > abnormalities). > > > >> > Flink > > > >> > > > will > > > >> > > > > >> > take care > > > >> > > > > >> > >> > > of > > > >> > > > > >> > >> > > > >> the > > > >> > > > > >> > >> > > > >>>>> failures and redeploy the tasks. > However, > > > due > > > >> to > > > >> > > data > > > >> > > > > >> > locality > > > >> > > > > >> > >> > and > > > >> > > > > >> > >> > > > >>>> limited > > > >> > > > > >> > >> > > > >>>>> resources, the new tasks are very likely > > to > > > be > > > >> > > > > redeployed > > > >> > > > > >> > to the > > > >> > > > > >> > >> > > same > > > >> > > > > >> > >> > > > >>>>> nodes, which will result in continuous > > task > > > >> > > > > abnormalities > > > >> > > > > >> > and > > > >> > > > > >> > >> > > affect > > > >> > > > > >> > >> > > > >> job > > > >> > > > > >> > >> > > > >>>>> progress. > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >>>>> Currently, Flink users need to manually > > > >> identify > > > >> > > the > > > >> > > > > >> > problematic > > > >> > > > > >> > >> > > > >> node and > > > >> > > > > >> > >> > > > >>>>> take it offline to solve this problem. > But > > > >> this > > > >> > > > > approach > > > >> > > > > >> has > > > >> > > > > >> > >> > > > >> following > > > >> > > > > >> > >> > > > >>>>> disadvantages: > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >>>>> 1. Taking a node offline can be a heavy > > > >> process. > > > >> > > > Users > > > >> > > > > >> may > > > >> > > > > >> > need > > > >> > > > > >> > >> > to > > > >> > > > > >> > >> > > > >>>> contact > > > >> > > > > >> > >> > > > >>>>> cluster administors to do this. The > > > operation > > > >> can > > > >> > > > even > > > >> > > > > be > > > >> > > > > >> > >> > dangerous > > > >> > > > > >> > >> > > > >> and > > > >> > > > > >> > >> > > > >>>> not > > > >> > > > > >> > >> > > > >>>>> allowed during some important business > > > events. > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >>>>> 2. Identifying and solving this kind of > > > >> problems > > > >> > > > > manually > > > >> > > > > >> > would > > > >> > > > > >> > >> > be > > > >> > > > > >> > >> > > > >> slow > > > >> > > > > >> > >> > > > >>>> and > > > >> > > > > >> > >> > > > >>>>> a waste of human resources. > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >>>>> To solve this problem, Zhu Zhu and I > > propose > > > >> to > > > >> > > > > >> introduce a > > > >> > > > > >> > >> > > blacklist > > > >> > > > > >> > >> > > > >>>>> mechanism for Flink to filter out > > > problematic > > > >> > > > > resources. > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >>>>> You can find more details in > FLIP-224[1]. > > > >> Looking > > > >> > > > > forward > > > >> > > > > >> > to your > > > >> > > > > >> > >> > > > >>>> feedback. > > > >> > > > > >> > >> > > > >>>>> [1] > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >>>>> Best, > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > >>>>> Lijie > > > >> > > > > >> > >> > > > >>>>> > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > -- > > > >> > > Best regards, > > > >> > > Roman Boyko > > > >> > > e.: ro.v.bo...@gmail.com > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > -- > > https://twitter.com/snntrable > > https://github.com/knaufk > > > -- https://twitter.com/snntrable https://github.com/knaufk