Hi Konstantin, We found that Flink REST URL does not support the format ":merge" , which will be recognized as a parameter in the URL(due to start with a colon).
We will keep the previous way, i.e. POST: http://{jm_rest_address:port}/blocklist/taskmanagers and the "id" and "merge" flag are put into the request body. Best, Lijie Lijie Wang <wangdachui9...@gmail.com> 于2022年5月18日周三 09:35写道: > Hi Weihua, > thanks for feedback. > > 1. Yes, only *Manually* is supported in this FLIP, but it's the first step > towards auto-detection. > 2. We wii print the blocked nodes in logs. Maybe also put it into the > exception of insufficient resources. > 3. No. This FLIP won't change the WebUI. The blocklist information can be > obtained through REST API and metrics. > > Best, > Lijie > > Weihua Hu <huweihua....@gmail.com> 于2022年5月17日周二 21:41写道: > >> Hi, >> Thanks for creating this FLIP. >> We have implemented an automatic blocklist detection mechanism >> internally, which is indeed very effective for handling node failures. >> Due to the large number of nodes, although SREs already support automatic >> offline failure nodes, the detection is not 100% accurate and there is some >> delay. >> So the blocklist mechanism can make flink job recover from failure much >> faster. >> >> Here are some of my thoughts: >> 1. In this FLIP, it needs users to locate machine failure manually, there >> is a certain cost of use >> 2. What happens if too many nodes are blocked, resulting in insufficient >> resources? Will there be a special Exception for the user? >> 3. Will we display the blocklist information in the WebUI? The blocklist >> is for cluster level, and if multiple users share a cluster, some users may >> be a little confused when resources are not enough, or when resources are >> applied for more. >> >> Also, Looking forward to the next FLIP on auto-detection. >> >> Best, >> Weihua >> >> > 2022年5月16日 下午11:22,Lijie Wang <wangdachui9...@gmail.com> 写道: >> > >> > Hi Konstantin, >> > >> > Maybe change it to the following: >> > >> > 1. POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id} >> > Merge is not allowed. If the {id} already exists, return error. >> Otherwise, >> > create a new item. >> > >> > 2. POST: http:// >> {jm_rest_address:port}/blocklist/taskmanagers/{id}:merge >> > Merge is allowed. If the {id} already exists, merge. Otherwise, create a >> > new item. >> > >> > WDYT? >> > >> > Best, >> > Lijie >> > >> > Konstantin Knauf <kna...@apache.org> 于2022年5月16日周一 20:07写道: >> > >> >> Hi Lijie, >> >> >> >> hm, maybe the following is more appropriate in that case >> >> >> >> POST: http://{jm_rest_address:port}/blocklist/taskmanagers/{id}:merge >> >> >> >> Best, >> >> >> >> Konstantin >> >> >> >> Am Mo., 16. Mai 2022 um 07:05 Uhr schrieb Lijie Wang < >> >> wangdachui9...@gmail.com>: >> >> >> >>> Hi Konstantin, >> >>> thanks for your feedback. >> >>> >> >>> From what I understand, PUT should be idempotent. However, we have a >> >>> *timeout* field in the request. This means that initiating the same >> >> request >> >>> at two different times will lead to different resource status >> (timestamps >> >>> of the items to be removed will be different). >> >>> >> >>> Should we use PUT in this case? WDYT? >> >>> >> >>> Best, >> >>> Lijie >> >>> >> >>> Konstantin Knauf <kna...@apache.org> 于2022年5月13日周五 17:20写道: >> >>> >> >>>> Hi Lijie, >> >>>> >> >>>> wouldn't the REST API-idiomatic way for an update/replace be a PUT on >> >> the >> >>>> resource? >> >>>> >> >>>> PUT: http://{jm_rest_address:port}/blocklist/taskmanagers/{id} >> >>>> >> >>>> Best, >> >>>> >> >>>> Konstantin >> >>>> >> >>>> >> >>>> >> >>>> Am Fr., 13. Mai 2022 um 11:01 Uhr schrieb Lijie Wang < >> >>>> wangdachui9...@gmail.com>: >> >>>> >> >>>>> Hi everyone, >> >>>>> >> >>>>> I've had an offline discussion with Becket Qin and Zhu Zhu, and made >> >>> the >> >>>>> following changes on REST API: >> >>>>> 1. To avoid ambiguity, *timeout* and *endTimestamp* can only choose >> >>> one. >> >>>> If >> >>>>> both are specified, will return error. >> >>>>> 2. If the specified item is already there, the *ADD* operation has >> >> two >> >>>>> behaviors: *return error*(default value) or *merge/update*, and we >> >>> add a >> >>>>> flag to the request body to control it. You can find more details >> >>> "Public >> >>>>> Interface" section. >> >>>>> >> >>>>> If there is no more feedback, we will start the vote thread next >> >> week. >> >>>>> >> >>>>> Best, >> >>>>> Lijie >> >>>>> >> >>>>> Lijie Wang <wangdachui9...@gmail.com> 于2022年5月10日周二 17:14写道: >> >>>>> >> >>>>>> Hi Becket Qin, >> >>>>>> >> >>>>>> Thanks for your suggestions. I have moved the description of >> >>>>>> configurations, metrics and REST API into "Public Interface" >> >> section, >> >>>> and >> >>>>>> made a few updates according to your suggestion. And in this FLIP, >> >>>> there >> >>>>>> no public java Interfaces or pluggables that users need to >> >> implement >> >>> by >> >>>>>> themselves. >> >>>>>> >> >>>>>> Answers for you questions: >> >>>>>> 1. Yes, there 2 block actions: MARK_BLOCKED and. >> >>>>>> MARK_BLOCKED_AND_EVACUATE_TASKS (has renamed). Currently, block >> >> items >> >>>> can >> >>>>>> only be added through the REST API, so these 2 action are mentioned >> >>> in >> >>>>> the >> >>>>>> REST API part (The REST API part has beed moved to public interface >> >>>> now). >> >>>>>> 2. I agree with you. I have changed the "Cause" field to String, >> >> and >> >>>>> allow >> >>>>>> users to specify it via REST API. >> >>>>>> 3. Yes, it is useful to allow different timeouts. As mentioned >> >> above, >> >>>> we >> >>>>>> will introduce 2 fields : *timeout* and *endTimestamp* into the ADD >> >>>> REST >> >>>>>> API to specify when to remove the blocked item. These 2 fields are >> >>>>>> optional, if neither is specified, it means that the blocked item >> >> is >> >>>>>> permanent and will not be removed. If both are specified, the >> >> minimum >> >>>> of >> >>>>>> *currentTimestamp+tiemout *and* endTimestamp* will be used as the >> >>> time >> >>>> to >> >>>>>> remove the blocked item. To keep the configurations more minimal, >> >> we >> >>>> have >> >>>>>> removed the *cluster.resource-blocklist.item.timeout* configuration >> >>>>>> option. >> >>>>>> 4. Yes, the block item will be overridden if the specified item >> >>> already >> >>>>>> exists. The ADD operation is *ADD or UPDATE*. >> >>>>>> 5. Yes. On JM/RM side, all the blocklist information is maintained >> >> in >> >>>>>> JMBlocklistHandler/RMBlocklistHandler. The blocklist handler(or >> >>>>> abstracted >> >>>>>> to other interfaces) will be propagated to different components. >> >>>>>> >> >>>>>> Best, >> >>>>>> Lijie >> >>>>>> >> >>>>>> Becket Qin <becket....@gmail.com> 于2022年5月10日周二 11:26写道: >> >>>>>> >> >>>>>>> Hi Lijie, >> >>>>>>> >> >>>>>>> Thanks for updating the FLIP. It looks like the public interface >> >>>> section >> >>>>>>> did not fully reflect all the user sensible behavior and API. Can >> >>> you >> >>>>> put >> >>>>>>> everything that users may be aware of there? That would include >> >> the >> >>>> REST >> >>>>>>> API, metrics, configurations, public java Interfaces or pluggables >> >>>> that >> >>>>>>> users may see or implement by themselves, as well as a brief >> >> summary >> >>>> of >> >>>>>>> the >> >>>>>>> behavior of the public API. >> >>>>>>> >> >>>>>>> Besides that, I have a few questions: >> >>>>>>> >> >>>>>>> 1. According to the conversation in the discussion thread, it >> >> looks >> >>>> like >> >>>>>>> the BlockAction will have "MARK_BLOCKLISTED" and >> >>>>>>> "MARK_BLOCKLISTED_AND_EVACUATE_TASKS". Is that the case? If so, >> >> can >> >>>> you >> >>>>>>> add >> >>>>>>> that to the public interface as well? >> >>>>>>> >> >>>>>>> 2. At this point, the "Cause" field in the BlockingItem is a >> >>> Throwable >> >>>>> and >> >>>>>>> is not reflected in the REST API. Should that be included in the >> >>> query >> >>>>>>> response? And should we change that field to be a String so users >> >>> may >> >>>>>>> specify the cause via the REST API when they block some nodes / >> >> TMs? >> >>>>>>> >> >>>>>>> 3. Would it be useful to allow users to have different timeouts >> >> for >> >>>>>>> different blocked items? So while there is a default timeout, >> >> users >> >>>> can >> >>>>>>> also override it via the REST API when they block an entity. >> >>>>>>> >> >>>>>>> 4. Regarding the ADD operation, if the specified item is already >> >>>> there, >> >>>>>>> will the block item be overridden? For example, if the user wants >> >> to >> >>>>>>> extend >> >>>>>>> the timeout of a blocked item, can they just issue an ADD command >> >>>>> again? >> >>>>>>> >> >>>>>>> 5. I am not quite familiar with the details of this, but is there >> >> a >> >>>>> source >> >>>>>>> of truth for the blocked list? I think it might be good to have a >> >>>> single >> >>>>>>> source of truth for the blocked list and just propagate that list >> >> to >> >>>>>>> different components to take the action of actually blocking the >> >>>>> resource. >> >>>>>>> >> >>>>>>> Thanks, >> >>>>>>> >> >>>>>>> Jiangjie (Becket) Qin >> >>>>>>> >> >>>>>>> On Mon, May 9, 2022 at 5:54 PM Lijie Wang < >> >> wangdachui9...@gmail.com >> >>>> >> >>>>>>> wrote: >> >>>>>>> >> >>>>>>>> Hi everyone, >> >>>>>>>> >> >>>>>>>> Based on the discussion in the mailing list, I updated the FLIP >> >>> doc, >> >>>>> the >> >>>>>>>> changes include: >> >>>>>>>> 1. Changed the description of the motivation section to more >> >>> clearly >> >>>>>>>> describe the problem this FLIP is trying to solve. >> >>>>>>>> 2. Only *Manually* is supported. >> >>>>>>>> 3. Adopted some suggestions, such as *endTimestamp*. >> >>>>>>>> >> >>>>>>>> Best, >> >>>>>>>> Lijie >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Roman Boyko <ro.v.bo...@gmail.com> 于2022年5月7日周六 19:25写道: >> >>>>>>>> >> >>>>>>>>> Hi Lijie! >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> *a) “Probably storing inside Zookeeper/Configmap might be >> >>>>>>> helpfulhere.” >> >>>>>>>>> Can you explain it in detail? I don't fully understand that. >> >> In >> >>>>>>>> myopinion, >> >>>>>>>>> non-active and active are the same, and no special treatment >> >>>>>>> isrequired.* >> >>>>>>>>> >> >>>>>>>>> Sorry this was a misunderstanding from my side. I thought we >> >>> were >> >>>>>>> talking >> >>>>>>>>> about the HA mode (but not about Active and Standalone >> >>>>>>> ResourceManager). >> >>>>>>>>> And the original question was - how to handle the blacklisted >> >>>> nodes >> >>>>>>> list >> >>>>>>>> at >> >>>>>>>>> the moment of leader change? Should we simply forget about >> >> them >> >>> or >> >>>>>>> try to >> >>>>>>>>> pre-save that list on the remote storage? >> >>>>>>>>> >> >>>>>>>>> On Sat, 7 May 2022 at 10:51, Yang Wang <danrtsey...@gmail.com >> >>> >> >>>>> wrote: >> >>>>>>>>> >> >>>>>>>>>> Thanks Lijie and ZhuZhu for the explanation. >> >>>>>>>>>> >> >>>>>>>>>> I just overlooked the "MARK_BLOCKLISTED". For tasks level, >> >> it >> >>> is >> >>>>>>> indeed >> >>>>>>>>>> some functionalities the external tools(e.g. kubectl taint) >> >>>> could >> >>>>>>> not >> >>>>>>>>>> support. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> Best, >> >>>>>>>>>> Yang >> >>>>>>>>>> >> >>>>>>>>>> Lijie Wang <wangdachui9...@gmail.com> 于2022年5月6日周五 22:18写道: >> >>>>>>>>>> >> >>>>>>>>>>> Thanks for your feedback, Jiangang and Martijn. >> >>>>>>>>>>> >> >>>>>>>>>>> @Jiangang >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>>> For auto-detecting, I wonder how to make the strategy >> >> and >> >>>>> mark a >> >>>>>>>> node >> >>>>>>>>>>> blocked? >> >>>>>>>>>>> >> >>>>>>>>>>> In fact, we currently plan to not support auto-detection >> >> in >> >>>> this >> >>>>>>>> FLIP. >> >>>>>>>>>> The >> >>>>>>>>>>> part about auto-detection may be continued in a separate >> >>> FLIP >> >>>> in >> >>>>>>> the >> >>>>>>>>>>> future. Some guys have the same concerns as you, and the >> >>>>>>> correctness >> >>>>>>>>> and >> >>>>>>>>>>> necessity of auto-detection may require further discussion >> >>> in >> >>>>> the >> >>>>>>>>> future. >> >>>>>>>>>>> >> >>>>>>>>>>>> In session mode, multi jobs can fail on the same bad >> >> node >> >>>> and >> >>>>>>> the >> >>>>>>>>> node >> >>>>>>>>>>> should be marked blocked. >> >>>>>>>>>>> By design, the blocklist information will be shared among >> >>> all >> >>>>> jobs >> >>>>>>>> in a >> >>>>>>>>>>> cluster/session. The JM will sync blocklist information >> >> with >> >>>> RM. >> >>>>>>>>>>> >> >>>>>>>>>>> @Martijn >> >>>>>>>>>>> >> >>>>>>>>>>>> I agree with Yang Wang on this. >> >>>>>>>>>>> As Zhu Zhu and I mentioned above, we think the >> >>>>>>> MARK_BLOCKLISTED(Just >> >>>>>>>>>> limits >> >>>>>>>>>>> the load of the node and does not kill all the processes >> >> on >> >>>> it) >> >>>>>>> is >> >>>>>>>>> also >> >>>>>>>>>>> important, and we think that external systems (*yarn >> >> rmadmin >> >>>> or >> >>>>>>>> kubectl >> >>>>>>>>>>> taint*) cannot support it. So we think it makes sense even >> >>>> only >> >>>>>>>>>> *manually*. >> >>>>>>>>>>> >> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are >> >>> indeed >> >>>>>>> super >> >>>>>>>>> hard >> >>>>>>>>>>> to get right. >> >>>>>>>>>>> Yes, as you see, Jiangang(and a few others) have the same >> >>>>> concern. >> >>>>>>>>>>> However, we currently plan to not support auto-detection >> >> in >> >>>> this >> >>>>>>>> FLIP, >> >>>>>>>>>> and >> >>>>>>>>>>> only *manually*. In addition, I'd like to say that the >> >> FLIP >> >>>>>>> provides >> >>>>>>>> a >> >>>>>>>>>>> mechanism to support MARK_BLOCKLISTED and >> >>>>>>>>>>> MARK_BLOCKLISTED_AND_EVACUATE_TASKS, >> >>>>>>>>>>> the auto-detection may be done by external systems. >> >>>>>>>>>>> >> >>>>>>>>>>> Best, >> >>>>>>>>>>> Lijie >> >>>>>>>>>>> >> >>>>>>>>>>> Martijn Visser <mart...@ververica.com> 于2022年5月6日周五 >> >>> 19:04写道: >> >>>>>>>>>>> >> >>>>>>>>>>>>> If we only support to block nodes manually, then I >> >> could >> >>>> not >> >>>>>>> see >> >>>>>>>>>>>> the obvious advantages compared with current SRE's >> >>>>> approach(via >> >>>>>>>> *yarn >> >>>>>>>>>>>> rmadmin or kubectl taint*). >> >>>>>>>>>>>> >> >>>>>>>>>>>> I agree with Yang Wang on this. >> >>>>>>>>>>>> >> >>>>>>>>>>>>> To me this sounds yet again like one of those magical >> >>>>>>> mechanisms >> >>>>>>>>>> that >> >>>>>>>>>>>> will rarely work just right. >> >>>>>>>>>>>> >> >>>>>>>>>>>> I also agree with Chesnay that magical mechanisms are >> >>> indeed >> >>>>>>> super >> >>>>>>>>> hard >> >>>>>>>>>>> to >> >>>>>>>>>>>> get right. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Best regards, >> >>>>>>>>>>>> >> >>>>>>>>>>>> Martijn >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Fri, 6 May 2022 at 12:03, Jiangang Liu < >> >>>>>>>> liujiangangp...@gmail.com >> >>>>>>>>>> >> >>>>>>>>>>>> wrote: >> >>>>>>>>>>>> >> >>>>>>>>>>>>> Thanks for the valuable design. The auto-detecting can >> >>>>> decrease >> >>>>>>>>> great >> >>>>>>>>>>> work >> >>>>>>>>>>>>> for us. We have implemented the similar feature in our >> >>>> inner >> >>>>>>> flink >> >>>>>>>>>>>>> version. >> >>>>>>>>>>>>> Below is something that I care about: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> 1. For auto-detecting, I wonder how to make the >> >>> strategy >> >>>>> and >> >>>>>>>>> mark a >> >>>>>>>>>>>>> node >> >>>>>>>>>>>>> blocked? Sometimes the blocked node is hard to be >> >>>>> detected, >> >>>>>>> for >> >>>>>>>>>>>>> example, >> >>>>>>>>>>>>> the upper node or the down node will be blocked when >> >>>>> network >> >>>>>>>>>>>>> unreachable. >> >>>>>>>>>>>>> 2. I see that the strategy is made in JobMaster >> >> side. >> >>>> How >> >>>>>>> about >> >>>>>>>>>>>>> implementing the similar logic in resource manager? >> >> In >> >>>>>>> session >> >>>>>>>>>> mode, >> >>>>>>>>>>>>> multi >> >>>>>>>>>>>>> jobs can fail on the same bad node and the node >> >> should >> >>>> be >> >>>>>>>> marked >> >>>>>>>>>>>>> blocked. >> >>>>>>>>>>>>> If the job makes the strategy, the node may be not >> >>>> marked >> >>>>>>>> blocked >> >>>>>>>>>> if >> >>>>>>>>>>>>> the >> >>>>>>>>>>>>> fail times don't exceed the threshold. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2022年5月5日周四 23:35写道: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thank you for all your feedback! >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Besides the answers from Lijie, I'd like to share >> >> some >> >>> of >> >>>>> my >> >>>>>>>>>> thoughts: >> >>>>>>>>>>>>>> 1. Whether to enable automatical blocklist >> >>>>>>>>>>>>>> Generally speaking, it is not a goal of FLIP-224. >> >>>>>>>>>>>>>> The automatical way should be something built upon >> >> the >> >>>>>>> blocklist >> >>>>>>>>>>>>>> mechanism and well decoupled. It was designed to be a >> >>>>>>>> configurable >> >>>>>>>>>>>>>> blocklist strategy, but I think we can further >> >> decouple >> >>>> it >> >>>>> by >> >>>>>>>>>>>>>> introducing a abnormal node detector, as Becket >> >>>> suggested, >> >>>>>>> which >> >>>>>>>>>> just >> >>>>>>>>>>>>>> uses the blocklist mechanism once bad nodes are >> >>> detected. >> >>>>>>>> However, >> >>>>>>>>>> it >> >>>>>>>>>>>>>> should be a separate FLIP with further dev >> >> discussions >> >>>> and >> >>>>>>>>> feedback >> >>>>>>>>>>>>>> from users. I also agree with Becket that different >> >>> users >> >>>>>>> have >> >>>>>>>>>>> different >> >>>>>>>>>>>>>> requirements, and we should listen to them. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> 2. Is it enough to just take away abnormal nodes >> >>>> externally >> >>>>>>>>>>>>>> My answer is no. As Lijie has mentioned, we need a >> >> way >> >>> to >> >>>>>>> avoid >> >>>>>>>>>>>>>> deploying tasks to temporary hot nodes. In this case, >> >>>> users >> >>>>>>> may >> >>>>>>>>> just >> >>>>>>>>>>>>>> want to limit the load of the node and do not want to >> >>>> kill >> >>>>>>> all >> >>>>>>>> the >> >>>>>>>>>>>>>> processes on it. Another case is the speculative >> >>>>> execution[1] >> >>>>>>>>> which >> >>>>>>>>>>>>>> may also leverage this feature to avoid starting >> >> mirror >> >>>>>>> tasks on >> >>>>>>>>>> slow >> >>>>>>>>>>>>>> nodes. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks, >> >>>>>>>>>>>>>> Zhu >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> [1] >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>> >> >>>> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Lijie Wang <wangdachui9...@gmail.com> 于2022年5月5日周四 >> >>>>> 15:56写道: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Hi everyone, >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Thanks for your feedback. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> There's one detail that I'd like to re-emphasize >> >> here >> >>>>>>> because >> >>>>>>>> it >> >>>>>>>>>> can >> >>>>>>>>>>>>>> affect the value and design of the blocklist >> >> mechanism >> >>>>>>> (perhaps >> >>>>>>>> I >> >>>>>>>>>>> should >> >>>>>>>>>>>>>> highlight it in the FLIP). We propose two actions in >> >>>> FLIP: >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> 1) MARK_BLOCKLISTED: Just mark the task manager or >> >>> node >> >>>>> as >> >>>>>>>>>> blocked. >> >>>>>>>>>>>>>> Future slots should not be allocated from the blocked >> >>>> task >> >>>>>>>> manager >> >>>>>>>>>> or >> >>>>>>>>>>>>> node. >> >>>>>>>>>>>>>> But slots that are already allocated will not be >> >>>> affected. >> >>>>> A >> >>>>>>>>> typical >> >>>>>>>>>>>>>> application scenario is to mitigate machine hotspots. >> >>> In >> >>>>> this >> >>>>>>>>> case, >> >>>>>>>>>> we >> >>>>>>>>>>>>> hope >> >>>>>>>>>>>>>> that subsequent resource allocations will not be on >> >> the >> >>>> hot >> >>>>>>>>> machine, >> >>>>>>>>>>> but >> >>>>>>>>>>>>>> tasks currently running on it should not be affected. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> 2) MARK_BLOCKLISTED_AND_EVACUATE_TASKS: Mark the >> >> task >> >>>>>>> manager >> >>>>>>>> or >> >>>>>>>>>>> node >> >>>>>>>>>>>>> as >> >>>>>>>>>>>>>> blocked, and evacuate all tasks on it. Evacuated >> >> tasks >> >>>> will >> >>>>>>> be >> >>>>>>>>>>>>> restarted on >> >>>>>>>>>>>>>> non-blocked task managers. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> For the above 2 actions, the former may more >> >>> highlight >> >>>>> the >> >>>>>>>>> meaning >> >>>>>>>>>>> of >> >>>>>>>>>>>>>> this FLIP, because the external system cannot do >> >> that. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Regarding *Manually* and *Automatically*, I >> >> basically >> >>>>> agree >> >>>>>>>> with >> >>>>>>>>>>>>> @Becket >> >>>>>>>>>>>>>> Qin: different users have different answers. Not all >> >>>> users’ >> >>>>>>>>>> deployment >> >>>>>>>>>>>>>> environments have a special external system that can >> >>>>> perform >> >>>>>>> the >> >>>>>>>>>>> anomaly >> >>>>>>>>>>>>>> detection. In addition, adding pluggable/optional >> >>>>>>> auto-detection >> >>>>>>>>>>> doesn't >> >>>>>>>>>>>>>> require much extra work on top of manual >> >> specification. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> I will answer your other questions one by one. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> @Yangze >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> a) I think you are right, we do not need to expose >> >>> the >> >>>>>>>>>>>>>> >> >>> `cluster.resource-blocklist.item.timeout-check-interval` >> >>>> to >> >>>>>>>> users. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> b) We can abstract the `notifyException` to a >> >>> separate >> >>>>>>>> interface >> >>>>>>>>>>>>> (maybe >> >>>>>>>>>>>>>> BlocklistExceptionListener), and the >> >>>>>>>>> ResourceManagerBlocklistHandler >> >>>>>>>>>>> can >> >>>>>>>>>>>>>> implement it in the future. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> @Martijn >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> a) I also think the manual blocking should be done >> >> by >> >>>>>>> cluster >> >>>>>>>>>>>>> operators. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> b) I think manual blocking makes sense, because >> >>>> according >> >>>>>>> to >> >>>>>>>> my >> >>>>>>>>>>>>>> experience, users are often the first to perceive the >> >>>>> machine >> >>>>>>>>>> problems >> >>>>>>>>>>>>>> (because of job failover or delay), and they will >> >>> contact >> >>>>>>>> cluster >> >>>>>>>>>>>>> operators >> >>>>>>>>>>>>>> to solve it, or even tell the cluster operators which >> >>>>>>> machine is >> >>>>>>>>>>>>>> problematic. From this point of view, I think the >> >>> people >> >>>>> who >> >>>>>>>>> really >> >>>>>>>>>>> need >> >>>>>>>>>>>>>> the manual blocking are the users, and it’s just >> >>>> performed >> >>>>> by >> >>>>>>>> the >> >>>>>>>>>>>>> cluster >> >>>>>>>>>>>>>> operator, so I think the manual blocking makes sense. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> @Chesnay >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> We need to touch the logic of JM/SlotPool, because >> >>> for >> >>>>>>>>>>>>> MARK_BLOCKLISTED >> >>>>>>>>>>>>>> , we need to know whether the slot is blocklisted >> >> when >> >>>> the >> >>>>>>> task >> >>>>>>>> is >> >>>>>>>>>>>>>> FINISHED/CANCELLED/FAILED. If so, SlotPool should >> >>>> release >> >>>>>>> the >> >>>>>>>>> slot >> >>>>>>>>>>>>>> directly to avoid assigning other tasks (of this job) >> >>> on >> >>>>> it. >> >>>>>>> If >> >>>>>>>> we >> >>>>>>>>>>> only >> >>>>>>>>>>>>>> maintain the blocklist information on the RM, JM >> >> needs >> >>> to >> >>>>>>>> retrieve >> >>>>>>>>>> it >> >>>>>>>>>>> by >> >>>>>>>>>>>>>> RPC. I think the performance overhead of that is >> >>>> relatively >> >>>>>>>> large, >> >>>>>>>>>> so >> >>>>>>>>>>> I >> >>>>>>>>>>>>>> think it's worth maintaining the blocklist >> >> information >> >>> on >> >>>>>>> the JM >> >>>>>>>>>> side >> >>>>>>>>>>>>> and >> >>>>>>>>>>>>>> syncing them. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> @Роман >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> a) “Probably storing inside Zookeeper/Configmap >> >>>> might >> >>>>>>> be >> >>>>>>>>>> helpful >> >>>>>>>>>>>>>> here.” Can you explain it in detail? I don't fully >> >>>>>>> understand >> >>>>>>>>> that. >> >>>>>>>>>>> In >> >>>>>>>>>>>>> my >> >>>>>>>>>>>>>> opinion, non-active and active are the same, and no >> >>>> special >> >>>>>>>>>> treatment >> >>>>>>>>>>> is >> >>>>>>>>>>>>>> required. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> b) I agree with you, the `endTimestamp` makes >> >> sense, >> >>> I >> >>>>> will >> >>>>>>>> add >> >>>>>>>>> it >> >>>>>>>>>>> to >> >>>>>>>>>>>>>> FLIP. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> @Yang >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> As mentioned above, AFAK, the external system >> >> cannot >> >>>>>>> support >> >>>>>>>> the >> >>>>>>>>>>>>>> MARK_BLOCKLISTED action. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Looking forward to your further feedback. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Lijie >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Yang Wang <danrtsey...@gmail.com> 于2022年5月3日周二 >> >>>> 21:09写道: >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Thanks Lijie and Zhu for creating the proposal. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> I want to share some thoughts about Flink cluster >> >>>>>>> operations. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> In the production environment, the SRE(aka Site >> >>>>>>> Reliability >> >>>>>>>>>>> Engineer) >> >>>>>>>>>>>>>>>> already has many tools to detect the unstable >> >> nodes, >> >>>>> which >> >>>>>>>>> could >> >>>>>>>>>>> take >> >>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>> system logs/metrics into consideration. >> >>>>>>>>>>>>>>>> Then they use graceful-decomission in YARN and >> >> taint >> >>>> in >> >>>>>>> K8s >> >>>>>>>> to >> >>>>>>>>>>>>> prevent >> >>>>>>>>>>>>>> new >> >>>>>>>>>>>>>>>> allocations on these unstable nodes. >> >>>>>>>>>>>>>>>> At last, they will evict all the containers and >> >> pods >> >>>>>>> running >> >>>>>>>> on >> >>>>>>>>>>> these >> >>>>>>>>>>>>>> nodes. >> >>>>>>>>>>>>>>>> This mechanism also works for planned maintenance. >> >>> So >> >>>> I >> >>>>> am >> >>>>>>>>> afraid >> >>>>>>>>>>>>> this >> >>>>>>>>>>>>>> is >> >>>>>>>>>>>>>>>> not the typical use case for FLIP-224. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> If we only support to block nodes manually, then I >> >>>> could >> >>>>>>> not >> >>>>>>>>> see >> >>>>>>>>>>>>>>>> the obvious advantages compared with current SRE's >> >>>>>>>> approach(via >> >>>>>>>>>>> *yarn >> >>>>>>>>>>>>>>>> rmadmin or kubectl taint*). >> >>>>>>>>>>>>>>>> At least, we need to have a pluggable component >> >>> which >> >>>>>>> could >> >>>>>>>>>> expose >> >>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>> potential unstable nodes automatically and block >> >>> them >> >>>> if >> >>>>>>>>> enabled >> >>>>>>>>>>>>>> explicitly. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>>> Yang >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Becket Qin <becket....@gmail.com> 于2022年5月2日周一 >> >>>> 16:36写道: >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Thanks for the proposal, Lijie. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> This is an interesting feature and discussion, >> >> and >> >>>>>>> somewhat >> >>>>>>>>>>> related >> >>>>>>>>>>>>>> to the >> >>>>>>>>>>>>>>>>> design principle about how people should operate >> >>>>> Flink. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> I think there are three things involved in this >> >>>> FLIP. >> >>>>>>>>>>>>>>>>> a) Detect and report the unstable node. >> >>>>>>>>>>>>>>>>> b) Collect the information of the unstable >> >>> node >> >>>>> and >> >>>>>>>>> form a >> >>>>>>>>>>>>>> blocklist. >> >>>>>>>>>>>>>>>>> c) Take the action to block nodes. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> My two cents: >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> 1. It looks like people all agree that Flink >> >>> should >> >>>>> have >> >>>>>>>> c). >> >>>>>>>>> It >> >>>>>>>>>>> is >> >>>>>>>>>>>>>> not only >> >>>>>>>>>>>>>>>>> useful for cases of node failures, but also >> >> handy >> >>>> for >> >>>>>>> some >> >>>>>>>>>>> planned >> >>>>>>>>>>>>>>>>> maintenance. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> 2. People have different opinions on b), i.e. >> >> who >> >>>>>>> should be >> >>>>>>>>> the >> >>>>>>>>>>>>> brain >> >>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>> make the decision to block a node. I think this >> >>>>> largely >> >>>>>>>>> depends >> >>>>>>>>>>> on >> >>>>>>>>>>>>>> who we >> >>>>>>>>>>>>>>>>> talk to. Different users would probably give >> >>>> different >> >>>>>>>>> answers. >> >>>>>>>>>>> For >> >>>>>>>>>>>>>> people >> >>>>>>>>>>>>>>>>> who do have a centralized node health management >> >>>>>>> service, >> >>>>>>>> let >> >>>>>>>>>>> Flink >> >>>>>>>>>>>>>> do just >> >>>>>>>>>>>>>>>>> do a) and c) would be preferred. So essentially >> >>>> Flink >> >>>>>>> would >> >>>>>>>>> be >> >>>>>>>>>>> one >> >>>>>>>>>>>>> of >> >>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> sources that may detect unstable nodes, report >> >> it >> >>> to >> >>>>>>> that >> >>>>>>>>>>> service, >> >>>>>>>>>>>>>> and then >> >>>>>>>>>>>>>>>>> take the command from that service to block the >> >>>>>>> problematic >> >>>>>>>>>>> nodes. >> >>>>>>>>>>>>> On >> >>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> other hand, for users who do not have such a >> >>>> service, >> >>>>>>>> simply >> >>>>>>>>>>>>> letting >> >>>>>>>>>>>>>> Flink >> >>>>>>>>>>>>>>>>> be clever by itself to block the suspicious >> >> nodes >> >>>>> might >> >>>>>>> be >> >>>>>>>>>>> desired >> >>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>> ensure the jobs are running smoothly. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> So that indicates a) and b) here should be >> >>>> pluggable / >> >>>>>>>>>> optional. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> In light of this, maybe it would make sense to >> >>> have >> >>>>>>>> something >> >>>>>>>>>>>>>> pluggable >> >>>>>>>>>>>>>>>>> like a UnstableNodeReporter which exposes >> >> unstable >> >>>>> nodes >> >>>>>>>>>>> actively. >> >>>>>>>>>>>>> (A >> >>>>>>>>>>>>>> more >> >>>>>>>>>>>>>>>>> general interface should be JobInfoReporter<T> >> >>> which >> >>>>>>> can be >> >>>>>>>>>> used >> >>>>>>>>>>> to >> >>>>>>>>>>>>>> report >> >>>>>>>>>>>>>>>>> any information of type <T>. But I'll just keep >> >>> the >> >>>>>>> scope >> >>>>>>>>>>> relevant >> >>>>>>>>>>>>> to >> >>>>>>>>>>>>>> this >> >>>>>>>>>>>>>>>>> FLIP here). Personally speaking, I think it is >> >> OK >> >>> to >> >>>>>>> have a >> >>>>>>>>>>> default >> >>>>>>>>>>>>>>>>> implementation of a reporter which just tells >> >>> Flink >> >>>> to >> >>>>>>> take >> >>>>>>>>>>> action >> >>>>>>>>>>>>> to >> >>>>>>>>>>>>>> block >> >>>>>>>>>>>>>>>>> problematic nodes and also unblocks them after >> >>>>> timeout. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Thanks, >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> On Mon, May 2, 2022 at 3:27 PM Роман Бойко < >> >>>>>>>>>> ro.v.bo...@gmail.com >> >>>>>>>>>>>> >> >>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> Thanks for good initiative, Lijie and Zhu! >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> If it's possible I'd like to participate in >> >>>>>>> development. >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> I agree with 3rd point of Konstantin's reply - >> >>> we >> >>>>>>> should >> >>>>>>>>>>> consider >> >>>>>>>>>>>>>> to move >> >>>>>>>>>>>>>>>>>> somehow the information of blocklisted >> >> nodes/TMs >> >>>>> from >> >>>>>>>>> active >> >>>>>>>>>>>>>>>>>> ResourceManager to non-active ones. Probably >> >>>> storing >> >>>>>>>> inside >> >>>>>>>>>>>>>>>>>> Zookeeper/Configmap might be helpful here. >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> And I agree with Martijn that a lot of >> >>>> organizations >> >>>>>>>> don't >> >>>>>>>>>> want >> >>>>>>>>>>>>> to >> >>>>>>>>>>>>>> expose >> >>>>>>>>>>>>>>>>>> such API for a cluster user group. But I think >> >>>> it's >> >>>>>>>>> necessary >> >>>>>>>>>>> to >> >>>>>>>>>>>>>> have the >> >>>>>>>>>>>>>>>>>> mechanism for unblocking the nodes/TMs anyway >> >>> for >> >>>>>>>> avoiding >> >>>>>>>>>>>>> incorrect >> >>>>>>>>>>>>>>>>>> automatic behaviour. >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> And another one small suggestion - I think it >> >>>> would >> >>>>> be >> >>>>>>>>> better >> >>>>>>>>>>> to >> >>>>>>>>>>>>>> extend >> >>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>> *BlocklistedItem* class with the >> >> *endTimestamp* >> >>>>> field >> >>>>>>> and >> >>>>>>>>>> fill >> >>>>>>>>>>> it >> >>>>>>>>>>>>>> at the >> >>>>>>>>>>>>>>>>>> item creation. This simple addition will allow >> >>> to: >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> - >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> Provide the ability to users to setup the >> >>> exact >> >>>>>>> time >> >>>>>>>> of >> >>>>>>>>>>>>>> blocklist end >> >>>>>>>>>>>>>>>>>> through RestAPI >> >>>>>>>>>>>>>>>>>> - >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> Not being tied to a single value of >> >>>>>>>>>>>>>>>>>> *cluster.resource-blacklist.item.timeout* >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> On Mon, 2 May 2022 at 14:17, Chesnay Schepler >> >> < >> >>>>>>>>>>>>> ches...@apache.org> >> >>>>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> I do share the concern between blurring the >> >>>> lines >> >>>>> a >> >>>>>>>> bit. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> That said, I'd prefer to not have any >> >>>>> auto-detection >> >>>>>>>> and >> >>>>>>>>>> only >> >>>>>>>>>>>>>> have an >> >>>>>>>>>>>>>>>>>>> opt-in mechanism >> >>>>>>>>>>>>>>>>>>> to manually block processes/nodes. To me >> >> this >> >>>>> sounds >> >>>>>>>> yet >> >>>>>>>>>>> again >> >>>>>>>>>>>>>> like one >> >>>>>>>>>>>>>>>>>>> of those >> >>>>>>>>>>>>>>>>>>> magical mechanisms that will rarely work >> >> just >> >>>>> right. >> >>>>>>>>>>>>>>>>>>> An external system can leverage way more >> >>>>> information >> >>>>>>>>> after >> >>>>>>>>>>> all. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> Moreover, I'm quite concerned about the >> >>>> complexity >> >>>>>>> of >> >>>>>>>>> this >> >>>>>>>>>>>>>> proposal. >> >>>>>>>>>>>>>>>>>>> Tracking on both the RM/JM side; syncing >> >>> between >> >>>>>>>>>> components; >> >>>>>>>>>>>>>>>>> adjustments >> >>>>>>>>>>>>>>>>>>> to the >> >>>>>>>>>>>>>>>>>>> slot and resource protocol. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> In a way it seems overly complicated. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> If we look at it purely from an active >> >>> resource >> >>>>>>>>> management >> >>>>>>>>>>>>>> perspective, >> >>>>>>>>>>>>>>>>>>> then there >> >>>>>>>>>>>>>>>>>>> isn't really a need to touch the slot >> >> protocol >> >>>> at >> >>>>>>> all >> >>>>>>>> (or >> >>>>>>>>>> in >> >>>>>>>>>>>>> fact >> >>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>>>> anything in the JobMaster), >> >>>>>>>>>>>>>>>>>>> because there isn't any point in keeping >> >>> around >> >>>>>>> blocked >> >>>>>>>>> TMs >> >>>>>>>>>>> in >> >>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> first >> >>>>>>>>>>>>>>>>>>> place. >> >>>>>>>>>>>>>>>>>>> They'd just be idling, potentially shutting >> >>> down >> >>>>>>> after >> >>>>>>>> a >> >>>>>>>>>>> while >> >>>>>>>>>>>>> by >> >>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> RM >> >>>>>>>>>>>>>>>>>>> because of >> >>>>>>>>>>>>>>>>>>> it (unless we _also_ touch that logic). >> >>>>>>>>>>>>>>>>>>> Here the blocking of a process (be it by >> >>>> blocking >> >>>>>>> the >> >>>>>>>>>> process >> >>>>>>>>>>>>> or >> >>>>>>>>>>>>>> node) >> >>>>>>>>>>>>>>>>> is >> >>>>>>>>>>>>>>>>>>> equivalent with shutting down the blocked >> >>>>>>> process(es). >> >>>>>>>>>>>>>>>>>>> Once the block is lifted we can just spin it >> >>>> back >> >>>>>>> up. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> And I do wonder whether we couldn't apply >> >> the >> >>>> same >> >>>>>>> line >> >>>>>>>>> of >> >>>>>>>>>>>>>> thinking to >> >>>>>>>>>>>>>>>>>>> standalone resource management. >> >>>>>>>>>>>>>>>>>>> Here being able to stop/restart a >> >> process/node >> >>>>>>> manually >> >>>>>>>>>>> should >> >>>>>>>>>>>>> be >> >>>>>>>>>>>>>> a >> >>>>>>>>>>>>>>>>> core >> >>>>>>>>>>>>>>>>>>> requirement for a Flink deployment anyway. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> On 02/05/2022 08:49, Martijn Visser wrote: >> >>>>>>>>>>>>>>>>>>>> Hi everyone, >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> Thanks for creating this FLIP. I can >> >>>> understand >> >>>>>>> the >> >>>>>>>>>> problem >> >>>>>>>>>>>>> and >> >>>>>>>>>>>>>> I see >> >>>>>>>>>>>>>>>>>>> value >> >>>>>>>>>>>>>>>>>>>> in the automatic detection and >> >>> blocklisting. I >> >>>>> do >> >>>>>>>> have >> >>>>>>>>>> some >> >>>>>>>>>>>>>> concerns >> >>>>>>>>>>>>>>>>>> with >> >>>>>>>>>>>>>>>>>>>> the ability to manually specify to be >> >>> blocked >> >>>>>>>>> resources. >> >>>>>>>>>> I >> >>>>>>>>>>>>> have >> >>>>>>>>>>>>>> two >> >>>>>>>>>>>>>>>>>>>> concerns; >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> * Most organizations explicitly have a >> >>>>> separation >> >>>>>>> of >> >>>>>>>>>>>>> concerns, >> >>>>>>>>>>>>>>>>> meaning >> >>>>>>>>>>>>>>>>>>> that >> >>>>>>>>>>>>>>>>>>>> there's a group who's responsible for >> >>>> managing a >> >>>>>>>>> cluster >> >>>>>>>>>>> and >> >>>>>>>>>>>>>> there's >> >>>>>>>>>>>>>>>>> a >> >>>>>>>>>>>>>>>>>>> user >> >>>>>>>>>>>>>>>>>>>> group who uses that cluster. With the >> >>>>>>> introduction of >> >>>>>>>>>> this >> >>>>>>>>>>>>>> mechanism, >> >>>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>>>> latter group now can influence the >> >>>>> responsibility >> >>>>>>> of >> >>>>>>>>> the >> >>>>>>>>>>>>> first >> >>>>>>>>>>>>>> group. >> >>>>>>>>>>>>>>>>>> So >> >>>>>>>>>>>>>>>>>>> it >> >>>>>>>>>>>>>>>>>>>> can be possible that someone from the user >> >>>> group >> >>>>>>>> blocks >> >>>>>>>>>>>>>> something, >> >>>>>>>>>>>>>>>>>> which >> >>>>>>>>>>>>>>>>>>>> causes an outage (which could result in >> >>> paging >> >>>>>>>>> mechanism >> >>>>>>>>>>>>>> triggering >> >>>>>>>>>>>>>>>>>> etc) >> >>>>>>>>>>>>>>>>>>>> which impacts the first group. >> >>>>>>>>>>>>>>>>>>>> * How big is the group of people who can >> >> go >> >>>>>>> through >> >>>>>>>> the >> >>>>>>>>>>>>> process >> >>>>>>>>>>>>>> of >> >>>>>>>>>>>>>>>>>>> manually >> >>>>>>>>>>>>>>>>>>>> identifying a node that isn't behaving as >> >> it >> >>>>>>> should >> >>>>>>>>> be? I >> >>>>>>>>>>> do >> >>>>>>>>>>>>>> think >> >>>>>>>>>>>>>>>>> this >> >>>>>>>>>>>>>>>>>>>> group is relatively limited. Does it then >> >>> make >> >>>>>>> sense >> >>>>>>>> to >> >>>>>>>>>>>>>> introduce >> >>>>>>>>>>>>>>>>> such >> >>>>>>>>>>>>>>>>>> a >> >>>>>>>>>>>>>>>>>>>> feature, which would only be used by a >> >>> really >> >>>>>>> small >> >>>>>>>>> user >> >>>>>>>>>>>>> group >> >>>>>>>>>>>>>> of >> >>>>>>>>>>>>>>>>>> Flink? >> >>>>>>>>>>>>>>>>>>> We >> >>>>>>>>>>>>>>>>>>>> still have to maintain, test and support >> >>> such >> >>>> a >> >>>>>>>>> feature. >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> I'm +1 for the autodetection features, but >> >>> I'm >> >>>>>>>> leaning >> >>>>>>>>>>>>> towards >> >>>>>>>>>>>>>> not >> >>>>>>>>>>>>>>>>>>> exposing >> >>>>>>>>>>>>>>>>>>>> this to the user group but having this >> >>>> available >> >>>>>>>>> strictly >> >>>>>>>>>>> for >> >>>>>>>>>>>>>> cluster >> >>>>>>>>>>>>>>>>>>>> operators. They could then also set up >> >> their >> >>>>>>>>>>>>>> paging/metrics/logging >> >>>>>>>>>>>>>>>>>>> system >> >>>>>>>>>>>>>>>>>>>> to take this into account. >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> Best regards, >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> Martijn Visser >> >>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82 >> >>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> On Fri, 29 Apr 2022 at 09:39, Yangze Guo < >> >>>>>>>>>>> karma...@gmail.com >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> Thanks for driving this, Zhu and Lijie. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> +1 for the overall proposal. Just share >> >>> some >> >>>>>>> cents >> >>>>>>>>> here: >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> - Why do we need to expose >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>> cluster.resource-blacklist.item.timeout-check-interval >> >>>>>>>>>> to >> >>>>>>>>>>>>> the >> >>>>>>>>>>>>>> user? >> >>>>>>>>>>>>>>>>>>>>> I think the semantics of >> >>>>>>>>>>>>>> `cluster.resource-blacklist.item.timeout` >> >>>>>>>>>>>>>>>>> is >> >>>>>>>>>>>>>>>>>>>>> sufficient for the user. How to guarantee >> >>> the >> >>>>>>>> timeout >> >>>>>>>>>>>>>> mechanism is >> >>>>>>>>>>>>>>>>>>>>> Flink's internal implementation. I think >> >> it >> >>>>> will >> >>>>>>> be >> >>>>>>>>> very >> >>>>>>>>>>>>>> confusing >> >>>>>>>>>>>>>>>>> and >> >>>>>>>>>>>>>>>>>>>>> we do not need to expose it to users. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> - ResourceManager can notify the >> >> exception >> >>>> of a >> >>>>>>> task >> >>>>>>>>>>>>> manager to >> >>>>>>>>>>>>>>>>>>>>> `BlacklistHandler` as well. >> >>>>>>>>>>>>>>>>>>>>> For example, the slot allocation might >> >> fail >> >>>> in >> >>>>>>> case >> >>>>>>>>> the >> >>>>>>>>>>>>> target >> >>>>>>>>>>>>>> task >> >>>>>>>>>>>>>>>>>>>>> manager is busy or has a network jitter. >> >> I >> >>>>> don't >> >>>>>>>> mean >> >>>>>>>>> we >> >>>>>>>>>>>>> need >> >>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>> cover >> >>>>>>>>>>>>>>>>>>>>> this case in this version, but we can >> >> also >> >>>>> open a >> >>>>>>>>>>>>>> `notifyException` >> >>>>>>>>>>>>>>>>> in >> >>>>>>>>>>>>>>>>>>>>> `ResourceManagerBlacklistHandler`. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> - Before we sync the blocklist to >> >>>>>>> ResourceManager, >> >>>>>>>>> will >> >>>>>>>>>>> the >> >>>>>>>>>>>>>> slot of >> >>>>>>>>>>>>>>>>> a >> >>>>>>>>>>>>>>>>>>>>> blocked task manager continues to be >> >>> released >> >>>>> and >> >>>>>>>>>>> allocated? >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>>>>>>>> Yangze Guo >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> On Thu, Apr 28, 2022 at 3:11 PM Lijie >> >> Wang >> >>> < >> >>>>>>>>>>>>>>>>> wangdachui9...@gmail.com> >> >>>>>>>>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>>>>>> Hi Konstantin, >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> Thanks for your feedback. I will >> >> response >> >>>>> your 4 >> >>>>>>>>>> remarks: >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> 1) Thanks for reminding me of the >> >>>>> controversy. I >> >>>>>>>>> think >> >>>>>>>>>>>>>> “BlockList” >> >>>>>>>>>>>>>>>>> is >> >>>>>>>>>>>>>>>>>>>>> good >> >>>>>>>>>>>>>>>>>>>>>> enough, and I will change it in FLIP. >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> 2) Your suggestion for the REST API is a >> >>>> good >> >>>>>>> idea. >> >>>>>>>>>> Based >> >>>>>>>>>>>>> on >> >>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>>> above, I >> >>>>>>>>>>>>>>>>>>>>>> would change REST API as following: >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/nodes >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/blocklist/taskmanagers >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> DELETE >> >> <host>/blocklist/node/<identifier> >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> DELETE >> >>>>> <host>/blocklist/taskmanager/<identifier> >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> 3) If a node is blocking/blocklisted, it >> >>>> means >> >>>>>>> that >> >>>>>>>>> all >> >>>>>>>>>>>>> task >> >>>>>>>>>>>>>>>>> managers >> >>>>>>>>>>>>>>>>>>> on >> >>>>>>>>>>>>>>>>>>>>>> this node are blocklisted. All slots on >> >>>> these >> >>>>>>> TMs >> >>>>>>>> are >> >>>>>>>>>> not >> >>>>>>>>>>>>>>>>> available. >> >>>>>>>>>>>>>>>>>>> This >> >>>>>>>>>>>>>>>>>>>>>> is actually a bit like TM losts, but >> >> these >> >>>> TMs >> >>>>>>> are >> >>>>>>>>> not >> >>>>>>>>>>>>> really >> >>>>>>>>>>>>>> lost, >> >>>>>>>>>>>>>>>>>>> they >> >>>>>>>>>>>>>>>>>>>>>> are in an unavailable status, and they >> >> are >> >>>>> still >> >>>>>>>>>>> registered >> >>>>>>>>>>>>>> in this >> >>>>>>>>>>>>>>>>>>> flink >> >>>>>>>>>>>>>>>>>>>>>> cluster. They will be available again >> >> once >> >>>> the >> >>>>>>>>>>>>> corresponding >> >>>>>>>>>>>>>>>>>> blocklist >> >>>>>>>>>>>>>>>>>>>>> item >> >>>>>>>>>>>>>>>>>>>>>> is removed. This behavior is the same in >> >>>>>>>>>>> active/non-active >> >>>>>>>>>>>>>>>>> clusters. >> >>>>>>>>>>>>>>>>>>>>>> However in the active clusters, these >> >> TMs >> >>>> may >> >>>>> be >> >>>>>>>>>> released >> >>>>>>>>>>>>> due >> >>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>> idle >> >>>>>>>>>>>>>>>>>>>>>> timeouts. >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> 4) For the item timeout, I prefer to >> >> keep >> >>>> it. >> >>>>>>> The >> >>>>>>>>>> reasons >> >>>>>>>>>>>>> are >> >>>>>>>>>>>>>> as >> >>>>>>>>>>>>>>>>>>>>> following: >> >>>>>>>>>>>>>>>>>>>>>> a) The timeout will not affect users >> >>> adding >> >>>> or >> >>>>>>>>> removing >> >>>>>>>>>>>>> items >> >>>>>>>>>>>>>> via >> >>>>>>>>>>>>>>>>>> REST >> >>>>>>>>>>>>>>>>>>>>> API, >> >>>>>>>>>>>>>>>>>>>>>> and users can disable it by configuring >> >> it >> >>>> to >> >>>>>>>>>>>>> Long.MAX_VALUE . >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> b) Some node problems can recover after >> >> a >> >>>>>>> period of >> >>>>>>>>>> time >> >>>>>>>>>>>>>> (such as >> >>>>>>>>>>>>>>>>>>> machine >> >>>>>>>>>>>>>>>>>>>>>> hotspots), in which case users may >> >> prefer >> >>>> that >> >>>>>>>> Flink >> >>>>>>>>>> can >> >>>>>>>>>>> do >> >>>>>>>>>>>>>> this >> >>>>>>>>>>>>>>>>>>>>>> automatically instead of requiring the >> >>> user >> >>>> to >> >>>>>>> do >> >>>>>>>> it >> >>>>>>>>>>>>> manually. >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> Lijie >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> Konstantin Knauf <kna...@apache.org> >> >>>>>>> 于2022年4月27日周三 >> >>>>>>>>>>>>> 19:23写道: >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> Hi Lijie, >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> I think, this makes sense and +1 to >> >> only >> >>>>>>> support >> >>>>>>>>>>> manually >> >>>>>>>>>>>>>> blocking >> >>>>>>>>>>>>>>>>>>>>>>> taskmanagers and nodes. Maybe the >> >>> different >> >>>>>>>>> strategies >> >>>>>>>>>>> can >> >>>>>>>>>>>>>> also be >> >>>>>>>>>>>>>>>>>>>>>>> maintained outside of Apache Flink. >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> A few remarks: >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> 1) Can we use another term than >> >>>> "bla.cklist" >> >>>>>>> due >> >>>>>>>> to >> >>>>>>>>>> the >> >>>>>>>>>>>>>>>>> controversy >> >>>>>>>>>>>>>>>>>>>>> around >> >>>>>>>>>>>>>>>>>>>>>>> the term? [1] There was also a Jira >> >>> Ticket >> >>>>>>> about >> >>>>>>>>> this >> >>>>>>>>>>>>> topic a >> >>>>>>>>>>>>>>>>> while >> >>>>>>>>>>>>>>>>>>>>> back >> >>>>>>>>>>>>>>>>>>>>>>> and there was generally a consensus to >> >>>> avoid >> >>>>>>> the >> >>>>>>>>> term >> >>>>>>>>>>>>>> blacklist & >> >>>>>>>>>>>>>>>>>>>>> whitelist >> >>>>>>>>>>>>>>>>>>>>>>> [2]? We could use "blocklist" >> >> "denylist" >> >>> or >> >>>>>>>>>>> "quarantined" >> >>>>>>>>>>>>>>>>>>>>>>> 2) For the REST API, I'd prefer a >> >>> slightly >> >>>>>>>> different >> >>>>>>>>>>>>> design >> >>>>>>>>>>>>>> as >> >>>>>>>>>>>>>>>>> verbs >> >>>>>>>>>>>>>>>>>>>>> like >> >>>>>>>>>>>>>>>>>>>>>>> add/remove often considered an >> >>> anti-pattern >> >>>>> for >> >>>>>>>> REST >> >>>>>>>>>>> APIs. >> >>>>>>>>>>>>>> POST >> >>>>>>>>>>>>>>>>> on a >> >>>>>>>>>>>>>>>>>>>>> list >> >>>>>>>>>>>>>>>>>>>>>>> item is generally the standard to add >> >>>> items. >> >>>>>>>> DELETE >> >>>>>>>>> on >> >>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>> individual >> >>>>>>>>>>>>>>>>>>>>>>> resource is standard to remove an item. >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> POST <host>/quarantine/items >> >>>>>>>>>>>>>>>>>>>>>>> DELETE >> >>>>> <host>/quarantine/items/<itemidentifier> >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> We could also consider to separate >> >>>>> taskmanagers >> >>>>>>>> and >> >>>>>>>>>>> nodes >> >>>>>>>>>>>>> in >> >>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>> REST >> >>>>>>>>>>>>>>>>>>>>> API >> >>>>>>>>>>>>>>>>>>>>>>> (and internal data structures). Any >> >>> opinion >> >>>>> on >> >>>>>>>> this? >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/nodes >> >>>>>>>>>>>>>>>>>>>>>>> POST/GET <host>/quarantine/taskmanager >> >>>>>>>>>>>>>>>>>>>>>>> DELETE >> >>> <host>/quarantine/nodes/<identifier> >> >>>>>>>>>>>>>>>>>>>>>>> DELETE >> >>>>>>> <host>/quarantine/taskmanager/<identifier> >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> 3) How would blocking nodes behave with >> >>>>>>> non-active >> >>>>>>>>>>>>> resource >> >>>>>>>>>>>>>>>>>> managers, >> >>>>>>>>>>>>>>>>>>>>> i.e. >> >>>>>>>>>>>>>>>>>>>>>>> standalone or reactive mode? >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> 4) To keep the implementation even more >> >>>>>>> minimal, >> >>>>>>>> do >> >>>>>>>>> we >> >>>>>>>>>>>>> need >> >>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>>> timeout >> >>>>>>>>>>>>>>>>>>>>>>> behavior? If items are added/removed >> >>>> manually >> >>>>>>> we >> >>>>>>>>> could >> >>>>>>>>>>>>>> delegate >> >>>>>>>>>>>>>>>>> this >> >>>>>>>>>>>>>>>>>>>>> to the >> >>>>>>>>>>>>>>>>>>>>>>> user easily. In my opinion the timeout >> >>>>> behavior >> >>>>>>>>> would >> >>>>>>>>>>>>> better >> >>>>>>>>>>>>>> fit >> >>>>>>>>>>>>>>>>>> into >> >>>>>>>>>>>>>>>>>>>>>>> specific strategies at a later point. >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> Looking forward to your thoughts. >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> Cheers and thank you, >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> Konstantin >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> [1] >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>> >> >>>> >> >>> >> >> >> https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term >> >>>>>>>>>>>>>>>>>>>>>>> [2] >> >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18209 >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> Am Mi., 27. Apr. 2022 um 04:04 Uhr >> >>> schrieb >> >>>>>>> Lijie >> >>>>>>>>> Wang >> >>>>>>>>>> < >> >>>>>>>>>>>>>>>>>>>>>>> wangdachui9...@gmail.com>: >> >>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> Hi all, >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> Flink job failures may happen due to >> >>>> cluster >> >>>>>>> node >> >>>>>>>>>>> issues >> >>>>>>>>>>>>>>>>>>>>> (insufficient >> >>>>>>>>>>>>>>>>>>>>>>> disk >> >>>>>>>>>>>>>>>>>>>>>>>> space, bad hardware, network >> >>>> abnormalities). >> >>>>>>>> Flink >> >>>>>>>>>> will >> >>>>>>>>>>>>>> take care >> >>>>>>>>>>>>>>>>>> of >> >>>>>>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>>>>>>>> failures and redeploy the tasks. >> >>> However, >> >>>>> due >> >>>>>>> to >> >>>>>>>>> data >> >>>>>>>>>>>>>> locality >> >>>>>>>>>>>>>>>>> and >> >>>>>>>>>>>>>>>>>>>>>>> limited >> >>>>>>>>>>>>>>>>>>>>>>>> resources, the new tasks are very >> >> likely >> >>>> to >> >>>>> be >> >>>>>>>>>>> redeployed >> >>>>>>>>>>>>>> to the >> >>>>>>>>>>>>>>>>>> same >> >>>>>>>>>>>>>>>>>>>>>>>> nodes, which will result in continuous >> >>>> task >> >>>>>>>>>>> abnormalities >> >>>>>>>>>>>>>> and >> >>>>>>>>>>>>>>>>>> affect >> >>>>>>>>>>>>>>>>>>>>> job >> >>>>>>>>>>>>>>>>>>>>>>>> progress. >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> Currently, Flink users need to >> >> manually >> >>>>>>> identify >> >>>>>>>>> the >> >>>>>>>>>>>>>> problematic >> >>>>>>>>>>>>>>>>>>>>> node and >> >>>>>>>>>>>>>>>>>>>>>>>> take it offline to solve this problem. >> >>> But >> >>>>>>> this >> >>>>>>>>>>> approach >> >>>>>>>>>>>>> has >> >>>>>>>>>>>>>>>>>>>>> following >> >>>>>>>>>>>>>>>>>>>>>>>> disadvantages: >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> 1. Taking a node offline can be a >> >> heavy >> >>>>>>> process. >> >>>>>>>>>> Users >> >>>>>>>>>>>>> may >> >>>>>>>>>>>>>> need >> >>>>>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>>>>>>>> contact >> >>>>>>>>>>>>>>>>>>>>>>>> cluster administors to do this. The >> >>>>> operation >> >>>>>>> can >> >>>>>>>>>> even >> >>>>>>>>>>> be >> >>>>>>>>>>>>>>>>> dangerous >> >>>>>>>>>>>>>>>>>>>>> and >> >>>>>>>>>>>>>>>>>>>>>>> not >> >>>>>>>>>>>>>>>>>>>>>>>> allowed during some important business >> >>>>> events. >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> 2. Identifying and solving this kind >> >> of >> >>>>>>> problems >> >>>>>>>>>>> manually >> >>>>>>>>>>>>>> would >> >>>>>>>>>>>>>>>>> be >> >>>>>>>>>>>>>>>>>>>>> slow >> >>>>>>>>>>>>>>>>>>>>>>> and >> >>>>>>>>>>>>>>>>>>>>>>>> a waste of human resources. >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> To solve this problem, Zhu Zhu and I >> >>>> propose >> >>>>>>> to >> >>>>>>>>>>>>> introduce a >> >>>>>>>>>>>>>>>>>> blacklist >> >>>>>>>>>>>>>>>>>>>>>>>> mechanism for Flink to filter out >> >>>>> problematic >> >>>>>>>>>>> resources. >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> You can find more details in >> >>> FLIP-224[1]. >> >>>>>>> Looking >> >>>>>>>>>>> forward >> >>>>>>>>>>>>>> to your >> >>>>>>>>>>>>>>>>>>>>>>> feedback. >> >>>>>>>>>>>>>>>>>>>>>>>> [1] >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>> >> >>>> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>>> Lijie >> >>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> -- >> >>>>>>>>> Best regards, >> >>>>>>>>> Roman Boyko >> >>>>>>>>> e.: ro.v.bo...@gmail.com >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>>> >> >>>> -- >> >>>> https://twitter.com/snntrable >> >>>> https://github.com/knaufk >> >>>> >> >>> >> >> >> >> >> >> -- >> >> https://twitter.com/snntrable >> >> https://github.com/knaufk >> >> >> >>