Thanks for driving this, Zhu and Lijie. +1 for the overall proposal. Just share some cents here:
- Why do we need to expose cluster.resource-blacklist.item.timeout-check-interval to the user? I think the semantics of `cluster.resource-blacklist.item.timeout` is sufficient for the user. How to guarantee the timeout mechanism is Flink's internal implementation. I think it will be very confusing and we do not need to expose it to users. - ResourceManager can notify the exception of a task manager to `BlacklistHandler` as well. For example, the slot allocation might fail in case the target task manager is busy or has a network jitter. I don't mean we need to cover this case in this version, but we can also open a `notifyException` in `ResourceManagerBlacklistHandler`. - Before we sync the blocklist to ResourceManager, will the slot of a blocked task manager continues to be released and allocated? Best, Yangze Guo On Thu, Apr 28, 2022 at 3:11 PM Lijie Wang <wangdachui9...@gmail.com> wrote: > > Hi Konstantin, > > Thanks for your feedback. I will response your 4 remarks: > > > 1) Thanks for reminding me of the controversy. I think “BlockList” is good > enough, and I will change it in FLIP. > > > 2) Your suggestion for the REST API is a good idea. Based on the above, I > would change REST API as following: > > POST/GET <host>/blocklist/nodes > > POST/GET <host>/blocklist/taskmanagers > > DELETE <host>/blocklist/node/<identifier> > > DELETE <host>/blocklist/taskmanager/<identifier> > > > 3) If a node is blocking/blocklisted, it means that all task managers on > this node are blocklisted. All slots on these TMs are not available. This > is actually a bit like TM losts, but these TMs are not really lost, they > are in an unavailable status, and they are still registered in this flink > cluster. They will be available again once the corresponding blocklist item > is removed. This behavior is the same in active/non-active clusters. > However in the active clusters, these TMs may be released due to idle > timeouts. > > > 4) For the item timeout, I prefer to keep it. The reasons are as following: > > a) The timeout will not affect users adding or removing items via REST API, > and users can disable it by configuring it to Long.MAX_VALUE . > > b) Some node problems can recover after a period of time (such as machine > hotspots), in which case users may prefer that Flink can do this > automatically instead of requiring the user to do it manually. > > > Best, > > Lijie > > Konstantin Knauf <kna...@apache.org> 于2022年4月27日周三 19:23写道: > > > Hi Lijie, > > > > I think, this makes sense and +1 to only support manually blocking > > taskmanagers and nodes. Maybe the different strategies can also be > > maintained outside of Apache Flink. > > > > A few remarks: > > > > 1) Can we use another term than "bla.cklist" due to the controversy around > > the term? [1] There was also a Jira Ticket about this topic a while back > > and there was generally a consensus to avoid the term blacklist & whitelist > > [2]? We could use "blocklist" "denylist" or "quarantined" > > 2) For the REST API, I'd prefer a slightly different design as verbs like > > add/remove often considered an anti-pattern for REST APIs. POST on a list > > item is generally the standard to add items. DELETE on the individual > > resource is standard to remove an item. > > > > POST <host>/quarantine/items > > DELETE <host>/quarantine/items/<itemidentifier> > > > > We could also consider to separate taskmanagers and nodes in the REST API > > (and internal data structures). Any opinion on this? > > > > POST/GET <host>/quarantine/nodes > > POST/GET <host>/quarantine/taskmanager > > DELETE <host>/quarantine/nodes/<identifier> > > DELETE <host>/quarantine/taskmanager/<identifier> > > > > 3) How would blocking nodes behave with non-active resource managers, i.e. > > standalone or reactive mode? > > > > 4) To keep the implementation even more minimal, do we need the timeout > > behavior? If items are added/removed manually we could delegate this to the > > user easily. In my opinion the timeout behavior would better fit into > > specific strategies at a later point. > > > > Looking forward to your thoughts. > > > > Cheers and thank you, > > > > Konstantin > > > > [1] > > > > https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term > > [2] https://issues.apache.org/jira/browse/FLINK-18209 > > > > Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang < > > wangdachui9...@gmail.com>: > > > > > Hi all, > > > > > > Flink job failures may happen due to cluster node issues (insufficient > > disk > > > space, bad hardware, network abnormalities). Flink will take care of the > > > failures and redeploy the tasks. However, due to data locality and > > limited > > > resources, the new tasks are very likely to be redeployed to the same > > > nodes, which will result in continuous task abnormalities and affect job > > > progress. > > > > > > Currently, Flink users need to manually identify the problematic node and > > > take it offline to solve this problem. But this approach has following > > > disadvantages: > > > > > > 1. Taking a node offline can be a heavy process. Users may need to > > contact > > > cluster administors to do this. The operation can even be dangerous and > > not > > > allowed during some important business events. > > > > > > 2. Identifying and solving this kind of problems manually would be slow > > and > > > a waste of human resources. > > > > > > To solve this problem, Zhu Zhu and I propose to introduce a blacklist > > > mechanism for Flink to filter out problematic resources. > > > > > > > > > You can find more details in FLIP-224[1]. Looking forward to your > > feedback. > > > > > > [1] > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism > > > > > > > > > Best, > > > > > > Lijie > > > > >