Hi Lijie, I think, this makes sense and +1 to only support manually blocking taskmanagers and nodes. Maybe the different strategies can also be maintained outside of Apache Flink.
A few remarks: 1) Can we use another term than "bla.cklist" due to the controversy around the term? [1] There was also a Jira Ticket about this topic a while back and there was generally a consensus to avoid the term blacklist & whitelist [2]? We could use "blocklist" "denylist" or "quarantined" 2) For the REST API, I'd prefer a slightly different design as verbs like add/remove often considered an anti-pattern for REST APIs. POST on a list item is generally the standard to add items. DELETE on the individual resource is standard to remove an item. POST <host>/quarantine/items DELETE <host>/quarantine/items/<itemidentifier> We could also consider to separate taskmanagers and nodes in the REST API (and internal data structures). Any opinion on this? POST/GET <host>/quarantine/nodes POST/GET <host>/quarantine/taskmanager DELETE <host>/quarantine/nodes/<identifier> DELETE <host>/quarantine/taskmanager/<identifier> 3) How would blocking nodes behave with non-active resource managers, i.e. standalone or reactive mode? 4) To keep the implementation even more minimal, do we need the timeout behavior? If items are added/removed manually we could delegate this to the user easily. In my opinion the timeout behavior would better fit into specific strategies at a later point. Looking forward to your thoughts. Cheers and thank you, Konstantin [1] https://en.wikipedia.org/wiki/Blacklist_(computing)#Controversy_over_use_of_the_term [2] https://issues.apache.org/jira/browse/FLINK-18209 Am Mi., 27. Apr. 2022 um 04:04 Uhr schrieb Lijie Wang < wangdachui9...@gmail.com>: > Hi all, > > Flink job failures may happen due to cluster node issues (insufficient disk > space, bad hardware, network abnormalities). Flink will take care of the > failures and redeploy the tasks. However, due to data locality and limited > resources, the new tasks are very likely to be redeployed to the same > nodes, which will result in continuous task abnormalities and affect job > progress. > > Currently, Flink users need to manually identify the problematic node and > take it offline to solve this problem. But this approach has following > disadvantages: > > 1. Taking a node offline can be a heavy process. Users may need to contact > cluster administors to do this. The operation can even be dangerous and not > allowed during some important business events. > > 2. Identifying and solving this kind of problems manually would be slow and > a waste of human resources. > > To solve this problem, Zhu Zhu and I propose to introduce a blacklist > mechanism for Flink to filter out problematic resources. > > > You can find more details in FLIP-224[1]. Looking forward to your feedback. > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-224%3A+Blacklist+Mechanism > > > Best, > > Lijie >