[ https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xingbo Huang updated FLINK-18229: --------------------------------- Fix Version/s: (was: 1.16.0) 1.17.0 > Pending worker requests should be properly cleared > -------------------------------------------------- > > Key: FLINK-18229 > URL: https://issues.apache.org/jira/browse/FLINK-18229 > Project: Flink > Issue Type: Sub-task > Components: Deployment / Kubernetes, Deployment / YARN, Runtime / > Coordination > Affects Versions: 1.9.3, 1.10.1, 1.11.0 > Reporter: Xintong Song > Assignee: Weihua Hu > Priority: Major > Fix For: 1.17.0 > > > Currently, if Kubernetes/Yarn does not have enough resources to fulfill > Flink's resource requirement, there will be pending pod/container requests on > Kubernetes/Yarn. These pending resource requirements are never cleared until > either fulfilled or the Flink cluster is shutdown. > However, sometimes Flink no longer needs the pending resources. E.g., the > slot request is then fulfilled by another slots that become available, or the > job failed due to slot request timeout (in a session cluster). In such cases, > Flink does not remove the resource request until the resource is allocated, > then it discovers that it no longer needs the allocated resource and release > them. This would affect the underlying Kubernetes/Yarn cluster, especially > when the cluster is under heavy workload. > It would be good for Flink to cancel pod/container requests as earlier as > possible if it can discover that some of the pending workers are no longer > needed. > There are several approaches potentially achieve this. > # We can always check whether there's a pending worker that can be canceled > when a \{{PendingTaskManagerSlot}} is unassigned. > # We can have a separate timeout for requesting new worker. If the resource > cannot be allocated within the given time since requested, we should cancel > that resource request and claim a resource allocation failure. > # We can share the same timeout for starting new worker (proposed in > FLINK-13554). This is similar to 2), but it requires the worker to be > registered, rather than allocated, before timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010)