[ https://issues.apache.org/jira/browse/FLINK-37813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Baozhu Zhao updated FLINK-37813: -------------------------------- Summary: JobManager re-allocation slots upon failover causes ResourceManager start more TaskManager and release unwanted TaskManager failure (was: JobManager re-allocation slots upon exit causes ResourceManager start more TaskManager and release unwanted TaskManager failure) > JobManager re-allocation slots upon failover causes ResourceManager start > more TaskManager and release unwanted TaskManager failure > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-37813 > URL: https://issues.apache.org/jira/browse/FLINK-37813 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.17.2, 1.19.2 > Reporter: Baozhu Zhao > Priority: Major > Attachments: new-jm.log, old-jm.log, re-allocate-slot.png, > slot-report.png, 注册的tm.png > > > environment : > * version : 1.17 > * resource provider: > * job desc: The job parallelism=27 , slotPerWorker=10,need 3 worker > * job config:cluster.fine-grained-resource-management.enabled=true > > issue Desc: > * When jobmanager failover, the SlotReport of the registered taskmanager did > not meet expectations, resulting in ResourceManager unable to release the > free taskmanager. > > Reproduce steps: > 1、Killing a taskManager causes the job to fail, and the slot manager will > reallocate the slot to the existing taskManagers. Before the slot allocation > is completed, killing the jobmanager and put the job in a SUSPEND state.There > is a probability that ` > FineGrainedSlotManager` will call method `declareNeededResources()` again to > allocate slots after releasing them.[^old-jm.log] > ^!re-allocate-slot.png|width=1152,height=501!^ > 2、After the new JobManager is launched, the existing taskmanager will > register, and the slotNum in the slotReport reported by the existing task > manager will be larger than slotPerWorker. Causing the > `ActiveResourceManager` to fail to correctly calculate the > 'releaseOrRequestWorkerNumber' when checking and releasing idle task managers > on a scheduled basis. [^new-jm.log] > !slot-report.png|width=991,height=554! > !注册的tm.png|width=372,height=190! > > -- This message was sent by Atlassian Jira (v8.20.10#820010)