[ https://issues.apache.org/jira/browse/FLINK-37813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Baozhu Zhao updated FLINK-37813: -------------------------------- Affects Version/s: 1.19.2 > JobManager failover during allocation slots causes ResourceManager to release > unwanted TaskManager failure > ---------------------------------------------------------------------------------------------------------- > > Key: FLINK-37813 > URL: https://issues.apache.org/jira/browse/FLINK-37813 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.17.2, 1.19.2 > Reporter: Baozhu Zhao > Priority: Major > Attachments: new-tm.log, old-tm.log, 注册的tm.png > > > environment : > * version : 1.17 > * resource provider: > * job desc: The job parallelism=27 , slotPerWorker=10,need 3 worker > * job config:cluster.fine-grained-resource-management.enabled=true > > issue Desc: > * When jobmanager failover, the SlotReport of the registered taskmanager did > not meet expectations, resulting in ResourceManager unable to release the > free taskmanager. > > Reproduce steps: > 1、Killing a taskManager causes the job to fail, and the slot manager will > reallocate the slot to the existing taskManagers. Before the slot allocation > is completed, killing the jobmanager and put the job in a SUSPEND state. > [^old-tm.log] > 2、After the new JobManager is launched, the existing taskmanager will > register, and the slotNum in the slotReport reported by the existing task > manager will be larger than slotPerWorker. Causing the > `ActiveResourceManager` to fail to correctly calculate the > 'releaseOrRequestWorkerNumber' when checking and releasing idle task managers > on a scheduled basis. [^new-tm.log] > > !注册的tm.png|width=372,height=190! > > -- This message was sent by Atlassian Jira (v8.20.10#820010)