[ https://issues.apache.org/jira/browse/FLINK-37813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018835#comment-18018835 ]
Lijie Wang commented on FLINK-37813: ------------------------------------ I encountered the same problem. It occurs when jobmanager restarts and the taskmanager pod recovered from previous attempt. !image-2025-09-08-22-22-20-365.png|width=1185,height=185! The taskmanger-1-1 only has one slot, but it report two here. > SlotManager re-allocation slots upon failover causes ResourceManager start > more TaskManager and release unwanted TaskManager failure > --------------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-37813 > URL: https://issues.apache.org/jira/browse/FLINK-37813 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.17.2, 1.19.2 > Reporter: Baozhu Zhao > Priority: Major > Attachments: image-2025-09-08-22-19-26-427.png, > image-2025-09-08-22-22-20-365.png, new-jm.log, old-jm.log, > re-allocate-slot.png, slot-report.png, 注册的tm.png > > > environment : > * version : 1.17 > * resource provider: > * job desc: The job parallelism=27 , slotPerWorker=10,need 3 worker > * job config:cluster.fine-grained-resource-management.enabled=true > > issue Desc: > * When jobmanager failover, the SlotReport of the registered taskmanager did > not meet expectations, resulting in ResourceManager unable to release the > free taskmanager. > > Reproduce steps: > 1、Killing a taskManager causes the job to fail, and the slot manager will > reallocate the slot to the existing taskManagers. Before the slot allocation > is completed, killing the jobmanager and put the job in a SUSPEND state.There > is a probability that ` > FineGrainedSlotManager` will call method `declareNeededResources()` again to > allocate slots after releasing them.[^old-jm.log] > ^!re-allocate-slot.png|width=1152,height=501!^ > 2、After the new JobManager is launched, the existing taskmanager will > register, and the slotNum in the slotReport reported by the existing task > manager will be larger than slotPerWorker. Causing the > `ActiveResourceManager` to fail to correctly calculate the > 'releaseOrRequestWorkerNumber' when checking and releasing idle task managers > on a scheduled basis. [^new-jm.log] > !slot-report.png|width=991,height=554! > !注册的tm.png|width=372,height=190! > > -- This message was sent by Atlassian Jira (v8.20.10#820010)