[jira] [Commented] (FLINK-37813) SlotManager re-allocation slots upon failover causes ResourceManager start more TaskManager and release unwanted TaskManager failure

Lijie Wang (Jira) Mon, 08 Sep 2025 07:24:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-37813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018835#comment-18018835
 ]


Lijie Wang commented on FLINK-37813:
------------------------------------

I encountered the same problem.  It occurs when jobmanager restarts and the 
taskmanager pod recovered from previous attempt. 

!image-2025-09-08-22-22-20-365.png|width=1185,height=185!
The taskmanger-1-1  only has one slot, but it report two here.

> SlotManager  re-allocation slots  upon failover causes ResourceManager start 
> more  TaskManager and release unwanted TaskManager failure
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-37813
>                 URL: https://issues.apache.org/jira/browse/FLINK-37813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.17.2, 1.19.2
>            Reporter: Baozhu Zhao
>            Priority: Major
>         Attachments: image-2025-09-08-22-19-26-427.png, 
> image-2025-09-08-22-22-20-365.png, new-jm.log, old-jm.log, 
> re-allocate-slot.png, slot-report.png, 注册的tm.png
>
>
> environment ：
>  * version : 1.17
>  * resource provider:
>  * job desc: The job parallelism=27 ， slotPerWorker=10，need 3 worker
>  * job config：cluster.fine-grained-resource-management.enabled=true
>  
> issue Desc：
>  * When jobmanager failover, the SlotReport of the registered taskmanager did 
> not meet expectations, resulting in ResourceManager unable to release the 
> free taskmanager.
>  
> Reproduce steps：
> 1、Killing a taskManager causes the job to fail, and the slot manager will 
> reallocate the slot to the existing taskManagers. Before the slot allocation 
> is completed, killing the jobmanager and put the job in a SUSPEND state.There 
> is a probability that `
> FineGrainedSlotManager` will call method `declareNeededResources()` again to 
> allocate slots after releasing them.[^old-jm.log] 
> ^!re-allocate-slot.png|width=1152,height=501!^
> 2、After the new JobManager is launched, the existing taskmanager will 
> register, and the slotNum in the slotReport reported by the existing task 
> manager will be larger than slotPerWorker. Causing the 
> `ActiveResourceManager` to fail to correctly calculate the 
> 'releaseOrRequestWorkerNumber' when checking and releasing idle task managers 
> on a scheduled basis.  [^new-jm.log]
> !slot-report.png|width=991,height=554!
> !注册的tm.png|width=372,height=190!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37813) SlotManager re-allocation slots upon failover causes ResourceManager start more TaskManager and release unwanted TaskManager failure

Reply via email to