[jira] [Updated] (FLINK-37813) SlotManager re-allocation slots upon failover causes ResourceManager start more TaskManager and release unwanted TaskManager failure

Baozhu Zhao (Jira) Thu, 22 May 2025 23:16:25 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-37813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Baozhu Zhao updated FLINK-37813:
--------------------------------
    Summary: SlotManager  re-allocation slots  upon failover causes 
ResourceManager start more  TaskManager and release unwanted TaskManager 
failure  (was: JobManager  re-allocation slots  upon failover causes 
ResourceManager start more  TaskManager and release unwanted TaskManager 
failure)

> SlotManager  re-allocation slots  upon failover causes ResourceManager start 
> more  TaskManager and release unwanted TaskManager failure
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-37813
>                 URL: https://issues.apache.org/jira/browse/FLINK-37813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.17.2, 1.19.2
>            Reporter: Baozhu Zhao
>            Priority: Major
>         Attachments: new-jm.log, old-jm.log, re-allocate-slot.png, 
> slot-report.png, 注册的tm.png
>
>
> environment ：
>  * version : 1.17
>  * resource provider:
>  * job desc: The job parallelism=27 ， slotPerWorker=10，need 3 worker
>  * job config：cluster.fine-grained-resource-management.enabled=true
>  
> issue Desc：
>  * When jobmanager failover, the SlotReport of the registered taskmanager did 
> not meet expectations, resulting in ResourceManager unable to release the 
> free taskmanager.
>  
> Reproduce steps：
> 1、Killing a taskManager causes the job to fail, and the slot manager will 
> reallocate the slot to the existing taskManagers. Before the slot allocation 
> is completed, killing the jobmanager and put the job in a SUSPEND state.There 
> is a probability that `
> FineGrainedSlotManager` will call method `declareNeededResources()` again to 
> allocate slots after releasing them.[^old-jm.log] 
> ^!re-allocate-slot.png|width=1152,height=501!^
> 2、After the new JobManager is launched, the existing taskmanager will 
> register, and the slotNum in the slotReport reported by the existing task 
> manager will be larger than slotPerWorker. Causing the 
> `ActiveResourceManager` to fail to correctly calculate the 
> 'releaseOrRequestWorkerNumber' when checking and releasing idle task managers 
> on a scheduled basis.  [^new-jm.log]
> !slot-report.png|width=991,height=554!
> !注册的tm.png|width=372,height=190!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-37813) SlotManager re-allocation slots upon failover causes ResourceManager start more TaskManager and release unwanted TaskManager failure

Reply via email to