[jira] [Updated] (FLINK-37813) JobManager failover during allocation slots causes ResourceManager to release unwanted TaskManager failure

Baozhu Zhao (Jira) Mon, 19 May 2025 06:34:10 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-37813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Baozhu Zhao updated FLINK-37813:
--------------------------------
    Affects Version/s: 1.19.2

> JobManager failover during allocation slots causes ResourceManager to release 
> unwanted TaskManager failure
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-37813
>                 URL: https://issues.apache.org/jira/browse/FLINK-37813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.17.2, 1.19.2
>            Reporter: Baozhu Zhao
>            Priority: Major
>         Attachments: new-tm.log, old-tm.log, 注册的tm.png
>
>
> environment ：
>  * version : 1.17
>  * resource provider:
>  * job desc: The job parallelism=27 ， slotPerWorker=10，need 3 worker
>  * job config：cluster.fine-grained-resource-management.enabled=true
>  
> issue Desc：
>  * When jobmanager failover, the SlotReport of the registered taskmanager did 
> not meet expectations, resulting in ResourceManager unable to release the 
> free taskmanager.
>  
> Reproduce steps：
> 1、Killing a taskManager causes the job to fail, and the slot manager will 
> reallocate the slot to the existing taskManagers. Before the slot allocation 
> is completed, killing the jobmanager and put the job in a SUSPEND state.  
> [^old-tm.log]
> 2、After the new JobManager is launched, the existing taskmanager will 
> register, and the slotNum in the slotReport reported by the existing task 
> manager will be larger than slotPerWorker. Causing the 
> `ActiveResourceManager` to fail to correctly calculate the 
> 'releaseOrRequestWorkerNumber' when checking and releasing idle task managers 
> on a scheduled basis.   [^new-tm.log]
>  
> !注册的tm.png|width=372,height=190!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-37813) JobManager failover during allocation slots causes ResourceManager to release unwanted TaskManager failure

Reply via email to