Baozhu Zhao created FLINK-37813:
-----------------------------------

             Summary: JobManager failover during allocation slots causes 
ResourceManager to release unwanted TaskManager failure
                 Key: FLINK-37813
                 URL: https://issues.apache.org/jira/browse/FLINK-37813
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.17.2
         Environment: 环境描述:

Flink on k8s 运行环境

Flink 版本 1.17

作业需要3个taskmanager,单个taskmanager 
10个slot。开启参数`cluster.fine-grained-resource-management.enabled=true`
            Reporter: Baozhu Zhao
         Attachments: new-tm.log, old-tm.log, 注册的tm.png

环境描述:

Flink on k8s 运行环境

作业需要3个taskmanager,单个taskmanager 
10个slot。开启参数`cluster.fine-grained-resource-management.enabled=true`

问题描述:

jobmanager failover 后,注册的taskmanager slot report 不符合预期,导致闲置的taskmanager 无法被释放

 

复现步骤:

1、杀死某个 taskmanager,导致作业failover,slot manager 会重新allocate slot 
到存量taskmanager,在slot 分配完成前,杀死 jobmanager ,作业会进入suspending 状态。[^old-tm.log]

2、新的JM 启动后,存量taskmanager 会注册,此时存量taskmanager注册的slotReport ,slot num 
会比正常的taskmanager 多。导致resourcemanager 在定时检查并release闲置taskmanager 
时,无法正确计算`releaseOrRequestWorkerNumber`,闲置的taskmanager 被释放。[^new-tm.log]

 

!注册的tm.png|width=372,height=190!

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to