Baozhu Zhao created FLINK-37813: ----------------------------------- Summary: JobManager failover during allocation slots causes ResourceManager to release unwanted TaskManager failure Key: FLINK-37813 URL: https://issues.apache.org/jira/browse/FLINK-37813 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.17.2 Environment: 环境描述:
Flink on k8s 运行环境 Flink 版本 1.17 作业需要3个taskmanager,单个taskmanager 10个slot。开启参数`cluster.fine-grained-resource-management.enabled=true` Reporter: Baozhu Zhao Attachments: new-tm.log, old-tm.log, 注册的tm.png 环境描述: Flink on k8s 运行环境 作业需要3个taskmanager,单个taskmanager 10个slot。开启参数`cluster.fine-grained-resource-management.enabled=true` 问题描述: jobmanager failover 后,注册的taskmanager slot report 不符合预期,导致闲置的taskmanager 无法被释放 复现步骤: 1、杀死某个 taskmanager,导致作业failover,slot manager 会重新allocate slot 到存量taskmanager,在slot 分配完成前,杀死 jobmanager ,作业会进入suspending 状态。[^old-tm.log] 2、新的JM 启动后,存量taskmanager 会注册,此时存量taskmanager注册的slotReport ,slot num 会比正常的taskmanager 多。导致resourcemanager 在定时检查并release闲置taskmanager 时,无法正确计算`releaseOrRequestWorkerNumber`,闲置的taskmanager 被释放。[^new-tm.log] !注册的tm.png|width=372,height=190! -- This message was sent by Atlassian Jira (v8.20.10#820010)