Aviv Dozorets created FLINK-35594: ------------------------------------- Summary: Downscaling doesn't release TaskManagers. Key: FLINK-35594 URL: https://issues.apache.org/jira/browse/FLINK-35594 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: 1.18.1 Environment: * Flink 1.18.1 (Java 11, Temurin). * Kubernetes Operator 1.8 * Kubernetes version v1.28.9-eks-036c24b (AWS EKS).
Autoscaling configuration: {code:java} jobmanager.scheduler: adaptive job.autoscaler.enabled: "true" job.autoscaler.metrics.window: 15m job.autoscaler.stabilization.interval: 15m job.autoscaler.scaling.effectiveness.threshold: 0.2 job.autoscaler.target.utilization: "0.75" job.autoscaler.target.utilization.boundary: "0.25" job.autoscaler.metrics.busy-time.aggregator: "AVG" job.autoscaler.restart.time-tracking.enabled: "true"{code} Reporter: Aviv Dozorets Attachments: Screenshot 2024-06-10 at 12.50.37 PM.png (Follow-up of Slack conversation on #troubleshooting channel). Recently I've observed a behavior, that should be improved: A Flink DataStream that runs with autoscaler (backed by Kubernetes operator) and Adaptive scheduler doesn't release a node (TaskManager) when scaling down. In my example job started with initial parallelism of 64, while having 4 TM with 16 cores each (1:1 core:slot) and scaled down to 16. My expectation: 1 TaskManager should be up and running. Reality: All 4 initial TaskManagers are running, with multiple and unequal amount of available slots. Didn't find an existing configuration to change the behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010)