Hello, I’ve recently started using the Flink Kubernetes Operator with the autoscaler feature and have encountered some OOMKilled issues. From my investigation, it appears that the operator automatically calculates and adjusts memory settings based on the initial configuration and current traffic. While this mechanism works in principle and I can see the deployments are getting auto adjusted as data is getting processed. I’ve noticed that the autoscaler tends to set the CPU and Memory resource limits for Kubernetes pods too low, which results in the pods being killed due to resource overconsumption.
The limits are being set almost equal to the total configured memory, without including any additional buffer to provide some leeway. I’ve tried to override or manually set the resource limits for the TaskManager, but these changes don’t seem to take effect. >From the perspective of the CRD definition, this configuration is permitted, but it doesn’t appear to be functioning as expected: https://github.com/apache/flink-kubernetes-operator/blob/release-1.13/helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml#L865-L890 See attachment for the full example > podTemplate: > spec: > containers: > - name: flink-main-container > # TODO: Investigate not working > resources: > limits: > cpu: 3.5 > memory: "12Gi" *I have a couple of questions:* - Is this a known issue with the Flink Operator, or could it be a configuration problem on my end? - Is there currently a way to explicitly define Kubernetes resource limits for the flink-main-container? *Environment details:* Used FlinkDeployment CRD: See attachment with all the settings ( FlinkDeployment-Example.yaml) Flink 2.1.1 Flink Operator: 1.13 Kubernetes: 1.33 Python: 3.12 Any insights or suggestions would be greatly appreciated. Thank you! Sebastian YEPES
FlinkDeployment-Example.yaml
Description: application/yaml
