Hello,

I’ve recently started using the Flink Kubernetes Operator with the
autoscaler feature and have encountered some OOMKilled issues. From my
investigation, it appears that the operator automatically calculates and
adjusts memory settings based on the initial configuration and current
traffic. While this mechanism works in principle and I can see
the deployments are getting auto adjusted as data is getting processed.
I’ve noticed that the autoscaler tends to set the CPU and Memory resource
limits for Kubernetes pods too low, which results in the pods being killed
due to resource overconsumption.

The limits are being set almost equal to the total configured memory,
without including any additional buffer to provide some leeway.

I’ve tried to override or manually set the resource limits for the
TaskManager, but these changes don’t seem to take effect.
>From the perspective of the CRD definition, this configuration is
permitted, but it doesn’t appear to be functioning as expected:
https://github.com/apache/flink-kubernetes-operator/blob/release-1.13/helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml#L865-L890

See attachment for the full example

>   podTemplate:
>     spec:
>       containers:
>         - name: flink-main-container
>           # TODO: Investigate not working
>           resources:
>             limits:
>               cpu: 3.5
>               memory: "12Gi"



*I have a couple of questions:*
- Is this a known issue with the Flink Operator, or could it be a
configuration problem on my end?
- Is there currently a way to explicitly define Kubernetes resource limits
for the flink-main-container?

*Environment details:*
 Used FlinkDeployment CRD: See attachment with all the settings (
FlinkDeployment-Example.yaml)
 Flink 2.1.1
 Flink Operator: 1.13
 Kubernetes: 1.33
 Python: 3.12


Any insights or suggestions would be greatly appreciated.

Thank you!
 Sebastian YEPES

Attachment: FlinkDeployment-Example.yaml
Description: application/yaml

Reply via email to