[ https://issues.apache.org/jira/browse/FLINK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nishant More updated FLINK-37773: --------------------------------- Attachment: Screenshot 2025-06-02 at 11.53.28 AM.png > Extra TMs are started when Jobmanger is OOM killed in some FlinkDeployment > runs > ------------------------------------------------------------------------------- > > Key: FLINK-37773 > URL: https://issues.apache.org/jira/browse/FLINK-37773 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: 1.10.0 > Reporter: Nihar Rao > Priority: Major > Attachments: Screenshot 2025-04-24 at 3.29.47 PM.png, Screenshot > 2025-06-02 at 11.53.28 AM.png > > > Hi, > We are running into a weird issue with apache flink kubernetes operator > 1.10.0 and apache flink 1.19.1. We are running jobs using native kubernetes > application mode and FlinkDeployment CRD. We are running a job with 24 > taskmanagers and 1 Jobmanager replica with HA enabled. > Below is the chronological summary of events: > 1. Job was initially started with 24 task managers. > 2. JM pod was OOMkilled and it is confirmed by our KSM metrics and {{kubectl > describe pod <JM pod>}} shows the pod restarted due to OOM as well. > 3. After JM was OOMkilled, JM was restarted and 24 new taskmanagers pods were > started and is confirmed on flink UI on available task slots section. > 4. There was no impact on job (it restarted successfully) but there are 48 > taskmanagers running out of which 24 are standby. The expected behaviour > after a JM OOM with HA enabled is no starting of new task managers. > 5. I have added the flink UI showing 24 extra TMs (48 task slots) screenshot > and kubectl output below. > I also checked the kubernetes operator pod logs and I don't find anything > that could explain this behaviour. This has happened few times now with > different jobs and we have tried purposely OOMkilling jobmanager for one of > our test jobs many times we haven't been aple to reproduce this behaviour. It > looks to be an edge case which is difficult to reproduce. > Can you please help us on how to debug this as kubernetes operator don't show > any relevant information on why this happened. Thanks and let me know if you > need further information. > kubectl get pod ouput showing 24 extra TMs: > NAME READY STATUS RESTARTS AGE > ioi-quality-667f575877-btfkv 1/1 Running 1 (39h ago) > 4d16h > ioi-quality-taskmanager-1-1 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-10 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-11 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-12 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-13 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-14 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-15 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-16 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-17 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-18 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-19 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-2 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-20 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-21 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-22 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-23 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-24 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-3 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-4 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-5 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-6 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-7 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-8 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-1-9 1/1 Running 0 > 4d16h > ioi-quality-taskmanager-2-1 1/1 Running 0 39h > ioi-quality-taskmanager-2-10 1/1 Running 0 39h > ioi-quality-taskmanager-2-11 1/1 Running 0 39h > ioi-quality-taskmanager-2-12 1/1 Running 0 39h > ioi-quality-taskmanager-2-13 1/1 Running 0 39h > ioi-quality-taskmanager-2-14 1/1 Running 0 39h > ioi-quality-taskmanager-2-15 1/1 Running 0 39h > ioi-quality-taskmanager-2-16 1/1 Running 0 39h > ioi-quality-taskmanager-2-17 1/1 Running 0 39h > ioi-quality-taskmanager-2-18 1/1 Running 0 39h > ioi-quality-taskmanager-2-19 1/1 Running 0 39h > ioi-quality-taskmanager-2-2 1/1 Running 0 39h > ioi-quality-taskmanager-2-20 1/1 Running 0 39h > ioi-quality-taskmanager-2-21 1/1 Running 0 39h > ioi-quality-taskmanager-2-22 1/1 Running 0 39h > ioi-quality-taskmanager-2-23 1/1 Running 0 39h > ioi-quality-taskmanager-2-24 1/1 Running 0 39h > ioi-quality-taskmanager-2-3 1/1 Running 0 39h > ioi-quality-taskmanager-2-4 1/1 Running 0 39h > ioi-quality-taskmanager-2-5 1/1 Running 0 39h > ioi-quality-taskmanager-2-6 1/1 Running 0 39h > ioi-quality-taskmanager-2-7 1/1 Running 0 39h > ioi-quality-taskmanager-2-8 1/1 Running 0 39h > ioi-quality-taskmanager-2-9 1/1 Running 0 39h > > -- This message was sent by Atlassian Jira (v8.20.10#820010)