Clara Xiong created FLINK-36140:
-----------------------------------
Summary: Log a warning when pods are terminated by kubernetes
Key: FLINK-36140
URL: https://issues.apache.org/jira/browse/FLINK-36140
Project: Flink
Issue Type: Improvement
Components: Deployment / Kubernetes
Affects Versions: 1.19.1
Reporter: Clara Xiong
Scheduled maintenance or buggy nodes on Kubernetes can result random pod
termination and eventually a series of job restarts due to rolling restart of
the Kubernetes cluster nodes. The larger the job is the higher the chance it is
affected. The jobs should be able to auto-recover from these issues, but can
cause unwanted turbulence in large scale pipeline.
In this case, it is very difficult to identify what is causing the restarts
without knowing the issue at Kubernetes layer and the keyword to search with
because it is logged at INFO level.
We need to log this at higher level. If changing it from INFO to ERROR breaks
monitoring we should at least log as warning.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)