Clara Xiong created FLINK-36140:
-----------------------------------

             Summary: Log a warning when pods are terminated by kubernetes
                 Key: FLINK-36140
                 URL: https://issues.apache.org/jira/browse/FLINK-36140
             Project: Flink
          Issue Type: Improvement
          Components: Deployment / Kubernetes
    Affects Versions: 1.19.1
            Reporter: Clara Xiong


Scheduled maintenance or buggy nodes on Kubernetes can result random pod 
termination and eventually a series of job restarts due to rolling restart of 
the Kubernetes cluster nodes. The larger the job is the higher the chance it is 
affected. The jobs should be able to auto-recover from these issues, but can 
cause unwanted turbulence in large scale pipeline. 

In this case, it is very difficult to identify what is causing the restarts 
without knowing the issue at Kubernetes layer and the keyword to search with 
because it is logged at INFO level.

We need to log this at higher level. If changing it from INFO to ERROR breaks 
monitoring we should at least log as warning. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to