Max Feng created FLINK-37483:
--------------------------------

             Summary: Native kubernetes clusters losing checkpoint state on 
FAILED
                 Key: FLINK-37483
                 URL: https://issues.apache.org/jira/browse/FLINK-37483
             Project: Flink
          Issue Type: Bug
          Components: Deployment / Kubernetes
    Affects Versions: 1.20.1
            Reporter: Max Feng


We're running Flink 1.20, native kubernetes application-mode clusters, and 
we're running into an issue where clusters are restarting without checkpoints 
from HA configmaps.

To the best of our understanding, here's what's happening: 

1) We're running application-mode clusters in native kubernetes with 
externalized checkpoints, retained on cancellation. We're attempting to restore 
a job from a checkpoint; the checkpoint reference is held in the Kubernetes HA 
configmap. 
2) The job goes to state FAILED.
3) The HA configmap containing the checkpoint reference is cleaned up
4) The Kubernetes pod exits. Because it is a Kubernetes deployment, the pod is 
immediately restarted. 
5) Upon restart, the new Jobmanager finds no checkpoints to restore from.

We think this is a bad combination of the following behaviors:
* FAILED triggers cleanup, which cleans up HA configmaps in native kubernetes 
mode
* FAILED does not actually stop a job in native kubernetes mode, instead it is 
immediately retried



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to