Hi flink-kubernetes-operator maintainers,
We have recently migrated to the official operator and seeing a new issue
where our FlinkDeployments can fail and crashloop looking for a
non-existent savepoint. On further inspection, the job is attempting to
restart from the savepoint specified in execution.savepoint.path. This
config new for us (wasn't set by previous operator) is seems to be
automatically set behind the scenes by the official operator. We see the
savepoint in execution.savepoint.path existed but gets deleted after some
amount of time (in the latest example, a few hours). Then when there is
some pod disruption, the job attempts to restart from the savepoint (which
was deleted) and starts crashlooping.
Hoping you can help us troubleshoot and figure out if this can be solved
through configuration (we are using equivalent configs from our previous
operator where we did not have this issue). Adding some details on version
and k8s state for your reference. Thank you for your support!
Flink Version: 1.14.5
Flink Operator Version: 1.4.0
At the time of the issue, here is the flink-config we see in the configmap
(the savepoint savepoint-bad5e5-6ab08cf0808e has been deleted from s3 at
this point):
kubernetes.jobmanager.replicas: 1
jobmanager.rpc.address: <SOMETHING>
metrics.scope.task: flink.taskmanager.job.<job_name>.task.<task_name>.metric
kubernetes.service-account: <SOMETHING>
kubernetes.cluster-id: <SOMETHING>
pipeline.auto-generate-uids: false
metrics.scope.tm: flink.taskmanager.metric
parallelism.default: 2
kubernetes.namespace: <SOMETHING>
metrics.reporters: prom
kubernetes.jobmanager.owner.reference: <SOMETHING>
metrics.reporter.prom.port: 9090
taskmanager.memory.process.size: 10G
kubernetes.internal.jobmanager.entrypoint.class:
org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
pipeline.name: <SOMETHING>
execution.savepoint.path: s3://<SOMETHING>/savepoint-bad5e5-6ab08cf0808e
kubernetes.pod-template-file:
/tmp/flink_op_generated_podTemplate_12924532349572558288.yaml
state.backend.rocksdb.localdir: /rocksdb/
kubernetes.pod-template-file.taskmanager:
/tmp/flink_op_generated_podTemplate_1129545383743356980.yaml
web.cancel.enable: false
execution.checkpointing.timeout: 5 min
kubernetes.container.image.pull-policy: IfNotPresent
$internal.pipeline.job-id: bad5e5682b8f4fbefbf75b00d285ac10
kubernetes.jobmanager.cpu: 2.0
state.backend: filesystem
$internal.flink.version: v1_14
kubernetes.pod-template-file.jobmanager:
/tmp/flink_op_generated_podTemplate_824610597202468981.yaml
blob.server.port: 6124
kubernetes.jobmanager.annotations:
flinkdeployment.flink.apache.org/generation:14
metrics.scope.operator:
flink.taskmanager.job.<job_name>.operator.<operator_name>.metric
state.savepoints.dir: s3://<SOMETHING>/savepoints
kubernetes.taskmanager.cpu: 2.0
execution.savepoint.ignore-unclaimed-state: true
$internal.application.program-args:
kubernetes.container.image: <SOMETHING>
taskmanager.numberOfTaskSlots: 1
metrics.scope.jm.job: flink.jobmanager.job.<job_name>.metric
kubernetes.rest-service.exposed.type: ClusterIP
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter
$internal.application.main: <SOMETHING>
metrics.scope.jm: flink.jobmanager.metric
execution.target: kubernetes-application
jobmanager.memory.process.size: 10G
metrics.scope.tm.job: flink.taskmanager.job.<job_name>.metric
taskmanager.rpc.port: 6122
internal.cluster.execution-mode: NORMAL
execution.checkpointing.externalized-checkpoint-retention:
RETAIN_ON_CANCELLATION
pipeline.jars: local:///build/flink/usrlib/<SOMETHING>.jar
state.checkpoints.dir: s3://<SOMETHING>/checkpoints
At the time of the issue, here is our FlinkDeployment Spec:
Spec:
Flink Configuration:
execution.checkpointing.timeout: 5 min
kubernetes.operator.job.restart.failed: true
kubernetes.operator.periodic.savepoint.interval: 600s
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9090
metrics.reporters: prom
metrics.scope.jm:
flink.jobmanager.metric
metrics.scope.jm.job:
flink.jobmanager.job.<job_name>.metric
metrics.scope.operator:
flink.taskmanager.job.<job_name>.operator.<operator_name>.metric
metrics.scope.task:
flink.taskmanager.job.<job_name>.task.<task_name>.metric
metrics.scope.tm:
flink.taskmanager.metric
metrics.scope.tm.job:
flink.taskmanager.job.<job_name>.metric
pipeline.auto-generate-uids: false
pipeline.name: <SOMETHING>
state.backend: filesystem
state.backend.rocksdb.localdir: /rocksdb/
state.checkpoints.dir:
s3://<SOMETHING>/checkpoints
state.savepoints.dir:
s3://<SOMETHING>/savepoints
Flink Version: v1_14
Image: <SOMETHING>
Image Pull Policy: IfNotPresent
Job:
Allow Non Restored State: true
Args:
Entry Class: <SOMETHING>
Initial Savepoint Path: s3a://<SOMETHING>/savepoint-bad5e5-577c6a76aec5
Jar URI: local:///build/flink/usrlib/<SOMETHING>.jar
Parallelism: 2
State: running
Upgrade Mode: savepoint