Hi Operator savepoint retention and savepoint upgrades have nothing to do with each other I think. Retention is only for periodic savepoints triggered by the operator itself.
I would upgrade to the latest 1.6.0 operator version before investigating further. Cheers Gyula On Sat, 23 Sep 2023 at 06:02, Nathan Moderwell < nathan.moderw...@robinhood.com> wrote: > Small update on this. I see that the issue is that we use `upgradeMode: > savepoint`, but have not configured the operator to retain savepoints for > long enough (the previous operator we used never deleted savepoints so we > didn't run into this). I am reconfiguring to use `upgradeMode: last-state` > and enabling HA to see if this provides us more stable job restoration on > pod disruption. > > On Fri, Sep 22, 2023 at 10:20 AM Nathan Moderwell < > nathan.moderw...@robinhood.com> wrote: > >> Hi flink-kubernetes-operator maintainers, >> >> We have recently migrated to the official operator and seeing a new issue >> where our FlinkDeployments can fail and crashloop looking for a >> non-existent savepoint. On further inspection, the job is attempting to >> restart from the savepoint specified in execution.savepoint.path. This >> config new for us (wasn't set by previous operator) is seems to be >> automatically set behind the scenes by the official operator. We see the >> savepoint in execution.savepoint.path existed but gets deleted after some >> amount of time (in the latest example, a few hours). Then when there is >> some pod disruption, the job attempts to restart from the savepoint (which >> was deleted) and starts crashlooping. >> >> Hoping you can help us troubleshoot and figure out if this can be solved >> through configuration (we are using equivalent configs from our previous >> operator where we did not have this issue). Adding some details on version >> and k8s state for your reference. Thank you for your support! >> >> Flink Version: 1.14.5 >> Flink Operator Version: 1.4.0 >> >> At the time of the issue, here is the flink-config we see in the >> configmap (the savepoint savepoint-bad5e5-6ab08cf0808e has been deleted >> from s3 at this point): >> >> kubernetes.jobmanager.replicas: 1 >> jobmanager.rpc.address: <SOMETHING> >> metrics.scope.task: >> flink.taskmanager.job.<job_name>.task.<task_name>.metric >> kubernetes.service-account: <SOMETHING> >> kubernetes.cluster-id: <SOMETHING> >> pipeline.auto-generate-uids: false >> metrics.scope.tm: flink.taskmanager.metric >> parallelism.default: 2 >> kubernetes.namespace: <SOMETHING> >> metrics.reporters: prom >> kubernetes.jobmanager.owner.reference: <SOMETHING> >> metrics.reporter.prom.port: 9090 >> taskmanager.memory.process.size: 10G >> kubernetes.internal.jobmanager.entrypoint.class: >> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint >> pipeline.name: <SOMETHING> >> execution.savepoint.path: s3://<SOMETHING>/savepoint-bad5e5-6ab08cf0808e >> kubernetes.pod-template-file: >> /tmp/flink_op_generated_podTemplate_12924532349572558288.yaml >> state.backend.rocksdb.localdir: /rocksdb/ >> kubernetes.pod-template-file.taskmanager: >> /tmp/flink_op_generated_podTemplate_1129545383743356980.yaml >> web.cancel.enable: false >> execution.checkpointing.timeout: 5 min >> kubernetes.container.image.pull-policy: IfNotPresent >> $internal.pipeline.job-id: bad5e5682b8f4fbefbf75b00d285ac10 >> kubernetes.jobmanager.cpu: 2.0 >> state.backend: filesystem >> $internal.flink.version: v1_14 >> kubernetes.pod-template-file.jobmanager: >> /tmp/flink_op_generated_podTemplate_824610597202468981.yaml >> blob.server.port: 6124 >> kubernetes.jobmanager.annotations: >> flinkdeployment.flink.apache.org/generation:14 >> metrics.scope.operator: >> flink.taskmanager.job.<job_name>.operator.<operator_name>.metric >> state.savepoints.dir: s3://<SOMETHING>/savepoints >> kubernetes.taskmanager.cpu: 2.0 >> execution.savepoint.ignore-unclaimed-state: true >> $internal.application.program-args: >> kubernetes.container.image: <SOMETHING> >> taskmanager.numberOfTaskSlots: 1 >> metrics.scope.jm.job: flink.jobmanager.job.<job_name>.metric >> kubernetes.rest-service.exposed.type: ClusterIP >> metrics.reporter.prom.class: >> org.apache.flink.metrics.prometheus.PrometheusReporter >> $internal.application.main: <SOMETHING> >> metrics.scope.jm: flink.jobmanager.metric >> execution.target: kubernetes-application >> jobmanager.memory.process.size: 10G >> metrics.scope.tm.job: flink.taskmanager.job.<job_name>.metric >> taskmanager.rpc.port: 6122 >> internal.cluster.execution-mode: NORMAL >> execution.checkpointing.externalized-checkpoint-retention: >> RETAIN_ON_CANCELLATION >> pipeline.jars: local:///build/flink/usrlib/<SOMETHING>.jar >> state.checkpoints.dir: s3://<SOMETHING>/checkpoints >> >> At the time of the issue, here is our FlinkDeployment Spec: >> >> Spec: >> Flink Configuration: >> execution.checkpointing.timeout: 5 min >> kubernetes.operator.job.restart.failed: true >> kubernetes.operator.periodic.savepoint.interval: 600s >> metrics.reporter.prom.class: >> org.apache.flink.metrics.prometheus.PrometheusReporter >> metrics.reporter.prom.port: 9090 >> metrics.reporters: prom >> metrics.scope.jm: >> flink.jobmanager.metric >> metrics.scope.jm.job: >> flink.jobmanager.job.<job_name>.metric >> metrics.scope.operator: >> flink.taskmanager.job.<job_name>.operator.<operator_name>.metric >> metrics.scope.task: >> flink.taskmanager.job.<job_name>.task.<task_name>.metric >> metrics.scope.tm: >> flink.taskmanager.metric >> metrics.scope.tm.job: >> flink.taskmanager.job.<job_name>.metric >> pipeline.auto-generate-uids: false >> pipeline.name: <SOMETHING> >> state.backend: filesystem >> state.backend.rocksdb.localdir: /rocksdb/ >> state.checkpoints.dir: >> s3://<SOMETHING>/checkpoints >> state.savepoints.dir: >> s3://<SOMETHING>/savepoints >> Flink Version: v1_14 >> Image: <SOMETHING> >> Image Pull Policy: IfNotPresent >> Job: >> Allow Non Restored State: true >> Args: >> Entry Class: <SOMETHING> >> Initial Savepoint Path: >> s3a://<SOMETHING>/savepoint-bad5e5-577c6a76aec5 >> Jar URI: local:///build/flink/usrlib/<SOMETHING>.jar >> Parallelism: 2 >> State: running >> Upgrade Mode: savepoint >> >> >> > > -- > > <http://www.robinhood.com/> > > Nathan Moderwell > > Senior Machine Learning Engineer > > Menlo Park, CA > > Don't copy, share, or use this email without permission. If you received > it by accident, please let us know and then delete it right away. >