Small update on this. I see that the issue is that we use `upgradeMode: savepoint`, but have not configured the operator to retain savepoints for long enough (the previous operator we used never deleted savepoints so we didn't run into this). I am reconfiguring to use `upgradeMode: last-state` and enabling HA to see if this provides us more stable job restoration on pod disruption.
On Fri, Sep 22, 2023 at 10:20 AM Nathan Moderwell < nathan.moderw...@robinhood.com> wrote: > Hi flink-kubernetes-operator maintainers, > > We have recently migrated to the official operator and seeing a new issue > where our FlinkDeployments can fail and crashloop looking for a > non-existent savepoint. On further inspection, the job is attempting to > restart from the savepoint specified in execution.savepoint.path. This > config new for us (wasn't set by previous operator) is seems to be > automatically set behind the scenes by the official operator. We see the > savepoint in execution.savepoint.path existed but gets deleted after some > amount of time (in the latest example, a few hours). Then when there is > some pod disruption, the job attempts to restart from the savepoint (which > was deleted) and starts crashlooping. > > Hoping you can help us troubleshoot and figure out if this can be solved > through configuration (we are using equivalent configs from our previous > operator where we did not have this issue). Adding some details on version > and k8s state for your reference. Thank you for your support! > > Flink Version: 1.14.5 > Flink Operator Version: 1.4.0 > > At the time of the issue, here is the flink-config we see in the configmap > (the savepoint savepoint-bad5e5-6ab08cf0808e has been deleted from s3 at > this point): > > kubernetes.jobmanager.replicas: 1 > jobmanager.rpc.address: <SOMETHING> > metrics.scope.task: > flink.taskmanager.job.<job_name>.task.<task_name>.metric > kubernetes.service-account: <SOMETHING> > kubernetes.cluster-id: <SOMETHING> > pipeline.auto-generate-uids: false > metrics.scope.tm: flink.taskmanager.metric > parallelism.default: 2 > kubernetes.namespace: <SOMETHING> > metrics.reporters: prom > kubernetes.jobmanager.owner.reference: <SOMETHING> > metrics.reporter.prom.port: 9090 > taskmanager.memory.process.size: 10G > kubernetes.internal.jobmanager.entrypoint.class: > org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint > pipeline.name: <SOMETHING> > execution.savepoint.path: s3://<SOMETHING>/savepoint-bad5e5-6ab08cf0808e > kubernetes.pod-template-file: > /tmp/flink_op_generated_podTemplate_12924532349572558288.yaml > state.backend.rocksdb.localdir: /rocksdb/ > kubernetes.pod-template-file.taskmanager: > /tmp/flink_op_generated_podTemplate_1129545383743356980.yaml > web.cancel.enable: false > execution.checkpointing.timeout: 5 min > kubernetes.container.image.pull-policy: IfNotPresent > $internal.pipeline.job-id: bad5e5682b8f4fbefbf75b00d285ac10 > kubernetes.jobmanager.cpu: 2.0 > state.backend: filesystem > $internal.flink.version: v1_14 > kubernetes.pod-template-file.jobmanager: > /tmp/flink_op_generated_podTemplate_824610597202468981.yaml > blob.server.port: 6124 > kubernetes.jobmanager.annotations: > flinkdeployment.flink.apache.org/generation:14 > metrics.scope.operator: > flink.taskmanager.job.<job_name>.operator.<operator_name>.metric > state.savepoints.dir: s3://<SOMETHING>/savepoints > kubernetes.taskmanager.cpu: 2.0 > execution.savepoint.ignore-unclaimed-state: true > $internal.application.program-args: > kubernetes.container.image: <SOMETHING> > taskmanager.numberOfTaskSlots: 1 > metrics.scope.jm.job: flink.jobmanager.job.<job_name>.metric > kubernetes.rest-service.exposed.type: ClusterIP > metrics.reporter.prom.class: > org.apache.flink.metrics.prometheus.PrometheusReporter > $internal.application.main: <SOMETHING> > metrics.scope.jm: flink.jobmanager.metric > execution.target: kubernetes-application > jobmanager.memory.process.size: 10G > metrics.scope.tm.job: flink.taskmanager.job.<job_name>.metric > taskmanager.rpc.port: 6122 > internal.cluster.execution-mode: NORMAL > execution.checkpointing.externalized-checkpoint-retention: > RETAIN_ON_CANCELLATION > pipeline.jars: local:///build/flink/usrlib/<SOMETHING>.jar > state.checkpoints.dir: s3://<SOMETHING>/checkpoints > > At the time of the issue, here is our FlinkDeployment Spec: > > Spec: > Flink Configuration: > execution.checkpointing.timeout: 5 min > kubernetes.operator.job.restart.failed: true > kubernetes.operator.periodic.savepoint.interval: 600s > metrics.reporter.prom.class: > org.apache.flink.metrics.prometheus.PrometheusReporter > metrics.reporter.prom.port: 9090 > metrics.reporters: prom > metrics.scope.jm: > flink.jobmanager.metric > metrics.scope.jm.job: > flink.jobmanager.job.<job_name>.metric > metrics.scope.operator: > flink.taskmanager.job.<job_name>.operator.<operator_name>.metric > metrics.scope.task: > flink.taskmanager.job.<job_name>.task.<task_name>.metric > metrics.scope.tm: > flink.taskmanager.metric > metrics.scope.tm.job: > flink.taskmanager.job.<job_name>.metric > pipeline.auto-generate-uids: false > pipeline.name: <SOMETHING> > state.backend: filesystem > state.backend.rocksdb.localdir: /rocksdb/ > state.checkpoints.dir: > s3://<SOMETHING>/checkpoints > state.savepoints.dir: > s3://<SOMETHING>/savepoints > Flink Version: v1_14 > Image: <SOMETHING> > Image Pull Policy: IfNotPresent > Job: > Allow Non Restored State: true > Args: > Entry Class: <SOMETHING> > Initial Savepoint Path: > s3a://<SOMETHING>/savepoint-bad5e5-577c6a76aec5 > Jar URI: local:///build/flink/usrlib/<SOMETHING>.jar > Parallelism: 2 > State: running > Upgrade Mode: savepoint > > > -- <http://www.robinhood.com/> Nathan Moderwell Senior Machine Learning Engineer Menlo Park, CA Don't copy, share, or use this email without permission. If you received it by accident, please let us know and then delete it right away.