Hi

Operator savepoint retention and savepoint upgrades have nothing to do with
each other I think. Retention is only for periodic savepoints triggered by
the operator itself.

I would upgrade to the latest 1.6.0 operator version before investigating
further.

Cheers
Gyula


On Sat, 23 Sep 2023 at 06:02, Nathan Moderwell <
nathan.moderw...@robinhood.com> wrote:

> Small update on this. I see that the issue is that we use `upgradeMode:
> savepoint`, but have not configured the operator to retain savepoints for
> long enough (the previous operator we used never deleted savepoints so we
> didn't run into this). I am reconfiguring to use `upgradeMode: last-state`
> and enabling HA to see if this provides us more stable job restoration on
> pod disruption.
>
> On Fri, Sep 22, 2023 at 10:20 AM Nathan Moderwell <
> nathan.moderw...@robinhood.com> wrote:
>
>> Hi flink-kubernetes-operator maintainers,
>>
>> We have recently migrated to the official operator and seeing a new issue
>> where our FlinkDeployments can fail and crashloop looking for a
>> non-existent savepoint. On further inspection, the job is attempting to
>> restart from the savepoint specified in execution.savepoint.path. This
>> config new for us (wasn't set by previous operator) is seems to be
>> automatically set behind the scenes by the official operator. We see the
>> savepoint in execution.savepoint.path existed but gets deleted after some
>> amount of time (in the latest example, a few hours). Then when there is
>> some pod disruption, the job attempts to restart from the savepoint (which
>> was deleted) and starts crashlooping.
>>
>> Hoping you can help us troubleshoot and figure out if this can be solved
>> through configuration (we are using equivalent configs from our previous
>> operator where we did not have this issue). Adding some details on version
>> and k8s state for your reference. Thank you for your support!
>>
>> Flink Version: 1.14.5
>> Flink Operator Version: 1.4.0
>>
>> At the time of the issue, here is the flink-config we see in the
>> configmap (the savepoint savepoint-bad5e5-6ab08cf0808e has been deleted
>> from s3 at this point):
>>
>> kubernetes.jobmanager.replicas: 1
>> jobmanager.rpc.address: <SOMETHING>
>> metrics.scope.task:
>> flink.taskmanager.job.<job_name>.task.<task_name>.metric
>> kubernetes.service-account: <SOMETHING>
>> kubernetes.cluster-id: <SOMETHING>
>> pipeline.auto-generate-uids: false
>> metrics.scope.tm: flink.taskmanager.metric
>> parallelism.default: 2
>> kubernetes.namespace: <SOMETHING>
>> metrics.reporters: prom
>> kubernetes.jobmanager.owner.reference: <SOMETHING>
>> metrics.reporter.prom.port: 9090
>> taskmanager.memory.process.size: 10G
>> kubernetes.internal.jobmanager.entrypoint.class:
>> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
>> pipeline.name: <SOMETHING>
>> execution.savepoint.path: s3://<SOMETHING>/savepoint-bad5e5-6ab08cf0808e
>> kubernetes.pod-template-file:
>> /tmp/flink_op_generated_podTemplate_12924532349572558288.yaml
>> state.backend.rocksdb.localdir: /rocksdb/
>> kubernetes.pod-template-file.taskmanager:
>> /tmp/flink_op_generated_podTemplate_1129545383743356980.yaml
>> web.cancel.enable: false
>> execution.checkpointing.timeout: 5 min
>> kubernetes.container.image.pull-policy: IfNotPresent
>> $internal.pipeline.job-id: bad5e5682b8f4fbefbf75b00d285ac10
>> kubernetes.jobmanager.cpu: 2.0
>> state.backend: filesystem
>> $internal.flink.version: v1_14
>> kubernetes.pod-template-file.jobmanager:
>> /tmp/flink_op_generated_podTemplate_824610597202468981.yaml
>> blob.server.port: 6124
>> kubernetes.jobmanager.annotations:
>> flinkdeployment.flink.apache.org/generation:14
>> metrics.scope.operator:
>> flink.taskmanager.job.<job_name>.operator.<operator_name>.metric
>> state.savepoints.dir: s3://<SOMETHING>/savepoints
>> kubernetes.taskmanager.cpu: 2.0
>> execution.savepoint.ignore-unclaimed-state: true
>> $internal.application.program-args:
>> kubernetes.container.image: <SOMETHING>
>> taskmanager.numberOfTaskSlots: 1
>> metrics.scope.jm.job: flink.jobmanager.job.<job_name>.metric
>> kubernetes.rest-service.exposed.type: ClusterIP
>> metrics.reporter.prom.class:
>> org.apache.flink.metrics.prometheus.PrometheusReporter
>> $internal.application.main: <SOMETHING>
>> metrics.scope.jm: flink.jobmanager.metric
>> execution.target: kubernetes-application
>> jobmanager.memory.process.size: 10G
>> metrics.scope.tm.job: flink.taskmanager.job.<job_name>.metric
>> taskmanager.rpc.port: 6122
>> internal.cluster.execution-mode: NORMAL
>> execution.checkpointing.externalized-checkpoint-retention:
>> RETAIN_ON_CANCELLATION
>> pipeline.jars: local:///build/flink/usrlib/<SOMETHING>.jar
>> state.checkpoints.dir: s3://<SOMETHING>/checkpoints
>>
>> At the time of the issue, here is our FlinkDeployment Spec:
>>
>> Spec:
>>   Flink Configuration:
>>     execution.checkpointing.timeout:                  5 min
>>     kubernetes.operator.job.restart.failed:           true
>>     kubernetes.operator.periodic.savepoint.interval:  600s
>>     metrics.reporter.prom.class:
>>  org.apache.flink.metrics.prometheus.PrometheusReporter
>>     metrics.reporter.prom.port:                       9090
>>     metrics.reporters:                                prom
>>     metrics.scope.jm:
>> flink.jobmanager.metric
>>     metrics.scope.jm.job:
>> flink.jobmanager.job.<job_name>.metric
>>     metrics.scope.operator:
>> flink.taskmanager.job.<job_name>.operator.<operator_name>.metric
>>     metrics.scope.task:
>> flink.taskmanager.job.<job_name>.task.<task_name>.metric
>>     metrics.scope.tm:
>> flink.taskmanager.metric
>>     metrics.scope.tm.job:
>> flink.taskmanager.job.<job_name>.metric
>>     pipeline.auto-generate-uids:                      false
>>     pipeline.name:                                    <SOMETHING>
>>     state.backend:                                    filesystem
>>     state.backend.rocksdb.localdir:                   /rocksdb/
>>     state.checkpoints.dir:
>>  s3://<SOMETHING>/checkpoints
>>     state.savepoints.dir:
>> s3://<SOMETHING>/savepoints
>>   Flink Version:                                      v1_14
>>   Image:                                              <SOMETHING>
>>   Image Pull Policy:                                  IfNotPresent
>>   Job:
>>     Allow Non Restored State:  true
>>     Args:
>>     Entry Class:             <SOMETHING>
>>     Initial Savepoint Path:
>>  s3a://<SOMETHING>/savepoint-bad5e5-577c6a76aec5
>>     Jar URI:                 local:///build/flink/usrlib/<SOMETHING>.jar
>>     Parallelism:             2
>>     State:                   running
>>     Upgrade Mode:            savepoint
>>
>>
>>
>
> --
>
> <http://www.robinhood.com/>
>
> Nathan Moderwell
>
> Senior Machine Learning Engineer
>
> Menlo Park, CA
>
> Don't copy, share, or use this email without permission. If you received
> it by accident, please let us know and then delete it right away.
>

Reply via email to