Thanks Weihua and Gyula, @Weihia > If you restart flink cluster by delete/create deployment directly, it will be automatically restored from the latest checkpoint[1], so maybe just enabling the checkpoint is enough. Not sure I follow, we might have changes to the job that will require us to restore from a savepoint, where checkpoints wouldn't be possible due to significant changes to the JobGraph.
> But if you want to use savepoint, you need to check whether the latest savepoint is successful (check whether have _metadata in savepoint dir is useful in most scenarios, but in some cases the _metadata may not be completed). Yes that is basically what our savepoint restore script does, it checks S3 to see if we have any savepoints generated and will specify that to the "--fromSavePoint" argument. @Gyula >Did you check the https://github.com/apache/flink-kubernetes-operator <https://github.com/apache/flink-kubernetes-operator> by any chance? Interesting, no I have missed this! will have a look but it would also be interesting to see how this have been solved before the introduction of the Flink operator Den tis 5 juli 2022 kl 16:37 skrev Gyula Fóra <gyula.f...@gmail.com>: > Hi! > > Did you check the https://github.com/apache/flink-kubernetes-operator > <https://github.com/apache/flink-kubernetes-operator> by any chance? > > It provides many of the application lifecycle features that you are > probably after straight out-of-the-box. It has both manual and periodic > savepoint triggering also included in the latest upcoming version :) > > Cheers, > Gyula > > On Tue, Jul 5, 2022 at 5:34 PM Weihua Hu <huweihua....@gmail.com> wrote: > >> Hi, jonas >> >> If you restart flink cluster by delete/create deployment directly, it >> will be automatically restored from the latest checkpoint[1], so maybe just >> enabling the checkpoint is enough. >> But if you want to use savepoint, you need to check whether the latest >> savepoint is successful (check whether have _metadata in savepoint dir is >> useful in most scenarios, but in some cases the _metadata may not be >> completed). >> >> [1] >> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/ >> >> Best, >> Weihua >> >> >> On Tue, Jul 5, 2022 at 10:54 PM jonas eyob <jonas.e...@gmail.com> wrote: >> >>> Hi! >>> >>> We are running a Standalone job on Kubernetes using application >>> deployment mode, with HA enabled. >>> >>> We have attempted to automate how we create and restore savepoints by >>> running a script for generating a savepoint (using k8 preStop hook) and >>> another one for restoring from a savepoint (located in a S3 bucket). >>> >>> Restoring from a savepoint is typically not a problem once we have a >>> savepoint generated and accessible in our s3 bucket. The problem is >>> generating the savepoint which hasn't been very reliable thus far. Logs are >>> not particularly helpful either so we wanted to rethink how we go about >>> taking savepoints. >>> >>> Are there any best practices for doing this in a CI/CD manner given our >>> setup? >>> >>> -- >>> >>> -- *Med Vänliga Hälsningar* *Jonas Eyob*