Hi! Did you check the https://github.com/apache/flink-kubernetes-operator <https://github.com/apache/flink-kubernetes-operator> by any chance?
It provides many of the application lifecycle features that you are probably after straight out-of-the-box. It has both manual and periodic savepoint triggering also included in the latest upcoming version :) Cheers, Gyula On Tue, Jul 5, 2022 at 5:34 PM Weihua Hu <huweihua....@gmail.com> wrote: > Hi, jonas > > If you restart flink cluster by delete/create deployment directly, it will > be automatically restored from the latest checkpoint[1], so maybe just > enabling the checkpoint is enough. > But if you want to use savepoint, you need to check whether the latest > savepoint is successful (check whether have _metadata in savepoint dir is > useful in most scenarios, but in some cases the _metadata may not be > completed). > > [1] > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/ > > Best, > Weihua > > > On Tue, Jul 5, 2022 at 10:54 PM jonas eyob <jonas.e...@gmail.com> wrote: > >> Hi! >> >> We are running a Standalone job on Kubernetes using application >> deployment mode, with HA enabled. >> >> We have attempted to automate how we create and restore savepoints by >> running a script for generating a savepoint (using k8 preStop hook) and >> another one for restoring from a savepoint (located in a S3 bucket). >> >> Restoring from a savepoint is typically not a problem once we have a >> savepoint generated and accessible in our s3 bucket. The problem is >> generating the savepoint which hasn't been very reliable thus far. Logs are >> not particularly helpful either so we wanted to rethink how we go about >> taking savepoints. >> >> Are there any best practices for doing this in a CI/CD manner given our >> setup? >> >> -- >> >>