Hi Jonas! I think generally managed platforms used to provide the functionality that you are after. Otherwise it's mostly home grown CI/CD integrations :)
The Kubernetes Operator is maybe the first initiative to bring proper application lifecycle management to the ecosystem directly. Cheers, Gyula On Tue, Jul 5, 2022 at 6:45 PM jonas eyob <jonas.e...@gmail.com> wrote: > Thanks Weihua and Gyula, > > @Weihia > > If you restart flink cluster by delete/create deployment directly, it > will be automatically restored from the latest checkpoint[1], so maybe just > enabling the checkpoint is enough. > Not sure I follow, we might have changes to the job that will require us > to restore from a savepoint, where checkpoints wouldn't be possible due to > significant changes to the JobGraph. > > > But if you want to use savepoint, you need to check whether the latest > savepoint is successful (check whether have _metadata in savepoint dir is > useful in most scenarios, but in some cases the _metadata may not be > completed). > > Yes that is basically what our savepoint restore script does, it checks S3 > to see if we have any savepoints generated and will specify that to the > "--fromSavePoint" argument. > > @Gyula > > >Did you check the https://github.com/apache/flink-kubernetes-operator > <https://github.com/apache/flink-kubernetes-operator> by any chance? > Interesting, no I have missed this! will have a look but it would also be > interesting to see how this have been solved before the introduction of the > Flink operator > > Den tis 5 juli 2022 kl 16:37 skrev Gyula Fóra <gyula.f...@gmail.com>: > >> Hi! >> >> Did you check the https://github.com/apache/flink-kubernetes-operator >> <https://github.com/apache/flink-kubernetes-operator> by any chance? >> >> It provides many of the application lifecycle features that you are >> probably after straight out-of-the-box. It has both manual and periodic >> savepoint triggering also included in the latest upcoming version :) >> >> Cheers, >> Gyula >> >> On Tue, Jul 5, 2022 at 5:34 PM Weihua Hu <huweihua....@gmail.com> wrote: >> >>> Hi, jonas >>> >>> If you restart flink cluster by delete/create deployment directly, it >>> will be automatically restored from the latest checkpoint[1], so maybe just >>> enabling the checkpoint is enough. >>> But if you want to use savepoint, you need to check whether the latest >>> savepoint is successful (check whether have _metadata in savepoint dir is >>> useful in most scenarios, but in some cases the _metadata may not be >>> completed). >>> >>> [1] >>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/ >>> >>> Best, >>> Weihua >>> >>> >>> On Tue, Jul 5, 2022 at 10:54 PM jonas eyob <jonas.e...@gmail.com> wrote: >>> >>>> Hi! >>>> >>>> We are running a Standalone job on Kubernetes using application >>>> deployment mode, with HA enabled. >>>> >>>> We have attempted to automate how we create and restore savepoints by >>>> running a script for generating a savepoint (using k8 preStop hook) and >>>> another one for restoring from a savepoint (located in a S3 bucket). >>>> >>>> Restoring from a savepoint is typically not a problem once we have a >>>> savepoint generated and accessible in our s3 bucket. The problem is >>>> generating the savepoint which hasn't been very reliable thus far. Logs are >>>> not particularly helpful either so we wanted to rethink how we go about >>>> taking savepoints. >>>> >>>> Are there any best practices for doing this in a CI/CD manner given our >>>> setup? >>>> >>>> -- >>>> >>>> > > -- > *Med Vänliga Hälsningar* > *Jonas Eyob* >