Re: Best practice for creating/restoring savepoint in standalone k8 setup

Gyula Fóra Tue, 05 Jul 2022 12:06:01 -0700

Hi Jonas!

I think generally managed platforms used to provide the functionality that
you are after. Otherwise it's mostly home grown CI/CD integrations :)


The Kubernetes Operator is maybe the first initiative to bring proper
application lifecycle management to the ecosystem directly.

Cheers,
Gyula

On Tue, Jul 5, 2022 at 6:45 PM jonas eyob <jonas.e...@gmail.com> wrote:

> Thanks Weihua and Gyula,
>
> @Weihia
> > If you restart flink cluster by delete/create deployment directly, it
> will be automatically restored from the latest checkpoint[1], so maybe just
> enabling the checkpoint is enough.
> Not sure I follow, we might have changes to the job that will require us
> to restore from a savepoint, where checkpoints wouldn't be possible due to
> significant changes to the JobGraph.
>
> > But if you want to use savepoint, you need to check whether the latest
> savepoint is successful (check whether have _metadata in savepoint dir is
> useful in most scenarios, but in some cases the _metadata may not be
> completed).
>
> Yes that is basically what our savepoint restore script does, it checks S3
> to see if we have any savepoints generated and will specify that to the
> "--fromSavePoint" argument.
>
> @Gyula
>
> >Did you check the https://github.com/apache/flink-kubernetes-operator
> <https://github.com/apache/flink-kubernetes-operator> by any chance?
> Interesting, no I have missed this! will have a look but it would also be
> interesting to see how this have been solved before the introduction of the
> Flink operator
>
> Den tis 5 juli 2022 kl 16:37 skrev Gyula Fóra <gyula.f...@gmail.com>:
>
>> Hi!
>>
>> Did you check the https://github.com/apache/flink-kubernetes-operator
>> <https://github.com/apache/flink-kubernetes-operator> by any chance?
>>
>> It provides many of the application lifecycle features that you are
>> probably after straight out-of-the-box. It has both manual and periodic
>> savepoint triggering also included in the latest upcoming version :)
>>
>> Cheers,
>> Gyula
>>
>> On Tue, Jul 5, 2022 at 5:34 PM Weihua Hu <huweihua....@gmail.com> wrote:
>>
>>> Hi, jonas
>>>
>>> If you restart flink cluster by delete/create deployment directly, it
>>> will be automatically restored from the latest checkpoint[1], so maybe just
>>> enabling the checkpoint is enough.
>>> But if you want to use savepoint, you need to check whether the latest
>>> savepoint is successful (check whether have _metadata in savepoint dir is
>>> useful in most scenarios, but in some cases the _metadata may not be
>>> completed).
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/
>>>
>>> Best,
>>> Weihua
>>>
>>>
>>> On Tue, Jul 5, 2022 at 10:54 PM jonas eyob <jonas.e...@gmail.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> We are running a Standalone job on Kubernetes using application
>>>> deployment mode, with HA enabled.
>>>>
>>>> We have attempted to automate how we create and restore savepoints by
>>>> running a script for generating a savepoint (using k8 preStop hook) and
>>>> another one for restoring from a savepoint (located in a S3 bucket).
>>>>
>>>> Restoring from a savepoint is typically not a problem once we have a
>>>> savepoint generated and accessible in our s3 bucket. The problem is
>>>> generating the savepoint which hasn't been very reliable thus far. Logs are
>>>> not particularly helpful either so we wanted to rethink how we go about
>>>> taking savepoints.
>>>>
>>>> Are there any best practices for doing this in a CI/CD manner given our
>>>> setup?
>>>>
>>>> --
>>>>
>>>>
>
> --
> *Med Vänliga Hälsningar*
> *Jonas Eyob*
>

Re: Best practice for creating/restoring savepoint in standalone k8 setup

Reply via email to