Re: Recover from savepoints with Kubernetes HA

Yang Wang Thu, 22 Jul 2021 19:34:34 -0700

Please note that when the job is canceled, the HA data(including the
checkpoint pointers) stored in the ConfigMap/ZNode will be deleted.


But it is strange that the "-s/--fromSavepoint" does not take effect when
redeploying the Flink application. The JobManager logs could help a lot to
find the root cause.

Best,
Yang

Austin Cawley-Edwards <austin.caw...@gmail.com> 于2021年7月22日周四 下午11:09写道：

> Hey Thomas,
>
> Hmm, I see no reason why you should not be able to update the checkpoint
> interval at runtime, and don't believe that information is stored in a
> savepoint. Can you share the JobManager logs of the job where this is
> ignored?
>
> Thanks,
> Austin
>
> On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm <thms....@gmail.com> wrote:
>
>> Hey Austin,
>>
>> Thanks for your help.
>>
>> I tried to change the checkpoint interval as example. The value for it
>> comes from an additional config file and is read and set within main() of
>> the job.
>>
>> The job is running in Application mode. Basically the same configuration
>> as from the official Flink website but instead of running the JobManager as
>> job it is created as deployment.
>>
>> For the redeployment of the job the REST API is triggered to create a
>> savepoint and cancel the job. After completion the deployment is updated
>> and the pods are recreated. The -s <latest_savepoint> Is always added as a
>> parameter to start the JobManager (standalone-job.sh). CLI is not involved.
>> We have automated these steps. But I tried the steps manually and have the
>> same results.
>>
>> I also tried to trigger a savepoint, scale the pods down, update the
>> start parameter with the recent savepoint and renamed
>> ‚kubernetes.cluster-id‘ as well as ‚high-availability.storageDir‘.
>>
>> When I trigger a savepoint with cancel, I also see that the HA config
>> maps are cleaned up.
>>
>>
>> Kr Thomas
>>
>> Austin Cawley-Edwards <austin.caw...@gmail.com> schrieb am Mi. 21. Juli
>> 2021 um 16:52:
>>
>>> Hi Thomas,
>>>
>>> I've got a few questions that will hopefully help get to find an answer:
>>>
>>> What job properties are you trying to change? Something like parallelism?
>>>
>>> What mode is your job running in? i.e., Session, Per-Job, or
>>> Application?
>>>
>>> Can you also describe how you're redeploying the job? Are you using the
>>> Native Kubernetes integration or Standalone (i.e. writing k8s  manifest
>>> files yourself)? It sounds like you are using the Flink CLI as well, is
>>> that correct?
>>>
>>> Thanks,
>>> Austin
>>>
>>> On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm <thms....@gmail.com> wrote:
>>>
>>>> Hey,
>>>>
>>>> we have some application clusters running on Kubernetes and explore the
>>>> HA mode which is working as expected. When we try to upgrade a job, e.g.
>>>> trigger a savepoint, cancel the job and redeploy, Flink is not restarting
>>>> from the savepoint we provide using the -s parameter. So all state is lost.
>>>>
>>>> If we just trigger the savepoint without canceling the job and redeploy
>>>> the HA mode picks up from the latest savepoint.
>>>>
>>>> But this way we can not upgrade job properties as they were picked up
>>>> from the savepoint as it seems.
>>>>
>>>> Is there any advice on how to do upgrades with HA enabled?
>>>>
>>>> Flink version is 1.12.2.
>>>>
>>>> Thanks for your help.
>>>>
>>>> Kr thomas
>>>>
>>>

Re: Recover from savepoints with Kubernetes HA

Reply via email to