Re: Recover from savepoints with Kubernetes HA

Austin Cawley-Edwards Fri, 23 Jul 2021 10:35:03 -0700

Great, glad it was an easy fix :) Thanks for following up!

On Fri, Jul 23, 2021 at 3:54 AM Thms Hmm <thms....@gmail.com> wrote:


> Finally I found the mistake. I put the „—host 10.1.2.3“ param as one
> argument. I think the savepoint argument was not interpreted correctly or
> ignored. Might be that the „-s“ param was used as value for „—host
> 10.1.2.3“ and „s3p://…“ as new param and because these are not valid
> arguments they were ignored.
>
> Not working:
>
> 23.07.2021 09:19:54.546 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Program Arguments:
>
> ...
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -     --host 10.1.2.3
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -     -s
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
> s3p://bucket/job1/savepoints/savepoint-000000-1234
>
> ————-
>
> Working:
>
> 23.07.2021 09:19:54.546 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Program Arguments:
>
> ...
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -     --host
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -     10.1.2.3
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -     -s
>
> 23.07.2021 09:19:54.549 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
> s3p://bucket/job1/savepoints/savepoint-000000-1234
>
> ...
>
> 23.07.2021 09:37:12.932 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job
> 00000000000000000000000000000000 from savepoint
> s3p://bucket/job1/savepoints/savepoint-000000-1234 ()
>
> Thanks again for your help.
>
> Kr Thomas
>
> Yang Wang <danrtsey...@gmail.com> schrieb am Fr. 23. Juli 2021 um 04:34:
>
>> Please note that when the job is canceled, the HA data(including the
>> checkpoint pointers) stored in the ConfigMap/ZNode will be deleted.
>>
>> But it is strange that the "-s/--fromSavepoint" does not take effect when
>> redeploying the Flink application. The JobManager logs could help a lot to
>> find the root cause.
>>
>> Best,
>> Yang
>>
>> Austin Cawley-Edwards <austin.caw...@gmail.com> 于2021年7月22日周四 下午11:09写道：
>>
>>> Hey Thomas,
>>>
>>> Hmm, I see no reason why you should not be able to update the checkpoint
>>> interval at runtime, and don't believe that information is stored in a
>>> savepoint. Can you share the JobManager logs of the job where this is
>>> ignored?
>>>
>>> Thanks,
>>> Austin
>>>
>>> On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm <thms....@gmail.com> wrote:
>>>
>>>> Hey Austin,
>>>>
>>>> Thanks for your help.
>>>>
>>>> I tried to change the checkpoint interval as example. The value for it
>>>> comes from an additional config file and is read and set within main() of
>>>> the job.
>>>>
>>>> The job is running in Application mode. Basically the same
>>>> configuration as from the official Flink website but instead of running the
>>>> JobManager as job it is created as deployment.
>>>>
>>>> For the redeployment of the job the REST API is triggered to create a
>>>> savepoint and cancel the job. After completion the deployment is updated
>>>> and the pods are recreated. The -s <latest_savepoint> Is always added as a
>>>> parameter to start the JobManager (standalone-job.sh). CLI is not involved.
>>>> We have automated these steps. But I tried the steps manually and have the
>>>> same results.
>>>>
>>>> I also tried to trigger a savepoint, scale the pods down, update the
>>>> start parameter with the recent savepoint and renamed
>>>> ‚kubernetes.cluster-id‘ as well as ‚high-availability.storageDir‘.
>>>>
>>>> When I trigger a savepoint with cancel, I also see that the HA config
>>>> maps are cleaned up.
>>>>
>>>>
>>>> Kr Thomas
>>>>
>>>> Austin Cawley-Edwards <austin.caw...@gmail.com> schrieb am Mi. 21.
>>>> Juli 2021 um 16:52:
>>>>
>>>>> Hi Thomas,
>>>>>
>>>>> I've got a few questions that will hopefully help get to find an
>>>>> answer:
>>>>>
>>>>> What job properties are you trying to change? Something like
>>>>> parallelism?
>>>>>
>>>>> What mode is your job running in? i.e., Session, Per-Job, or
>>>>> Application?
>>>>>
>>>>> Can you also describe how you're redeploying the job? Are you using
>>>>> the Native Kubernetes integration or Standalone (i.e. writing k8s  
>>>>> manifest
>>>>> files yourself)? It sounds like you are using the Flink CLI as well, is
>>>>> that correct?
>>>>>
>>>>> Thanks,
>>>>> Austin
>>>>>
>>>>> On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm <thms....@gmail.com> wrote:
>>>>>
>>>>>> Hey,
>>>>>>
>>>>>> we have some application clusters running on Kubernetes and explore
>>>>>> the HA mode which is working as expected. When we try to upgrade a job,
>>>>>> e.g. trigger a savepoint, cancel the job and redeploy, Flink is not
>>>>>> restarting from the savepoint we provide using the -s parameter. So all
>>>>>> state is lost.
>>>>>>
>>>>>> If we just trigger the savepoint without canceling the job and
>>>>>> redeploy the HA mode picks up from the latest savepoint.
>>>>>>
>>>>>> But this way we can not upgrade job properties as they were picked up
>>>>>> from the savepoint as it seems.
>>>>>>
>>>>>> Is there any advice on how to do upgrades with HA enabled?
>>>>>>
>>>>>> Flink version is 1.12.2.
>>>>>>
>>>>>> Thanks for your help.
>>>>>>
>>>>>> Kr thomas
>>>>>>
>>>>>

Re: Recover from savepoints with Kubernetes HA

Reply via email to