Re: Recover from savepoints with Kubernetes HA

Thms Hmm Fri, 23 Jul 2021 00:54:26 -0700

Finally I found the mistake. I put the „—host 10.1.2.3“ param as one
argument. I think the savepoint argument was not interpreted correctly or
ignored. Might be that the „-s“ param was used as value for „—host
10.1.2.3“ and „s3p://…“ as new param and because these are not valid
arguments they were ignored.


Not working:

23.07.2021 09:19:54.546 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Program Arguments:

...

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -     --host 10.1.2.3

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -     -s

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
s3p://bucket/job1/savepoints/savepoint-000000-1234

————-

Working:

23.07.2021 09:19:54.546 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Program Arguments:

...

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -     --host

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -     10.1.2.3

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -     -s

23.07.2021 09:19:54.549 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
s3p://bucket/job1/savepoints/savepoint-000000-1234

...

23.07.2021 09:37:12.932 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job
00000000000000000000000000000000 from savepoint
s3p://bucket/job1/savepoints/savepoint-000000-1234 ()

Thanks again for your help.

Kr Thomas

Yang Wang <danrtsey...@gmail.com> schrieb am Fr. 23. Juli 2021 um 04:34:

> Please note that when the job is canceled, the HA data(including the
> checkpoint pointers) stored in the ConfigMap/ZNode will be deleted.
>
> But it is strange that the "-s/--fromSavepoint" does not take effect when
> redeploying the Flink application. The JobManager logs could help a lot to
> find the root cause.
>
> Best,
> Yang
>
> Austin Cawley-Edwards <austin.caw...@gmail.com> 于2021年7月22日周四 下午11:09写道：
>
>> Hey Thomas,
>>
>> Hmm, I see no reason why you should not be able to update the checkpoint
>> interval at runtime, and don't believe that information is stored in a
>> savepoint. Can you share the JobManager logs of the job where this is
>> ignored?
>>
>> Thanks,
>> Austin
>>
>> On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm <thms....@gmail.com> wrote:
>>
>>> Hey Austin,
>>>
>>> Thanks for your help.
>>>
>>> I tried to change the checkpoint interval as example. The value for it
>>> comes from an additional config file and is read and set within main() of
>>> the job.
>>>
>>> The job is running in Application mode. Basically the same configuration
>>> as from the official Flink website but instead of running the JobManager as
>>> job it is created as deployment.
>>>
>>> For the redeployment of the job the REST API is triggered to create a
>>> savepoint and cancel the job. After completion the deployment is updated
>>> and the pods are recreated. The -s <latest_savepoint> Is always added as a
>>> parameter to start the JobManager (standalone-job.sh). CLI is not involved.
>>> We have automated these steps. But I tried the steps manually and have the
>>> same results.
>>>
>>> I also tried to trigger a savepoint, scale the pods down, update the
>>> start parameter with the recent savepoint and renamed
>>> ‚kubernetes.cluster-id‘ as well as ‚high-availability.storageDir‘.
>>>
>>> When I trigger a savepoint with cancel, I also see that the HA config
>>> maps are cleaned up.
>>>
>>>
>>> Kr Thomas
>>>
>>> Austin Cawley-Edwards <austin.caw...@gmail.com> schrieb am Mi. 21. Juli
>>> 2021 um 16:52:
>>>
>>>> Hi Thomas,
>>>>
>>>> I've got a few questions that will hopefully help get to find an answer:
>>>>
>>>> What job properties are you trying to change? Something like
>>>> parallelism?
>>>>
>>>> What mode is your job running in? i.e., Session, Per-Job, or
>>>> Application?
>>>>
>>>> Can you also describe how you're redeploying the job? Are you using the
>>>> Native Kubernetes integration or Standalone (i.e. writing k8s  manifest
>>>> files yourself)? It sounds like you are using the Flink CLI as well, is
>>>> that correct?
>>>>
>>>> Thanks,
>>>> Austin
>>>>
>>>> On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm <thms....@gmail.com> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> we have some application clusters running on Kubernetes and explore
>>>>> the HA mode which is working as expected. When we try to upgrade a job,
>>>>> e.g. trigger a savepoint, cancel the job and redeploy, Flink is not
>>>>> restarting from the savepoint we provide using the -s parameter. So all
>>>>> state is lost.
>>>>>
>>>>> If we just trigger the savepoint without canceling the job and
>>>>> redeploy the HA mode picks up from the latest savepoint.
>>>>>
>>>>> But this way we can not upgrade job properties as they were picked up
>>>>> from the savepoint as it seems.
>>>>>
>>>>> Is there any advice on how to do upgrades with HA enabled?
>>>>>
>>>>> Flink version is 1.12.2.
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>> Kr thomas
>>>>>
>>>>

Re: Recover from savepoints with Kubernetes HA

Reply via email to