Re: Recover from savepoints with Kubernetes HA

Austin Cawley-Edwards Thu, 22 Jul 2021 08:09:31 -0700

Hey Thomas,

Hmm, I see no reason why you should not be able to update the checkpoint
interval at runtime, and don't believe that information is stored in a
savepoint. Can you share the JobManager logs of the job where this is
ignored?


Thanks,
Austin

On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm <thms....@gmail.com> wrote:

> Hey Austin,
>
> Thanks for your help.
>
> I tried to change the checkpoint interval as example. The value for it
> comes from an additional config file and is read and set within main() of
> the job.
>
> The job is running in Application mode. Basically the same configuration
> as from the official Flink website but instead of running the JobManager as
> job it is created as deployment.
>
> For the redeployment of the job the REST API is triggered to create a
> savepoint and cancel the job. After completion the deployment is updated
> and the pods are recreated. The -s <latest_savepoint> Is always added as a
> parameter to start the JobManager (standalone-job.sh). CLI is not involved.
> We have automated these steps. But I tried the steps manually and have the
> same results.
>
> I also tried to trigger a savepoint, scale the pods down, update the start
> parameter with the recent savepoint and renamed ‚kubernetes.cluster-id‘ as
> well as ‚high-availability.storageDir‘.
>
> When I trigger a savepoint with cancel, I also see that the HA config maps
> are cleaned up.
>
>
> Kr Thomas
>
> Austin Cawley-Edwards <austin.caw...@gmail.com> schrieb am Mi. 21. Juli
> 2021 um 16:52:
>
>> Hi Thomas,
>>
>> I've got a few questions that will hopefully help get to find an answer:
>>
>> What job properties are you trying to change? Something like parallelism?
>>
>> What mode is your job running in? i.e., Session, Per-Job, or Application?
>>
>> Can you also describe how you're redeploying the job? Are you using the
>> Native Kubernetes integration or Standalone (i.e. writing k8s  manifest
>> files yourself)? It sounds like you are using the Flink CLI as well, is
>> that correct?
>>
>> Thanks,
>> Austin
>>
>> On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm <thms....@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> we have some application clusters running on Kubernetes and explore the
>>> HA mode which is working as expected. When we try to upgrade a job, e.g.
>>> trigger a savepoint, cancel the job and redeploy, Flink is not restarting
>>> from the savepoint we provide using the -s parameter. So all state is lost.
>>>
>>> If we just trigger the savepoint without canceling the job and redeploy
>>> the HA mode picks up from the latest savepoint.
>>>
>>> But this way we can not upgrade job properties as they were picked up
>>> from the savepoint as it seems.
>>>
>>> Is there any advice on how to do upgrades with HA enabled?
>>>
>>> Flink version is 1.12.2.
>>>
>>> Thanks for your help.
>>>
>>> Kr thomas
>>>
>>

Re: Recover from savepoints with Kubernetes HA

Reply via email to