Please note that when the job is canceled, the HA data(including the checkpoint pointers) stored in the ConfigMap/ZNode will be deleted.
But it is strange that the "-s/--fromSavepoint" does not take effect when redeploying the Flink application. The JobManager logs could help a lot to find the root cause. Best, Yang Austin Cawley-Edwards <austin.caw...@gmail.com> 于2021年7月22日周四 下午11:09写道: > Hey Thomas, > > Hmm, I see no reason why you should not be able to update the checkpoint > interval at runtime, and don't believe that information is stored in a > savepoint. Can you share the JobManager logs of the job where this is > ignored? > > Thanks, > Austin > > On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm <thms....@gmail.com> wrote: > >> Hey Austin, >> >> Thanks for your help. >> >> I tried to change the checkpoint interval as example. The value for it >> comes from an additional config file and is read and set within main() of >> the job. >> >> The job is running in Application mode. Basically the same configuration >> as from the official Flink website but instead of running the JobManager as >> job it is created as deployment. >> >> For the redeployment of the job the REST API is triggered to create a >> savepoint and cancel the job. After completion the deployment is updated >> and the pods are recreated. The -s <latest_savepoint> Is always added as a >> parameter to start the JobManager (standalone-job.sh). CLI is not involved. >> We have automated these steps. But I tried the steps manually and have the >> same results. >> >> I also tried to trigger a savepoint, scale the pods down, update the >> start parameter with the recent savepoint and renamed >> ‚kubernetes.cluster-id‘ as well as ‚high-availability.storageDir‘. >> >> When I trigger a savepoint with cancel, I also see that the HA config >> maps are cleaned up. >> >> >> Kr Thomas >> >> Austin Cawley-Edwards <austin.caw...@gmail.com> schrieb am Mi. 21. Juli >> 2021 um 16:52: >> >>> Hi Thomas, >>> >>> I've got a few questions that will hopefully help get to find an answer: >>> >>> What job properties are you trying to change? Something like parallelism? >>> >>> What mode is your job running in? i.e., Session, Per-Job, or >>> Application? >>> >>> Can you also describe how you're redeploying the job? Are you using the >>> Native Kubernetes integration or Standalone (i.e. writing k8s manifest >>> files yourself)? It sounds like you are using the Flink CLI as well, is >>> that correct? >>> >>> Thanks, >>> Austin >>> >>> On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm <thms....@gmail.com> wrote: >>> >>>> Hey, >>>> >>>> we have some application clusters running on Kubernetes and explore the >>>> HA mode which is working as expected. When we try to upgrade a job, e.g. >>>> trigger a savepoint, cancel the job and redeploy, Flink is not restarting >>>> from the savepoint we provide using the -s parameter. So all state is lost. >>>> >>>> If we just trigger the savepoint without canceling the job and redeploy >>>> the HA mode picks up from the latest savepoint. >>>> >>>> But this way we can not upgrade job properties as they were picked up >>>> from the savepoint as it seems. >>>> >>>> Is there any advice on how to do upgrades with HA enabled? >>>> >>>> Flink version is 1.12.2. >>>> >>>> Thanks for your help. >>>> >>>> Kr thomas >>>> >>>