Finally I found the mistake. I put the „—host 10.1.2.3“ param as one argument. I think the savepoint argument was not interpreted correctly or ignored. Might be that the „-s“ param was used as value for „—host 10.1.2.3“ and „s3p://…“ as new param and because these are not valid arguments they were ignored.
Not working: 23.07.2021 09:19:54.546 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: ... 23.07.2021 09:19:54.549 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host 10.1.2.3 23.07.2021 09:19:54.549 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -s 23.07.2021 09:19:54.549 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - s3p://bucket/job1/savepoints/savepoint-000000-1234 ————- Working: 23.07.2021 09:19:54.546 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: ... 23.07.2021 09:19:54.549 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host 23.07.2021 09:19:54.549 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 10.1.2.3 23.07.2021 09:19:54.549 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -s 23.07.2021 09:19:54.549 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - s3p://bucket/job1/savepoints/savepoint-000000-1234 ... 23.07.2021 09:37:12.932 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job 00000000000000000000000000000000 from savepoint s3p://bucket/job1/savepoints/savepoint-000000-1234 () Thanks again for your help. Kr Thomas Yang Wang <danrtsey...@gmail.com> schrieb am Fr. 23. Juli 2021 um 04:34: > Please note that when the job is canceled, the HA data(including the > checkpoint pointers) stored in the ConfigMap/ZNode will be deleted. > > But it is strange that the "-s/--fromSavepoint" does not take effect when > redeploying the Flink application. The JobManager logs could help a lot to > find the root cause. > > Best, > Yang > > Austin Cawley-Edwards <austin.caw...@gmail.com> 于2021年7月22日周四 下午11:09写道: > >> Hey Thomas, >> >> Hmm, I see no reason why you should not be able to update the checkpoint >> interval at runtime, and don't believe that information is stored in a >> savepoint. Can you share the JobManager logs of the job where this is >> ignored? >> >> Thanks, >> Austin >> >> On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm <thms....@gmail.com> wrote: >> >>> Hey Austin, >>> >>> Thanks for your help. >>> >>> I tried to change the checkpoint interval as example. The value for it >>> comes from an additional config file and is read and set within main() of >>> the job. >>> >>> The job is running in Application mode. Basically the same configuration >>> as from the official Flink website but instead of running the JobManager as >>> job it is created as deployment. >>> >>> For the redeployment of the job the REST API is triggered to create a >>> savepoint and cancel the job. After completion the deployment is updated >>> and the pods are recreated. The -s <latest_savepoint> Is always added as a >>> parameter to start the JobManager (standalone-job.sh). CLI is not involved. >>> We have automated these steps. But I tried the steps manually and have the >>> same results. >>> >>> I also tried to trigger a savepoint, scale the pods down, update the >>> start parameter with the recent savepoint and renamed >>> ‚kubernetes.cluster-id‘ as well as ‚high-availability.storageDir‘. >>> >>> When I trigger a savepoint with cancel, I also see that the HA config >>> maps are cleaned up. >>> >>> >>> Kr Thomas >>> >>> Austin Cawley-Edwards <austin.caw...@gmail.com> schrieb am Mi. 21. Juli >>> 2021 um 16:52: >>> >>>> Hi Thomas, >>>> >>>> I've got a few questions that will hopefully help get to find an answer: >>>> >>>> What job properties are you trying to change? Something like >>>> parallelism? >>>> >>>> What mode is your job running in? i.e., Session, Per-Job, or >>>> Application? >>>> >>>> Can you also describe how you're redeploying the job? Are you using the >>>> Native Kubernetes integration or Standalone (i.e. writing k8s manifest >>>> files yourself)? It sounds like you are using the Flink CLI as well, is >>>> that correct? >>>> >>>> Thanks, >>>> Austin >>>> >>>> On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm <thms....@gmail.com> wrote: >>>> >>>>> Hey, >>>>> >>>>> we have some application clusters running on Kubernetes and explore >>>>> the HA mode which is working as expected. When we try to upgrade a job, >>>>> e.g. trigger a savepoint, cancel the job and redeploy, Flink is not >>>>> restarting from the savepoint we provide using the -s parameter. So all >>>>> state is lost. >>>>> >>>>> If we just trigger the savepoint without canceling the job and >>>>> redeploy the HA mode picks up from the latest savepoint. >>>>> >>>>> But this way we can not upgrade job properties as they were picked up >>>>> from the savepoint as it seems. >>>>> >>>>> Is there any advice on how to do upgrades with HA enabled? >>>>> >>>>> Flink version is 1.12.2. >>>>> >>>>> Thanks for your help. >>>>> >>>>> Kr thomas >>>>> >>>>