Great, glad it was an easy fix :) Thanks for following up! On Fri, Jul 23, 2021 at 3:54 AM Thms Hmm <thms....@gmail.com> wrote:
> Finally I found the mistake. I put the „—host 10.1.2.3“ param as one > argument. I think the savepoint argument was not interpreted correctly or > ignored. Might be that the „-s“ param was used as value for „—host > 10.1.2.3“ and „s3p://…“ as new param and because these are not valid > arguments they were ignored. > > Not working: > > 23.07.2021 09:19:54.546 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: > > ... > > 23.07.2021 09:19:54.549 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host 10.1.2.3 > > 23.07.2021 09:19:54.549 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -s > > 23.07.2021 09:19:54.549 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > s3p://bucket/job1/savepoints/savepoint-000000-1234 > > ————- > > Working: > > 23.07.2021 09:19:54.546 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: > > ... > > 23.07.2021 09:19:54.549 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host > > 23.07.2021 09:19:54.549 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 10.1.2.3 > > 23.07.2021 09:19:54.549 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -s > > 23.07.2021 09:19:54.549 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > s3p://bucket/job1/savepoints/savepoint-000000-1234 > > ... > > 23.07.2021 09:37:12.932 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job > 00000000000000000000000000000000 from savepoint > s3p://bucket/job1/savepoints/savepoint-000000-1234 () > > Thanks again for your help. > > Kr Thomas > > Yang Wang <danrtsey...@gmail.com> schrieb am Fr. 23. Juli 2021 um 04:34: > >> Please note that when the job is canceled, the HA data(including the >> checkpoint pointers) stored in the ConfigMap/ZNode will be deleted. >> >> But it is strange that the "-s/--fromSavepoint" does not take effect when >> redeploying the Flink application. The JobManager logs could help a lot to >> find the root cause. >> >> Best, >> Yang >> >> Austin Cawley-Edwards <austin.caw...@gmail.com> 于2021年7月22日周四 下午11:09写道: >> >>> Hey Thomas, >>> >>> Hmm, I see no reason why you should not be able to update the checkpoint >>> interval at runtime, and don't believe that information is stored in a >>> savepoint. Can you share the JobManager logs of the job where this is >>> ignored? >>> >>> Thanks, >>> Austin >>> >>> On Wed, Jul 21, 2021 at 11:47 AM Thms Hmm <thms....@gmail.com> wrote: >>> >>>> Hey Austin, >>>> >>>> Thanks for your help. >>>> >>>> I tried to change the checkpoint interval as example. The value for it >>>> comes from an additional config file and is read and set within main() of >>>> the job. >>>> >>>> The job is running in Application mode. Basically the same >>>> configuration as from the official Flink website but instead of running the >>>> JobManager as job it is created as deployment. >>>> >>>> For the redeployment of the job the REST API is triggered to create a >>>> savepoint and cancel the job. After completion the deployment is updated >>>> and the pods are recreated. The -s <latest_savepoint> Is always added as a >>>> parameter to start the JobManager (standalone-job.sh). CLI is not involved. >>>> We have automated these steps. But I tried the steps manually and have the >>>> same results. >>>> >>>> I also tried to trigger a savepoint, scale the pods down, update the >>>> start parameter with the recent savepoint and renamed >>>> ‚kubernetes.cluster-id‘ as well as ‚high-availability.storageDir‘. >>>> >>>> When I trigger a savepoint with cancel, I also see that the HA config >>>> maps are cleaned up. >>>> >>>> >>>> Kr Thomas >>>> >>>> Austin Cawley-Edwards <austin.caw...@gmail.com> schrieb am Mi. 21. >>>> Juli 2021 um 16:52: >>>> >>>>> Hi Thomas, >>>>> >>>>> I've got a few questions that will hopefully help get to find an >>>>> answer: >>>>> >>>>> What job properties are you trying to change? Something like >>>>> parallelism? >>>>> >>>>> What mode is your job running in? i.e., Session, Per-Job, or >>>>> Application? >>>>> >>>>> Can you also describe how you're redeploying the job? Are you using >>>>> the Native Kubernetes integration or Standalone (i.e. writing k8s >>>>> manifest >>>>> files yourself)? It sounds like you are using the Flink CLI as well, is >>>>> that correct? >>>>> >>>>> Thanks, >>>>> Austin >>>>> >>>>> On Wed, Jul 21, 2021 at 4:05 AM Thms Hmm <thms....@gmail.com> wrote: >>>>> >>>>>> Hey, >>>>>> >>>>>> we have some application clusters running on Kubernetes and explore >>>>>> the HA mode which is working as expected. When we try to upgrade a job, >>>>>> e.g. trigger a savepoint, cancel the job and redeploy, Flink is not >>>>>> restarting from the savepoint we provide using the -s parameter. So all >>>>>> state is lost. >>>>>> >>>>>> If we just trigger the savepoint without canceling the job and >>>>>> redeploy the HA mode picks up from the latest savepoint. >>>>>> >>>>>> But this way we can not upgrade job properties as they were picked up >>>>>> from the savepoint as it seems. >>>>>> >>>>>> Is there any advice on how to do upgrades with HA enabled? >>>>>> >>>>>> Flink version is 1.12.2. >>>>>> >>>>>> Thanks for your help. >>>>>> >>>>>> Kr thomas >>>>>> >>>>>