Hello Parag, Looking at the last command you sent, it seems like you are not passing the savepoint path for the savepoint instance, but just passing the savepoint directory while restarting the job.
When a savepoint is completed, it is usually materialized under <SAVEPOINT_DIRECTORY>/<JOB_ID>/savepoint-<SAVEPOINT_ID>. Can you please try to find the latest successful savepoint under <SAVEPOINT_DIRECTORY>/<JOB_ID> and pass the full path of the savepoint instance? Sincerely, Ali On Wed, Oct 6, 2021 at 12:58 PM Dawid Wysakowicz <dwysakow...@apache.org> wrote: > Hi Parag, > > When you restore from a savepoint do you see a line like: "Restoring job > {} from {}" in jobmanagers logs? Is the entire state lost or just part of > it? Could you explain a bit what does your job look like and how do you > check that the state is lost? > > Sorry if too obvious, but what are the "accumulators" you refer to? Are > they *State primitives[1] or really constructs that are called > Accumulator[2]? The latter are not checkpointed. > > Best, > > Dawid > > [1] > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/state/#using-keyed-state > > [2] > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/user_defined_functions/#accumulators--counters > On 06/10/2021 08:42, Parag Somani wrote: > > Yes Nico. I have evaluated this. > > I have tried below: > > 1. Take the savepoint > 2. Stop the job > 3. Shutdown the instances > 4. Started new pod using below command: > > /docker-entrypoint.sh "standalone-job" "-Ds3.access-key= > ${AWS_ACCESS_KEY_ID}" "-Ds3.secret-key=${AWS_SECRET_ACCESS_KEY}" > "-Ds3.endpoint=${AWS_S3_ENDPOINT}" "-Dhigh-availability.zookeeper.quorum= > ${ZOOKEEPER_CLUSTER}" "--job-classname" "com.test.MySpringBootApp" > "--fromSavepoint" "s3://s3-health-service-discovery/savepoints" ${args} > > I haven't observed any errors during start-up in logs. But the state got > reset i.e. values stored inside the accumulator got flushed. > > On Tue, Oct 5, 2021 at 9:40 PM Nicolaus Weidner < > nicolaus.weid...@ververica.com> wrote: > >> Hi Parag, >> >> I am not so familiar with the setup you are using, but did you check out >> [1]? Maybe the parameter >> [--fromSavepoint /path/to/savepoint [--allowNonRestoredState]] >> is what you are looking for? >> >> Best regards, >> Nico >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/docker/#application-mode-on-docker >> >> On Tue, Oct 5, 2021 at 12:37 PM Parag Somani <somanipa...@gmail.com> >> wrote: >> >>> Hello, >>> >>> We are currently using Apache flink 1.12.0 deployed on k8s cluster of >>> 1.18 with zk for HA. Due to certain vulnerabilities in container related >>> with few jar(like netty-*, meso), we are forced to upgrade. >>> >>> While upgrading flink to 1.14.0, faced NPE, >>> https://issues.apache.org/jira/browse/FLINK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17402570#comment-17402570 >>> >>> To address it, I have followed steps >>> >>> 1. savepoint creation >>> 2. Stop the job >>> 3. Restore from save point where i am facing challenge. >>> >>> For step #3 from above, i was able to restore from savepoint mainly >>> because: >>> "bin/flink run -s :savepointPath [:runArgs] " >>> It majorly about restarting a jar file uploaded. As our application is >>> based on k8s and running using docker, i was not able to restore it. And >>> because of it, state of variables in accumulator got corrupted and i lost >>> the data in one of env. >>> >>> My query is, what is preffered way to restore from savepoint, if >>> application is running on k8s using docker. >>> >>> We are using following command to run job manager: >>> /docker-entrypoint.sh "standalone-job" "-Ds3.access-key= >>> ${AWS_ACCESS_KEY_ID}" "-Ds3.secret-key=${AWS_SECRET_ACCESS_KEY}" >>> "-Ds3.endpoint=${AWS_S3_ENDPOINT}" >>> "-Dhigh-availability.zookeeper.quorum=${ZOOKEEPER_CLUSTER}" >>> "--job-classname" "<class-name>" ${args} >>> >>> Thank you in advance...! >>> >>> -- >>> Regards, >>> Parag Surajmal Somani. >>> >> > > -- > Regards, > Parag Surajmal Somani. > > -- Ali Bahadır Zeybek | Solutions Architect <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton Wehner