As a side note, I am assuming that you are using the same Flink Job before and after the savepoint and the same Flink version. Am I correct?
Cheers, Kostas On Mon, Nov 25, 2019 at 2:40 PM Kostas Kloudas <kklou...@gmail.com> wrote: > > Hi Singh, > > This behaviour is strange. > One thing I can recommend to see if the two jobs are identical is to > launch also the second job without a savepoint, > just start from scratch, and simply look at the web interface to see > if everything is there. > > Also could you please provide some code from your job, just to see if > there is anything problematic with the application code? > Normally there should be no problem with not providing UIDs for some > stateless operators. > > Cheers, > Kostas > > On Sat, Nov 23, 2019 at 11:16 AM M Singh <mans2si...@yahoo.com> wrote: > > > > > > Hey Folks: > > > > Please let me know how to resolve this issue since using > > --allowNonRestoredState without knowing if any state will be lost seems > > risky. > > > > Thanks > > On Friday, November 22, 2019, 02:55:09 PM EST, M Singh > > <mans2si...@yahoo.com> wrote: > > > > > > Hi: > > > > I have a flink application in which some of the operators have uid and name > > and some stateless ones don't. > > > > I've taken a save point and tried to start another instance of the > > application from a savepoint - I get the following exception which > > indicates that the operator is not available to the new program even though > > the second job is the same as first but just running from the first jobs > > savepoint. > > > > Caused by: java.lang.IllegalStateException: Failed to rollback to > > checkpoint/savepoint > > s3://mybucket/state/savePoint/mysavepointfolder/66s4c6402d7532801287290436fa9fadd/savepoint-664c64-fa235d26d379. > > Cannot map checkpoint/savepoint state for operator > > d1a56c5a9ce5e3f1b03e01cac458bb4f to the new program, because the operator > > is not available in the new program. If you want to allow to skip this, you > > can set the --allowNonRestoredState option on the CLI. > > > > at > > org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:205) > > > > at > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1102) > > > > at > > org.apache.flink.runtime.jobmaster.JobMaster.tryRestoreExecutionGraphFromSavepoint(JobMaster.java:1219) > > > > at > > org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1143) > > > > at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:294) > > > > at > > org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:157) > > > > ... 10 more > > > > > > > > I've tried to start an application instance from the checkpoint too of the > > first instance but it gives the same exception indicating that the operator > > is not available. > > > > Questions: > > > > 1. If this a problem because some of the operators don't have uid ? > > 2. Is it required to have uids even for stateless operators like simple map > > or filter operators ? > > 3. Is there a way to find out which operator is not available in the new > > application even though I am running the same application ? > > 4. Is there a way to figure out if this is the only missing operator or are > > there others whose mapping is missing for the second instance run ? > > 5. Is this issue resolved in Apache Flink 1.9 (since I am still using Flink > > 1.6) > > > > If there any additional pointers please let me know. > > > > Thanks > > > > Mans > > > >