Hi, where exactly did you read many times that incremental checkpoints cannot reference files from previous checkpoints, because we would have to correct that information. In fact, this is how incremental checkpoints work. Now for this case, I would consider it extremely unlikely that a checkpoint 1620 would still reference a checkpoint 1, in particular if the files for that checkpoint are already deleted, which should only happen if it is no longer referenced. Which version of Flink are you using and what is your distributed filesystem? Is there any way to reproduce the problem?
Best, Stefan > Am 21.11.2017 um 14:30 schrieb gerardg <ger...@talaia.io>: > > Hello, > > We have a task that fails to restart from a checkpoint with the following > error: > > java.lang.IllegalStateException: Could not initialize keyed state backend. > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:321) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:217) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:676) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:663) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:252) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: > /home/gluster/flink/checkpoints/fac589c7248186bda2ad7b711f174973/chk-1/a069f85e-4ceb-4fba-9308-fb238f31574f > (No such file or directory) > at java.io.FileInputStream.open0(Native Method) > at java.io.FileInputStream.open(FileInputStream.java:195) > at java.io.FileInputStream.<init>(FileInputStream.java:138) > at > org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:49) > at > org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142) > at > org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85) > at > org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:70) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.readStateData(RocksDBKeyedStateBackend.java:1290) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.readAllStateData(RocksDBKeyedStateBackend.java:1477) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstance(RocksDBKeyedStateBackend.java:1333) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:1512) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:979) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.createKeyedStateBackend(StreamTask.java:772) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:311) > ... 6 common frames omitted > > It seems that it tries to restore the job using checkpoint number 1 (which > was automatically deleted by flink), when the latest checkpoint is the 1620. > And I can actually see how it logged that it would try to restore from > checkpoint 1620: > > Found 1 checkpoints in ZooKeeper. > Trying to retrieve checkpoint 1620. > Restoring from latest valid checkpoint: Checkpoint 1620 @ 1511267100332 for > fac589c7248186bda2ad7b711f174973. > > I have incremental checkpointing enabled, but I read many times that > checkpoints do not reference themselves so I'm not sure what could be > happening. > > Gerard > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/