> I am not able to consistently reproduce this issue. It seems to only occur > when the failover happens at the wrong time. I have disabled task local > recovery and will report back if we see this again.
Thanks, please post any results here. > The SST files are not the ones for task local recovery, those would be in a > different directory (we have configured io.tmp.dirs as /mnt/data/tmp). Those files on /mnt could still be checked against the ones in checkpoint directories (on S3/DFS), the size should match. I'm also curious why do you place local recovery files on a remote FS? (I assume /mnt/data/tmp is a remote FS or a persistent volume). Currently, if a TM is lost (e.g. process dies) then those files can not be used - and recovery will fallback to S3/DFS. So this probably incurs some IO/latency unnecessarily. Regards, Roman On Tue, May 25, 2021 at 2:16 PM Peter Westermann <no.westerm...@genesys.com> wrote: > > Hi Roman, > > > > I am not able to consistently reproduce this issue. It seems to only occur > when the failover happens at the wrong time. I have disabled task local > recovery and will report back if we see this again. We need incremental > checkpoints for our workload. > > The SST files are not the ones for task local recovery, those would be in a > different directory (we have configured io.tmp.dirs as /mnt/data/tmp). > > > > Thanks, > > Peter > > > > > > From: Roman Khachatryan <ro...@apache.org> > Date: Thursday, May 20, 2021 at 4:54 PM > To: Peter Westermann <no.westerm...@genesys.com> > Cc: user@flink.apache.org <user@flink.apache.org> > Subject: Re: Job recovery issues with state restoration > > Hi Peter, > > Do you experience this issue if running without local recovery or > incremental checkpoints enabled? > Or have you maybe compared local (on TM) and remove (on DFS) SST files? > > Regards, > Roman > > On Thu, May 20, 2021 at 5:54 PM Peter Westermann > <no.westerm...@genesys.com> wrote: > > > > Hello, > > > > > > > > I’ve reported issues around checkpoint recovery in case of a job failure > > due to zookeeper connection loss in the past. I am still seeing issues > > occasionally. > > > > This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, > > incremental checkpoints, and task-local recovery enabled. > > > > > > > > Here’s what happened: A zookeeper instance was terminated as part of a > > deployment for our zookeeper service, this caused a new jobmanager leader > > election (so far so good). A leader was elected and the job was restarted > > from the latest checkpoint but never became healthy. The root exception and > > the logs show issues reading state: > > > > o.r.RocksDBException: Sst file size mismatch: > > /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst. > > Size recorded in manifest 36718, actual size 2570\ > > Sst file size mismatch: > > /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst. > > Size recorded in manifest 13756, actual size 1307\ > > Sst file size mismatch: > > /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst. > > Size recorded in manifest 16278, actual size 1138\ > > Sst file size mismatch: > > /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst. > > Size recorded in manifest 23108, actual size 1267\ > > Sst file size mismatch: > > /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst. > > Size recorded in manifest 148089, actual size 1293\ > > \ > > \\tat org.rocksdb.RocksDB.open(RocksDB.java)\ > > \\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\ > > \\tat > > o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\ > > \\t... 22 common frames omitted\ > > Wrapped by: java.io.IOException: Error while opening RocksDB instance.\ > > \\tat > > o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\ > > \\tat > > o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\ > > \\tat > > o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper... > > > > > > > > Since we retain multiple checkpoints, I tried redeploying the job from all > > checkpoints that were still available. All those attempts lead to similar > > failures. (I eventually had to use an older savepoint to recover the job.) > > > > Any guidance for avoiding this would be appreciated. > > > > > > > > Peter