Hello,

I’ve reported issues around checkpoint recovery in case of a job failure due to 
zookeeper connection loss in the past. I am still seeing issues occasionally.
This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, 
incremental checkpoints, and task-local recovery enabled.

Here’s what happened: A zookeeper instance was terminated as part of a 
deployment for our zookeeper service, this caused a new jobmanager leader 
election (so far so good). A leader was elected and the job was restarted from 
the latest checkpoint but never became healthy. The root exception and the logs 
show issues reading state:
o.r.RocksDBException: Sst file size mismatch: 
/mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst.
 Size recorded in manifest 36718, actual size 2570\
Sst file size mismatch: 
/mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst.
 Size recorded in manifest 13756, actual size 1307\
Sst file size mismatch: 
/mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst.
 Size recorded in manifest 16278, actual size 1138\
Sst file size mismatch: 
/mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst.
 Size recorded in manifest 23108, actual size 1267\
Sst file size mismatch: 
/mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst.
 Size recorded in manifest 148089, actual size 1293\
\
\\tat org.rocksdb.RocksDB.open(RocksDB.java)\
\\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\
\\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\
\\t... 22 common frames omitted\
Wrapped by: java.io.IOException: Error while opening RocksDB instance.\
\\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\
\\tat 
o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\
\\tat 
o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper...

Since we retain multiple checkpoints, I tried redeploying the job from all 
checkpoints that were still available. All those attempts lead to similar 
failures. (I eventually had to use an older savepoint to recover the job.)
Any guidance for avoiding this would be appreciated.

Peter

Reply via email to