Unable to recover from checkpoint

Sivaprasanna Thu, 30 Jul 2020 05:46:17 -0700

Hello,

We recently ran into an unexpected scenario. Our stateful streaming
pipeline uses RocksDB as the backend and has incremental checkpointing
enabled. We have RETAIN_ON_CANCELATION enabled so some of the previous
cancellation and restarts had left a lot of unattended checkpoint
directories which amounted to almost 1 TB . Today we manually cleared these
directories and left the current running job's checkpoint directory alone
untouched. Few hours later, the job ran into some other error and failed
but when it attempted to use the latest successful checkpoint, it failed
saying java.io.FileNotFoundException: File does not exist:
/path/to/an/older/checkpoint/45a55300adab66d7cc49ff5e50ee5b62/shared/f7ace888-059b-4256-966c-51c1549aa6e4


So I have few questions:
- Are we not supposed to clear these older checkpoint directories which
were created by previous runs of the pipeline?
- Does the /shared directory under the current checkpoint directory not
have all the necessary files to recover?
- What is the recommended procedure to clear remnant checkpoint
directories? Here, by remnant, I mean previous runs of the job which was
cancelled and we manually restarted with the latest checkpoint (lets say
chk-123). The new job is running fine and has made further checkpoints. Can
we delete chk-123?

Thanks,
Sivaprasanna

Unable to recover from checkpoint

Reply via email to