Hello, We recently ran into an unexpected scenario. Our stateful streaming pipeline uses RocksDB as the backend and has incremental checkpointing enabled. We have RETAIN_ON_CANCELATION enabled so some of the previous cancellation and restarts had left a lot of unattended checkpoint directories which amounted to almost 1 TB . Today we manually cleared these directories and left the current running job's checkpoint directory alone untouched. Few hours later, the job ran into some other error and failed but when it attempted to use the latest successful checkpoint, it failed saying java.io.FileNotFoundException: File does not exist: /path/to/an/older/checkpoint/45a55300adab66d7cc49ff5e50ee5b62/shared/f7ace888-059b-4256-966c-51c1549aa6e4
So I have few questions: - Are we not supposed to clear these older checkpoint directories which were created by previous runs of the pipeline? - Does the /shared directory under the current checkpoint directory not have all the necessary files to recover? - What is the recommended procedure to clear remnant checkpoint directories? Here, by remnant, I mean previous runs of the job which was cancelled and we manually restarted with the latest checkpoint (lets say chk-123). The new job is running fine and has made further checkpoints. Can we delete chk-123? Thanks, Sivaprasanna