Hi  Sivaprasanna
    For  RocksDBStateBackend incremental checkpoint, the latest checkpoint
may contain the files of the previous checkpoint(the files in the shared
directory), so delete the files belong to the previous checkpoint may lead
to FileNotFoundException. Currently, we can only parse the metadata
manually to know what the files belong to a specific checkpoint. There is
an issue FLINK-17571 wants to show the files belong to a specific
checkpoint.

Best,
Congxian


Sivaprasanna <sivaprasanna...@gmail.com> 于2020年7月30日周四 下午8:46写道:

> Hello,
>
> We recently ran into an unexpected scenario. Our stateful streaming
> pipeline uses RocksDB as the backend and has incremental checkpointing
> enabled. We have RETAIN_ON_CANCELATION enabled so some of the previous
> cancellation and restarts had left a lot of unattended checkpoint
> directories which amounted to almost 1 TB . Today we manually cleared these
> directories and left the current running job's checkpoint directory alone
> untouched. Few hours later, the job ran into some other error and failed
> but when it attempted to use the latest successful checkpoint, it failed
> saying java.io.FileNotFoundException: File does not exist:
> /path/to/an/older/checkpoint/45a55300adab66d7cc49ff5e50ee5b62/shared/f7ace888-059b-4256-966c-51c1549aa6e4
>
> So I have few questions:
> - Are we not supposed to clear these older checkpoint directories which
> were created by previous runs of the pipeline?
> - Does the /shared directory under the current checkpoint directory not
> have all the necessary files to recover?
> - What is the recommended procedure to clear remnant checkpoint
> directories? Here, by remnant, I mean previous runs of the job which was
> cancelled and we manually restarted with the latest checkpoint (lets say
> chk-123). The new job is running fine and has made further checkpoints. Can
> we delete chk-123?
>
> Thanks,
> Sivaprasanna
>

Reply via email to