Hello, Exact semantics for checkpointing/task recovery are still a little confusing to me after parsing docs: so a few questions.
- What does Flink consider a task failure? Is it any exception that the job does not handle? - Do the failure recovery strategies mentioned in https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html refer to restarting from the most recent checkpoint? E.g for fixed-delay recoveries, a fixed number of restarts from a specific checkpoint are attempted. - The docs mention the following command to resume from a checkpoint. In the checkpoint metadata path I have configured, I only see a series of directories named by hashes: - 24c8d7a38dd90ca8bd5f04c36d1442ba - shared - taskowned - 5d202a0ba04cdc1b917892c1e35d00dc - shared - taskowned How do I know which is the most recent checkpoint? Really appreciate any help. Thank you.