Hi Amran, See my inline answers.
Best, Vino amran dean <adfs54545...@gmail.com> 于2019年10月30日周三 上午2:59写道: > Hello, > Exact semantics for checkpointing/task recovery are still a little > confusing to me after parsing docs: so a few questions. > > - What does Flink consider a task failure? Is it any exception that the > job does not handle? > *Flink believes that the task failure is: any factor makes the task itself unable to continue to run. * > > - Do the failure recovery strategies mentioned in > https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html > refer > to restarting from the most recent checkpoint? > E.g for fixed-delay recoveries, a fixed number of restarts from a specific > checkpoint are attempted. > *For an automatic restart, Flink will try to find the nearest checkpoint.* > > - The docs mention the following command to resume from a checkpoint. In > the checkpoint metadata path I have configured, I only see a series of > directories named by hashes: > > - 24c8d7a38dd90ca8bd5f04c36d1442ba > - shared > - taskowned > - 5d202a0ba04cdc1b917892c1e35d00dc > - shared > - taskowned > How do I know which is the most recent checkpoint? > *In the checkpoint directory corresponding to the jobID, you should see some folder names, like "chk-xxx", so specify this path. More details please see here[1].* [1]: https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/checkpoints.html#resuming-from-a-retained-checkpoint > > Really appreciate any help. Thank you. >