Hi Amran,

See my inline answers.

Best,
Vino

amran dean <adfs54545...@gmail.com> 于2019年10月30日周三 上午2:59写道:

> Hello,
> Exact semantics for checkpointing/task recovery are still a little
> confusing to me after parsing docs: so a few questions.
>
> - What does Flink consider a task failure? Is it any exception that the
> job does not handle?
>

*Flink believes that the task failure is: any factor makes the task itself
unable to continue to run. *

>
> - Do the failure recovery strategies mentioned in
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html
>  refer
> to restarting from the most recent checkpoint?
> E.g for fixed-delay recoveries, a fixed number of restarts from a specific
> checkpoint are attempted.
>

*For an automatic restart, Flink will try to find the nearest checkpoint.*

>
> - The docs mention the following command to resume from a checkpoint. In
> the checkpoint metadata path I have configured, I only see a series of
> directories named by hashes:
>
> - 24c8d7a38dd90ca8bd5f04c36d1442ba
>     - shared
>     - taskowned
> - 5d202a0ba04cdc1b917892c1e35d00dc
>     - shared
>     - taskowned
> How do I know which is the most recent checkpoint?
>

*In the checkpoint directory corresponding to the jobID, you should see
some folder names, like "chk-xxx", so specify this path. More details
please see here[1].*

[1]:
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/checkpoints.html#resuming-from-a-retained-checkpoint

>
> Really appreciate any help. Thank you.
>

Reply via email to