Hi Averell,

The checkpoint is automatically triggered periodically according to the
checkpoint interval set by the user. I believe that you should have no
doubt about this.

There are many reasons for the Job failure.
The technical definition is that the Job does not normally enter the final
termination state.
Here is a document [1] with a transformation map of Job status, you can see
how Flink Job status is converted.
There are many reasons why a job fails.
For example, if a sub task fails or throws an exception, a sub task throws
an exception when doing a checkpoint (but this does not necessarily lead to
a job failure),
Connection timeout between a TM and JM, TM downtime, JM leader switch and
more.

*So, in these scenarios (including your own enumeration) you can simulate
the failure recovery of a job.*

More specifically, Job recovery is based on the child nodes of Zookeeper's
"/jobgraph".
If any job does not enter the termination state normally, the child nodes
of this job will not be cleaned up.

[1]:
https://ci.apache.org/projects/flink/flink-docs-master/internals/job_scheduling.html

Thanks, vino.

Averell <lvhu...@gmail.com> 于2018年8月24日周五 下午9:17写道:

> Hi Vino,
>
> Regarding this statement "/Checkpoints are taken automatically and are used
> for automatic restarting job in case of a failure/", I do not quite
> understand the definition of a failure, and how to simulate that while
> testing my application. Possible scenarios that I can think of:
>    (1) flink application killed
>    (2) cluster crashed
>    (3) one of the servers in the cluster crashed
>    (4) unhandled exception raised when abnormal data received
>    ...
>
> Could you please help explain?
>
> Thanks and regards,
> Averell
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Reply via email to