Hi Averell, The checkpoint is automatically triggered periodically according to the checkpoint interval set by the user. I believe that you should have no doubt about this.
There are many reasons for the Job failure. The technical definition is that the Job does not normally enter the final termination state. Here is a document [1] with a transformation map of Job status, you can see how Flink Job status is converted. There are many reasons why a job fails. For example, if a sub task fails or throws an exception, a sub task throws an exception when doing a checkpoint (but this does not necessarily lead to a job failure), Connection timeout between a TM and JM, TM downtime, JM leader switch and more. *So, in these scenarios (including your own enumeration) you can simulate the failure recovery of a job.* More specifically, Job recovery is based on the child nodes of Zookeeper's "/jobgraph". If any job does not enter the termination state normally, the child nodes of this job will not be cleaned up. [1]: https://ci.apache.org/projects/flink/flink-docs-master/internals/job_scheduling.html Thanks, vino. Averell <lvhu...@gmail.com> 于2018年8月24日周五 下午9:17写道: > Hi Vino, > > Regarding this statement "/Checkpoints are taken automatically and are used > for automatic restarting job in case of a failure/", I do not quite > understand the definition of a failure, and how to simulate that while > testing my application. Possible scenarios that I can think of: > (1) flink application killed > (2) cluster crashed > (3) one of the servers in the cluster crashed > (4) unhandled exception raised when abnormal data received > ... > > Could you please help explain? > > Thanks and regards, > Averell > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >