Hi Vino, So I will use the default setting of DELETE_ON_CANCELLATION. When the program cancels the checkpoint will be deleted, when the program fails,because the checkpoint will not be deleted, I still can have a checkpoint that can be used to resume. Please help to correct me if I am wrong.
Thanks. Best Henry > 在 2018年9月25日,下午2:22,vino yang <yanghua1...@gmail.com> 写道: > > Hi Henry, > > I gave a blue comment in your original email. > > Thanks, vino. > > 徐涛 <happydexu...@gmail.com <mailto:happydexu...@gmail.com>> 于2018年9月25日周二 > 下午12:56写道: > Hi Vino, > What is the definition and difference between job cancel and job fails? > Can I say that if the program is shutdown artificially, then it is a > job cancel, > if the program is shutdown due to some error, it > is a job fail? > > > This is not entirely true, and artificially triggering a cancel may also lead > to failure. You can think that if the human triggers the cancel, each task > instance can be correctly canceled, then the final job's status is canceled. > The final state of the job due to various anomalies is failed. > > This is important because it is the prerequisite for the following > question: > > In the document of Flink 1.6, it says: > "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the > checkpoint when the job is cancelled. Note that you have to manually clean up > the checkpoint state after cancellation in this case. > ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the > checkpoint when the job is cancelled. The checkpoint state will only be > available if the job fails." > But it does not says whether the checkpoint will be retained on fail. > If the checkpoint activity of fail is the same as cancel, then I have > to use RETAIL_ON_CANCELLATION, because if I do not use it, the checkpoint > will be deleted on job fail. > If the checkpoint activity of fail is not delete, then at this case it > is safe on job fail. > > In the configuration, there are two enumeration classes > `CheckpointRetentionPolicy` and `ExternalizedCheckpointCleanup`, you need to > consider which configuration you want to use. Your main concern is > ExternalizedCheckpointCleanup, which cleans up the metadata for externalized > checkpoints. Are you sure you want to use it? Flink defaults to > self-management checkpoint cleanup, which is a non-externalized checkpoint. > > > Best > Henry > > > >> 在 2018年9月25日,上午11:16,vino yang <yanghua1...@gmail.com >> <mailto:yanghua1...@gmail.com>> 写道: >> >> Hi Henry, >> >> Answer your question: >> >> What is the definition and difference between job cancel and job fails? >> >> > The cancellation and failure of the job will cause the job to enter the >> > termination state. But cancellation is artificially triggered and normally >> > terminated, while failure is usually a passive termination due to an >> > exception. >> >> If I use DELETE_ON_CANCELLATION option, in this case, does I have the >> checkpoint to resume the program? >> >> > No, if you use externalized checkpoints. you cannot resume from >> > externalized checkpoints after the job has been cancelled. >> >> I mean if I can guarantee that a savepoint can always be made before >> manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, >> is there any probability that I do not have a checkpoint to recover from? >> >> > From the latest source code, savepoint is not affected by >> > CheckpointRetentionPolicy, it needs to be cleaned up manually. >> >> Thanks, vino. >> >> 徐涛 <happydexu...@gmail.com <mailto:happydexu...@gmail.com>> 于2018年9月25日周二 >> 上午11:06写道: >> Hi All, >> I mean if I can guarantee that a savepoint can always be made before >> manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, >> is there any probability that I do not have a checkpoint to recover from? >> Thank a a lot. >> >> Best >> Henry >> >> >> >>> 在 2018年9月25日,上午10:41,徐涛 <happydexu...@gmail.com >>> <mailto:happydexu...@gmail.com>> 写道: >>> >>> Hi All, >>> In flink document, it says >>> DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is >>> cancelled. The checkpoint state will only be available if the job fails.” >>> What is the definition and difference between job cancel and job fails? >>> If I run the program on yarn, and after a few days, the yarn application >>> get failed for some reason. >>> If I use DELETE_ON_CANCELLATION option, in this case, does I have the >>> checkpoint to resume the program? >>> >>> If the checkpoint are only deleted when I cancel the program, I can >>> always make the savepoint before cancelation. Then it seems that I can only >>> set DELETE_ON_CANCELLATION then. >>> I can not find a case that RETAIN_ON_CANCELLATION should be used. >>> >>> >>> Best >>> Henry >>> >> >