Hi Vino,
So I will use the default setting of DELETE_ON_CANCELLATION. When the
program cancels the checkpoint will be deleted, when the program fails,because
the checkpoint will not be deleted, I still can have a checkpoint that can be
used to resume.
Please help to correct me if I am wrong.
Thanks.
Best
Henry
> 在 2018年9月25日,下午2:22,vino yang <[email protected]> 写道:
>
> Hi Henry,
>
> I gave a blue comment in your original email.
>
> Thanks, vino.
>
> 徐涛 <[email protected] <mailto:[email protected]>> 于2018年9月25日周二
> 下午12:56写道:
> Hi Vino,
> What is the definition and difference between job cancel and job fails?
> Can I say that if the program is shutdown artificially, then it is a
> job cancel,
> if the program is shutdown due to some error, it
> is a job fail?
>
>
> This is not entirely true, and artificially triggering a cancel may also lead
> to failure. You can think that if the human triggers the cancel, each task
> instance can be correctly canceled, then the final job's status is canceled.
> The final state of the job due to various anomalies is failed.
>
> This is important because it is the prerequisite for the following
> question:
>
> In the document of Flink 1.6, it says:
> "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the
> checkpoint when the job is cancelled. Note that you have to manually clean up
> the checkpoint state after cancellation in this case.
> ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the
> checkpoint when the job is cancelled. The checkpoint state will only be
> available if the job fails."
> But it does not says whether the checkpoint will be retained on fail.
> If the checkpoint activity of fail is the same as cancel, then I have
> to use RETAIL_ON_CANCELLATION, because if I do not use it, the checkpoint
> will be deleted on job fail.
> If the checkpoint activity of fail is not delete, then at this case it
> is safe on job fail.
>
> In the configuration, there are two enumeration classes
> `CheckpointRetentionPolicy` and `ExternalizedCheckpointCleanup`, you need to
> consider which configuration you want to use. Your main concern is
> ExternalizedCheckpointCleanup, which cleans up the metadata for externalized
> checkpoints. Are you sure you want to use it? Flink defaults to
> self-management checkpoint cleanup, which is a non-externalized checkpoint.
>
>
> Best
> Henry
>
>
>
>> 在 2018年9月25日,上午11:16,vino yang <[email protected]
>> <mailto:[email protected]>> 写道:
>>
>> Hi Henry,
>>
>> Answer your question:
>>
>> What is the definition and difference between job cancel and job fails?
>>
>> > The cancellation and failure of the job will cause the job to enter the
>> > termination state. But cancellation is artificially triggered and normally
>> > terminated, while failure is usually a passive termination due to an
>> > exception.
>>
>> If I use DELETE_ON_CANCELLATION option, in this case, does I have the
>> checkpoint to resume the program?
>>
>> > No, if you use externalized checkpoints. you cannot resume from
>> > externalized checkpoints after the job has been cancelled.
>>
>> I mean if I can guarantee that a savepoint can always be made before
>> manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints,
>> is there any probability that I do not have a checkpoint to recover from?
>>
>> > From the latest source code, savepoint is not affected by
>> > CheckpointRetentionPolicy, it needs to be cleaned up manually.
>>
>> Thanks, vino.
>>
>> 徐涛 <[email protected] <mailto:[email protected]>> 于2018年9月25日周二
>> 上午11:06写道:
>> Hi All,
>> I mean if I can guarantee that a savepoint can always be made before
>> manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints,
>> is there any probability that I do not have a checkpoint to recover from?
>> Thank a a lot.
>>
>> Best
>> Henry
>>
>>
>>
>>> 在 2018年9月25日,上午10:41,徐涛 <[email protected]
>>> <mailto:[email protected]>> 写道:
>>>
>>> Hi All,
>>> In flink document, it says
>>> DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is
>>> cancelled. The checkpoint state will only be available if the job fails.”
>>> What is the definition and difference between job cancel and job fails?
>>> If I run the program on yarn, and after a few days, the yarn application
>>> get failed for some reason.
>>> If I use DELETE_ON_CANCELLATION option, in this case, does I have the
>>> checkpoint to resume the program?
>>>
>>> If the checkpoint are only deleted when I cancel the program, I can
>>> always make the savepoint before cancelation. Then it seems that I can only
>>> set DELETE_ON_CANCELLATION then.
>>> I can not find a case that RETAIN_ON_CANCELLATION should be used.
>>>
>>>
>>> Best
>>> Henry
>>>
>>
>