Hi Sohi,

Could it be that you configured your job tasks to fail if checkpoint fails
(streamExecutionEnvironment.getCheckpointConfig().setFailOnCheckpointingErrors(true))?
Could you send the complete job master log?

If checkpoint 470 has been subsumed by 471, it could be that its directory
is removed to release resources, but some tasks are still running
checkpointing and fail being unable to access removed files. It could be
ignored if the checkpoint was just subsumed by the next successful one but
setFailOnCheckpointingErrors(true) cases the job to fail.

Best,
Andrey

On Wed, Jan 16, 2019 at 3:20 AM Congxian Qiu <qcx978132...@gmail.com> wrote:

> Hi, Sohi
> You can check out doc[1][2] to find out the answer.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/checkpointing.html#enabling-and-configuring-checkpointing
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/restart_strategies.html
>
> sohimankotia <sohimanko...@gmail.com> 于2019年1月15日周二 下午4:16写道:
>
>> Yes. File got deleted .
>>
>> 2019-01-15 10:40:41,360 INFO FSNamesystem.audit: allowed=true   ugi=hdfs
>> (auth:SIMPLE)  ip=/192.168.3.184       cmd=delete
>> src=/pipeline/job/checkpoints/e9a08c0661a6c31b5af540cf352e1265/chk-470/5fb3a899-8c0f-45f6-a847-42cbb71e6d19
>>
>> dst=null        perm=null       proto=rpc
>>
>> Looks like file was deleted from job itself .
>>
>> Does it cause job restart then ?
>>
>> If checkpoint fails then it should try next checkpoint or restart job ?
>>
>>
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>
>
>
> --
> Best,
> Congxian
>

Reply via email to