Re: CompletedCheckpoints are getting Stale ( Flink 1.4.2 )

vino yang Thu, 30 Aug 2018 20:03:02 -0700

Hi Laura,

First of all, Flink only keeps one completed checkpoint by default[1]. You
need to confirm whether your configuration is consistent with the number of
files. If they are consistent, it is for other reasons:


1) The cleaning of the completed checkpoint is done by JM. You need to
confirm whether it can access your file.[2]
2) JM will asynchronously clean up the metadata of the old completed
checkpoint on the ZK with a background thread. After the cleanup is
successful, it will clean the Checkpoint data. If the above reasons are
excluded, then maybe you provide JM's log to help us confirm whether this
is the reason. I think it is more appropriate to ping Till.[3]

[1]:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/state/checkpointing.html#state-checkpoints-num-retained
[2]:
https://stackoverflow.com/questions/44928624/apache-flink-not-deleting-old-checkpoints
[3]:
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L437

Thanks, vino.

Laura Uzcátegui <laura.uzcateg...@gmail.com> 于2018年8月30日周四 下午10:52写道：

> Hello,
>
>  At work, we are currently standing up a cluster with the following
> configuration:
>
>
>    - Flink version: 1.4.2
>    - HA Enabled with Zookeeper
>    - State backend : rocksDB
>    - state.checkpoints.dir: hdfs://namenode:9000/flink/checkpoints
>    - state.backend.rocksdb.checkpointdir:
>    hdfs://namenode:9000/flink/checkpoints
>    - *high-availability.storageDir*: hdfs://namenode:9000/flink/recovery
>
> We have also a job running with checkpointing enabled and without
> externalized checkpoint.
>
> We run this job multiple times a day since it's run from our
> integration-test pipeline, and we started noticing the folder
> *high-availability.storageDir  *storing the completedCheckpoint files is
> increasing constantly the number of files created, which is making us
> wonder if there is no cleanup policy for the Filesystem when HA is enabled.
>
> Under what  circumstance would there be an ever increasing number of
> completedCheckpoint files on the HA storage dir when there is only a single
> job running over and over again ?
>
> Here is a list of what we are seeing accumulating over time and actually
> reaching the maximum of files allowed on the Filesystem.
>
> completedCheckpoint00d86c01d8b9
> completedCheckpoint00d86e9030a9
> completedCheckpoint00d877b74355
> completedCheckpoint00d87b3dd9ad
> completedCheckpoint00d8815d9afd
> completedCheckpoint00d88973195c
> completedCheckpoint00d88b4792f2
> completedCheckpoint00d890d499dc
> completedCheckpoint00d91b00ada2
>
>
> Cheers,
>
>
> Laura U.
>
>

Re: CompletedCheckpoints are getting Stale ( Flink 1.4.2 )

Reply via email to