Re: CompletedCheckpoints are getting Stale ( Flink 1.4.2 )

Stephan Ewen Fri, 31 Aug 2018 00:38:22 -0700

Hi Laura!

Vino had good pointers. There really should be no case in which this is not
cleaned up.


Is this a bounded job that ends? Is it always the last of the bounded job's
checkpoints that remains?

Best,
Stephan


On Fri, Aug 31, 2018 at 5:02 AM, vino yang <yanghua1...@gmail.com> wrote:

> Hi Laura,
>
> First of all, Flink only keeps one completed checkpoint by default[1]. You
> need to confirm whether your configuration is consistent with the number of
> files. If they are consistent, it is for other reasons:
>
> 1) The cleaning of the completed checkpoint is done by JM. You need to
> confirm whether it can access your file.[2]
> 2) JM will asynchronously clean up the metadata of the old completed
> checkpoint on the ZK with a background thread. After the cleanup is
> successful, it will clean the Checkpoint data. If the above reasons are
> excluded, then maybe you provide JM's log to help us confirm whether this
> is the reason. I think it is more appropriate to ping Till.[3]
>
> [1]: https://ci.apache.org/projects/flink/flink-docs-
> release-1.6/dev/stream/state/checkpointing.html#state-
> checkpoints-num-retained
> [2]: https://stackoverflow.com/questions/44928624/apache-
> flink-not-deleting-old-checkpoints
> [3]: https://github.com/apache/flink/blob/master/
> flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/
> ZooKeeperStateHandleStore.java#L437
>
> Thanks, vino.
>
> Laura Uzcátegui <laura.uzcateg...@gmail.com> 于2018年8月30日周四 下午10:52写道：
>
>> Hello,
>>
>>  At work, we are currently standing up a cluster with the following
>> configuration:
>>
>>
>>    - Flink version: 1.4.2
>>    - HA Enabled with Zookeeper
>>    - State backend : rocksDB
>>    - state.checkpoints.dir: hdfs://namenode:9000/flink/checkpoints
>>    - state.backend.rocksdb.checkpointdir: hdfs://namenode:9000/flink/
>>    checkpoints
>>    - *high-availability.storageDir*: hdfs://namenode:9000/flink/recovery
>>
>> We have also a job running with checkpointing enabled and without
>> externalized checkpoint.
>>
>> We run this job multiple times a day since it's run from our
>> integration-test pipeline, and we started noticing the folder
>> *high-availability.storageDir  *storing the completedCheckpoint files is
>> increasing constantly the number of files created, which is making us
>> wonder if there is no cleanup policy for the Filesystem when HA is enabled.
>>
>> Under what  circumstance would there be an ever increasing number of
>> completedCheckpoint files on the HA storage dir when there is only a single
>> job running over and over again ?
>>
>> Here is a list of what we are seeing accumulating over time and actually
>> reaching the maximum of files allowed on the Filesystem.
>>
>> completedCheckpoint00d86c01d8b9
>> completedCheckpoint00d86e9030a9
>> completedCheckpoint00d877b74355
>> completedCheckpoint00d87b3dd9ad
>> completedCheckpoint00d8815d9afd
>> completedCheckpoint00d88973195c
>> completedCheckpoint00d88b4792f2
>> completedCheckpoint00d890d499dc
>> completedCheckpoint00d91b00ada2
>>
>>
>> Cheers,
>>
>>
>> Laura U.
>>
>>

Re: CompletedCheckpoints are getting Stale ( Flink 1.4.2 )

Reply via email to