Re: S3 recovery and checkpoint directories exhibit explosive growth

Stephan Ewen Mon, 24 Jul 2017 10:57:07 -0700

Hi Prashant!

I assume you are using Flink 1.3.0 or 1.3.1?


Here are some things you can do:

  - I would try and disable the incremental checkpointing for a start and
see what happens then. That should reduce the number of files already.

  - Is it possible for you to run a patched version of Flink? If yes, can
you try to do the following: In the class "FileStateHandle", in the method
"discardState()", remove the code around "FileUtils.deletePathIfEmpty(...)"
- this is probably not working well when hitting too many S3 files.

  -  You can delete old "completedCheckpointXXXYYY" files, but please do
not delete the other two types, they are needed for HA recovery.

Greetings,
Stephan


On Mon, Jul 24, 2017 at 3:46 AM, prashantnayak <
prash...@intellifylearning.com> wrote:

> Hi Xiaogang and Stephan
>
> We're continuing to test and have now set up the cluster to disable
> incremental RocksDB checkpointing as well as increasing the checkpoint
> interval from 30s to 120s  (not ideal really :-( )
>
> We'll run it with a large number of jobs and report back if this setup
> shows
> improvement.
>
> Appreciate any another insights you might have around this problem.
>
> Thanks
> Prashant
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and-
> checkpoint-directories-exhibit-explosive-growth-tp14270p14392.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Re: S3 recovery and checkpoint directories exhibit explosive growth

Reply via email to