Hi Prashant! I assume you are using Flink 1.3.0 or 1.3.1?
Here are some things you can do: - I would try and disable the incremental checkpointing for a start and see what happens then. That should reduce the number of files already. - Is it possible for you to run a patched version of Flink? If yes, can you try to do the following: In the class "FileStateHandle", in the method "discardState()", remove the code around "FileUtils.deletePathIfEmpty(...)" - this is probably not working well when hitting too many S3 files. - You can delete old "completedCheckpointXXXYYY" files, but please do not delete the other two types, they are needed for HA recovery. Greetings, Stephan On Mon, Jul 24, 2017 at 3:46 AM, prashantnayak < prash...@intellifylearning.com> wrote: > Hi Xiaogang and Stephan > > We're continuing to test and have now set up the cluster to disable > incremental RocksDB checkpointing as well as increasing the checkpoint > interval from 30s to 120s (not ideal really :-( ) > > We'll run it with a large number of jobs and report back if this setup > shows > improvement. > > Appreciate any another insights you might have around this problem. > > Thanks > Prashant > > > > -- > View this message in context: http://apache-flink-user- > mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and- > checkpoint-directories-exhibit-explosive-growth-tp14270p14392.html > Sent from the Apache Flink User Mailing List archive. mailing list archive > at Nabble.com. >