Hi Xiaogang and Stephan

Thank you for your response.  Sorry about the delay in responding (was
traveling):

We've been trying to figure out what triggers this - but your points about
master not being able to delete files "in time" seems to be correct.... 

We've been test out two different environments 
 1.  where we have a few jobs (< 10) - but these jobs have processed large
number of records (e.g. > 200-300 million)
 2.  where we have many jobs (> 40) - but these jobs are processing very low
number of records.

We observe that in (1) - recovery, checkpoint directory growth is very
proportional to number of jobs and number of retained checkpoints configured
(we set it to 2)

We observe that in (2) - recovery, checkpoint, ext-checkpoint directory
growth is very fast.  This environment will eventually get bogged down, get
unresponsive and then die.

To answer some of your other questions

  - Does it only occur with incremental checkpoints, or also with regular
checkpoints?
        we believe this occurs in both cases
  - How many checkpoints to you retain?
        we retain 2
  - Do you use externalized checkpoints?
        yes, and we set retention = 2 and retain_on_cancellation
  - Do you use a highly-available setup with ZooKeeper?
        yes, we do

We recently bumped up JobManager (appMaster) CPU and Heap in environment #2
(increased to 4 CPU, 2GB heap, 2.5GB memory allocated to Mesos container),
but that has had no effect.

Definitely appreciate any additional insight you might be able to provide. 
This is impeding us in production deployments.

Is there any way we can at least mitigate this growth? For e.g, we have a
script that can be cron'd and can delete files in the S3 recovery directory
that are older than X number of hours.  Is it OK to run this script and keep
only the last hour worth of recovery files?

We notice that there are about 3 types of files in recovery
   - completedCheckpointXXXYYY
   - mesosWorkerStoreXXXYYY
   - submittedJobGraphXXXXYYY

is it ok to have the cron job prune all of these so we only have last hour
worth, or just perhaps the completedCheckpoint files?

Happy to provide any additional detail you need.  Just let me know...

Thanks
Prashant



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and-checkpoint-directories-exhibit-explosive-growth-tp14270p14374.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.

Reply via email to