Hi Prashantnayak Thanks a lot for reporting this problem. Can you provide more details to address it?
I am guessing master has to delete too many files when a checkpoint is subsumed, which is very common in our cases. The number of files in the recovery directory will increase if the master cannot delete these files in time. It usually happens when the checkpoint interval is very small and the degree of parallelism is very large. Regards, Xiaogang 2017-07-15 0:31 GMT+08:00 Stephan Ewen <se...@apache.org>: > Hi! > > I am looping in Stefan and Xiaogang who worked a lot in incremental > checkpointing. > > Some background on incremental checkpoints: Incremental checkpoints store > "pieces" of the state (RocksDB ssTables) that are shared between > checkpoints. Hence it naturally uses more files than no-incremental > checkpoints. > > You could help us understand this with a few more details: > - Does it only occur with incremental checkpoints, or also with regular > checkpoints? > - How many checkpoints to you retain? > - Do you use externalized checkpoints? > - Do you use a highly-available setup with ZooKeeper? > > Thanks, > Stephan > > > > On Thu, Jul 13, 2017 at 10:43 PM, prashantnayak < > prash...@intellifylearning.com> wrote: > >> >> To add one more data point... it seems like the recovery directory is the >> bottleneck somehow.. so if we delete the recovery directory and restart >> the >> job manager - it comes back and is responsive. >> >> Of course, we lose all jobs, since none can be recovered... and that is of >> course not ideal. >> >> So the question seems to be why the recovery directory grows exponentially >> in the first place. >> >> I can't imagine we're the only ones to see this... or we must be >> configuring >> something wrong while testing Flink 1.3.1 >> >> Thanks for your help in advance >> >> Prashant >> >> >> >> -- >> View this message in context: http://apache-flink-user-maili >> ng-list-archive.2336050.n4.nabble.com/S3-recovery-and-che >> ckpoint-directories-exhibit-explosive-growth-tp14270p14271.html >> Sent from the Apache Flink User Mailing List archive. mailing list >> archive at Nabble.com. >> > >