Hi! I am looping in Stefan and Xiaogang who worked a lot in incremental checkpointing.
Some background on incremental checkpoints: Incremental checkpoints store "pieces" of the state (RocksDB ssTables) that are shared between checkpoints. Hence it naturally uses more files than no-incremental checkpoints. You could help us understand this with a few more details: - Does it only occur with incremental checkpoints, or also with regular checkpoints? - How many checkpoints to you retain? - Do you use externalized checkpoints? - Do you use a highly-available setup with ZooKeeper? Thanks, Stephan On Thu, Jul 13, 2017 at 10:43 PM, prashantnayak < prash...@intellifylearning.com> wrote: > > To add one more data point... it seems like the recovery directory is the > bottleneck somehow.. so if we delete the recovery directory and restart > the > job manager - it comes back and is responsive. > > Of course, we lose all jobs, since none can be recovered... and that is of > course not ideal. > > So the question seems to be why the recovery directory grows exponentially > in the first place. > > I can't imagine we're the only ones to see this... or we must be > configuring > something wrong while testing Flink 1.3.1 > > Thanks for your help in advance > > Prashant > > > > -- > View this message in context: http://apache-flink-user- > mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and- > checkpoint-directories-exhibit-explosive-growth-tp14270p14271.html > Sent from the Apache Flink User Mailing List archive. mailing list archive > at Nabble.com. >