Hi, We're running a job with on the order of >100GiB of state. For our initial run we wanted to keep things simple, so we allocated a single core node with 1 Taskmanager and 1 parallelism and 1 TiB storage (split between 4 disks on that machine). Overall, things are actually moving pretty smoothly, except for checkpointing. Checkpoints are set to be incremental, yet they're all in the range of 10-20 GiB -- we do have a lot of data being updated in real-time, retracts+appends -- and they take around 10-30 min. We have the Taskmanager to set to checkpoint every 5 min which means we're spending the majority of our time just checkpointing.
My question is, what sort of bottlenecks should we be investigating and what are some things we can try to improve our checkpoint times? Some things we're considering are: Increasing parallelism, hoping that this will partition the data and each operator will therefore checkpoint faster. Changing time between checkpoints, though we don't have a good understanding of how this might affect total time. Also, we are hesitant to use unaligned checkpointing at the moment and are hoping for some other options. Thanks! -- Rex Fenley | Software Engineer - Mobile and Backend Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> | FOLLOW US <https://twitter.com/remindhq> | LIKE US <https://www.facebook.com/remindhq>