Hi,

We're running a job with on the order of >100GiB of state. For our initial
run we wanted to keep things simple, so we allocated a single core node
with 1 Taskmanager and 1 parallelism and 1 TiB storage (split between 4
disks on that machine). Overall, things are actually moving pretty
smoothly, except for checkpointing. Checkpoints are set to be incremental,
yet they're all in the range of 10-20 GiB -- we do have a lot of data being
updated in real-time, retracts+appends -- and they take around 10-30 min.
We have the Taskmanager to set to checkpoint every 5 min which means we're
spending the majority of our time just checkpointing.

My question is, what sort of bottlenecks should we be investigating and
what are some things we can try to improve our checkpoint times?

Some things we're considering are:
Increasing parallelism, hoping that this will partition the data and each
operator will therefore checkpoint faster.
Changing time between checkpoints, though we don't have a good
understanding of how this might affect total time.

Also, we are hesitant to use unaligned checkpointing at the moment and are
hoping for some other options.

Thanks!

-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Reply via email to