Hello all!

We are trying to bring our flink job closer to real-time processing and
currently our main issue is latency that happens during checkpoints. Our
job uses RocksDB with periodic checkpoints, which are a few hundred GBs
every 15 minutes. We are trying to reduce the checkpointing duration but
our main concern is the fact that, during checkpoints, 70% of our CPU is
used for checkpointing (*FullSnapshotAsyncWriter.writeKVStateData*)

Ideally, we would like to allocate a fixed amount of our CPU resources to
this task (let's say 10%), which would allow the regular processing of data
to remain stable while checkpointing. This comes at the expense of having
10% idle CPU in-between checkpoints and having longer checkpoint durations,
but we are OK with this tradeoff if it brings more predictable latency
overall.

However, I didn't find any setting to achieve this. It seems like these
checkpointing tasks are scheduled in the *asyncOperationsThreadPool* that
resides in *StreamTask.java* and this pool seems to be unbounded.
Do you think that having an upper bound to this thread pool would achieve
the outcome we expect? And if so, is there a way to add this bound?

Thanks a lot!

Robin

Reply via email to