Hello all! We are trying to bring our flink job closer to real-time processing and currently our main issue is latency that happens during checkpoints. Our job uses RocksDB with periodic checkpoints, which are a few hundred GBs every 15 minutes. We are trying to reduce the checkpointing duration but our main concern is the fact that, during checkpoints, 70% of our CPU is used for checkpointing (*FullSnapshotAsyncWriter.writeKVStateData*)
Ideally, we would like to allocate a fixed amount of our CPU resources to this task (let's say 10%), which would allow the regular processing of data to remain stable while checkpointing. This comes at the expense of having 10% idle CPU in-between checkpoints and having longer checkpoint durations, but we are OK with this tradeoff if it brings more predictable latency overall. However, I didn't find any setting to achieve this. It seems like these checkpointing tasks are scheduled in the *asyncOperationsThreadPool* that resides in *StreamTask.java* and this pool seems to be unbounded. Do you think that having an upper bound to this thread pool would achieve the outcome we expect? And if so, is there a way to add this bound? Thanks a lot! Robin