Hello all, hope you're well :) We are attempting to build a Flink job with minimal and stable latency (as much as possible) that consumes data from Kafka. Currently our main limitation happens when our job checkpoints the RocksDB state: backpressure is applied on the stream, causing latency. I am wondering if there are ways to configure Flink so that the checkpointing process affects the flow of data as little as possible?
In our case, backpressure seems to arise from CPU consumption, because: - CPU usage is around 80% when checkpoints aren't running, and capped at 100% when they are - checkpoint alignment time is very low, using unaligned checkpoints doesn't appear to help with backpressure - network (async) part of the checkpoint should in theory not cause backpressure since resources would be used for the main stream during async waits, but I might be wrong What we would really like to achieve is isolating the compute resource used for checkpointing from the ones used for task slots. Which would of course mean that we need to oversize our cluster for having resources available for checkpointing even when it's not running, but also that we would get longer checkpoints compared to today where checkpoints seem to use CPU cores attributed to task slots. We are ok with that to some degree, but we don't know how to achieve this isolation. Do you have any clue? Lastly, we currently have nodes with 8 cores but allocate 6 task slots, and we have set the following settings: state.backend.rocksdb.thread.num: 6 state.backend.rocksdb.writebuffer.count: 6 Thanks all for your help!