Hey Robin, Thanks for sharing the detailed information. May I ask, when you are saying "CPU usage is around 80% when checkpoints aren't running, and capped at 100% when they are", do you see zigzag patterns of CPU usage, or is it kept capped at 100% of CPU?
I think one possibility is that the sync phase of cp (the writebuffer flush during the sync phase) triggers a rocksdb compaction, and we saw this happens on Ververica services as well. At this moment, maybe you can try to make the checkpoint less frequent (increase the checkpoint interval) to reduce the frequency of compaction. Please let me know whether this helps. In long term, I think we probably need to separate the compaction process from the internal db and control/schedule the compaction process ourselves (compaction takes a good amount of CPU and reduces TPS). Best. Yuan On Thu, Oct 13, 2022 at 11:39 PM Robin Cassan via user < user@flink.apache.org> wrote: > Hello all, hope you're well :) > We are attempting to build a Flink job with minimal and stable latency (as > much as possible) that consumes data from Kafka. Currently our main > limitation happens when our job checkpoints the RocksDB state: backpressure > is applied on the stream, causing latency. I am wondering if there are ways > to configure Flink so that the checkpointing process affects the flow of > data as little as possible? > > In our case, backpressure seems to arise from CPU consumption, because: > - CPU usage is around 80% when checkpoints aren't running, and capped at > 100% when they are > - checkpoint alignment time is very low, using unaligned checkpoints > doesn't appear to help with backpressure > - network (async) part of the checkpoint should in theory not cause > backpressure since resources would be used for the main stream during async > waits, but I might be wrong > > What we would really like to achieve is isolating the compute resource > used for checkpointing from the ones used for task slots. Which would of > course mean that we need to oversize our cluster for having resources > available for checkpointing even when it's not running, but also that we > would get longer checkpoints compared to today where checkpoints seem to > use CPU cores attributed to task slots. We are ok with that to some degree, > but we don't know how to achieve this isolation. Do you have any clue? > > Lastly, we currently have nodes with 8 cores but allocate 6 task slots, > and we have set the following settings: > > state.backend.rocksdb.thread.num: 6 > state.backend.rocksdb.writebuffer.count: 6 > > > Thanks all for your help! >