Hello all, hope you're well :)
We are attempting to build a Flink job with minimal and stable latency (as
much as possible) that consumes data from Kafka. Currently our main
limitation happens when our job checkpoints the RocksDB state: backpressure
is applied on the stream, causing latency. I am wondering if there are ways
to configure Flink so that the checkpointing process affects the flow of
data as little as possible?

In our case, backpressure seems to arise from CPU consumption, because:
- CPU usage is around 80% when checkpoints aren't running, and capped at
100% when they are
- checkpoint alignment time is very low, using unaligned checkpoints
doesn't appear to help with backpressure
- network (async) part of the checkpoint should in theory not cause
backpressure since resources would be used for the main stream during async
waits, but I might be wrong

What we would really like to achieve is isolating the compute resource used
for checkpointing from the ones used for task slots. Which would of course
mean that we need to oversize our cluster for having resources available
for checkpointing even when it's not running, but also that we would get
longer checkpoints compared to today where checkpoints seem to use CPU
cores attributed to task slots. We are ok with that to some degree, but we
don't know how to achieve this isolation. Do you have any clue?

Lastly, we currently have nodes with 8 cores but allocate 6 task slots, and
we have set the following settings:

state.backend.rocksdb.thread.num: 6
state.backend.rocksdb.writebuffer.count: 6


Thanks all for your help!

Reply via email to