Flink Checkpoint Duration/Job Throughput

Terry Heathcote Tue, 21 Dec 2021 10:53:28 -0800

Hi

We are having trouble with record throughput that we believe to be a result
of slow checkpoint durations. The job uses Kafka as both a source and sink
as well as a Redis-backed service within the same cluster, used to enrich
the data in a transformation, before writing records back to Kafka. Below
is a description of the job:


   - Flink Version 12.5
   - Source topic = 24 partitions.
   - Multiple sink topics.
   - Parallelism set to 24.
   - Operators applied are a map function and process function to fetch the
   Redis data.
   - EXACTLY_ONCE processing is required.

We have switched between aligned and unaligned checkpoints but with no
improvement in performance. What we have witnessed is that on average the
majority of operators and their respective subtasks acknowledge checkpoints
within milliseconds but 1 or 2 subtasks wait 2 to 4 mins before
acknowledging the checkpoint. Also, the subtask load seems skewed after
applying transformations prior to the sinks (tried to rebalance and shuffle
here but with no improvement). Checkpoint duration can vary between 5s and
7 minutes.

We believe this is slowing our overall job throughput as Kafka transaction
commits are delayed by slower checkpointing, creating upstream
backpressure, and a buildup on the source Kafka topic offsets. We would
ideally like to decrease the checkpoint interval once durations are low and
stable.

Any help on this would be greatly appreciated.

Best regards
Terry
ᐧ

Flink Checkpoint Duration/Job Throughput

Reply via email to