Hi We are having trouble with record throughput that we believe to be a result of slow checkpoint durations. The job uses Kafka as both a source and sink as well as a Redis-backed service within the same cluster, used to enrich the data in a transformation, before writing records back to Kafka. Below is a description of the job:
- Flink Version 12.5 - Source topic = 24 partitions. - Multiple sink topics. - Parallelism set to 24. - Operators applied are a map function and process function to fetch the Redis data. - EXACTLY_ONCE processing is required. We have switched between aligned and unaligned checkpoints but with no improvement in performance. What we have witnessed is that on average the majority of operators and their respective subtasks acknowledge checkpoints within milliseconds but 1 or 2 subtasks wait 2 to 4 mins before acknowledging the checkpoint. Also, the subtask load seems skewed after applying transformations prior to the sinks (tried to rebalance and shuffle here but with no improvement). Checkpoint duration can vary between 5s and 7 minutes. We believe this is slowing our overall job throughput as Kafka transaction commits are delayed by slower checkpointing, creating upstream backpressure, and a buildup on the source Kafka topic offsets. We would ideally like to decrease the checkpoint interval once durations are low and stable. Any help on this would be greatly appreciated. Best regards Terry ᐧ