Hey folks, We currently run (py) flink 1.17 on k8s (managed by flink k8s operator), with HA and checkpointing (fixed retries policy). We produce into Kafka with AT_LEAST_ONCE delivery guarantee.
Our application failed when trying to produce a message larger than Kafka's message larger than message.max.bytes. This offset was never going to be committed, so Flink HA was not able to recover the application. Upon a manual restart, it looks like the offending offset has been lost: it was not picked after rewinding to the checkpointed offset, and it was not committed to Kafka. I would have expected this offset to not have made it past the KafkaProducer commit checkpoint barrier, and that the app would fail again on it. I understand that there are failure scenarios that could end in data loss when Kafka delivery guarantee is set to EXACTLY_ONCE and kafka expires an uncommitted transaction. However, it's unclear to me if other corner cases would apply to AT_LEAST_ONCE guarantees. Upon broker failure and app restarts, I would expect duplicate messages but no data loss. What I can see as a problem is that this commit was never going to happen. Is this expected behaviour? Am I missing something here? Cheers, -- Gabriele Modena (he / him) Staff Software Engineer Wikimedia Foundation