Hey folks,

We currently run (py) flink 1.17 on k8s (managed by flink k8s
operator), with HA and checkpointing (fixed retries policy). We
produce into Kafka with AT_LEAST_ONCE delivery guarantee.

Our application failed when trying to produce a message larger than
Kafka's message larger than message.max.bytes. This offset was never
going to be committed, so Flink HA was not able to recover the
application.

Upon a manual restart, it looks like the offending offset has been
lost: it was not picked after rewinding to the checkpointed offset,
and it was not committed to Kafka. I would have expected this offset
to not have made it past the KafkaProducer commit checkpoint barrier,
and that the app would fail again on it.

I understand that there are failure scenarios that could end in data
loss when Kafka delivery guarantee is set to EXACTLY_ONCE and kafka
expires an uncommitted transaction.

However, it's unclear to me if other corner cases would apply to
AT_LEAST_ONCE guarantees. Upon broker failure and app restarts, I
would expect duplicate messages but no data loss. What I can see as a
problem is that this commit was never going to happen.

Is this expected behaviour? Am I missing something here?

Cheers,
-- 
Gabriele Modena (he / him)
Staff Software Engineer
Wikimedia Foundation

Reply via email to