Re: Offset lost with AT_LEAST_ONCE kafka delivery guarantees

Alexander Fedulov Thu, 26 Oct 2023 12:55:23 -0700

* to clarify: by different output I mean that for the same input message
the output message could be slightly smaller due to the abovementioned
factors and fall into the allowed size range without causing any failures


On Thu, 26 Oct 2023 at 21:52, Alexander Fedulov <alexander.fedu...@gmail.com>
wrote:

> Your expectations are correct. In case of AT_LEAST_ONCE  Flink will wait
> for all outstanding records in the Kafka buffers to be acknowledged before
> marking the checkpoint successful (=also recording the offsets of the
> sources). That said, there might be other factors involved that could lead
> to a different output even when reading the same data from the sources -
> such as using using processing time (instead of event time) or doing some
> sort of lookup calls to external systems. If you absolutely cannot think of
> a scenario where this could be the case for your application, please try to
> reproduce the error reliably - this is something that needs to be
> further looked into.
>
> Best,
> Alexander Fedulov
>
> On Mon, 23 Oct 2023 at 19:11, Gabriele Modena <gmod...@wikimedia.org>
> wrote:
>
>> Hey folks,
>>
>> We currently run (py) flink 1.17 on k8s (managed by flink k8s
>> operator), with HA and checkpointing (fixed retries policy). We
>> produce into Kafka with AT_LEAST_ONCE delivery guarantee.
>>
>> Our application failed when trying to produce a message larger than
>> Kafka's message larger than message.max.bytes. This offset was never
>> going to be committed, so Flink HA was not able to recover the
>> application.
>>
>> Upon a manual restart, it looks like the offending offset has been
>> lost: it was not picked after rewinding to the checkpointed offset,
>> and it was not committed to Kafka. I would have expected this offset
>> to not have made it past the KafkaProducer commit checkpoint barrier,
>> and that the app would fail again on it.
>>
>> I understand that there are failure scenarios that could end in data
>> loss when Kafka delivery guarantee is set to EXACTLY_ONCE and kafka
>> expires an uncommitted transaction.
>>
>> However, it's unclear to me if other corner cases would apply to
>> AT_LEAST_ONCE guarantees. Upon broker failure and app restarts, I
>> would expect duplicate messages but no data loss. What I can see as a
>> problem is that this commit was never going to happen.
>>
>> Is this expected behaviour? Am I missing something here?
>>
>> Cheers,
>> --
>> Gabriele Modena (he / him)
>> Staff Software Engineer
>> Wikimedia Foundation
>>
>

Re: Offset lost with AT_LEAST_ONCE kafka delivery guarantees

Reply via email to