Re: Flink KafkaProducer Failed Transaction Stalling the whole flow

Alexis Sarda-Espinosa Mon, 18 Dec 2023 07:21:21 -0800

Hi Dominik,

Sounds like it could be this?
https://issues.apache.org/jira/browse/FLINK-28060


It doesn't mention transactions but I'd guess it could be the same
mechanism.

Regards,
Alexis.

On Mon, 18 Dec 2023, 07:51 Dominik Wosiński, <wos...@gmail.com> wrote:

> Hey,
> I've got a question regarding the transaction failures in EXACTLY_ONCE
> flow with Flink 1.15.3 with Confluent Cloud Kafka.
>
> The case is that there is a FlinkKafkaProducer in EXACTLY_ONCE setup with
> default *transaction.timeout.ms <http://transaction.timeout.ms> *of
> 15min.
>
> During the processing the job had some issues that caused checkpoint to
> timeout, that in turn caused the transaction issues, which caused
> transaction to fail with the following logs:
> Unable to commit transaction
> (org.apache.flink.streaming.runtime.operators.sink.committables.CommitRequestImpl@5d0d5082)
> because its producer is already fenced. This means that you either have a
> different producer with the same 'transactional.id' (this is unlikely
> with the 'KafkaSink' as all generated ids are unique and shouldn't be
> reused) or recovery took longer than 'transaction.timeout.ms' (900000ms).
> In both cases this most likely signals data loss, please consult the Flink
> documentation for more details.
> Up to this point everything is pretty clear. After that however, the job
> continued to work normally but every single transaction was failing with:
> Unable to commit transaction
> (org.apache.flink.streaming.runtime.operators.sink.committables.CommitRequestImpl@5a924600)
> because it's in an invalid state. Most likely the transaction has been
> aborted for some reason. Please check the Kafka logs for more details.
> Which effectively stalls all downstream processing because no transaction
> would be ever commited.
>
> I've read through the docs and understand that this is kind of a known
> issue due to the fact that Kafka doesn't effectively support 2PC, but why
> doesn't that cause the failure and restart of the whole job? Currently, the
> job will process everything normally and hides the issue until it has grown
> catastrophically.
>
> Thanks in advance,
> Cheers.
>

Re: Flink KafkaProducer Failed Transaction Stalling the whole flow

Reply via email to