Flink job unable to checkpoint (timeout) after restart with savepoint with KafkaSink

vararu.va...@gmail.com Tue, 03 Sep 2024 04:27:03 -0700

Hi,

We’ve tried to restart with savepoint 2 different jobs:

FlinkKafkaProducer -> KafkaSink with a new UID on it and –allowNonRestoredState flag to reset the state of the sink operator.
KafkaSink -> KafkaSink with a new UID on it and –allowNonRestoredState flag to reset the state of the sink operator.

For both cases we changed the UID of the kafka sink to make sure that its state resets. However, we did it via savepoint to keep the source operator state (no data duplication/loss allowed).

The problem is that for both cases the job couldn’t checkpoint anymore. Each checkpoint failed after the configured timeout (in our case 3 minutes). Normally, before restart, checkpoints took under 1 second. I’ve tried to increase the timeout but it did not make any difference and it was clearly because of the Kafka sink.

I have observed a lot of logs like this (not sure if they are related to the issue):

2024-09-03 08:12:09,550 INFO org.apache.kafka.clients.producer.internals.TransactionManager [] - [Producer clientId=producer-fk8s-480dcb71187e8ab619944412e95cb04e22388b17-20211116144116-0-3187, transactionalId=fk8s-480dcb71187e8ab619944412e95cb04e22388b17-20211116144116-0-3187] Invoking InitProducerId for the first time in order to acquire a producer ID

2024-09-03 08:12:09,552 INFO org.apache.kafka.clients.Metadata [] - [Producer clientId=producer-fk8s-480dcb71187e8ab619944412e95cb04e22388b17-

20211116144116-0-3187, transactionalId=fk8s-480dcb71187e8ab619944412e95cb04e22388b17-20211116144116-0-3187] Cluster ID: LtOP7cS0SOis0BcZNqaPJA

2024-09-03 08:12:09,552 INFO org.apache.kafka.clients.producer.internals.TransactionManager [] - [Producer clientId=producer-fk8s-480dcb71187e8ab619944412e95cb04e22388b17-20211116144116-0-3187, transactionalId=fk8s-480dcb71187e8ab619944412e95cb04e22388b17-20211116144116-0-3187] Discovered transaction coordinator ec2-63-32-61-53.eu-west-1.compute.amazonaws.com:9092 (id: 1010 rack: null)

Kafka server version - kafka_2.12-2.6.0

Flink Kafka connector version - 3.1.0-1.18

Kafka client version - org.apache.kafka:kafka-clients:jar:3.4.0

Cheers,
Vadim.

Flink job unable to checkpoint (timeout) after restart with savepoint with KafkaSink

Reply via email to