Hi, We’ve tried to restart with savepoint 2 different jobs:
For both cases we changed the UID of the kafka sink to make sure that its state resets. However, we did it via savepoint to keep the source operator state (no data duplication/loss allowed). The problem is that for both cases the job couldn’t checkpoint anymore. Each checkpoint failed after the configured timeout (in our case 3 minutes). Normally, before restart, checkpoints took under 1 second. I’ve tried to increase the timeout but it did not make any difference and it was clearly because of the Kafka sink. I have observed a lot of logs like this (not sure if they are related to the issue): 2024-09-03 08:12:09,550 INFO org.apache.kafka.clients.producer.internals.TransactionManager [] - [Producer clientId=producer-fk8s-480dcb71187e8ab619944412e95cb04e22388b17-20211116144116-0-3187, transactionalId=fk8s-480dcb71187e8ab619944412e95cb04e22388b17-20211116144116-0-3187] Invoking InitProducerId for the first time in order to acquire a producer ID 2024-09-03 08:12:09,552 INFO org.apache.kafka.clients.Metadata [] - [Producer clientId=producer-fk8s-480dcb71187e8ab619944412e95cb04e22388b17- 20211116144116-0-3187, transactionalId=fk8s-480dcb71187e8ab619944412e95cb04e22388b17-20211116144116-0-3187] Cluster ID: LtOP7cS0SOis0BcZNqaPJA 2024-09-03 08:12:09,552 INFO org.apache.kafka.clients.producer.internals.TransactionManager [] - [Producer clientId=producer-fk8s-480dcb71187e8ab619944412e95cb04e22388b17-20211116144116-0-3187, transactionalId=fk8s-480dcb71187e8ab619944412e95cb04e22388b17-20211116144116-0-3187] Discovered transaction coordinator ec2-63-32-61-53.eu-west-1.compute.amazonaws.com:9092 (id: 1010 rack: null) Kafka server version - kafka_2.12-2.6.0 Flink Kafka connector version - 3.1.0-1.18 Kafka client version - org.apache.kafka:kafka-clients:jar:3.4.0 Cheers, |