Hi Chohan,


Which Kafka client version are you using? ... considering that this started 
today, did you recently change the Kafka client version?



Giving a little more context (exception call stack/more log) might help finding 
out what is going on ... 😊



Regards



Thias



-----Original Message-----
From: Shahid Chohan <cho...@stripe.com>
Sent: Mittwoch, 1. September 2021 05:05
To: user@flink.apache.org
Subject: Unrecoverable apps due to timeouts on transaction state initialization



Today I started seeing the following exception across all of the exactly-once 
kafka sink apps I have deployed



org.apache.kafka.common.errors.TimeoutException: 
org.apache.kafka.common.errors.TimeoutException: Timeout expired while 
initializing transactional state in 60000ms.

Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired 
while initializing transactional state in 60000ms.



The apps are all on Flink v1.10.2



I tried the following workarounds sequentially for a single app but I still 
continued to get the same exception

- changing the sink uid and restoring with allowing non-restored-state

- changing the kafka producer id and restoring with allowing non-restored-state

- changing the output kafka topic to a new one and restoring with allowing 
non-restored-state

- deploying from scratch (no previous checkpoint/savepoint)

- doubling the timeout for state initialization from 60s to 120s



My mental model is that we have completely disassociated the flink app from any 
pending transactions on the kafka side (by changing the uid, producer id, and 
output topic) and so it should be able to recover from scratch. The kafka 
clusters are otherwise healthy and accepting writes for non-exactly-once flink 
apps and all other kafka producers.



On the kafka side, we have the following configs set.



transaction.max.timeout.ms=3600000

transaction.remove.expired.transaction.cleanup.interval.ms=86400000



I'm considering changing the cleanup to something shorter so that if there are 
hanging transactions on the kafka side then maybe they can get garbage 
collected sooner. Or I might just wait it out and accept the downtime.



But otherwise, I am out of ideas and unsure how to proceed. Any help would be 
much appreciated.
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet 
unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von 
e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine 
Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser 
Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per 
e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche 
unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng 
verboten.

This message is intended only for the named recipient and may contain 
confidential or privileged information. As the confidentiality of email 
communication cannot be guaranteed, we do not accept any responsibility for the 
confidentiality and the intactness of this message. If you have received it in 
error, please advise the sender by return e-mail and delete this message and 
any attachments. Any unauthorised use or dissemination of this information is 
strictly prohibited.

Reply via email to