Hi, I'm investigating possibilities of exactly-once semantic for Debezium [1] Kafka Connect source connectors, which implements change data capture for various databases. Debezium has two phases, initial snapshot phase and streaming phase. Initial snapshot phase loads existing data from the database and send it to the Kafka, subsequent streaming phase captures any changes to the data.
Exactly-once delivery seems to work really well during the streaming phase. Now, I'm investigating how to ensure exactly-once delivery for initial snapshot phase. If the snapshot fails (e.g. due to DB connection drop or worker node crash), we force new snapshot after the restart as the data may change during the restart and the snapshot has to reflect the state of the data in time when it was executed. However, re-taking the snapshot produces duplicate records in the Kafka related topics. Probably the most easy solution to this issue is to run the whole snapshot in a single Kafka transaction. This may result into a huge transaction, containing millions of records, in some cases even billions of records. As these records cannot be consumed until transaction is committed and therefore logs cannot be compacted, this would potentially result in huge increase of Kafka logs. Also, as for the large DBs this is time consuming process, it would very likely result in transaction timeouts (unless the timeout is set to very large value). Is my understanding of the impact of very large transactions correct? Are there any other drawbacks I'm missing (e.g. can it also result in some memory issue or something similar)? Thanks in advanced! Vojta [1] https://debezium.io/
signature.asc
Description: This is a digitally signed message part.