Hi,
I'm investigating possibilities of exactly-once semantic for Debezium [1] 
Kafka Connect source connectors, which implements change data capture for 
various databases. Debezium has two phases, initial snapshot phase and 
streaming phase. Initial snapshot phase loads existing data from the database 
and send it to the Kafka, subsequent streaming phase captures any changes to 
the data.

Exactly-once delivery seems to work really well during the streaming phase. 
Now, I'm investigating how to ensure exactly-once delivery for initial 
snapshot phase.  If the snapshot fails (e.g. due to DB connection drop or 
worker node crash), we force new snapshot after the restart as the data may 
change during the restart and the snapshot has to reflect the state of the data 
in time when it was executed. However, re-taking the snapshot produces 
duplicate records in the Kafka related topics. 

Probably the most easy solution to this issue is to run the whole snapshot in 
a single Kafka transaction. This may result into a huge transaction, 
containing millions of records, in some cases even billions of records. As 
these records cannot be consumed until transaction is committed and therefore 
logs cannot be compacted, this would potentially result in huge increase of 
Kafka logs. Also, as for the large DBs this is time consuming process, it 
would very likely result in transaction timeouts (unless the timeout is set to 
very large value).

Is my understanding of the impact of very large transactions correct? Are 
there any other drawbacks I'm missing (e.g. can it also result in some memory 
issue or something similar)?

Thanks in advanced!
Vojta

[1] https://debezium.io/

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to