Questions in sink exactly once implementation

徐涛 Fri, 12 Oct 2018 02:21:02 -0700

Hi 
        I am reading the book “Introduction to Apache Flink”, and in the book 
there mentions two ways to achieve sink exactly once:
        1. The first way is to buffer all output at the sink and commit this 
atomically when the sink receives a checkpoint record.
        2. The second way is to eagerly write data to the output, keeping in 
mind that some of this data might be “dirty” and replayed after a failure. If 
there is a failure, then we need to roll back the output, thus overwriting the 
dirty data and effectively deleting dirty data that has already been written to 
the output.


        I read the code of Elasticsearch sink, and find there is a 
flushOnCheckpoint option, if set to true, the change will accumulate until 
checkpoint is made. I guess it will guarantee at-least-once delivery, because 
although it use batch flush, but the flush is not a atomic action, so it can 
not guarantee exactly-once delivery. 

        My question is : 
        1. As many sinks do not support transaction, at this case I have to 
choose 2 to achieve exactly once. At this case, I have to buffer all the 
records between checkpoints and delete them, it is a bit heavy action.
        2. I guess mysql sink should support exactly once delivery, because it 
supports transaction, but at this case I have to execute batch according to the 
number of actions between checkpoints but not a specific number, 100 for 
example. When there are a lot of items between checkpoints, it is a heavy 
action either.

Best
Henry

Questions in sink exactly once implementation

Reply via email to