Re: Questions in sink exactly once implementation

徐涛 Sat, 13 Oct 2018 05:50:07 -0700

Hi Hequn,
        Thanks a lot for your response. I have a few questions about this 
topic. Would you please help me about it?
        1. I have heard a idempotent way but I do not know how to implement it, 
would you please enlighten me about it by a example?
        2. If dirty data are added but not updated, then only overwrite is not 
enough I think.
        3. If using two-phase commit, the sink must support transaction.
                3.1 If the sink does not support transaction, for example, 
elasticsearch, do I have to use idempotent to implement exactly-once?
                3.2 If the sink support transaction, for example, mysql, 
idempotent and two-phase commit is both OK. But like you say, if there are a 
lot of items between checkpoints, the batch insert is a heavy action, I still 
have to use idempotent way to implement exactly-once.



Best
Hequn

> 在 2018年10月13日，上午11:43，Hequn Cheng <chenghe...@gmail.com> 写道：
> 
> Hi Henry,
> 
> Yes, exactly once using atomic way is heavy for mysql. However, you don't 
> have to buffer data if you choose option 2. You can simply overwrite old 
> records with new ones if result data is idempotent and this way can also 
> achieve exactly once. 
> There is a document about End-to-End Exactly-Once Processing in Apache 
> Flink[1], which may be helpful for you.
> 
> Best, Hequn
> 
> [1] 
> https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
>  
> <https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html>
> 
> 
> 
> On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <happydexu...@gmail.com 
> <mailto:happydexu...@gmail.com>> wrote:
> Hi 
>         I am reading the book “Introduction to Apache Flink”, and in the book 
> there mentions two ways to achieve sink exactly once:
>         1. The first way is to buffer all output at the sink and commit this 
> atomically when the sink receives a checkpoint record.
>         2. The second way is to eagerly write data to the output, keeping in 
> mind that some of this data might be “dirty” and replayed after a failure. If 
> there is a failure, then we need to roll back the output, thus overwriting 
> the dirty data and effectively deleting dirty data that has already been 
> written to the output.
> 
>         I read the code of Elasticsearch sink, and find there is a 
> flushOnCheckpoint option, if set to true, the change will accumulate until 
> checkpoint is made. I guess it will guarantee at-least-once delivery, because 
> although it use batch flush, but the flush is not a atomic action, so it can 
> not guarantee exactly-once delivery. 
> 
>         My question is : 
>         1. As many sinks do not support transaction, at this case I have to 
> choose 2 to achieve exactly once. At this case, I have to buffer all the 
> records between checkpoints and delete them, it is a bit heavy action.
>         2. I guess mysql sink should support exactly once delivery, because 
> it supports transaction, but at this case I have to execute batch according 
> to the number of actions between checkpoints but not a specific number, 100 
> for example. When there are a lot of items between checkpoints, it is a heavy 
> action either.
> 
> Best
> Henry

Re: Questions in sink exactly once implementation

Reply via email to