Hi Henry, > 1. I have heard a idempotent way but I do not know how to implement it, would you please enlighten me about it by a example? It's a property of the result data. For example, you can overwrite old values with new ones using a primary key.
> 2. If dirty data are *added* but not updated This against idempotent. Idempotent ensure that the result is consistent in the end. > 3. If using two-phase commit, the sink must support transaction. I think the answer is yes. Best, Hequn On Sat, Oct 13, 2018 at 8:49 PM 徐涛 <happydexu...@gmail.com> wrote: > Hi Hequn, > Thanks a lot for your response. I have a few questions about this topic. > Would you please help me about it? > 1. I have heard a idempotent way but I do not know how to implement it, > would you please enlighten me about it by a example? > 2. If dirty data are *added* but not updated, then only overwrite is not > enough I think. > 3. If using two-phase commit, the sink must support transaction. > 3.1 If the sink does not support transaction, for example, elasticsearch, > do I *have to* use idempotent to implement exactly-once? > 3.2 If the sink support transaction, for example, mysql, idempotent and > two-phase commit is both OK. But like you say, if there are a lot of items > between checkpoints, the batch insert is a heavy action, I still have to > use idempotent way to implement exactly-once. > > > Best > Hequn > > 在 2018年10月13日,上午11:43,Hequn Cheng <chenghe...@gmail.com> 写道: > > Hi Henry, > > Yes, exactly once using atomic way is heavy for mysql. However, you don't > have to buffer data if you choose option 2. You can simply overwrite old > records with new ones if result data is idempotent and this way can also > achieve exactly once. > There is a document about End-to-End Exactly-Once Processing in Apache > Flink[1], which may be helpful for you. > > Best, Hequn > > [1] > https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html > > > > On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <happydexu...@gmail.com> wrote: > >> Hi >> I am reading the book “Introduction to Apache Flink”, and in the >> book there mentions two ways to achieve sink exactly once: >> 1. The first way is to buffer all output at the sink and commit >> this atomically when the sink receives a checkpoint record. >> 2. The second way is to eagerly write data to the output, keeping >> in mind that some of this data might be “dirty” and replayed after a >> failure. If there is a failure, then we need to roll back the output, thus >> overwriting the dirty data and effectively deleting dirty data that has >> already been written to the output. >> >> I read the code of Elasticsearch sink, and find there is a >> flushOnCheckpoint option, if set to true, the change will accumulate until >> checkpoint is made. I guess it will guarantee at-least-once delivery, >> because although it use batch flush, but the flush is not a atomic action, >> so it can not guarantee exactly-once delivery. >> >> My question is : >> 1. As many sinks do not support transaction, at this case I have >> to choose 2 to achieve exactly once. At this case, I have to buffer all the >> records between checkpoints and delete them, it is a bit heavy action. >> 2. I guess mysql sink should support exactly once delivery, >> because it supports transaction, but at this case I have to execute batch >> according to the number of actions between checkpoints but not a specific >> number, 100 for example. When there are a lot of items between checkpoints, >> it is a heavy action either. >> >> Best >> Henry > > >