Hi Hequn, Thanks a lot for your response. I have a few questions about this topic. Would you please help me about it? 1. I have heard a idempotent way but I do not know how to implement it, would you please enlighten me about it by a example? 2. If dirty data are added but not updated, then only overwrite is not enough I think. 3. If using two-phase commit, the sink must support transaction. 3.1 If the sink does not support transaction, for example, elasticsearch, do I have to use idempotent to implement exactly-once? 3.2 If the sink support transaction, for example, mysql, idempotent and two-phase commit is both OK. But like you say, if there are a lot of items between checkpoints, the batch insert is a heavy action, I still have to use idempotent way to implement exactly-once.
Best Hequn > 在 2018年10月13日,上午11:43,Hequn Cheng <chenghe...@gmail.com> 写道: > > Hi Henry, > > Yes, exactly once using atomic way is heavy for mysql. However, you don't > have to buffer data if you choose option 2. You can simply overwrite old > records with new ones if result data is idempotent and this way can also > achieve exactly once. > There is a document about End-to-End Exactly-Once Processing in Apache > Flink[1], which may be helpful for you. > > Best, Hequn > > [1] > https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html > > <https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html> > > > > On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <happydexu...@gmail.com > <mailto:happydexu...@gmail.com>> wrote: > Hi > I am reading the book “Introduction to Apache Flink”, and in the book > there mentions two ways to achieve sink exactly once: > 1. The first way is to buffer all output at the sink and commit this > atomically when the sink receives a checkpoint record. > 2. The second way is to eagerly write data to the output, keeping in > mind that some of this data might be “dirty” and replayed after a failure. If > there is a failure, then we need to roll back the output, thus overwriting > the dirty data and effectively deleting dirty data that has already been > written to the output. > > I read the code of Elasticsearch sink, and find there is a > flushOnCheckpoint option, if set to true, the change will accumulate until > checkpoint is made. I guess it will guarantee at-least-once delivery, because > although it use batch flush, but the flush is not a atomic action, so it can > not guarantee exactly-once delivery. > > My question is : > 1. As many sinks do not support transaction, at this case I have to > choose 2 to achieve exactly once. At this case, I have to buffer all the > records between checkpoints and delete them, it is a bit heavy action. > 2. I guess mysql sink should support exactly once delivery, because > it supports transaction, but at this case I have to execute batch according > to the number of actions between checkpoints but not a specific number, 100 > for example. When there are a lot of items between checkpoints, it is a heavy > action either. > > Best > Henry