Re: Questions in sink exactly once implementation

Hequn Cheng Sat, 13 Oct 2018 19:46:05 -0700

Hi Henry,

> 1. I have heard a idempotent way but I do not know how to implement it,
would you please enlighten me about it by a example?
It's a property of the result data. For example, you can overwrite old
values with new ones using a primary key.


> 2. If dirty data are *added* but not updated
This against idempotent. Idempotent ensure that the result is consistent in
the end.

> 3. If using two-phase commit, the sink must support transaction.
I think the answer is yes.

Best, Hequn


On Sat, Oct 13, 2018 at 8:49 PM 徐涛 <happydexu...@gmail.com> wrote:

> Hi Hequn,
> Thanks a lot for your response. I have a few questions about this topic.
> Would you please help me about it?
> 1. I have heard a idempotent way but I do not know how to implement it,
> would you please enlighten me about it by a example?
> 2. If dirty data are *added* but not updated, then only overwrite is not
> enough I think.
> 3. If using two-phase commit, the sink must support transaction.
> 3.1 If the sink does not support transaction, for example, elasticsearch,
> do I *have to* use idempotent to implement exactly-once?
> 3.2 If the sink support transaction, for example, mysql, idempotent and
> two-phase commit is both OK. But like you say, if there are a lot of items
> between checkpoints, the batch insert is a heavy action, I still have to
> use idempotent way to implement exactly-once.
>
>
> Best
> Hequn
>
> 在 2018年10月13日，上午11:43，Hequn Cheng <chenghe...@gmail.com> 写道：
>
> Hi Henry,
>
> Yes, exactly once using atomic way is heavy for mysql. However, you don't
> have to buffer data if you choose option 2. You can simply overwrite old
> records with new ones if result data is idempotent and this way can also
> achieve exactly once.
> There is a document about End-to-End Exactly-Once Processing in Apache
> Flink[1], which may be helpful for you.
>
> Best, Hequn
>
> [1]
> https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
>
>
>
> On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <happydexu...@gmail.com> wrote:
>
>> Hi
>>         I am reading the book “Introduction to Apache Flink”, and in the
>> book there mentions two ways to achieve sink exactly once:
>>         1. The first way is to buffer all output at the sink and commit
>> this atomically when the sink receives a checkpoint record.
>>         2. The second way is to eagerly write data to the output, keeping
>> in mind that some of this data might be “dirty” and replayed after a
>> failure. If there is a failure, then we need to roll back the output, thus
>> overwriting the dirty data and effectively deleting dirty data that has
>> already been written to the output.
>>
>>         I read the code of Elasticsearch sink, and find there is a
>> flushOnCheckpoint option, if set to true, the change will accumulate until
>> checkpoint is made. I guess it will guarantee at-least-once delivery,
>> because although it use batch flush, but the flush is not a atomic action,
>> so it can not guarantee exactly-once delivery.
>>
>>         My question is :
>>         1. As many sinks do not support transaction, at this case I have
>> to choose 2 to achieve exactly once. At this case, I have to buffer all the
>> records between checkpoints and delete them, it is a bit heavy action.
>>         2. I guess mysql sink should support exactly once delivery,
>> because it supports transaction, but at this case I have to execute batch
>> according to the number of actions between checkpoints but not a specific
>> number, 100 for example. When there are a lot of items between checkpoints,
>> it is a heavy action either.
>>
>> Best
>> Henry
>
>
>

Re: Questions in sink exactly once implementation

Reply via email to