Re: flink snapshotting fault-tolerance

Stavros Kontopoulos Thu, 19 May 2016 11:34:26 -0700

Yes thats what i was thinking thnx. When people here exactly once they
think are you sure, there is something hidden there... because theory is
theory :)
So if you keep getting invalidated snapshots but data passes through
operators you issue a warning or fail the pipeline and return an exception
to the driver?



On Thu, May 19, 2016 at 9:30 PM, Paris Carbone <par...@kth.se> wrote:

> In that case, typically a timeout invalidates the whole snapshot (all
> states for the same epoch) until eventually we have a full complete
> snapshot.
>
>
> On 19 May 2016, at 20:26, Stavros Kontopoulos <st.kontopou...@gmail.com>
> wrote:
>
> "Checkpoints are only confirmed if all parallel subtasks successfully
> created a valid snapshot of the state." as stated by Robert. So to rephrase
> my question... how confirmation that all snapshots are finished is done and
> what happens if some task is very slow...or is blocked?
> If you have N tasks confirmed and one missing what do you do? You start a
> new checkpoint for that one? or a global new checkpoint for the rest of N
> tasks as well?
>
> On Thu, May 19, 2016 at 9:21 PM, Paris Carbone <par...@kth.se> wrote:
>
>>
>> Regarding your last question,
>> If a checkpoint expires it just gets invalidated and a new complete
>> checkpoint will eventually occur that can be used for recovery. If I am
>> wrong, or something has changed please correct me.
>>
>> Paris
>>
>> On 19 May 2016, at 20:14, Paris Carbone <par...@kth.se> wrote:
>>
>> Hi Stavros,
>>
>> Currently, rollback failure recovery in Flink works in the pipeline
>> level, not in the task level (see Millwheel [1]). It further builds on
>> repayable stream logs (i.e. Kafka), thus, there is no need for 3pc or
>> backup in the pipeline sources. You can also check this presentation [2]
>> which explains the basic concepts more in detail I hope. Mind that many
>> upcoming optimisation opportunities are going to be addressed in the not so
>> long-term Flink roadmap.
>>
>> Paris
>>
>> [1]
>> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf
>> [2]
>> http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha
>>
>>
>> <http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
>>
>>
>> <http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
>>
>> On 19 May 2016, at 19:43, Stavros Kontopoulos <st.kontopou...@gmail.com>
>> wrote:
>>
>> Cool thnx. So if a checkpoint expires the pipeline will block or fail in
>> total or only the specific task related to the operator (running along with
>> the checkpoint task) or nothing happens?
>>
>> On Tue, May 17, 2016 at 3:49 PM, Robert Metzger <rmetz...@apache.org>
>> wrote:
>>
>>> Hi Stravos,
>>>
>>> I haven't implemented our checkpointing mechanism and I didn't
>>> participate in the design decisions while implementing it, so I can not
>>> compare it in detail to other approaches.
>>>
>>> From a "does it work perspective": Checkpoints are only confirmed if all
>>> parallel subtasks successfully created a valid snapshot of the state. So if
>>> there is a failure in the checkpointing mechanism, no valid checkpoint will
>>> be created. The system will recover from the last valid checkpoint.
>>> There is a timeout for checkpoints. So if a barrier doesn't pass through
>>> the system for a certain period of time, the checkpoint is cancelled. The
>>> default timeout is 10 minutes.
>>>
>>> Regards,
>>> Robert
>>>
>>>
>>> On Mon, May 16, 2016 at 1:22 PM, Stavros Kontopoulos <
>>> st.kontopou...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I was looking into the flink snapshotting algorithm details also
>>>> mentioned here:
>>>>
>>>> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
>>>>
>>>> https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
>>>>
>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E
>>>>
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html
>>>>
>>>> From other sources i understand that it assumes no failures to work for
>>>> message delivery or for example a process hanging for ever:
>>>> https://en.wikipedia.org/wiki/Snapshot_algorithm
>>>>
>>>> https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/
>>>>
>>>> So my understanding (maybe wrong) is that this is a solution which
>>>> seems not to address the fault tolerance issue in a strong manner like for
>>>> example if it was to use a 3pc protocol for local state propagation and
>>>> global agreement. I know the latter is not efficient just mentioning it for
>>>> comparison.
>>>>
>>>> How the algorithm behaves in practical terms under the presence of its
>>>> own failures (this is a background process collecting partial states)? Are
>>>> there timeouts for reaching a barrier?
>>>>
>>>> PS. have not looked deep into the code details yet, planning to.
>>>>
>>>> Best,
>>>> Stavros
>>>>
>>>>
>>>
>>
>>
>>
>
>

Re: flink snapshotting fault-tolerance

Reply via email to