Re: flink snapshotting fault-tolerance

Paris Carbone Thu, 19 May 2016 11:28:02 -0700

True, if you like formal modelling and stuff like that you can think of it as a 
more relaxed/abortable operation (e.g. like abortable consensus) which yields 
the same guarantees and works ok in semi-synchronous distributed systems (as in 
the case of Flink).

On 19 May 2016, at 20:22, Stavros Kontopoulos 
<[email protected]<mailto:[email protected]>> wrote:

Hey thnx for the links. There are assumptions though like reliable channels... 
since you rely on tcp in practice and if a checkpoint fails or is very slow 
then you need to deal with it.... thats why i asked previously what happens 
then..  3cp does not need assumptions i think, but engineering is more 
practical (it should be) and a different story in general.
The [2] mentions also the assumptions...

Best,
Stavros

On Thu, May 19, 2016 at 9:14 PM, Paris Carbone 
<[email protected]<mailto:[email protected]>> wrote:
Hi Stavros,

Currently, rollback failure recovery in Flink works in the pipeline level, not 
in the task level (see Millwheel [1]). It further builds on repayable stream 
logs (i.e. Kafka), thus, there is no need for 3pc or backup in the pipeline 
sources. You can also check this presentation [2] which explains the basic 
concepts more in detail I hope. Mind that many upcoming optimisation 
opportunities are going to be addressed in the not so long-term Flink roadmap.

Paris

[1] 
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf
[2] 
http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha

<http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>

<http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
On 19 May 2016, at 19:43, Stavros Kontopoulos 
<[email protected]<mailto:[email protected]>> wrote:

Cool thnx. So if a checkpoint expires the pipeline will block or fail in total 
or only the specific task related to the operator (running along with the 
checkpoint task) or nothing happens?

On Tue, May 17, 2016 at 3:49 PM, Robert Metzger 
<[email protected]<mailto:[email protected]>> wrote:
Hi Stravos,

I haven't implemented our checkpointing mechanism and I didn't participate in 
the design decisions while implementing it, so I can not compare it in detail 
to other approaches.

>From a "does it work perspective": Checkpoints are only confirmed if all 
>parallel subtasks successfully created a valid snapshot of the state. So if 
>there is a failure in the checkpointing mechanism, no valid checkpoint will be 
>created. The system will recover from the last valid checkpoint.
There is a timeout for checkpoints. So if a barrier doesn't pass through the 
system for a certain period of time, the checkpoint is cancelled. The default 
timeout is 10 minutes.

Regards,
Robert

On Mon, May 16, 2016 at 1:22 PM, Stavros Kontopoulos 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I was looking into the flink snapshotting algorithm details also mentioned here:
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html

>From other sources i understand that it assumes no failures to work for 
>message delivery or for example a process hanging for ever:
https://en.wikipedia.org/wiki/Snapshot_algorithm
https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/

So my understanding (maybe wrong) is that this is a solution which seems not to 
address the fault tolerance issue in a strong manner like for example if it was 
to use a 3pc protocol for local state propagation and global agreement. I know 
the latter is not efficient just mentioning it for comparison.

How the algorithm behaves in practical terms under the presence of its own 
failures (this is a background process collecting partial states)? Are there 
timeouts for reaching a barrier?

PS. have not looked deep into the code details yet, planning to.

Best,
Stavros

Re: flink snapshotting fault-tolerance

Reply via email to