Re: fault tolerance model for stateful streams

Aljoscha Krettek Mon, 06 Jul 2015 01:09:07 -0700

Hi,
good questions, about 1. you are right, when the JobManager fails the state
is lost. Ufuk, Till and Stephan are currently working on making the
JobManager fault tolerant by having hot-standby JobManagers and storing the
important JobManager state in ZooKeeper. Maybe they can further comment on
their work here.

Regarding 2. (I hope I'm getting this right) the whole state of one
parallel instance of an operator is stored as one piece in an HDFS file, so
the number of keys does not increase the number of files that need to be
stored. By tuning the checkpointing interval the overhead from
checkpointing can be adjusted. I think Gyula, Paris and Stephan are also
working on incremental state checkpoints and having a key-value store that
can gracefully go out-of-core and can also do incremental checkpoints.

Cheers,
Aljoscha

On Mon, 6 Jul 2015 at 09:45 Nathan Forager <nathan.forager...@gmail.com>
wrote:

> hi there,
>
> I noticed the 0.9 release announces exactly-once semantics for streams. I
> looked at the user guide and the primary mechanism for recovery appears to
> be checkpointing of user state. I have a few questions:
>
> 1. The default behavior is that checkpoints are kept in memory on the
> JobManager. Am I correct in assuming that this does *not* guarantee failure
> recovery or exactly-once semantics if the driver fails?
>
> 2 The alternative recommended approach is to checkpoint to HDFS. If I have
> a program where I am doing something like aggregating counts over thousands
> of keys... it doesn't seem tenable to save a huge number checkpoint files
> to HDFS with acceptable latency.
>
> Could you comment at all on the persistence model for exactly-once in
> Flink. I am pretty confused because checkpointing to HDFS in this way seems
> to have limitations around scalability and latency.
>
> - nate
>

Re: fault tolerance model for stateful streams

Reply via email to