Yes, that would be great!

Thank you, Fabian

2018-03-23 3:06 GMT+01:00 Ashish Pokharel <>:

> Fabian, that sounds good. Should I recap some bullets in an email and
> start a new thread then?
> Thanks, Ashish
> On Mar 22, 2018, at 5:16 AM, Fabian Hueske <> wrote:
> Hi Ashish,
> Agreed!
> I think the right approach would be to gather the requirements and start a
> discussion on the dev mailing list.
> Contributors and committers who are more familiar with the checkpointing
> and recovery internals should discuss a solution that can be integrated and
> doesn't break with the current mechanism.
> For instance (not sure whether this is feasible or solves the problem) one
> could only do local checkpoints and not write to the distributed persistent
> storage. That would bring down checkpointing costs and the recovery life
> cycle would not need to be radically changed.
> Best, Fabian
> 2018-03-20 22:56 GMT+01:00 Ashish Pokharel <>:
>> I definitely like the idea of event based checkpointing :)
>> Fabian, I do agree with your point that it is not possible to take a
>> rescue checkpoint consistently. The basis here however is not around the
>> operator that actually failed. It’s to avoid data loss across 100s
>> (probably 1000s of parallel operators) which are being restarted and are
>> “healthy”. We have 100k (nearing million soon) elements pushing data.
>> Losing few seconds worth of data for few is not good but “acceptable” as
>> long as damage can be controlled. Of course, we are going to use rocksdb +
>> 2-phase commit with Kafka where we need exactly once guarantees. The
>> proposal of “fine grain recovery https://cwiki.apache.
>> org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recover
>> y+from+Task+Failures
>> <>”
>> seems like a good start at least from damage control perspective but even
>> with that it feels like something like “event based approach” can be done
>> for a sub-set of job graph that are “healthy”.
>> Thanks, Ashish
>> On Mar 20, 2018, at 9:53 AM, Fabian Hueske <> wrote:
>> Well, that's not that easy to do, because checkpoints must be coordinated
>> and triggered the JobManager.
>> Also, the checkpointing mechanism with flowing checkpoint barriers (to
>> ensure checkpoint consistency) won't work once a task failed because it
>> cannot continue processing and forward barriers. If the task failed with an
>> OOME, the whole JVM is gone anyway.
>> I don't think it is possible to take something like a consistent rescue
>> checkpoint in case of a failure.
>> I might be possible to checkpoint application state of non-failed tasks,
>> but this would result in data loss for the failed task and we would need to
>> weigh the use cases for such a feature are the implementation effort.
>> Maybe there are better ways to address such use cases.
>> Best, Fabian
>> 2018-03-20 6:43 GMT+01:00 makeyang <>:
>>> currently there is only time based way to trigger a checkpoint. based on
>>> this
>>> discussion, I think flink need to introduce event based way to trigger
>>> checkpoint such as restart a task manager should be count as a event.
>>> --
>>> Sent from:

Reply via email to