Re: Restart hook and checkpoint

Fabian Hueske Fri, 23 Mar 2018 00:56:32 -0700

Yes, that would be great!

Thank you, Fabian


2018-03-23 3:06 GMT+01:00 Ashish Pokharel <ashish...@yahoo.com>:

> Fabian, that sounds good. Should I recap some bullets in an email and
> start a new thread then?
>
> Thanks, Ashish
>
>
> On Mar 22, 2018, at 5:16 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
> Hi Ashish,
>
> Agreed!
> I think the right approach would be to gather the requirements and start a
> discussion on the dev mailing list.
> Contributors and committers who are more familiar with the checkpointing
> and recovery internals should discuss a solution that can be integrated and
> doesn't break with the current mechanism.
> For instance (not sure whether this is feasible or solves the problem) one
> could only do local checkpoints and not write to the distributed persistent
> storage. That would bring down checkpointing costs and the recovery life
> cycle would not need to be radically changed.
>
> Best, Fabian
>
> 2018-03-20 22:56 GMT+01:00 Ashish Pokharel <ashish...@yahoo.com>:
>
>> I definitely like the idea of event based checkpointing :)
>>
>> Fabian, I do agree with your point that it is not possible to take a
>> rescue checkpoint consistently. The basis here however is not around the
>> operator that actually failed. It’s to avoid data loss across 100s
>> (probably 1000s of parallel operators) which are being restarted and are
>> “healthy”. We have 100k (nearing million soon) elements pushing data.
>> Losing few seconds worth of data for few is not good but “acceptable” as
>> long as damage can be controlled. Of course, we are going to use rocksdb +
>> 2-phase commit with Kafka where we need exactly once guarantees. The
>> proposal of “fine grain recovery https://cwiki.apache.
>> org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recover
>> y+from+Task+Failures
>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+:+Fine+Grained+Recovery+from+Task+Failures>”
>> seems like a good start at least from damage control perspective but even
>> with that it feels like something like “event based approach” can be done
>> for a sub-set of job graph that are “healthy”.
>>
>> Thanks, Ashish
>>
>>
>> On Mar 20, 2018, at 9:53 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>>
>> Well, that's not that easy to do, because checkpoints must be coordinated
>> and triggered the JobManager.
>> Also, the checkpointing mechanism with flowing checkpoint barriers (to
>> ensure checkpoint consistency) won't work once a task failed because it
>> cannot continue processing and forward barriers. If the task failed with an
>> OOME, the whole JVM is gone anyway.
>> I don't think it is possible to take something like a consistent rescue
>> checkpoint in case of a failure.
>>
>> I might be possible to checkpoint application state of non-failed tasks,
>> but this would result in data loss for the failed task and we would need to
>> weigh the use cases for such a feature are the implementation effort.
>> Maybe there are better ways to address such use cases.
>>
>> Best, Fabian
>>
>> 2018-03-20 6:43 GMT+01:00 makeyang <riverbuild...@hotmail.com>:
>>
>>> currently there is only time based way to trigger a checkpoint. based on
>>> this
>>> discussion, I think flink need to introduce event based way to trigger
>>> checkpoint such as restart a task manager should be count as a event.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/
>>>
>>
>>
>>
>
>

Re: Restart hook and checkpoint

Reply via email to