Re: Restart hook and checkpoint

Ashish Pokharel Tue, 20 Mar 2018 14:56:51 -0700

I definitely like the idea of event based checkpointing :) 

Fabian, I do agree with your point that it is not possible to take a rescue 
checkpoint consistently. The basis here however is not around the operator that 
actually failed. It’s to avoid data loss across 100s (probably 1000s of 
parallel operators) which are being restarted and are “healthy”. We have 100k 
(nearing million soon) elements pushing data. Losing few seconds worth of data 
for few is not good but “acceptable” as long as damage can be controlled. Of 
course, we are going to use rocksdb + 2-phase commit with Kafka where we need 
exactly once guarantees. The proposal of “fine grain recovery 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
 
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+:+Fine+Grained+Recovery+from+Task+Failures>”
 seems like a good start at least from damage control perspective but even with 
that it feels like something like “event based approach” can be done for a 
sub-set of job graph that are “healthy”.


Thanks, Ashish


> On Mar 20, 2018, at 9:53 AM, Fabian Hueske <fhue...@gmail.com> wrote:
> 
> Well, that's not that easy to do, because checkpoints must be coordinated and 
> triggered the JobManager.
> Also, the checkpointing mechanism with flowing checkpoint barriers (to ensure 
> checkpoint consistency) won't work once a task failed because it cannot 
> continue processing and forward barriers. If the task failed with an OOME, 
> the whole JVM is gone anyway.
> I don't think it is possible to take something like a consistent rescue 
> checkpoint in case of a failure. 
> 
> I might be possible to checkpoint application state of non-failed tasks, but 
> this would result in data loss for the failed task and we would need to weigh 
> the use cases for such a feature are the implementation effort.
> Maybe there are better ways to address such use cases.
> 
> Best, Fabian
> 
> 2018-03-20 6:43 GMT+01:00 makeyang <riverbuild...@hotmail.com 
> <mailto:riverbuild...@hotmail.com>>:
> currently there is only time based way to trigger a checkpoint. based on this
> discussion, I think flink need to introduce event based way to trigger
> checkpoint such as restart a task manager should be count as a event.
> 
> 
> 
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ 
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>

Re: Restart hook and checkpoint

Reply via email to