I definitely like the idea of event based checkpointing :) Fabian, I do agree with your point that it is not possible to take a rescue checkpoint consistently. The basis here however is not around the operator that actually failed. It’s to avoid data loss across 100s (probably 1000s of parallel operators) which are being restarted and are “healthy”. We have 100k (nearing million soon) elements pushing data. Losing few seconds worth of data for few is not good but “acceptable” as long as damage can be controlled. Of course, we are going to use rocksdb + 2-phase commit with Kafka where we need exactly once guarantees. The proposal of “fine grain recovery https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures <https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+:+Fine+Grained+Recovery+from+Task+Failures>” seems like a good start at least from damage control perspective but even with that it feels like something like “event based approach” can be done for a sub-set of job graph that are “healthy”.
Thanks, Ashish > On Mar 20, 2018, at 9:53 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > Well, that's not that easy to do, because checkpoints must be coordinated and > triggered the JobManager. > Also, the checkpointing mechanism with flowing checkpoint barriers (to ensure > checkpoint consistency) won't work once a task failed because it cannot > continue processing and forward barriers. If the task failed with an OOME, > the whole JVM is gone anyway. > I don't think it is possible to take something like a consistent rescue > checkpoint in case of a failure. > > I might be possible to checkpoint application state of non-failed tasks, but > this would result in data loss for the failed task and we would need to weigh > the use cases for such a feature are the implementation effort. > Maybe there are better ways to address such use cases. > > Best, Fabian > > 2018-03-20 6:43 GMT+01:00 makeyang <riverbuild...@hotmail.com > <mailto:riverbuild...@hotmail.com>>: > currently there is only time based way to trigger a checkpoint. based on this > discussion, I think flink need to introduce event based way to trigger > checkpoint such as restart a task manager should be count as a event. > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> >