Yes, that would be great! Thank you, Fabian
2018-03-23 3:06 GMT+01:00 Ashish Pokharel <ashish...@yahoo.com>: > Fabian, that sounds good. Should I recap some bullets in an email and > start a new thread then? > > Thanks, Ashish > > > On Mar 22, 2018, at 5:16 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > Hi Ashish, > > Agreed! > I think the right approach would be to gather the requirements and start a > discussion on the dev mailing list. > Contributors and committers who are more familiar with the checkpointing > and recovery internals should discuss a solution that can be integrated and > doesn't break with the current mechanism. > For instance (not sure whether this is feasible or solves the problem) one > could only do local checkpoints and not write to the distributed persistent > storage. That would bring down checkpointing costs and the recovery life > cycle would not need to be radically changed. > > Best, Fabian > > 2018-03-20 22:56 GMT+01:00 Ashish Pokharel <ashish...@yahoo.com>: > >> I definitely like the idea of event based checkpointing :) >> >> Fabian, I do agree with your point that it is not possible to take a >> rescue checkpoint consistently. The basis here however is not around the >> operator that actually failed. It’s to avoid data loss across 100s >> (probably 1000s of parallel operators) which are being restarted and are >> “healthy”. We have 100k (nearing million soon) elements pushing data. >> Losing few seconds worth of data for few is not good but “acceptable” as >> long as damage can be controlled. Of course, we are going to use rocksdb + >> 2-phase commit with Kafka where we need exactly once guarantees. The >> proposal of “fine grain recovery https://cwiki.apache. >> org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recover >> y+from+Task+Failures >> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+:+Fine+Grained+Recovery+from+Task+Failures>” >> seems like a good start at least from damage control perspective but even >> with that it feels like something like “event based approach” can be done >> for a sub-set of job graph that are “healthy”. >> >> Thanks, Ashish >> >> >> On Mar 20, 2018, at 9:53 AM, Fabian Hueske <fhue...@gmail.com> wrote: >> >> Well, that's not that easy to do, because checkpoints must be coordinated >> and triggered the JobManager. >> Also, the checkpointing mechanism with flowing checkpoint barriers (to >> ensure checkpoint consistency) won't work once a task failed because it >> cannot continue processing and forward barriers. If the task failed with an >> OOME, the whole JVM is gone anyway. >> I don't think it is possible to take something like a consistent rescue >> checkpoint in case of a failure. >> >> I might be possible to checkpoint application state of non-failed tasks, >> but this would result in data loss for the failed task and we would need to >> weigh the use cases for such a feature are the implementation effort. >> Maybe there are better ways to address such use cases. >> >> Best, Fabian >> >> 2018-03-20 6:43 GMT+01:00 makeyang <riverbuild...@hotmail.com>: >> >>> currently there is only time based way to trigger a checkpoint. based on >>> this >>> discussion, I think flink need to introduce event based way to trigger >>> checkpoint such as restart a task manager should be count as a event. >>> >>> >>> >>> -- >>> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nab >>> ble.com/ >>> >> >> >> > >