Hi Srikanth! That is an interesting idea. I have it on my mind to create a design doc for checkpointing improvements. That could be added as a proposal there.
I hope I'll be able to start with that design doc next week. Greetings, Stephan On Fri, May 13, 2016 at 1:35 PM, Matthias J. Sax <mj...@apache.org> wrote: > I don't think barries can "expire" as of now. Might be a nice idea > thought -- I don't know if this might be a problem in production. > > Furthermore, I want to point out, that an "expiring checkpoint" would > not break exactly-once processing, as the latest successful checkpoint > can always be used to recover correctly. Only the recovery-time would be > increase. because if a "barrier expires" and no checkpoint can be > stored, more data has to be replayed using the "old" checkpoint". > > > -Matthias > > On 05/12/2016 09:21 PM, Srikanth wrote: > > Hello, > > > > I was reading about Flink's checkpoint and wanted to check if I > > correctly understood the usage of barriers for exactly once processing. > > 1) Operator does alignment by buffering records coming after a barrier > > until it receives barrier from all upstream operators instances. > > 2) Barrier is always preceded by a watermark to trigger processing all > > windows that are complete. > > 3) Records in windows that are not triggered are also saved as part of > > checkpoint. These windows are repopulated when restoring from > checkpoints. > > > > In production setups, were there any cases where alignment during > > checkpointing caused unacceptable latency? > > If so, is there a way to indicate say wait for a MAX 100 ms? That way we > > have exactly-once in most situations but prefer at least once over > > higher latency in corner cases. > > > > Srikanth > >