You can use the checkpoint mode to "at least once". That way, barriers never block.
On Fri, May 13, 2016 at 6:05 PM, Srikanth <srikanth...@gmail.com> wrote: > I have a follow up. Is there a recommendation of list of knobs that can be > tuned if at least once guarantee while handling failure is good enough? > For cases like alert generation, non idempotent sink, etc where the system > can live with duplicates or has other mechanism to handle them. > > Srikanth > > On Fri, May 13, 2016 at 8:09 AM, Stephan Ewen <se...@apache.org> wrote: > >> Hi Srikanth! >> >> That is an interesting idea. >> I have it on my mind to create a design doc for checkpointing >> improvements. That could be added as a proposal there. >> >> I hope I'll be able to start with that design doc next week. >> >> Greetings, >> Stephan >> >> >> On Fri, May 13, 2016 at 1:35 PM, Matthias J. Sax <mj...@apache.org> >> wrote: >> >>> I don't think barries can "expire" as of now. Might be a nice idea >>> thought -- I don't know if this might be a problem in production. >>> >>> Furthermore, I want to point out, that an "expiring checkpoint" would >>> not break exactly-once processing, as the latest successful checkpoint >>> can always be used to recover correctly. Only the recovery-time would be >>> increase. because if a "barrier expires" and no checkpoint can be >>> stored, more data has to be replayed using the "old" checkpoint". >>> >>> >>> -Matthias >>> >>> On 05/12/2016 09:21 PM, Srikanth wrote: >>> > Hello, >>> > >>> > I was reading about Flink's checkpoint and wanted to check if I >>> > correctly understood the usage of barriers for exactly once processing. >>> > 1) Operator does alignment by buffering records coming after a barrier >>> > until it receives barrier from all upstream operators instances. >>> > 2) Barrier is always preceded by a watermark to trigger processing all >>> > windows that are complete. >>> > 3) Records in windows that are not triggered are also saved as part of >>> > checkpoint. These windows are repopulated when restoring from >>> checkpoints. >>> > >>> > In production setups, were there any cases where alignment during >>> > checkpointing caused unacceptable latency? >>> > If so, is there a way to indicate say wait for a MAX 100 ms? That way >>> we >>> > have exactly-once in most situations but prefer at least once over >>> > higher latency in corner cases. >>> > >>> > Srikanth >>> >>> >> >