Hi Chen, Just to add in the previous discussion that we are currently discussing possible improvements of windowing mechanism/semantics here:
https://docs.google.com/document/d/1Xp-YBf87vLTduYSivgqWVEMjYUmkA-hyb4muX3KRl08/edit#heading=h.psfzjlv68tp <https://docs.google.com/document/d/1Xp-YBf87vLTduYSivgqWVEMjYUmkA-hyb4muX3KRl08/edit#heading=h.psfzjlv68tp> and the related discussion in on the Flink Dev mailing list, in the thread with title: [DISCUSS] Allowed Lateness in Flink In the doc you can also see that allowed lateness is already in the master branch and will be part of the upcoming release. Any input to the discussion is more than welcome, so feel free to jump in. Kostas > On Jul 7, 2016, at 1:17 AM, Chen Qin <qinnc...@gmail.com> wrote: > > Jamie, > > Sorry for late reply, some of my thoughts inline. > > -Chen > > Another way to do this is to kick off a parallel job to do the backfill from > the previous savepoint without stopping the current "realtime" job. This way > you would not have to have a "blackout". This assumes your final sink can > tolerate having some parallel writes to it OR you have two different sinks > and throw a switch from one to another for downstream jobs, etc. > > Sounds great to me. I think it will solve "blackout" issue I mentioned. Sink > might a bit more like read-check-write fashion but should be fine. > > > > In general, I don't know if there are good solutions for all of these > scenarios. Some might keep messages in windows longer.(messages not purged > yet) Some might kick off another pipeline just dealing with affected > windows(messages already purged). What would be suggested patterns? > > Of course, ideally you would just keep data in windows longer such that you > don't purge window state until you're sure there is no more data coming. The > problem with this approach in the real world is that you may be wrong with > whatever time you choose ;) I would suggest doing the best job possible > upfront by using an appropriate watermark strategy to deal with most of the > data. Then process the truly late data with a separate path in the > application code. This "separate" path may have to deal with merging late > data with the data that's already been written to the sink but this is > definitely possible depending on the sink. > > Make sense. A truly late events should go through a side job that merge with > whatever written in sink. That might also imply both sinks able to do > read-check-merge. > > e.g job doing search keyword count from begining, an outage caused some hosts > partitioned by keywords went down for couple of days. backfill job started > load and adding counts, after it backfilled all missing keywords and merge > aggregation results, it might needs to write to current yet written windows > and let main job pickup and merge results. >