As I said before, you can solve that with a custom WatermarkAssigner. Collect a histogram, take the median out of X samples, ignore outliers, etc.
2017-12-12 13:37 GMT+01:00 Jinhua Luo <luajit...@gmail.com>: > Think about we have a normal ordered stream, if an abnormal event A > appears and thus advances the watermark, making all subsequent normal > events (earlier than A) late, I think it's a mistake. > The ways you listed cannot help this mistake. The normal events cannot > be dropped, and the lateness may be hard to determine (it depends on > the timestamp of the abnormal event) and re-triggered the window to > downstream brings in side-effect. > If the abnormal event appears in the middle of stream, then maybe we > could filter out this event checking the delta with the last element, > but what if the abnormal event is the first event emitted by the > source? > > > 2017-12-12 19:25 GMT+08:00 Fabian Hueske <fhue...@gmail.com>: > > Early events are usually not an issue because the can be kept in state > until > > they are ready to be processed. > > Also, depending on the watermark assigner often push the watermark ahead > > such that they are not early but all other events are late. > > > > Handling of late events depends on your use case and there are the three > > options that I already listed: > > > > 1) dropping > > 2) keeping state of "completed" computations for some time (allowed > > lateness). If a late event arrives, you can update the result and emit an > > update. In this case your downstream operators systems have to be able to > > deal with updates. > > 3) send the late events to a different channel via side outputs and > handle > > them later. > > > > > > > > 2017-12-12 12:14 GMT+01:00 Jinhua Luo <luajit...@gmail.com>: > >> > >> Yes, I know flink is flexible. > >> > >> But I am thinking when the event sequence is mess (e,g, branches of > >> time-series events interleaved, but each branch has completely > >> different time periods), then it's hard to apply them into streaming > >> api, because no matter which way you generate watermark, the watermark > >> cannot be backward or branching. > >> > >> Is there any best practice to handle late event and/or early event? > >> > >> > >> 2017-12-12 18:24 GMT+08:00 Fabian Hueske <fhue...@gmail.com>: > >> > Hi, > >> > > >> > this depends on how you generate watermarks [1]. > >> > You could generate watermarks with a four hour delay and be fine (at > the > >> > cost of a four hour latency) or have some checks that you don't > >> > increment a > >> > watermark by more than x minutes at a time. > >> > These considerations are quite use case specific, so it's hard to give > >> > an > >> > advice that applies to all cases. > >> > > >> > There are also different strategies for how to handle late data in > >> > windows. > >> > You can drop it (default behavior), you can update previously emitted > >> > results (allowed lateness) [2], or emit them to a side output [3]. > >> > > >> > Flink is quite flexible when dealing with watermarks and late data. > >> > > >> > Best, Fabian > >> > > >> > [1] > >> > > >> > https://ci.apache.org/projects/flink/flink-docs- > release-1.3/dev/event_timestamps_watermarks.html > >> > [2] > >> > > >> > https://ci.apache.org/projects/flink/flink-docs- > release-1.3/dev/windows.html#allowed-lateness > >> > [3] > >> > > >> > https://ci.apache.org/projects/flink/flink-docs- > release-1.3/dev/windows.html#getting-late-data-as-a-side-output > >> > > >> > 2017-12-12 10:16 GMT+01:00 Jinhua Luo <luajit...@gmail.com>: > >> >> > >> >> Hi All, > >> >> > >> >> The watermark is monotonous incremental in a stream, correct? > >> >> > >> >> Given a stream out-of-order extremely, e.g. > >> >> e4(12:04:33) --> e3 (15:00:22) --> e2(12:04:21) --> e1 (12:03:01) > >> >> > >> >> Here e1 appears first, so watermark start from 12:03:01, so e3 is an > >> >> early event, it would be placed in another window, and fired > >> >> individually, correct? If so, the result is not bad. > >> >> > >> >> The worse case is: > >> >> > >> >> e4(12:04:33) --> e3 (12:03:01) --> e2(12:04:21) --> e1 (15:00:22) > >> >> > >> >> > >> >> Then e2,e3,e4 would be considered late events and get discarded? And > >> >> the watermark are set to a wrong value permanently? > >> >> > >> >> So the stream must not be that out-of-order, otherwise flink could > not > >> >> handle them well? > >> > > >> > > > > > >