I also don't like big changes but sometimes they are necessary. The reason why I'm so adamant about out-of-order processing is that out-of-order elements are not some exception that occurs once in a while; they occur constantly in a distributed system. For example, in this: https://gist.github.com/aljoscha/a367012646ab98208d27 the resulting windows are completely bogus because the current windowing system assumes elements to globally arrive in order, which is simply not true. (The example has a source that generates increasing integers. Then these pass through a map and are unioned with the original DataStream before a window operator.) This simulates elements arriving from different operators at a windowing operator. The example is also DOP=1, I imagine this to get worse with higher DOP.
What do you mean by costly? As I said, I have a proof-of-concept windowing operator that can handle out-or-order elements. This is an example using the current Flink API: https://gist.github.com/aljoscha/f8dce0691732e344bbe8. (It is an infinite source of tuples and a 5 second window operator that counts the tuples.) The first problem is that this code deadlocks because of the thread that emits fake elements. If I disable the fake element code it works, but the throughput using my mockup is 4 times higher . The gap widens dramatically if the window size increases. So, it actually increases performance (unless I'm making a mistake in my explorations) and can handle elements that arrive out-of-order (which happens basically always in a real-world windowing use-cases). On Tue, 23 Jun 2015 at 12:51 Stephan Ewen <se...@apache.org> wrote: > What I like a lot about Aljoscha's proposed design is that we need no > different code for "system time" vs. "event time". It only differs in where > the timestamps are assigned. > > The OOP approach also gives you the semantics of total ordering without > imposing merges on the streams. > > On Tue, Jun 23, 2015 at 10:43 AM, Matthias J. Sax < > mj...@informatik.hu-berlin.de> wrote: > > > I agree that there should be multiple alternatives the user(!) can > > choose from. Partial out-of-order processing works for many/most > > aggregates. However, if you consider Event-Pattern-Matching, global > > ordering in necessary (even if the performance penalty might be high). > > > > I would also keep "system-time windows" as an alternative to "source > > assigned ts-windows". > > > > It might also be interesting to consider the following paper for > > overlapping windows: "Resource sharing in continuous sliding-window > > aggregates" > > > > > https://dl.acm.org/citation.cfm?id=1316720 > > > > -Matthias > > > > On 06/23/2015 10:37 AM, Gyula Fóra wrote: > > > Hey > > > > > > I think we should not block PRs unnecessarily if your suggested changes > > > might touch them at some point. > > > > > > Also I still think we should not put everything in the Datastream > because > > > it will be a huge mess. > > > > > > Also we need to agree on the out of order processing, whether we want > it > > > the way you proposed it(which is quite costly). Another alternative > > > approach there which fits in the current windowing is to filter out if > > > order events and apply a special handling operator on them. This would > be > > > fairly lightweight. > > > > > > My point is that we need to consider some alternative solutions. And we > > > should not block contributions along the way. > > > > > > Cheers > > > Gyula > > > > > > On Tue, Jun 23, 2015 at 9:55 AM Aljoscha Krettek <aljos...@apache.org> > > > wrote: > > > > > >> The reason I posted this now is that we need to think about the API > and > > >> windowing before proceeding with the PRs of Gabor (inverse reduce) and > > >> Gyula (removal of "aggregate" functions on DataStream). > > >> > > >> For the windowing, I think that the current model does not work for > > >> out-of-order processing. Therefore, the whole windowing infrastructure > > will > > >> basically have to be redone. Meaning also that any work on the > > >> pre-aggregators or optimizations that we do now becomes useless. > > >> > > >> For the API, I proposed to restructure the interactions between all > the > > >> different *DataStream classes and grouping/windowing. (See API section > > of > > >> the doc I posted.) > > >> > > >> On Mon, 22 Jun 2015 at 21:56 Gyula Fóra <gyula.f...@gmail.com> wrote: > > >> > > >>> Hi Aljoscha, > > >>> > > >>> Thanks for the nice summary, this is a very good initiative. > > >>> > > >>> I added some comments to the respective sections (where I didnt fully > > >> agree > > >>> :).). > > >>> At some point I think it would be good to have a public hangout > session > > >> on > > >>> this, which could make a more dynamic discussion. > > >>> > > >>> Cheers, > > >>> Gyula > > >>> > > >>> Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. jún. > > >> 22., > > >>> H, 21:34): > > >>> > > >>>> Hi, > > >>>> with people proposing changes to the streaming part I also wanted to > > >>> throw > > >>>> my hat into the ring. :D > > >>>> > > >>>> During the last few months, while I was getting acquainted with the > > >>>> streaming system, I wrote down some thoughts I had about how things > > >> could > > >>>> be improved. Hopefully, they are in somewhat coherent shape now, so > > >>> please > > >>>> have a look if you are interested in this: > > >>>> > > >>>> > > >>> > > >> > > > https://docs.google.com/document/d/1rSoHyhUhm2IE30o5tkR8GEetjFvMRMNxvsCfoPsW6_4/edit?usp=sharing > > >>>> > > >>>> This mostly covers: > > >>>> - Timestamps assigned at sources > > >>>> - Out-of-order processing of elements in window operators > > >>>> - API design > > >>>> > > >>>> Please let me know what you think. Comment in the document or here > in > > >> the > > >>>> mailing list. > > >>>> > > >>>> I have a PR in the makings that would introduce source timestamps > and > > >>>> watermarks for keeping track of them. I also hacked a > proof-of-concept > > >>> of a > > >>>> windowing system that is able to process out-of-order elements > using a > > >>>> FlatMap operator. (It uses panes to perform efficient > > >> pre-aggregations.) > > >>>> > > >>>> Cheers, > > >>>> Aljoscha > > >>>> > > >>> > > >> > > > > > > > >