I agree lets separate these topics from each other so we can get faster resolution.
There is already a state discussion in the thread we started with Paris. On Wed, Jun 24, 2015 at 9:24 AM Kostas Tzoumas <ktzou...@apache.org> wrote: > I agree with supporting out-of-order out of the box :-), even if this means > a major refactoring. This is the right time to refactor the streaming API > before we pull it out of beta. I think that this is more important than new > features in the streaming API, which can be prioritized once the API is out > of beta (meaning, that IMO this is the right time to stall PRs until we > agree on the design). > > There are three sections in the document: windowing, state, and API. How > convoluted are those with each other? Can we separate the discussion or do > we need to discuss those all together? I think part of the difficulty is > that we are discussing three design choices at once. > > On Tue, Jun 23, 2015 at 7:43 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > Out of order is ubiquitous in the real-world. Typically, what happens is > > that businesses will declare a maximum allowable delay for delayed > > transactions and will commit to results when that delay is reached. > > Transactions that arrive later than this cutoff are collected specially > as > > corrections which are reported/used when possible. > > > > Clearly, ordering can also be violated during processing, but if the data > > is originally out of order the situation can't be repaired by any > protocol > > fixes that prevent transactions from becoming disordered but has to > handled > > at the data level. > > > > > > > > > > On Tue, Jun 23, 2015 at 9:29 AM, Aljoscha Krettek <aljos...@apache.org> > > wrote: > > > > > I also don't like big changes but sometimes they are necessary. The > > reason > > > why I'm so adamant about out-of-order processing is that out-of-order > > > elements are not some exception that occurs once in a while; they occur > > > constantly in a distributed system. For example, in this: > > > https://gist.github.com/aljoscha/a367012646ab98208d27 the resulting > > > windows > > > are completely bogus because the current windowing system assumes > > elements > > > to globally arrive in order, which is simply not true. (The example > has a > > > source that generates increasing integers. Then these pass through a > map > > > and are unioned with the original DataStream before a window operator.) > > > This simulates elements arriving from different operators at a > windowing > > > operator. The example is also DOP=1, I imagine this to get worse with > > > higher DOP. > > > > > > What do you mean by costly? As I said, I have a proof-of-concept > > windowing > > > operator that can handle out-or-order elements. This is an example > using > > > the current Flink API: > > > https://gist.github.com/aljoscha/f8dce0691732e344bbe8. > > > (It is an infinite source of tuples and a 5 second window operator that > > > counts the tuples.) The first problem is that this code deadlocks > because > > > of the thread that emits fake elements. If I disable the fake element > > code > > > it works, but the throughput using my mockup is 4 times higher . The > gap > > > widens dramatically if the window size increases. > > > > > > So, it actually increases performance (unless I'm making a mistake in > my > > > explorations) and can handle elements that arrive out-of-order (which > > > happens basically always in a real-world windowing use-cases). > > > > > > > > > On Tue, 23 Jun 2015 at 12:51 Stephan Ewen <se...@apache.org> wrote: > > > > > > > What I like a lot about Aljoscha's proposed design is that we need no > > > > different code for "system time" vs. "event time". It only differs in > > > where > > > > the timestamps are assigned. > > > > > > > > The OOP approach also gives you the semantics of total ordering > without > > > > imposing merges on the streams. > > > > > > > > On Tue, Jun 23, 2015 at 10:43 AM, Matthias J. Sax < > > > > mj...@informatik.hu-berlin.de> wrote: > > > > > > > > > I agree that there should be multiple alternatives the user(!) can > > > > > choose from. Partial out-of-order processing works for many/most > > > > > aggregates. However, if you consider Event-Pattern-Matching, global > > > > > ordering in necessary (even if the performance penalty might be > > high). > > > > > > > > > > I would also keep "system-time windows" as an alternative to > "source > > > > > assigned ts-windows". > > > > > > > > > > It might also be interesting to consider the following paper for > > > > > overlapping windows: "Resource sharing in continuous sliding-window > > > > > aggregates" > > > > > > > > > > > https://dl.acm.org/citation.cfm?id=1316720 > > > > > > > > > > -Matthias > > > > > > > > > > On 06/23/2015 10:37 AM, Gyula Fóra wrote: > > > > > > Hey > > > > > > > > > > > > I think we should not block PRs unnecessarily if your suggested > > > changes > > > > > > might touch them at some point. > > > > > > > > > > > > Also I still think we should not put everything in the Datastream > > > > because > > > > > > it will be a huge mess. > > > > > > > > > > > > Also we need to agree on the out of order processing, whether we > > want > > > > it > > > > > > the way you proposed it(which is quite costly). Another > alternative > > > > > > approach there which fits in the current windowing is to filter > out > > > if > > > > > > order events and apply a special handling operator on them. This > > > would > > > > be > > > > > > fairly lightweight. > > > > > > > > > > > > My point is that we need to consider some alternative solutions. > > And > > > we > > > > > > should not block contributions along the way. > > > > > > > > > > > > Cheers > > > > > > Gyula > > > > > > > > > > > > On Tue, Jun 23, 2015 at 9:55 AM Aljoscha Krettek < > > > aljos...@apache.org> > > > > > > wrote: > > > > > > > > > > > >> The reason I posted this now is that we need to think about the > > API > > > > and > > > > > >> windowing before proceeding with the PRs of Gabor (inverse > reduce) > > > and > > > > > >> Gyula (removal of "aggregate" functions on DataStream). > > > > > >> > > > > > >> For the windowing, I think that the current model does not work > > for > > > > > >> out-of-order processing. Therefore, the whole windowing > > > infrastructure > > > > > will > > > > > >> basically have to be redone. Meaning also that any work on the > > > > > >> pre-aggregators or optimizations that we do now becomes useless. > > > > > >> > > > > > >> For the API, I proposed to restructure the interactions between > > all > > > > the > > > > > >> different *DataStream classes and grouping/windowing. (See API > > > section > > > > > of > > > > > >> the doc I posted.) > > > > > >> > > > > > >> On Mon, 22 Jun 2015 at 21:56 Gyula Fóra <gyula.f...@gmail.com> > > > wrote: > > > > > >> > > > > > >>> Hi Aljoscha, > > > > > >>> > > > > > >>> Thanks for the nice summary, this is a very good initiative. > > > > > >>> > > > > > >>> I added some comments to the respective sections (where I didnt > > > fully > > > > > >> agree > > > > > >>> :).). > > > > > >>> At some point I think it would be good to have a public hangout > > > > session > > > > > >> on > > > > > >>> this, which could make a more dynamic discussion. > > > > > >>> > > > > > >>> Cheers, > > > > > >>> Gyula > > > > > >>> > > > > > >>> Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: > 2015. > > > jún. > > > > > >> 22., > > > > > >>> H, 21:34): > > > > > >>> > > > > > >>>> Hi, > > > > > >>>> with people proposing changes to the streaming part I also > > wanted > > > to > > > > > >>> throw > > > > > >>>> my hat into the ring. :D > > > > > >>>> > > > > > >>>> During the last few months, while I was getting acquainted > with > > > the > > > > > >>>> streaming system, I wrote down some thoughts I had about how > > > things > > > > > >> could > > > > > >>>> be improved. Hopefully, they are in somewhat coherent shape > now, > > > so > > > > > >>> please > > > > > >>>> have a look if you are interested in this: > > > > > >>>> > > > > > >>>> > > > > > >>> > > > > > >> > > > > > > > > > > > > > > > https://docs.google.com/document/d/1rSoHyhUhm2IE30o5tkR8GEetjFvMRMNxvsCfoPsW6_4/edit?usp=sharing > > > > > >>>> > > > > > >>>> This mostly covers: > > > > > >>>> - Timestamps assigned at sources > > > > > >>>> - Out-of-order processing of elements in window operators > > > > > >>>> - API design > > > > > >>>> > > > > > >>>> Please let me know what you think. Comment in the document or > > here > > > > in > > > > > >> the > > > > > >>>> mailing list. > > > > > >>>> > > > > > >>>> I have a PR in the makings that would introduce source > > timestamps > > > > and > > > > > >>>> watermarks for keeping track of them. I also hacked a > > > > proof-of-concept > > > > > >>> of a > > > > > >>>> windowing system that is able to process out-of-order elements > > > > using a > > > > > >>>> FlatMap operator. (It uses panes to perform efficient > > > > > >> pre-aggregations.) > > > > > >>>> > > > > > >>>> Cheers, > > > > > >>>> Aljoscha > > > > > >>>> > > > > > >>> > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > >