I agree with supporting out-of-order out of the box :-), even if this means
a major refactoring. This is the right time to refactor the streaming API
before we pull it out of beta. I think that this is more important than new
features in the streaming API, which can be prioritized once the API is out
of beta (meaning, that IMO this is the right time to stall PRs until we
agree on the design).

There are three sections in the document: windowing, state, and API. How
convoluted are those with each other? Can we separate the discussion or do
we need to discuss those all together? I think part of the difficulty is
that we are discussing three design choices at once.

On Tue, Jun 23, 2015 at 7:43 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Out of order is ubiquitous in the real-world.  Typically, what happens is
> that businesses will declare a maximum allowable delay for delayed
> transactions and will commit to results when that delay is reached.
> Transactions that arrive later than this cutoff are collected specially as
> corrections which are reported/used when possible.
>
> Clearly, ordering can also be violated during processing, but if the data
> is originally out of order the situation can't be repaired by any protocol
> fixes that prevent transactions from becoming disordered but has to handled
> at the data level.
>
>
>
>
> On Tue, Jun 23, 2015 at 9:29 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > I also don't like big changes but sometimes they are necessary. The
> reason
> > why I'm so adamant about out-of-order processing is that out-of-order
> > elements are not some exception that occurs once in a while; they occur
> > constantly in a distributed system. For example, in this:
> > https://gist.github.com/aljoscha/a367012646ab98208d27 the resulting
> > windows
> > are completely bogus because the current windowing system assumes
> elements
> > to globally arrive in order, which is simply not true. (The example has a
> > source that generates increasing integers. Then these pass through a map
> > and are unioned with the original DataStream before a window operator.)
> > This simulates elements arriving from different operators at a windowing
> > operator. The example is also DOP=1, I imagine this to get worse with
> > higher DOP.
> >
> > What do you mean by costly? As I said, I have a proof-of-concept
> windowing
> > operator that can handle out-or-order elements. This is an example using
> > the current Flink API:
> > https://gist.github.com/aljoscha/f8dce0691732e344bbe8.
> > (It is an infinite source of tuples and a 5 second window operator that
> > counts the tuples.) The first problem is that this code deadlocks because
> > of the thread that emits fake elements. If I disable the fake element
> code
> > it works, but the throughput using my mockup is 4 times higher . The gap
> > widens dramatically if the window size increases.
> >
> > So, it actually increases performance (unless I'm making a mistake in my
> > explorations) and can handle elements that arrive out-of-order (which
> > happens basically always in a real-world windowing use-cases).
> >
> >
> > On Tue, 23 Jun 2015 at 12:51 Stephan Ewen <se...@apache.org> wrote:
> >
> > > What I like a lot about Aljoscha's proposed design is that we need no
> > > different code for "system time" vs. "event time". It only differs in
> > where
> > > the timestamps are assigned.
> > >
> > > The OOP approach also gives you the semantics of total ordering without
> > > imposing merges on the streams.
> > >
> > > On Tue, Jun 23, 2015 at 10:43 AM, Matthias J. Sax <
> > > mj...@informatik.hu-berlin.de> wrote:
> > >
> > > > I agree that there should be multiple alternatives the user(!) can
> > > > choose from. Partial out-of-order processing works for many/most
> > > > aggregates. However, if you consider Event-Pattern-Matching, global
> > > > ordering in necessary (even if the performance penalty might be
> high).
> > > >
> > > > I would also keep "system-time windows" as an alternative to "source
> > > > assigned ts-windows".
> > > >
> > > > It might also be interesting to consider the following paper for
> > > > overlapping windows: "Resource sharing in continuous sliding-window
> > > > aggregates"
> > > >
> > > > > https://dl.acm.org/citation.cfm?id=1316720
> > > >
> > > > -Matthias
> > > >
> > > > On 06/23/2015 10:37 AM, Gyula Fóra wrote:
> > > > > Hey
> > > > >
> > > > > I think we should not block PRs unnecessarily if your suggested
> > changes
> > > > > might touch them at some point.
> > > > >
> > > > > Also I still think we should not put everything in the Datastream
> > > because
> > > > > it will be a huge mess.
> > > > >
> > > > > Also we need to agree on the out of order processing, whether we
> want
> > > it
> > > > > the way you proposed it(which is quite costly). Another alternative
> > > > > approach there which fits in the current windowing is to filter out
> > if
> > > > > order events and apply a special handling operator on them. This
> > would
> > > be
> > > > > fairly lightweight.
> > > > >
> > > > > My point is that we need to consider some alternative solutions.
> And
> > we
> > > > > should not block contributions along the way.
> > > > >
> > > > > Cheers
> > > > > Gyula
> > > > >
> > > > > On Tue, Jun 23, 2015 at 9:55 AM Aljoscha Krettek <
> > aljos...@apache.org>
> > > > > wrote:
> > > > >
> > > > >> The reason I posted this now is that we need to think about the
> API
> > > and
> > > > >> windowing before proceeding with the PRs of Gabor (inverse reduce)
> > and
> > > > >> Gyula (removal of "aggregate" functions on DataStream).
> > > > >>
> > > > >> For the windowing, I think that the current model does not work
> for
> > > > >> out-of-order processing. Therefore, the whole windowing
> > infrastructure
> > > > will
> > > > >> basically have to be redone. Meaning also that any work on the
> > > > >> pre-aggregators or optimizations that we do now becomes useless.
> > > > >>
> > > > >> For the API, I proposed to restructure the interactions between
> all
> > > the
> > > > >> different *DataStream classes and grouping/windowing. (See API
> > section
> > > > of
> > > > >> the doc I posted.)
> > > > >>
> > > > >> On Mon, 22 Jun 2015 at 21:56 Gyula Fóra <gyula.f...@gmail.com>
> > wrote:
> > > > >>
> > > > >>> Hi Aljoscha,
> > > > >>>
> > > > >>> Thanks for the nice summary, this is a very good initiative.
> > > > >>>
> > > > >>> I added some comments to the respective sections (where I didnt
> > fully
> > > > >> agree
> > > > >>> :).).
> > > > >>> At some point I think it would be good to have a public hangout
> > > session
> > > > >> on
> > > > >>> this, which could make a more dynamic discussion.
> > > > >>>
> > > > >>> Cheers,
> > > > >>> Gyula
> > > > >>>
> > > > >>> Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015.
> > jún.
> > > > >> 22.,
> > > > >>> H, 21:34):
> > > > >>>
> > > > >>>> Hi,
> > > > >>>> with people proposing changes to the streaming part I also
> wanted
> > to
> > > > >>> throw
> > > > >>>> my hat into the ring. :D
> > > > >>>>
> > > > >>>> During the last few months, while I was getting acquainted with
> > the
> > > > >>>> streaming system, I wrote down some thoughts I had about how
> > things
> > > > >> could
> > > > >>>> be improved. Hopefully, they are in somewhat coherent shape now,
> > so
> > > > >>> please
> > > > >>>> have a look if you are interested in this:
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1rSoHyhUhm2IE30o5tkR8GEetjFvMRMNxvsCfoPsW6_4/edit?usp=sharing
> > > > >>>>
> > > > >>>> This mostly covers:
> > > > >>>>  - Timestamps assigned at sources
> > > > >>>>  - Out-of-order processing of elements in window operators
> > > > >>>>  - API design
> > > > >>>>
> > > > >>>> Please let me know what you think. Comment in the document or
> here
> > > in
> > > > >> the
> > > > >>>> mailing list.
> > > > >>>>
> > > > >>>> I have a PR in the makings that would introduce source
> timestamps
> > > and
> > > > >>>> watermarks for keeping track of them. I also hacked a
> > > proof-of-concept
> > > > >>> of a
> > > > >>>> windowing system that is able to process out-of-order elements
> > > using a
> > > > >>>> FlatMap operator. (It uses panes to perform efficient
> > > > >> pre-aggregations.)
> > > > >>>>
> > > > >>>> Cheers,
> > > > >>>> Aljoscha
> > > > >>>>
> > > > >>>
> > > > >>
> > > > >
> > > >
> > > >
> > >
> >
>

Reply via email to