Re: Towards a spec for robust streaming SQL, Part 1

Fabian Hueske Tue, 09 May 2017 15:06:45 -0700

Hi Tyler,

thank you very much for this excellent write-up and the super nice
visualizations!
You are discussing a lot of the things that we have been thinking about as
well from a different perspective.
IMO, yours and the Flink model are pretty much aligned although we use a
different terminology (which is not yet completely established). So there
should be room for unification ;-)


Allow me a few words on the current state in Flink. In the upcoming 1.3.0
release, we will have support for group window (TUMBLE, HOP, SESSION), OVER
RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
aggregations. The group windows are triggered by watermark and the over
window and non-windowed aggregations emit for each input record
(AtCount(1)). The window aggregations do not support early or late firing
(late records are dropped), so no updates here. However, the non-windowed
aggregates produce updates (in acc and acc/retract mode). Based on this we
will work on better control for late updates and early firing as well as
joins in the next release.

Reading the document, I did not find any major difference in our concepts.
In fact, we are aiming to support the cases you are describing as well.
I have a question though. Would you classify an OVER aggregation as a
stream -> stream or stream -> table operation? It collects records to
aggregate them, but emits one record for each input row. Depending on the
window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
emit the result record when the input record is received.

I'm looking forward to the second part.

Cheers, Fabian



2017-05-09 0:34 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>:

> Any thoughts here Fabian? I'm planning to start sending out some more
> emails towards the end of the week.
>
> -Tyler
>
>
> On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <taki...@google.com> wrote:
>
> > No worries, thanks for the heads up. Good luck wrapping all that stuff
> up.
> >
> > -Tyler
> >
> > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fhue...@gmail.com>
> wrote:
> >
> >> Hi Tyler,
> >>
> >> thanks for pushing this effort and including the Flink list.
> >> I haven't managed to read the doc yet, but just wanted to thank you for
> >> the
> >> write-up and let you know that I'm very interested in this discussion.
> >>
> >> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
> >> getting as many contributions merged before the release is forked off.
> >> When that happened, I'll have more time to read and comment.
> >>
> >> Thanks,
> >> Fabian
> >>
> >>
> >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>:
> >>
> >> > Good point, when you start talking about anything less than a full
> join,
> >> > triggers get involved to describe how one actually achieves the
> desired
> >> > semantics, and they may end up being tied to just one of the inputs
> >> (e.g.,
> >> > you may only care about the watermark for one side of the join). Am
> >> > expecting us to address these sorts of details more precisely in doc
> #2.
> >> >
> >> > -Tyler
> >> >
> >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> <k...@google.com.invalid
> >> >
> >> > wrote:
> >> >
> >> > > There's something to be said about having different triggering
> >> depending
> >> > on
> >> > > which side of a join data comes from, perhaps?
> >> > >
> >> > > (delightful doc, as usual)
> >> > >
> >> > > Kenn
> >> > >
> >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> >> <taki...@google.com.invalid
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> >> basically
> >> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
> >> > input
> >> > > > streams into a single output stream. GBK then groups the data
> within
> >> > that
> >> > > > single union stream as you might otherwise expect, yielding a
> single
> >> > > table.
> >> > > > So I think it doesn't really impact things much. Grouping,
> >> aggregation,
> >> > > > window merging etc all just act upon the merged stream and
> generate
> >> > what
> >> > > is
> >> > > > effectively a merged table.
> >> > > >
> >> > > > -Tyler
> >> > > >
> >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> >> <lc...@google.com.invalid
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > The doc is a good read.
> >> > > > >
> >> > > > > I think you do a great job of explaining table -> stream, stream
> >> ->
> >> > > > stream,
> >> > > > > and stream -> table when there is only one stream.
> >> > > > > But when there are multiple streams reading/writing to a table,
> >> how
> >> > > does
> >> > > > > that impact what occurs?
> >> > > > > For example, with CoGBK you have multiple streams writing to a
> >> table,
> >> > > how
> >> > > > > does that impact window merging?
> >> > > > >
> >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> >> > > <taki...@google.com.invalid
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> >> > > > > >
> >> > > > > > Apologies for the big cross post, but I thought this might be
> >> > > something
> >> > > > > all
> >> > > > > > three communities would find relevant.
> >> > > > > >
> >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> Calcite,
> >> > > thanks
> >> > > > to
> >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> >> conclusion
> >> > > > about
> >> > > > > > how to elegantly support the full suite of streaming
> >> functionality
> >> > in
> >> > > > the
> >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> community
> >> > have
> >> > > > been
> >> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> >> others,
> >> > > > thank
> >> > > > > > you! :-), but from my understanding we still don't have a full
> >> spec
> >> > > for
> >> > > > > how
> >> > > > > > to support robust streaming in SQL (including but not limited
> >> to,
> >> > > > e.g., a
> >> > > > > > triggers analogue such as EMIT).
> >> > > > > >
> >> > > > > > I've been spending a lot of time thinking about this and have
> >> some
> >> > > > > opinions
> >> > > > > > about how I think it should look that I've already written
> down,
> >> > so I
> >> > > > > > volunteered to try to drive forward agreement on a general
> >> > streaming
> >> > > > SQL
> >> > > > > > spec between our three communities (well, technically I
> >> volunteered
> >> > > to
> >> > > > do
> >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
> >> want
> >> > to
> >> > > > > join
> >> > > > > > in since you're going that direction already anyway and will
> >> have
> >> > > > useful
> >> > > > > > insights :-).
> >> > > > > >
> >> > > > > > My plan was to do this by sharing two docs:
> >> > > > > >
> >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> >> context,
> >> > > and
> >> > > > > >    really only mentions SQL in passing. But it describes the
> >> > > > relationship
> >> > > > > >    between the Beam Model and the "streams & tables" way of
> >> > thinking,
> >> > > > > which
> >> > > > > >    turns out to be useful in understanding what robust
> >> streaming in
> >> > > SQL
> >> > > > > > might
> >> > > > > >    look like. Many of you probably already know some or all of
> >> > what's
> >> > > > in
> >> > > > > > here,
> >> > > > > >    but I felt it was necessary to have it all written down in
> >> order
> >> > > to
> >> > > > > > justify
> >> > > > > >    some of the proposals I wanted to make in the second doc.
> >> > > > > >
> >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc
> >> is
> >> > > that
> >> > > > it
> >> > > > > >    would become a general specification for what robust
> >> streaming
> >> > SQL
> >> > > > in
> >> > > > > >    Calcite should look like. It would start out as a basic
> >> proposal
> >> > > of
> >> > > > > what
> >> > > > > >    things *could* look like (combining both what things look
> >> like
> >> > now
> >> > > > as
> >> > > > > > well
> >> > > > > >    as a set of proposed changes for the future), and we could
> >> all
> >> > > > iterate
> >> > > > > > on
> >> > > > > >    it together until we get to something we're happy with.
> >> > > > > >
> >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> monster,
> >> > so I
> >> > > > > > figured I'd share it and let folks hack at it with comments if
> >> they
> >> > > > have
> >> > > > > > any, while I try to get the second doc ready in the meantime.
> As
> >> > part
> >> > > > of
> >> > > > > > getting doc #2 ready, I'll be starting a separate thread to
> try
> >> to
> >> > > > gather
> >> > > > > > input on what things are already in flight for streaming SQL
> >> across
> >> > > the
> >> > > > > > various communities, to make sure the proposal captures
> >> everything
> >> > > > that's
> >> > > > > > going on as accurately as it can.
> >> > > > > >
> >> > > > > > If you have any questions or comments, I'm interested to hear
> >> them.
> >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> >> > > > > >
> >> > > > > >   http://s.apache.org/beam-streams-tables
> >> > > > > >
> >> > > > > > -Tyler
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Reply via email to