Re: Towards a spec for robust streaming SQL, Part 1

Shaoxuan Wang Fri, 12 May 2017 21:08:43 -0700

Tyler,
Yes, dynamic table changes over time. You can find more details about
dynamic table from this Flink blog (
https://flink.apache.org/news/2017/04/04/dynamic-tables.html). Fabian, me
and Xiaowei posted it a week before the flink-forward@SF.
"A dynamic table is a table that is continuously updated and can be queried
like a regular, static table. However, in contrast to a query on a batch
table which terminates and returns a static table as result, a query on a
dynamic table runs continuously and produces a table that is continuously
updated depending on the modification on the input table. Hence, the
resulting table is a dynamic table as well."


Regards,
Shaoxuan


On Sat, May 13, 2017 at 1:05 AM, Tyler Akidau <taki...@google.com.invalid>
wrote:

> Being able to support an EMIT config independent of the query itself sounds
> great for compatible use cases (which should be many :-).
>
> Shaoxuan, can you please refresh my memory what a dynamic table means in
> Flink? It's basically just a state table, right? The "dynamic" part of the
> name is to simply to imply it can evolve and change over time?
>
> -Tyler
>
>
> On Fri, May 12, 2017 at 1:59 AM Shaoxuan Wang <wshaox...@gmail.com> wrote:
>
> > Thanks to Tyler and Fabian for sharing your thoughts.
> >
> > Regarding to the early/late update control of FLINK. IMO, each dynamic
> > table can have an EMIT config. For FLINK table-API, this can be easily
> > implemented in different manners, case by case. For instance, in window
> > aggregate, we could define "when EMIT a result" via a windowConf per each
> > window when we create windows. Unfortunately we do not have such
> > flexibility (as compared with TableAPI) in SQL query, we need find a way
> to
> > embed this EMIT config.
> >
> > Regards,
> > Shaoxuan
> >
> >
> > On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fhue...@gmail.com>
> wrote:
> >
> > > 2017-05-11 7:14 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>:
> > >
> > > > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fhue...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Tyler,
> > > > >
> > > > > thank you very much for this excellent write-up and the super nice
> > > > > visualizations!
> > > > > You are discussing a lot of the things that we have been thinking
> > about
> > > > as
> > > > > well from a different perspective.
> > > > > IMO, yours and the Flink model are pretty much aligned although we
> > use
> > > a
> > > > > different terminology (which is not yet completely established). So
> > > there
> > > > > should be room for unification ;-)
> > > > >
> > > >
> > > > Good to hear, thanks for giving it a look. :-)
> > > >
> > > >
> > > > > Allow me a few words on the current state in Flink. In the upcoming
> > > 1.3.0
> > > > > release, we will have support for group window (TUMBLE, HOP,
> > SESSION),
> > > > OVER
> > > > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > > > aggregations. The group windows are triggered by watermark and the
> > over
> > > > > window and non-windowed aggregations emit for each input record
> > > > > (AtCount(1)). The window aggregations do not support early or late
> > > firing
> > > > > (late records are dropped), so no updates here. However, the
> > > non-windowed
> > > > > aggregates produce updates (in acc and acc/retract mode). Based on
> > this
> > > > we
> > > > > will work on better control for late updates and early firing as
> well
> > > as
> > > > > joins in the next release.
> > > > >
> > > >
> > > > Impressive. At this rate there's a good chance we'll just be doing
> > > catchup
> > > > and thanking you for building everything. ;-) Do you have ideas for
> > what
> > > > you want your early/late updates control to look like? That's one of
> > the
> > > > areas I'd like to see better defined for us. And how deep are you
> going
> > > > with joins?
> > > >
> > >
> > > Right now (well actually I merged the change 1h ago) we are using a
> > > QueryConfig object to specify state retention intervals to be able to
> > clean
> > > up state for inactive keys.
> > > A running a query looks like this:
> > >
> > > // -------
> > > val env = StreamExecutionEnvironment.getExecutionEnvironment
> > > val tEnv = TableEnvironment.getTableEnvironment(env)
> > >
> > > val qConf = tEnv.queryConfig.withIdleStateRetentionTime(
> Time.hours(12))
> > //
> > > state of inactive keys is kept for 12 hours
> > >
> > > val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP
> BY
> > > a")
> > > val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf)
> //
> > > boolean flag for acc/retract
> > >
> > > env.execute()
> > > // -------
> > >
> > > We plan to use the QueryConfig also to specify early/late updates. Our
> > main
> > > motivation is to have a uniform and standard SQL for batch and
> streaming.
> > > Hence, we have to move the configuration out of the query. But I agree,
> > > that it would be very nice to be able to include it in the query. I
> think
> > > it should not be difficult to optionally support an EMIT clause as
> well.
> > >
> > >
> > > >
> > > > > Reading the document, I did not find any major difference in our
> > > > concepts.
> > > > > In fact, we are aiming to support the cases you are describing as
> > well.
> > > > > I have a question though. Would you classify an OVER aggregation
> as a
> > > > > stream -> stream or stream -> table operation? It collects records
> to
> > > > > aggregate them, but emits one record for each input row. Depending
> on
> > > the
> > > > > window definition (i.e., with FOLLOWING CURRENT ROW), you can
> compute
> > > and
> > > > > emit the result record when the input record is received.
> > > > >
> > > >
> > > > I would call it a composite stream → stream operation (since SQL,
> like
> > > the
> > > > standard Beam/Flink APIs, is a higher level set of constructs than
> raw
> > > > streams/tables operations) consisting of a stream → table windowed
> > > grouping
> > > > followed by a table → stream triggering on every element, basically
> as
> > > you
> > > > described in the previous paragraph.
> > > >
> > > >
> > > That makes sense. Thanks :-)
> > >
> > >
> > > > -Tyler
> > > >
> > > >
> > > > >
> > > > > I'm looking forward to the second part.
> > > > >
> > > > > Cheers, Fabian
> > > > >
> > > > >
> > > > >
> > > > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <taki...@google.com.invalid
> >:
> > > > >
> > > > > > Any thoughts here Fabian? I'm planning to start sending out some
> > more
> > > > > > emails towards the end of the week.
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > > >
> > > > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <taki...@google.com
> >
> > > > wrote:
> > > > > >
> > > > > > > No worries, thanks for the heads up. Good luck wrapping all
> that
> > > > stuff
> > > > > > up.
> > > > > > >
> > > > > > > -Tyler
> > > > > > >
> > > > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <
> > fhue...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Hi Tyler,
> > > > > > >>
> > > > > > >> thanks for pushing this effort and including the Flink list.
> > > > > > >> I haven't managed to read the doc yet, but just wanted to
> thank
> > > you
> > > > > for
> > > > > > >> the
> > > > > > >> write-up and let you know that I'm very interested in this
> > > > discussion.
> > > > > > >>
> > > > > > >> We are very close to the feature freeze of Flink 1.3 and I'm
> > quite
> > > > > busy
> > > > > > >> getting as many contributions merged before the release is
> > forked
> > > > off.
> > > > > > >> When that happened, I'll have more time to read and comment.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Fabian
> > > > > > >>
> > > > > > >>
> > > > > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau
> > <taki...@google.com.invalid
> > > > >:
> > > > > > >>
> > > > > > >> > Good point, when you start talking about anything less than
> a
> > > full
> > > > > > join,
> > > > > > >> > triggers get involved to describe how one actually achieves
> > the
> > > > > > desired
> > > > > > >> > semantics, and they may end up being tied to just one of the
> > > > inputs
> > > > > > >> (e.g.,
> > > > > > >> > you may only care about the watermark for one side of the
> > join).
> > > > Am
> > > > > > >> > expecting us to address these sorts of details more
> precisely
> > in
> > > > doc
> > > > > > #2.
> > > > > > >> >
> > > > > > >> > -Tyler
> > > > > > >> >
> > > > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > > > > <k...@google.com.invalid
> > > > > > >> >
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > There's something to be said about having different
> > triggering
> > > > > > >> depending
> > > > > > >> > on
> > > > > > >> > > which side of a join data comes from, perhaps?
> > > > > > >> > >
> > > > > > >> > > (delightful doc, as usual)
> > > > > > >> > >
> > > > > > >> > > Kenn
> > > > > > >> > >
> > > > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > > > > >> <taki...@google.com.invalid
> > > > > > >> > >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Thanks for reading, Luke. The simple answer is that
> CoGBK
> > is
> > > > > > >> basically
> > > > > > >> > > > flatten + GBK. Flatten is a non-grouping operation that
> > > merges
> > > > > the
> > > > > > >> > input
> > > > > > >> > > > streams into a single output stream. GBK then groups the
> > > data
> > > > > > within
> > > > > > >> > that
> > > > > > >> > > > single union stream as you might otherwise expect,
> > yielding
> > > a
> > > > > > single
> > > > > > >> > > table.
> > > > > > >> > > > So I think it doesn't really impact things much.
> Grouping,
> > > > > > >> aggregation,
> > > > > > >> > > > window merging etc all just act upon the merged stream
> and
> > > > > > generate
> > > > > > >> > what
> > > > > > >> > > is
> > > > > > >> > > > effectively a merged table.
> > > > > > >> > > >
> > > > > > >> > > > -Tyler
> > > > > > >> > > >
> > > > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > > > > >> <lc...@google.com.invalid
> > > > > > >> > >
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > The doc is a good read.
> > > > > > >> > > > >
> > > > > > >> > > > > I think you do a great job of explaining table ->
> > stream,
> > > > > stream
> > > > > > >> ->
> > > > > > >> > > > stream,
> > > > > > >> > > > > and stream -> table when there is only one stream.
> > > > > > >> > > > > But when there are multiple streams reading/writing
> to a
> > > > > table,
> > > > > > >> how
> > > > > > >> > > does
> > > > > > >> > > > > that impact what occurs?
> > > > > > >> > > > > For example, with CoGBK you have multiple streams
> > writing
> > > > to a
> > > > > > >> table,
> > > > > > >> > > how
> > > > > > >> > > > > does that impact window merging?
> > > > > > >> > > > >
> > > > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > > > > >> > > <taki...@google.com.invalid
> > > > > > >> > > > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > > >> > > > > >
> > > > > > >> > > > > > Apologies for the big cross post, but I thought this
> > > might
> > > > > be
> > > > > > >> > > something
> > > > > > >> > > > > all
> > > > > > >> > > > > > three communities would find relevant.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Beam is finally making progress on a SQL DSL
> utilizing
> > > > > > Calcite,
> > > > > > >> > > thanks
> > > > > > >> > > > to
> > > > > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to
> > some
> > > > > > >> conclusion
> > > > > > >> > > > about
> > > > > > >> > > > > > how to elegantly support the full suite of streaming
> > > > > > >> functionality
> > > > > > >> > in
> > > > > > >> > > > the
> > > > > > >> > > > > > Beam model in via Calcite SQL. You folks in the
> Flink
> > > > > > community
> > > > > > >> > have
> > > > > > >> > > > been
> > > > > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> > > > amongst
> > > > > > >> others,
> > > > > > >> > > > thank
> > > > > > >> > > > > > you! :-), but from my understanding we still don't
> > have
> > > a
> > > > > full
> > > > > > >> spec
> > > > > > >> > > for
> > > > > > >> > > > > how
> > > > > > >> > > > > > to support robust streaming in SQL (including but
> not
> > > > > limited
> > > > > > >> to,
> > > > > > >> > > > e.g., a
> > > > > > >> > > > > > triggers analogue such as EMIT).
> > > > > > >> > > > > >
> > > > > > >> > > > > > I've been spending a lot of time thinking about this
> > and
> > > > > have
> > > > > > >> some
> > > > > > >> > > > > opinions
> > > > > > >> > > > > > about how I think it should look that I've already
> > > written
> > > > > > down,
> > > > > > >> > so I
> > > > > > >> > > > > > volunteered to try to drive forward agreement on a
> > > general
> > > > > > >> > streaming
> > > > > > >> > > > SQL
> > > > > > >> > > > > > spec between our three communities (well,
> technically
> > I
> > > > > > >> volunteered
> > > > > > >> > > to
> > > > > > >> > > > do
> > > > > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink
> > folks
> > > > > might
> > > > > > >> want
> > > > > > >> > to
> > > > > > >> > > > > join
> > > > > > >> > > > > > in since you're going that direction already anyway
> > and
> > > > will
> > > > > > >> have
> > > > > > >> > > > useful
> > > > > > >> > > > > > insights :-).
> > > > > > >> > > > > >
> > > > > > >> > > > > > My plan was to do this by sharing two docs:
> > > > > > >> > > > > >
> > > > > > >> > > > > >    1. The Beam Model : Streams & Tables - This one
> is
> > > for
> > > > > > >> context,
> > > > > > >> > > and
> > > > > > >> > > > > >    really only mentions SQL in passing. But it
> > describes
> > > > the
> > > > > > >> > > > relationship
> > > > > > >> > > > > >    between the Beam Model and the "streams & tables"
> > way
> > > > of
> > > > > > >> > thinking,
> > > > > > >> > > > > which
> > > > > > >> > > > > >    turns out to be useful in understanding what
> robust
> > > > > > >> streaming in
> > > > > > >> > > SQL
> > > > > > >> > > > > > might
> > > > > > >> > > > > >    look like. Many of you probably already know some
> > or
> > > > all
> > > > > of
> > > > > > >> > what's
> > > > > > >> > > > in
> > > > > > >> > > > > > here,
> > > > > > >> > > > > >    but I felt it was necessary to have it all
> written
> > > down
> > > > > in
> > > > > > >> order
> > > > > > >> > > to
> > > > > > >> > > > > > justify
> > > > > > >> > > > > >    some of the proposals I wanted to make in the
> > second
> > > > doc.
> > > > > > >> > > > > >
> > > > > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal
> for
> > > this
> > > > > doc
> > > > > > >> is
> > > > > > >> > > that
> > > > > > >> > > > it
> > > > > > >> > > > > >    would become a general specification for what
> > robust
> > > > > > >> streaming
> > > > > > >> > SQL
> > > > > > >> > > > in
> > > > > > >> > > > > >    Calcite should look like. It would start out as a
> > > basic
> > > > > > >> proposal
> > > > > > >> > > of
> > > > > > >> > > > > what
> > > > > > >> > > > > >    things *could* look like (combining both what
> > things
> > > > look
> > > > > > >> like
> > > > > > >> > now
> > > > > > >> > > > as
> > > > > > >> > > > > > well
> > > > > > >> > > > > >    as a set of proposed changes for the future), and
> > we
> > > > > could
> > > > > > >> all
> > > > > > >> > > > iterate
> > > > > > >> > > > > > on
> > > > > > >> > > > > >    it together until we get to something we're happy
> > > with.
> > > > > > >> > > > > >
> > > > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit
> of
> > a
> > > > > > monster,
> > > > > > >> > so I
> > > > > > >> > > > > > figured I'd share it and let folks hack at it with
> > > > comments
> > > > > if
> > > > > > >> they
> > > > > > >> > > > have
> > > > > > >> > > > > > any, while I try to get the second doc ready in the
> > > > > meantime.
> > > > > > As
> > > > > > >> > part
> > > > > > >> > > > of
> > > > > > >> > > > > > getting doc #2 ready, I'll be starting a separate
> > thread
> > > > to
> > > > > > try
> > > > > > >> to
> > > > > > >> > > > gather
> > > > > > >> > > > > > input on what things are already in flight for
> > streaming
> > > > SQL
> > > > > > >> across
> > > > > > >> > > the
> > > > > > >> > > > > > various communities, to make sure the proposal
> > captures
> > > > > > >> everything
> > > > > > >> > > > that's
> > > > > > >> > > > > > going on as accurately as it can.
> > > > > > >> > > > > >
> > > > > > >> > > > > > If you have any questions or comments, I'm
> interested
> > to
> > > > > hear
> > > > > > >> them.
> > > > > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams
> &
> > > > > Tables":
> > > > > > >> > > > > >
> > > > > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > > > > >> > > > > >
> > > > > > >> > > > > > -Tyler
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Reply via email to