On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fhue...@gmail.com> wrote:
> Hi Tyler, > > thank you very much for this excellent write-up and the super nice > visualizations! > You are discussing a lot of the things that we have been thinking about as > well from a different perspective. > IMO, yours and the Flink model are pretty much aligned although we use a > different terminology (which is not yet completely established). So there > should be room for unification ;-) > Good to hear, thanks for giving it a look. :-) > Allow me a few words on the current state in Flink. In the upcoming 1.3.0 > release, we will have support for group window (TUMBLE, HOP, SESSION), OVER > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY > aggregations. The group windows are triggered by watermark and the over > window and non-windowed aggregations emit for each input record > (AtCount(1)). The window aggregations do not support early or late firing > (late records are dropped), so no updates here. However, the non-windowed > aggregates produce updates (in acc and acc/retract mode). Based on this we > will work on better control for late updates and early firing as well as > joins in the next release. > Impressive. At this rate there's a good chance we'll just be doing catchup and thanking you for building everything. ;-) Do you have ideas for what you want your early/late updates control to look like? That's one of the areas I'd like to see better defined for us. And how deep are you going with joins? > Reading the document, I did not find any major difference in our concepts. > In fact, we are aiming to support the cases you are describing as well. > I have a question though. Would you classify an OVER aggregation as a > stream -> stream or stream -> table operation? It collects records to > aggregate them, but emits one record for each input row. Depending on the > window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and > emit the result record when the input record is received. > I would call it a composite stream → stream operation (since SQL, like the standard Beam/Flink APIs, is a higher level set of constructs than raw streams/tables operations) consisting of a stream → table windowed grouping followed by a table → stream triggering on every element, basically as you described in the previous paragraph. -Tyler > > I'm looking forward to the second part. > > Cheers, Fabian > > > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>: > > > Any thoughts here Fabian? I'm planning to start sending out some more > > emails towards the end of the week. > > > > -Tyler > > > > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <taki...@google.com> wrote: > > > > > No worries, thanks for the heads up. Good luck wrapping all that stuff > > up. > > > > > > -Tyler > > > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fhue...@gmail.com> > > wrote: > > > > > >> Hi Tyler, > > >> > > >> thanks for pushing this effort and including the Flink list. > > >> I haven't managed to read the doc yet, but just wanted to thank you > for > > >> the > > >> write-up and let you know that I'm very interested in this discussion. > > >> > > >> We are very close to the feature freeze of Flink 1.3 and I'm quite > busy > > >> getting as many contributions merged before the release is forked off. > > >> When that happened, I'll have more time to read and comment. > > >> > > >> Thanks, > > >> Fabian > > >> > > >> > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <taki...@google.com.invalid>: > > >> > > >> > Good point, when you start talking about anything less than a full > > join, > > >> > triggers get involved to describe how one actually achieves the > > desired > > >> > semantics, and they may end up being tied to just one of the inputs > > >> (e.g., > > >> > you may only care about the watermark for one side of the join). Am > > >> > expecting us to address these sorts of details more precisely in doc > > #2. > > >> > > > >> > -Tyler > > >> > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles > > <k...@google.com.invalid > > >> > > > >> > wrote: > > >> > > > >> > > There's something to be said about having different triggering > > >> depending > > >> > on > > >> > > which side of a join data comes from, perhaps? > > >> > > > > >> > > (delightful doc, as usual) > > >> > > > > >> > > Kenn > > >> > > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau > > >> <taki...@google.com.invalid > > >> > > > > >> > > wrote: > > >> > > > > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is > > >> basically > > >> > > > flatten + GBK. Flatten is a non-grouping operation that merges > the > > >> > input > > >> > > > streams into a single output stream. GBK then groups the data > > within > > >> > that > > >> > > > single union stream as you might otherwise expect, yielding a > > single > > >> > > table. > > >> > > > So I think it doesn't really impact things much. Grouping, > > >> aggregation, > > >> > > > window merging etc all just act upon the merged stream and > > generate > > >> > what > > >> > > is > > >> > > > effectively a merged table. > > >> > > > > > >> > > > -Tyler > > >> > > > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik > > >> <lc...@google.com.invalid > > >> > > > > >> > > > wrote: > > >> > > > > > >> > > > > The doc is a good read. > > >> > > > > > > >> > > > > I think you do a great job of explaining table -> stream, > stream > > >> -> > > >> > > > stream, > > >> > > > > and stream -> table when there is only one stream. > > >> > > > > But when there are multiple streams reading/writing to a > table, > > >> how > > >> > > does > > >> > > > > that impact what occurs? > > >> > > > > For example, with CoGBK you have multiple streams writing to a > > >> table, > > >> > > how > > >> > > > > does that impact window merging? > > >> > > > > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau > > >> > > <taki...@google.com.invalid > > >> > > > > > > >> > > > > wrote: > > >> > > > > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists! > > >> > > > > > > > >> > > > > > Apologies for the big cross post, but I thought this might > be > > >> > > something > > >> > > > > all > > >> > > > > > three communities would find relevant. > > >> > > > > > > > >> > > > > > Beam is finally making progress on a SQL DSL utilizing > > Calcite, > > >> > > thanks > > >> > > > to > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some > > >> conclusion > > >> > > > about > > >> > > > > > how to elegantly support the full suite of streaming > > >> functionality > > >> > in > > >> > > > the > > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink > > community > > >> > have > > >> > > > been > > >> > > > > > pushing on this (e.g., adding windowing constructs, amongst > > >> others, > > >> > > > thank > > >> > > > > > you! :-), but from my understanding we still don't have a > full > > >> spec > > >> > > for > > >> > > > > how > > >> > > > > > to support robust streaming in SQL (including but not > limited > > >> to, > > >> > > > e.g., a > > >> > > > > > triggers analogue such as EMIT). > > >> > > > > > > > >> > > > > > I've been spending a lot of time thinking about this and > have > > >> some > > >> > > > > opinions > > >> > > > > > about how I think it should look that I've already written > > down, > > >> > so I > > >> > > > > > volunteered to try to drive forward agreement on a general > > >> > streaming > > >> > > > SQL > > >> > > > > > spec between our three communities (well, technically I > > >> volunteered > > >> > > to > > >> > > > do > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks > might > > >> want > > >> > to > > >> > > > > join > > >> > > > > > in since you're going that direction already anyway and will > > >> have > > >> > > > useful > > >> > > > > > insights :-). > > >> > > > > > > > >> > > > > > My plan was to do this by sharing two docs: > > >> > > > > > > > >> > > > > > 1. The Beam Model : Streams & Tables - This one is for > > >> context, > > >> > > and > > >> > > > > > really only mentions SQL in passing. But it describes the > > >> > > > relationship > > >> > > > > > between the Beam Model and the "streams & tables" way of > > >> > thinking, > > >> > > > > which > > >> > > > > > turns out to be useful in understanding what robust > > >> streaming in > > >> > > SQL > > >> > > > > > might > > >> > > > > > look like. Many of you probably already know some or all > of > > >> > what's > > >> > > > in > > >> > > > > > here, > > >> > > > > > but I felt it was necessary to have it all written down > in > > >> order > > >> > > to > > >> > > > > > justify > > >> > > > > > some of the proposals I wanted to make in the second doc. > > >> > > > > > > > >> > > > > > 2. A streaming SQL spec for Calcite - The goal for this > doc > > >> is > > >> > > that > > >> > > > it > > >> > > > > > would become a general specification for what robust > > >> streaming > > >> > SQL > > >> > > > in > > >> > > > > > Calcite should look like. It would start out as a basic > > >> proposal > > >> > > of > > >> > > > > what > > >> > > > > > things *could* look like (combining both what things look > > >> like > > >> > now > > >> > > > as > > >> > > > > > well > > >> > > > > > as a set of proposed changes for the future), and we > could > > >> all > > >> > > > iterate > > >> > > > > > on > > >> > > > > > it together until we get to something we're happy with. > > >> > > > > > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a > > monster, > > >> > so I > > >> > > > > > figured I'd share it and let folks hack at it with comments > if > > >> they > > >> > > > have > > >> > > > > > any, while I try to get the second doc ready in the > meantime. > > As > > >> > part > > >> > > > of > > >> > > > > > getting doc #2 ready, I'll be starting a separate thread to > > try > > >> to > > >> > > > gather > > >> > > > > > input on what things are already in flight for streaming SQL > > >> across > > >> > > the > > >> > > > > > various communities, to make sure the proposal captures > > >> everything > > >> > > > that's > > >> > > > > > going on as accurately as it can. > > >> > > > > > > > >> > > > > > If you have any questions or comments, I'm interested to > hear > > >> them. > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & > Tables": > > >> > > > > > > > >> > > > > > http://s.apache.org/beam-streams-tables > > >> > > > > > > > >> > > > > > -Tyler > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > >