It is not a bit different than the batch API, because streaming semantics are a bit different ;-)
One good thing is that we can make things better that were sub-optimal in the Batch API. On Tue, Jul 14, 2015 at 10:55 AM, Stephan Ewen <se...@apache.org> wrote: > keyBy() does not do any grouping. Grouping in streams in not defined > without windows. > > On Tue, Jul 14, 2015 at 10:48 AM, Gyula Fóra <gyula.f...@gmail.com> wrote: > >> If we only want to have either keyBy or groupBy, why not keep groupBy? >> That >> would be more consistent with the batch api. >> On Tue, Jul 14, 2015 at 10:35 AM Stephan Ewen <se...@apache.org> wrote: >> >> > Concerning your comments: >> > >> > 1) In the new design, there is no grouping without windowing. The >> > KeyedDataStream subsumes the grouping and key-ing for partitioned state. >> > >> > The keyBy() + window() makes a parallel grouped window >> > keyBy() alone allows access to partitioned state. >> > >> > My thought was that this is simpler, because it needs not groupBy() >> and >> > keyBy(), but one construct to handle both cases. >> > >> > 2) The discretization is a rough thought and is nothing for the short >> term. >> > It totally needs more thoughts. I put it there to have it as a sketch >> for >> > how to evolve this. >> > >> > The idea is of course to not have a single data set, but a series of >> > data set. In each discrete time slice, the data set can be treated like >> a >> > regular data set. >> > >> > Let's kick off a separate design for the discretization. Joins are >> good >> > to talk about (data sets can be joined with data set), and I am sure >> there >> > are more questions coming up. >> > >> > >> > Does that make sense? >> > >> > >> > >> > >> > >> > On Tue, Jul 14, 2015 at 10:05 AM, Gyula Fóra <gyula.f...@gmail.com> >> wrote: >> > >> > > I think Marton has some good points here. >> > > >> > > 1) Is KeyedDataStream a better name if this is only a renaming? >> > > >> > > 2) the discretize semantics is unclear indeed. Are we operating on a >> > single >> > > or sequence of datasets? If the latter why not call it something else >> > > (dstream). How are joins and other binary operators defined for >> different >> > > discretizations etc. >> > > On Mon, Jul 13, 2015 at 7:37 PM Márton Balassi <mbala...@apache.org> >> > > wrote: >> > > >> > > > Generally I agree with the new design. Two concerns: >> > > > >> > > > 1) Does KeyedDataStream replace GroupedDataStream or is it the >> latter a >> > > > special case of the former? >> > > > >> > > > The KeyedDataStream as described in the design document is a bit >> > unclear >> > > > for me. It lists the following usages: >> > > > a) It is the first step in building a window stream, on top of >> which >> > > the >> > > > grouped/windowed aggregation and reduce-style function can be >> applied >> > > > b) It allows to use the "by-key" state of functions. Here, every >> > record >> > > > has access to a state that is scoped by its key. Key-scoped state >> can >> > be >> > > > automatically redistributed and repartitioned. >> > > > >> > > > The code snippet describes a use case where the computation and the >> > > access >> > > > of the state is used the way currently the GroupedDataStream should >> > > work. I >> > > > suppose this is the example for case b). Would case a) also window >> > > elements >> > > > by key? If yes, then this is practically a renaming and enhancement >> of >> > > the >> > > > GroupedDataStream functionality with keyed state. Then the >> > > > StreamExecutionEnvironment.createKeyedStream(Partitioner, >> > > > KeySelector)construction does not make much sense as the user only >> > > operates >> > > > within the scope of the keyselector and not the partitioner anyway. >> > > > >> > > > I personally think KeyedDataStream as a name does not necessarily >> > suggest >> > > > that the records are grouped by key, it only suggests partitioning >> by >> > > key - >> > > > at least for me. :) >> > > > >> > > > 2) The API for discretization is not convenient IMHO >> > > > >> > > > The discretization part declares that the output of >> > > DataStream.discretize() >> > > > is a sequence of DataSets. I love this approach, but then in the >> code >> > > > snippet the return value of this function is simply a DataSet and >> uses >> > it >> > > > as such. The take home message of that code is the following: this >> is >> > > > actually the way you would like to program on these sequence of >> > DataSets, >> > > > most probably you would like to do the same with each of them. If >> that >> > is >> > > > the case we should provide a nice utility for that. I think Spark >> > > > Streaming's DStream.foreachRDD() is fairly useful for this purpose. >> > > > >> > > > On Mon, Jul 13, 2015 at 6:30 PM, Gyula Fóra <gyula.f...@gmail.com> >> > > wrote: >> > > > >> > > > > +1 >> > > > > On Mon, Jul 13, 2015 at 6:23 PM Stephan Ewen <se...@apache.org> >> > wrote: >> > > > > >> > > > > > If naming is the only concern, then we should go ahead, because >> we >> > > can >> > > > > > change names easily (before the release). >> > > > > > >> > > > > > In fact, I don't think it leaves a bad impression. Global >> windows >> > are >> > > > > > non-parallel windows. There are also parallel windows. Pick what >> > you >> > > > need >> > > > > > and what works. >> > > > > > >> > > > > > >> > > > > > On Mon, Jul 13, 2015 at 6:13 PM, Gyula Fóra < >> gyula.f...@gmail.com> >> > > > > wrote: >> > > > > > >> > > > > > > I think we agree on everything its more of a naming issue :) >> > > > > > > >> > > > > > > I thought it might be misleading that global time windows are >> > > > > > > "non-parallel" windows. We dont want to give a bad impression. >> > > (Also >> > > > we >> > > > > > > dont want them to think that every global window is parallel >> but >> > > > thats >> > > > > > not >> > > > > > > a problem here) >> > > > > > > >> > > > > > > Gyula >> > > > > > > On Mon, Jul 13, 2015 at 5:22 PM Stephan Ewen < >> se...@apache.org> >> > > > wrote: >> > > > > > > >> > > > > > > > Okay, what is missing about the windowing in your opinion? >> > > > > > > > >> > > > > > > > The core points of the document are: >> > > > > > > > >> > > > > > > > - The parallel windows are per group only. >> > > > > > > > >> > > > > > > > - The implementation of the parallel windows holds window >> > data >> > > in >> > > > > the >> > > > > > > > group buffers. >> > > > > > > > >> > > > > > > > - The global windows are non-parallel. May have parallel >> > > > > > > pre-aggregation, >> > > > > > > > if they are time windows. >> > > > > > > > >> > > > > > > > - Time may be operator time (timer thread), or watermark >> > time. >> > > > > > > Watermark >> > > > > > > > time can refer to ingress or event time. >> > > > > > > > >> > > > > > > > - Windows that do not pre-aggregate may require elements >> in >> > > > order. >> > > > > > Not >> > > > > > > > part of the first prototype. >> > > > > > > > >> > > > > > > > Do we agree on those points? >> > > > > > > > >> > > > > > > > >> > > > > > > > On Mon, Jul 13, 2015 at 4:50 PM, Gyula Fóra < >> > > gyula.f...@gmail.com> >> > > > > > > wrote: >> > > > > > > > >> > > > > > > > > In general I like it, although the main difference between >> > the >> > > > > > current >> > > > > > > > and >> > > > > > > > > the new one is the windowing and that is still not very >> > clear. >> > > > > > > > > >> > > > > > > > > Where do we have the full stream time windows for >> > > instance?(which >> > > > > is >> > > > > > > > > parallel but not keyed) >> > > > > > > > > On Mon, Jul 13, 2015 at 4:28 PM Aljoscha Krettek < >> > > > > > aljos...@apache.org> >> > > > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > +1 I like it as well. >> > > > > > > > > > >> > > > > > > > > > On Mon, 13 Jul 2015 at 16:17 Kostas Tzoumas < >> > > > ktzou...@apache.org >> > > > > > >> > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > +1 from my side >> > > > > > > > > > > >> > > > > > > > > > > On Mon, Jul 13, 2015 at 4:15 PM, Stephan Ewen < >> > > > > se...@apache.org> >> > > > > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > > Do we have consensus on these designs? >> > > > > > > > > > > > >> > > > > > > > > > > > If we have, we should get to implementing this soon, >> > > > because >> > > > > > > > > basically >> > > > > > > > > > > all >> > > > > > > > > > > > streaming patches will have to be revisited in >> light of >> > > > > this... >> > > > > > > > > > > > >> > > > > > > > > > > > On Tue, Jul 7, 2015 at 3:41 PM, Gyula Fóra < >> > > > > > gyula.f...@gmail.com >> > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > You are right thats an important issue. >> > > > > > > > > > > > > >> > > > > > > > > > > > > And I think we should also do some renaming with >> the >> > > > > > > "iterations" >> > > > > > > > > > > because >> > > > > > > > > > > > > they are not really iterations like in the batch >> case >> > > and >> > > > > it >> > > > > > > > might >> > > > > > > > > > > > confuse >> > > > > > > > > > > > > some users. >> > > > > > > > > > > > > Maybe we can call them loops or cycles and rename >> the >> > > api >> > > > > > calls >> > > > > > > > to >> > > > > > > > > > make >> > > > > > > > > > > > it >> > > > > > > > > > > > > more intuitive what happens. It is really just a >> > cyclic >> > > > > > > dataflow. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Aljoscha Krettek <aljos...@apache.org> ezt írta >> > > > (időpont: >> > > > > > > 2015. >> > > > > > > > > júl. >> > > > > > > > > > > 7., >> > > > > > > > > > > > > K, >> > > > > > > > > > > > > 15:35): >> > > > > > > > > > > > > >> > > > > > > > > > > > > > Hi, >> > > > > > > > > > > > > > I just noticed that we don't have anything about >> > how >> > > > > > > iterations >> > > > > > > > > and >> > > > > > > > > > > > > > timestamps/watermarks should interact. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Cheers, >> > > > > > > > > > > > > > Aljoscha >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Mon, 6 Jul 2015 at 23:56 Stephan Ewen < >> > > > > se...@apache.org >> > > > > > > >> > > > > > > > > wrote: >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Hi all! >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > As many of you know, there are a ongoing >> efforts >> > to >> > > > > > > > consolidate >> > > > > > > > > > the >> > > > > > > > > > > > > > > streaming API for the next release, and then >> > > graduate >> > > > > it >> > > > > > > > (from >> > > > > > > > > > beta >> > > > > > > > > > > > > > > status). >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > In the process of this consolidation, we want >> to >> > > > > achieve >> > > > > > > the >> > > > > > > > > > > > following >> > > > > > > > > > > > > > > goals. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Make the code more robust and simplify it >> in >> > > parts >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Clearly define the semantics of the >> > constructs. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Prepare it for support of more advanced >> > > concepts, >> > > > > like >> > > > > > > > > > > > partitionable >> > > > > > > > > > > > > > > state, and event time. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Cut support for certain corner cases that >> were >> > > > > > > prototyped, >> > > > > > > > > but >> > > > > > > > > > > > > turned >> > > > > > > > > > > > > > > out to be not efficiently doable >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Based on prior discussions on the mailing >> list, >> > > > > Aljoscha >> > > > > > > and >> > > > > > > > me >> > > > > > > > > > > > drafted >> > > > > > > > > > > > > > the >> > > > > > > > > > > > > > > design documents below, which outline how the >> > > > > > consolidated >> > > > > > > > API >> > > > > > > > > > > would >> > > > > > > > > > > > > > like. >> > > > > > > > > > > > > > > We focused in constructs, time, and window >> > > semantics. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Design document on how to restructure the >> > Streaming >> > > > > API: >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Design document on definitions of time, order, >> > and >> > > > the >> > > > > > > > > resulting >> > > > > > > > > > > > > > semantics: >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Note: The design of the interfaces and >> concepts >> > for >> > > > > > > advanced >> > > > > > > > > > state >> > > > > > > > > > > in >> > > > > > > > > > > > > > > functions is not in here. That is part of a >> > > separate >> > > > > > design >> > > > > > > > > > > > discussion >> > > > > > > > > > > > > > and >> > > > > > > > > > > > > > > orthogonal to the designs drafted here. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Please have a look and voice questions and >> > > concerns. >> > > > > > Since >> > > > > > > we >> > > > > > > > > > > should >> > > > > > > > > > > > > not >> > > > > > > > > > > > > > > break the streaming API more than once, we >> should >> > > > make >> > > > > > sure >> > > > > > > > > this >> > > > > > > > > > > > > > > consolidation brings it into the shape we >> want it >> > > to >> > > > be >> > > > > > in. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Greetings, >> > > > > > > > > > > > > > > Stephan >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >