Re: Design documents for consolidated DataStream API

Stephan Ewen Tue, 14 Jul 2015 01:35:18 -0700

Concerning your comments:

1) In the new design, there is no grouping without windowing. The
KeyedDataStream subsumes the grouping and key-ing for partitioned state.


    The keyBy() + window() makes a parallel grouped window
    keyBy() alone allows access to partitioned state.

    My thought was that this is simpler, because it needs not groupBy() and
keyBy(), but one construct to handle both cases.

2) The discretization is a rough thought and is nothing for the short term.
It totally needs more thoughts. I put it there to have it as a sketch for
how to evolve this.

    The idea is of course to not have a single data set, but a series of
data set. In each discrete time slice, the data set can be treated like a
regular data set.

    Let's kick off a separate design for the discretization. Joins are good
to talk about (data sets can be joined with data set), and I am sure there
are more questions coming up.


Does that make sense?





On Tue, Jul 14, 2015 at 10:05 AM, Gyula Fóra <gyula.f...@gmail.com> wrote:

> I think Marton has some good points here.
>
> 1) Is KeyedDataStream a better name if this is only a renaming?
>
> 2) the discretize semantics is unclear indeed. Are we operating on a single
> or sequence of datasets? If the latter why not call it something else
> (dstream). How are joins and other binary operators defined for different
> discretizations etc.
> On Mon, Jul 13, 2015 at 7:37 PM Márton Balassi <mbala...@apache.org>
> wrote:
>
> > Generally I agree with the new design. Two concerns:
> >
> > 1) Does KeyedDataStream replace GroupedDataStream or is it the latter a
> > special case of the former?
> >
> > The KeyedDataStream as described in the design document is a bit unclear
> > for me. It lists the following usages:
> >   a) It is the first step in building a window stream, on top of which
> the
> > grouped/windowed aggregation and reduce-style function can be applied
> >   b) It allows to use the "by-key" state of functions. Here, every record
> > has access to a state that is scoped by its key. Key-scoped state can be
> > automatically redistributed and repartitioned.
> >
> > The code snippet describes a use case where the computation and the
> access
> > of the state is used the way currently the GroupedDataStream should
> work. I
> > suppose this is the example for case b). Would case a) also window
> elements
> > by key? If yes, then this is practically a renaming and enhancement of
> the
> > GroupedDataStream functionality with keyed state. Then the
> > StreamExecutionEnvironment.createKeyedStream(Partitioner,
> > KeySelector)construction does not make much sense as the user only
> operates
> > within the scope of the keyselector and not the partitioner anyway.
> >
> > I personally think KeyedDataStream as a name does not necessarily suggest
> > that the records are grouped by key, it only suggests partitioning by
> key -
> > at least for me. :)
> >
> > 2) The API for discretization is not convenient IMHO
> >
> > The discretization part declares that the output of
> DataStream.discretize()
> > is a sequence of DataSets. I love this approach, but then in the code
> > snippet the return value of this function is simply a DataSet and uses it
> > as such. The take home message of that code is the following: this is
> > actually the way you would like to program on these sequence of DataSets,
> > most probably you would like to do the same with each of them. If that is
> > the case we should provide a nice utility for that. I think Spark
> > Streaming's DStream.foreachRDD() is fairly useful for this purpose.
> >
> > On Mon, Jul 13, 2015 at 6:30 PM, Gyula Fóra <gyula.f...@gmail.com>
> wrote:
> >
> > > +1
> > > On Mon, Jul 13, 2015 at 6:23 PM Stephan Ewen <se...@apache.org> wrote:
> > >
> > > > If naming is the only concern, then we should go ahead, because we
> can
> > > > change names easily (before the release).
> > > >
> > > > In fact, I don't think it leaves a bad impression. Global windows are
> > > > non-parallel windows. There are also parallel windows. Pick what you
> > need
> > > > and what works.
> > > >
> > > >
> > > > On Mon, Jul 13, 2015 at 6:13 PM, Gyula Fóra <gyula.f...@gmail.com>
> > > wrote:
> > > >
> > > > > I think we agree on everything its more of a naming issue :)
> > > > >
> > > > > I thought it might be misleading that global time windows are
> > > > > "non-parallel" windows. We dont want to give a bad impression.
> (Also
> > we
> > > > > dont want them to think that every global window is parallel but
> > thats
> > > > not
> > > > > a problem here)
> > > > >
> > > > > Gyula
> > > > > On Mon, Jul 13, 2015 at 5:22 PM Stephan Ewen <se...@apache.org>
> > wrote:
> > > > >
> > > > > > Okay, what is missing about the windowing in your opinion?
> > > > > >
> > > > > > The core points of the document are:
> > > > > >
> > > > > >   - The parallel windows are per group only.
> > > > > >
> > > > > >   - The implementation of the parallel windows holds window data
> in
> > > the
> > > > > > group buffers.
> > > > > >
> > > > > >   - The global windows are non-parallel. May have parallel
> > > > > pre-aggregation,
> > > > > > if they are time windows.
> > > > > >
> > > > > >   - Time may be operator time (timer thread), or watermark time.
> > > > > Watermark
> > > > > > time can refer to ingress or event time.
> > > > > >
> > > > > >   - Windows that do not pre-aggregate may require elements in
> > order.
> > > > Not
> > > > > > part of the first prototype.
> > > > > >
> > > > > > Do we agree on those points?
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 13, 2015 at 4:50 PM, Gyula Fóra <
> gyula.f...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > In general I like it, although the main difference between the
> > > > current
> > > > > > and
> > > > > > > the new one is the windowing and that is still not very clear.
> > > > > > >
> > > > > > > Where do we have the full stream time windows for
> instance?(which
> > > is
> > > > > > > parallel but not keyed)
> > > > > > > On Mon, Jul 13, 2015 at 4:28 PM Aljoscha Krettek <
> > > > aljos...@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 I like it as well.
> > > > > > > >
> > > > > > > > On Mon, 13 Jul 2015 at 16:17 Kostas Tzoumas <
> > ktzou...@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 from my side
> > > > > > > > >
> > > > > > > > > On Mon, Jul 13, 2015 at 4:15 PM, Stephan Ewen <
> > > se...@apache.org>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Do we have consensus on these designs?
> > > > > > > > > >
> > > > > > > > > > If we have, we should get to implementing this soon,
> > because
> > > > > > > basically
> > > > > > > > > all
> > > > > > > > > > streaming patches will have to be revisited in light of
> > > this...
> > > > > > > > > >
> > > > > > > > > > On Tue, Jul 7, 2015 at 3:41 PM, Gyula Fóra <
> > > > gyula.f...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > You are right thats an important issue.
> > > > > > > > > > >
> > > > > > > > > > > And I think we should also do some renaming with the
> > > > > "iterations"
> > > > > > > > > because
> > > > > > > > > > > they are not really iterations like in the batch case
> and
> > > it
> > > > > > might
> > > > > > > > > > confuse
> > > > > > > > > > > some users.
> > > > > > > > > > > Maybe we can call them loops or cycles and rename the
> api
> > > > calls
> > > > > > to
> > > > > > > > make
> > > > > > > > > > it
> > > > > > > > > > > more intuitive what happens. It is really just a cyclic
> > > > > dataflow.
> > > > > > > > > > >
> > > > > > > > > > > Aljoscha Krettek <aljos...@apache.org> ezt írta
> > (időpont:
> > > > > 2015.
> > > > > > > júl.
> > > > > > > > > 7.,
> > > > > > > > > > > K,
> > > > > > > > > > > 15:35):
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > I just noticed that we don't have anything about how
> > > > > iterations
> > > > > > > and
> > > > > > > > > > > > timestamps/watermarks should interact.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Aljoscha
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 6 Jul 2015 at 23:56 Stephan Ewen <
> > > se...@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi all!
> > > > > > > > > > > > >
> > > > > > > > > > > > > As many of you know, there are a ongoing efforts to
> > > > > > consolidate
> > > > > > > > the
> > > > > > > > > > > > > streaming API for the next release, and then
> graduate
> > > it
> > > > > > (from
> > > > > > > > beta
> > > > > > > > > > > > > status).
> > > > > > > > > > > > >
> > > > > > > > > > > > > In the process of this consolidation, we want to
> > > achieve
> > > > > the
> > > > > > > > > > following
> > > > > > > > > > > > > goals.
> > > > > > > > > > > > >
> > > > > > > > > > > > >  - Make the code more robust and simplify it in
> parts
> > > > > > > > > > > > >
> > > > > > > > > > > > >  - Clearly define the semantics of the constructs.
> > > > > > > > > > > > >
> > > > > > > > > > > > >  - Prepare it for support of more advanced
> concepts,
> > > like
> > > > > > > > > > partitionable
> > > > > > > > > > > > > state, and event time.
> > > > > > > > > > > > >
> > > > > > > > > > > > >  - Cut support for certain corner cases that were
> > > > > prototyped,
> > > > > > > but
> > > > > > > > > > > turned
> > > > > > > > > > > > > out to be not efficiently doable
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Based on prior discussions on the mailing list,
> > > Aljoscha
> > > > > and
> > > > > > me
> > > > > > > > > > drafted
> > > > > > > > > > > > the
> > > > > > > > > > > > > design documents below, which outline how the
> > > > consolidated
> > > > > > API
> > > > > > > > > would
> > > > > > > > > > > > like.
> > > > > > > > > > > > > We focused in constructs, time, and window
> semantics.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Design document on how to restructure the Streaming
> > > API:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams
> > > > > > > > > > > > >
> > > > > > > > > > > > > Design document on definitions of time, order, and
> > the
> > > > > > > resulting
> > > > > > > > > > > > semantics:
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Note: The design of the interfaces and concepts for
> > > > > advanced
> > > > > > > > state
> > > > > > > > > in
> > > > > > > > > > > > > functions is not in here. That is part of a
> separate
> > > > design
> > > > > > > > > > discussion
> > > > > > > > > > > > and
> > > > > > > > > > > > > orthogonal to the designs drafted here.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please have a look and voice questions and
> concerns.
> > > > Since
> > > > > we
> > > > > > > > > should
> > > > > > > > > > > not
> > > > > > > > > > > > > break the streaming API more than once, we should
> > make
> > > > sure
> > > > > > > this
> > > > > > > > > > > > > consolidation brings it into the shape we want it
> to
> > be
> > > > in.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Greetings,
> > > > > > > > > > > > > Stephan
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Design documents for consolidated DataStream API

Reply via email to