Re: [DISCUSS] KIP-150 - Kafka-Streams Cogroup

Guozhang Wang Tue, 06 Jun 2017 15:59:58 -0700

Kyle,

Thanks a lot for the updated KIP. It looks good to me.



Guozhang


On Fri, Jun 2, 2017 at 5:37 AM, Jim Jagielski <[email protected]> wrote:

> This makes much more sense to me. +1
>
> > On Jun 1, 2017, at 10:33 AM, Kyle Winkelman <[email protected]>
> wrote:
> >
> > I have updated the KIP and my PR. Let me know what you think.
> > To created a cogrouped stream just call cogroup on a KgroupedStream and
> > supply the initializer, aggValueSerde, and an aggregator. Then continue
> > adding kgroupedstreams and aggregators. Then call one of the many
> aggregate
> > calls to create a KTable.
> >
> > Thanks,
> > Kyle
> >
> > On Jun 1, 2017 4:03 AM, "Damian Guy" <[email protected]> wrote:
> >
> >> Hi Kyle,
> >>
> >> Thanks for the update. I think just one initializer makes sense as it
> >> should only be called once per key and generally it is just going to
> create
> >> a new instance of whatever the Aggregate class is.
> >>
> >> Cheers,
> >> Damian
> >>
> >> On Wed, 31 May 2017 at 20:09 Kyle Winkelman <[email protected]>
> >> wrote:
> >>
> >>> Hello all,
> >>>
> >>> I have spent some more time on this and the best alternative I have
> come
> >> up
> >>> with is:
> >>> KGroupedStream has a single cogroup call that takes an initializer and
> an
> >>> aggregator.
> >>> CogroupedKStream has a cogroup call that takes additional groupedStream
> >>> aggregator pairs.
> >>> CogroupedKStream has multiple aggregate methods that create the
> different
> >>> stores.
> >>>
> >>> I plan on updating the kip but I want people's input on if we should
> have
> >>> the initializer be passed in once at the beginning or if we should
> >> instead
> >>> have the initializer be required for each call to one of the aggregate
> >>> calls. The first makes more sense to me but doesnt allow the user to
> >>> specify different initializers for different tables.
> >>>
> >>> Thanks,
> >>> Kyle
> >>>
> >>> On May 24, 2017 7:46 PM, "Kyle Winkelman" <[email protected]>
> >>> wrote:
> >>>
> >>>> Yea I really like that idea I'll see what I can do to update the kip
> >> and
> >>>> my pr when I have some time. I'm not sure how well creating the
> >>>> kstreamaggregates will go though because at that point I will have
> >> thrown
> >>>> away the type of the values. It will be type safe I just may need to
> >> do a
> >>>> little forcing.
> >>>>
> >>>> Thanks,
> >>>> Kyle
> >>>>
> >>>> On May 24, 2017 3:28 PM, "Guozhang Wang" <[email protected]> wrote:
> >>>>
> >>>>> Kyle,
> >>>>>
> >>>>> Thanks for the explanations, my previous read on the wiki examples
> was
> >>>>> wrong.
> >>>>>
> >>>>> So I guess my motivation should be "reduced" to: can we move the
> >> window
> >>>>> specs param from "KGroupedStream#cogroup(..)" to
> >>>>> "CogroupedKStream#aggregate(..)", and my motivations are:
> >>>>>
> >>>>> 1. minor: we can reduce the #.generics in CogroupedKStream from 3 to
> >> 2.
> >>>>> 2. major: this is for extensibility of the APIs, and since we are
> >>> removing
> >>>>> the "Evolving" annotations on Streams it may be harder to change it
> >>> again
> >>>>> in the future. The extended use cases are that people wanted to have
> >>>>> windowed running aggregates on different granularities, e.g. "give me
> >>> the
> >>>>> counts per-minute, per-hour, per-day and per-week", and today in DSL
> >> we
> >>>>> need to specify that case in multiple aggregate operators, which gets
> >> a
> >>>>> state store / changelog, etc. And it is possible to optimize it as
> >> well
> >>> to
> >>>>> a single state store. Its implementation would be tricky as you need
> >> to
> >>>>> contain different lengthed windows within your window store but just
> >>> from
> >>>>> the public API point of view, it could be specified as:
> >>>>>
> >>>>> CogroupedKStream stream = stream1.cogroup(stream2, ...
> >>>>> "state-store-name");
> >>>>>
> >>>>> table1 = stream.aggregate(/*per-minute window*/)
> >>>>> table2 = stream.aggregate(/*per-hour window*/)
> >>>>> table3 = stream.aggregate(/*per-day window*/)
> >>>>>
> >>>>> while underlying we are only using a single store "state-store-name"
> >> for
> >>>>> it.
> >>>>>
> >>>>>
> >>>>> Although this feature is out of the scope of this KIP, I'd like to
> >>> discuss
> >>>>> if we can "leave the door open" to make such changes without
> modifying
> >>> the
> >>>>> public APIs .
> >>>>>
> >>>>> Guozhang
> >>>>>
> >>>>>
> >>>>> On Wed, May 24, 2017 at 3:57 AM, Kyle Winkelman <
> >>> [email protected]
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> I allow defining a single window/sessionwindow one time when you
> >> make
> >>>>> the
> >>>>>> cogroup call from a KGroupedStream. From then on you are using the
> >>>>> cogroup
> >>>>>> call from with in CogroupedKStream which doesnt accept any
> >> additional
> >>>>>> windows/sessionwindows.
> >>>>>>
> >>>>>> Is this what you meant by your question or did I misunderstand?
> >>>>>>
> >>>>>> On May 23, 2017 9:33 PM, "Guozhang Wang" <[email protected]>
> >> wrote:
> >>>>>>
> >>>>>> Another question that came to me is on "window alignment": from the
> >>> KIP
> >>>>> it
> >>>>>> seems you are allowing users to specify a (potentially different)
> >>> window
> >>>>>> spec in each co-grouped input stream. So if these window specs are
> >>>>>> different how should we "align" them with different input streams? I
> >>>>> think
> >>>>>> it is more natural to only specify on window spec in the
> >>>>>>
> >>>>>> KTable<RK, V> CogroupedKStream#aggregate(Windows);
> >>>>>>
> >>>>>>
> >>>>>> And remove it from the cogroup() functions. WDYT?
> >>>>>>
> >>>>>>
> >>>>>> Guozhang
> >>>>>>
> >>>>>> On Tue, May 23, 2017 at 6:22 PM, Guozhang Wang <[email protected]>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Thanks for the proposal Kyle, this is a quite common use case to
> >>>>> support
> >>>>>>> such multi-way table join (i.e. N source tables with N aggregate
> >>> func)
> >>>>>> with
> >>>>>>> a single store and N+1 serdes, I have seen lots of people using
> >> the
> >>>>>>> low-level PAPI to achieve this goal.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, May 19, 2017 at 10:04 AM, Kyle Winkelman <
> >>>>>> [email protected]
> >>>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I like your point about not handling other cases such as count
> >> and
> >>>>>> reduce.
> >>>>>>>>
> >>>>>>>> I think that reduce may not make sense because reduce assumes
> >> that
> >>>>> the
> >>>>>>>> input values are the same as the output values. With cogroup
> >> there
> >>>>> may
> >>>>>> be
> >>>>>>>> multiple different input types and then your output type cant be
> >>>>>> multiple
> >>>>>>>> different things. In the case where you have all matching value
> >>> types
> >>>>>> you
> >>>>>>>> can do KStreamBuilder#merge followed by the reduce.
> >>>>>>>>
> >>>>>>>> As for count I think it is possible to call count on all the
> >>>>> individual
> >>>>>>>> grouped streams and then do joins. Otherwise we could maybe make
> >> a
> >>>>>> special
> >>>>>>>> call in groupedstream for this case. Because in this case we dont
> >>>>> need
> >>>>>> to
> >>>>>>>> do type checking on the values. It could be similar to the
> >> current
> >>>>> count
> >>>>>>>> methods but accept a var args of additonal grouped streams as
> >> well
> >>>>> and
> >>>>>>>> make
> >>>>>>>> sure they have a key type of K.
> >>>>>>>>
> >>>>>>>> The way I have put the kip together is to ensure that we do type
> >>>>>> checking.
> >>>>>>>> I don't see a way we could group them all first and then make a
> >>> call
> >>>>> to
> >>>>>>>> count, reduce, or aggregate because with aggregate they would
> >> need
> >>> to
> >>>>>> pass
> >>>>>>>> a list of aggregators and we would have no way of type checking
> >>> that
> >>>>>> they
> >>>>>>>> match the grouped streams.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Kyle
> >>>>>>>>
> >>>>>>>> On May 19, 2017 11:42 AM, "Xavier Léauté" <[email protected]>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Sorry to jump on this thread so late. I agree this is a very
> >>> useful
> >>>>>>>>> addition and wanted to provide an additional use-case and some
> >>> more
> >>>>>>>>> comments.
> >>>>>>>>>
> >>>>>>>>> This is actually a very common analytics use-case in the
> >> ad-tech
> >>>>>>>> industry.
> >>>>>>>>> The typical setup will have an auction stream, an impression
> >>>>> stream,
> >>>>>>>> and a
> >>>>>>>>> click stream. Those three streams need to be combined to
> >> compute
> >>>>>>>> aggregate
> >>>>>>>>> statistics (e.g. impression statistics, and click-through
> >> rates),
> >>>>>> since
> >>>>>>>>> most of the attributes of interest are only present the auction
> >>>>>> stream.
> >>>>>>>>>
> >>>>>>>>> A simple way to do this is to co-group all the streams by the
> >>>>> auction
> >>>>>>>> key,
> >>>>>>>>> and process updates to the co-group as events for each stream
> >>> come
> >>>>> in,
> >>>>>>>>> keeping only one value from each stream before sending
> >> downstream
> >>>>> for
> >>>>>>>>> further processing / aggregation.
> >>>>>>>>>
> >>>>>>>>> One could view the result of that co-group operation as a
> >>> "KTable"
> >>>>>> with
> >>>>>>>>> multiple values per key. The key being the grouping key, and
> >> the
> >>>>>> values
> >>>>>>>>> consisting of one value per stream.
> >>>>>>>>>
> >>>>>>>>> What I like about Kyle's approach is that allows elegant
> >>>>> co-grouping
> >>>>>> of
> >>>>>>>>> multiple streams without having to worry about the number of
> >>>>> streams,
> >>>>>>>> and
> >>>>>>>>> avoids dealing with Tuple types or other generic interfaces
> >> that
> >>>>> could
> >>>>>>>> get
> >>>>>>>>> messy if we wanted to preserve all the value types in the
> >>> resulting
> >>>>>>>>> co-grouped stream.
> >>>>>>>>>
> >>>>>>>>> My only concern is that we only allow the cogroup + aggregate
> >>>>> combined
> >>>>>>>>> operation. This forces the user to build their own tuple
> >>>>> serialization
> >>>>>>>>> format if they want to preserve the individual input stream
> >>> values
> >>>>> as
> >>>>>> a
> >>>>>>>>> group. It also deviates quite a bit from our approach in
> >>>>>> KGroupedStream
> >>>>>>>>> which offers other operations, such as count and reduce, which
> >>>>> should
> >>>>>>>> also
> >>>>>>>>> be applicable to a co-grouped stream.
> >>>>>>>>>
> >>>>>>>>> Overall I still think this is a really useful addition, but I
> >>> feel
> >>>>> we
> >>>>>>>>> haven't spend much time trying to explore alternative DSLs that
> >>>>> could
> >>>>>>>> maybe
> >>>>>>>>> generalize better or match our existing syntax more closely.
> >>>>>>>>>
> >>>>>>>>> On Tue, May 9, 2017 at 8:08 AM Kyle Winkelman <
> >>>>>> [email protected]
> >>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Eno, is there anyone else that is an expert in the kafka
> >>> streams
> >>>>>> realm
> >>>>>>>>> that
> >>>>>>>>>> I should reach out to for input?
> >>>>>>>>>>
> >>>>>>>>>> I believe Damian Guy is still planning on reviewing this more
> >>> in
> >>>>>> depth
> >>>>>>>>> so I
> >>>>>>>>>> will wait for his inputs before continuing.
> >>>>>>>>>>
> >>>>>>>>>> On May 9, 2017 7:30 AM, "Eno Thereska" <
> >> [email protected]
> >>>>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Thanks Kyle, good arguments.
> >>>>>>>>>>>
> >>>>>>>>>>> Eno
> >>>>>>>>>>>
> >>>>>>>>>>>> On May 7, 2017, at 5:06 PM, Kyle Winkelman <
> >>>>>>>> [email protected]
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> *- minor: could you add an exact example (similar to what
> >>>>> Jay’s
> >>>>>>>>> example
> >>>>>>>>>>> is,
> >>>>>>>>>>>> or like your Spark/Pig pointers had) to make this super
> >>>>>> concrete?*
> >>>>>>>>>>>> I have added a more concrete example to the KIP.
> >>>>>>>>>>>>
> >>>>>>>>>>>> *- my main concern is that we’re exposing this
> >> optimization
> >>>>> to
> >>>>>> the
> >>>>>>>>> DSL.
> >>>>>>>>>>> In
> >>>>>>>>>>>> an ideal world, an optimizer would take the existing DSL
> >>> and
> >>>>> do
> >>>>>>>> the
> >>>>>>>>>> right
> >>>>>>>>>>>> thing under the covers (create just one state store,
> >>> arrange
> >>>>> the
> >>>>>>>>> nodes
> >>>>>>>>>>>> etc). The original DSL had a bunch of small, composable
> >>>>> pieces
> >>>>>>>>> (group,
> >>>>>>>>>>>> aggregate, join) that this proposal groups together. I’d
> >>>>> like to
> >>>>>>>> hear
> >>>>>>>>>>> your
> >>>>>>>>>>>> thoughts on whether it’s possible to do this optimization
> >>>>> with
> >>>>>> the
> >>>>>>>>>>> current
> >>>>>>>>>>>> DSL, at the topology builder level.*
> >>>>>>>>>>>> You would have to make a lot of checks to understand if
> >> it
> >>> is
> >>>>>> even
> >>>>>>>>>>> possible
> >>>>>>>>>>>> to make this optimization:
> >>>>>>>>>>>> 1. Make sure they are all KTableKTableOuterJoins
> >>>>>>>>>>>> 2. None of the intermediate KTables are used for anything
> >>>>> else.
> >>>>>>>>>>>> 3. None of the intermediate stores are used. (This may be
> >>>>>>>> impossible
> >>>>>>>>>>>> especially if they use KafkaStreams#store after the
> >>> topology
> >>>>> has
> >>>>>>>>>> already
> >>>>>>>>>>>> been built.)
> >>>>>>>>>>>> You would then need to make decisions during the
> >>>>> optimization:
> >>>>>>>>>>>> 1. Your new initializer would the composite of all the
> >>>>>> individual
> >>>>>>>>>>>> initializers and the valueJoiners.
> >>>>>>>>>>>> 2. I am having a hard time thinking about how you would
> >>> turn
> >>>>> the
> >>>>>>>>>>>> aggregators and valueJoiners into an aggregator that
> >> would
> >>>>> work
> >>>>>> on
> >>>>>>>>> the
> >>>>>>>>>>>> final object, but this may be possible.
> >>>>>>>>>>>> 3. Which state store would you use? The ones declared
> >> would
> >>>>> be
> >>>>>> for
> >>>>>>>>> the
> >>>>>>>>>>>> aggregate values. None of the declared ones would be
> >>>>> guaranteed
> >>>>>> to
> >>>>>>>>> hold
> >>>>>>>>>>> the
> >>>>>>>>>>>> final object. This would mean you must created a new
> >> state
> >>>>> store
> >>>>>>>> and
> >>>>>>>>>> not
> >>>>>>>>>>>> created any of the declared ones.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The main argument I have against it is even if it could
> >> be
> >>>>> done
> >>>>>> I
> >>>>>>>>> don't
> >>>>>>>>>>>> know that we would want to have this be an optimization
> >> in
> >>>>> the
> >>>>>>>>>> background
> >>>>>>>>>>>> because the user would still be required to think about
> >> all
> >>>>> of
> >>>>>> the
> >>>>>>>>>>>> intermediate values that they shouldn't need to worry
> >> about
> >>>>> if
> >>>>>>>> they
> >>>>>>>>>> only
> >>>>>>>>>>>> care about the final object.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In my opinion cogroup is a common enough case that it
> >>> should
> >>>>> be
> >>>>>>>> part
> >>>>>>>>> of
> >>>>>>>>>>> the
> >>>>>>>>>>>> composable pieces (group, aggregate, join) because we
> >> want
> >>> to
> >>>>>>>> allow
> >>>>>>>>>>> people
> >>>>>>>>>>>> to join more than 2 or more streams in an easy way. Right
> >>>>> now I
> >>>>>>>> don't
> >>>>>>>>>>> think
> >>>>>>>>>>>> we give them ways of handling this use case easily.
> >>>>>>>>>>>>
> >>>>>>>>>>>> *-I think there will be scope for several such
> >>> optimizations
> >>>>> in
> >>>>>>>> the
> >>>>>>>>>>> future
> >>>>>>>>>>>> and perhaps at some point we need to think about
> >> decoupling
> >>>>> the
> >>>>>>>> 1:1
> >>>>>>>>>>> mapping
> >>>>>>>>>>>> from the DSL into the physical topology.*
> >>>>>>>>>>>> I would argue that cogroup is not just an optimization it
> >>> is
> >>>>> a
> >>>>>> new
> >>>>>>>>> way
> >>>>>>>>>>> for
> >>>>>>>>>>>> the users to look at accomplishing a problem that
> >> requires
> >>>>>>>> multiple
> >>>>>>>>>>>> streams. I may sound like a broken record but I don't
> >> think
> >>>>>> users
> >>>>>>>>>> should
> >>>>>>>>>>>> have to build the N-1 intermediate tables and deal with
> >>> their
> >>>>>>>>>>> initializers,
> >>>>>>>>>>>> serdes and stores if all they care about is the final
> >>> object.
> >>>>>>>>>>>> Now if for example someone uses cogroup but doesn't
> >> supply
> >>>>>>>> additional
> >>>>>>>>>>>> streams and aggregators this case is equivalent to a
> >> single
> >>>>>>>> grouped
> >>>>>>>>>>> stream
> >>>>>>>>>>>> making an aggregate call. This case is what I view an
> >>>>>> optimization
> >>>>>>>>> as,
> >>>>>>>>>> we
> >>>>>>>>>>>> could remove the KStreamCogroup and act as if there was
> >>> just
> >>>>> a
> >>>>>>>> call
> >>>>>>>>> to
> >>>>>>>>>>>> KGroupedStream#aggregate instead of calling
> >>>>>>>> KGroupedStream#cogroup.
> >>>>>>>>> (I
> >>>>>>>>>>>> would prefer to just write a warning saying that this is
> >>> not
> >>>>> how
> >>>>>>>>>> cogroup
> >>>>>>>>>>> is
> >>>>>>>>>>>> to be used.)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Kyle
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sun, May 7, 2017 at 5:41 AM, Eno Thereska <
> >>>>>>>> [email protected]
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Kyle,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the KIP again. A couple of comments:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - minor: could you add an exact example (similar to what
> >>>>> Jay’s
> >>>>>>>>> example
> >>>>>>>>>>> is,
> >>>>>>>>>>>>> or like your Spark/Pig pointers had) to make this super
> >>>>>> concrete?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - my main concern is that we’re exposing this
> >> optimization
> >>>>> to
> >>>>>> the
> >>>>>>>>> DSL.
> >>>>>>>>>>> In
> >>>>>>>>>>>>> an ideal world, an optimizer would take the existing DSL
> >>>>> and do
> >>>>>>>> the
> >>>>>>>>>>> right
> >>>>>>>>>>>>> thing under the covers (create just one state store,
> >>> arrange
> >>>>>> the
> >>>>>>>>> nodes
> >>>>>>>>>>>>> etc). The original DSL had a bunch of small, composable
> >>>>> pieces
> >>>>>>>>> (group,
> >>>>>>>>>>>>> aggregate, join) that this proposal groups together. I’d
> >>>>> like
> >>>>>> to
> >>>>>>>>> hear
> >>>>>>>>>>> your
> >>>>>>>>>>>>> thoughts on whether it’s possible to do this
> >> optimization
> >>>>> with
> >>>>>>>> the
> >>>>>>>>>>> current
> >>>>>>>>>>>>> DSL, at the topology builder level.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think there will be scope for several such
> >> optimizations
> >>>>> in
> >>>>>> the
> >>>>>>>>>> future
> >>>>>>>>>>>>> and perhaps at some point we need to think about
> >>> decoupling
> >>>>> the
> >>>>>>>> 1:1
> >>>>>>>>>>> mapping
> >>>>>>>>>>>>> from the DSL into the physical topology.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>> Eno
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On May 5, 2017, at 4:39 PM, Jay Kreps <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I haven't digested the proposal but the use case is
> >>> pretty
> >>>>>>>> common.
> >>>>>>>>> An
> >>>>>>>>>>>>>> example would be the "customer 360" or "unified
> >> customer
> >>>>>>>> profile"
> >>>>>>>>> use
> >>>>>>>>>>>>> case
> >>>>>>>>>>>>>> we often use. In that use case you have a dozen systems
> >>>>> each
> >>>>>> of
> >>>>>>>>> which
> >>>>>>>>>>> has
> >>>>>>>>>>>>>> some information about your customer (account details,
> >>>>>> settings,
> >>>>>>>>>>> billing
> >>>>>>>>>>>>>> info, customer service contacts, purchase history,
> >> etc).
> >>>>> Your
> >>>>>>>> goal
> >>>>>>>>> is
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> join/munge these into a single profile record for each
> >>>>>> customer
> >>>>>>>>> that
> >>>>>>>>>>> has
> >>>>>>>>>>>>>> all the relevant info in a usable form and is
> >> up-to-date
> >>>>> with
> >>>>>>>> all
> >>>>>>>>> the
> >>>>>>>>>>>>>> source systems. If you implement that with kstreams as
> >> a
> >>>>>>>> sequence
> >>>>>>>>> of
> >>>>>>>>>>>>> joins
> >>>>>>>>>>>>>> i think today we'd fully materialize N-1 intermediate
> >>>>> tables.
> >>>>>>>> But
> >>>>>>>>>>> clearly
> >>>>>>>>>>>>>> you only need a single stage to group all these things
> >>> that
> >>>>>> are
> >>>>>>>>>> already
> >>>>>>>>>>>>>> co-partitioned. A distributed database would do this
> >>> under
> >>>>> the
> >>>>>>>>> covers
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>>> is arguably better (at least when it does the right
> >>> thing)
> >>>>> and
> >>>>>>>>>> perhaps
> >>>>>>>>>>> we
> >>>>>>>>>>>>>> could do the same thing but I'm not sure we know the
> >>>>>>>> partitioning
> >>>>>>>>> so
> >>>>>>>>>> we
> >>>>>>>>>>>>> may
> >>>>>>>>>>>>>> need an explicit cogroup command that impllies they are
> >>>>>> already
> >>>>>>>>>>>>>> co-partitioned.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Jay
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, May 5, 2017 at 5:56 AM, Kyle Winkelman <
> >>>>>>>>>>> [email protected]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Yea thats a good way to look at it.
> >>>>>>>>>>>>>>> I have seen this type of functionality in a couple
> >> other
> >>>>>>>> platforms
> >>>>>>>>>>> like
> >>>>>>>>>>>>>>> spark and pig.
> >>>>>>>>>>>>>>> https://spark.apache.org/docs/0.6.2/api/core/spark/
> >>>>>>>>>>>>> PairRDDFunctions.html
> >>>>>>>>>>>>>>> https://www.tutorialspoint.com/apache_pig/apache_pig_
> >>>>>>>>>>>>> cogroup_operator.htm
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On May 5, 2017 7:43 AM, "Damian Guy" <
> >>>>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Kyle,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If i'm reading this correctly it is like an N way
> >> outer
> >>>>>> join?
> >>>>>>>> So
> >>>>>>>>> an
> >>>>>>>>>>>>> input
> >>>>>>>>>>>>>>>> on any stream will always produce a new aggregated
> >>> value
> >>>>> -
> >>>>>> is
> >>>>>>>>> that
> >>>>>>>>>>>>>>> correct?
> >>>>>>>>>>>>>>>> Effectively, each Aggregator just looks up the
> >> current
> >>>>>> value,
> >>>>>>>>>>>>> aggregates
> >>>>>>>>>>>>>>>> and forwards the result.
> >>>>>>>>>>>>>>>> I need to look into it and think about it a bit more,
> >>>>> but it
> >>>>>>>>> seems
> >>>>>>>>>>> like
> >>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>> could be a useful optimization.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, 4 May 2017 at 23:21 Kyle Winkelman <
> >>>>>>>>>> [email protected]
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I sure can. I have added the following description
> >> to
> >>> my
> >>>>>>>> KIP. If
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> doesn't help let me know and I will take some more
> >>> time
> >>>>> to
> >>>>>>>>> build a
> >>>>>>>>>>>>>>>> diagram
> >>>>>>>>>>>>>>>>> and make more of a step by step description:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Example with Current API:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> KTable<K, V1> table1 =
> >>>>>>>>>>>>>>>>> builder.stream("topic1").groupByKey().aggregate(
> >>>>>> initializer1
> >>>>>>>> ,
> >>>>>>>>>>>>>>>> aggregator1,
> >>>>>>>>>>>>>>>>> aggValueSerde1, storeName1);
> >>>>>>>>>>>>>>>>> KTable<K, V2> table2 =
> >>>>>>>>>>>>>>>>> builder.stream("topic2").groupByKey().aggregate(
> >>>>>> initializer2
> >>>>>>>> ,
> >>>>>>>>>>>>>>>> aggregator2,
> >>>>>>>>>>>>>>>>> aggValueSerde2, storeName2);
> >>>>>>>>>>>>>>>>> KTable<K, V3> table3 =
> >>>>>>>>>>>>>>>>> builder.stream("topic3").groupByKey().aggregate(
> >>>>>> initializer3
> >>>>>>>> ,
> >>>>>>>>>>>>>>>> aggregator3,
> >>>>>>>>>>>>>>>>> aggValueSerde3, storeName3);
> >>>>>>>>>>>>>>>>> KTable<K, CG> cogrouped = table1.outerJoin(table2,
> >>>>>>>>>>>>>>>>> joinerOneAndTwo).outerJoin(table3,
> >>>>> joinerOneTwoAndThree);
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> As you can see this creates 3 StateStores, requires
> >> 3
> >>>>>>>>>> initializers,
> >>>>>>>>>>>>>>> and 3
> >>>>>>>>>>>>>>>>> aggValueSerdes. This also adds the pressure to user
> >> to
> >>>>>> define
> >>>>>>>>> what
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> intermediate values are going to be (V1, V2, V3).
> >> They
> >>>>> are
> >>>>>>>> left
> >>>>>>>>>>> with a
> >>>>>>>>>>>>>>>>> couple choices, first to make V1, V2, and V3 all the
> >>>>> same
> >>>>>> as
> >>>>>>>> CG
> >>>>>>>>>> and
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> two
> >>>>>>>>>>>>>>>>> joiners are more like mergers, or second make them
> >>>>>>>> intermediate
> >>>>>>>>>>> states
> >>>>>>>>>>>>>>>> such
> >>>>>>>>>>>>>>>>> as Topic1Map, Topic2Map, and Topic3Map and the
> >> joiners
> >>>>> use
> >>>>>>>> those
> >>>>>>>>>> to
> >>>>>>>>>>>>>>> build
> >>>>>>>>>>>>>>>>> the final aggregate CG value. This is something the
> >>> user
> >>>>>>>> could
> >>>>>>>>>> avoid
> >>>>>>>>>>>>>>>>> thinking about with this KIP.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> When a new input arrives lets say at "topic1" it
> >> will
> >>>>> first
> >>>>>>>> go
> >>>>>>>>>>> through
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> KStreamAggregate grabbing the current aggregate from
> >>>>>>>> storeName1.
> >>>>>>>>>> It
> >>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> produce this in the form of the first intermediate
> >>> value
> >>>>>> and
> >>>>>>>> get
> >>>>>>>>>>> sent
> >>>>>>>>>>>>>>>>> through a KTableKTableOuterJoin where it will look
> >> up
> >>>>> the
> >>>>>>>>> current
> >>>>>>>>>>>>> value
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the key in storeName2. It will use the first joiner
> >> to
> >>>>>>>> calculate
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>> second
> >>>>>>>>>>>>>>>>> intermediate value, which will go through an
> >>> additional
> >>>>>>>>>>>>>>>>> KTableKTableOuterJoin. Here it will look up the
> >>> current
> >>>>>>>> value of
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> key
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> storeName3 and use the second joiner to build the
> >>> final
> >>>>>>>>> aggregate
> >>>>>>>>>>>>>>> value.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If you think through all possibilities for incoming
> >>>>> topics
> >>>>>>>> you
> >>>>>>>>>> will
> >>>>>>>>>>>>> see
> >>>>>>>>>>>>>>>>> that no matter which topic it comes in through all
> >>> three
> >>>>>>>> stores
> >>>>>>>>>> are
> >>>>>>>>>>>>>>>> queried
> >>>>>>>>>>>>>>>>> and all of the joiners must get used.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Topology wise for N incoming streams this creates N
> >>>>>>>>>>>>>>>>> KStreamAggregates, 2*(N-1) KTableKTableOuterJoins,
> >> and
> >>>>> N-1
> >>>>>>>>>>>>>>>>> KTableKTableJoinMergers.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Example with Proposed API:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> KGroupedStream<K, V1> grouped1 =
> >>>>> builder.stream("topic1").
> >>>>>>>>>>>>>>> groupByKey();
> >>>>>>>>>>>>>>>>> KGroupedStream<K, V2> grouped2 =
> >>>>> builder.stream("topic2").
> >>>>>>>>>>>>>>> groupByKey();
> >>>>>>>>>>>>>>>>> KGroupedStream<K, V3> grouped3 =
> >>>>> builder.stream("topic3").
> >>>>>>>>>>>>>>> groupByKey();
> >>>>>>>>>>>>>>>>> KTable<K, CG> cogrouped =
> >>> grouped1.cogroup(initializer1,
> >>>>>>>>>>> aggregator1,
> >>>>>>>>>>>>>>>>> aggValueSerde1, storeName1)
> >>>>>>>>>>>>>>>>>      .cogroup(grouped2, aggregator2)
> >>>>>>>>>>>>>>>>>      .cogroup(grouped3, aggregator3)
> >>>>>>>>>>>>>>>>>      .aggregate();
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> As you can see this creates 1 StateStore, requires 1
> >>>>>>>>> initializer,
> >>>>>>>>>>> and
> >>>>>>>>>>>>> 1
> >>>>>>>>>>>>>>>>> aggValueSerde. The user no longer has to worry about
> >>> the
> >>>>>>>>>>> intermediate
> >>>>>>>>>>>>>>>>> values and the joiners. All they have to think about
> >>> is
> >>>>> how
> >>>>>>>> each
> >>>>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>> impacts the creation of the final CG object.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> When a new input arrives lets say at "topic1" it
> >> will
> >>>>> first
> >>>>>>>> go
> >>>>>>>>>>> through
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> KStreamAggreagte and grab the current aggregate from
> >>>>>>>> storeName1.
> >>>>>>>>>> It
> >>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> add its incoming object to the aggregate, update the
> >>>>> store
> >>>>>>>> and
> >>>>>>>>>> pass
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>> aggregate on. This new aggregate goes through the
> >>>>>>>> KStreamCogroup
> >>>>>>>>>>> which
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> pretty much just a pass through processor and you
> >> are
> >>>>> done.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Topology wise for N incoming streams the new api
> >> will
> >>>>> only
> >>>>>>>> every
> >>>>>>>>>>>>>>> create N
> >>>>>>>>>>>>>>>>> KStreamAggregates and 1 KStreamCogroup.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Thu, May 4, 2017 at 4:42 PM, Matthias J. Sax <
> >>>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Kyle,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> thanks a lot for the KIP. Maybe I am a little slow,
> >>>>> but I
> >>>>>>>> could
> >>>>>>>>>> not
> >>>>>>>>>>>>>>>>>> follow completely. Could you maybe add a more
> >>> concrete
> >>>>>>>> example,
> >>>>>>>>>>> like
> >>>>>>>>>>>>>>> 3
> >>>>>>>>>>>>>>>>>> streams with 3 records each (plus expected result),
> >>> and
> >>>>>> show
> >>>>>>>>> the
> >>>>>>>>>>>>>>>>>> difference between current way to to implement it
> >> and
> >>>>> the
> >>>>>>>>>> proposed
> >>>>>>>>>>>>>>> API?
> >>>>>>>>>>>>>>>>>> This could also cover the internal processing to
> >> see
> >>>>> what
> >>>>>>>> store
> >>>>>>>>>>> calls
> >>>>>>>>>>>>>>>>>> would be required for both approaches etc.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I think, it's pretty advanced stuff you propose,
> >> and
> >>> it
> >>>>>>>> would
> >>>>>>>>>> help
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> understand it better.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks a lot!
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 5/4/17 11:39 AM, Kyle Winkelman wrote:
> >>>>>>>>>>>>>>>>>>> I have made a pull request. It can be found here.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/2975
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I plan to write some more unit tests for my
> >> classes
> >>>>> and
> >>>>>> get
> >>>>>>>>>> around
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> writing documentation for the public api
> >> additions.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> One thing I was curious about is during the
> >>>>>>>>>>>>>>>>>> KCogroupedStreamImpl#aggregate
> >>>>>>>>>>>>>>>>>>> method I pass null to the KGroupedStream#
> >>>>>>>>> repartitionIfRequired
> >>>>>>>>>>>>>>>> method.
> >>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>> can't supply the store name because if more than
> >> one
> >>>>>>>> grouped
> >>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>>> repartitions an error is thrown. Is there some
> >> name
> >>>>> that
> >>>>>>>>> someone
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>> recommend or should I leave the null and allow it
> >> to
> >>>>> fall
> >>>>>>>> back
> >>>>>>>>>> to
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> KGroupedStream.name?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Should this be expanded to handle grouped tables?
> >>> This
> >>>>>>>> would
> >>>>>>>>> be
> >>>>>>>>>>>>>>>> pretty
> >>>>>>>>>>>>>>>>>> easy
> >>>>>>>>>>>>>>>>>>> for a normal aggregate but one allowing session
> >>> stores
> >>>>>> and
> >>>>>>>>>>> windowed
> >>>>>>>>>>>>>>>>>> stores
> >>>>>>>>>>>>>>>>>>> would required KTableSessionWindowAggregate and
> >>>>>>>>>>>>>>> KTableWindowAggregate
> >>>>>>>>>>>>>>>>>>> implementations.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>> Kyle
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On May 4, 2017 1:24 PM, "Eno Thereska" <
> >>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I’ll look as well asap, sorry, been swamped.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Eno
> >>>>>>>>>>>>>>>>>>>>> On May 4, 2017, at 6:17 PM, Damian Guy <
> >>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Kyle,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for the KIP. I apologize that i haven't
> >> had
> >>>>> the
> >>>>>>>>> chance
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>> look
> >>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>> the KIP yet, but will schedule some time to look
> >>>>> into
> >>>>>> it
> >>>>>>>>>>>>>>> tomorrow.
> >>>>>>>>>>>>>>>>> For
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> implementation, can you raise a PR against kafka
> >>>>> trunk
> >>>>>>>> and
> >>>>>>>>>> mark
> >>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>> WIP?
> >>>>>>>>>>>>>>>>>>>>> It will be easier to review what you have done.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>> Damian
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Thu, 4 May 2017 at 11:50 Kyle Winkelman <
> >>>>>>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I am replying to this in hopes it will draw
> >> some
> >>>>>>>> attention
> >>>>>>>>> to
> >>>>>>>>>>> my
> >>>>>>>>>>>>>>>> KIP
> >>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>> haven't heard from anyone in a couple days.
> >> This
> >>>>> is my
> >>>>>>>>> first
> >>>>>>>>>>> KIP
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>>>>>> first large contribution to the project so I'm
> >>>>> sure I
> >>>>>>>> did
> >>>>>>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>>>> wrong.
> >>>>>>>>>>>>>>>>>>>>>> ;)
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On May 1, 2017 4:18 PM, "Kyle Winkelman" <
> >>>>>>>>>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hello all,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I have created KIP-150 to facilitate
> >> discussion
> >>>>> about
> >>>>>>>>> adding
> >>>>>>>>>>>>>>>>> cogroup
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> the streams DSL.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Please find the KIP here:
> >>>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
> >>>>>> confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>>>>>>>>>>> 150+-+Kafka-Streams+Cogroup
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Please find my initial implementation here:
> >>>>>>>>>>>>>>>>>>>>>>> https://github.com/KyleWinkelman/kafka
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>> Kyle Winkelman
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> -- Guozhang
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Guozhang
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> -- Guozhang
> >>>>>
> >>>>
> >>>
> >>
>
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-150 - Kafka-Streams Cogroup

Reply via email to