Re: [DISCUSS] KIP-372: Naming Joins and Grouping

John Roesler Fri, 14 Sep 2018 15:06:38 -0700

Hey Bill (and Guozhang),

Apologies. I misunderstood what Guozhang was getting at in his earlier
remark.


I'm a +1 on the current proposal.

Thanks,
-John

On Fri, Sep 14, 2018 at 3:10 PM Bill Bejeck <[email protected]> wrote:

> All,
>
> Thanks for the comments, I'll respond to all of your comments in the
> recieved order.
>
>
> Guozhang,
>
> >  And if [join-name] not specified, stay the same, which is:
> > * [previous-processor-name]-repartition for both Stream-Stream (S-S) join
> and S-T join
>
>   I believe the current approach is
> [appId]-[previous-processor-name]-repartition, but the main point is if no
> name is provided, the current naming scheme will remain in place
>
> >  I think it is more natural to rename it to
> >  * [app-id]-[join-name]-left-repartition
> >  * [app-id]-[join-name]-right-repartition
>
> I agree and I'll update the KIP accordingly
>
> > 2 I'd suggest to use the name to also define the corresponding processor
>
> I'll address this in conjunction with comments from Matthias and your
> second round of comments.
>
>
> >3. Could you also specify how this would affect the optimization for
>
>                  merging multiple repartition topics?
>
>           With an optimization where we reduce the number of repartition
> topics, if the user has not named the repartition topics, we'll generate a
> name for
>           the optimized repartition topic. In the case where users have
> provided names for the repartition topics,
>           we'll reuse the name from the first of the named repartition
> topics.
>           However, optimizing a topology by reducing the number of
> repartition topics may require an application reset, but that is beyond the
> scope of this KIP.
>
>           I'll update the KIP with this case
>
>           I've also addressed 4.a and 4.b in the "Compatibility,
> Deprecation and Migration plan" section
>
>
> Matthias
>
> > 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> > `app-id]-[join-name]-left|right-repartition` but we should not change
> > the pattern depending if the user specifies a name of not.
>
> I've updated the KIP to say in the case of Joins and a name is provided
> we'll go with `app-id-name-left|right-repartition` in all cases if the name
> is not provided
> we will continue to use the current naming process.
>
> > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > to be orthogonal, and thus KIP-372 should not change any processor
> > names, but KIP-307 should define a holistic strategy for all processor.
> > Otherwise, we might up with different strategies or revert what we
> > decide in this KIP if it's not compatible with KIP-307.
>
> I agree and have updated the name of this KIP to be more precise.
>
> Eno
>
> > I know we don't normally have a "Related work" section in KIPs, but
> > sometimes I find it useful to see what others have done in similar cases.
> > Since this will be important for rolling re-deployments, I wonder what
> > other frameworks like Flink (or Samza) have done in these cases. Perhaps
> > they have done nothing, in which case it's fine to do this from first
> > principles, but IMO it would be good to know just to make sure we're
> > heading in the right direction.
>
> > Also I don't get a good feel for how much work this will be for an end
> user
> > who is doing the rolling deployment, perhaps an end-to-end example would
> > help.
>
> I can't speak for the other projects, so I'll have to take a look and see.
>
> As for getting a feel for how much work this will be for an end user, could
> you look at the
> updated KIP and see if it clears things up?
>
>
> Matthias
>
> > (1) For `Grouped` should we add `with(String name, Serde<K> key,
> > Serde<V> value)` to allow specifying all parameters at once?
>
> > Produced/Consumed/Serialized etc follow a similar pattern. There is one
> > static method for each config parameter, plus one static method that
> > accepts all parameters. Might be more consistent if we follow this
> pattern.
>
> > (2) It seems that `Serialized` is only used in `groupBy` and
> > `groupByKey` -- because both methods accepting `Serialized` parameter
> > are deprecated and replaced with methods accepting `Grouped`, it seems
> > that we also want to deprecate `Serialized`?
>
> > (3) About naming repartition topics: thinking about this once more, I
> > actually prefer to use `left|right` instead of `this|other` :)
>
> For 1, I agree and I'll update the KIP.
>
> For 2, Yes that seems to make sense and I'll update the KIP
>
> Part 3 addressed in earlier comments and the KIP is clarified.
>
> Guozhang
>
> > Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> > `groupBy/groupByKey` naming schemes, and for joins its current proposal
> is
> > to extend ValueJoiner with Named and hence this part is what I meant to
> > have "overlaps".
>
> > Thinking about it a bit more, since Joined is only used for S-S and S-T
> > joins but not T-T joins, having the naming schemes on Joined would not be
> > sufficient, and extending ValueJoiner would indeed be a good choice.
>
> > As for groupBy, since it is using KeyValueMapper which is supposed to be
> > extended with Named in KIP-307, it does not require extending to
> processor
> > nodes as well.
>
> > Given this, I'm fine with limiting the scope to only repartition topics.
>
> I agree with limiting the scope of this KIP to only repartition topics and
> have updated the KIP
>
> John
>
> > I think it's slightly out of scope for this KIP, but I'm not sure it's
> > right to add a name to ValueJoiner or KeyValueMapper.
> > Both of those are "functional interfaces", that is, they are basically
> > named functions.
> >
> > It seems like we should preserve this property both to provide a clean
> > separation between the operation *logic* and the operator *config*.
> >
> > It's a good point that `Joined` doesn't appear in the KTable join.
> However,
> > KTable join does have the variant that accepts "Materialized".
> > This makes it similar to the grouping operators that currently accept
> > "Serialized", namely the operator is already configurable, but only in
> > a narrow way (we can only configure the materialization of the table or
> the
> > serialization of the grouping).
> >
> > Consider as an alternative the KStream.join operation that takes a config
> > object called "Joined". This implies no more than that we are
> > configuring the join operation. If we are deciding to allow adding a name
> > to the operation, it's natural to add it to the operation's config.
> >
> > Conversely, we also want to name the KTable.join operation, not the
> > materialization itself (which already has a name with separate
> semantics).
> >
> > To me, this suggests that, just like we propose to subsume the Serialized
> > config for grouping into a more general config called Grouped, we should
> > also want to do something similar and replace `KTable#join(KTable<>,
> > ValueJoiner<>, Materialized<>)` with one taking a more general
> > configuration, maybe:
> > `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
>
> I'm not sure I completely follow what you are suggesting.  But it seems
> like you are in favor of combining the naming of join
>
> and grouping operation and processors with this KIP.
>
> I'll say that I don't agree we should do the renaming in this KIP and
> restrict it to naming repartition topics
>
>
>    1. It gives us a good head start for users to be able to do rolling
>    updates once repartition topics have been named
>    2. Undertaking any KIP changes almost always introduces some uncertainty
>    with respect to what changes will introduce, so a smaller scope helps
>    ensure the success of the KIP
>    3. We have an existing proposal KIP-307 for processor naming and I think
>    it's best to do it all in one singular KIP so we don't have to undo any
>    inconsistent previous work.
>
> So I'll have to agree with Matthias and Guozhang in limiting the scope of
> this KIP to naming repartition topics.
>
> To that end, I'll also say we defer adding another general configuration
> object `TableJoined` to the work of naming operations and processors in
> KIP-307
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Sep 13, 2018 at 6:49 PM John Roesler <[email protected]> wrote:
>
> > Hey all,
> >
> > I think it's slightly out of scope for this KIP, but I'm not sure it's
> > right to add a name to ValueJoiner or KeyValueMapper.
> > Both of those are "functional interfaces", that is, they are basically
> > named functions.
> >
> > It seems like we should preserve this property both to provide a clean
> > separation between the operation *logic* and the operator *config*.
> >
> > It's a good point that `Joined` doesn't appear in the KTable join.
> However,
> > KTable join does have the variant that accepts "Materialized".
> > This makes it similar to the grouping operators that currently accept
> > "Serialized", namely the operator is already configurable, but only in
> > a narrow way (we can only configure the materialization of the table or
> the
> > serialization of the grouping).
> >
> > Consider as an alternative the KStream.join operation that takes a config
> > object called "Joined". This implies no more than that we are
> > configuring the join operation. If we are deciding to allow adding a name
> > to the operation, it's natural to add it to the operation's config.
> >
> > Conversely, we also want to name the KTable.join operation, not the
> > materialization itself (which already has a name with separate
> semantics).
> >
> > To me, this suggests that, just like we propose to subsume the Serialized
> > config for grouping into a more general config called Grouped, we should
> > also want to do something similar and replace `KTable#join(KTable<>,
> > ValueJoiner<>, Materialized<>)` with one taking a more general
> > configuration, maybe:
> > `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
> >
> > Does this make sense?
> >
> > Thanks,
> > -John
> >
> > On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang <[email protected]>
> wrote:
> >
> > > Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> > > `groupBy/groupByKey` naming schemes, and for joins its current proposal
> > is
> > > to extend ValueJoiner with Named and hence this part is what I meant to
> > > have "overlaps".
> > >
> > > Thinking about it a bit more, since Joined is only used for S-S and S-T
> > > joins but not T-T joins, having the naming schemes on Joined would not
> be
> > > sufficient, and extending ValueJoiner would indeed be a good choice.
> > >
> > > As for groupBy, since it is using KeyValueMapper which is supposed to
> be
> > > extended with Named in KIP-307, it does not require extending to
> > processor
> > > nodes as well.
> > >
> > >
> > > Given this, I'm fine with limiting the scope to only repartition
> topics.
> > >
> > >
> > > Guozhang
> > >
> > > On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Follow up comments:
> > > >
> > > > 1) We should either use `[app-id]-this|other-[join-name]-repartition`
> > or
> > > > `app-id]-[join-name]-left|right-repartition` but we should not change
> > > > the pattern depending if the user specifies a name of not. I am fine
> > > > with both patterns---just want to make sure with stick with one.
> > > >
> > > > 2) I didn't see why we would need to do this in this KIP. KIP-307
> seems
> > > > to be orthogonal, and thus KIP-372 should not change any processor
> > > > names, but KIP-307 should define a holistic strategy for all
> processor.
> > > > Otherwise, we might up with different strategies or revert what we
> > > > decide in this KIP if it's not compatible with KIP-307.
> > > >
> > > >
> > > > -Matthias
> > > >
> > > >
> > > > On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > > > > Hello Bill,
> > > > >
> > > > > I made a pass over your proposal and here are some questions:
> > > > >
> > > > > 1. For Joined names, the current proposal is to define the
> > repartition
> > > > > topic names as
> > > > >
> > > > > * [app-id]-this-[join-name]-repartition
> > > > >
> > > > > * [app-id]-other-[join-name]-repartition
> > > > >
> > > > >
> > > > > And if [join-name] not specified, stay the same, which is:
> > > > >
> > > > > * [previous-processor-name]-repartition for both Stream-Stream
> (S-S)
> > > > join
> > > > > and S-T join
> > > > >
> > > > > I think it is more natural to rename it to
> > > > >
> > > > > * [app-id]-[join-name]-left-repartition
> > > > >
> > > > > * [app-id]-[join-name]-right-repartition
> > > > >
> > > > >
> > > > > 2. I'd suggest to use the name to also define the corresponding
> > > processor
> > > > > names accordingly, in addition to the repartition topic names. Note
> > > that
> > > > > for joins, this may be overlapping with KIP-307
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > > > > as
> > > > > it also have proposals for defining processor names for join
> > operators
> > > as
> > > > > well.
> > > > >
> > > > > 3. Could you also specify how this would affect the optimization
> for
> > > > > merging multiple repartition topics?
> > > > >
> > > > > 4. In the "Compatibility, Deprecation, and Migration Plan" section,
> > > could
> > > > > you also mention the following scenarios, if any of the upgrade
> path
> > > > would
> > > > > be changed:
> > > > >
> > > > >  a) changing user DSL code: under which scenarios users can now do
> a
> > > > > rolling bounce instead of resetting applications.
> > > > >
> > > > >  b) upgrading from older version to new version, with all the names
> > > > > specified, and with optimization turned on. E.g. say we have the
> code
> > > > > written in 2.1 with all names specified, and now upgrading to 2.2
> > with
> > > > new
> > > > > optimizations that may potentially change the repartition topics.
> Is
> > > that
> > > > > always safe to do?
> > > > >
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <[email protected]>
> > > wrote:
> > > > >
> > > > >> All I'd like to start a discussion on KIP-372 for the naming of
> > joins
> > > > and
> > > > >> grouping operations in Kafka Streams.
> > > > >>
> > > > >> The KIP page can be found here:
> > > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > >> 372%3A+Naming+Joins+and+Grouping
> > > > >>
> > > > >> I look forward to feedback and comments.
> > > > >>
> > > > >> Thanks,
> > > > >> Bill
> > > > >>
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Reply via email to