Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Matthias J. Sax Sun, 16 Sep 2018 12:04:30 -0700

Thant for the update Bill. LGTM.


On 9/14/18 5:33 PM, Guozhang Wang wrote:
> Hello Bill,
> 
> I've made another pass over the wiki page and it lgtm now. Thanks!
> 
> 
> Guozhang
> 
> 
> On Fri, Sep 14, 2018 at 3:05 PM, John Roesler <j...@confluent.io> wrote:
> 
>> Hey Bill (and Guozhang),
>>
>> Apologies. I misunderstood what Guozhang was getting at in his earlier
>> remark.
>>
>> I'm a +1 on the current proposal.
>>
>> Thanks,
>> -John
>>
>> On Fri, Sep 14, 2018 at 3:10 PM Bill Bejeck <bbej...@gmail.com> wrote:
>>
>>> All,
>>>
>>> Thanks for the comments, I'll respond to all of your comments in the
>>> recieved order.
>>>
>>>
>>> Guozhang,
>>>
>>>>  And if [join-name] not specified, stay the same, which is:
>>>> * [previous-processor-name]-repartition for both Stream-Stream (S-S)
>> join
>>> and S-T join
>>>
>>>   I believe the current approach is
>>> [appId]-[previous-processor-name]-repartition, but the main point is if
>> no
>>> name is provided, the current naming scheme will remain in place
>>>
>>>>  I think it is more natural to rename it to
>>>>  * [app-id]-[join-name]-left-repartition
>>>>  * [app-id]-[join-name]-right-repartition
>>>
>>> I agree and I'll update the KIP accordingly
>>>
>>>> 2 I'd suggest to use the name to also define the corresponding
>> processor
>>>
>>> I'll address this in conjunction with comments from Matthias and your
>>> second round of comments.
>>>
>>>
>>>> 3. Could you also specify how this would affect the optimization for
>>>
>>>                  merging multiple repartition topics?
>>>
>>>           With an optimization where we reduce the number of repartition
>>> topics, if the user has not named the repartition topics, we'll generate
>> a
>>> name for
>>>           the optimized repartition topic. In the case where users have
>>> provided names for the repartition topics,
>>>           we'll reuse the name from the first of the named repartition
>>> topics.
>>>           However, optimizing a topology by reducing the number of
>>> repartition topics may require an application reset, but that is beyond
>> the
>>> scope of this KIP.
>>>
>>>           I'll update the KIP with this case
>>>
>>>           I've also addressed 4.a and 4.b in the "Compatibility,
>>> Deprecation and Migration plan" section
>>>
>>>
>>> Matthias
>>>
>>>> 1) We should either use `[app-id]-this|other-[join-name]-repartition`
>> or
>>>> `app-id]-[join-name]-left|right-repartition` but we should not change
>>>> the pattern depending if the user specifies a name of not.
>>>
>>> I've updated the KIP to say in the case of Joins and a name is provided
>>> we'll go with `app-id-name-left|right-repartition` in all cases if the
>> name
>>> is not provided
>>> we will continue to use the current naming process.
>>>
>>>> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
>>>> to be orthogonal, and thus KIP-372 should not change any processor
>>>> names, but KIP-307 should define a holistic strategy for all processor.
>>>> Otherwise, we might up with different strategies or revert what we
>>>> decide in this KIP if it's not compatible with KIP-307.
>>>
>>> I agree and have updated the name of this KIP to be more precise.
>>>
>>> Eno
>>>
>>>> I know we don't normally have a "Related work" section in KIPs, but
>>>> sometimes I find it useful to see what others have done in similar
>> cases.
>>>> Since this will be important for rolling re-deployments, I wonder what
>>>> other frameworks like Flink (or Samza) have done in these cases.
>> Perhaps
>>>> they have done nothing, in which case it's fine to do this from first
>>>> principles, but IMO it would be good to know just to make sure we're
>>>> heading in the right direction.
>>>
>>>> Also I don't get a good feel for how much work this will be for an end
>>> user
>>>> who is doing the rolling deployment, perhaps an end-to-end example
>> would
>>>> help.
>>>
>>> I can't speak for the other projects, so I'll have to take a look and
>> see.
>>>
>>> As for getting a feel for how much work this will be for an end user,
>> could
>>> you look at the
>>> updated KIP and see if it clears things up?
>>>
>>>
>>> Matthias
>>>
>>>> (1) For `Grouped` should we add `with(String name, Serde<K> key,
>>>> Serde<V> value)` to allow specifying all parameters at once?
>>>
>>>> Produced/Consumed/Serialized etc follow a similar pattern. There is one
>>>> static method for each config parameter, plus one static method that
>>>> accepts all parameters. Might be more consistent if we follow this
>>> pattern.
>>>
>>>> (2) It seems that `Serialized` is only used in `groupBy` and
>>>> `groupByKey` -- because both methods accepting `Serialized` parameter
>>>> are deprecated and replaced with methods accepting `Grouped`, it seems
>>>> that we also want to deprecate `Serialized`?
>>>
>>>> (3) About naming repartition topics: thinking about this once more, I
>>>> actually prefer to use `left|right` instead of `this|other` :)
>>>
>>> For 1, I agree and I'll update the KIP.
>>>
>>> For 2, Yes that seems to make sense and I'll update the KIP
>>>
>>> Part 3 addressed in earlier comments and the KIP is clarified.
>>>
>>> Guozhang
>>>
>>>> Just to clarify on 2): currently KIP-307 do not have proposed APIs for
>>>> `groupBy/groupByKey` naming schemes, and for joins its current proposal
>>> is
>>>> to extend ValueJoiner with Named and hence this part is what I meant to
>>>> have "overlaps".
>>>
>>>> Thinking about it a bit more, since Joined is only used for S-S and S-T
>>>> joins but not T-T joins, having the naming schemes on Joined would not
>> be
>>>> sufficient, and extending ValueJoiner would indeed be a good choice.
>>>
>>>> As for groupBy, since it is using KeyValueMapper which is supposed to
>> be
>>>> extended with Named in KIP-307, it does not require extending to
>>> processor
>>>> nodes as well.
>>>
>>>> Given this, I'm fine with limiting the scope to only repartition
>> topics.
>>>
>>> I agree with limiting the scope of this KIP to only repartition topics
>> and
>>> have updated the KIP
>>>
>>> John
>>>
>>>> I think it's slightly out of scope for this KIP, but I'm not sure it's
>>>> right to add a name to ValueJoiner or KeyValueMapper.
>>>> Both of those are "functional interfaces", that is, they are basically
>>>> named functions.
>>>>
>>>> It seems like we should preserve this property both to provide a clean
>>>> separation between the operation *logic* and the operator *config*.
>>>>
>>>> It's a good point that `Joined` doesn't appear in the KTable join.
>>> However,
>>>> KTable join does have the variant that accepts "Materialized".
>>>> This makes it similar to the grouping operators that currently accept
>>>> "Serialized", namely the operator is already configurable, but only in
>>>> a narrow way (we can only configure the materialization of the table or
>>> the
>>>> serialization of the grouping).
>>>>
>>>> Consider as an alternative the KStream.join operation that takes a
>> config
>>>> object called "Joined". This implies no more than that we are
>>>> configuring the join operation. If we are deciding to allow adding a
>> name
>>>> to the operation, it's natural to add it to the operation's config.
>>>>
>>>> Conversely, we also want to name the KTable.join operation, not the
>>>> materialization itself (which already has a name with separate
>>> semantics).
>>>>
>>>> To me, this suggests that, just like we propose to subsume the
>> Serialized
>>>> config for grouping into a more general config called Grouped, we
>> should
>>>> also want to do something similar and replace `KTable#join(KTable<>,
>>>> ValueJoiner<>, Materialized<>)` with one taking a more general
>>>> configuration, maybe:
>>>> `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
>>>
>>> I'm not sure I completely follow what you are suggesting.  But it seems
>>> like you are in favor of combining the naming of join
>>>
>>> and grouping operation and processors with this KIP.
>>>
>>> I'll say that I don't agree we should do the renaming in this KIP and
>>> restrict it to naming repartition topics
>>>
>>>
>>>    1. It gives us a good head start for users to be able to do rolling
>>>    updates once repartition topics have been named
>>>    2. Undertaking any KIP changes almost always introduces some
>> uncertainty
>>>    with respect to what changes will introduce, so a smaller scope helps
>>>    ensure the success of the KIP
>>>    3. We have an existing proposal KIP-307 for processor naming and I
>> think
>>>    it's best to do it all in one singular KIP so we don't have to undo
>> any
>>>    inconsistent previous work.
>>>
>>> So I'll have to agree with Matthias and Guozhang in limiting the scope of
>>> this KIP to naming repartition topics.
>>>
>>> To that end, I'll also say we defer adding another general configuration
>>> object `TableJoined` to the work of naming operations and processors in
>>> KIP-307
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Sep 13, 2018 at 6:49 PM John Roesler <j...@confluent.io> wrote:
>>>
>>>> Hey all,
>>>>
>>>> I think it's slightly out of scope for this KIP, but I'm not sure it's
>>>> right to add a name to ValueJoiner or KeyValueMapper.
>>>> Both of those are "functional interfaces", that is, they are basically
>>>> named functions.
>>>>
>>>> It seems like we should preserve this property both to provide a clean
>>>> separation between the operation *logic* and the operator *config*.
>>>>
>>>> It's a good point that `Joined` doesn't appear in the KTable join.
>>> However,
>>>> KTable join does have the variant that accepts "Materialized".
>>>> This makes it similar to the grouping operators that currently accept
>>>> "Serialized", namely the operator is already configurable, but only in
>>>> a narrow way (we can only configure the materialization of the table or
>>> the
>>>> serialization of the grouping).
>>>>
>>>> Consider as an alternative the KStream.join operation that takes a
>> config
>>>> object called "Joined". This implies no more than that we are
>>>> configuring the join operation. If we are deciding to allow adding a
>> name
>>>> to the operation, it's natural to add it to the operation's config.
>>>>
>>>> Conversely, we also want to name the KTable.join operation, not the
>>>> materialization itself (which already has a name with separate
>>> semantics).
>>>>
>>>> To me, this suggests that, just like we propose to subsume the
>> Serialized
>>>> config for grouping into a more general config called Grouped, we
>> should
>>>> also want to do something similar and replace `KTable#join(KTable<>,
>>>> ValueJoiner<>, Materialized<>)` with one taking a more general
>>>> configuration, maybe:
>>>> `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
>>>>
>>>> Does this make sense?
>>>>
>>>> Thanks,
>>>> -John
>>>>
>>>> On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang <wangg...@gmail.com>
>>> wrote:
>>>>
>>>>> Just to clarify on 2): currently KIP-307 do not have proposed APIs
>> for
>>>>> `groupBy/groupByKey` naming schemes, and for joins its current
>> proposal
>>>> is
>>>>> to extend ValueJoiner with Named and hence this part is what I meant
>> to
>>>>> have "overlaps".
>>>>>
>>>>> Thinking about it a bit more, since Joined is only used for S-S and
>> S-T
>>>>> joins but not T-T joins, having the naming schemes on Joined would
>> not
>>> be
>>>>> sufficient, and extending ValueJoiner would indeed be a good choice.
>>>>>
>>>>> As for groupBy, since it is using KeyValueMapper which is supposed to
>>> be
>>>>> extended with Named in KIP-307, it does not require extending to
>>>> processor
>>>>> nodes as well.
>>>>>
>>>>>
>>>>> Given this, I'm fine with limiting the scope to only repartition
>>> topics.
>>>>>
>>>>>
>>>>> Guozhang
>>>>>
>>>>> On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax <
>>> matth...@confluent.io
>>>>>
>>>>> wrote:
>>>>>
>>>>>> Follow up comments:
>>>>>>
>>>>>> 1) We should either use `[app-id]-this|other-[join-
>> name]-repartition`
>>>> or
>>>>>> `app-id]-[join-name]-left|right-repartition` but we should not
>> change
>>>>>> the pattern depending if the user specifies a name of not. I am
>> fine
>>>>>> with both patterns---just want to make sure with stick with one.
>>>>>>
>>>>>> 2) I didn't see why we would need to do this in this KIP. KIP-307
>>> seems
>>>>>> to be orthogonal, and thus KIP-372 should not change any processor
>>>>>> names, but KIP-307 should define a holistic strategy for all
>>> processor.
>>>>>> Otherwise, we might up with different strategies or revert what we
>>>>>> decide in this KIP if it's not compatible with KIP-307.
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>> On 9/12/18 6:28 PM, Guozhang Wang wrote:
>>>>>>> Hello Bill,
>>>>>>>
>>>>>>> I made a pass over your proposal and here are some questions:
>>>>>>>
>>>>>>> 1. For Joined names, the current proposal is to define the
>>>> repartition
>>>>>>> topic names as
>>>>>>>
>>>>>>> * [app-id]-this-[join-name]-repartition
>>>>>>>
>>>>>>> * [app-id]-other-[join-name]-repartition
>>>>>>>
>>>>>>>
>>>>>>> And if [join-name] not specified, stay the same, which is:
>>>>>>>
>>>>>>> * [previous-processor-name]-repartition for both Stream-Stream
>>> (S-S)
>>>>>> join
>>>>>>> and S-T join
>>>>>>>
>>>>>>> I think it is more natural to rename it to
>>>>>>>
>>>>>>> * [app-id]-[join-name]-left-repartition
>>>>>>>
>>>>>>> * [app-id]-[join-name]-right-repartition
>>>>>>>
>>>>>>>
>>>>>>> 2. I'd suggest to use the name to also define the corresponding
>>>>> processor
>>>>>>> names accordingly, in addition to the repartition topic names.
>> Note
>>>>> that
>>>>>>> for joins, this may be overlapping with KIP-307
>>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>> 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
>>>>>>> as
>>>>>>> it also have proposals for defining processor names for join
>>>> operators
>>>>> as
>>>>>>> well.
>>>>>>>
>>>>>>> 3. Could you also specify how this would affect the optimization
>>> for
>>>>>>> merging multiple repartition topics?
>>>>>>>
>>>>>>> 4. In the "Compatibility, Deprecation, and Migration Plan"
>> section,
>>>>> could
>>>>>>> you also mention the following scenarios, if any of the upgrade
>>> path
>>>>>> would
>>>>>>> be changed:
>>>>>>>
>>>>>>>  a) changing user DSL code: under which scenarios users can now
>> do
>>> a
>>>>>>> rolling bounce instead of resetting applications.
>>>>>>>
>>>>>>>  b) upgrading from older version to new version, with all the
>> names
>>>>>>> specified, and with optimization turned on. E.g. say we have the
>>> code
>>>>>>> written in 2.1 with all names specified, and now upgrading to 2.2
>>>> with
>>>>>> new
>>>>>>> optimizations that may potentially change the repartition topics.
>>> Is
>>>>> that
>>>>>>> always safe to do?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Guozhang
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bbej...@gmail.com>
>>>>> wrote:
>>>>>>>
>>>>>>>> All I'd like to start a discussion on KIP-372 for the naming of
>>>> joins
>>>>>> and
>>>>>>>> grouping operations in Kafka Streams.
>>>>>>>>
>>>>>>>> The KIP page can be found here:
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>> 372%3A+Naming+Joins+and+Grouping
>>>>>>>>
>>>>>>>> I look forward to feedback and comments.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Bill
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -- Guozhang
>>>>>
>>>>
>>>
>>
> 
> 
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Reply via email to