Re: [DISCUSS] KIP-307: Allow to define custom processor names with KStreams DSL

Florian Hussonnois Thu, 05 Jul 2018 14:53:57 -0700

Hi, thank you very much for all you suggestions. I've started to update the
KIP (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL
).
Also, I propose to rename the Processed class into Described - this will be
more meaningful (but this is just a detail).


I'm OK to not enforcing uppercase for specific names but should we allow
arbitrary names with whitespaces for example ? Currently, I can't tell if
this can lead to some side effects ?

Le lun. 11 juin 2018 à 01:31, Matthias J. Sax <matth...@confluent.io> a
écrit :

> Just catching up on this thread.
>
> I like the general idea. Couple of comments:
>
>  - I think that adding `Processed` (or maybe a different name?) is a
> valid proposal for stateless operators that only have a single overload
> atm. It would align with the overall API design.
>
>  - for all methods with multiple existing overloads, we can consider to
> extend `Consumed`, `Produced`, `Materialized` etc to take an additional
> processor name (not sure atm how elegant this is; we would need to
> "play" with the API a little bit; the advantage would be, that we do not
> add more overloads what seems to be key for this KIP)
>
>  - operators return void: while I agree that the "name first" chaining
> idea is not very intuitive, it might still work, if we name the method
> correctly (again, we would need to "play" with the API a little bit to see)
>
>  - for DSL operators that are translated to multiple nodes: it might
> make sense to use the specified operator name as prefix and add
> reasonable suffixes. For example, a join translates into 5 operators
> that could be name "name-left-store-processor",
> "name-left-join-processor", "name-right-store-processor",
> "name-right-join-processor", and "name-join-merge-processor" (or
> similar). Maybe just using numbers might also work.
>
>  - I think, we should strip the number suffixes if a user provides names
>
>  - enforcing upper case seems to be tricky: for example, we do not
> enforce upper case for store names and we cannot easily change it as it
> would break compatibility -- thus, for consistency reasons we might not
> want to do this
>
>  - for better understand of the impact of the KIP, it would be quite
> helpful if you would list all method names that are affected in the KIP
> (ie, list all newly added overloads)
>
>
> -Matthias
>
>
> On 5/31/18 6:40 PM, Guozhang Wang wrote:
> > Hi Florian,
> >
> > Re 1: I think changing the KStreamImpl / KTableImpl to allow modifying
> the
> > processor name after the operator is fine as long as we do the check
> again
> > when modifying that. In fact, we are having some topology optimization
> > going on which may modify processor names in the final topology anyways (
> > https://github.com/apache/kafka/pull/4983). Semantically I think it is
> > easier to understand to developers than "deciding the processor name for
> > the next operator".
> >
> > Re 2: Yeah I'm thinking that for operators that translates to multiple
> > processor names, we can still use the provided "hint" to name the
> processor
> > names, e.g. for Joins we can name them as `join-foo-this` and
> > `join-foo-that` etc if user calls `as("foo")`.
> >
> > Re 3: The motivation I had about removing the suffix is that it has huge
> > restrictions on topology compatibilities: consider if user code added a
> new
> > operator, or library does some optimization to remove some operators, the
> > suffix indexing may be changed for a large amount of the processor names:
> > this will in turn change the internal state store names, as well as
> > internal topic names as well, making the new application topology to be
> > incompatible with the ones. One rationale I had about this KIP is that
> > aligned this effort, moving forward we can allow users to customize
> > internal names so that they can still be reused even with topology
> changes
> > (e.g. KIP-230), so I think removing the suffix index would be more
> > applicable in the long run.
> >
> >
> >
> > Guozhang
> >
> >
> >
> >
> >
> > On Thu, May 31, 2018 at 3:08 PM, Florian Hussonnois <
> fhussonn...@gmail.com>
> > wrote:
> >
> >> Hi ,
> >> Thank you very much for your feedback.
> >>
> >> 1/
> >> I agree that overloading most of the methods with a Processed is not
> ideal.
> >> I've started modifying the KStream API and I got to the same conclusion.
> >> Also ading a new method directly to KStreamImpl and KTableImpl classes
> >> seems to be a better option.
> >>
> >> However a processor name cannot be redefined after calling an operator
> (or
> >> maybe I miss something in the code).
> >> From my understanding, this will only set the KStream name property not
> the
> >> processor name previsouly added to the topology builder - leading to
> >> InvalidTopology exception.
> >>
> >> So the new method should actually defines the name of the next
> processor :
> >> Below is an example :
> >>
> >> *stream.as <http://stream.as>(Processed.name("MAPPE_TO_UPPERCASE")*
> >> *          .map( (k, v) -> KeyValue.pair(k, v.toUpperCase()))*
> >>
> >> I think this approach could solve the cases for methods returning void ?
> >>
> >> Regarding this new method we have two possible implementations :
> >>
> >>    1. Adding a method like : withName(String processorName)
> >>    2. or adding a method accepting an Processed object : as(Processed).
> >>
> >> I think solution 2. is preferable as the Processed class could be
> enriched
> >> further (in futur).
> >>
> >> 2/
> >> As Guozhang said some operators add internal processors.
> >> For example the branch() method create one KStreamBranch processor to
> route
> >> records and one KStreamPassThrough processor for each branch.
> >> In that situation only the parent processor can be named. For children
> >> processors we could keep the current behaviour that add a suffix (i.e
> >> KSTREAM-BRANCHCHILD-)
> >>
> >> This also the case for the join() method that result to adding multiple
> >> processors to the topology (windowing, left/right joins and a merge
> >> processor).
> >> I think, like for the branch method users could only define a processor
> >> name prefix.
> >>
> >> 3/
> >> I think we should  still added a suffix like "-0000000000" to processor
> >> name and enforce uppercases as this will keep some consistency with the
> >> ones generated by the API.
> >>
> >> 4/
> >> Yes, the KTable interface should be modified like KStream to allow
> custom
> >> processor names definition.
> >>
> >> Thanks,
> >>
> >>
> >> Le jeu. 31 mai 2018 à 19:18, Damian Guy <damian....@gmail.com> a écrit
> :
> >>
> >>> Hi Florian,
> >>>
> >>> Thanks for the KIP. What about KTable and other DSL interfaces? Will
> they
> >>> not want to be able to do the same thing?
> >>> It would be good to see a complete set of the public API changes.
> >>>
> >>> Cheers,
> >>> Damian
> >>>
> >>> On Wed, 30 May 2018 at 19:45 Guozhang Wang <wangg...@gmail.com> wrote:
> >>>
> >>>> Hello Florian,
> >>>>
> >>>> Thanks for the KIP. I have some meta feedbacks on the proposal:
> >>>>
> >>>> 1. You mentioned that this `Processed` object will be added to a new
> >>>> overloaded variant of all the stateless operators, what about the
> >>> stateful
> >>>> operators? Would like to hear your opinions if you have thought about
> >>> that:
> >>>> note for stateful operators they will usually be mapped to multiple
> >>>> processor node names, so we probably need to come up with some ways to
> >>>> define all their names.
> >>>>
> >>>> 2. I share the same concern with Bill as for adding lots of new
> >> overload
> >>>> functions into the stateless operators, as we have just spent quite
> >> some
> >>>> effort in trimming them since 1.0.0 release. If the goal is to just
> >>> provide
> >>>> some "hints" on the generated processor node names, not strictly
> >>> enforcing
> >>>> the exact names that to be generated, then how about we just add a new
> >>>> function to `KStream` and `KTable` classes like: "as(Processed)", with
> >>> the
> >>>> semantics as "the latest operators that generate this KStream / KTable
> >>> will
> >>>> be named accordingly to this hint".
> >>>>
> >>>> The only caveat, is that for all operators like `KStream#to` and
> >>>> `KStream#print` that returns void, this alternative would not work.
> But
> >>> for
> >>>> the current operators:
> >>>>
> >>>> a. KStream#print,
> >>>> b. KStream#foreach,
> >>>> c. KStream#to,
> >>>> d. KStream#process
> >>>>
> >>>> I personally felt that except `KStream#process` users would not
> usually
> >>>> bother to override their names, and for `KStream#process` we could add
> >> an
> >>>> overload variant with the additional Processed object.
> >>>>
> >>>>
> >>>> 3. In your example, the processor names are still added with a suffix
> >>> like
> >>>> "
> >>>> -0000000000", is this intentional? If yes, why (I thought with user
> >>>> specified processor name hints we will not add suffix to distinguish
> >>>> different nodes of the same type any more)?
> >>>>
> >>>>
> >>>> Guozhang
> >>>>
> >>>>
> >>>> On Tue, May 29, 2018 at 6:47 AM, Bill Bejeck <bbej...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Hi Florian,
> >>>>>
> >>>>> Thanks for the KIP.  I think being able to add more context to the
> >>>>> processor names would be useful.
> >>>>>
> >>>>> I like the idea of adding a "withProcessorName" to Produced, Consumed
> >>> and
> >>>>> Joined.
> >>>>>
> >>>>> But instead of adding the "Processed" parameter to a large percentage
> >>> of
> >>>>> the methods, which would result in overloaded methods (which we
> >> removed
> >>>>> quite a bit with KIP-182) what do you think of adding a method
> >>>>> to the AbstractStream class "withName(String processorName)"? BTW I"m
> >>> not
> >>>>> married to the method name, it's the best I can do off the top of my
> >>>> head.
> >>>>>
> >>>>> For the methods that return void, we'd have to add a parameter, but
> >>> that
> >>>>> would at least cut down on the number of overloaded methods in the
> >> API.
> >>>>>
> >>>>> Just my 2 cents.
> >>>>>
> >>>>> Thanks,
> >>>>> Bill
> >>>>>
> >>>>> On Sun, May 27, 2018 at 4:13 PM, Florian Hussonnois <
> >>>> fhussonn...@gmail.com
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I would like to start a new discussion on following KIP :
> >>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>> 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL
> >>>>>>
> >>>>>> This is still a draft.
> >>>>>>
> >>>>>> Looking forward for your feedback.
> >>>>>> --
> >>>>>> Florian HUSSONNOIS
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> -- Guozhang
> >>>>
> >>>
> >>
> >>
> >> --
> >> Florian HUSSONNOIS
> >>
> >
> >
> >
>
>

-- 
Florian HUSSONNOIS

Re: [DISCUSS] KIP-307: Allow to define custom processor names with KStreams DSL

Reply via email to