Re: [DISCUSS] KIP-307: Allow to define custom processor names with KStreams DSL

Matthias J. Sax Fri, 19 Oct 2018 11:14:29 -0700

What is the status of this KIP?

-Matthias


On 7/19/18 5:17 PM, Guozhang Wang wrote:
> Hello Florian,
> 
> Sorry for being late... Found myself keep apologizing for late replies
> these days. But I do want to push this KIP's progress forward as I see it
> very important and helpful feature for extensibility.
> 
> About the exceptions, I've gone through them and hopefully it is an
> exhaustive list:
> 
> 1. KTable#toStream()
> 2. KStream#merge(KStream)
> 3. KStream#process() / transform() / transformValues()
> 4. KGroupedTable / KGroupedStream#count()
> 
> 
> Here's my reasoning:
> 
> * It is okay not letting users to override the name for 1/2, since they are
> too trivial to be useful for debugging, plus their processor names would
> not determine any related topic / store names.
> * For 3, I'd vote for adding overloaded functions with Named.
> * For 4, if users really want to name the processor she can call
> aggregate() instead, so I think it is okay to skip this case.
> 
> 
> Guozhang
> 
> 
> 
> On Fri, Jul 6, 2018 at 3:06 PM, Florian Hussonnois <fhussonn...@gmail.com>
> wrote:
> 
>> Hi,
>>
>> The option #3 seems to be a good alternative and I find the API more
>> elegant (thanks John).
>>
>> But, we still have the need to overload some methods either because they do
>> not accept an action instance or because they are translated to multiple
>> processors.
>>
>> For example, this is the case for methods branch() and merge(). We could
>> introduce a new interface Named (or maybe a different name ?) with a method
>> name(). All action interfaces could extend this one to implement the option
>> 3).
>> This would result by having the following overloads  :
>>
>> Stream<K, V> merge(final Named name, final KStream<K, V> stream);
>> KStream<K, V>[] branch(final Named name, final Predicate<? super K, ? super
>> V>... predicates)
>>
>> N.B : The list above is  not exhaustive
>>
>> ---------
>> user's code will become :
>>
>>         KStream<String, Integer> stream = builder.stream("test");
>>         KStream<String, Integer>[] branches =
>> stream.branch(Named.with("BRANCH-STREAM-ON-VALUE"),
>>                 Predicate.named("STREAM-PAIR-VALUE", (k, v) -> v % 2 ==
>> 0),
>>                 Predicate.named("STREAM-IMPAIR-VALUE", (k, v) -> v % 2 !=
>> 0));
>>
>>         branches[0].to("pair");
>>         branches[1].to("impair");
>> ---------
>>
>> This is a mix of the options 3) and 1)
>>
>> Le ven. 6 juil. 2018 à 22:58, Guozhang Wang <wangg...@gmail.com> a écrit :
>>
>>> Hi folks, just to summarize the options we have so far:
>>>
>>> 1) Add a new "as" for KTable / KStream, plus adding new fields for
>>> operators-returns-void control objects (the current wiki's proposal).
>>>
>>> Pros: no more overloads.
>>> Cons: a bit departing with the current high-level API design of the DSL,
>>> plus, the inconsistency between operators-returns-void and
>>> operators-not-return-voids.
>>>
>>> 2) Add overloaded functions for all operators, that accepts a new control
>>> object "Described".
>>>
>>> Pros: consistent with current APIs.
>>> Cons: lots of overloaded functions to add.
>>>
>>> 3) Add another default function in the interface (thank you J8!) as John
>>> proposed.
>>>
>>> Pros: no overloaded functions, no "Described".
>>> Cons: do we lose lambda functions really (seems not if we provide a
>> "named"
>>> for each func)? Plus "Described" may be more extensible than a single
>>> `String`.
>>>
>>>
>>> My principle of considering which one is better depends primarily on "how
>>> to make advanced users easily use the additional API, while keeping it
>>> hidden from normal users who do not care at all". For that purpose I
>> think
>>> 3) > 1) > 2).
>>>
>>> One caveat though, is that changing the interface would not be
>>> binary-compatible though source-compatible, right? I.e. users need to
>>> recompile their code though no changes needed.
>>>
>>>
>>>
>>> Another note: for 3), if we really want to keep extensibility of
>> Described
>>> we could do sth. like:
>>>
>>> ---------
>>>
>>> public interface Predicate<K, V> {
>>>     // existing method
>>>     boolean test(final K key, final V value);
>>>
>>>     // new default method adds the ability to name the predicate
>>>     default Described described() {
>>>         return new Described(null);
>>>     }
>>> }
>>>
>>> ----------
>>>
>>> where user's code becomes:
>>>
>>> stream.filter(named("key", (k, v) -> true));   // note `named` now just
>>> sets a Described("key") in "described()".
>>>
>>> stream.filter(described(Described.as("key", /* any other fancy
>> parameters
>>> in the future*/), (k, v) -> true));
>>> ----------
>>>
>>>
>>> I feel it is not much likely that we'd need to extend it further in the
>>> future, so just a `String` would be good enough. But just listing all
>>> possibilities here.
>>>
>>>
>>>
>>> Guozhang
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jul 6, 2018 at 8:19 AM, John Roesler <j...@confluent.io> wrote:
>>>
>>>> Hi Florian,
>>>>
>>>> Sorry I'm late to the party, but I missed the message originally.
>>>>
>>>> Regarding the names, it's probably a good idea to stick to the same
>>>> character set we're currently using: letters, numbers, and hyphens. The
>>>> names are used in Kafka topics, files and folders, and RocksDB
>> databases,
>>>> and we also need them to work with the file systems of Windows, Linux,
>>> and
>>>> MacOS. My opinion is that with a situation like that, it's better to be
>>>> conservative. It might also be a good idea to impose an upper limit on
>>> name
>>>> length to avoid running afoul of any of those systems.
>>>>
>>>> ---
>>>>
>>>> It seems like there's a small debate between 1) adding a new method to
>>>> KStream (and maybe KTable) to modify its name after the fact, or 2)
>>>> piggy-backing on the config objects where they exist and adding one
>> where
>>>> they don't. To me, #2 is the better alternative even though it produces
>>>> more overloads and may be a bit awkward in places.
>>>>
>>>> The reason is simply that #1 is a high-level departure from the
>>>> graph-building paradigm we're using in the DSL. Consider:
>>>>
>>>> Graph.node1(config).node2(config)
>>>>
>>>> vs
>>>>
>>>> Graph.node1().config().node2().config()
>>>>
>>>> We could have done either, but we picked the former. I think it's
>>> probably
>>>> a good goal to try and stick to it so that developers can develop and
>>> rely
>>>> on their instincts for how the DSL will behave.
>>>>
>>>> I do want to present one alternative to adding new config objects: we
>> can
>>>> just add a "name()" method to all our "action" interfaces. For example,
>>>> I'll demonstrate how we can add a "name" to Predicate and then use it
>> to
>>>> name a "KStream#filter" DSL operator:
>>>>
>>>> public interface Predicate<K, V> {
>>>>     // existing method
>>>>     boolean test(final K key, final V value);
>>>>
>>>>     // new default method adds the ability to name the predicate
>>>>     default String name() {
>>>>         return null;
>>>>     }
>>>>
>>>>     // new static factory method adds the ability to wrap lambda
>>> predicates
>>>> with a named predicate
>>>>     static <K, V> Predicate<K, V> named(final String name, final
>>>> Predicate<K, V> predicate) {
>>>>         return new Predicate<K, V>() {
>>>>             @Override
>>>>             public boolean test(final K key, final V value) {
>>>>                 return predicate.test(key, value);
>>>>             }
>>>>
>>>>             @Override
>>>>             public String name() {
>>>>                 return name;
>>>>             }
>>>>         };
>>>>     }
>>>> }
>>>>
>>>> Then, here's how it would look to use it:
>>>>
>>>> // Anonymous predicates continue to work just fine
>>>> stream.filter((k, v) -> true);
>>>>
>>>> // Devs can swap in a Predicate that implements the name() method.
>>>> stream.filter(new Predicate<Object, Object>() {
>>>>     @Override
>>>>     public boolean test(final Object key, final Object value) {
>>>>         return true;
>>>>     }
>>>>
>>>>     @Override
>>>>     public String name() {
>>>>         return "hey";
>>>>     }
>>>> });
>>>>
>>>> // Or they can wrap their existing lambda using the static factory
>> method
>>>> stream.filter(named("key", (k, v) -> true));
>>>>
>>>> Just a thought.
>>>>
>>>> Overall, I think it's really valuable to be able to name the
>> processors,
>>>> for all the reasons you mentioned in the KIP. So thank you for
>>> introducing
>>>> this!
>>>>
>>>> Thanks,
>>>> -John
>>>>
>>>> On Thu, Jul 5, 2018 at 4:53 PM Florian Hussonnois <
>> fhussonn...@gmail.com
>>>>
>>>> wrote:
>>>>
>>>>> Hi, thank you very much for all you suggestions. I've started to
>> update
>>>> the
>>>>> KIP (
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>> 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL
>>>>> ).
>>>>> Also, I propose to rename the Processed class into Described - this
>>> will
>>>> be
>>>>> more meaningful (but this is just a detail).
>>>>>
>>>>> I'm OK to not enforcing uppercase for specific names but should we
>>> allow
>>>>> arbitrary names with whitespaces for example ? Currently, I can't
>> tell
>>> if
>>>>> this can lead to some side effects ?
>>>>>
>>>>> Le lun. 11 juin 2018 à 01:31, Matthias J. Sax <matth...@confluent.io
>>>
>>> a
>>>>> écrit :
>>>>>
>>>>>> Just catching up on this thread.
>>>>>>
>>>>>> I like the general idea. Couple of comments:
>>>>>>
>>>>>>  - I think that adding `Processed` (or maybe a different name?) is
>> a
>>>>>> valid proposal for stateless operators that only have a single
>>> overload
>>>>>> atm. It would align with the overall API design.
>>>>>>
>>>>>>  - for all methods with multiple existing overloads, we can
>> consider
>>> to
>>>>>> extend `Consumed`, `Produced`, `Materialized` etc to take an
>>> additional
>>>>>> processor name (not sure atm how elegant this is; we would need to
>>>>>> "play" with the API a little bit; the advantage would be, that we
>> do
>>>> not
>>>>>> add more overloads what seems to be key for this KIP)
>>>>>>
>>>>>>  - operators return void: while I agree that the "name first"
>>> chaining
>>>>>> idea is not very intuitive, it might still work, if we name the
>>> method
>>>>>> correctly (again, we would need to "play" with the API a little bit
>>> to
>>>>> see)
>>>>>>
>>>>>>  - for DSL operators that are translated to multiple nodes: it
>> might
>>>>>> make sense to use the specified operator name as prefix and add
>>>>>> reasonable suffixes. For example, a join translates into 5
>> operators
>>>>>> that could be name "name-left-store-processor",
>>>>>> "name-left-join-processor", "name-right-store-processor",
>>>>>> "name-right-join-processor", and "name-join-merge-processor" (or
>>>>>> similar). Maybe just using numbers might also work.
>>>>>>
>>>>>>  - I think, we should strip the number suffixes if a user provides
>>>> names
>>>>>>
>>>>>>  - enforcing upper case seems to be tricky: for example, we do not
>>>>>> enforce upper case for store names and we cannot easily change it
>> as
>>> it
>>>>>> would break compatibility -- thus, for consistency reasons we might
>>> not
>>>>>> want to do this
>>>>>>
>>>>>>  - for better understand of the impact of the KIP, it would be
>> quite
>>>>>> helpful if you would list all method names that are affected in the
>>> KIP
>>>>>> (ie, list all newly added overloads)
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>> On 5/31/18 6:40 PM, Guozhang Wang wrote:
>>>>>>> Hi Florian,
>>>>>>>
>>>>>>> Re 1: I think changing the KStreamImpl / KTableImpl to allow
>>>> modifying
>>>>>> the
>>>>>>> processor name after the operator is fine as long as we do the
>>> check
>>>>>> again
>>>>>>> when modifying that. In fact, we are having some topology
>>>> optimization
>>>>>>> going on which may modify processor names in the final topology
>>>>> anyways (
>>>>>>> https://github.com/apache/kafka/pull/4983). Semantically I think
>>> it
>>>> is
>>>>>>> easier to understand to developers than "deciding the processor
>>> name
>>>>> for
>>>>>>> the next operator".
>>>>>>>
>>>>>>> Re 2: Yeah I'm thinking that for operators that translates to
>>>> multiple
>>>>>>> processor names, we can still use the provided "hint" to name the
>>>>>> processor
>>>>>>> names, e.g. for Joins we can name them as `join-foo-this` and
>>>>>>> `join-foo-that` etc if user calls `as("foo")`.
>>>>>>>
>>>>>>> Re 3: The motivation I had about removing the suffix is that it
>> has
>>>>> huge
>>>>>>> restrictions on topology compatibilities: consider if user code
>>>> added a
>>>>>> new
>>>>>>> operator, or library does some optimization to remove some
>>> operators,
>>>>> the
>>>>>>> suffix indexing may be changed for a large amount of the
>> processor
>>>>> names:
>>>>>>> this will in turn change the internal state store names, as well
>> as
>>>>>>> internal topic names as well, making the new application topology
>>> to
>>>> be
>>>>>>> incompatible with the ones. One rationale I had about this KIP is
>>>> that
>>>>>>> aligned this effort, moving forward we can allow users to
>> customize
>>>>>>> internal names so that they can still be reused even with
>> topology
>>>>>> changes
>>>>>>> (e.g. KIP-230), so I think removing the suffix index would be
>> more
>>>>>>> applicable in the long run.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Guozhang
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 31, 2018 at 3:08 PM, Florian Hussonnois <
>>>>>> fhussonn...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi ,
>>>>>>>> Thank you very much for your feedback.
>>>>>>>>
>>>>>>>> 1/
>>>>>>>> I agree that overloading most of the methods with a Processed is
>>> not
>>>>>> ideal.
>>>>>>>> I've started modifying the KStream API and I got to the same
>>>>> conclusion.
>>>>>>>> Also ading a new method directly to KStreamImpl and KTableImpl
>>>> classes
>>>>>>>> seems to be a better option.
>>>>>>>>
>>>>>>>> However a processor name cannot be redefined after calling an
>>>> operator
>>>>>> (or
>>>>>>>> maybe I miss something in the code).
>>>>>>>> From my understanding, this will only set the KStream name
>>> property
>>>>> not
>>>>>> the
>>>>>>>> processor name previsouly added to the topology builder -
>> leading
>>> to
>>>>>>>> InvalidTopology exception.
>>>>>>>>
>>>>>>>> So the new method should actually defines the name of the next
>>>>>> processor :
>>>>>>>> Below is an example :
>>>>>>>>
>>>>>>>> *stream.as <http://stream.as
>>>> (Processed.name("MAPPE_TO_UPPERCASE")*
>>>>>>>> *          .map( (k, v) -> KeyValue.pair(k, v.toUpperCase()))*
>>>>>>>>
>>>>>>>> I think this approach could solve the cases for methods
>> returning
>>>>> void ?
>>>>>>>>
>>>>>>>> Regarding this new method we have two possible implementations :
>>>>>>>>
>>>>>>>>    1. Adding a method like : withName(String processorName)
>>>>>>>>    2. or adding a method accepting an Processed object :
>>>>> as(Processed).
>>>>>>>>
>>>>>>>> I think solution 2. is preferable as the Processed class could
>> be
>>>>>> enriched
>>>>>>>> further (in futur).
>>>>>>>>
>>>>>>>> 2/
>>>>>>>> As Guozhang said some operators add internal processors.
>>>>>>>> For example the branch() method create one KStreamBranch
>> processor
>>>> to
>>>>>> route
>>>>>>>> records and one KStreamPassThrough processor for each branch.
>>>>>>>> In that situation only the parent processor can be named. For
>>>> children
>>>>>>>> processors we could keep the current behaviour that add a suffix
>>>> (i.e
>>>>>>>> KSTREAM-BRANCHCHILD-)
>>>>>>>>
>>>>>>>> This also the case for the join() method that result to adding
>>>>> multiple
>>>>>>>> processors to the topology (windowing, left/right joins and a
>>> merge
>>>>>>>> processor).
>>>>>>>> I think, like for the branch method users could only define a
>>>>> processor
>>>>>>>> name prefix.
>>>>>>>>
>>>>>>>> 3/
>>>>>>>> I think we should  still added a suffix like "-0000000000" to
>>>>> processor
>>>>>>>> name and enforce uppercases as this will keep some consistency
>>> with
>>>>> the
>>>>>>>> ones generated by the API.
>>>>>>>>
>>>>>>>> 4/
>>>>>>>> Yes, the KTable interface should be modified like KStream to
>> allow
>>>>>> custom
>>>>>>>> processor names definition.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>> Le jeu. 31 mai 2018 à 19:18, Damian Guy <damian....@gmail.com>
>> a
>>>>> écrit
>>>>>> :
>>>>>>>>
>>>>>>>>> Hi Florian,
>>>>>>>>>
>>>>>>>>> Thanks for the KIP. What about KTable and other DSL interfaces?
>>>> Will
>>>>>> they
>>>>>>>>> not want to be able to do the same thing?
>>>>>>>>> It would be good to see a complete set of the public API
>> changes.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Damian
>>>>>>>>>
>>>>>>>>> On Wed, 30 May 2018 at 19:45 Guozhang Wang <wangg...@gmail.com
>>>
>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Florian,
>>>>>>>>>>
>>>>>>>>>> Thanks for the KIP. I have some meta feedbacks on the
>> proposal:
>>>>>>>>>>
>>>>>>>>>> 1. You mentioned that this `Processed` object will be added
>> to a
>>>> new
>>>>>>>>>> overloaded variant of all the stateless operators, what about
>>> the
>>>>>>>>> stateful
>>>>>>>>>> operators? Would like to hear your opinions if you have
>> thought
>>>>> about
>>>>>>>>> that:
>>>>>>>>>> note for stateful operators they will usually be mapped to
>>>> multiple
>>>>>>>>>> processor node names, so we probably need to come up with some
>>>> ways
>>>>> to
>>>>>>>>>> define all their names.
>>>>>>>>>>
>>>>>>>>>> 2. I share the same concern with Bill as for adding lots of
>> new
>>>>>>>> overload
>>>>>>>>>> functions into the stateless operators, as we have just spent
>>>> quite
>>>>>>>> some
>>>>>>>>>> effort in trimming them since 1.0.0 release. If the goal is to
>>>> just
>>>>>>>>> provide
>>>>>>>>>> some "hints" on the generated processor node names, not
>> strictly
>>>>>>>>> enforcing
>>>>>>>>>> the exact names that to be generated, then how about we just
>>> add a
>>>>> new
>>>>>>>>>> function to `KStream` and `KTable` classes like:
>>> "as(Processed)",
>>>>> with
>>>>>>>>> the
>>>>>>>>>> semantics as "the latest operators that generate this KStream
>> /
>>>>> KTable
>>>>>>>>> will
>>>>>>>>>> be named accordingly to this hint".
>>>>>>>>>>
>>>>>>>>>> The only caveat, is that for all operators like `KStream#to`
>> and
>>>>>>>>>> `KStream#print` that returns void, this alternative would not
>>>> work.
>>>>>> But
>>>>>>>>> for
>>>>>>>>>> the current operators:
>>>>>>>>>>
>>>>>>>>>> a. KStream#print,
>>>>>>>>>> b. KStream#foreach,
>>>>>>>>>> c. KStream#to,
>>>>>>>>>> d. KStream#process
>>>>>>>>>>
>>>>>>>>>> I personally felt that except `KStream#process` users would
>> not
>>>>>> usually
>>>>>>>>>> bother to override their names, and for `KStream#process` we
>>> could
>>>>> add
>>>>>>>> an
>>>>>>>>>> overload variant with the additional Processed object.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 3. In your example, the processor names are still added with a
>>>>> suffix
>>>>>>>>> like
>>>>>>>>>> "
>>>>>>>>>> -0000000000", is this intentional? If yes, why (I thought with
>>>> user
>>>>>>>>>> specified processor name hints we will not add suffix to
>>>> distinguish
>>>>>>>>>> different nodes of the same type any more)?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 29, 2018 at 6:47 AM, Bill Bejeck <
>> bbej...@gmail.com
>>>>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Florian,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the KIP.  I think being able to add more context
>> to
>>>> the
>>>>>>>>>>> processor names would be useful.
>>>>>>>>>>>
>>>>>>>>>>> I like the idea of adding a "withProcessorName" to Produced,
>>>>> Consumed
>>>>>>>>> and
>>>>>>>>>>> Joined.
>>>>>>>>>>>
>>>>>>>>>>> But instead of adding the "Processed" parameter to a large
>>>>> percentage
>>>>>>>>> of
>>>>>>>>>>> the methods, which would result in overloaded methods (which
>> we
>>>>>>>> removed
>>>>>>>>>>> quite a bit with KIP-182) what do you think of adding a
>> method
>>>>>>>>>>> to the AbstractStream class "withName(String processorName)"?
>>> BTW
>>>>> I"m
>>>>>>>>> not
>>>>>>>>>>> married to the method name, it's the best I can do off the
>> top
>>> of
>>>>> my
>>>>>>>>>> head.
>>>>>>>>>>>
>>>>>>>>>>> For the methods that return void, we'd have to add a
>> parameter,
>>>> but
>>>>>>>>> that
>>>>>>>>>>> would at least cut down on the number of overloaded methods
>> in
>>>> the
>>>>>>>> API.
>>>>>>>>>>>
>>>>>>>>>>> Just my 2 cents.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Bill
>>>>>>>>>>>
>>>>>>>>>>> On Sun, May 27, 2018 at 4:13 PM, Florian Hussonnois <
>>>>>>>>>> fhussonn...@gmail.com
>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to start a new discussion on following KIP :
>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>>>
>>> 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL
>>>>>>>>>>>>
>>>>>>>>>>>> This is still a draft.
>>>>>>>>>>>>
>>>>>>>>>>>> Looking forward for your feedback.
>>>>>>>>>>>> --
>>>>>>>>>>>> Florian HUSSONNOIS
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Florian HUSSONNOIS
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Florian HUSSONNOIS
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -- Guozhang
>>>
>>
>>
>> --
>> Florian HUSSONNOIS
>>
> 
> 
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-307: Allow to define custom processor names with KStreams DSL

Reply via email to