Re: [DISCUSS] KIP-478 Strongly Typed Processor API

John Roesler Fri, 21 Jun 2019 17:10:16 -0700

After kicking the naming around a bit more, it seems like any package
name change is a bit "weird" because it fragments the package and
directory structure. If we can come up with a reasonable name for the
interface after all, it seems like the better choice.


The real challenge is that the existing name "Processor" seems just
about perfect. In picking a new name, we need to consider the ultimate
state, after the deprecation period, when we entirely remove
Processor. In this context, TypedProcessor seems a little odd to me,
because it seems to imply that there should also be an "untyped
processor".

After kicking around a few other ideas, what does everyone think about
"RecordProcessor"? I _think_ maybe it stands on its own just fine,
because it's a thing that processes... records?

If others agree with this, I can change the proposal to RecordProcessor.

Thanks,
-John

On Fri, Jun 21, 2019 at 6:42 PM John Roesler <j...@confluent.io> wrote:
>
> Hi all,
>
> I've updated the KIP with the feedback so far.
>
> The naming question is still the biggest (only?) outstanding issue. It
> would be good to hear some more thoughts on it.
>
> As we stand now, there's one vote for changing the package name to
> something like 'typedprocessor', one for changing the interface to
> TypedProcessor (as in the PoC), and one for just changing the
> Processor interface in-place, breaking source compatibility.
>
> How can we resolve this decision?
>
> Thanks,
> -John
>
> On Thu, Jun 20, 2019 at 5:44 PM John Roesler <j...@confluent.io> wrote:
> >
> > Thanks for the feedback, Guozhang and Matthias,
> >
> > Regarding motivation: I'll update the wiki. Briefly:
> > * Any processor can benefit. Imagine a pure user of the ProcessorAPI
> > who has very complex processing logic. I have seen several processor
> > implementation that are hundreds of lines long and call
> > `context.forward` in many different locations and branches. In such an
> > implementation, it would be very easy to have a bug in a rarely used
> > branch that forwards the wrong kind of value. This would structurally
> > prevent that from happening.
> > * Also, anyone who heavily uses the ProcessorAPI would likely have
> > developed helper methods to wire together processors, just as we have
> > in the DSL implementation. This change would enable them to ensure at
> > compile time that they are actually wiring together compatible types.
> > This was actually _my_ original motivation, since I found it very
> > difficult and time consuming to follow the Streams DSL internal
> > builders.
> >
> > Regarding breaking the source compatibility of Processor: I would
> > _love_ to side-step the naming problem, but I really don't know if
> > it's excusable to break compatibility. I suspect that our oldest and
> > dearest friends are using the ProcessorAPI in some form or another,
> > and all their source code would break. It sucks to have to create a
> > whole new interface to get around this, but it feels like the right
> > thing to do. Would be nice to get even more feedback on this point,
> > though.
> >
> > Regarding the types of stores, as I said in my response to Sophie,
> > it's not an issue.
> >
> > Regarding the change to StreamsBuilder, it doesn't pin the types in
> > any way, since all the types are bounded by Object only, and there are
> > no extra constraints between arguments (each type is used only once in
> > one argument). But maybe I missed the point you were asking about.
> > Since the type takes generic paramters, we should allow users to pass
> > in parameterized arguments. Otherwise, they would _have to_ give us a
> > raw type, and they would be forced to get a "rawtyes" warning from the
> > compiler. So, it's our obligation in any API that accepts a
> > parameterized-type parameter to allow people to actually pass a
> > parameterized type, even if we don't actually use the parameters.
> >
> > The naming question is a complex one, as I took pains to detail
> > previously. Please don't just pick out one minor point, call it weak,
> > and then claim that it invalidates the whole decision. I don't think
> > there's a clear best choice, so I'm more than happy for someone to
> > advocate for renaming the class instead of the package. Can you
> > provide some reasons why you think that would be better?
> >
> > Regarding the deprecated methods, you're absolutely right. I'll update the 
> > KIP.
> >
> > Thanks again for all the feedback!
> > -John
> >
> > On Thu, Jun 20, 2019 at 4:34 PM Matthias J. Sax <matth...@confluent.io> 
> > wrote:
> > >
> > > Just want to second what Sophie said about the stores. The type of a
> > > used stores is completely independent of input/output types.
> > >
> > > This related to change `addGlobalStore()` method. Why do you want to pin
> > > the types? In fact, people request the ability to filter() and maybe
> > > even map() the data before they are put into the global store. Limiting
> > > the types seems to be a step backward here?
> > >
> > >
> > >
> > > Also, the pack name is questionable.
> > >
> > > > This wouldn't be the first project to do something like this...
> > >
> > > Not a strong argument. I would actually propose to not a a new package,
> > > but just a new class `TypedProcessor`.
> > >
> > >
> > > For `ProcessorContext#forward` methods -- some of those methods are
> > > already deprecated. While the will still be affected, it would be worth
> > > to mark them as deprecated in the wiki page, too.
> > >
> > >
> > > @Guozhang: I dont' think we should break source compatibility in a minor
> > > release.
> > >
> > >
> > >
> > > -Matthias
> > >
> > >
> > >
> > > On 6/20/19 1:43 PM, Guozhang Wang wrote:
> > > > Hi John,
> > > >
> > > > Thanks for KIP! I've a few comments below:
> > > >
> > > > 1. So far the "Motivation" section is very general, and the only 
> > > > concrete
> > > > example that I have in mind is `TransformValues#punctuate`. Do we have 
> > > > any
> > > > other concrete issues that drive this KIP? If not then I feel better to
> > > > narrow the scope of this KIP to:
> > > >
> > > > 1.a) modifying ProcessorContext only with the output types on forward.
> > > > 1.b) modifying Transformer signature to have generics of 
> > > > ProcessorContext,
> > > > and then lift the restricting of not using punctuate: if user did not
> > > > follow the enforced typing and just code without generics, they will get
> > > > warning at compile time and get run-time error if they forward 
> > > > wrong-typed
> > > > records, which I think would be acceptable.
> > > >
> > > > I feel this would be a good solution for this specific issue; again, 
> > > > feel
> > > > free to update the wiki page with other known issues that cannot be
> > > > resolved.
> > > >
> > > > 2. If, we want to go with the current scope then my next question would 
> > > > be,
> > > > how much breakage we would introducing if we just modify the Processor
> > > > signature directly? My feeling is that DSL users would be most likely 
> > > > not
> > > > affected and PAPI users only need to modify a few lines on class
> > > > declaration. I feel it worth doing some research on this part and then
> > > > decide if we really want to bite the bullet of duplicated Processor /
> > > > ProcessorSupplier classes for maintaining compatibility.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > >
> > > > On Wed, Jun 19, 2019 at 12:21 PM John Roesler <j...@confluent.io> wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> In response to the feedback so far, I changed the package name from
> > > >> `processor2` to `processor.generic`.
> > > >>
> > > >> Thanks,
> > > >> -John
> > > >>
> > > >> On Mon, Jun 17, 2019 at 4:49 PM John Roesler <j...@confluent.io> wrote:
> > > >>>
> > > >>> Thanks for the feedback, Sophie!
> > > >>>
> > > >>> I actually felt a little uneasy when I wrote that remark, because it's
> > > >>> not restricted at all in the API, it's just available to you if you
> > > >>> choose to give your stores and context the same parameters. So, I
> > > >>> think your use case is valid, and also perfectly permissable under the
> > > >>> current KIP. Sorry for sowing confusion on my own discussion thread!
> > > >>>
> > > >>> I'm not crazy about the package name, either. I went with it only
> > > >>> because there's seemingly nothing special about the new package except
> > > >>> that it can't have the same name as the old one. Otherwise, the
> > > >>> existing "processor" and "Processor" names for the package and class
> > > >>> are perfectly satisfying. Rather than pile on additional semantics, it
> > > >>> seemed cleaner to just add a number to the package name.
> > > >>>
> > > >>> This wouldn't be the first project to do something like this... Apache
> > > >>> Commons, for example, has added a "2" to the end of some of their
> > > >>> packages for exactly the same reason.
> > > >>>
> > > >>> I'm open to any suggestions. For example, we could do something like
> > > >>> org.apache.kafka.streams.typedprocessor.Processor or
> > > >>> org.apache.kafka.streams.processor.typed.Processor , which would have
> > > >>> just about the same effect. One microscopic thought is that, if
> > > >>> there's another interface in the "processor" package that we wish to
> > > >>> do the same thing to, would _could_ pile it in to "processor2", but we
> > > >>> couldn't do the same if we use a package that has "typed" in the name,
> > > >>> unless that change is _also_ related to types in some way. But this
> > > >>> seems like a very minor concern.
> > > >>>
> > > >>> What's your preference?
> > > >>> -John
> > > >>>
> > > >>> On Mon, Jun 17, 2019 at 3:56 PM Sophie Blee-Goldman 
> > > >>> <sop...@confluent.io>
> > > >> wrote:
> > > >>>>
> > > >>>> Hey John, thanks for writing this up! I like the proposal but there's
> > > >> one
> > > >>>> point that I think may be too restrictive:
> > > >>>>
> > > >>>> "A processor that happens to use a typed store is actually emitting 
> > > >>>> the
> > > >>>> same types that it is storing."
> > > >>>>
> > > >>>> I can imagine someone could want to leverage this new type safety
> > > >> without
> > > >>>> also limiting how they can interact with/use their store. As an
> > > >> (admittedly
> > > >>>> contrived) example, say you have an input stream of purchases of a
> > > >> certain
> > > >>>> type (entertainment, food, etc), and on seeing a new record you want 
> > > >>>> to
> > > >>>> output how many types of purchase a shopper has made more than 5
> > > >> purchases
> > > >>>> of in the last month. Your state store will probably be holding some
> > > >> more
> > > >>>> complicated PurchaseHistory object (keyed by user), but your output 
> > > >>>> is
> > > >> just
> > > >>>> a <User, Long>
> > > >>>>
> > > >>>> I'm also not crazy about "processor2" as the package name ... not 
> > > >>>> sure
> > > >> what
> > > >>>> a better one would be though (something with "typed"?)
> > > >>>>
> > > >>>> On Mon, Jun 17, 2019 at 12:47 PM John Roesler <j...@confluent.io>
> > > >> wrote:
> > > >>>>
> > > >>>>> Hi all,
> > > >>>>>
> > > >>>>> I'd like to propose KIP-478 (
> > > >> https://cwiki.apache.org/confluence/x/2SkLBw
> > > >>>>> ).
> > > >>>>>
> > > >>>>> This proposal would add output type bounds to the Processor 
> > > >>>>> interface
> > > >>>>> in Kafka Streams, which enables static checking of a number of 
> > > >>>>> useful
> > > >>>>> properties:
> > > >>>>> * A processor B that consumes the output of processor A is actually
> > > >>>>> expecting the same types that processor A produces.
> > > >>>>> * A processor that happens to use a typed store is actually emitting
> > > >>>>> the same types that it is storing.
> > > >>>>> * A processor is simply forwarding the expected types in all code
> > > >> paths.
> > > >>>>> * Processors added via the Streams DSL, which are not permitted to
> > > >>>>> forward results at all are statically prevented from doing so by the
> > > >>>>> compiler
> > > >>>>>
> > > >>>>> Internally, we can use the above properties to achieve a much higher
> > > >>>>> level of confidence in the Streams DSL implementation's correctness.
> > > >>>>> Actually, while doing the POC, I found a few bugs and mistakes, 
> > > >>>>> which
> > > >>>>> become structurally impossible with KIP-478.
> > > >>>>>
> > > >>>>> Additionally, the stronger types dramatically improve the
> > > >>>>> self-documentation of our Streams internal implementations, which
> > > >>>>> makes it much easier for new contributors to ramp up with 
> > > >>>>> confidence.
> > > >>>>>
> > > >>>>> Thanks so much for your consideration!
> > > >>>>> -John
> > > >>>>>
> > > >>
> > > >
> > > >
> > >

Re: [DISCUSS] KIP-478 Strongly Typed Processor API

Reply via email to