Re: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API

Michael Noll Mon, 20 Mar 2017 13:00:06 -0700

Hmm, I must admit I don't like this last update all too much.

Basically we would have:


    StreamsBuilder builder = new StreamsBuilder();

    // And here you'd define your...well, what actually?
    // Ah right, you are composing a topology here, though you are not
aware of it.

    KafkaStreams streams = new KafkaStreams(builder.build(),
streamsConfiguration);

So what are you building here with StreamsBuilder?  Streams (hint: No)?
And what about tables -- is there a TableBuilder (hint: No)?

I also interpret Guozhang's last response as that he'd prefer to have
"Topology" in the class/interface names.  I am aware that we shouldn't
necessarily use the status quo to make decisions about future changes, but
the very first concept we explain in the Kafka Streams documentation is
"Stream Processing Topology":
https://kafka.apache.org/0102/documentation/streams#streams_concepts

-Michael



On Mon, Mar 20, 2017 at 7:55 PM, Matthias J. Sax <matth...@confluent.io>
wrote:

> \cc users list
>
>
> -------- Forwarded Message --------
> Subject: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API
> Date: Mon, 20 Mar 2017 11:51:01 -0700
> From: Matthias J. Sax <matth...@confluent.io>
> Organization: Confluent Inc
> To: d...@kafka.apache.org
>
> I want to push this discussion further.
>
> Guozhang's argument about "exposing" the Topology class is valid. It's a
> public class anyway, so it's not as issue. However, I think the question
> is not too much about exposing but about "advertising" (ie, putting it
> into the focus) or not at DSL level.
>
>
> If I interpret the last replies correctly, it seems that we could agree
> on "StreamsBuilder" as name. I did update the KIP accordingly. Please
> correct me, if I got this wrong.
>
>
> If there are not other objects -- this naming discussion was the last
> open point to far -- I would like the start the VOTE thread.
>
>
> -Matthias
>
>
> On 3/14/17 2:37 PM, Guozhang Wang wrote:
> > I'd like to keep the term "Topology" inside the builder class since, as
> > Matthias mentioned, this builder#build() function returns a "Topology"
> > object, whose type is a public class anyways. Although you can argue to
> let
> > users always call
> >
> > "new KafkaStreams(builder.build())"
> >
> > I think it is still more benefit to expose this concept.
> >
> >
> >
> > Guozhang
> >
> > On Tue, Mar 14, 2017 at 10:43 AM, Matthias J. Sax <matth...@confluent.io
> >
> > wrote:
> >
> >> Thanks for your input Michael.
> >>
> >>>> - KafkaStreams as the new name for the builder that creates the
> logical
> >>>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and
> >>>> `KafkaStreams.table("input-topic")`.
> >>
> >> I don't thinks this is a good idea, for multiple reasons:
> >>
> >> (1) We would reuse a name for a completely different purpose. The same
> >> argument for not renaming KStreamBuilder to TopologyBuilder. The
> >> confusion would just be too large.
> >>
> >> So if we would start from scratch, it might be ok to do so, but now we
> >> cannot make this move, IMHO.
> >>
> >> Also a clarification question: do you suggest to have static methods
> >> #stream and #table -- I am not sure if this would work?
> >> (or was you code snippet just simplification?)
> >>
> >>
> >> (2) Kafka Streams is basically a "processing client" next to consumer
> >> and producer client. Thus, the name KafkaStreams aligns to the naming
> >> schema of KafkaConsumer and KafkaProducer. I am not sure if it would be
> >> a good choice to "break" this naming scheme.
> >>
> >> Btw: this is also the reason, why we have KafkaStreams#close() -- and
> >> not KafkaStreams#stop() -- because #close() aligns with consumer and
> >> producer client.
> >>
> >>
> >> (3) On more argument against using KafkaStreams as DSL entry class would
> >> be, that it would need to create a Topology that can be given to the
> >> "runner/processing-client". Thus the pattern would be
> >>
> >>> Topology topology = streams.build();
> >>> KafkaStramsRunner runner = new KafkaStreamsRunner(..., topology)
> >>
> >> (or of course as a one liner).
> >>
> >>
> >>
> >> On the other hand, there was the idea (that we intentionally excluded
> >> from the KIP), to change the "client instantiation" pattern.
> >>
> >> Right now, a new client in actively instantiated (ie, by calling "new")
> >> and the topology if provided as a constructor argument. However,
> >> especially for DSL (not sure if it would make sense for PAPI), the DSL
> >> builder could create the client for the user.
> >>
> >> Something like this:
> >>
> >>> KStreamBuilder builder = new KStreamBuilder();
> >>> builder.whatever() // use the builder
> >>>
> >>> StreamsConfig config = ....
> >>> KafkaStreams streams = builder.getKafkaStreams(config);
> >>
> >> If we change the patter like this, the notion a the "DSL builder" would
> >> change, as it does not create a topology anymore, but it creates the
> >> "processing client". This would address Jay's concern about "not
> >> exposing concept users don't need the understand" and would not require
> >> to include the word "Topology" in the DSL builder class name, because
> >> the builder does not build a Topology anymore.
> >>
> >> I just put some names that came to my mind first hand -- did not think
> >> about good names. It's just to discuss the pattern.
> >>
> >>
> >>
> >> -Matthias
> >>
> >>
> >>
> >>
> >>
> >> On 3/14/17 3:36 AM, Michael Noll wrote:
> >>> I see Jay's point, and I agree with much of it -- notably about being
> >>> careful which concepts we do and do not expose, depending on which user
> >>> group / user type is affected.  That said, I'm not sure yet whether or
> >> not
> >>> we should get rid of "Topology" (or a similar term) in the DSL.
> >>>
> >>> For what it's worth, here's how related technologies define/name their
> >>> "topologies" and "builders".  Note that, in all cases, it's about
> >>> constructing a logical processing plan, which then is being
> executed/run.
> >>>
> >>> - `Pipeline` (Google Dataflow/Apache Beam)
> >>>     - To add a source you first instantiate the Source (e.g.
> >>> `TextIO.Read.from("gs://some/inputData.txt")`),
> >>>       then attach it to your processing plan via
> >> `Pipeline#apply(<source>)`.
> >>>       This setup is a bit different to our DSL because in our DSL the
> >>> builder does both, i.e.
> >>>       instantiating + auto-attaching to itself.
> >>>     - To execute the processing plan you call `Pipeline#execute()`.
> >>> - `StreamingContext`` (Spark): This setup is similar to our DSL.
> >>>     - To add a source you call e.g.
> >>> `StreamingContext#socketTextStream("localhost", 9999)`.
> >>>     - To execute the processing plan you call
> >> `StreamingContext#execute()`.
> >>> - `StreamExecutionEnvironment` (Flink): This setup is similar to our
> DSL.
> >>>     - To add a source you call e.g.
> >>> `StreamExecutionEnvironment#socketTextStream("localhost", 9999)`.
> >>>     - To execute the processing plan you call
> >>> `StreamExecutionEnvironment#execute()`.
> >>> - `Graph`/`Flow` (Akka Streams), as a result of composing Sources (~
> >>> `KStreamBuilder.stream()`) and Sinks (~ `KStream#to()`)
> >>>   into Flows, which are [Runnable]Graphs.
> >>>     - You instantiate a Source directly, and then compose the Source
> with
> >>> Sinks to create a RunnableGraph:
> >>>       see signature `Source#to[Mat2](sink: Graph[SinkShape[Out],
> Mat2]):
> >>> RunnableGraph[Mat]`.
> >>>     - To execute the processing plan you call `Flow#run()`.
> >>>
> >>> In our DSL, in comparison, we do:
> >>>
> >>> - `KStreamBuilder` (Kafka Streams API)
> >>>     - To add a source you call e.g. `KStreamBuilder#stream("input-
> >> topic")`.
> >>>     - To execute the processing plan you create a `KafkaStreams`
> instance
> >>> from `KStreamBuilder`
> >>>       (where the builder will instantiate the topology = processing
> plan
> >> to
> >>> be executed), and then
> >>>       call `KafkaStreams#start()`.  Think of `KafkaStreams` as our
> >> runner.
> >>>
> >>> First, I agree with the sentiment that the current name of
> >> `KStreamBuilder`
> >>> isn't great (which is why we're having this discussion).  Also, that
> >>> finding a good name is tricky. ;-)
> >>>
> >>> Second, even though I agree with many of Jay's points I'm not sure
> >> whether
> >>> I like the `StreamsBuilder` suggestion (i.e. any name that does not
> >> include
> >>> "topology" or a similar term) that much more.  It still doesn't
> describe
> >>> what that class actually does, and what the difference to
> `KafkaStreams`
> >>> is.  IMHO, the point of `KStreamBuilder` is that it lets you build a
> >>> logical plan (what we call "topology"), and `KafkaStreams` is the thing
> >>> that executes that plan.  I'm not yet convinced that abstracting these
> >> two
> >>> points away from the user is a good idea if the argument is that it's
> >>> potentially confusing to beginners (a claim which I am not sure is
> >> actually
> >>> true).
> >>>
> >>> That said, if we rather favor "good-sounding but perhaps less
> technically
> >>> correct names", I'd argue we should not even use something like
> >> "Builder".
> >>> We could, for example, also pick the following names:
> >>>
> >>> - KafkaStreams as the new name for the builder that creates the logical
> >>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and
> >>> `KafkaStreams.table("input-topic")`.
> >>> - KafkaStreamsRunner as the new name for the executioner of the plan,
> >> with
> >>> `KafkaStreamsRunner(KafkaStreams).run()`.
> >>>
> >>>
> >>>
> >>> On Tue, Mar 14, 2017 at 5:56 AM, Sriram Subramanian <r...@confluent.io>
> >>> wrote:
> >>>
> >>>> StreamsBuilder would be my vote.
> >>>>
> >>>>> On Mar 13, 2017, at 9:42 PM, Jay Kreps <j...@confluent.io> wrote:
> >>>>>
> >>>>> Hey Matthias,
> >>>>>
> >>>>> Make sense, I'm more advocating for removing the word topology than
> any
> >>>>> particular new replacement.
> >>>>>
> >>>>> -Jay
> >>>>>
> >>>>> On Mon, Mar 13, 2017 at 12:30 PM, Matthias J. Sax <
> >> matth...@confluent.io
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Jay,
> >>>>>>
> >>>>>> thanks for your feedback
> >>>>>>
> >>>>>>> What if instead we called it KStreamsBuilder?
> >>>>>>
> >>>>>> That's the current name and I personally think it's not the best
> one.
> >>>>>> The main reason why I don't like KStreamsBuilder is, that we have
> the
> >>>>>> concepts of KStreams and KTables, and the builder creates both.
> >> However,
> >>>>>> the name puts he focus on KStream and devalues KTable.
> >>>>>>
> >>>>>> I understand your argument, and I am personally open the remove the
> >>>>>> "Topology" part, and name it "StreamsBuilder". Not sure what others
> >>>>>> think about this.
> >>>>>>
> >>>>>>
> >>>>>> About Processor API: I like the idea in general, but I thinks it's
> out
> >>>>>> of scope for this KIP. KIP-120 has the focus on removing leaking
> >>>>>> internal APIs and do some cleanup how our API reflects some
> concepts.
> >>>>>>
> >>>>>> However, I added your idea to API discussion Wiki page and we take
> if
> >>>>>> from there:
> >>>>>> https://cwiki.apache.org/confluence/display/KAFKA/
> >>>>>> Kafka+Streams+Discussions
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> -Matthias
> >>>>>>
> >>>>>>
> >>>>>>> On 3/13/17 11:52 AM, Jay Kreps wrote:
> >>>>>>> Two things:
> >>>>>>>
> >>>>>>>   1. This is a minor thing but the proposed new name for
> >> KStreamBuilder
> >>>>>>>   is StreamsTopologyBuilder. I actually think we should not put
> >>>>>> topology in
> >>>>>>>   the name as topology is not a concept you need to understand at
> the
> >>>>>>>   kstreams layer right now. I'd think of three categories of
> >> concepts:
> >>>>>> (1)
> >>>>>>>   concepts you need to understand to get going even for a simple
> >>>>>> example, (2)
> >>>>>>>   concepts you need to understand to operate and debug a real
> >>>>>> production app,
> >>>>>>>   (3) concepts we truly abstract and you don't need to ever
> >> understand.
> >>>>>> I
> >>>>>>>   think in the kstream layer topologies are currently category (2),
> >> and
> >>>>>> this
> >>>>>>>   is where they belong. By introducing the name in even the
> simplest
> >>>>>> example
> >>>>>>>   it means the user has to go read about toplogies to really
> >> understand
> >>>>>> even
> >>>>>>>   this simple snippet. What if instead we called it
> KStreamsBuilder?
> >>>>>>>   2. For the processor api, I think this api is mostly not for end
> >>>>>> users.
> >>>>>>>   However this are a couple cases where it might make sense to
> expose
> >>>>>> it. I
> >>>>>>>   think users coming from Samza, or JMS's MessageListener (
> >>>>>>>   https://docs.oracle.com/javaee/7/api/javax/jms/
> >> MessageListener.html)
> >>>>>>>   understand a simple callback interface for message processing. In
> >>>>>> fact,
> >>>>>>>   people often ask why Kafka's consumer doesn't provide such an
> >>>>>> interface.
> >>>>>>>   I'd argue we do, it's KafkaStreams. The only issue is that the
> >>>>>> processor
> >>>>>>>   API documentation is a bit scary for a person implementing this
> >> type
> >>>>>> of
> >>>>>>>   api. My observation is that people using this style of API don't
> >> do a
> >>>>>> lot
> >>>>>>>   of cross-message operations, then just do single message
> operations
> >>>>>> and use
> >>>>>>>   a database for anything that spans messages. They also don't
> factor
> >>>>>> their
> >>>>>>>   code into many MessageListeners and compose them, they just have
> >> one
> >>>>>>>   listener that has the complete handling logic. Say I am a user
> who
> >>>>>> wants to
> >>>>>>>   implement a single Processor in this style. Do we have an easy
> way
> >> to
> >>>>>> do
> >>>>>>>   that today (either with the .transform/.process methods in
> kstreams
> >>>>>> or with
> >>>>>>>   the topology apis)? Is there anything we can do in the way of
> >> trivial
> >>>>>>>   helper code to make this better? Also, how can we explain that
> >>>>>> pattern to
> >>>>>>>   people? I think currently we have pretty in-depth docs on our
> apis
> >>>>>> but I
> >>>>>>>   suspect a person trying to figure out how to implement a simple
> >>>>>> callback
> >>>>>>>   might get a bit lost trying to figure out how to wire it up. A
> >> simple
> >>>>>> five
> >>>>>>>   line example in the docs would probably help a lot. Not sure if
> >> this
> >>>>>> is
> >>>>>>>   best addressed in this KIP or is a side comment.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> -Jay
> >>>>>>>
> >>>>>>> On Fri, Feb 3, 2017 at 3:33 PM, Matthias J. Sax <
> >> matth...@confluent.io
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi All,
> >>>>>>>>
> >>>>>>>> I did prepare a KIP to do some cleanup some of Kafka's Streaming
> >> API.
> >>>>>>>>
> >>>>>>>> Please have a look here:
> >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>> 120%3A+Cleanup+Kafka+Streams+builder+API
> >>>>>>>>
> >>>>>>>> Looking forward to your feedback!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -Matthias
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> >>
> >
> >
>
>
>
>

Re: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API

Reply via email to