Re: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API

Guozhang Wang Tue, 21 Mar 2017 15:07:37 -0700

Just to clarify, I did want to have the term `Topology` as part of the
class name, for the reasons above. I'm not too worried about to be
consistent with the previous names, but I feel the `XXTopologyBuilder` is
better than `XXStreamsBuilder` since it's build() function returns a
Topology object.



Guozhang


On Mon, Mar 20, 2017 at 12:53 PM, Michael Noll <mich...@confluent.io> wrote:

> Hmm, I must admit I don't like this last update all too much.
>
> Basically we would have:
>
>     StreamsBuilder builder = new StreamsBuilder();
>
>     // And here you'd define your...well, what actually?
>     // Ah right, you are composing a topology here, though you are not
> aware of it.
>
>     KafkaStreams streams = new KafkaStreams(builder.build(),
> streamsConfiguration);
>
> So what are you building here with StreamsBuilder?  Streams (hint: No)?
> And what about tables -- is there a TableBuilder (hint: No)?
>
> I also interpret Guozhang's last response as that he'd prefer to have
> "Topology" in the class/interface names.  I am aware that we shouldn't
> necessarily use the status quo to make decisions about future changes, but
> the very first concept we explain in the Kafka Streams documentation is
> "Stream Processing Topology":
> https://kafka.apache.org/0102/documentation/streams#streams_concepts
>
> -Michael
>
>
>
> On Mon, Mar 20, 2017 at 7:55 PM, Matthias J. Sax <matth...@confluent.io>
> wrote:
>
> > \cc users list
> >
> >
> > -------- Forwarded Message --------
> > Subject: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API
> > Date: Mon, 20 Mar 2017 11:51:01 -0700
> > From: Matthias J. Sax <matth...@confluent.io>
> > Organization: Confluent Inc
> > To: d...@kafka.apache.org
> >
> > I want to push this discussion further.
> >
> > Guozhang's argument about "exposing" the Topology class is valid. It's a
> > public class anyway, so it's not as issue. However, I think the question
> > is not too much about exposing but about "advertising" (ie, putting it
> > into the focus) or not at DSL level.
> >
> >
> > If I interpret the last replies correctly, it seems that we could agree
> > on "StreamsBuilder" as name. I did update the KIP accordingly. Please
> > correct me, if I got this wrong.
> >
> >
> > If there are not other objects -- this naming discussion was the last
> > open point to far -- I would like the start the VOTE thread.
> >
> >
> > -Matthias
> >
> >
> > On 3/14/17 2:37 PM, Guozhang Wang wrote:
> > > I'd like to keep the term "Topology" inside the builder class since, as
> > > Matthias mentioned, this builder#build() function returns a "Topology"
> > > object, whose type is a public class anyways. Although you can argue to
> > let
> > > users always call
> > >
> > > "new KafkaStreams(builder.build())"
> > >
> > > I think it is still more benefit to expose this concept.
> > >
> > >
> > >
> > > Guozhang
> > >
> > > On Tue, Mar 14, 2017 at 10:43 AM, Matthias J. Sax <
> matth...@confluent.io
> > >
> > > wrote:
> > >
> > >> Thanks for your input Michael.
> > >>
> > >>>> - KafkaStreams as the new name for the builder that creates the
> > logical
> > >>>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and
> > >>>> `KafkaStreams.table("input-topic")`.
> > >>
> > >> I don't thinks this is a good idea, for multiple reasons:
> > >>
> > >> (1) We would reuse a name for a completely different purpose. The same
> > >> argument for not renaming KStreamBuilder to TopologyBuilder. The
> > >> confusion would just be too large.
> > >>
> > >> So if we would start from scratch, it might be ok to do so, but now we
> > >> cannot make this move, IMHO.
> > >>
> > >> Also a clarification question: do you suggest to have static methods
> > >> #stream and #table -- I am not sure if this would work?
> > >> (or was you code snippet just simplification?)
> > >>
> > >>
> > >> (2) Kafka Streams is basically a "processing client" next to consumer
> > >> and producer client. Thus, the name KafkaStreams aligns to the naming
> > >> schema of KafkaConsumer and KafkaProducer. I am not sure if it would
> be
> > >> a good choice to "break" this naming scheme.
> > >>
> > >> Btw: this is also the reason, why we have KafkaStreams#close() -- and
> > >> not KafkaStreams#stop() -- because #close() aligns with consumer and
> > >> producer client.
> > >>
> > >>
> > >> (3) On more argument against using KafkaStreams as DSL entry class
> would
> > >> be, that it would need to create a Topology that can be given to the
> > >> "runner/processing-client". Thus the pattern would be
> > >>
> > >>> Topology topology = streams.build();
> > >>> KafkaStramsRunner runner = new KafkaStreamsRunner(..., topology)
> > >>
> > >> (or of course as a one liner).
> > >>
> > >>
> > >>
> > >> On the other hand, there was the idea (that we intentionally excluded
> > >> from the KIP), to change the "client instantiation" pattern.
> > >>
> > >> Right now, a new client in actively instantiated (ie, by calling
> "new")
> > >> and the topology if provided as a constructor argument. However,
> > >> especially for DSL (not sure if it would make sense for PAPI), the DSL
> > >> builder could create the client for the user.
> > >>
> > >> Something like this:
> > >>
> > >>> KStreamBuilder builder = new KStreamBuilder();
> > >>> builder.whatever() // use the builder
> > >>>
> > >>> StreamsConfig config = ....
> > >>> KafkaStreams streams = builder.getKafkaStreams(config);
> > >>
> > >> If we change the patter like this, the notion a the "DSL builder"
> would
> > >> change, as it does not create a topology anymore, but it creates the
> > >> "processing client". This would address Jay's concern about "not
> > >> exposing concept users don't need the understand" and would not
> require
> > >> to include the word "Topology" in the DSL builder class name, because
> > >> the builder does not build a Topology anymore.
> > >>
> > >> I just put some names that came to my mind first hand -- did not think
> > >> about good names. It's just to discuss the pattern.
> > >>
> > >>
> > >>
> > >> -Matthias
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On 3/14/17 3:36 AM, Michael Noll wrote:
> > >>> I see Jay's point, and I agree with much of it -- notably about being
> > >>> careful which concepts we do and do not expose, depending on which
> user
> > >>> group / user type is affected.  That said, I'm not sure yet whether
> or
> > >> not
> > >>> we should get rid of "Topology" (or a similar term) in the DSL.
> > >>>
> > >>> For what it's worth, here's how related technologies define/name
> their
> > >>> "topologies" and "builders".  Note that, in all cases, it's about
> > >>> constructing a logical processing plan, which then is being
> > executed/run.
> > >>>
> > >>> - `Pipeline` (Google Dataflow/Apache Beam)
> > >>>     - To add a source you first instantiate the Source (e.g.
> > >>> `TextIO.Read.from("gs://some/inputData.txt")`),
> > >>>       then attach it to your processing plan via
> > >> `Pipeline#apply(<source>)`.
> > >>>       This setup is a bit different to our DSL because in our DSL the
> > >>> builder does both, i.e.
> > >>>       instantiating + auto-attaching to itself.
> > >>>     - To execute the processing plan you call `Pipeline#execute()`.
> > >>> - `StreamingContext`` (Spark): This setup is similar to our DSL.
> > >>>     - To add a source you call e.g.
> > >>> `StreamingContext#socketTextStream("localhost", 9999)`.
> > >>>     - To execute the processing plan you call
> > >> `StreamingContext#execute()`.
> > >>> - `StreamExecutionEnvironment` (Flink): This setup is similar to our
> > DSL.
> > >>>     - To add a source you call e.g.
> > >>> `StreamExecutionEnvironment#socketTextStream("localhost", 9999)`.
> > >>>     - To execute the processing plan you call
> > >>> `StreamExecutionEnvironment#execute()`.
> > >>> - `Graph`/`Flow` (Akka Streams), as a result of composing Sources (~
> > >>> `KStreamBuilder.stream()`) and Sinks (~ `KStream#to()`)
> > >>>   into Flows, which are [Runnable]Graphs.
> > >>>     - You instantiate a Source directly, and then compose the Source
> > with
> > >>> Sinks to create a RunnableGraph:
> > >>>       see signature `Source#to[Mat2](sink: Graph[SinkShape[Out],
> > Mat2]):
> > >>> RunnableGraph[Mat]`.
> > >>>     - To execute the processing plan you call `Flow#run()`.
> > >>>
> > >>> In our DSL, in comparison, we do:
> > >>>
> > >>> - `KStreamBuilder` (Kafka Streams API)
> > >>>     - To add a source you call e.g. `KStreamBuilder#stream("input-
> > >> topic")`.
> > >>>     - To execute the processing plan you create a `KafkaStreams`
> > instance
> > >>> from `KStreamBuilder`
> > >>>       (where the builder will instantiate the topology = processing
> > plan
> > >> to
> > >>> be executed), and then
> > >>>       call `KafkaStreams#start()`.  Think of `KafkaStreams` as our
> > >> runner.
> > >>>
> > >>> First, I agree with the sentiment that the current name of
> > >> `KStreamBuilder`
> > >>> isn't great (which is why we're having this discussion).  Also, that
> > >>> finding a good name is tricky. ;-)
> > >>>
> > >>> Second, even though I agree with many of Jay's points I'm not sure
> > >> whether
> > >>> I like the `StreamsBuilder` suggestion (i.e. any name that does not
> > >> include
> > >>> "topology" or a similar term) that much more.  It still doesn't
> > describe
> > >>> what that class actually does, and what the difference to
> > `KafkaStreams`
> > >>> is.  IMHO, the point of `KStreamBuilder` is that it lets you build a
> > >>> logical plan (what we call "topology"), and `KafkaStreams` is the
> thing
> > >>> that executes that plan.  I'm not yet convinced that abstracting
> these
> > >> two
> > >>> points away from the user is a good idea if the argument is that it's
> > >>> potentially confusing to beginners (a claim which I am not sure is
> > >> actually
> > >>> true).
> > >>>
> > >>> That said, if we rather favor "good-sounding but perhaps less
> > technically
> > >>> correct names", I'd argue we should not even use something like
> > >> "Builder".
> > >>> We could, for example, also pick the following names:
> > >>>
> > >>> - KafkaStreams as the new name for the builder that creates the
> logical
> > >>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and
> > >>> `KafkaStreams.table("input-topic")`.
> > >>> - KafkaStreamsRunner as the new name for the executioner of the plan,
> > >> with
> > >>> `KafkaStreamsRunner(KafkaStreams).run()`.
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Mar 14, 2017 at 5:56 AM, Sriram Subramanian <
> r...@confluent.io>
> > >>> wrote:
> > >>>
> > >>>> StreamsBuilder would be my vote.
> > >>>>
> > >>>>> On Mar 13, 2017, at 9:42 PM, Jay Kreps <j...@confluent.io> wrote:
> > >>>>>
> > >>>>> Hey Matthias,
> > >>>>>
> > >>>>> Make sense, I'm more advocating for removing the word topology than
> > any
> > >>>>> particular new replacement.
> > >>>>>
> > >>>>> -Jay
> > >>>>>
> > >>>>> On Mon, Mar 13, 2017 at 12:30 PM, Matthias J. Sax <
> > >> matth...@confluent.io
> > >>>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Jay,
> > >>>>>>
> > >>>>>> thanks for your feedback
> > >>>>>>
> > >>>>>>> What if instead we called it KStreamsBuilder?
> > >>>>>>
> > >>>>>> That's the current name and I personally think it's not the best
> > one.
> > >>>>>> The main reason why I don't like KStreamsBuilder is, that we have
> > the
> > >>>>>> concepts of KStreams and KTables, and the builder creates both.
> > >> However,
> > >>>>>> the name puts he focus on KStream and devalues KTable.
> > >>>>>>
> > >>>>>> I understand your argument, and I am personally open the remove
> the
> > >>>>>> "Topology" part, and name it "StreamsBuilder". Not sure what
> others
> > >>>>>> think about this.
> > >>>>>>
> > >>>>>>
> > >>>>>> About Processor API: I like the idea in general, but I thinks it's
> > out
> > >>>>>> of scope for this KIP. KIP-120 has the focus on removing leaking
> > >>>>>> internal APIs and do some cleanup how our API reflects some
> > concepts.
> > >>>>>>
> > >>>>>> However, I added your idea to API discussion Wiki page and we take
> > if
> > >>>>>> from there:
> > >>>>>> https://cwiki.apache.org/confluence/display/KAFKA/
> > >>>>>> Kafka+Streams+Discussions
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> -Matthias
> > >>>>>>
> > >>>>>>
> > >>>>>>> On 3/13/17 11:52 AM, Jay Kreps wrote:
> > >>>>>>> Two things:
> > >>>>>>>
> > >>>>>>>   1. This is a minor thing but the proposed new name for
> > >> KStreamBuilder
> > >>>>>>>   is StreamsTopologyBuilder. I actually think we should not put
> > >>>>>> topology in
> > >>>>>>>   the name as topology is not a concept you need to understand at
> > the
> > >>>>>>>   kstreams layer right now. I'd think of three categories of
> > >> concepts:
> > >>>>>> (1)
> > >>>>>>>   concepts you need to understand to get going even for a simple
> > >>>>>> example, (2)
> > >>>>>>>   concepts you need to understand to operate and debug a real
> > >>>>>> production app,
> > >>>>>>>   (3) concepts we truly abstract and you don't need to ever
> > >> understand.
> > >>>>>> I
> > >>>>>>>   think in the kstream layer topologies are currently category
> (2),
> > >> and
> > >>>>>> this
> > >>>>>>>   is where they belong. By introducing the name in even the
> > simplest
> > >>>>>> example
> > >>>>>>>   it means the user has to go read about toplogies to really
> > >> understand
> > >>>>>> even
> > >>>>>>>   this simple snippet. What if instead we called it
> > KStreamsBuilder?
> > >>>>>>>   2. For the processor api, I think this api is mostly not for
> end
> > >>>>>> users.
> > >>>>>>>   However this are a couple cases where it might make sense to
> > expose
> > >>>>>> it. I
> > >>>>>>>   think users coming from Samza, or JMS's MessageListener (
> > >>>>>>>   https://docs.oracle.com/javaee/7/api/javax/jms/
> > >> MessageListener.html)
> > >>>>>>>   understand a simple callback interface for message processing.
> In
> > >>>>>> fact,
> > >>>>>>>   people often ask why Kafka's consumer doesn't provide such an
> > >>>>>> interface.
> > >>>>>>>   I'd argue we do, it's KafkaStreams. The only issue is that the
> > >>>>>> processor
> > >>>>>>>   API documentation is a bit scary for a person implementing this
> > >> type
> > >>>>>> of
> > >>>>>>>   api. My observation is that people using this style of API
> don't
> > >> do a
> > >>>>>> lot
> > >>>>>>>   of cross-message operations, then just do single message
> > operations
> > >>>>>> and use
> > >>>>>>>   a database for anything that spans messages. They also don't
> > factor
> > >>>>>> their
> > >>>>>>>   code into many MessageListeners and compose them, they just
> have
> > >> one
> > >>>>>>>   listener that has the complete handling logic. Say I am a user
> > who
> > >>>>>> wants to
> > >>>>>>>   implement a single Processor in this style. Do we have an easy
> > way
> > >> to
> > >>>>>> do
> > >>>>>>>   that today (either with the .transform/.process methods in
> > kstreams
> > >>>>>> or with
> > >>>>>>>   the topology apis)? Is there anything we can do in the way of
> > >> trivial
> > >>>>>>>   helper code to make this better? Also, how can we explain that
> > >>>>>> pattern to
> > >>>>>>>   people? I think currently we have pretty in-depth docs on our
> > apis
> > >>>>>> but I
> > >>>>>>>   suspect a person trying to figure out how to implement a simple
> > >>>>>> callback
> > >>>>>>>   might get a bit lost trying to figure out how to wire it up. A
> > >> simple
> > >>>>>> five
> > >>>>>>>   line example in the docs would probably help a lot. Not sure if
> > >> this
> > >>>>>> is
> > >>>>>>>   best addressed in this KIP or is a side comment.
> > >>>>>>>
> > >>>>>>> Cheers,
> > >>>>>>>
> > >>>>>>> -Jay
> > >>>>>>>
> > >>>>>>> On Fri, Feb 3, 2017 at 3:33 PM, Matthias J. Sax <
> > >> matth...@confluent.io
> > >>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi All,
> > >>>>>>>>
> > >>>>>>>> I did prepare a KIP to do some cleanup some of Kafka's Streaming
> > >> API.
> > >>>>>>>>
> > >>>>>>>> Please have a look here:
> > >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >>>>>>>> 120%3A+Cleanup+Kafka+Streams+builder+API
> > >>>>>>>>
> > >>>>>>>> Looking forward to your feedback!
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> -Matthias
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> > >
> >
> >
> >
> >
>



-- 
-- Guozhang

Re: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API

Reply via email to