\cc users list
-------- Forwarded Message -------- Subject: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API Date: Mon, 20 Mar 2017 11:51:01 -0700 From: Matthias J. Sax <matth...@confluent.io> Organization: Confluent Inc To: d...@kafka.apache.org I want to push this discussion further. Guozhang's argument about "exposing" the Topology class is valid. It's a public class anyway, so it's not as issue. However, I think the question is not too much about exposing but about "advertising" (ie, putting it into the focus) or not at DSL level. If I interpret the last replies correctly, it seems that we could agree on "StreamsBuilder" as name. I did update the KIP accordingly. Please correct me, if I got this wrong. If there are not other objects -- this naming discussion was the last open point to far -- I would like the start the VOTE thread. -Matthias On 3/14/17 2:37 PM, Guozhang Wang wrote: > I'd like to keep the term "Topology" inside the builder class since, as > Matthias mentioned, this builder#build() function returns a "Topology" > object, whose type is a public class anyways. Although you can argue to let > users always call > > "new KafkaStreams(builder.build())" > > I think it is still more benefit to expose this concept. > > > > Guozhang > > On Tue, Mar 14, 2017 at 10:43 AM, Matthias J. Sax <matth...@confluent.io> > wrote: > >> Thanks for your input Michael. >> >>>> - KafkaStreams as the new name for the builder that creates the logical >>>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and >>>> `KafkaStreams.table("input-topic")`. >> >> I don't thinks this is a good idea, for multiple reasons: >> >> (1) We would reuse a name for a completely different purpose. The same >> argument for not renaming KStreamBuilder to TopologyBuilder. The >> confusion would just be too large. >> >> So if we would start from scratch, it might be ok to do so, but now we >> cannot make this move, IMHO. >> >> Also a clarification question: do you suggest to have static methods >> #stream and #table -- I am not sure if this would work? >> (or was you code snippet just simplification?) >> >> >> (2) Kafka Streams is basically a "processing client" next to consumer >> and producer client. Thus, the name KafkaStreams aligns to the naming >> schema of KafkaConsumer and KafkaProducer. I am not sure if it would be >> a good choice to "break" this naming scheme. >> >> Btw: this is also the reason, why we have KafkaStreams#close() -- and >> not KafkaStreams#stop() -- because #close() aligns with consumer and >> producer client. >> >> >> (3) On more argument against using KafkaStreams as DSL entry class would >> be, that it would need to create a Topology that can be given to the >> "runner/processing-client". Thus the pattern would be >> >>> Topology topology = streams.build(); >>> KafkaStramsRunner runner = new KafkaStreamsRunner(..., topology) >> >> (or of course as a one liner). >> >> >> >> On the other hand, there was the idea (that we intentionally excluded >> from the KIP), to change the "client instantiation" pattern. >> >> Right now, a new client in actively instantiated (ie, by calling "new") >> and the topology if provided as a constructor argument. However, >> especially for DSL (not sure if it would make sense for PAPI), the DSL >> builder could create the client for the user. >> >> Something like this: >> >>> KStreamBuilder builder = new KStreamBuilder(); >>> builder.whatever() // use the builder >>> >>> StreamsConfig config = .... >>> KafkaStreams streams = builder.getKafkaStreams(config); >> >> If we change the patter like this, the notion a the "DSL builder" would >> change, as it does not create a topology anymore, but it creates the >> "processing client". This would address Jay's concern about "not >> exposing concept users don't need the understand" and would not require >> to include the word "Topology" in the DSL builder class name, because >> the builder does not build a Topology anymore. >> >> I just put some names that came to my mind first hand -- did not think >> about good names. It's just to discuss the pattern. >> >> >> >> -Matthias >> >> >> >> >> >> On 3/14/17 3:36 AM, Michael Noll wrote: >>> I see Jay's point, and I agree with much of it -- notably about being >>> careful which concepts we do and do not expose, depending on which user >>> group / user type is affected. That said, I'm not sure yet whether or >> not >>> we should get rid of "Topology" (or a similar term) in the DSL. >>> >>> For what it's worth, here's how related technologies define/name their >>> "topologies" and "builders". Note that, in all cases, it's about >>> constructing a logical processing plan, which then is being executed/run. >>> >>> - `Pipeline` (Google Dataflow/Apache Beam) >>> - To add a source you first instantiate the Source (e.g. >>> `TextIO.Read.from("gs://some/inputData.txt")`), >>> then attach it to your processing plan via >> `Pipeline#apply(<source>)`. >>> This setup is a bit different to our DSL because in our DSL the >>> builder does both, i.e. >>> instantiating + auto-attaching to itself. >>> - To execute the processing plan you call `Pipeline#execute()`. >>> - `StreamingContext`` (Spark): This setup is similar to our DSL. >>> - To add a source you call e.g. >>> `StreamingContext#socketTextStream("localhost", 9999)`. >>> - To execute the processing plan you call >> `StreamingContext#execute()`. >>> - `StreamExecutionEnvironment` (Flink): This setup is similar to our DSL. >>> - To add a source you call e.g. >>> `StreamExecutionEnvironment#socketTextStream("localhost", 9999)`. >>> - To execute the processing plan you call >>> `StreamExecutionEnvironment#execute()`. >>> - `Graph`/`Flow` (Akka Streams), as a result of composing Sources (~ >>> `KStreamBuilder.stream()`) and Sinks (~ `KStream#to()`) >>> into Flows, which are [Runnable]Graphs. >>> - You instantiate a Source directly, and then compose the Source with >>> Sinks to create a RunnableGraph: >>> see signature `Source#to[Mat2](sink: Graph[SinkShape[Out], Mat2]): >>> RunnableGraph[Mat]`. >>> - To execute the processing plan you call `Flow#run()`. >>> >>> In our DSL, in comparison, we do: >>> >>> - `KStreamBuilder` (Kafka Streams API) >>> - To add a source you call e.g. `KStreamBuilder#stream("input- >> topic")`. >>> - To execute the processing plan you create a `KafkaStreams` instance >>> from `KStreamBuilder` >>> (where the builder will instantiate the topology = processing plan >> to >>> be executed), and then >>> call `KafkaStreams#start()`. Think of `KafkaStreams` as our >> runner. >>> >>> First, I agree with the sentiment that the current name of >> `KStreamBuilder` >>> isn't great (which is why we're having this discussion). Also, that >>> finding a good name is tricky. ;-) >>> >>> Second, even though I agree with many of Jay's points I'm not sure >> whether >>> I like the `StreamsBuilder` suggestion (i.e. any name that does not >> include >>> "topology" or a similar term) that much more. It still doesn't describe >>> what that class actually does, and what the difference to `KafkaStreams` >>> is. IMHO, the point of `KStreamBuilder` is that it lets you build a >>> logical plan (what we call "topology"), and `KafkaStreams` is the thing >>> that executes that plan. I'm not yet convinced that abstracting these >> two >>> points away from the user is a good idea if the argument is that it's >>> potentially confusing to beginners (a claim which I am not sure is >> actually >>> true). >>> >>> That said, if we rather favor "good-sounding but perhaps less technically >>> correct names", I'd argue we should not even use something like >> "Builder". >>> We could, for example, also pick the following names: >>> >>> - KafkaStreams as the new name for the builder that creates the logical >>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and >>> `KafkaStreams.table("input-topic")`. >>> - KafkaStreamsRunner as the new name for the executioner of the plan, >> with >>> `KafkaStreamsRunner(KafkaStreams).run()`. >>> >>> >>> >>> On Tue, Mar 14, 2017 at 5:56 AM, Sriram Subramanian <r...@confluent.io> >>> wrote: >>> >>>> StreamsBuilder would be my vote. >>>> >>>>> On Mar 13, 2017, at 9:42 PM, Jay Kreps <j...@confluent.io> wrote: >>>>> >>>>> Hey Matthias, >>>>> >>>>> Make sense, I'm more advocating for removing the word topology than any >>>>> particular new replacement. >>>>> >>>>> -Jay >>>>> >>>>> On Mon, Mar 13, 2017 at 12:30 PM, Matthias J. Sax < >> matth...@confluent.io >>>>> >>>>> wrote: >>>>> >>>>>> Jay, >>>>>> >>>>>> thanks for your feedback >>>>>> >>>>>>> What if instead we called it KStreamsBuilder? >>>>>> >>>>>> That's the current name and I personally think it's not the best one. >>>>>> The main reason why I don't like KStreamsBuilder is, that we have the >>>>>> concepts of KStreams and KTables, and the builder creates both. >> However, >>>>>> the name puts he focus on KStream and devalues KTable. >>>>>> >>>>>> I understand your argument, and I am personally open the remove the >>>>>> "Topology" part, and name it "StreamsBuilder". Not sure what others >>>>>> think about this. >>>>>> >>>>>> >>>>>> About Processor API: I like the idea in general, but I thinks it's out >>>>>> of scope for this KIP. KIP-120 has the focus on removing leaking >>>>>> internal APIs and do some cleanup how our API reflects some concepts. >>>>>> >>>>>> However, I added your idea to API discussion Wiki page and we take if >>>>>> from there: >>>>>> https://cwiki.apache.org/confluence/display/KAFKA/ >>>>>> Kafka+Streams+Discussions >>>>>> >>>>>> >>>>>> >>>>>> -Matthias >>>>>> >>>>>> >>>>>>> On 3/13/17 11:52 AM, Jay Kreps wrote: >>>>>>> Two things: >>>>>>> >>>>>>> 1. This is a minor thing but the proposed new name for >> KStreamBuilder >>>>>>> is StreamsTopologyBuilder. I actually think we should not put >>>>>> topology in >>>>>>> the name as topology is not a concept you need to understand at the >>>>>>> kstreams layer right now. I'd think of three categories of >> concepts: >>>>>> (1) >>>>>>> concepts you need to understand to get going even for a simple >>>>>> example, (2) >>>>>>> concepts you need to understand to operate and debug a real >>>>>> production app, >>>>>>> (3) concepts we truly abstract and you don't need to ever >> understand. >>>>>> I >>>>>>> think in the kstream layer topologies are currently category (2), >> and >>>>>> this >>>>>>> is where they belong. By introducing the name in even the simplest >>>>>> example >>>>>>> it means the user has to go read about toplogies to really >> understand >>>>>> even >>>>>>> this simple snippet. What if instead we called it KStreamsBuilder? >>>>>>> 2. For the processor api, I think this api is mostly not for end >>>>>> users. >>>>>>> However this are a couple cases where it might make sense to expose >>>>>> it. I >>>>>>> think users coming from Samza, or JMS's MessageListener ( >>>>>>> https://docs.oracle.com/javaee/7/api/javax/jms/ >> MessageListener.html) >>>>>>> understand a simple callback interface for message processing. In >>>>>> fact, >>>>>>> people often ask why Kafka's consumer doesn't provide such an >>>>>> interface. >>>>>>> I'd argue we do, it's KafkaStreams. The only issue is that the >>>>>> processor >>>>>>> API documentation is a bit scary for a person implementing this >> type >>>>>> of >>>>>>> api. My observation is that people using this style of API don't >> do a >>>>>> lot >>>>>>> of cross-message operations, then just do single message operations >>>>>> and use >>>>>>> a database for anything that spans messages. They also don't factor >>>>>> their >>>>>>> code into many MessageListeners and compose them, they just have >> one >>>>>>> listener that has the complete handling logic. Say I am a user who >>>>>> wants to >>>>>>> implement a single Processor in this style. Do we have an easy way >> to >>>>>> do >>>>>>> that today (either with the .transform/.process methods in kstreams >>>>>> or with >>>>>>> the topology apis)? Is there anything we can do in the way of >> trivial >>>>>>> helper code to make this better? Also, how can we explain that >>>>>> pattern to >>>>>>> people? I think currently we have pretty in-depth docs on our apis >>>>>> but I >>>>>>> suspect a person trying to figure out how to implement a simple >>>>>> callback >>>>>>> might get a bit lost trying to figure out how to wire it up. A >> simple >>>>>> five >>>>>>> line example in the docs would probably help a lot. Not sure if >> this >>>>>> is >>>>>>> best addressed in this KIP or is a side comment. >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> -Jay >>>>>>> >>>>>>> On Fri, Feb 3, 2017 at 3:33 PM, Matthias J. Sax < >> matth...@confluent.io >>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I did prepare a KIP to do some cleanup some of Kafka's Streaming >> API. >>>>>>>> >>>>>>>> Please have a look here: >>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- >>>>>>>> 120%3A+Cleanup+Kafka+Streams+builder+API >>>>>>>> >>>>>>>> Looking forward to your feedback! >>>>>>>> >>>>>>>> >>>>>>>> -Matthias >>>>>> >>>>>> >>>> >>> >> >> > >
signature.asc
Description: OpenPGP digital signature