Fwd: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API

Matthias J. Sax Mon, 20 Mar 2017 12:01:59 -0700

\cc users list


-------- Forwarded Message --------
Subject: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API
Date: Mon, 20 Mar 2017 11:51:01 -0700
From: Matthias J. Sax <[email protected]>
Organization: Confluent Inc
To: [email protected]

I want to push this discussion further.

Guozhang's argument about "exposing" the Topology class is valid. It's a
public class anyway, so it's not as issue. However, I think the question
is not too much about exposing but about "advertising" (ie, putting it
into the focus) or not at DSL level.


If I interpret the last replies correctly, it seems that we could agree
on "StreamsBuilder" as name. I did update the KIP accordingly. Please
correct me, if I got this wrong.


If there are not other objects -- this naming discussion was the last
open point to far -- I would like the start the VOTE thread.


-Matthias


On 3/14/17 2:37 PM, Guozhang Wang wrote:
> I'd like to keep the term "Topology" inside the builder class since, as
> Matthias mentioned, this builder#build() function returns a "Topology"
> object, whose type is a public class anyways. Although you can argue to let
> users always call
> 
> "new KafkaStreams(builder.build())"
> 
> I think it is still more benefit to expose this concept.
> 
> 
> 
> Guozhang
> 
> On Tue, Mar 14, 2017 at 10:43 AM, Matthias J. Sax <[email protected]>
> wrote:
> 
>> Thanks for your input Michael.
>>
>>>> - KafkaStreams as the new name for the builder that creates the logical
>>>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and
>>>> `KafkaStreams.table("input-topic")`.
>>
>> I don't thinks this is a good idea, for multiple reasons:
>>
>> (1) We would reuse a name for a completely different purpose. The same
>> argument for not renaming KStreamBuilder to TopologyBuilder. The
>> confusion would just be too large.
>>
>> So if we would start from scratch, it might be ok to do so, but now we
>> cannot make this move, IMHO.
>>
>> Also a clarification question: do you suggest to have static methods
>> #stream and #table -- I am not sure if this would work?
>> (or was you code snippet just simplification?)
>>
>>
>> (2) Kafka Streams is basically a "processing client" next to consumer
>> and producer client. Thus, the name KafkaStreams aligns to the naming
>> schema of KafkaConsumer and KafkaProducer. I am not sure if it would be
>> a good choice to "break" this naming scheme.
>>
>> Btw: this is also the reason, why we have KafkaStreams#close() -- and
>> not KafkaStreams#stop() -- because #close() aligns with consumer and
>> producer client.
>>
>>
>> (3) On more argument against using KafkaStreams as DSL entry class would
>> be, that it would need to create a Topology that can be given to the
>> "runner/processing-client". Thus the pattern would be
>>
>>> Topology topology = streams.build();
>>> KafkaStramsRunner runner = new KafkaStreamsRunner(..., topology)
>>
>> (or of course as a one liner).
>>
>>
>>
>> On the other hand, there was the idea (that we intentionally excluded
>> from the KIP), to change the "client instantiation" pattern.
>>
>> Right now, a new client in actively instantiated (ie, by calling "new")
>> and the topology if provided as a constructor argument. However,
>> especially for DSL (not sure if it would make sense for PAPI), the DSL
>> builder could create the client for the user.
>>
>> Something like this:
>>
>>> KStreamBuilder builder = new KStreamBuilder();
>>> builder.whatever() // use the builder
>>>
>>> StreamsConfig config = ....
>>> KafkaStreams streams = builder.getKafkaStreams(config);
>>
>> If we change the patter like this, the notion a the "DSL builder" would
>> change, as it does not create a topology anymore, but it creates the
>> "processing client". This would address Jay's concern about "not
>> exposing concept users don't need the understand" and would not require
>> to include the word "Topology" in the DSL builder class name, because
>> the builder does not build a Topology anymore.
>>
>> I just put some names that came to my mind first hand -- did not think
>> about good names. It's just to discuss the pattern.
>>
>>
>>
>> -Matthias
>>
>>
>>
>>
>>
>> On 3/14/17 3:36 AM, Michael Noll wrote:
>>> I see Jay's point, and I agree with much of it -- notably about being
>>> careful which concepts we do and do not expose, depending on which user
>>> group / user type is affected.  That said, I'm not sure yet whether or
>> not
>>> we should get rid of "Topology" (or a similar term) in the DSL.
>>>
>>> For what it's worth, here's how related technologies define/name their
>>> "topologies" and "builders".  Note that, in all cases, it's about
>>> constructing a logical processing plan, which then is being executed/run.
>>>
>>> - `Pipeline` (Google Dataflow/Apache Beam)
>>>     - To add a source you first instantiate the Source (e.g.
>>> `TextIO.Read.from("gs://some/inputData.txt")`),
>>>       then attach it to your processing plan via
>> `Pipeline#apply(<source>)`.
>>>       This setup is a bit different to our DSL because in our DSL the
>>> builder does both, i.e.
>>>       instantiating + auto-attaching to itself.
>>>     - To execute the processing plan you call `Pipeline#execute()`.
>>> - `StreamingContext`` (Spark): This setup is similar to our DSL.
>>>     - To add a source you call e.g.
>>> `StreamingContext#socketTextStream("localhost", 9999)`.
>>>     - To execute the processing plan you call
>> `StreamingContext#execute()`.
>>> - `StreamExecutionEnvironment` (Flink): This setup is similar to our DSL.
>>>     - To add a source you call e.g.
>>> `StreamExecutionEnvironment#socketTextStream("localhost", 9999)`.
>>>     - To execute the processing plan you call
>>> `StreamExecutionEnvironment#execute()`.
>>> - `Graph`/`Flow` (Akka Streams), as a result of composing Sources (~
>>> `KStreamBuilder.stream()`) and Sinks (~ `KStream#to()`)
>>>   into Flows, which are [Runnable]Graphs.
>>>     - You instantiate a Source directly, and then compose the Source with
>>> Sinks to create a RunnableGraph:
>>>       see signature `Source#to[Mat2](sink: Graph[SinkShape[Out], Mat2]):
>>> RunnableGraph[Mat]`.
>>>     - To execute the processing plan you call `Flow#run()`.
>>>
>>> In our DSL, in comparison, we do:
>>>
>>> - `KStreamBuilder` (Kafka Streams API)
>>>     - To add a source you call e.g. `KStreamBuilder#stream("input-
>> topic")`.
>>>     - To execute the processing plan you create a `KafkaStreams` instance
>>> from `KStreamBuilder`
>>>       (where the builder will instantiate the topology = processing plan
>> to
>>> be executed), and then
>>>       call `KafkaStreams#start()`.  Think of `KafkaStreams` as our
>> runner.
>>>
>>> First, I agree with the sentiment that the current name of
>> `KStreamBuilder`
>>> isn't great (which is why we're having this discussion).  Also, that
>>> finding a good name is tricky. ;-)
>>>
>>> Second, even though I agree with many of Jay's points I'm not sure
>> whether
>>> I like the `StreamsBuilder` suggestion (i.e. any name that does not
>> include
>>> "topology" or a similar term) that much more.  It still doesn't describe
>>> what that class actually does, and what the difference to `KafkaStreams`
>>> is.  IMHO, the point of `KStreamBuilder` is that it lets you build a
>>> logical plan (what we call "topology"), and `KafkaStreams` is the thing
>>> that executes that plan.  I'm not yet convinced that abstracting these
>> two
>>> points away from the user is a good idea if the argument is that it's
>>> potentially confusing to beginners (a claim which I am not sure is
>> actually
>>> true).
>>>
>>> That said, if we rather favor "good-sounding but perhaps less technically
>>> correct names", I'd argue we should not even use something like
>> "Builder".
>>> We could, for example, also pick the following names:
>>>
>>> - KafkaStreams as the new name for the builder that creates the logical
>>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and
>>> `KafkaStreams.table("input-topic")`.
>>> - KafkaStreamsRunner as the new name for the executioner of the plan,
>> with
>>> `KafkaStreamsRunner(KafkaStreams).run()`.
>>>
>>>
>>>
>>> On Tue, Mar 14, 2017 at 5:56 AM, Sriram Subramanian <[email protected]>
>>> wrote:
>>>
>>>> StreamsBuilder would be my vote.
>>>>
>>>>> On Mar 13, 2017, at 9:42 PM, Jay Kreps <[email protected]> wrote:
>>>>>
>>>>> Hey Matthias,
>>>>>
>>>>> Make sense, I'm more advocating for removing the word topology than any
>>>>> particular new replacement.
>>>>>
>>>>> -Jay
>>>>>
>>>>> On Mon, Mar 13, 2017 at 12:30 PM, Matthias J. Sax <
>> [email protected]
>>>>>
>>>>> wrote:
>>>>>
>>>>>> Jay,
>>>>>>
>>>>>> thanks for your feedback
>>>>>>
>>>>>>> What if instead we called it KStreamsBuilder?
>>>>>>
>>>>>> That's the current name and I personally think it's not the best one.
>>>>>> The main reason why I don't like KStreamsBuilder is, that we have the
>>>>>> concepts of KStreams and KTables, and the builder creates both.
>> However,
>>>>>> the name puts he focus on KStream and devalues KTable.
>>>>>>
>>>>>> I understand your argument, and I am personally open the remove the
>>>>>> "Topology" part, and name it "StreamsBuilder". Not sure what others
>>>>>> think about this.
>>>>>>
>>>>>>
>>>>>> About Processor API: I like the idea in general, but I thinks it's out
>>>>>> of scope for this KIP. KIP-120 has the focus on removing leaking
>>>>>> internal APIs and do some cleanup how our API reflects some concepts.
>>>>>>
>>>>>> However, I added your idea to API discussion Wiki page and we take if
>>>>>> from there:
>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/
>>>>>> Kafka+Streams+Discussions
>>>>>>
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>>> On 3/13/17 11:52 AM, Jay Kreps wrote:
>>>>>>> Two things:
>>>>>>>
>>>>>>>   1. This is a minor thing but the proposed new name for
>> KStreamBuilder
>>>>>>>   is StreamsTopologyBuilder. I actually think we should not put
>>>>>> topology in
>>>>>>>   the name as topology is not a concept you need to understand at the
>>>>>>>   kstreams layer right now. I'd think of three categories of
>> concepts:
>>>>>> (1)
>>>>>>>   concepts you need to understand to get going even for a simple
>>>>>> example, (2)
>>>>>>>   concepts you need to understand to operate and debug a real
>>>>>> production app,
>>>>>>>   (3) concepts we truly abstract and you don't need to ever
>> understand.
>>>>>> I
>>>>>>>   think in the kstream layer topologies are currently category (2),
>> and
>>>>>> this
>>>>>>>   is where they belong. By introducing the name in even the simplest
>>>>>> example
>>>>>>>   it means the user has to go read about toplogies to really
>> understand
>>>>>> even
>>>>>>>   this simple snippet. What if instead we called it KStreamsBuilder?
>>>>>>>   2. For the processor api, I think this api is mostly not for end
>>>>>> users.
>>>>>>>   However this are a couple cases where it might make sense to expose
>>>>>> it. I
>>>>>>>   think users coming from Samza, or JMS's MessageListener (
>>>>>>>   https://docs.oracle.com/javaee/7/api/javax/jms/
>> MessageListener.html)
>>>>>>>   understand a simple callback interface for message processing. In
>>>>>> fact,
>>>>>>>   people often ask why Kafka's consumer doesn't provide such an
>>>>>> interface.
>>>>>>>   I'd argue we do, it's KafkaStreams. The only issue is that the
>>>>>> processor
>>>>>>>   API documentation is a bit scary for a person implementing this
>> type
>>>>>> of
>>>>>>>   api. My observation is that people using this style of API don't
>> do a
>>>>>> lot
>>>>>>>   of cross-message operations, then just do single message operations
>>>>>> and use
>>>>>>>   a database for anything that spans messages. They also don't factor
>>>>>> their
>>>>>>>   code into many MessageListeners and compose them, they just have
>> one
>>>>>>>   listener that has the complete handling logic. Say I am a user who
>>>>>> wants to
>>>>>>>   implement a single Processor in this style. Do we have an easy way
>> to
>>>>>> do
>>>>>>>   that today (either with the .transform/.process methods in kstreams
>>>>>> or with
>>>>>>>   the topology apis)? Is there anything we can do in the way of
>> trivial
>>>>>>>   helper code to make this better? Also, how can we explain that
>>>>>> pattern to
>>>>>>>   people? I think currently we have pretty in-depth docs on our apis
>>>>>> but I
>>>>>>>   suspect a person trying to figure out how to implement a simple
>>>>>> callback
>>>>>>>   might get a bit lost trying to figure out how to wire it up. A
>> simple
>>>>>> five
>>>>>>>   line example in the docs would probably help a lot. Not sure if
>> this
>>>>>> is
>>>>>>>   best addressed in this KIP or is a side comment.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> -Jay
>>>>>>>
>>>>>>> On Fri, Feb 3, 2017 at 3:33 PM, Matthias J. Sax <
>> [email protected]
>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I did prepare a KIP to do some cleanup some of Kafka's Streaming
>> API.
>>>>>>>>
>>>>>>>> Please have a look here:
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>> 120%3A+Cleanup+Kafka+Streams+builder+API
>>>>>>>>
>>>>>>>> Looking forward to your feedback!
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>
>>>
>>
>>
> 
>

signature.asc
Description: OpenPGP digital signature

Fwd: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API

Reply via email to