Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API

Matthias J. Sax Tue, 14 Mar 2017 10:43:44 -0700

Thanks for your input Michael.

>> - KafkaStreams as the new name for the builder that creates the logical
>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and
>> `KafkaStreams.table("input-topic")`.


I don't thinks this is a good idea, for multiple reasons:

(1) We would reuse a name for a completely different purpose. The same
argument for not renaming KStreamBuilder to TopologyBuilder. The
confusion would just be too large.

So if we would start from scratch, it might be ok to do so, but now we
cannot make this move, IMHO.

Also a clarification question: do you suggest to have static methods
#stream and #table -- I am not sure if this would work?
(or was you code snippet just simplification?)


(2) Kafka Streams is basically a "processing client" next to consumer
and producer client. Thus, the name KafkaStreams aligns to the naming
schema of KafkaConsumer and KafkaProducer. I am not sure if it would be
a good choice to "break" this naming scheme.

Btw: this is also the reason, why we have KafkaStreams#close() -- and
not KafkaStreams#stop() -- because #close() aligns with consumer and
producer client.


(3) On more argument against using KafkaStreams as DSL entry class would
be, that it would need to create a Topology that can be given to the
"runner/processing-client". Thus the pattern would be

> Topology topology = streams.build();
> KafkaStramsRunner runner = new KafkaStreamsRunner(..., topology)

(or of course as a one liner).



On the other hand, there was the idea (that we intentionally excluded
from the KIP), to change the "client instantiation" pattern.

Right now, a new client in actively instantiated (ie, by calling "new")
and the topology if provided as a constructor argument. However,
especially for DSL (not sure if it would make sense for PAPI), the DSL
builder could create the client for the user.

Something like this:

> KStreamBuilder builder = new KStreamBuilder();
> builder.whatever() // use the builder
>
> StreamsConfig config = ....
> KafkaStreams streams = builder.getKafkaStreams(config);

If we change the patter like this, the notion a the "DSL builder" would
change, as it does not create a topology anymore, but it creates the
"processing client". This would address Jay's concern about "not
exposing concept users don't need the understand" and would not require
to include the word "Topology" in the DSL builder class name, because
the builder does not build a Topology anymore.

I just put some names that came to my mind first hand -- did not think
about good names. It's just to discuss the pattern.



-Matthias





On 3/14/17 3:36 AM, Michael Noll wrote:
> I see Jay's point, and I agree with much of it -- notably about being
> careful which concepts we do and do not expose, depending on which user
> group / user type is affected.  That said, I'm not sure yet whether or not
> we should get rid of "Topology" (or a similar term) in the DSL.
> 
> For what it's worth, here's how related technologies define/name their
> "topologies" and "builders".  Note that, in all cases, it's about
> constructing a logical processing plan, which then is being executed/run.
> 
> - `Pipeline` (Google Dataflow/Apache Beam)
>     - To add a source you first instantiate the Source (e.g.
> `TextIO.Read.from("gs://some/inputData.txt")`),
>       then attach it to your processing plan via `Pipeline#apply(<source>)`.
>       This setup is a bit different to our DSL because in our DSL the
> builder does both, i.e.
>       instantiating + auto-attaching to itself.
>     - To execute the processing plan you call `Pipeline#execute()`.
> - `StreamingContext`` (Spark): This setup is similar to our DSL.
>     - To add a source you call e.g.
> `StreamingContext#socketTextStream("localhost", 9999)`.
>     - To execute the processing plan you call `StreamingContext#execute()`.
> - `StreamExecutionEnvironment` (Flink): This setup is similar to our DSL.
>     - To add a source you call e.g.
> `StreamExecutionEnvironment#socketTextStream("localhost", 9999)`.
>     - To execute the processing plan you call
> `StreamExecutionEnvironment#execute()`.
> - `Graph`/`Flow` (Akka Streams), as a result of composing Sources (~
> `KStreamBuilder.stream()`) and Sinks (~ `KStream#to()`)
>   into Flows, which are [Runnable]Graphs.
>     - You instantiate a Source directly, and then compose the Source with
> Sinks to create a RunnableGraph:
>       see signature `Source#to[Mat2](sink: Graph[SinkShape[Out], Mat2]):
> RunnableGraph[Mat]`.
>     - To execute the processing plan you call `Flow#run()`.
> 
> In our DSL, in comparison, we do:
> 
> - `KStreamBuilder` (Kafka Streams API)
>     - To add a source you call e.g. `KStreamBuilder#stream("input-topic")`.
>     - To execute the processing plan you create a `KafkaStreams` instance
> from `KStreamBuilder`
>       (where the builder will instantiate the topology = processing plan to
> be executed), and then
>       call `KafkaStreams#start()`.  Think of `KafkaStreams` as our runner.
> 
> First, I agree with the sentiment that the current name of `KStreamBuilder`
> isn't great (which is why we're having this discussion).  Also, that
> finding a good name is tricky. ;-)
> 
> Second, even though I agree with many of Jay's points I'm not sure whether
> I like the `StreamsBuilder` suggestion (i.e. any name that does not include
> "topology" or a similar term) that much more.  It still doesn't describe
> what that class actually does, and what the difference to `KafkaStreams`
> is.  IMHO, the point of `KStreamBuilder` is that it lets you build a
> logical plan (what we call "topology"), and `KafkaStreams` is the thing
> that executes that plan.  I'm not yet convinced that abstracting these two
> points away from the user is a good idea if the argument is that it's
> potentially confusing to beginners (a claim which I am not sure is actually
> true).
> 
> That said, if we rather favor "good-sounding but perhaps less technically
> correct names", I'd argue we should not even use something like "Builder".
> We could, for example, also pick the following names:
> 
> - KafkaStreams as the new name for the builder that creates the logical
> plan, with e.g. `KafkaStreams.stream("intput-topic")` and
> `KafkaStreams.table("input-topic")`.
> - KafkaStreamsRunner as the new name for the executioner of the plan, with
> `KafkaStreamsRunner(KafkaStreams).run()`.
> 
> 
> 
> On Tue, Mar 14, 2017 at 5:56 AM, Sriram Subramanian <r...@confluent.io>
> wrote:
> 
>> StreamsBuilder would be my vote.
>>
>>> On Mar 13, 2017, at 9:42 PM, Jay Kreps <j...@confluent.io> wrote:
>>>
>>> Hey Matthias,
>>>
>>> Make sense, I'm more advocating for removing the word topology than any
>>> particular new replacement.
>>>
>>> -Jay
>>>
>>> On Mon, Mar 13, 2017 at 12:30 PM, Matthias J. Sax <matth...@confluent.io
>>>
>>> wrote:
>>>
>>>> Jay,
>>>>
>>>> thanks for your feedback
>>>>
>>>>> What if instead we called it KStreamsBuilder?
>>>>
>>>> That's the current name and I personally think it's not the best one.
>>>> The main reason why I don't like KStreamsBuilder is, that we have the
>>>> concepts of KStreams and KTables, and the builder creates both. However,
>>>> the name puts he focus on KStream and devalues KTable.
>>>>
>>>> I understand your argument, and I am personally open the remove the
>>>> "Topology" part, and name it "StreamsBuilder". Not sure what others
>>>> think about this.
>>>>
>>>>
>>>> About Processor API: I like the idea in general, but I thinks it's out
>>>> of scope for this KIP. KIP-120 has the focus on removing leaking
>>>> internal APIs and do some cleanup how our API reflects some concepts.
>>>>
>>>> However, I added your idea to API discussion Wiki page and we take if
>>>> from there:
>>>> https://cwiki.apache.org/confluence/display/KAFKA/
>>>> Kafka+Streams+Discussions
>>>>
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>>> On 3/13/17 11:52 AM, Jay Kreps wrote:
>>>>> Two things:
>>>>>
>>>>>   1. This is a minor thing but the proposed new name for KStreamBuilder
>>>>>   is StreamsTopologyBuilder. I actually think we should not put
>>>> topology in
>>>>>   the name as topology is not a concept you need to understand at the
>>>>>   kstreams layer right now. I'd think of three categories of concepts:
>>>> (1)
>>>>>   concepts you need to understand to get going even for a simple
>>>> example, (2)
>>>>>   concepts you need to understand to operate and debug a real
>>>> production app,
>>>>>   (3) concepts we truly abstract and you don't need to ever understand.
>>>> I
>>>>>   think in the kstream layer topologies are currently category (2), and
>>>> this
>>>>>   is where they belong. By introducing the name in even the simplest
>>>> example
>>>>>   it means the user has to go read about toplogies to really understand
>>>> even
>>>>>   this simple snippet. What if instead we called it KStreamsBuilder?
>>>>>   2. For the processor api, I think this api is mostly not for end
>>>> users.
>>>>>   However this are a couple cases where it might make sense to expose
>>>> it. I
>>>>>   think users coming from Samza, or JMS's MessageListener (
>>>>>   https://docs.oracle.com/javaee/7/api/javax/jms/MessageListener.html)
>>>>>   understand a simple callback interface for message processing. In
>>>> fact,
>>>>>   people often ask why Kafka's consumer doesn't provide such an
>>>> interface.
>>>>>   I'd argue we do, it's KafkaStreams. The only issue is that the
>>>> processor
>>>>>   API documentation is a bit scary for a person implementing this type
>>>> of
>>>>>   api. My observation is that people using this style of API don't do a
>>>> lot
>>>>>   of cross-message operations, then just do single message operations
>>>> and use
>>>>>   a database for anything that spans messages. They also don't factor
>>>> their
>>>>>   code into many MessageListeners and compose them, they just have one
>>>>>   listener that has the complete handling logic. Say I am a user who
>>>> wants to
>>>>>   implement a single Processor in this style. Do we have an easy way to
>>>> do
>>>>>   that today (either with the .transform/.process methods in kstreams
>>>> or with
>>>>>   the topology apis)? Is there anything we can do in the way of trivial
>>>>>   helper code to make this better? Also, how can we explain that
>>>> pattern to
>>>>>   people? I think currently we have pretty in-depth docs on our apis
>>>> but I
>>>>>   suspect a person trying to figure out how to implement a simple
>>>> callback
>>>>>   might get a bit lost trying to figure out how to wire it up. A simple
>>>> five
>>>>>   line example in the docs would probably help a lot. Not sure if this
>>>> is
>>>>>   best addressed in this KIP or is a side comment.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -Jay
>>>>>
>>>>> On Fri, Feb 3, 2017 at 3:33 PM, Matthias J. Sax <matth...@confluent.io
>>>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I did prepare a KIP to do some cleanup some of Kafka's Streaming API.
>>>>>>
>>>>>> Please have a look here:
>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>> 120%3A+Cleanup+Kafka+Streams+builder+API
>>>>>>
>>>>>> Looking forward to your feedback!
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>
>>>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API

Reply via email to