Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Aditya Auradkar Mon, 27 Jul 2015 22:47:38 -0700

+1 on comparison with existing solutions. On a high level, it seems nice to
have a transform library inside Kafka.. a lot of the building blocks are
already there to build a stream processing framework. However the details
are tricky to get right I think this discussion will get a lot more
interesting when we have something concrete to look at. I'm +1 for the
general idea.
How far away are we from having something a prototype patch to play with?


Couple of observations:
- Since the input source for each processor is always Kafka, you get basic
client side partition management out of the box it use the high level
consumer.
- The KIP states that cmd line tools will be provided to deploy as a
separate service. Is the proposed scope limited to providing a library with
which makes it possible build stream-processing-as- a-service or provide
such a service within Kafka itself?

Aditya

On Mon, Jul 27, 2015 at 8:20 PM, Gwen Shapira <gshap...@cloudera.com> wrote:

> Hi,
>
> Since we will be discussing KIP-28 in the call tomorrow, can you
> update the KIP with the feature-comparison with  existing solutions?
> I admit that I do not see a need for single-event-producer-consumer
> pair (AKA Flume Interceptor). I've seen tons of people implement such
> apps in the last year, and it seemed easy. Now, perhaps we were doing
> it all wrong... but I'd like to know how :)
>
> If we are talking about a bigger story (i.e. DSL, real
> stream-processing, etc), thats a different discussion. I've seen a
> bunch of misconceptions about SparkStreaming in this discussion, and I
> have some thoughts in that regard, but I'd rather not go into that if
> thats outside the scope of this KIP.
>
> Gwen
>
>
> On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang <wangg...@gmail.com> wrote:
> > Hi Ewen,
> >
> > Replies inlined.
> >
> > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
> e...@confluent.io>
> > wrote:
> >
> >> Just some notes on the KIP doc itself:
> >>
> >> * It'd be useful to clarify at what point the plain consumer + custom
> code
> >> + producer breaks down. I think trivial filtering and aggregation on a
> >> single stream usually work fine with this model. Anything where you need
> >> more complex joins, windowing, etc. are where it breaks down. I think
> most
> >> interesting applications require that functionality, but it's helpful to
> >> make this really clear in the motivation -- right now, Kafka only
> provides
> >> the lowest level plumbing for stream processing applications, so most
> >> interesting apps require very heavyweight frameworks.
> >>
> >
> > I think for users to efficiently express complex logic like joins
> > windowing, etc, a higher-level programming interface beyond the process()
> > interface would definitely be better, but that does not necessarily
> require
> > a "heavyweight" frameworks, which usually includes more than just the
> > high-level functional programming model. I would argue that an
> alternative
> > solution would better be provided for users who want some high-level
> > programming interface but not a heavyweight stream processing framework
> > that include the processor library plus another DSL layer on top of it.
> >
> >
> >
> >> * I think the feature comparison of plain producer/consumer, stream
> >> processing frameworks, and this new library is a good start, but we
> might
> >> want something more thorough and structured, like a feature matrix.
> Right
> >> now it's hard to figure out exactly how they relate to each other.
> >>
> >
> > Cool, I can do that.
> >
> >
> >> * I'd personally push the library vs. framework story very strongly --
> the
> >> total buy-in and weak integration story of stream processing frameworks
> is
> >> a big downside and makes a library a really compelling (and currently
> >> unavailable, as far as I am aware) alternative.
> >>
> >
> > Are you suggesting there are still some content missing about the
> > motivations of adding the proposed library in the wiki page?
> >
> >
> >> * Comment about in-memory storage of other frameworks is interesting --
> it
> >> is specific to the framework, but is supposed to also give performance
> >> benefits. The high-level functional processing interface would allow for
> >> combining multiple operations when there's no shuffle, but when there
> is a
> >> shuffle, we'll always be writing to Kafka, right? Spark (and presumably
> >> spark streaming) is supposed to get a big win by handling shuffles such
> >> that the data just stays in cache and never actually hits disk, or at
> least
> >> hits disk in the background. Will we take a hit because we always write
> to
> >> Kafka?
> >>
> >
> > I agree with Neha's comments here. One more point I want to make is
> > materializing to Kafka is not necessarily much worse than keeping data in
> > memory if the downstream consumption is caught up such that most of the
> > reads will be hitting file cache. I remember Samza has illustrated that
> > under such scenarios its throughput is actually quite comparable to Spark
> > Streaming / Storm.
> >
> >
> >> * I really struggled with the structure of the KIP template with Copycat
> >> because the flow doesn't work well for proposals like this. They aren't
> as
> >> concrete changes as the KIP template was designed for. I'd completely
> >> ignore that template in favor of optimizing for clarity if I were you.
> >>
> >> -Ewen
> >>
> >> On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang <wangg...@gmail.com>
> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I just posted KIP-28: Add a transform client for data processing
> >> > <
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
> >> > >
> >> > .
> >> >
> >> > The wiki page does not yet have the full design / implementation
> details,
> >> > and this email is to kick-off the conversation on whether we should
> add
> >> > this new client with the described motivations, and if yes what
> features
> >> /
> >> > functionalities should be included.
> >> >
> >> > Looking forward to your feedback!
> >> >
> >> > -- Guozhang
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Ewen
> >>
> >
> >
> >
> > --
> > -- Guozhang
>

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Reply via email to