Re: Thoughts and obesrvations on Samza

Jay Kreps Tue, 07 Jul 2015 10:50:44 -0700

Kafka Metamorphasis: Data streams in, cockroaches stream out :-)

On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci Morales <
g...@apache.org> wrote:


> Forgot to add. On the naming issues, Kafka Metamorphosis is a clear winner
> :)
>
> --
> Gianmarco
>
> On 7 July 2015 at 13:26, Gianmarco De Francisci Morales <g...@apache.org>
> wrote:
>
> > Hi,
> >
> > @Martin, thanks for you comments.
> > Maybe I'm missing some important point, but I think coupling the releases
> > is actually a *good* thing.
> > To make an example, would it be better if the MR and HDFS components of
> > Hadoop had different release schedules?
> >
> > Actually, keeping the discussion in a single place would make agreeing on
> > releases (and backwards compatibility) much easier, as everybody would be
> > responsible for the whole codebase.
> >
> > That said, I like the idea of absorbing samza-core as a sub-project, and
> > leave the fancy stuff separate.
> > It probably gives 90% of the benefits we have been discussing here.
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> > On 7 July 2015 at 02:30, Jay Kreps <jay.kr...@gmail.com> wrote:
> >
> >> Hey Martin,
> >>
> >> I agree coupling release schedules is a downside.
> >>
> >> Definitely we can try to solve some of the integration problems in
> >> Confluent Platform or in other distributions. But I think this ends up
> >> being really shallow. I guess I feel to really get a good user
> experience
> >> the two systems have to kind of feel like part of the same thing and you
> >> can't really add that in later--you can put both in the same
> downloadable
> >> tar file but it doesn't really give a very cohesive feeling. I agree
> that
> >> ultimately any of the project stuff is as much social and naming as
> >> anything else--theoretically two totally independent projects could work
> >> to
> >> tightly align. In practice this seems to be quite difficult though.
> >>
> >> For the frameworks--totally agree it would be good to maintain the
> >> framework support with the project. In some cases there may not be too
> >> much
> >> there since the integration gets lighter but I think whatever stubs you
> >> need should be included. So no I definitely wasn't trying to imply
> >> dropping
> >> support for these frameworks, just making the integration lighter by
> >> separating process management from partition management.
> >>
> >> You raise two good points we would have to figure out if we went down
> the
> >> alignment path:
> >> 1. With respect to the name, yeah I think the first question is whether
> >> some "re-branding" would be worth it. If so then I think we can have a
> big
> >> thread on the name. I'm definitely not set on Kafka Streaming or Kafka
> >> Streams I was just using them to be kind of illustrative. I agree with
> >> your
> >> critique of these names, though I think people would get the idea.
> >> 2. Yeah you also raise a good point about how to "factor" it. Here are
> the
> >> options I see (I could get enthusiastic about any of them):
> >>    a. One repo for both Kafka and Samza
> >>    b. Two repos, retaining the current seperation
> >>    c. Two repos, the equivalent of samza-api and samza-core is absorbed
> >> almost like a third client
> >>
> >> Cheers,
> >>
> >> -Jay
> >>
> >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann <mar...@kleppmann.com>
> >> wrote:
> >>
> >> > Ok, thanks for the clarifications. Just a few follow-up comments.
> >> >
> >> > - I see the appeal of merging with Kafka or becoming a subproject: the
> >> > reasons you mention are good. The risk I see is that release schedules
> >> > become coupled to each other, which can slow everyone down, and large
> >> > projects with many contributors are harder to manage. (Jakob, can you
> >> speak
> >> > from experience, having seen a wider range of Hadoop ecosystem
> >> projects?)
> >> >
> >> > Some of the goals of a better unified developer experience could also
> be
> >> > solved by integrating Samza nicely into a Kafka distribution (such as
> >> > Confluent's). I'm not against merging projects if we decide that's the
> >> way
> >> > to go, just pointing out the same goals can perhaps also be achieved
> in
> >> > other ways.
> >> >
> >> > - With regard to dropping the YARN dependency: are you proposing that
> >> > Samza doesn't give any help to people wanting to run on
> >> YARN/Mesos/AWS/etc?
> >> > So the docs would basically have a link to Slider and nothing else? Or
> >> > would we maintain integrations with a bunch of popular deployment
> >> methods
> >> > (e.g. the necessary glue and shell scripts to make Samza work with
> >> Slider)?
> >> >
> >> > I absolutely think it's a good idea to have the "as a library" and
> "as a
> >> > process" (using Yi's taxonomy) options for people who want them, but I
> >> > think there should also be a low-friction path for common "as a
> service"
> >> > deployment methods, for which we probably need to maintain
> integrations.
> >> >
> >> > - Project naming: "Kafka Streams" seems odd to me, because Kafka is
> all
> >> > about streams already. Perhaps "Kafka Transformers" or "Kafka Filters"
> >> > would be more apt?
> >> >
> >> > One suggestion: perhaps the core of Samza (stream transformation with
> >> > state management -- i.e. the "Samza as a library" bit) could become
> >> part of
> >> > Kafka, while higher-level tools such as streaming SQL and integrations
> >> with
> >> > deployment frameworks remain in a separate project? In other words,
> >> Kafka
> >> > would absorb the proven, stable core of Samza, which would become the
> >> > "third Kafka client" mentioned early in this thread. The Samza project
> >> > would then target that third Kafka client as its base API, and the
> >> project
> >> > would be freed up to explore more experimental new horizons.
> >> >
> >> > Martin
> >> >
> >> > On 6 Jul 2015, at 18:51, Jay Kreps <jay.kr...@gmail.com> wrote:
> >> >
> >> > > Hey Martin,
> >> > >
> >> > > For the YARN/Mesos/etc decoupling I actually don't think it ties our
> >> > hands
> >> > > at all, all it does is refactor things. The division of
> >> responsibility is
> >> > > that Samza core is responsible for task lifecycle, state, and
> >> partition
> >> > > management (using the Kafka co-ordinator) but it is NOT responsible
> >> for
> >> > > packaging, configuration deployment or execution of processes. The
> >> > problem
> >> > > of packaging and starting these processes is
> >> > > framework/environment-specific. This leaves individual frameworks to
> >> be
> >> > as
> >> > > fancy or vanilla as they like. So you can get simple stateless
> >> support in
> >> > > YARN, Mesos, etc using their off-the-shelf app framework (Slider,
> >> > Marathon,
> >> > > etc). These are well known by people and have nice UIs and a lot of
> >> > > flexibility. I don't think they have node affinity as a built in
> >> option
> >> > > (though I could be wrong). So if we want that we can either wait for
> >> them
> >> > > to add it or do a custom framework to add that feature (as now).
> >> > Obviously
> >> > > if you manage things with old-school ops tools (puppet/chef/etc) you
> >> get
> >> > > locality easily. The nice thing, though, is that all the samza
> >> "business
> >> > > logic" around partition management and fault tolerance is in Samza
> >> core
> >> > so
> >> > > it is shared across frameworks and the framework specific bit is
> just
> >> > > whether it is smart enough to try to get the same host when a job is
> >> > > restarted.
> >> > >
> >> > > With respect to the Kafka-alignment, yeah I think the goal would be
> >> (a)
> >> > > actually get better alignment in user experience, and (b) express
> >> this in
> >> > > the naming and project branding. Specifically:
> >> > > 1. Website/docs, it would be nice for the "transformation" api to be
> >> > > discoverable in the main Kafka docs--i.e. be able to explain when to
> >> use
> >> > > the consumer and when to use the stream processing functionality and
> >> lead
> >> > > people into that experience.
> >> > > 2. Align releases so if you get Kafkza 1.4.2 (or whatever) that has
> >> both
> >> > > Kafka and the stream processing part and they actually work
> together.
> >> > > 3. Unify the programming experience so the client and Samza api
> share
> >> > > config/monitoring/naming/packaging/etc.
> >> > >
> >> > > I think sub-projects keep separate committers and can have a
> separate
> >> > repo,
> >> > > but I'm actually not really sure (I can't find a definition of a
> >> > subproject
> >> > > in Apache).
> >> > >
> >> > > Basically at a high-level you want the experience to "feel" like a
> >> single
> >> > > system, not to relatively independent things that are kind of
> >> awkwardly
> >> > > glued together.
> >> > >
> >> > > I think if we did that they having naming or branding like "kafka
> >> > > streaming" or "kafka streams" or something like that would actually
> >> do a
> >> > > good job of conveying what it is. I do that this would help adoption
> >> > quite
> >> > > a lot as it would correctly convey that using Kafka Streaming with
> >> Kafka
> >> > is
> >> > > a fairly seamless experience and Kafka is pretty heavily adopted at
> >> this
> >> > > point.
> >> > >
> >> > > Fwiw we actually considered this model originally when open sourcing
> >> > Samza,
> >> > > however at that time Kafka was relatively unknown and we decided not
> >> to
> >> > do
> >> > > it since we felt it would be limiting. From my point of view the
> three
> >> > > things have changed (1) Kafka is now really heavily used for stream
> >> > > processing, (2) we learned that abstracting out the stream well is
> >> > > basically impossible, (3) we learned it is really hard to keep the
> two
> >> > > things feeling like a single product.
> >> > >
> >> > > -Jay
> >> > >
> >> > >
> >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin Kleppmann <
> >> mar...@kleppmann.com>
> >> > > wrote:
> >> > >
> >> > >> Hi all,
> >> > >>
> >> > >> Lots of good thoughts here.
> >> > >>
> >> > >> I agree with the general philosophy of tying Samza more firmly to
> >> Kafka.
> >> > >> After I spent a while looking at integrating other message brokers
> >> (e.g.
> >> > >> Kinesis) with SystemConsumer, I came to the conclusion that
> >> > SystemConsumer
> >> > >> tacitly assumes a model so much like Kafka's that pretty much
> nobody
> >> but
> >> > >> Kafka actually implements it. (Databus is perhaps an exception, but
> >> it
> >> > >> isn't widely used outside of LinkedIn.) Thus, making Samza fully
> >> > dependent
> >> > >> on Kafka acknowledges that the system-independence was never as
> real
> >> as
> >> > we
> >> > >> perhaps made it out to be. The gains of code reuse are real.
> >> > >>
> >> > >> The idea of decoupling Samza from YARN has also always been
> >> appealing to
> >> > >> me, for various reasons already mentioned in this thread. Although
> >> > making
> >> > >> Samza jobs deployable on anything (YARN/Mesos/AWS/etc) seems
> >> laudable,
> >> > I am
> >> > >> a little concerned that it will restrict us to a lowest common
> >> > denominator.
> >> > >> For example, would host affinity (SAMZA-617) still be possible? For
> >> jobs
> >> > >> with large amounts of state, I think SAMZA-617 would be a big boon,
> >> > since
> >> > >> restoring state off the changelog on every single restart is
> painful,
> >> > due
> >> > >> to long recovery times. It would be a shame if the decoupling from
> >> YARN
> >> > >> made host affinity impossible.
> >> > >>
> >> > >> Jay, a question about the proposed API for instantiating a job in
> >> code
> >> > >> (rather than a properties file): when submitting a job to a
> cluster,
> >> is
> >> > the
> >> > >> idea that the instantiation code runs on a client somewhere, which
> >> then
> >> > >> pokes the necessary endpoints on YARN/Mesos/AWS/etc? Or does that
> >> code
> >> > run
> >> > >> on each container that is part of the job (in which case, how does
> >> the
> >> > job
> >> > >> submission to the cluster work)?
> >> > >>
> >> > >> I agree with Garry that it doesn't feel right to make a 1.0 release
> >> > with a
> >> > >> plan for it to be immediately obsolete. So if this is going to
> >> happen, I
> >> > >> think it would be more honest to stick with 0.* version numbers
> until
> >> > the
> >> > >> library-ified Samza has been implemented, is stable and widely
> used.
> >> > >>
> >> > >> Should the new Samza be a subproject of Kafka? There is precedent
> for
> >> > >> tight coupling between different Apache projects (e.g. Curator and
> >> > >> Zookeeper, or Slider and YARN), so I think remaining separate would
> >> be
> >> > ok.
> >> > >> Even if Samza is fully dependent on Kafka, there is enough
> substance
> >> in
> >> > >> Samza that it warrants being a separate project. An argument in
> >> favour
> >> > of
> >> > >> merging would be if we think Kafka has a much stronger "brand
> >> presence"
> >> > >> than Samza; I'm ambivalent on that one. If the Kafka project is
> >> willing
> >> > to
> >> > >> endorse Samza as the "official" way of doing stateful stream
> >> > >> transformations, that would probably have much the same effect as
> >> > >> re-branding Samza as "Kafka Stream Processors" or suchlike. Close
> >> > >> collaboration between the two projects will be needed in any case.
> >> > >>
> >> > >> From a project management perspective, I guess the "new Samza"
> would
> >> > have
> >> > >> to be developed on a branch alongside ongoing maintenance of the
> >> current
> >> > >> line of development? I think it would be important to continue
> >> > supporting
> >> > >> existing users, and provide a graceful migration path to the new
> >> > version.
> >> > >> Leaving the current versions unsupported and forcing people to
> >> rewrite
> >> > >> their jobs would send a bad signal.
> >> > >>
> >> > >> Best,
> >> > >> Martin
> >> > >>
> >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <j...@confluent.io> wrote:
> >> > >>
> >> > >>> Hey Garry,
> >> > >>>
> >> > >>> Yeah that's super frustrating. I'd be happy to chat more about
> this
> >> if
> >> > >>> you'd be interested. I think Chris and I started with the idea of
> >> "what
> >> > >>> would it take to make Samza a kick-ass ingestion tool" but
> >> ultimately
> >> > we
> >> > >>> kind of came around to the idea that ingestion and transformation
> >> had
> >> > >>> pretty different needs and coupling the two made things hard.
> >> > >>>
> >> > >>> For what it's worth I think copycat (KIP-26) actually will do what
> >> you
> >> > >> are
> >> > >>> looking for.
> >> > >>>
> >> > >>> With regard to your point about slider, I don't necessarily
> >> disagree.
> >> > >> But I
> >> > >>> think getting good YARN support is quite doable and I think we can
> >> make
> >> > >>> that work well. I think the issue this proposal solves is that
> >> > >> technically
> >> > >>> it is pretty hard to support multiple cluster management systems
> the
> >> > way
> >> > >>> things are now, you need to write an "app master" or "framework"
> for
> >> > each
> >> > >>> and they are all a little different so testing is really hard. In
> >> the
> >> > >>> absence of this we have been stuck with just YARN which has
> >> fantastic
> >> > >>> penetration in the Hadoopy part of the org, but zero penetration
> >> > >> elsewhere.
> >> > >>> Given the huge amount of work being put in to slider, marathon,
> aws
> >> > >>> tooling, not to mention the umpteen related packaging technologies
> >> > people
> >> > >>> want to use (Docker, Kubernetes, various cloud-specific deploy
> >> tools,
> >> > >> etc)
> >> > >>> I really think it is important to get this right.
> >> > >>>
> >> > >>> -Jay
> >> > >>>
> >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry Turkington <
> >> > >>> g.turking...@improvedigital.com> wrote:
> >> > >>>
> >> > >>>> Hi all,
> >> > >>>>
> >> > >>>> I think the question below re does Samza become a sub-project of
> >> Kafka
> >> > >>>> highlights the broader point around migration. Chris mentions
> >> Samza's
> >> > >>>> maturity is heading towards a v1 release but I'm not sure it
> feels
> >> > >> right to
> >> > >>>> launch a v1 then immediately plan to deprecate most of it.
> >> > >>>>
> >> > >>>> From a selfish perspective I have some guys who have started
> >> working
> >> > >> with
> >> > >>>> Samza and building some new consumers/producers was next up.
> Sounds
> >> > like
> >> > >>>> that is absolutely not the direction to go. I need to look into
> the
> >> > KIP
> >> > >> in
> >> > >>>> more detail but for me the attractiveness of adding new Samza
> >> > >>>> consumer/producers -- even if yes all they were doing was really
> >> > getting
> >> > >>>> data into and out of Kafka --  was to avoid  having to worry
> about
> >> the
> >> > >>>> lifecycle management of external clients. If there is a generic
> >> Kafka
> >> > >>>> ingress/egress layer that I can plug a new connector into and
> have
> >> a
> >> > >> lot of
> >> > >>>> the heavy lifting re scale and reliability done for me then it
> >> gives
> >> > me
> >> > >> all
> >> > >>>> the pushing new consumers/producers would. If not then it
> >> complicates
> >> > my
> >> > >>>> operational deployments.
> >> > >>>>
> >> > >>>> Which is similar to my other question with the proposal -- if we
> >> > build a
> >> > >>>> fully available/stand-alone Samza plus the requisite shims to
> >> > integrate
> >> > >>>> with Slider etc I suspect the former may be a lot more work than
> we
> >> > >> think.
> >> > >>>> We may make it much easier for a newcomer to get something
> running
> >> but
> >> > >>>> having them step up and get a reliable production deployment may
> >> still
> >> > >>>> dominate mailing list  traffic, if for different reasons than
> >> today.
> >> > >>>>
> >> > >>>> Don't get me wrong -- I'm comfortable with making the Samza
> >> dependency
> >> > >> on
> >> > >>>> Kafka much more explicit and I absolutely see the benefits  in
> the
> >> > >>>> reduction of duplication and clashing terminologies/abstractions
> >> that
> >> > >>>> Chris/Jay describe. Samza as a library would likely be a very
> nice
> >> > tool
> >> > >> to
> >> > >>>> add to the Kafka ecosystem. I just have the concerns above re the
> >> > >>>> operational side.
> >> > >>>>
> >> > >>>> Garry
> >> > >>>>
> >> > >>>> -----Original Message-----
> >> > >>>> From: Gianmarco De Francisci Morales [mailto:g...@apache.org]
> >> > >>>> Sent: 02 July 2015 12:56
> >> > >>>> To: dev@samza.apache.org
> >> > >>>> Subject: Re: Thoughts and obesrvations on Samza
> >> > >>>>
> >> > >>>> Very interesting thoughts.
> >> > >>>> From outside, I have always perceived Samza as a computing layer
> >> over
> >> > >>>> Kafka.
> >> > >>>>
> >> > >>>> The question, maybe a bit provocative, is "should Samza be a
> >> > sub-project
> >> > >>>> of Kafka then?"
> >> > >>>> Or does it make sense to keep it as a separate project with a
> >> separate
> >> > >>>> governance?
> >> > >>>>
> >> > >>>> Cheers,
> >> > >>>>
> >> > >>>> --
> >> > >>>> Gianmarco
> >> > >>>>
> >> > >>>> On 2 July 2015 at 08:59, Yan Fang <yanfang...@gmail.com> wrote:
> >> > >>>>
> >> > >>>>> Overall, I agree to couple with Kafka more tightly. Because
> Samza
> >> de
> >> > >>>>> facto is based on Kafka, and it should leverage what Kafka has.
> At
> >> > the
> >> > >>>>> same time, Kafka does not need to reinvent what Samza already
> >> has. I
> >> > >>>>> also like the idea of separating the ingestion and
> transformation.
> >> > >>>>>
> >> > >>>>> But it is a little difficult for me to image how the Samza will
> >> look
> >> > >>>> like.
> >> > >>>>> And I feel Chris and Jay have a little difference in terms of
> how
> >> > >>>>> Samza should look like.
> >> > >>>>>
> >> > >>>>> *** Will it look like what Jay's code shows (A client of Kakfa)
> ?
> >> And
> >> > >>>>> user's application code calls this client?
> >> > >>>>>
> >> > >>>>> 1. If we make Samza be a library of Kafka (like what the code
> >> shows),
> >> > >>>>> how do we implement auto-balance and fault-tolerance? Are they
> >> taken
> >> > >>>>> care by the Kafka broker or other mechanism, such as "Samza
> >> worker"
> >> > >>>>> (just make up the name) ?
> >> > >>>>>
> >> > >>>>> 2. What about other features, such as auto-scaling, shared
> state,
> >> > >>>>> monitoring?
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> *** If we have Samza standalone, (is this what Chris suggests?)
> >> > >>>>>
> >> > >>>>> 1. we still need to ingest data from Kakfa and produce to it.
> >> Then it
> >> > >>>>> becomes the same as what Samza looks like now, except it does
> not
> >> > rely
> >> > >>>>> on Yarn anymore.
> >> > >>>>>
> >> > >>>>> 2. if it is standalone, how can it leverage Kafka's metrics,
> logs,
> >> > >>>>> etc? Use Kafka code as the dependency?
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> Thanks,
> >> > >>>>>
> >> > >>>>> Fang, Yan
> >> > >>>>> yanfang...@gmail.com
> >> > >>>>>
> >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang Wang <
> wangg...@gmail.com
> >> >
> >> > >>>> wrote:
> >> > >>>>>
> >> > >>>>>> Read through the code example and it looks good to me. A few
> >> > >>>>>> thoughts regarding deployment:
> >> > >>>>>>
> >> > >>>>>> Today Samza deploys as executable runnable like:
> >> > >>>>>>
> >> > >>>>>> deploy/samza/bin/run-job.sh --config-factory=...
> >> > >>>> --config-path=file://...
> >> > >>>>>>
> >> > >>>>>> And this proposal advocate for deploying Samza more as embedded
> >> > >>>>>> libraries in user application code (ignoring the terminology
> >> since
> >> > >>>>>> it is not the
> >> > >>>>> same
> >> > >>>>>> as the prototype code):
> >> > >>>>>>
> >> > >>>>>> StreamTask task = new MyStreamTask(configs); Thread thread =
> new
> >> > >>>>>> Thread(task); thread.start();
> >> > >>>>>>
> >> > >>>>>> I think both of these deployment modes are important for
> >> different
> >> > >>>>>> types
> >> > >>>>> of
> >> > >>>>>> users. That said, I think making Samza purely standalone is
> still
> >> > >>>>>> sufficient for either runnable or library modes.
> >> > >>>>>>
> >> > >>>>>> Guozhang
> >> > >>>>>>
> >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay Kreps <j...@confluent.io>
> >> > wrote:
> >> > >>>>>>
> >> > >>>>>>> Looks like gmail mangled the code example, it was supposed to
> >> look
> >> > >>>>>>> like
> >> > >>>>>>> this:
> >> > >>>>>>>
> >> > >>>>>>> Properties props = new Properties();
> >> > >>>>>>> props.put("bootstrap.servers", "localhost:4242");
> >> StreamingConfig
> >> > >>>>>>> config = new StreamingConfig(props);
> >> > >>>>>>> config.subscribe("test-topic-1", "test-topic-2");
> >> > >>>>>>> config.processor(ExampleStreamProcessor.class);
> >> > >>>>>>> config.serialization(new StringSerializer(), new
> >> > >>>>>>> StringDeserializer()); KafkaStreaming container = new
> >> > >>>>>>> KafkaStreaming(config); container.run();
> >> > >>>>>>>
> >> > >>>>>>> -Jay
> >> > >>>>>>>
> >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps <j...@confluent.io
> >
> >> > >>>> wrote:
> >> > >>>>>>>
> >> > >>>>>>>> Hey guys,
> >> > >>>>>>>>
> >> > >>>>>>>> This came out of some conversations Chris and I were having
> >> > >>>>>>>> around
> >> > >>>>>>> whether
> >> > >>>>>>>> it would make sense to use Samza as a kind of data ingestion
> >> > >>>>> framework
> >> > >>>>>>> for
> >> > >>>>>>>> Kafka (which ultimately lead to KIP-26 "copycat"). This kind
> of
> >> > >>>>>> combined
> >> > >>>>>>>> with complaints around config and YARN and the discussion
> >> around
> >> > >>>>>>>> how
> >> > >>>>> to
> >> > >>>>>>>> best do a standalone mode.
> >> > >>>>>>>>
> >> > >>>>>>>> So the thought experiment was, given that Samza was basically
> >> > >>>>>>>> already totally Kafka specific, what if you just embraced
> that
> >> > >>>>>>>> and turned it
> >> > >>>>>> into
> >> > >>>>>>>> something less like a heavyweight framework and more like a
> >> > >>>>>>>> third
> >> > >>>>> Kafka
> >> > >>>>>>>> client--a kind of "producing consumer" with state management
> >> > >>>>>> facilities.
> >> > >>>>>>>> Basically a library. Instead of a complex stream processing
> >> > >>>>>>>> framework
> >> > >>>>>>> this
> >> > >>>>>>>> would actually be a very simple thing, not much more
> >> complicated
> >> > >>>>>>>> to
> >> > >>>>> use
> >> > >>>>>>> or
> >> > >>>>>>>> operate than a Kafka consumer. As Chris said we thought about
> >> it
> >> > >>>>>>>> a
> >> > >>>>> lot
> >> > >>>>>> of
> >> > >>>>>>>> what Samza (and the other stream processing systems were
> doing)
> >> > >>>>> seemed
> >> > >>>>>>> like
> >> > >>>>>>>> kind of a hangover from MapReduce.
> >> > >>>>>>>>
> >> > >>>>>>>> Of course you need to ingest/output data to and from the
> stream
> >> > >>>>>>>> processing. But when we actually looked into how that would
> >> > >>>>>>>> work,
> >> > >>>>> Samza
> >> > >>>>>>>> isn't really an ideal data ingestion framework for a bunch of
> >> > >>>>> reasons.
> >> > >>>>>> To
> >> > >>>>>>>> really do that right you need a pretty different internal
> data
> >> > >>>>>>>> model
> >> > >>>>>> and
> >> > >>>>>>>> set of apis. So what if you split them and had an api for
> Kafka
> >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) and a separate api for
> >> Kafka
> >> > >>>>>>>> transformation (Samza).
> >> > >>>>>>>>
> >> > >>>>>>>> This would also allow really embracing the same terminology
> and
> >> > >>>>>>>> conventions. One complaint about the current state is that
> the
> >> > >>>>>>>> two
> >> > >>>>>>> systems
> >> > >>>>>>>> kind of feel bolted on. Terminology like "stream" vs "topic"
> >> and
> >> > >>>>>>> different
> >> > >>>>>>>> config and monitoring systems means you kind of have to learn
> >> > >>>>>>>> Kafka's
> >> > >>>>>>> way,
> >> > >>>>>>>> then learn Samza's slightly different way, then kind of
> >> > >>>>>>>> understand
> >> > >>>>> how
> >> > >>>>>>> they
> >> > >>>>>>>> map to each other, which having walked a few people through
> >> this
> >> > >>>>>>>> is surprisingly tricky for folks to get.
> >> > >>>>>>>>
> >> > >>>>>>>> Since I have been spending a lot of time on airplanes I
> hacked
> >> > >>>>>>>> up an ernest but still somewhat incomplete prototype of what
> >> > >>>>>>>> this would
> >> > >>>>> look
> >> > >>>>>>>> like. This is just unceremoniously dumped into Kafka as it
> >> > >>>>>>>> required a
> >> > >>>>>> few
> >> > >>>>>>>> changes to the new consumer. Here is the code:
> >> > >>>>>>>>
> >> > >>>>>>>>
> >> > >>>>>>>
> >> > >>>>>>
> >> > >>>>>
> >> >
> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
> >> > >>>>> /apache/kafka/clients/streaming
> >> > >>>>>>>>
> >> > >>>>>>>> For the purpose of the prototype I just liberally renamed
> >> > >>>>>>>> everything
> >> > >>>>> to
> >> > >>>>>>>> try to align it with Kafka with no regard for compatibility.
> >> > >>>>>>>>
> >> > >>>>>>>> To use this would be something like this:
> >> > >>>>>>>> Properties props = new Properties();
> >> > >>>>>>>> props.put("bootstrap.servers", "localhost:4242");
> >> > >>>>>>>> StreamingConfig config = new
> >> > >>>>> StreamingConfig(props);
> >> > >>>>>>> config.subscribe("test-topic-1",
> >> > >>>>>>>> "test-topic-2");
> >> config.processor(ExampleStreamProcessor.class);
> >> > >>>>>>> config.serialization(new
> >> > >>>>>>>> StringSerializer(), new StringDeserializer()); KafkaStreaming
> >> > >>>>>> container =
> >> > >>>>>>>> new KafkaStreaming(config); container.run();
> >> > >>>>>>>>
> >> > >>>>>>>> KafkaStreaming is basically the SamzaContainer;
> StreamProcessor
> >> > >>>>>>>> is basically StreamTask.
> >> > >>>>>>>>
> >> > >>>>>>>> So rather than putting all the class names in a file and then
> >> > >>>>>>>> having
> >> > >>>>>> the
> >> > >>>>>>>> job assembled by reflection, you just instantiate the
> container
> >> > >>>>>>>> programmatically. Work is balanced over however many
> instances
> >> > >>>>>>>> of
> >> > >>>>> this
> >> > >>>>>>> are
> >> > >>>>>>>> alive at any time (i.e. if an instance dies, new tasks are
> >> added
> >> > >>>>>>>> to
> >> > >>>>> the
> >> > >>>>>>>> existing containers without shutting them down).
> >> > >>>>>>>>
> >> > >>>>>>>> We would provide some glue for running this stuff in YARN via
> >> > >>>>>>>> Slider, Mesos via Marathon, and AWS using some of their tools
> >> > >>>>>>>> but from the
> >> > >>>>>> point
> >> > >>>>>>> of
> >> > >>>>>>>> view of these frameworks these stream processing jobs are
> just
> >> > >>>>>> stateless
> >> > >>>>>>>> services that can come and go and expand and contract at
> will.
> >> > >>>>>>>> There
> >> > >>>>> is
> >> > >>>>>>> no
> >> > >>>>>>>> more custom scheduler.
> >> > >>>>>>>>
> >> > >>>>>>>> Here are some relevant details:
> >> > >>>>>>>>
> >> > >>>>>>>>  1. It is only ~1300 lines of code, it would get larger if we
> >> > >>>>>>>>  productionized but not vastly larger. We really do get a ton
> >> > >>>>>>>> of
> >> > >>>>>>> leverage
> >> > >>>>>>>>  out of Kafka.
> >> > >>>>>>>>  2. Partition management is fully delegated to the new
> >> consumer.
> >> > >>>>> This
> >> > >>>>>>>>  is nice since now any partition management strategy
> available
> >> > >>>>>>>> to
> >> > >>>>>> Kafka
> >> > >>>>>>>>  consumer is also available to Samza (and vice versa) and
> with
> >> > >>>>>>>> the
> >> > >>>>>>> exact
> >> > >>>>>>>>  same configs.
> >> > >>>>>>>>  3. It supports state as well as state reuse
> >> > >>>>>>>>
> >> > >>>>>>>> Anyhow take a look, hopefully it is thought provoking.
> >> > >>>>>>>>
> >> > >>>>>>>> -Jay
> >> > >>>>>>>>
> >> > >>>>>>>>
> >> > >>>>>>>>
> >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris Riccomini <
> >> > >>>>>> criccom...@apache.org>
> >> > >>>>>>>> wrote:
> >> > >>>>>>>>
> >> > >>>>>>>>> Hey all,
> >> > >>>>>>>>>
> >> > >>>>>>>>> I have had some discussions with Samza engineers at LinkedIn
> >> > >>>>>>>>> and
> >> > >>>>>>> Confluent
> >> > >>>>>>>>> and we came up with a few observations and would like to
> >> > >>>>>>>>> propose
> >> > >>>>> some
> >> > >>>>>>>>> changes.
> >> > >>>>>>>>>
> >> > >>>>>>>>> We've observed some things that I want to call out about
> >> > >>>>>>>>> Samza's
> >> > >>>>>> design,
> >> > >>>>>>>>> and I'd like to propose some changes.
> >> > >>>>>>>>>
> >> > >>>>>>>>> * Samza is dependent upon a dynamic deployment system.
> >> > >>>>>>>>> * Samza is too pluggable.
> >> > >>>>>>>>> * Samza's SystemConsumer/SystemProducer and Kafka's consumer
> >> > >>>>>>>>> APIs
> >> > >>>>> are
> >> > >>>>>>>>> trying to solve a lot of the same problems.
> >> > >>>>>>>>>
> >> > >>>>>>>>> All three of these issues are related, but I'll address them
> >> in
> >> > >>>>> order.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Deployment
> >> > >>>>>>>>>
> >> > >>>>>>>>> Samza strongly depends on the use of a dynamic deployment
> >> > >>>>>>>>> scheduler
> >> > >>>>>> such
> >> > >>>>>>>>> as
> >> > >>>>>>>>> YARN, Mesos, etc. When we initially built Samza, we bet that
> >> > >>>>>>>>> there
> >> > >>>>>> would
> >> > >>>>>>>>> be
> >> > >>>>>>>>> one or two winners in this area, and we could support them,
> >> and
> >> > >>>>>>>>> the
> >> > >>>>>> rest
> >> > >>>>>>>>> would go away. In reality, there are many variations.
> >> > >>>>>>>>> Furthermore,
> >> > >>>>>> many
> >> > >>>>>>>>> people still prefer to just start their processors like
> normal
> >> > >>>>>>>>> Java processes, and use traditional deployment scripts such
> as
> >> > >>>>>>>>> Fabric,
> >> > >>>>>> Chef,
> >> > >>>>>>>>> Ansible, etc. Forcing a deployment system on users makes the
> >> > >>>>>>>>> Samza start-up process really painful for first time users.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Dynamic deployment as a requirement was also a bit of a
> >> > >>>>>>>>> mis-fire
> >> > >>>>>> because
> >> > >>>>>>>>> of
> >> > >>>>>>>>> a fundamental misunderstanding between the nature of batch
> >> jobs
> >> > >>>>>>>>> and
> >> > >>>>>>> stream
> >> > >>>>>>>>> processing jobs. Early on, we made conscious effort to favor
> >> > >>>>>>>>> the
> >> > >>>>>> Hadoop
> >> > >>>>>>>>> (Map/Reduce) way of doing things, since it worked and was
> well
> >> > >>>>>>> understood.
> >> > >>>>>>>>> One thing that we missed was that batch jobs have a definite
> >> > >>>>>> beginning,
> >> > >>>>>>>>> and
> >> > >>>>>>>>> end, and stream processing jobs don't (usually). This leads
> to
> >> > >>>>>>>>> a
> >> > >>>>> much
> >> > >>>>>>>>> simpler scheduling problem for stream processors. You
> >> basically
> >> > >>>>>>>>> just
> >> > >>>>>>> need
> >> > >>>>>>>>> to find a place to start the processor, and start it. The
> way
> >> > >>>>>>>>> we run grids, at LinkedIn, there's no concept of a cluster
> >> > >>>>>>>>> being "full". We always
> >> > >>>>>> add
> >> > >>>>>>>>> more machines. The problem with coupling Samza with a
> >> scheduler
> >> > >>>>>>>>> is
> >> > >>>>>> that
> >> > >>>>>>>>> Samza (as a framework) now has to handle deployment. This
> >> pulls
> >> > >>>>>>>>> in a
> >> > >>>>>>> bunch
> >> > >>>>>>>>> of things such as configuration distribution (config
> stream),
> >> > >>>>>>>>> shell
> >> > >>>>>>> scrips
> >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging (all the .tgz stuff),
> >> etc.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Another reason for requiring dynamic deployment was to
> support
> >> > >>>>>>>>> data locality. If you want to have locality, you need to put
> >> > >>>>>>>>> your
> >> > >>>>>> processors
> >> > >>>>>>>>> close to the data they're processing. Upon further
> >> > >>>>>>>>> investigation,
> >> > >>>>>>> though,
> >> > >>>>>>>>> this feature is not that beneficial. There is some good
> >> > >>>>>>>>> discussion
> >> > >>>>>> about
> >> > >>>>>>>>> some problems with it on SAMZA-335. Again, we took the
> >> > >>>>>>>>> Map/Reduce
> >> > >>>>>> path,
> >> > >>>>>>>>> but
> >> > >>>>>>>>> there are some fundamental differences between HDFS and
> Kafka.
> >> > >>>>>>>>> HDFS
> >> > >>>>>> has
> >> > >>>>>>>>> blocks, while Kafka has partitions. This leads to less
> >> > >>>>>>>>> optimization potential with stream processors on top of
> Kafka.
> >> > >>>>>>>>>
> >> > >>>>>>>>> This feature is also used as a crutch. Samza doesn't have
> any
> >> > >>>>>>>>> built
> >> > >>>>> in
> >> > >>>>>>>>> fault-tolerance logic. Instead, it depends on the dynamic
> >> > >>>>>>>>> deployment scheduling system to handle restarts when a
> >> > >>>>>>>>> processor dies. This has
> >> > >>>>>>> made
> >> > >>>>>>>>> it very difficult to write a standalone Samza container
> >> > >>>> (SAMZA-516).
> >> > >>>>>>>>>
> >> > >>>>>>>>> Pluggability
> >> > >>>>>>>>>
> >> > >>>>>>>>> In some cases pluggability is good, but I think that we've
> >> gone
> >> > >>>>>>>>> too
> >> > >>>>>> far
> >> > >>>>>>>>> with it. Currently, Samza has:
> >> > >>>>>>>>>
> >> > >>>>>>>>> * Pluggable config.
> >> > >>>>>>>>> * Pluggable metrics.
> >> > >>>>>>>>> * Pluggable deployment systems.
> >> > >>>>>>>>> * Pluggable streaming systems (SystemConsumer,
> SystemProducer,
> >> > >>>> etc).
> >> > >>>>>>>>> * Pluggable serdes.
> >> > >>>>>>>>> * Pluggable storage engines.
> >> > >>>>>>>>> * Pluggable strategies for just about every component
> >> > >>>>> (MessageChooser,
> >> > >>>>>>>>> SystemStreamPartitionGrouper, ConfigRewriter, etc).
> >> > >>>>>>>>>
> >> > >>>>>>>>> There's probably more that I've forgotten, as well. Some of
> >> > >>>>>>>>> these
> >> > >>>>> are
> >> > >>>>>>>>> useful, but some have proven not to be. This all comes at a
> >> cost:
> >> > >>>>>>>>> complexity. This complexity is making it harder for our
> users
> >> > >>>>>>>>> to
> >> > >>>>> pick
> >> > >>>>>> up
> >> > >>>>>>>>> and use Samza out of the box. It also makes it difficult for
> >> > >>>>>>>>> Samza developers to reason about what the characteristics of
> >> > >>>>>>>>> the container (since the characteristics change depending on
> >> > >>>>>>>>> which plugins are use).
> >> > >>>>>>>>>
> >> > >>>>>>>>> The issues with pluggability are most visible in the System
> >> APIs.
> >> > >>>>> What
> >> > >>>>>>>>> Samza really requires to be functional is Kafka as its
> >> > >>>>>>>>> transport
> >> > >>>>>> layer.
> >> > >>>>>>>>> But
> >> > >>>>>>>>> we've conflated two unrelated use cases into one API:
> >> > >>>>>>>>>
> >> > >>>>>>>>> 1. Get data into/out of Kafka.
> >> > >>>>>>>>> 2. Process the data in Kafka.
> >> > >>>>>>>>>
> >> > >>>>>>>>> The current System API supports both of these use cases. The
> >> > >>>>>>>>> problem
> >> > >>>>>> is,
> >> > >>>>>>>>> we
> >> > >>>>>>>>> actually want different features for each use case. By
> >> papering
> >> > >>>>>>>>> over
> >> > >>>>>>> these
> >> > >>>>>>>>> two use cases, and providing a single API, we've introduced
> a
> >> > >>>>>>>>> ton of
> >> > >>>>>>> leaky
> >> > >>>>>>>>> abstractions.
> >> > >>>>>>>>>
> >> > >>>>>>>>> For example, what we'd really like in (2) is to have
> >> > >>>>>>>>> monotonically increasing longs for offsets (like Kafka).
> This
> >> > >>>>>>>>> would be at odds
> >> > >>>>> with
> >> > >>>>>>> (1),
> >> > >>>>>>>>> though, since different systems have different
> >> > >>>>>>> SCNs/Offsets/UUIDs/vectors.
> >> > >>>>>>>>> There was discussion both on the mailing list and the SQL
> >> JIRAs
> >> > >>>>> about
> >> > >>>>>>> the
> >> > >>>>>>>>> need for this.
> >> > >>>>>>>>>
> >> > >>>>>>>>> The same thing holds true for replayability. Kafka allows us
> >> to
> >> > >>>>> rewind
> >> > >>>>>>>>> when
> >> > >>>>>>>>> we have a failure. Many other systems don't. In some cases,
> >> > >>>>>>>>> systems
> >> > >>>>>>> return
> >> > >>>>>>>>> null for their offsets (e.g. WikipediaSystemConsumer)
> because
> >> > >>>>>>>>> they
> >> > >>>>>> have
> >> > >>>>>>> no
> >> > >>>>>>>>> offsets.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Partitioning is another example. Kafka supports
> partitioning,
> >> > >>>>>>>>> but
> >> > >>>>> many
> >> > >>>>>>>>> systems don't. We model this by having a single partition
> for
> >> > >>>>>>>>> those systems. Still, other systems model partitioning
> >> > >>>> differently (e.g.
> >> > >>>>>>>>> Kinesis).
> >> > >>>>>>>>>
> >> > >>>>>>>>> The SystemAdmin interface is also a mess. Creating streams
> in
> >> a
> >> > >>>>>>>>> system-agnostic way is almost impossible. As is modeling
> >> > >>>>>>>>> metadata
> >> > >>>>> for
> >> > >>>>>>> the
> >> > >>>>>>>>> system (replication factor, partitions, location, etc). The
> >> > >>>>>>>>> list
> >> > >>>>> goes
> >> > >>>>>>> on.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Duplicate work
> >> > >>>>>>>>>
> >> > >>>>>>>>> At the time that we began writing Samza, Kafka's consumer
> and
> >> > >>>>> producer
> >> > >>>>>>>>> APIs
> >> > >>>>>>>>> had a relatively weak feature set. On the consumer-side, you
> >> > >>>>>>>>> had two
> >> > >>>>>>>>> options: use the high level consumer, or the simple
> consumer.
> >> > >>>>>>>>> The
> >> > >>>>>>> problem
> >> > >>>>>>>>> with the high-level consumer was that it controlled your
> >> > >>>>>>>>> offsets, partition assignments, and the order in which you
> >> > >>>>>>>>> received messages. The
> >> > >>>>> problem
> >> > >>>>>>>>> with
> >> > >>>>>>>>> the simple consumer is that it's not simple. It's basic. You
> >> > >>>>>>>>> end up
> >> > >>>>>>> having
> >> > >>>>>>>>> to handle a lot of really low-level stuff that you
> shouldn't.
> >> > >>>>>>>>> We
> >> > >>>>>> spent a
> >> > >>>>>>>>> lot of time to make Samza's KafkaSystemConsumer very robust.
> >> It
> >> > >>>>>>>>> also allows us to support some cool features:
> >> > >>>>>>>>>
> >> > >>>>>>>>> * Per-partition message ordering and prioritization.
> >> > >>>>>>>>> * Tight control over partition assignment to support joins,
> >> > >>>>>>>>> global
> >> > >>>>>> state
> >> > >>>>>>>>> (if we want to implement it :)), etc.
> >> > >>>>>>>>> * Tight control over offset checkpointing.
> >> > >>>>>>>>>
> >> > >>>>>>>>> What we didn't realize at the time is that these features
> >> > >>>>>>>>> should
> >> > >>>>>>> actually
> >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers (not just Samza stream
> >> > >>>>>> processors)
> >> > >>>>>>>>> end up wanting to do things like joins and partition
> >> > >>>>>>>>> assignment. The
> >> > >>>>>>> Kafka
> >> > >>>>>>>>> community has come to the same conclusion. They're adding a
> >> ton
> >> > >>>>>>>>> of upgrades into their new Kafka consumer implementation.
> To a
> >> > >>>>>>>>> large extent,
> >> > >>>>> it's
> >> > >>>>>>>>> duplicate work to what we've already done in Samza.
> >> > >>>>>>>>>
> >> > >>>>>>>>> On top of this, Kafka ended up taking a very similar
> approach
> >> > >>>>>>>>> to
> >> > >>>>>> Samza's
> >> > >>>>>>>>> KafkaCheckpointManager implementation for handling offset
> >> > >>>>>> checkpointing.
> >> > >>>>>>>>> Like Samza, Kafka's new offset management feature stores
> >> offset
> >> > >>>>>>>>> checkpoints in a topic, and allows you to fetch them from
> the
> >> > >>>>>>>>> broker.
> >> > >>>>>>>>>
> >> > >>>>>>>>> A lot of this seems like a waste, since we could have shared
> >> > >>>>>>>>> the
> >> > >>>>> work
> >> > >>>>>> if
> >> > >>>>>>>>> it
> >> > >>>>>>>>> had been done in Kafka from the get-go.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Vision
> >> > >>>>>>>>>
> >> > >>>>>>>>> All of this leads me to a rather radical proposal. Samza is
> >> > >>>>> relatively
> >> > >>>>>>>>> stable at this point. I'd venture to say that we're near a
> 1.0
> >> > >>>>>> release.
> >> > >>>>>>>>> I'd
> >> > >>>>>>>>> like to propose that we take what we've learned, and begin
> >> > >>>>>>>>> thinking
> >> > >>>>>>> about
> >> > >>>>>>>>> Samza beyond 1.0. What would we change if we were starting
> >> from
> >> > >>>>>> scratch?
> >> > >>>>>>>>> My
> >> > >>>>>>>>> proposal is to:
> >> > >>>>>>>>>
> >> > >>>>>>>>> 1. Make Samza standalone the *only* way to run Samza
> >> > >>>>>>>>> processors, and eliminate all direct dependences on YARN,
> >> Mesos,
> >> > >>>> etc.
> >> > >>>>>>>>> 2. Make a definitive call to support only Kafka as the
> stream
> >> > >>>>>> processing
> >> > >>>>>>>>> layer.
> >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging, serialization, and
> >> > >>>>>>>>> config
> >> > >>>>>>> systems,
> >> > >>>>>>>>> and simply use Kafka's instead.
> >> > >>>>>>>>>
> >> > >>>>>>>>> This would fix all of the issues that I outlined above. It
> >> > >>>>>>>>> should
> >> > >>>>> also
> >> > >>>>>>>>> shrink the Samza code base pretty dramatically. Supporting
> >> only
> >> > >>>>>>>>> a standalone container will allow Samza to be executed on
> YARN
> >> > >>>>>>>>> (using Slider), Mesos (using Marathon/Aurora), or most other
> >> > >>>>>>>>> in-house
> >> > >>>>>>> deployment
> >> > >>>>>>>>> systems. This should make life a lot easier for new users.
> >> > >>>>>>>>> Imagine
> >> > >>>>>>> having
> >> > >>>>>>>>> the hello-samza tutorial without YARN. The drop in mailing
> >> list
> >> > >>>>>> traffic
> >> > >>>>>>>>> will be pretty dramatic.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Coupling with Kafka seems long overdue to me. The reality
> is,
> >> > >>>>> everyone
> >> > >>>>>>>>> that
> >> > >>>>>>>>> I'm aware of is using Samza with Kafka. We basically require
> >> it
> >> > >>>>>> already
> >> > >>>>>>> in
> >> > >>>>>>>>> order for most features to work. Those that are using other
> >> > >>>>>>>>> systems
> >> > >>>>>> are
> >> > >>>>>>>>> generally using it for ingest into Kafka (1), and then they
> do
> >> > >>>>>>>>> the processing on top. There is already discussion (
> >> > >>>>>>>>>
> >> > >>>>>>>
> >> > >>>>>>
> >> > >>>>>
> >> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
> >> > >>>>> 767
> >> > >>>>>>>>> )
> >> > >>>>>>>>> in Kafka to make ingesting into Kafka extremely easy.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Once we make the call to couple with Kafka, we can leverage
> a
> >> > >>>>>>>>> ton of
> >> > >>>>>>> their
> >> > >>>>>>>>> ecosystem. We no longer have to maintain our own config,
> >> > >>>>>>>>> metrics,
> >> > >>>>> etc.
> >> > >>>>>>> We
> >> > >>>>>>>>> can all share the same libraries, and make them better. This
> >> > >>>>>>>>> will
> >> > >>>>> also
> >> > >>>>>>>>> allow us to share the consumer/producer APIs, and will let
> us
> >> > >>>>> leverage
> >> > >>>>>>>>> their offset management and partition management, rather
> than
> >> > >>>>>>>>> having
> >> > >>>>>> our
> >> > >>>>>>>>> own. All of the coordinator stream code would go away, as
> >> would
> >> > >>>>>>>>> most
> >> > >>>>>> of
> >> > >>>>>>>>> the
> >> > >>>>>>>>> YARN AppMaster code. We'd probably have to push some
> partition
> >> > >>>>>>> management
> >> > >>>>>>>>> features into the Kafka broker, but they're already moving
> in
> >> > >>>>>>>>> that direction with the new consumer API. The features we
> have
> >> > >>>>>>>>> for
> >> > >>>>>> partition
> >> > >>>>>>>>> assignment aren't unique to Samza, and seem like they should
> >> be
> >> > >>>>>>>>> in
> >> > >>>>>> Kafka
> >> > >>>>>>>>> anyway. There will always be some niche usages which will
> >> > >>>>>>>>> require
> >> > >>>>>> extra
> >> > >>>>>>>>> care and hence full control over partition assignments much
> >> > >>>>>>>>> like the
> >> > >>>>>>> Kafka
> >> > >>>>>>>>> low level consumer api. These would continue to be
> supported.
> >> > >>>>>>>>>
> >> > >>>>>>>>> These items will be good for the Samza community. They'll
> make
> >> > >>>>>>>>> Samza easier to use, and make it easier for developers to
> add
> >> > >>>>>>>>> new features.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Obviously this is a fairly large (and somewhat backwards
> >> > >>>>> incompatible
> >> > >>>>>>>>> change). If we choose to go this route, it's important that
> we
> >> > >>>>> openly
> >> > >>>>>>>>> communicate how we're going to provide a migration path from
> >> > >>>>>>>>> the
> >> > >>>>>>> existing
> >> > >>>>>>>>> APIs to the new ones (if we make incompatible changes). I
> >> think
> >> > >>>>>>>>> at a minimum, we'd probably need to provide a wrapper to
> allow
> >> > >>>>>>>>> existing StreamTask implementations to continue running on
> the
> >> > >>>> new container.
> >> > >>>>>>> It's
> >> > >>>>>>>>> also important that we openly communicate about timing, and
> >> > >>>>>>>>> stages
> >> > >>>>> of
> >> > >>>>>>> the
> >> > >>>>>>>>> migration.
> >> > >>>>>>>>>
> >> > >>>>>>>>> If you made it this far, I'm sure you have opinions. :)
> Please
> >> > >>>>>>>>> send
> >> > >>>>>> your
> >> > >>>>>>>>> thoughts and feedback.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Cheers,
> >> > >>>>>>>>> Chris
> >> > >>>>>>>>>
> >> > >>>>>>>>
> >> > >>>>>>>>
> >> > >>>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> --
> >> > >>>>>> -- Guozhang
> >> > >>>>>>
> >> > >>>>>
> >> > >>>>
> >> > >>
> >> > >>
> >> >
> >> >
> >> >
> >>
> >
> >
>

Re: Thoughts and obesrvations on Samza

Reply via email to