Re: Thoughts and obesrvations on Samza

Ben Kirwin Wed, 08 Jul 2015 07:45:19 -0700

Hi all,

Interesting stuff! Jumping in a bit late, but here goes...


I'd definitely be excited about a slimmed-down and more Kafka-specific
Samza -- you don't seem to lose much functionality that people
actually use, and the gains in simplicity / code sharing seem
potentially very large. (I've spent a bunch of time peeling back those
layers of abstraction to get eg. more control over message send order,
and working directly against Kafka's APIs would have been much
easier.) I also like the approach of letting Kafka code do the heavy
lifting and letting stream processing systems build on those -- good,
reusable implementations would be great for the whole
stream-processing ecosystem, and Samza in particular.

On the other hand, I do hope that using Kafka's group membership /
partition assignment / etc. stays optional. As far as I can tell,
~every major stream processing system that uses Kafka has chosen (or
switched to) 'static' partitioning, where each logical task consumes a
fixed set of partitions. When 'dynamic deploying' (a la Storm / Mesos
/ Yarn) the underlying system is already doing failure detection and
transferring work between hosts when machines go down, so using
Kafka's implementation is redundant at best -- and at worst, the
interaction between the two systems can make outages worse.

And thanks to Chris / Jay for getting this ball rolling. Exciting times...

On Tue, Jul 7, 2015 at 2:35 PM, Jay Kreps <[email protected]> wrote:
> Hey Roger,
>
> I couldn't agree more. We spent a bunch of time talking to people and that
> is exactly the stuff we heard time and again. What makes it hard, of
> course, is that there is some tension between compatibility with what's
> there now and making things better for new users.
>
> I also strongly agree with the importance of multi-language support. We are
> talking now about Java, but for application development use cases people
> want to work in whatever language they are using elsewhere. I think moving
> to a model where Kafka itself does the group membership, lifecycle control,
> and partition assignment has the advantage of putting all that complex
> stuff behind a clean api that the clients are already going to be
> implementing for their consumer, so the added functionality for stream
> processing beyond a consumer becomes very minor.
>
> -Jay
>
> On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover <[email protected]>
> wrote:
>
>> Metamorphosis...nice. :)
>>
>> This has been a great discussion.  As a user of Samza who's recently
>> integrated it into a relatively large organization, I just want to add
>> support to a few points already made.
>>
>> The biggest hurdles to adoption of Samza as it currently exists that I've
>> experienced are:
>> 1) YARN - YARN is overly complex in many environments where Puppet would do
>> just fine but it was the only mechanism to get fault tolerance.
>> 2) Configuration - I think I like the idea of configuring most of the job
>> in code rather than config files.  In general, I think the goal should be
>> to make it harder to make mistakes, especially of the kind where the code
>> expects something and the config doesn't match.  The current config is
>> quite intricate and error-prone.  For example, the application logic may
>> depend on bootstrapping a topic but rather than asserting that in the code,
>> you have to rely on getting the config right.  Likewise with serdes, the
>> Java representations produced by various serdes (JSON, Avro, etc.) are not
>> equivalent so you cannot just reconfigure a serde without changing the
>> code.   It would be nice for jobs to be able to assert what they expect
>> from their input topics in terms of partitioning.  This is getting a little
>> off topic but I was even thinking about creating a "Samza config linter"
>> that would sanity check a set of configs.  Especially in organizations
>> where config is managed by a different team than the application developer,
>> it's very hard to get avoid config mistakes.
>> 3) Java/Scala centric - for many teams (especially DevOps-type folks), the
>> pain of the Java toolchain (maven, slow builds, weak command line support,
>> configuration over convention) really inhibits productivity.  As more and
>> more high-quality clients become available for Kafka, I hope they'll follow
>> Samza's model.  Not sure how much it affects the proposals in this thread
>> but please consider other languages in the ecosystem as well.  From what
>> I've heard, Spark has more Python users than Java/Scala.
>> (FYI, we added a Jython wrapper for the Samza API
>>
>> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
>> and are working on a Yeoman generator
>> https://github.com/Quantiply/generator-rico for Jython/Samza projects to
>> alleviate some of the pain)
>>
>> I also want to underscore Jay's point about improving the user experience.
>> That's a very important factor for adoption.  I think the goal should be to
>> make Samza as easy to get started with as something like Logstash.
>> Logstash is vastly inferior in terms of capabilities to Samza but it's easy
>> to get started and that makes a big difference.
>>
>> Cheers,
>>
>> Roger
>>
>>
>>
>>
>>
>> On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci Morales <
>> [email protected]> wrote:
>>
>> > Forgot to add. On the naming issues, Kafka Metamorphosis is a clear
>> winner
>> > :)
>> >
>> > --
>> > Gianmarco
>> >
>> > On 7 July 2015 at 13:26, Gianmarco De Francisci Morales <[email protected]
>> >
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > @Martin, thanks for you comments.
>> > > Maybe I'm missing some important point, but I think coupling the
>> releases
>> > > is actually a *good* thing.
>> > > To make an example, would it be better if the MR and HDFS components of
>> > > Hadoop had different release schedules?
>> > >
>> > > Actually, keeping the discussion in a single place would make agreeing
>> on
>> > > releases (and backwards compatibility) much easier, as everybody would
>> be
>> > > responsible for the whole codebase.
>> > >
>> > > That said, I like the idea of absorbing samza-core as a sub-project,
>> and
>> > > leave the fancy stuff separate.
>> > > It probably gives 90% of the benefits we have been discussing here.
>> > >
>> > > Cheers,
>> > >
>> > > --
>> > > Gianmarco
>> > >
>> > > On 7 July 2015 at 02:30, Jay Kreps <[email protected]> wrote:
>> > >
>> > >> Hey Martin,
>> > >>
>> > >> I agree coupling release schedules is a downside.
>> > >>
>> > >> Definitely we can try to solve some of the integration problems in
>> > >> Confluent Platform or in other distributions. But I think this ends up
>> > >> being really shallow. I guess I feel to really get a good user
>> > experience
>> > >> the two systems have to kind of feel like part of the same thing and
>> you
>> > >> can't really add that in later--you can put both in the same
>> > downloadable
>> > >> tar file but it doesn't really give a very cohesive feeling. I agree
>> > that
>> > >> ultimately any of the project stuff is as much social and naming as
>> > >> anything else--theoretically two totally independent projects could
>> work
>> > >> to
>> > >> tightly align. In practice this seems to be quite difficult though.
>> > >>
>> > >> For the frameworks--totally agree it would be good to maintain the
>> > >> framework support with the project. In some cases there may not be too
>> > >> much
>> > >> there since the integration gets lighter but I think whatever stubs
>> you
>> > >> need should be included. So no I definitely wasn't trying to imply
>> > >> dropping
>> > >> support for these frameworks, just making the integration lighter by
>> > >> separating process management from partition management.
>> > >>
>> > >> You raise two good points we would have to figure out if we went down
>> > the
>> > >> alignment path:
>> > >> 1. With respect to the name, yeah I think the first question is
>> whether
>> > >> some "re-branding" would be worth it. If so then I think we can have a
>> > big
>> > >> thread on the name. I'm definitely not set on Kafka Streaming or Kafka
>> > >> Streams I was just using them to be kind of illustrative. I agree with
>> > >> your
>> > >> critique of these names, though I think people would get the idea.
>> > >> 2. Yeah you also raise a good point about how to "factor" it. Here are
>> > the
>> > >> options I see (I could get enthusiastic about any of them):
>> > >>    a. One repo for both Kafka and Samza
>> > >>    b. Two repos, retaining the current seperation
>> > >>    c. Two repos, the equivalent of samza-api and samza-core is
>> absorbed
>> > >> almost like a third client
>> > >>
>> > >> Cheers,
>> > >>
>> > >> -Jay
>> > >>
>> > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann <
>> [email protected]>
>> > >> wrote:
>> > >>
>> > >> > Ok, thanks for the clarifications. Just a few follow-up comments.
>> > >> >
>> > >> > - I see the appeal of merging with Kafka or becoming a subproject:
>> the
>> > >> > reasons you mention are good. The risk I see is that release
>> schedules
>> > >> > become coupled to each other, which can slow everyone down, and
>> large
>> > >> > projects with many contributors are harder to manage. (Jakob, can
>> you
>> > >> speak
>> > >> > from experience, having seen a wider range of Hadoop ecosystem
>> > >> projects?)
>> > >> >
>> > >> > Some of the goals of a better unified developer experience could
>> also
>> > be
>> > >> > solved by integrating Samza nicely into a Kafka distribution (such
>> as
>> > >> > Confluent's). I'm not against merging projects if we decide that's
>> the
>> > >> way
>> > >> > to go, just pointing out the same goals can perhaps also be achieved
>> > in
>> > >> > other ways.
>> > >> >
>> > >> > - With regard to dropping the YARN dependency: are you proposing
>> that
>> > >> > Samza doesn't give any help to people wanting to run on
>> > >> YARN/Mesos/AWS/etc?
>> > >> > So the docs would basically have a link to Slider and nothing else?
>> Or
>> > >> > would we maintain integrations with a bunch of popular deployment
>> > >> methods
>> > >> > (e.g. the necessary glue and shell scripts to make Samza work with
>> > >> Slider)?
>> > >> >
>> > >> > I absolutely think it's a good idea to have the "as a library" and
>> > "as a
>> > >> > process" (using Yi's taxonomy) options for people who want them,
>> but I
>> > >> > think there should also be a low-friction path for common "as a
>> > service"
>> > >> > deployment methods, for which we probably need to maintain
>> > integrations.
>> > >> >
>> > >> > - Project naming: "Kafka Streams" seems odd to me, because Kafka is
>> > all
>> > >> > about streams already. Perhaps "Kafka Transformers" or "Kafka
>> Filters"
>> > >> > would be more apt?
>> > >> >
>> > >> > One suggestion: perhaps the core of Samza (stream transformation
>> with
>> > >> > state management -- i.e. the "Samza as a library" bit) could become
>> > >> part of
>> > >> > Kafka, while higher-level tools such as streaming SQL and
>> integrations
>> > >> with
>> > >> > deployment frameworks remain in a separate project? In other words,
>> > >> Kafka
>> > >> > would absorb the proven, stable core of Samza, which would become
>> the
>> > >> > "third Kafka client" mentioned early in this thread. The Samza
>> project
>> > >> > would then target that third Kafka client as its base API, and the
>> > >> project
>> > >> > would be freed up to explore more experimental new horizons.
>> > >> >
>> > >> > Martin
>> > >> >
>> > >> > On 6 Jul 2015, at 18:51, Jay Kreps <[email protected]> wrote:
>> > >> >
>> > >> > > Hey Martin,
>> > >> > >
>> > >> > > For the YARN/Mesos/etc decoupling I actually don't think it ties
>> our
>> > >> > hands
>> > >> > > at all, all it does is refactor things. The division of
>> > >> responsibility is
>> > >> > > that Samza core is responsible for task lifecycle, state, and
>> > >> partition
>> > >> > > management (using the Kafka co-ordinator) but it is NOT
>> responsible
>> > >> for
>> > >> > > packaging, configuration deployment or execution of processes. The
>> > >> > problem
>> > >> > > of packaging and starting these processes is
>> > >> > > framework/environment-specific. This leaves individual frameworks
>> to
>> > >> be
>> > >> > as
>> > >> > > fancy or vanilla as they like. So you can get simple stateless
>> > >> support in
>> > >> > > YARN, Mesos, etc using their off-the-shelf app framework (Slider,
>> > >> > Marathon,
>> > >> > > etc). These are well known by people and have nice UIs and a lot
>> of
>> > >> > > flexibility. I don't think they have node affinity as a built in
>> > >> option
>> > >> > > (though I could be wrong). So if we want that we can either wait
>> for
>> > >> them
>> > >> > > to add it or do a custom framework to add that feature (as now).
>> > >> > Obviously
>> > >> > > if you manage things with old-school ops tools (puppet/chef/etc)
>> you
>> > >> get
>> > >> > > locality easily. The nice thing, though, is that all the samza
>> > >> "business
>> > >> > > logic" around partition management and fault tolerance is in Samza
>> > >> core
>> > >> > so
>> > >> > > it is shared across frameworks and the framework specific bit is
>> > just
>> > >> > > whether it is smart enough to try to get the same host when a job
>> is
>> > >> > > restarted.
>> > >> > >
>> > >> > > With respect to the Kafka-alignment, yeah I think the goal would
>> be
>> > >> (a)
>> > >> > > actually get better alignment in user experience, and (b) express
>> > >> this in
>> > >> > > the naming and project branding. Specifically:
>> > >> > > 1. Website/docs, it would be nice for the "transformation" api to
>> be
>> > >> > > discoverable in the main Kafka docs--i.e. be able to explain when
>> to
>> > >> use
>> > >> > > the consumer and when to use the stream processing functionality
>> and
>> > >> lead
>> > >> > > people into that experience.
>> > >> > > 2. Align releases so if you get Kafkza 1.4.2 (or whatever) that
>> has
>> > >> both
>> > >> > > Kafka and the stream processing part and they actually work
>> > together.
>> > >> > > 3. Unify the programming experience so the client and Samza api
>> > share
>> > >> > > config/monitoring/naming/packaging/etc.
>> > >> > >
>> > >> > > I think sub-projects keep separate committers and can have a
>> > separate
>> > >> > repo,
>> > >> > > but I'm actually not really sure (I can't find a definition of a
>> > >> > subproject
>> > >> > > in Apache).
>> > >> > >
>> > >> > > Basically at a high-level you want the experience to "feel" like a
>> > >> single
>> > >> > > system, not to relatively independent things that are kind of
>> > >> awkwardly
>> > >> > > glued together.
>> > >> > >
>> > >> > > I think if we did that they having naming or branding like "kafka
>> > >> > > streaming" or "kafka streams" or something like that would
>> actually
>> > >> do a
>> > >> > > good job of conveying what it is. I do that this would help
>> adoption
>> > >> > quite
>> > >> > > a lot as it would correctly convey that using Kafka Streaming with
>> > >> Kafka
>> > >> > is
>> > >> > > a fairly seamless experience and Kafka is pretty heavily adopted
>> at
>> > >> this
>> > >> > > point.
>> > >> > >
>> > >> > > Fwiw we actually considered this model originally when open
>> sourcing
>> > >> > Samza,
>> > >> > > however at that time Kafka was relatively unknown and we decided
>> not
>> > >> to
>> > >> > do
>> > >> > > it since we felt it would be limiting. From my point of view the
>> > three
>> > >> > > things have changed (1) Kafka is now really heavily used for
>> stream
>> > >> > > processing, (2) we learned that abstracting out the stream well is
>> > >> > > basically impossible, (3) we learned it is really hard to keep the
>> > two
>> > >> > > things feeling like a single product.
>> > >> > >
>> > >> > > -Jay
>> > >> > >
>> > >> > >
>> > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin Kleppmann <
>> > >> [email protected]>
>> > >> > > wrote:
>> > >> > >
>> > >> > >> Hi all,
>> > >> > >>
>> > >> > >> Lots of good thoughts here.
>> > >> > >>
>> > >> > >> I agree with the general philosophy of tying Samza more firmly to
>> > >> Kafka.
>> > >> > >> After I spent a while looking at integrating other message
>> brokers
>> > >> (e.g.
>> > >> > >> Kinesis) with SystemConsumer, I came to the conclusion that
>> > >> > SystemConsumer
>> > >> > >> tacitly assumes a model so much like Kafka's that pretty much
>> > nobody
>> > >> but
>> > >> > >> Kafka actually implements it. (Databus is perhaps an exception,
>> but
>> > >> it
>> > >> > >> isn't widely used outside of LinkedIn.) Thus, making Samza fully
>> > >> > dependent
>> > >> > >> on Kafka acknowledges that the system-independence was never as
>> > real
>> > >> as
>> > >> > we
>> > >> > >> perhaps made it out to be. The gains of code reuse are real.
>> > >> > >>
>> > >> > >> The idea of decoupling Samza from YARN has also always been
>> > >> appealing to
>> > >> > >> me, for various reasons already mentioned in this thread.
>> Although
>> > >> > making
>> > >> > >> Samza jobs deployable on anything (YARN/Mesos/AWS/etc) seems
>> > >> laudable,
>> > >> > I am
>> > >> > >> a little concerned that it will restrict us to a lowest common
>> > >> > denominator.
>> > >> > >> For example, would host affinity (SAMZA-617) still be possible?
>> For
>> > >> jobs
>> > >> > >> with large amounts of state, I think SAMZA-617 would be a big
>> boon,
>> > >> > since
>> > >> > >> restoring state off the changelog on every single restart is
>> > painful,
>> > >> > due
>> > >> > >> to long recovery times. It would be a shame if the decoupling
>> from
>> > >> YARN
>> > >> > >> made host affinity impossible.
>> > >> > >>
>> > >> > >> Jay, a question about the proposed API for instantiating a job in
>> > >> code
>> > >> > >> (rather than a properties file): when submitting a job to a
>> > cluster,
>> > >> is
>> > >> > the
>> > >> > >> idea that the instantiation code runs on a client somewhere,
>> which
>> > >> then
>> > >> > >> pokes the necessary endpoints on YARN/Mesos/AWS/etc? Or does that
>> > >> code
>> > >> > run
>> > >> > >> on each container that is part of the job (in which case, how
>> does
>> > >> the
>> > >> > job
>> > >> > >> submission to the cluster work)?
>> > >> > >>
>> > >> > >> I agree with Garry that it doesn't feel right to make a 1.0
>> release
>> > >> > with a
>> > >> > >> plan for it to be immediately obsolete. So if this is going to
>> > >> happen, I
>> > >> > >> think it would be more honest to stick with 0.* version numbers
>> > until
>> > >> > the
>> > >> > >> library-ified Samza has been implemented, is stable and widely
>> > used.
>> > >> > >>
>> > >> > >> Should the new Samza be a subproject of Kafka? There is precedent
>> > for
>> > >> > >> tight coupling between different Apache projects (e.g. Curator
>> and
>> > >> > >> Zookeeper, or Slider and YARN), so I think remaining separate
>> would
>> > >> be
>> > >> > ok.
>> > >> > >> Even if Samza is fully dependent on Kafka, there is enough
>> > substance
>> > >> in
>> > >> > >> Samza that it warrants being a separate project. An argument in
>> > >> favour
>> > >> > of
>> > >> > >> merging would be if we think Kafka has a much stronger "brand
>> > >> presence"
>> > >> > >> than Samza; I'm ambivalent on that one. If the Kafka project is
>> > >> willing
>> > >> > to
>> > >> > >> endorse Samza as the "official" way of doing stateful stream
>> > >> > >> transformations, that would probably have much the same effect as
>> > >> > >> re-branding Samza as "Kafka Stream Processors" or suchlike. Close
>> > >> > >> collaboration between the two projects will be needed in any
>> case.
>> > >> > >>
>> > >> > >> From a project management perspective, I guess the "new Samza"
>> > would
>> > >> > have
>> > >> > >> to be developed on a branch alongside ongoing maintenance of the
>> > >> current
>> > >> > >> line of development? I think it would be important to continue
>> > >> > supporting
>> > >> > >> existing users, and provide a graceful migration path to the new
>> > >> > version.
>> > >> > >> Leaving the current versions unsupported and forcing people to
>> > >> rewrite
>> > >> > >> their jobs would send a bad signal.
>> > >> > >>
>> > >> > >> Best,
>> > >> > >> Martin
>> > >> > >>
>> > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <[email protected]> wrote:
>> > >> > >>
>> > >> > >>> Hey Garry,
>> > >> > >>>
>> > >> > >>> Yeah that's super frustrating. I'd be happy to chat more about
>> > this
>> > >> if
>> > >> > >>> you'd be interested. I think Chris and I started with the idea
>> of
>> > >> "what
>> > >> > >>> would it take to make Samza a kick-ass ingestion tool" but
>> > >> ultimately
>> > >> > we
>> > >> > >>> kind of came around to the idea that ingestion and
>> transformation
>> > >> had
>> > >> > >>> pretty different needs and coupling the two made things hard.
>> > >> > >>>
>> > >> > >>> For what it's worth I think copycat (KIP-26) actually will do
>> what
>> > >> you
>> > >> > >> are
>> > >> > >>> looking for.
>> > >> > >>>
>> > >> > >>> With regard to your point about slider, I don't necessarily
>> > >> disagree.
>> > >> > >> But I
>> > >> > >>> think getting good YARN support is quite doable and I think we
>> can
>> > >> make
>> > >> > >>> that work well. I think the issue this proposal solves is that
>> > >> > >> technically
>> > >> > >>> it is pretty hard to support multiple cluster management systems
>> > the
>> > >> > way
>> > >> > >>> things are now, you need to write an "app master" or "framework"
>> > for
>> > >> > each
>> > >> > >>> and they are all a little different so testing is really hard.
>> In
>> > >> the
>> > >> > >>> absence of this we have been stuck with just YARN which has
>> > >> fantastic
>> > >> > >>> penetration in the Hadoopy part of the org, but zero penetration
>> > >> > >> elsewhere.
>> > >> > >>> Given the huge amount of work being put in to slider, marathon,
>> > aws
>> > >> > >>> tooling, not to mention the umpteen related packaging
>> technologies
>> > >> > people
>> > >> > >>> want to use (Docker, Kubernetes, various cloud-specific deploy
>> > >> tools,
>> > >> > >> etc)
>> > >> > >>> I really think it is important to get this right.
>> > >> > >>>
>> > >> > >>> -Jay
>> > >> > >>>
>> > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry Turkington <
>> > >> > >>> [email protected]> wrote:
>> > >> > >>>
>> > >> > >>>> Hi all,
>> > >> > >>>>
>> > >> > >>>> I think the question below re does Samza become a sub-project
>> of
>> > >> Kafka
>> > >> > >>>> highlights the broader point around migration. Chris mentions
>> > >> Samza's
>> > >> > >>>> maturity is heading towards a v1 release but I'm not sure it
>> > feels
>> > >> > >> right to
>> > >> > >>>> launch a v1 then immediately plan to deprecate most of it.
>> > >> > >>>>
>> > >> > >>>> From a selfish perspective I have some guys who have started
>> > >> working
>> > >> > >> with
>> > >> > >>>> Samza and building some new consumers/producers was next up.
>> > Sounds
>> > >> > like
>> > >> > >>>> that is absolutely not the direction to go. I need to look into
>> > the
>> > >> > KIP
>> > >> > >> in
>> > >> > >>>> more detail but for me the attractiveness of adding new Samza
>> > >> > >>>> consumer/producers -- even if yes all they were doing was
>> really
>> > >> > getting
>> > >> > >>>> data into and out of Kafka --  was to avoid  having to worry
>> > about
>> > >> the
>> > >> > >>>> lifecycle management of external clients. If there is a generic
>> > >> Kafka
>> > >> > >>>> ingress/egress layer that I can plug a new connector into and
>> > have
>> > >> a
>> > >> > >> lot of
>> > >> > >>>> the heavy lifting re scale and reliability done for me then it
>> > >> gives
>> > >> > me
>> > >> > >> all
>> > >> > >>>> the pushing new consumers/producers would. If not then it
>> > >> complicates
>> > >> > my
>> > >> > >>>> operational deployments.
>> > >> > >>>>
>> > >> > >>>> Which is similar to my other question with the proposal -- if
>> we
>> > >> > build a
>> > >> > >>>> fully available/stand-alone Samza plus the requisite shims to
>> > >> > integrate
>> > >> > >>>> with Slider etc I suspect the former may be a lot more work
>> than
>> > we
>> > >> > >> think.
>> > >> > >>>> We may make it much easier for a newcomer to get something
>> > running
>> > >> but
>> > >> > >>>> having them step up and get a reliable production deployment
>> may
>> > >> still
>> > >> > >>>> dominate mailing list  traffic, if for different reasons than
>> > >> today.
>> > >> > >>>>
>> > >> > >>>> Don't get me wrong -- I'm comfortable with making the Samza
>> > >> dependency
>> > >> > >> on
>> > >> > >>>> Kafka much more explicit and I absolutely see the benefits  in
>> > the
>> > >> > >>>> reduction of duplication and clashing
>> terminologies/abstractions
>> > >> that
>> > >> > >>>> Chris/Jay describe. Samza as a library would likely be a very
>> > nice
>> > >> > tool
>> > >> > >> to
>> > >> > >>>> add to the Kafka ecosystem. I just have the concerns above re
>> the
>> > >> > >>>> operational side.
>> > >> > >>>>
>> > >> > >>>> Garry
>> > >> > >>>>
>> > >> > >>>> -----Original Message-----
>> > >> > >>>> From: Gianmarco De Francisci Morales [mailto:[email protected]]
>> > >> > >>>> Sent: 02 July 2015 12:56
>> > >> > >>>> To: [email protected]
>> > >> > >>>> Subject: Re: Thoughts and obesrvations on Samza
>> > >> > >>>>
>> > >> > >>>> Very interesting thoughts.
>> > >> > >>>> From outside, I have always perceived Samza as a computing
>> layer
>> > >> over
>> > >> > >>>> Kafka.
>> > >> > >>>>
>> > >> > >>>> The question, maybe a bit provocative, is "should Samza be a
>> > >> > sub-project
>> > >> > >>>> of Kafka then?"
>> > >> > >>>> Or does it make sense to keep it as a separate project with a
>> > >> separate
>> > >> > >>>> governance?
>> > >> > >>>>
>> > >> > >>>> Cheers,
>> > >> > >>>>
>> > >> > >>>> --
>> > >> > >>>> Gianmarco
>> > >> > >>>>
>> > >> > >>>> On 2 July 2015 at 08:59, Yan Fang <[email protected]>
>> wrote:
>> > >> > >>>>
>> > >> > >>>>> Overall, I agree to couple with Kafka more tightly. Because
>> > Samza
>> > >> de
>> > >> > >>>>> facto is based on Kafka, and it should leverage what Kafka
>> has.
>> > At
>> > >> > the
>> > >> > >>>>> same time, Kafka does not need to reinvent what Samza already
>> > >> has. I
>> > >> > >>>>> also like the idea of separating the ingestion and
>> > transformation.
>> > >> > >>>>>
>> > >> > >>>>> But it is a little difficult for me to image how the Samza
>> will
>> > >> look
>> > >> > >>>> like.
>> > >> > >>>>> And I feel Chris and Jay have a little difference in terms of
>> > how
>> > >> > >>>>> Samza should look like.
>> > >> > >>>>>
>> > >> > >>>>> *** Will it look like what Jay's code shows (A client of
>> Kakfa)
>> > ?
>> > >> And
>> > >> > >>>>> user's application code calls this client?
>> > >> > >>>>>
>> > >> > >>>>> 1. If we make Samza be a library of Kafka (like what the code
>> > >> shows),
>> > >> > >>>>> how do we implement auto-balance and fault-tolerance? Are they
>> > >> taken
>> > >> > >>>>> care by the Kafka broker or other mechanism, such as "Samza
>> > >> worker"
>> > >> > >>>>> (just make up the name) ?
>> > >> > >>>>>
>> > >> > >>>>> 2. What about other features, such as auto-scaling, shared
>> > state,
>> > >> > >>>>> monitoring?
>> > >> > >>>>>
>> > >> > >>>>>
>> > >> > >>>>> *** If we have Samza standalone, (is this what Chris
>> suggests?)
>> > >> > >>>>>
>> > >> > >>>>> 1. we still need to ingest data from Kakfa and produce to it.
>> > >> Then it
>> > >> > >>>>> becomes the same as what Samza looks like now, except it does
>> > not
>> > >> > rely
>> > >> > >>>>> on Yarn anymore.
>> > >> > >>>>>
>> > >> > >>>>> 2. if it is standalone, how can it leverage Kafka's metrics,
>> > logs,
>> > >> > >>>>> etc? Use Kafka code as the dependency?
>> > >> > >>>>>
>> > >> > >>>>>
>> > >> > >>>>> Thanks,
>> > >> > >>>>>
>> > >> > >>>>> Fang, Yan
>> > >> > >>>>> [email protected]
>> > >> > >>>>>
>> > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang Wang <
>> > [email protected]
>> > >> >
>> > >> > >>>> wrote:
>> > >> > >>>>>
>> > >> > >>>>>> Read through the code example and it looks good to me. A few
>> > >> > >>>>>> thoughts regarding deployment:
>> > >> > >>>>>>
>> > >> > >>>>>> Today Samza deploys as executable runnable like:
>> > >> > >>>>>>
>> > >> > >>>>>> deploy/samza/bin/run-job.sh --config-factory=...
>> > >> > >>>> --config-path=file://...
>> > >> > >>>>>>
>> > >> > >>>>>> And this proposal advocate for deploying Samza more as
>> embedded
>> > >> > >>>>>> libraries in user application code (ignoring the terminology
>> > >> since
>> > >> > >>>>>> it is not the
>> > >> > >>>>> same
>> > >> > >>>>>> as the prototype code):
>> > >> > >>>>>>
>> > >> > >>>>>> StreamTask task = new MyStreamTask(configs); Thread thread =
>> > new
>> > >> > >>>>>> Thread(task); thread.start();
>> > >> > >>>>>>
>> > >> > >>>>>> I think both of these deployment modes are important for
>> > >> different
>> > >> > >>>>>> types
>> > >> > >>>>> of
>> > >> > >>>>>> users. That said, I think making Samza purely standalone is
>> > still
>> > >> > >>>>>> sufficient for either runnable or library modes.
>> > >> > >>>>>>
>> > >> > >>>>>> Guozhang
>> > >> > >>>>>>
>> > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay Kreps <
>> [email protected]>
>> > >> > wrote:
>> > >> > >>>>>>
>> > >> > >>>>>>> Looks like gmail mangled the code example, it was supposed
>> to
>> > >> look
>> > >> > >>>>>>> like
>> > >> > >>>>>>> this:
>> > >> > >>>>>>>
>> > >> > >>>>>>> Properties props = new Properties();
>> > >> > >>>>>>> props.put("bootstrap.servers", "localhost:4242");
>> > >> StreamingConfig
>> > >> > >>>>>>> config = new StreamingConfig(props);
>> > >> > >>>>>>> config.subscribe("test-topic-1", "test-topic-2");
>> > >> > >>>>>>> config.processor(ExampleStreamProcessor.class);
>> > >> > >>>>>>> config.serialization(new StringSerializer(), new
>> > >> > >>>>>>> StringDeserializer()); KafkaStreaming container = new
>> > >> > >>>>>>> KafkaStreaming(config); container.run();
>> > >> > >>>>>>>
>> > >> > >>>>>>> -Jay
>> > >> > >>>>>>>
>> > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps <
>> [email protected]
>> > >
>> > >> > >>>> wrote:
>> > >> > >>>>>>>
>> > >> > >>>>>>>> Hey guys,
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> This came out of some conversations Chris and I were having
>> > >> > >>>>>>>> around
>> > >> > >>>>>>> whether
>> > >> > >>>>>>>> it would make sense to use Samza as a kind of data
>> ingestion
>> > >> > >>>>> framework
>> > >> > >>>>>>> for
>> > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26 "copycat"). This
>> kind
>> > of
>> > >> > >>>>>> combined
>> > >> > >>>>>>>> with complaints around config and YARN and the discussion
>> > >> around
>> > >> > >>>>>>>> how
>> > >> > >>>>> to
>> > >> > >>>>>>>> best do a standalone mode.
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> So the thought experiment was, given that Samza was
>> basically
>> > >> > >>>>>>>> already totally Kafka specific, what if you just embraced
>> > that
>> > >> > >>>>>>>> and turned it
>> > >> > >>>>>> into
>> > >> > >>>>>>>> something less like a heavyweight framework and more like a
>> > >> > >>>>>>>> third
>> > >> > >>>>> Kafka
>> > >> > >>>>>>>> client--a kind of "producing consumer" with state
>> management
>> > >> > >>>>>> facilities.
>> > >> > >>>>>>>> Basically a library. Instead of a complex stream processing
>> > >> > >>>>>>>> framework
>> > >> > >>>>>>> this
>> > >> > >>>>>>>> would actually be a very simple thing, not much more
>> > >> complicated
>> > >> > >>>>>>>> to
>> > >> > >>>>> use
>> > >> > >>>>>>> or
>> > >> > >>>>>>>> operate than a Kafka consumer. As Chris said we thought
>> about
>> > >> it
>> > >> > >>>>>>>> a
>> > >> > >>>>> lot
>> > >> > >>>>>> of
>> > >> > >>>>>>>> what Samza (and the other stream processing systems were
>> > doing)
>> > >> > >>>>> seemed
>> > >> > >>>>>>> like
>> > >> > >>>>>>>> kind of a hangover from MapReduce.
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> Of course you need to ingest/output data to and from the
>> > stream
>> > >> > >>>>>>>> processing. But when we actually looked into how that would
>> > >> > >>>>>>>> work,
>> > >> > >>>>> Samza
>> > >> > >>>>>>>> isn't really an ideal data ingestion framework for a bunch
>> of
>> > >> > >>>>> reasons.
>> > >> > >>>>>> To
>> > >> > >>>>>>>> really do that right you need a pretty different internal
>> > data
>> > >> > >>>>>>>> model
>> > >> > >>>>>> and
>> > >> > >>>>>>>> set of apis. So what if you split them and had an api for
>> > Kafka
>> > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) and a separate api for
>> > >> Kafka
>> > >> > >>>>>>>> transformation (Samza).
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> This would also allow really embracing the same terminology
>> > and
>> > >> > >>>>>>>> conventions. One complaint about the current state is that
>> > the
>> > >> > >>>>>>>> two
>> > >> > >>>>>>> systems
>> > >> > >>>>>>>> kind of feel bolted on. Terminology like "stream" vs
>> "topic"
>> > >> and
>> > >> > >>>>>>> different
>> > >> > >>>>>>>> config and monitoring systems means you kind of have to
>> learn
>> > >> > >>>>>>>> Kafka's
>> > >> > >>>>>>> way,
>> > >> > >>>>>>>> then learn Samza's slightly different way, then kind of
>> > >> > >>>>>>>> understand
>> > >> > >>>>> how
>> > >> > >>>>>>> they
>> > >> > >>>>>>>> map to each other, which having walked a few people through
>> > >> this
>> > >> > >>>>>>>> is surprisingly tricky for folks to get.
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> Since I have been spending a lot of time on airplanes I
>> > hacked
>> > >> > >>>>>>>> up an ernest but still somewhat incomplete prototype of
>> what
>> > >> > >>>>>>>> this would
>> > >> > >>>>> look
>> > >> > >>>>>>>> like. This is just unceremoniously dumped into Kafka as it
>> > >> > >>>>>>>> required a
>> > >> > >>>>>> few
>> > >> > >>>>>>>> changes to the new consumer. Here is the code:
>> > >> > >>>>>>>>
>> > >> > >>>>>>>>
>> > >> > >>>>>>>
>> > >> > >>>>>>
>> > >> > >>>>>
>> > >> >
>> > https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
>> > >> > >>>>> /apache/kafka/clients/streaming
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> For the purpose of the prototype I just liberally renamed
>> > >> > >>>>>>>> everything
>> > >> > >>>>> to
>> > >> > >>>>>>>> try to align it with Kafka with no regard for
>> compatibility.
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> To use this would be something like this:
>> > >> > >>>>>>>> Properties props = new Properties();
>> > >> > >>>>>>>> props.put("bootstrap.servers", "localhost:4242");
>> > >> > >>>>>>>> StreamingConfig config = new
>> > >> > >>>>> StreamingConfig(props);
>> > >> > >>>>>>> config.subscribe("test-topic-1",
>> > >> > >>>>>>>> "test-topic-2");
>> > >> config.processor(ExampleStreamProcessor.class);
>> > >> > >>>>>>> config.serialization(new
>> > >> > >>>>>>>> StringSerializer(), new StringDeserializer());
>> KafkaStreaming
>> > >> > >>>>>> container =
>> > >> > >>>>>>>> new KafkaStreaming(config); container.run();
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> KafkaStreaming is basically the SamzaContainer;
>> > StreamProcessor
>> > >> > >>>>>>>> is basically StreamTask.
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> So rather than putting all the class names in a file and
>> then
>> > >> > >>>>>>>> having
>> > >> > >>>>>> the
>> > >> > >>>>>>>> job assembled by reflection, you just instantiate the
>> > container
>> > >> > >>>>>>>> programmatically. Work is balanced over however many
>> > instances
>> > >> > >>>>>>>> of
>> > >> > >>>>> this
>> > >> > >>>>>>> are
>> > >> > >>>>>>>> alive at any time (i.e. if an instance dies, new tasks are
>> > >> added
>> > >> > >>>>>>>> to
>> > >> > >>>>> the
>> > >> > >>>>>>>> existing containers without shutting them down).
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> We would provide some glue for running this stuff in YARN
>> via
>> > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS using some of their
>> tools
>> > >> > >>>>>>>> but from the
>> > >> > >>>>>> point
>> > >> > >>>>>>> of
>> > >> > >>>>>>>> view of these frameworks these stream processing jobs are
>> > just
>> > >> > >>>>>> stateless
>> > >> > >>>>>>>> services that can come and go and expand and contract at
>> > will.
>> > >> > >>>>>>>> There
>> > >> > >>>>> is
>> > >> > >>>>>>> no
>> > >> > >>>>>>>> more custom scheduler.
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> Here are some relevant details:
>> > >> > >>>>>>>>
>> > >> > >>>>>>>>  1. It is only ~1300 lines of code, it would get larger if
>> we
>> > >> > >>>>>>>>  productionized but not vastly larger. We really do get a
>> ton
>> > >> > >>>>>>>> of
>> > >> > >>>>>>> leverage
>> > >> > >>>>>>>>  out of Kafka.
>> > >> > >>>>>>>>  2. Partition management is fully delegated to the new
>> > >> consumer.
>> > >> > >>>>> This
>> > >> > >>>>>>>>  is nice since now any partition management strategy
>> > available
>> > >> > >>>>>>>> to
>> > >> > >>>>>> Kafka
>> > >> > >>>>>>>>  consumer is also available to Samza (and vice versa) and
>> > with
>> > >> > >>>>>>>> the
>> > >> > >>>>>>> exact
>> > >> > >>>>>>>>  same configs.
>> > >> > >>>>>>>>  3. It supports state as well as state reuse
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> Anyhow take a look, hopefully it is thought provoking.
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> -Jay
>> > >> > >>>>>>>>
>> > >> > >>>>>>>>
>> > >> > >>>>>>>>
>> > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris Riccomini <
>> > >> > >>>>>> [email protected]>
>> > >> > >>>>>>>> wrote:
>> > >> > >>>>>>>>
>> > >> > >>>>>>>>> Hey all,
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> I have had some discussions with Samza engineers at
>> LinkedIn
>> > >> > >>>>>>>>> and
>> > >> > >>>>>>> Confluent
>> > >> > >>>>>>>>> and we came up with a few observations and would like to
>> > >> > >>>>>>>>> propose
>> > >> > >>>>> some
>> > >> > >>>>>>>>> changes.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> We've observed some things that I want to call out about
>> > >> > >>>>>>>>> Samza's
>> > >> > >>>>>> design,
>> > >> > >>>>>>>>> and I'd like to propose some changes.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> * Samza is dependent upon a dynamic deployment system.
>> > >> > >>>>>>>>> * Samza is too pluggable.
>> > >> > >>>>>>>>> * Samza's SystemConsumer/SystemProducer and Kafka's
>> consumer
>> > >> > >>>>>>>>> APIs
>> > >> > >>>>> are
>> > >> > >>>>>>>>> trying to solve a lot of the same problems.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> All three of these issues are related, but I'll address
>> them
>> > >> in
>> > >> > >>>>> order.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Deployment
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Samza strongly depends on the use of a dynamic deployment
>> > >> > >>>>>>>>> scheduler
>> > >> > >>>>>> such
>> > >> > >>>>>>>>> as
>> > >> > >>>>>>>>> YARN, Mesos, etc. When we initially built Samza, we bet
>> that
>> > >> > >>>>>>>>> there
>> > >> > >>>>>> would
>> > >> > >>>>>>>>> be
>> > >> > >>>>>>>>> one or two winners in this area, and we could support
>> them,
>> > >> and
>> > >> > >>>>>>>>> the
>> > >> > >>>>>> rest
>> > >> > >>>>>>>>> would go away. In reality, there are many variations.
>> > >> > >>>>>>>>> Furthermore,
>> > >> > >>>>>> many
>> > >> > >>>>>>>>> people still prefer to just start their processors like
>> > normal
>> > >> > >>>>>>>>> Java processes, and use traditional deployment scripts
>> such
>> > as
>> > >> > >>>>>>>>> Fabric,
>> > >> > >>>>>> Chef,
>> > >> > >>>>>>>>> Ansible, etc. Forcing a deployment system on users makes
>> the
>> > >> > >>>>>>>>> Samza start-up process really painful for first time
>> users.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Dynamic deployment as a requirement was also a bit of a
>> > >> > >>>>>>>>> mis-fire
>> > >> > >>>>>> because
>> > >> > >>>>>>>>> of
>> > >> > >>>>>>>>> a fundamental misunderstanding between the nature of batch
>> > >> jobs
>> > >> > >>>>>>>>> and
>> > >> > >>>>>>> stream
>> > >> > >>>>>>>>> processing jobs. Early on, we made conscious effort to
>> favor
>> > >> > >>>>>>>>> the
>> > >> > >>>>>> Hadoop
>> > >> > >>>>>>>>> (Map/Reduce) way of doing things, since it worked and was
>> > well
>> > >> > >>>>>>> understood.
>> > >> > >>>>>>>>> One thing that we missed was that batch jobs have a
>> definite
>> > >> > >>>>>> beginning,
>> > >> > >>>>>>>>> and
>> > >> > >>>>>>>>> end, and stream processing jobs don't (usually). This
>> leads
>> > to
>> > >> > >>>>>>>>> a
>> > >> > >>>>> much
>> > >> > >>>>>>>>> simpler scheduling problem for stream processors. You
>> > >> basically
>> > >> > >>>>>>>>> just
>> > >> > >>>>>>> need
>> > >> > >>>>>>>>> to find a place to start the processor, and start it. The
>> > way
>> > >> > >>>>>>>>> we run grids, at LinkedIn, there's no concept of a cluster
>> > >> > >>>>>>>>> being "full". We always
>> > >> > >>>>>> add
>> > >> > >>>>>>>>> more machines. The problem with coupling Samza with a
>> > >> scheduler
>> > >> > >>>>>>>>> is
>> > >> > >>>>>> that
>> > >> > >>>>>>>>> Samza (as a framework) now has to handle deployment. This
>> > >> pulls
>> > >> > >>>>>>>>> in a
>> > >> > >>>>>>> bunch
>> > >> > >>>>>>>>> of things such as configuration distribution (config
>> > stream),
>> > >> > >>>>>>>>> shell
>> > >> > >>>>>>> scrips
>> > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging (all the .tgz
>> stuff),
>> > >> etc.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Another reason for requiring dynamic deployment was to
>> > support
>> > >> > >>>>>>>>> data locality. If you want to have locality, you need to
>> put
>> > >> > >>>>>>>>> your
>> > >> > >>>>>> processors
>> > >> > >>>>>>>>> close to the data they're processing. Upon further
>> > >> > >>>>>>>>> investigation,
>> > >> > >>>>>>> though,
>> > >> > >>>>>>>>> this feature is not that beneficial. There is some good
>> > >> > >>>>>>>>> discussion
>> > >> > >>>>>> about
>> > >> > >>>>>>>>> some problems with it on SAMZA-335. Again, we took the
>> > >> > >>>>>>>>> Map/Reduce
>> > >> > >>>>>> path,
>> > >> > >>>>>>>>> but
>> > >> > >>>>>>>>> there are some fundamental differences between HDFS and
>> > Kafka.
>> > >> > >>>>>>>>> HDFS
>> > >> > >>>>>> has
>> > >> > >>>>>>>>> blocks, while Kafka has partitions. This leads to less
>> > >> > >>>>>>>>> optimization potential with stream processors on top of
>> > Kafka.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> This feature is also used as a crutch. Samza doesn't have
>> > any
>> > >> > >>>>>>>>> built
>> > >> > >>>>> in
>> > >> > >>>>>>>>> fault-tolerance logic. Instead, it depends on the dynamic
>> > >> > >>>>>>>>> deployment scheduling system to handle restarts when a
>> > >> > >>>>>>>>> processor dies. This has
>> > >> > >>>>>>> made
>> > >> > >>>>>>>>> it very difficult to write a standalone Samza container
>> > >> > >>>> (SAMZA-516).
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Pluggability
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> In some cases pluggability is good, but I think that we've
>> > >> gone
>> > >> > >>>>>>>>> too
>> > >> > >>>>>> far
>> > >> > >>>>>>>>> with it. Currently, Samza has:
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> * Pluggable config.
>> > >> > >>>>>>>>> * Pluggable metrics.
>> > >> > >>>>>>>>> * Pluggable deployment systems.
>> > >> > >>>>>>>>> * Pluggable streaming systems (SystemConsumer,
>> > SystemProducer,
>> > >> > >>>> etc).
>> > >> > >>>>>>>>> * Pluggable serdes.
>> > >> > >>>>>>>>> * Pluggable storage engines.
>> > >> > >>>>>>>>> * Pluggable strategies for just about every component
>> > >> > >>>>> (MessageChooser,
>> > >> > >>>>>>>>> SystemStreamPartitionGrouper, ConfigRewriter, etc).
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> There's probably more that I've forgotten, as well. Some
>> of
>> > >> > >>>>>>>>> these
>> > >> > >>>>> are
>> > >> > >>>>>>>>> useful, but some have proven not to be. This all comes at
>> a
>> > >> cost:
>> > >> > >>>>>>>>> complexity. This complexity is making it harder for our
>> > users
>> > >> > >>>>>>>>> to
>> > >> > >>>>> pick
>> > >> > >>>>>> up
>> > >> > >>>>>>>>> and use Samza out of the box. It also makes it difficult
>> for
>> > >> > >>>>>>>>> Samza developers to reason about what the characteristics
>> of
>> > >> > >>>>>>>>> the container (since the characteristics change depending
>> on
>> > >> > >>>>>>>>> which plugins are use).
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> The issues with pluggability are most visible in the
>> System
>> > >> APIs.
>> > >> > >>>>> What
>> > >> > >>>>>>>>> Samza really requires to be functional is Kafka as its
>> > >> > >>>>>>>>> transport
>> > >> > >>>>>> layer.
>> > >> > >>>>>>>>> But
>> > >> > >>>>>>>>> we've conflated two unrelated use cases into one API:
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> 1. Get data into/out of Kafka.
>> > >> > >>>>>>>>> 2. Process the data in Kafka.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> The current System API supports both of these use cases.
>> The
>> > >> > >>>>>>>>> problem
>> > >> > >>>>>> is,
>> > >> > >>>>>>>>> we
>> > >> > >>>>>>>>> actually want different features for each use case. By
>> > >> papering
>> > >> > >>>>>>>>> over
>> > >> > >>>>>>> these
>> > >> > >>>>>>>>> two use cases, and providing a single API, we've
>> introduced
>> > a
>> > >> > >>>>>>>>> ton of
>> > >> > >>>>>>> leaky
>> > >> > >>>>>>>>> abstractions.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> For example, what we'd really like in (2) is to have
>> > >> > >>>>>>>>> monotonically increasing longs for offsets (like Kafka).
>> > This
>> > >> > >>>>>>>>> would be at odds
>> > >> > >>>>> with
>> > >> > >>>>>>> (1),
>> > >> > >>>>>>>>> though, since different systems have different
>> > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors.
>> > >> > >>>>>>>>> There was discussion both on the mailing list and the SQL
>> > >> JIRAs
>> > >> > >>>>> about
>> > >> > >>>>>>> the
>> > >> > >>>>>>>>> need for this.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> The same thing holds true for replayability. Kafka allows
>> us
>> > >> to
>> > >> > >>>>> rewind
>> > >> > >>>>>>>>> when
>> > >> > >>>>>>>>> we have a failure. Many other systems don't. In some
>> cases,
>> > >> > >>>>>>>>> systems
>> > >> > >>>>>>> return
>> > >> > >>>>>>>>> null for their offsets (e.g. WikipediaSystemConsumer)
>> > because
>> > >> > >>>>>>>>> they
>> > >> > >>>>>> have
>> > >> > >>>>>>> no
>> > >> > >>>>>>>>> offsets.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Partitioning is another example. Kafka supports
>> > partitioning,
>> > >> > >>>>>>>>> but
>> > >> > >>>>> many
>> > >> > >>>>>>>>> systems don't. We model this by having a single partition
>> > for
>> > >> > >>>>>>>>> those systems. Still, other systems model partitioning
>> > >> > >>>> differently (e.g.
>> > >> > >>>>>>>>> Kinesis).
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> The SystemAdmin interface is also a mess. Creating streams
>> > in
>> > >> a
>> > >> > >>>>>>>>> system-agnostic way is almost impossible. As is modeling
>> > >> > >>>>>>>>> metadata
>> > >> > >>>>> for
>> > >> > >>>>>>> the
>> > >> > >>>>>>>>> system (replication factor, partitions, location, etc).
>> The
>> > >> > >>>>>>>>> list
>> > >> > >>>>> goes
>> > >> > >>>>>>> on.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Duplicate work
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> At the time that we began writing Samza, Kafka's consumer
>> > and
>> > >> > >>>>> producer
>> > >> > >>>>>>>>> APIs
>> > >> > >>>>>>>>> had a relatively weak feature set. On the consumer-side,
>> you
>> > >> > >>>>>>>>> had two
>> > >> > >>>>>>>>> options: use the high level consumer, or the simple
>> > consumer.
>> > >> > >>>>>>>>> The
>> > >> > >>>>>>> problem
>> > >> > >>>>>>>>> with the high-level consumer was that it controlled your
>> > >> > >>>>>>>>> offsets, partition assignments, and the order in which you
>> > >> > >>>>>>>>> received messages. The
>> > >> > >>>>> problem
>> > >> > >>>>>>>>> with
>> > >> > >>>>>>>>> the simple consumer is that it's not simple. It's basic.
>> You
>> > >> > >>>>>>>>> end up
>> > >> > >>>>>>> having
>> > >> > >>>>>>>>> to handle a lot of really low-level stuff that you
>> > shouldn't.
>> > >> > >>>>>>>>> We
>> > >> > >>>>>> spent a
>> > >> > >>>>>>>>> lot of time to make Samza's KafkaSystemConsumer very
>> robust.
>> > >> It
>> > >> > >>>>>>>>> also allows us to support some cool features:
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> * Per-partition message ordering and prioritization.
>> > >> > >>>>>>>>> * Tight control over partition assignment to support
>> joins,
>> > >> > >>>>>>>>> global
>> > >> > >>>>>> state
>> > >> > >>>>>>>>> (if we want to implement it :)), etc.
>> > >> > >>>>>>>>> * Tight control over offset checkpointing.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> What we didn't realize at the time is that these features
>> > >> > >>>>>>>>> should
>> > >> > >>>>>>> actually
>> > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers (not just Samza
>> stream
>> > >> > >>>>>> processors)
>> > >> > >>>>>>>>> end up wanting to do things like joins and partition
>> > >> > >>>>>>>>> assignment. The
>> > >> > >>>>>>> Kafka
>> > >> > >>>>>>>>> community has come to the same conclusion. They're adding
>> a
>> > >> ton
>> > >> > >>>>>>>>> of upgrades into their new Kafka consumer implementation.
>> > To a
>> > >> > >>>>>>>>> large extent,
>> > >> > >>>>> it's
>> > >> > >>>>>>>>> duplicate work to what we've already done in Samza.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> On top of this, Kafka ended up taking a very similar
>> > approach
>> > >> > >>>>>>>>> to
>> > >> > >>>>>> Samza's
>> > >> > >>>>>>>>> KafkaCheckpointManager implementation for handling offset
>> > >> > >>>>>> checkpointing.
>> > >> > >>>>>>>>> Like Samza, Kafka's new offset management feature stores
>> > >> offset
>> > >> > >>>>>>>>> checkpoints in a topic, and allows you to fetch them from
>> > the
>> > >> > >>>>>>>>> broker.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> A lot of this seems like a waste, since we could have
>> shared
>> > >> > >>>>>>>>> the
>> > >> > >>>>> work
>> > >> > >>>>>> if
>> > >> > >>>>>>>>> it
>> > >> > >>>>>>>>> had been done in Kafka from the get-go.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Vision
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> All of this leads me to a rather radical proposal. Samza
>> is
>> > >> > >>>>> relatively
>> > >> > >>>>>>>>> stable at this point. I'd venture to say that we're near a
>> > 1.0
>> > >> > >>>>>> release.
>> > >> > >>>>>>>>> I'd
>> > >> > >>>>>>>>> like to propose that we take what we've learned, and begin
>> > >> > >>>>>>>>> thinking
>> > >> > >>>>>>> about
>> > >> > >>>>>>>>> Samza beyond 1.0. What would we change if we were starting
>> > >> from
>> > >> > >>>>>> scratch?
>> > >> > >>>>>>>>> My
>> > >> > >>>>>>>>> proposal is to:
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> 1. Make Samza standalone the *only* way to run Samza
>> > >> > >>>>>>>>> processors, and eliminate all direct dependences on YARN,
>> > >> Mesos,
>> > >> > >>>> etc.
>> > >> > >>>>>>>>> 2. Make a definitive call to support only Kafka as the
>> > stream
>> > >> > >>>>>> processing
>> > >> > >>>>>>>>> layer.
>> > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging, serialization, and
>> > >> > >>>>>>>>> config
>> > >> > >>>>>>> systems,
>> > >> > >>>>>>>>> and simply use Kafka's instead.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> This would fix all of the issues that I outlined above. It
>> > >> > >>>>>>>>> should
>> > >> > >>>>> also
>> > >> > >>>>>>>>> shrink the Samza code base pretty dramatically. Supporting
>> > >> only
>> > >> > >>>>>>>>> a standalone container will allow Samza to be executed on
>> > YARN
>> > >> > >>>>>>>>> (using Slider), Mesos (using Marathon/Aurora), or most
>> other
>> > >> > >>>>>>>>> in-house
>> > >> > >>>>>>> deployment
>> > >> > >>>>>>>>> systems. This should make life a lot easier for new users.
>> > >> > >>>>>>>>> Imagine
>> > >> > >>>>>>> having
>> > >> > >>>>>>>>> the hello-samza tutorial without YARN. The drop in mailing
>> > >> list
>> > >> > >>>>>> traffic
>> > >> > >>>>>>>>> will be pretty dramatic.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Coupling with Kafka seems long overdue to me. The reality
>> > is,
>> > >> > >>>>> everyone
>> > >> > >>>>>>>>> that
>> > >> > >>>>>>>>> I'm aware of is using Samza with Kafka. We basically
>> require
>> > >> it
>> > >> > >>>>>> already
>> > >> > >>>>>>> in
>> > >> > >>>>>>>>> order for most features to work. Those that are using
>> other
>> > >> > >>>>>>>>> systems
>> > >> > >>>>>> are
>> > >> > >>>>>>>>> generally using it for ingest into Kafka (1), and then
>> they
>> > do
>> > >> > >>>>>>>>> the processing on top. There is already discussion (
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>
>> > >> > >>>>>>
>> > >> > >>>>>
>> > >> >
>> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
>> > >> > >>>>> 767
>> > >> > >>>>>>>>> )
>> > >> > >>>>>>>>> in Kafka to make ingesting into Kafka extremely easy.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Once we make the call to couple with Kafka, we can
>> leverage
>> > a
>> > >> > >>>>>>>>> ton of
>> > >> > >>>>>>> their
>> > >> > >>>>>>>>> ecosystem. We no longer have to maintain our own config,
>> > >> > >>>>>>>>> metrics,
>> > >> > >>>>> etc.
>> > >> > >>>>>>> We
>> > >> > >>>>>>>>> can all share the same libraries, and make them better.
>> This
>> > >> > >>>>>>>>> will
>> > >> > >>>>> also
>> > >> > >>>>>>>>> allow us to share the consumer/producer APIs, and will let
>> > us
>> > >> > >>>>> leverage
>> > >> > >>>>>>>>> their offset management and partition management, rather
>> > than
>> > >> > >>>>>>>>> having
>> > >> > >>>>>> our
>> > >> > >>>>>>>>> own. All of the coordinator stream code would go away, as
>> > >> would
>> > >> > >>>>>>>>> most
>> > >> > >>>>>> of
>> > >> > >>>>>>>>> the
>> > >> > >>>>>>>>> YARN AppMaster code. We'd probably have to push some
>> > partition
>> > >> > >>>>>>> management
>> > >> > >>>>>>>>> features into the Kafka broker, but they're already moving
>> > in
>> > >> > >>>>>>>>> that direction with the new consumer API. The features we
>> > have
>> > >> > >>>>>>>>> for
>> > >> > >>>>>> partition
>> > >> > >>>>>>>>> assignment aren't unique to Samza, and seem like they
>> should
>> > >> be
>> > >> > >>>>>>>>> in
>> > >> > >>>>>> Kafka
>> > >> > >>>>>>>>> anyway. There will always be some niche usages which will
>> > >> > >>>>>>>>> require
>> > >> > >>>>>> extra
>> > >> > >>>>>>>>> care and hence full control over partition assignments
>> much
>> > >> > >>>>>>>>> like the
>> > >> > >>>>>>> Kafka
>> > >> > >>>>>>>>> low level consumer api. These would continue to be
>> > supported.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> These items will be good for the Samza community. They'll
>> > make
>> > >> > >>>>>>>>> Samza easier to use, and make it easier for developers to
>> > add
>> > >> > >>>>>>>>> new features.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Obviously this is a fairly large (and somewhat backwards
>> > >> > >>>>> incompatible
>> > >> > >>>>>>>>> change). If we choose to go this route, it's important
>> that
>> > we
>> > >> > >>>>> openly
>> > >> > >>>>>>>>> communicate how we're going to provide a migration path
>> from
>> > >> > >>>>>>>>> the
>> > >> > >>>>>>> existing
>> > >> > >>>>>>>>> APIs to the new ones (if we make incompatible changes). I
>> > >> think
>> > >> > >>>>>>>>> at a minimum, we'd probably need to provide a wrapper to
>> > allow
>> > >> > >>>>>>>>> existing StreamTask implementations to continue running on
>> > the
>> > >> > >>>> new container.
>> > >> > >>>>>>> It's
>> > >> > >>>>>>>>> also important that we openly communicate about timing,
>> and
>> > >> > >>>>>>>>> stages
>> > >> > >>>>> of
>> > >> > >>>>>>> the
>> > >> > >>>>>>>>> migration.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> If you made it this far, I'm sure you have opinions. :)
>> > Please
>> > >> > >>>>>>>>> send
>> > >> > >>>>>> your
>> > >> > >>>>>>>>> thoughts and feedback.
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>> Cheers,
>> > >> > >>>>>>>>> Chris
>> > >> > >>>>>>>>>
>> > >> > >>>>>>>>
>> > >> > >>>>>>>>
>> > >> > >>>>>>>
>> > >> > >>>>>>
>> > >> > >>>>>>
>> > >> > >>>>>>
>> > >> > >>>>>> --
>> > >> > >>>>>> -- Guozhang
>> > >> > >>>>>>
>> > >> > >>>>>
>> > >> > >>>>
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >> >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>

Re: Thoughts and obesrvations on Samza

Reply via email to