Re: Thoughts and obesrvations on Samza

Yan Fang Fri, 10 Jul 2015 14:40:16 -0700

Thanks, Jay. This argument persuaded me actually. :)

Fang, Yan
yanfang...@gmail.com


On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps <j...@confluent.io> wrote:

> Hey Yan,
>
> Yeah philosophically I think the argument is that you should capture the
> stream in Kafka independent of the transformation. This is obviously a
> Kafka-centric view point.
>
> Advantages of this:
> - In practice I think this is what e.g. Storm people often end up doing
> anyway. You usually need to throttle any access to a live serving database.
> - Can have multiple subscribers and they get the same thing without
> additional load on the source system.
> - Applications can tap into the stream if need be by subscribing.
> - You can debug your transformation by tailing the Kafka topic with the
> console consumer
> - Can tee off the same data stream for batch analysis or Lambda arch style
> re-processing
>
> The disadvantage is that it will use Kafka resources. But the idea is
> eventually you will have multiple subscribers to any data source (at least
> for monitoring) so you will end up there soon enough anyway.
>
> Down the road the technical benefit is that I think it gives us a good path
> towards end-to-end exactly once semantics from source to destination.
> Basically the connectors need to support idempotence when talking to Kafka
> and we need the transactional write feature in Kafka to make the
> transformation atomic. This is actually pretty doable if you separate
> connector=>kafka problem from the generic transformations which are always
> kafka=>kafka. However I think it is quite impossible to do in a all_things
> => all_things environment. Today you can say "well the semantics of the
> Samza APIs depend on the connectors you use" but it is actually worse then
> that because the semantics actually depend on the pairing of connectors--so
> not only can you probably not get a usable "exactly once" guarantee
> end-to-end it can actually be quite hard to reverse engineer what property
> (if any) your end-to-end flow has if you have heterogenous systems.
>
> -Jay
>
> On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <yanfang...@gmail.com> wrote:
>
> > {quote}
> > maintained in a separate repository and retaining the existing
> > committership but sharing as much else as possible (website, etc)
> > {quote}
> >
> > Overall, I agree on this idea. Now the question is more about "how to do
> > it".
> >
> > On the other hand, one thing I want to point out is that, if we decide to
> > go this way, how do we want to support
> > otherSystem-transformation-otherSystem use case?
> >
> > Basically, there are four user groups here:
> >
> > 1. Kafka-transformation-Kafka
> > 2. Kafka-transformation-otherSystem
> > 3. otherSystem-transformation-Kafka
> > 4. otherSystem-transformation-otherSystem
> >
> > For group 1, they can easily use the new Samza library to achieve. For
> > group 2 and 3, they can use copyCat -> transformation -> Kafka or Kafka->
> > transformation -> copyCat.
> >
> > The problem is for group 4. Do we want to abandon this or still support
> it?
> > Of course, this use case can be achieved by using copyCat ->
> transformation
> > -> Kafka -> transformation -> copyCat, the thing is how we persuade them
> to
> > do this long chain. If yes, it will also be a win for Kafka too. Or if
> > there is no one in this community actually doing this so far, maybe ok to
> > not support the group 4 directly.
> >
> > Thanks,
> >
> > Fang, Yan
> > yanfang...@gmail.com
> >
> > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <j...@confluent.io> wrote:
> >
> > > Yeah I agree with this summary. I think there are kind of two questions
> > > here:
> > > 1. Technically does alignment/reliance on Kafka make sense
> > > 2. Branding wise (naming, website, concepts, etc) does alignment with
> > Kafka
> > > make sense
> > >
> > > Personally I do think both of these things would be really valuable,
> and
> > > would dramatically alter the trajectory of the project.
> > >
> > > My preference would be to see if people can mostly agree on a direction
> > > rather than splintering things off. From my point of view the ideal
> > outcome
> > > of all the options discussed would be to make Samza a closely aligned
> > > subproject, maintained in a separate repository and retaining the
> > existing
> > > committership but sharing as much else as possible (website, etc). No
> > idea
> > > about how these things work, Jacob, you probably know more.
> > >
> > > No discussion amongst the Kafka folks has happened on this, but likely
> we
> > > should figure out what the Samza community actually wants first.
> > >
> > > I admit that this is a fairly radical departure from how things are.
> > >
> > > If that doesn't fly, I think, yeah we could leave Samza as it is and do
> > the
> > > more radical reboot inside Kafka. From my point of view that does leave
> > > things in a somewhat confusing state since now there are two stream
> > > processing systems more or less coupled to Kafka in large part made by
> > the
> > > same people. But, arguably that might be a cleaner way to make the
> > cut-over
> > > and perhaps less risky for Samza community since if it works people can
> > > switch and if it doesn't nothing will have changed. Dunno, how do
> people
> > > feel about this?
> > >
> > > -Jay
> > >
> > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <jgho...@gmail.com>
> wrote:
> > >
> > > > >  This leads me to thinking that merging projects and communities
> > might
> > > > be a good idea: with the union of experience from both communities,
> we
> > > will
> > > > probably build a better system that is better for users.
> > > > Is this what's being proposed though? Merging the projects seems like
> > > > a consequence of at most one of the three directions under
> discussion:
> > > > 1) Samza 2.0: The Samza community relies more heavily on Kafka for
> > > > configuration, etc. (to a greater or lesser extent to be determined)
> > > > but the Samza community would not automatically merge withe Kafka
> > > > community (the Phoenix/HBase example is a good one here).
> > > > 2) Samza Reboot: The Samza community continues to exist with a
> limited
> > > > project scope, but similarly would not need to be part of the Kafka
> > > > community (ie given committership) to progress.  Here, maybe the
> Samza
> > > > team would become a subproject of Kafka (the Board frowns on
> > > > subprojects at the moment, so I'm not sure if that's even feasible),
> > > > but that would not be required.
> > > > 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka
> > > > team builds its own streaming library, possibly off of Jay's
> > > > prototype, which has not direct lineage to the Samza team.  There's
> no
> > > > reason for the Kafka team to bring in the Samza team.
> > > >
> > > > Is the Kafka community on board with this?
> > > >
> > > > To be clear, all three options under discussion are interesting,
> > > > technically valid and likely healthy directions for the project.
> > > > Also, they are not mutually exclusive.  The Samza community could
> > > > decide to pursue, say, 'Samza 2.0', while the Kafka community went
> > > > forward with 'Hey Samza!'  My points above are directed entirely at
> > > > the community aspect of these choices.
> > > > -Jakob
> > > >
> > > > On 10 July 2015 at 09:10, Roger Hoover <roger.hoo...@gmail.com>
> wrote:
> > > > > That's great.  Thanks, Jay.
> > > > >
> > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <j...@confluent.io>
> wrote:
> > > > >
> > > > >> Yeah totally agree. I think you have this issue even today, right?
> > > I.e.
> > > > if
> > > > >> you need to make a simple config change and you're running in YARN
> > > today
> > > > >> you end up bouncing the job which then rebuilds state. I think the
> > fix
> > > > is
> > > > >> exactly what you described which is to have a long timeout on
> > > partition
> > > > >> movement for stateful jobs so that if a job is just getting
> bounced,
> > > and
> > > > >> the cluster manager (or admin) is smart enough to restart it on
> the
> > > same
> > > > >> host when possible, it can optimistically reuse any existing state
> > it
> > > > finds
> > > > >> on disk (if it is valid).
> > > > >>
> > > > >> So in this model the charter of the CM is to place processes as
> > > > stickily as
> > > > >> possible and to restart or re-place failed processes. The charter
> of
> > > the
> > > > >> partition management system is to control the assignment of work
> to
> > > > these
> > > > >> processes. The nice thing about this is that the work assignment,
> > > > timeouts,
> > > > >> behavior, configs, and code will all be the same across all
> cluster
> > > > >> managers.
> > > > >>
> > > > >> So I think that prototype would actually give you exactly what you
> > > want
> > > > >> today for any cluster manager (or manual placement + restart
> script)
> > > > that
> > > > >> was sticky in terms of host placement since there is already a
> > > > configurable
> > > > >> partition movement timeout and task-by-task state reuse with a
> check
> > > on
> > > > >> state validity.
> > > > >>
> > > > >> -Jay
> > > > >>
> > > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover <
> > roger.hoo...@gmail.com
> > > >
> > > > >> wrote:
> > > > >>
> > > > >> > That would be great to let Kafka do as much heavy lifting as
> > > possible
> > > > and
> > > > >> > make it easier for other languages to implement Samza apis.
> > > > >> >
> > > > >> > One thing to watch out for is the interplay between Kafka's
> group
> > > > >> > management and the external scheduler/process manager's fault
> > > > tolerance.
> > > > >> > If a container dies, the Kafka group membership protocol will
> try
> > to
> > > > >> assign
> > > > >> > it's tasks to other containers while at the same time the
> process
> > > > manager
> > > > >> > is trying to relaunch the container.  Without some consideration
> > for
> > > > this
> > > > >> > (like a configurable amount of time to wait before Kafka alters
> > the
> > > > group
> > > > >> > membership), there may be thrashing going on which is especially
> > bad
> > > > for
> > > > >> > containers with large amounts of local state.
> > > > >> >
> > > > >> > Someone else pointed this out already but I thought it might be
> > > worth
> > > > >> > calling out again.
> > > > >> >
> > > > >> > Cheers,
> > > > >> >
> > > > >> > Roger
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <j...@confluent.io>
> > > wrote:
> > > > >> >
> > > > >> > > Hey Roger,
> > > > >> > >
> > > > >> > > I couldn't agree more. We spent a bunch of time talking to
> > people
> > > > and
> > > > >> > that
> > > > >> > > is exactly the stuff we heard time and again. What makes it
> > hard,
> > > of
> > > > >> > > course, is that there is some tension between compatibility
> with
> > > > what's
> > > > >> > > there now and making things better for new users.
> > > > >> > >
> > > > >> > > I also strongly agree with the importance of multi-language
> > > > support. We
> > > > >> > are
> > > > >> > > talking now about Java, but for application development use
> > cases
> > > > >> people
> > > > >> > > want to work in whatever language they are using elsewhere. I
> > > think
> > > > >> > moving
> > > > >> > > to a model where Kafka itself does the group membership,
> > lifecycle
> > > > >> > control,
> > > > >> > > and partition assignment has the advantage of putting all that
> > > > complex
> > > > >> > > stuff behind a clean api that the clients are already going to
> > be
> > > > >> > > implementing for their consumer, so the added functionality
> for
> > > > stream
> > > > >> > > processing beyond a consumer becomes very minor.
> > > > >> > >
> > > > >> > > -Jay
> > > > >> > >
> > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover <
> > > > roger.hoo...@gmail.com>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Metamorphosis...nice. :)
> > > > >> > > >
> > > > >> > > > This has been a great discussion.  As a user of Samza who's
> > > > recently
> > > > >> > > > integrated it into a relatively large organization, I just
> > want
> > > to
> > > > >> add
> > > > >> > > > support to a few points already made.
> > > > >> > > >
> > > > >> > > > The biggest hurdles to adoption of Samza as it currently
> > exists
> > > > that
> > > > >> > I've
> > > > >> > > > experienced are:
> > > > >> > > > 1) YARN - YARN is overly complex in many environments where
> > > Puppet
> > > > >> > would
> > > > >> > > do
> > > > >> > > > just fine but it was the only mechanism to get fault
> > tolerance.
> > > > >> > > > 2) Configuration - I think I like the idea of configuring
> most
> > > of
> > > > the
> > > > >> > job
> > > > >> > > > in code rather than config files.  In general, I think the
> > goal
> > > > >> should
> > > > >> > be
> > > > >> > > > to make it harder to make mistakes, especially of the kind
> > where
> > > > the
> > > > >> > code
> > > > >> > > > expects something and the config doesn't match.  The current
> > > > config
> > > > >> is
> > > > >> > > > quite intricate and error-prone.  For example, the
> application
> > > > logic
> > > > >> > may
> > > > >> > > > depend on bootstrapping a topic but rather than asserting
> that
> > > in
> > > > the
> > > > >> > > code,
> > > > >> > > > you have to rely on getting the config right.  Likewise with
> > > > serdes,
> > > > >> > the
> > > > >> > > > Java representations produced by various serdes (JSON, Avro,
> > > etc.)
> > > > >> are
> > > > >> > > not
> > > > >> > > > equivalent so you cannot just reconfigure a serde without
> > > changing
> > > > >> the
> > > > >> > > > code.   It would be nice for jobs to be able to assert what
> > they
> > > > >> expect
> > > > >> > > > from their input topics in terms of partitioning.  This is
> > > > getting a
> > > > >> > > little
> > > > >> > > > off topic but I was even thinking about creating a "Samza
> > config
> > > > >> > linter"
> > > > >> > > > that would sanity check a set of configs.  Especially in
> > > > >> organizations
> > > > >> > > > where config is managed by a different team than the
> > application
> > > > >> > > developer,
> > > > >> > > > it's very hard to get avoid config mistakes.
> > > > >> > > > 3) Java/Scala centric - for many teams (especially
> DevOps-type
> > > > >> folks),
> > > > >> > > the
> > > > >> > > > pain of the Java toolchain (maven, slow builds, weak command
> > > line
> > > > >> > > support,
> > > > >> > > > configuration over convention) really inhibits productivity.
> > As
> > > > more
> > > > >> > and
> > > > >> > > > more high-quality clients become available for Kafka, I hope
> > > > they'll
> > > > >> > > follow
> > > > >> > > > Samza's model.  Not sure how much it affects the proposals
> in
> > > this
> > > > >> > thread
> > > > >> > > > but please consider other languages in the ecosystem as
> well.
> > > > From
> > > > >> > what
> > > > >> > > > I've heard, Spark has more Python users than Java/Scala.
> > > > >> > > > (FYI, we added a Jython wrapper for the Samza API
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
> > > > >> > > > and are working on a Yeoman generator
> > > > >> > > > https://github.com/Quantiply/generator-rico for
> Jython/Samza
> > > > >> projects
> > > > >> > to
> > > > >> > > > alleviate some of the pain)
> > > > >> > > >
> > > > >> > > > I also want to underscore Jay's point about improving the
> user
> > > > >> > > experience.
> > > > >> > > > That's a very important factor for adoption.  I think the
> goal
> > > > should
> > > > >> > be
> > > > >> > > to
> > > > >> > > > make Samza as easy to get started with as something like
> > > Logstash.
> > > > >> > > > Logstash is vastly inferior in terms of capabilities to
> Samza
> > > but
> > > > >> it's
> > > > >> > > easy
> > > > >> > > > to get started and that makes a big difference.
> > > > >> > > >
> > > > >> > > > Cheers,
> > > > >> > > >
> > > > >> > > > Roger
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci
> > Morales <
> > > > >> > > > g...@apache.org> wrote:
> > > > >> > > >
> > > > >> > > > > Forgot to add. On the naming issues, Kafka Metamorphosis
> is
> > a
> > > > clear
> > > > >> > > > winner
> > > > >> > > > > :)
> > > > >> > > > >
> > > > >> > > > > --
> > > > >> > > > > Gianmarco
> > > > >> > > > >
> > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci Morales <
> > > > >> > > g...@apache.org
> > > > >> > > > >
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi,
> > > > >> > > > > >
> > > > >> > > > > > @Martin, thanks for you comments.
> > > > >> > > > > > Maybe I'm missing some important point, but I think
> > coupling
> > > > the
> > > > >> > > > releases
> > > > >> > > > > > is actually a *good* thing.
> > > > >> > > > > > To make an example, would it be better if the MR and
> HDFS
> > > > >> > components
> > > > >> > > of
> > > > >> > > > > > Hadoop had different release schedules?
> > > > >> > > > > >
> > > > >> > > > > > Actually, keeping the discussion in a single place would
> > > make
> > > > >> > > agreeing
> > > > >> > > > on
> > > > >> > > > > > releases (and backwards compatibility) much easier, as
> > > > everybody
> > > > >> > > would
> > > > >> > > > be
> > > > >> > > > > > responsible for the whole codebase.
> > > > >> > > > > >
> > > > >> > > > > > That said, I like the idea of absorbing samza-core as a
> > > > >> > sub-project,
> > > > >> > > > and
> > > > >> > > > > > leave the fancy stuff separate.
> > > > >> > > > > > It probably gives 90% of the benefits we have been
> > > discussing
> > > > >> here.
> > > > >> > > > > >
> > > > >> > > > > > Cheers,
> > > > >> > > > > >
> > > > >> > > > > > --
> > > > >> > > > > > Gianmarco
> > > > >> > > > > >
> > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps <jay.kr...@gmail.com
> >
> > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > >> Hey Martin,
> > > > >> > > > > >>
> > > > >> > > > > >> I agree coupling release schedules is a downside.
> > > > >> > > > > >>
> > > > >> > > > > >> Definitely we can try to solve some of the integration
> > > > problems
> > > > >> in
> > > > >> > > > > >> Confluent Platform or in other distributions. But I
> think
> > > > this
> > > > >> > ends
> > > > >> > > up
> > > > >> > > > > >> being really shallow. I guess I feel to really get a
> good
> > > > user
> > > > >> > > > > experience
> > > > >> > > > > >> the two systems have to kind of feel like part of the
> > same
> > > > thing
> > > > >> > and
> > > > >> > > > you
> > > > >> > > > > >> can't really add that in later--you can put both in the
> > > same
> > > > >> > > > > downloadable
> > > > >> > > > > >> tar file but it doesn't really give a very cohesive
> > > feeling.
> > > > I
> > > > >> > agree
> > > > >> > > > > that
> > > > >> > > > > >> ultimately any of the project stuff is as much social
> and
> > > > naming
> > > > >> > as
> > > > >> > > > > >> anything else--theoretically two totally independent
> > > projects
> > > > >> > could
> > > > >> > > > work
> > > > >> > > > > >> to
> > > > >> > > > > >> tightly align. In practice this seems to be quite
> > difficult
> > > > >> > though.
> > > > >> > > > > >>
> > > > >> > > > > >> For the frameworks--totally agree it would be good to
> > > > maintain
> > > > >> the
> > > > >> > > > > >> framework support with the project. In some cases there
> > may
> > > > not
> > > > >> be
> > > > >> > > too
> > > > >> > > > > >> much
> > > > >> > > > > >> there since the integration gets lighter but I think
> > > whatever
> > > > >> > stubs
> > > > >> > > > you
> > > > >> > > > > >> need should be included. So no I definitely wasn't
> trying
> > > to
> > > > >> imply
> > > > >> > > > > >> dropping
> > > > >> > > > > >> support for these frameworks, just making the
> integration
> > > > >> lighter
> > > > >> > by
> > > > >> > > > > >> separating process management from partition
> management.
> > > > >> > > > > >>
> > > > >> > > > > >> You raise two good points we would have to figure out
> if
> > we
> > > > went
> > > > >> > > down
> > > > >> > > > > the
> > > > >> > > > > >> alignment path:
> > > > >> > > > > >> 1. With respect to the name, yeah I think the first
> > > question
> > > > is
> > > > >> > > > whether
> > > > >> > > > > >> some "re-branding" would be worth it. If so then I
> think
> > we
> > > > can
> > > > >> > > have a
> > > > >> > > > > big
> > > > >> > > > > >> thread on the name. I'm definitely not set on Kafka
> > > > Streaming or
> > > > >> > > Kafka
> > > > >> > > > > >> Streams I was just using them to be kind of
> > illustrative. I
> > > > >> agree
> > > > >> > > with
> > > > >> > > > > >> your
> > > > >> > > > > >> critique of these names, though I think people would
> get
> > > the
> > > > >> idea.
> > > > >> > > > > >> 2. Yeah you also raise a good point about how to
> "factor"
> > > it.
> > > > >> Here
> > > > >> > > are
> > > > >> > > > > the
> > > > >> > > > > >> options I see (I could get enthusiastic about any of
> > them):
> > > > >> > > > > >>    a. One repo for both Kafka and Samza
> > > > >> > > > > >>    b. Two repos, retaining the current seperation
> > > > >> > > > > >>    c. Two repos, the equivalent of samza-api and
> > samza-core
> > > > is
> > > > >> > > > absorbed
> > > > >> > > > > >> almost like a third client
> > > > >> > > > > >>
> > > > >> > > > > >> Cheers,
> > > > >> > > > > >>
> > > > >> > > > > >> -Jay
> > > > >> > > > > >>
> > > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann <
> > > > >> > > > mar...@kleppmann.com>
> > > > >> > > > > >> wrote:
> > > > >> > > > > >>
> > > > >> > > > > >> > Ok, thanks for the clarifications. Just a few
> follow-up
> > > > >> > comments.
> > > > >> > > > > >> >
> > > > >> > > > > >> > - I see the appeal of merging with Kafka or becoming
> a
> > > > >> > subproject:
> > > > >> > > > the
> > > > >> > > > > >> > reasons you mention are good. The risk I see is that
> > > > release
> > > > >> > > > schedules
> > > > >> > > > > >> > become coupled to each other, which can slow everyone
> > > down,
> > > > >> and
> > > > >> > > > large
> > > > >> > > > > >> > projects with many contributors are harder to manage.
> > > > (Jakob,
> > > > >> > can
> > > > >> > > > you
> > > > >> > > > > >> speak
> > > > >> > > > > >> > from experience, having seen a wider range of Hadoop
> > > > ecosystem
> > > > >> > > > > >> projects?)
> > > > >> > > > > >> >
> > > > >> > > > > >> > Some of the goals of a better unified developer
> > > experience
> > > > >> could
> > > > >> > > > also
> > > > >> > > > > be
> > > > >> > > > > >> > solved by integrating Samza nicely into a Kafka
> > > > distribution
> > > > >> > (such
> > > > >> > > > as
> > > > >> > > > > >> > Confluent's). I'm not against merging projects if we
> > > decide
> > > > >> > that's
> > > > >> > > > the
> > > > >> > > > > >> way
> > > > >> > > > > >> > to go, just pointing out the same goals can perhaps
> > also
> > > be
> > > > >> > > achieved
> > > > >> > > > > in
> > > > >> > > > > >> > other ways.
> > > > >> > > > > >> >
> > > > >> > > > > >> > - With regard to dropping the YARN dependency: are
> you
> > > > >> proposing
> > > > >> > > > that
> > > > >> > > > > >> > Samza doesn't give any help to people wanting to run
> on
> > > > >> > > > > >> YARN/Mesos/AWS/etc?
> > > > >> > > > > >> > So the docs would basically have a link to Slider and
> > > > nothing
> > > > >> > > else?
> > > > >> > > > Or
> > > > >> > > > > >> > would we maintain integrations with a bunch of
> popular
> > > > >> > deployment
> > > > >> > > > > >> methods
> > > > >> > > > > >> > (e.g. the necessary glue and shell scripts to make
> > Samza
> > > > work
> > > > >> > with
> > > > >> > > > > >> Slider)?
> > > > >> > > > > >> >
> > > > >> > > > > >> > I absolutely think it's a good idea to have the "as a
> > > > library"
> > > > >> > and
> > > > >> > > > > "as a
> > > > >> > > > > >> > process" (using Yi's taxonomy) options for people who
> > > want
> > > > >> them,
> > > > >> > > > but I
> > > > >> > > > > >> > think there should also be a low-friction path for
> > common
> > > > "as
> > > > >> a
> > > > >> > > > > service"
> > > > >> > > > > >> > deployment methods, for which we probably need to
> > > maintain
> > > > >> > > > > integrations.
> > > > >> > > > > >> >
> > > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd to me,
> > > because
> > > > >> Kafka
> > > > >> > > is
> > > > >> > > > > all
> > > > >> > > > > >> > about streams already. Perhaps "Kafka Transformers"
> or
> > > > "Kafka
> > > > >> > > > Filters"
> > > > >> > > > > >> > would be more apt?
> > > > >> > > > > >> >
> > > > >> > > > > >> > One suggestion: perhaps the core of Samza (stream
> > > > >> transformation
> > > > >> > > > with
> > > > >> > > > > >> > state management -- i.e. the "Samza as a library"
> bit)
> > > > could
> > > > >> > > become
> > > > >> > > > > >> part of
> > > > >> > > > > >> > Kafka, while higher-level tools such as streaming SQL
> > and
> > > > >> > > > integrations
> > > > >> > > > > >> with
> > > > >> > > > > >> > deployment frameworks remain in a separate project?
> In
> > > > other
> > > > >> > > words,
> > > > >> > > > > >> Kafka
> > > > >> > > > > >> > would absorb the proven, stable core of Samza, which
> > > would
> > > > >> > become
> > > > >> > > > the
> > > > >> > > > > >> > "third Kafka client" mentioned early in this thread.
> > The
> > > > Samza
> > > > >> > > > project
> > > > >> > > > > >> > would then target that third Kafka client as its base
> > > API,
> > > > and
> > > > >> > the
> > > > >> > > > > >> project
> > > > >> > > > > >> > would be freed up to explore more experimental new
> > > > horizons.
> > > > >> > > > > >> >
> > > > >> > > > > >> > Martin
> > > > >> > > > > >> >
> > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps <
> > jay.kr...@gmail.com>
> > > > >> wrote:
> > > > >> > > > > >> >
> > > > >> > > > > >> > > Hey Martin,
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually don't
> > > think
> > > > it
> > > > >> > ties
> > > > >> > > > our
> > > > >> > > > > >> > hands
> > > > >> > > > > >> > > at all, all it does is refactor things. The
> division
> > of
> > > > >> > > > > >> responsibility is
> > > > >> > > > > >> > > that Samza core is responsible for task lifecycle,
> > > state,
> > > > >> and
> > > > >> > > > > >> partition
> > > > >> > > > > >> > > management (using the Kafka co-ordinator) but it is
> > NOT
> > > > >> > > > responsible
> > > > >> > > > > >> for
> > > > >> > > > > >> > > packaging, configuration deployment or execution of
> > > > >> processes.
> > > > >> > > The
> > > > >> > > > > >> > problem
> > > > >> > > > > >> > > of packaging and starting these processes is
> > > > >> > > > > >> > > framework/environment-specific. This leaves
> > individual
> > > > >> > > frameworks
> > > > >> > > > to
> > > > >> > > > > >> be
> > > > >> > > > > >> > as
> > > > >> > > > > >> > > fancy or vanilla as they like. So you can get
> simple
> > > > >> stateless
> > > > >> > > > > >> support in
> > > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf app
> > > framework
> > > > >> > > (Slider,
> > > > >> > > > > >> > Marathon,
> > > > >> > > > > >> > > etc). These are well known by people and have nice
> > UIs
> > > > and a
> > > > >> > lot
> > > > >> > > > of
> > > > >> > > > > >> > > flexibility. I don't think they have node affinity
> > as a
> > > > >> built
> > > > >> > in
> > > > >> > > > > >> option
> > > > >> > > > > >> > > (though I could be wrong). So if we want that we
> can
> > > > either
> > > > >> > wait
> > > > >> > > > for
> > > > >> > > > > >> them
> > > > >> > > > > >> > > to add it or do a custom framework to add that
> > feature
> > > > (as
> > > > >> > now).
> > > > >> > > > > >> > Obviously
> > > > >> > > > > >> > > if you manage things with old-school ops tools
> > > > >> > (puppet/chef/etc)
> > > > >> > > > you
> > > > >> > > > > >> get
> > > > >> > > > > >> > > locality easily. The nice thing, though, is that
> all
> > > the
> > > > >> samza
> > > > >> > > > > >> "business
> > > > >> > > > > >> > > logic" around partition management and fault
> > tolerance
> > > > is in
> > > > >> > > Samza
> > > > >> > > > > >> core
> > > > >> > > > > >> > so
> > > > >> > > > > >> > > it is shared across frameworks and the framework
> > > specific
> > > > >> bit
> > > > >> > is
> > > > >> > > > > just
> > > > >> > > > > >> > > whether it is smart enough to try to get the same
> > host
> > > > when
> > > > >> a
> > > > >> > > job
> > > > >> > > > is
> > > > >> > > > > >> > > restarted.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > With respect to the Kafka-alignment, yeah I think
> the
> > > > goal
> > > > >> > would
> > > > >> > > > be
> > > > >> > > > > >> (a)
> > > > >> > > > > >> > > actually get better alignment in user experience,
> and
> > > (b)
> > > > >> > > express
> > > > >> > > > > >> this in
> > > > >> > > > > >> > > the naming and project branding. Specifically:
> > > > >> > > > > >> > > 1. Website/docs, it would be nice for the
> > > > "transformation"
> > > > >> api
> > > > >> > > to
> > > > >> > > > be
> > > > >> > > > > >> > > discoverable in the main Kafka docs--i.e. be able
> to
> > > > explain
> > > > >> > > when
> > > > >> > > > to
> > > > >> > > > > >> use
> > > > >> > > > > >> > > the consumer and when to use the stream processing
> > > > >> > functionality
> > > > >> > > > and
> > > > >> > > > > >> lead
> > > > >> > > > > >> > > people into that experience.
> > > > >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2 (or
> > > > whatever)
> > > > >> > that
> > > > >> > > > has
> > > > >> > > > > >> both
> > > > >> > > > > >> > > Kafka and the stream processing part and they
> > actually
> > > > work
> > > > >> > > > > together.
> > > > >> > > > > >> > > 3. Unify the programming experience so the client
> and
> > > > Samza
> > > > >> > api
> > > > >> > > > > share
> > > > >> > > > > >> > > config/monitoring/naming/packaging/etc.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > I think sub-projects keep separate committers and
> can
> > > > have a
> > > > >> > > > > separate
> > > > >> > > > > >> > repo,
> > > > >> > > > > >> > > but I'm actually not really sure (I can't find a
> > > > definition
> > > > >> > of a
> > > > >> > > > > >> > subproject
> > > > >> > > > > >> > > in Apache).
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > Basically at a high-level you want the experience
> to
> > > > "feel"
> > > > >> > > like a
> > > > >> > > > > >> single
> > > > >> > > > > >> > > system, not to relatively independent things that
> are
> > > > kind
> > > > >> of
> > > > >> > > > > >> awkwardly
> > > > >> > > > > >> > > glued together.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > I think if we did that they having naming or
> branding
> > > > like
> > > > >> > > "kafka
> > > > >> > > > > >> > > streaming" or "kafka streams" or something like
> that
> > > > would
> > > > >> > > > actually
> > > > >> > > > > >> do a
> > > > >> > > > > >> > > good job of conveying what it is. I do that this
> > would
> > > > help
> > > > >> > > > adoption
> > > > >> > > > > >> > quite
> > > > >> > > > > >> > > a lot as it would correctly convey that using Kafka
> > > > >> Streaming
> > > > >> > > with
> > > > >> > > > > >> Kafka
> > > > >> > > > > >> > is
> > > > >> > > > > >> > > a fairly seamless experience and Kafka is pretty
> > > heavily
> > > > >> > adopted
> > > > >> > > > at
> > > > >> > > > > >> this
> > > > >> > > > > >> > > point.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > Fwiw we actually considered this model originally
> > when
> > > > open
> > > > >> > > > sourcing
> > > > >> > > > > >> > Samza,
> > > > >> > > > > >> > > however at that time Kafka was relatively unknown
> and
> > > we
> > > > >> > decided
> > > > >> > > > not
> > > > >> > > > > >> to
> > > > >> > > > > >> > do
> > > > >> > > > > >> > > it since we felt it would be limiting. From my
> point
> > of
> > > > view
> > > > >> > the
> > > > >> > > > > three
> > > > >> > > > > >> > > things have changed (1) Kafka is now really heavily
> > > used
> > > > for
> > > > >> > > > stream
> > > > >> > > > > >> > > processing, (2) we learned that abstracting out the
> > > > stream
> > > > >> > well
> > > > >> > > is
> > > > >> > > > > >> > > basically impossible, (3) we learned it is really
> > hard
> > > to
> > > > >> keep
> > > > >> > > the
> > > > >> > > > > two
> > > > >> > > > > >> > > things feeling like a single product.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > -Jay
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin Kleppmann <
> > > > >> > > > > >> mar...@kleppmann.com>
> > > > >> > > > > >> > > wrote:
> > > > >> > > > > >> > >
> > > > >> > > > > >> > >> Hi all,
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> Lots of good thoughts here.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> I agree with the general philosophy of tying Samza
> > > more
> > > > >> > firmly
> > > > >> > > to
> > > > >> > > > > >> Kafka.
> > > > >> > > > > >> > >> After I spent a while looking at integrating other
> > > > message
> > > > >> > > > brokers
> > > > >> > > > > >> (e.g.
> > > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the
> > conclusion
> > > > that
> > > > >> > > > > >> > SystemConsumer
> > > > >> > > > > >> > >> tacitly assumes a model so much like Kafka's that
> > > pretty
> > > > >> much
> > > > >> > > > > nobody
> > > > >> > > > > >> but
> > > > >> > > > > >> > >> Kafka actually implements it. (Databus is perhaps
> an
> > > > >> > exception,
> > > > >> > > > but
> > > > >> > > > > >> it
> > > > >> > > > > >> > >> isn't widely used outside of LinkedIn.) Thus,
> making
> > > > Samza
> > > > >> > > fully
> > > > >> > > > > >> > dependent
> > > > >> > > > > >> > >> on Kafka acknowledges that the system-independence
> > was
> > > > >> never
> > > > >> > as
> > > > >> > > > > real
> > > > >> > > > > >> as
> > > > >> > > > > >> > we
> > > > >> > > > > >> > >> perhaps made it out to be. The gains of code reuse
> > are
> > > > >> real.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> The idea of decoupling Samza from YARN has also
> > always
> > > > been
> > > > >> > > > > >> appealing to
> > > > >> > > > > >> > >> me, for various reasons already mentioned in this
> > > > thread.
> > > > >> > > > Although
> > > > >> > > > > >> > making
> > > > >> > > > > >> > >> Samza jobs deployable on anything
> > (YARN/Mesos/AWS/etc)
> > > > >> seems
> > > > >> > > > > >> laudable,
> > > > >> > > > > >> > I am
> > > > >> > > > > >> > >> a little concerned that it will restrict us to a
> > > lowest
> > > > >> > common
> > > > >> > > > > >> > denominator.
> > > > >> > > > > >> > >> For example, would host affinity (SAMZA-617) still
> > be
> > > > >> > possible?
> > > > >> > > > For
> > > > >> > > > > >> jobs
> > > > >> > > > > >> > >> with large amounts of state, I think SAMZA-617
> would
> > > be
> > > > a
> > > > >> big
> > > > >> > > > boon,
> > > > >> > > > > >> > since
> > > > >> > > > > >> > >> restoring state off the changelog on every single
> > > > restart
> > > > >> is
> > > > >> > > > > painful,
> > > > >> > > > > >> > due
> > > > >> > > > > >> > >> to long recovery times. It would be a shame if the
> > > > >> decoupling
> > > > >> > > > from
> > > > >> > > > > >> YARN
> > > > >> > > > > >> > >> made host affinity impossible.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> Jay, a question about the proposed API for
> > > > instantiating a
> > > > >> > job
> > > > >> > > in
> > > > >> > > > > >> code
> > > > >> > > > > >> > >> (rather than a properties file): when submitting a
> > job
> > > > to a
> > > > >> > > > > cluster,
> > > > >> > > > > >> is
> > > > >> > > > > >> > the
> > > > >> > > > > >> > >> idea that the instantiation code runs on a client
> > > > >> somewhere,
> > > > >> > > > which
> > > > >> > > > > >> then
> > > > >> > > > > >> > >> pokes the necessary endpoints on
> YARN/Mesos/AWS/etc?
> > > Or
> > > > >> does
> > > > >> > > that
> > > > >> > > > > >> code
> > > > >> > > > > >> > run
> > > > >> > > > > >> > >> on each container that is part of the job (in
> which
> > > > case,
> > > > >> how
> > > > >> > > > does
> > > > >> > > > > >> the
> > > > >> > > > > >> > job
> > > > >> > > > > >> > >> submission to the cluster work)?
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> I agree with Garry that it doesn't feel right to
> > make
> > > a
> > > > 1.0
> > > > >> > > > release
> > > > >> > > > > >> > with a
> > > > >> > > > > >> > >> plan for it to be immediately obsolete. So if this
> > is
> > > > going
> > > > >> > to
> > > > >> > > > > >> happen, I
> > > > >> > > > > >> > >> think it would be more honest to stick with 0.*
> > > version
> > > > >> > numbers
> > > > >> > > > > until
> > > > >> > > > > >> > the
> > > > >> > > > > >> > >> library-ified Samza has been implemented, is
> stable
> > > and
> > > > >> > widely
> > > > >> > > > > used.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> Should the new Samza be a subproject of Kafka?
> There
> > > is
> > > > >> > > precedent
> > > > >> > > > > for
> > > > >> > > > > >> > >> tight coupling between different Apache projects
> > (e.g.
> > > > >> > Curator
> > > > >> > > > and
> > > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think
> remaining
> > > > >> separate
> > > > >> > > > would
> > > > >> > > > > >> be
> > > > >> > > > > >> > ok.
> > > > >> > > > > >> > >> Even if Samza is fully dependent on Kafka, there
> is
> > > > enough
> > > > >> > > > > substance
> > > > >> > > > > >> in
> > > > >> > > > > >> > >> Samza that it warrants being a separate project.
> An
> > > > >> argument
> > > > >> > in
> > > > >> > > > > >> favour
> > > > >> > > > > >> > of
> > > > >> > > > > >> > >> merging would be if we think Kafka has a much
> > stronger
> > > > >> "brand
> > > > >> > > > > >> presence"
> > > > >> > > > > >> > >> than Samza; I'm ambivalent on that one. If the
> Kafka
> > > > >> project
> > > > >> > is
> > > > >> > > > > >> willing
> > > > >> > > > > >> > to
> > > > >> > > > > >> > >> endorse Samza as the "official" way of doing
> > stateful
> > > > >> stream
> > > > >> > > > > >> > >> transformations, that would probably have much the
> > > same
> > > > >> > effect
> > > > >> > > as
> > > > >> > > > > >> > >> re-branding Samza as "Kafka Stream Processors" or
> > > > suchlike.
> > > > >> > > Close
> > > > >> > > > > >> > >> collaboration between the two projects will be
> > needed
> > > in
> > > > >> any
> > > > >> > > > case.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> From a project management perspective, I guess the
> > > "new
> > > > >> > Samza"
> > > > >> > > > > would
> > > > >> > > > > >> > have
> > > > >> > > > > >> > >> to be developed on a branch alongside ongoing
> > > > maintenance
> > > > >> of
> > > > >> > > the
> > > > >> > > > > >> current
> > > > >> > > > > >> > >> line of development? I think it would be important
> > to
> > > > >> > continue
> > > > >> > > > > >> > supporting
> > > > >> > > > > >> > >> existing users, and provide a graceful migration
> > path
> > > to
> > > > >> the
> > > > >> > > new
> > > > >> > > > > >> > version.
> > > > >> > > > > >> > >> Leaving the current versions unsupported and
> forcing
> > > > people
> > > > >> > to
> > > > >> > > > > >> rewrite
> > > > >> > > > > >> > >> their jobs would send a bad signal.
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> Best,
> > > > >> > > > > >> > >> Martin
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <
> > j...@confluent.io>
> > > > >> wrote:
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >>> Hey Garry,
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be happy to
> chat
> > > > more
> > > > >> > about
> > > > >> > > > > this
> > > > >> > > > > >> if
> > > > >> > > > > >> > >>> you'd be interested. I think Chris and I started
> > with
> > > > the
> > > > >> > idea
> > > > >> > > > of
> > > > >> > > > > >> "what
> > > > >> > > > > >> > >>> would it take to make Samza a kick-ass ingestion
> > > tool"
> > > > but
> > > > >> > > > > >> ultimately
> > > > >> > > > > >> > we
> > > > >> > > > > >> > >>> kind of came around to the idea that ingestion
> and
> > > > >> > > > transformation
> > > > >> > > > > >> had
> > > > >> > > > > >> > >>> pretty different needs and coupling the two made
> > > things
> > > > >> > hard.
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> For what it's worth I think copycat (KIP-26)
> > actually
> > > > will
> > > > >> > do
> > > > >> > > > what
> > > > >> > > > > >> you
> > > > >> > > > > >> > >> are
> > > > >> > > > > >> > >>> looking for.
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> With regard to your point about slider, I don't
> > > > >> necessarily
> > > > >> > > > > >> disagree.
> > > > >> > > > > >> > >> But I
> > > > >> > > > > >> > >>> think getting good YARN support is quite doable
> > and I
> > > > >> think
> > > > >> > we
> > > > >> > > > can
> > > > >> > > > > >> make
> > > > >> > > > > >> > >>> that work well. I think the issue this proposal
> > > solves
> > > > is
> > > > >> > that
> > > > >> > > > > >> > >> technically
> > > > >> > > > > >> > >>> it is pretty hard to support multiple cluster
> > > > management
> > > > >> > > systems
> > > > >> > > > > the
> > > > >> > > > > >> > way
> > > > >> > > > > >> > >>> things are now, you need to write an "app master"
> > or
> > > > >> > > "framework"
> > > > >> > > > > for
> > > > >> > > > > >> > each
> > > > >> > > > > >> > >>> and they are all a little different so testing is
> > > > really
> > > > >> > hard.
> > > > >> > > > In
> > > > >> > > > > >> the
> > > > >> > > > > >> > >>> absence of this we have been stuck with just YARN
> > > which
> > > > >> has
> > > > >> > > > > >> fantastic
> > > > >> > > > > >> > >>> penetration in the Hadoopy part of the org, but
> > zero
> > > > >> > > penetration
> > > > >> > > > > >> > >> elsewhere.
> > > > >> > > > > >> > >>> Given the huge amount of work being put in to
> > slider,
> > > > >> > > marathon,
> > > > >> > > > > aws
> > > > >> > > > > >> > >>> tooling, not to mention the umpteen related
> > packaging
> > > > >> > > > technologies
> > > > >> > > > > >> > people
> > > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various
> > > cloud-specific
> > > > >> > deploy
> > > > >> > > > > >> tools,
> > > > >> > > > > >> > >> etc)
> > > > >> > > > > >> > >>> I really think it is important to get this right.
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> -Jay
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry Turkington
> <
> > > > >> > > > > >> > >>> g.turking...@improvedigital.com> wrote:
> > > > >> > > > > >> > >>>
> > > > >> > > > > >> > >>>> Hi all,
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> I think the question below re does Samza become
> a
> > > > >> > sub-project
> > > > >> > > > of
> > > > >> > > > > >> Kafka
> > > > >> > > > > >> > >>>> highlights the broader point around migration.
> > Chris
> > > > >> > mentions
> > > > >> > > > > >> Samza's
> > > > >> > > > > >> > >>>> maturity is heading towards a v1 release but I'm
> > not
> > > > sure
> > > > >> > it
> > > > >> > > > > feels
> > > > >> > > > > >> > >> right to
> > > > >> > > > > >> > >>>> launch a v1 then immediately plan to deprecate
> > most
> > > of
> > > > >> it.
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> From a selfish perspective I have some guys who
> > have
> > > > >> > started
> > > > >> > > > > >> working
> > > > >> > > > > >> > >> with
> > > > >> > > > > >> > >>>> Samza and building some new consumers/producers
> > was
> > > > next
> > > > >> > up.
> > > > >> > > > > Sounds
> > > > >> > > > > >> > like
> > > > >> > > > > >> > >>>> that is absolutely not the direction to go. I
> need
> > > to
> > > > >> look
> > > > >> > > into
> > > > >> > > > > the
> > > > >> > > > > >> > KIP
> > > > >> > > > > >> > >> in
> > > > >> > > > > >> > >>>> more detail but for me the attractiveness of
> > adding
> > > > new
> > > > >> > Samza
> > > > >> > > > > >> > >>>> consumer/producers -- even if yes all they were
> > > doing
> > > > was
> > > > >> > > > really
> > > > >> > > > > >> > getting
> > > > >> > > > > >> > >>>> data into and out of Kafka --  was to avoid
> > having
> > > to
> > > > >> > worry
> > > > >> > > > > about
> > > > >> > > > > >> the
> > > > >> > > > > >> > >>>> lifecycle management of external clients. If
> there
> > > is
> > > > a
> > > > >> > > generic
> > > > >> > > > > >> Kafka
> > > > >> > > > > >> > >>>> ingress/egress layer that I can plug a new
> > connector
> > > > into
> > > > >> > and
> > > > >> > > > > have
> > > > >> > > > > >> a
> > > > >> > > > > >> > >> lot of
> > > > >> > > > > >> > >>>> the heavy lifting re scale and reliability done
> > for
> > > me
> > > > >> then
> > > > >> > > it
> > > > >> > > > > >> gives
> > > > >> > > > > >> > me
> > > > >> > > > > >> > >> all
> > > > >> > > > > >> > >>>> the pushing new consumers/producers would. If
> not
> > > > then it
> > > > >> > > > > >> complicates
> > > > >> > > > > >> > my
> > > > >> > > > > >> > >>>> operational deployments.
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> Which is similar to my other question with the
> > > > proposal
> > > > >> --
> > > > >> > if
> > > > >> > > > we
> > > > >> > > > > >> > build a
> > > > >> > > > > >> > >>>> fully available/stand-alone Samza plus the
> > requisite
> > > > >> shims
> > > > >> > to
> > > > >> > > > > >> > integrate
> > > > >> > > > > >> > >>>> with Slider etc I suspect the former may be a
> lot
> > > more
> > > > >> work
> > > > >> > > > than
> > > > >> > > > > we
> > > > >> > > > > >> > >> think.
> > > > >> > > > > >> > >>>> We may make it much easier for a newcomer to get
> > > > >> something
> > > > >> > > > > running
> > > > >> > > > > >> but
> > > > >> > > > > >> > >>>> having them step up and get a reliable
> production
> > > > >> > deployment
> > > > >> > > > may
> > > > >> > > > > >> still
> > > > >> > > > > >> > >>>> dominate mailing list  traffic, if for different
> > > > reasons
> > > > >> > than
> > > > >> > > > > >> today.
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable with
> making
> > > the
> > > > >> Samza
> > > > >> > > > > >> dependency
> > > > >> > > > > >> > >> on
> > > > >> > > > > >> > >>>> Kafka much more explicit and I absolutely see
> the
> > > > >> benefits
> > > > >> > > in
> > > > >> > > > > the
> > > > >> > > > > >> > >>>> reduction of duplication and clashing
> > > > >> > > > terminologies/abstractions
> > > > >> > > > > >> that
> > > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library would
> > likely
> > > > be a
> > > > >> > very
> > > > >> > > > > nice
> > > > >> > > > > >> > tool
> > > > >> > > > > >> > >> to
> > > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have the
> > concerns
> > > > >> above
> > > > >> > re
> > > > >> > > > the
> > > > >> > > > > >> > >>>> operational side.
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> Garry
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> -----Original Message-----
> > > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales [mailto:
> > > > >> > g...@apache.org
> > > > >> > > ]
> > > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56
> > > > >> > > > > >> > >>>> To: dev@samza.apache.org
> > > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on Samza
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> Very interesting thoughts.
> > > > >> > > > > >> > >>>> From outside, I have always perceived Samza as a
> > > > >> computing
> > > > >> > > > layer
> > > > >> > > > > >> over
> > > > >> > > > > >> > >>>> Kafka.
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> The question, maybe a bit provocative, is
> "should
> > > > Samza
> > > > >> be
> > > > >> > a
> > > > >> > > > > >> > sub-project
> > > > >> > > > > >> > >>>> of Kafka then?"
> > > > >> > > > > >> > >>>> Or does it make sense to keep it as a separate
> > > project
> > > > >> > with a
> > > > >> > > > > >> separate
> > > > >> > > > > >> > >>>> governance?
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> Cheers,
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> --
> > > > >> > > > > >> > >>>> Gianmarco
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang <
> > > > yanfang...@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka more
> > tightly.
> > > > >> > Because
> > > > >> > > > > Samza
> > > > >> > > > > >> de
> > > > >> > > > > >> > >>>>> facto is based on Kafka, and it should leverage
> > > what
> > > > >> Kafka
> > > > >> > > > has.
> > > > >> > > > > At
> > > > >> > > > > >> > the
> > > > >> > > > > >> > >>>>> same time, Kafka does not need to reinvent what
> > > Samza
> > > > >> > > already
> > > > >> > > > > >> has. I
> > > > >> > > > > >> > >>>>> also like the idea of separating the ingestion
> > and
> > > > >> > > > > transformation.
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>> But it is a little difficult for me to image
> how
> > > the
> > > > >> Samza
> > > > >> > > > will
> > > > >> > > > > >> look
> > > > >> > > > > >> > >>>> like.
> > > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little
> difference
> > > in
> > > > >> terms
> > > > >> > > of
> > > > >> > > > > how
> > > > >> > > > > >> > >>>>> Samza should look like.
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>> *** Will it look like what Jay's code shows (A
> > > > client of
> > > > >> > > > Kakfa)
> > > > >> > > > > ?
> > > > >> > > > > >> And
> > > > >> > > > > >> > >>>>> user's application code calls this client?
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>> 1. If we make Samza be a library of Kafka (like
> > > what
> > > > the
> > > > >> > > code
> > > > >> > > > > >> shows),
> > > > >> > > > > >> > >>>>> how do we implement auto-balance and
> > > fault-tolerance?
> > > > >> Are
> > > > >> > > they
> > > > >> > > > > >> taken
> > > > >> > > > > >> > >>>>> care by the Kafka broker or other mechanism,
> such
> > > as
> > > > >> > "Samza
> > > > >> > > > > >> worker"
> > > > >> > > > > >> > >>>>> (just make up the name) ?
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>> 2. What about other features, such as
> > auto-scaling,
> > > > >> shared
> > > > >> > > > > state,
> > > > >> > > > > >> > >>>>> monitoring?
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is this what
> > > Chris
> > > > >> > > > suggests?)
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>> 1. we still need to ingest data from Kakfa and
> > > > produce
> > > > >> to
> > > > >> > > it.
> > > > >> > > > > >> Then it
> > > > >> > > > > >> > >>>>> becomes the same as what Samza looks like now,
> > > > except it
> > > > >> > > does
> > > > >> > > > > not
> > > > >> > > > > >> > rely
> > > > >> > > > > >> > >>>>> on Yarn anymore.
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>> 2. if it is standalone, how can it leverage
> > Kafka's
> > > > >> > metrics,
> > > > >> > > > > logs,
> > > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency?
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>> Thanks,
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>> Fang, Yan
> > > > >> > > > > >> > >>>>> yanfang...@gmail.com
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang Wang <
> > > > >> > > > > wangg...@gmail.com
> > > > >> > > > > >> >
> > > > >> > > > > >> > >>>> wrote:
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>>> Read through the code example and it looks
> good
> > to
> > > > me.
> > > > >> A
> > > > >> > > few
> > > > >> > > > > >> > >>>>>> thoughts regarding deployment:
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>> Today Samza deploys as executable runnable
> like:
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh
> --config-factory=...
> > > > >> > > > > >> > >>>> --config-path=file://...
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>> And this proposal advocate for deploying Samza
> > > more
> > > > as
> > > > >> > > > embedded
> > > > >> > > > > >> > >>>>>> libraries in user application code (ignoring
> the
> > > > >> > > terminology
> > > > >> > > > > >> since
> > > > >> > > > > >> > >>>>>> it is not the
> > > > >> > > > > >> > >>>>> same
> > > > >> > > > > >> > >>>>>> as the prototype code):
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>> StreamTask task = new MyStreamTask(configs);
> > > Thread
> > > > >> > thread
> > > > >> > > =
> > > > >> > > > > new
> > > > >> > > > > >> > >>>>>> Thread(task); thread.start();
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>> I think both of these deployment modes are
> > > important
> > > > >> for
> > > > >> > > > > >> different
> > > > >> > > > > >> > >>>>>> types
> > > > >> > > > > >> > >>>>> of
> > > > >> > > > > >> > >>>>>> users. That said, I think making Samza purely
> > > > >> standalone
> > > > >> > is
> > > > >> > > > > still
> > > > >> > > > > >> > >>>>>> sufficient for either runnable or library
> modes.
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>> Guozhang
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay Kreps <
> > > > >> > > > j...@confluent.io>
> > > > >> > > > > >> > wrote:
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code example, it
> > was
> > > > >> > supposed
> > > > >> > > > to
> > > > >> > > > > >> look
> > > > >> > > > > >> > >>>>>>> like
> > > > >> > > > > >> > >>>>>>> this:
> > > > >> > > > > >> > >>>>>>>
> > > > >> > > > > >> > >>>>>>> Properties props = new Properties();
> > > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers",
> > "localhost:4242");
> > > > >> > > > > >> StreamingConfig
> > > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props);
> > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
> > "test-topic-2");
> > > > >> > > > > >> > >>>>>>>
> config.processor(ExampleStreamProcessor.class);
> > > > >> > > > > >> > >>>>>>> config.serialization(new StringSerializer(),
> > new
> > > > >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming
> > container =
> > > > new
> > > > >> > > > > >> > >>>>>>> KafkaStreaming(config); container.run();
> > > > >> > > > > >> > >>>>>>>
> > > > >> > > > > >> > >>>>>>> -Jay
> > > > >> > > > > >> > >>>>>>>
> > > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps <
> > > > >> > > > j...@confluent.io
> > > > >> > > > > >
> > > > >> > > > > >> > >>>> wrote:
> > > > >> > > > > >> > >>>>>>>
> > > > >> > > > > >> > >>>>>>>> Hey guys,
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> This came out of some conversations Chris
> and
> > I
> > > > were
> > > > >> > > having
> > > > >> > > > > >> > >>>>>>>> around
> > > > >> > > > > >> > >>>>>>> whether
> > > > >> > > > > >> > >>>>>>>> it would make sense to use Samza as a kind
> of
> > > data
> > > > >> > > > ingestion
> > > > >> > > > > >> > >>>>> framework
> > > > >> > > > > >> > >>>>>>> for
> > > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26
> > > "copycat").
> > > > >> This
> > > > >> > > > kind
> > > > >> > > > > of
> > > > >> > > > > >> > >>>>>> combined
> > > > >> > > > > >> > >>>>>>>> with complaints around config and YARN and
> the
> > > > >> > discussion
> > > > >> > > > > >> around
> > > > >> > > > > >> > >>>>>>>> how
> > > > >> > > > > >> > >>>>> to
> > > > >> > > > > >> > >>>>>>>> best do a standalone mode.
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> So the thought experiment was, given that
> > Samza
> > > > was
> > > > >> > > > basically
> > > > >> > > > > >> > >>>>>>>> already totally Kafka specific, what if you
> > just
> > > > >> > embraced
> > > > >> > > > > that
> > > > >> > > > > >> > >>>>>>>> and turned it
> > > > >> > > > > >> > >>>>>> into
> > > > >> > > > > >> > >>>>>>>> something less like a heavyweight framework
> > and
> > > > more
> > > > >> > > like a
> > > > >> > > > > >> > >>>>>>>> third
> > > > >> > > > > >> > >>>>> Kafka
> > > > >> > > > > >> > >>>>>>>> client--a kind of "producing consumer" with
> > > state
> > > > >> > > > management
> > > > >> > > > > >> > >>>>>> facilities.
> > > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a complex
> > stream
> > > > >> > > processing
> > > > >> > > > > >> > >>>>>>>> framework
> > > > >> > > > > >> > >>>>>>> this
> > > > >> > > > > >> > >>>>>>>> would actually be a very simple thing, not
> > much
> > > > more
> > > > >> > > > > >> complicated
> > > > >> > > > > >> > >>>>>>>> to
> > > > >> > > > > >> > >>>>> use
> > > > >> > > > > >> > >>>>>>> or
> > > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As Chris said
> > we
> > > > >> thought
> > > > >> > > > about
> > > > >> > > > > >> it
> > > > >> > > > > >> > >>>>>>>> a
> > > > >> > > > > >> > >>>>> lot
> > > > >> > > > > >> > >>>>>> of
> > > > >> > > > > >> > >>>>>>>> what Samza (and the other stream processing
> > > > systems
> > > > >> > were
> > > > >> > > > > doing)
> > > > >> > > > > >> > >>>>> seemed
> > > > >> > > > > >> > >>>>>>> like
> > > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce.
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output data to
> > and
> > > > from
> > > > >> > the
> > > > >> > > > > stream
> > > > >> > > > > >> > >>>>>>>> processing. But when we actually looked into
> > how
> > > > that
> > > > >> > > would
> > > > >> > > > > >> > >>>>>>>> work,
> > > > >> > > > > >> > >>>>> Samza
> > > > >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion
> framework
> > > > for a
> > > > >> > > bunch
> > > > >> > > > of
> > > > >> > > > > >> > >>>>> reasons.
> > > > >> > > > > >> > >>>>>> To
> > > > >> > > > > >> > >>>>>>>> really do that right you need a pretty
> > different
> > > > >> > internal
> > > > >> > > > > data
> > > > >> > > > > >> > >>>>>>>> model
> > > > >> > > > > >> > >>>>>> and
> > > > >> > > > > >> > >>>>>>>> set of apis. So what if you split them and
> had
> > > an
> > > > api
> > > > >> > for
> > > > >> > > > > Kafka
> > > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) and a
> > > separate
> > > > >> api
> > > > >> > > for
> > > > >> > > > > >> Kafka
> > > > >> > > > > >> > >>>>>>>> transformation (Samza).
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> This would also allow really embracing the
> > same
> > > > >> > > terminology
> > > > >> > > > > and
> > > > >> > > > > >> > >>>>>>>> conventions. One complaint about the current
> > > > state is
> > > > >> > > that
> > > > >> > > > > the
> > > > >> > > > > >> > >>>>>>>> two
> > > > >> > > > > >> > >>>>>>> systems
> > > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology like
> > > "stream"
> > > > vs
> > > > >> > > > "topic"
> > > > >> > > > > >> and
> > > > >> > > > > >> > >>>>>>> different
> > > > >> > > > > >> > >>>>>>>> config and monitoring systems means you kind
> > of
> > > > have
> > > > >> to
> > > > >> > > > learn
> > > > >> > > > > >> > >>>>>>>> Kafka's
> > > > >> > > > > >> > >>>>>>> way,
> > > > >> > > > > >> > >>>>>>>> then learn Samza's slightly different way,
> > then
> > > > kind
> > > > >> of
> > > > >> > > > > >> > >>>>>>>> understand
> > > > >> > > > > >> > >>>>> how
> > > > >> > > > > >> > >>>>>>> they
> > > > >> > > > > >> > >>>>>>>> map to each other, which having walked a few
> > > > people
> > > > >> > > through
> > > > >> > > > > >> this
> > > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to get.
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of time on
> > > > >> airplanes I
> > > > >> > > > > hacked
> > > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat incomplete
> > > > prototype
> > > > >> of
> > > > >> > > > what
> > > > >> > > > > >> > >>>>>>>> this would
> > > > >> > > > > >> > >>>>> look
> > > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously dumped
> into
> > > > Kafka
> > > > >> as
> > > > >> > > it
> > > > >> > > > > >> > >>>>>>>> required a
> > > > >> > > > > >> > >>>>>> few
> > > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here is the
> code:
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> >
> > > > >> > > > >
> > > > >> >
> > > >
> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
> > > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I just
> > > liberally
> > > > >> > renamed
> > > > >> > > > > >> > >>>>>>>> everything
> > > > >> > > > > >> > >>>>> to
> > > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no regard
> for
> > > > >> > > > compatibility.
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> To use this would be something like this:
> > > > >> > > > > >> > >>>>>>>> Properties props = new Properties();
> > > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers",
> > > "localhost:4242");
> > > > >> > > > > >> > >>>>>>>> StreamingConfig config = new
> > > > >> > > > > >> > >>>>> StreamingConfig(props);
> > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
> > > > >> > > > > >> > >>>>>>>> "test-topic-2");
> > > > >> > > > > >> config.processor(ExampleStreamProcessor.class);
> > > > >> > > > > >> > >>>>>>> config.serialization(new
> > > > >> > > > > >> > >>>>>>>> StringSerializer(), new
> StringDeserializer());
> > > > >> > > > KafkaStreaming
> > > > >> > > > > >> > >>>>>> container =
> > > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config); container.run();
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the
> > SamzaContainer;
> > > > >> > > > > StreamProcessor
> > > > >> > > > > >> > >>>>>>>> is basically StreamTask.
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> So rather than putting all the class names
> in
> > a
> > > > file
> > > > >> > and
> > > > >> > > > then
> > > > >> > > > > >> > >>>>>>>> having
> > > > >> > > > > >> > >>>>>> the
> > > > >> > > > > >> > >>>>>>>> job assembled by reflection, you just
> > > instantiate
> > > > the
> > > > >> > > > > container
> > > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced over
> > however
> > > > many
> > > > >> > > > > instances
> > > > >> > > > > >> > >>>>>>>> of
> > > > >> > > > > >> > >>>>> this
> > > > >> > > > > >> > >>>>>>> are
> > > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an instance dies,
> > new
> > > > >> tasks
> > > > >> > > are
> > > > >> > > > > >> added
> > > > >> > > > > >> > >>>>>>>> to
> > > > >> > > > > >> > >>>>> the
> > > > >> > > > > >> > >>>>>>>> existing containers without shutting them
> > down).
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> We would provide some glue for running this
> > > stuff
> > > > in
> > > > >> > YARN
> > > > >> > > > via
> > > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS using
> some
> > > of
> > > > >> their
> > > > >> > > > tools
> > > > >> > > > > >> > >>>>>>>> but from the
> > > > >> > > > > >> > >>>>>> point
> > > > >> > > > > >> > >>>>>>> of
> > > > >> > > > > >> > >>>>>>>> view of these frameworks these stream
> > processing
> > > > jobs
> > > > >> > are
> > > > >> > > > > just
> > > > >> > > > > >> > >>>>>> stateless
> > > > >> > > > > >> > >>>>>>>> services that can come and go and expand and
> > > > contract
> > > > >> > at
> > > > >> > > > > will.
> > > > >> > > > > >> > >>>>>>>> There
> > > > >> > > > > >> > >>>>> is
> > > > >> > > > > >> > >>>>>>> no
> > > > >> > > > > >> > >>>>>>>> more custom scheduler.
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> Here are some relevant details:
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>>  1. It is only ~1300 lines of code, it would
> > get
> > > > >> larger
> > > > >> > > if
> > > > >> > > > we
> > > > >> > > > > >> > >>>>>>>>  productionized but not vastly larger. We
> > really
> > > > do
> > > > >> > get a
> > > > >> > > > ton
> > > > >> > > > > >> > >>>>>>>> of
> > > > >> > > > > >> > >>>>>>> leverage
> > > > >> > > > > >> > >>>>>>>>  out of Kafka.
> > > > >> > > > > >> > >>>>>>>>  2. Partition management is fully delegated
> to
> > > the
> > > > >> new
> > > > >> > > > > >> consumer.
> > > > >> > > > > >> > >>>>> This
> > > > >> > > > > >> > >>>>>>>>  is nice since now any partition management
> > > > strategy
> > > > >> > > > > available
> > > > >> > > > > >> > >>>>>>>> to
> > > > >> > > > > >> > >>>>>> Kafka
> > > > >> > > > > >> > >>>>>>>>  consumer is also available to Samza (and
> vice
> > > > versa)
> > > > >> > and
> > > > >> > > > > with
> > > > >> > > > > >> > >>>>>>>> the
> > > > >> > > > > >> > >>>>>>> exact
> > > > >> > > > > >> > >>>>>>>>  same configs.
> > > > >> > > > > >> > >>>>>>>>  3. It supports state as well as state reuse
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is thought
> > > > >> provoking.
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> -Jay
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris
> > > Riccomini <
> > > > >> > > > > >> > >>>>>> criccom...@apache.org>
> > > > >> > > > > >> > >>>>>>>> wrote:
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Hey all,
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> I have had some discussions with Samza
> > > engineers
> > > > at
> > > > >> > > > LinkedIn
> > > > >> > > > > >> > >>>>>>>>> and
> > > > >> > > > > >> > >>>>>>> Confluent
> > > > >> > > > > >> > >>>>>>>>> and we came up with a few observations and
> > > would
> > > > >> like
> > > > >> > to
> > > > >> > > > > >> > >>>>>>>>> propose
> > > > >> > > > > >> > >>>>> some
> > > > >> > > > > >> > >>>>>>>>> changes.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> We've observed some things that I want to
> > call
> > > > out
> > > > >> > about
> > > > >> > > > > >> > >>>>>>>>> Samza's
> > > > >> > > > > >> > >>>>>> design,
> > > > >> > > > > >> > >>>>>>>>> and I'd like to propose some changes.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic
> > deployment
> > > > >> system.
> > > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable.
> > > > >> > > > > >> > >>>>>>>>> * Samza's SystemConsumer/SystemProducer and
> > > > Kafka's
> > > > >> > > > consumer
> > > > >> > > > > >> > >>>>>>>>> APIs
> > > > >> > > > > >> > >>>>> are
> > > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same problems.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> All three of these issues are related, but
> > I'll
> > > > >> > address
> > > > >> > > > them
> > > > >> > > > > >> in
> > > > >> > > > > >> > >>>>> order.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Deployment
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use of a
> > dynamic
> > > > >> > > deployment
> > > > >> > > > > >> > >>>>>>>>> scheduler
> > > > >> > > > > >> > >>>>>> such
> > > > >> > > > > >> > >>>>>>>>> as
> > > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially built
> > > Samza,
> > > > we
> > > > >> > bet
> > > > >> > > > that
> > > > >> > > > > >> > >>>>>>>>> there
> > > > >> > > > > >> > >>>>>> would
> > > > >> > > > > >> > >>>>>>>>> be
> > > > >> > > > > >> > >>>>>>>>> one or two winners in this area, and we
> could
> > > > >> support
> > > > >> > > > them,
> > > > >> > > > > >> and
> > > > >> > > > > >> > >>>>>>>>> the
> > > > >> > > > > >> > >>>>>> rest
> > > > >> > > > > >> > >>>>>>>>> would go away. In reality, there are many
> > > > >> variations.
> > > > >> > > > > >> > >>>>>>>>> Furthermore,
> > > > >> > > > > >> > >>>>>> many
> > > > >> > > > > >> > >>>>>>>>> people still prefer to just start their
> > > > processors
> > > > >> > like
> > > > >> > > > > normal
> > > > >> > > > > >> > >>>>>>>>> Java processes, and use traditional
> > deployment
> > > > >> scripts
> > > > >> > > > such
> > > > >> > > > > as
> > > > >> > > > > >> > >>>>>>>>> Fabric,
> > > > >> > > > > >> > >>>>>> Chef,
> > > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment system
> on
> > > > users
> > > > >> > makes
> > > > >> > > > the
> > > > >> > > > > >> > >>>>>>>>> Samza start-up process really painful for
> > first
> > > > time
> > > > >> > > > users.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement was
> also
> > a
> > > > bit
> > > > >> of
> > > > >> > a
> > > > >> > > > > >> > >>>>>>>>> mis-fire
> > > > >> > > > > >> > >>>>>> because
> > > > >> > > > > >> > >>>>>>>>> of
> > > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding between the
> > > > nature of
> > > > >> > > batch
> > > > >> > > > > >> jobs
> > > > >> > > > > >> > >>>>>>>>> and
> > > > >> > > > > >> > >>>>>>> stream
> > > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made
> conscious
> > > > effort
> > > > >> to
> > > > >> > > > favor
> > > > >> > > > > >> > >>>>>>>>> the
> > > > >> > > > > >> > >>>>>> Hadoop
> > > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things, since it
> > > worked
> > > > >> and
> > > > >> > > was
> > > > >> > > > > well
> > > > >> > > > > >> > >>>>>>> understood.
> > > > >> > > > > >> > >>>>>>>>> One thing that we missed was that batch
> jobs
> > > > have a
> > > > >> > > > definite
> > > > >> > > > > >> > >>>>>> beginning,
> > > > >> > > > > >> > >>>>>>>>> and
> > > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs don't
> > > (usually).
> > > > >> This
> > > > >> > > > leads
> > > > >> > > > > to
> > > > >> > > > > >> > >>>>>>>>> a
> > > > >> > > > > >> > >>>>> much
> > > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for stream
> > > processors.
> > > > >> You
> > > > >> > > > > >> basically
> > > > >> > > > > >> > >>>>>>>>> just
> > > > >> > > > > >> > >>>>>>> need
> > > > >> > > > > >> > >>>>>>>>> to find a place to start the processor, and
> > > start
> > > > >> it.
> > > > >> > > The
> > > > >> > > > > way
> > > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's no
> concept
> > > of
> > > > a
> > > > >> > > cluster
> > > > >> > > > > >> > >>>>>>>>> being "full". We always
> > > > >> > > > > >> > >>>>>> add
> > > > >> > > > > >> > >>>>>>>>> more machines. The problem with coupling
> > Samza
> > > > with
> > > > >> a
> > > > >> > > > > >> scheduler
> > > > >> > > > > >> > >>>>>>>>> is
> > > > >> > > > > >> > >>>>>> that
> > > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to handle
> > > > deployment.
> > > > >> > > This
> > > > >> > > > > >> pulls
> > > > >> > > > > >> > >>>>>>>>> in a
> > > > >> > > > > >> > >>>>>>> bunch
> > > > >> > > > > >> > >>>>>>>>> of things such as configuration
> distribution
> > > > (config
> > > > >> > > > > stream),
> > > > >> > > > > >> > >>>>>>>>> shell
> > > > >> > > > > >> > >>>>>>> scrips
> > > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging (all
> > the
> > > > .tgz
> > > > >> > > > stuff),
> > > > >> > > > > >> etc.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic
> > deployment
> > > > was
> > > > >> to
> > > > >> > > > > support
> > > > >> > > > > >> > >>>>>>>>> data locality. If you want to have
> locality,
> > > you
> > > > >> need
> > > > >> > to
> > > > >> > > > put
> > > > >> > > > > >> > >>>>>>>>> your
> > > > >> > > > > >> > >>>>>> processors
> > > > >> > > > > >> > >>>>>>>>> close to the data they're processing. Upon
> > > > further
> > > > >> > > > > >> > >>>>>>>>> investigation,
> > > > >> > > > > >> > >>>>>>> though,
> > > > >> > > > > >> > >>>>>>>>> this feature is not that beneficial. There
> is
> > > > some
> > > > >> > good
> > > > >> > > > > >> > >>>>>>>>> discussion
> > > > >> > > > > >> > >>>>>> about
> > > > >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335. Again,
> we
> > > > took
> > > > >> the
> > > > >> > > > > >> > >>>>>>>>> Map/Reduce
> > > > >> > > > > >> > >>>>>> path,
> > > > >> > > > > >> > >>>>>>>>> but
> > > > >> > > > > >> > >>>>>>>>> there are some fundamental differences
> > between
> > > > HDFS
> > > > >> > and
> > > > >> > > > > Kafka.
> > > > >> > > > > >> > >>>>>>>>> HDFS
> > > > >> > > > > >> > >>>>>> has
> > > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions. This
> > leads
> > > to
> > > > >> less
> > > > >> > > > > >> > >>>>>>>>> optimization potential with stream
> processors
> > > on
> > > > top
> > > > >> > of
> > > > >> > > > > Kafka.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> This feature is also used as a crutch.
> Samza
> > > > doesn't
> > > > >> > > have
> > > > >> > > > > any
> > > > >> > > > > >> > >>>>>>>>> built
> > > > >> > > > > >> > >>>>> in
> > > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it depends
> on
> > > the
> > > > >> > > dynamic
> > > > >> > > > > >> > >>>>>>>>> deployment scheduling system to handle
> > restarts
> > > > >> when a
> > > > >> > > > > >> > >>>>>>>>> processor dies. This has
> > > > >> > > > > >> > >>>>>>> made
> > > > >> > > > > >> > >>>>>>>>> it very difficult to write a standalone
> Samza
> > > > >> > container
> > > > >> > > > > >> > >>>> (SAMZA-516).
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Pluggability
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> In some cases pluggability is good, but I
> > think
> > > > that
> > > > >> > > we've
> > > > >> > > > > >> gone
> > > > >> > > > > >> > >>>>>>>>> too
> > > > >> > > > > >> > >>>>>> far
> > > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has:
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> * Pluggable config.
> > > > >> > > > > >> > >>>>>>>>> * Pluggable metrics.
> > > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems.
> > > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems
> > (SystemConsumer,
> > > > >> > > > > SystemProducer,
> > > > >> > > > > >> > >>>> etc).
> > > > >> > > > > >> > >>>>>>>>> * Pluggable serdes.
> > > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines.
> > > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just about every
> > > > >> component
> > > > >> > > > > >> > >>>>> (MessageChooser,
> > > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper,
> ConfigRewriter,
> > > > etc).
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> There's probably more that I've forgotten,
> as
> > > > well.
> > > > >> > Some
> > > > >> > > > of
> > > > >> > > > > >> > >>>>>>>>> these
> > > > >> > > > > >> > >>>>> are
> > > > >> > > > > >> > >>>>>>>>> useful, but some have proven not to be.
> This
> > > all
> > > > >> comes
> > > > >> > > at
> > > > >> > > > a
> > > > >> > > > > >> cost:
> > > > >> > > > > >> > >>>>>>>>> complexity. This complexity is making it
> > harder
> > > > for
> > > > >> > our
> > > > >> > > > > users
> > > > >> > > > > >> > >>>>>>>>> to
> > > > >> > > > > >> > >>>>> pick
> > > > >> > > > > >> > >>>>>> up
> > > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It also makes
> > it
> > > > >> > difficult
> > > > >> > > > for
> > > > >> > > > > >> > >>>>>>>>> Samza developers to reason about what the
> > > > >> > > characteristics
> > > > >> > > > of
> > > > >> > > > > >> > >>>>>>>>> the container (since the characteristics
> > change
> > > > >> > > depending
> > > > >> > > > on
> > > > >> > > > > >> > >>>>>>>>> which plugins are use).
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> The issues with pluggability are most
> visible
> > > in
> > > > the
> > > > >> > > > System
> > > > >> > > > > >> APIs.
> > > > >> > > > > >> > >>>>> What
> > > > >> > > > > >> > >>>>>>>>> Samza really requires to be functional is
> > Kafka
> > > > as
> > > > >> its
> > > > >> > > > > >> > >>>>>>>>> transport
> > > > >> > > > > >> > >>>>>> layer.
> > > > >> > > > > >> > >>>>>>>>> But
> > > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use cases
> into
> > > one
> > > > >> API:
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka.
> > > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> The current System API supports both of
> these
> > > use
> > > > >> > cases.
> > > > >> > > > The
> > > > >> > > > > >> > >>>>>>>>> problem
> > > > >> > > > > >> > >>>>>> is,
> > > > >> > > > > >> > >>>>>>>>> we
> > > > >> > > > > >> > >>>>>>>>> actually want different features for each
> use
> > > > case.
> > > > >> By
> > > > >> > > > > >> papering
> > > > >> > > > > >> > >>>>>>>>> over
> > > > >> > > > > >> > >>>>>>> these
> > > > >> > > > > >> > >>>>>>>>> two use cases, and providing a single API,
> > > we've
> > > > >> > > > introduced
> > > > >> > > > > a
> > > > >> > > > > >> > >>>>>>>>> ton of
> > > > >> > > > > >> > >>>>>>> leaky
> > > > >> > > > > >> > >>>>>>>>> abstractions.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> For example, what we'd really like in (2)
> is
> > to
> > > > have
> > > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for offsets
> > > (like
> > > > >> > Kafka).
> > > > >> > > > > This
> > > > >> > > > > >> > >>>>>>>>> would be at odds
> > > > >> > > > > >> > >>>>> with
> > > > >> > > > > >> > >>>>>>> (1),
> > > > >> > > > > >> > >>>>>>>>> though, since different systems have
> > different
> > > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors.
> > > > >> > > > > >> > >>>>>>>>> There was discussion both on the mailing
> list
> > > and
> > > > >> the
> > > > >> > > SQL
> > > > >> > > > > >> JIRAs
> > > > >> > > > > >> > >>>>> about
> > > > >> > > > > >> > >>>>>>> the
> > > > >> > > > > >> > >>>>>>>>> need for this.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> The same thing holds true for
> replayability.
> > > > Kafka
> > > > >> > > allows
> > > > >> > > > us
> > > > >> > > > > >> to
> > > > >> > > > > >> > >>>>> rewind
> > > > >> > > > > >> > >>>>>>>>> when
> > > > >> > > > > >> > >>>>>>>>> we have a failure. Many other systems
> don't.
> > In
> > > > some
> > > > >> > > > cases,
> > > > >> > > > > >> > >>>>>>>>> systems
> > > > >> > > > > >> > >>>>>>> return
> > > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g.
> > > > >> WikipediaSystemConsumer)
> > > > >> > > > > because
> > > > >> > > > > >> > >>>>>>>>> they
> > > > >> > > > > >> > >>>>>> have
> > > > >> > > > > >> > >>>>>>> no
> > > > >> > > > > >> > >>>>>>>>> offsets.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Partitioning is another example. Kafka
> > supports
> > > > >> > > > > partitioning,
> > > > >> > > > > >> > >>>>>>>>> but
> > > > >> > > > > >> > >>>>> many
> > > > >> > > > > >> > >>>>>>>>> systems don't. We model this by having a
> > single
> > > > >> > > partition
> > > > >> > > > > for
> > > > >> > > > > >> > >>>>>>>>> those systems. Still, other systems model
> > > > >> partitioning
> > > > >> > > > > >> > >>>> differently (e.g.
> > > > >> > > > > >> > >>>>>>>>> Kinesis).
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a mess.
> > > > Creating
> > > > >> > > streams
> > > > >> > > > > in
> > > > >> > > > > >> a
> > > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost impossible.
> As
> > is
> > > > >> > modeling
> > > > >> > > > > >> > >>>>>>>>> metadata
> > > > >> > > > > >> > >>>>> for
> > > > >> > > > > >> > >>>>>>> the
> > > > >> > > > > >> > >>>>>>>>> system (replication factor, partitions,
> > > location,
> > > > >> > etc).
> > > > >> > > > The
> > > > >> > > > > >> > >>>>>>>>> list
> > > > >> > > > > >> > >>>>> goes
> > > > >> > > > > >> > >>>>>>> on.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Duplicate work
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> At the time that we began writing Samza,
> > > Kafka's
> > > > >> > > consumer
> > > > >> > > > > and
> > > > >> > > > > >> > >>>>> producer
> > > > >> > > > > >> > >>>>>>>>> APIs
> > > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set. On the
> > > > >> > consumer-side,
> > > > >> > > > you
> > > > >> > > > > >> > >>>>>>>>> had two
> > > > >> > > > > >> > >>>>>>>>> options: use the high level consumer, or
> the
> > > > simple
> > > > >> > > > > consumer.
> > > > >> > > > > >> > >>>>>>>>> The
> > > > >> > > > > >> > >>>>>>> problem
> > > > >> > > > > >> > >>>>>>>>> with the high-level consumer was that it
> > > > controlled
> > > > >> > your
> > > > >> > > > > >> > >>>>>>>>> offsets, partition assignments, and the
> order
> > > in
> > > > >> which
> > > > >> > > you
> > > > >> > > > > >> > >>>>>>>>> received messages. The
> > > > >> > > > > >> > >>>>> problem
> > > > >> > > > > >> > >>>>>>>>> with
> > > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's not
> simple.
> > > It's
> > > > >> > basic.
> > > > >> > > > You
> > > > >> > > > > >> > >>>>>>>>> end up
> > > > >> > > > > >> > >>>>>>> having
> > > > >> > > > > >> > >>>>>>>>> to handle a lot of really low-level stuff
> > that
> > > > you
> > > > >> > > > > shouldn't.
> > > > >> > > > > >> > >>>>>>>>> We
> > > > >> > > > > >> > >>>>>> spent a
> > > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's
> > KafkaSystemConsumer
> > > > very
> > > > >> > > > robust.
> > > > >> > > > > >> It
> > > > >> > > > > >> > >>>>>>>>> also allows us to support some cool
> features:
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering and
> > > > prioritization.
> > > > >> > > > > >> > >>>>>>>>> * Tight control over partition assignment
> to
> > > > support
> > > > >> > > > joins,
> > > > >> > > > > >> > >>>>>>>>> global
> > > > >> > > > > >> > >>>>>> state
> > > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)), etc.
> > > > >> > > > > >> > >>>>>>>>> * Tight control over offset checkpointing.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time is that
> > > these
> > > > >> > > features
> > > > >> > > > > >> > >>>>>>>>> should
> > > > >> > > > > >> > >>>>>>> actually
> > > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers (not
> > just
> > > > >> Samza
> > > > >> > > > stream
> > > > >> > > > > >> > >>>>>> processors)
> > > > >> > > > > >> > >>>>>>>>> end up wanting to do things like joins and
> > > > partition
> > > > >> > > > > >> > >>>>>>>>> assignment. The
> > > > >> > > > > >> > >>>>>>> Kafka
> > > > >> > > > > >> > >>>>>>>>> community has come to the same conclusion.
> > > > They're
> > > > >> > > adding
> > > > >> > > > a
> > > > >> > > > > >> ton
> > > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka consumer
> > > > >> > > implementation.
> > > > >> > > > > To a
> > > > >> > > > > >> > >>>>>>>>> large extent,
> > > > >> > > > > >> > >>>>> it's
> > > > >> > > > > >> > >>>>>>>>> duplicate work to what we've already done
> in
> > > > Samza.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up taking a
> very
> > > > similar
> > > > >> > > > > approach
> > > > >> > > > > >> > >>>>>>>>> to
> > > > >> > > > > >> > >>>>>> Samza's
> > > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager implementation for
> > > > handling
> > > > >> > > offset
> > > > >> > > > > >> > >>>>>> checkpointing.
> > > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset management
> > > feature
> > > > >> > stores
> > > > >> > > > > >> offset
> > > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows you to
> > fetch
> > > > them
> > > > >> > > from
> > > > >> > > > > the
> > > > >> > > > > >> > >>>>>>>>> broker.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste, since we
> > > could
> > > > >> have
> > > > >> > > > shared
> > > > >> > > > > >> > >>>>>>>>> the
> > > > >> > > > > >> > >>>>> work
> > > > >> > > > > >> > >>>>>> if
> > > > >> > > > > >> > >>>>>>>>> it
> > > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the get-go.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Vision
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather radical
> > > > proposal.
> > > > >> > Samza
> > > > >> > > > is
> > > > >> > > > > >> > >>>>> relatively
> > > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to say
> that
> > > > we're
> > > > >> > > near a
> > > > >> > > > > 1.0
> > > > >> > > > > >> > >>>>>> release.
> > > > >> > > > > >> > >>>>>>>>> I'd
> > > > >> > > > > >> > >>>>>>>>> like to propose that we take what we've
> > > learned,
> > > > and
> > > > >> > > begin
> > > > >> > > > > >> > >>>>>>>>> thinking
> > > > >> > > > > >> > >>>>>>> about
> > > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we change if
> we
> > > were
> > > > >> > > starting
> > > > >> > > > > >> from
> > > > >> > > > > >> > >>>>>> scratch?
> > > > >> > > > > >> > >>>>>>>>> My
> > > > >> > > > > >> > >>>>>>>>> proposal is to:
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only* way to
> > run
> > > > Samza
> > > > >> > > > > >> > >>>>>>>>> processors, and eliminate all direct
> > > dependences
> > > > on
> > > > >> > > YARN,
> > > > >> > > > > >> Mesos,
> > > > >> > > > > >> > >>>> etc.
> > > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support only
> > Kafka
> > > > as
> > > > >> the
> > > > >> > > > > stream
> > > > >> > > > > >> > >>>>>> processing
> > > > >> > > > > >> > >>>>>>>>> layer.
> > > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging,
> > > > >> serialization,
> > > > >> > > and
> > > > >> > > > > >> > >>>>>>>>> config
> > > > >> > > > > >> > >>>>>>> systems,
> > > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> This would fix all of the issues that I
> > > outlined
> > > > >> > above.
> > > > >> > > It
> > > > >> > > > > >> > >>>>>>>>> should
> > > > >> > > > > >> > >>>>> also
> > > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty
> > dramatically.
> > > > >> > > Supporting
> > > > >> > > > > >> only
> > > > >> > > > > >> > >>>>>>>>> a standalone container will allow Samza to
> be
> > > > >> executed
> > > > >> > > on
> > > > >> > > > > YARN
> > > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using
> > Marathon/Aurora),
> > > or
> > > > >> most
> > > > >> > > > other
> > > > >> > > > > >> > >>>>>>>>> in-house
> > > > >> > > > > >> > >>>>>>> deployment
> > > > >> > > > > >> > >>>>>>>>> systems. This should make life a lot easier
> > for
> > > > new
> > > > >> > > users.
> > > > >> > > > > >> > >>>>>>>>> Imagine
> > > > >> > > > > >> > >>>>>>> having
> > > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without YARN. The
> > drop
> > > > in
> > > > >> > > mailing
> > > > >> > > > > >> list
> > > > >> > > > > >> > >>>>>> traffic
> > > > >> > > > > >> > >>>>>>>>> will be pretty dramatic.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long overdue to
> me.
> > > The
> > > > >> > > reality
> > > > >> > > > > is,
> > > > >> > > > > >> > >>>>> everyone
> > > > >> > > > > >> > >>>>>>>>> that
> > > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with Kafka. We
> > > > basically
> > > > >> > > > require
> > > > >> > > > > >> it
> > > > >> > > > > >> > >>>>>> already
> > > > >> > > > > >> > >>>>>>> in
> > > > >> > > > > >> > >>>>>>>>> order for most features to work. Those that
> > are
> > > > >> using
> > > > >> > > > other
> > > > >> > > > > >> > >>>>>>>>> systems
> > > > >> > > > > >> > >>>>>> are
> > > > >> > > > > >> > >>>>>>>>> generally using it for ingest into Kafka
> (1),
> > > and
> > > > >> then
> > > > >> > > > they
> > > > >> > > > > do
> > > > >> > > > > >> > >>>>>>>>> the processing on top. There is already
> > > > discussion (
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> >
> > > > >> > > > >
> > > > >> >
> > > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
> > > > >> > > > > >> > >>>>> 767
> > > > >> > > > > >> > >>>>>>>>> )
> > > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into Kafka
> > extremely
> > > > >> easy.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Once we make the call to couple with Kafka,
> > we
> > > > can
> > > > >> > > > leverage
> > > > >> > > > > a
> > > > >> > > > > >> > >>>>>>>>> ton of
> > > > >> > > > > >> > >>>>>>> their
> > > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to maintain
> our
> > > own
> > > > >> > config,
> > > > >> > > > > >> > >>>>>>>>> metrics,
> > > > >> > > > > >> > >>>>> etc.
> > > > >> > > > > >> > >>>>>>> We
> > > > >> > > > > >> > >>>>>>>>> can all share the same libraries, and make
> > them
> > > > >> > better.
> > > > >> > > > This
> > > > >> > > > > >> > >>>>>>>>> will
> > > > >> > > > > >> > >>>>> also
> > > > >> > > > > >> > >>>>>>>>> allow us to share the consumer/producer
> APIs,
> > > and
> > > > >> will
> > > > >> > > let
> > > > >> > > > > us
> > > > >> > > > > >> > >>>>> leverage
> > > > >> > > > > >> > >>>>>>>>> their offset management and partition
> > > management,
> > > > >> > rather
> > > > >> > > > > than
> > > > >> > > > > >> > >>>>>>>>> having
> > > > >> > > > > >> > >>>>>> our
> > > > >> > > > > >> > >>>>>>>>> own. All of the coordinator stream code
> would
> > > go
> > > > >> away,
> > > > >> > > as
> > > > >> > > > > >> would
> > > > >> > > > > >> > >>>>>>>>> most
> > > > >> > > > > >> > >>>>>> of
> > > > >> > > > > >> > >>>>>>>>> the
> > > > >> > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably have to
> > push
> > > > some
> > > > >> > > > > partition
> > > > >> > > > > >> > >>>>>>> management
> > > > >> > > > > >> > >>>>>>>>> features into the Kafka broker, but they're
> > > > already
> > > > >> > > moving
> > > > >> > > > > in
> > > > >> > > > > >> > >>>>>>>>> that direction with the new consumer API.
> The
> > > > >> features
> > > > >> > > we
> > > > >> > > > > have
> > > > >> > > > > >> > >>>>>>>>> for
> > > > >> > > > > >> > >>>>>> partition
> > > > >> > > > > >> > >>>>>>>>> assignment aren't unique to Samza, and seem
> > > like
> > > > >> they
> > > > >> > > > should
> > > > >> > > > > >> be
> > > > >> > > > > >> > >>>>>>>>> in
> > > > >> > > > > >> > >>>>>> Kafka
> > > > >> > > > > >> > >>>>>>>>> anyway. There will always be some niche
> > usages
> > > > which
> > > > >> > > will
> > > > >> > > > > >> > >>>>>>>>> require
> > > > >> > > > > >> > >>>>>> extra
> > > > >> > > > > >> > >>>>>>>>> care and hence full control over partition
> > > > >> assignments
> > > > >> > > > much
> > > > >> > > > > >> > >>>>>>>>> like the
> > > > >> > > > > >> > >>>>>>> Kafka
> > > > >> > > > > >> > >>>>>>>>> low level consumer api. These would
> continue
> > to
> > > > be
> > > > >> > > > > supported.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> These items will be good for the Samza
> > > community.
> > > > >> > > They'll
> > > > >> > > > > make
> > > > >> > > > > >> > >>>>>>>>> Samza easier to use, and make it easier for
> > > > >> developers
> > > > >> > > to
> > > > >> > > > > add
> > > > >> > > > > >> > >>>>>>>>> new features.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Obviously this is a fairly large (and
> > somewhat
> > > > >> > backwards
> > > > >> > > > > >> > >>>>> incompatible
> > > > >> > > > > >> > >>>>>>>>> change). If we choose to go this route,
> it's
> > > > >> important
> > > > >> > > > that
> > > > >> > > > > we
> > > > >> > > > > >> > >>>>> openly
> > > > >> > > > > >> > >>>>>>>>> communicate how we're going to provide a
> > > > migration
> > > > >> > path
> > > > >> > > > from
> > > > >> > > > > >> > >>>>>>>>> the
> > > > >> > > > > >> > >>>>>>> existing
> > > > >> > > > > >> > >>>>>>>>> APIs to the new ones (if we make
> incompatible
> > > > >> > changes).
> > > > >> > > I
> > > > >> > > > > >> think
> > > > >> > > > > >> > >>>>>>>>> at a minimum, we'd probably need to
> provide a
> > > > >> wrapper
> > > > >> > to
> > > > >> > > > > allow
> > > > >> > > > > >> > >>>>>>>>> existing StreamTask implementations to
> > continue
> > > > >> > running
> > > > >> > > on
> > > > >> > > > > the
> > > > >> > > > > >> > >>>> new container.
> > > > >> > > > > >> > >>>>>>> It's
> > > > >> > > > > >> > >>>>>>>>> also important that we openly communicate
> > about
> > > > >> > timing,
> > > > >> > > > and
> > > > >> > > > > >> > >>>>>>>>> stages
> > > > >> > > > > >> > >>>>> of
> > > > >> > > > > >> > >>>>>>> the
> > > > >> > > > > >> > >>>>>>>>> migration.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> If you made it this far, I'm sure you have
> > > > opinions.
> > > > >> > :)
> > > > >> > > > > Please
> > > > >> > > > > >> > >>>>>>>>> send
> > > > >> > > > > >> > >>>>>> your
> > > > >> > > > > >> > >>>>>>>>> thoughts and feedback.
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>> Cheers,
> > > > >> > > > > >> > >>>>>>>>> Chris
> > > > >> > > > > >> > >>>>>>>>>
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>>
> > > > >> > > > > >> > >>>>>>>
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>> --
> > > > >> > > > > >> > >>>>>> -- Guozhang
> > > > >> > > > > >> > >>>>>>
> > > > >> > > > > >> > >>>>>
> > > > >> > > > > >> > >>>>
> > > > >> > > > > >> > >>
> > > > >> > > > > >> > >>
> > > > >> > > > > >> >
> > > > >> > > > > >> >
> > > > >> > > > > >> >
> > > > >> > > > > >>
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: Thoughts and obesrvations on Samza

Reply via email to