Re: Thoughts and obesrvations on Samza

Jay Kreps Fri, 03 Jul 2015 12:03:54 -0700

Hey Yi,

Yeah let me try to be more concrete. Any partition assignment management
requires two things
  G - a group of consumer instances
  f  - a function f(G) that yields the mapping of partitions to consumer
instances
The hard thing is to determine G, this is what requires the heartbeats,
coordinator, etc. So I think we agree that G should be maintained by the
kafka co-ordinator, and I think we all agree that in any design f has to be
pluggable.


Note that if f is deterministic that you can actually do this two ways:
1. Original Proposal: the coordinator determines G and gives back f(G) to
the client
2. Current Proposal: the coordinator determines G and gives back G to the
clients each of whom compute f(G) to determine their own assignment.

As you say the current proposal you link to is (2), but given the needs
arguably we should consider (1). Maintaining G is in any case the really
hard thing, and that will be done in Kafka in either proposal it is just a
question of where the partition assignment function f is plugged in (client
vs server side).

-Jay

On Thu, Jul 2, 2015 at 4:44 PM, Yi Pan <nickpa...@gmail.com> wrote:

> @Jay,
>
> {quote}
> I think it may be possible to rework the assignment feature in the consumer
> to make this always be a client-side concern so that Samza, the Kafka
> consumer, and Copycat can all use the same facility.
> {quote}
> Thanks! I like that idea.
>
> {quote}
> So it may make sense to revist this, I don't think it is necessarily a
> massive
> change and would give more flexibility for the variety of cases.
> {quote}
> I totally agree.
>
> P.S. just for my education,
> {quote}
> The original design for the kafka coordinator was that the coordinator
> would
> just coordinate *group* membership and the actual assignment of partitions
> to members of the group would be done client side.
> {quote}
> Please correct me if I am wrong. Is the link here still valid:
>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Detailed+Consumer+Coordinator+Design
> ?
> If yes, I thought that the assignment is done by the broker as in
> KAFKA-167? Maybe we can discuss and clarify this in person.
>
> Thanks a lot!
>
> -Yi
>
> On Thu, Jul 2, 2015 at 3:52 PM, Jay Kreps <j...@confluent.io> wrote:
>
> > Yeah, hey Yi, I get what you are saying now. At the risk of getting into
> > the weeds a bit you are exactly right, a similar thing is needed for
> > copycat/kip-26. At the risk of getting a bit into the weeds, I think it
> may
> > be possible to rework the assignment feature in the consumer to make this
> > always be a client-side concern so that Samza, the Kafka consumer, and
> > Copycat can all use the same facility.
> >
> > The original design for the kafka coordinator was that the coordinator
> > would just coordinate *group* membership and the actual assignment of
> > partitions to members of the group would be done client side. The
> advantage
> > of this was that it was more general the disadvantage was that the server
> > couldn't really check or monitor the partition assignment. Since we
> didn't
> > have any other use case for generic group management we went with the
> more
> > specific partition assignment.
> >
> > However a few things have changed since that original design:
> > 1. We now have the additional use cases of copycat and Samza
> > 2. We now realize that the assignment strategies don't actually
> necessarily
> > ensure each partition is assigned to only one consumer--there are really
> > valid use cases for broadcast or multiple replica assignment schemes--so
> we
> > can't actually make the a hard assertion on the server.
> >
> > So it may make sense to revist this, I don't think it is necessarily a
> > massive change and would give more flexibility for the variety of cases.
> >
> > -Jay
> >
> > On Thu, Jul 2, 2015 at 3:38 PM, Yi Pan <nickpa...@gmail.com> wrote:
> >
> > > @Guozhang, yes, that's what I meant. From Kafka consumers' point of
> view,
> > > it pretty much boils down to answer the following question:
> > > 1. For Kafka consumer in each container (i.e. a Samza worker), which
> > topic
> > > partitions it should consume.
> > > Samza's current StreamTask model still makes sense to me and the
> > > partition-to-task mapping is the debatable point that whether that
> should
> > > be in Kafka or stays in a separate module. As we discussed earlier,
> some
> > > simple partition-to-task mapping maybe expressed as co-partition
> > > distribution among different topics in Kafka (forgive me if I had make
> > > mistakes here since I am not 100% sure about how partition distribution
> > > policies work in Kafka). However, more complex and application-specific
> > > partition-to-task mapping would need to stay outside of Kafka. One
> > example
> > > is the discussion on SQL tasks:
> > >
> > >
> >
> https://issues.apache.org/jira/browse/SAMZA-516?focusedCommentId=14288685&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14288685
> > >
> > > On Thu, Jul 2, 2015 at 2:47 PM, Guozhang Wang <wangg...@gmail.com>
> > wrote:
> > >
> > > > Since the resource scheduling systems like YARN / Mesos only gives
> > Samza
> > > a
> > > > couple of resource units (or "containers") to run processes, while
> > Samza
> > > > still needs to handle task assignment / scheduling like which tasks
> > > should
> > > > be allocated to which containers that consume from which partitions,
> > > etc. I
> > > > think this is want Yi meant for "partition management"?
> > > >
> > > > On Thu, Jul 2, 2015 at 2:35 PM, Yi Pan <nickpa...@gmail.com> wrote:
> > > >
> > > > > @Jay, yes, the current function in the JobCoordinator is just
> > partition
> > > > > management. Maybe we should just call it PartitionManager to make
> it
> > > > > explicit.
> > > > >
> > > > > -Yi
> > > > >
> > > > > On Thu, Jul 2, 2015 at 2:24 PM, Jay Kreps <j...@confluent.io>
> wrote:
> > > > >
> > > > > > Hey Yi,
> > > > > >
> > > > > > What does the JobCoordinator do? YARN/Mesos/etc would be doing
> the
> > > > actual
> > > > > > resource assignment, process restart, etc, right? Is the
> additional
> > > > value
> > > > > > add of the JobCoordinator just partition management?
> > > > > >
> > > > > > -Jay
> > > > > >
> > > > > > On Thu, Jul 2, 2015 at 11:32 AM, Yi Pan <nickpa...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Hi, all,
> > > > > > >
> > > > > > >
> > > > > > > Thanks Chris for sending out this proposal and Jay for sharing
> > the
> > > > > > > extremely illustrative prototype code.
> > > > > > >
> > > > > > >
> > > > > > > I have been thinking it over many times and want to list out my
> > > > > personal
> > > > > > > opinions below:
> > > > > > >
> > > > > > > 1. Generally, I agree with most of the people here on the
> mailing
> > > > list
> > > > > on
> > > > > > > two points:
> > > > > > >
> > > > > > >    a. Deeper integration w/ Kafka is great. No more confusing
> > > mapping
> > > > > > from
> > > > > > > SystemStreamPartition to TopicPartition etc.
> > > > > > >
> > > > > > >    b. Separation the ingestion vs transformation greatly
> simplify
> > > the
> > > > > > > systems APIs
> > > > > > >
> > > > > > > Having the above two changes would allow us to remove many
> > > > unnecessary
> > > > > > > complexities introduced by those pluggable interfaces Chris’
> > > pointed
> > > > > out,
> > > > > > > e.g. pluggable streaming systems and serde.
> > > > > > >
> > > > > > >
> > > > > > > To recall one of Chris’s statement on difficulties in dynamic
> > > > > > deployment, I
> > > > > > > believe that the difficulties are mainly the result of
> > > tight-coupling
> > > > > of
> > > > > > > partition assignment vs the container deployment in the current
> > > > system.
> > > > > > The
> > > > > > > current container deployment requires a pre-defined partition
> > > > > assignment
> > > > > > > strategy coupled together w/ the deployment configuration
> before
> > we
> > > > can
> > > > > > > submit to YARN and start the Samza container, which makes the
> > > > launching
> > > > > > > process super long. Also, fault-tolerance and the embedded
> > > > > JobCoordinator
> > > > > > > code in YARN AppMaster is another way of  making dynamic
> > deployment
> > > > > more
> > > > > > > complex and difficult.
> > > > > > >
> > > > > > >
> > > > > > > First, borrowing Yan’s term, let’s call the Samza standalone
> > > process
> > > > a
> > > > > > > Samza worker. Here is what I have been thinking:
> > > > > > >
> > > > > > > 1. Separate the execution framework from partition
> > assignment/load
> > > > > > > balancing:
> > > > > > >
> > > > > > >     a. a Samza worker should be launched by execution framework
> > > that
> > > > > only
> > > > > > > deals w/ process placement to available nodes. The execution
> > > > framework
> > > > > > now
> > > > > > > should only deal w/ how many such processes are needed, where
> to
> > > put
> > > > > > them,
> > > > > > > and how to keep them alive.
> > > > > > >
> > > > > > >     b. Partition assignment/load balancing can be a pluggable
> > > > interface
> > > > > > in
> > > > > > > Samza that allows the Samza workers to ask for partition
> > > assignments.
> > > > > > Let’s
> > > > > > > borrow the name JobCoordinator for now. To allow
> fault-tolerance
> > in
> > > > > case
> > > > > > of
> > > > > > > failure, the partition assignments to workers need to be
> dynamic.
> > > > > Hence,
> > > > > > > the abstract interface would be much like what Jay’s code
> > > illustrate:
> > > > > > > get()/onAssigned()/onRevoke(). The implementation of the
> > partition
> > > > > > > assignment can be either:
> > > > > > >
> > > > > > >         a) completely rely on Kafka.
> > > > > > >
> > > > > > >         b) explicit partition assignment via JobCoordinator.
> > > Chris’s
> > > > > work
> > > > > > > in SAMZA-516 can be easily incorporated here. The use case in
> > > > SAMZA-41
> > > > > > that
> > > > > > > runs Samza ProcessJob w/ static partition assignment can be
> > > > implemented
> > > > > > of
> > > > > > > JobCoordinator via any home-grown implementation of distributed
> > > > > > > coordinator. All the work we did in LinkedIn to support dynamic
> > > > > partition
> > > > > > > assignment and host-affinity SAMZA-617 can be nicely reused as
> an
> > > > > > > implementation of JobCoordinator.
> > > > > > >
> > > > > > >
> > > > > > > When we did the above work, I can see three usage patterns in
> > > Samza:
> > > > > > >
> > > > > > >    a. Samza as a library: Samza has a set of libraries to
> provide
> > > > > stream
> > > > > > > processing, just like a third Kafka client (as illustrated in
> > Jay’s
> > > > > > > example). The execution/deployment is totally controlled by the
> > > > > > application
> > > > > > > and the partition coordination is done via Kafka
> > > > > > >
> > > > > > >    b. Samza as a process: Samza runs as a standalone process.
> > There
> > > > may
> > > > > > not
> > > > > > > be a execution framework to launch and deploy Samza processes.
> > The
> > > > > > > partition assignment is pluggable JobCoordinator.
> > > > > > >
> > > > > > >    c. Samza as a service: Samza runs as a collection of
> > processes.
> > > > > There
> > > > > > > will be an execution framework to allocate resource, launch and
> > > > deploy
> > > > > > > Samza workers and keep them alive. The same pluggable
> > > JobCoordinator
> > > > is
> > > > > > > desirable here as well.
> > > > > > >
> > > > > > >
> > > > > > > Lastly, I would argue that CopyCat in KIP-26 should probably
> > follow
> > > > the
> > > > > > > same model. Hence, in Samza as a service model as in LinkedIn,
> we
> > > can
> > > > > use
> > > > > > > the same fault tolerance execution framework to run CopyCat and
> > > Samza
> > > > > w/o
> > > > > > > the need to operate two service platforms, which should address
> > > > > Sriram’s
> > > > > > > comment in the email thread.
> > > > > > >
> > > > > > >
> > > > > > > Hope the above makes sense. Thanks all!
> > > > > > >
> > > > > > >
> > > > > > > -Yi
> > > > > > >
> > > > > > > On Thu, Jul 2, 2015 at 9:53 AM, Sriram <sriram....@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > One thing that is worth exploring is to have a transformation
> > and
> > > > > > > > ingestion library in Kafka but use the same framework for
> fault
> > > > > > > tolerance,
> > > > > > > > resource isolation and management. The biggest difference I
> see
> > > in
> > > > > > these
> > > > > > > > two use cases is the API and data model.
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Jul 2, 2015, at 8:59 AM, Jay Kreps <j...@confluent.io>
> > > wrote:
> > > > > > > > >
> > > > > > > > > Hey Garry,
> > > > > > > > >
> > > > > > > > > Yeah that's super frustrating. I'd be happy to chat more
> > about
> > > > this
> > > > > > if
> > > > > > > > > you'd be interested. I think Chris and I started with the
> > idea
> > > of
> > > > > > "what
> > > > > > > > > would it take to make Samza a kick-ass ingestion tool" but
> > > > > ultimately
> > > > > > > we
> > > > > > > > > kind of came around to the idea that ingestion and
> > > transformation
> > > > > had
> > > > > > > > > pretty different needs and coupling the two made things
> hard.
> > > > > > > > >
> > > > > > > > > For what it's worth I think copycat (KIP-26) actually will
> do
> > > > what
> > > > > > you
> > > > > > > > are
> > > > > > > > > looking for.
> > > > > > > > >
> > > > > > > > > With regard to your point about slider, I don't necessarily
> > > > > disagree.
> > > > > > > > But I
> > > > > > > > > think getting good YARN support is quite doable and I think
> > we
> > > > can
> > > > > > make
> > > > > > > > > that work well. I think the issue this proposal solves is
> > that
> > > > > > > > technically
> > > > > > > > > it is pretty hard to support multiple cluster management
> > > systems
> > > > > the
> > > > > > > way
> > > > > > > > > things are now, you need to write an "app master" or
> > > "framework"
> > > > > for
> > > > > > > each
> > > > > > > > > and they are all a little different so testing is really
> > hard.
> > > In
> > > > > the
> > > > > > > > > absence of this we have been stuck with just YARN which has
> > > > > fantastic
> > > > > > > > > penetration in the Hadoopy part of the org, but zero
> > > penetration
> > > > > > > > elsewhere.
> > > > > > > > > Given the huge amount of work being put in to slider,
> > marathon,
> > > > aws
> > > > > > > > > tooling, not to mention the umpteen related packaging
> > > > technologies
> > > > > > > people
> > > > > > > > > want to use (Docker, Kubernetes, various cloud-specific
> > deploy
> > > > > tools,
> > > > > > > > etc)
> > > > > > > > > I really think it is important to get this right.
> > > > > > > > >
> > > > > > > > > -Jay
> > > > > > > > >
> > > > > > > > > On Thu, Jul 2, 2015 at 4:17 AM, Garry Turkington <
> > > > > > > > > g.turking...@improvedigital.com> wrote:
> > > > > > > > >
> > > > > > > > >> Hi all,
> > > > > > > > >>
> > > > > > > > >> I think the question below re does Samza become a
> > sub-project
> > > of
> > > > > > Kafka
> > > > > > > > >> highlights the broader point around migration. Chris
> > mentions
> > > > > > Samza's
> > > > > > > > >> maturity is heading towards a v1 release but I'm not sure
> it
> > > > feels
> > > > > > > > right to
> > > > > > > > >> launch a v1 then immediately plan to deprecate most of it.
> > > > > > > > >>
> > > > > > > > >> From a selfish perspective I have some guys who have
> started
> > > > > working
> > > > > > > > with
> > > > > > > > >> Samza and building some new consumers/producers was next
> up.
> > > > > Sounds
> > > > > > > like
> > > > > > > > >> that is absolutely not the direction to go. I need to look
> > > into
> > > > > the
> > > > > > > KIP
> > > > > > > > in
> > > > > > > > >> more detail but for me the attractiveness of adding new
> > Samza
> > > > > > > > >> consumer/producers -- even if yes all they were doing was
> > > really
> > > > > > > getting
> > > > > > > > >> data into and out of Kafka --  was to avoid  having to
> worry
> > > > about
> > > > > > the
> > > > > > > > >> lifecycle management of external clients. If there is a
> > > generic
> > > > > > Kafka
> > > > > > > > >> ingress/egress layer that I can plug a new connector into
> > and
> > > > > have a
> > > > > > > > lot of
> > > > > > > > >> the heavy lifting re scale and reliability done for me
> then
> > it
> > > > > gives
> > > > > > > me
> > > > > > > > all
> > > > > > > > >> the pushing new consumers/producers would. If not then it
> > > > > > complicates
> > > > > > > my
> > > > > > > > >> operational deployments.
> > > > > > > > >>
> > > > > > > > >> Which is similar to my other question with the proposal --
> > if
> > > we
> > > > > > > build a
> > > > > > > > >> fully available/stand-alone Samza plus the requisite shims
> > to
> > > > > > > integrate
> > > > > > > > >> with Slider etc I suspect the former may be a lot more
> work
> > > than
> > > > > we
> > > > > > > > think.
> > > > > > > > >> We may make it much easier for a newcomer to get something
> > > > running
> > > > > > but
> > > > > > > > >> having them step up and get a reliable production
> deployment
> > > may
> > > > > > still
> > > > > > > > >> dominate mailing list  traffic, if for different reasons
> > than
> > > > > today.
> > > > > > > > >>
> > > > > > > > >> Don't get me wrong -- I'm comfortable with making the
> Samza
> > > > > > dependency
> > > > > > > > on
> > > > > > > > >> Kafka much more explicit and I absolutely see the benefits
> > in
> > > > the
> > > > > > > > >> reduction of duplication and clashing
> > > terminologies/abstractions
> > > > > > that
> > > > > > > > >> Chris/Jay describe. Samza as a library would likely be a
> > very
> > > > nice
> > > > > > > tool
> > > > > > > > to
> > > > > > > > >> add to the Kafka ecosystem. I just have the concerns above
> > re
> > > > the
> > > > > > > > >> operational side.
> > > > > > > > >>
> > > > > > > > >> Garry
> > > > > > > > >>
> > > > > > > > >> -----Original Message-----
> > > > > > > > >> From: Gianmarco De Francisci Morales [mailto:
> > g...@apache.org]
> > > > > > > > >> Sent: 02 July 2015 12:56
> > > > > > > > >> To: dev@samza.apache.org
> > > > > > > > >> Subject: Re: Thoughts and obesrvations on Samza
> > > > > > > > >>
> > > > > > > > >> Very interesting thoughts.
> > > > > > > > >> From outside, I have always perceived Samza as a computing
> > > layer
> > > > > > over
> > > > > > > > >> Kafka.
> > > > > > > > >>
> > > > > > > > >> The question, maybe a bit provocative, is "should Samza
> be a
> > > > > > > sub-project
> > > > > > > > >> of Kafka then?"
> > > > > > > > >> Or does it make sense to keep it as a separate project
> with
> > a
> > > > > > separate
> > > > > > > > >> governance?
> > > > > > > > >>
> > > > > > > > >> Cheers,
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> Gianmarco
> > > > > > > > >>
> > > > > > > > >>> On 2 July 2015 at 08:59, Yan Fang <yanfang...@gmail.com>
> > > > wrote:
> > > > > > > > >>>
> > > > > > > > >>> Overall, I agree to couple with Kafka more tightly.
> Because
> > > > Samza
> > > > > > de
> > > > > > > > >>> facto is based on Kafka, and it should leverage what
> Kafka
> > > has.
> > > > > At
> > > > > > > the
> > > > > > > > >>> same time, Kafka does not need to reinvent what Samza
> > already
> > > > > has.
> > > > > > I
> > > > > > > > >>> also like the idea of separating the ingestion and
> > > > > transformation.
> > > > > > > > >>>
> > > > > > > > >>> But it is a little difficult for me to image how the
> Samza
> > > will
> > > > > > look
> > > > > > > > >> like.
> > > > > > > > >>> And I feel Chris and Jay have a little difference in
> terms
> > of
> > > > how
> > > > > > > > >>> Samza should look like.
> > > > > > > > >>>
> > > > > > > > >>> *** Will it look like what Jay's code shows (A client of
> > > > Kakfa) ?
> > > > > > And
> > > > > > > > >>> user's application code calls this client?
> > > > > > > > >>>
> > > > > > > > >>> 1. If we make Samza be a library of Kafka (like what the
> > code
> > > > > > shows),
> > > > > > > > >>> how do we implement auto-balance and fault-tolerance? Are
> > > they
> > > > > > taken
> > > > > > > > >>> care by the Kafka broker or other mechanism, such as
> "Samza
> > > > > worker"
> > > > > > > > >>> (just make up the name) ?
> > > > > > > > >>>
> > > > > > > > >>> 2. What about other features, such as auto-scaling,
> shared
> > > > state,
> > > > > > > > >>> monitoring?
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> *** If we have Samza standalone, (is this what Chris
> > > suggests?)
> > > > > > > > >>>
> > > > > > > > >>> 1. we still need to ingest data from Kakfa and produce to
> > it.
> > > > > Then
> > > > > > it
> > > > > > > > >>> becomes the same as what Samza looks like now, except it
> > does
> > > > not
> > > > > > > rely
> > > > > > > > >>> on Yarn anymore.
> > > > > > > > >>>
> > > > > > > > >>> 2. if it is standalone, how can it leverage Kafka's
> > metrics,
> > > > > logs,
> > > > > > > > >>> etc? Use Kafka code as the dependency?
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> Thanks,
> > > > > > > > >>>
> > > > > > > > >>> Fang, Yan
> > > > > > > > >>> yanfang...@gmail.com
> > > > > > > > >>>
> > > > > > > > >>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang Wang <
> > > > > wangg...@gmail.com
> > > > > > >
> > > > > > > > >>> wrote:
> > > > > > > > >>>
> > > > > > > > >>>> Read through the code example and it looks good to me. A
> > few
> > > > > > > > >>>> thoughts regarding deployment:
> > > > > > > > >>>>
> > > > > > > > >>>> Today Samza deploys as executable runnable like:
> > > > > > > > >>>>
> > > > > > > > >>>> deploy/samza/bin/run-job.sh --config-factory=...
> > > > > > > > >> --config-path=file://...
> > > > > > > > >>>>
> > > > > > > > >>>> And this proposal advocate for deploying Samza more as
> > > > embedded
> > > > > > > > >>>> libraries in user application code (ignoring the
> > terminology
> > > > > since
> > > > > > > > >>>> it is not the
> > > > > > > > >>> same
> > > > > > > > >>>> as the prototype code):
> > > > > > > > >>>>
> > > > > > > > >>>> StreamTask task = new MyStreamTask(configs); Thread
> > thread =
> > > > new
> > > > > > > > >>>> Thread(task); thread.start();
> > > > > > > > >>>>
> > > > > > > > >>>> I think both of these deployment modes are important for
> > > > > different
> > > > > > > > >>>> types
> > > > > > > > >>> of
> > > > > > > > >>>> users. That said, I think making Samza purely standalone
> > is
> > > > > still
> > > > > > > > >>>> sufficient for either runnable or library modes.
> > > > > > > > >>>>
> > > > > > > > >>>> Guozhang
> > > > > > > > >>>>
> > > > > > > > >>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay Kreps <
> > > > j...@confluent.io>
> > > > > > > > wrote:
> > > > > > > > >>>>>
> > > > > > > > >>>>> Looks like gmail mangled the code example, it was
> > supposed
> > > to
> > > > > > look
> > > > > > > > >>>>> like
> > > > > > > > >>>>> this:
> > > > > > > > >>>>>
> > > > > > > > >>>>> Properties props = new Properties();
> > > > > > > > >>>>> props.put("bootstrap.servers", "localhost:4242");
> > > > > StreamingConfig
> > > > > > > > >>>>> config = new StreamingConfig(props);
> > > > > > > > >>>>> config.subscribe("test-topic-1", "test-topic-2");
> > > > > > > > >>>>> config.processor(ExampleStreamProcessor.class);
> > > > > > > > >>>>> config.serialization(new StringSerializer(), new
> > > > > > > > >>>>> StringDeserializer()); KafkaStreaming container = new
> > > > > > > > >>>>> KafkaStreaming(config); container.run();
> > > > > > > > >>>>>
> > > > > > > > >>>>> -Jay
> > > > > > > > >>>>>
> > > > > > > > >>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps <
> > > > j...@confluent.io>
> > > > > > > > >> wrote:
> > > > > > > > >>>>>
> > > > > > > > >>>>>> Hey guys,
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> This came out of some conversations Chris and I were
> > > having
> > > > > > > > >>>>>> around
> > > > > > > > >>>>> whether
> > > > > > > > >>>>>> it would make sense to use Samza as a kind of data
> > > ingestion
> > > > > > > > >>> framework
> > > > > > > > >>>>> for
> > > > > > > > >>>>>> Kafka (which ultimately lead to KIP-26 "copycat").
> This
> > > kind
> > > > > of
> > > > > > > > >>>> combined
> > > > > > > > >>>>>> with complaints around config and YARN and the
> > discussion
> > > > > around
> > > > > > > > >>>>>> how
> > > > > > > > >>> to
> > > > > > > > >>>>>> best do a standalone mode.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> So the thought experiment was, given that Samza was
> > > > basically
> > > > > > > > >>>>>> already totally Kafka specific, what if you just
> > embraced
> > > > that
> > > > > > > > >>>>>> and turned it
> > > > > > > > >>>> into
> > > > > > > > >>>>>> something less like a heavyweight framework and more
> > like
> > > a
> > > > > > > > >>>>>> third
> > > > > > > > >>> Kafka
> > > > > > > > >>>>>> client--a kind of "producing consumer" with state
> > > management
> > > > > > > > >>>> facilities.
> > > > > > > > >>>>>> Basically a library. Instead of a complex stream
> > > processing
> > > > > > > > >>>>>> framework
> > > > > > > > >>>>> this
> > > > > > > > >>>>>> would actually be a very simple thing, not much more
> > > > > complicated
> > > > > > > > >>>>>> to
> > > > > > > > >>> use
> > > > > > > > >>>>> or
> > > > > > > > >>>>>> operate than a Kafka consumer. As Chris said we
> thought
> > > > about
> > > > > it
> > > > > > > > >>>>>> a
> > > > > > > > >>> lot
> > > > > > > > >>>> of
> > > > > > > > >>>>>> what Samza (and the other stream processing systems
> were
> > > > > doing)
> > > > > > > > >>> seemed
> > > > > > > > >>>>> like
> > > > > > > > >>>>>> kind of a hangover from MapReduce.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Of course you need to ingest/output data to and from
> the
> > > > > stream
> > > > > > > > >>>>>> processing. But when we actually looked into how that
> > > would
> > > > > > > > >>>>>> work,
> > > > > > > > >>> Samza
> > > > > > > > >>>>>> isn't really an ideal data ingestion framework for a
> > bunch
> > > > of
> > > > > > > > >>> reasons.
> > > > > > > > >>>> To
> > > > > > > > >>>>>> really do that right you need a pretty different
> > internal
> > > > data
> > > > > > > > >>>>>> model
> > > > > > > > >>>> and
> > > > > > > > >>>>>> set of apis. So what if you split them and had an api
> > for
> > > > > Kafka
> > > > > > > > >>>>>> ingress/egress (copycat AKA KIP-26) and a separate api
> > for
> > > > > Kafka
> > > > > > > > >>>>>> transformation (Samza).
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> This would also allow really embracing the same
> > > terminology
> > > > > and
> > > > > > > > >>>>>> conventions. One complaint about the current state is
> > that
> > > > the
> > > > > > > > >>>>>> two
> > > > > > > > >>>>> systems
> > > > > > > > >>>>>> kind of feel bolted on. Terminology like "stream" vs
> > > "topic"
> > > > > and
> > > > > > > > >>>>> different
> > > > > > > > >>>>>> config and monitoring systems means you kind of have
> to
> > > > learn
> > > > > > > > >>>>>> Kafka's
> > > > > > > > >>>>> way,
> > > > > > > > >>>>>> then learn Samza's slightly different way, then kind
> of
> > > > > > > > >>>>>> understand
> > > > > > > > >>> how
> > > > > > > > >>>>> they
> > > > > > > > >>>>>> map to each other, which having walked a few people
> > > through
> > > > > this
> > > > > > > > >>>>>> is surprisingly tricky for folks to get.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Since I have been spending a lot of time on airplanes
> I
> > > > hacked
> > > > > > > > >>>>>> up an ernest but still somewhat incomplete prototype
> of
> > > what
> > > > > > > > >>>>>> this would
> > > > > > > > >>> look
> > > > > > > > >>>>>> like. This is just unceremoniously dumped into Kafka
> as
> > it
> > > > > > > > >>>>>> required a
> > > > > > > > >>>> few
> > > > > > > > >>>>>> changes to the new consumer. Here is the code:
> > > > > > > > >>>
> > > > > > >
> > > >
> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
> > > > > > > > >>> /apache/kafka/clients/streaming
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> For the purpose of the prototype I just liberally
> > renamed
> > > > > > > > >>>>>> everything
> > > > > > > > >>> to
> > > > > > > > >>>>>> try to align it with Kafka with no regard for
> > > compatibility.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> To use this would be something like this:
> > > > > > > > >>>>>> Properties props = new Properties();
> > > > > > > > >>>>>> props.put("bootstrap.servers", "localhost:4242");
> > > > > > > > >>>>>> StreamingConfig config = new
> > > > > > > > >>> StreamingConfig(props);
> > > > > > > > >>>>> config.subscribe("test-topic-1",
> > > > > > > > >>>>>> "test-topic-2");
> > > > > config.processor(ExampleStreamProcessor.class);
> > > > > > > > >>>>> config.serialization(new
> > > > > > > > >>>>>> StringSerializer(), new StringDeserializer());
> > > > KafkaStreaming
> > > > > > > > >>>> container =
> > > > > > > > >>>>>> new KafkaStreaming(config); container.run();
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> KafkaStreaming is basically the SamzaContainer;
> > > > > StreamProcessor
> > > > > > > > >>>>>> is basically StreamTask.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> So rather than putting all the class names in a file
> and
> > > > then
> > > > > > > > >>>>>> having
> > > > > > > > >>>> the
> > > > > > > > >>>>>> job assembled by reflection, you just instantiate the
> > > > > container
> > > > > > > > >>>>>> programmatically. Work is balanced over however many
> > > > instances
> > > > > > > > >>>>>> of
> > > > > > > > >>> this
> > > > > > > > >>>>> are
> > > > > > > > >>>>>> alive at any time (i.e. if an instance dies, new tasks
> > are
> > > > > added
> > > > > > > > >>>>>> to
> > > > > > > > >>> the
> > > > > > > > >>>>>> existing containers without shutting them down).
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> We would provide some glue for running this stuff in
> > YARN
> > > > via
> > > > > > > > >>>>>> Slider, Mesos via Marathon, and AWS using some of
> their
> > > > tools
> > > > > > > > >>>>>> but from the
> > > > > > > > >>>> point
> > > > > > > > >>>>> of
> > > > > > > > >>>>>> view of these frameworks these stream processing jobs
> > are
> > > > just
> > > > > > > > >>>> stateless
> > > > > > > > >>>>>> services that can come and go and expand and contract
> at
> > > > will.
> > > > > > > > >>>>>> There
> > > > > > > > >>> is
> > > > > > > > >>>>> no
> > > > > > > > >>>>>> more custom scheduler.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Here are some relevant details:
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>   1. It is only ~1300 lines of code, it would get
> larger
> > > if
> > > > we
> > > > > > > > >>>>>>   productionized but not vastly larger. We really do
> > get a
> > > > ton
> > > > > > > > >>>>>> of
> > > > > > > > >>>>> leverage
> > > > > > > > >>>>>>   out of Kafka.
> > > > > > > > >>>>>>   2. Partition management is fully delegated to the
> new
> > > > > > consumer.
> > > > > > > > >>> This
> > > > > > > > >>>>>>   is nice since now any partition management strategy
> > > > > available
> > > > > > > > >>>>>> to
> > > > > > > > >>>> Kafka
> > > > > > > > >>>>>>   consumer is also available to Samza (and vice versa)
> > and
> > > > > with
> > > > > > > > >>>>>> the
> > > > > > > > >>>>> exact
> > > > > > > > >>>>>>   same configs.
> > > > > > > > >>>>>>   3. It supports state as well as state reuse
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Anyhow take a look, hopefully it is thought provoking.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> -Jay
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris Riccomini <
> > > > > > > > >>>> criccom...@apache.org>
> > > > > > > > >>>>>> wrote:
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>> Hey all,
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> I have had some discussions with Samza engineers at
> > > > LinkedIn
> > > > > > > > >>>>>>> and
> > > > > > > > >>>>> Confluent
> > > > > > > > >>>>>>> and we came up with a few observations and would like
> > to
> > > > > > > > >>>>>>> propose
> > > > > > > > >>> some
> > > > > > > > >>>>>>> changes.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> We've observed some things that I want to call out
> > about
> > > > > > > > >>>>>>> Samza's
> > > > > > > > >>>> design,
> > > > > > > > >>>>>>> and I'd like to propose some changes.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> * Samza is dependent upon a dynamic deployment
> system.
> > > > > > > > >>>>>>> * Samza is too pluggable.
> > > > > > > > >>>>>>> * Samza's SystemConsumer/SystemProducer and Kafka's
> > > > consumer
> > > > > > > > >>>>>>> APIs
> > > > > > > > >>> are
> > > > > > > > >>>>>>> trying to solve a lot of the same problems.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> All three of these issues are related, but I'll
> address
> > > > them
> > > > > in
> > > > > > > > >>> order.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Deployment
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Samza strongly depends on the use of a dynamic
> > deployment
> > > > > > > > >>>>>>> scheduler
> > > > > > > > >>>> such
> > > > > > > > >>>>>>> as
> > > > > > > > >>>>>>> YARN, Mesos, etc. When we initially built Samza, we
> bet
> > > > that
> > > > > > > > >>>>>>> there
> > > > > > > > >>>> would
> > > > > > > > >>>>>>> be
> > > > > > > > >>>>>>> one or two winners in this area, and we could support
> > > them,
> > > > > and
> > > > > > > > >>>>>>> the
> > > > > > > > >>>> rest
> > > > > > > > >>>>>>> would go away. In reality, there are many variations.
> > > > > > > > >>>>>>> Furthermore,
> > > > > > > > >>>> many
> > > > > > > > >>>>>>> people still prefer to just start their processors
> like
> > > > > normal
> > > > > > > > >>>>>>> Java processes, and use traditional deployment
> scripts
> > > such
> > > > > as
> > > > > > > > >>>>>>> Fabric,
> > > > > > > > >>>> Chef,
> > > > > > > > >>>>>>> Ansible, etc. Forcing a deployment system on users
> > makes
> > > > the
> > > > > > > > >>>>>>> Samza start-up process really painful for first time
> > > users.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Dynamic deployment as a requirement was also a bit
> of a
> > > > > > > > >>>>>>> mis-fire
> > > > > > > > >>>> because
> > > > > > > > >>>>>>> of
> > > > > > > > >>>>>>> a fundamental misunderstanding between the nature of
> > > batch
> > > > > jobs
> > > > > > > > >>>>>>> and
> > > > > > > > >>>>> stream
> > > > > > > > >>>>>>> processing jobs. Early on, we made conscious effort
> to
> > > > favor
> > > > > > > > >>>>>>> the
> > > > > > > > >>>> Hadoop
> > > > > > > > >>>>>>> (Map/Reduce) way of doing things, since it worked and
> > was
> > > > > well
> > > > > > > > >>>>> understood.
> > > > > > > > >>>>>>> One thing that we missed was that batch jobs have a
> > > > definite
> > > > > > > > >>>> beginning,
> > > > > > > > >>>>>>> and
> > > > > > > > >>>>>>> end, and stream processing jobs don't (usually). This
> > > leads
> > > > > to
> > > > > > > > >>>>>>> a
> > > > > > > > >>> much
> > > > > > > > >>>>>>> simpler scheduling problem for stream processors. You
> > > > > basically
> > > > > > > > >>>>>>> just
> > > > > > > > >>>>> need
> > > > > > > > >>>>>>> to find a place to start the processor, and start it.
> > The
> > > > way
> > > > > > > > >>>>>>> we run grids, at LinkedIn, there's no concept of a
> > > cluster
> > > > > > > > >>>>>>> being "full". We always
> > > > > > > > >>>> add
> > > > > > > > >>>>>>> more machines. The problem with coupling Samza with a
> > > > > scheduler
> > > > > > > > >>>>>>> is
> > > > > > > > >>>> that
> > > > > > > > >>>>>>> Samza (as a framework) now has to handle deployment.
> > This
> > > > > pulls
> > > > > > > > >>>>>>> in a
> > > > > > > > >>>>> bunch
> > > > > > > > >>>>>>> of things such as configuration distribution (config
> > > > stream),
> > > > > > > > >>>>>>> shell
> > > > > > > > >>>>> scrips
> > > > > > > > >>>>>>> (bin/run-job.sh, JobRunner), packaging (all the .tgz
> > > > stuff),
> > > > > > etc.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Another reason for requiring dynamic deployment was
> to
> > > > > support
> > > > > > > > >>>>>>> data locality. If you want to have locality, you need
> > to
> > > > put
> > > > > > > > >>>>>>> your
> > > > > > > > >>>> processors
> > > > > > > > >>>>>>> close to the data they're processing. Upon further
> > > > > > > > >>>>>>> investigation,
> > > > > > > > >>>>> though,
> > > > > > > > >>>>>>> this feature is not that beneficial. There is some
> good
> > > > > > > > >>>>>>> discussion
> > > > > > > > >>>> about
> > > > > > > > >>>>>>> some problems with it on SAMZA-335. Again, we took
> the
> > > > > > > > >>>>>>> Map/Reduce
> > > > > > > > >>>> path,
> > > > > > > > >>>>>>> but
> > > > > > > > >>>>>>> there are some fundamental differences between HDFS
> and
> > > > > Kafka.
> > > > > > > > >>>>>>> HDFS
> > > > > > > > >>>> has
> > > > > > > > >>>>>>> blocks, while Kafka has partitions. This leads to
> less
> > > > > > > > >>>>>>> optimization potential with stream processors on top
> of
> > > > > Kafka.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> This feature is also used as a crutch. Samza doesn't
> > have
> > > > any
> > > > > > > > >>>>>>> built
> > > > > > > > >>> in
> > > > > > > > >>>>>>> fault-tolerance logic. Instead, it depends on the
> > dynamic
> > > > > > > > >>>>>>> deployment scheduling system to handle restarts when
> a
> > > > > > > > >>>>>>> processor dies. This has
> > > > > > > > >>>>> made
> > > > > > > > >>>>>>> it very difficult to write a standalone Samza
> container
> > > > > > > > >> (SAMZA-516).
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Pluggability
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> In some cases pluggability is good, but I think that
> > > we've
> > > > > gone
> > > > > > > > >>>>>>> too
> > > > > > > > >>>> far
> > > > > > > > >>>>>>> with it. Currently, Samza has:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> * Pluggable config.
> > > > > > > > >>>>>>> * Pluggable metrics.
> > > > > > > > >>>>>>> * Pluggable deployment systems.
> > > > > > > > >>>>>>> * Pluggable streaming systems (SystemConsumer,
> > > > > SystemProducer,
> > > > > > > > >> etc).
> > > > > > > > >>>>>>> * Pluggable serdes.
> > > > > > > > >>>>>>> * Pluggable storage engines.
> > > > > > > > >>>>>>> * Pluggable strategies for just about every component
> > > > > > > > >>> (MessageChooser,
> > > > > > > > >>>>>>> SystemStreamPartitionGrouper, ConfigRewriter, etc).
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> There's probably more that I've forgotten, as well.
> > Some
> > > of
> > > > > > > > >>>>>>> these
> > > > > > > > >>> are
> > > > > > > > >>>>>>> useful, but some have proven not to be. This all
> comes
> > > at a
> > > > > > cost:
> > > > > > > > >>>>>>> complexity. This complexity is making it harder for
> our
> > > > users
> > > > > > > > >>>>>>> to
> > > > > > > > >>> pick
> > > > > > > > >>>> up
> > > > > > > > >>>>>>> and use Samza out of the box. It also makes it
> > difficult
> > > > for
> > > > > > > > >>>>>>> Samza developers to reason about what the
> > characteristics
> > > > of
> > > > > > > > >>>>>>> the container (since the characteristics change
> > depending
> > > > on
> > > > > > > > >>>>>>> which plugins are use).
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> The issues with pluggability are most visible in the
> > > System
> > > > > > APIs.
> > > > > > > > >>> What
> > > > > > > > >>>>>>> Samza really requires to be functional is Kafka as
> its
> > > > > > > > >>>>>>> transport
> > > > > > > > >>>> layer.
> > > > > > > > >>>>>>> But
> > > > > > > > >>>>>>> we've conflated two unrelated use cases into one API:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> 1. Get data into/out of Kafka.
> > > > > > > > >>>>>>> 2. Process the data in Kafka.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> The current System API supports both of these use
> > cases.
> > > > The
> > > > > > > > >>>>>>> problem
> > > > > > > > >>>> is,
> > > > > > > > >>>>>>> we
> > > > > > > > >>>>>>> actually want different features for each use case.
> By
> > > > > papering
> > > > > > > > >>>>>>> over
> > > > > > > > >>>>> these
> > > > > > > > >>>>>>> two use cases, and providing a single API, we've
> > > > introduced a
> > > > > > > > >>>>>>> ton of
> > > > > > > > >>>>> leaky
> > > > > > > > >>>>>>> abstractions.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> For example, what we'd really like in (2) is to have
> > > > > > > > >>>>>>> monotonically increasing longs for offsets (like
> > Kafka).
> > > > This
> > > > > > > > >>>>>>> would be at odds
> > > > > > > > >>> with
> > > > > > > > >>>>> (1),
> > > > > > > > >>>>>>> though, since different systems have different
> > > > > > > > >>>>> SCNs/Offsets/UUIDs/vectors.
> > > > > > > > >>>>>>> There was discussion both on the mailing list and the
> > SQL
> > > > > JIRAs
> > > > > > > > >>> about
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>> need for this.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> The same thing holds true for replayability. Kafka
> > allows
> > > > us
> > > > > to
> > > > > > > > >>> rewind
> > > > > > > > >>>>>>> when
> > > > > > > > >>>>>>> we have a failure. Many other systems don't. In some
> > > cases,
> > > > > > > > >>>>>>> systems
> > > > > > > > >>>>> return
> > > > > > > > >>>>>>> null for their offsets (e.g. WikipediaSystemConsumer)
> > > > because
> > > > > > > > >>>>>>> they
> > > > > > > > >>>> have
> > > > > > > > >>>>> no
> > > > > > > > >>>>>>> offsets.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Partitioning is another example. Kafka supports
> > > > partitioning,
> > > > > > > > >>>>>>> but
> > > > > > > > >>> many
> > > > > > > > >>>>>>> systems don't. We model this by having a single
> > partition
> > > > for
> > > > > > > > >>>>>>> those systems. Still, other systems model
> partitioning
> > > > > > > > >> differently (e.g.
> > > > > > > > >>>>>>> Kinesis).
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> The SystemAdmin interface is also a mess. Creating
> > > streams
> > > > > in a
> > > > > > > > >>>>>>> system-agnostic way is almost impossible. As is
> > modeling
> > > > > > > > >>>>>>> metadata
> > > > > > > > >>> for
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>> system (replication factor, partitions, location,
> etc).
> > > The
> > > > > > > > >>>>>>> list
> > > > > > > > >>> goes
> > > > > > > > >>>>> on.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Duplicate work
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> At the time that we began writing Samza, Kafka's
> > consumer
> > > > and
> > > > > > > > >>> producer
> > > > > > > > >>>>>>> APIs
> > > > > > > > >>>>>>> had a relatively weak feature set. On the
> > consumer-side,
> > > > you
> > > > > > > > >>>>>>> had two
> > > > > > > > >>>>>>> options: use the high level consumer, or the simple
> > > > consumer.
> > > > > > > > >>>>>>> The
> > > > > > > > >>>>> problem
> > > > > > > > >>>>>>> with the high-level consumer was that it controlled
> > your
> > > > > > > > >>>>>>> offsets, partition assignments, and the order in
> which
> > > you
> > > > > > > > >>>>>>> received messages. The
> > > > > > > > >>> problem
> > > > > > > > >>>>>>> with
> > > > > > > > >>>>>>> the simple consumer is that it's not simple. It's
> > basic.
> > > > You
> > > > > > > > >>>>>>> end up
> > > > > > > > >>>>> having
> > > > > > > > >>>>>>> to handle a lot of really low-level stuff that you
> > > > shouldn't.
> > > > > > > > >>>>>>> We
> > > > > > > > >>>> spent a
> > > > > > > > >>>>>>> lot of time to make Samza's KafkaSystemConsumer very
> > > > robust.
> > > > > It
> > > > > > > > >>>>>>> also allows us to support some cool features:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> * Per-partition message ordering and prioritization.
> > > > > > > > >>>>>>> * Tight control over partition assignment to support
> > > joins,
> > > > > > > > >>>>>>> global
> > > > > > > > >>>> state
> > > > > > > > >>>>>>> (if we want to implement it :)), etc.
> > > > > > > > >>>>>>> * Tight control over offset checkpointing.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> What we didn't realize at the time is that these
> > features
> > > > > > > > >>>>>>> should
> > > > > > > > >>>>> actually
> > > > > > > > >>>>>>> be in Kafka. A lot of Kafka consumers (not just Samza
> > > > stream
> > > > > > > > >>>> processors)
> > > > > > > > >>>>>>> end up wanting to do things like joins and partition
> > > > > > > > >>>>>>> assignment. The
> > > > > > > > >>>>> Kafka
> > > > > > > > >>>>>>> community has come to the same conclusion. They're
> > > adding a
> > > > > ton
> > > > > > > > >>>>>>> of upgrades into their new Kafka consumer
> > implementation.
> > > > To
> > > > > a
> > > > > > > > >>>>>>> large extent,
> > > > > > > > >>> it's
> > > > > > > > >>>>>>> duplicate work to what we've already done in Samza.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> On top of this, Kafka ended up taking a very similar
> > > > approach
> > > > > > > > >>>>>>> to
> > > > > > > > >>>> Samza's
> > > > > > > > >>>>>>> KafkaCheckpointManager implementation for handling
> > offset
> > > > > > > > >>>> checkpointing.
> > > > > > > > >>>>>>> Like Samza, Kafka's new offset management feature
> > stores
> > > > > offset
> > > > > > > > >>>>>>> checkpoints in a topic, and allows you to fetch them
> > from
> > > > the
> > > > > > > > >>>>>>> broker.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> A lot of this seems like a waste, since we could have
> > > > shared
> > > > > > > > >>>>>>> the
> > > > > > > > >>> work
> > > > > > > > >>>> if
> > > > > > > > >>>>>>> it
> > > > > > > > >>>>>>> had been done in Kafka from the get-go.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Vision
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> All of this leads me to a rather radical proposal.
> > Samza
> > > is
> > > > > > > > >>> relatively
> > > > > > > > >>>>>>> stable at this point. I'd venture to say that we're
> > near
> > > a
> > > > > 1.0
> > > > > > > > >>>> release.
> > > > > > > > >>>>>>> I'd
> > > > > > > > >>>>>>> like to propose that we take what we've learned, and
> > > begin
> > > > > > > > >>>>>>> thinking
> > > > > > > > >>>>> about
> > > > > > > > >>>>>>> Samza beyond 1.0. What would we change if we were
> > > starting
> > > > > from
> > > > > > > > >>>> scratch?
> > > > > > > > >>>>>>> My
> > > > > > > > >>>>>>> proposal is to:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> 1. Make Samza standalone the *only* way to run Samza
> > > > > > > > >>>>>>> processors, and eliminate all direct dependences on
> > YARN,
> > > > > > Mesos,
> > > > > > > > >> etc.
> > > > > > > > >>>>>>> 2. Make a definitive call to support only Kafka as
> the
> > > > stream
> > > > > > > > >>>> processing
> > > > > > > > >>>>>>> layer.
> > > > > > > > >>>>>>> 3. Eliminate Samza's metrics, logging, serialization,
> > and
> > > > > > > > >>>>>>> config
> > > > > > > > >>>>> systems,
> > > > > > > > >>>>>>> and simply use Kafka's instead.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> This would fix all of the issues that I outlined
> above.
> > > It
> > > > > > > > >>>>>>> should
> > > > > > > > >>> also
> > > > > > > > >>>>>>> shrink the Samza code base pretty dramatically.
> > > Supporting
> > > > > only
> > > > > > > > >>>>>>> a standalone container will allow Samza to be
> executed
> > on
> > > > > YARN
> > > > > > > > >>>>>>> (using Slider), Mesos (using Marathon/Aurora), or
> most
> > > > other
> > > > > > > > >>>>>>> in-house
> > > > > > > > >>>>> deployment
> > > > > > > > >>>>>>> systems. This should make life a lot easier for new
> > > users.
> > > > > > > > >>>>>>> Imagine
> > > > > > > > >>>>> having
> > > > > > > > >>>>>>> the hello-samza tutorial without YARN. The drop in
> > > mailing
> > > > > list
> > > > > > > > >>>> traffic
> > > > > > > > >>>>>>> will be pretty dramatic.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Coupling with Kafka seems long overdue to me. The
> > reality
> > > > is,
> > > > > > > > >>> everyone
> > > > > > > > >>>>>>> that
> > > > > > > > >>>>>>> I'm aware of is using Samza with Kafka. We basically
> > > > require
> > > > > it
> > > > > > > > >>>> already
> > > > > > > > >>>>> in
> > > > > > > > >>>>>>> order for most features to work. Those that are using
> > > other
> > > > > > > > >>>>>>> systems
> > > > > > > > >>>> are
> > > > > > > > >>>>>>> generally using it for ingest into Kafka (1), and
> then
> > > they
> > > > > do
> > > > > > > > >>>>>>> the processing on top. There is already discussion (
> > > > > > > > >>>
> > > > > > >
> > > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
> > > > > > > > >>> 767
> > > > > > > > >>>>>>> )
> > > > > > > > >>>>>>> in Kafka to make ingesting into Kafka extremely easy.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Once we make the call to couple with Kafka, we can
> > > > leverage a
> > > > > > > > >>>>>>> ton of
> > > > > > > > >>>>> their
> > > > > > > > >>>>>>> ecosystem. We no longer have to maintain our own
> > config,
> > > > > > > > >>>>>>> metrics,
> > > > > > > > >>> etc.
> > > > > > > > >>>>> We
> > > > > > > > >>>>>>> can all share the same libraries, and make them
> better.
> > > > This
> > > > > > > > >>>>>>> will
> > > > > > > > >>> also
> > > > > > > > >>>>>>> allow us to share the consumer/producer APIs, and
> will
> > > let
> > > > us
> > > > > > > > >>> leverage
> > > > > > > > >>>>>>> their offset management and partition management,
> > rather
> > > > than
> > > > > > > > >>>>>>> having
> > > > > > > > >>>> our
> > > > > > > > >>>>>>> own. All of the coordinator stream code would go
> away,
> > as
> > > > > would
> > > > > > > > >>>>>>> most
> > > > > > > > >>>> of
> > > > > > > > >>>>>>> the
> > > > > > > > >>>>>>> YARN AppMaster code. We'd probably have to push some
> > > > > partition
> > > > > > > > >>>>> management
> > > > > > > > >>>>>>> features into the Kafka broker, but they're already
> > > moving
> > > > in
> > > > > > > > >>>>>>> that direction with the new consumer API. The
> features
> > we
> > > > > have
> > > > > > > > >>>>>>> for
> > > > > > > > >>>> partition
> > > > > > > > >>>>>>> assignment aren't unique to Samza, and seem like they
> > > > should
> > > > > be
> > > > > > > > >>>>>>> in
> > > > > > > > >>>> Kafka
> > > > > > > > >>>>>>> anyway. There will always be some niche usages which
> > will
> > > > > > > > >>>>>>> require
> > > > > > > > >>>> extra
> > > > > > > > >>>>>>> care and hence full control over partition
> assignments
> > > much
> > > > > > > > >>>>>>> like the
> > > > > > > > >>>>> Kafka
> > > > > > > > >>>>>>> low level consumer api. These would continue to be
> > > > supported.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> These items will be good for the Samza community.
> > They'll
> > > > > make
> > > > > > > > >>>>>>> Samza easier to use, and make it easier for
> developers
> > to
> > > > add
> > > > > > > > >>>>>>> new features.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Obviously this is a fairly large (and somewhat
> > backwards
> > > > > > > > >>> incompatible
> > > > > > > > >>>>>>> change). If we choose to go this route, it's
> important
> > > that
> > > > > we
> > > > > > > > >>> openly
> > > > > > > > >>>>>>> communicate how we're going to provide a migration
> path
> > > > from
> > > > > > > > >>>>>>> the
> > > > > > > > >>>>> existing
> > > > > > > > >>>>>>> APIs to the new ones (if we make incompatible
> > changes). I
> > > > > think
> > > > > > > > >>>>>>> at a minimum, we'd probably need to provide a wrapper
> > to
> > > > > allow
> > > > > > > > >>>>>>> existing StreamTask implementations to continue
> running
> > > on
> > > > > the
> > > > > > > > >> new container.
> > > > > > > > >>>>> It's
> > > > > > > > >>>>>>> also important that we openly communicate about
> timing,
> > > and
> > > > > > > > >>>>>>> stages
> > > > > > > > >>> of
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>> migration.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> If you made it this far, I'm sure you have opinions.
> :)
> > > > > Please
> > > > > > > > >>>>>>> send
> > > > > > > > >>>> your
> > > > > > > > >>>>>>> thoughts and feedback.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Cheers,
> > > > > > > > >>>>>>> Chris
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>> --
> > > > > > > > >>>> -- Guozhang
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>

Re: Thoughts and obesrvations on Samza

Reply via email to