Re: Samza and KStreams (KIP-28): LinkedIn's POV

Renato Marroquín Mogrovejo Sun, 04 Oct 2015 21:45:59 -0700

Hi Yi,

Thanks a lot for sharing this POV, makes a lot more sense to me now what
both (KStreams and Samza) are targeting to.



Best,

Renato M.

2015-10-03 20:13 GMT+02:00 Roger Hoover <roger.hoo...@gmail.com>:

> Hi Yi,
>
> Thank you for sharing this update and perspective.  I tend to agree that
> for simple, stateless cases, things could be easier and hopefully KStreams
> may help with that.  I also appreciate a lot of features that Samza already
> supports for operations.  As previously discussed, the biggest request I
> have is being able to run Samza without YARN, under something like
> Kubernetes instead.
>
> Also, I'm curious.  What's the current state of the Samza SQL physical
> operators?  Are they used in production yet?  Is there a doc on how to use
> them?
>
> Thanks,
>
> Roger
>
>
>
> On Fri, Oct 2, 2015 at 1:54 PM, Yi Pan <nickpa...@gmail.com> wrote:
>
> > Hi, all Samza-lovers,
> >
> > This question on the relationship of Kafka KStream (KIP-28) and Samza has
> > come up a couple times recently. So we wanted to clarify where we stand
> at
> > LinkedIn in terms of this discussion.
> >
> > Samza has historically had a symbiotic relationship with Kafka and will
> > continue to work very well with Kafka.  Earlier in the year, we had an
> > in-depth discussion exploring an even deeper integration with Kafka.
> After
> > hitting multiple practical issues (e.g. Apache rules) and technical
> issues
> > we had to give up on that idea.  As a fall out of the discussion, the
> Kafka
> > community is adding some of the basic event processing capabilities into
> > Kafka core directly. The basic callback/push style programming model by
> > itself is a great addition to the Kafka API set.
> >
> > However at LinkedIn, we continue to be firmly committed to Samza as our
> > stream processing framework. Although KStream is a nice addition to Kafka
> > stack, our goals for Samza are broader. There are some key technical
> > differences that makes Samza the right strategy for us.
> >
> > 1.  Support for non-kafka systems :
> >
> > At LinkedIn a larger percentage of our streaming jobs use Databus as an
> > input source.   For any such non-Kafka source, although the CopyCat
> > connector framework gives a common model for pulling data out of a source
> > and pushing it into Kafka, it introduces yet another piece of
> > infrastructure that we have to operate and manage.  Also for any
> companies
> > who are already on AWS, Google Compute, Azure etc.  asking them to deploy
> > and operate kafka in AWS instead of using the natively supported services
> > like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a
> > non-starter.  With more acquisitions at LinkedIn that use AWS we are also
> > facing this first hand.  The Samza community has a healthy set of system
> > producers/consumers which are in the works (Kinesis, ActiveMQ,
> > ElasticSearch, HDFS, etc.).
> >
> > 2. We run Samza as a Stream Processing Service at LinkedIn. This is
> > fundamentally different from KStream.
> >
> > This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream
> > Insight and similar services.  The service makes it much easier for
> > developers to get their stream processing jobs up and running in
> production
> > by helping with the most common problems like monitoring, dashboards,
> > auto-scale, rolling upgrades and such.
> >
> > The truth is that if the stream processing application is stateless then
> > some of these common problems are not as involved and can be solved even
> on
> > regular IaaS platforms like EC2 and such.   Arguably stateless
> applications
> > can be built easily on top of the native APIs from the input source like
> > kafka, kinesis etc.
> >
> > The place where Samza shines is with stateful stream processing
> > applications.  When a Samza application uses the local rocks DB based
> > state, the application needs special care in terms of rolling upgrades,
> > addition/removal of containers/machines, temporary machine failures,
> > capacity management.  We have already done some really good work in Samza
> > 0.10 to make sure that we don't reseed the state from kafka (i.e.
> > host-affinity feature that allows to reuse the local states).  In the
> > absence of this feature, we had multiple production issues caused due to
> > network saturation during state reseeding from kafka.   The problems with
> > stateful applications are similar to problems encountered when building
> > no-sql databases and other data systems.
> >
> > There are surely some scenarios where customers don't want any YARN
> > dependency and want to run their stream processing application on a
> > dedicated set of nodes.  This is where KStream clearly has an advantage
> > over current Samza. Prior to KStream we had a patch in Samza which also
> > solved the same problem (SAMZA-516). We do expect to finish this patch
> soon
> > and formally support Stand Alone Samza.
> >
> > 3. Operators for Stream Processing and SQL :
> >
> > At LinkedIn, there is a growing need to iterate Samza jobs faster in the
> > loop of implement, deploy, revise the code, and deploy again. A key
> > bottleneck that slows down this iteration is the implementation of a
> Samza
> > job. It has long-been recognized in the Samza community that there is a
> > strong need for a high-level language support to shorten this iterative
> > process. Since last year, we have identified SQL as the user-facing
> > high-level language and completed the high-level design and started
> > prototyping it in Samza. The prototype starts with a set of physical
> > operators which are crucial to the correctness of streaming processing,
> > namely, the window operator, aggregation, and join. KStream adopts some
> of
> > these core ideas. However, our view in Samza’s SQL support goes beyond
> > what’s covered in KStream. We want Samza’s SQL support to be as easy as
> > Google Dataflow and Azure Stream Analytics, in which a user can upload a
> > query statement and the system will parse the query, translate it into a
> > distributed execution plan, allocate the containers and stream resources
> in
> > a cluster, and deploy it. To support this grand vision, our effort in
> > building the SQL operators API, the query planner and optimizers is
> vastly
> > different from what KStream covers, which only covers a single node
> > programming interface.
> >
> > Independent of these strategic differences, one big aspect for us is also
> > the fact that Samza is an established and mature system which we have
> > successfully operationalized and has been running in production for a few
> > years.
> >
> >
> >
> > Thanks!
> >
> > -Yi Pan
> > Samza @ LinkedIn
> >
>

Re: Samza and KStreams (KIP-28): LinkedIn's POV

Reply via email to