Samza and KStreams (KIP-28): LinkedIn's POV

Yi Pan Fri, 02 Oct 2015 13:55:40 -0700

Hi, all Samza-lovers,

This question on the relationship of Kafka KStream (KIP-28) and Samza has
come up a couple times recently. So we wanted to clarify where we stand at
LinkedIn in terms of this discussion.


Samza has historically had a symbiotic relationship with Kafka and will
continue to work very well with Kafka.  Earlier in the year, we had an
in-depth discussion exploring an even deeper integration with Kafka.  After
hitting multiple practical issues (e.g. Apache rules) and technical issues
we had to give up on that idea.  As a fall out of the discussion, the Kafka
community is adding some of the basic event processing capabilities into
Kafka core directly. The basic callback/push style programming model by
itself is a great addition to the Kafka API set.

However at LinkedIn, we continue to be firmly committed to Samza as our
stream processing framework. Although KStream is a nice addition to Kafka
stack, our goals for Samza are broader. There are some key technical
differences that makes Samza the right strategy for us.

1.  Support for non-kafka systems :

At LinkedIn a larger percentage of our streaming jobs use Databus as an
input source.   For any such non-Kafka source, although the CopyCat
connector framework gives a common model for pulling data out of a source
and pushing it into Kafka, it introduces yet another piece of
infrastructure that we have to operate and manage.  Also for any companies
who are already on AWS, Google Compute, Azure etc.  asking them to deploy
and operate kafka in AWS instead of using the natively supported services
like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a
non-starter.  With more acquisitions at LinkedIn that use AWS we are also
facing this first hand.  The Samza community has a healthy set of system
producers/consumers which are in the works (Kinesis, ActiveMQ,
ElasticSearch, HDFS, etc.).

2. We run Samza as a Stream Processing Service at LinkedIn. This is
fundamentally different from KStream.

This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream
Insight and similar services.  The service makes it much easier for
developers to get their stream processing jobs up and running in production
by helping with the most common problems like monitoring, dashboards,
auto-scale, rolling upgrades and such.

The truth is that if the stream processing application is stateless then
some of these common problems are not as involved and can be solved even on
regular IaaS platforms like EC2 and such.   Arguably stateless applications
can be built easily on top of the native APIs from the input source like
kafka, kinesis etc.

The place where Samza shines is with stateful stream processing
applications.  When a Samza application uses the local rocks DB based
state, the application needs special care in terms of rolling upgrades,
addition/removal of containers/machines, temporary machine failures,
capacity management.  We have already done some really good work in Samza
0.10 to make sure that we don't reseed the state from kafka (i.e.
host-affinity feature that allows to reuse the local states).  In the
absence of this feature, we had multiple production issues caused due to
network saturation during state reseeding from kafka.   The problems with
stateful applications are similar to problems encountered when building
no-sql databases and other data systems.

There are surely some scenarios where customers don't want any YARN
dependency and want to run their stream processing application on a
dedicated set of nodes.  This is where KStream clearly has an advantage
over current Samza. Prior to KStream we had a patch in Samza which also
solved the same problem (SAMZA-516). We do expect to finish this patch soon
and formally support Stand Alone Samza.

3. Operators for Stream Processing and SQL :

At LinkedIn, there is a growing need to iterate Samza jobs faster in the
loop of implement, deploy, revise the code, and deploy again. A key
bottleneck that slows down this iteration is the implementation of a Samza
job. It has long-been recognized in the Samza community that there is a
strong need for a high-level language support to shorten this iterative
process. Since last year, we have identified SQL as the user-facing
high-level language and completed the high-level design and started
prototyping it in Samza. The prototype starts with a set of physical
operators which are crucial to the correctness of streaming processing,
namely, the window operator, aggregation, and join. KStream adopts some of
these core ideas. However, our view in Samza’s SQL support goes beyond
what’s covered in KStream. We want Samza’s SQL support to be as easy as
Google Dataflow and Azure Stream Analytics, in which a user can upload a
query statement and the system will parse the query, translate it into a
distributed execution plan, allocate the containers and stream resources in
a cluster, and deploy it. To support this grand vision, our effort in
building the SQL operators API, the query planner and optimizers is vastly
different from what KStream covers, which only covers a single node
programming interface.

Independent of these strategic differences, one big aspect for us is also
the fact that Samza is an established and mature system which we have
successfully operationalized and has been running in production for a few
years.



Thanks!

-Yi Pan
Samza @ LinkedIn

Samza and KStreams (KIP-28): LinkedIn's POV

Reply via email to