Hi, all Samza-lovers, This question on the relationship of Kafka KStream (KIP-28) and Samza has come up a couple times recently. So we wanted to clarify where we stand at LinkedIn in terms of this discussion.
Samza has historically had a symbiotic relationship with Kafka and will continue to work very well with Kafka. Earlier in the year, we had an in-depth discussion exploring an even deeper integration with Kafka. After hitting multiple practical issues (e.g. Apache rules) and technical issues we had to give up on that idea. As a fall out of the discussion, the Kafka community is adding some of the basic event processing capabilities into Kafka core directly. The basic callback/push style programming model by itself is a great addition to the Kafka API set. However at LinkedIn, we continue to be firmly committed to Samza as our stream processing framework. Although KStream is a nice addition to Kafka stack, our goals for Samza are broader. There are some key technical differences that makes Samza the right strategy for us. 1. Support for non-kafka systems : At LinkedIn a larger percentage of our streaming jobs use Databus as an input source. For any such non-Kafka source, although the CopyCat connector framework gives a common model for pulling data out of a source and pushing it into Kafka, it introduces yet another piece of infrastructure that we have to operate and manage. Also for any companies who are already on AWS, Google Compute, Azure etc. asking them to deploy and operate kafka in AWS instead of using the natively supported services like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a non-starter. With more acquisitions at LinkedIn that use AWS we are also facing this first hand. The Samza community has a healthy set of system producers/consumers which are in the works (Kinesis, ActiveMQ, ElasticSearch, HDFS, etc.). 2. We run Samza as a Stream Processing Service at LinkedIn. This is fundamentally different from KStream. This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream Insight and similar services. The service makes it much easier for developers to get their stream processing jobs up and running in production by helping with the most common problems like monitoring, dashboards, auto-scale, rolling upgrades and such. The truth is that if the stream processing application is stateless then some of these common problems are not as involved and can be solved even on regular IaaS platforms like EC2 and such. Arguably stateless applications can be built easily on top of the native APIs from the input source like kafka, kinesis etc. The place where Samza shines is with stateful stream processing applications. When a Samza application uses the local rocks DB based state, the application needs special care in terms of rolling upgrades, addition/removal of containers/machines, temporary machine failures, capacity management. We have already done some really good work in Samza 0.10 to make sure that we don't reseed the state from kafka (i.e. host-affinity feature that allows to reuse the local states). In the absence of this feature, we had multiple production issues caused due to network saturation during state reseeding from kafka. The problems with stateful applications are similar to problems encountered when building no-sql databases and other data systems. There are surely some scenarios where customers don't want any YARN dependency and want to run their stream processing application on a dedicated set of nodes. This is where KStream clearly has an advantage over current Samza. Prior to KStream we had a patch in Samza which also solved the same problem (SAMZA-516). We do expect to finish this patch soon and formally support Stand Alone Samza. 3. Operators for Stream Processing and SQL : At LinkedIn, there is a growing need to iterate Samza jobs faster in the loop of implement, deploy, revise the code, and deploy again. A key bottleneck that slows down this iteration is the implementation of a Samza job. It has long-been recognized in the Samza community that there is a strong need for a high-level language support to shorten this iterative process. Since last year, we have identified SQL as the user-facing high-level language and completed the high-level design and started prototyping it in Samza. The prototype starts with a set of physical operators which are crucial to the correctness of streaming processing, namely, the window operator, aggregation, and join. KStream adopts some of these core ideas. However, our view in Samza’s SQL support goes beyond what’s covered in KStream. We want Samza’s SQL support to be as easy as Google Dataflow and Azure Stream Analytics, in which a user can upload a query statement and the system will parse the query, translate it into a distributed execution plan, allocate the containers and stream resources in a cluster, and deploy it. To support this grand vision, our effort in building the SQL operators API, the query planner and optimizers is vastly different from what KStream covers, which only covers a single node programming interface. Independent of these strategic differences, one big aspect for us is also the fact that Samza is an established and mature system which we have successfully operationalized and has been running in production for a few years. Thanks! -Yi Pan Samza @ LinkedIn