Hi Yi, Thanks a lot for sharing this POV, makes a lot more sense to me now what both (KStreams and Samza) are targeting to.
Best, Renato M. 2015-10-03 20:13 GMT+02:00 Roger Hoover <roger.hoo...@gmail.com>: > Hi Yi, > > Thank you for sharing this update and perspective. I tend to agree that > for simple, stateless cases, things could be easier and hopefully KStreams > may help with that. I also appreciate a lot of features that Samza already > supports for operations. As previously discussed, the biggest request I > have is being able to run Samza without YARN, under something like > Kubernetes instead. > > Also, I'm curious. What's the current state of the Samza SQL physical > operators? Are they used in production yet? Is there a doc on how to use > them? > > Thanks, > > Roger > > > > On Fri, Oct 2, 2015 at 1:54 PM, Yi Pan <nickpa...@gmail.com> wrote: > > > Hi, all Samza-lovers, > > > > This question on the relationship of Kafka KStream (KIP-28) and Samza has > > come up a couple times recently. So we wanted to clarify where we stand > at > > LinkedIn in terms of this discussion. > > > > Samza has historically had a symbiotic relationship with Kafka and will > > continue to work very well with Kafka. Earlier in the year, we had an > > in-depth discussion exploring an even deeper integration with Kafka. > After > > hitting multiple practical issues (e.g. Apache rules) and technical > issues > > we had to give up on that idea. As a fall out of the discussion, the > Kafka > > community is adding some of the basic event processing capabilities into > > Kafka core directly. The basic callback/push style programming model by > > itself is a great addition to the Kafka API set. > > > > However at LinkedIn, we continue to be firmly committed to Samza as our > > stream processing framework. Although KStream is a nice addition to Kafka > > stack, our goals for Samza are broader. There are some key technical > > differences that makes Samza the right strategy for us. > > > > 1. Support for non-kafka systems : > > > > At LinkedIn a larger percentage of our streaming jobs use Databus as an > > input source. For any such non-Kafka source, although the CopyCat > > connector framework gives a common model for pulling data out of a source > > and pushing it into Kafka, it introduces yet another piece of > > infrastructure that we have to operate and manage. Also for any > companies > > who are already on AWS, Google Compute, Azure etc. asking them to deploy > > and operate kafka in AWS instead of using the natively supported services > > like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a > > non-starter. With more acquisitions at LinkedIn that use AWS we are also > > facing this first hand. The Samza community has a healthy set of system > > producers/consumers which are in the works (Kinesis, ActiveMQ, > > ElasticSearch, HDFS, etc.). > > > > 2. We run Samza as a Stream Processing Service at LinkedIn. This is > > fundamentally different from KStream. > > > > This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream > > Insight and similar services. The service makes it much easier for > > developers to get their stream processing jobs up and running in > production > > by helping with the most common problems like monitoring, dashboards, > > auto-scale, rolling upgrades and such. > > > > The truth is that if the stream processing application is stateless then > > some of these common problems are not as involved and can be solved even > on > > regular IaaS platforms like EC2 and such. Arguably stateless > applications > > can be built easily on top of the native APIs from the input source like > > kafka, kinesis etc. > > > > The place where Samza shines is with stateful stream processing > > applications. When a Samza application uses the local rocks DB based > > state, the application needs special care in terms of rolling upgrades, > > addition/removal of containers/machines, temporary machine failures, > > capacity management. We have already done some really good work in Samza > > 0.10 to make sure that we don't reseed the state from kafka (i.e. > > host-affinity feature that allows to reuse the local states). In the > > absence of this feature, we had multiple production issues caused due to > > network saturation during state reseeding from kafka. The problems with > > stateful applications are similar to problems encountered when building > > no-sql databases and other data systems. > > > > There are surely some scenarios where customers don't want any YARN > > dependency and want to run their stream processing application on a > > dedicated set of nodes. This is where KStream clearly has an advantage > > over current Samza. Prior to KStream we had a patch in Samza which also > > solved the same problem (SAMZA-516). We do expect to finish this patch > soon > > and formally support Stand Alone Samza. > > > > 3. Operators for Stream Processing and SQL : > > > > At LinkedIn, there is a growing need to iterate Samza jobs faster in the > > loop of implement, deploy, revise the code, and deploy again. A key > > bottleneck that slows down this iteration is the implementation of a > Samza > > job. It has long-been recognized in the Samza community that there is a > > strong need for a high-level language support to shorten this iterative > > process. Since last year, we have identified SQL as the user-facing > > high-level language and completed the high-level design and started > > prototyping it in Samza. The prototype starts with a set of physical > > operators which are crucial to the correctness of streaming processing, > > namely, the window operator, aggregation, and join. KStream adopts some > of > > these core ideas. However, our view in Samza’s SQL support goes beyond > > what’s covered in KStream. We want Samza’s SQL support to be as easy as > > Google Dataflow and Azure Stream Analytics, in which a user can upload a > > query statement and the system will parse the query, translate it into a > > distributed execution plan, allocate the containers and stream resources > in > > a cluster, and deploy it. To support this grand vision, our effort in > > building the SQL operators API, the query planner and optimizers is > vastly > > different from what KStream covers, which only covers a single node > > programming interface. > > > > Independent of these strategic differences, one big aspect for us is also > > the fact that Samza is an established and mature system which we have > > successfully operationalized and has been running in production for a few > > years. > > > > > > > > Thanks! > > > > -Yi Pan > > Samza @ LinkedIn > > >