Yi, What you just summarized makes a whole lot more sense to me. Shamelessly I am looking at this shift as a customer with a production workflow riding on it so I am looking for some kind of consistency into the future of Samza. This makes me feel a lot better about it.
Thank you! On Sun, Jul 12, 2015 at 10:44 PM, Yi Pan <nickpa...@gmail.com> wrote: > Just to make it explicitly clear what I am proposing, here is a version of > more detailed description: > > The fourth option (in addition to what Jakob summarized) we are proposing > is: > > - Recharter Samza to “stream processing as a service” > > - The current Samza core (the basic transformation API w/ basic partition > and offset management build-in) will be moved to Kafka Streams (i.e. part > of Kafka) and supports “run-as-a-library” > > - Deprecate the SystemConsumers and SystemProducers APIs and move them to > Copycat > > - The current SQL development: > > * physical operators and a Trident-like stream API should stay in Kafka > Streams as libraries, enabling any standalone deployment to use the core > window/join functions > > * query parser/planner and execution on top of a distributed service > should stay in new Samza (i.e. “stream processing as a service”) > > - Advanced features related to job scheduling/state management stays in new > Samza (i.e. “streaming processing as a service”) > > * Any advanced PartitionManager implementation that can be plugged into > Kafka Streams > > * Any auto-scaling, dynamic configuration via coordinator stream > > * Any advanced state management s.t. host-affinity etc. > > > Pros: > > - W/ the current Samza core as Kafka Streams and move the ingestion to > Copycat, we achieved most of the goals in the initial proposal: > > * Tighter coupling w/ Kafka > > * Reuse Kafka’s build-in functionalities, such as offset manager, basic > partition distribution > > * Separation of ingestion vs transformation APIs > > * offload a lot of system-specific configuration to Kafka Streams and > Copycat (i.e. SystemFactory configure, serde configure, etc.) > > * remove YARN dependency and make standalone deployment easy. As > Guozhang mentioned, it would be really easy to start a process that > internally run Kafka Streams as library. > > - By re-chartering Samza as “stream processing as a service”, we address > the concern regarding to > > * Pluggable partition management > > * Running in a distributed cluster to manage process lifecycle, > fault-tolerance, resource-allocation, etc. > > * More advanced features s.t. host-affinity, auto-scaling, and dynamic > configure changes, etc. > > > Regarding to the code and community organization, I think the following may > be the best: > > Code: > > - A Kafka sub-project Kafka Streams to hold samza-core, samza-kv-store, and > the physical operator layer as library in SQL: this would allow better > alignment w/ Kafka, in code, doc, and branding > > - Retain the current Samza project just to keep > > * A pluggable explicit partition management in Kafka Streams client > > * Integration w/ cluster-management systems for advanced features: > > * host-affinity, auto-scaling,, dynamic configuration, etc. > > * It will fully depend on the Kafka Streams API and remove all support > for SystemConsumers/SystemProducers in the future > > Community: (this is almost the same as what Chris proposed) > > - Kafka Streams: the current Samza community should be supporting this > effort together with some Kafka members, since most of the code here will > be from samza-core, samza-kv-store, and samza-sql. > > - new Samza: the current Samza community should continue serve the course > to support more advanced features to run Kafka Streams as a service. > Arguably, the new Samza framework may be used to run Copycat workers as > well, at least to manage Copycat worker’s lifecycle in a clustered > environment. Hence, it would stay as a general stream processing framework > that takes in any source and output to any destination, just the transport > system is fixed to Kafka. > > On Sun, Jul 12, 2015 at 7:29 PM, Yi Pan <nickpa...@gmail.com> wrote: > > > Hi, Chris, > > > > Thanks for sending out this concrete set of points here. I agree w/ all > > but have a slight different point view on 8). > > > > My view on this is: instead of sunset Samza as TLP, can we re-charter the > > scope of Samza to be the home for "running streaming process as a > service"? > > > > My main motivation is from the following points from a long internal > > discussion in LinkedIn: > > > > - There is a clear ask for pluggable partition management, like we do in > > LinkedIn, and as Ben Kirwin has mentioned in > > > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3ccacux-d-yjx++2gnf_1laf10kyuvyamg7up_dt19v0znmmhb...@mail.gmail.com%3E > > - There are concerns on lack of support for running stream processing in > a > > cluster: lifecycle management, resource allocation, fault tolerance, etc. > > - There is a question to how to support more advanced features s.t. > > host-affinity, auto-scaling, and dynamic configuration in Samza jobs, as > > raised by Martin here: > > > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3c0d66efd0-b7cd-4e4e-8b2f-2716167c3...@kleppmann.com%3E > > > > We have use cases that need to address all the above three cases and most > > of the functions are all in the current Samza project, in some flavor. We > > are all supporting to merge the samza-core functionalities into Kafka > > Streams, but there is a question where we keep these functions in the > > future. One option is to start a new project that includes these > functions > > that are closely related w/ "run stream-processing as-a-service", while > > another personally more attractive option is to re-charter Samza project > > just do "run stream processing as-a-service". We can avoid the overhead > of > > re-starting another community for this project. Personally, I felt that > > here are the benefits we should be getting: > > > > 1. We have already agreed mostly that Kafka Streams API would allow some > > pluggable partition management functions. Hence, the advanced partition > > management can live out-side the new Kafka Streams core w/o affecting the > > run-as-a-library model in Kafka Streams. > > 2. The integration w/ cluster management system and advanced features > > listed above stays in the same project and allow existing users enjoy > > no-impact migration to Kafka Stream as the core. That also addresses > Tim's > > question on "removing the support for YARN". > > 3. A separate project for stream-processing-as-a-service also allow the > > new Kafka Streams being independent to any cluster management and just > > focusing on stream process core functions, while leaving the functions > that > > requires cluster-resource and state management to a separate layer. > > > > Please feel free to comment. Thanks! > > > > On Sun, Jul 12, 2015 at 5:54 PM, Chris Riccomini <criccom...@apache.org> > > wrote: > > > >> Hey all, > >> > >> I want to start by saying that I'm absolutely thrilled to be a part of > >> this > >> community. The amount of level-headed, thoughtful, educated discussion > >> that's gone on over the past ~10 days is overwhelming. Wonderful. > >> > >> It seems like discussion is waning a bit, and we've reached some > >> conclusions. There are several key emails in this threat, which I want > to > >> call out: > >> > >> 1. Jakob's summary of the three potential ways forward. > >> > >> > >> > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVu-hxdBfyQ4qm3LDC55cUQbPdmbe4zGzTOOatYF1Pz43A%40mail.gmail.com%3E > >> 2. Julian's call out that we should be focusing on community over code. > >> > >> > >> > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCAPSgeESZ_7bVFbwN%2Bzqi5MH%3D4CWu9MZUSanKg0-1woMqt55Fvg%40mail.gmail.com%3E > >> 3. Martin's summary about the benefits of merging communities. > >> > >> > >> > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CBFB866B6-D9D8-4578-93C0-FFAEB1DF00FC%40kleppmann.com%3E > >> 4. Jakob's comments about the distinction between community and code > >> paths. > >> > >> > >> > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVtWPjHLLDsmxvz9KggVA5DfBi-nUvfqB6QdA-du%2B_a9Ng%40mail.gmail.com%3E > >> > >> I agree with the comments on all of these emails. I think Martin's > summary > >> of his position aligns very closely with my own. To that end, I think we > >> should get concrete about what the proposal is, and call a vote on it. > >> Given that Jay, Martin, and I seem to be aligning fairly closely, I > think > >> we should start with: > >> > >> 1. [community] Make Samza a subproject of Kafka. > >> 2. [community] Make all Samza PMC/committers committers of the > subproject. > >> 3. [community] Migrate Samza's website/documentation into Kafka's. > >> 4. [code] Have the Samza community and the Kafka community start a > >> from-scratch reboot together in the new Kafka subproject. We can > >> borrow/copy & paste significant chunks of code from Samza's code base. > >> 5. [code] The subproject would intentionally eliminate support for both > >> other streaming systems and all deployment systems. > >> 6. [code] Attempt to provide a bridge from our SystemConsumer to KIP-26 > >> (copy cat) > >> 7. [code] Attempt to provide a bridge from the new subproject's > processor > >> interface to our legacy StreamTask interface. > >> 8. [code/community] Sunset Samza as a TLP when we have a working Kafka > >> subproject that has a fault-tolerant container with state management. > >> > >> It's likely that (6) and (7) won't be fully drop-in. Still, the closer > we > >> can get, the better it's going to be for our existing community. > >> > >> One thing that I didn't touch on with (2) is whether any Samza PMC > members > >> should be rolled into Kafka PMC membership as well (though, Jay and > Jakob > >> are already PMC members on both). I think that Samza's community > deserves > >> a > >> voice on the PMC, so I'd propose that we roll at least a few PMC members > >> into the Kafka PMC, but I don't have a strong framework for which people > >> to > >> pick. > >> > >> Before (8), I think that Samza's TLP can continue to commit bug fixes > and > >> patches as it sees fit, provided that we openly communicate that we > won't > >> necessarily migrate new features to the new subproject, and that the TLP > >> will be shut down after the migration to the Kafka subproject occurs. > >> > >> Jakob, I could use your guidance here about about how to achieve this > from > >> an Apache process perspective (sorry). > >> > >> * Should I just call a vote on this proposal? > >> * Should it happen on dev or private? > >> * Do committers have binding votes, or just PMC? > >> > >> Having trouble finding much detail on the Apache wikis. :( > >> > >> Cheers, > >> Chris > >> > >> On Fri, Jul 10, 2015 at 2:38 PM, Yan Fang <yanfang...@gmail.com> wrote: > >> > >> > Thanks, Jay. This argument persuaded me actually. :) > >> > > >> > Fang, Yan > >> > yanfang...@gmail.com > >> > > >> > On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps <j...@confluent.io> wrote: > >> > > >> > > Hey Yan, > >> > > > >> > > Yeah philosophically I think the argument is that you should capture > >> the > >> > > stream in Kafka independent of the transformation. This is > obviously a > >> > > Kafka-centric view point. > >> > > > >> > > Advantages of this: > >> > > - In practice I think this is what e.g. Storm people often end up > >> doing > >> > > anyway. You usually need to throttle any access to a live serving > >> > database. > >> > > - Can have multiple subscribers and they get the same thing without > >> > > additional load on the source system. > >> > > - Applications can tap into the stream if need be by subscribing. > >> > > - You can debug your transformation by tailing the Kafka topic with > >> the > >> > > console consumer > >> > > - Can tee off the same data stream for batch analysis or Lambda arch > >> > style > >> > > re-processing > >> > > > >> > > The disadvantage is that it will use Kafka resources. But the idea > is > >> > > eventually you will have multiple subscribers to any data source (at > >> > least > >> > > for monitoring) so you will end up there soon enough anyway. > >> > > > >> > > Down the road the technical benefit is that I think it gives us a > good > >> > path > >> > > towards end-to-end exactly once semantics from source to > destination. > >> > > Basically the connectors need to support idempotence when talking to > >> > Kafka > >> > > and we need the transactional write feature in Kafka to make the > >> > > transformation atomic. This is actually pretty doable if you > separate > >> > > connector=>kafka problem from the generic transformations which are > >> > always > >> > > kafka=>kafka. However I think it is quite impossible to do in a > >> > all_things > >> > > => all_things environment. Today you can say "well the semantics of > >> the > >> > > Samza APIs depend on the connectors you use" but it is actually > worse > >> > then > >> > > that because the semantics actually depend on the pairing of > >> > connectors--so > >> > > not only can you probably not get a usable "exactly once" guarantee > >> > > end-to-end it can actually be quite hard to reverse engineer what > >> > property > >> > > (if any) your end-to-end flow has if you have heterogenous systems. > >> > > > >> > > -Jay > >> > > > >> > > On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <yanfang...@gmail.com> > >> wrote: > >> > > > >> > > > {quote} > >> > > > maintained in a separate repository and retaining the existing > >> > > > committership but sharing as much else as possible (website, etc) > >> > > > {quote} > >> > > > > >> > > > Overall, I agree on this idea. Now the question is more about "how > >> to > >> > do > >> > > > it". > >> > > > > >> > > > On the other hand, one thing I want to point out is that, if we > >> decide > >> > to > >> > > > go this way, how do we want to support > >> > > > otherSystem-transformation-otherSystem use case? > >> > > > > >> > > > Basically, there are four user groups here: > >> > > > > >> > > > 1. Kafka-transformation-Kafka > >> > > > 2. Kafka-transformation-otherSystem > >> > > > 3. otherSystem-transformation-Kafka > >> > > > 4. otherSystem-transformation-otherSystem > >> > > > > >> > > > For group 1, they can easily use the new Samza library to achieve. > >> For > >> > > > group 2 and 3, they can use copyCat -> transformation -> Kafka or > >> > Kafka-> > >> > > > transformation -> copyCat. > >> > > > > >> > > > The problem is for group 4. Do we want to abandon this or still > >> support > >> > > it? > >> > > > Of course, this use case can be achieved by using copyCat -> > >> > > transformation > >> > > > -> Kafka -> transformation -> copyCat, the thing is how we > persuade > >> > them > >> > > to > >> > > > do this long chain. If yes, it will also be a win for Kafka too. > Or > >> if > >> > > > there is no one in this community actually doing this so far, > maybe > >> ok > >> > to > >> > > > not support the group 4 directly. > >> > > > > >> > > > Thanks, > >> > > > > >> > > > Fang, Yan > >> > > > yanfang...@gmail.com > >> > > > > >> > > > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <j...@confluent.io> > >> wrote: > >> > > > > >> > > > > Yeah I agree with this summary. I think there are kind of two > >> > questions > >> > > > > here: > >> > > > > 1. Technically does alignment/reliance on Kafka make sense > >> > > > > 2. Branding wise (naming, website, concepts, etc) does alignment > >> with > >> > > > Kafka > >> > > > > make sense > >> > > > > > >> > > > > Personally I do think both of these things would be really > >> valuable, > >> > > and > >> > > > > would dramatically alter the trajectory of the project. > >> > > > > > >> > > > > My preference would be to see if people can mostly agree on a > >> > direction > >> > > > > rather than splintering things off. From my point of view the > >> ideal > >> > > > outcome > >> > > > > of all the options discussed would be to make Samza a closely > >> aligned > >> > > > > subproject, maintained in a separate repository and retaining > the > >> > > > existing > >> > > > > committership but sharing as much else as possible (website, > >> etc). No > >> > > > idea > >> > > > > about how these things work, Jacob, you probably know more. > >> > > > > > >> > > > > No discussion amongst the Kafka folks has happened on this, but > >> > likely > >> > > we > >> > > > > should figure out what the Samza community actually wants first. > >> > > > > > >> > > > > I admit that this is a fairly radical departure from how things > >> are. > >> > > > > > >> > > > > If that doesn't fly, I think, yeah we could leave Samza as it is > >> and > >> > do > >> > > > the > >> > > > > more radical reboot inside Kafka. From my point of view that > does > >> > leave > >> > > > > things in a somewhat confusing state since now there are two > >> stream > >> > > > > processing systems more or less coupled to Kafka in large part > >> made > >> > by > >> > > > the > >> > > > > same people. But, arguably that might be a cleaner way to make > the > >> > > > cut-over > >> > > > > and perhaps less risky for Samza community since if it works > >> people > >> > can > >> > > > > switch and if it doesn't nothing will have changed. Dunno, how > do > >> > > people > >> > > > > feel about this? > >> > > > > > >> > > > > -Jay > >> > > > > > >> > > > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan < > jgho...@gmail.com> > >> > > wrote: > >> > > > > > >> > > > > > > This leads me to thinking that merging projects and > >> communities > >> > > > might > >> > > > > > be a good idea: with the union of experience from both > >> communities, > >> > > we > >> > > > > will > >> > > > > > probably build a better system that is better for users. > >> > > > > > Is this what's being proposed though? Merging the projects > seems > >> > like > >> > > > > > a consequence of at most one of the three directions under > >> > > discussion: > >> > > > > > 1) Samza 2.0: The Samza community relies more heavily on Kafka > >> for > >> > > > > > configuration, etc. (to a greater or lesser extent to be > >> > determined) > >> > > > > > but the Samza community would not automatically merge withe > >> Kafka > >> > > > > > community (the Phoenix/HBase example is a good one here). > >> > > > > > 2) Samza Reboot: The Samza community continues to exist with a > >> > > limited > >> > > > > > project scope, but similarly would not need to be part of the > >> Kafka > >> > > > > > community (ie given committership) to progress. Here, maybe > the > >> > > Samza > >> > > > > > team would become a subproject of Kafka (the Board frowns on > >> > > > > > subprojects at the moment, so I'm not sure if that's even > >> > feasible), > >> > > > > > but that would not be required. > >> > > > > > 3) Hey Samza! FYI, Kafka does streaming now: In this option > the > >> > Kafka > >> > > > > > team builds its own streaming library, possibly off of Jay's > >> > > > > > prototype, which has not direct lineage to the Samza team. > >> There's > >> > > no > >> > > > > > reason for the Kafka team to bring in the Samza team. > >> > > > > > > >> > > > > > Is the Kafka community on board with this? > >> > > > > > > >> > > > > > To be clear, all three options under discussion are > interesting, > >> > > > > > technically valid and likely healthy directions for the > project. > >> > > > > > Also, they are not mutually exclusive. The Samza community > >> could > >> > > > > > decide to pursue, say, 'Samza 2.0', while the Kafka community > >> went > >> > > > > > forward with 'Hey Samza!' My points above are directed > >> entirely at > >> > > > > > the community aspect of these choices. > >> > > > > > -Jakob > >> > > > > > > >> > > > > > On 10 July 2015 at 09:10, Roger Hoover < > roger.hoo...@gmail.com> > >> > > wrote: > >> > > > > > > That's great. Thanks, Jay. > >> > > > > > > > >> > > > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps < > j...@confluent.io> > >> > > wrote: > >> > > > > > > > >> > > > > > >> Yeah totally agree. I think you have this issue even today, > >> > right? > >> > > > > I.e. > >> > > > > > if > >> > > > > > >> you need to make a simple config change and you're running > in > >> > YARN > >> > > > > today > >> > > > > > >> you end up bouncing the job which then rebuilds state. I > >> think > >> > the > >> > > > fix > >> > > > > > is > >> > > > > > >> exactly what you described which is to have a long timeout > on > >> > > > > partition > >> > > > > > >> movement for stateful jobs so that if a job is just getting > >> > > bounced, > >> > > > > and > >> > > > > > >> the cluster manager (or admin) is smart enough to restart > it > >> on > >> > > the > >> > > > > same > >> > > > > > >> host when possible, it can optimistically reuse any > existing > >> > state > >> > > > it > >> > > > > > finds > >> > > > > > >> on disk (if it is valid). > >> > > > > > >> > >> > > > > > >> So in this model the charter of the CM is to place > processes > >> as > >> > > > > > stickily as > >> > > > > > >> possible and to restart or re-place failed processes. The > >> > charter > >> > > of > >> > > > > the > >> > > > > > >> partition management system is to control the assignment of > >> work > >> > > to > >> > > > > > these > >> > > > > > >> processes. The nice thing about this is that the work > >> > assignment, > >> > > > > > timeouts, > >> > > > > > >> behavior, configs, and code will all be the same across all > >> > > cluster > >> > > > > > >> managers. > >> > > > > > >> > >> > > > > > >> So I think that prototype would actually give you exactly > >> what > >> > you > >> > > > > want > >> > > > > > >> today for any cluster manager (or manual placement + > restart > >> > > script) > >> > > > > > that > >> > > > > > >> was sticky in terms of host placement since there is > already > >> a > >> > > > > > configurable > >> > > > > > >> partition movement timeout and task-by-task state reuse > with > >> a > >> > > check > >> > > > > on > >> > > > > > >> state validity. > >> > > > > > >> > >> > > > > > >> -Jay > >> > > > > > >> > >> > > > > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover < > >> > > > roger.hoo...@gmail.com > >> > > > > > > >> > > > > > >> wrote: > >> > > > > > >> > >> > > > > > >> > That would be great to let Kafka do as much heavy lifting > >> as > >> > > > > possible > >> > > > > > and > >> > > > > > >> > make it easier for other languages to implement Samza > apis. > >> > > > > > >> > > >> > > > > > >> > One thing to watch out for is the interplay between > Kafka's > >> > > group > >> > > > > > >> > management and the external scheduler/process manager's > >> fault > >> > > > > > tolerance. > >> > > > > > >> > If a container dies, the Kafka group membership protocol > >> will > >> > > try > >> > > > to > >> > > > > > >> assign > >> > > > > > >> > it's tasks to other containers while at the same time the > >> > > process > >> > > > > > manager > >> > > > > > >> > is trying to relaunch the container. Without some > >> > consideration > >> > > > for > >> > > > > > this > >> > > > > > >> > (like a configurable amount of time to wait before Kafka > >> > alters > >> > > > the > >> > > > > > group > >> > > > > > >> > membership), there may be thrashing going on which is > >> > especially > >> > > > bad > >> > > > > > for > >> > > > > > >> > containers with large amounts of local state. > >> > > > > > >> > > >> > > > > > >> > Someone else pointed this out already but I thought it > >> might > >> > be > >> > > > > worth > >> > > > > > >> > calling out again. > >> > > > > > >> > > >> > > > > > >> > Cheers, > >> > > > > > >> > > >> > > > > > >> > Roger > >> > > > > > >> > > >> > > > > > >> > > >> > > > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps < > >> j...@confluent.io> > >> > > > > wrote: > >> > > > > > >> > > >> > > > > > >> > > Hey Roger, > >> > > > > > >> > > > >> > > > > > >> > > I couldn't agree more. We spent a bunch of time talking > >> to > >> > > > people > >> > > > > > and > >> > > > > > >> > that > >> > > > > > >> > > is exactly the stuff we heard time and again. What > makes > >> it > >> > > > hard, > >> > > > > of > >> > > > > > >> > > course, is that there is some tension between > >> compatibility > >> > > with > >> > > > > > what's > >> > > > > > >> > > there now and making things better for new users. > >> > > > > > >> > > > >> > > > > > >> > > I also strongly agree with the importance of > >> multi-language > >> > > > > > support. We > >> > > > > > >> > are > >> > > > > > >> > > talking now about Java, but for application development > >> use > >> > > > cases > >> > > > > > >> people > >> > > > > > >> > > want to work in whatever language they are using > >> elsewhere. > >> > I > >> > > > > think > >> > > > > > >> > moving > >> > > > > > >> > > to a model where Kafka itself does the group > membership, > >> > > > lifecycle > >> > > > > > >> > control, > >> > > > > > >> > > and partition assignment has the advantage of putting > all > >> > that > >> > > > > > complex > >> > > > > > >> > > stuff behind a clean api that the clients are already > >> going > >> > to > >> > > > be > >> > > > > > >> > > implementing for their consumer, so the added > >> functionality > >> > > for > >> > > > > > stream > >> > > > > > >> > > processing beyond a consumer becomes very minor. > >> > > > > > >> > > > >> > > > > > >> > > -Jay > >> > > > > > >> > > > >> > > > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover < > >> > > > > > roger.hoo...@gmail.com> > >> > > > > > >> > > wrote: > >> > > > > > >> > > > >> > > > > > >> > > > Metamorphosis...nice. :) > >> > > > > > >> > > > > >> > > > > > >> > > > This has been a great discussion. As a user of Samza > >> > who's > >> > > > > > recently > >> > > > > > >> > > > integrated it into a relatively large organization, I > >> just > >> > > > want > >> > > > > to > >> > > > > > >> add > >> > > > > > >> > > > support to a few points already made. > >> > > > > > >> > > > > >> > > > > > >> > > > The biggest hurdles to adoption of Samza as it > >> currently > >> > > > exists > >> > > > > > that > >> > > > > > >> > I've > >> > > > > > >> > > > experienced are: > >> > > > > > >> > > > 1) YARN - YARN is overly complex in many environments > >> > where > >> > > > > Puppet > >> > > > > > >> > would > >> > > > > > >> > > do > >> > > > > > >> > > > just fine but it was the only mechanism to get fault > >> > > > tolerance. > >> > > > > > >> > > > 2) Configuration - I think I like the idea of > >> configuring > >> > > most > >> > > > > of > >> > > > > > the > >> > > > > > >> > job > >> > > > > > >> > > > in code rather than config files. In general, I > think > >> the > >> > > > goal > >> > > > > > >> should > >> > > > > > >> > be > >> > > > > > >> > > > to make it harder to make mistakes, especially of the > >> kind > >> > > > where > >> > > > > > the > >> > > > > > >> > code > >> > > > > > >> > > > expects something and the config doesn't match. The > >> > current > >> > > > > > config > >> > > > > > >> is > >> > > > > > >> > > > quite intricate and error-prone. For example, the > >> > > application > >> > > > > > logic > >> > > > > > >> > may > >> > > > > > >> > > > depend on bootstrapping a topic but rather than > >> asserting > >> > > that > >> > > > > in > >> > > > > > the > >> > > > > > >> > > code, > >> > > > > > >> > > > you have to rely on getting the config right. > Likewise > >> > with > >> > > > > > serdes, > >> > > > > > >> > the > >> > > > > > >> > > > Java representations produced by various serdes > (JSON, > >> > Avro, > >> > > > > etc.) > >> > > > > > >> are > >> > > > > > >> > > not > >> > > > > > >> > > > equivalent so you cannot just reconfigure a serde > >> without > >> > > > > changing > >> > > > > > >> the > >> > > > > > >> > > > code. It would be nice for jobs to be able to > assert > >> > what > >> > > > they > >> > > > > > >> expect > >> > > > > > >> > > > from their input topics in terms of partitioning. > >> This is > >> > > > > > getting a > >> > > > > > >> > > little > >> > > > > > >> > > > off topic but I was even thinking about creating a > >> "Samza > >> > > > config > >> > > > > > >> > linter" > >> > > > > > >> > > > that would sanity check a set of configs. Especially > >> in > >> > > > > > >> organizations > >> > > > > > >> > > > where config is managed by a different team than the > >> > > > application > >> > > > > > >> > > developer, > >> > > > > > >> > > > it's very hard to get avoid config mistakes. > >> > > > > > >> > > > 3) Java/Scala centric - for many teams (especially > >> > > DevOps-type > >> > > > > > >> folks), > >> > > > > > >> > > the > >> > > > > > >> > > > pain of the Java toolchain (maven, slow builds, weak > >> > command > >> > > > > line > >> > > > > > >> > > support, > >> > > > > > >> > > > configuration over convention) really inhibits > >> > productivity. > >> > > > As > >> > > > > > more > >> > > > > > >> > and > >> > > > > > >> > > > more high-quality clients become available for > Kafka, I > >> > hope > >> > > > > > they'll > >> > > > > > >> > > follow > >> > > > > > >> > > > Samza's model. Not sure how much it affects the > >> proposals > >> > > in > >> > > > > this > >> > > > > > >> > thread > >> > > > > > >> > > > but please consider other languages in the ecosystem > as > >> > > well. > >> > > > > > From > >> > > > > > >> > what > >> > > > > > >> > > > I've heard, Spark has more Python users than > >> Java/Scala. > >> > > > > > >> > > > (FYI, we added a Jython wrapper for the Samza API > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > >> > > > > > >> > > >> > > > > > >> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza > >> > > > > > >> > > > and are working on a Yeoman generator > >> > > > > > >> > > > https://github.com/Quantiply/generator-rico for > >> > > Jython/Samza > >> > > > > > >> projects > >> > > > > > >> > to > >> > > > > > >> > > > alleviate some of the pain) > >> > > > > > >> > > > > >> > > > > > >> > > > I also want to underscore Jay's point about improving > >> the > >> > > user > >> > > > > > >> > > experience. > >> > > > > > >> > > > That's a very important factor for adoption. I think > >> the > >> > > goal > >> > > > > > should > >> > > > > > >> > be > >> > > > > > >> > > to > >> > > > > > >> > > > make Samza as easy to get started with as something > >> like > >> > > > > Logstash. > >> > > > > > >> > > > Logstash is vastly inferior in terms of capabilities > to > >> > > Samza > >> > > > > but > >> > > > > > >> it's > >> > > > > > >> > > easy > >> > > > > > >> > > > to get started and that makes a big difference. > >> > > > > > >> > > > > >> > > > > > >> > > > Cheers, > >> > > > > > >> > > > > >> > > > > > >> > > > Roger > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De > Francisci > >> > > > Morales < > >> > > > > > >> > > > g...@apache.org> wrote: > >> > > > > > >> > > > > >> > > > > > >> > > > > Forgot to add. On the naming issues, Kafka > >> Metamorphosis > >> > > is > >> > > > a > >> > > > > > clear > >> > > > > > >> > > > winner > >> > > > > > >> > > > > :) > >> > > > > > >> > > > > > >> > > > > > >> > > > > -- > >> > > > > > >> > > > > Gianmarco > >> > > > > > >> > > > > > >> > > > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci > >> Morales > >> > < > >> > > > > > >> > > g...@apache.org > >> > > > > > >> > > > > > >> > > > > > >> > > > > wrote: > >> > > > > > >> > > > > > >> > > > > > >> > > > > > Hi, > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > @Martin, thanks for you comments. > >> > > > > > >> > > > > > Maybe I'm missing some important point, but I > think > >> > > > coupling > >> > > > > > the > >> > > > > > >> > > > releases > >> > > > > > >> > > > > > is actually a *good* thing. > >> > > > > > >> > > > > > To make an example, would it be better if the MR > >> and > >> > > HDFS > >> > > > > > >> > components > >> > > > > > >> > > of > >> > > > > > >> > > > > > Hadoop had different release schedules? > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > Actually, keeping the discussion in a single > place > >> > would > >> > > > > make > >> > > > > > >> > > agreeing > >> > > > > > >> > > > on > >> > > > > > >> > > > > > releases (and backwards compatibility) much > >> easier, as > >> > > > > > everybody > >> > > > > > >> > > would > >> > > > > > >> > > > be > >> > > > > > >> > > > > > responsible for the whole codebase. > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > That said, I like the idea of absorbing > samza-core > >> as > >> > a > >> > > > > > >> > sub-project, > >> > > > > > >> > > > and > >> > > > > > >> > > > > > leave the fancy stuff separate. > >> > > > > > >> > > > > > It probably gives 90% of the benefits we have > been > >> > > > > discussing > >> > > > > > >> here. > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > Cheers, > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > -- > >> > > > > > >> > > > > > Gianmarco > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps < > >> > jay.kr...@gmail.com > >> > > > > >> > > > > > wrote: > >> > > > > > >> > > > > > > >> > > > > > >> > > > > >> Hey Martin, > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> I agree coupling release schedules is a > downside. > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> Definitely we can try to solve some of the > >> > integration > >> > > > > > problems > >> > > > > > >> in > >> > > > > > >> > > > > >> Confluent Platform or in other distributions. > But > >> I > >> > > think > >> > > > > > this > >> > > > > > >> > ends > >> > > > > > >> > > up > >> > > > > > >> > > > > >> being really shallow. I guess I feel to really > >> get a > >> > > good > >> > > > > > user > >> > > > > > >> > > > > experience > >> > > > > > >> > > > > >> the two systems have to kind of feel like part > of > >> the > >> > > > same > >> > > > > > thing > >> > > > > > >> > and > >> > > > > > >> > > > you > >> > > > > > >> > > > > >> can't really add that in later--you can put both > >> in > >> > the > >> > > > > same > >> > > > > > >> > > > > downloadable > >> > > > > > >> > > > > >> tar file but it doesn't really give a very > >> cohesive > >> > > > > feeling. > >> > > > > > I > >> > > > > > >> > agree > >> > > > > > >> > > > > that > >> > > > > > >> > > > > >> ultimately any of the project stuff is as much > >> social > >> > > and > >> > > > > > naming > >> > > > > > >> > as > >> > > > > > >> > > > > >> anything else--theoretically two totally > >> independent > >> > > > > projects > >> > > > > > >> > could > >> > > > > > >> > > > work > >> > > > > > >> > > > > >> to > >> > > > > > >> > > > > >> tightly align. In practice this seems to be > quite > >> > > > difficult > >> > > > > > >> > though. > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> For the frameworks--totally agree it would be > >> good to > >> > > > > > maintain > >> > > > > > >> the > >> > > > > > >> > > > > >> framework support with the project. In some > cases > >> > there > >> > > > may > >> > > > > > not > >> > > > > > >> be > >> > > > > > >> > > too > >> > > > > > >> > > > > >> much > >> > > > > > >> > > > > >> there since the integration gets lighter but I > >> think > >> > > > > whatever > >> > > > > > >> > stubs > >> > > > > > >> > > > you > >> > > > > > >> > > > > >> need should be included. So no I definitely > wasn't > >> > > trying > >> > > > > to > >> > > > > > >> imply > >> > > > > > >> > > > > >> dropping > >> > > > > > >> > > > > >> support for these frameworks, just making the > >> > > integration > >> > > > > > >> lighter > >> > > > > > >> > by > >> > > > > > >> > > > > >> separating process management from partition > >> > > management. > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> You raise two good points we would have to > figure > >> out > >> > > if > >> > > > we > >> > > > > > went > >> > > > > > >> > > down > >> > > > > > >> > > > > the > >> > > > > > >> > > > > >> alignment path: > >> > > > > > >> > > > > >> 1. With respect to the name, yeah I think the > >> first > >> > > > > question > >> > > > > > is > >> > > > > > >> > > > whether > >> > > > > > >> > > > > >> some "re-branding" would be worth it. If so > then I > >> > > think > >> > > > we > >> > > > > > can > >> > > > > > >> > > have a > >> > > > > > >> > > > > big > >> > > > > > >> > > > > >> thread on the name. I'm definitely not set on > >> Kafka > >> > > > > > Streaming or > >> > > > > > >> > > Kafka > >> > > > > > >> > > > > >> Streams I was just using them to be kind of > >> > > > illustrative. I > >> > > > > > >> agree > >> > > > > > >> > > with > >> > > > > > >> > > > > >> your > >> > > > > > >> > > > > >> critique of these names, though I think people > >> would > >> > > get > >> > > > > the > >> > > > > > >> idea. > >> > > > > > >> > > > > >> 2. Yeah you also raise a good point about how to > >> > > "factor" > >> > > > > it. > >> > > > > > >> Here > >> > > > > > >> > > are > >> > > > > > >> > > > > the > >> > > > > > >> > > > > >> options I see (I could get enthusiastic about > any > >> of > >> > > > them): > >> > > > > > >> > > > > >> a. One repo for both Kafka and Samza > >> > > > > > >> > > > > >> b. Two repos, retaining the current > seperation > >> > > > > > >> > > > > >> c. Two repos, the equivalent of samza-api and > >> > > > samza-core > >> > > > > > is > >> > > > > > >> > > > absorbed > >> > > > > > >> > > > > >> almost like a third client > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> Cheers, > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> -Jay > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin > Kleppmann < > >> > > > > > >> > > > mar...@kleppmann.com> > >> > > > > > >> > > > > >> wrote: > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > Ok, thanks for the clarifications. Just a few > >> > > follow-up > >> > > > > > >> > comments. > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > - I see the appeal of merging with Kafka or > >> > becoming > >> > > a > >> > > > > > >> > subproject: > >> > > > > > >> > > > the > >> > > > > > >> > > > > >> > reasons you mention are good. The risk I see > is > >> > that > >> > > > > > release > >> > > > > > >> > > > schedules > >> > > > > > >> > > > > >> > become coupled to each other, which can slow > >> > everyone > >> > > > > down, > >> > > > > > >> and > >> > > > > > >> > > > large > >> > > > > > >> > > > > >> > projects with many contributors are harder to > >> > manage. > >> > > > > > (Jakob, > >> > > > > > >> > can > >> > > > > > >> > > > you > >> > > > > > >> > > > > >> speak > >> > > > > > >> > > > > >> > from experience, having seen a wider range of > >> > Hadoop > >> > > > > > ecosystem > >> > > > > > >> > > > > >> projects?) > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > Some of the goals of a better unified > developer > >> > > > > experience > >> > > > > > >> could > >> > > > > > >> > > > also > >> > > > > > >> > > > > be > >> > > > > > >> > > > > >> > solved by integrating Samza nicely into a > Kafka > >> > > > > > distribution > >> > > > > > >> > (such > >> > > > > > >> > > > as > >> > > > > > >> > > > > >> > Confluent's). I'm not against merging projects > >> if > >> > we > >> > > > > decide > >> > > > > > >> > that's > >> > > > > > >> > > > the > >> > > > > > >> > > > > >> way > >> > > > > > >> > > > > >> > to go, just pointing out the same goals can > >> perhaps > >> > > > also > >> > > > > be > >> > > > > > >> > > achieved > >> > > > > > >> > > > > in > >> > > > > > >> > > > > >> > other ways. > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > - With regard to dropping the YARN dependency: > >> are > >> > > you > >> > > > > > >> proposing > >> > > > > > >> > > > that > >> > > > > > >> > > > > >> > Samza doesn't give any help to people wanting > to > >> > run > >> > > on > >> > > > > > >> > > > > >> YARN/Mesos/AWS/etc? > >> > > > > > >> > > > > >> > So the docs would basically have a link to > >> Slider > >> > and > >> > > > > > nothing > >> > > > > > >> > > else? > >> > > > > > >> > > > Or > >> > > > > > >> > > > > >> > would we maintain integrations with a bunch of > >> > > popular > >> > > > > > >> > deployment > >> > > > > > >> > > > > >> methods > >> > > > > > >> > > > > >> > (e.g. the necessary glue and shell scripts to > >> make > >> > > > Samza > >> > > > > > work > >> > > > > > >> > with > >> > > > > > >> > > > > >> Slider)? > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > I absolutely think it's a good idea to have > the > >> > "as a > >> > > > > > library" > >> > > > > > >> > and > >> > > > > > >> > > > > "as a > >> > > > > > >> > > > > >> > process" (using Yi's taxonomy) options for > >> people > >> > who > >> > > > > want > >> > > > > > >> them, > >> > > > > > >> > > > but I > >> > > > > > >> > > > > >> > think there should also be a low-friction path > >> for > >> > > > common > >> > > > > > "as > >> > > > > > >> a > >> > > > > > >> > > > > service" > >> > > > > > >> > > > > >> > deployment methods, for which we probably need > >> to > >> > > > > maintain > >> > > > > > >> > > > > integrations. > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd to > >> me, > >> > > > > because > >> > > > > > >> Kafka > >> > > > > > >> > > is > >> > > > > > >> > > > > all > >> > > > > > >> > > > > >> > about streams already. Perhaps "Kafka > >> Transformers" > >> > > or > >> > > > > > "Kafka > >> > > > > > >> > > > Filters" > >> > > > > > >> > > > > >> > would be more apt? > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > One suggestion: perhaps the core of Samza > >> (stream > >> > > > > > >> transformation > >> > > > > > >> > > > with > >> > > > > > >> > > > > >> > state management -- i.e. the "Samza as a > >> library" > >> > > bit) > >> > > > > > could > >> > > > > > >> > > become > >> > > > > > >> > > > > >> part of > >> > > > > > >> > > > > >> > Kafka, while higher-level tools such as > >> streaming > >> > SQL > >> > > > and > >> > > > > > >> > > > integrations > >> > > > > > >> > > > > >> with > >> > > > > > >> > > > > >> > deployment frameworks remain in a separate > >> project? > >> > > In > >> > > > > > other > >> > > > > > >> > > words, > >> > > > > > >> > > > > >> Kafka > >> > > > > > >> > > > > >> > would absorb the proven, stable core of Samza, > >> > which > >> > > > > would > >> > > > > > >> > become > >> > > > > > >> > > > the > >> > > > > > >> > > > > >> > "third Kafka client" mentioned early in this > >> > thread. > >> > > > The > >> > > > > > Samza > >> > > > > > >> > > > project > >> > > > > > >> > > > > >> > would then target that third Kafka client as > its > >> > base > >> > > > > API, > >> > > > > > and > >> > > > > > >> > the > >> > > > > > >> > > > > >> project > >> > > > > > >> > > > > >> > would be freed up to explore more experimental > >> new > >> > > > > > horizons. > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > Martin > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps < > >> > > > jay.kr...@gmail.com> > >> > > > > > >> wrote: > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > Hey Martin, > >> > > > > > >> > > > > >> > > > >> > > > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually > >> > don't > >> > > > > think > >> > > > > > it > >> > > > > > >> > ties > >> > > > > > >> > > > our > >> > > > > > >> > > > > >> > hands > >> > > > > > >> > > > > >> > > at all, all it does is refactor things. The > >> > > division > >> > > > of > >> > > > > > >> > > > > >> responsibility is > >> > > > > > >> > > > > >> > > that Samza core is responsible for task > >> > lifecycle, > >> > > > > state, > >> > > > > > >> and > >> > > > > > >> > > > > >> partition > >> > > > > > >> > > > > >> > > management (using the Kafka co-ordinator) > but > >> it > >> > is > >> > > > NOT > >> > > > > > >> > > > responsible > >> > > > > > >> > > > > >> for > >> > > > > > >> > > > > >> > > packaging, configuration deployment or > >> execution > >> > of > >> > > > > > >> processes. > >> > > > > > >> > > The > >> > > > > > >> > > > > >> > problem > >> > > > > > >> > > > > >> > > of packaging and starting these processes is > >> > > > > > >> > > > > >> > > framework/environment-specific. This leaves > >> > > > individual > >> > > > > > >> > > frameworks > >> > > > > > >> > > > to > >> > > > > > >> > > > > >> be > >> > > > > > >> > > > > >> > as > >> > > > > > >> > > > > >> > > fancy or vanilla as they like. So you can > get > >> > > simple > >> > > > > > >> stateless > >> > > > > > >> > > > > >> support in > >> > > > > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf > app > >> > > > > framework > >> > > > > > >> > > (Slider, > >> > > > > > >> > > > > >> > Marathon, > >> > > > > > >> > > > > >> > > etc). These are well known by people and > have > >> > nice > >> > > > UIs > >> > > > > > and a > >> > > > > > >> > lot > >> > > > > > >> > > > of > >> > > > > > >> > > > > >> > > flexibility. I don't think they have node > >> > affinity > >> > > > as a > >> > > > > > >> built > >> > > > > > >> > in > >> > > > > > >> > > > > >> option > >> > > > > > >> > > > > >> > > (though I could be wrong). So if we want > that > >> we > >> > > can > >> > > > > > either > >> > > > > > >> > wait > >> > > > > > >> > > > for > >> > > > > > >> > > > > >> them > >> > > > > > >> > > > > >> > > to add it or do a custom framework to add > that > >> > > > feature > >> > > > > > (as > >> > > > > > >> > now). > >> > > > > > >> > > > > >> > Obviously > >> > > > > > >> > > > > >> > > if you manage things with old-school ops > tools > >> > > > > > >> > (puppet/chef/etc) > >> > > > > > >> > > > you > >> > > > > > >> > > > > >> get > >> > > > > > >> > > > > >> > > locality easily. The nice thing, though, is > >> that > >> > > all > >> > > > > the > >> > > > > > >> samza > >> > > > > > >> > > > > >> "business > >> > > > > > >> > > > > >> > > logic" around partition management and fault > >> > > > tolerance > >> > > > > > is in > >> > > > > > >> > > Samza > >> > > > > > >> > > > > >> core > >> > > > > > >> > > > > >> > so > >> > > > > > >> > > > > >> > > it is shared across frameworks and the > >> framework > >> > > > > specific > >> > > > > > >> bit > >> > > > > > >> > is > >> > > > > > >> > > > > just > >> > > > > > >> > > > > >> > > whether it is smart enough to try to get the > >> same > >> > > > host > >> > > > > > when > >> > > > > > >> a > >> > > > > > >> > > job > >> > > > > > >> > > > is > >> > > > > > >> > > > > >> > > restarted. > >> > > > > > >> > > > > >> > > > >> > > > > > >> > > > > >> > > With respect to the Kafka-alignment, yeah I > >> think > >> > > the > >> > > > > > goal > >> > > > > > >> > would > >> > > > > > >> > > > be > >> > > > > > >> > > > > >> (a) > >> > > > > > >> > > > > >> > > actually get better alignment in user > >> experience, > >> > > and > >> > > > > (b) > >> > > > > > >> > > express > >> > > > > > >> > > > > >> this in > >> > > > > > >> > > > > >> > > the naming and project branding. > Specifically: > >> > > > > > >> > > > > >> > > 1. Website/docs, it would be nice for the > >> > > > > > "transformation" > >> > > > > > >> api > >> > > > > > >> > > to > >> > > > > > >> > > > be > >> > > > > > >> > > > > >> > > discoverable in the main Kafka docs--i.e. be > >> able > >> > > to > >> > > > > > explain > >> > > > > > >> > > when > >> > > > > > >> > > > to > >> > > > > > >> > > > > >> use > >> > > > > > >> > > > > >> > > the consumer and when to use the stream > >> > processing > >> > > > > > >> > functionality > >> > > > > > >> > > > and > >> > > > > > >> > > > > >> lead > >> > > > > > >> > > > > >> > > people into that experience. > >> > > > > > >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2 > >> (or > >> > > > > > whatever) > >> > > > > > >> > that > >> > > > > > >> > > > has > >> > > > > > >> > > > > >> both > >> > > > > > >> > > > > >> > > Kafka and the stream processing part and > they > >> > > > actually > >> > > > > > work > >> > > > > > >> > > > > together. > >> > > > > > >> > > > > >> > > 3. Unify the programming experience so the > >> client > >> > > and > >> > > > > > Samza > >> > > > > > >> > api > >> > > > > > >> > > > > share > >> > > > > > >> > > > > >> > > config/monitoring/naming/packaging/etc. > >> > > > > > >> > > > > >> > > > >> > > > > > >> > > > > >> > > I think sub-projects keep separate > committers > >> and > >> > > can > >> > > > > > have a > >> > > > > > >> > > > > separate > >> > > > > > >> > > > > >> > repo, > >> > > > > > >> > > > > >> > > but I'm actually not really sure (I can't > >> find a > >> > > > > > definition > >> > > > > > >> > of a > >> > > > > > >> > > > > >> > subproject > >> > > > > > >> > > > > >> > > in Apache). > >> > > > > > >> > > > > >> > > > >> > > > > > >> > > > > >> > > Basically at a high-level you want the > >> experience > >> > > to > >> > > > > > "feel" > >> > > > > > >> > > like a > >> > > > > > >> > > > > >> single > >> > > > > > >> > > > > >> > > system, not to relatively independent things > >> that > >> > > are > >> > > > > > kind > >> > > > > > >> of > >> > > > > > >> > > > > >> awkwardly > >> > > > > > >> > > > > >> > > glued together. > >> > > > > > >> > > > > >> > > > >> > > > > > >> > > > > >> > > I think if we did that they having naming or > >> > > branding > >> > > > > > like > >> > > > > > >> > > "kafka > >> > > > > > >> > > > > >> > > streaming" or "kafka streams" or something > >> like > >> > > that > >> > > > > > would > >> > > > > > >> > > > actually > >> > > > > > >> > > > > >> do a > >> > > > > > >> > > > > >> > > good job of conveying what it is. I do that > >> this > >> > > > would > >> > > > > > help > >> > > > > > >> > > > adoption > >> > > > > > >> > > > > >> > quite > >> > > > > > >> > > > > >> > > a lot as it would correctly convey that > using > >> > Kafka > >> > > > > > >> Streaming > >> > > > > > >> > > with > >> > > > > > >> > > > > >> Kafka > >> > > > > > >> > > > > >> > is > >> > > > > > >> > > > > >> > > a fairly seamless experience and Kafka is > >> pretty > >> > > > > heavily > >> > > > > > >> > adopted > >> > > > > > >> > > > at > >> > > > > > >> > > > > >> this > >> > > > > > >> > > > > >> > > point. > >> > > > > > >> > > > > >> > > > >> > > > > > >> > > > > >> > > Fwiw we actually considered this model > >> originally > >> > > > when > >> > > > > > open > >> > > > > > >> > > > sourcing > >> > > > > > >> > > > > >> > Samza, > >> > > > > > >> > > > > >> > > however at that time Kafka was relatively > >> unknown > >> > > and > >> > > > > we > >> > > > > > >> > decided > >> > > > > > >> > > > not > >> > > > > > >> > > > > >> to > >> > > > > > >> > > > > >> > do > >> > > > > > >> > > > > >> > > it since we felt it would be limiting. From > my > >> > > point > >> > > > of > >> > > > > > view > >> > > > > > >> > the > >> > > > > > >> > > > > three > >> > > > > > >> > > > > >> > > things have changed (1) Kafka is now really > >> > heavily > >> > > > > used > >> > > > > > for > >> > > > > > >> > > > stream > >> > > > > > >> > > > > >> > > processing, (2) we learned that abstracting > >> out > >> > the > >> > > > > > stream > >> > > > > > >> > well > >> > > > > > >> > > is > >> > > > > > >> > > > > >> > > basically impossible, (3) we learned it is > >> really > >> > > > hard > >> > > > > to > >> > > > > > >> keep > >> > > > > > >> > > the > >> > > > > > >> > > > > two > >> > > > > > >> > > > > >> > > things feeling like a single product. > >> > > > > > >> > > > > >> > > > >> > > > > > >> > > > > >> > > -Jay > >> > > > > > >> > > > > >> > > > >> > > > > > >> > > > > >> > > > >> > > > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin > >> Kleppmann > >> > < > >> > > > > > >> > > > > >> mar...@kleppmann.com> > >> > > > > > >> > > > > >> > > wrote: > >> > > > > > >> > > > > >> > > > >> > > > > > >> > > > > >> > >> Hi all, > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >> Lots of good thoughts here. > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >> I agree with the general philosophy of > tying > >> > Samza > >> > > > > more > >> > > > > > >> > firmly > >> > > > > > >> > > to > >> > > > > > >> > > > > >> Kafka. > >> > > > > > >> > > > > >> > >> After I spent a while looking at > integrating > >> > other > >> > > > > > message > >> > > > > > >> > > > brokers > >> > > > > > >> > > > > >> (e.g. > >> > > > > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the > >> > > > conclusion > >> > > > > > that > >> > > > > > >> > > > > >> > SystemConsumer > >> > > > > > >> > > > > >> > >> tacitly assumes a model so much like > Kafka's > >> > that > >> > > > > pretty > >> > > > > > >> much > >> > > > > > >> > > > > nobody > >> > > > > > >> > > > > >> but > >> > > > > > >> > > > > >> > >> Kafka actually implements it. (Databus is > >> > perhaps > >> > > an > >> > > > > > >> > exception, > >> > > > > > >> > > > but > >> > > > > > >> > > > > >> it > >> > > > > > >> > > > > >> > >> isn't widely used outside of LinkedIn.) > Thus, > >> > > making > >> > > > > > Samza > >> > > > > > >> > > fully > >> > > > > > >> > > > > >> > dependent > >> > > > > > >> > > > > >> > >> on Kafka acknowledges that the > >> > system-independence > >> > > > was > >> > > > > > >> never > >> > > > > > >> > as > >> > > > > > >> > > > > real > >> > > > > > >> > > > > >> as > >> > > > > > >> > > > > >> > we > >> > > > > > >> > > > > >> > >> perhaps made it out to be. The gains of > code > >> > reuse > >> > > > are > >> > > > > > >> real. > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >> The idea of decoupling Samza from YARN has > >> also > >> > > > always > >> > > > > > been > >> > > > > > >> > > > > >> appealing to > >> > > > > > >> > > > > >> > >> me, for various reasons already mentioned > in > >> > this > >> > > > > > thread. > >> > > > > > >> > > > Although > >> > > > > > >> > > > > >> > making > >> > > > > > >> > > > > >> > >> Samza jobs deployable on anything > >> > > > (YARN/Mesos/AWS/etc) > >> > > > > > >> seems > >> > > > > > >> > > > > >> laudable, > >> > > > > > >> > > > > >> > I am > >> > > > > > >> > > > > >> > >> a little concerned that it will restrict us > >> to a > >> > > > > lowest > >> > > > > > >> > common > >> > > > > > >> > > > > >> > denominator. > >> > > > > > >> > > > > >> > >> For example, would host affinity > (SAMZA-617) > >> > still > >> > > > be > >> > > > > > >> > possible? > >> > > > > > >> > > > For > >> > > > > > >> > > > > >> jobs > >> > > > > > >> > > > > >> > >> with large amounts of state, I think > >> SAMZA-617 > >> > > would > >> > > > > be > >> > > > > > a > >> > > > > > >> big > >> > > > > > >> > > > boon, > >> > > > > > >> > > > > >> > since > >> > > > > > >> > > > > >> > >> restoring state off the changelog on every > >> > single > >> > > > > > restart > >> > > > > > >> is > >> > > > > > >> > > > > painful, > >> > > > > > >> > > > > >> > due > >> > > > > > >> > > > > >> > >> to long recovery times. It would be a shame > >> if > >> > the > >> > > > > > >> decoupling > >> > > > > > >> > > > from > >> > > > > > >> > > > > >> YARN > >> > > > > > >> > > > > >> > >> made host affinity impossible. > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >> Jay, a question about the proposed API for > >> > > > > > instantiating a > >> > > > > > >> > job > >> > > > > > >> > > in > >> > > > > > >> > > > > >> code > >> > > > > > >> > > > > >> > >> (rather than a properties file): when > >> > submitting a > >> > > > job > >> > > > > > to a > >> > > > > > >> > > > > cluster, > >> > > > > > >> > > > > >> is > >> > > > > > >> > > > > >> > the > >> > > > > > >> > > > > >> > >> idea that the instantiation code runs on a > >> > client > >> > > > > > >> somewhere, > >> > > > > > >> > > > which > >> > > > > > >> > > > > >> then > >> > > > > > >> > > > > >> > >> pokes the necessary endpoints on > >> > > YARN/Mesos/AWS/etc? > >> > > > > Or > >> > > > > > >> does > >> > > > > > >> > > that > >> > > > > > >> > > > > >> code > >> > > > > > >> > > > > >> > run > >> > > > > > >> > > > > >> > >> on each container that is part of the job > (in > >> > > which > >> > > > > > case, > >> > > > > > >> how > >> > > > > > >> > > > does > >> > > > > > >> > > > > >> the > >> > > > > > >> > > > > >> > job > >> > > > > > >> > > > > >> > >> submission to the cluster work)? > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >> I agree with Garry that it doesn't feel > >> right to > >> > > > make > >> > > > > a > >> > > > > > 1.0 > >> > > > > > >> > > > release > >> > > > > > >> > > > > >> > with a > >> > > > > > >> > > > > >> > >> plan for it to be immediately obsolete. So > if > >> > this > >> > > > is > >> > > > > > going > >> > > > > > >> > to > >> > > > > > >> > > > > >> happen, I > >> > > > > > >> > > > > >> > >> think it would be more honest to stick with > >> 0.* > >> > > > > version > >> > > > > > >> > numbers > >> > > > > > >> > > > > until > >> > > > > > >> > > > > >> > the > >> > > > > > >> > > > > >> > >> library-ified Samza has been implemented, > is > >> > > stable > >> > > > > and > >> > > > > > >> > widely > >> > > > > > >> > > > > used. > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >> Should the new Samza be a subproject of > >> Kafka? > >> > > There > >> > > > > is > >> > > > > > >> > > precedent > >> > > > > > >> > > > > for > >> > > > > > >> > > > > >> > >> tight coupling between different Apache > >> projects > >> > > > (e.g. > >> > > > > > >> > Curator > >> > > > > > >> > > > and > >> > > > > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think > >> > > remaining > >> > > > > > >> separate > >> > > > > > >> > > > would > >> > > > > > >> > > > > >> be > >> > > > > > >> > > > > >> > ok. > >> > > > > > >> > > > > >> > >> Even if Samza is fully dependent on Kafka, > >> there > >> > > is > >> > > > > > enough > >> > > > > > >> > > > > substance > >> > > > > > >> > > > > >> in > >> > > > > > >> > > > > >> > >> Samza that it warrants being a separate > >> project. > >> > > An > >> > > > > > >> argument > >> > > > > > >> > in > >> > > > > > >> > > > > >> favour > >> > > > > > >> > > > > >> > of > >> > > > > > >> > > > > >> > >> merging would be if we think Kafka has a > much > >> > > > stronger > >> > > > > > >> "brand > >> > > > > > >> > > > > >> presence" > >> > > > > > >> > > > > >> > >> than Samza; I'm ambivalent on that one. If > >> the > >> > > Kafka > >> > > > > > >> project > >> > > > > > >> > is > >> > > > > > >> > > > > >> willing > >> > > > > > >> > > > > >> > to > >> > > > > > >> > > > > >> > >> endorse Samza as the "official" way of > doing > >> > > > stateful > >> > > > > > >> stream > >> > > > > > >> > > > > >> > >> transformations, that would probably have > >> much > >> > the > >> > > > > same > >> > > > > > >> > effect > >> > > > > > >> > > as > >> > > > > > >> > > > > >> > >> re-branding Samza as "Kafka Stream > >> Processors" > >> > or > >> > > > > > suchlike. > >> > > > > > >> > > Close > >> > > > > > >> > > > > >> > >> collaboration between the two projects will > >> be > >> > > > needed > >> > > > > in > >> > > > > > >> any > >> > > > > > >> > > > case. > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >> From a project management perspective, I > >> guess > >> > the > >> > > > > "new > >> > > > > > >> > Samza" > >> > > > > > >> > > > > would > >> > > > > > >> > > > > >> > have > >> > > > > > >> > > > > >> > >> to be developed on a branch alongside > ongoing > >> > > > > > maintenance > >> > > > > > >> of > >> > > > > > >> > > the > >> > > > > > >> > > > > >> current > >> > > > > > >> > > > > >> > >> line of development? I think it would be > >> > important > >> > > > to > >> > > > > > >> > continue > >> > > > > > >> > > > > >> > supporting > >> > > > > > >> > > > > >> > >> existing users, and provide a graceful > >> migration > >> > > > path > >> > > > > to > >> > > > > > >> the > >> > > > > > >> > > new > >> > > > > > >> > > > > >> > version. > >> > > > > > >> > > > > >> > >> Leaving the current versions unsupported > and > >> > > forcing > >> > > > > > people > >> > > > > > >> > to > >> > > > > > >> > > > > >> rewrite > >> > > > > > >> > > > > >> > >> their jobs would send a bad signal. > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >> Best, > >> > > > > > >> > > > > >> > >> Martin > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps < > >> > > > j...@confluent.io> > >> > > > > > >> wrote: > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >>> Hey Garry, > >> > > > > > >> > > > > >> > >>> > >> > > > > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be > happy > >> to > >> > > chat > >> > > > > > more > >> > > > > > >> > about > >> > > > > > >> > > > > this > >> > > > > > >> > > > > >> if > >> > > > > > >> > > > > >> > >>> you'd be interested. I think Chris and I > >> > started > >> > > > with > >> > > > > > the > >> > > > > > >> > idea > >> > > > > > >> > > > of > >> > > > > > >> > > > > >> "what > >> > > > > > >> > > > > >> > >>> would it take to make Samza a kick-ass > >> > ingestion > >> > > > > tool" > >> > > > > > but > >> > > > > > >> > > > > >> ultimately > >> > > > > > >> > > > > >> > we > >> > > > > > >> > > > > >> > >>> kind of came around to the idea that > >> ingestion > >> > > and > >> > > > > > >> > > > transformation > >> > > > > > >> > > > > >> had > >> > > > > > >> > > > > >> > >>> pretty different needs and coupling the > two > >> > made > >> > > > > things > >> > > > > > >> > hard. > >> > > > > > >> > > > > >> > >>> > >> > > > > > >> > > > > >> > >>> For what it's worth I think copycat > (KIP-26) > >> > > > actually > >> > > > > > will > >> > > > > > >> > do > >> > > > > > >> > > > what > >> > > > > > >> > > > > >> you > >> > > > > > >> > > > > >> > >> are > >> > > > > > >> > > > > >> > >>> looking for. > >> > > > > > >> > > > > >> > >>> > >> > > > > > >> > > > > >> > >>> With regard to your point about slider, I > >> don't > >> > > > > > >> necessarily > >> > > > > > >> > > > > >> disagree. > >> > > > > > >> > > > > >> > >> But I > >> > > > > > >> > > > > >> > >>> think getting good YARN support is quite > >> doable > >> > > > and I > >> > > > > > >> think > >> > > > > > >> > we > >> > > > > > >> > > > can > >> > > > > > >> > > > > >> make > >> > > > > > >> > > > > >> > >>> that work well. I think the issue this > >> proposal > >> > > > > solves > >> > > > > > is > >> > > > > > >> > that > >> > > > > > >> > > > > >> > >> technically > >> > > > > > >> > > > > >> > >>> it is pretty hard to support multiple > >> cluster > >> > > > > > management > >> > > > > > >> > > systems > >> > > > > > >> > > > > the > >> > > > > > >> > > > > >> > way > >> > > > > > >> > > > > >> > >>> things are now, you need to write an "app > >> > master" > >> > > > or > >> > > > > > >> > > "framework" > >> > > > > > >> > > > > for > >> > > > > > >> > > > > >> > each > >> > > > > > >> > > > > >> > >>> and they are all a little different so > >> testing > >> > is > >> > > > > > really > >> > > > > > >> > hard. > >> > > > > > >> > > > In > >> > > > > > >> > > > > >> the > >> > > > > > >> > > > > >> > >>> absence of this we have been stuck with > just > >> > YARN > >> > > > > which > >> > > > > > >> has > >> > > > > > >> > > > > >> fantastic > >> > > > > > >> > > > > >> > >>> penetration in the Hadoopy part of the > org, > >> but > >> > > > zero > >> > > > > > >> > > penetration > >> > > > > > >> > > > > >> > >> elsewhere. > >> > > > > > >> > > > > >> > >>> Given the huge amount of work being put in > >> to > >> > > > slider, > >> > > > > > >> > > marathon, > >> > > > > > >> > > > > aws > >> > > > > > >> > > > > >> > >>> tooling, not to mention the umpteen > related > >> > > > packaging > >> > > > > > >> > > > technologies > >> > > > > > >> > > > > >> > people > >> > > > > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various > >> > > > > cloud-specific > >> > > > > > >> > deploy > >> > > > > > >> > > > > >> tools, > >> > > > > > >> > > > > >> > >> etc) > >> > > > > > >> > > > > >> > >>> I really think it is important to get this > >> > right. > >> > > > > > >> > > > > >> > >>> > >> > > > > > >> > > > > >> > >>> -Jay > >> > > > > > >> > > > > >> > >>> > >> > > > > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry > >> > Turkington > >> > > < > >> > > > > > >> > > > > >> > >>> g.turking...@improvedigital.com> wrote: > >> > > > > > >> > > > > >> > >>> > >> > > > > > >> > > > > >> > >>>> Hi all, > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> I think the question below re does Samza > >> > become > >> > > a > >> > > > > > >> > sub-project > >> > > > > > >> > > > of > >> > > > > > >> > > > > >> Kafka > >> > > > > > >> > > > > >> > >>>> highlights the broader point around > >> migration. > >> > > > Chris > >> > > > > > >> > mentions > >> > > > > > >> > > > > >> Samza's > >> > > > > > >> > > > > >> > >>>> maturity is heading towards a v1 release > >> but > >> > I'm > >> > > > not > >> > > > > > sure > >> > > > > > >> > it > >> > > > > > >> > > > > feels > >> > > > > > >> > > > > >> > >> right to > >> > > > > > >> > > > > >> > >>>> launch a v1 then immediately plan to > >> deprecate > >> > > > most > >> > > > > of > >> > > > > > >> it. > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> From a selfish perspective I have some > guys > >> > who > >> > > > have > >> > > > > > >> > started > >> > > > > > >> > > > > >> working > >> > > > > > >> > > > > >> > >> with > >> > > > > > >> > > > > >> > >>>> Samza and building some new > >> > consumers/producers > >> > > > was > >> > > > > > next > >> > > > > > >> > up. > >> > > > > > >> > > > > Sounds > >> > > > > > >> > > > > >> > like > >> > > > > > >> > > > > >> > >>>> that is absolutely not the direction to > >> go. I > >> > > need > >> > > > > to > >> > > > > > >> look > >> > > > > > >> > > into > >> > > > > > >> > > > > the > >> > > > > > >> > > > > >> > KIP > >> > > > > > >> > > > > >> > >> in > >> > > > > > >> > > > > >> > >>>> more detail but for me the attractiveness > >> of > >> > > > adding > >> > > > > > new > >> > > > > > >> > Samza > >> > > > > > >> > > > > >> > >>>> consumer/producers -- even if yes all > they > >> > were > >> > > > > doing > >> > > > > > was > >> > > > > > >> > > > really > >> > > > > > >> > > > > >> > getting > >> > > > > > >> > > > > >> > >>>> data into and out of Kafka -- was to > avoid > >> > > > having > >> > > > > to > >> > > > > > >> > worry > >> > > > > > >> > > > > about > >> > > > > > >> > > > > >> the > >> > > > > > >> > > > > >> > >>>> lifecycle management of external clients. > >> If > >> > > there > >> > > > > is > >> > > > > > a > >> > > > > > >> > > generic > >> > > > > > >> > > > > >> Kafka > >> > > > > > >> > > > > >> > >>>> ingress/egress layer that I can plug a > new > >> > > > connector > >> > > > > > into > >> > > > > > >> > and > >> > > > > > >> > > > > have > >> > > > > > >> > > > > >> a > >> > > > > > >> > > > > >> > >> lot of > >> > > > > > >> > > > > >> > >>>> the heavy lifting re scale and > reliability > >> > done > >> > > > for > >> > > > > me > >> > > > > > >> then > >> > > > > > >> > > it > >> > > > > > >> > > > > >> gives > >> > > > > > >> > > > > >> > me > >> > > > > > >> > > > > >> > >> all > >> > > > > > >> > > > > >> > >>>> the pushing new consumers/producers > would. > >> If > >> > > not > >> > > > > > then it > >> > > > > > >> > > > > >> complicates > >> > > > > > >> > > > > >> > my > >> > > > > > >> > > > > >> > >>>> operational deployments. > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> Which is similar to my other question > with > >> the > >> > > > > > proposal > >> > > > > > >> -- > >> > > > > > >> > if > >> > > > > > >> > > > we > >> > > > > > >> > > > > >> > build a > >> > > > > > >> > > > > >> > >>>> fully available/stand-alone Samza plus > the > >> > > > requisite > >> > > > > > >> shims > >> > > > > > >> > to > >> > > > > > >> > > > > >> > integrate > >> > > > > > >> > > > > >> > >>>> with Slider etc I suspect the former may > >> be a > >> > > lot > >> > > > > more > >> > > > > > >> work > >> > > > > > >> > > > than > >> > > > > > >> > > > > we > >> > > > > > >> > > > > >> > >> think. > >> > > > > > >> > > > > >> > >>>> We may make it much easier for a newcomer > >> to > >> > get > >> > > > > > >> something > >> > > > > > >> > > > > running > >> > > > > > >> > > > > >> but > >> > > > > > >> > > > > >> > >>>> having them step up and get a reliable > >> > > production > >> > > > > > >> > deployment > >> > > > > > >> > > > may > >> > > > > > >> > > > > >> still > >> > > > > > >> > > > > >> > >>>> dominate mailing list traffic, if for > >> > different > >> > > > > > reasons > >> > > > > > >> > than > >> > > > > > >> > > > > >> today. > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable > with > >> > > making > >> > > > > the > >> > > > > > >> Samza > >> > > > > > >> > > > > >> dependency > >> > > > > > >> > > > > >> > >> on > >> > > > > > >> > > > > >> > >>>> Kafka much more explicit and I absolutely > >> see > >> > > the > >> > > > > > >> benefits > >> > > > > > >> > > in > >> > > > > > >> > > > > the > >> > > > > > >> > > > > >> > >>>> reduction of duplication and clashing > >> > > > > > >> > > > terminologies/abstractions > >> > > > > > >> > > > > >> that > >> > > > > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library > >> would > >> > > > likely > >> > > > > > be a > >> > > > > > >> > very > >> > > > > > >> > > > > nice > >> > > > > > >> > > > > >> > tool > >> > > > > > >> > > > > >> > >> to > >> > > > > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have > the > >> > > > concerns > >> > > > > > >> above > >> > > > > > >> > re > >> > > > > > >> > > > the > >> > > > > > >> > > > > >> > >>>> operational side. > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> Garry > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> -----Original Message----- > >> > > > > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales > >> [mailto: > >> > > > > > >> > g...@apache.org > >> > > > > > >> > > ] > >> > > > > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56 > >> > > > > > >> > > > > >> > >>>> To: dev@samza.apache.org > >> > > > > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on > >> > Samza > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> Very interesting thoughts. > >> > > > > > >> > > > > >> > >>>> From outside, I have always perceived > Samza > >> > as a > >> > > > > > >> computing > >> > > > > > >> > > > layer > >> > > > > > >> > > > > >> over > >> > > > > > >> > > > > >> > >>>> Kafka. > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> The question, maybe a bit provocative, is > >> > > "should > >> > > > > > Samza > >> > > > > > >> be > >> > > > > > >> > a > >> > > > > > >> > > > > >> > sub-project > >> > > > > > >> > > > > >> > >>>> of Kafka then?" > >> > > > > > >> > > > > >> > >>>> Or does it make sense to keep it as a > >> separate > >> > > > > project > >> > > > > > >> > with a > >> > > > > > >> > > > > >> separate > >> > > > > > >> > > > > >> > >>>> governance? > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> Cheers, > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> -- > >> > > > > > >> > > > > >> > >>>> Gianmarco > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang < > >> > > > > > yanfang...@gmail.com> > >> > > > > > >> > > > wrote: > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka > more > >> > > > tightly. > >> > > > > > >> > Because > >> > > > > > >> > > > > Samza > >> > > > > > >> > > > > >> de > >> > > > > > >> > > > > >> > >>>>> facto is based on Kafka, and it should > >> > leverage > >> > > > > what > >> > > > > > >> Kafka > >> > > > > > >> > > > has. > >> > > > > > >> > > > > At > >> > > > > > >> > > > > >> > the > >> > > > > > >> > > > > >> > >>>>> same time, Kafka does not need to > reinvent > >> > what > >> > > > > Samza > >> > > > > > >> > > already > >> > > > > > >> > > > > >> has. I > >> > > > > > >> > > > > >> > >>>>> also like the idea of separating the > >> > ingestion > >> > > > and > >> > > > > > >> > > > > transformation. > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> But it is a little difficult for me to > >> image > >> > > how > >> > > > > the > >> > > > > > >> Samza > >> > > > > > >> > > > will > >> > > > > > >> > > > > >> look > >> > > > > > >> > > > > >> > >>>> like. > >> > > > > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little > >> > > difference > >> > > > > in > >> > > > > > >> terms > >> > > > > > >> > > of > >> > > > > > >> > > > > how > >> > > > > > >> > > > > >> > >>>>> Samza should look like. > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> *** Will it look like what Jay's code > >> shows > >> > (A > >> > > > > > client of > >> > > > > > >> > > > Kakfa) > >> > > > > > >> > > > > ? > >> > > > > > >> > > > > >> And > >> > > > > > >> > > > > >> > >>>>> user's application code calls this > client? > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> 1. If we make Samza be a library of > Kafka > >> > (like > >> > > > > what > >> > > > > > the > >> > > > > > >> > > code > >> > > > > > >> > > > > >> shows), > >> > > > > > >> > > > > >> > >>>>> how do we implement auto-balance and > >> > > > > fault-tolerance? > >> > > > > > >> Are > >> > > > > > >> > > they > >> > > > > > >> > > > > >> taken > >> > > > > > >> > > > > >> > >>>>> care by the Kafka broker or other > >> mechanism, > >> > > such > >> > > > > as > >> > > > > > >> > "Samza > >> > > > > > >> > > > > >> worker" > >> > > > > > >> > > > > >> > >>>>> (just make up the name) ? > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> 2. What about other features, such as > >> > > > auto-scaling, > >> > > > > > >> shared > >> > > > > > >> > > > > state, > >> > > > > > >> > > > > >> > >>>>> monitoring? > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is > this > >> > what > >> > > > > Chris > >> > > > > > >> > > > suggests?) > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> 1. we still need to ingest data from > Kakfa > >> > and > >> > > > > > produce > >> > > > > > >> to > >> > > > > > >> > > it. > >> > > > > > >> > > > > >> Then it > >> > > > > > >> > > > > >> > >>>>> becomes the same as what Samza looks > like > >> > now, > >> > > > > > except it > >> > > > > > >> > > does > >> > > > > > >> > > > > not > >> > > > > > >> > > > > >> > rely > >> > > > > > >> > > > > >> > >>>>> on Yarn anymore. > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> 2. if it is standalone, how can it > >> leverage > >> > > > Kafka's > >> > > > > > >> > metrics, > >> > > > > > >> > > > > logs, > >> > > > > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency? > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> Thanks, > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> Fang, Yan > >> > > > > > >> > > > > >> > >>>>> yanfang...@gmail.com > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang > >> > Wang < > >> > > > > > >> > > > > wangg...@gmail.com > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > >>>> wrote: > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>>>> Read through the code example and it > >> looks > >> > > good > >> > > > to > >> > > > > > me. > >> > > > > > >> A > >> > > > > > >> > > few > >> > > > > > >> > > > > >> > >>>>>> thoughts regarding deployment: > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>> Today Samza deploys as executable > >> runnable > >> > > like: > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh > >> > > --config-factory=... > >> > > > > > >> > > > > >> > >>>> --config-path=file://... > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>> And this proposal advocate for > deploying > >> > Samza > >> > > > > more > >> > > > > > as > >> > > > > > >> > > > embedded > >> > > > > > >> > > > > >> > >>>>>> libraries in user application code > >> (ignoring > >> > > the > >> > > > > > >> > > terminology > >> > > > > > >> > > > > >> since > >> > > > > > >> > > > > >> > >>>>>> it is not the > >> > > > > > >> > > > > >> > >>>>> same > >> > > > > > >> > > > > >> > >>>>>> as the prototype code): > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>> StreamTask task = new > >> MyStreamTask(configs); > >> > > > > Thread > >> > > > > > >> > thread > >> > > > > > >> > > = > >> > > > > > >> > > > > new > >> > > > > > >> > > > > >> > >>>>>> Thread(task); thread.start(); > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>> I think both of these deployment modes > >> are > >> > > > > important > >> > > > > > >> for > >> > > > > > >> > > > > >> different > >> > > > > > >> > > > > >> > >>>>>> types > >> > > > > > >> > > > > >> > >>>>> of > >> > > > > > >> > > > > >> > >>>>>> users. That said, I think making Samza > >> > purely > >> > > > > > >> standalone > >> > > > > > >> > is > >> > > > > > >> > > > > still > >> > > > > > >> > > > > >> > >>>>>> sufficient for either runnable or > library > >> > > modes. > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>> Guozhang > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay > >> Kreps > >> > < > >> > > > > > >> > > > j...@confluent.io> > >> > > > > > >> > > > > >> > wrote: > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code > >> example, > >> > it > >> > > > was > >> > > > > > >> > supposed > >> > > > > > >> > > > to > >> > > > > > >> > > > > >> look > >> > > > > > >> > > > > >> > >>>>>>> like > >> > > > > > >> > > > > >> > >>>>>>> this: > >> > > > > > >> > > > > >> > >>>>>>> > >> > > > > > >> > > > > >> > >>>>>>> Properties props = new Properties(); > >> > > > > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers", > >> > > > "localhost:4242"); > >> > > > > > >> > > > > >> StreamingConfig > >> > > > > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props); > >> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", > >> > > > "test-topic-2"); > >> > > > > > >> > > > > >> > >>>>>>> > >> > > config.processor(ExampleStreamProcessor.class); > >> > > > > > >> > > > > >> > >>>>>>> config.serialization(new > >> > StringSerializer(), > >> > > > new > >> > > > > > >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming > >> > > > container = > >> > > > > > new > >> > > > > > >> > > > > >> > >>>>>>> KafkaStreaming(config); > container.run(); > >> > > > > > >> > > > > >> > >>>>>>> > >> > > > > > >> > > > > >> > >>>>>>> -Jay > >> > > > > > >> > > > > >> > >>>>>>> > >> > > > > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay > >> > Kreps < > >> > > > > > >> > > > j...@confluent.io > >> > > > > > >> > > > > > > >> > > > > > >> > > > > >> > >>>> wrote: > >> > > > > > >> > > > > >> > >>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> Hey guys, > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> This came out of some conversations > >> Chris > >> > > and > >> > > > I > >> > > > > > were > >> > > > > > >> > > having > >> > > > > > >> > > > > >> > >>>>>>>> around > >> > > > > > >> > > > > >> > >>>>>>> whether > >> > > > > > >> > > > > >> > >>>>>>>> it would make sense to use Samza as a > >> kind > >> > > of > >> > > > > data > >> > > > > > >> > > > ingestion > >> > > > > > >> > > > > >> > >>>>> framework > >> > > > > > >> > > > > >> > >>>>>>> for > >> > > > > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to > KIP-26 > >> > > > > "copycat"). > >> > > > > > >> This > >> > > > > > >> > > > kind > >> > > > > > >> > > > > of > >> > > > > > >> > > > > >> > >>>>>> combined > >> > > > > > >> > > > > >> > >>>>>>>> with complaints around config and > YARN > >> and > >> > > the > >> > > > > > >> > discussion > >> > > > > > >> > > > > >> around > >> > > > > > >> > > > > >> > >>>>>>>> how > >> > > > > > >> > > > > >> > >>>>> to > >> > > > > > >> > > > > >> > >>>>>>>> best do a standalone mode. > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> So the thought experiment was, given > >> that > >> > > > Samza > >> > > > > > was > >> > > > > > >> > > > basically > >> > > > > > >> > > > > >> > >>>>>>>> already totally Kafka specific, what > if > >> > you > >> > > > just > >> > > > > > >> > embraced > >> > > > > > >> > > > > that > >> > > > > > >> > > > > >> > >>>>>>>> and turned it > >> > > > > > >> > > > > >> > >>>>>> into > >> > > > > > >> > > > > >> > >>>>>>>> something less like a heavyweight > >> > framework > >> > > > and > >> > > > > > more > >> > > > > > >> > > like a > >> > > > > > >> > > > > >> > >>>>>>>> third > >> > > > > > >> > > > > >> > >>>>> Kafka > >> > > > > > >> > > > > >> > >>>>>>>> client--a kind of "producing > consumer" > >> > with > >> > > > > state > >> > > > > > >> > > > management > >> > > > > > >> > > > > >> > >>>>>> facilities. > >> > > > > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a > >> complex > >> > > > stream > >> > > > > > >> > > processing > >> > > > > > >> > > > > >> > >>>>>>>> framework > >> > > > > > >> > > > > >> > >>>>>>> this > >> > > > > > >> > > > > >> > >>>>>>>> would actually be a very simple > thing, > >> not > >> > > > much > >> > > > > > more > >> > > > > > >> > > > > >> complicated > >> > > > > > >> > > > > >> > >>>>>>>> to > >> > > > > > >> > > > > >> > >>>>> use > >> > > > > > >> > > > > >> > >>>>>>> or > >> > > > > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As > Chris > >> > said > >> > > > we > >> > > > > > >> thought > >> > > > > > >> > > > about > >> > > > > > >> > > > > >> it > >> > > > > > >> > > > > >> > >>>>>>>> a > >> > > > > > >> > > > > >> > >>>>> lot > >> > > > > > >> > > > > >> > >>>>>> of > >> > > > > > >> > > > > >> > >>>>>>>> what Samza (and the other stream > >> > processing > >> > > > > > systems > >> > > > > > >> > were > >> > > > > > >> > > > > doing) > >> > > > > > >> > > > > >> > >>>>> seemed > >> > > > > > >> > > > > >> > >>>>>>> like > >> > > > > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce. > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output > >> data > >> > to > >> > > > and > >> > > > > > from > >> > > > > > >> > the > >> > > > > > >> > > > > stream > >> > > > > > >> > > > > >> > >>>>>>>> processing. But when we actually > looked > >> > into > >> > > > how > >> > > > > > that > >> > > > > > >> > > would > >> > > > > > >> > > > > >> > >>>>>>>> work, > >> > > > > > >> > > > > >> > >>>>> Samza > >> > > > > > >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion > >> > > framework > >> > > > > > for a > >> > > > > > >> > > bunch > >> > > > > > >> > > > of > >> > > > > > >> > > > > >> > >>>>> reasons. > >> > > > > > >> > > > > >> > >>>>>> To > >> > > > > > >> > > > > >> > >>>>>>>> really do that right you need a > pretty > >> > > > different > >> > > > > > >> > internal > >> > > > > > >> > > > > data > >> > > > > > >> > > > > >> > >>>>>>>> model > >> > > > > > >> > > > > >> > >>>>>> and > >> > > > > > >> > > > > >> > >>>>>>>> set of apis. So what if you split > them > >> and > >> > > had > >> > > > > an > >> > > > > > api > >> > > > > > >> > for > >> > > > > > >> > > > > Kafka > >> > > > > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) > >> and a > >> > > > > separate > >> > > > > > >> api > >> > > > > > >> > > for > >> > > > > > >> > > > > >> Kafka > >> > > > > > >> > > > > >> > >>>>>>>> transformation (Samza). > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> This would also allow really > embracing > >> the > >> > > > same > >> > > > > > >> > > terminology > >> > > > > > >> > > > > and > >> > > > > > >> > > > > >> > >>>>>>>> conventions. One complaint about the > >> > current > >> > > > > > state is > >> > > > > > >> > > that > >> > > > > > >> > > > > the > >> > > > > > >> > > > > >> > >>>>>>>> two > >> > > > > > >> > > > > >> > >>>>>>> systems > >> > > > > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology > >> like > >> > > > > "stream" > >> > > > > > vs > >> > > > > > >> > > > "topic" > >> > > > > > >> > > > > >> and > >> > > > > > >> > > > > >> > >>>>>>> different > >> > > > > > >> > > > > >> > >>>>>>>> config and monitoring systems means > you > >> > kind > >> > > > of > >> > > > > > have > >> > > > > > >> to > >> > > > > > >> > > > learn > >> > > > > > >> > > > > >> > >>>>>>>> Kafka's > >> > > > > > >> > > > > >> > >>>>>>> way, > >> > > > > > >> > > > > >> > >>>>>>>> then learn Samza's slightly different > >> way, > >> > > > then > >> > > > > > kind > >> > > > > > >> of > >> > > > > > >> > > > > >> > >>>>>>>> understand > >> > > > > > >> > > > > >> > >>>>> how > >> > > > > > >> > > > > >> > >>>>>>> they > >> > > > > > >> > > > > >> > >>>>>>>> map to each other, which having > walked > >> a > >> > few > >> > > > > > people > >> > > > > > >> > > through > >> > > > > > >> > > > > >> this > >> > > > > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to > >> get. > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of > >> time > >> > on > >> > > > > > >> airplanes I > >> > > > > > >> > > > > hacked > >> > > > > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat > >> incomplete > >> > > > > > prototype > >> > > > > > >> of > >> > > > > > >> > > > what > >> > > > > > >> > > > > >> > >>>>>>>> this would > >> > > > > > >> > > > > >> > >>>>> look > >> > > > > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously > >> dumped > >> > > into > >> > > > > > Kafka > >> > > > > > >> as > >> > > > > > >> > > it > >> > > > > > >> > > > > >> > >>>>>>>> required a > >> > > > > > >> > > > > >> > >>>>>> few > >> > > > > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here is > >> the > >> > > code: > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>> > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > > >> > > > > > >> > > >> > > > > > > >> > > > >> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org > >> > > > > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I > just > >> > > > > liberally > >> > > > > > >> > renamed > >> > > > > > >> > > > > >> > >>>>>>>> everything > >> > > > > > >> > > > > >> > >>>>> to > >> > > > > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no > >> regard > >> > > for > >> > > > > > >> > > > compatibility. > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> To use this would be something like > >> this: > >> > > > > > >> > > > > >> > >>>>>>>> Properties props = new Properties(); > >> > > > > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers", > >> > > > > "localhost:4242"); > >> > > > > > >> > > > > >> > >>>>>>>> StreamingConfig config = new > >> > > > > > >> > > > > >> > >>>>> StreamingConfig(props); > >> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", > >> > > > > > >> > > > > >> > >>>>>>>> "test-topic-2"); > >> > > > > > >> > > > > >> config.processor(ExampleStreamProcessor.class); > >> > > > > > >> > > > > >> > >>>>>>> config.serialization(new > >> > > > > > >> > > > > >> > >>>>>>>> StringSerializer(), new > >> > > StringDeserializer()); > >> > > > > > >> > > > KafkaStreaming > >> > > > > > >> > > > > >> > >>>>>> container = > >> > > > > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config); > >> > container.run(); > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the > >> > > > SamzaContainer; > >> > > > > > >> > > > > StreamProcessor > >> > > > > > >> > > > > >> > >>>>>>>> is basically StreamTask. > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> So rather than putting all the class > >> names > >> > > in > >> > > > a > >> > > > > > file > >> > > > > > >> > and > >> > > > > > >> > > > then > >> > > > > > >> > > > > >> > >>>>>>>> having > >> > > > > > >> > > > > >> > >>>>>> the > >> > > > > > >> > > > > >> > >>>>>>>> job assembled by reflection, you just > >> > > > > instantiate > >> > > > > > the > >> > > > > > >> > > > > container > >> > > > > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced > over > >> > > > however > >> > > > > > many > >> > > > > > >> > > > > instances > >> > > > > > >> > > > > >> > >>>>>>>> of > >> > > > > > >> > > > > >> > >>>>> this > >> > > > > > >> > > > > >> > >>>>>>> are > >> > > > > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an > instance > >> > dies, > >> > > > new > >> > > > > > >> tasks > >> > > > > > >> > > are > >> > > > > > >> > > > > >> added > >> > > > > > >> > > > > >> > >>>>>>>> to > >> > > > > > >> > > > > >> > >>>>> the > >> > > > > > >> > > > > >> > >>>>>>>> existing containers without shutting > >> them > >> > > > down). > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> We would provide some glue for > running > >> > this > >> > > > > stuff > >> > > > > > in > >> > > > > > >> > YARN > >> > > > > > >> > > > via > >> > > > > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS > >> using > >> > > some > >> > > > > of > >> > > > > > >> their > >> > > > > > >> > > > tools > >> > > > > > >> > > > > >> > >>>>>>>> but from the > >> > > > > > >> > > > > >> > >>>>>> point > >> > > > > > >> > > > > >> > >>>>>>> of > >> > > > > > >> > > > > >> > >>>>>>>> view of these frameworks these stream > >> > > > processing > >> > > > > > jobs > >> > > > > > >> > are > >> > > > > > >> > > > > just > >> > > > > > >> > > > > >> > >>>>>> stateless > >> > > > > > >> > > > > >> > >>>>>>>> services that can come and go and > >> expand > >> > and > >> > > > > > contract > >> > > > > > >> > at > >> > > > > > >> > > > > will. > >> > > > > > >> > > > > >> > >>>>>>>> There > >> > > > > > >> > > > > >> > >>>>> is > >> > > > > > >> > > > > >> > >>>>>>> no > >> > > > > > >> > > > > >> > >>>>>>>> more custom scheduler. > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> Here are some relevant details: > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> 1. It is only ~1300 lines of code, > it > >> > would > >> > > > get > >> > > > > > >> larger > >> > > > > > >> > > if > >> > > > > > >> > > > we > >> > > > > > >> > > > > >> > >>>>>>>> productionized but not vastly > larger. > >> We > >> > > > really > >> > > > > > do > >> > > > > > >> > get a > >> > > > > > >> > > > ton > >> > > > > > >> > > > > >> > >>>>>>>> of > >> > > > > > >> > > > > >> > >>>>>>> leverage > >> > > > > > >> > > > > >> > >>>>>>>> out of Kafka. > >> > > > > > >> > > > > >> > >>>>>>>> 2. Partition management is fully > >> > delegated > >> > > to > >> > > > > the > >> > > > > > >> new > >> > > > > > >> > > > > >> consumer. > >> > > > > > >> > > > > >> > >>>>> This > >> > > > > > >> > > > > >> > >>>>>>>> is nice since now any partition > >> > management > >> > > > > > strategy > >> > > > > > >> > > > > available > >> > > > > > >> > > > > >> > >>>>>>>> to > >> > > > > > >> > > > > >> > >>>>>> Kafka > >> > > > > > >> > > > > >> > >>>>>>>> consumer is also available to Samza > >> (and > >> > > vice > >> > > > > > versa) > >> > > > > > >> > and > >> > > > > > >> > > > > with > >> > > > > > >> > > > > >> > >>>>>>>> the > >> > > > > > >> > > > > >> > >>>>>>> exact > >> > > > > > >> > > > > >> > >>>>>>>> same configs. > >> > > > > > >> > > > > >> > >>>>>>>> 3. It supports state as well as > state > >> > reuse > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is > >> > thought > >> > > > > > >> provoking. > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> -Jay > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, > Chris > >> > > > > Riccomini < > >> > > > > > >> > > > > >> > >>>>>> criccom...@apache.org> > >> > > > > > >> > > > > >> > >>>>>>>> wrote: > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Hey all, > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> I have had some discussions with > Samza > >> > > > > engineers > >> > > > > > at > >> > > > > > >> > > > LinkedIn > >> > > > > > >> > > > > >> > >>>>>>>>> and > >> > > > > > >> > > > > >> > >>>>>>> Confluent > >> > > > > > >> > > > > >> > >>>>>>>>> and we came up with a few > observations > >> > and > >> > > > > would > >> > > > > > >> like > >> > > > > > >> > to > >> > > > > > >> > > > > >> > >>>>>>>>> propose > >> > > > > > >> > > > > >> > >>>>> some > >> > > > > > >> > > > > >> > >>>>>>>>> changes. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> We've observed some things that I > >> want to > >> > > > call > >> > > > > > out > >> > > > > > >> > about > >> > > > > > >> > > > > >> > >>>>>>>>> Samza's > >> > > > > > >> > > > > >> > >>>>>> design, > >> > > > > > >> > > > > >> > >>>>>>>>> and I'd like to propose some > changes. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic > >> > > > deployment > >> > > > > > >> system. > >> > > > > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable. > >> > > > > > >> > > > > >> > >>>>>>>>> * Samza's > >> SystemConsumer/SystemProducer > >> > and > >> > > > > > Kafka's > >> > > > > > >> > > > consumer > >> > > > > > >> > > > > >> > >>>>>>>>> APIs > >> > > > > > >> > > > > >> > >>>>> are > >> > > > > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same > >> > problems. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> All three of these issues are > related, > >> > but > >> > > > I'll > >> > > > > > >> > address > >> > > > > > >> > > > them > >> > > > > > >> > > > > >> in > >> > > > > > >> > > > > >> > >>>>> order. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Deployment > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use > of a > >> > > > dynamic > >> > > > > > >> > > deployment > >> > > > > > >> > > > > >> > >>>>>>>>> scheduler > >> > > > > > >> > > > > >> > >>>>>> such > >> > > > > > >> > > > > >> > >>>>>>>>> as > >> > > > > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially > >> built > >> > > > > Samza, > >> > > > > > we > >> > > > > > >> > bet > >> > > > > > >> > > > that > >> > > > > > >> > > > > >> > >>>>>>>>> there > >> > > > > > >> > > > > >> > >>>>>> would > >> > > > > > >> > > > > >> > >>>>>>>>> be > >> > > > > > >> > > > > >> > >>>>>>>>> one or two winners in this area, and > >> we > >> > > could > >> > > > > > >> support > >> > > > > > >> > > > them, > >> > > > > > >> > > > > >> and > >> > > > > > >> > > > > >> > >>>>>>>>> the > >> > > > > > >> > > > > >> > >>>>>> rest > >> > > > > > >> > > > > >> > >>>>>>>>> would go away. In reality, there are > >> many > >> > > > > > >> variations. > >> > > > > > >> > > > > >> > >>>>>>>>> Furthermore, > >> > > > > > >> > > > > >> > >>>>>> many > >> > > > > > >> > > > > >> > >>>>>>>>> people still prefer to just start > >> their > >> > > > > > processors > >> > > > > > >> > like > >> > > > > > >> > > > > normal > >> > > > > > >> > > > > >> > >>>>>>>>> Java processes, and use traditional > >> > > > deployment > >> > > > > > >> scripts > >> > > > > > >> > > > such > >> > > > > > >> > > > > as > >> > > > > > >> > > > > >> > >>>>>>>>> Fabric, > >> > > > > > >> > > > > >> > >>>>>> Chef, > >> > > > > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment > >> system > >> > > on > >> > > > > > users > >> > > > > > >> > makes > >> > > > > > >> > > > the > >> > > > > > >> > > > > >> > >>>>>>>>> Samza start-up process really > painful > >> for > >> > > > first > >> > > > > > time > >> > > > > > >> > > > users. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement > >> was > >> > > also > >> > > > a > >> > > > > > bit > >> > > > > > >> of > >> > > > > > >> > a > >> > > > > > >> > > > > >> > >>>>>>>>> mis-fire > >> > > > > > >> > > > > >> > >>>>>> because > >> > > > > > >> > > > > >> > >>>>>>>>> of > >> > > > > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding > between > >> > the > >> > > > > > nature of > >> > > > > > >> > > batch > >> > > > > > >> > > > > >> jobs > >> > > > > > >> > > > > >> > >>>>>>>>> and > >> > > > > > >> > > > > >> > >>>>>>> stream > >> > > > > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made > >> > > conscious > >> > > > > > effort > >> > > > > > >> to > >> > > > > > >> > > > favor > >> > > > > > >> > > > > >> > >>>>>>>>> the > >> > > > > > >> > > > > >> > >>>>>> Hadoop > >> > > > > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things, > >> since > >> > it > >> > > > > worked > >> > > > > > >> and > >> > > > > > >> > > was > >> > > > > > >> > > > > well > >> > > > > > >> > > > > >> > >>>>>>> understood. > >> > > > > > >> > > > > >> > >>>>>>>>> One thing that we missed was that > >> batch > >> > > jobs > >> > > > > > have a > >> > > > > > >> > > > definite > >> > > > > > >> > > > > >> > >>>>>> beginning, > >> > > > > > >> > > > > >> > >>>>>>>>> and > >> > > > > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs > don't > >> > > > > (usually). > >> > > > > > >> This > >> > > > > > >> > > > leads > >> > > > > > >> > > > > to > >> > > > > > >> > > > > >> > >>>>>>>>> a > >> > > > > > >> > > > > >> > >>>>> much > >> > > > > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for > stream > >> > > > > processors. > >> > > > > > >> You > >> > > > > > >> > > > > >> basically > >> > > > > > >> > > > > >> > >>>>>>>>> just > >> > > > > > >> > > > > >> > >>>>>>> need > >> > > > > > >> > > > > >> > >>>>>>>>> to find a place to start the > >> processor, > >> > and > >> > > > > start > >> > > > > > >> it. > >> > > > > > >> > > The > >> > > > > > >> > > > > way > >> > > > > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's > no > >> > > concept > >> > > > > of > >> > > > > > a > >> > > > > > >> > > cluster > >> > > > > > >> > > > > >> > >>>>>>>>> being "full". We always > >> > > > > > >> > > > > >> > >>>>>> add > >> > > > > > >> > > > > >> > >>>>>>>>> more machines. The problem with > >> coupling > >> > > > Samza > >> > > > > > with > >> > > > > > >> a > >> > > > > > >> > > > > >> scheduler > >> > > > > > >> > > > > >> > >>>>>>>>> is > >> > > > > > >> > > > > >> > >>>>>> that > >> > > > > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to > >> handle > >> > > > > > deployment. > >> > > > > > >> > > This > >> > > > > > >> > > > > >> pulls > >> > > > > > >> > > > > >> > >>>>>>>>> in a > >> > > > > > >> > > > > >> > >>>>>>> bunch > >> > > > > > >> > > > > >> > >>>>>>>>> of things such as configuration > >> > > distribution > >> > > > > > (config > >> > > > > > >> > > > > stream), > >> > > > > > >> > > > > >> > >>>>>>>>> shell > >> > > > > > >> > > > > >> > >>>>>>> scrips > >> > > > > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), > packaging > >> > (all > >> > > > the > >> > > > > > .tgz > >> > > > > > >> > > > stuff), > >> > > > > > >> > > > > >> etc. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic > >> > > > deployment > >> > > > > > was > >> > > > > > >> to > >> > > > > > >> > > > > support > >> > > > > > >> > > > > >> > >>>>>>>>> data locality. If you want to have > >> > > locality, > >> > > > > you > >> > > > > > >> need > >> > > > > > >> > to > >> > > > > > >> > > > put > >> > > > > > >> > > > > >> > >>>>>>>>> your > >> > > > > > >> > > > > >> > >>>>>> processors > >> > > > > > >> > > > > >> > >>>>>>>>> close to the data they're > processing. > >> > Upon > >> > > > > > further > >> > > > > > >> > > > > >> > >>>>>>>>> investigation, > >> > > > > > >> > > > > >> > >>>>>>> though, > >> > > > > > >> > > > > >> > >>>>>>>>> this feature is not that beneficial. > >> > There > >> > > is > >> > > > > > some > >> > > > > > >> > good > >> > > > > > >> > > > > >> > >>>>>>>>> discussion > >> > > > > > >> > > > > >> > >>>>>> about > >> > > > > > >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335. > >> > Again, > >> > > we > >> > > > > > took > >> > > > > > >> the > >> > > > > > >> > > > > >> > >>>>>>>>> Map/Reduce > >> > > > > > >> > > > > >> > >>>>>> path, > >> > > > > > >> > > > > >> > >>>>>>>>> but > >> > > > > > >> > > > > >> > >>>>>>>>> there are some fundamental > differences > >> > > > between > >> > > > > > HDFS > >> > > > > > >> > and > >> > > > > > >> > > > > Kafka. > >> > > > > > >> > > > > >> > >>>>>>>>> HDFS > >> > > > > > >> > > > > >> > >>>>>> has > >> > > > > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions. > >> This > >> > > > leads > >> > > > > to > >> > > > > > >> less > >> > > > > > >> > > > > >> > >>>>>>>>> optimization potential with stream > >> > > processors > >> > > > > on > >> > > > > > top > >> > > > > > >> > of > >> > > > > > >> > > > > Kafka. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> This feature is also used as a > crutch. > >> > > Samza > >> > > > > > doesn't > >> > > > > > >> > > have > >> > > > > > >> > > > > any > >> > > > > > >> > > > > >> > >>>>>>>>> built > >> > > > > > >> > > > > >> > >>>>> in > >> > > > > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it > >> > depends > >> > > on > >> > > > > the > >> > > > > > >> > > dynamic > >> > > > > > >> > > > > >> > >>>>>>>>> deployment scheduling system to > handle > >> > > > restarts > >> > > > > > >> when a > >> > > > > > >> > > > > >> > >>>>>>>>> processor dies. This has > >> > > > > > >> > > > > >> > >>>>>>> made > >> > > > > > >> > > > > >> > >>>>>>>>> it very difficult to write a > >> standalone > >> > > Samza > >> > > > > > >> > container > >> > > > > > >> > > > > >> > >>>> (SAMZA-516). > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Pluggability > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> In some cases pluggability is good, > >> but I > >> > > > think > >> > > > > > that > >> > > > > > >> > > we've > >> > > > > > >> > > > > >> gone > >> > > > > > >> > > > > >> > >>>>>>>>> too > >> > > > > > >> > > > > >> > >>>>>> far > >> > > > > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has: > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable config. > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable metrics. > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems. > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems > >> > > > (SystemConsumer, > >> > > > > > >> > > > > SystemProducer, > >> > > > > > >> > > > > >> > >>>> etc). > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable serdes. > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines. > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just > about > >> > every > >> > > > > > >> component > >> > > > > > >> > > > > >> > >>>>> (MessageChooser, > >> > > > > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper, > >> > > ConfigRewriter, > >> > > > > > etc). > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> There's probably more that I've > >> > forgotten, > >> > > as > >> > > > > > well. > >> > > > > > >> > Some > >> > > > > > >> > > > of > >> > > > > > >> > > > > >> > >>>>>>>>> these > >> > > > > > >> > > > > >> > >>>>> are > >> > > > > > >> > > > > >> > >>>>>>>>> useful, but some have proven not to > >> be. > >> > > This > >> > > > > all > >> > > > > > >> comes > >> > > > > > >> > > at > >> > > > > > >> > > > a > >> > > > > > >> > > > > >> cost: > >> > > > > > >> > > > > >> > >>>>>>>>> complexity. This complexity is > making > >> it > >> > > > harder > >> > > > > > for > >> > > > > > >> > our > >> > > > > > >> > > > > users > >> > > > > > >> > > > > >> > >>>>>>>>> to > >> > > > > > >> > > > > >> > >>>>> pick > >> > > > > > >> > > > > >> > >>>>>> up > >> > > > > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It > also > >> > makes > >> > > > it > >> > > > > > >> > difficult > >> > > > > > >> > > > for > >> > > > > > >> > > > > >> > >>>>>>>>> Samza developers to reason about > what > >> the > >> > > > > > >> > > characteristics > >> > > > > > >> > > > of > >> > > > > > >> > > > > >> > >>>>>>>>> the container (since the > >> characteristics > >> > > > change > >> > > > > > >> > > depending > >> > > > > > >> > > > on > >> > > > > > >> > > > > >> > >>>>>>>>> which plugins are use). > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> The issues with pluggability are > most > >> > > visible > >> > > > > in > >> > > > > > the > >> > > > > > >> > > > System > >> > > > > > >> > > > > >> APIs. > >> > > > > > >> > > > > >> > >>>>> What > >> > > > > > >> > > > > >> > >>>>>>>>> Samza really requires to be > >> functional is > >> > > > Kafka > >> > > > > > as > >> > > > > > >> its > >> > > > > > >> > > > > >> > >>>>>>>>> transport > >> > > > > > >> > > > > >> > >>>>>> layer. > >> > > > > > >> > > > > >> > >>>>>>>>> But > >> > > > > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use > >> cases > >> > > into > >> > > > > one > >> > > > > > >> API: > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka. > >> > > > > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> The current System API supports both > >> of > >> > > these > >> > > > > use > >> > > > > > >> > cases. > >> > > > > > >> > > > The > >> > > > > > >> > > > > >> > >>>>>>>>> problem > >> > > > > > >> > > > > >> > >>>>>> is, > >> > > > > > >> > > > > >> > >>>>>>>>> we > >> > > > > > >> > > > > >> > >>>>>>>>> actually want different features for > >> each > >> > > use > >> > > > > > case. > >> > > > > > >> By > >> > > > > > >> > > > > >> papering > >> > > > > > >> > > > > >> > >>>>>>>>> over > >> > > > > > >> > > > > >> > >>>>>>> these > >> > > > > > >> > > > > >> > >>>>>>>>> two use cases, and providing a > single > >> > API, > >> > > > > we've > >> > > > > > >> > > > introduced > >> > > > > > >> > > > > a > >> > > > > > >> > > > > >> > >>>>>>>>> ton of > >> > > > > > >> > > > > >> > >>>>>>> leaky > >> > > > > > >> > > > > >> > >>>>>>>>> abstractions. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> For example, what we'd really like > in > >> (2) > >> > > is > >> > > > to > >> > > > > > have > >> > > > > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for > >> > offsets > >> > > > > (like > >> > > > > > >> > Kafka). > >> > > > > > >> > > > > This > >> > > > > > >> > > > > >> > >>>>>>>>> would be at odds > >> > > > > > >> > > > > >> > >>>>> with > >> > > > > > >> > > > > >> > >>>>>>> (1), > >> > > > > > >> > > > > >> > >>>>>>>>> though, since different systems have > >> > > > different > >> > > > > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors. > >> > > > > > >> > > > > >> > >>>>>>>>> There was discussion both on the > >> mailing > >> > > list > >> > > > > and > >> > > > > > >> the > >> > > > > > >> > > SQL > >> > > > > > >> > > > > >> JIRAs > >> > > > > > >> > > > > >> > >>>>> about > >> > > > > > >> > > > > >> > >>>>>>> the > >> > > > > > >> > > > > >> > >>>>>>>>> need for this. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> The same thing holds true for > >> > > replayability. > >> > > > > > Kafka > >> > > > > > >> > > allows > >> > > > > > >> > > > us > >> > > > > > >> > > > > >> to > >> > > > > > >> > > > > >> > >>>>> rewind > >> > > > > > >> > > > > >> > >>>>>>>>> when > >> > > > > > >> > > > > >> > >>>>>>>>> we have a failure. Many other > systems > >> > > don't. > >> > > > In > >> > > > > > some > >> > > > > > >> > > > cases, > >> > > > > > >> > > > > >> > >>>>>>>>> systems > >> > > > > > >> > > > > >> > >>>>>>> return > >> > > > > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g. > >> > > > > > >> WikipediaSystemConsumer) > >> > > > > > >> > > > > because > >> > > > > > >> > > > > >> > >>>>>>>>> they > >> > > > > > >> > > > > >> > >>>>>> have > >> > > > > > >> > > > > >> > >>>>>>> no > >> > > > > > >> > > > > >> > >>>>>>>>> offsets. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Partitioning is another example. > Kafka > >> > > > supports > >> > > > > > >> > > > > partitioning, > >> > > > > > >> > > > > >> > >>>>>>>>> but > >> > > > > > >> > > > > >> > >>>>> many > >> > > > > > >> > > > > >> > >>>>>>>>> systems don't. We model this by > >> having a > >> > > > single > >> > > > > > >> > > partition > >> > > > > > >> > > > > for > >> > > > > > >> > > > > >> > >>>>>>>>> those systems. Still, other systems > >> model > >> > > > > > >> partitioning > >> > > > > > >> > > > > >> > >>>> differently (e.g. > >> > > > > > >> > > > > >> > >>>>>>>>> Kinesis). > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a > >> mess. > >> > > > > > Creating > >> > > > > > >> > > streams > >> > > > > > >> > > > > in > >> > > > > > >> > > > > >> a > >> > > > > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost > >> impossible. > >> > > As > >> > > > is > >> > > > > > >> > modeling > >> > > > > > >> > > > > >> > >>>>>>>>> metadata > >> > > > > > >> > > > > >> > >>>>> for > >> > > > > > >> > > > > >> > >>>>>>> the > >> > > > > > >> > > > > >> > >>>>>>>>> system (replication factor, > >> partitions, > >> > > > > location, > >> > > > > > >> > etc). > >> > > > > > >> > > > The > >> > > > > > >> > > > > >> > >>>>>>>>> list > >> > > > > > >> > > > > >> > >>>>> goes > >> > > > > > >> > > > > >> > >>>>>>> on. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Duplicate work > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> At the time that we began writing > >> Samza, > >> > > > > Kafka's > >> > > > > > >> > > consumer > >> > > > > > >> > > > > and > >> > > > > > >> > > > > >> > >>>>> producer > >> > > > > > >> > > > > >> > >>>>>>>>> APIs > >> > > > > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set. > On > >> the > >> > > > > > >> > consumer-side, > >> > > > > > >> > > > you > >> > > > > > >> > > > > >> > >>>>>>>>> had two > >> > > > > > >> > > > > >> > >>>>>>>>> options: use the high level > consumer, > >> or > >> > > the > >> > > > > > simple > >> > > > > > >> > > > > consumer. > >> > > > > > >> > > > > >> > >>>>>>>>> The > >> > > > > > >> > > > > >> > >>>>>>> problem > >> > > > > > >> > > > > >> > >>>>>>>>> with the high-level consumer was > that > >> it > >> > > > > > controlled > >> > > > > > >> > your > >> > > > > > >> > > > > >> > >>>>>>>>> offsets, partition assignments, and > >> the > >> > > order > >> > > > > in > >> > > > > > >> which > >> > > > > > >> > > you > >> > > > > > >> > > > > >> > >>>>>>>>> received messages. The > >> > > > > > >> > > > > >> > >>>>> problem > >> > > > > > >> > > > > >> > >>>>>>>>> with > >> > > > > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's not > >> > > simple. > >> > > > > It's > >> > > > > > >> > basic. > >> > > > > > >> > > > You > >> > > > > > >> > > > > >> > >>>>>>>>> end up > >> > > > > > >> > > > > >> > >>>>>>> having > >> > > > > > >> > > > > >> > >>>>>>>>> to handle a lot of really low-level > >> stuff > >> > > > that > >> > > > > > you > >> > > > > > >> > > > > shouldn't. > >> > > > > > >> > > > > >> > >>>>>>>>> We > >> > > > > > >> > > > > >> > >>>>>> spent a > >> > > > > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's > >> > > > KafkaSystemConsumer > >> > > > > > very > >> > > > > > >> > > > robust. > >> > > > > > >> > > > > >> It > >> > > > > > >> > > > > >> > >>>>>>>>> also allows us to support some cool > >> > > features: > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering and > >> > > > > > prioritization. > >> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over partition > >> assignment > >> > > to > >> > > > > > support > >> > > > > > >> > > > joins, > >> > > > > > >> > > > > >> > >>>>>>>>> global > >> > > > > > >> > > > > >> > >>>>>> state > >> > > > > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)), > etc. > >> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over offset > >> > checkpointing. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time > is > >> > that > >> > > > > these > >> > > > > > >> > > features > >> > > > > > >> > > > > >> > >>>>>>>>> should > >> > > > > > >> > > > > >> > >>>>>>> actually > >> > > > > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka > consumers > >> > (not > >> > > > just > >> > > > > > >> Samza > >> > > > > > >> > > > stream > >> > > > > > >> > > > > >> > >>>>>> processors) > >> > > > > > >> > > > > >> > >>>>>>>>> end up wanting to do things like > joins > >> > and > >> > > > > > partition > >> > > > > > >> > > > > >> > >>>>>>>>> assignment. The > >> > > > > > >> > > > > >> > >>>>>>> Kafka > >> > > > > > >> > > > > >> > >>>>>>>>> community has come to the same > >> > conclusion. > >> > > > > > They're > >> > > > > > >> > > adding > >> > > > > > >> > > > a > >> > > > > > >> > > > > >> ton > >> > > > > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka > >> consumer > >> > > > > > >> > > implementation. > >> > > > > > >> > > > > To a > >> > > > > > >> > > > > >> > >>>>>>>>> large extent, > >> > > > > > >> > > > > >> > >>>>> it's > >> > > > > > >> > > > > >> > >>>>>>>>> duplicate work to what we've already > >> done > >> > > in > >> > > > > > Samza. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up > taking > >> a > >> > > very > >> > > > > > similar > >> > > > > > >> > > > > approach > >> > > > > > >> > > > > >> > >>>>>>>>> to > >> > > > > > >> > > > > >> > >>>>>> Samza's > >> > > > > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager > implementation > >> for > >> > > > > > handling > >> > > > > > >> > > offset > >> > > > > > >> > > > > >> > >>>>>> checkpointing. > >> > > > > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset > >> management > >> > > > > feature > >> > > > > > >> > stores > >> > > > > > >> > > > > >> offset > >> > > > > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows > >> you to > >> > > > fetch > >> > > > > > them > >> > > > > > >> > > from > >> > > > > > >> > > > > the > >> > > > > > >> > > > > >> > >>>>>>>>> broker. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste, > >> since > >> > we > >> > > > > could > >> > > > > > >> have > >> > > > > > >> > > > shared > >> > > > > > >> > > > > >> > >>>>>>>>> the > >> > > > > > >> > > > > >> > >>>>> work > >> > > > > > >> > > > > >> > >>>>>> if > >> > > > > > >> > > > > >> > >>>>>>>>> it > >> > > > > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the > >> get-go. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Vision > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather > >> radical > >> > > > > > proposal. > >> > > > > > >> > Samza > >> > > > > > >> > > > is > >> > > > > > >> > > > > >> > >>>>> relatively > >> > > > > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to > >> say > >> > > that > >> > > > > > we're > >> > > > > > >> > > near a > >> > > > > > >> > > > > 1.0 > >> > > > > > >> > > > > >> > >>>>>> release. > >> > > > > > >> > > > > >> > >>>>>>>>> I'd > >> > > > > > >> > > > > >> > >>>>>>>>> like to propose that we take what > >> we've > >> > > > > learned, > >> > > > > > and > >> > > > > > >> > > begin > >> > > > > > >> > > > > >> > >>>>>>>>> thinking > >> > > > > > >> > > > > >> > >>>>>>> about > >> > > > > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we > >> change if > >> > > we > >> > > > > were > >> > > > > > >> > > starting > >> > > > > > >> > > > > >> from > >> > > > > > >> > > > > >> > >>>>>> scratch? > >> > > > > > >> > > > > >> > >>>>>>>>> My > >> > > > > > >> > > > > >> > >>>>>>>>> proposal is to: > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only* > >> way > >> > to > >> > > > run > >> > > > > > Samza > >> > > > > > >> > > > > >> > >>>>>>>>> processors, and eliminate all direct > >> > > > > dependences > >> > > > > > on > >> > > > > > >> > > YARN, > >> > > > > > >> > > > > >> Mesos, > >> > > > > > >> > > > > >> > >>>> etc. > >> > > > > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support > >> only > >> > > > Kafka > >> > > > > > as > >> > > > > > >> the > >> > > > > > >> > > > > stream > >> > > > > > >> > > > > >> > >>>>>> processing > >> > > > > > >> > > > > >> > >>>>>>>>> layer. > >> > > > > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, > logging, > >> > > > > > >> serialization, > >> > > > > > >> > > and > >> > > > > > >> > > > > >> > >>>>>>>>> config > >> > > > > > >> > > > > >> > >>>>>>> systems, > >> > > > > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> This would fix all of the issues > that > >> I > >> > > > > outlined > >> > > > > > >> > above. > >> > > > > > >> > > It > >> > > > > > >> > > > > >> > >>>>>>>>> should > >> > > > > > >> > > > > >> > >>>>> also > >> > > > > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty > >> > > > dramatically. > >> > > > > > >> > > Supporting > >> > > > > > >> > > > > >> only > >> > > > > > >> > > > > >> > >>>>>>>>> a standalone container will allow > >> Samza > >> > to > >> > > be > >> > > > > > >> executed > >> > > > > > >> > > on > >> > > > > > >> > > > > YARN > >> > > > > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using > >> > > > Marathon/Aurora), > >> > > > > or > >> > > > > > >> most > >> > > > > > >> > > > other > >> > > > > > >> > > > > >> > >>>>>>>>> in-house > >> > > > > > >> > > > > >> > >>>>>>> deployment > >> > > > > > >> > > > > >> > >>>>>>>>> systems. This should make life a lot > >> > easier > >> > > > for > >> > > > > > new > >> > > > > > >> > > users. > >> > > > > > >> > > > > >> > >>>>>>>>> Imagine > >> > > > > > >> > > > > >> > >>>>>>> having > >> > > > > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without > YARN. > >> > The > >> > > > drop > >> > > > > > in > >> > > > > > >> > > mailing > >> > > > > > >> > > > > >> list > >> > > > > > >> > > > > >> > >>>>>> traffic > >> > > > > > >> > > > > >> > >>>>>>>>> will be pretty dramatic. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long > >> overdue to > >> > > me. > >> > > > > The > >> > > > > > >> > > reality > >> > > > > > >> > > > > is, > >> > > > > > >> > > > > >> > >>>>> everyone > >> > > > > > >> > > > > >> > >>>>>>>>> that > >> > > > > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with > >> Kafka. > >> > We > >> > > > > > basically > >> > > > > > >> > > > require > >> > > > > > >> > > > > >> it > >> > > > > > >> > > > > >> > >>>>>> already > >> > > > > > >> > > > > >> > >>>>>>> in > >> > > > > > >> > > > > >> > >>>>>>>>> order for most features to work. > Those > >> > that > >> > > > are > >> > > > > > >> using > >> > > > > > >> > > > other > >> > > > > > >> > > > > >> > >>>>>>>>> systems > >> > > > > > >> > > > > >> > >>>>>> are > >> > > > > > >> > > > > >> > >>>>>>>>> generally using it for ingest into > >> Kafka > >> > > (1), > >> > > > > and > >> > > > > > >> then > >> > > > > > >> > > > they > >> > > > > > >> > > > > do > >> > > > > > >> > > > > >> > >>>>>>>>> the processing on top. There is > >> already > >> > > > > > discussion ( > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>> > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > > >> > > > > > >> > > >> > > > > > > >> > > > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851 > >> > > > > > >> > > > > >> > >>>>> 767 > >> > > > > > >> > > > > >> > >>>>>>>>> ) > >> > > > > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into > Kafka > >> > > > extremely > >> > > > > > >> easy. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Once we make the call to couple with > >> > Kafka, > >> > > > we > >> > > > > > can > >> > > > > > >> > > > leverage > >> > > > > > >> > > > > a > >> > > > > > >> > > > > >> > >>>>>>>>> ton of > >> > > > > > >> > > > > >> > >>>>>>> their > >> > > > > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to > >> maintain > >> > > our > >> > > > > own > >> > > > > > >> > config, > >> > > > > > >> > > > > >> > >>>>>>>>> metrics, > >> > > > > > >> > > > > >> > >>>>> etc. > >> > > > > > >> > > > > >> > >>>>>>> We > >> > > > > > >> > > > > >> > >>>>>>>>> can all share the same libraries, > and > >> > make > >> > > > them > >> > > > > > >> > better. > >> > > > > > >> > > > This > >> > > > > > >> > > > > >> > >>>>>>>>> will > >> > > > > > >> > > > > >> > >>>>> also > >> > > > > > >> > > > > >> > >>>>>>>>> allow us to share the > >> consumer/producer > >> > > APIs, > >> > > > > and > >> > > > > > >> will > >> > > > > > >> > > let > >> > > > > > >> > > > > us > >> > > > > > >> > > > > >> > >>>>> leverage > >> > > > > > >> > > > > >> > >>>>>>>>> their offset management and > partition > >> > > > > management, > >> > > > > > >> > rather > >> > > > > > >> > > > > than > >> > > > > > >> > > > > >> > >>>>>>>>> having > >> > > > > > >> > > > > >> > >>>>>> our > >> > > > > > >> > > > > >> > >>>>>>>>> own. All of the coordinator stream > >> code > >> > > would > >> > > > > go > >> > > > > > >> away, > >> > > > > > >> > > as > >> > > > > > >> > > > > >> would > >> > > > > > >> > > > > >> > >>>>>>>>> most > >> > > > > > >> > > > > >> > >>>>>> of > >> > > > > > >> > > > > >> > >>>>>>>>> the > >> > > > > > >> > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably > >> have > >> > to > >> > > > push > >> > > > > > some > >> > > > > > >> > > > > partition > >> > > > > > >> > > > > >> > >>>>>>> management > >> > > > > > >> > > > > >> > >>>>>>>>> features into the Kafka broker, but > >> > they're > >> > > > > > already > >> > > > > > >> > > moving > >> > > > > > >> > > > > in > >> > > > > > >> > > > > >> > >>>>>>>>> that direction with the new consumer > >> API. > >> > > The > >> > > > > > >> features > >> > > > > > >> > > we > >> > > > > > >> > > > > have > >> > > > > > >> > > > > >> > >>>>>>>>> for > >> > > > > > >> > > > > >> > >>>>>> partition > >> > > > > > >> > > > > >> > >>>>>>>>> assignment aren't unique to Samza, > and > >> > seem > >> > > > > like > >> > > > > > >> they > >> > > > > > >> > > > should > >> > > > > > >> > > > > >> be > >> > > > > > >> > > > > >> > >>>>>>>>> in > >> > > > > > >> > > > > >> > >>>>>> Kafka > >> > > > > > >> > > > > >> > >>>>>>>>> anyway. There will always be some > >> niche > >> > > > usages > >> > > > > > which > >> > > > > > >> > > will > >> > > > > > >> > > > > >> > >>>>>>>>> require > >> > > > > > >> > > > > >> > >>>>>> extra > >> > > > > > >> > > > > >> > >>>>>>>>> care and hence full control over > >> > partition > >> > > > > > >> assignments > >> > > > > > >> > > > much > >> > > > > > >> > > > > >> > >>>>>>>>> like the > >> > > > > > >> > > > > >> > >>>>>>> Kafka > >> > > > > > >> > > > > >> > >>>>>>>>> low level consumer api. These would > >> > > continue > >> > > > to > >> > > > > > be > >> > > > > > >> > > > > supported. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> These items will be good for the > Samza > >> > > > > community. > >> > > > > > >> > > They'll > >> > > > > > >> > > > > make > >> > > > > > >> > > > > >> > >>>>>>>>> Samza easier to use, and make it > >> easier > >> > for > >> > > > > > >> developers > >> > > > > > >> > > to > >> > > > > > >> > > > > add > >> > > > > > >> > > > > >> > >>>>>>>>> new features. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Obviously this is a fairly large > (and > >> > > > somewhat > >> > > > > > >> > backwards > >> > > > > > >> > > > > >> > >>>>> incompatible > >> > > > > > >> > > > > >> > >>>>>>>>> change). If we choose to go this > >> route, > >> > > it's > >> > > > > > >> important > >> > > > > > >> > > > that > >> > > > > > >> > > > > we > >> > > > > > >> > > > > >> > >>>>> openly > >> > > > > > >> > > > > >> > >>>>>>>>> communicate how we're going to > >> provide a > >> > > > > > migration > >> > > > > > >> > path > >> > > > > > >> > > > from > >> > > > > > >> > > > > >> > >>>>>>>>> the > >> > > > > > >> > > > > >> > >>>>>>> existing > >> > > > > > >> > > > > >> > >>>>>>>>> APIs to the new ones (if we make > >> > > incompatible > >> > > > > > >> > changes). > >> > > > > > >> > > I > >> > > > > > >> > > > > >> think > >> > > > > > >> > > > > >> > >>>>>>>>> at a minimum, we'd probably need to > >> > > provide a > >> > > > > > >> wrapper > >> > > > > > >> > to > >> > > > > > >> > > > > allow > >> > > > > > >> > > > > >> > >>>>>>>>> existing StreamTask implementations > to > >> > > > continue > >> > > > > > >> > running > >> > > > > > >> > > on > >> > > > > > >> > > > > the > >> > > > > > >> > > > > >> > >>>> new container. > >> > > > > > >> > > > > >> > >>>>>>> It's > >> > > > > > >> > > > > >> > >>>>>>>>> also important that we openly > >> communicate > >> > > > about > >> > > > > > >> > timing, > >> > > > > > >> > > > and > >> > > > > > >> > > > > >> > >>>>>>>>> stages > >> > > > > > >> > > > > >> > >>>>> of > >> > > > > > >> > > > > >> > >>>>>>> the > >> > > > > > >> > > > > >> > >>>>>>>>> migration. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> If you made it this far, I'm sure > you > >> > have > >> > > > > > opinions. > >> > > > > > >> > :) > >> > > > > > >> > > > > Please > >> > > > > > >> > > > > >> > >>>>>>>>> send > >> > > > > > >> > > > > >> > >>>>>> your > >> > > > > > >> > > > > >> > >>>>>>>>> thoughts and feedback. > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>>> Cheers, > >> > > > > > >> > > > > >> > >>>>>>>>> Chris > >> > > > > > >> > > > > >> > >>>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>>> > >> > > > > > >> > > > > >> > >>>>>>> > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>>> -- > >> > > > > > >> > > > > >> > >>>>>> -- Guozhang > >> > > > > > >> > > > > >> > >>>>>> > >> > > > > > >> > > > > >> > >>>>> > >> > > > > > >> > > > > >> > >>>> > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > >> > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > > > >> > > > >> > > > > > >> > > >> > > > > > >> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > >