That was meant to be "thread" not "threat". lol. :) On Sun, Jul 12, 2015 at 5:54 PM, Chris Riccomini <criccom...@apache.org> wrote:
> Hey all, > > I want to start by saying that I'm absolutely thrilled to be a part of > this community. The amount of level-headed, thoughtful, educated discussion > that's gone on over the past ~10 days is overwhelming. Wonderful. > > It seems like discussion is waning a bit, and we've reached some > conclusions. There are several key emails in this threat, which I want to > call out: > > 1. Jakob's summary of the three potential ways forward. > > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVu-hxdBfyQ4qm3LDC55cUQbPdmbe4zGzTOOatYF1Pz43A%40mail.gmail.com%3E > 2. Julian's call out that we should be focusing on community over code. > > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCAPSgeESZ_7bVFbwN%2Bzqi5MH%3D4CWu9MZUSanKg0-1woMqt55Fvg%40mail.gmail.com%3E > 3. Martin's summary about the benefits of merging communities. > > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CBFB866B6-D9D8-4578-93C0-FFAEB1DF00FC%40kleppmann.com%3E > 4. Jakob's comments about the distinction between community and code paths. > > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVtWPjHLLDsmxvz9KggVA5DfBi-nUvfqB6QdA-du%2B_a9Ng%40mail.gmail.com%3E > > I agree with the comments on all of these emails. I think Martin's summary > of his position aligns very closely with my own. To that end, I think we > should get concrete about what the proposal is, and call a vote on it. > Given that Jay, Martin, and I seem to be aligning fairly closely, I think > we should start with: > > 1. [community] Make Samza a subproject of Kafka. > 2. [community] Make all Samza PMC/committers committers of the subproject. > 3. [community] Migrate Samza's website/documentation into Kafka's. > 4. [code] Have the Samza community and the Kafka community start a > from-scratch reboot together in the new Kafka subproject. We can > borrow/copy & paste significant chunks of code from Samza's code base. > 5. [code] The subproject would intentionally eliminate support for both > other streaming systems and all deployment systems. > 6. [code] Attempt to provide a bridge from our SystemConsumer to KIP-26 > (copy cat) > 7. [code] Attempt to provide a bridge from the new subproject's processor > interface to our legacy StreamTask interface. > 8. [code/community] Sunset Samza as a TLP when we have a working Kafka > subproject that has a fault-tolerant container with state management. > > It's likely that (6) and (7) won't be fully drop-in. Still, the closer we > can get, the better it's going to be for our existing community. > > One thing that I didn't touch on with (2) is whether any Samza PMC members > should be rolled into Kafka PMC membership as well (though, Jay and Jakob > are already PMC members on both). I think that Samza's community deserves a > voice on the PMC, so I'd propose that we roll at least a few PMC members > into the Kafka PMC, but I don't have a strong framework for which people to > pick. > > Before (8), I think that Samza's TLP can continue to commit bug fixes and > patches as it sees fit, provided that we openly communicate that we won't > necessarily migrate new features to the new subproject, and that the TLP > will be shut down after the migration to the Kafka subproject occurs. > > Jakob, I could use your guidance here about about how to achieve this from > an Apache process perspective (sorry). > > * Should I just call a vote on this proposal? > * Should it happen on dev or private? > * Do committers have binding votes, or just PMC? > > Having trouble finding much detail on the Apache wikis. :( > > Cheers, > Chris > > On Fri, Jul 10, 2015 at 2:38 PM, Yan Fang <yanfang...@gmail.com> wrote: > >> Thanks, Jay. This argument persuaded me actually. :) >> >> Fang, Yan >> yanfang...@gmail.com >> >> On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps <j...@confluent.io> wrote: >> >> > Hey Yan, >> > >> > Yeah philosophically I think the argument is that you should capture the >> > stream in Kafka independent of the transformation. This is obviously a >> > Kafka-centric view point. >> > >> > Advantages of this: >> > - In practice I think this is what e.g. Storm people often end up doing >> > anyway. You usually need to throttle any access to a live serving >> database. >> > - Can have multiple subscribers and they get the same thing without >> > additional load on the source system. >> > - Applications can tap into the stream if need be by subscribing. >> > - You can debug your transformation by tailing the Kafka topic with the >> > console consumer >> > - Can tee off the same data stream for batch analysis or Lambda arch >> style >> > re-processing >> > >> > The disadvantage is that it will use Kafka resources. But the idea is >> > eventually you will have multiple subscribers to any data source (at >> least >> > for monitoring) so you will end up there soon enough anyway. >> > >> > Down the road the technical benefit is that I think it gives us a good >> path >> > towards end-to-end exactly once semantics from source to destination. >> > Basically the connectors need to support idempotence when talking to >> Kafka >> > and we need the transactional write feature in Kafka to make the >> > transformation atomic. This is actually pretty doable if you separate >> > connector=>kafka problem from the generic transformations which are >> always >> > kafka=>kafka. However I think it is quite impossible to do in a >> all_things >> > => all_things environment. Today you can say "well the semantics of the >> > Samza APIs depend on the connectors you use" but it is actually worse >> then >> > that because the semantics actually depend on the pairing of >> connectors--so >> > not only can you probably not get a usable "exactly once" guarantee >> > end-to-end it can actually be quite hard to reverse engineer what >> property >> > (if any) your end-to-end flow has if you have heterogenous systems. >> > >> > -Jay >> > >> > On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <yanfang...@gmail.com> wrote: >> > >> > > {quote} >> > > maintained in a separate repository and retaining the existing >> > > committership but sharing as much else as possible (website, etc) >> > > {quote} >> > > >> > > Overall, I agree on this idea. Now the question is more about "how to >> do >> > > it". >> > > >> > > On the other hand, one thing I want to point out is that, if we >> decide to >> > > go this way, how do we want to support >> > > otherSystem-transformation-otherSystem use case? >> > > >> > > Basically, there are four user groups here: >> > > >> > > 1. Kafka-transformation-Kafka >> > > 2. Kafka-transformation-otherSystem >> > > 3. otherSystem-transformation-Kafka >> > > 4. otherSystem-transformation-otherSystem >> > > >> > > For group 1, they can easily use the new Samza library to achieve. For >> > > group 2 and 3, they can use copyCat -> transformation -> Kafka or >> Kafka-> >> > > transformation -> copyCat. >> > > >> > > The problem is for group 4. Do we want to abandon this or still >> support >> > it? >> > > Of course, this use case can be achieved by using copyCat -> >> > transformation >> > > -> Kafka -> transformation -> copyCat, the thing is how we persuade >> them >> > to >> > > do this long chain. If yes, it will also be a win for Kafka too. Or if >> > > there is no one in this community actually doing this so far, maybe >> ok to >> > > not support the group 4 directly. >> > > >> > > Thanks, >> > > >> > > Fang, Yan >> > > yanfang...@gmail.com >> > > >> > > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <j...@confluent.io> wrote: >> > > >> > > > Yeah I agree with this summary. I think there are kind of two >> questions >> > > > here: >> > > > 1. Technically does alignment/reliance on Kafka make sense >> > > > 2. Branding wise (naming, website, concepts, etc) does alignment >> with >> > > Kafka >> > > > make sense >> > > > >> > > > Personally I do think both of these things would be really valuable, >> > and >> > > > would dramatically alter the trajectory of the project. >> > > > >> > > > My preference would be to see if people can mostly agree on a >> direction >> > > > rather than splintering things off. From my point of view the ideal >> > > outcome >> > > > of all the options discussed would be to make Samza a closely >> aligned >> > > > subproject, maintained in a separate repository and retaining the >> > > existing >> > > > committership but sharing as much else as possible (website, etc). >> No >> > > idea >> > > > about how these things work, Jacob, you probably know more. >> > > > >> > > > No discussion amongst the Kafka folks has happened on this, but >> likely >> > we >> > > > should figure out what the Samza community actually wants first. >> > > > >> > > > I admit that this is a fairly radical departure from how things are. >> > > > >> > > > If that doesn't fly, I think, yeah we could leave Samza as it is >> and do >> > > the >> > > > more radical reboot inside Kafka. From my point of view that does >> leave >> > > > things in a somewhat confusing state since now there are two stream >> > > > processing systems more or less coupled to Kafka in large part made >> by >> > > the >> > > > same people. But, arguably that might be a cleaner way to make the >> > > cut-over >> > > > and perhaps less risky for Samza community since if it works people >> can >> > > > switch and if it doesn't nothing will have changed. Dunno, how do >> > people >> > > > feel about this? >> > > > >> > > > -Jay >> > > > >> > > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <jgho...@gmail.com> >> > wrote: >> > > > >> > > > > > This leads me to thinking that merging projects and communities >> > > might >> > > > > be a good idea: with the union of experience from both >> communities, >> > we >> > > > will >> > > > > probably build a better system that is better for users. >> > > > > Is this what's being proposed though? Merging the projects seems >> like >> > > > > a consequence of at most one of the three directions under >> > discussion: >> > > > > 1) Samza 2.0: The Samza community relies more heavily on Kafka for >> > > > > configuration, etc. (to a greater or lesser extent to be >> determined) >> > > > > but the Samza community would not automatically merge withe Kafka >> > > > > community (the Phoenix/HBase example is a good one here). >> > > > > 2) Samza Reboot: The Samza community continues to exist with a >> > limited >> > > > > project scope, but similarly would not need to be part of the >> Kafka >> > > > > community (ie given committership) to progress. Here, maybe the >> > Samza >> > > > > team would become a subproject of Kafka (the Board frowns on >> > > > > subprojects at the moment, so I'm not sure if that's even >> feasible), >> > > > > but that would not be required. >> > > > > 3) Hey Samza! FYI, Kafka does streaming now: In this option the >> Kafka >> > > > > team builds its own streaming library, possibly off of Jay's >> > > > > prototype, which has not direct lineage to the Samza team. >> There's >> > no >> > > > > reason for the Kafka team to bring in the Samza team. >> > > > > >> > > > > Is the Kafka community on board with this? >> > > > > >> > > > > To be clear, all three options under discussion are interesting, >> > > > > technically valid and likely healthy directions for the project. >> > > > > Also, they are not mutually exclusive. The Samza community could >> > > > > decide to pursue, say, 'Samza 2.0', while the Kafka community went >> > > > > forward with 'Hey Samza!' My points above are directed entirely >> at >> > > > > the community aspect of these choices. >> > > > > -Jakob >> > > > > >> > > > > On 10 July 2015 at 09:10, Roger Hoover <roger.hoo...@gmail.com> >> > wrote: >> > > > > > That's great. Thanks, Jay. >> > > > > > >> > > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <j...@confluent.io> >> > wrote: >> > > > > > >> > > > > >> Yeah totally agree. I think you have this issue even today, >> right? >> > > > I.e. >> > > > > if >> > > > > >> you need to make a simple config change and you're running in >> YARN >> > > > today >> > > > > >> you end up bouncing the job which then rebuilds state. I think >> the >> > > fix >> > > > > is >> > > > > >> exactly what you described which is to have a long timeout on >> > > > partition >> > > > > >> movement for stateful jobs so that if a job is just getting >> > bounced, >> > > > and >> > > > > >> the cluster manager (or admin) is smart enough to restart it on >> > the >> > > > same >> > > > > >> host when possible, it can optimistically reuse any existing >> state >> > > it >> > > > > finds >> > > > > >> on disk (if it is valid). >> > > > > >> >> > > > > >> So in this model the charter of the CM is to place processes as >> > > > > stickily as >> > > > > >> possible and to restart or re-place failed processes. The >> charter >> > of >> > > > the >> > > > > >> partition management system is to control the assignment of >> work >> > to >> > > > > these >> > > > > >> processes. The nice thing about this is that the work >> assignment, >> > > > > timeouts, >> > > > > >> behavior, configs, and code will all be the same across all >> > cluster >> > > > > >> managers. >> > > > > >> >> > > > > >> So I think that prototype would actually give you exactly what >> you >> > > > want >> > > > > >> today for any cluster manager (or manual placement + restart >> > script) >> > > > > that >> > > > > >> was sticky in terms of host placement since there is already a >> > > > > configurable >> > > > > >> partition movement timeout and task-by-task state reuse with a >> > check >> > > > on >> > > > > >> state validity. >> > > > > >> >> > > > > >> -Jay >> > > > > >> >> > > > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover < >> > > roger.hoo...@gmail.com >> > > > > >> > > > > >> wrote: >> > > > > >> >> > > > > >> > That would be great to let Kafka do as much heavy lifting as >> > > > possible >> > > > > and >> > > > > >> > make it easier for other languages to implement Samza apis. >> > > > > >> > >> > > > > >> > One thing to watch out for is the interplay between Kafka's >> > group >> > > > > >> > management and the external scheduler/process manager's fault >> > > > > tolerance. >> > > > > >> > If a container dies, the Kafka group membership protocol will >> > try >> > > to >> > > > > >> assign >> > > > > >> > it's tasks to other containers while at the same time the >> > process >> > > > > manager >> > > > > >> > is trying to relaunch the container. Without some >> consideration >> > > for >> > > > > this >> > > > > >> > (like a configurable amount of time to wait before Kafka >> alters >> > > the >> > > > > group >> > > > > >> > membership), there may be thrashing going on which is >> especially >> > > bad >> > > > > for >> > > > > >> > containers with large amounts of local state. >> > > > > >> > >> > > > > >> > Someone else pointed this out already but I thought it might >> be >> > > > worth >> > > > > >> > calling out again. >> > > > > >> > >> > > > > >> > Cheers, >> > > > > >> > >> > > > > >> > Roger >> > > > > >> > >> > > > > >> > >> > > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <j...@confluent.io >> > >> > > > wrote: >> > > > > >> > >> > > > > >> > > Hey Roger, >> > > > > >> > > >> > > > > >> > > I couldn't agree more. We spent a bunch of time talking to >> > > people >> > > > > and >> > > > > >> > that >> > > > > >> > > is exactly the stuff we heard time and again. What makes it >> > > hard, >> > > > of >> > > > > >> > > course, is that there is some tension between compatibility >> > with >> > > > > what's >> > > > > >> > > there now and making things better for new users. >> > > > > >> > > >> > > > > >> > > I also strongly agree with the importance of multi-language >> > > > > support. We >> > > > > >> > are >> > > > > >> > > talking now about Java, but for application development use >> > > cases >> > > > > >> people >> > > > > >> > > want to work in whatever language they are using >> elsewhere. I >> > > > think >> > > > > >> > moving >> > > > > >> > > to a model where Kafka itself does the group membership, >> > > lifecycle >> > > > > >> > control, >> > > > > >> > > and partition assignment has the advantage of putting all >> that >> > > > > complex >> > > > > >> > > stuff behind a clean api that the clients are already >> going to >> > > be >> > > > > >> > > implementing for their consumer, so the added functionality >> > for >> > > > > stream >> > > > > >> > > processing beyond a consumer becomes very minor. >> > > > > >> > > >> > > > > >> > > -Jay >> > > > > >> > > >> > > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover < >> > > > > roger.hoo...@gmail.com> >> > > > > >> > > wrote: >> > > > > >> > > >> > > > > >> > > > Metamorphosis...nice. :) >> > > > > >> > > > >> > > > > >> > > > This has been a great discussion. As a user of Samza >> who's >> > > > > recently >> > > > > >> > > > integrated it into a relatively large organization, I >> just >> > > want >> > > > to >> > > > > >> add >> > > > > >> > > > support to a few points already made. >> > > > > >> > > > >> > > > > >> > > > The biggest hurdles to adoption of Samza as it currently >> > > exists >> > > > > that >> > > > > >> > I've >> > > > > >> > > > experienced are: >> > > > > >> > > > 1) YARN - YARN is overly complex in many environments >> where >> > > > Puppet >> > > > > >> > would >> > > > > >> > > do >> > > > > >> > > > just fine but it was the only mechanism to get fault >> > > tolerance. >> > > > > >> > > > 2) Configuration - I think I like the idea of configuring >> > most >> > > > of >> > > > > the >> > > > > >> > job >> > > > > >> > > > in code rather than config files. In general, I think >> the >> > > goal >> > > > > >> should >> > > > > >> > be >> > > > > >> > > > to make it harder to make mistakes, especially of the >> kind >> > > where >> > > > > the >> > > > > >> > code >> > > > > >> > > > expects something and the config doesn't match. The >> current >> > > > > config >> > > > > >> is >> > > > > >> > > > quite intricate and error-prone. For example, the >> > application >> > > > > logic >> > > > > >> > may >> > > > > >> > > > depend on bootstrapping a topic but rather than asserting >> > that >> > > > in >> > > > > the >> > > > > >> > > code, >> > > > > >> > > > you have to rely on getting the config right. Likewise >> with >> > > > > serdes, >> > > > > >> > the >> > > > > >> > > > Java representations produced by various serdes (JSON, >> Avro, >> > > > etc.) >> > > > > >> are >> > > > > >> > > not >> > > > > >> > > > equivalent so you cannot just reconfigure a serde without >> > > > changing >> > > > > >> the >> > > > > >> > > > code. It would be nice for jobs to be able to assert >> what >> > > they >> > > > > >> expect >> > > > > >> > > > from their input topics in terms of partitioning. This >> is >> > > > > getting a >> > > > > >> > > little >> > > > > >> > > > off topic but I was even thinking about creating a "Samza >> > > config >> > > > > >> > linter" >> > > > > >> > > > that would sanity check a set of configs. Especially in >> > > > > >> organizations >> > > > > >> > > > where config is managed by a different team than the >> > > application >> > > > > >> > > developer, >> > > > > >> > > > it's very hard to get avoid config mistakes. >> > > > > >> > > > 3) Java/Scala centric - for many teams (especially >> > DevOps-type >> > > > > >> folks), >> > > > > >> > > the >> > > > > >> > > > pain of the Java toolchain (maven, slow builds, weak >> command >> > > > line >> > > > > >> > > support, >> > > > > >> > > > configuration over convention) really inhibits >> productivity. >> > > As >> > > > > more >> > > > > >> > and >> > > > > >> > > > more high-quality clients become available for Kafka, I >> hope >> > > > > they'll >> > > > > >> > > follow >> > > > > >> > > > Samza's model. Not sure how much it affects the >> proposals >> > in >> > > > this >> > > > > >> > thread >> > > > > >> > > > but please consider other languages in the ecosystem as >> > well. >> > > > > From >> > > > > >> > what >> > > > > >> > > > I've heard, Spark has more Python users than Java/Scala. >> > > > > >> > > > (FYI, we added a Jython wrapper for the Samza API >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > >> > > > > >> > >> > > > > >> >> > > > > >> > > > >> > > >> > >> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza >> > > > > >> > > > and are working on a Yeoman generator >> > > > > >> > > > https://github.com/Quantiply/generator-rico for >> > Jython/Samza >> > > > > >> projects >> > > > > >> > to >> > > > > >> > > > alleviate some of the pain) >> > > > > >> > > > >> > > > > >> > > > I also want to underscore Jay's point about improving the >> > user >> > > > > >> > > experience. >> > > > > >> > > > That's a very important factor for adoption. I think the >> > goal >> > > > > should >> > > > > >> > be >> > > > > >> > > to >> > > > > >> > > > make Samza as easy to get started with as something like >> > > > Logstash. >> > > > > >> > > > Logstash is vastly inferior in terms of capabilities to >> > Samza >> > > > but >> > > > > >> it's >> > > > > >> > > easy >> > > > > >> > > > to get started and that makes a big difference. >> > > > > >> > > > >> > > > > >> > > > Cheers, >> > > > > >> > > > >> > > > > >> > > > Roger >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci >> > > Morales < >> > > > > >> > > > g...@apache.org> wrote: >> > > > > >> > > > >> > > > > >> > > > > Forgot to add. On the naming issues, Kafka >> Metamorphosis >> > is >> > > a >> > > > > clear >> > > > > >> > > > winner >> > > > > >> > > > > :) >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > >> > > > > Gianmarco >> > > > > >> > > > > >> > > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci >> Morales < >> > > > > >> > > g...@apache.org >> > > > > >> > > > > >> > > > > >> > > > > wrote: >> > > > > >> > > > > >> > > > > >> > > > > > Hi, >> > > > > >> > > > > > >> > > > > >> > > > > > @Martin, thanks for you comments. >> > > > > >> > > > > > Maybe I'm missing some important point, but I think >> > > coupling >> > > > > the >> > > > > >> > > > releases >> > > > > >> > > > > > is actually a *good* thing. >> > > > > >> > > > > > To make an example, would it be better if the MR and >> > HDFS >> > > > > >> > components >> > > > > >> > > of >> > > > > >> > > > > > Hadoop had different release schedules? >> > > > > >> > > > > > >> > > > > >> > > > > > Actually, keeping the discussion in a single place >> would >> > > > make >> > > > > >> > > agreeing >> > > > > >> > > > on >> > > > > >> > > > > > releases (and backwards compatibility) much easier, >> as >> > > > > everybody >> > > > > >> > > would >> > > > > >> > > > be >> > > > > >> > > > > > responsible for the whole codebase. >> > > > > >> > > > > > >> > > > > >> > > > > > That said, I like the idea of absorbing samza-core >> as a >> > > > > >> > sub-project, >> > > > > >> > > > and >> > > > > >> > > > > > leave the fancy stuff separate. >> > > > > >> > > > > > It probably gives 90% of the benefits we have been >> > > > discussing >> > > > > >> here. >> > > > > >> > > > > > >> > > > > >> > > > > > Cheers, >> > > > > >> > > > > > >> > > > > >> > > > > > -- >> > > > > >> > > > > > Gianmarco >> > > > > >> > > > > > >> > > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps < >> jay.kr...@gmail.com >> > > >> > > > > wrote: >> > > > > >> > > > > > >> > > > > >> > > > > >> Hey Martin, >> > > > > >> > > > > >> >> > > > > >> > > > > >> I agree coupling release schedules is a downside. >> > > > > >> > > > > >> >> > > > > >> > > > > >> Definitely we can try to solve some of the >> integration >> > > > > problems >> > > > > >> in >> > > > > >> > > > > >> Confluent Platform or in other distributions. But I >> > think >> > > > > this >> > > > > >> > ends >> > > > > >> > > up >> > > > > >> > > > > >> being really shallow. I guess I feel to really get a >> > good >> > > > > user >> > > > > >> > > > > experience >> > > > > >> > > > > >> the two systems have to kind of feel like part of >> the >> > > same >> > > > > thing >> > > > > >> > and >> > > > > >> > > > you >> > > > > >> > > > > >> can't really add that in later--you can put both in >> the >> > > > same >> > > > > >> > > > > downloadable >> > > > > >> > > > > >> tar file but it doesn't really give a very cohesive >> > > > feeling. >> > > > > I >> > > > > >> > agree >> > > > > >> > > > > that >> > > > > >> > > > > >> ultimately any of the project stuff is as much >> social >> > and >> > > > > naming >> > > > > >> > as >> > > > > >> > > > > >> anything else--theoretically two totally independent >> > > > projects >> > > > > >> > could >> > > > > >> > > > work >> > > > > >> > > > > >> to >> > > > > >> > > > > >> tightly align. In practice this seems to be quite >> > > difficult >> > > > > >> > though. >> > > > > >> > > > > >> >> > > > > >> > > > > >> For the frameworks--totally agree it would be good >> to >> > > > > maintain >> > > > > >> the >> > > > > >> > > > > >> framework support with the project. In some cases >> there >> > > may >> > > > > not >> > > > > >> be >> > > > > >> > > too >> > > > > >> > > > > >> much >> > > > > >> > > > > >> there since the integration gets lighter but I think >> > > > whatever >> > > > > >> > stubs >> > > > > >> > > > you >> > > > > >> > > > > >> need should be included. So no I definitely wasn't >> > trying >> > > > to >> > > > > >> imply >> > > > > >> > > > > >> dropping >> > > > > >> > > > > >> support for these frameworks, just making the >> > integration >> > > > > >> lighter >> > > > > >> > by >> > > > > >> > > > > >> separating process management from partition >> > management. >> > > > > >> > > > > >> >> > > > > >> > > > > >> You raise two good points we would have to figure >> out >> > if >> > > we >> > > > > went >> > > > > >> > > down >> > > > > >> > > > > the >> > > > > >> > > > > >> alignment path: >> > > > > >> > > > > >> 1. With respect to the name, yeah I think the first >> > > > question >> > > > > is >> > > > > >> > > > whether >> > > > > >> > > > > >> some "re-branding" would be worth it. If so then I >> > think >> > > we >> > > > > can >> > > > > >> > > have a >> > > > > >> > > > > big >> > > > > >> > > > > >> thread on the name. I'm definitely not set on Kafka >> > > > > Streaming or >> > > > > >> > > Kafka >> > > > > >> > > > > >> Streams I was just using them to be kind of >> > > illustrative. I >> > > > > >> agree >> > > > > >> > > with >> > > > > >> > > > > >> your >> > > > > >> > > > > >> critique of these names, though I think people would >> > get >> > > > the >> > > > > >> idea. >> > > > > >> > > > > >> 2. Yeah you also raise a good point about how to >> > "factor" >> > > > it. >> > > > > >> Here >> > > > > >> > > are >> > > > > >> > > > > the >> > > > > >> > > > > >> options I see (I could get enthusiastic about any of >> > > them): >> > > > > >> > > > > >> a. One repo for both Kafka and Samza >> > > > > >> > > > > >> b. Two repos, retaining the current seperation >> > > > > >> > > > > >> c. Two repos, the equivalent of samza-api and >> > > samza-core >> > > > > is >> > > > > >> > > > absorbed >> > > > > >> > > > > >> almost like a third client >> > > > > >> > > > > >> >> > > > > >> > > > > >> Cheers, >> > > > > >> > > > > >> >> > > > > >> > > > > >> -Jay >> > > > > >> > > > > >> >> > > > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann < >> > > > > >> > > > mar...@kleppmann.com> >> > > > > >> > > > > >> wrote: >> > > > > >> > > > > >> >> > > > > >> > > > > >> > Ok, thanks for the clarifications. Just a few >> > follow-up >> > > > > >> > comments. >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > - I see the appeal of merging with Kafka or >> becoming >> > a >> > > > > >> > subproject: >> > > > > >> > > > the >> > > > > >> > > > > >> > reasons you mention are good. The risk I see is >> that >> > > > > release >> > > > > >> > > > schedules >> > > > > >> > > > > >> > become coupled to each other, which can slow >> everyone >> > > > down, >> > > > > >> and >> > > > > >> > > > large >> > > > > >> > > > > >> > projects with many contributors are harder to >> manage. >> > > > > (Jakob, >> > > > > >> > can >> > > > > >> > > > you >> > > > > >> > > > > >> speak >> > > > > >> > > > > >> > from experience, having seen a wider range of >> Hadoop >> > > > > ecosystem >> > > > > >> > > > > >> projects?) >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > Some of the goals of a better unified developer >> > > > experience >> > > > > >> could >> > > > > >> > > > also >> > > > > >> > > > > be >> > > > > >> > > > > >> > solved by integrating Samza nicely into a Kafka >> > > > > distribution >> > > > > >> > (such >> > > > > >> > > > as >> > > > > >> > > > > >> > Confluent's). I'm not against merging projects if >> we >> > > > decide >> > > > > >> > that's >> > > > > >> > > > the >> > > > > >> > > > > >> way >> > > > > >> > > > > >> > to go, just pointing out the same goals can >> perhaps >> > > also >> > > > be >> > > > > >> > > achieved >> > > > > >> > > > > in >> > > > > >> > > > > >> > other ways. >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > - With regard to dropping the YARN dependency: are >> > you >> > > > > >> proposing >> > > > > >> > > > that >> > > > > >> > > > > >> > Samza doesn't give any help to people wanting to >> run >> > on >> > > > > >> > > > > >> YARN/Mesos/AWS/etc? >> > > > > >> > > > > >> > So the docs would basically have a link to Slider >> and >> > > > > nothing >> > > > > >> > > else? >> > > > > >> > > > Or >> > > > > >> > > > > >> > would we maintain integrations with a bunch of >> > popular >> > > > > >> > deployment >> > > > > >> > > > > >> methods >> > > > > >> > > > > >> > (e.g. the necessary glue and shell scripts to make >> > > Samza >> > > > > work >> > > > > >> > with >> > > > > >> > > > > >> Slider)? >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > I absolutely think it's a good idea to have the >> "as a >> > > > > library" >> > > > > >> > and >> > > > > >> > > > > "as a >> > > > > >> > > > > >> > process" (using Yi's taxonomy) options for people >> who >> > > > want >> > > > > >> them, >> > > > > >> > > > but I >> > > > > >> > > > > >> > think there should also be a low-friction path for >> > > common >> > > > > "as >> > > > > >> a >> > > > > >> > > > > service" >> > > > > >> > > > > >> > deployment methods, for which we probably need to >> > > > maintain >> > > > > >> > > > > integrations. >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd to me, >> > > > because >> > > > > >> Kafka >> > > > > >> > > is >> > > > > >> > > > > all >> > > > > >> > > > > >> > about streams already. Perhaps "Kafka >> Transformers" >> > or >> > > > > "Kafka >> > > > > >> > > > Filters" >> > > > > >> > > > > >> > would be more apt? >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > One suggestion: perhaps the core of Samza (stream >> > > > > >> transformation >> > > > > >> > > > with >> > > > > >> > > > > >> > state management -- i.e. the "Samza as a library" >> > bit) >> > > > > could >> > > > > >> > > become >> > > > > >> > > > > >> part of >> > > > > >> > > > > >> > Kafka, while higher-level tools such as streaming >> SQL >> > > and >> > > > > >> > > > integrations >> > > > > >> > > > > >> with >> > > > > >> > > > > >> > deployment frameworks remain in a separate >> project? >> > In >> > > > > other >> > > > > >> > > words, >> > > > > >> > > > > >> Kafka >> > > > > >> > > > > >> > would absorb the proven, stable core of Samza, >> which >> > > > would >> > > > > >> > become >> > > > > >> > > > the >> > > > > >> > > > > >> > "third Kafka client" mentioned early in this >> thread. >> > > The >> > > > > Samza >> > > > > >> > > > project >> > > > > >> > > > > >> > would then target that third Kafka client as its >> base >> > > > API, >> > > > > and >> > > > > >> > the >> > > > > >> > > > > >> project >> > > > > >> > > > > >> > would be freed up to explore more experimental new >> > > > > horizons. >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > Martin >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps < >> > > jay.kr...@gmail.com> >> > > > > >> wrote: >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > > Hey Martin, >> > > > > >> > > > > >> > > >> > > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually >> don't >> > > > think >> > > > > it >> > > > > >> > ties >> > > > > >> > > > our >> > > > > >> > > > > >> > hands >> > > > > >> > > > > >> > > at all, all it does is refactor things. The >> > division >> > > of >> > > > > >> > > > > >> responsibility is >> > > > > >> > > > > >> > > that Samza core is responsible for task >> lifecycle, >> > > > state, >> > > > > >> and >> > > > > >> > > > > >> partition >> > > > > >> > > > > >> > > management (using the Kafka co-ordinator) but >> it is >> > > NOT >> > > > > >> > > > responsible >> > > > > >> > > > > >> for >> > > > > >> > > > > >> > > packaging, configuration deployment or >> execution of >> > > > > >> processes. >> > > > > >> > > The >> > > > > >> > > > > >> > problem >> > > > > >> > > > > >> > > of packaging and starting these processes is >> > > > > >> > > > > >> > > framework/environment-specific. This leaves >> > > individual >> > > > > >> > > frameworks >> > > > > >> > > > to >> > > > > >> > > > > >> be >> > > > > >> > > > > >> > as >> > > > > >> > > > > >> > > fancy or vanilla as they like. So you can get >> > simple >> > > > > >> stateless >> > > > > >> > > > > >> support in >> > > > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf app >> > > > framework >> > > > > >> > > (Slider, >> > > > > >> > > > > >> > Marathon, >> > > > > >> > > > > >> > > etc). These are well known by people and have >> nice >> > > UIs >> > > > > and a >> > > > > >> > lot >> > > > > >> > > > of >> > > > > >> > > > > >> > > flexibility. I don't think they have node >> affinity >> > > as a >> > > > > >> built >> > > > > >> > in >> > > > > >> > > > > >> option >> > > > > >> > > > > >> > > (though I could be wrong). So if we want that we >> > can >> > > > > either >> > > > > >> > wait >> > > > > >> > > > for >> > > > > >> > > > > >> them >> > > > > >> > > > > >> > > to add it or do a custom framework to add that >> > > feature >> > > > > (as >> > > > > >> > now). >> > > > > >> > > > > >> > Obviously >> > > > > >> > > > > >> > > if you manage things with old-school ops tools >> > > > > >> > (puppet/chef/etc) >> > > > > >> > > > you >> > > > > >> > > > > >> get >> > > > > >> > > > > >> > > locality easily. The nice thing, though, is that >> > all >> > > > the >> > > > > >> samza >> > > > > >> > > > > >> "business >> > > > > >> > > > > >> > > logic" around partition management and fault >> > > tolerance >> > > > > is in >> > > > > >> > > Samza >> > > > > >> > > > > >> core >> > > > > >> > > > > >> > so >> > > > > >> > > > > >> > > it is shared across frameworks and the framework >> > > > specific >> > > > > >> bit >> > > > > >> > is >> > > > > >> > > > > just >> > > > > >> > > > > >> > > whether it is smart enough to try to get the >> same >> > > host >> > > > > when >> > > > > >> a >> > > > > >> > > job >> > > > > >> > > > is >> > > > > >> > > > > >> > > restarted. >> > > > > >> > > > > >> > > >> > > > > >> > > > > >> > > With respect to the Kafka-alignment, yeah I >> think >> > the >> > > > > goal >> > > > > >> > would >> > > > > >> > > > be >> > > > > >> > > > > >> (a) >> > > > > >> > > > > >> > > actually get better alignment in user >> experience, >> > and >> > > > (b) >> > > > > >> > > express >> > > > > >> > > > > >> this in >> > > > > >> > > > > >> > > the naming and project branding. Specifically: >> > > > > >> > > > > >> > > 1. Website/docs, it would be nice for the >> > > > > "transformation" >> > > > > >> api >> > > > > >> > > to >> > > > > >> > > > be >> > > > > >> > > > > >> > > discoverable in the main Kafka docs--i.e. be >> able >> > to >> > > > > explain >> > > > > >> > > when >> > > > > >> > > > to >> > > > > >> > > > > >> use >> > > > > >> > > > > >> > > the consumer and when to use the stream >> processing >> > > > > >> > functionality >> > > > > >> > > > and >> > > > > >> > > > > >> lead >> > > > > >> > > > > >> > > people into that experience. >> > > > > >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2 (or >> > > > > whatever) >> > > > > >> > that >> > > > > >> > > > has >> > > > > >> > > > > >> both >> > > > > >> > > > > >> > > Kafka and the stream processing part and they >> > > actually >> > > > > work >> > > > > >> > > > > together. >> > > > > >> > > > > >> > > 3. Unify the programming experience so the >> client >> > and >> > > > > Samza >> > > > > >> > api >> > > > > >> > > > > share >> > > > > >> > > > > >> > > config/monitoring/naming/packaging/etc. >> > > > > >> > > > > >> > > >> > > > > >> > > > > >> > > I think sub-projects keep separate committers >> and >> > can >> > > > > have a >> > > > > >> > > > > separate >> > > > > >> > > > > >> > repo, >> > > > > >> > > > > >> > > but I'm actually not really sure (I can't find a >> > > > > definition >> > > > > >> > of a >> > > > > >> > > > > >> > subproject >> > > > > >> > > > > >> > > in Apache). >> > > > > >> > > > > >> > > >> > > > > >> > > > > >> > > Basically at a high-level you want the >> experience >> > to >> > > > > "feel" >> > > > > >> > > like a >> > > > > >> > > > > >> single >> > > > > >> > > > > >> > > system, not to relatively independent things >> that >> > are >> > > > > kind >> > > > > >> of >> > > > > >> > > > > >> awkwardly >> > > > > >> > > > > >> > > glued together. >> > > > > >> > > > > >> > > >> > > > > >> > > > > >> > > I think if we did that they having naming or >> > branding >> > > > > like >> > > > > >> > > "kafka >> > > > > >> > > > > >> > > streaming" or "kafka streams" or something like >> > that >> > > > > would >> > > > > >> > > > actually >> > > > > >> > > > > >> do a >> > > > > >> > > > > >> > > good job of conveying what it is. I do that this >> > > would >> > > > > help >> > > > > >> > > > adoption >> > > > > >> > > > > >> > quite >> > > > > >> > > > > >> > > a lot as it would correctly convey that using >> Kafka >> > > > > >> Streaming >> > > > > >> > > with >> > > > > >> > > > > >> Kafka >> > > > > >> > > > > >> > is >> > > > > >> > > > > >> > > a fairly seamless experience and Kafka is pretty >> > > > heavily >> > > > > >> > adopted >> > > > > >> > > > at >> > > > > >> > > > > >> this >> > > > > >> > > > > >> > > point. >> > > > > >> > > > > >> > > >> > > > > >> > > > > >> > > Fwiw we actually considered this model >> originally >> > > when >> > > > > open >> > > > > >> > > > sourcing >> > > > > >> > > > > >> > Samza, >> > > > > >> > > > > >> > > however at that time Kafka was relatively >> unknown >> > and >> > > > we >> > > > > >> > decided >> > > > > >> > > > not >> > > > > >> > > > > >> to >> > > > > >> > > > > >> > do >> > > > > >> > > > > >> > > it since we felt it would be limiting. From my >> > point >> > > of >> > > > > view >> > > > > >> > the >> > > > > >> > > > > three >> > > > > >> > > > > >> > > things have changed (1) Kafka is now really >> heavily >> > > > used >> > > > > for >> > > > > >> > > > stream >> > > > > >> > > > > >> > > processing, (2) we learned that abstracting out >> the >> > > > > stream >> > > > > >> > well >> > > > > >> > > is >> > > > > >> > > > > >> > > basically impossible, (3) we learned it is >> really >> > > hard >> > > > to >> > > > > >> keep >> > > > > >> > > the >> > > > > >> > > > > two >> > > > > >> > > > > >> > > things feeling like a single product. >> > > > > >> > > > > >> > > >> > > > > >> > > > > >> > > -Jay >> > > > > >> > > > > >> > > >> > > > > >> > > > > >> > > >> > > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin >> Kleppmann < >> > > > > >> > > > > >> mar...@kleppmann.com> >> > > > > >> > > > > >> > > wrote: >> > > > > >> > > > > >> > > >> > > > > >> > > > > >> > >> Hi all, >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> Lots of good thoughts here. >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> I agree with the general philosophy of tying >> Samza >> > > > more >> > > > > >> > firmly >> > > > > >> > > to >> > > > > >> > > > > >> Kafka. >> > > > > >> > > > > >> > >> After I spent a while looking at integrating >> other >> > > > > message >> > > > > >> > > > brokers >> > > > > >> > > > > >> (e.g. >> > > > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the >> > > conclusion >> > > > > that >> > > > > >> > > > > >> > SystemConsumer >> > > > > >> > > > > >> > >> tacitly assumes a model so much like Kafka's >> that >> > > > pretty >> > > > > >> much >> > > > > >> > > > > nobody >> > > > > >> > > > > >> but >> > > > > >> > > > > >> > >> Kafka actually implements it. (Databus is >> perhaps >> > an >> > > > > >> > exception, >> > > > > >> > > > but >> > > > > >> > > > > >> it >> > > > > >> > > > > >> > >> isn't widely used outside of LinkedIn.) Thus, >> > making >> > > > > Samza >> > > > > >> > > fully >> > > > > >> > > > > >> > dependent >> > > > > >> > > > > >> > >> on Kafka acknowledges that the >> system-independence >> > > was >> > > > > >> never >> > > > > >> > as >> > > > > >> > > > > real >> > > > > >> > > > > >> as >> > > > > >> > > > > >> > we >> > > > > >> > > > > >> > >> perhaps made it out to be. The gains of code >> reuse >> > > are >> > > > > >> real. >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> The idea of decoupling Samza from YARN has also >> > > always >> > > > > been >> > > > > >> > > > > >> appealing to >> > > > > >> > > > > >> > >> me, for various reasons already mentioned in >> this >> > > > > thread. >> > > > > >> > > > Although >> > > > > >> > > > > >> > making >> > > > > >> > > > > >> > >> Samza jobs deployable on anything >> > > (YARN/Mesos/AWS/etc) >> > > > > >> seems >> > > > > >> > > > > >> laudable, >> > > > > >> > > > > >> > I am >> > > > > >> > > > > >> > >> a little concerned that it will restrict us to >> a >> > > > lowest >> > > > > >> > common >> > > > > >> > > > > >> > denominator. >> > > > > >> > > > > >> > >> For example, would host affinity (SAMZA-617) >> still >> > > be >> > > > > >> > possible? >> > > > > >> > > > For >> > > > > >> > > > > >> jobs >> > > > > >> > > > > >> > >> with large amounts of state, I think SAMZA-617 >> > would >> > > > be >> > > > > a >> > > > > >> big >> > > > > >> > > > boon, >> > > > > >> > > > > >> > since >> > > > > >> > > > > >> > >> restoring state off the changelog on every >> single >> > > > > restart >> > > > > >> is >> > > > > >> > > > > painful, >> > > > > >> > > > > >> > due >> > > > > >> > > > > >> > >> to long recovery times. It would be a shame if >> the >> > > > > >> decoupling >> > > > > >> > > > from >> > > > > >> > > > > >> YARN >> > > > > >> > > > > >> > >> made host affinity impossible. >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> Jay, a question about the proposed API for >> > > > > instantiating a >> > > > > >> > job >> > > > > >> > > in >> > > > > >> > > > > >> code >> > > > > >> > > > > >> > >> (rather than a properties file): when >> submitting a >> > > job >> > > > > to a >> > > > > >> > > > > cluster, >> > > > > >> > > > > >> is >> > > > > >> > > > > >> > the >> > > > > >> > > > > >> > >> idea that the instantiation code runs on a >> client >> > > > > >> somewhere, >> > > > > >> > > > which >> > > > > >> > > > > >> then >> > > > > >> > > > > >> > >> pokes the necessary endpoints on >> > YARN/Mesos/AWS/etc? >> > > > Or >> > > > > >> does >> > > > > >> > > that >> > > > > >> > > > > >> code >> > > > > >> > > > > >> > run >> > > > > >> > > > > >> > >> on each container that is part of the job (in >> > which >> > > > > case, >> > > > > >> how >> > > > > >> > > > does >> > > > > >> > > > > >> the >> > > > > >> > > > > >> > job >> > > > > >> > > > > >> > >> submission to the cluster work)? >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> I agree with Garry that it doesn't feel right >> to >> > > make >> > > > a >> > > > > 1.0 >> > > > > >> > > > release >> > > > > >> > > > > >> > with a >> > > > > >> > > > > >> > >> plan for it to be immediately obsolete. So if >> this >> > > is >> > > > > going >> > > > > >> > to >> > > > > >> > > > > >> happen, I >> > > > > >> > > > > >> > >> think it would be more honest to stick with 0.* >> > > > version >> > > > > >> > numbers >> > > > > >> > > > > until >> > > > > >> > > > > >> > the >> > > > > >> > > > > >> > >> library-ified Samza has been implemented, is >> > stable >> > > > and >> > > > > >> > widely >> > > > > >> > > > > used. >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> Should the new Samza be a subproject of Kafka? >> > There >> > > > is >> > > > > >> > > precedent >> > > > > >> > > > > for >> > > > > >> > > > > >> > >> tight coupling between different Apache >> projects >> > > (e.g. >> > > > > >> > Curator >> > > > > >> > > > and >> > > > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think >> > remaining >> > > > > >> separate >> > > > > >> > > > would >> > > > > >> > > > > >> be >> > > > > >> > > > > >> > ok. >> > > > > >> > > > > >> > >> Even if Samza is fully dependent on Kafka, >> there >> > is >> > > > > enough >> > > > > >> > > > > substance >> > > > > >> > > > > >> in >> > > > > >> > > > > >> > >> Samza that it warrants being a separate >> project. >> > An >> > > > > >> argument >> > > > > >> > in >> > > > > >> > > > > >> favour >> > > > > >> > > > > >> > of >> > > > > >> > > > > >> > >> merging would be if we think Kafka has a much >> > > stronger >> > > > > >> "brand >> > > > > >> > > > > >> presence" >> > > > > >> > > > > >> > >> than Samza; I'm ambivalent on that one. If the >> > Kafka >> > > > > >> project >> > > > > >> > is >> > > > > >> > > > > >> willing >> > > > > >> > > > > >> > to >> > > > > >> > > > > >> > >> endorse Samza as the "official" way of doing >> > > stateful >> > > > > >> stream >> > > > > >> > > > > >> > >> transformations, that would probably have much >> the >> > > > same >> > > > > >> > effect >> > > > > >> > > as >> > > > > >> > > > > >> > >> re-branding Samza as "Kafka Stream Processors" >> or >> > > > > suchlike. >> > > > > >> > > Close >> > > > > >> > > > > >> > >> collaboration between the two projects will be >> > > needed >> > > > in >> > > > > >> any >> > > > > >> > > > case. >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> From a project management perspective, I guess >> the >> > > > "new >> > > > > >> > Samza" >> > > > > >> > > > > would >> > > > > >> > > > > >> > have >> > > > > >> > > > > >> > >> to be developed on a branch alongside ongoing >> > > > > maintenance >> > > > > >> of >> > > > > >> > > the >> > > > > >> > > > > >> current >> > > > > >> > > > > >> > >> line of development? I think it would be >> important >> > > to >> > > > > >> > continue >> > > > > >> > > > > >> > supporting >> > > > > >> > > > > >> > >> existing users, and provide a graceful >> migration >> > > path >> > > > to >> > > > > >> the >> > > > > >> > > new >> > > > > >> > > > > >> > version. >> > > > > >> > > > > >> > >> Leaving the current versions unsupported and >> > forcing >> > > > > people >> > > > > >> > to >> > > > > >> > > > > >> rewrite >> > > > > >> > > > > >> > >> their jobs would send a bad signal. >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> Best, >> > > > > >> > > > > >> > >> Martin >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps < >> > > j...@confluent.io> >> > > > > >> wrote: >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >>> Hey Garry, >> > > > > >> > > > > >> > >>> >> > > > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be happy to >> > chat >> > > > > more >> > > > > >> > about >> > > > > >> > > > > this >> > > > > >> > > > > >> if >> > > > > >> > > > > >> > >>> you'd be interested. I think Chris and I >> started >> > > with >> > > > > the >> > > > > >> > idea >> > > > > >> > > > of >> > > > > >> > > > > >> "what >> > > > > >> > > > > >> > >>> would it take to make Samza a kick-ass >> ingestion >> > > > tool" >> > > > > but >> > > > > >> > > > > >> ultimately >> > > > > >> > > > > >> > we >> > > > > >> > > > > >> > >>> kind of came around to the idea that ingestion >> > and >> > > > > >> > > > transformation >> > > > > >> > > > > >> had >> > > > > >> > > > > >> > >>> pretty different needs and coupling the two >> made >> > > > things >> > > > > >> > hard. >> > > > > >> > > > > >> > >>> >> > > > > >> > > > > >> > >>> For what it's worth I think copycat (KIP-26) >> > > actually >> > > > > will >> > > > > >> > do >> > > > > >> > > > what >> > > > > >> > > > > >> you >> > > > > >> > > > > >> > >> are >> > > > > >> > > > > >> > >>> looking for. >> > > > > >> > > > > >> > >>> >> > > > > >> > > > > >> > >>> With regard to your point about slider, I >> don't >> > > > > >> necessarily >> > > > > >> > > > > >> disagree. >> > > > > >> > > > > >> > >> But I >> > > > > >> > > > > >> > >>> think getting good YARN support is quite >> doable >> > > and I >> > > > > >> think >> > > > > >> > we >> > > > > >> > > > can >> > > > > >> > > > > >> make >> > > > > >> > > > > >> > >>> that work well. I think the issue this >> proposal >> > > > solves >> > > > > is >> > > > > >> > that >> > > > > >> > > > > >> > >> technically >> > > > > >> > > > > >> > >>> it is pretty hard to support multiple cluster >> > > > > management >> > > > > >> > > systems >> > > > > >> > > > > the >> > > > > >> > > > > >> > way >> > > > > >> > > > > >> > >>> things are now, you need to write an "app >> master" >> > > or >> > > > > >> > > "framework" >> > > > > >> > > > > for >> > > > > >> > > > > >> > each >> > > > > >> > > > > >> > >>> and they are all a little different so >> testing is >> > > > > really >> > > > > >> > hard. >> > > > > >> > > > In >> > > > > >> > > > > >> the >> > > > > >> > > > > >> > >>> absence of this we have been stuck with just >> YARN >> > > > which >> > > > > >> has >> > > > > >> > > > > >> fantastic >> > > > > >> > > > > >> > >>> penetration in the Hadoopy part of the org, >> but >> > > zero >> > > > > >> > > penetration >> > > > > >> > > > > >> > >> elsewhere. >> > > > > >> > > > > >> > >>> Given the huge amount of work being put in to >> > > slider, >> > > > > >> > > marathon, >> > > > > >> > > > > aws >> > > > > >> > > > > >> > >>> tooling, not to mention the umpteen related >> > > packaging >> > > > > >> > > > technologies >> > > > > >> > > > > >> > people >> > > > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various >> > > > cloud-specific >> > > > > >> > deploy >> > > > > >> > > > > >> tools, >> > > > > >> > > > > >> > >> etc) >> > > > > >> > > > > >> > >>> I really think it is important to get this >> right. >> > > > > >> > > > > >> > >>> >> > > > > >> > > > > >> > >>> -Jay >> > > > > >> > > > > >> > >>> >> > > > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry >> Turkington >> > < >> > > > > >> > > > > >> > >>> g.turking...@improvedigital.com> wrote: >> > > > > >> > > > > >> > >>> >> > > > > >> > > > > >> > >>>> Hi all, >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> I think the question below re does Samza >> become >> > a >> > > > > >> > sub-project >> > > > > >> > > > of >> > > > > >> > > > > >> Kafka >> > > > > >> > > > > >> > >>>> highlights the broader point around >> migration. >> > > Chris >> > > > > >> > mentions >> > > > > >> > > > > >> Samza's >> > > > > >> > > > > >> > >>>> maturity is heading towards a v1 release but >> I'm >> > > not >> > > > > sure >> > > > > >> > it >> > > > > >> > > > > feels >> > > > > >> > > > > >> > >> right to >> > > > > >> > > > > >> > >>>> launch a v1 then immediately plan to >> deprecate >> > > most >> > > > of >> > > > > >> it. >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> From a selfish perspective I have some guys >> who >> > > have >> > > > > >> > started >> > > > > >> > > > > >> working >> > > > > >> > > > > >> > >> with >> > > > > >> > > > > >> > >>>> Samza and building some new >> consumers/producers >> > > was >> > > > > next >> > > > > >> > up. >> > > > > >> > > > > Sounds >> > > > > >> > > > > >> > like >> > > > > >> > > > > >> > >>>> that is absolutely not the direction to go. I >> > need >> > > > to >> > > > > >> look >> > > > > >> > > into >> > > > > >> > > > > the >> > > > > >> > > > > >> > KIP >> > > > > >> > > > > >> > >> in >> > > > > >> > > > > >> > >>>> more detail but for me the attractiveness of >> > > adding >> > > > > new >> > > > > >> > Samza >> > > > > >> > > > > >> > >>>> consumer/producers -- even if yes all they >> were >> > > > doing >> > > > > was >> > > > > >> > > > really >> > > > > >> > > > > >> > getting >> > > > > >> > > > > >> > >>>> data into and out of Kafka -- was to avoid >> > > having >> > > > to >> > > > > >> > worry >> > > > > >> > > > > about >> > > > > >> > > > > >> the >> > > > > >> > > > > >> > >>>> lifecycle management of external clients. If >> > there >> > > > is >> > > > > a >> > > > > >> > > generic >> > > > > >> > > > > >> Kafka >> > > > > >> > > > > >> > >>>> ingress/egress layer that I can plug a new >> > > connector >> > > > > into >> > > > > >> > and >> > > > > >> > > > > have >> > > > > >> > > > > >> a >> > > > > >> > > > > >> > >> lot of >> > > > > >> > > > > >> > >>>> the heavy lifting re scale and reliability >> done >> > > for >> > > > me >> > > > > >> then >> > > > > >> > > it >> > > > > >> > > > > >> gives >> > > > > >> > > > > >> > me >> > > > > >> > > > > >> > >> all >> > > > > >> > > > > >> > >>>> the pushing new consumers/producers would. If >> > not >> > > > > then it >> > > > > >> > > > > >> complicates >> > > > > >> > > > > >> > my >> > > > > >> > > > > >> > >>>> operational deployments. >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> Which is similar to my other question with >> the >> > > > > proposal >> > > > > >> -- >> > > > > >> > if >> > > > > >> > > > we >> > > > > >> > > > > >> > build a >> > > > > >> > > > > >> > >>>> fully available/stand-alone Samza plus the >> > > requisite >> > > > > >> shims >> > > > > >> > to >> > > > > >> > > > > >> > integrate >> > > > > >> > > > > >> > >>>> with Slider etc I suspect the former may be a >> > lot >> > > > more >> > > > > >> work >> > > > > >> > > > than >> > > > > >> > > > > we >> > > > > >> > > > > >> > >> think. >> > > > > >> > > > > >> > >>>> We may make it much easier for a newcomer to >> get >> > > > > >> something >> > > > > >> > > > > running >> > > > > >> > > > > >> but >> > > > > >> > > > > >> > >>>> having them step up and get a reliable >> > production >> > > > > >> > deployment >> > > > > >> > > > may >> > > > > >> > > > > >> still >> > > > > >> > > > > >> > >>>> dominate mailing list traffic, if for >> different >> > > > > reasons >> > > > > >> > than >> > > > > >> > > > > >> today. >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable with >> > making >> > > > the >> > > > > >> Samza >> > > > > >> > > > > >> dependency >> > > > > >> > > > > >> > >> on >> > > > > >> > > > > >> > >>>> Kafka much more explicit and I absolutely see >> > the >> > > > > >> benefits >> > > > > >> > > in >> > > > > >> > > > > the >> > > > > >> > > > > >> > >>>> reduction of duplication and clashing >> > > > > >> > > > terminologies/abstractions >> > > > > >> > > > > >> that >> > > > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library would >> > > likely >> > > > > be a >> > > > > >> > very >> > > > > >> > > > > nice >> > > > > >> > > > > >> > tool >> > > > > >> > > > > >> > >> to >> > > > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have the >> > > concerns >> > > > > >> above >> > > > > >> > re >> > > > > >> > > > the >> > > > > >> > > > > >> > >>>> operational side. >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> Garry >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> -----Original Message----- >> > > > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales [mailto: >> > > > > >> > g...@apache.org >> > > > > >> > > ] >> > > > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56 >> > > > > >> > > > > >> > >>>> To: dev@samza.apache.org >> > > > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on >> Samza >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> Very interesting thoughts. >> > > > > >> > > > > >> > >>>> From outside, I have always perceived Samza >> as a >> > > > > >> computing >> > > > > >> > > > layer >> > > > > >> > > > > >> over >> > > > > >> > > > > >> > >>>> Kafka. >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> The question, maybe a bit provocative, is >> > "should >> > > > > Samza >> > > > > >> be >> > > > > >> > a >> > > > > >> > > > > >> > sub-project >> > > > > >> > > > > >> > >>>> of Kafka then?" >> > > > > >> > > > > >> > >>>> Or does it make sense to keep it as a >> separate >> > > > project >> > > > > >> > with a >> > > > > >> > > > > >> separate >> > > > > >> > > > > >> > >>>> governance? >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> Cheers, >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> -- >> > > > > >> > > > > >> > >>>> Gianmarco >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang < >> > > > > yanfang...@gmail.com> >> > > > > >> > > > wrote: >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka more >> > > tightly. >> > > > > >> > Because >> > > > > >> > > > > Samza >> > > > > >> > > > > >> de >> > > > > >> > > > > >> > >>>>> facto is based on Kafka, and it should >> leverage >> > > > what >> > > > > >> Kafka >> > > > > >> > > > has. >> > > > > >> > > > > At >> > > > > >> > > > > >> > the >> > > > > >> > > > > >> > >>>>> same time, Kafka does not need to reinvent >> what >> > > > Samza >> > > > > >> > > already >> > > > > >> > > > > >> has. I >> > > > > >> > > > > >> > >>>>> also like the idea of separating the >> ingestion >> > > and >> > > > > >> > > > > transformation. >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> But it is a little difficult for me to image >> > how >> > > > the >> > > > > >> Samza >> > > > > >> > > > will >> > > > > >> > > > > >> look >> > > > > >> > > > > >> > >>>> like. >> > > > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little >> > difference >> > > > in >> > > > > >> terms >> > > > > >> > > of >> > > > > >> > > > > how >> > > > > >> > > > > >> > >>>>> Samza should look like. >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> *** Will it look like what Jay's code shows >> (A >> > > > > client of >> > > > > >> > > > Kakfa) >> > > > > >> > > > > ? >> > > > > >> > > > > >> And >> > > > > >> > > > > >> > >>>>> user's application code calls this client? >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> 1. If we make Samza be a library of Kafka >> (like >> > > > what >> > > > > the >> > > > > >> > > code >> > > > > >> > > > > >> shows), >> > > > > >> > > > > >> > >>>>> how do we implement auto-balance and >> > > > fault-tolerance? >> > > > > >> Are >> > > > > >> > > they >> > > > > >> > > > > >> taken >> > > > > >> > > > > >> > >>>>> care by the Kafka broker or other mechanism, >> > such >> > > > as >> > > > > >> > "Samza >> > > > > >> > > > > >> worker" >> > > > > >> > > > > >> > >>>>> (just make up the name) ? >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> 2. What about other features, such as >> > > auto-scaling, >> > > > > >> shared >> > > > > >> > > > > state, >> > > > > >> > > > > >> > >>>>> monitoring? >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is this >> what >> > > > Chris >> > > > > >> > > > suggests?) >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> 1. we still need to ingest data from Kakfa >> and >> > > > > produce >> > > > > >> to >> > > > > >> > > it. >> > > > > >> > > > > >> Then it >> > > > > >> > > > > >> > >>>>> becomes the same as what Samza looks like >> now, >> > > > > except it >> > > > > >> > > does >> > > > > >> > > > > not >> > > > > >> > > > > >> > rely >> > > > > >> > > > > >> > >>>>> on Yarn anymore. >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> 2. if it is standalone, how can it leverage >> > > Kafka's >> > > > > >> > metrics, >> > > > > >> > > > > logs, >> > > > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency? >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> Thanks, >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> Fang, Yan >> > > > > >> > > > > >> > >>>>> yanfang...@gmail.com >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang >> Wang < >> > > > > >> > > > > wangg...@gmail.com >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > >>>> wrote: >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>>>> Read through the code example and it looks >> > good >> > > to >> > > > > me. >> > > > > >> A >> > > > > >> > > few >> > > > > >> > > > > >> > >>>>>> thoughts regarding deployment: >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>> Today Samza deploys as executable runnable >> > like: >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh >> > --config-factory=... >> > > > > >> > > > > >> > >>>> --config-path=file://... >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>> And this proposal advocate for deploying >> Samza >> > > > more >> > > > > as >> > > > > >> > > > embedded >> > > > > >> > > > > >> > >>>>>> libraries in user application code >> (ignoring >> > the >> > > > > >> > > terminology >> > > > > >> > > > > >> since >> > > > > >> > > > > >> > >>>>>> it is not the >> > > > > >> > > > > >> > >>>>> same >> > > > > >> > > > > >> > >>>>>> as the prototype code): >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>> StreamTask task = new >> MyStreamTask(configs); >> > > > Thread >> > > > > >> > thread >> > > > > >> > > = >> > > > > >> > > > > new >> > > > > >> > > > > >> > >>>>>> Thread(task); thread.start(); >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>> I think both of these deployment modes are >> > > > important >> > > > > >> for >> > > > > >> > > > > >> different >> > > > > >> > > > > >> > >>>>>> types >> > > > > >> > > > > >> > >>>>> of >> > > > > >> > > > > >> > >>>>>> users. That said, I think making Samza >> purely >> > > > > >> standalone >> > > > > >> > is >> > > > > >> > > > > still >> > > > > >> > > > > >> > >>>>>> sufficient for either runnable or library >> > modes. >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>> Guozhang >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay >> Kreps < >> > > > > >> > > > j...@confluent.io> >> > > > > >> > > > > >> > wrote: >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code >> example, it >> > > was >> > > > > >> > supposed >> > > > > >> > > > to >> > > > > >> > > > > >> look >> > > > > >> > > > > >> > >>>>>>> like >> > > > > >> > > > > >> > >>>>>>> this: >> > > > > >> > > > > >> > >>>>>>> >> > > > > >> > > > > >> > >>>>>>> Properties props = new Properties(); >> > > > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers", >> > > "localhost:4242"); >> > > > > >> > > > > >> StreamingConfig >> > > > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props); >> > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", >> > > "test-topic-2"); >> > > > > >> > > > > >> > >>>>>>> >> > config.processor(ExampleStreamProcessor.class); >> > > > > >> > > > > >> > >>>>>>> config.serialization(new >> StringSerializer(), >> > > new >> > > > > >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming >> > > container = >> > > > > new >> > > > > >> > > > > >> > >>>>>>> KafkaStreaming(config); container.run(); >> > > > > >> > > > > >> > >>>>>>> >> > > > > >> > > > > >> > >>>>>>> -Jay >> > > > > >> > > > > >> > >>>>>>> >> > > > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay >> Kreps < >> > > > > >> > > > j...@confluent.io >> > > > > >> > > > > > >> > > > > >> > > > > >> > >>>> wrote: >> > > > > >> > > > > >> > >>>>>>> >> > > > > >> > > > > >> > >>>>>>>> Hey guys, >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> This came out of some conversations Chris >> > and >> > > I >> > > > > were >> > > > > >> > > having >> > > > > >> > > > > >> > >>>>>>>> around >> > > > > >> > > > > >> > >>>>>>> whether >> > > > > >> > > > > >> > >>>>>>>> it would make sense to use Samza as a >> kind >> > of >> > > > data >> > > > > >> > > > ingestion >> > > > > >> > > > > >> > >>>>> framework >> > > > > >> > > > > >> > >>>>>>> for >> > > > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26 >> > > > "copycat"). >> > > > > >> This >> > > > > >> > > > kind >> > > > > >> > > > > of >> > > > > >> > > > > >> > >>>>>> combined >> > > > > >> > > > > >> > >>>>>>>> with complaints around config and YARN >> and >> > the >> > > > > >> > discussion >> > > > > >> > > > > >> around >> > > > > >> > > > > >> > >>>>>>>> how >> > > > > >> > > > > >> > >>>>> to >> > > > > >> > > > > >> > >>>>>>>> best do a standalone mode. >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> So the thought experiment was, given that >> > > Samza >> > > > > was >> > > > > >> > > > basically >> > > > > >> > > > > >> > >>>>>>>> already totally Kafka specific, what if >> you >> > > just >> > > > > >> > embraced >> > > > > >> > > > > that >> > > > > >> > > > > >> > >>>>>>>> and turned it >> > > > > >> > > > > >> > >>>>>> into >> > > > > >> > > > > >> > >>>>>>>> something less like a heavyweight >> framework >> > > and >> > > > > more >> > > > > >> > > like a >> > > > > >> > > > > >> > >>>>>>>> third >> > > > > >> > > > > >> > >>>>> Kafka >> > > > > >> > > > > >> > >>>>>>>> client--a kind of "producing consumer" >> with >> > > > state >> > > > > >> > > > management >> > > > > >> > > > > >> > >>>>>> facilities. >> > > > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a complex >> > > stream >> > > > > >> > > processing >> > > > > >> > > > > >> > >>>>>>>> framework >> > > > > >> > > > > >> > >>>>>>> this >> > > > > >> > > > > >> > >>>>>>>> would actually be a very simple thing, >> not >> > > much >> > > > > more >> > > > > >> > > > > >> complicated >> > > > > >> > > > > >> > >>>>>>>> to >> > > > > >> > > > > >> > >>>>> use >> > > > > >> > > > > >> > >>>>>>> or >> > > > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As Chris >> said >> > > we >> > > > > >> thought >> > > > > >> > > > about >> > > > > >> > > > > >> it >> > > > > >> > > > > >> > >>>>>>>> a >> > > > > >> > > > > >> > >>>>> lot >> > > > > >> > > > > >> > >>>>>> of >> > > > > >> > > > > >> > >>>>>>>> what Samza (and the other stream >> processing >> > > > > systems >> > > > > >> > were >> > > > > >> > > > > doing) >> > > > > >> > > > > >> > >>>>> seemed >> > > > > >> > > > > >> > >>>>>>> like >> > > > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce. >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output data >> to >> > > and >> > > > > from >> > > > > >> > the >> > > > > >> > > > > stream >> > > > > >> > > > > >> > >>>>>>>> processing. But when we actually looked >> into >> > > how >> > > > > that >> > > > > >> > > would >> > > > > >> > > > > >> > >>>>>>>> work, >> > > > > >> > > > > >> > >>>>> Samza >> > > > > >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion >> > framework >> > > > > for a >> > > > > >> > > bunch >> > > > > >> > > > of >> > > > > >> > > > > >> > >>>>> reasons. >> > > > > >> > > > > >> > >>>>>> To >> > > > > >> > > > > >> > >>>>>>>> really do that right you need a pretty >> > > different >> > > > > >> > internal >> > > > > >> > > > > data >> > > > > >> > > > > >> > >>>>>>>> model >> > > > > >> > > > > >> > >>>>>> and >> > > > > >> > > > > >> > >>>>>>>> set of apis. So what if you split them >> and >> > had >> > > > an >> > > > > api >> > > > > >> > for >> > > > > >> > > > > Kafka >> > > > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) and a >> > > > separate >> > > > > >> api >> > > > > >> > > for >> > > > > >> > > > > >> Kafka >> > > > > >> > > > > >> > >>>>>>>> transformation (Samza). >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> This would also allow really embracing >> the >> > > same >> > > > > >> > > terminology >> > > > > >> > > > > and >> > > > > >> > > > > >> > >>>>>>>> conventions. One complaint about the >> current >> > > > > state is >> > > > > >> > > that >> > > > > >> > > > > the >> > > > > >> > > > > >> > >>>>>>>> two >> > > > > >> > > > > >> > >>>>>>> systems >> > > > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology like >> > > > "stream" >> > > > > vs >> > > > > >> > > > "topic" >> > > > > >> > > > > >> and >> > > > > >> > > > > >> > >>>>>>> different >> > > > > >> > > > > >> > >>>>>>>> config and monitoring systems means you >> kind >> > > of >> > > > > have >> > > > > >> to >> > > > > >> > > > learn >> > > > > >> > > > > >> > >>>>>>>> Kafka's >> > > > > >> > > > > >> > >>>>>>> way, >> > > > > >> > > > > >> > >>>>>>>> then learn Samza's slightly different >> way, >> > > then >> > > > > kind >> > > > > >> of >> > > > > >> > > > > >> > >>>>>>>> understand >> > > > > >> > > > > >> > >>>>> how >> > > > > >> > > > > >> > >>>>>>> they >> > > > > >> > > > > >> > >>>>>>>> map to each other, which having walked a >> few >> > > > > people >> > > > > >> > > through >> > > > > >> > > > > >> this >> > > > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to get. >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of time >> on >> > > > > >> airplanes I >> > > > > >> > > > > hacked >> > > > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat >> incomplete >> > > > > prototype >> > > > > >> of >> > > > > >> > > > what >> > > > > >> > > > > >> > >>>>>>>> this would >> > > > > >> > > > > >> > >>>>> look >> > > > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously dumped >> > into >> > > > > Kafka >> > > > > >> as >> > > > > >> > > it >> > > > > >> > > > > >> > >>>>>>>> required a >> > > > > >> > > > > >> > >>>>>> few >> > > > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here is the >> > code: >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>> >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > > > > >> > >> > > > > >> > https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org >> > > > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I just >> > > > liberally >> > > > > >> > renamed >> > > > > >> > > > > >> > >>>>>>>> everything >> > > > > >> > > > > >> > >>>>> to >> > > > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no regard >> > for >> > > > > >> > > > compatibility. >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> To use this would be something like this: >> > > > > >> > > > > >> > >>>>>>>> Properties props = new Properties(); >> > > > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers", >> > > > "localhost:4242"); >> > > > > >> > > > > >> > >>>>>>>> StreamingConfig config = new >> > > > > >> > > > > >> > >>>>> StreamingConfig(props); >> > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", >> > > > > >> > > > > >> > >>>>>>>> "test-topic-2"); >> > > > > >> > > > > >> config.processor(ExampleStreamProcessor.class); >> > > > > >> > > > > >> > >>>>>>> config.serialization(new >> > > > > >> > > > > >> > >>>>>>>> StringSerializer(), new >> > StringDeserializer()); >> > > > > >> > > > KafkaStreaming >> > > > > >> > > > > >> > >>>>>> container = >> > > > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config); >> container.run(); >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the >> > > SamzaContainer; >> > > > > >> > > > > StreamProcessor >> > > > > >> > > > > >> > >>>>>>>> is basically StreamTask. >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> So rather than putting all the class >> names >> > in >> > > a >> > > > > file >> > > > > >> > and >> > > > > >> > > > then >> > > > > >> > > > > >> > >>>>>>>> having >> > > > > >> > > > > >> > >>>>>> the >> > > > > >> > > > > >> > >>>>>>>> job assembled by reflection, you just >> > > > instantiate >> > > > > the >> > > > > >> > > > > container >> > > > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced over >> > > however >> > > > > many >> > > > > >> > > > > instances >> > > > > >> > > > > >> > >>>>>>>> of >> > > > > >> > > > > >> > >>>>> this >> > > > > >> > > > > >> > >>>>>>> are >> > > > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an instance >> dies, >> > > new >> > > > > >> tasks >> > > > > >> > > are >> > > > > >> > > > > >> added >> > > > > >> > > > > >> > >>>>>>>> to >> > > > > >> > > > > >> > >>>>> the >> > > > > >> > > > > >> > >>>>>>>> existing containers without shutting them >> > > down). >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> We would provide some glue for running >> this >> > > > stuff >> > > > > in >> > > > > >> > YARN >> > > > > >> > > > via >> > > > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS using >> > some >> > > > of >> > > > > >> their >> > > > > >> > > > tools >> > > > > >> > > > > >> > >>>>>>>> but from the >> > > > > >> > > > > >> > >>>>>> point >> > > > > >> > > > > >> > >>>>>>> of >> > > > > >> > > > > >> > >>>>>>>> view of these frameworks these stream >> > > processing >> > > > > jobs >> > > > > >> > are >> > > > > >> > > > > just >> > > > > >> > > > > >> > >>>>>> stateless >> > > > > >> > > > > >> > >>>>>>>> services that can come and go and expand >> and >> > > > > contract >> > > > > >> > at >> > > > > >> > > > > will. >> > > > > >> > > > > >> > >>>>>>>> There >> > > > > >> > > > > >> > >>>>> is >> > > > > >> > > > > >> > >>>>>>> no >> > > > > >> > > > > >> > >>>>>>>> more custom scheduler. >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> Here are some relevant details: >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> 1. It is only ~1300 lines of code, it >> would >> > > get >> > > > > >> larger >> > > > > >> > > if >> > > > > >> > > > we >> > > > > >> > > > > >> > >>>>>>>> productionized but not vastly larger. We >> > > really >> > > > > do >> > > > > >> > get a >> > > > > >> > > > ton >> > > > > >> > > > > >> > >>>>>>>> of >> > > > > >> > > > > >> > >>>>>>> leverage >> > > > > >> > > > > >> > >>>>>>>> out of Kafka. >> > > > > >> > > > > >> > >>>>>>>> 2. Partition management is fully >> delegated >> > to >> > > > the >> > > > > >> new >> > > > > >> > > > > >> consumer. >> > > > > >> > > > > >> > >>>>> This >> > > > > >> > > > > >> > >>>>>>>> is nice since now any partition >> management >> > > > > strategy >> > > > > >> > > > > available >> > > > > >> > > > > >> > >>>>>>>> to >> > > > > >> > > > > >> > >>>>>> Kafka >> > > > > >> > > > > >> > >>>>>>>> consumer is also available to Samza (and >> > vice >> > > > > versa) >> > > > > >> > and >> > > > > >> > > > > with >> > > > > >> > > > > >> > >>>>>>>> the >> > > > > >> > > > > >> > >>>>>>> exact >> > > > > >> > > > > >> > >>>>>>>> same configs. >> > > > > >> > > > > >> > >>>>>>>> 3. It supports state as well as state >> reuse >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is >> thought >> > > > > >> provoking. >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> -Jay >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris >> > > > Riccomini < >> > > > > >> > > > > >> > >>>>>> criccom...@apache.org> >> > > > > >> > > > > >> > >>>>>>>> wrote: >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Hey all, >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> I have had some discussions with Samza >> > > > engineers >> > > > > at >> > > > > >> > > > LinkedIn >> > > > > >> > > > > >> > >>>>>>>>> and >> > > > > >> > > > > >> > >>>>>>> Confluent >> > > > > >> > > > > >> > >>>>>>>>> and we came up with a few observations >> and >> > > > would >> > > > > >> like >> > > > > >> > to >> > > > > >> > > > > >> > >>>>>>>>> propose >> > > > > >> > > > > >> > >>>>> some >> > > > > >> > > > > >> > >>>>>>>>> changes. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> We've observed some things that I want >> to >> > > call >> > > > > out >> > > > > >> > about >> > > > > >> > > > > >> > >>>>>>>>> Samza's >> > > > > >> > > > > >> > >>>>>> design, >> > > > > >> > > > > >> > >>>>>>>>> and I'd like to propose some changes. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic >> > > deployment >> > > > > >> system. >> > > > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable. >> > > > > >> > > > > >> > >>>>>>>>> * Samza's SystemConsumer/SystemProducer >> and >> > > > > Kafka's >> > > > > >> > > > consumer >> > > > > >> > > > > >> > >>>>>>>>> APIs >> > > > > >> > > > > >> > >>>>> are >> > > > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same >> problems. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> All three of these issues are related, >> but >> > > I'll >> > > > > >> > address >> > > > > >> > > > them >> > > > > >> > > > > >> in >> > > > > >> > > > > >> > >>>>> order. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Deployment >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use of a >> > > dynamic >> > > > > >> > > deployment >> > > > > >> > > > > >> > >>>>>>>>> scheduler >> > > > > >> > > > > >> > >>>>>> such >> > > > > >> > > > > >> > >>>>>>>>> as >> > > > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially >> built >> > > > Samza, >> > > > > we >> > > > > >> > bet >> > > > > >> > > > that >> > > > > >> > > > > >> > >>>>>>>>> there >> > > > > >> > > > > >> > >>>>>> would >> > > > > >> > > > > >> > >>>>>>>>> be >> > > > > >> > > > > >> > >>>>>>>>> one or two winners in this area, and we >> > could >> > > > > >> support >> > > > > >> > > > them, >> > > > > >> > > > > >> and >> > > > > >> > > > > >> > >>>>>>>>> the >> > > > > >> > > > > >> > >>>>>> rest >> > > > > >> > > > > >> > >>>>>>>>> would go away. In reality, there are >> many >> > > > > >> variations. >> > > > > >> > > > > >> > >>>>>>>>> Furthermore, >> > > > > >> > > > > >> > >>>>>> many >> > > > > >> > > > > >> > >>>>>>>>> people still prefer to just start their >> > > > > processors >> > > > > >> > like >> > > > > >> > > > > normal >> > > > > >> > > > > >> > >>>>>>>>> Java processes, and use traditional >> > > deployment >> > > > > >> scripts >> > > > > >> > > > such >> > > > > >> > > > > as >> > > > > >> > > > > >> > >>>>>>>>> Fabric, >> > > > > >> > > > > >> > >>>>>> Chef, >> > > > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment >> system >> > on >> > > > > users >> > > > > >> > makes >> > > > > >> > > > the >> > > > > >> > > > > >> > >>>>>>>>> Samza start-up process really painful >> for >> > > first >> > > > > time >> > > > > >> > > > users. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement was >> > also >> > > a >> > > > > bit >> > > > > >> of >> > > > > >> > a >> > > > > >> > > > > >> > >>>>>>>>> mis-fire >> > > > > >> > > > > >> > >>>>>> because >> > > > > >> > > > > >> > >>>>>>>>> of >> > > > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding between >> the >> > > > > nature of >> > > > > >> > > batch >> > > > > >> > > > > >> jobs >> > > > > >> > > > > >> > >>>>>>>>> and >> > > > > >> > > > > >> > >>>>>>> stream >> > > > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made >> > conscious >> > > > > effort >> > > > > >> to >> > > > > >> > > > favor >> > > > > >> > > > > >> > >>>>>>>>> the >> > > > > >> > > > > >> > >>>>>> Hadoop >> > > > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things, since >> it >> > > > worked >> > > > > >> and >> > > > > >> > > was >> > > > > >> > > > > well >> > > > > >> > > > > >> > >>>>>>> understood. >> > > > > >> > > > > >> > >>>>>>>>> One thing that we missed was that batch >> > jobs >> > > > > have a >> > > > > >> > > > definite >> > > > > >> > > > > >> > >>>>>> beginning, >> > > > > >> > > > > >> > >>>>>>>>> and >> > > > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs don't >> > > > (usually). >> > > > > >> This >> > > > > >> > > > leads >> > > > > >> > > > > to >> > > > > >> > > > > >> > >>>>>>>>> a >> > > > > >> > > > > >> > >>>>> much >> > > > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for stream >> > > > processors. >> > > > > >> You >> > > > > >> > > > > >> basically >> > > > > >> > > > > >> > >>>>>>>>> just >> > > > > >> > > > > >> > >>>>>>> need >> > > > > >> > > > > >> > >>>>>>>>> to find a place to start the processor, >> and >> > > > start >> > > > > >> it. >> > > > > >> > > The >> > > > > >> > > > > way >> > > > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's no >> > concept >> > > > of >> > > > > a >> > > > > >> > > cluster >> > > > > >> > > > > >> > >>>>>>>>> being "full". We always >> > > > > >> > > > > >> > >>>>>> add >> > > > > >> > > > > >> > >>>>>>>>> more machines. The problem with coupling >> > > Samza >> > > > > with >> > > > > >> a >> > > > > >> > > > > >> scheduler >> > > > > >> > > > > >> > >>>>>>>>> is >> > > > > >> > > > > >> > >>>>>> that >> > > > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to handle >> > > > > deployment. >> > > > > >> > > This >> > > > > >> > > > > >> pulls >> > > > > >> > > > > >> > >>>>>>>>> in a >> > > > > >> > > > > >> > >>>>>>> bunch >> > > > > >> > > > > >> > >>>>>>>>> of things such as configuration >> > distribution >> > > > > (config >> > > > > >> > > > > stream), >> > > > > >> > > > > >> > >>>>>>>>> shell >> > > > > >> > > > > >> > >>>>>>> scrips >> > > > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging >> (all >> > > the >> > > > > .tgz >> > > > > >> > > > stuff), >> > > > > >> > > > > >> etc. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic >> > > deployment >> > > > > was >> > > > > >> to >> > > > > >> > > > > support >> > > > > >> > > > > >> > >>>>>>>>> data locality. If you want to have >> > locality, >> > > > you >> > > > > >> need >> > > > > >> > to >> > > > > >> > > > put >> > > > > >> > > > > >> > >>>>>>>>> your >> > > > > >> > > > > >> > >>>>>> processors >> > > > > >> > > > > >> > >>>>>>>>> close to the data they're processing. >> Upon >> > > > > further >> > > > > >> > > > > >> > >>>>>>>>> investigation, >> > > > > >> > > > > >> > >>>>>>> though, >> > > > > >> > > > > >> > >>>>>>>>> this feature is not that beneficial. >> There >> > is >> > > > > some >> > > > > >> > good >> > > > > >> > > > > >> > >>>>>>>>> discussion >> > > > > >> > > > > >> > >>>>>> about >> > > > > >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335. >> Again, >> > we >> > > > > took >> > > > > >> the >> > > > > >> > > > > >> > >>>>>>>>> Map/Reduce >> > > > > >> > > > > >> > >>>>>> path, >> > > > > >> > > > > >> > >>>>>>>>> but >> > > > > >> > > > > >> > >>>>>>>>> there are some fundamental differences >> > > between >> > > > > HDFS >> > > > > >> > and >> > > > > >> > > > > Kafka. >> > > > > >> > > > > >> > >>>>>>>>> HDFS >> > > > > >> > > > > >> > >>>>>> has >> > > > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions. This >> > > leads >> > > > to >> > > > > >> less >> > > > > >> > > > > >> > >>>>>>>>> optimization potential with stream >> > processors >> > > > on >> > > > > top >> > > > > >> > of >> > > > > >> > > > > Kafka. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> This feature is also used as a crutch. >> > Samza >> > > > > doesn't >> > > > > >> > > have >> > > > > >> > > > > any >> > > > > >> > > > > >> > >>>>>>>>> built >> > > > > >> > > > > >> > >>>>> in >> > > > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it >> depends >> > on >> > > > the >> > > > > >> > > dynamic >> > > > > >> > > > > >> > >>>>>>>>> deployment scheduling system to handle >> > > restarts >> > > > > >> when a >> > > > > >> > > > > >> > >>>>>>>>> processor dies. This has >> > > > > >> > > > > >> > >>>>>>> made >> > > > > >> > > > > >> > >>>>>>>>> it very difficult to write a standalone >> > Samza >> > > > > >> > container >> > > > > >> > > > > >> > >>>> (SAMZA-516). >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Pluggability >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> In some cases pluggability is good, but >> I >> > > think >> > > > > that >> > > > > >> > > we've >> > > > > >> > > > > >> gone >> > > > > >> > > > > >> > >>>>>>>>> too >> > > > > >> > > > > >> > >>>>>> far >> > > > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has: >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> * Pluggable config. >> > > > > >> > > > > >> > >>>>>>>>> * Pluggable metrics. >> > > > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems. >> > > > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems >> > > (SystemConsumer, >> > > > > >> > > > > SystemProducer, >> > > > > >> > > > > >> > >>>> etc). >> > > > > >> > > > > >> > >>>>>>>>> * Pluggable serdes. >> > > > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines. >> > > > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just about >> every >> > > > > >> component >> > > > > >> > > > > >> > >>>>> (MessageChooser, >> > > > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper, >> > ConfigRewriter, >> > > > > etc). >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> There's probably more that I've >> forgotten, >> > as >> > > > > well. >> > > > > >> > Some >> > > > > >> > > > of >> > > > > >> > > > > >> > >>>>>>>>> these >> > > > > >> > > > > >> > >>>>> are >> > > > > >> > > > > >> > >>>>>>>>> useful, but some have proven not to be. >> > This >> > > > all >> > > > > >> comes >> > > > > >> > > at >> > > > > >> > > > a >> > > > > >> > > > > >> cost: >> > > > > >> > > > > >> > >>>>>>>>> complexity. This complexity is making it >> > > harder >> > > > > for >> > > > > >> > our >> > > > > >> > > > > users >> > > > > >> > > > > >> > >>>>>>>>> to >> > > > > >> > > > > >> > >>>>> pick >> > > > > >> > > > > >> > >>>>>> up >> > > > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It also >> makes >> > > it >> > > > > >> > difficult >> > > > > >> > > > for >> > > > > >> > > > > >> > >>>>>>>>> Samza developers to reason about what >> the >> > > > > >> > > characteristics >> > > > > >> > > > of >> > > > > >> > > > > >> > >>>>>>>>> the container (since the characteristics >> > > change >> > > > > >> > > depending >> > > > > >> > > > on >> > > > > >> > > > > >> > >>>>>>>>> which plugins are use). >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> The issues with pluggability are most >> > visible >> > > > in >> > > > > the >> > > > > >> > > > System >> > > > > >> > > > > >> APIs. >> > > > > >> > > > > >> > >>>>> What >> > > > > >> > > > > >> > >>>>>>>>> Samza really requires to be functional >> is >> > > Kafka >> > > > > as >> > > > > >> its >> > > > > >> > > > > >> > >>>>>>>>> transport >> > > > > >> > > > > >> > >>>>>> layer. >> > > > > >> > > > > >> > >>>>>>>>> But >> > > > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use cases >> > into >> > > > one >> > > > > >> API: >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka. >> > > > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> The current System API supports both of >> > these >> > > > use >> > > > > >> > cases. >> > > > > >> > > > The >> > > > > >> > > > > >> > >>>>>>>>> problem >> > > > > >> > > > > >> > >>>>>> is, >> > > > > >> > > > > >> > >>>>>>>>> we >> > > > > >> > > > > >> > >>>>>>>>> actually want different features for >> each >> > use >> > > > > case. >> > > > > >> By >> > > > > >> > > > > >> papering >> > > > > >> > > > > >> > >>>>>>>>> over >> > > > > >> > > > > >> > >>>>>>> these >> > > > > >> > > > > >> > >>>>>>>>> two use cases, and providing a single >> API, >> > > > we've >> > > > > >> > > > introduced >> > > > > >> > > > > a >> > > > > >> > > > > >> > >>>>>>>>> ton of >> > > > > >> > > > > >> > >>>>>>> leaky >> > > > > >> > > > > >> > >>>>>>>>> abstractions. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> For example, what we'd really like in >> (2) >> > is >> > > to >> > > > > have >> > > > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for >> offsets >> > > > (like >> > > > > >> > Kafka). >> > > > > >> > > > > This >> > > > > >> > > > > >> > >>>>>>>>> would be at odds >> > > > > >> > > > > >> > >>>>> with >> > > > > >> > > > > >> > >>>>>>> (1), >> > > > > >> > > > > >> > >>>>>>>>> though, since different systems have >> > > different >> > > > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors. >> > > > > >> > > > > >> > >>>>>>>>> There was discussion both on the mailing >> > list >> > > > and >> > > > > >> the >> > > > > >> > > SQL >> > > > > >> > > > > >> JIRAs >> > > > > >> > > > > >> > >>>>> about >> > > > > >> > > > > >> > >>>>>>> the >> > > > > >> > > > > >> > >>>>>>>>> need for this. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> The same thing holds true for >> > replayability. >> > > > > Kafka >> > > > > >> > > allows >> > > > > >> > > > us >> > > > > >> > > > > >> to >> > > > > >> > > > > >> > >>>>> rewind >> > > > > >> > > > > >> > >>>>>>>>> when >> > > > > >> > > > > >> > >>>>>>>>> we have a failure. Many other systems >> > don't. >> > > In >> > > > > some >> > > > > >> > > > cases, >> > > > > >> > > > > >> > >>>>>>>>> systems >> > > > > >> > > > > >> > >>>>>>> return >> > > > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g. >> > > > > >> WikipediaSystemConsumer) >> > > > > >> > > > > because >> > > > > >> > > > > >> > >>>>>>>>> they >> > > > > >> > > > > >> > >>>>>> have >> > > > > >> > > > > >> > >>>>>>> no >> > > > > >> > > > > >> > >>>>>>>>> offsets. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Partitioning is another example. Kafka >> > > supports >> > > > > >> > > > > partitioning, >> > > > > >> > > > > >> > >>>>>>>>> but >> > > > > >> > > > > >> > >>>>> many >> > > > > >> > > > > >> > >>>>>>>>> systems don't. We model this by having a >> > > single >> > > > > >> > > partition >> > > > > >> > > > > for >> > > > > >> > > > > >> > >>>>>>>>> those systems. Still, other systems >> model >> > > > > >> partitioning >> > > > > >> > > > > >> > >>>> differently (e.g. >> > > > > >> > > > > >> > >>>>>>>>> Kinesis). >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a >> mess. >> > > > > Creating >> > > > > >> > > streams >> > > > > >> > > > > in >> > > > > >> > > > > >> a >> > > > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost >> impossible. >> > As >> > > is >> > > > > >> > modeling >> > > > > >> > > > > >> > >>>>>>>>> metadata >> > > > > >> > > > > >> > >>>>> for >> > > > > >> > > > > >> > >>>>>>> the >> > > > > >> > > > > >> > >>>>>>>>> system (replication factor, partitions, >> > > > location, >> > > > > >> > etc). >> > > > > >> > > > The >> > > > > >> > > > > >> > >>>>>>>>> list >> > > > > >> > > > > >> > >>>>> goes >> > > > > >> > > > > >> > >>>>>>> on. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Duplicate work >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> At the time that we began writing Samza, >> > > > Kafka's >> > > > > >> > > consumer >> > > > > >> > > > > and >> > > > > >> > > > > >> > >>>>> producer >> > > > > >> > > > > >> > >>>>>>>>> APIs >> > > > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set. On >> the >> > > > > >> > consumer-side, >> > > > > >> > > > you >> > > > > >> > > > > >> > >>>>>>>>> had two >> > > > > >> > > > > >> > >>>>>>>>> options: use the high level consumer, or >> > the >> > > > > simple >> > > > > >> > > > > consumer. >> > > > > >> > > > > >> > >>>>>>>>> The >> > > > > >> > > > > >> > >>>>>>> problem >> > > > > >> > > > > >> > >>>>>>>>> with the high-level consumer was that it >> > > > > controlled >> > > > > >> > your >> > > > > >> > > > > >> > >>>>>>>>> offsets, partition assignments, and the >> > order >> > > > in >> > > > > >> which >> > > > > >> > > you >> > > > > >> > > > > >> > >>>>>>>>> received messages. The >> > > > > >> > > > > >> > >>>>> problem >> > > > > >> > > > > >> > >>>>>>>>> with >> > > > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's not >> > simple. >> > > > It's >> > > > > >> > basic. >> > > > > >> > > > You >> > > > > >> > > > > >> > >>>>>>>>> end up >> > > > > >> > > > > >> > >>>>>>> having >> > > > > >> > > > > >> > >>>>>>>>> to handle a lot of really low-level >> stuff >> > > that >> > > > > you >> > > > > >> > > > > shouldn't. >> > > > > >> > > > > >> > >>>>>>>>> We >> > > > > >> > > > > >> > >>>>>> spent a >> > > > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's >> > > KafkaSystemConsumer >> > > > > very >> > > > > >> > > > robust. >> > > > > >> > > > > >> It >> > > > > >> > > > > >> > >>>>>>>>> also allows us to support some cool >> > features: >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering and >> > > > > prioritization. >> > > > > >> > > > > >> > >>>>>>>>> * Tight control over partition >> assignment >> > to >> > > > > support >> > > > > >> > > > joins, >> > > > > >> > > > > >> > >>>>>>>>> global >> > > > > >> > > > > >> > >>>>>> state >> > > > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)), etc. >> > > > > >> > > > > >> > >>>>>>>>> * Tight control over offset >> checkpointing. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time is >> that >> > > > these >> > > > > >> > > features >> > > > > >> > > > > >> > >>>>>>>>> should >> > > > > >> > > > > >> > >>>>>>> actually >> > > > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers >> (not >> > > just >> > > > > >> Samza >> > > > > >> > > > stream >> > > > > >> > > > > >> > >>>>>> processors) >> > > > > >> > > > > >> > >>>>>>>>> end up wanting to do things like joins >> and >> > > > > partition >> > > > > >> > > > > >> > >>>>>>>>> assignment. The >> > > > > >> > > > > >> > >>>>>>> Kafka >> > > > > >> > > > > >> > >>>>>>>>> community has come to the same >> conclusion. >> > > > > They're >> > > > > >> > > adding >> > > > > >> > > > a >> > > > > >> > > > > >> ton >> > > > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka >> consumer >> > > > > >> > > implementation. >> > > > > >> > > > > To a >> > > > > >> > > > > >> > >>>>>>>>> large extent, >> > > > > >> > > > > >> > >>>>> it's >> > > > > >> > > > > >> > >>>>>>>>> duplicate work to what we've already >> done >> > in >> > > > > Samza. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up taking a >> > very >> > > > > similar >> > > > > >> > > > > approach >> > > > > >> > > > > >> > >>>>>>>>> to >> > > > > >> > > > > >> > >>>>>> Samza's >> > > > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager implementation >> for >> > > > > handling >> > > > > >> > > offset >> > > > > >> > > > > >> > >>>>>> checkpointing. >> > > > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset >> management >> > > > feature >> > > > > >> > stores >> > > > > >> > > > > >> offset >> > > > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows you >> to >> > > fetch >> > > > > them >> > > > > >> > > from >> > > > > >> > > > > the >> > > > > >> > > > > >> > >>>>>>>>> broker. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste, since >> we >> > > > could >> > > > > >> have >> > > > > >> > > > shared >> > > > > >> > > > > >> > >>>>>>>>> the >> > > > > >> > > > > >> > >>>>> work >> > > > > >> > > > > >> > >>>>>> if >> > > > > >> > > > > >> > >>>>>>>>> it >> > > > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the get-go. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Vision >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather radical >> > > > > proposal. >> > > > > >> > Samza >> > > > > >> > > > is >> > > > > >> > > > > >> > >>>>> relatively >> > > > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to say >> > that >> > > > > we're >> > > > > >> > > near a >> > > > > >> > > > > 1.0 >> > > > > >> > > > > >> > >>>>>> release. >> > > > > >> > > > > >> > >>>>>>>>> I'd >> > > > > >> > > > > >> > >>>>>>>>> like to propose that we take what we've >> > > > learned, >> > > > > and >> > > > > >> > > begin >> > > > > >> > > > > >> > >>>>>>>>> thinking >> > > > > >> > > > > >> > >>>>>>> about >> > > > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we change >> if >> > we >> > > > were >> > > > > >> > > starting >> > > > > >> > > > > >> from >> > > > > >> > > > > >> > >>>>>> scratch? >> > > > > >> > > > > >> > >>>>>>>>> My >> > > > > >> > > > > >> > >>>>>>>>> proposal is to: >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only* way >> to >> > > run >> > > > > Samza >> > > > > >> > > > > >> > >>>>>>>>> processors, and eliminate all direct >> > > > dependences >> > > > > on >> > > > > >> > > YARN, >> > > > > >> > > > > >> Mesos, >> > > > > >> > > > > >> > >>>> etc. >> > > > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support >> only >> > > Kafka >> > > > > as >> > > > > >> the >> > > > > >> > > > > stream >> > > > > >> > > > > >> > >>>>>> processing >> > > > > >> > > > > >> > >>>>>>>>> layer. >> > > > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging, >> > > > > >> serialization, >> > > > > >> > > and >> > > > > >> > > > > >> > >>>>>>>>> config >> > > > > >> > > > > >> > >>>>>>> systems, >> > > > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> This would fix all of the issues that I >> > > > outlined >> > > > > >> > above. >> > > > > >> > > It >> > > > > >> > > > > >> > >>>>>>>>> should >> > > > > >> > > > > >> > >>>>> also >> > > > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty >> > > dramatically. >> > > > > >> > > Supporting >> > > > > >> > > > > >> only >> > > > > >> > > > > >> > >>>>>>>>> a standalone container will allow Samza >> to >> > be >> > > > > >> executed >> > > > > >> > > on >> > > > > >> > > > > YARN >> > > > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using >> > > Marathon/Aurora), >> > > > or >> > > > > >> most >> > > > > >> > > > other >> > > > > >> > > > > >> > >>>>>>>>> in-house >> > > > > >> > > > > >> > >>>>>>> deployment >> > > > > >> > > > > >> > >>>>>>>>> systems. This should make life a lot >> easier >> > > for >> > > > > new >> > > > > >> > > users. >> > > > > >> > > > > >> > >>>>>>>>> Imagine >> > > > > >> > > > > >> > >>>>>>> having >> > > > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without YARN. >> The >> > > drop >> > > > > in >> > > > > >> > > mailing >> > > > > >> > > > > >> list >> > > > > >> > > > > >> > >>>>>> traffic >> > > > > >> > > > > >> > >>>>>>>>> will be pretty dramatic. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long overdue >> to >> > me. >> > > > The >> > > > > >> > > reality >> > > > > >> > > > > is, >> > > > > >> > > > > >> > >>>>> everyone >> > > > > >> > > > > >> > >>>>>>>>> that >> > > > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with Kafka. >> We >> > > > > basically >> > > > > >> > > > require >> > > > > >> > > > > >> it >> > > > > >> > > > > >> > >>>>>> already >> > > > > >> > > > > >> > >>>>>>> in >> > > > > >> > > > > >> > >>>>>>>>> order for most features to work. Those >> that >> > > are >> > > > > >> using >> > > > > >> > > > other >> > > > > >> > > > > >> > >>>>>>>>> systems >> > > > > >> > > > > >> > >>>>>> are >> > > > > >> > > > > >> > >>>>>>>>> generally using it for ingest into Kafka >> > (1), >> > > > and >> > > > > >> then >> > > > > >> > > > they >> > > > > >> > > > > do >> > > > > >> > > > > >> > >>>>>>>>> the processing on top. There is already >> > > > > discussion ( >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>> >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > > > > >> > >> > > > > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851 >> > > > > >> > > > > >> > >>>>> 767 >> > > > > >> > > > > >> > >>>>>>>>> ) >> > > > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into Kafka >> > > extremely >> > > > > >> easy. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Once we make the call to couple with >> Kafka, >> > > we >> > > > > can >> > > > > >> > > > leverage >> > > > > >> > > > > a >> > > > > >> > > > > >> > >>>>>>>>> ton of >> > > > > >> > > > > >> > >>>>>>> their >> > > > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to maintain >> > our >> > > > own >> > > > > >> > config, >> > > > > >> > > > > >> > >>>>>>>>> metrics, >> > > > > >> > > > > >> > >>>>> etc. >> > > > > >> > > > > >> > >>>>>>> We >> > > > > >> > > > > >> > >>>>>>>>> can all share the same libraries, and >> make >> > > them >> > > > > >> > better. >> > > > > >> > > > This >> > > > > >> > > > > >> > >>>>>>>>> will >> > > > > >> > > > > >> > >>>>> also >> > > > > >> > > > > >> > >>>>>>>>> allow us to share the consumer/producer >> > APIs, >> > > > and >> > > > > >> will >> > > > > >> > > let >> > > > > >> > > > > us >> > > > > >> > > > > >> > >>>>> leverage >> > > > > >> > > > > >> > >>>>>>>>> their offset management and partition >> > > > management, >> > > > > >> > rather >> > > > > >> > > > > than >> > > > > >> > > > > >> > >>>>>>>>> having >> > > > > >> > > > > >> > >>>>>> our >> > > > > >> > > > > >> > >>>>>>>>> own. All of the coordinator stream code >> > would >> > > > go >> > > > > >> away, >> > > > > >> > > as >> > > > > >> > > > > >> would >> > > > > >> > > > > >> > >>>>>>>>> most >> > > > > >> > > > > >> > >>>>>> of >> > > > > >> > > > > >> > >>>>>>>>> the >> > > > > >> > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably have >> to >> > > push >> > > > > some >> > > > > >> > > > > partition >> > > > > >> > > > > >> > >>>>>>> management >> > > > > >> > > > > >> > >>>>>>>>> features into the Kafka broker, but >> they're >> > > > > already >> > > > > >> > > moving >> > > > > >> > > > > in >> > > > > >> > > > > >> > >>>>>>>>> that direction with the new consumer >> API. >> > The >> > > > > >> features >> > > > > >> > > we >> > > > > >> > > > > have >> > > > > >> > > > > >> > >>>>>>>>> for >> > > > > >> > > > > >> > >>>>>> partition >> > > > > >> > > > > >> > >>>>>>>>> assignment aren't unique to Samza, and >> seem >> > > > like >> > > > > >> they >> > > > > >> > > > should >> > > > > >> > > > > >> be >> > > > > >> > > > > >> > >>>>>>>>> in >> > > > > >> > > > > >> > >>>>>> Kafka >> > > > > >> > > > > >> > >>>>>>>>> anyway. There will always be some niche >> > > usages >> > > > > which >> > > > > >> > > will >> > > > > >> > > > > >> > >>>>>>>>> require >> > > > > >> > > > > >> > >>>>>> extra >> > > > > >> > > > > >> > >>>>>>>>> care and hence full control over >> partition >> > > > > >> assignments >> > > > > >> > > > much >> > > > > >> > > > > >> > >>>>>>>>> like the >> > > > > >> > > > > >> > >>>>>>> Kafka >> > > > > >> > > > > >> > >>>>>>>>> low level consumer api. These would >> > continue >> > > to >> > > > > be >> > > > > >> > > > > supported. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> These items will be good for the Samza >> > > > community. >> > > > > >> > > They'll >> > > > > >> > > > > make >> > > > > >> > > > > >> > >>>>>>>>> Samza easier to use, and make it easier >> for >> > > > > >> developers >> > > > > >> > > to >> > > > > >> > > > > add >> > > > > >> > > > > >> > >>>>>>>>> new features. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Obviously this is a fairly large (and >> > > somewhat >> > > > > >> > backwards >> > > > > >> > > > > >> > >>>>> incompatible >> > > > > >> > > > > >> > >>>>>>>>> change). If we choose to go this route, >> > it's >> > > > > >> important >> > > > > >> > > > that >> > > > > >> > > > > we >> > > > > >> > > > > >> > >>>>> openly >> > > > > >> > > > > >> > >>>>>>>>> communicate how we're going to provide a >> > > > > migration >> > > > > >> > path >> > > > > >> > > > from >> > > > > >> > > > > >> > >>>>>>>>> the >> > > > > >> > > > > >> > >>>>>>> existing >> > > > > >> > > > > >> > >>>>>>>>> APIs to the new ones (if we make >> > incompatible >> > > > > >> > changes). >> > > > > >> > > I >> > > > > >> > > > > >> think >> > > > > >> > > > > >> > >>>>>>>>> at a minimum, we'd probably need to >> > provide a >> > > > > >> wrapper >> > > > > >> > to >> > > > > >> > > > > allow >> > > > > >> > > > > >> > >>>>>>>>> existing StreamTask implementations to >> > > continue >> > > > > >> > running >> > > > > >> > > on >> > > > > >> > > > > the >> > > > > >> > > > > >> > >>>> new container. >> > > > > >> > > > > >> > >>>>>>> It's >> > > > > >> > > > > >> > >>>>>>>>> also important that we openly >> communicate >> > > about >> > > > > >> > timing, >> > > > > >> > > > and >> > > > > >> > > > > >> > >>>>>>>>> stages >> > > > > >> > > > > >> > >>>>> of >> > > > > >> > > > > >> > >>>>>>> the >> > > > > >> > > > > >> > >>>>>>>>> migration. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> If you made it this far, I'm sure you >> have >> > > > > opinions. >> > > > > >> > :) >> > > > > >> > > > > Please >> > > > > >> > > > > >> > >>>>>>>>> send >> > > > > >> > > > > >> > >>>>>> your >> > > > > >> > > > > >> > >>>>>>>>> thoughts and feedback. >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>>> Cheers, >> > > > > >> > > > > >> > >>>>>>>>> Chris >> > > > > >> > > > > >> > >>>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>>> >> > > > > >> > > > > >> > >>>>>>> >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>>> -- >> > > > > >> > > > > >> > >>>>>> -- Guozhang >> > > > > >> > > > > >> > >>>>>> >> > > > > >> > > > > >> > >>>>> >> > > > > >> > > > > >> > >>>> >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > > > > >> > > > > >> >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > >> > > >> > > > > >> > >> > > > > >> >> > > > > >> > > > >> > > >> > >> > >