Thanks, Jay. This argument persuaded me actually. :) Fang, Yan yanfang...@gmail.com
On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps <j...@confluent.io> wrote: > Hey Yan, > > Yeah philosophically I think the argument is that you should capture the > stream in Kafka independent of the transformation. This is obviously a > Kafka-centric view point. > > Advantages of this: > - In practice I think this is what e.g. Storm people often end up doing > anyway. You usually need to throttle any access to a live serving database. > - Can have multiple subscribers and they get the same thing without > additional load on the source system. > - Applications can tap into the stream if need be by subscribing. > - You can debug your transformation by tailing the Kafka topic with the > console consumer > - Can tee off the same data stream for batch analysis or Lambda arch style > re-processing > > The disadvantage is that it will use Kafka resources. But the idea is > eventually you will have multiple subscribers to any data source (at least > for monitoring) so you will end up there soon enough anyway. > > Down the road the technical benefit is that I think it gives us a good path > towards end-to-end exactly once semantics from source to destination. > Basically the connectors need to support idempotence when talking to Kafka > and we need the transactional write feature in Kafka to make the > transformation atomic. This is actually pretty doable if you separate > connector=>kafka problem from the generic transformations which are always > kafka=>kafka. However I think it is quite impossible to do in a all_things > => all_things environment. Today you can say "well the semantics of the > Samza APIs depend on the connectors you use" but it is actually worse then > that because the semantics actually depend on the pairing of connectors--so > not only can you probably not get a usable "exactly once" guarantee > end-to-end it can actually be quite hard to reverse engineer what property > (if any) your end-to-end flow has if you have heterogenous systems. > > -Jay > > On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <yanfang...@gmail.com> wrote: > > > {quote} > > maintained in a separate repository and retaining the existing > > committership but sharing as much else as possible (website, etc) > > {quote} > > > > Overall, I agree on this idea. Now the question is more about "how to do > > it". > > > > On the other hand, one thing I want to point out is that, if we decide to > > go this way, how do we want to support > > otherSystem-transformation-otherSystem use case? > > > > Basically, there are four user groups here: > > > > 1. Kafka-transformation-Kafka > > 2. Kafka-transformation-otherSystem > > 3. otherSystem-transformation-Kafka > > 4. otherSystem-transformation-otherSystem > > > > For group 1, they can easily use the new Samza library to achieve. For > > group 2 and 3, they can use copyCat -> transformation -> Kafka or Kafka-> > > transformation -> copyCat. > > > > The problem is for group 4. Do we want to abandon this or still support > it? > > Of course, this use case can be achieved by using copyCat -> > transformation > > -> Kafka -> transformation -> copyCat, the thing is how we persuade them > to > > do this long chain. If yes, it will also be a win for Kafka too. Or if > > there is no one in this community actually doing this so far, maybe ok to > > not support the group 4 directly. > > > > Thanks, > > > > Fang, Yan > > yanfang...@gmail.com > > > > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <j...@confluent.io> wrote: > > > > > Yeah I agree with this summary. I think there are kind of two questions > > > here: > > > 1. Technically does alignment/reliance on Kafka make sense > > > 2. Branding wise (naming, website, concepts, etc) does alignment with > > Kafka > > > make sense > > > > > > Personally I do think both of these things would be really valuable, > and > > > would dramatically alter the trajectory of the project. > > > > > > My preference would be to see if people can mostly agree on a direction > > > rather than splintering things off. From my point of view the ideal > > outcome > > > of all the options discussed would be to make Samza a closely aligned > > > subproject, maintained in a separate repository and retaining the > > existing > > > committership but sharing as much else as possible (website, etc). No > > idea > > > about how these things work, Jacob, you probably know more. > > > > > > No discussion amongst the Kafka folks has happened on this, but likely > we > > > should figure out what the Samza community actually wants first. > > > > > > I admit that this is a fairly radical departure from how things are. > > > > > > If that doesn't fly, I think, yeah we could leave Samza as it is and do > > the > > > more radical reboot inside Kafka. From my point of view that does leave > > > things in a somewhat confusing state since now there are two stream > > > processing systems more or less coupled to Kafka in large part made by > > the > > > same people. But, arguably that might be a cleaner way to make the > > cut-over > > > and perhaps less risky for Samza community since if it works people can > > > switch and if it doesn't nothing will have changed. Dunno, how do > people > > > feel about this? > > > > > > -Jay > > > > > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <jgho...@gmail.com> > wrote: > > > > > > > > This leads me to thinking that merging projects and communities > > might > > > > be a good idea: with the union of experience from both communities, > we > > > will > > > > probably build a better system that is better for users. > > > > Is this what's being proposed though? Merging the projects seems like > > > > a consequence of at most one of the three directions under > discussion: > > > > 1) Samza 2.0: The Samza community relies more heavily on Kafka for > > > > configuration, etc. (to a greater or lesser extent to be determined) > > > > but the Samza community would not automatically merge withe Kafka > > > > community (the Phoenix/HBase example is a good one here). > > > > 2) Samza Reboot: The Samza community continues to exist with a > limited > > > > project scope, but similarly would not need to be part of the Kafka > > > > community (ie given committership) to progress. Here, maybe the > Samza > > > > team would become a subproject of Kafka (the Board frowns on > > > > subprojects at the moment, so I'm not sure if that's even feasible), > > > > but that would not be required. > > > > 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka > > > > team builds its own streaming library, possibly off of Jay's > > > > prototype, which has not direct lineage to the Samza team. There's > no > > > > reason for the Kafka team to bring in the Samza team. > > > > > > > > Is the Kafka community on board with this? > > > > > > > > To be clear, all three options under discussion are interesting, > > > > technically valid and likely healthy directions for the project. > > > > Also, they are not mutually exclusive. The Samza community could > > > > decide to pursue, say, 'Samza 2.0', while the Kafka community went > > > > forward with 'Hey Samza!' My points above are directed entirely at > > > > the community aspect of these choices. > > > > -Jakob > > > > > > > > On 10 July 2015 at 09:10, Roger Hoover <roger.hoo...@gmail.com> > wrote: > > > > > That's great. Thanks, Jay. > > > > > > > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <j...@confluent.io> > wrote: > > > > > > > > > >> Yeah totally agree. I think you have this issue even today, right? > > > I.e. > > > > if > > > > >> you need to make a simple config change and you're running in YARN > > > today > > > > >> you end up bouncing the job which then rebuilds state. I think the > > fix > > > > is > > > > >> exactly what you described which is to have a long timeout on > > > partition > > > > >> movement for stateful jobs so that if a job is just getting > bounced, > > > and > > > > >> the cluster manager (or admin) is smart enough to restart it on > the > > > same > > > > >> host when possible, it can optimistically reuse any existing state > > it > > > > finds > > > > >> on disk (if it is valid). > > > > >> > > > > >> So in this model the charter of the CM is to place processes as > > > > stickily as > > > > >> possible and to restart or re-place failed processes. The charter > of > > > the > > > > >> partition management system is to control the assignment of work > to > > > > these > > > > >> processes. The nice thing about this is that the work assignment, > > > > timeouts, > > > > >> behavior, configs, and code will all be the same across all > cluster > > > > >> managers. > > > > >> > > > > >> So I think that prototype would actually give you exactly what you > > > want > > > > >> today for any cluster manager (or manual placement + restart > script) > > > > that > > > > >> was sticky in terms of host placement since there is already a > > > > configurable > > > > >> partition movement timeout and task-by-task state reuse with a > check > > > on > > > > >> state validity. > > > > >> > > > > >> -Jay > > > > >> > > > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover < > > roger.hoo...@gmail.com > > > > > > > > >> wrote: > > > > >> > > > > >> > That would be great to let Kafka do as much heavy lifting as > > > possible > > > > and > > > > >> > make it easier for other languages to implement Samza apis. > > > > >> > > > > > >> > One thing to watch out for is the interplay between Kafka's > group > > > > >> > management and the external scheduler/process manager's fault > > > > tolerance. > > > > >> > If a container dies, the Kafka group membership protocol will > try > > to > > > > >> assign > > > > >> > it's tasks to other containers while at the same time the > process > > > > manager > > > > >> > is trying to relaunch the container. Without some consideration > > for > > > > this > > > > >> > (like a configurable amount of time to wait before Kafka alters > > the > > > > group > > > > >> > membership), there may be thrashing going on which is especially > > bad > > > > for > > > > >> > containers with large amounts of local state. > > > > >> > > > > > >> > Someone else pointed this out already but I thought it might be > > > worth > > > > >> > calling out again. > > > > >> > > > > > >> > Cheers, > > > > >> > > > > > >> > Roger > > > > >> > > > > > >> > > > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <j...@confluent.io> > > > wrote: > > > > >> > > > > > >> > > Hey Roger, > > > > >> > > > > > > >> > > I couldn't agree more. We spent a bunch of time talking to > > people > > > > and > > > > >> > that > > > > >> > > is exactly the stuff we heard time and again. What makes it > > hard, > > > of > > > > >> > > course, is that there is some tension between compatibility > with > > > > what's > > > > >> > > there now and making things better for new users. > > > > >> > > > > > > >> > > I also strongly agree with the importance of multi-language > > > > support. We > > > > >> > are > > > > >> > > talking now about Java, but for application development use > > cases > > > > >> people > > > > >> > > want to work in whatever language they are using elsewhere. I > > > think > > > > >> > moving > > > > >> > > to a model where Kafka itself does the group membership, > > lifecycle > > > > >> > control, > > > > >> > > and partition assignment has the advantage of putting all that > > > > complex > > > > >> > > stuff behind a clean api that the clients are already going to > > be > > > > >> > > implementing for their consumer, so the added functionality > for > > > > stream > > > > >> > > processing beyond a consumer becomes very minor. > > > > >> > > > > > > >> > > -Jay > > > > >> > > > > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover < > > > > roger.hoo...@gmail.com> > > > > >> > > wrote: > > > > >> > > > > > > >> > > > Metamorphosis...nice. :) > > > > >> > > > > > > > >> > > > This has been a great discussion. As a user of Samza who's > > > > recently > > > > >> > > > integrated it into a relatively large organization, I just > > want > > > to > > > > >> add > > > > >> > > > support to a few points already made. > > > > >> > > > > > > > >> > > > The biggest hurdles to adoption of Samza as it currently > > exists > > > > that > > > > >> > I've > > > > >> > > > experienced are: > > > > >> > > > 1) YARN - YARN is overly complex in many environments where > > > Puppet > > > > >> > would > > > > >> > > do > > > > >> > > > just fine but it was the only mechanism to get fault > > tolerance. > > > > >> > > > 2) Configuration - I think I like the idea of configuring > most > > > of > > > > the > > > > >> > job > > > > >> > > > in code rather than config files. In general, I think the > > goal > > > > >> should > > > > >> > be > > > > >> > > > to make it harder to make mistakes, especially of the kind > > where > > > > the > > > > >> > code > > > > >> > > > expects something and the config doesn't match. The current > > > > config > > > > >> is > > > > >> > > > quite intricate and error-prone. For example, the > application > > > > logic > > > > >> > may > > > > >> > > > depend on bootstrapping a topic but rather than asserting > that > > > in > > > > the > > > > >> > > code, > > > > >> > > > you have to rely on getting the config right. Likewise with > > > > serdes, > > > > >> > the > > > > >> > > > Java representations produced by various serdes (JSON, Avro, > > > etc.) > > > > >> are > > > > >> > > not > > > > >> > > > equivalent so you cannot just reconfigure a serde without > > > changing > > > > >> the > > > > >> > > > code. It would be nice for jobs to be able to assert what > > they > > > > >> expect > > > > >> > > > from their input topics in terms of partitioning. This is > > > > getting a > > > > >> > > little > > > > >> > > > off topic but I was even thinking about creating a "Samza > > config > > > > >> > linter" > > > > >> > > > that would sanity check a set of configs. Especially in > > > > >> organizations > > > > >> > > > where config is managed by a different team than the > > application > > > > >> > > developer, > > > > >> > > > it's very hard to get avoid config mistakes. > > > > >> > > > 3) Java/Scala centric - for many teams (especially > DevOps-type > > > > >> folks), > > > > >> > > the > > > > >> > > > pain of the Java toolchain (maven, slow builds, weak command > > > line > > > > >> > > support, > > > > >> > > > configuration over convention) really inhibits productivity. > > As > > > > more > > > > >> > and > > > > >> > > > more high-quality clients become available for Kafka, I hope > > > > they'll > > > > >> > > follow > > > > >> > > > Samza's model. Not sure how much it affects the proposals > in > > > this > > > > >> > thread > > > > >> > > > but please consider other languages in the ecosystem as > well. > > > > From > > > > >> > what > > > > >> > > > I've heard, Spark has more Python users than Java/Scala. > > > > >> > > > (FYI, we added a Jython wrapper for the Samza API > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza > > > > >> > > > and are working on a Yeoman generator > > > > >> > > > https://github.com/Quantiply/generator-rico for > Jython/Samza > > > > >> projects > > > > >> > to > > > > >> > > > alleviate some of the pain) > > > > >> > > > > > > > >> > > > I also want to underscore Jay's point about improving the > user > > > > >> > > experience. > > > > >> > > > That's a very important factor for adoption. I think the > goal > > > > should > > > > >> > be > > > > >> > > to > > > > >> > > > make Samza as easy to get started with as something like > > > Logstash. > > > > >> > > > Logstash is vastly inferior in terms of capabilities to > Samza > > > but > > > > >> it's > > > > >> > > easy > > > > >> > > > to get started and that makes a big difference. > > > > >> > > > > > > > >> > > > Cheers, > > > > >> > > > > > > > >> > > > Roger > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci > > Morales < > > > > >> > > > g...@apache.org> wrote: > > > > >> > > > > > > > >> > > > > Forgot to add. On the naming issues, Kafka Metamorphosis > is > > a > > > > clear > > > > >> > > > winner > > > > >> > > > > :) > > > > >> > > > > > > > > >> > > > > -- > > > > >> > > > > Gianmarco > > > > >> > > > > > > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci Morales < > > > > >> > > g...@apache.org > > > > >> > > > > > > > > >> > > > > wrote: > > > > >> > > > > > > > > >> > > > > > Hi, > > > > >> > > > > > > > > > >> > > > > > @Martin, thanks for you comments. > > > > >> > > > > > Maybe I'm missing some important point, but I think > > coupling > > > > the > > > > >> > > > releases > > > > >> > > > > > is actually a *good* thing. > > > > >> > > > > > To make an example, would it be better if the MR and > HDFS > > > > >> > components > > > > >> > > of > > > > >> > > > > > Hadoop had different release schedules? > > > > >> > > > > > > > > > >> > > > > > Actually, keeping the discussion in a single place would > > > make > > > > >> > > agreeing > > > > >> > > > on > > > > >> > > > > > releases (and backwards compatibility) much easier, as > > > > everybody > > > > >> > > would > > > > >> > > > be > > > > >> > > > > > responsible for the whole codebase. > > > > >> > > > > > > > > > >> > > > > > That said, I like the idea of absorbing samza-core as a > > > > >> > sub-project, > > > > >> > > > and > > > > >> > > > > > leave the fancy stuff separate. > > > > >> > > > > > It probably gives 90% of the benefits we have been > > > discussing > > > > >> here. > > > > >> > > > > > > > > > >> > > > > > Cheers, > > > > >> > > > > > > > > > >> > > > > > -- > > > > >> > > > > > Gianmarco > > > > >> > > > > > > > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps <jay.kr...@gmail.com > > > > > > wrote: > > > > >> > > > > > > > > > >> > > > > >> Hey Martin, > > > > >> > > > > >> > > > > >> > > > > >> I agree coupling release schedules is a downside. > > > > >> > > > > >> > > > > >> > > > > >> Definitely we can try to solve some of the integration > > > > problems > > > > >> in > > > > >> > > > > >> Confluent Platform or in other distributions. But I > think > > > > this > > > > >> > ends > > > > >> > > up > > > > >> > > > > >> being really shallow. I guess I feel to really get a > good > > > > user > > > > >> > > > > experience > > > > >> > > > > >> the two systems have to kind of feel like part of the > > same > > > > thing > > > > >> > and > > > > >> > > > you > > > > >> > > > > >> can't really add that in later--you can put both in the > > > same > > > > >> > > > > downloadable > > > > >> > > > > >> tar file but it doesn't really give a very cohesive > > > feeling. > > > > I > > > > >> > agree > > > > >> > > > > that > > > > >> > > > > >> ultimately any of the project stuff is as much social > and > > > > naming > > > > >> > as > > > > >> > > > > >> anything else--theoretically two totally independent > > > projects > > > > >> > could > > > > >> > > > work > > > > >> > > > > >> to > > > > >> > > > > >> tightly align. In practice this seems to be quite > > difficult > > > > >> > though. > > > > >> > > > > >> > > > > >> > > > > >> For the frameworks--totally agree it would be good to > > > > maintain > > > > >> the > > > > >> > > > > >> framework support with the project. In some cases there > > may > > > > not > > > > >> be > > > > >> > > too > > > > >> > > > > >> much > > > > >> > > > > >> there since the integration gets lighter but I think > > > whatever > > > > >> > stubs > > > > >> > > > you > > > > >> > > > > >> need should be included. So no I definitely wasn't > trying > > > to > > > > >> imply > > > > >> > > > > >> dropping > > > > >> > > > > >> support for these frameworks, just making the > integration > > > > >> lighter > > > > >> > by > > > > >> > > > > >> separating process management from partition > management. > > > > >> > > > > >> > > > > >> > > > > >> You raise two good points we would have to figure out > if > > we > > > > went > > > > >> > > down > > > > >> > > > > the > > > > >> > > > > >> alignment path: > > > > >> > > > > >> 1. With respect to the name, yeah I think the first > > > question > > > > is > > > > >> > > > whether > > > > >> > > > > >> some "re-branding" would be worth it. If so then I > think > > we > > > > can > > > > >> > > have a > > > > >> > > > > big > > > > >> > > > > >> thread on the name. I'm definitely not set on Kafka > > > > Streaming or > > > > >> > > Kafka > > > > >> > > > > >> Streams I was just using them to be kind of > > illustrative. I > > > > >> agree > > > > >> > > with > > > > >> > > > > >> your > > > > >> > > > > >> critique of these names, though I think people would > get > > > the > > > > >> idea. > > > > >> > > > > >> 2. Yeah you also raise a good point about how to > "factor" > > > it. > > > > >> Here > > > > >> > > are > > > > >> > > > > the > > > > >> > > > > >> options I see (I could get enthusiastic about any of > > them): > > > > >> > > > > >> a. One repo for both Kafka and Samza > > > > >> > > > > >> b. Two repos, retaining the current seperation > > > > >> > > > > >> c. Two repos, the equivalent of samza-api and > > samza-core > > > > is > > > > >> > > > absorbed > > > > >> > > > > >> almost like a third client > > > > >> > > > > >> > > > > >> > > > > >> Cheers, > > > > >> > > > > >> > > > > >> > > > > >> -Jay > > > > >> > > > > >> > > > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann < > > > > >> > > > mar...@kleppmann.com> > > > > >> > > > > >> wrote: > > > > >> > > > > >> > > > > >> > > > > >> > Ok, thanks for the clarifications. Just a few > follow-up > > > > >> > comments. > > > > >> > > > > >> > > > > > >> > > > > >> > - I see the appeal of merging with Kafka or becoming > a > > > > >> > subproject: > > > > >> > > > the > > > > >> > > > > >> > reasons you mention are good. The risk I see is that > > > > release > > > > >> > > > schedules > > > > >> > > > > >> > become coupled to each other, which can slow everyone > > > down, > > > > >> and > > > > >> > > > large > > > > >> > > > > >> > projects with many contributors are harder to manage. > > > > (Jakob, > > > > >> > can > > > > >> > > > you > > > > >> > > > > >> speak > > > > >> > > > > >> > from experience, having seen a wider range of Hadoop > > > > ecosystem > > > > >> > > > > >> projects?) > > > > >> > > > > >> > > > > > >> > > > > >> > Some of the goals of a better unified developer > > > experience > > > > >> could > > > > >> > > > also > > > > >> > > > > be > > > > >> > > > > >> > solved by integrating Samza nicely into a Kafka > > > > distribution > > > > >> > (such > > > > >> > > > as > > > > >> > > > > >> > Confluent's). I'm not against merging projects if we > > > decide > > > > >> > that's > > > > >> > > > the > > > > >> > > > > >> way > > > > >> > > > > >> > to go, just pointing out the same goals can perhaps > > also > > > be > > > > >> > > achieved > > > > >> > > > > in > > > > >> > > > > >> > other ways. > > > > >> > > > > >> > > > > > >> > > > > >> > - With regard to dropping the YARN dependency: are > you > > > > >> proposing > > > > >> > > > that > > > > >> > > > > >> > Samza doesn't give any help to people wanting to run > on > > > > >> > > > > >> YARN/Mesos/AWS/etc? > > > > >> > > > > >> > So the docs would basically have a link to Slider and > > > > nothing > > > > >> > > else? > > > > >> > > > Or > > > > >> > > > > >> > would we maintain integrations with a bunch of > popular > > > > >> > deployment > > > > >> > > > > >> methods > > > > >> > > > > >> > (e.g. the necessary glue and shell scripts to make > > Samza > > > > work > > > > >> > with > > > > >> > > > > >> Slider)? > > > > >> > > > > >> > > > > > >> > > > > >> > I absolutely think it's a good idea to have the "as a > > > > library" > > > > >> > and > > > > >> > > > > "as a > > > > >> > > > > >> > process" (using Yi's taxonomy) options for people who > > > want > > > > >> them, > > > > >> > > > but I > > > > >> > > > > >> > think there should also be a low-friction path for > > common > > > > "as > > > > >> a > > > > >> > > > > service" > > > > >> > > > > >> > deployment methods, for which we probably need to > > > maintain > > > > >> > > > > integrations. > > > > >> > > > > >> > > > > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd to me, > > > because > > > > >> Kafka > > > > >> > > is > > > > >> > > > > all > > > > >> > > > > >> > about streams already. Perhaps "Kafka Transformers" > or > > > > "Kafka > > > > >> > > > Filters" > > > > >> > > > > >> > would be more apt? > > > > >> > > > > >> > > > > > >> > > > > >> > One suggestion: perhaps the core of Samza (stream > > > > >> transformation > > > > >> > > > with > > > > >> > > > > >> > state management -- i.e. the "Samza as a library" > bit) > > > > could > > > > >> > > become > > > > >> > > > > >> part of > > > > >> > > > > >> > Kafka, while higher-level tools such as streaming SQL > > and > > > > >> > > > integrations > > > > >> > > > > >> with > > > > >> > > > > >> > deployment frameworks remain in a separate project? > In > > > > other > > > > >> > > words, > > > > >> > > > > >> Kafka > > > > >> > > > > >> > would absorb the proven, stable core of Samza, which > > > would > > > > >> > become > > > > >> > > > the > > > > >> > > > > >> > "third Kafka client" mentioned early in this thread. > > The > > > > Samza > > > > >> > > > project > > > > >> > > > > >> > would then target that third Kafka client as its base > > > API, > > > > and > > > > >> > the > > > > >> > > > > >> project > > > > >> > > > > >> > would be freed up to explore more experimental new > > > > horizons. > > > > >> > > > > >> > > > > > >> > > > > >> > Martin > > > > >> > > > > >> > > > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps < > > jay.kr...@gmail.com> > > > > >> wrote: > > > > >> > > > > >> > > > > > >> > > > > >> > > Hey Martin, > > > > >> > > > > >> > > > > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually don't > > > think > > > > it > > > > >> > ties > > > > >> > > > our > > > > >> > > > > >> > hands > > > > >> > > > > >> > > at all, all it does is refactor things. The > division > > of > > > > >> > > > > >> responsibility is > > > > >> > > > > >> > > that Samza core is responsible for task lifecycle, > > > state, > > > > >> and > > > > >> > > > > >> partition > > > > >> > > > > >> > > management (using the Kafka co-ordinator) but it is > > NOT > > > > >> > > > responsible > > > > >> > > > > >> for > > > > >> > > > > >> > > packaging, configuration deployment or execution of > > > > >> processes. > > > > >> > > The > > > > >> > > > > >> > problem > > > > >> > > > > >> > > of packaging and starting these processes is > > > > >> > > > > >> > > framework/environment-specific. This leaves > > individual > > > > >> > > frameworks > > > > >> > > > to > > > > >> > > > > >> be > > > > >> > > > > >> > as > > > > >> > > > > >> > > fancy or vanilla as they like. So you can get > simple > > > > >> stateless > > > > >> > > > > >> support in > > > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf app > > > framework > > > > >> > > (Slider, > > > > >> > > > > >> > Marathon, > > > > >> > > > > >> > > etc). These are well known by people and have nice > > UIs > > > > and a > > > > >> > lot > > > > >> > > > of > > > > >> > > > > >> > > flexibility. I don't think they have node affinity > > as a > > > > >> built > > > > >> > in > > > > >> > > > > >> option > > > > >> > > > > >> > > (though I could be wrong). So if we want that we > can > > > > either > > > > >> > wait > > > > >> > > > for > > > > >> > > > > >> them > > > > >> > > > > >> > > to add it or do a custom framework to add that > > feature > > > > (as > > > > >> > now). > > > > >> > > > > >> > Obviously > > > > >> > > > > >> > > if you manage things with old-school ops tools > > > > >> > (puppet/chef/etc) > > > > >> > > > you > > > > >> > > > > >> get > > > > >> > > > > >> > > locality easily. The nice thing, though, is that > all > > > the > > > > >> samza > > > > >> > > > > >> "business > > > > >> > > > > >> > > logic" around partition management and fault > > tolerance > > > > is in > > > > >> > > Samza > > > > >> > > > > >> core > > > > >> > > > > >> > so > > > > >> > > > > >> > > it is shared across frameworks and the framework > > > specific > > > > >> bit > > > > >> > is > > > > >> > > > > just > > > > >> > > > > >> > > whether it is smart enough to try to get the same > > host > > > > when > > > > >> a > > > > >> > > job > > > > >> > > > is > > > > >> > > > > >> > > restarted. > > > > >> > > > > >> > > > > > > >> > > > > >> > > With respect to the Kafka-alignment, yeah I think > the > > > > goal > > > > >> > would > > > > >> > > > be > > > > >> > > > > >> (a) > > > > >> > > > > >> > > actually get better alignment in user experience, > and > > > (b) > > > > >> > > express > > > > >> > > > > >> this in > > > > >> > > > > >> > > the naming and project branding. Specifically: > > > > >> > > > > >> > > 1. Website/docs, it would be nice for the > > > > "transformation" > > > > >> api > > > > >> > > to > > > > >> > > > be > > > > >> > > > > >> > > discoverable in the main Kafka docs--i.e. be able > to > > > > explain > > > > >> > > when > > > > >> > > > to > > > > >> > > > > >> use > > > > >> > > > > >> > > the consumer and when to use the stream processing > > > > >> > functionality > > > > >> > > > and > > > > >> > > > > >> lead > > > > >> > > > > >> > > people into that experience. > > > > >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2 (or > > > > whatever) > > > > >> > that > > > > >> > > > has > > > > >> > > > > >> both > > > > >> > > > > >> > > Kafka and the stream processing part and they > > actually > > > > work > > > > >> > > > > together. > > > > >> > > > > >> > > 3. Unify the programming experience so the client > and > > > > Samza > > > > >> > api > > > > >> > > > > share > > > > >> > > > > >> > > config/monitoring/naming/packaging/etc. > > > > >> > > > > >> > > > > > > >> > > > > >> > > I think sub-projects keep separate committers and > can > > > > have a > > > > >> > > > > separate > > > > >> > > > > >> > repo, > > > > >> > > > > >> > > but I'm actually not really sure (I can't find a > > > > definition > > > > >> > of a > > > > >> > > > > >> > subproject > > > > >> > > > > >> > > in Apache). > > > > >> > > > > >> > > > > > > >> > > > > >> > > Basically at a high-level you want the experience > to > > > > "feel" > > > > >> > > like a > > > > >> > > > > >> single > > > > >> > > > > >> > > system, not to relatively independent things that > are > > > > kind > > > > >> of > > > > >> > > > > >> awkwardly > > > > >> > > > > >> > > glued together. > > > > >> > > > > >> > > > > > > >> > > > > >> > > I think if we did that they having naming or > branding > > > > like > > > > >> > > "kafka > > > > >> > > > > >> > > streaming" or "kafka streams" or something like > that > > > > would > > > > >> > > > actually > > > > >> > > > > >> do a > > > > >> > > > > >> > > good job of conveying what it is. I do that this > > would > > > > help > > > > >> > > > adoption > > > > >> > > > > >> > quite > > > > >> > > > > >> > > a lot as it would correctly convey that using Kafka > > > > >> Streaming > > > > >> > > with > > > > >> > > > > >> Kafka > > > > >> > > > > >> > is > > > > >> > > > > >> > > a fairly seamless experience and Kafka is pretty > > > heavily > > > > >> > adopted > > > > >> > > > at > > > > >> > > > > >> this > > > > >> > > > > >> > > point. > > > > >> > > > > >> > > > > > > >> > > > > >> > > Fwiw we actually considered this model originally > > when > > > > open > > > > >> > > > sourcing > > > > >> > > > > >> > Samza, > > > > >> > > > > >> > > however at that time Kafka was relatively unknown > and > > > we > > > > >> > decided > > > > >> > > > not > > > > >> > > > > >> to > > > > >> > > > > >> > do > > > > >> > > > > >> > > it since we felt it would be limiting. From my > point > > of > > > > view > > > > >> > the > > > > >> > > > > three > > > > >> > > > > >> > > things have changed (1) Kafka is now really heavily > > > used > > > > for > > > > >> > > > stream > > > > >> > > > > >> > > processing, (2) we learned that abstracting out the > > > > stream > > > > >> > well > > > > >> > > is > > > > >> > > > > >> > > basically impossible, (3) we learned it is really > > hard > > > to > > > > >> keep > > > > >> > > the > > > > >> > > > > two > > > > >> > > > > >> > > things feeling like a single product. > > > > >> > > > > >> > > > > > > >> > > > > >> > > -Jay > > > > >> > > > > >> > > > > > > >> > > > > >> > > > > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin Kleppmann < > > > > >> > > > > >> mar...@kleppmann.com> > > > > >> > > > > >> > > wrote: > > > > >> > > > > >> > > > > > > >> > > > > >> > >> Hi all, > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> Lots of good thoughts here. > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> I agree with the general philosophy of tying Samza > > > more > > > > >> > firmly > > > > >> > > to > > > > >> > > > > >> Kafka. > > > > >> > > > > >> > >> After I spent a while looking at integrating other > > > > message > > > > >> > > > brokers > > > > >> > > > > >> (e.g. > > > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the > > conclusion > > > > that > > > > >> > > > > >> > SystemConsumer > > > > >> > > > > >> > >> tacitly assumes a model so much like Kafka's that > > > pretty > > > > >> much > > > > >> > > > > nobody > > > > >> > > > > >> but > > > > >> > > > > >> > >> Kafka actually implements it. (Databus is perhaps > an > > > > >> > exception, > > > > >> > > > but > > > > >> > > > > >> it > > > > >> > > > > >> > >> isn't widely used outside of LinkedIn.) Thus, > making > > > > Samza > > > > >> > > fully > > > > >> > > > > >> > dependent > > > > >> > > > > >> > >> on Kafka acknowledges that the system-independence > > was > > > > >> never > > > > >> > as > > > > >> > > > > real > > > > >> > > > > >> as > > > > >> > > > > >> > we > > > > >> > > > > >> > >> perhaps made it out to be. The gains of code reuse > > are > > > > >> real. > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> The idea of decoupling Samza from YARN has also > > always > > > > been > > > > >> > > > > >> appealing to > > > > >> > > > > >> > >> me, for various reasons already mentioned in this > > > > thread. > > > > >> > > > Although > > > > >> > > > > >> > making > > > > >> > > > > >> > >> Samza jobs deployable on anything > > (YARN/Mesos/AWS/etc) > > > > >> seems > > > > >> > > > > >> laudable, > > > > >> > > > > >> > I am > > > > >> > > > > >> > >> a little concerned that it will restrict us to a > > > lowest > > > > >> > common > > > > >> > > > > >> > denominator. > > > > >> > > > > >> > >> For example, would host affinity (SAMZA-617) still > > be > > > > >> > possible? > > > > >> > > > For > > > > >> > > > > >> jobs > > > > >> > > > > >> > >> with large amounts of state, I think SAMZA-617 > would > > > be > > > > a > > > > >> big > > > > >> > > > boon, > > > > >> > > > > >> > since > > > > >> > > > > >> > >> restoring state off the changelog on every single > > > > restart > > > > >> is > > > > >> > > > > painful, > > > > >> > > > > >> > due > > > > >> > > > > >> > >> to long recovery times. It would be a shame if the > > > > >> decoupling > > > > >> > > > from > > > > >> > > > > >> YARN > > > > >> > > > > >> > >> made host affinity impossible. > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> Jay, a question about the proposed API for > > > > instantiating a > > > > >> > job > > > > >> > > in > > > > >> > > > > >> code > > > > >> > > > > >> > >> (rather than a properties file): when submitting a > > job > > > > to a > > > > >> > > > > cluster, > > > > >> > > > > >> is > > > > >> > > > > >> > the > > > > >> > > > > >> > >> idea that the instantiation code runs on a client > > > > >> somewhere, > > > > >> > > > which > > > > >> > > > > >> then > > > > >> > > > > >> > >> pokes the necessary endpoints on > YARN/Mesos/AWS/etc? > > > Or > > > > >> does > > > > >> > > that > > > > >> > > > > >> code > > > > >> > > > > >> > run > > > > >> > > > > >> > >> on each container that is part of the job (in > which > > > > case, > > > > >> how > > > > >> > > > does > > > > >> > > > > >> the > > > > >> > > > > >> > job > > > > >> > > > > >> > >> submission to the cluster work)? > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> I agree with Garry that it doesn't feel right to > > make > > > a > > > > 1.0 > > > > >> > > > release > > > > >> > > > > >> > with a > > > > >> > > > > >> > >> plan for it to be immediately obsolete. So if this > > is > > > > going > > > > >> > to > > > > >> > > > > >> happen, I > > > > >> > > > > >> > >> think it would be more honest to stick with 0.* > > > version > > > > >> > numbers > > > > >> > > > > until > > > > >> > > > > >> > the > > > > >> > > > > >> > >> library-ified Samza has been implemented, is > stable > > > and > > > > >> > widely > > > > >> > > > > used. > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> Should the new Samza be a subproject of Kafka? > There > > > is > > > > >> > > precedent > > > > >> > > > > for > > > > >> > > > > >> > >> tight coupling between different Apache projects > > (e.g. > > > > >> > Curator > > > > >> > > > and > > > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think > remaining > > > > >> separate > > > > >> > > > would > > > > >> > > > > >> be > > > > >> > > > > >> > ok. > > > > >> > > > > >> > >> Even if Samza is fully dependent on Kafka, there > is > > > > enough > > > > >> > > > > substance > > > > >> > > > > >> in > > > > >> > > > > >> > >> Samza that it warrants being a separate project. > An > > > > >> argument > > > > >> > in > > > > >> > > > > >> favour > > > > >> > > > > >> > of > > > > >> > > > > >> > >> merging would be if we think Kafka has a much > > stronger > > > > >> "brand > > > > >> > > > > >> presence" > > > > >> > > > > >> > >> than Samza; I'm ambivalent on that one. If the > Kafka > > > > >> project > > > > >> > is > > > > >> > > > > >> willing > > > > >> > > > > >> > to > > > > >> > > > > >> > >> endorse Samza as the "official" way of doing > > stateful > > > > >> stream > > > > >> > > > > >> > >> transformations, that would probably have much the > > > same > > > > >> > effect > > > > >> > > as > > > > >> > > > > >> > >> re-branding Samza as "Kafka Stream Processors" or > > > > suchlike. > > > > >> > > Close > > > > >> > > > > >> > >> collaboration between the two projects will be > > needed > > > in > > > > >> any > > > > >> > > > case. > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> From a project management perspective, I guess the > > > "new > > > > >> > Samza" > > > > >> > > > > would > > > > >> > > > > >> > have > > > > >> > > > > >> > >> to be developed on a branch alongside ongoing > > > > maintenance > > > > >> of > > > > >> > > the > > > > >> > > > > >> current > > > > >> > > > > >> > >> line of development? I think it would be important > > to > > > > >> > continue > > > > >> > > > > >> > supporting > > > > >> > > > > >> > >> existing users, and provide a graceful migration > > path > > > to > > > > >> the > > > > >> > > new > > > > >> > > > > >> > version. > > > > >> > > > > >> > >> Leaving the current versions unsupported and > forcing > > > > people > > > > >> > to > > > > >> > > > > >> rewrite > > > > >> > > > > >> > >> their jobs would send a bad signal. > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> Best, > > > > >> > > > > >> > >> Martin > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps < > > j...@confluent.io> > > > > >> wrote: > > > > >> > > > > >> > >> > > > > >> > > > > >> > >>> Hey Garry, > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be happy to > chat > > > > more > > > > >> > about > > > > >> > > > > this > > > > >> > > > > >> if > > > > >> > > > > >> > >>> you'd be interested. I think Chris and I started > > with > > > > the > > > > >> > idea > > > > >> > > > of > > > > >> > > > > >> "what > > > > >> > > > > >> > >>> would it take to make Samza a kick-ass ingestion > > > tool" > > > > but > > > > >> > > > > >> ultimately > > > > >> > > > > >> > we > > > > >> > > > > >> > >>> kind of came around to the idea that ingestion > and > > > > >> > > > transformation > > > > >> > > > > >> had > > > > >> > > > > >> > >>> pretty different needs and coupling the two made > > > things > > > > >> > hard. > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> For what it's worth I think copycat (KIP-26) > > actually > > > > will > > > > >> > do > > > > >> > > > what > > > > >> > > > > >> you > > > > >> > > > > >> > >> are > > > > >> > > > > >> > >>> looking for. > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> With regard to your point about slider, I don't > > > > >> necessarily > > > > >> > > > > >> disagree. > > > > >> > > > > >> > >> But I > > > > >> > > > > >> > >>> think getting good YARN support is quite doable > > and I > > > > >> think > > > > >> > we > > > > >> > > > can > > > > >> > > > > >> make > > > > >> > > > > >> > >>> that work well. I think the issue this proposal > > > solves > > > > is > > > > >> > that > > > > >> > > > > >> > >> technically > > > > >> > > > > >> > >>> it is pretty hard to support multiple cluster > > > > management > > > > >> > > systems > > > > >> > > > > the > > > > >> > > > > >> > way > > > > >> > > > > >> > >>> things are now, you need to write an "app master" > > or > > > > >> > > "framework" > > > > >> > > > > for > > > > >> > > > > >> > each > > > > >> > > > > >> > >>> and they are all a little different so testing is > > > > really > > > > >> > hard. > > > > >> > > > In > > > > >> > > > > >> the > > > > >> > > > > >> > >>> absence of this we have been stuck with just YARN > > > which > > > > >> has > > > > >> > > > > >> fantastic > > > > >> > > > > >> > >>> penetration in the Hadoopy part of the org, but > > zero > > > > >> > > penetration > > > > >> > > > > >> > >> elsewhere. > > > > >> > > > > >> > >>> Given the huge amount of work being put in to > > slider, > > > > >> > > marathon, > > > > >> > > > > aws > > > > >> > > > > >> > >>> tooling, not to mention the umpteen related > > packaging > > > > >> > > > technologies > > > > >> > > > > >> > people > > > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various > > > cloud-specific > > > > >> > deploy > > > > >> > > > > >> tools, > > > > >> > > > > >> > >> etc) > > > > >> > > > > >> > >>> I really think it is important to get this right. > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> -Jay > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry Turkington > < > > > > >> > > > > >> > >>> g.turking...@improvedigital.com> wrote: > > > > >> > > > > >> > >>> > > > > >> > > > > >> > >>>> Hi all, > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> I think the question below re does Samza become > a > > > > >> > sub-project > > > > >> > > > of > > > > >> > > > > >> Kafka > > > > >> > > > > >> > >>>> highlights the broader point around migration. > > Chris > > > > >> > mentions > > > > >> > > > > >> Samza's > > > > >> > > > > >> > >>>> maturity is heading towards a v1 release but I'm > > not > > > > sure > > > > >> > it > > > > >> > > > > feels > > > > >> > > > > >> > >> right to > > > > >> > > > > >> > >>>> launch a v1 then immediately plan to deprecate > > most > > > of > > > > >> it. > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> From a selfish perspective I have some guys who > > have > > > > >> > started > > > > >> > > > > >> working > > > > >> > > > > >> > >> with > > > > >> > > > > >> > >>>> Samza and building some new consumers/producers > > was > > > > next > > > > >> > up. > > > > >> > > > > Sounds > > > > >> > > > > >> > like > > > > >> > > > > >> > >>>> that is absolutely not the direction to go. I > need > > > to > > > > >> look > > > > >> > > into > > > > >> > > > > the > > > > >> > > > > >> > KIP > > > > >> > > > > >> > >> in > > > > >> > > > > >> > >>>> more detail but for me the attractiveness of > > adding > > > > new > > > > >> > Samza > > > > >> > > > > >> > >>>> consumer/producers -- even if yes all they were > > > doing > > > > was > > > > >> > > > really > > > > >> > > > > >> > getting > > > > >> > > > > >> > >>>> data into and out of Kafka -- was to avoid > > having > > > to > > > > >> > worry > > > > >> > > > > about > > > > >> > > > > >> the > > > > >> > > > > >> > >>>> lifecycle management of external clients. If > there > > > is > > > > a > > > > >> > > generic > > > > >> > > > > >> Kafka > > > > >> > > > > >> > >>>> ingress/egress layer that I can plug a new > > connector > > > > into > > > > >> > and > > > > >> > > > > have > > > > >> > > > > >> a > > > > >> > > > > >> > >> lot of > > > > >> > > > > >> > >>>> the heavy lifting re scale and reliability done > > for > > > me > > > > >> then > > > > >> > > it > > > > >> > > > > >> gives > > > > >> > > > > >> > me > > > > >> > > > > >> > >> all > > > > >> > > > > >> > >>>> the pushing new consumers/producers would. If > not > > > > then it > > > > >> > > > > >> complicates > > > > >> > > > > >> > my > > > > >> > > > > >> > >>>> operational deployments. > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> Which is similar to my other question with the > > > > proposal > > > > >> -- > > > > >> > if > > > > >> > > > we > > > > >> > > > > >> > build a > > > > >> > > > > >> > >>>> fully available/stand-alone Samza plus the > > requisite > > > > >> shims > > > > >> > to > > > > >> > > > > >> > integrate > > > > >> > > > > >> > >>>> with Slider etc I suspect the former may be a > lot > > > more > > > > >> work > > > > >> > > > than > > > > >> > > > > we > > > > >> > > > > >> > >> think. > > > > >> > > > > >> > >>>> We may make it much easier for a newcomer to get > > > > >> something > > > > >> > > > > running > > > > >> > > > > >> but > > > > >> > > > > >> > >>>> having them step up and get a reliable > production > > > > >> > deployment > > > > >> > > > may > > > > >> > > > > >> still > > > > >> > > > > >> > >>>> dominate mailing list traffic, if for different > > > > reasons > > > > >> > than > > > > >> > > > > >> today. > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable with > making > > > the > > > > >> Samza > > > > >> > > > > >> dependency > > > > >> > > > > >> > >> on > > > > >> > > > > >> > >>>> Kafka much more explicit and I absolutely see > the > > > > >> benefits > > > > >> > > in > > > > >> > > > > the > > > > >> > > > > >> > >>>> reduction of duplication and clashing > > > > >> > > > terminologies/abstractions > > > > >> > > > > >> that > > > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library would > > likely > > > > be a > > > > >> > very > > > > >> > > > > nice > > > > >> > > > > >> > tool > > > > >> > > > > >> > >> to > > > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have the > > concerns > > > > >> above > > > > >> > re > > > > >> > > > the > > > > >> > > > > >> > >>>> operational side. > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> Garry > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> -----Original Message----- > > > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales [mailto: > > > > >> > g...@apache.org > > > > >> > > ] > > > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56 > > > > >> > > > > >> > >>>> To: dev@samza.apache.org > > > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on Samza > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> Very interesting thoughts. > > > > >> > > > > >> > >>>> From outside, I have always perceived Samza as a > > > > >> computing > > > > >> > > > layer > > > > >> > > > > >> over > > > > >> > > > > >> > >>>> Kafka. > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> The question, maybe a bit provocative, is > "should > > > > Samza > > > > >> be > > > > >> > a > > > > >> > > > > >> > sub-project > > > > >> > > > > >> > >>>> of Kafka then?" > > > > >> > > > > >> > >>>> Or does it make sense to keep it as a separate > > > project > > > > >> > with a > > > > >> > > > > >> separate > > > > >> > > > > >> > >>>> governance? > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> Cheers, > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> -- > > > > >> > > > > >> > >>>> Gianmarco > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang < > > > > yanfang...@gmail.com> > > > > >> > > > wrote: > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka more > > tightly. > > > > >> > Because > > > > >> > > > > Samza > > > > >> > > > > >> de > > > > >> > > > > >> > >>>>> facto is based on Kafka, and it should leverage > > > what > > > > >> Kafka > > > > >> > > > has. > > > > >> > > > > At > > > > >> > > > > >> > the > > > > >> > > > > >> > >>>>> same time, Kafka does not need to reinvent what > > > Samza > > > > >> > > already > > > > >> > > > > >> has. I > > > > >> > > > > >> > >>>>> also like the idea of separating the ingestion > > and > > > > >> > > > > transformation. > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> But it is a little difficult for me to image > how > > > the > > > > >> Samza > > > > >> > > > will > > > > >> > > > > >> look > > > > >> > > > > >> > >>>> like. > > > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little > difference > > > in > > > > >> terms > > > > >> > > of > > > > >> > > > > how > > > > >> > > > > >> > >>>>> Samza should look like. > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> *** Will it look like what Jay's code shows (A > > > > client of > > > > >> > > > Kakfa) > > > > >> > > > > ? > > > > >> > > > > >> And > > > > >> > > > > >> > >>>>> user's application code calls this client? > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> 1. If we make Samza be a library of Kafka (like > > > what > > > > the > > > > >> > > code > > > > >> > > > > >> shows), > > > > >> > > > > >> > >>>>> how do we implement auto-balance and > > > fault-tolerance? > > > > >> Are > > > > >> > > they > > > > >> > > > > >> taken > > > > >> > > > > >> > >>>>> care by the Kafka broker or other mechanism, > such > > > as > > > > >> > "Samza > > > > >> > > > > >> worker" > > > > >> > > > > >> > >>>>> (just make up the name) ? > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> 2. What about other features, such as > > auto-scaling, > > > > >> shared > > > > >> > > > > state, > > > > >> > > > > >> > >>>>> monitoring? > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is this what > > > Chris > > > > >> > > > suggests?) > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> 1. we still need to ingest data from Kakfa and > > > > produce > > > > >> to > > > > >> > > it. > > > > >> > > > > >> Then it > > > > >> > > > > >> > >>>>> becomes the same as what Samza looks like now, > > > > except it > > > > >> > > does > > > > >> > > > > not > > > > >> > > > > >> > rely > > > > >> > > > > >> > >>>>> on Yarn anymore. > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> 2. if it is standalone, how can it leverage > > Kafka's > > > > >> > metrics, > > > > >> > > > > logs, > > > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency? > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> Thanks, > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> Fang, Yan > > > > >> > > > > >> > >>>>> yanfang...@gmail.com > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang Wang < > > > > >> > > > > wangg...@gmail.com > > > > >> > > > > >> > > > > > >> > > > > >> > >>>> wrote: > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>>>> Read through the code example and it looks > good > > to > > > > me. > > > > >> A > > > > >> > > few > > > > >> > > > > >> > >>>>>> thoughts regarding deployment: > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>> Today Samza deploys as executable runnable > like: > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh > --config-factory=... > > > > >> > > > > >> > >>>> --config-path=file://... > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>> And this proposal advocate for deploying Samza > > > more > > > > as > > > > >> > > > embedded > > > > >> > > > > >> > >>>>>> libraries in user application code (ignoring > the > > > > >> > > terminology > > > > >> > > > > >> since > > > > >> > > > > >> > >>>>>> it is not the > > > > >> > > > > >> > >>>>> same > > > > >> > > > > >> > >>>>>> as the prototype code): > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>> StreamTask task = new MyStreamTask(configs); > > > Thread > > > > >> > thread > > > > >> > > = > > > > >> > > > > new > > > > >> > > > > >> > >>>>>> Thread(task); thread.start(); > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>> I think both of these deployment modes are > > > important > > > > >> for > > > > >> > > > > >> different > > > > >> > > > > >> > >>>>>> types > > > > >> > > > > >> > >>>>> of > > > > >> > > > > >> > >>>>>> users. That said, I think making Samza purely > > > > >> standalone > > > > >> > is > > > > >> > > > > still > > > > >> > > > > >> > >>>>>> sufficient for either runnable or library > modes. > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>> Guozhang > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay Kreps < > > > > >> > > > j...@confluent.io> > > > > >> > > > > >> > wrote: > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code example, it > > was > > > > >> > supposed > > > > >> > > > to > > > > >> > > > > >> look > > > > >> > > > > >> > >>>>>>> like > > > > >> > > > > >> > >>>>>>> this: > > > > >> > > > > >> > >>>>>>> > > > > >> > > > > >> > >>>>>>> Properties props = new Properties(); > > > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers", > > "localhost:4242"); > > > > >> > > > > >> StreamingConfig > > > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props); > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", > > "test-topic-2"); > > > > >> > > > > >> > >>>>>>> > config.processor(ExampleStreamProcessor.class); > > > > >> > > > > >> > >>>>>>> config.serialization(new StringSerializer(), > > new > > > > >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming > > container = > > > > new > > > > >> > > > > >> > >>>>>>> KafkaStreaming(config); container.run(); > > > > >> > > > > >> > >>>>>>> > > > > >> > > > > >> > >>>>>>> -Jay > > > > >> > > > > >> > >>>>>>> > > > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps < > > > > >> > > > j...@confluent.io > > > > >> > > > > > > > > > >> > > > > >> > >>>> wrote: > > > > >> > > > > >> > >>>>>>> > > > > >> > > > > >> > >>>>>>>> Hey guys, > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> This came out of some conversations Chris > and > > I > > > > were > > > > >> > > having > > > > >> > > > > >> > >>>>>>>> around > > > > >> > > > > >> > >>>>>>> whether > > > > >> > > > > >> > >>>>>>>> it would make sense to use Samza as a kind > of > > > data > > > > >> > > > ingestion > > > > >> > > > > >> > >>>>> framework > > > > >> > > > > >> > >>>>>>> for > > > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26 > > > "copycat"). > > > > >> This > > > > >> > > > kind > > > > >> > > > > of > > > > >> > > > > >> > >>>>>> combined > > > > >> > > > > >> > >>>>>>>> with complaints around config and YARN and > the > > > > >> > discussion > > > > >> > > > > >> around > > > > >> > > > > >> > >>>>>>>> how > > > > >> > > > > >> > >>>>> to > > > > >> > > > > >> > >>>>>>>> best do a standalone mode. > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> So the thought experiment was, given that > > Samza > > > > was > > > > >> > > > basically > > > > >> > > > > >> > >>>>>>>> already totally Kafka specific, what if you > > just > > > > >> > embraced > > > > >> > > > > that > > > > >> > > > > >> > >>>>>>>> and turned it > > > > >> > > > > >> > >>>>>> into > > > > >> > > > > >> > >>>>>>>> something less like a heavyweight framework > > and > > > > more > > > > >> > > like a > > > > >> > > > > >> > >>>>>>>> third > > > > >> > > > > >> > >>>>> Kafka > > > > >> > > > > >> > >>>>>>>> client--a kind of "producing consumer" with > > > state > > > > >> > > > management > > > > >> > > > > >> > >>>>>> facilities. > > > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a complex > > stream > > > > >> > > processing > > > > >> > > > > >> > >>>>>>>> framework > > > > >> > > > > >> > >>>>>>> this > > > > >> > > > > >> > >>>>>>>> would actually be a very simple thing, not > > much > > > > more > > > > >> > > > > >> complicated > > > > >> > > > > >> > >>>>>>>> to > > > > >> > > > > >> > >>>>> use > > > > >> > > > > >> > >>>>>>> or > > > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As Chris said > > we > > > > >> thought > > > > >> > > > about > > > > >> > > > > >> it > > > > >> > > > > >> > >>>>>>>> a > > > > >> > > > > >> > >>>>> lot > > > > >> > > > > >> > >>>>>> of > > > > >> > > > > >> > >>>>>>>> what Samza (and the other stream processing > > > > systems > > > > >> > were > > > > >> > > > > doing) > > > > >> > > > > >> > >>>>> seemed > > > > >> > > > > >> > >>>>>>> like > > > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce. > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output data to > > and > > > > from > > > > >> > the > > > > >> > > > > stream > > > > >> > > > > >> > >>>>>>>> processing. But when we actually looked into > > how > > > > that > > > > >> > > would > > > > >> > > > > >> > >>>>>>>> work, > > > > >> > > > > >> > >>>>> Samza > > > > >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion > framework > > > > for a > > > > >> > > bunch > > > > >> > > > of > > > > >> > > > > >> > >>>>> reasons. > > > > >> > > > > >> > >>>>>> To > > > > >> > > > > >> > >>>>>>>> really do that right you need a pretty > > different > > > > >> > internal > > > > >> > > > > data > > > > >> > > > > >> > >>>>>>>> model > > > > >> > > > > >> > >>>>>> and > > > > >> > > > > >> > >>>>>>>> set of apis. So what if you split them and > had > > > an > > > > api > > > > >> > for > > > > >> > > > > Kafka > > > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) and a > > > separate > > > > >> api > > > > >> > > for > > > > >> > > > > >> Kafka > > > > >> > > > > >> > >>>>>>>> transformation (Samza). > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> This would also allow really embracing the > > same > > > > >> > > terminology > > > > >> > > > > and > > > > >> > > > > >> > >>>>>>>> conventions. One complaint about the current > > > > state is > > > > >> > > that > > > > >> > > > > the > > > > >> > > > > >> > >>>>>>>> two > > > > >> > > > > >> > >>>>>>> systems > > > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology like > > > "stream" > > > > vs > > > > >> > > > "topic" > > > > >> > > > > >> and > > > > >> > > > > >> > >>>>>>> different > > > > >> > > > > >> > >>>>>>>> config and monitoring systems means you kind > > of > > > > have > > > > >> to > > > > >> > > > learn > > > > >> > > > > >> > >>>>>>>> Kafka's > > > > >> > > > > >> > >>>>>>> way, > > > > >> > > > > >> > >>>>>>>> then learn Samza's slightly different way, > > then > > > > kind > > > > >> of > > > > >> > > > > >> > >>>>>>>> understand > > > > >> > > > > >> > >>>>> how > > > > >> > > > > >> > >>>>>>> they > > > > >> > > > > >> > >>>>>>>> map to each other, which having walked a few > > > > people > > > > >> > > through > > > > >> > > > > >> this > > > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to get. > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of time on > > > > >> airplanes I > > > > >> > > > > hacked > > > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat incomplete > > > > prototype > > > > >> of > > > > >> > > > what > > > > >> > > > > >> > >>>>>>>> this would > > > > >> > > > > >> > >>>>> look > > > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously dumped > into > > > > Kafka > > > > >> as > > > > >> > > it > > > > >> > > > > >> > >>>>>>>> required a > > > > >> > > > > >> > >>>>>> few > > > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here is the > code: > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>> > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > > > > > >> > > > > > > > > >> > > > > > > https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org > > > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I just > > > liberally > > > > >> > renamed > > > > >> > > > > >> > >>>>>>>> everything > > > > >> > > > > >> > >>>>> to > > > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no regard > for > > > > >> > > > compatibility. > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> To use this would be something like this: > > > > >> > > > > >> > >>>>>>>> Properties props = new Properties(); > > > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers", > > > "localhost:4242"); > > > > >> > > > > >> > >>>>>>>> StreamingConfig config = new > > > > >> > > > > >> > >>>>> StreamingConfig(props); > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", > > > > >> > > > > >> > >>>>>>>> "test-topic-2"); > > > > >> > > > > >> config.processor(ExampleStreamProcessor.class); > > > > >> > > > > >> > >>>>>>> config.serialization(new > > > > >> > > > > >> > >>>>>>>> StringSerializer(), new > StringDeserializer()); > > > > >> > > > KafkaStreaming > > > > >> > > > > >> > >>>>>> container = > > > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config); container.run(); > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the > > SamzaContainer; > > > > >> > > > > StreamProcessor > > > > >> > > > > >> > >>>>>>>> is basically StreamTask. > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> So rather than putting all the class names > in > > a > > > > file > > > > >> > and > > > > >> > > > then > > > > >> > > > > >> > >>>>>>>> having > > > > >> > > > > >> > >>>>>> the > > > > >> > > > > >> > >>>>>>>> job assembled by reflection, you just > > > instantiate > > > > the > > > > >> > > > > container > > > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced over > > however > > > > many > > > > >> > > > > instances > > > > >> > > > > >> > >>>>>>>> of > > > > >> > > > > >> > >>>>> this > > > > >> > > > > >> > >>>>>>> are > > > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an instance dies, > > new > > > > >> tasks > > > > >> > > are > > > > >> > > > > >> added > > > > >> > > > > >> > >>>>>>>> to > > > > >> > > > > >> > >>>>> the > > > > >> > > > > >> > >>>>>>>> existing containers without shutting them > > down). > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> We would provide some glue for running this > > > stuff > > > > in > > > > >> > YARN > > > > >> > > > via > > > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS using > some > > > of > > > > >> their > > > > >> > > > tools > > > > >> > > > > >> > >>>>>>>> but from the > > > > >> > > > > >> > >>>>>> point > > > > >> > > > > >> > >>>>>>> of > > > > >> > > > > >> > >>>>>>>> view of these frameworks these stream > > processing > > > > jobs > > > > >> > are > > > > >> > > > > just > > > > >> > > > > >> > >>>>>> stateless > > > > >> > > > > >> > >>>>>>>> services that can come and go and expand and > > > > contract > > > > >> > at > > > > >> > > > > will. > > > > >> > > > > >> > >>>>>>>> There > > > > >> > > > > >> > >>>>> is > > > > >> > > > > >> > >>>>>>> no > > > > >> > > > > >> > >>>>>>>> more custom scheduler. > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> Here are some relevant details: > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> 1. It is only ~1300 lines of code, it would > > get > > > > >> larger > > > > >> > > if > > > > >> > > > we > > > > >> > > > > >> > >>>>>>>> productionized but not vastly larger. We > > really > > > > do > > > > >> > get a > > > > >> > > > ton > > > > >> > > > > >> > >>>>>>>> of > > > > >> > > > > >> > >>>>>>> leverage > > > > >> > > > > >> > >>>>>>>> out of Kafka. > > > > >> > > > > >> > >>>>>>>> 2. Partition management is fully delegated > to > > > the > > > > >> new > > > > >> > > > > >> consumer. > > > > >> > > > > >> > >>>>> This > > > > >> > > > > >> > >>>>>>>> is nice since now any partition management > > > > strategy > > > > >> > > > > available > > > > >> > > > > >> > >>>>>>>> to > > > > >> > > > > >> > >>>>>> Kafka > > > > >> > > > > >> > >>>>>>>> consumer is also available to Samza (and > vice > > > > versa) > > > > >> > and > > > > >> > > > > with > > > > >> > > > > >> > >>>>>>>> the > > > > >> > > > > >> > >>>>>>> exact > > > > >> > > > > >> > >>>>>>>> same configs. > > > > >> > > > > >> > >>>>>>>> 3. It supports state as well as state reuse > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is thought > > > > >> provoking. > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> -Jay > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris > > > Riccomini < > > > > >> > > > > >> > >>>>>> criccom...@apache.org> > > > > >> > > > > >> > >>>>>>>> wrote: > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Hey all, > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> I have had some discussions with Samza > > > engineers > > > > at > > > > >> > > > LinkedIn > > > > >> > > > > >> > >>>>>>>>> and > > > > >> > > > > >> > >>>>>>> Confluent > > > > >> > > > > >> > >>>>>>>>> and we came up with a few observations and > > > would > > > > >> like > > > > >> > to > > > > >> > > > > >> > >>>>>>>>> propose > > > > >> > > > > >> > >>>>> some > > > > >> > > > > >> > >>>>>>>>> changes. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> We've observed some things that I want to > > call > > > > out > > > > >> > about > > > > >> > > > > >> > >>>>>>>>> Samza's > > > > >> > > > > >> > >>>>>> design, > > > > >> > > > > >> > >>>>>>>>> and I'd like to propose some changes. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic > > deployment > > > > >> system. > > > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable. > > > > >> > > > > >> > >>>>>>>>> * Samza's SystemConsumer/SystemProducer and > > > > Kafka's > > > > >> > > > consumer > > > > >> > > > > >> > >>>>>>>>> APIs > > > > >> > > > > >> > >>>>> are > > > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same problems. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> All three of these issues are related, but > > I'll > > > > >> > address > > > > >> > > > them > > > > >> > > > > >> in > > > > >> > > > > >> > >>>>> order. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Deployment > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use of a > > dynamic > > > > >> > > deployment > > > > >> > > > > >> > >>>>>>>>> scheduler > > > > >> > > > > >> > >>>>>> such > > > > >> > > > > >> > >>>>>>>>> as > > > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially built > > > Samza, > > > > we > > > > >> > bet > > > > >> > > > that > > > > >> > > > > >> > >>>>>>>>> there > > > > >> > > > > >> > >>>>>> would > > > > >> > > > > >> > >>>>>>>>> be > > > > >> > > > > >> > >>>>>>>>> one or two winners in this area, and we > could > > > > >> support > > > > >> > > > them, > > > > >> > > > > >> and > > > > >> > > > > >> > >>>>>>>>> the > > > > >> > > > > >> > >>>>>> rest > > > > >> > > > > >> > >>>>>>>>> would go away. In reality, there are many > > > > >> variations. > > > > >> > > > > >> > >>>>>>>>> Furthermore, > > > > >> > > > > >> > >>>>>> many > > > > >> > > > > >> > >>>>>>>>> people still prefer to just start their > > > > processors > > > > >> > like > > > > >> > > > > normal > > > > >> > > > > >> > >>>>>>>>> Java processes, and use traditional > > deployment > > > > >> scripts > > > > >> > > > such > > > > >> > > > > as > > > > >> > > > > >> > >>>>>>>>> Fabric, > > > > >> > > > > >> > >>>>>> Chef, > > > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment system > on > > > > users > > > > >> > makes > > > > >> > > > the > > > > >> > > > > >> > >>>>>>>>> Samza start-up process really painful for > > first > > > > time > > > > >> > > > users. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement was > also > > a > > > > bit > > > > >> of > > > > >> > a > > > > >> > > > > >> > >>>>>>>>> mis-fire > > > > >> > > > > >> > >>>>>> because > > > > >> > > > > >> > >>>>>>>>> of > > > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding between the > > > > nature of > > > > >> > > batch > > > > >> > > > > >> jobs > > > > >> > > > > >> > >>>>>>>>> and > > > > >> > > > > >> > >>>>>>> stream > > > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made > conscious > > > > effort > > > > >> to > > > > >> > > > favor > > > > >> > > > > >> > >>>>>>>>> the > > > > >> > > > > >> > >>>>>> Hadoop > > > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things, since it > > > worked > > > > >> and > > > > >> > > was > > > > >> > > > > well > > > > >> > > > > >> > >>>>>>> understood. > > > > >> > > > > >> > >>>>>>>>> One thing that we missed was that batch > jobs > > > > have a > > > > >> > > > definite > > > > >> > > > > >> > >>>>>> beginning, > > > > >> > > > > >> > >>>>>>>>> and > > > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs don't > > > (usually). > > > > >> This > > > > >> > > > leads > > > > >> > > > > to > > > > >> > > > > >> > >>>>>>>>> a > > > > >> > > > > >> > >>>>> much > > > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for stream > > > processors. > > > > >> You > > > > >> > > > > >> basically > > > > >> > > > > >> > >>>>>>>>> just > > > > >> > > > > >> > >>>>>>> need > > > > >> > > > > >> > >>>>>>>>> to find a place to start the processor, and > > > start > > > > >> it. > > > > >> > > The > > > > >> > > > > way > > > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's no > concept > > > of > > > > a > > > > >> > > cluster > > > > >> > > > > >> > >>>>>>>>> being "full". We always > > > > >> > > > > >> > >>>>>> add > > > > >> > > > > >> > >>>>>>>>> more machines. The problem with coupling > > Samza > > > > with > > > > >> a > > > > >> > > > > >> scheduler > > > > >> > > > > >> > >>>>>>>>> is > > > > >> > > > > >> > >>>>>> that > > > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to handle > > > > deployment. > > > > >> > > This > > > > >> > > > > >> pulls > > > > >> > > > > >> > >>>>>>>>> in a > > > > >> > > > > >> > >>>>>>> bunch > > > > >> > > > > >> > >>>>>>>>> of things such as configuration > distribution > > > > (config > > > > >> > > > > stream), > > > > >> > > > > >> > >>>>>>>>> shell > > > > >> > > > > >> > >>>>>>> scrips > > > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging (all > > the > > > > .tgz > > > > >> > > > stuff), > > > > >> > > > > >> etc. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic > > deployment > > > > was > > > > >> to > > > > >> > > > > support > > > > >> > > > > >> > >>>>>>>>> data locality. If you want to have > locality, > > > you > > > > >> need > > > > >> > to > > > > >> > > > put > > > > >> > > > > >> > >>>>>>>>> your > > > > >> > > > > >> > >>>>>> processors > > > > >> > > > > >> > >>>>>>>>> close to the data they're processing. Upon > > > > further > > > > >> > > > > >> > >>>>>>>>> investigation, > > > > >> > > > > >> > >>>>>>> though, > > > > >> > > > > >> > >>>>>>>>> this feature is not that beneficial. There > is > > > > some > > > > >> > good > > > > >> > > > > >> > >>>>>>>>> discussion > > > > >> > > > > >> > >>>>>> about > > > > >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335. Again, > we > > > > took > > > > >> the > > > > >> > > > > >> > >>>>>>>>> Map/Reduce > > > > >> > > > > >> > >>>>>> path, > > > > >> > > > > >> > >>>>>>>>> but > > > > >> > > > > >> > >>>>>>>>> there are some fundamental differences > > between > > > > HDFS > > > > >> > and > > > > >> > > > > Kafka. > > > > >> > > > > >> > >>>>>>>>> HDFS > > > > >> > > > > >> > >>>>>> has > > > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions. This > > leads > > > to > > > > >> less > > > > >> > > > > >> > >>>>>>>>> optimization potential with stream > processors > > > on > > > > top > > > > >> > of > > > > >> > > > > Kafka. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> This feature is also used as a crutch. > Samza > > > > doesn't > > > > >> > > have > > > > >> > > > > any > > > > >> > > > > >> > >>>>>>>>> built > > > > >> > > > > >> > >>>>> in > > > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it depends > on > > > the > > > > >> > > dynamic > > > > >> > > > > >> > >>>>>>>>> deployment scheduling system to handle > > restarts > > > > >> when a > > > > >> > > > > >> > >>>>>>>>> processor dies. This has > > > > >> > > > > >> > >>>>>>> made > > > > >> > > > > >> > >>>>>>>>> it very difficult to write a standalone > Samza > > > > >> > container > > > > >> > > > > >> > >>>> (SAMZA-516). > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Pluggability > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> In some cases pluggability is good, but I > > think > > > > that > > > > >> > > we've > > > > >> > > > > >> gone > > > > >> > > > > >> > >>>>>>>>> too > > > > >> > > > > >> > >>>>>> far > > > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has: > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> * Pluggable config. > > > > >> > > > > >> > >>>>>>>>> * Pluggable metrics. > > > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems. > > > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems > > (SystemConsumer, > > > > >> > > > > SystemProducer, > > > > >> > > > > >> > >>>> etc). > > > > >> > > > > >> > >>>>>>>>> * Pluggable serdes. > > > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines. > > > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just about every > > > > >> component > > > > >> > > > > >> > >>>>> (MessageChooser, > > > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper, > ConfigRewriter, > > > > etc). > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> There's probably more that I've forgotten, > as > > > > well. > > > > >> > Some > > > > >> > > > of > > > > >> > > > > >> > >>>>>>>>> these > > > > >> > > > > >> > >>>>> are > > > > >> > > > > >> > >>>>>>>>> useful, but some have proven not to be. > This > > > all > > > > >> comes > > > > >> > > at > > > > >> > > > a > > > > >> > > > > >> cost: > > > > >> > > > > >> > >>>>>>>>> complexity. This complexity is making it > > harder > > > > for > > > > >> > our > > > > >> > > > > users > > > > >> > > > > >> > >>>>>>>>> to > > > > >> > > > > >> > >>>>> pick > > > > >> > > > > >> > >>>>>> up > > > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It also makes > > it > > > > >> > difficult > > > > >> > > > for > > > > >> > > > > >> > >>>>>>>>> Samza developers to reason about what the > > > > >> > > characteristics > > > > >> > > > of > > > > >> > > > > >> > >>>>>>>>> the container (since the characteristics > > change > > > > >> > > depending > > > > >> > > > on > > > > >> > > > > >> > >>>>>>>>> which plugins are use). > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> The issues with pluggability are most > visible > > > in > > > > the > > > > >> > > > System > > > > >> > > > > >> APIs. > > > > >> > > > > >> > >>>>> What > > > > >> > > > > >> > >>>>>>>>> Samza really requires to be functional is > > Kafka > > > > as > > > > >> its > > > > >> > > > > >> > >>>>>>>>> transport > > > > >> > > > > >> > >>>>>> layer. > > > > >> > > > > >> > >>>>>>>>> But > > > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use cases > into > > > one > > > > >> API: > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka. > > > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> The current System API supports both of > these > > > use > > > > >> > cases. > > > > >> > > > The > > > > >> > > > > >> > >>>>>>>>> problem > > > > >> > > > > >> > >>>>>> is, > > > > >> > > > > >> > >>>>>>>>> we > > > > >> > > > > >> > >>>>>>>>> actually want different features for each > use > > > > case. > > > > >> By > > > > >> > > > > >> papering > > > > >> > > > > >> > >>>>>>>>> over > > > > >> > > > > >> > >>>>>>> these > > > > >> > > > > >> > >>>>>>>>> two use cases, and providing a single API, > > > we've > > > > >> > > > introduced > > > > >> > > > > a > > > > >> > > > > >> > >>>>>>>>> ton of > > > > >> > > > > >> > >>>>>>> leaky > > > > >> > > > > >> > >>>>>>>>> abstractions. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> For example, what we'd really like in (2) > is > > to > > > > have > > > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for offsets > > > (like > > > > >> > Kafka). > > > > >> > > > > This > > > > >> > > > > >> > >>>>>>>>> would be at odds > > > > >> > > > > >> > >>>>> with > > > > >> > > > > >> > >>>>>>> (1), > > > > >> > > > > >> > >>>>>>>>> though, since different systems have > > different > > > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors. > > > > >> > > > > >> > >>>>>>>>> There was discussion both on the mailing > list > > > and > > > > >> the > > > > >> > > SQL > > > > >> > > > > >> JIRAs > > > > >> > > > > >> > >>>>> about > > > > >> > > > > >> > >>>>>>> the > > > > >> > > > > >> > >>>>>>>>> need for this. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> The same thing holds true for > replayability. > > > > Kafka > > > > >> > > allows > > > > >> > > > us > > > > >> > > > > >> to > > > > >> > > > > >> > >>>>> rewind > > > > >> > > > > >> > >>>>>>>>> when > > > > >> > > > > >> > >>>>>>>>> we have a failure. Many other systems > don't. > > In > > > > some > > > > >> > > > cases, > > > > >> > > > > >> > >>>>>>>>> systems > > > > >> > > > > >> > >>>>>>> return > > > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g. > > > > >> WikipediaSystemConsumer) > > > > >> > > > > because > > > > >> > > > > >> > >>>>>>>>> they > > > > >> > > > > >> > >>>>>> have > > > > >> > > > > >> > >>>>>>> no > > > > >> > > > > >> > >>>>>>>>> offsets. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Partitioning is another example. Kafka > > supports > > > > >> > > > > partitioning, > > > > >> > > > > >> > >>>>>>>>> but > > > > >> > > > > >> > >>>>> many > > > > >> > > > > >> > >>>>>>>>> systems don't. We model this by having a > > single > > > > >> > > partition > > > > >> > > > > for > > > > >> > > > > >> > >>>>>>>>> those systems. Still, other systems model > > > > >> partitioning > > > > >> > > > > >> > >>>> differently (e.g. > > > > >> > > > > >> > >>>>>>>>> Kinesis). > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a mess. > > > > Creating > > > > >> > > streams > > > > >> > > > > in > > > > >> > > > > >> a > > > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost impossible. > As > > is > > > > >> > modeling > > > > >> > > > > >> > >>>>>>>>> metadata > > > > >> > > > > >> > >>>>> for > > > > >> > > > > >> > >>>>>>> the > > > > >> > > > > >> > >>>>>>>>> system (replication factor, partitions, > > > location, > > > > >> > etc). > > > > >> > > > The > > > > >> > > > > >> > >>>>>>>>> list > > > > >> > > > > >> > >>>>> goes > > > > >> > > > > >> > >>>>>>> on. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Duplicate work > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> At the time that we began writing Samza, > > > Kafka's > > > > >> > > consumer > > > > >> > > > > and > > > > >> > > > > >> > >>>>> producer > > > > >> > > > > >> > >>>>>>>>> APIs > > > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set. On the > > > > >> > consumer-side, > > > > >> > > > you > > > > >> > > > > >> > >>>>>>>>> had two > > > > >> > > > > >> > >>>>>>>>> options: use the high level consumer, or > the > > > > simple > > > > >> > > > > consumer. > > > > >> > > > > >> > >>>>>>>>> The > > > > >> > > > > >> > >>>>>>> problem > > > > >> > > > > >> > >>>>>>>>> with the high-level consumer was that it > > > > controlled > > > > >> > your > > > > >> > > > > >> > >>>>>>>>> offsets, partition assignments, and the > order > > > in > > > > >> which > > > > >> > > you > > > > >> > > > > >> > >>>>>>>>> received messages. The > > > > >> > > > > >> > >>>>> problem > > > > >> > > > > >> > >>>>>>>>> with > > > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's not > simple. > > > It's > > > > >> > basic. > > > > >> > > > You > > > > >> > > > > >> > >>>>>>>>> end up > > > > >> > > > > >> > >>>>>>> having > > > > >> > > > > >> > >>>>>>>>> to handle a lot of really low-level stuff > > that > > > > you > > > > >> > > > > shouldn't. > > > > >> > > > > >> > >>>>>>>>> We > > > > >> > > > > >> > >>>>>> spent a > > > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's > > KafkaSystemConsumer > > > > very > > > > >> > > > robust. > > > > >> > > > > >> It > > > > >> > > > > >> > >>>>>>>>> also allows us to support some cool > features: > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering and > > > > prioritization. > > > > >> > > > > >> > >>>>>>>>> * Tight control over partition assignment > to > > > > support > > > > >> > > > joins, > > > > >> > > > > >> > >>>>>>>>> global > > > > >> > > > > >> > >>>>>> state > > > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)), etc. > > > > >> > > > > >> > >>>>>>>>> * Tight control over offset checkpointing. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time is that > > > these > > > > >> > > features > > > > >> > > > > >> > >>>>>>>>> should > > > > >> > > > > >> > >>>>>>> actually > > > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers (not > > just > > > > >> Samza > > > > >> > > > stream > > > > >> > > > > >> > >>>>>> processors) > > > > >> > > > > >> > >>>>>>>>> end up wanting to do things like joins and > > > > partition > > > > >> > > > > >> > >>>>>>>>> assignment. The > > > > >> > > > > >> > >>>>>>> Kafka > > > > >> > > > > >> > >>>>>>>>> community has come to the same conclusion. > > > > They're > > > > >> > > adding > > > > >> > > > a > > > > >> > > > > >> ton > > > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka consumer > > > > >> > > implementation. > > > > >> > > > > To a > > > > >> > > > > >> > >>>>>>>>> large extent, > > > > >> > > > > >> > >>>>> it's > > > > >> > > > > >> > >>>>>>>>> duplicate work to what we've already done > in > > > > Samza. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up taking a > very > > > > similar > > > > >> > > > > approach > > > > >> > > > > >> > >>>>>>>>> to > > > > >> > > > > >> > >>>>>> Samza's > > > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager implementation for > > > > handling > > > > >> > > offset > > > > >> > > > > >> > >>>>>> checkpointing. > > > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset management > > > feature > > > > >> > stores > > > > >> > > > > >> offset > > > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows you to > > fetch > > > > them > > > > >> > > from > > > > >> > > > > the > > > > >> > > > > >> > >>>>>>>>> broker. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste, since we > > > could > > > > >> have > > > > >> > > > shared > > > > >> > > > > >> > >>>>>>>>> the > > > > >> > > > > >> > >>>>> work > > > > >> > > > > >> > >>>>>> if > > > > >> > > > > >> > >>>>>>>>> it > > > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the get-go. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Vision > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather radical > > > > proposal. > > > > >> > Samza > > > > >> > > > is > > > > >> > > > > >> > >>>>> relatively > > > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to say > that > > > > we're > > > > >> > > near a > > > > >> > > > > 1.0 > > > > >> > > > > >> > >>>>>> release. > > > > >> > > > > >> > >>>>>>>>> I'd > > > > >> > > > > >> > >>>>>>>>> like to propose that we take what we've > > > learned, > > > > and > > > > >> > > begin > > > > >> > > > > >> > >>>>>>>>> thinking > > > > >> > > > > >> > >>>>>>> about > > > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we change if > we > > > were > > > > >> > > starting > > > > >> > > > > >> from > > > > >> > > > > >> > >>>>>> scratch? > > > > >> > > > > >> > >>>>>>>>> My > > > > >> > > > > >> > >>>>>>>>> proposal is to: > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only* way to > > run > > > > Samza > > > > >> > > > > >> > >>>>>>>>> processors, and eliminate all direct > > > dependences > > > > on > > > > >> > > YARN, > > > > >> > > > > >> Mesos, > > > > >> > > > > >> > >>>> etc. > > > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support only > > Kafka > > > > as > > > > >> the > > > > >> > > > > stream > > > > >> > > > > >> > >>>>>> processing > > > > >> > > > > >> > >>>>>>>>> layer. > > > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging, > > > > >> serialization, > > > > >> > > and > > > > >> > > > > >> > >>>>>>>>> config > > > > >> > > > > >> > >>>>>>> systems, > > > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> This would fix all of the issues that I > > > outlined > > > > >> > above. > > > > >> > > It > > > > >> > > > > >> > >>>>>>>>> should > > > > >> > > > > >> > >>>>> also > > > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty > > dramatically. > > > > >> > > Supporting > > > > >> > > > > >> only > > > > >> > > > > >> > >>>>>>>>> a standalone container will allow Samza to > be > > > > >> executed > > > > >> > > on > > > > >> > > > > YARN > > > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using > > Marathon/Aurora), > > > or > > > > >> most > > > > >> > > > other > > > > >> > > > > >> > >>>>>>>>> in-house > > > > >> > > > > >> > >>>>>>> deployment > > > > >> > > > > >> > >>>>>>>>> systems. This should make life a lot easier > > for > > > > new > > > > >> > > users. > > > > >> > > > > >> > >>>>>>>>> Imagine > > > > >> > > > > >> > >>>>>>> having > > > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without YARN. The > > drop > > > > in > > > > >> > > mailing > > > > >> > > > > >> list > > > > >> > > > > >> > >>>>>> traffic > > > > >> > > > > >> > >>>>>>>>> will be pretty dramatic. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long overdue to > me. > > > The > > > > >> > > reality > > > > >> > > > > is, > > > > >> > > > > >> > >>>>> everyone > > > > >> > > > > >> > >>>>>>>>> that > > > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with Kafka. We > > > > basically > > > > >> > > > require > > > > >> > > > > >> it > > > > >> > > > > >> > >>>>>> already > > > > >> > > > > >> > >>>>>>> in > > > > >> > > > > >> > >>>>>>>>> order for most features to work. Those that > > are > > > > >> using > > > > >> > > > other > > > > >> > > > > >> > >>>>>>>>> systems > > > > >> > > > > >> > >>>>>> are > > > > >> > > > > >> > >>>>>>>>> generally using it for ingest into Kafka > (1), > > > and > > > > >> then > > > > >> > > > they > > > > >> > > > > do > > > > >> > > > > >> > >>>>>>>>> the processing on top. There is already > > > > discussion ( > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>> > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > > > > > >> > > > > > > > > >> > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851 > > > > >> > > > > >> > >>>>> 767 > > > > >> > > > > >> > >>>>>>>>> ) > > > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into Kafka > > extremely > > > > >> easy. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Once we make the call to couple with Kafka, > > we > > > > can > > > > >> > > > leverage > > > > >> > > > > a > > > > >> > > > > >> > >>>>>>>>> ton of > > > > >> > > > > >> > >>>>>>> their > > > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to maintain > our > > > own > > > > >> > config, > > > > >> > > > > >> > >>>>>>>>> metrics, > > > > >> > > > > >> > >>>>> etc. > > > > >> > > > > >> > >>>>>>> We > > > > >> > > > > >> > >>>>>>>>> can all share the same libraries, and make > > them > > > > >> > better. > > > > >> > > > This > > > > >> > > > > >> > >>>>>>>>> will > > > > >> > > > > >> > >>>>> also > > > > >> > > > > >> > >>>>>>>>> allow us to share the consumer/producer > APIs, > > > and > > > > >> will > > > > >> > > let > > > > >> > > > > us > > > > >> > > > > >> > >>>>> leverage > > > > >> > > > > >> > >>>>>>>>> their offset management and partition > > > management, > > > > >> > rather > > > > >> > > > > than > > > > >> > > > > >> > >>>>>>>>> having > > > > >> > > > > >> > >>>>>> our > > > > >> > > > > >> > >>>>>>>>> own. All of the coordinator stream code > would > > > go > > > > >> away, > > > > >> > > as > > > > >> > > > > >> would > > > > >> > > > > >> > >>>>>>>>> most > > > > >> > > > > >> > >>>>>> of > > > > >> > > > > >> > >>>>>>>>> the > > > > >> > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably have to > > push > > > > some > > > > >> > > > > partition > > > > >> > > > > >> > >>>>>>> management > > > > >> > > > > >> > >>>>>>>>> features into the Kafka broker, but they're > > > > already > > > > >> > > moving > > > > >> > > > > in > > > > >> > > > > >> > >>>>>>>>> that direction with the new consumer API. > The > > > > >> features > > > > >> > > we > > > > >> > > > > have > > > > >> > > > > >> > >>>>>>>>> for > > > > >> > > > > >> > >>>>>> partition > > > > >> > > > > >> > >>>>>>>>> assignment aren't unique to Samza, and seem > > > like > > > > >> they > > > > >> > > > should > > > > >> > > > > >> be > > > > >> > > > > >> > >>>>>>>>> in > > > > >> > > > > >> > >>>>>> Kafka > > > > >> > > > > >> > >>>>>>>>> anyway. There will always be some niche > > usages > > > > which > > > > >> > > will > > > > >> > > > > >> > >>>>>>>>> require > > > > >> > > > > >> > >>>>>> extra > > > > >> > > > > >> > >>>>>>>>> care and hence full control over partition > > > > >> assignments > > > > >> > > > much > > > > >> > > > > >> > >>>>>>>>> like the > > > > >> > > > > >> > >>>>>>> Kafka > > > > >> > > > > >> > >>>>>>>>> low level consumer api. These would > continue > > to > > > > be > > > > >> > > > > supported. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> These items will be good for the Samza > > > community. > > > > >> > > They'll > > > > >> > > > > make > > > > >> > > > > >> > >>>>>>>>> Samza easier to use, and make it easier for > > > > >> developers > > > > >> > > to > > > > >> > > > > add > > > > >> > > > > >> > >>>>>>>>> new features. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Obviously this is a fairly large (and > > somewhat > > > > >> > backwards > > > > >> > > > > >> > >>>>> incompatible > > > > >> > > > > >> > >>>>>>>>> change). If we choose to go this route, > it's > > > > >> important > > > > >> > > > that > > > > >> > > > > we > > > > >> > > > > >> > >>>>> openly > > > > >> > > > > >> > >>>>>>>>> communicate how we're going to provide a > > > > migration > > > > >> > path > > > > >> > > > from > > > > >> > > > > >> > >>>>>>>>> the > > > > >> > > > > >> > >>>>>>> existing > > > > >> > > > > >> > >>>>>>>>> APIs to the new ones (if we make > incompatible > > > > >> > changes). > > > > >> > > I > > > > >> > > > > >> think > > > > >> > > > > >> > >>>>>>>>> at a minimum, we'd probably need to > provide a > > > > >> wrapper > > > > >> > to > > > > >> > > > > allow > > > > >> > > > > >> > >>>>>>>>> existing StreamTask implementations to > > continue > > > > >> > running > > > > >> > > on > > > > >> > > > > the > > > > >> > > > > >> > >>>> new container. > > > > >> > > > > >> > >>>>>>> It's > > > > >> > > > > >> > >>>>>>>>> also important that we openly communicate > > about > > > > >> > timing, > > > > >> > > > and > > > > >> > > > > >> > >>>>>>>>> stages > > > > >> > > > > >> > >>>>> of > > > > >> > > > > >> > >>>>>>> the > > > > >> > > > > >> > >>>>>>>>> migration. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> If you made it this far, I'm sure you have > > > > opinions. > > > > >> > :) > > > > >> > > > > Please > > > > >> > > > > >> > >>>>>>>>> send > > > > >> > > > > >> > >>>>>> your > > > > >> > > > > >> > >>>>>>>>> thoughts and feedback. > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>>> Cheers, > > > > >> > > > > >> > >>>>>>>>> Chris > > > > >> > > > > >> > >>>>>>>>> > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>>> > > > > >> > > > > >> > >>>>>>> > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>>> -- > > > > >> > > > > >> > >>>>>> -- Guozhang > > > > >> > > > > >> > >>>>>> > > > > >> > > > > >> > >>>>> > > > > >> > > > > >> > >>>> > > > > >> > > > > >> > >> > > > > >> > > > > >> > >> > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > >