Re: [PROPOSAL] Kafka for the Apache Incubator

Jun Rao Thu, 23 Jun 2011 08:29:25 -0700

Thanks, Tommaso, Chris and Mattmann.

Jun


On Thu, Jun 23, 2011 at 8:07 AM, Tommaso Teofili
<tommaso.teof...@gmail.com>wrote:

> Wow, very nice proposal guys!
> Tommaso
>
> 2011/6/22 Jun Rao <jun...@gmail.com>
>
> > Hi,
> >
> > I would like to propose Kafka to be an Apache Incubator project.  Kafka
> is
> > a
> > distributed, high throughput, publish-subscribe system for processing
> large
> > amounts of streaming data.
> >
> > Here's a link to the proposal in the Incubator wiki
> > http://wiki.apache.org/incubator/KafkaProposal
> >
> > I've also pasted the initial contents below.
> >
> > Thanks,
> >
> > Jun
> >
> > == Abstract ==
> > Kafka is a distributed publish-subscribe system for processing large
> > amounts
> > of streaming data.
> >
> > == Proposal ==
> > Kafka provides an extremely high throughput distributed publish/subscribe
> > messaging system.  Additionally, it supports relatively long term
> > persistence of messages to support a wide variety of consumers,
> > partitioning
> > of the message stream across servers and consumers, and functionality for
> > loading data into Apache Hadoop for offline, batch processing.
> >
> > == Background ==
> > Kafka was developed at LinkedIn to process the large amounts of events
> > generated by that company's website and provide a common repository for
> > many
> > types of consumers to access and process those events. Kafka has been
> used
> > in production at LinkedIn scale to handle dozens of types of events
> > including page views, searches and social network activity. Kafka
> clusters
> > at LinkedIn currently process more than two billion events per day.
> >
> > Kafka fills the gap between messaging systems such as Apache ActiveMQ,
> > which
> > can provide high-volume messaging systems but lack persistence of those
> > messages, and log processing systems such as Scribe and Flume, which do
> not
> > provide adequate latency for our diverse set of consumers.  Kafka can
> also
> > be inserted into traditional log-processing systems, acting as an
> > intermediate step before further processing. Kafka focuses relentlessly
> on
> > performance and throughput by not introspecting into message content, nor
> > indexing them on the broker.  We also achieve high performance by
> depending
> > on Java's sendFile/transferTo capabilities to minimize intermediate
> buffer
> > copies and relying on the OS's pagecache to efficiently serve up message
> > contents to consumers.
> >
> > Kafka is written in Scala and depends on Apache ZooKeeper for
> coordination
> > amongst its producers, brokers and consumers.
> >
> > Kafka was developed internally at LinkedIn to meet our particular use
> > cases,
> > but will be useful to many organizations facing a similar need to
> reliably
> > process large amounts of streaming data.  Therefore, we would like to
> share
> > it the ASF and begin developing a community of developers and users
> within
> > Apache.
> >
> > == Rationale ==
> > Many organizations can benefit from a reliable stream processing system
> > such
> > as Kafka.  While our use case of processing events from a very large
> > website
> > like LinkedIn has driven the design of Kafka, its uses are varied and we
> > expect many new use cases to emerge.  Kafka provides a natural bridge
> > between near real-time event processing and offline batch processing and
> > will appeal to many users.
> >
> > == Current Status ==
> > === Meritocracy ===
> > Our intent with this incubator proposal is to start building a diverse
> > developer community around Kafka following the Apache meritocracy model.
> > Since Kafka was open sourced we have solicited contributions via the
> > website
> > and presentations given to user groups and technical audiences.  We have
> > had
> > positive responses to these and have received several contributions and
> > clients for other languages.  We plan to continue this support for new
> > contributors and work with those who contribute significantly to the
> > project
> > to make them committers.
> >
> > === Community ===
> > Kafka is currently being used by developed by engineers within LinkedIn
> and
> > used in production in that company. Additionally, we have active users in
> > or
> > have received contributions from a diverse set of companies including
> > MediaSift, SocialTwist, Clearspring and Urban Airship. Recent public
> > presentations of Kafka and its goals garnered much interest from
> potential
> > contributors. We hope to extend our contributor base significantly and
> > invite all those who are interested in building high-throughput
> distributed
> > systems to participate.  We have begun receiving contributions from
> outside
> > of LinkedIn, including clients for several languages including Ruby, PHP,
> > Clojure, .NET and Python.
> >
> > To further this goal, we use GitHub issue tracking and branching
> > facilities,
> > as well as maintaining a public mailing list via Google Groups.
> >
> > === Core Developers ===
> > Kafka is currently being developed by four engineers at LinkedIn: Neha
> > Narkhede, Jun Rao, Jakob Homan and Jay Kreps. Jun has experience within
> > Apache as a Cassandra committer and PMC member. Neha has been an active
> > contributor to several projects LinkedIn has open sourced, including
> Bobo,
> > Sensei and Zoie. Jay has experience with open source software as the
> > originator of the Project Voldemort project, as well as being active
> within
> > the Hadoop ecosystem community. Jakob is an Apache Hadoop committer and
> PMC
> > and previous Apache ZooKeeper contributor.
> >
> > === Alignment ===
> > The ASF is the natural choice to host the Kafka project as its goal of
> > encouraging community-driven open-source projects fits with our vision
> for
> > Kafka.  Additionally, many other projects with which we are familiar with
> > and expect Kafka to integrate with, such as Apache Hadoop, Pig, ZooKeeper
> > and log4j are hosted by the ASF and we will benefit and provide benefit
> by
> > close proximity to them.
> >
> > == Known Risks ==
> > === Orphaned Products ===
> > The core developers plan to work full time on the project. There is very
> > little risk of Kafka being abandoned as it is a critical part of
> LinkedIn's
> > internal infrastructure and is in production use.
> >
> > === Inexperience with Open Source ===
> > All of the core developers have experience with open source development.
> >  LinkedIn open sourced Kafka several months ago and has been receiving
> > contributions since.  Jun is an Apache Cassandra committer and PMC
> member.
> >  Jay and Neha have been involved with several open source projects
> released
> > by LinkedIn.  Jakob has been actively involved with the ASF as a
> full-time
> > Hadoop committer and PMC member.
> >
> > === Homogeneous Developers ===
> > The current core developers are all from LinkedIn. However, we hope to
> > establish a developer community that includes contributors from several
> > corporations and we actively encouraging new contributors via the mailing
> > lists and public presentations of Kafka.
> >
> > === Reliance on Salaried Developers ===
> > Currently, the developers are paid to do work on Kafka. However, once the
> > project has a community built around it, we expect to get committers,
> > developers and community from outside the current core developers.
> However,
> > because LinkedIn relies on Kafka internally, the reliance on salaried
> > developers is unlikely to change.
> >
> > === Relationships with Other Apache Products ===
> > Kafka is deeply integrated with Apache products. Kafka uses Apache
> > ZooKeeper
> > to coordinate its state amongst the brokers, consumers, and soon, the
> > producers.  Kafka provides input formats to allow Hadoop MapReduce to
> load
> > data directly from Kafka.  Kafka provides an appender to allow consuming
> > data directly from Apache log4j.
> >
> > === An Excessive Fascination with the Apache Brand ===
> > While we respect the reputation of the Apache brand and have no doubts
> that
> > it will attract contributors and users, our interest is primarily to give
> > Kafka a solid home as an open source project following an established
> > development model. We have also given reasons in the Rationale and
> > Alignment
> > sections.
> >
> > == Documentation ==
> > Information about Kafka can be found at [http://sna-projects.com/kafka/]
> > The
> > following links provide more information about the project:
> >
> >  * Kafka roadmap and goals: [http://sna-projects.com/kafka/projects.php]
> >  * The GitHub site: [https://github.com/kafka-dev/kafka]
> >  * Kafka overview from Jay Kreps: [
> > http://www.slideshare.net/ydn/hug-january-2011-kafka-presentation]
> >  * Kafka overview from Jakob Homan: [http://bit.ly/fLmoZz]
> >  * Kafka paper at NetDB 2011: [
> >
> >
> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
> > ]
> >
> > == Initial Source ==
> > Kafka has been under development at LinkedIn since November 2009.  It was
> > open sourced by LinkedIn in January 2011.  It is currently hosted on
> github
> > under the Apache license at [https://github.com/kafka-dev/kafka]
> >
> > Kafka is mainly written in Scala with some performance testing code in
> > Java.
> >  Several clients have been contributed in other languages, including
> Ruby,
> > PHP, Clojure, .NET and Python.  Its source tree is entirely self
> contained
> > and relies of simple build tool (sbt) as its build system and dependency
> > resolution mechanism.
> >
> > == External Dependencies ==
> > The dependencies all have Apache compatible licenses.
> >
> > == Cryptography ==
> > Not applicable.
> >
> > == Required Resources ==
> > === Mailing Lists ===
> >  * kafka-private for private PMC discussions (with moderated
> subscriptions)
> >  * kafka-dev   * kafka-commits   * kafka-user
> >
> > === Subversion Directory ===
> > [https://svn.apache.org/repos/asf/incubator/kafka]
> >
> > === Issue Tracking ===
> > JIRA Kafka (KAFKA)
> >
> > === Other Resources ===
> > The existing code already has unit tests, so we would like a Hudson
> > instance
> > to run them whenever a new patch is submitted. This can be added after
> > project creation.
> >
> > == Initial Committers ==
> >  * Jay Kreps
> >  * Jun Rao
> >  * Neha Narkhede
> >  * Jakob Homan
> >
> > == Affiliations ==
> >  * Jay Kreps (LinkedIn)
> >  * Jun Rao (LinkedIn)
> >  * Neha Narkhede (LinkedIn)
> >  * Jakob Homan (LinkedIn)
> >
> > == Sponsors ==
> > === Champion ===
> > Chris Douglas (Apache Member)
> >
> > === Nominated Mentors ===
> >  * Alan Cabrera (Apache Member)
> >  * Geir Magnusson, Jr. (Apache Member and Director)
> >  * Owen O'Malley (Apache Member)
> >
> > === Sponsoring Entity ===
> > We are requesting the Incubator to sponsor this project.
> >
>

Re: [PROPOSAL] Kafka for the Apache Incubator

Reply via email to