Thank you all for your support, looking forward to working with the Apache community.
-leo On Sep 26, 2011, at 9:47 AM, Patrick Hunt wrote: > This passes, with 16 +1 votes, plenty of them binding, and no -1 votes. > > Thanks to all who voted! > > We can now get started creating the Apache S4 podling. > > Patrick > > On Tue, Sep 20, 2011 at 1:56 PM, Patrick Hunt <ph...@apache.org> wrote: >> It's been a nearly a week since the S4 proposal was submitted for >> discussion. A few questions were asked, and the proposal was clarified >> in response. Sufficient mentors have volunteered. I thus feel we are >> now ready for a vote. >> >> The latest proposal can be found at the end of this email and at: >> >> http://wiki.apache.org/incubator/S4Proposal >> >> The discussion regarding the proposal can be found at: >> >> http://s.apache.org/RMU >> >> Please cast your votes: >> >> [ ] +1 Accept S4 for incubation >> [ ] +0 Indifferent to S4 incubation >> [ ] -1 Reject S4 for incubation >> >> This vote will close 72 hours from now. >> >> Thanks, >> >> Patrick >> >> ------------------ >> = S4 Proposal = >> >> == Abstract == >> >> S4 (Simple Scalable Streaming System) is a general-purpose, >> distributed, scalable, partially fault-tolerant, pluggable platform >> that allows programmers to easily develop applications for processing >> continuous, unbounded streams of data. >> >> == Proposal == >> >> S4 is a software platform written in Java. Clients that send and >> receive events can be written in any programming language. S4 also >> includes a collection of modules called Processing Elements (or PEs >> for short) that implement basic functionality and can be used by >> application developers. In S4, keyed data events are routed with >> affinity to Processing Elements (PEs), which consume the events and do >> one or both of the following: (1) ''emit'' one or more events which >> may be consumed by other PEs, (2) ''publish'' results. The >> architecture resembles the Actors model, providing semantics of >> encapsulation and location transparency, thus allowing applications to >> be massively concurrent while exposing a simple programming interface >> to application developers. >> >> To drive adoption and increase the number of contributors to the >> project, we may need to prioritize the focus based on feedback from >> the community. We believe that one of the top priorities and driving >> design principle for the S4 project is to provide a simple API that >> hides most of the complexity associated with distributed systems and >> concurrency. The project grew out of the need to provide a flexible >> platform for application developers and scientists that can be used >> for quick experimentation and production. >> >> S4 differs from existing Apache projects in a number of fundamental >> ways. Flume is an Incubator project that focuses on log processing, >> performing lightweight processing in a distributed fashion and >> accumulating log data in a centralized repository for batch >> processing. S4 instead performs all stream processing in a distributed >> fashion and enables applications to form arbitrary graphs to process >> streams of events. We see Flume as a complementary project. We also >> expect S4 to complement Hadoop processing and in some cases to >> supersede it. Kafka is another Incubator project that focuses on >> processing large amounts of stream data. The design of Kafka, however, >> follows the pub-sub paradigm, which focuses on delivering messages >> containing arbitrary data from source processes (publishers) to >> consumer processes (subscribers). Compared to S4, Kafka is an >> intermediate step between data generation and processing, while S4 is >> itself a platform for processing streams of events. >> >> S4 overall addresses a need of existing applications to process >> streams of events beyond moving data to a centralized repository for >> batch processing. It complements the features of existing Apache >> projects, such as Hadoop, Flume, and Kafka, by providing a flexible >> platform for distributed event processing. >> >> == Background == >> >> S4 was initially developed at Yahoo! Labs starting in 2008 to process >> user feedback in the context of search advertising. The project was >> licensed under the Apache License version 2.0 in October 2010. The >> project documentation is currently available at http://s4.io . >> >> == Rationale == >> >> Stream computing has been growing steadily over the last 20 years. >> However, recently there has been an explosion in real-time data >> sources including the Web, sensor networks, financial securities >> analysis and trading, traffic monitoring, natural language processing >> of news and social data, and much more. >> >> As Hadoop evolved as a standard open source solution for batch >> processing of massive data sets, there is no equivalent community >> supported open source platform for processing data streams in >> real-time. While various research projects have evolved into >> proprietary commercial products, S4 has the potential to fill the gap. >> Many projects that require a scalable stream processing architecture >> currently use Hadoop by segmenting the input stream into data batches. >> This solution is not efficient, results in high latency, and >> introduces unnecessary complexity. >> >> The S4 design is primarily driven by large scale applications for data >> mining and machine learning in a production environment. We think that >> the S4 design is surprisingly flexible and lends itself to run in >> large clusters built with commodity hardware. >> >> S4 enables application programmers to focus more on the application >> and less on the infrastructure. S4 also provides a consistent graph >> oriented programming model that, if widely adopted, will facilitate >> sharing of basic component across developers. >> >> == Initial Goals == >> >> The basic S4 infrastructure is complete and can be used in real-world >> applications. However, many additional components need to be developed >> and improved. Some areas we hope to focus on in Apache: >> >> * Add a reliable communication protocol option to the communication >> layer for low bandwidth control messages that require guaranteed >> delivery. >> * Higher-performance serialization and inter-node communication. >> * Functionality to save the state of PEs at runtime transparently and >> restore it at startup. >> * Intelligent load shedding strategies. >> * Dynamic load balancing to make it possible to add and remove nodes >> from the cluster without data loss. >> * Dynamic application loading and unloading. >> * Migration to a pure object-oriented design that takes advantage of >> Java static typing using Generics in the framework code. (Keep it >> simple for the application developer.) >> * Eliminate string identifiers and XML configuration. >> * Adopt JSR 330 (Dependency Injection for Java). >> * Add real-time query support. >> * Add a cluster management system. >> >> Clearly this is a long list but sets the high level roadmap for the project. >> >> == Current Status == >> >> The project has been under development at Yahoo! since late 2008, and >> it was open sourced in October 2010. Since then we have received >> patches from developers, started a discussion forum, and improved the >> documentation. >> >> === Meritocracy === >> >> The S4 project was initially developed at Yahoo! Labs, a >> research-oriented organization that values original ideas and >> individual contributions. The design evolved in a bottom up fashion, >> where decisions were driven by the application and the long-term >> viability and flexibility of the platform. Once the project became >> open-source it continued to be managed by those who were actively >> doing the work. >> >> === Community === >> >> S4 is currently in use internally at Yahoo!, and since it was released >> as an open source project it has received positive feedback and >> contributions from developers. >> >> === Core Developers === >> >> S4 developers span a few companies and work on a voluntary basis. We >> expect to have developers from other organizations joining the team in >> the next few months, especially if S4 joins the Apache Incubator >> project. Being an Apache Incubator project is likely to attract the >> attention of more talented developers. >> >> One interesting aspect of the current group of developers is the >> diverse background: >> >> * Kishore Gopalakrishna was the main developer of the communication >> layer and the integration with Zookeeper. He has been an active >> contributor to Hadoop. >> * Flavio Junqueira has a background in distributed computing. He is a >> committer of ZooKeeper, a ZooKeeper PMC member, and a committer of >> BookKeeper; >> * Matthieu Morel has extensive background in distributed systems, he >> likes theory and loves to implement things. He has been the main >> designer and implementor of S4 checkpointing.* Anish Nair has been the >> project’s main customer. With his background on natural language >> processing and algorithms he developed the applications that drove the >> S4 design including processing of social feeds and real-time >> recommendation engines. >> * Leo Neumeyer has a background in signal processing and statistical >> modeling but has been advocating clean simple software design >> throughout his career. At Yahoo! he conceived and championed the S4 >> project as a solution to improve monetization in search advertising. >> * Bruce Robbins has been the main S4 developer, taking the concept >> from idea to releases. Bruce engineering experience ranges from >> programming Mainframe computers to assembly code. >> >> === Alignment === >> >> S4 brings stream processing capabilities that complement Hadoop's >> batch processing capabilities. >> >> == Known Risks == >> >> === Orphaned Products === >> >> S4 has been used in production at Yahoo! and is being evaluated by >> other organizations. The developers have continued to support the >> project on their own time. We believe that adoption will increase >> significantly as more tools and documentation become available. As the >> project evolves, we may see new ideas that we may want to adopt or, if >> it makes sense and is practical, we may want to merge two or more open >> source projects. We believe that there is a clear need to have a well >> supported open source stream processing platform and therefore, there >> is low risk of the project becoming orphan. However, we are open to >> combining projects in order to have fewer projects with a more active >> community. Ultimately, this will be decided by the design ideas, the >> implementation quality, and the adoption. >> >> === Inexperience with Open Source === >> >> The S4 code was open sourced by Yahoo! under Apache 2.0 license. One >> committer of the S4 project, Flavio Junqueira, is intimately familiar >> with the Apache model for open-source development and is experienced >> with working with new contributors. Flavio is both a committer a PMC >> member for ZooKeeper. The other developers have had experience as >> contributors in other open-source projects. Most of the original S4 >> developers continue to be committers. >> >> === Homogeneous Developers === >> >> The initial set of committers for S4 represent four different >> companies: A9, Linkedin, Quantbench, and Yahoo!. This set is diverse >> enough for a starting project. >> >> === Reliance on Salaried Developers === >> >> Some committers are contributing as part of their jobs, but as we move >> to a more diverse set of developers we expect a good mix of salaried >> and volunteer time. >> >> === Relationships with Other Apache Projects === >> >> S4 relies on the following Apache projects: >> >> * BCEL (bytecode generation library) >> * commons cli (command line interface) >> * commons logging (needed by some other dependency) >> * log4j >> * commons jexl (expression processing) >> * zookeeper >> * Maven and its usual plug-ins (build time only) >> >> Compared to existing projects, S4 complements existing functionality >> in a few ways summarized below: >> * Flume: S4 processes streams in a distributed fashion and enables >> applications to form arbitrary graphs of processing elements. Flume >> focuses on accumulating streams of logs in a centalized repository for >> batch processing; >> * Kafka: Kafka is a pub/sub messaging layer that interposes >> generation of events and processing, while S4 itself forwards events >> and processes them in a stream fashion. >> * Hadoop: Hadoop focuses on batch processing of large data sets, >> while S4 is a platform for stream processing of events. We would like >> to implement extensions that enable processing in both platforms with >> the same code. >> >> === An Excessive Fascination with the Apache Brand === >> >> The project has already received a significant amount of attention and >> so far has been associated with Yahoo!. We would like, however, to >> foster the development of a community around S4 that evolves >> independently of the interests of a single company. Given the reliance >> of S4 on some Apache projects and the principles promoted by the >> foundation, we find it a suitable home for the project. >> >> == Documentation == >> >> * S4 Website: http://s4.io >> * S4 documentation: http://docs.s4.io/ >> * S4 Forum: http://groups.google.com/group/s4-project/topics >> * S4 Mailing list (with archives): http://groups.google.com/group/s4-project >> >> == Source and Intellectual Property Submission Plan == >> >> The S4 source code is already licensed under Apache Software License >> 2.0. The source code is available at https://github.com/s4 >> >> >> == External Dependencies == >> >> * asm (3-clause BSD license) >> * json (json.org's own license >> http://www.crockford.com/JSON/license.html which is acceptable as per >> Apache FAQ: http://www.apache.org/legal/resolved.html#json) >> * kryo (4-clause BSD license) >> * spring framework (Apache license - v 2) >> * codehaus jackson (Apache license) >> * junit (Common Public License - v 1.0) >> >> == Cryptography == >> None >> >> == Required Resources == >> >> === Mailing lists === >> * s4-dev >> * s4-user >> * s4-private (with moderated subscriptions) >> * s4-commit >> >> === Subversion Directory === >> >> https://svn.apache.org/repos/asf/incubator/s4 >> >> === Issue Tracking === >> >> JIRA S4 (S4) >> >> == Initial Committers == >> * Kishore Gopalakrishna (kg at s4 dot io) >> * Flavio Junqueira (fpj at s4 dot io) >> * Matthieu Morel (mm at s4 dot io) >> * Anish Nair (an at s4 dot com) >> * Leo Neumeyer (leo at s4 dot io) >> * Bruce Robbins (br at s4 dot io) >> >> == Affiliations == >> * Kishore Gopalakrishna, Linkedin >> * Flavio Junqueira, Yahoo! >> * Matthieu Morel, Yahoo! >> * Anish Nair, A9 >> * Leo Neumeyer, Quantbench >> * Bruce Robbins, Yahoo! >> >> == Sponsors == >> >> === Champion === >> >> * Patrick Hunt >> >> === Nominated Mentors === >> >> * Patrick Hunt >> * Owen O’Malley >> * Arun Murthy >> >> === Sponsoring Entity === >> >> * Apache Incubator PMC >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org