Re: [VOTE] S4 to join the Incubator

Leo Neumeyer Mon, 26 Sep 2011 09:56:07 -0700

Thank you all for your support, looking forward to working with the Apache 
community.


-leo

On Sep 26, 2011, at 9:47 AM, Patrick Hunt wrote:

> This passes, with 16 +1 votes, plenty of them binding, and no -1 votes.
> 
> Thanks to all who voted!
> 
> We can now get started creating the Apache S4 podling.
> 
> Patrick
> 
> On Tue, Sep 20, 2011 at 1:56 PM, Patrick Hunt <ph...@apache.org> wrote:
>> It's been a nearly a week since the S4 proposal was submitted for
>> discussion.  A few questions were asked, and the proposal was clarified
>> in response.  Sufficient mentors have volunteered.  I thus feel we are
>> now ready for a vote.
>> 
>> The latest proposal can be found at the end of this email and at:
>> 
>>  http://wiki.apache.org/incubator/S4Proposal
>> 
>> The discussion regarding the proposal can be found at:
>> 
>>  http://s.apache.org/RMU
>> 
>> Please cast your votes:
>> 
>> [  ] +1 Accept S4 for incubation
>> [  ] +0 Indifferent to S4 incubation
>> [  ] -1 Reject S4 for incubation
>> 
>> This vote will close 72 hours from now.
>> 
>> Thanks,
>> 
>> Patrick
>> 
>> ------------------
>> = S4 Proposal =
>> 
>> == Abstract ==
>> 
>> S4 (Simple Scalable Streaming System) is a general-purpose,
>> distributed, scalable, partially fault-tolerant, pluggable platform
>> that allows programmers to easily develop applications for processing
>> continuous, unbounded streams of data.
>> 
>> == Proposal ==
>> 
>> S4 is a software platform written in Java. Clients that send and
>> receive events can be written in any programming language. S4 also
>> includes a collection of modules called Processing Elements (or PEs
>> for short) that implement basic functionality and can be used by
>> application developers. In S4, keyed data events are routed with
>> affinity to Processing Elements (PEs), which consume the events and do
>> one or both of the following: (1) ''emit'' one or more events which
>> may be consumed by other PEs, (2) ''publish'' results. The
>> architecture resembles the Actors model, providing semantics of
>> encapsulation and location transparency, thus allowing applications to
>> be massively concurrent while exposing a simple programming  interface
>> to application developers.
>> 
>> To drive adoption and increase the number of contributors to the
>> project, we may need to prioritize the focus based on feedback from
>> the community. We believe that one of the top priorities and driving
>> design principle for the S4 project is to provide a simple API that
>> hides most of the complexity associated with distributed systems and
>> concurrency. The project grew out of the need to provide a flexible
>> platform for application developers and scientists that can be used
>> for quick experimentation and production.
>> 
>> S4 differs from existing Apache projects in a number of fundamental
>> ways. Flume is an Incubator project that focuses on log processing,
>> performing lightweight processing in a distributed fashion and
>> accumulating log data in a centralized repository for batch
>> processing. S4 instead performs all stream processing in a distributed
>> fashion and enables applications to form arbitrary graphs to process
>> streams of events. We see Flume as a complementary project. We also
>> expect S4 to complement Hadoop processing and in some cases to
>> supersede it. Kafka is another Incubator project that focuses on
>> processing large amounts of stream data. The design of Kafka, however,
>> follows the pub-sub paradigm, which focuses on delivering messages
>> containing arbitrary data from source processes (publishers) to
>> consumer processes (subscribers). Compared to S4, Kafka is an
>> intermediate step between data generation and processing, while S4 is
>> itself a platform for processing streams of events.
>> 
>> S4 overall addresses a need of existing applications to process
>> streams of events beyond moving data to a centralized repository for
>> batch processing. It complements the features of existing Apache
>> projects, such as Hadoop, Flume, and Kafka, by providing a flexible
>> platform for distributed event processing.
>> 
>> == Background ==
>> 
>> S4 was initially developed at Yahoo! Labs starting in 2008 to process
>> user feedback in the context of search advertising. The project was
>> licensed under the Apache License version 2.0 in October 2010. The
>> project documentation is currently available at http://s4.io .
>> 
>> == Rationale ==
>> 
>> Stream computing has been growing steadily over the last 20 years.
>> However, recently there has been an explosion in real-time data
>> sources including the Web, sensor networks, financial securities
>> analysis and trading, traffic monitoring, natural language processing
>> of news and social data, and much more.
>> 
>> As Hadoop evolved as a standard open source solution for batch
>> processing of massive data sets, there is no equivalent community
>> supported open source platform for processing data streams in
>> real-time. While various research projects have evolved into
>> proprietary commercial products, S4 has the potential to fill the gap.
>> Many projects that require a scalable stream processing architecture
>> currently use Hadoop by segmenting the input stream into data batches.
>> This solution is not efficient, results in high latency, and
>> introduces unnecessary complexity.
>> 
>> The S4 design is primarily driven by large scale applications for data
>> mining and machine learning in a production environment. We think that
>> the S4 design is surprisingly flexible and lends itself to run in
>> large clusters built with commodity hardware.
>> 
>> S4 enables application programmers to focus more on the application
>> and less on the infrastructure. S4 also provides a consistent graph
>> oriented programming model that, if widely adopted, will facilitate
>> sharing of basic component across developers.
>> 
>> == Initial Goals ==
>> 
>> The basic S4 infrastructure is complete and can be used in real-world
>> applications. However, many additional components need to be developed
>> and improved. Some areas we hope to focus on in Apache:
>> 
>>  * Add a reliable communication protocol option to the communication
>> layer for low bandwidth control messages that require guaranteed
>> delivery.
>>  * Higher-performance serialization and inter-node communication.
>>  * Functionality to save the state of PEs at runtime transparently and
>> restore it at startup.
>>  * Intelligent load shedding strategies.
>>  * Dynamic load balancing to make it possible to add and remove nodes
>> from the cluster without data loss.
>>  * Dynamic application loading and unloading.
>>  * Migration to a pure object-oriented design that takes advantage of
>> Java static typing using Generics in the framework code. (Keep it
>> simple for the application developer.)
>>  * Eliminate string identifiers and XML configuration.
>>  * Adopt JSR 330 (Dependency Injection for Java).
>>  * Add real-time query support.
>>  * Add a cluster management system.
>> 
>> Clearly this is a long list but sets the high level roadmap for the project.
>> 
>> == Current Status ==
>> 
>> The project has been under development at Yahoo! since late 2008, and
>> it was open sourced in October 2010. Since then we have received
>> patches from developers, started a discussion forum, and improved the
>> documentation.
>> 
>> === Meritocracy ===
>> 
>> The S4 project was initially developed at Yahoo! Labs, a
>> research-oriented organization that values original ideas and
>> individual contributions. The design evolved in a bottom up fashion,
>> where decisions were driven by the application and the long-term
>> viability and flexibility of the platform. Once the project became
>> open-source it continued to be managed by those who were actively
>> doing the work.
>> 
>> === Community ===
>> 
>> S4 is currently in use internally at Yahoo!, and since it was released
>> as an open source project it has received positive feedback and
>> contributions from developers.
>> 
>> === Core Developers ===
>> 
>> S4 developers span a few companies and work on a voluntary basis. We
>> expect to have developers from other organizations joining the team in
>> the next few months, especially if S4 joins the Apache Incubator
>> project. Being an Apache Incubator project is likely to attract the
>> attention of more talented developers.
>> 
>> One interesting aspect of the current group of developers is the
>> diverse background:
>> 
>>  * Kishore Gopalakrishna was the main developer of the communication
>> layer and the integration with Zookeeper. He has been an active
>> contributor to Hadoop.
>>  * Flavio Junqueira has a background in distributed computing. He is a
>> committer of ZooKeeper, a ZooKeeper PMC member, and a committer of
>> BookKeeper;
>>  * Matthieu Morel has extensive background in distributed systems, he
>> likes theory and loves to implement things. He has been the main
>> designer and implementor of S4 checkpointing.* Anish Nair has been the
>> project’s main customer. With his background on natural language
>> processing and algorithms he developed the applications that drove the
>> S4 design including processing of social feeds and real-time
>> recommendation engines.
>>  * Leo Neumeyer has a background in signal processing and statistical
>> modeling but has been advocating clean simple software design
>> throughout his career. At Yahoo! he conceived and championed the S4
>> project as a solution to improve monetization in search advertising.
>>  * Bruce Robbins has been the main S4 developer, taking the concept
>> from idea to releases. Bruce engineering experience ranges from
>> programming Mainframe computers to assembly code.
>> 
>> === Alignment ===
>> 
>> S4 brings stream processing capabilities that complement Hadoop's
>> batch processing capabilities.
>> 
>> == Known Risks ==
>> 
>> === Orphaned Products ===
>> 
>> S4 has been used in production at Yahoo! and is being evaluated by
>> other organizations. The developers have continued to support the
>> project on their own time. We believe that adoption will increase
>> significantly as more tools and documentation become available. As the
>> project evolves, we may see new ideas that we may want to adopt or, if
>> it makes sense and is practical, we may want to merge two or more open
>> source projects. We believe that there is a clear need to have a well
>> supported open source stream processing platform and therefore, there
>> is low risk of the project becoming orphan. However, we are open to
>> combining projects in order to have fewer projects with a more active
>> community. Ultimately, this will be decided by the design ideas, the
>> implementation quality, and the adoption.
>> 
>> === Inexperience with Open Source ===
>> 
>> The S4 code was open sourced by Yahoo! under Apache 2.0 license. One
>> committer of the S4 project, Flavio Junqueira, is intimately familiar
>> with the Apache model for open-source development and is experienced
>> with working with new contributors.  Flavio is both a committer a PMC
>> member for ZooKeeper. The other developers have had experience as
>> contributors in other open-source projects. Most of the original S4
>> developers continue to be committers.
>> 
>> === Homogeneous Developers ===
>> 
>> The initial set of committers for S4 represent four different
>> companies: A9, Linkedin, Quantbench, and Yahoo!. This set is diverse
>> enough for a starting project.
>> 
>> === Reliance on Salaried Developers ===
>> 
>> Some committers are contributing as part of their jobs, but as we move
>> to a more diverse set of developers we expect a good mix of salaried
>> and volunteer time.
>> 
>> === Relationships with Other Apache Projects ===
>> 
>> S4 relies on the following Apache projects:
>> 
>>  * BCEL (bytecode generation library)
>>  * commons cli (command line interface)
>>  * commons logging (needed by some other dependency)
>>  * log4j
>>  * commons jexl (expression processing)
>>  * zookeeper
>>  * Maven and its usual plug-ins (build time only)
>> 
>> Compared to existing projects, S4 complements existing functionality
>> in a few ways summarized below:
>>  * Flume: S4 processes streams in a distributed fashion and enables
>> applications to form arbitrary graphs of processing elements. Flume
>> focuses on accumulating streams of logs in a centalized repository for
>> batch processing;
>>  * Kafka: Kafka is a pub/sub messaging layer that interposes
>> generation of events and processing, while S4 itself forwards events
>> and processes them in a stream fashion.
>>  * Hadoop: Hadoop focuses on batch processing of large data sets,
>> while S4 is a platform for stream processing of events. We would like
>> to implement extensions that enable processing in both platforms with
>> the same code.
>> 
>> === An Excessive Fascination with the Apache Brand ===
>> 
>> The project has already received a significant amount of attention and
>> so far has been associated with Yahoo!. We would like, however, to
>> foster the development of a community around S4 that evolves
>> independently of the interests of a single company. Given the reliance
>> of S4 on some Apache projects and the principles promoted by the
>> foundation, we find it a suitable home for the project.
>> 
>> == Documentation ==
>> 
>>  * S4 Website: http://s4.io
>>  * S4 documentation: http://docs.s4.io/
>>  * S4 Forum: http://groups.google.com/group/s4-project/topics
>>  * S4 Mailing list (with archives): http://groups.google.com/group/s4-project
>> 
>> == Source and Intellectual Property Submission Plan ==
>> 
>> The S4 source code is already licensed under Apache Software License
>> 2.0. The source code is available at https://github.com/s4
>> 
>> 
>> == External Dependencies ==
>> 
>>  * asm (3-clause BSD license)
>>  * json (json.org's own license
>> http://www.crockford.com/JSON/license.html which is acceptable as per
>> Apache FAQ: http://www.apache.org/legal/resolved.html#json)
>>  * kryo (4-clause BSD license)
>>  * spring framework (Apache license - v 2)
>>  * codehaus jackson (Apache license)
>>  * junit (Common Public License - v 1.0)
>> 
>> == Cryptography ==
>> None
>> 
>> == Required Resources ==
>> 
>> === Mailing lists ===
>>  * s4-dev
>>  * s4-user
>>  * s4-private (with moderated subscriptions)
>>  * s4-commit
>> 
>> === Subversion Directory ===
>> 
>> https://svn.apache.org/repos/asf/incubator/s4
>> 
>> === Issue Tracking ===
>> 
>> JIRA S4 (S4)
>> 
>> == Initial Committers ==
>>  * Kishore Gopalakrishna (kg at s4 dot io)
>>  * Flavio Junqueira (fpj at s4 dot io)
>>  * Matthieu Morel (mm at s4 dot io)
>>  * Anish Nair (an at s4 dot com)
>>  * Leo Neumeyer (leo at s4 dot io)
>>  * Bruce Robbins (br at s4 dot io)
>> 
>> == Affiliations ==
>>  * Kishore Gopalakrishna, Linkedin
>>  * Flavio Junqueira, Yahoo!
>>  * Matthieu Morel, Yahoo!
>>  * Anish Nair, A9
>>  * Leo Neumeyer, Quantbench
>>  * Bruce Robbins, Yahoo!
>> 
>> == Sponsors ==
>> 
>> === Champion ===
>> 
>>  * Patrick Hunt
>> 
>> === Nominated Mentors ===
>> 
>>  * Patrick Hunt
>>  * Owen O’Malley
>>  * Arun Murthy
>> 
>> === Sponsoring Entity ===
>> 
>>  * Apache Incubator PMC
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [VOTE] S4 to join the Incubator

Reply via email to