Hi, I have a few questions.
My understanding is that a podling requires 3 +1s for progress (releases, new commiters). Does this mean we need at least 3 mentors? Would it be helpful to have "extra"? Is having the Champion being a Mentor ok? Are there any concerns/discussion with the proposal? (or are +1's basically saying lgtm.) Thanks, Jon. On Tue, May 31, 2011 at 4:13 AM, Mohammad Nour El-Din < nour.moham...@gmail.com> wrote: > +1 (binding) > > On Tue, May 31, 2011 at 11:59 AM, Mark Struberg <strub...@yahoo.de> wrote: > > +1 > > > > LieGrue, > > strub > > > > --- On Mon, 5/30/11, Yoav Shapira <yo...@apache.org> wrote: > > > >> From: Yoav Shapira <yo...@apache.org> > >> Subject: Re: [PROPOSAL] Flume for the Apache Incubator > >> To: general@incubator.apache.org > >> Date: Monday, May 30, 2011, 11:18 PM > >> On Fri, May 27, 2011 at 10:18 AM, > >> Jonathan Hsieh <j...@cloudera.com> > >> wrote: > >> > I would like to propose Flume to be an Apache > >> Incubator project. Flume is a > >> > distributed, reliable, and available system for > >> efficiently collecting, > >> > aggregating, and moving large amounts of log data to > >> scalable data storage > >> > systems such as Apache Hadoop's HDFS. > >> > > >> > Here's a link to the proposal in the Incubator wiki > >> > http://wiki.apache.org/incubator/FlumeProposal > >> > >> +1, cool stuff. > >> > >> Yoav > >> > >> > > >> > I've also pasted the initial contents below. > >> > > >> > Thanks! > >> > Jon. > >> > > >> > = Flume - A Distributed Log Collection System = > >> > > >> > == Abstract == > >> > > >> > Flume is a distributed, reliable, and available system > >> for efficiently > >> > collecting, aggregating, and moving large amounts of > >> log data to scalable > >> > data storage systems such as Apache Hadoop's HDFS. > >> > > >> > == Proposal == > >> > > >> > Flume is a distributed, reliable, and available system > >> for efficiently > >> > collecting, aggregating, and moving large amounts of > >> log data from many > >> > different sources to a centralized data store. Its > >> main goal is to deliver > >> > data from applications to Hadoop’s HDFS. It has a > >> simple and flexible > >> > architecture for transporting streaming event data via > >> flume nodes to the > >> > data store. It is robust and fault-tolerant with > >> tunable reliability > >> > mechanisms that rely upon many failover and recovery > >> mechanisms. The system > >> > is centrally configured and allows for intelligent > >> dynamic management. It > >> > uses a simple extensible data model that allows for > >> lightweight online > >> > analytic applications. It provides a pluggable > >> mechanism by which new > >> > sources, destinations, and analytic functions which > >> can be integrated within > >> > a Flume pipeline. > >> > > >> > == Background == > >> > > >> > Flume was initially developed by Cloudera to enable > >> reliable and simplified > >> > collection of log information from many distributed > >> sources. It was later > >> > open-sourced by Cloudera on GitHub as an Apache 2.0 > >> licensed project in June > >> > 2010. During this time Flume has been formally > >> released five times as > >> > versions 0.9.0 (June 2010), 0.9.1 (Aug 2010), 0.9.1u1 > >> (Oct 2010), 0.9.2 (Nov > >> > 2010), and 0.9.3 (Feb 2011). These releases are also > >> distributed by > >> > Cloudera as source and binaries along with > >> enhancements as part of Cloudera > >> > Distribution including Apache Hadoop (CDH). > >> > > >> > == Rationale == > >> > > >> > Collecting log information in a data center in a > >> timely, reliable, and > >> > efficient manner is a difficult challenge but > >> important because when > >> > aggregated and analyzed, log information can yield > >> valuable business > >> > insights. We believe that users and operators need > >> a manageable systematic > >> > approach for log collection that simplifies the > >> creation, the monitoring, > >> > and the administration of reliable log data pipelines. > >> Oftentimes today, > >> > this collection is attempted by periodically shipping > >> data in batches and by > >> > using potentially unreliable and inefficient ad-hoc > >> methods. > >> > > >> > Log data is typically generated in various systems > >> running within a data > >> > center that can range from a few machines to hundreds > >> of machines. In > >> > aggregate, the data acts like a large-volume > >> continuous stream with contents > >> > that can have highly-varied format and highly-varied > >> content. The volume > >> > and variety of raw log data makes Apache Hadoop's HDFS > >> file system an ideal > >> > storage location before the eventual analysis. > >> Unfortunately, HDFS has > >> > limitations with regards to durability as well as > >> scaling limitations when > >> > handling a large number of low-bandwidth connections > >> or small files. > >> > Similar technical challenges are also suffered when > >> attempting to write > >> > data to other data storage services. > >> > > >> > Flume addresses these challenges by providing a > >> reliable, scalable, > >> > manageable, and extensible solution. It uses a > >> streaming design for > >> > capturing and aggregating log information from varied > >> sources in a > >> > distributed environment and has centralized management > >> features for minimal > >> > configuration and management overhead. > >> > > >> > == Initial Goals == > >> > > >> > Flume is currently in its first major release with a > >> considerable number of > >> > enhancement requests, tasks, and issues recorded > >> towards its future > >> > development. The initial goal of this project will be > >> to continue to build > >> > community in the spirit of the "Apache Way", and to > >> address the highly > >> > requested features and bug-fixes towards the next dot > >> release. > >> > > >> > Some goals include: > >> > * To stand up a sustaining Apache-based community > >> around the Flume codebase. > >> > * Implementing core functionality of a usable > >> highly-available Flume master. > >> > * Performance, usability, and robustness > >> improvements. > >> > * Improving the ability to monitor and diagnose > >> problems as data is > >> > transported. > >> > * Providing a centralized place for contributed > >> connectors and related > >> > projects. > >> > > >> > = Current Status = > >> > > >> > == Meritocracy == > >> > > >> > Flume was initially developed by Jonathan Hsieh in > >> July 2009 along with > >> > development team at Cloudera. Developers external to > >> Cloudera provided > >> > feedback, suggested features and fixes and implemented > >> extensions of Flume. > >> > Cloudera engineering team has since maintained the > >> project with Jonathan > >> > Hsieh, Henry Robinson, and Patrick Hunt dedicated > >> towards its improvement. > >> > Contributors to Flume and its connectors include > >> developers from different > >> > companies and different parts of the world. > >> > > >> > == Community == > >> > > >> > Flume is currently used by a number of organizations > >> all over the world. > >> > Flume has an active and growing user and developer > >> community with active > >> > participation in [user| > >> > https://groups.google.com/a/cloudera.org/group/flume-user/topics] > >> and > >> > [developer| > https://groups.google.com/a/cloudera.org/group/flume-dev/topics] > >> > mailing lists. The users and developers also > >> communicate via IRC on #flume > >> > at irc.freenode.net. > >> > > >> > Since open sourcing the project, there have been over > >> 15 different people > >> > from diverse organizations who have contributed code. > >> During this period, > >> > the project team has hosted open, in-person, quarterly > >> meetups to discuss > >> > new features, new designs, and new use-case stories. > >> > > >> > == Core Developers == > >> > > >> > The core developers for Flume project are: > >> > * Andrew Bayer: Andrew has a lot of expertise with > >> build tools, > >> > specifically Jenkins continuous integration and > >> Maven. > >> > * Jonathan Hsieh: Jonathan designed and implemented > >> much of the original > >> > code. > >> > * Patrick Hunt: Patrick has improved the web > >> interfaces of Flume components > >> > and contributed several build quality improvements. > >> > * Bruce Mitchener: Bruce has improved the internal > >> logging infrastructure > >> > as well as edited significant portions of the Flume > >> manual. > >> > * Henry Robinson: Henry has implemented much of the > >> ZooKeeper integration, > >> > plugin mechanisms, as well as several Flume features > >> and bug fixes. > >> > * Eric Sammer: Eric has implemented the Maven build, > >> as well as several > >> > Flume features and bug fixes. > >> > > >> > All core developers of the Flume project have > >> contributed towards Hadoop or > >> > related Apache projects and are very familiar with > >> Apache principals and > >> > philosophy for community driven software development. > >> > > >> > == Alignment == > >> > > >> > Flume complements Hadoop Map-Reduce, Pig, Hive, HBase > >> by providing a robust > >> > mechanism to allow log data integration from external > >> systems for effective > >> > analysis. Its design enable efficient integration of > >> newly ingested data to > >> > Hive's data warehouse. > >> > > >> > Flume's architecture is open and easily extensible. > >> This has encouraged > >> > many users to contribute integrate plugins to other > >> projects. For example, > >> > several users have contributed connectors to message > >> queuing and bus > >> > services, to several open source data stores, to > >> incremental search indexes, > >> > and to a stream analysis engines. > >> > > >> > = Known Risks = > >> > > >> > == Orphaned Products == > >> > > >> > Flume is already deployed in production at multiple > >> companies and they are > >> > actively participating in feature requests and user > >> led discussions. Flume > >> > is getting traction with developers and thus the risks > >> of it being orphaned > >> > are minimal. > >> > > >> > == Inexperience with Open Source == > >> > > >> > All code developed for Flume has is open sourced by > >> Cloudera under Apache > >> > 2.0 license. All committers of Flume project are > >> intimately familiar with > >> > the Apache model for open-source development and are > >> experienced with > >> > working with new contributors. > >> > > >> > == Homogeneous Developers == > >> > > >> > The initial set of committers is from a reduced set of > >> organizations. > >> > However, we expect that once approved for incubation, > >> the project will > >> > attract new contributors from diverse organizations > >> and will thus grow > >> > organically. The participation of developers from > >> several different > >> > organizations in the mailing list is a strong > >> indication for this assertion. > >> > > >> > == Reliance on Salaried Developers == > >> > > >> > It is expected that Flume will be developed on > >> salaried and volunteer time, > >> > although all of the initial developers will work on it > >> mainly on salaried > >> > time. > >> > > >> > == Relationships with Other Apache Products == > >> > > >> > Flume depends upon other Apache Projects: Apache > >> Hadoop, Apache Log4J, > >> > Apache ZooKeeper, Apache Thrift, Apache Avro, multiple > >> Apache Commons > >> > components. Its build depends upon Apache Ant and > >> Apache Maven. > >> > > >> > Flume users have created connectors that interact with > >> several other Apache > >> > projects including Apache HBase and Apache Cassandra. > >> > > >> > Flume's functionality has some indirect or direct > >> overlap with the > >> > functionality of Apache Chukwa but has several > >> significant architectural > >> > diffferences. Both systems can be used to collect > >> log data to write to > >> > hdfs. However, Chukwa's primary goals are the > >> analytic and monitoring > >> > aspects of a Hadoop cluster. Instead of focusing on > >> analytics, Flume > >> > focuses primarily upon data transport and integration > >> with a wide set of > >> > data sources and data destinations. > >> Architecturally, Chukwa components are > >> > individually and statically configured. It also > >> depends upon Hadoop > >> > MapReduce for its core functionality. In contrast, > >> Flume's components are > >> > dynamically and centrally configured and does not > >> depend directly upon > >> > Hadoop MapReduce. Furthermore, Flume provides a more > >> general model for > >> > handling data and enables integration with projects > >> such as Apache Hive, > >> > data stores such as Apache HBase, Apache Cassandra and > >> Voldemort, and > >> > several Apache Lucene-related projects. > >> > > >> > == An Excessive Fascination with the Apache Brand == > >> > > >> > We would like Flume to become an Apache project to > >> further foster a healthy > >> > community of contributors and consumers around the > >> project. Since Flume > >> > directly interacts with many Apache Hadoop-related > >> projects by solves an > >> > important problem of many Hadoop users, residing in > >> the the Apache Software > >> > Foundation will increase interaction with the larger > >> community. > >> > > >> > = Documentation = > >> > > >> > * All Flume documentation (User Guide, Developer > >> Guide, Cookbook, and > >> > Windows Guide) is maintained within Flume sources and > >> can be built directly. > >> > * Cloudera provides documentation specific to its > >> distribution of Flume at: > >> > http://archive.cloudera.com/cdh/3/flume/ > >> > * Flume wiki at GitHub: https://github.com/cloudera/flume/wiki > >> > * Flume jira at Cloudera: https://issues.cloudera.org/browse/flume > >> > > >> > = Initial Source = > >> > > >> > * https://github.com/cloudera/flume/tree/ > >> > > >> > == Source and Intellectual Property Submission Plan > >> == > >> > > >> > * The initial source is already licensed under the > >> Apache License, Version > >> > 2.0. https://github.com/cloudera/flume/blob/master/LICENSE > >> > > >> > == External Dependencies == > >> > > >> > The required external dependencies are all Apache > >> License or compatible > >> > licenses. Following components with non-Apache > >> licenses are enumerated: > >> > > >> > * org.arabidopsis.ahocorasick : BSD-style > >> > > >> > Non-Apache build tools that are used by Flume are as > >> follows: > >> > > >> > * AsciiDoc: GNU GPLv2 > >> > * FindBugs: GNU LGPL > >> > * Cobertura: GNU GPLv2 > >> > * PMD : BSD-style > >> > > >> > == Cryptography == > >> > > >> > Flume uses standard APIs and tools for SSH and SSL > >> communication where > >> > necessary. > >> > > >> > = Required Resources = > >> > > >> > == Mailing lists == > >> > > >> > * flume-private (with moderated subscriptions) > >> > * flume-dev > >> > * flume-commits > >> > * flume-user > >> > > >> > == Subversion Directory == > >> > > >> > https://svn.apache.org/repos/asf/incubator/flume > >> > > >> > == Issue Tracking == > >> > > >> > JIRA Flume (FLUME) > >> > > >> > == Other Resources == > >> > > >> > The existing code already has unit and integration > >> tests so we would like a > >> > Hudson instance to run them whenever a new patch is > >> submitted. This can be > >> > added after project creation. > >> > > >> > = Initial Committers = > >> > > >> > * Andrew Bayer (abayer at cloudera dot com) > >> > * Jonathan Hsieh (jon at cloudera dot com) > >> > * Aaron Kimball (akimball83 at gmail dot com) > >> > * Bruce Mitchener (bruce.mitchener at gmail dot > >> com) > >> > * Arvind Prabhakar (arvind at cloudera dot com) > >> > * Ahmed Radwan (ahmed at cloudera dot com) > >> > * Henry Robinson (henry at cloudera dot com) > >> > * Eric Sammer (esammer at cloudera dot com) > >> > > >> > = Affiliations = > >> > > >> > * Andrew Bayer, Cloudera > >> > * Jonathan Hsieh, Cloudera > >> > * Aaron Kimball, Odiago > >> > * Bruce Mitchener, Independent > >> > * Arvind Prabhakar, Cloudera > >> > * Ahmed Radwan, Cloudera > >> > * Henry Robinson, Cloudera > >> > * Eric Sammer, Cloudera > >> > > >> > > >> > = Sponsors = > >> > > >> > == Champion == > >> > > >> > * Nigel Daley > >> > > >> > == Nominated Mentors == > >> > > >> > * Tom White > >> > * Nigel Daley > >> > > >> > == Sponsoring Entity == > >> > > >> > * Apache Incubator PMC > >> > > >> > > >> > -- > >> > // Jonathan Hsieh (shay) > >> > // Software Engineer, Cloudera > >> > // j...@cloudera.com > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >> For additional commands, e-mail: general-h...@incubator.apache.org > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > > -- > Thanks > - Mohammad Nour > Author of (WebSphere Application Server Community Edition 2.0 User Guide) > http://www.redbooks.ibm.com/abstracts/sg247585.html > - LinkedIn: http://www.linkedin.com/in/mnour > - Blog: http://tadabborat.blogspot.com > ---- > "Life is like riding a bicycle. To keep your balance you must keep moving" > - Albert Einstein > > "Writing clean code is what you must do in order to call yourself a > professional. There is no reasonable excuse for doing anything less > than your best." > - Clean Code: A Handbook of Agile Software Craftsmanship > > "Stay hungry, stay foolish." > - Steve Jobs > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // j...@cloudera.com