Re: [PROPOSAL] Flume for the Apache Incubator

Patrick Hunt Mon, 06 Jun 2011 16:14:21 -0700

Jon, I'm available as a mentor if you're still looking.

Patrick


On Mon, Jun 6, 2011 at 9:59 AM, Jonathan Hsieh <j...@cloudera.com> wrote:
> It looks like we've gotten many positive responses, and thus far have had no
> issues brought up.  We have 3 mentors signed up now, but if anyone else is
> willing, we'd be interested in adding at most one or two more mentors.
>
> Dicussion seem to have tapered off, so unless I hear otherwise, I plan on
> opening the vote late on Tuesday, 6/7!
>
> Thanks,
> Jon.
>
> On Tue, May 31, 2011 at 12:28 PM, Upayavira <u...@odoko.co.uk> wrote:
>
>>
>>
>> On Tue, 31 May 2011 11:53 -0700, "Jonathan Hsieh" <j...@cloudera.com>
>> wrote:
>> > Hi,
>> >
>> > I have a few questions.
>> >
>> > My understanding is that a podling requires 3 +1s for progress (releases,
>> > new commiters).  Does this mean we need at least 3 mentors?  Would it be
>> > helpful to have "extra"?
>>
>> Strictly you do not need three mentors, but having three means you (in
>> theory) have three people with binding votes watching your progress,
>> which makes the necessary votes much easier, so having 'extra' can help.
>> Too many mentors though can lead to them all thinking that someone else
>> is doing it. Three does seem an optimal number.
>>
>> > Is having the Champion being a Mentor ok?
>>
>> Yes, it is fine.
>>
>> > Are there any concerns/discussion with the proposal?  (or are +1's
>> > basically saying lgtm.)
>>
>> I think the +1s are saying that (although I haven't read the proposal).
>>
>> Upayavira
>>
>>
>> > On Tue, May 31, 2011 at 4:13 AM, Mohammad Nour El-Din <
>> > nour.moham...@gmail.com> wrote:
>> >
>> > > +1 (binding)
>> > >
>> > > On Tue, May 31, 2011 at 11:59 AM, Mark Struberg <strub...@yahoo.de>
>> wrote:
>> > > > +1
>> > > >
>> > > > LieGrue,
>> > > > strub
>> > > >
>> > > > --- On Mon, 5/30/11, Yoav Shapira <yo...@apache.org> wrote:
>> > > >
>> > > >> From: Yoav Shapira <yo...@apache.org>
>> > > >> Subject: Re: [PROPOSAL] Flume for the Apache Incubator
>> > > >> To: general@incubator.apache.org
>> > > >> Date: Monday, May 30, 2011, 11:18 PM
>> > > >> On Fri, May 27, 2011 at 10:18 AM,
>> > > >> Jonathan Hsieh <j...@cloudera.com>
>> > > >> wrote:
>> > > >> > I would like to propose Flume to be an Apache
>> > > >> Incubator project.  Flume is a
>> > > >> > distributed, reliable, and available system for
>> > > >> efficiently collecting,
>> > > >> > aggregating, and moving large amounts of log data to
>> > > >> scalable data storage
>> > > >> > systems such as Apache Hadoop's HDFS.
>> > > >> >
>> > > >> > Here's a link to the proposal in the Incubator wiki
>> > > >> > http://wiki.apache.org/incubator/FlumeProposal
>> > > >>
>> > > >> +1, cool stuff.
>> > > >>
>> > > >> Yoav
>> > > >>
>> > > >> >
>> > > >> > I've also pasted the initial contents below.
>> > > >> >
>> > > >> > Thanks!
>> > > >> > Jon.
>> > > >> >
>> > > >> > = Flume - A Distributed Log Collection System =
>> > > >> >
>> > > >> > == Abstract ==
>> > > >> >
>> > > >> > Flume is a distributed, reliable, and available system
>> > > >> for efficiently
>> > > >> > collecting, aggregating, and moving large amounts of
>> > > >> log data to scalable
>> > > >> > data storage systems such as Apache Hadoop's HDFS.
>> > > >> >
>> > > >> > == Proposal ==
>> > > >> >
>> > > >> > Flume is a distributed, reliable, and available system
>> > > >> for efficiently
>> > > >> > collecting, aggregating, and moving large amounts of
>> > > >> log data from many
>> > > >> > different sources to a centralized data store. Its
>> > > >> main goal is to deliver
>> > > >> > data from applications to Hadoop’s HDFS.  It has a
>> > > >> simple and flexible
>> > > >> > architecture for transporting streaming event data via
>> > > >> flume nodes to the
>> > > >> > data store.  It is robust and fault-tolerant with
>> > > >> tunable reliability
>> > > >> > mechanisms that rely upon many failover and recovery
>> > > >> mechanisms. The system
>> > > >> > is centrally configured and allows for intelligent
>> > > >> dynamic management. It
>> > > >> > uses a simple extensible data model that allows for
>> > > >> lightweight online
>> > > >> > analytic applications.  It provides a pluggable
>> > > >> mechanism by which new
>> > > >> > sources, destinations, and analytic functions which
>> > > >> can be integrated within
>> > > >> > a Flume pipeline.
>> > > >> >
>> > > >> > == Background ==
>> > > >> >
>> > > >> > Flume was initially developed by Cloudera to enable
>> > > >> reliable and simplified
>> > > >> > collection of log information from many distributed
>> > > >> sources. It was later
>> > > >> > open-sourced by Cloudera on GitHub as an Apache 2.0
>> > > >> licensed project in June
>> > > >> > 2010. During this time Flume has been formally
>> > > >> released five times as
>> > > >> > versions 0.9.0 (June 2010), 0.9.1 (Aug 2010), 0.9.1u1
>> > > >> (Oct 2010), 0.9.2 (Nov
>> > > >> > 2010), and 0.9.3 (Feb 2011).  These releases are also
>> > > >> distributed by
>> > > >> > Cloudera as source and binaries along with
>> > > >> enhancements as part of Cloudera
>> > > >> > Distribution including Apache Hadoop (CDH).
>> > > >> >
>> > > >> > == Rationale ==
>> > > >> >
>> > > >> > Collecting log information in a data center in a
>> > > >> timely, reliable, and
>> > > >> > efficient manner is a difficult challenge but
>> > > >> important because when
>> > > >> > aggregated and analyzed, log information can yield
>> > > >> valuable business
>> > > >> > insights.   We believe that users and operators need
>> > > >> a manageable systematic
>> > > >> > approach for log collection that simplifies the
>> > > >> creation, the monitoring,
>> > > >> > and the administration of reliable log data pipelines.
>> > > >>  Oftentimes today,
>> > > >> > this collection is attempted by periodically shipping
>> > > >> data in batches and by
>> > > >> > using potentially unreliable and inefficient ad-hoc
>> > > >> methods.
>> > > >> >
>> > > >> > Log data is typically generated in various systems
>> > > >> running within a data
>> > > >> > center that can range from a few machines to hundreds
>> > > >> of machines.  In
>> > > >> > aggregate, the data acts like a large-volume
>> > > >> continuous stream with contents
>> > > >> > that can have highly-varied format and highly-varied
>> > > >> content.  The volume
>> > > >> > and variety of raw log data makes Apache Hadoop's HDFS
>> > > >> file system an ideal
>> > > >> > storage location before the eventual analysis.
>> > > >>  Unfortunately, HDFS has
>> > > >> > limitations with regards to durability as well as
>> > > >> scaling limitations when
>> > > >> > handling a large number of low-bandwidth connections
>> > > >> or small files.
>> > > >> >  Similar technical challenges are also suffered when
>> > > >> attempting to write
>> > > >> > data to other data storage services.
>> > > >> >
>> > > >> > Flume addresses these challenges by providing a
>> > > >> reliable, scalable,
>> > > >> > manageable, and extensible solution.  It uses a
>> > > >> streaming design for
>> > > >> > capturing and aggregating log information from varied
>> > > >> sources in a
>> > > >> > distributed environment and has centralized management
>> > > >> features for minimal
>> > > >> > configuration and management overhead.
>> > > >> >
>> > > >> > == Initial Goals ==
>> > > >> >
>> > > >> > Flume is currently in its first major release with a
>> > > >> considerable number of
>> > > >> > enhancement requests, tasks, and issues recorded
>> > > >> towards its future
>> > > >> > development. The initial goal of this project will be
>> > > >> to continue to build
>> > > >> > community in the spirit of the "Apache Way", and to
>> > > >> address the highly
>> > > >> > requested features and bug-fixes towards the next dot
>> > > >> release.
>> > > >> >
>> > > >> > Some goals include:
>> > > >> > * To stand up a sustaining Apache-based community
>> > > >> around the Flume codebase.
>> > > >> > * Implementing core functionality of a usable
>> > > >> highly-available Flume master.
>> > > >> > * Performance, usability, and robustness
>> > > >> improvements.
>> > > >> > * Improving the ability to monitor and diagnose
>> > > >> problems as data is
>> > > >> > transported.
>> > > >> > * Providing a centralized place for contributed
>> > > >> connectors and related
>> > > >> > projects.
>> > > >> >
>> > > >> > = Current Status =
>> > > >> >
>> > > >> > == Meritocracy ==
>> > > >> >
>> > > >> > Flume was initially developed by Jonathan Hsieh in
>> > > >> July 2009 along with
>> > > >> > development team at Cloudera. Developers external to
>> > > >> Cloudera provided
>> > > >> > feedback, suggested features and fixes and implemented
>> > > >> extensions of Flume.
>> > > >> > Cloudera engineering team has since maintained the
>> > > >> project with Jonathan
>> > > >> > Hsieh, Henry Robinson, and Patrick Hunt dedicated
>> > > >> towards its improvement.
>> > > >> > Contributors to Flume and its connectors include
>> > > >> developers from different
>> > > >> > companies and different parts of the world.
>> > > >> >
>> > > >> > == Community ==
>> > > >> >
>> > > >> > Flume is currently used by a number of organizations
>> > > >> all over the world.
>> > > >> > Flume has an active and growing user and developer
>> > > >> community with active
>> > > >> > participation in [user|
>> > > >> > https://groups.google.com/a/cloudera.org/group/flume-user/topics]
>> > > >> and
>> > > >> > [developer|
>> > > https://groups.google.com/a/cloudera.org/group/flume-dev/topics]
>> > > >> > mailing lists.  The users and developers also
>> > > >> communicate via IRC on #flume
>> > > >> > at irc.freenode.net.
>> > > >> >
>> > > >> > Since open sourcing the project, there have been over
>> > > >> 15 different people
>> > > >> > from diverse organizations who have contributed code.
>> > > >> During this period,
>> > > >> > the project team has hosted open, in-person, quarterly
>> > > >> meetups to discuss
>> > > >> > new features, new designs, and new use-case stories.
>> > > >> >
>> > > >> > == Core Developers ==
>> > > >> >
>> > > >> > The core developers for Flume project are:
>> > > >> >  * Andrew Bayer: Andrew has a lot of expertise with
>> > > >> build tools,
>> > > >> > specifically Jenkins continuous integration and
>> > > >> Maven.
>> > > >> >  * Jonathan Hsieh: Jonathan designed and implemented
>> > > >> much of the original
>> > > >> > code.
>> > > >> >  * Patrick Hunt: Patrick has improved the web
>> > > >> interfaces of Flume components
>> > > >> > and contributed several build quality  improvements.
>> > > >> >  * Bruce Mitchener: Bruce has improved the internal
>> > > >> logging infrastructure
>> > > >> > as well as edited significant portions of the Flume
>> > > >> manual.
>> > > >> >  * Henry Robinson: Henry has implemented much of the
>> > > >> ZooKeeper integration,
>> > > >> > plugin mechanisms, as well as several Flume features
>> > > >> and bug fixes.
>> > > >> >  * Eric Sammer: Eric has implemented the Maven build,
>> > > >> as well as several
>> > > >> > Flume features and bug fixes.
>> > > >> >
>> > > >> > All core developers of the Flume project have
>> > > >> contributed towards Hadoop or
>> > > >> > related Apache projects and are very familiar with
>> > > >> Apache principals and
>> > > >> > philosophy for community driven software development.
>> > > >> >
>> > > >> > == Alignment ==
>> > > >> >
>> > > >> > Flume complements Hadoop Map-Reduce, Pig, Hive, HBase
>> > > >> by providing a robust
>> > > >> > mechanism to allow log data integration from external
>> > > >> systems for effective
>> > > >> > analysis.  Its design enable efficient integration of
>> > > >> newly ingested data to
>> > > >> > Hive's data warehouse.
>> > > >> >
>> > > >> > Flume's architecture is open and easily extensible.
>> > > >>  This has encouraged
>> > > >> > many users to contribute integrate plugins to other
>> > > >> projects.  For example,
>> > > >> > several users have contributed connectors to message
>> > > >> queuing and bus
>> > > >> > services, to several open source data stores, to
>> > > >> incremental search indexes,
>> > > >> > and to a stream analysis engines.
>> > > >> >
>> > > >> > = Known Risks =
>> > > >> >
>> > > >> > == Orphaned Products ==
>> > > >> >
>> > > >> > Flume is already deployed in production at multiple
>> > > >> companies and they are
>> > > >> > actively participating in feature requests and user
>> > > >> led discussions. Flume
>> > > >> > is getting traction with developers and thus the risks
>> > > >> of it being orphaned
>> > > >> > are minimal.
>> > > >> >
>> > > >> > == Inexperience with Open Source ==
>> > > >> >
>> > > >> > All code developed for Flume has is open sourced by
>> > > >> Cloudera under Apache
>> > > >> > 2.0 license.  All committers of Flume project are
>> > > >> intimately familiar with
>> > > >> > the Apache model for open-source development and are
>> > > >> experienced with
>> > > >> > working with new contributors.
>> > > >> >
>> > > >> > == Homogeneous Developers ==
>> > > >> >
>> > > >> > The initial set of committers is from a reduced set of
>> > > >> organizations.
>> > > >> > However, we expect that once approved for incubation,
>> > > >> the project will
>> > > >> > attract new contributors from diverse organizations
>> > > >> and will thus grow
>> > > >> > organically. The participation of developers from
>> > > >> several different
>> > > >> > organizations in the mailing list is a strong
>> > > >> indication for this assertion.
>> > > >> >
>> > > >> > == Reliance on Salaried Developers ==
>> > > >> >
>> > > >> > It is expected that Flume will be developed on
>> > > >> salaried and volunteer time,
>> > > >> > although all of the initial developers will work on it
>> > > >> mainly on salaried
>> > > >> > time.
>> > > >> >
>> > > >> > == Relationships with Other Apache Products ==
>> > > >> >
>> > > >> > Flume depends upon other Apache Projects: Apache
>> > > >> Hadoop, Apache Log4J,
>> > > >> > Apache ZooKeeper, Apache Thrift, Apache Avro, multiple
>> > > >> Apache Commons
>> > > >> > components. Its build depends upon Apache Ant and
>> > > >> Apache Maven.
>> > > >> >
>> > > >> > Flume users have created connectors that interact with
>> > > >> several other Apache
>> > > >> > projects including Apache HBase and Apache Cassandra.
>> > > >> >
>> > > >> > Flume's functionality has some indirect or direct
>> > > >> overlap with the
>> > > >> > functionality of Apache Chukwa but has several
>> > > >> significant architectural
>> > > >> > diffferences.  Both systems can be used to collect
>> > > >> log data to write to
>> > > >> > hdfs.  However, Chukwa's primary goals are the
>> > > >> analytic and monitoring
>> > > >> > aspects of a Hadoop cluster.  Instead of focusing on
>> > > >> analytics, Flume
>> > > >> > focuses primarily upon data transport and integration
>> > > >> with a wide set of
>> > > >> > data sources and data destinations.
>> > > >> Architecturally, Chukwa components are
>> > > >> > individually and statically configured.  It also
>> > > >> depends upon Hadoop
>> > > >> > MapReduce for its core functionality.  In contrast,
>> > > >> Flume's components are
>> > > >> > dynamically and centrally configured and does not
>> > > >> depend directly upon
>> > > >> > Hadoop MapReduce.  Furthermore, Flume provides a more
>> > > >> general model for
>> > > >> > handling data and enables integration with projects
>> > > >> such as Apache Hive,
>> > > >> > data stores such as Apache HBase, Apache Cassandra and
>> > > >> Voldemort, and
>> > > >> > several Apache Lucene-related projects.
>> > > >> >
>> > > >> > == An Excessive Fascination with the Apache Brand ==
>> > > >> >
>> > > >> > We would like Flume to become an Apache project to
>> > > >> further foster a healthy
>> > > >> > community of contributors and consumers around the
>> > > >> project.  Since Flume
>> > > >> > directly interacts with many Apache Hadoop-related
>> > > >> projects by solves an
>> > > >> > important problem of many Hadoop users, residing in
>> > > >> the the Apache Software
>> > > >> > Foundation will increase interaction with the larger
>> > > >> community.
>> > > >> >
>> > > >> > = Documentation =
>> > > >> >
>> > > >> >  * All Flume documentation (User Guide, Developer
>> > > >> Guide, Cookbook, and
>> > > >> > Windows Guide) is maintained within Flume sources and
>> > > >> can be built directly.
>> > > >> >  * Cloudera provides documentation specific to its
>> > > >> distribution of Flume at:
>> > > >> > http://archive.cloudera.com/cdh/3/flume/
>> > > >> >  * Flume wiki at GitHub: https://github.com/cloudera/flume/wiki
>> > > >> >  * Flume jira at Cloudera:
>> https://issues.cloudera.org/browse/flume
>> > > >> >
>> > > >> > = Initial Source =
>> > > >> >
>> > > >> >  * https://github.com/cloudera/flume/tree/
>> > > >> >
>> > > >> > == Source and Intellectual Property Submission Plan
>> > > >> ==
>> > > >> >
>> > > >> >  * The initial source is already licensed under the
>> > > >> Apache License, Version
>> > > >> > 2.0. https://github.com/cloudera/flume/blob/master/LICENSE
>> > > >> >
>> > > >> > == External Dependencies ==
>> > > >> >
>> > > >> > The required external dependencies are all Apache
>> > > >> License or compatible
>> > > >> > licenses. Following components with non-Apache
>> > > >> licenses are enumerated:
>> > > >> >
>> > > >> >  * org.arabidopsis.ahocorasick : BSD-style
>> > > >> >
>> > > >> > Non-Apache build tools that are used by Flume are as
>> > > >> follows:
>> > > >> >
>> > > >> >  * AsciiDoc: GNU GPLv2
>> > > >> >  * FindBugs: GNU LGPL
>> > > >> >  * Cobertura: GNU GPLv2
>> > > >> >  * PMD : BSD-style
>> > > >> >
>> > > >> > == Cryptography ==
>> > > >> >
>> > > >> > Flume uses standard APIs and tools for SSH and SSL
>> > > >> communication where
>> > > >> > necessary.
>> > > >> >
>> > > >> > = Required  Resources =
>> > > >> >
>> > > >> > == Mailing lists ==
>> > > >> >
>> > > >> >  * flume-private (with moderated subscriptions)
>> > > >> >  * flume-dev
>> > > >> >  * flume-commits
>> > > >> >  * flume-user
>> > > >> >
>> > > >> > == Subversion Directory ==
>> > > >> >
>> > > >> > https://svn.apache.org/repos/asf/incubator/flume
>> > > >> >
>> > > >> > == Issue Tracking ==
>> > > >> >
>> > > >> > JIRA Flume (FLUME)
>> > > >> >
>> > > >> > == Other Resources ==
>> > > >> >
>> > > >> > The existing code already has unit and integration
>> > > >> tests so we would like a
>> > > >> > Hudson instance to run them whenever a new patch is
>> > > >> submitted. This can be
>> > > >> > added after project creation.
>> > > >> >
>> > > >> > = Initial Committers =
>> > > >> >
>> > > >> >  * Andrew Bayer (abayer at cloudera dot com)
>> > > >> >  * Jonathan Hsieh (jon at cloudera dot com)
>> > > >> >  * Aaron Kimball (akimball83 at gmail dot com)
>> > > >> >  * Bruce Mitchener (bruce.mitchener at gmail dot
>> > > >> com)
>> > > >> >  * Arvind Prabhakar (arvind at cloudera dot com)
>> > > >> >  * Ahmed Radwan (ahmed at cloudera dot com)
>> > > >> >  * Henry Robinson (henry at cloudera dot com)
>> > > >> >  * Eric Sammer (esammer at cloudera dot com)
>> > > >> >
>> > > >> > = Affiliations =
>> > > >> >
>> > > >> >  * Andrew Bayer, Cloudera
>> > > >> >  * Jonathan Hsieh, Cloudera
>> > > >> >  * Aaron Kimball, Odiago
>> > > >> >  * Bruce Mitchener, Independent
>> > > >> >  * Arvind Prabhakar, Cloudera
>> > > >> >  * Ahmed Radwan, Cloudera
>> > > >> >  * Henry Robinson, Cloudera
>> > > >> >  * Eric Sammer, Cloudera
>> > > >> >
>> > > >> >
>> > > >> > = Sponsors =
>> > > >> >
>> > > >> > == Champion ==
>> > > >> >
>> > > >> >  * Nigel Daley
>> > > >> >
>> > > >> > == Nominated Mentors ==
>> > > >> >
>> > > >> >  * Tom White
>> > > >> >  * Nigel Daley
>> > > >> >
>> > > >> > == Sponsoring Entity ==
>> > > >> >
>> > > >> >  * Apache Incubator PMC
>> > > >> >
>> > > >> >
>> > > >> > --
>> > > >> > // Jonathan Hsieh (shay)
>> > > >> > // Software Engineer, Cloudera
>> > > >> > // j...@cloudera.com
>> > > >> >
>> > > >>
>> > > >>
>> ---------------------------------------------------------------------
>> > > >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> > > >> For additional commands, e-mail: general-h...@incubator.apache.org
>> > > >>
>> > > >>
>> > > >
>> > > > ---------------------------------------------------------------------
>> > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> > > > For additional commands, e-mail: general-h...@incubator.apache.org
>> > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Thanks
>> > > - Mohammad Nour
>> > >   Author of (WebSphere Application Server Community Edition 2.0 User
>> Guide)
>> > >   http://www.redbooks.ibm.com/abstracts/sg247585.html
>> > > - LinkedIn: http://www.linkedin.com/in/mnour
>> > > - Blog: http://tadabborat.blogspot.com
>> > > ----
>> > > "Life is like riding a bicycle. To keep your balance you must keep
>> moving"
>> > > - Albert Einstein
>> > >
>> > > "Writing clean code is what you must do in order to call yourself a
>> > > professional. There is no reasonable excuse for doing anything less
>> > > than your best."
>> > > - Clean Code: A Handbook of Agile Software Craftsmanship
>> > >
>> > > "Stay hungry, stay foolish."
>> > > - Steve Jobs
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> > > For additional commands, e-mail: general-h...@incubator.apache.org
>> > >
>> > >
>> >
>> >
>> > --
>> > // Jonathan Hsieh (shay)
>> > // Software Engineer, Cloudera
>> > // j...@cloudera.com
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>>
>>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // j...@cloudera.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] Flume for the Apache Incubator

Reply via email to