Re: [PROPOSAL] Flume for the Apache Incubator

Mark Struberg Tue, 31 May 2011 03:00:08 -0700

+1 

LieGrue,
strub


--- On Mon, 5/30/11, Yoav Shapira <yo...@apache.org> wrote:

> From: Yoav Shapira <yo...@apache.org>
> Subject: Re: [PROPOSAL] Flume for the Apache Incubator
> To: general@incubator.apache.org
> Date: Monday, May 30, 2011, 11:18 PM
> On Fri, May 27, 2011 at 10:18 AM,
> Jonathan Hsieh <j...@cloudera.com>
> wrote:
> > I would like to propose Flume to be an Apache
> Incubator project.  Flume is a
> > distributed, reliable, and available system for
> efficiently collecting,
> > aggregating, and moving large amounts of log data to
> scalable data storage
> > systems such as Apache Hadoop's HDFS.
> >
> > Here's a link to the proposal in the Incubator wiki
> > http://wiki.apache.org/incubator/FlumeProposal
> 
> +1, cool stuff.
> 
> Yoav
> 
> >
> > I've also pasted the initial contents below.
> >
> > Thanks!
> > Jon.
> >
> > = Flume - A Distributed Log Collection System =
> >
> > == Abstract ==
> >
> > Flume is a distributed, reliable, and available system
> for efficiently
> > collecting, aggregating, and moving large amounts of
> log data to scalable
> > data storage systems such as Apache Hadoop's HDFS.
> >
> > == Proposal ==
> >
> > Flume is a distributed, reliable, and available system
> for efficiently
> > collecting, aggregating, and moving large amounts of
> log data from many
> > different sources to a centralized data store. Its
> main goal is to deliver
> > data from applications to Hadoop’s HDFS.  It has a
> simple and flexible
> > architecture for transporting streaming event data via
> flume nodes to the
> > data store.  It is robust and fault-tolerant with
> tunable reliability
> > mechanisms that rely upon many failover and recovery
> mechanisms. The system
> > is centrally configured and allows for intelligent
> dynamic management. It
> > uses a simple extensible data model that allows for
> lightweight online
> > analytic applications.  It provides a pluggable
> mechanism by which new
> > sources, destinations, and analytic functions which
> can be integrated within
> > a Flume pipeline.
> >
> > == Background ==
> >
> > Flume was initially developed by Cloudera to enable
> reliable and simplified
> > collection of log information from many distributed
> sources. It was later
> > open-sourced by Cloudera on GitHub as an Apache 2.0
> licensed project in June
> > 2010. During this time Flume has been formally
> released five times as
> > versions 0.9.0 (June 2010), 0.9.1 (Aug 2010), 0.9.1u1
> (Oct 2010), 0.9.2 (Nov
> > 2010), and 0.9.3 (Feb 2011).  These releases are also
> distributed by
> > Cloudera as source and binaries along with
> enhancements as part of Cloudera
> > Distribution including Apache Hadoop (CDH).
> >
> > == Rationale ==
> >
> > Collecting log information in a data center in a
> timely, reliable, and
> > efficient manner is a difficult challenge but
> important because when
> > aggregated and analyzed, log information can yield
> valuable business
> > insights.   We believe that users and operators need
> a manageable systematic
> > approach for log collection that simplifies the
> creation, the monitoring,
> > and the administration of reliable log data pipelines.
>  Oftentimes today,
> > this collection is attempted by periodically shipping
> data in batches and by
> > using potentially unreliable and inefficient ad-hoc
> methods.
> >
> > Log data is typically generated in various systems
> running within a data
> > center that can range from a few machines to hundreds
> of machines.  In
> > aggregate, the data acts like a large-volume
> continuous stream with contents
> > that can have highly-varied format and highly-varied
> content.  The volume
> > and variety of raw log data makes Apache Hadoop's HDFS
> file system an ideal
> > storage location before the eventual analysis.
>  Unfortunately, HDFS has
> > limitations with regards to durability as well as
> scaling limitations when
> > handling a large number of low-bandwidth connections
> or small files.
> >  Similar technical challenges are also suffered when
> attempting to write
> > data to other data storage services.
> >
> > Flume addresses these challenges by providing a
> reliable, scalable,
> > manageable, and extensible solution.  It uses a
> streaming design for
> > capturing and aggregating log information from varied
> sources in a
> > distributed environment and has centralized management
> features for minimal
> > configuration and management overhead.
> >
> > == Initial Goals ==
> >
> > Flume is currently in its first major release with a
> considerable number of
> > enhancement requests, tasks, and issues recorded
> towards its future
> > development. The initial goal of this project will be
> to continue to build
> > community in the spirit of the "Apache Way", and to
> address the highly
> > requested features and bug-fixes towards the next dot
> release.
> >
> > Some goals include:
> > * To stand up a sustaining Apache-based community
> around the Flume codebase.
> > * Implementing core functionality of a usable
> highly-available Flume master.
> > * Performance, usability, and robustness
> improvements.
> > * Improving the ability to monitor and diagnose
> problems as data is
> > transported.
> > * Providing a centralized place for contributed
> connectors and related
> > projects.
> >
> > = Current Status =
> >
> > == Meritocracy ==
> >
> > Flume was initially developed by Jonathan Hsieh in
> July 2009 along with
> > development team at Cloudera. Developers external to
> Cloudera provided
> > feedback, suggested features and fixes and implemented
> extensions of Flume.
> > Cloudera engineering team has since maintained the
> project with Jonathan
> > Hsieh, Henry Robinson, and Patrick Hunt dedicated
> towards its improvement.
> > Contributors to Flume and its connectors include
> developers from different
> > companies and different parts of the world.
> >
> > == Community ==
> >
> > Flume is currently used by a number of organizations
> all over the world.
> > Flume has an active and growing user and developer
> community with active
> > participation in [user|
> > https://groups.google.com/a/cloudera.org/group/flume-user/topics]
> and
> > [developer|https://groups.google.com/a/cloudera.org/group/flume-dev/topics]
> > mailing lists.  The users and developers also
> communicate via IRC on #flume
> > at irc.freenode.net.
> >
> > Since open sourcing the project, there have been over
> 15 different people
> > from diverse organizations who have contributed code.
> During this period,
> > the project team has hosted open, in-person, quarterly
> meetups to discuss
> > new features, new designs, and new use-case stories.
> >
> > == Core Developers ==
> >
> > The core developers for Flume project are:
> >  * Andrew Bayer: Andrew has a lot of expertise with
> build tools,
> > specifically Jenkins continuous integration and
> Maven.
> >  * Jonathan Hsieh: Jonathan designed and implemented
> much of the original
> > code.
> >  * Patrick Hunt: Patrick has improved the web
> interfaces of Flume components
> > and contributed several build quality  improvements.
> >  * Bruce Mitchener: Bruce has improved the internal
> logging infrastructure
> > as well as edited significant portions of the Flume
> manual.
> >  * Henry Robinson: Henry has implemented much of the
> ZooKeeper integration,
> > plugin mechanisms, as well as several Flume features
> and bug fixes.
> >  * Eric Sammer: Eric has implemented the Maven build,
> as well as several
> > Flume features and bug fixes.
> >
> > All core developers of the Flume project have
> contributed towards Hadoop or
> > related Apache projects and are very familiar with
> Apache principals and
> > philosophy for community driven software development.
> >
> > == Alignment ==
> >
> > Flume complements Hadoop Map-Reduce, Pig, Hive, HBase
> by providing a robust
> > mechanism to allow log data integration from external
> systems for effective
> > analysis.  Its design enable efficient integration of
> newly ingested data to
> > Hive's data warehouse.
> >
> > Flume's architecture is open and easily extensible.
>  This has encouraged
> > many users to contribute integrate plugins to other
> projects.  For example,
> > several users have contributed connectors to message
> queuing and bus
> > services, to several open source data stores, to
> incremental search indexes,
> > and to a stream analysis engines.
> >
> > = Known Risks =
> >
> > == Orphaned Products ==
> >
> > Flume is already deployed in production at multiple
> companies and they are
> > actively participating in feature requests and user
> led discussions. Flume
> > is getting traction with developers and thus the risks
> of it being orphaned
> > are minimal.
> >
> > == Inexperience with Open Source ==
> >
> > All code developed for Flume has is open sourced by
> Cloudera under Apache
> > 2.0 license.  All committers of Flume project are
> intimately familiar with
> > the Apache model for open-source development and are
> experienced with
> > working with new contributors.
> >
> > == Homogeneous Developers ==
> >
> > The initial set of committers is from a reduced set of
> organizations.
> > However, we expect that once approved for incubation,
> the project will
> > attract new contributors from diverse organizations
> and will thus grow
> > organically. The participation of developers from
> several different
> > organizations in the mailing list is a strong
> indication for this assertion.
> >
> > == Reliance on Salaried Developers ==
> >
> > It is expected that Flume will be developed on
> salaried and volunteer time,
> > although all of the initial developers will work on it
> mainly on salaried
> > time.
> >
> > == Relationships with Other Apache Products ==
> >
> > Flume depends upon other Apache Projects: Apache
> Hadoop, Apache Log4J,
> > Apache ZooKeeper, Apache Thrift, Apache Avro, multiple
> Apache Commons
> > components. Its build depends upon Apache Ant and
> Apache Maven.
> >
> > Flume users have created connectors that interact with
> several other Apache
> > projects including Apache HBase and Apache Cassandra.
> >
> > Flume's functionality has some indirect or direct
> overlap with the
> > functionality of Apache Chukwa but has several
> significant architectural
> > diffferences.  Both systems can be used to collect
> log data to write to
> > hdfs.  However, Chukwa's primary goals are the
> analytic and monitoring
> > aspects of a Hadoop cluster.  Instead of focusing on
> analytics, Flume
> > focuses primarily upon data transport and integration
> with a wide set of
> > data sources and data destinations.  
> Architecturally, Chukwa components are
> > individually and statically configured.  It also
> depends upon Hadoop
> > MapReduce for its core functionality.  In contrast,
> Flume's components are
> > dynamically and centrally configured and does not
> depend directly upon
> > Hadoop MapReduce.  Furthermore, Flume provides a more
> general model for
> > handling data and enables integration with projects
> such as Apache Hive,
> > data stores such as Apache HBase, Apache Cassandra and
> Voldemort, and
> > several Apache Lucene-related projects.
> >
> > == An Excessive Fascination with the Apache Brand ==
> >
> > We would like Flume to become an Apache project to
> further foster a healthy
> > community of contributors and consumers around the
> project.  Since Flume
> > directly interacts with many Apache Hadoop-related
> projects by solves an
> > important problem of many Hadoop users, residing in
> the the Apache Software
> > Foundation will increase interaction with the larger
> community.
> >
> > = Documentation =
> >
> >  * All Flume documentation (User Guide, Developer
> Guide, Cookbook, and
> > Windows Guide) is maintained within Flume sources and
> can be built directly.
> >  * Cloudera provides documentation specific to its
> distribution of Flume at:
> > http://archive.cloudera.com/cdh/3/flume/
> >  * Flume wiki at GitHub: https://github.com/cloudera/flume/wiki
> >  * Flume jira at Cloudera: https://issues.cloudera.org/browse/flume
> >
> > = Initial Source =
> >
> >  * https://github.com/cloudera/flume/tree/
> >
> > == Source and Intellectual Property Submission Plan
> ==
> >
> >  * The initial source is already licensed under the
> Apache License, Version
> > 2.0. https://github.com/cloudera/flume/blob/master/LICENSE
> >
> > == External Dependencies ==
> >
> > The required external dependencies are all Apache
> License or compatible
> > licenses. Following components with non-Apache
> licenses are enumerated:
> >
> >  * org.arabidopsis.ahocorasick : BSD-style
> >
> > Non-Apache build tools that are used by Flume are as
> follows:
> >
> >  * AsciiDoc: GNU GPLv2
> >  * FindBugs: GNU LGPL
> >  * Cobertura: GNU GPLv2
> >  * PMD : BSD-style
> >
> > == Cryptography ==
> >
> > Flume uses standard APIs and tools for SSH and SSL
> communication where
> > necessary.
> >
> > = Required  Resources =
> >
> > == Mailing lists ==
> >
> >  * flume-private (with moderated subscriptions)
> >  * flume-dev
> >  * flume-commits
> >  * flume-user
> >
> > == Subversion Directory ==
> >
> > https://svn.apache.org/repos/asf/incubator/flume
> >
> > == Issue Tracking ==
> >
> > JIRA Flume (FLUME)
> >
> > == Other Resources ==
> >
> > The existing code already has unit and integration
> tests so we would like a
> > Hudson instance to run them whenever a new patch is
> submitted. This can be
> > added after project creation.
> >
> > = Initial Committers =
> >
> >  * Andrew Bayer (abayer at cloudera dot com)
> >  * Jonathan Hsieh (jon at cloudera dot com)
> >  * Aaron Kimball (akimball83 at gmail dot com)
> >  * Bruce Mitchener (bruce.mitchener at gmail dot
> com)
> >  * Arvind Prabhakar (arvind at cloudera dot com)
> >  * Ahmed Radwan (ahmed at cloudera dot com)
> >  * Henry Robinson (henry at cloudera dot com)
> >  * Eric Sammer (esammer at cloudera dot com)
> >
> > = Affiliations =
> >
> >  * Andrew Bayer, Cloudera
> >  * Jonathan Hsieh, Cloudera
> >  * Aaron Kimball, Odiago
> >  * Bruce Mitchener, Independent
> >  * Arvind Prabhakar, Cloudera
> >  * Ahmed Radwan, Cloudera
> >  * Henry Robinson, Cloudera
> >  * Eric Sammer, Cloudera
> >
> >
> > = Sponsors =
> >
> > == Champion ==
> >
> >  * Nigel Daley
> >
> > == Nominated Mentors ==
> >
> >  * Tom White
> >  * Nigel Daley
> >
> > == Sponsoring Entity ==
> >
> >  * Apache Incubator PMC
> >
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // j...@cloudera.com
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] Flume for the Apache Incubator

Reply via email to