+1 (binding) On Tue, May 31, 2011 at 11:59 AM, Mark Struberg <strub...@yahoo.de> wrote: > +1 > > LieGrue, > strub > > --- On Mon, 5/30/11, Yoav Shapira <yo...@apache.org> wrote: > >> From: Yoav Shapira <yo...@apache.org> >> Subject: Re: [PROPOSAL] Flume for the Apache Incubator >> To: general@incubator.apache.org >> Date: Monday, May 30, 2011, 11:18 PM >> On Fri, May 27, 2011 at 10:18 AM, >> Jonathan Hsieh <j...@cloudera.com> >> wrote: >> > I would like to propose Flume to be an Apache >> Incubator project. Flume is a >> > distributed, reliable, and available system for >> efficiently collecting, >> > aggregating, and moving large amounts of log data to >> scalable data storage >> > systems such as Apache Hadoop's HDFS. >> > >> > Here's a link to the proposal in the Incubator wiki >> > http://wiki.apache.org/incubator/FlumeProposal >> >> +1, cool stuff. >> >> Yoav >> >> > >> > I've also pasted the initial contents below. >> > >> > Thanks! >> > Jon. >> > >> > = Flume - A Distributed Log Collection System = >> > >> > == Abstract == >> > >> > Flume is a distributed, reliable, and available system >> for efficiently >> > collecting, aggregating, and moving large amounts of >> log data to scalable >> > data storage systems such as Apache Hadoop's HDFS. >> > >> > == Proposal == >> > >> > Flume is a distributed, reliable, and available system >> for efficiently >> > collecting, aggregating, and moving large amounts of >> log data from many >> > different sources to a centralized data store. Its >> main goal is to deliver >> > data from applications to Hadoop’s HDFS. It has a >> simple and flexible >> > architecture for transporting streaming event data via >> flume nodes to the >> > data store. It is robust and fault-tolerant with >> tunable reliability >> > mechanisms that rely upon many failover and recovery >> mechanisms. The system >> > is centrally configured and allows for intelligent >> dynamic management. It >> > uses a simple extensible data model that allows for >> lightweight online >> > analytic applications. It provides a pluggable >> mechanism by which new >> > sources, destinations, and analytic functions which >> can be integrated within >> > a Flume pipeline. >> > >> > == Background == >> > >> > Flume was initially developed by Cloudera to enable >> reliable and simplified >> > collection of log information from many distributed >> sources. It was later >> > open-sourced by Cloudera on GitHub as an Apache 2.0 >> licensed project in June >> > 2010. During this time Flume has been formally >> released five times as >> > versions 0.9.0 (June 2010), 0.9.1 (Aug 2010), 0.9.1u1 >> (Oct 2010), 0.9.2 (Nov >> > 2010), and 0.9.3 (Feb 2011). These releases are also >> distributed by >> > Cloudera as source and binaries along with >> enhancements as part of Cloudera >> > Distribution including Apache Hadoop (CDH). >> > >> > == Rationale == >> > >> > Collecting log information in a data center in a >> timely, reliable, and >> > efficient manner is a difficult challenge but >> important because when >> > aggregated and analyzed, log information can yield >> valuable business >> > insights. We believe that users and operators need >> a manageable systematic >> > approach for log collection that simplifies the >> creation, the monitoring, >> > and the administration of reliable log data pipelines. >> Oftentimes today, >> > this collection is attempted by periodically shipping >> data in batches and by >> > using potentially unreliable and inefficient ad-hoc >> methods. >> > >> > Log data is typically generated in various systems >> running within a data >> > center that can range from a few machines to hundreds >> of machines. In >> > aggregate, the data acts like a large-volume >> continuous stream with contents >> > that can have highly-varied format and highly-varied >> content. The volume >> > and variety of raw log data makes Apache Hadoop's HDFS >> file system an ideal >> > storage location before the eventual analysis. >> Unfortunately, HDFS has >> > limitations with regards to durability as well as >> scaling limitations when >> > handling a large number of low-bandwidth connections >> or small files. >> > Similar technical challenges are also suffered when >> attempting to write >> > data to other data storage services. >> > >> > Flume addresses these challenges by providing a >> reliable, scalable, >> > manageable, and extensible solution. It uses a >> streaming design for >> > capturing and aggregating log information from varied >> sources in a >> > distributed environment and has centralized management >> features for minimal >> > configuration and management overhead. >> > >> > == Initial Goals == >> > >> > Flume is currently in its first major release with a >> considerable number of >> > enhancement requests, tasks, and issues recorded >> towards its future >> > development. The initial goal of this project will be >> to continue to build >> > community in the spirit of the "Apache Way", and to >> address the highly >> > requested features and bug-fixes towards the next dot >> release. >> > >> > Some goals include: >> > * To stand up a sustaining Apache-based community >> around the Flume codebase. >> > * Implementing core functionality of a usable >> highly-available Flume master. >> > * Performance, usability, and robustness >> improvements. >> > * Improving the ability to monitor and diagnose >> problems as data is >> > transported. >> > * Providing a centralized place for contributed >> connectors and related >> > projects. >> > >> > = Current Status = >> > >> > == Meritocracy == >> > >> > Flume was initially developed by Jonathan Hsieh in >> July 2009 along with >> > development team at Cloudera. Developers external to >> Cloudera provided >> > feedback, suggested features and fixes and implemented >> extensions of Flume. >> > Cloudera engineering team has since maintained the >> project with Jonathan >> > Hsieh, Henry Robinson, and Patrick Hunt dedicated >> towards its improvement. >> > Contributors to Flume and its connectors include >> developers from different >> > companies and different parts of the world. >> > >> > == Community == >> > >> > Flume is currently used by a number of organizations >> all over the world. >> > Flume has an active and growing user and developer >> community with active >> > participation in [user| >> > https://groups.google.com/a/cloudera.org/group/flume-user/topics] >> and >> > [developer|https://groups.google.com/a/cloudera.org/group/flume-dev/topics] >> > mailing lists. The users and developers also >> communicate via IRC on #flume >> > at irc.freenode.net. >> > >> > Since open sourcing the project, there have been over >> 15 different people >> > from diverse organizations who have contributed code. >> During this period, >> > the project team has hosted open, in-person, quarterly >> meetups to discuss >> > new features, new designs, and new use-case stories. >> > >> > == Core Developers == >> > >> > The core developers for Flume project are: >> > * Andrew Bayer: Andrew has a lot of expertise with >> build tools, >> > specifically Jenkins continuous integration and >> Maven. >> > * Jonathan Hsieh: Jonathan designed and implemented >> much of the original >> > code. >> > * Patrick Hunt: Patrick has improved the web >> interfaces of Flume components >> > and contributed several build quality improvements. >> > * Bruce Mitchener: Bruce has improved the internal >> logging infrastructure >> > as well as edited significant portions of the Flume >> manual. >> > * Henry Robinson: Henry has implemented much of the >> ZooKeeper integration, >> > plugin mechanisms, as well as several Flume features >> and bug fixes. >> > * Eric Sammer: Eric has implemented the Maven build, >> as well as several >> > Flume features and bug fixes. >> > >> > All core developers of the Flume project have >> contributed towards Hadoop or >> > related Apache projects and are very familiar with >> Apache principals and >> > philosophy for community driven software development. >> > >> > == Alignment == >> > >> > Flume complements Hadoop Map-Reduce, Pig, Hive, HBase >> by providing a robust >> > mechanism to allow log data integration from external >> systems for effective >> > analysis. Its design enable efficient integration of >> newly ingested data to >> > Hive's data warehouse. >> > >> > Flume's architecture is open and easily extensible. >> This has encouraged >> > many users to contribute integrate plugins to other >> projects. For example, >> > several users have contributed connectors to message >> queuing and bus >> > services, to several open source data stores, to >> incremental search indexes, >> > and to a stream analysis engines. >> > >> > = Known Risks = >> > >> > == Orphaned Products == >> > >> > Flume is already deployed in production at multiple >> companies and they are >> > actively participating in feature requests and user >> led discussions. Flume >> > is getting traction with developers and thus the risks >> of it being orphaned >> > are minimal. >> > >> > == Inexperience with Open Source == >> > >> > All code developed for Flume has is open sourced by >> Cloudera under Apache >> > 2.0 license. All committers of Flume project are >> intimately familiar with >> > the Apache model for open-source development and are >> experienced with >> > working with new contributors. >> > >> > == Homogeneous Developers == >> > >> > The initial set of committers is from a reduced set of >> organizations. >> > However, we expect that once approved for incubation, >> the project will >> > attract new contributors from diverse organizations >> and will thus grow >> > organically. The participation of developers from >> several different >> > organizations in the mailing list is a strong >> indication for this assertion. >> > >> > == Reliance on Salaried Developers == >> > >> > It is expected that Flume will be developed on >> salaried and volunteer time, >> > although all of the initial developers will work on it >> mainly on salaried >> > time. >> > >> > == Relationships with Other Apache Products == >> > >> > Flume depends upon other Apache Projects: Apache >> Hadoop, Apache Log4J, >> > Apache ZooKeeper, Apache Thrift, Apache Avro, multiple >> Apache Commons >> > components. Its build depends upon Apache Ant and >> Apache Maven. >> > >> > Flume users have created connectors that interact with >> several other Apache >> > projects including Apache HBase and Apache Cassandra. >> > >> > Flume's functionality has some indirect or direct >> overlap with the >> > functionality of Apache Chukwa but has several >> significant architectural >> > diffferences. Both systems can be used to collect >> log data to write to >> > hdfs. However, Chukwa's primary goals are the >> analytic and monitoring >> > aspects of a Hadoop cluster. Instead of focusing on >> analytics, Flume >> > focuses primarily upon data transport and integration >> with a wide set of >> > data sources and data destinations. >> Architecturally, Chukwa components are >> > individually and statically configured. It also >> depends upon Hadoop >> > MapReduce for its core functionality. In contrast, >> Flume's components are >> > dynamically and centrally configured and does not >> depend directly upon >> > Hadoop MapReduce. Furthermore, Flume provides a more >> general model for >> > handling data and enables integration with projects >> such as Apache Hive, >> > data stores such as Apache HBase, Apache Cassandra and >> Voldemort, and >> > several Apache Lucene-related projects. >> > >> > == An Excessive Fascination with the Apache Brand == >> > >> > We would like Flume to become an Apache project to >> further foster a healthy >> > community of contributors and consumers around the >> project. Since Flume >> > directly interacts with many Apache Hadoop-related >> projects by solves an >> > important problem of many Hadoop users, residing in >> the the Apache Software >> > Foundation will increase interaction with the larger >> community. >> > >> > = Documentation = >> > >> > * All Flume documentation (User Guide, Developer >> Guide, Cookbook, and >> > Windows Guide) is maintained within Flume sources and >> can be built directly. >> > * Cloudera provides documentation specific to its >> distribution of Flume at: >> > http://archive.cloudera.com/cdh/3/flume/ >> > * Flume wiki at GitHub: https://github.com/cloudera/flume/wiki >> > * Flume jira at Cloudera: https://issues.cloudera.org/browse/flume >> > >> > = Initial Source = >> > >> > * https://github.com/cloudera/flume/tree/ >> > >> > == Source and Intellectual Property Submission Plan >> == >> > >> > * The initial source is already licensed under the >> Apache License, Version >> > 2.0. https://github.com/cloudera/flume/blob/master/LICENSE >> > >> > == External Dependencies == >> > >> > The required external dependencies are all Apache >> License or compatible >> > licenses. Following components with non-Apache >> licenses are enumerated: >> > >> > * org.arabidopsis.ahocorasick : BSD-style >> > >> > Non-Apache build tools that are used by Flume are as >> follows: >> > >> > * AsciiDoc: GNU GPLv2 >> > * FindBugs: GNU LGPL >> > * Cobertura: GNU GPLv2 >> > * PMD : BSD-style >> > >> > == Cryptography == >> > >> > Flume uses standard APIs and tools for SSH and SSL >> communication where >> > necessary. >> > >> > = Required Resources = >> > >> > == Mailing lists == >> > >> > * flume-private (with moderated subscriptions) >> > * flume-dev >> > * flume-commits >> > * flume-user >> > >> > == Subversion Directory == >> > >> > https://svn.apache.org/repos/asf/incubator/flume >> > >> > == Issue Tracking == >> > >> > JIRA Flume (FLUME) >> > >> > == Other Resources == >> > >> > The existing code already has unit and integration >> tests so we would like a >> > Hudson instance to run them whenever a new patch is >> submitted. This can be >> > added after project creation. >> > >> > = Initial Committers = >> > >> > * Andrew Bayer (abayer at cloudera dot com) >> > * Jonathan Hsieh (jon at cloudera dot com) >> > * Aaron Kimball (akimball83 at gmail dot com) >> > * Bruce Mitchener (bruce.mitchener at gmail dot >> com) >> > * Arvind Prabhakar (arvind at cloudera dot com) >> > * Ahmed Radwan (ahmed at cloudera dot com) >> > * Henry Robinson (henry at cloudera dot com) >> > * Eric Sammer (esammer at cloudera dot com) >> > >> > = Affiliations = >> > >> > * Andrew Bayer, Cloudera >> > * Jonathan Hsieh, Cloudera >> > * Aaron Kimball, Odiago >> > * Bruce Mitchener, Independent >> > * Arvind Prabhakar, Cloudera >> > * Ahmed Radwan, Cloudera >> > * Henry Robinson, Cloudera >> > * Eric Sammer, Cloudera >> > >> > >> > = Sponsors = >> > >> > == Champion == >> > >> > * Nigel Daley >> > >> > == Nominated Mentors == >> > >> > * Tom White >> > * Nigel Daley >> > >> > == Sponsoring Entity == >> > >> > * Apache Incubator PMC >> > >> > >> > -- >> > // Jonathan Hsieh (shay) >> > // Software Engineer, Cloudera >> > // j...@cloudera.com >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >
-- Thanks - Mohammad Nour Author of (WebSphere Application Server Community Edition 2.0 User Guide) http://www.redbooks.ibm.com/abstracts/sg247585.html - LinkedIn: http://www.linkedin.com/in/mnour - Blog: http://tadabborat.blogspot.com ---- "Life is like riding a bicycle. To keep your balance you must keep moving" - Albert Einstein "Writing clean code is what you must do in order to call yourself a professional. There is no reasonable excuse for doing anything less than your best." - Clean Code: A Handbook of Agile Software Craftsmanship "Stay hungry, stay foolish." - Steve Jobs --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org