+1 on the proposal. Looking forward to the vote. Nige
On May 28, 2011, at 10:49 PM, Mattmann, Chris A (388J) wrote: > On May 27, 2011, at 11:40 AM, arv...@cloudera.com wrote: > >> Greetings All, >> >> We would like to propose Sqoop Project for inclusion in ASF Incubator as a >> new podling. Sqoop is a tool designed for efficiently transferring bulk data >> between Apache Hadoop and structured datastores such as relational >> databases. The complete proposal can be found at: >> >> http://wiki.apache.org/incubator/SqoopProposal >> >> The initial contents of this proposal are also pasted below for convenience. >> >> Thanks and Regards, >> Arvind Prabhakar >> >> = Sqoop - A Data Transfer Tool for Hadoop = >> >> == Abstract == >> >> Sqoop is a tool designed for efficiently transferring bulk data between >> Apache Hadoop and structured datastores such as relational databases. You >> can use Sqoop to import data from external structured datastores into Hadoop >> Distributed File System or related systems like Hive and HBase. Conversely, >> Sqoop can be used to extract data from Hadoop and export it to external >> structured datastores such as relational databases and enterprise data >> warehouses. >> >> == Proposal == >> >> Hadoop and related systems operate on large volumes of data. Typically this >> data originates from outside of Hadoop infrastructure and must be >> provisioned for consumption by Hadoop and related systems for analysis and >> processing. Sqoop allows fast provisioning of data into Hadoop and related >> systems by providing a bulk import and export mechanism that enables >> consumers to effectively use Hadoop for data analysis and processing. >> >> == Background == >> >> Sqoop was initially developed by Cloudera to enable the import and export of >> data between various databases and Hadoop Distributed File System (HDFS). It >> was provided as a patch to Hadoop project via the issue [[ >> https://issues.apache.org/jira/browse/HADOOP-5815|HADOOP-5815]] and was >> maintained as a contrib module to Hadoop between May 2009 to April 2010. In >> April 2010, Sqoop was removed from Hadoop contrib via [[ >> https://issues.apache.org/jira/browse/MAPREDUCE-1644|MAPREDUCE-1644]] and >> was made available by Cloudera on >> [[http://github.com/cloudera/sqoop|GitHub]]. >> >> >> Since then Sqoop has been maintained by Cloudera as an open source project >> on GitHub. All code available in Sqoop is open source and made publicaly >> available under the Apache 2 license. During this time Sqoop has been >> formally released three times as versions 1.0, 1.1 and 1.2. >> >> == Rationale == >> >> Hadoop is often used to process data that originated or is later served by >> structured data stores such as relational databases, spreadsheets or >> enterprise data warehouses. Unfortunately, current methods of transferring >> data are inefficient and ad hoc, often consisting of manual steps specific >> to the external system. These steps are necessary to help provision this >> data for consumption by Map-Reduce jobs, or by systems that build on top of >> Hadoop such as Hive and Pig. The transfer of this data can take substantial >> amount of time depending upon its size. An optimal transfer approach that >> works well with one particular datastore will typically not work as >> optimally with another datastore due to inherent architectural differences >> between different datastore implementations. Sqoop addresses this problem by >> providing connectivity of Hadoop with external systems via pluggable >> connectors. Specialized connectors are developed for optimal performance for >> data transfer between Hadoop and target systems. >> >> Analyzed and processed data from Hadoop and related systems may also require >> to be provisioned outside of Hadoop for consumption by business >> applications. Sqoop allows the export of data from Hadoop to external >> systems to facilitate its use in other systems. This too, like the import >> scenario, is implemented via specialized connectors that are built for the >> purposes of optimal integration between Hadoop and external systems. >> >> Connectors can be built for systems that Sqoop does not yet integrate with >> and thus can be easily incorporated into Sqoop. Connectors allow Sqoop to >> interface with external systems of different types, ensuring that newer >> systems can integrate with Hadoop with relative ease and in a consistent >> manner. >> >> Besides allowing integration with other external systems, Sqoop provides >> tight integration with systems that build on to of Hadoop such as Hive, >> HBase etc - thus providing data integration between Hadoop based systems and >> external systems in a single step manner. >> >> == Initial Goals == >> >> Sqoop is currently in its first major release with a considerable number of >> enhancement requests, tasks, and issues logged towards its future >> development. The initial goal of this project will be to address the highly >> requested features and bug-fixes towards its next dot release. The key >> features of interest are the following: >> * Support for bulk import into Apache HBase. >> * Allow user to supply password in permission protected file. >> * Support for pluggable query to help Sqoop identify the metadata >> associated with the source or target table definitions. >> * Allow user to specify custom split semantics for efficient >> parallelization of import jobs. >> >> = Current Status = >> >> == Meritocracy == >> >> Sqoop has been an open source project since its start. It was initially >> developed by Aaron Kimball in May 2009 along with development team at >> Cloudera and supplied as a patch to Hadoop project. Later it was moved to >> GitHub as a Cloudera open-source project where Cloudera engineering team has >> since maintained it with Arvind Prabhakar and Ahmed Radwan dedicated towards >> its improvement. Developers external to Cloudera provided feedback, >> suggested features and fixes and implemented extensions of Sqoop since its >> inception. Contributors to Sqoop include developers from different >> companies and different parts of the world. >> >> == Community == >> >> Sqoop is currently used by a number of organizations all over the world. >> Sqoop has an active and growing user community with active participation in >> [[https://groups.google.com/a/cloudera.org/group/sqoop-user/topics|user]] >> and [[ >> https://groups.google.com/a/cloudera.org/group/sqoop-dev/topics|developer]] >> mailing lists. >> >> == Core Developers == >> >> The core developers for Sqoop project are: >> * Aaron Kimball: Aaron designed and implemented much of the original code. >> * Arvind Prabhakar: Has been working on Sqoop features and bug fixes. >> * Ahmed Radwan: Has been working on Sqoop features and bug fixes. >> * Jonathan Hsieh: Has started working on Sqoop features and bug fixes. >> * Other contributors to the project include: Angus He, Brian Muller, Eli >> Collins, Guy Le Mar, James Grant, Konstantin Boudnik, Lars Francke, Michael >> Hausler, Michael Katzenellenbogen, Pter Happ and Scott Foster. >> >> All committers to Sqoop project have contributed towards Hadoop or related >> Apache projects and are very familiar with Apache principals and philosophy >> for community driven software development. >> >> == Alignment == >> >> Sqoop complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a robust >> mechanism to allow data integration from external systems for effective data >> analysis. It integrates with Hive and HBase currently and work is being done >> to integrate it with Pig. >> >> = Known Risks = >> >> == Orphaned Products == >> >> Sqoop is already deployed in production at multiple companies and they are >> actively participating in feature requests and user led discussions. Sqoop >> is getting traction with developers and thus the risks of it being orphaned >> are minimal. >> >> == Inexperience with Open Source == >> >> All code developed for Sqoop has been open source from the start. The >> initial part of Sqoop development was done within Hadoop project as a >> contrib module. Since then it has been maintained as an Apache 2.0 licensed >> open-source project on GitHub by Cloudera. >> >> All committers of Sqoop project are intimately familiar with the Apache >> model for open-source development and are experienced with working with new >> contributors. Aaron Kimball, the creator of the project and one of the >> committers is also a committer on Apache MapReduce. >> >> == Homogeneous Developers == >> >> The initial set of committers is from a small set of organizations. However, >> we expect that once approved for incubation, the project will attract new >> contributors from diverse organizations and will thus grow organically. The >> participation of developers from several different organizations in the >> mailing list is a strong indication for this assertion. >> >> == Reliance on Salaried Developers == >> >> It is expected that Sqoop will be developed on salaried and volunteer time, >> although all of the initial developers will work on it mainly on salaried >> time. >> >> == Relationships with Other Apache Products == >> >> Sqoop depends upon other Apache Projects: Hadoop, Hive, HBase Log4J and >> multiple Apache commons components and build systems like Ant and Maven. >> >> == An Excessive Fascination with the Apache Brand == >> >> The reasons for joining Apache are to increase the synergy with other Apache >> Hadoop related projects and to foster a healthy community of contributors >> and consumers around the project. This is facilitated by ASF and that is the >> primary reason we would like Sqoop to become an Apache project. >> >> = Documentation = >> >> * All Sqoop documentation is maintained within Sqoop sources and can be >> built directly. >> * Sqoop docs: http://archive.cloudera.com/cdh/3/sqoop/ >> * Sqoop wiki at GitHub: https://github.com/cloudera/sqoop/wiki >> * Sqoop jira at Cloudera: https://issues.cloudera.org/browse/sqoop >> >> = Initial Source = >> >> * https://github.com/cloudera/sqoop/tree/ >> >> == Source and Intellectual Property Submission Plan == >> >> * The initial source is already Apache 2.0 licensed. >> >> == External Dependencies == >> >> The required external dependencies are all Apache License or compatible >> licenses. Following components with non-Apache licenses are enumerated: >> >> * HSQLDB: HSQLDB License - a BSD-based license. >> >> Non-Apache build tools that are used by Sqoop are as follows: >> >> * AsciiDoc: GNU GPLv2 >> * Checkstyle: GNU LGPLv3 >> * FindBugs: GNU LGPL >> * Cobertura: GNU GPLv2 >> >> == Cryptography == >> >> Sqoop does not depend upon any cryptography tools or libraries. >> >> = Required Resources = >> >> == Mailing lists == >> >> * sqoop-private (with moderated subscriptions) >> * sqoop-dev >> * sqoop-commits >> * sqoop-user >> >> == Subversion Directory == >> >> https://svn.apache.org/repos/asf/incubator/sqoop >> >> == Issue Tracing == >> >> JIRA Sqoop (SQOOP) >> >> == Other Resources == >> >> The existing code already has unit and integration tests so we would like a >> Hudson instance to run them whenever a new patch is submitted. This can be >> added after project creation. >> >> = Initial Committers = >> >> * Arvind Prabhakar (arvind at cloudera dot com) >> * Ahmed Radwan (a dot aboelela at gmail dot com) >> * Jonathan Hsieh (jon at cloudera dot com) >> * Aaron Kimball (kimballa at apache dot org) >> * Greg Cottman (greg dot cottman at quest dot com) >> * Guy le Mar (guy dot lemar at quest dot com) >> * Roman Shaposhnik (rvs at cloudera dot com) >> * Andrew Bayer (andrew at cloudera dot com) >> >> A CLA is already on file for Aaron Kimball. >> >> = Affiliations = >> >> * Arvind Prabhakar, Cloudera >> * Ahmed Radwan, Cloudera >> * Jonathan Hsieh, Cloudera >> * Aaron Kimball, Odiago >> * Greg Cottman, Quest >> * Guy le Mar, Quest >> * Roman Shaposhnik, Cloudera >> * Andrew Bayer, Cloudera >> >> = Sponsors = >> >> == Champion == >> >> * Tom White (tomwhite at apache dot org) >> >> == Nominated Mentors == >> >> * Patrick Hunt (phunt at apache dot org) >> >> == Sponsoring Entity == >> >> * Apache Incubator PMC > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org