Cool project and nice proposal :-) Tommaso
2011/5/29 Nigel Daley <nda...@mac.com> > +1 on the proposal. Looking forward to the vote. > > Nige > > On May 28, 2011, at 10:49 PM, Mattmann, Chris A (388J) wrote: > > > On May 27, 2011, at 11:40 AM, arv...@cloudera.com wrote: > > > >> Greetings All, > >> > >> We would like to propose Sqoop Project for inclusion in ASF Incubator as > a > >> new podling. Sqoop is a tool designed for efficiently transferring bulk > data > >> between Apache Hadoop and structured datastores such as relational > >> databases. The complete proposal can be found at: > >> > >> http://wiki.apache.org/incubator/SqoopProposal > >> > >> The initial contents of this proposal are also pasted below for > convenience. > >> > >> Thanks and Regards, > >> Arvind Prabhakar > >> > >> = Sqoop - A Data Transfer Tool for Hadoop = > >> > >> == Abstract == > >> > >> Sqoop is a tool designed for efficiently transferring bulk data between > >> Apache Hadoop and structured datastores such as relational databases. > You > >> can use Sqoop to import data from external structured datastores into > Hadoop > >> Distributed File System or related systems like Hive and HBase. > Conversely, > >> Sqoop can be used to extract data from Hadoop and export it to external > >> structured datastores such as relational databases and enterprise data > >> warehouses. > >> > >> == Proposal == > >> > >> Hadoop and related systems operate on large volumes of data. Typically > this > >> data originates from outside of Hadoop infrastructure and must be > >> provisioned for consumption by Hadoop and related systems for analysis > and > >> processing. Sqoop allows fast provisioning of data into Hadoop and > related > >> systems by providing a bulk import and export mechanism that enables > >> consumers to effectively use Hadoop for data analysis and processing. > >> > >> == Background == > >> > >> Sqoop was initially developed by Cloudera to enable the import and > export of > >> data between various databases and Hadoop Distributed File System > (HDFS). It > >> was provided as a patch to Hadoop project via the issue [[ > >> https://issues.apache.org/jira/browse/HADOOP-5815|HADOOP-5815]] and was > >> maintained as a contrib module to Hadoop between May 2009 to April 2010. > In > >> April 2010, Sqoop was removed from Hadoop contrib via [[ > >> https://issues.apache.org/jira/browse/MAPREDUCE-1644|MAPREDUCE-1644]] > and > >> was made available by Cloudera on [[ > http://github.com/cloudera/sqoop|GitHub]]. > >> > >> > >> Since then Sqoop has been maintained by Cloudera as an open source > project > >> on GitHub. All code available in Sqoop is open source and made publicaly > >> available under the Apache 2 license. During this time Sqoop has been > >> formally released three times as versions 1.0, 1.1 and 1.2. > >> > >> == Rationale == > >> > >> Hadoop is often used to process data that originated or is later served > by > >> structured data stores such as relational databases, spreadsheets or > >> enterprise data warehouses. Unfortunately, current methods of > transferring > >> data are inefficient and ad hoc, often consisting of manual steps > specific > >> to the external system. These steps are necessary to help provision this > >> data for consumption by Map-Reduce jobs, or by systems that build on top > of > >> Hadoop such as Hive and Pig. The transfer of this data can take > substantial > >> amount of time depending upon its size. An optimal transfer approach > that > >> works well with one particular datastore will typically not work as > >> optimally with another datastore due to inherent architectural > differences > >> between different datastore implementations. Sqoop addresses this > problem by > >> providing connectivity of Hadoop with external systems via pluggable > >> connectors. Specialized connectors are developed for optimal performance > for > >> data transfer between Hadoop and target systems. > >> > >> Analyzed and processed data from Hadoop and related systems may also > require > >> to be provisioned outside of Hadoop for consumption by business > >> applications. Sqoop allows the export of data from Hadoop to external > >> systems to facilitate its use in other systems. This too, like the > import > >> scenario, is implemented via specialized connectors that are built for > the > >> purposes of optimal integration between Hadoop and external systems. > >> > >> Connectors can be built for systems that Sqoop does not yet integrate > with > >> and thus can be easily incorporated into Sqoop. Connectors allow Sqoop > to > >> interface with external systems of different types, ensuring that newer > >> systems can integrate with Hadoop with relative ease and in a consistent > >> manner. > >> > >> Besides allowing integration with other external systems, Sqoop provides > >> tight integration with systems that build on to of Hadoop such as Hive, > >> HBase etc - thus providing data integration between Hadoop based systems > and > >> external systems in a single step manner. > >> > >> == Initial Goals == > >> > >> Sqoop is currently in its first major release with a considerable number > of > >> enhancement requests, tasks, and issues logged towards its future > >> development. The initial goal of this project will be to address the > highly > >> requested features and bug-fixes towards its next dot release. The key > >> features of interest are the following: > >> * Support for bulk import into Apache HBase. > >> * Allow user to supply password in permission protected file. > >> * Support for pluggable query to help Sqoop identify the metadata > >> associated with the source or target table definitions. > >> * Allow user to specify custom split semantics for efficient > >> parallelization of import jobs. > >> > >> = Current Status = > >> > >> == Meritocracy == > >> > >> Sqoop has been an open source project since its start. It was initially > >> developed by Aaron Kimball in May 2009 along with development team at > >> Cloudera and supplied as a patch to Hadoop project. Later it was moved > to > >> GitHub as a Cloudera open-source project where Cloudera engineering team > has > >> since maintained it with Arvind Prabhakar and Ahmed Radwan dedicated > towards > >> its improvement. Developers external to Cloudera provided feedback, > >> suggested features and fixes and implemented extensions of Sqoop since > its > >> inception. Contributors to Sqoop include developers from different > >> companies and different parts of the world. > >> > >> == Community == > >> > >> Sqoop is currently used by a number of organizations all over the world. > >> Sqoop has an active and growing user community with active participation > in > >> [[https://groups.google.com/a/cloudera.org/group/sqoop-user/topics|user > ]] > >> and [[ > >> > https://groups.google.com/a/cloudera.org/group/sqoop-dev/topics|developer > ]] > >> mailing lists. > >> > >> == Core Developers == > >> > >> The core developers for Sqoop project are: > >> * Aaron Kimball: Aaron designed and implemented much of the original > code. > >> * Arvind Prabhakar: Has been working on Sqoop features and bug fixes. > >> * Ahmed Radwan: Has been working on Sqoop features and bug fixes. > >> * Jonathan Hsieh: Has started working on Sqoop features and bug fixes. > >> * Other contributors to the project include: Angus He, Brian Muller, Eli > >> Collins, Guy Le Mar, James Grant, Konstantin Boudnik, Lars Francke, > Michael > >> Hausler, Michael Katzenellenbogen, Pter Happ and Scott Foster. > >> > >> All committers to Sqoop project have contributed towards Hadoop or > related > >> Apache projects and are very familiar with Apache principals and > philosophy > >> for community driven software development. > >> > >> == Alignment == > >> > >> Sqoop complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a > robust > >> mechanism to allow data integration from external systems for effective > data > >> analysis. It integrates with Hive and HBase currently and work is being > done > >> to integrate it with Pig. > >> > >> = Known Risks = > >> > >> == Orphaned Products == > >> > >> Sqoop is already deployed in production at multiple companies and they > are > >> actively participating in feature requests and user led discussions. > Sqoop > >> is getting traction with developers and thus the risks of it being > orphaned > >> are minimal. > >> > >> == Inexperience with Open Source == > >> > >> All code developed for Sqoop has been open source from the start. The > >> initial part of Sqoop development was done within Hadoop project as a > >> contrib module. Since then it has been maintained as an Apache 2.0 > licensed > >> open-source project on GitHub by Cloudera. > >> > >> All committers of Sqoop project are intimately familiar with the Apache > >> model for open-source development and are experienced with working with > new > >> contributors. Aaron Kimball, the creator of the project and one of the > >> committers is also a committer on Apache MapReduce. > >> > >> == Homogeneous Developers == > >> > >> The initial set of committers is from a small set of organizations. > However, > >> we expect that once approved for incubation, the project will attract > new > >> contributors from diverse organizations and will thus grow organically. > The > >> participation of developers from several different organizations in the > >> mailing list is a strong indication for this assertion. > >> > >> == Reliance on Salaried Developers == > >> > >> It is expected that Sqoop will be developed on salaried and volunteer > time, > >> although all of the initial developers will work on it mainly on > salaried > >> time. > >> > >> == Relationships with Other Apache Products == > >> > >> Sqoop depends upon other Apache Projects: Hadoop, Hive, HBase Log4J and > >> multiple Apache commons components and build systems like Ant and Maven. > >> > >> == An Excessive Fascination with the Apache Brand == > >> > >> The reasons for joining Apache are to increase the synergy with other > Apache > >> Hadoop related projects and to foster a healthy community of > contributors > >> and consumers around the project. This is facilitated by ASF and that is > the > >> primary reason we would like Sqoop to become an Apache project. > >> > >> = Documentation = > >> > >> * All Sqoop documentation is maintained within Sqoop sources and can be > >> built directly. > >> * Sqoop docs: http://archive.cloudera.com/cdh/3/sqoop/ > >> * Sqoop wiki at GitHub: https://github.com/cloudera/sqoop/wiki > >> * Sqoop jira at Cloudera: https://issues.cloudera.org/browse/sqoop > >> > >> = Initial Source = > >> > >> * https://github.com/cloudera/sqoop/tree/ > >> > >> == Source and Intellectual Property Submission Plan == > >> > >> * The initial source is already Apache 2.0 licensed. > >> > >> == External Dependencies == > >> > >> The required external dependencies are all Apache License or compatible > >> licenses. Following components with non-Apache licenses are enumerated: > >> > >> * HSQLDB: HSQLDB License - a BSD-based license. > >> > >> Non-Apache build tools that are used by Sqoop are as follows: > >> > >> * AsciiDoc: GNU GPLv2 > >> * Checkstyle: GNU LGPLv3 > >> * FindBugs: GNU LGPL > >> * Cobertura: GNU GPLv2 > >> > >> == Cryptography == > >> > >> Sqoop does not depend upon any cryptography tools or libraries. > >> > >> = Required Resources = > >> > >> == Mailing lists == > >> > >> * sqoop-private (with moderated subscriptions) > >> * sqoop-dev > >> * sqoop-commits > >> * sqoop-user > >> > >> == Subversion Directory == > >> > >> https://svn.apache.org/repos/asf/incubator/sqoop > >> > >> == Issue Tracing == > >> > >> JIRA Sqoop (SQOOP) > >> > >> == Other Resources == > >> > >> The existing code already has unit and integration tests so we would > like a > >> Hudson instance to run them whenever a new patch is submitted. This can > be > >> added after project creation. > >> > >> = Initial Committers = > >> > >> * Arvind Prabhakar (arvind at cloudera dot com) > >> * Ahmed Radwan (a dot aboelela at gmail dot com) > >> * Jonathan Hsieh (jon at cloudera dot com) > >> * Aaron Kimball (kimballa at apache dot org) > >> * Greg Cottman (greg dot cottman at quest dot com) > >> * Guy le Mar (guy dot lemar at quest dot com) > >> * Roman Shaposhnik (rvs at cloudera dot com) > >> * Andrew Bayer (andrew at cloudera dot com) > >> > >> A CLA is already on file for Aaron Kimball. > >> > >> = Affiliations = > >> > >> * Arvind Prabhakar, Cloudera > >> * Ahmed Radwan, Cloudera > >> * Jonathan Hsieh, Cloudera > >> * Aaron Kimball, Odiago > >> * Greg Cottman, Quest > >> * Guy le Mar, Quest > >> * Roman Shaposhnik, Cloudera > >> * Andrew Bayer, Cloudera > >> > >> = Sponsors = > >> > >> == Champion == > >> > >> * Tom White (tomwhite at apache dot org) > >> > >> == Nominated Mentors == > >> > >> * Patrick Hunt (phunt at apache dot org) > >> > >> == Sponsoring Entity == > >> > >> * Apache Incubator PMC > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Senior Computer Scientist > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 171-266B, Mailstop: 171-246 > > Email: chris.a.mattm...@nasa.gov > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Assistant Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >