Re: [PROPOSAL] Sqoop Project

Tommaso Teofili Mon, 30 May 2011 04:35:32 -0700

Cool project and nice proposal :-)

Tommaso


2011/5/29 Nigel Daley <nda...@mac.com>

> +1 on the proposal.  Looking forward to the vote.
>
> Nige
>
> On May 28, 2011, at 10:49 PM, Mattmann, Chris A (388J) wrote:
>
> > On May 27, 2011, at 11:40 AM, arv...@cloudera.com wrote:
> >
> >> Greetings All,
> >>
> >> We would like to propose Sqoop Project for inclusion in ASF Incubator as
> a
> >> new podling. Sqoop is a tool designed for efficiently transferring bulk
> data
> >> between Apache Hadoop and structured datastores such as relational
> >> databases. The complete proposal can be found at:
> >>
> >> http://wiki.apache.org/incubator/SqoopProposal
> >>
> >> The initial contents of this proposal are also pasted below for
> convenience.
> >>
> >> Thanks and Regards,
> >> Arvind Prabhakar
> >>
> >> = Sqoop - A Data Transfer Tool for Hadoop =
> >>
> >> == Abstract ==
> >>
> >> Sqoop is a tool designed for efficiently transferring bulk data between
> >> Apache Hadoop and structured datastores such as relational databases.
> You
> >> can use Sqoop to import data from external structured datastores into
> Hadoop
> >> Distributed File System or related systems like Hive and HBase.
> Conversely,
> >> Sqoop can be used to extract data from Hadoop and export it to external
> >> structured datastores such as relational databases and enterprise data
> >> warehouses.
> >>
> >> == Proposal ==
> >>
> >> Hadoop and related systems operate on large volumes of data. Typically
> this
> >> data originates from outside of Hadoop infrastructure and must be
> >> provisioned for consumption by Hadoop and related systems for analysis
> and
> >> processing. Sqoop allows fast provisioning of data into Hadoop and
> related
> >> systems by providing a bulk import and export mechanism that enables
> >> consumers to effectively use Hadoop for data analysis and processing.
> >>
> >> == Background ==
> >>
> >> Sqoop was initially developed by Cloudera to enable the import and
> export of
> >> data between various databases and Hadoop Distributed File System
> (HDFS). It
> >> was provided as a patch to Hadoop project via the issue [[
> >> https://issues.apache.org/jira/browse/HADOOP-5815|HADOOP-5815]] and was
> >> maintained as a contrib module to Hadoop between May 2009 to April 2010.
> In
> >> April 2010, Sqoop was removed from Hadoop contrib via [[
> >> https://issues.apache.org/jira/browse/MAPREDUCE-1644|MAPREDUCE-1644]]
> and
> >> was made available by Cloudera on [[
> http://github.com/cloudera/sqoop|GitHub]].
> >>
> >>
> >> Since then Sqoop has been maintained by Cloudera as an open source
> project
> >> on GitHub. All code available in Sqoop is open source and made publicaly
> >> available under the Apache 2 license. During this time Sqoop has been
> >> formally released three times as versions 1.0, 1.1 and 1.2.
> >>
> >> == Rationale ==
> >>
> >> Hadoop is often used to process data that originated or is later served
> by
> >> structured data stores such as relational databases, spreadsheets or
> >> enterprise data warehouses. Unfortunately, current methods of
> transferring
> >> data are inefficient and ad hoc, often consisting of manual steps
> specific
> >> to the external system. These steps are necessary to help provision this
> >> data for consumption by Map-Reduce jobs, or by systems that build on top
> of
> >> Hadoop such as Hive and Pig. The transfer of this data can take
> substantial
> >> amount of time depending upon its size. An optimal transfer approach
> that
> >> works well with one particular datastore will typically not work as
> >> optimally with another datastore due to inherent architectural
> differences
> >> between different datastore implementations. Sqoop addresses this
> problem by
> >> providing connectivity of Hadoop with external systems via pluggable
> >> connectors. Specialized connectors are developed for optimal performance
> for
> >> data transfer between Hadoop and target systems.
> >>
> >> Analyzed and processed data from Hadoop and related systems may also
> require
> >> to be provisioned outside of Hadoop for consumption by business
> >> applications. Sqoop allows the export of data from Hadoop to external
> >> systems to facilitate its use in other systems. This too, like the
> import
> >> scenario, is implemented via specialized connectors that are built for
> the
> >> purposes of optimal integration between Hadoop and external systems.
> >>
> >> Connectors can be built for systems that Sqoop does not yet integrate
> with
> >> and thus can be easily incorporated into Sqoop. Connectors allow Sqoop
> to
> >> interface with external systems of different types, ensuring that newer
> >> systems can integrate with Hadoop with relative ease and in a consistent
> >> manner.
> >>
> >> Besides allowing integration with other external systems, Sqoop provides
> >> tight integration with systems that build on to of Hadoop such as Hive,
> >> HBase etc - thus providing data integration between Hadoop based systems
> and
> >> external systems in a single step manner.
> >>
> >> == Initial Goals ==
> >>
> >> Sqoop is currently in its first major release with a considerable number
> of
> >> enhancement requests, tasks, and issues logged towards its future
> >> development. The initial goal of this project will be to address the
> highly
> >> requested features and bug-fixes towards its next dot release. The key
> >> features of interest are the following:
> >> * Support for bulk import into Apache HBase.
> >> * Allow user to supply password in permission protected file.
> >> * Support for pluggable query to help Sqoop identify the metadata
> >> associated with the source or target table definitions.
> >> * Allow user to specify custom split semantics for efficient
> >> parallelization of import jobs.
> >>
> >> = Current Status =
> >>
> >> == Meritocracy ==
> >>
> >> Sqoop has been an open source project since its start. It was initially
> >> developed by Aaron Kimball in May 2009 along with development team at
> >> Cloudera and supplied as a patch to Hadoop project. Later it was moved
> to
> >> GitHub as a Cloudera open-source project where Cloudera engineering team
> has
> >> since maintained it with Arvind Prabhakar and Ahmed Radwan dedicated
> towards
> >> its improvement. Developers external to Cloudera provided feedback,
> >> suggested features and fixes and implemented extensions of Sqoop since
> its
> >> inception.  Contributors to Sqoop include developers from different
> >> companies and different parts of the world.
> >>
> >> == Community ==
> >>
> >> Sqoop is currently used by a number of organizations all over the world.
> >> Sqoop has an active and growing user community with active participation
> in
> >> [[https://groups.google.com/a/cloudera.org/group/sqoop-user/topics|user
> ]]
> >> and [[
> >>
> https://groups.google.com/a/cloudera.org/group/sqoop-dev/topics|developer
> ]]
> >> mailing lists.
> >>
> >> == Core Developers ==
> >>
> >> The core developers for Sqoop project are:
> >> * Aaron Kimball: Aaron designed and implemented much of the original
> code.
> >> * Arvind Prabhakar: Has been working on Sqoop features and bug fixes.
> >> * Ahmed Radwan: Has been working on Sqoop features and bug fixes.
> >> * Jonathan Hsieh: Has started working on Sqoop features and bug fixes.
> >> * Other contributors to the project include: Angus He, Brian Muller, Eli
> >> Collins, Guy Le Mar, James Grant, Konstantin Boudnik, Lars Francke,
> Michael
> >> Hausler, Michael Katzenellenbogen, Pter Happ and Scott Foster.
> >>
> >> All committers to Sqoop project have contributed towards Hadoop or
> related
> >> Apache projects and are very familiar with Apache principals and
> philosophy
> >> for community driven software development.
> >>
> >> == Alignment ==
> >>
> >> Sqoop complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a
> robust
> >> mechanism to allow data integration from external systems for effective
> data
> >> analysis. It integrates with Hive and HBase currently and work is being
> done
> >> to integrate it with Pig.
> >>
> >> = Known Risks =
> >>
> >> == Orphaned Products ==
> >>
> >> Sqoop is already deployed in production at multiple companies and they
> are
> >> actively participating in feature requests and user led discussions.
> Sqoop
> >> is getting traction with developers and thus the risks of it being
> orphaned
> >> are minimal.
> >>
> >> == Inexperience with Open Source ==
> >>
> >> All code developed for Sqoop has been open source from the start. The
> >> initial part of Sqoop development was done within Hadoop project as a
> >> contrib module. Since then it has been maintained as an Apache 2.0
> licensed
> >> open-source project on GitHub by Cloudera.
> >>
> >> All committers of Sqoop project are intimately familiar with the Apache
> >> model for open-source development and are experienced with working with
> new
> >> contributors. Aaron Kimball, the creator of the project and one of the
> >> committers is also a committer on Apache MapReduce.
> >>
> >> == Homogeneous Developers ==
> >>
> >> The initial set of committers is from a small set of organizations.
> However,
> >> we expect that once approved for incubation, the project will attract
> new
> >> contributors from diverse organizations and will thus grow organically.
> The
> >> participation of developers from several different organizations in the
> >> mailing list is a strong indication for this assertion.
> >>
> >> == Reliance on Salaried Developers ==
> >>
> >> It is expected that Sqoop will be developed on salaried and volunteer
> time,
> >> although all of the initial developers will work on it mainly on
> salaried
> >> time.
> >>
> >> == Relationships with Other Apache Products ==
> >>
> >> Sqoop depends upon other Apache Projects: Hadoop, Hive, HBase Log4J and
> >> multiple Apache commons components and build systems like Ant and Maven.
> >>
> >> == An Excessive Fascination with the Apache Brand ==
> >>
> >> The reasons for joining Apache are to increase the synergy with other
> Apache
> >> Hadoop related projects and to foster a healthy community of
> contributors
> >> and consumers around the project. This is facilitated by ASF and that is
> the
> >> primary reason we would like Sqoop to become an Apache project.
> >>
> >> = Documentation =
> >>
> >> * All Sqoop documentation is maintained within Sqoop sources and can be
> >> built directly.
> >> * Sqoop docs: http://archive.cloudera.com/cdh/3/sqoop/
> >> * Sqoop wiki at GitHub: https://github.com/cloudera/sqoop/wiki
> >> * Sqoop jira at Cloudera: https://issues.cloudera.org/browse/sqoop
> >>
> >> = Initial Source =
> >>
> >> * https://github.com/cloudera/sqoop/tree/
> >>
> >> == Source and Intellectual Property Submission Plan ==
> >>
> >> * The initial source is already Apache 2.0 licensed.
> >>
> >> == External Dependencies ==
> >>
> >> The required external dependencies are all Apache License or compatible
> >> licenses. Following components with non-Apache licenses are enumerated:
> >>
> >> * HSQLDB: HSQLDB License - a BSD-based license.
> >>
> >> Non-Apache build tools that are used by Sqoop are as follows:
> >>
> >> * AsciiDoc: GNU GPLv2
> >> * Checkstyle: GNU LGPLv3
> >> * FindBugs: GNU LGPL
> >> * Cobertura: GNU GPLv2
> >>
> >> == Cryptography ==
> >>
> >> Sqoop does not depend upon any cryptography tools or libraries.
> >>
> >> = Required  Resources =
> >>
> >> == Mailing lists ==
> >>
> >> * sqoop-private (with moderated subscriptions)
> >> * sqoop-dev
> >> * sqoop-commits
> >> * sqoop-user
> >>
> >> == Subversion Directory ==
> >>
> >> https://svn.apache.org/repos/asf/incubator/sqoop
> >>
> >> == Issue Tracing ==
> >>
> >> JIRA Sqoop (SQOOP)
> >>
> >> == Other Resources ==
> >>
> >> The existing code already has unit and integration tests so we would
> like a
> >> Hudson instance to run them whenever a new patch is submitted. This can
> be
> >> added after project creation.
> >>
> >> = Initial Committers =
> >>
> >> * Arvind Prabhakar (arvind at cloudera dot com)
> >> * Ahmed Radwan (a dot aboelela at gmail dot com)
> >> * Jonathan Hsieh (jon at cloudera dot com)
> >> * Aaron Kimball (kimballa at apache dot org)
> >> * Greg Cottman (greg dot cottman at quest dot com)
> >> * Guy le Mar (guy dot lemar at quest dot com)
> >> * Roman Shaposhnik (rvs at cloudera dot com)
> >> * Andrew Bayer (andrew at cloudera dot com)
> >>
> >> A CLA is already on file for Aaron Kimball.
> >>
> >> = Affiliations =
> >>
> >> * Arvind Prabhakar, Cloudera
> >> * Ahmed Radwan, Cloudera
> >> * Jonathan Hsieh, Cloudera
> >> * Aaron Kimball, Odiago
> >> * Greg Cottman, Quest
> >> * Guy le Mar, Quest
> >> * Roman Shaposhnik, Cloudera
> >> * Andrew Bayer, Cloudera
> >>
> >> = Sponsors =
> >>
> >> == Champion ==
> >>
> >> * Tom White (tomwhite at apache dot org)
> >>
> >> == Nominated Mentors ==
> >>
> >> * Patrick Hunt (phunt at apache dot org)
> >>
> >> == Sponsoring Entity ==
> >>
> >> * Apache Incubator PMC
> >
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: chris.a.mattm...@nasa.gov
> > WWW:   http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Re: [PROPOSAL] Sqoop Project

Reply via email to