Cool project, +1

On Thu, Jun 30, 2011 at 2:23 PM, Arvind Prabhakar <arv...@apache.org> wrote:
> +1 (non-binding)
>
> Thanks,
> Arvind
>
> On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam <misla...@yahoo.com> wrote:
>> Hi All,
>>
>> The discussion about Oozie proposal is settling down. Therefore I would like 
>> to
>> initiate a vote to accept Oozie as an Apache Incubator project.
>>
>> The latest proposal is pasted at the end and it could be found in the wiki as
>> well:
>>
>> http://wiki.apache.org/incubator/OozieProposal
>>
>>
>> The related discussion thread is at:
>> http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
>>
>>
>> Please cast your votes:
>>
>> [  ] +1 Accept Oozie for incubation
>> [  ] +0 Indifferent to Oozie incubation
>> [  ] -1 Reject Oozie for incubation
>>
>> This vote will close 72 hours  from now.
>>
>> Regards,
>> Mohammad
>>
>>
>> Abstract
>> Oozie is a server-based workflow scheduling and coordination system to manage
>> data processing jobs for Apache HadoopTM.
>>
>> Proposal
>> Oozie is an  extensible, scalable and reliable system to define, manage,
>> schedule,  and execute complex Hadoop workloads via web services. More
>> specifically, this includes:
>>
>>        * XML-based declarative framework to specify a job or a complex 
>> workflow of
>> dependent jobs.
>>
>>        * Support different types of job such as Hadoop Map-Reduce, Pipe, 
>> Streaming,
>> Pig, Hive and custom java applications.
>>
>>        * Workflow scheduling based on frequency and/or data availability.
>>        * Monitoring capability, automatic retry and failure handing of jobs.
>>        * Extensible and pluggable architecture to allow arbitrary grid 
>> programming
>> paradigms.
>>
>>        * Authentication, authorization, and capacity-aware load throttling 
>> to allow
>> multi-tenant software as a service.
>>
>> Background
>> Most data  processing applications require multiple jobs to achieve their 
>> goals,
>> with inherent dependencies among the jobs. A dependency could be  sequential,
>> where one job can only start after another job has finished.  Or it could be
>> conditional, where the execution of a job depends on the  return value or 
>> status
>> of another job. In other cases, parallel  execution of multiple jobs may be
>> permitted – or desired – to exploit  the massive pool of compute nodes 
>> provided
>> by Hadoop.
>>
>> These  job dependencies are often expressed as a Directed Acyclic Graph, also
>> called a workflow. A node in the workflow is typically a job (a  computation 
>> on
>> the grid) or another type of action such as an eMail  notification. 
>> Computations
>> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
>> available on the grid. Edges of the graph  represent transitions from one 
>> node
>> to the next, as the execution of a  workflow proceeds.
>>
>> Describing  a workflow in a declarative way has the advantage of decoupling 
>> job
>> dependencies and execution control from application logic. Furthermore,  the
>> workflow is modularized into jobs that can be reused within the same  
>> workflow
>> or across different workflows. Execution of the workflow is  then driven by a
>> runtime system without understanding the application  logic of the jobs. This
>> runtime system specializes in reliable and  predictable execution: It can 
>> retry
>> actions that have failed or invoke a  cleanup action after termination of the
>> workflow; it can monitor  progress, success, or failure of a workflow, and 
>> send
>> appropriate alerts  to an administrator. The application developer is 
>> relieved
>> from  implementing these generic procedures.
>>
>> Furthermore,  some applications or workflows need to run in periodic 
>> intervals
>> or  when dependent data is available. For example, a workflow could be  
>> executed
>> every day as soon as output data from the previous 24 instances  of another,
>> hourly workflow is available. The workflow coordinator  provides such 
>> scheduling
>> features, along with prioritization, load  balancing and throttling to 
>> optimize
>> utilization of resources in the  cluster. This makes it easier to maintain,
>> control, and coordinate  complex data applications.
>>
>> Nearly  three years ago, a team of Yahoo! developers addressed these critical
>> requirements for Hadoop-based data processing systems by developing a  new
>> workflow management and scheduling system called Oozie. While it was  
>> initially
>> developed as a Yahoo!-internal project, it was designed and  implemented with
>> the intention of open-sourcing. Oozie was released as a GitHub project in 
>> early
>> 2010. Oozie is used in production within Yahoo and  since it has been
>> open-sourced it has been gaining adoption with  external developers
>>
>> Rationale
>> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
>> order
>> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  
>> a
>> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
>> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
>> shell
>> scripts.
>>
>> Because  of this, developers find themselves writing ad-hoc glue programs to
>> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
>> manage, monitor and recover.
>>
>> Workflow  management and scheduling is an essential feature for large-scale 
>> data
>> processing applications. Such applications could write the customized  
>> solution
>> that would require separate development, operational, and  maintenance 
>> overhead.
>> Since it is a prevalent use-case for data  processing, the application 
>> developer
>> would surely prefer a generalized  solution with little or no such overhead.
>> Oozie addresses the challenge  by providing an execution framework to 
>> flexibly
>> specify the job  dependency, data dependency, and time dependency. In 
>> addition,
>> Oozie  provides a multi-tenant-based centralized service and the opportunity 
>> to
>> optimize load and utilization while respecting SLAs.
>>
>> Oozie is built on Apache HadoopTM to schedule jobs related to various Apache
>> projects such as Hadoop,  Pig, and Hive. As an Apache Open source project, 
>> Oozie
>> is expected to  attract the larger and more diversified community that 
>> currently
>> uses  such Apache sponsored projects. Additionally, users of the Hadoop
>> ecosystem can influence Oozie’s roadmap, and contribute to it. Likewise,  
>> Oozie,
>> as part of the Apache Hadoop TMecosystem, will be a great benefit to the 
>> current
>> Hadoop/Pig/Hive/HBase/HCatalog community.
>>
>> Current Status
>> Meritocracy
>> Oozie  currently is a github-based open sourced project where developers from
>> multiple companies are contributing to the project. Our intent with this
>> incubator proposal is to further extend this diverse developer  community 
>> around
>> Oozie following the Apache meritocracy model. We plan  to continue to provide
>> adequate support to new developers and to quickly  recruit those who make 
>> solid
>> contributions to committer status. In  addition, Oozie will expect, accept, 
>> and
>> work to attract contributions  from amateurs as well.
>>
>> Community
>> While an  efficient workflow management and scheduling system is critical for
>> large companies with huge data processing in multi-tenant clusters, it  is
>> equally necessary for any non-trivial deployment. Different companies  are
>> currently using Oozie as a workflow scheduler for Hadoop-based data  
>> processing.
>> At Yahoo! it is being used extensively in production  clusters to process
>> thousand of jobs. Like the Oozie user community, the  Oozie developer 
>> community
>> is also very strong. Developers from Yahoo!  provided the initial code base, 
>> and
>> they are still the most active  contributors. In late 2010, developers from
>> Cloudera also started  contributing, and currently other companies (e.g., 
>> IBM)
>> are beginning to  participate.
>>
>> We currently use JIRA for issue tracking, github for code hosting and Yahoo!
>> Groups for developer and user communications.
>>
>> Core Developers
>> Oozie is  currently being designed and developed by four engineers from 
>> Yahoo! –
>> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In  
>> addition,
>> many outside contributors are actively contributing in design  and 
>> development.
>> Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM are very
>> important contributors. All of these core  developers have deep expertise in
>> Hadoop and the Hadoop Ecosystem in  general.
>>
>> Alignment
>> The ASF is a  natural host for Oozie given that it is already the home of
>> Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie was
>> designed to support Hadoop from the beginning in order to solve data  
>> processing
>> challenges in Hadoop clusters. Oozie complements the existing  Apache cloud
>> computing projects by providing a flexible framework for  managing complex 
>> data
>> processing tasks.
>>
>> Known Risks
>> Orphaned Products
>> The core  developers plan to work full time on the project. There is very 
>> little
>> risk of Oozie getting orphaned since large companies like Yahoo! are
>> extensively using it on their production Hadoop clusters. For example,  there
>> are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are 
>> processed
>> hourly through Oozie in production. In addition, there are  nearly 400 active
>> users (including Yahoo! internal and external) in the  email community where
>> nearly 15 emails are exchanged per day.  Furthermore, there were more than 
>> 1500
>> downloads of the Oozie binary in  the last eight months from the github site 
>> and
>> a large number of  downloads were conducted by other companies such as 
>> Cloudera.
>> Oozie has  three major releases and more than 15 patch releases in the last
>> couple  of years which further demonstrates Oozie as a very active project. 
>> We
>> plan to extend and diversify this community further through Apache.
>>
>> Inexperience with Open Source
>> The core  developers are all active users and followers of open source. They 
>> are
>> already committers and contributors to the Oozie Github project. In  
>> addition,
>> they are very familiar with Apache principals and philosophy  for community
>> driven software development.
>>
>> Homogeneous Developers
>> The core developers are from Yahoo! as well as from several other 
>> corporations,
>> including Cloudera and IBM.
>>
>> Reliance on Salaried Developers
>> Currently,  the developers are paid to do work on Oozie. Companies like 
>> Yahoo!
>> and  Cloudera are invested in Oozie as the solution to the workflow  
>> management
>> and scheduling problem in Hadoop clusters, and that is not  likely to 
>> change. In
>> addition, since workflow management is very  important for most hadoop based
>> data processing, non-salaried developers  and researchers from various
>> institutes are expected to contribute to  the project.
>>
>> Relationships with Other Apache Products
>> Oozie is  based on Apache Hadoop to manage jobs created by different Apache
>> projects such as Hadoop, Pig, and Hive. Users of these products are  
>> extensively
>> using Oozie as their workflow scheduler.
>>
>> An Excessive Fascination with the Apache Brand
>> We deeply  respect the reputation of Apache and have had great success with
>> other  Apache projects such as Pig and HCatalog. We are motivated to expand 
>> and
>> increase the adoption and development of Oozie following Apache’s  
>> established
>> open source model. We have also given reasons in the  Rationale and Alignment
>> sections.
>>
>> Documentation
>> Information about Oozie can be found at http://yahoo.github.com/oozie/. The
>> following links provide more information about Oozie in open source:
>>
>>        * Codebase at GitHub: https://github.com/yahoo/oozie.
>>        * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
>>        * Continuous Integration (CI)  build:
>> http://oozie-ci.hadoop.developer.yahoo.net/
>>
>>        * Yahoo user community: 
>> http://tech.groups.yahoo.com/group/Oozie-users/
>> Initial Source
>> Oozie has been under development since 2009 by a team of engineers at 
>> Yahoo!. It
>> is currently hosted on GitHub under an Apache license at
>> https://github.com/yahoo/oozie.
>>
>> External Dependencies
>> The required  external dependencies are all Apache License or compatible
>> licenses.  Following the components with non-Apache licenses are enumerated:
>>
>>        * HSQLDB License: HSQLDB
>>        * JDOM license: JDOM
>>        * BSD: Serp
>>        * CCDL v1: jaxb-api, ejb, JAF
>> NOTE:  With the exception of HSQLDB and JDOM that are directly used by Oozie,
>> the other listed components are transitive dependencies of other Apache
>> components used by Oozie.
>>
>> Cryptography
>> Oozie supports the Kerberos authentication mechanism to access secured Hadoop
>> services.
>>
>> Required Resources
>> Mailing Lists
>>        * oozie-private for private PMC discussions (with moderated 
>> subscriptions)
>>        * oozie-dev
>>        * oozie-commits
>>        * oozie-user
>> Subversion Directory
>> https://svn.apache.org/repos/asf/incubator/oozie
>> Issue Tracking
>> JIRA Oozie (OOZIE)
>> Other Resources
>> The  existing code already has unit tests, so we would like a Hudson instance
>> to run them whenever a new patch is submitted. This can be added after  
>> project
>> creation.
>>
>> Initial Committers
>>        * Mohammad K Islam (mislam77 at yahoo  dot com)
>>        * Angelo K Huang (angelohuang at gmail dot com)
>>        * Mayank Bansal (mabansal at gmail dot com)
>>        * Andreas Neumann (neunand at gmail dot com)
>>        * Alejandro Abdelnur (tucu00 at gmail dot com)
>>        * Chao Wang (brookwc at gmail dot com)
>> Affiliations
>>        * Mohammad K Islam (Yahoo!)
>>        * Angelo Huang (Yahoo!)
>>        * Mayank Bansal (Yahoo!)
>>        * Andreas Neumann (Yahoo!)
>>        * Alejandro Abdelnur (Cloudera)
>>        * Chao Wang (IBM)
>> Sponsors
>> Champion
>> Alan Gates
>> Nominated Mentors
>>        * Owen O'Malley (Incubator PMC member)
>>        * Alan Gates (Incubator PMC member)
>>        * Christopher Douglas(Incubator PMC member)
>>        * Devaraj Das (Hadoop PMC member)
>> Sponsoring EntityWe are requesting the Incubator to sponsor this project.
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to