Re: [PROPOSAL] Oozie for the Apache Incubator

Mayank Bansal Sun, 26 Jun 2011 17:04:43 -0700

+1

Thanks a lot team, I look forward to contribute more to project.


Thanks,
Mayank

On Sun, Jun 26, 2011 at 4:24 PM, Thilina Gunarathne <cset...@gmail.com>wrote:

> +1..  Very interesting stuff..
>
> thanks,
> Thilina
>
> On Sun, Jun 26, 2011 at 7:12 PM, Suresh Marru <sma...@apache.org> wrote:
>
> > Interesting Project. Time permitting, I would like to contribute to the
> > workflow effort
> >
> > --Suresh
> >
> > On Jun 24, 2011, at 3:46 PM, Mohammad Islam wrote:
> >
> > > Hi,
> > >
> > > I would like to propose Oozie to be an Apache Incubator project.
> > > Oozie is a server-based workflow scheduling and coordination system to
> > manage
> > > data processing jobs for Apache Hadoop.
> > >
> > >
> > > Here's a link to the proposal in the Incubator wiki
> > > http://wiki.apache.org/incubator/OozieProposal
> > >
> > >
> > > I've also pasted the initial contents below.
> > >
> > > Regards,
> > >
> > > Mohammad Islam
> > >
> > >
> > > Start of Oozie Proposal
> > >
> > > Abstract
> > > Oozie is a server-based workflow scheduling and coordination system to
> > manage
> > > data processing jobs for Apache HadoopTM.
> > >
> > > Proposal
> > > Oozie is an  extensible, scalable and reliable system to define,
> manage,
> > > schedule,  and execute complex Hadoop workloads via web services. More
> > > specifically, this includes:
> > >
> > >       * XML-based declarative framework to specify a job or a complex
> > workflow of
> > > dependent jobs.
> > >
> > >       * Support different types of job such as Hadoop Map-Reduce, Pipe,
> > Streaming,
> > > Pig, Hive and custom java applications.
> > >
> > >       * Workflow scheduling based on frequency and/or data
> availability.
> > >       * Monitoring capability, automatic retry and failure handing of
> > jobs.
> > >       * Extensible and pluggable architecture to allow arbitrary grid
> > programming
> > > paradigms.
> > >
> > >       * Authentication, authorization, and capacity-aware load
> throttling
> > to allow
> > > multi-tenant software as a service.
> > >
> > > Background
> > > Most data  processing applications require multiple jobs to achieve
> their
> > goals,
> > > with inherent dependencies among the jobs. A dependency could be
> >  sequential,
> > > where one job can only start after another job has finished.  Or it
> could
> > be
> > > conditional, where the execution of a job depends on the  return value
> or
> > status
> > > of another job. In other cases, parallel  execution of multiple jobs
> may
> > be
> > > permitted – or desired – to exploit  the massive pool of compute nodes
> > provided
> > > by Hadoop.
> > >
> > > These  job dependencies are often expressed as a Directed Acyclic
> Graph,
> > also
> > > called a workflow. A node in the workflow is typically a job (a
> >  computation on
> > > the grid) or another type of action such as an eMail  notification.
> > Computations
> > > can be expressed in map/reduce, Pig, Hive or  any other programming
> > paradigm
> > > available on the grid. Edges of the graph  represent transitions from
> one
> > node
> > > to the next, as the execution of a  workflow proceeds.
> > >
> > > Describing  a workflow in a declarative way has the advantage of
> > decoupling job
> > > dependencies and execution control from application logic. Furthermore,
> >  the
> > > workflow is modularized into jobs that can be reused within the same
> >  workflow
> > > or across different workflows. Execution of the workflow is  then
> driven
> > by a
> > > runtime system without understanding the application  logic of the
> jobs.
> > This
> > > runtime system specializes in reliable and  predictable execution: It
> can
> > retry
> > > actions that have failed or invoke a  cleanup action after termination
> of
> > the
> > > workflow; it can monitor  progress, success, or failure of a workflow,
> > and send
> > > appropriate alerts  to an administrator. The application developer is
> > relieved
> > > from  implementing these generic procedures.
> > >
> > > Furthermore,  some applications or workflows need to run in periodic
> > intervals
> > > or  when dependent data is available. For example, a workflow could be
> >  executed
> > > every day as soon as output data from the previous 24 instances  of
> > another,
> > > hourly workflow is available. The workflow coordinator  provides such
> > scheduling
> > > features, along with prioritization, load  balancing and throttling to
> > optimize
> > > utilization of resources in the  cluster. This makes it easier to
> > maintain,
> > > control, and coordinate  complex data applications.
> > >
> > > Nearly  three years ago, a team of Yahoo! developers addressed these
> > critical
> > > requirements for Hadoop-based data processing systems by developing a
> >  new
> > > workflow management and scheduling system called Oozie. While it was
> >  initially
> > > developed as a Yahoo!-internal project, it was designed and
>  implemented
> > with
> > > the intention of open-sourcing. Oozie was released as a GitHub project
> in
> > early
> > > 2010. Oozie is used in production within Yahoo and  since it has been
> > > open-sourced it has been gaining adoption with  external developers
> > >
> > > Rationale
> > > Commonly,  applications that run on Hadoop require multiple Hadoop jobs
> > in order
> > > to  obtain the desired results. Furthermore, these Hadoop jobs are
> > commonly  a
> > > combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> > > map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs
>  and
> > shell
> > > scripts.
> > >
> > > Because  of this, developers find themselves writing ad-hoc glue
> programs
> > to
> > > combine these Hadoop jobs. These ad-hoc programs are difficult to
> >  schedule,
> > > manage, monitor and recover.
> > >
> > > Workflow  management and scheduling is an essential feature for
> > large-scale data
> > > processing applications. Such applications could write the customized
> >  solution
> > > that would require separate development, operational, and  maintenance
> > overhead.
> > > Since it is a prevalent use-case for data  processing, the application
> > developer
> > > would surely prefer a generalized  solution with little or no such
> > overhead.
> > > Oozie addresses the challenge  by providing an execution framework to
> > flexibly
> > > specify the job  dependency, data dependency, and time dependency. In
> > addition,
> > > Oozie  provides a multi-tenant-based centralized service and the
> > opportunity to
> > > optimize load and utilization while respecting SLAs.
> > >
> > > Oozie  is built on Apache Hadoop to schedule jobs related to various
> > Apache
> > > projects such as Hadoop, Pig, and Hive. As an Apache Open source
> >  project, Oozie
> > > is expected to attract the larger and more diversified  community that
> > currently
> > > uses such Apache sponsored projects.  Additionally, users of the Hadoop
> > > ecosystem can influence Oozie’s  roadmap, and contribute to it.
> Likewise,
> > Oozie,
> > > as part of the Apache  Hadoop ecosystem, will be a great benefit to the
> > current
> > > Hadoop/Pig/Hive/HBase/HCatalog community.
> > >
> > > Current Status
> > > Meritocracy
> > > Oozie  currently is a github-based open sourced project where
> developers
> > from
> > > multiple companies are contributing to the project. Our intent with
> this
> > > incubator proposal is to further extend this diverse developer
>  community
> > around
> > > Oozie following the Apache meritocracy model. We plan  to continue to
> > provide
> > > adequate support to new developers and to quickly  recruit those who
> make
> > solid
> > > contributions to committer status. In  addition, Oozie will expect,
> > accept, and
> > > work to attract contributions  from amateurs as well.
> > >
> > > Community
> > > While an  efficient workflow management and scheduling system is
> critical
> > for
> > > large companies with huge data processing in multi-tenant clusters, it
> >  is
> > > equally necessary for any non-trivial deployment. Different companies
> >  are
> > > currently using Oozie as a workflow scheduler for Hadoop-based data
> >  processing.
> > > At Yahoo! it is being used extensively in production  clusters to
> process
> > > thousand of jobs. Like the Oozie user community, the  Oozie developer
> > community
> > > is also very strong. Developers from Yahoo!  provided the initial code
> > base, and
> > > they are still the most active  contributors. In late 2010, developers
> > from
> > > Cloudera also started  contributing, and currently other companies
> (e.g.,
> > IBM)
> > > are beginning to  participate.
> > >
> > > We currently use JIRA for issue tracking, github for code hosting and
> > Yahoo!
> > > Groups for developer and user communications.
> > >
> > > Core Developers
> > > Oozie is  currently being designed and developed by four engineers from
> > Yahoo! –
> > > Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In
> >  addition,
> > > many outside contributors are actively contributing in design  and
> > development.
> > > Among them, Alejandro Abdelnur from Cloudera and Chao  Wang from IBM
> are
> > very
> > > important contributors. All of these core  developers have deep
> expertise
> > in
> > > Hadoop and the Hadoop Ecosystem in  general.
> > >
> > > Alignment
> > > The ASF is a  natural host for Oozie given that it is already the home
> of
> > > Hadoop,  Pig, Hive, and other emerging cloud software projects. Oozie
> was
> > > designed to support Hadoop from the beginning in order to solve data
> >  processing
> > > challenges in Hadoop clusters. Oozie complements the existing  Apache
> > cloud
> > > computing projects by providing a flexible framework for  managing
> > complex data
> > > processing tasks.
> > >
> > > Known Risks
> > > Orphaned Products
> > > The core  developers plan to work full time on the project. There is
> very
> > little
> > > risk of Oozie getting orphaned since large companies like Yahoo! are
> > > extensively using it on their production Hadoop clusters. For example,
> >  there
> > > are nearly 400 Yahoo! internal Oozie users and thousands of jobs  are
> > processed
> > > hourly through Oozie in production. In addition, there are  nearly 400
> > active
> > > users (including Yahoo! internal and external) in the  email community
> > where
> > > nearly 15 emails are exchanged per day.  Furthermore, there were more
> > than 1500
> > > downloads of the Oozie binary in  the last eight months from the github
> > site and
> > > a large number of  downloads were conducted by other companies such as
> > Cloudera.
> > > Oozie has  three major releases and more than 15 patch releases in the
> > last
> > > couple  of years which further demonstrates Oozie as a very active
> > project. We
> > > plan to extend and diversify this community further through Apache.
> > >
> > > Inexperience with Open Source
> > > The core  developers are all active users and followers of open source.
> > They are
> > > already committers and contributors to the Oozie Github project. In
> >  addition,
> > > they are very familiar with Apache principals and philosophy  for
> > community
> > > driven software development.
> > >
> > > Homogeneous Developers
> > > The core developers are from Yahoo! as well as from several other
> > corporations,
> > > including Cloudera and IBM.
> > >
> > > Reliance on Salaried Developers
> > > Currently,  the developers are paid to do work on Oozie. Companies like
> > Yahoo!
> > > and  Cloudera are invested in Oozie as the solution to the workflow
> >  management
> > > and scheduling problem in Hadoop clusters, and that is not  likely to
> > change. In
> > > addition, since workflow management is very  important for most hadoop
> > based
> > > data processing, non-salaried developers  and researchers from various
> > > institutes are expected to contribute to  the project.
> > >
> > > Relationships with Other Apache Products
> > > Oozie is  based on Apache Hadoop to manage jobs created by different
> > Apache
> > > projects such as Hadoop, Pig, and Hive. Users of these products are
> >  extensively
> > > using Oozie as their workflow scheduler.
> > >
> > > An Excessive Fascination with the Apache Brand
> > > We deeply  respect the reputation of Apache and have had great success
> > with
> > > other  Apache projects such as Pig and HCatalog. We are motivated to
> > expand and
> > > increase the adoption and development of Oozie following Apache’s
> >  established
> > > open source model. We have also given reasons in the  Rationale and
> > Alignment
> > > sections.
> > >
> > > Documentation
> > > Information about Oozie can be found at http://yahoo.github.com/oozie/
> .
> > The
> > > following links provide more information about Oozie in open source:
> > >
> > >       * Codebase at GitHub: https://github.com/yahoo/oozie.
> > >       * JIRA : http://oozie-jira.hadoop.developer.yahoo.net
> > >       * Continuous Integration (CI) build:
> > > http://oozie-ci.hadoop.developer.yahoo.net/
> > >
> > >       * Yahoo user community:
> > http://tech.groups.yahoo.com/group/Oozie-users/
> > > Initial Source
> > > Oozie has been under development since 2009 by a team of engineers at
> > Yahoo!. It
> > > is currently hosted on GitHub under an Apache license at
> > > https://github.com/yahoo/oozie.
> > >
> > > External Dependencies
> > > The required  external dependencies are all Apache License or
> compatible
> > > licenses.  Following the components with non-Apache licenses are
> > enumerated:
> > >
> > >       * HSQLDB License: HSQLDB
> > >       * JDOM license: JDOM
> > >       * BSD: Serp
> > >       * CCDL v1: jaxb-api, ejb, JAF
> > > NOTE:  With the exception of HSQLDB and JDOM that are directly used by
> > Oozie,
> > > the other listed components are transitive dependencies of other Apache
> > > components used by Oozie.
> > >
> > > Cryptography
> > > Oozie supports the Kerberos authentication mechanism to access secured
> > Hadoop
> > > services.
> > >
> > > Required Resources
> > > Mailing Lists
> > >       * oozie-private for private PMC discussions (with moderated
> > subscriptions)
> > >       * oozie-dev
> > >       * oozie-commits
> > >       * oozie-user
> > > Subversion Directory
> > > https://svn.apache.org/repos/asf/incubator/oozie
> > > Issue Tracking
> > > JIRA Oozie (OOZIE)
> > > Other Resources
> > > The  existing code already has unit tests, so we would like a Hudson
> > instance
> > > to run them whenever a new patch is submitted. This can be added after
> >  project
> > > creation.
> > >
> > > Initial Committers
> > >       * Mohammad K Islam (mislam77 at yahoo  dot com)
> > >       * Angelo K Huang (angelohuang at gmail dot com)
> > >       * Mayank Bansal (mabansal at gmail dot com)
> > >       * Andreas Neumann (neunand at gmail dot com)
> > >       * Alejandro Abdelnur (tucu00 at gmail dot com)
> > >       * Chao Wang (brookwc at gmail dot com)
> > > Affiliations
> > >       * Mohammad K Islam (Yahoo!)
> > >       * Angelo Huang (Yahoo!)
> > >       * Mayank Bansal (Yahoo!)
> > >       * Andreas Neumann (Yahoo!)
> > >       * Alejandro Abdelnur (Cloudera)
> > >       * Chao Wang (IBM)
> > > Sponsors
> > > Champion
> > > Alan Gates
> > > Nominated Mentors
> > >       * Owen O'Malley (Incubator PMC member)
> > >       * Alan Gates (Incubator PMC member)
> > >       * Christopher Douglas(Incubator PMC member)
> > >       * Devaraj Das (Hadoop PMC member)
> > > Sponsoring EntityWe are requesting the Incubator to sponsor this
> project.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
>
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
> http://thilina.gunarathne.org
>



-- 
Thanks and Regards,
Mayank
Cell: 408-718-9370

Re: [PROPOSAL] Oozie for the Apache Incubator

Reply via email to