Re: Data processing pipeline workflow management

Mattmann, Chris A (3980) Wed, 11 Mar 2015 23:09:07 -0700

Apache OODT now has a workflow plugin that connects to Mesos:

http://oodt.apache.org/


Cross posting this to d...@oodt.apache.org so people like
Mike Starch can chime in.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Zameer Manji <zma...@apache.org>
Reply-To: "dev@aurora.incubator.apache.org"
<dev@aurora.incubator.apache.org>
Date: Wednesday, March 11, 2015 at 3:21 PM
To: "dev@aurora.incubator.apache.org" <dev@aurora.incubator.apache.org>
Subject: Re: Data processing pipeline workflow management

>Hey,
>
>This is a great question. See my comments inline below.
>
>On Tue, Mar 10, 2015 at 8:28 AM, Lars Albertsson
><lars.alberts...@gmail.com>
>wrote:
>
>> We are evaluating Aurora as a workflow management tool for batch
>> processing pipelines. We basically need a tool that regularly runs
>> batch processes that are connected as producers/consumers of data,
>> typically stored in HDFS or S3.
>>
>> The alternative tools would be Azkaban, Luigi, and Oozie, but I am
>> hoping that building something built on Aurora would result in a
>> better solution.
>>
>> Does anyone have experience with building workflows with Aurora? How
>> is Twitter handling batch pipelines? Would the approach below make
>> sense, or are there better suggestions? Is there anything related to
>> this in the roadmap or available inside Twitter only?
>>
>
>As far as I know, you are the first person to consider Aurora for workflow
>management for batch processing. Currently Twitter does not use Aurora for
>batch pipelines.
>I'm not aware of the specifics of the design, but at Twitter there is an
>internal solution for pipelines built upon Hadoop/YARN.
>Currently Aurora is designed around being a service scheduler and I'm not
>aware of any future plans to support workflows or batch computation.
>
>
>> In our case, the batch processes will be a mix of cluster
>> computation's with Spark, and single-node computations. We want the
>> latter to also be scheduled on a farm, and this is why we are
>> attracted to Mesos. In the text below, I'll call each part of a
>> pipeline a 'step', in order to avoid confusion with Aurora jobs and
>> tasks.
>>
>> My unordered wishlist is:
>> * Data pipelines consist of DAGs, where steps take one or more inputs,
>> and generate one or more outputs.
>>
>> * Independent steps in the DAG execute in parallel, constrained by
>> resources.
>>
>> * Steps can be written in different languages and frameworks, some
>> clustered.
>>
>> * The developer code/test/debug cycle is quick, and all functional
>> tests can execute on a laptop.
>>
>> * Developers can test integrated data pipelines, consisting of
>> multiple steps, on laptops.
>>
>> * Steps and their intputs and outputs are parameterised, e.g. by date.
>> A parameterised step is typically independent from other instances of
>> the same step, e.g. join one day's impressions log with user
>> demographics. In some cases, steps depend on yesterday's results, e.g.
>> apply one day's user management operation log to the user dataset from
>> the day before.
>>
>> * Data pipelines are specified in embedded DSL files (e.g. aurora
>> files), kept close to the business logic code.
>>
>> * Batch steps should be started soon after the input files become
>> available.
>>
>> * Steps should gracefully avoid recomputation when output files exist.
>>
>> * Backfilling a window back in time, e.g. 30 days, should happen
>> automatically if some earlier steps have failed, or if output files
>> have been deleted manually.
>>
>> * Continuous deployment in the sense that steps are automatically
>> deployed and scheduled after 'git push'.
>>
>> * Step owners can get an overview of step status and history, and
>> debug step execution, e.g. by accessing log files.
>>
>>
>> I am aware that no framework will give us everything. It is a matter
>> of how much we need to live without or build ourselves.
>>
>
>Your wishlist looks pretty reasonable for batch computation workflows.
>
>I'm not aware of any batch/workflow Mesos framework. If you want some or
>all of the above features on top of Mesos, I think you would be venturing
>into writing your own framework.
>Aurora doesn't have the concept of DAG and it can't make scheduling
>decisions based on job progress or HDFS state.
>
>--
>Zameer Manji

Re: Data processing pipeline workflow management

Reply via email to