Re: Data processing pipeline workflow management

Zameer Manji Wed, 11 Mar 2015 15:22:58 -0700

Hey,

This is a great question. See my comments inline below.


On Tue, Mar 10, 2015 at 8:28 AM, Lars Albertsson <lars.alberts...@gmail.com>
wrote:

> We are evaluating Aurora as a workflow management tool for batch
> processing pipelines. We basically need a tool that regularly runs
> batch processes that are connected as producers/consumers of data,
> typically stored in HDFS or S3.
>
> The alternative tools would be Azkaban, Luigi, and Oozie, but I am
> hoping that building something built on Aurora would result in a
> better solution.
>
> Does anyone have experience with building workflows with Aurora? How
> is Twitter handling batch pipelines? Would the approach below make
> sense, or are there better suggestions? Is there anything related to
> this in the roadmap or available inside Twitter only?
>

As far as I know, you are the first person to consider Aurora for workflow
management for batch processing. Currently Twitter does not use Aurora for
batch pipelines.
I'm not aware of the specifics of the design, but at Twitter there is an
internal solution for pipelines built upon Hadoop/YARN.
Currently Aurora is designed around being a service scheduler and I'm not
aware of any future plans to support workflows or batch computation.


> In our case, the batch processes will be a mix of cluster
> computation's with Spark, and single-node computations. We want the
> latter to also be scheduled on a farm, and this is why we are
> attracted to Mesos. In the text below, I'll call each part of a
> pipeline a 'step', in order to avoid confusion with Aurora jobs and
> tasks.
>
> My unordered wishlist is:
> * Data pipelines consist of DAGs, where steps take one or more inputs,
> and generate one or more outputs.
>
> * Independent steps in the DAG execute in parallel, constrained by
> resources.
>
> * Steps can be written in different languages and frameworks, some
> clustered.
>
> * The developer code/test/debug cycle is quick, and all functional
> tests can execute on a laptop.
>
> * Developers can test integrated data pipelines, consisting of
> multiple steps, on laptops.
>
> * Steps and their intputs and outputs are parameterised, e.g. by date.
> A parameterised step is typically independent from other instances of
> the same step, e.g. join one day's impressions log with user
> demographics. In some cases, steps depend on yesterday's results, e.g.
> apply one day's user management operation log to the user dataset from
> the day before.
>
> * Data pipelines are specified in embedded DSL files (e.g. aurora
> files), kept close to the business logic code.
>
> * Batch steps should be started soon after the input files become
> available.
>
> * Steps should gracefully avoid recomputation when output files exist.
>
> * Backfilling a window back in time, e.g. 30 days, should happen
> automatically if some earlier steps have failed, or if output files
> have been deleted manually.
>
> * Continuous deployment in the sense that steps are automatically
> deployed and scheduled after 'git push'.
>
> * Step owners can get an overview of step status and history, and
> debug step execution, e.g. by accessing log files.
>
>
> I am aware that no framework will give us everything. It is a matter
> of how much we need to live without or build ourselves.
>

Your wishlist looks pretty reasonable for batch computation workflows.

I'm not aware of any batch/workflow Mesos framework. If you want some or
all of the above features on top of Mesos, I think you would be venturing
into writing your own framework.
Aurora doesn't have the concept of DAG and it can't make scheduling
decisions based on job progress or HDFS state.

--
Zameer Manji

Re: Data processing pipeline workflow management

Reply via email to