Apache OODT now has a workflow plugin that connects to Mesos: http://oodt.apache.org/
Cross posting this to d...@oodt.apache.org so people like Mike Starch can chime in. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Zameer Manji <zma...@apache.org> Reply-To: "dev@aurora.incubator.apache.org" <dev@aurora.incubator.apache.org> Date: Wednesday, March 11, 2015 at 3:21 PM To: "dev@aurora.incubator.apache.org" <dev@aurora.incubator.apache.org> Subject: Re: Data processing pipeline workflow management >Hey, > >This is a great question. See my comments inline below. > >On Tue, Mar 10, 2015 at 8:28 AM, Lars Albertsson ><lars.alberts...@gmail.com> >wrote: > >> We are evaluating Aurora as a workflow management tool for batch >> processing pipelines. We basically need a tool that regularly runs >> batch processes that are connected as producers/consumers of data, >> typically stored in HDFS or S3. >> >> The alternative tools would be Azkaban, Luigi, and Oozie, but I am >> hoping that building something built on Aurora would result in a >> better solution. >> >> Does anyone have experience with building workflows with Aurora? How >> is Twitter handling batch pipelines? Would the approach below make >> sense, or are there better suggestions? Is there anything related to >> this in the roadmap or available inside Twitter only? >> > >As far as I know, you are the first person to consider Aurora for workflow >management for batch processing. Currently Twitter does not use Aurora for >batch pipelines. >I'm not aware of the specifics of the design, but at Twitter there is an >internal solution for pipelines built upon Hadoop/YARN. >Currently Aurora is designed around being a service scheduler and I'm not >aware of any future plans to support workflows or batch computation. > > >> In our case, the batch processes will be a mix of cluster >> computation's with Spark, and single-node computations. We want the >> latter to also be scheduled on a farm, and this is why we are >> attracted to Mesos. In the text below, I'll call each part of a >> pipeline a 'step', in order to avoid confusion with Aurora jobs and >> tasks. >> >> My unordered wishlist is: >> * Data pipelines consist of DAGs, where steps take one or more inputs, >> and generate one or more outputs. >> >> * Independent steps in the DAG execute in parallel, constrained by >> resources. >> >> * Steps can be written in different languages and frameworks, some >> clustered. >> >> * The developer code/test/debug cycle is quick, and all functional >> tests can execute on a laptop. >> >> * Developers can test integrated data pipelines, consisting of >> multiple steps, on laptops. >> >> * Steps and their intputs and outputs are parameterised, e.g. by date. >> A parameterised step is typically independent from other instances of >> the same step, e.g. join one day's impressions log with user >> demographics. In some cases, steps depend on yesterday's results, e.g. >> apply one day's user management operation log to the user dataset from >> the day before. >> >> * Data pipelines are specified in embedded DSL files (e.g. aurora >> files), kept close to the business logic code. >> >> * Batch steps should be started soon after the input files become >> available. >> >> * Steps should gracefully avoid recomputation when output files exist. >> >> * Backfilling a window back in time, e.g. 30 days, should happen >> automatically if some earlier steps have failed, or if output files >> have been deleted manually. >> >> * Continuous deployment in the sense that steps are automatically >> deployed and scheduled after 'git push'. >> >> * Step owners can get an overview of step status and history, and >> debug step execution, e.g. by accessing log files. >> >> >> I am aware that no framework will give us everything. It is a matter >> of how much we need to live without or build ourselves. >> > >Your wishlist looks pretty reasonable for batch computation workflows. > >I'm not aware of any batch/workflow Mesos framework. If you want some or >all of the above features on top of Mesos, I think you would be venturing >into writing your own framework. >Aurora doesn't have the concept of DAG and it can't make scheduling >decisions based on job progress or HDFS state. > >-- >Zameer Manji