Hi LarsThanks for the brain dump. All the points you made about target audience, degree of high availability and time based scheduling instead of event based scheduling are all valid and make sense.In our case, most of your Devs are .net based and so xml or web based scheduling is preferred over something written in Java/Scalia/Python. Based on my research so far on the available workflow managers today, azkaban is the most easier to adopt since it doesn't have any hard dependence on Hadoop and is easy to onboard and schedule jobs. I was able to install and execute some spark workflows in a day. Though the fact that it's being phased out in linkedin is troubling , I think it's the best suited for our use case today.
Sent from Outlook On Sun, Aug 9, 2015 at 4:51 PM -0700, "Lars Albertsson" <lars.alberts...@gmail.com> wrote: I used to maintain Luigi at Spotify, and got some insight in workflow manager characteristics and production behaviour in the process. I am evaluating options for my current employer, and the short list is basically: Luigi, Azkaban, Pinball, Airflow, and rolling our own. The latter is not necessarily more work than adapting an existing tool, since existing managers are typically more or less tied to the technology used by the company that created them. Are your users primarily developers building pipelines that drive data-intensive products, or are they analysts, producing business intelligence? These groups tend to have preferences for different types of tools and interfaces. I have a love/hate relationship with Luigi, but given your requirements, it is probably the best fit: * It has support for Spark, and it seems to be used and maintained. * It has no builtin support for Cassandra, but Cassandra is heavily used at Spotify. IIRC, the code required to support Cassandra targets is more or less trivial. There is no obvious single definition of a dataset in C*, so you'll have to come up with a convention and encode it as a Target subclass. I guess that is why it never made it outside Spotify. * The open source community is active and it is well tested in production at multiple sites. * It is easy to write dependencies, but in a Python DSL. If your users are developers, this is preferable over XML or a web interface. There are always quirks and odd constraints somewhere that require the expressive power of a programming language. It also allows you to create extensions without changing Luigi itself. * It does not have recurring scheduling bulitin. Luigi needs a motor to get going, typically cron, installed on a few machines for redundancy. In a typical pipeline scenario, you give output datasets a time parameter, which arranges for a dataset to be produced each hour/day/week/month. * It supports failure notifications. Pinball and Airflow have similar architecture to Luigi, with a single central scheduler and workers that submit and execute jobs. They seem to be more solidly engineered at a glance, but less battle tested outside Pinterest/Airbnb, and they have fewer integrations to the data ecosystem. Azkaban has a different architecture and user interface, and seems more geared towards data scientists than developers; it has a good UI for controlling jobs, but writing extensions and controlling it programmatically seems more difficult than for Luigi. All of the tools above are centralised, and the central component can become a bottleneck and a single point of problem. I am not aware of any decentralised open source workflow managers, but you can run multiple instances and shard manually. Regarding recurring jobs, it is typically undesirable to blindly run jobs at a certain time. If you run jobs, e.g. with cron, and process whatever data is available in your input sources, your jobs become indeterministic and unreliable. If incoming data is late or missing, your jobs will fail or create artificial skews in output data, leading to confusing results. Moreover, if jobs fail or have bugs, it will be difficult to rerun them and get predictable results. This is why I don't think Chronos is a meaningful alternative for scheduling data processing. There are different strategies on this topic, but IMHO, it is easiest create predictable and reliable pipelines by bucketing incoming data into datasets that you seal off, and mark ready for processing, and then use the workflow manager's DAG logic to process data when input datasets are available, rather than at a certain time. If you use Kafka for data collection, Secor can handle this logic for you. In addition to your requirements, there are IMHO a few more topics one needs to consider: * How are pipelines tested? I.e. if I change job B below, how can I be sure that the new output does not break A? You need to involve the workflow DAG in testing such scenarios. * How do you debug jobs and DAG problems? In case of trouble, can you figure out where the job logs are, or why a particular job does not start? * Do you need high availability for job scheduling? That will require additional components. This became a bit of a brain dump on the topic. I hope that it is useful. Don't hesitate to get back if I can help. Regards, Lars Albertsson On Fri, Aug 7, 2015 at 5:43 PM, Vikram Kone wrote: > Hi, > I'm looking for open source workflow tools/engines that allow us to schedule > spark jobs on a datastax cassandra cluster. Since there are tonnes of > alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to > check with people here to see what they are using today. > > Some of the requirements of the workflow engine that I'm looking for are > > 1. First class support for submitting Spark jobs on Cassandra. Not some > wrapper Java code to submit tasks. > 2. Active open source community support and well tested at production scale. > 3. Should be dead easy to write job dependencices using XML or web interface > . Ex; job A depends on Job B and Job C, so run Job A after B and C are > finished. Don't need to write full blown java applications to specify job > parameters and dependencies. Should be very simple to use. > 4. Time based recurrent scheduling. Run the spark jobs at a given time > every hour or day or week or month. > 5. Job monitoring, alerting on failures and email notifications on daily > basis. > > I have looked at Ooyala's spark job server which seems to be hated towards > making spark jobs run faster by sharing contexts between the jobs but isn't > a full blown workflow engine per se. A combination of spark job server and > workflow engine would be ideal > > Thanks for the inputs