Hi LarsThanks for the brain dump. All the points you made about target 
audience, degree of high availability and time based scheduling instead of 
event based scheduling are all valid and make sense.In our case, most of your 
Devs are .net based and so xml or web based scheduling is preferred over 
something written in Java/Scalia/Python. Based on my research so far on the 
available workflow managers today, azkaban is the most easier to adopt since it 
doesn't have any hard dependence on Hadoop and is easy to onboard and schedule 
jobs. I was able to install and execute some spark workflows in a day. Though 
the fact that it's being phased out in linkedin is troubling , I think it's the 
best suited for our use case today. 

Sent from Outlook




On Sun, Aug 9, 2015 at 4:51 PM -0700, "Lars Albertsson" 
<lars.alberts...@gmail.com> wrote:










I used to maintain Luigi at Spotify, and got some insight in workflow
manager characteristics and production behaviour in the process.

I am evaluating options for my current employer, and the short list is
basically: Luigi, Azkaban, Pinball, Airflow, and rolling our own. The
latter is not necessarily more work than adapting an existing tool,
since existing managers are typically more or less tied to the
technology used by the company that created them.

Are your users primarily developers building pipelines that drive
data-intensive products, or are they analysts, producing business
intelligence? These groups tend to have preferences for different
types of tools and interfaces.

I have a love/hate relationship with Luigi, but given your
requirements, it is probably the best fit:

* It has support for Spark, and it seems to be used and maintained.

* It has no builtin support for Cassandra, but Cassandra is heavily
used at Spotify. IIRC, the code required to support Cassandra targets
is more or less trivial. There is no obvious single definition of a
dataset in C*, so you'll have to come up with a convention and encode
it as a Target subclass. I guess that is why it never made it outside
Spotify.

* The open source community is active and it is well tested in
production at multiple sites.

* It is easy to write dependencies, but in a Python DSL. If your users
are developers, this is preferable over XML or a web interface. There
are always quirks and odd constraints somewhere that require the
expressive power of a programming language. It also allows you to
create extensions without changing Luigi itself.

* It does not have recurring scheduling bulitin. Luigi needs a motor
to get going, typically cron, installed on a few machines for
redundancy. In a typical pipeline scenario, you give output datasets a
time parameter, which arranges for a dataset to be produced each
hour/day/week/month.

* It supports failure notifications.


Pinball and Airflow have similar architecture to Luigi, with a single
central scheduler and workers that submit and execute jobs. They seem
to be more solidly engineered at a glance, but less battle tested
outside Pinterest/Airbnb, and they have fewer integrations to the data
ecosystem.

Azkaban has a different architecture and user interface, and seems
more geared towards data scientists than developers; it has a good UI
for controlling jobs, but writing extensions and controlling it
programmatically seems more difficult than for Luigi.

All of the tools above are centralised, and the central component can
become a bottleneck and a single point of problem. I am not aware of
any decentralised open source workflow managers, but you can run
multiple instances and shard manually.

Regarding recurring jobs, it is typically undesirable to blindly run
jobs at a certain time. If you run jobs, e.g. with cron, and process
whatever data is available in your input sources, your jobs become
indeterministic and unreliable. If incoming data is late or missing,
your jobs will fail or create artificial skews in output data, leading
to confusing results. Moreover, if jobs fail or have bugs, it will be
difficult to rerun them and get predictable results. This is why I
don't think Chronos is a meaningful alternative for scheduling data
processing.

There are different strategies on this topic, but IMHO, it is easiest
create predictable and reliable pipelines by bucketing incoming data
into datasets that you seal off, and mark ready for processing, and
then use the workflow manager's DAG logic to process data when input
datasets are available, rather than at a certain time. If you use
Kafka for data collection, Secor can handle this logic for you.


In addition to your requirements, there are IMHO a few more topics one
needs to consider:
* How are pipelines tested? I.e. if I change job B below, how can I be
sure that the new output does not break A? You need to involve the
workflow DAG in testing such scenarios.
* How do you debug jobs and DAG problems? In case of trouble, can you
figure out where the job logs are, or why a particular job does not
start?
* Do you need high availability for job scheduling? That will require
additional components.


This became a bit of a brain dump on the topic. I hope that it is
useful. Don't hesitate to get back if I can help.

Regards,

Lars Albertsson



On Fri, Aug 7, 2015 at 5:43 PM, Vikram Kone  wrote:
> Hi,
> I'm looking for open source workflow tools/engines that allow us to schedule
> spark jobs on a datastax cassandra cluster. Since there are tonnes of
> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to
> check with people here to see what they are using today.
>
> Some of the requirements of the workflow engine that I'm looking for are
>
> 1. First class support for submitting Spark jobs on Cassandra. Not some
> wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production scale.
> 3. Should be dead easy to write job dependencices using XML or web interface
> . Ex; job A depends on Job B and Job C, so run Job A after B and C are
> finished. Don't need to write full blown java applications to specify job
> parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
> every hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on daily
> basis.
>
> I have looked at Ooyala's spark job server which seems to be hated towards
> making spark jobs run faster by sharing contexts between the jobs but isn't
> a full blown workflow engine per se. A combination of spark job server and
> workflow engine would be ideal
>
> Thanks for the inputs

Reply via email to