Thanks for sharing this Ben.

I agree Zeppelin is a better fit with tighter integration with Spark and
built-in visualizations.

We have pretty much standardized on pySpark, so here's one of the scripts
we use internally
to extract %pyspark, %sql and %md paragraphs into a standalone script (that
can be scheduled in Airflow for example)
https://github.com/Tagar/stuff/blob/master/znote.py (patches are welcome :-)

Hope this helps.

ps. In my opinion adding dependencies between paragraphs wouldn't be that
hard for simple cases,
and can be first step to define a DAG in Zeppelin directly. It would be
really awesome if we see this type of
integration in the future.

Othewise I don't see much value if a whole note/ whole workflow would run
as a single task in Airflow.
In my opinion, each paragraph has to be a task... then it'll be very
useful.


Thanks,
Ruslan


On Fri, May 19, 2017 at 4:55 PM, Ben Vogan <b...@shopkick.com> wrote:

> I do not expect the relationship between DAGs to be described in Zeppelin
> - that would be done in Airflow.  It just seems that Zeppelin is such a
> great tool for a data scientists workflow that it would be nice if once
> they are done with the work the note could be productionized directly.  I
> could envision a couple of scenarios:
>
> 1. Using a zeppelin instance to run the note via the REST API.  The
> instance could be containerized and spun up specifically for a DAG or it
> could be a permanently available one.
> 2. A note could be pulled from git and some part of the Zeppelin engine
> could execute the note without the web UI at all.
>
> I would expect on the airflow side there to be some special operators for
> executing these.
>
> If the scheduler is pluggable then it should be possible to create a plug
> in that talks to the Airflow REST API.
>
> I happen to prefer Zeppelin to Jupyter - although I get your point about
> both being python.  I don't really view that as a problem - most of the big
> data platforms I'm talking to are implemented on the JVM after all.  The
> python part of Airflow is really just describing what gets run and it isn't
> hard to run something that isn't written in python.
>
> On Fri, May 19, 2017 at 2:52 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
> wrote:
>
>> We also use both Zeppelin and Airflow.
>>
>> I'm interested in hearing what others are doing here too.
>>
>> Although honestly there might be some challenges
>> - Airflow expects a DAG structure, while a notebook has pretty linear
>> structure;
>> - Airflow is Python-based; Zeppelin is all Java (REST API might be of
>> help?).
>> Jupyter+Airflow might be a more natural fit to integrate?
>>
>> On top of that, the way we use Zeppelin is a lot of ad-hoc queries,
>> while Airflow is for more finalized workflows I guess?
>>
>> Thanks for bringing this up.
>>
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Fri, May 19, 2017 at 2:20 PM, Ben Vogan <b...@shopkick.com> wrote:
>>
>>> Hi all,
>>>
>>> We are really enjoying the workflow of interacting with our data via
>>> Zeppelin, but are not sold on using the built in cron scheduling
>>> capability.  We would like to be able to create more complex DAGs that are
>>> better suited for something like Airflow.  I was curious as to whether
>>> anyone has done an integration of Zeppelin with Airflow.
>>>
>>> Either directly from within Zeppelin, or from the Airflow side.
>>>
>>> Thanks,
>>> --
>>> *BENJAMIN VOGAN* | Data Platform Team Lead
>>>
>>> <http://www.shopkick.com/>
>>> <https://www.facebook.com/shopkick>
>>> <https://www.instagram.com/shopkick/>
>>> <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz>
>>> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240>
>>>
>>
>>
>
>
> --
> *BENJAMIN VOGAN* | Data Platform Team Lead
>
> <http://www.shopkick.com/>
> <https://www.facebook.com/shopkick> <https://www.instagram.com/shopkick/>
> <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz>
> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240>
>

Reply via email to