Thanks for sharing this Ben. I agree Zeppelin is a better fit with tighter integration with Spark and built-in visualizations.
We have pretty much standardized on pySpark, so here's one of the scripts we use internally to extract %pyspark, %sql and %md paragraphs into a standalone script (that can be scheduled in Airflow for example) https://github.com/Tagar/stuff/blob/master/znote.py (patches are welcome :-) Hope this helps. ps. In my opinion adding dependencies between paragraphs wouldn't be that hard for simple cases, and can be first step to define a DAG in Zeppelin directly. It would be really awesome if we see this type of integration in the future. Othewise I don't see much value if a whole note/ whole workflow would run as a single task in Airflow. In my opinion, each paragraph has to be a task... then it'll be very useful. Thanks, Ruslan On Fri, May 19, 2017 at 4:55 PM, Ben Vogan <b...@shopkick.com> wrote: > I do not expect the relationship between DAGs to be described in Zeppelin > - that would be done in Airflow. It just seems that Zeppelin is such a > great tool for a data scientists workflow that it would be nice if once > they are done with the work the note could be productionized directly. I > could envision a couple of scenarios: > > 1. Using a zeppelin instance to run the note via the REST API. The > instance could be containerized and spun up specifically for a DAG or it > could be a permanently available one. > 2. A note could be pulled from git and some part of the Zeppelin engine > could execute the note without the web UI at all. > > I would expect on the airflow side there to be some special operators for > executing these. > > If the scheduler is pluggable then it should be possible to create a plug > in that talks to the Airflow REST API. > > I happen to prefer Zeppelin to Jupyter - although I get your point about > both being python. I don't really view that as a problem - most of the big > data platforms I'm talking to are implemented on the JVM after all. The > python part of Airflow is really just describing what gets run and it isn't > hard to run something that isn't written in python. > > On Fri, May 19, 2017 at 2:52 PM, Ruslan Dautkhanov <dautkha...@gmail.com> > wrote: > >> We also use both Zeppelin and Airflow. >> >> I'm interested in hearing what others are doing here too. >> >> Although honestly there might be some challenges >> - Airflow expects a DAG structure, while a notebook has pretty linear >> structure; >> - Airflow is Python-based; Zeppelin is all Java (REST API might be of >> help?). >> Jupyter+Airflow might be a more natural fit to integrate? >> >> On top of that, the way we use Zeppelin is a lot of ad-hoc queries, >> while Airflow is for more finalized workflows I guess? >> >> Thanks for bringing this up. >> >> >> >> -- >> Ruslan Dautkhanov >> >> On Fri, May 19, 2017 at 2:20 PM, Ben Vogan <b...@shopkick.com> wrote: >> >>> Hi all, >>> >>> We are really enjoying the workflow of interacting with our data via >>> Zeppelin, but are not sold on using the built in cron scheduling >>> capability. We would like to be able to create more complex DAGs that are >>> better suited for something like Airflow. I was curious as to whether >>> anyone has done an integration of Zeppelin with Airflow. >>> >>> Either directly from within Zeppelin, or from the Airflow side. >>> >>> Thanks, >>> -- >>> *BENJAMIN VOGAN* | Data Platform Team Lead >>> >>> <http://www.shopkick.com/> >>> <https://www.facebook.com/shopkick> >>> <https://www.instagram.com/shopkick/> >>> <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz> >>> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240> >>> >> >> > > > -- > *BENJAMIN VOGAN* | Data Platform Team Lead > > <http://www.shopkick.com/> > <https://www.facebook.com/shopkick> <https://www.instagram.com/shopkick/> > <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz> > <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240> >