We have begun experimenting with an airflow/zeppelin integration. We use the first paragraph of a note to define dependencies and outputs; names and owners; and schedule for the note. There are utility functions (in scala) available that provide a data catalog for retrieving data sources. These functions return a dataframe and record that note's dependency on a particular data source so that a dag can be constructed between the task that creates the input data and the zeppelin note. There is also a function to provide a dataframe writer that captures the outputs that a note provides. This registers that note as the source for data that is then available in the data catalog for other notebooks to use. This allows one notebook to have a dependency on data created by another notebook.
An airflow dag generator (python code) queries the zeppelin notebook server, looking for the results of the first paragraph for each note. It uses these outputs to construct the DAG between notebooks. It generates a ZeppelinNoteOperator for each note that will use the zeppelin REST api to execute the notebook when the scheduler schedules that task. We've just started to use this so we don't have a lot of experience with it yet. The biggest caveats to start are: * there is no mechanism for test-cases of note code * We have to call the notebook server on every iteration of the scheduler/whenever a dag is init'd - we use the cached results if available, but it still requires a round trip to the zeppelin notebook server Regards, Erik On Fri, May 19, 2017 at 10:15 PM Ben Vogan <b...@shopkick.com> wrote: > Thanks for sharing this Ruslan - I will take a look. > > I agree that paragraphs can form tasks within a DAG. My point was that > ideally a DAG could encompass multiple notes. I.e. the completion of one > note triggers another and so on to complete an entire chain of dependent > tasks. > > For example team A has a note that generates data set A*. Teams B & C > each have notes that depend on A* to generate B* & C* for their specific > purposes. It doesn't make sense for all of that to have to live in one > note, but they are all part of a single workflow. > > Best, > --Ben > > On Fri, May 19, 2017 at 9:02 PM, Ruslan Dautkhanov <dautkha...@gmail.com> > wrote: > >> Thanks for sharing this Ben. >> >> I agree Zeppelin is a better fit with tighter integration with Spark and >> built-in visualizations. >> >> We have pretty much standardized on pySpark, so here's one of the scripts >> we use internally >> to extract %pyspark, %sql and %md paragraphs into a standalone script >> (that can be scheduled in Airflow for example) >> https://github.com/Tagar/stuff/blob/master/znote.py (patches are welcome >> :-) >> >> Hope this helps. >> >> ps. In my opinion adding dependencies between paragraphs wouldn't be that >> hard for simple cases, >> and can be first step to define a DAG in Zeppelin directly. It would be >> really awesome if we see this type of >> integration in the future. >> >> Othewise I don't see much value if a whole note/ whole workflow would run >> as a single task in Airflow. >> In my opinion, each paragraph has to be a task... then it'll be very >> useful. >> >> >> Thanks, >> Ruslan >> >> >> On Fri, May 19, 2017 at 4:55 PM, Ben Vogan <b...@shopkick.com> wrote: >> >>> I do not expect the relationship between DAGs to be described in >>> Zeppelin - that would be done in Airflow. It just seems that Zeppelin is >>> such a great tool for a data scientists workflow that it would be nice if >>> once they are done with the work the note could be productionized >>> directly. I could envision a couple of scenarios: >>> >>> 1. Using a zeppelin instance to run the note via the REST API. The >>> instance could be containerized and spun up specifically for a DAG or it >>> could be a permanently available one. >>> 2. A note could be pulled from git and some part of the Zeppelin engine >>> could execute the note without the web UI at all. >>> >>> I would expect on the airflow side there to be some special operators >>> for executing these. >>> >>> If the scheduler is pluggable then it should be possible to create a >>> plug in that talks to the Airflow REST API. >>> >>> I happen to prefer Zeppelin to Jupyter - although I get your point about >>> both being python. I don't really view that as a problem - most of the big >>> data platforms I'm talking to are implemented on the JVM after all. The >>> python part of Airflow is really just describing what gets run and it isn't >>> hard to run something that isn't written in python. >>> >>> On Fri, May 19, 2017 at 2:52 PM, Ruslan Dautkhanov <dautkha...@gmail.com >>> > wrote: >>> >>>> We also use both Zeppelin and Airflow. >>>> >>>> I'm interested in hearing what others are doing here too. >>>> >>>> Although honestly there might be some challenges >>>> - Airflow expects a DAG structure, while a notebook has pretty linear >>>> structure; >>>> - Airflow is Python-based; Zeppelin is all Java (REST API might be of >>>> help?). >>>> Jupyter+Airflow might be a more natural fit to integrate? >>>> >>>> On top of that, the way we use Zeppelin is a lot of ad-hoc queries, >>>> while Airflow is for more finalized workflows I guess? >>>> >>>> Thanks for bringing this up. >>>> >>>> >>>> >>>> -- >>>> Ruslan Dautkhanov >>>> >>>> On Fri, May 19, 2017 at 2:20 PM, Ben Vogan <b...@shopkick.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> We are really enjoying the workflow of interacting with our data via >>>>> Zeppelin, but are not sold on using the built in cron scheduling >>>>> capability. We would like to be able to create more complex DAGs that are >>>>> better suited for something like Airflow. I was curious as to whether >>>>> anyone has done an integration of Zeppelin with Airflow. >>>>> >>>>> Either directly from within Zeppelin, or from the Airflow side. >>>>> >>>>> Thanks, >>>>> -- >>>>> *BENJAMIN VOGAN* | Data Platform Team Lead >>>>> >>>>> <http://www.shopkick.com/> >>>>> <https://www.facebook.com/shopkick> >>>>> <https://www.instagram.com/shopkick/> >>>>> <https://www.pinterest.com/shopkick/> >>>>> <https://twitter.com/shopkickbiz> >>>>> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240> >>>>> >>>> >>>> >>> >>> >>> -- >>> *BENJAMIN VOGAN* | Data Platform Team Lead >>> >>> <http://www.shopkick.com/> >>> <https://www.facebook.com/shopkick> >>> <https://www.instagram.com/shopkick/> >>> <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz> >>> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240> >>> >> >> > > > -- > *BENJAMIN VOGAN* | Data Platform Team Lead > > <http://www.shopkick.com/> > <https://www.facebook.com/shopkick> <https://www.instagram.com/shopkick/> > <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz> > <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240> >