We have begun experimenting with an airflow/zeppelin integration. We use
the first paragraph of a note to define dependencies and outputs; names and
owners; and schedule for the note. There are utility functions (in scala)
available that provide a data catalog for retrieving data sources. These
f
Thanks for sharing this Ruslan - I will take a look.
I agree that paragraphs can form tasks within a DAG. My point was that
ideally a DAG could encompass multiple notes. I.e. the completion of one
note triggers another and so on to complete an entire chain of dependent
tasks.
For example team A
Thanks for sharing this Ben.
I agree Zeppelin is a better fit with tighter integration with Spark and
built-in visualizations.
We have pretty much standardized on pySpark, so here's one of the scripts
we use internally
to extract %pyspark, %sql and %md paragraphs into a standalone script (that
ca
I do not expect the relationship between DAGs to be described in Zeppelin -
that would be done in Airflow. It just seems that Zeppelin is such a great
tool for a data scientists workflow that it would be nice if once they are
done with the work the note could be productionized directly. I could
e
We also use both Zeppelin and Airflow.
I'm interested in hearing what others are doing here too.
Although honestly there might be some challenges
- Airflow expects a DAG structure, while a notebook has pretty linear
structure;
- Airflow is Python-based; Zeppelin is all Java (REST API might be of