We have begun experimenting with an airflow/zeppelin integration. We use
the first paragraph of a note to define dependencies and outputs; names and
owners; and schedule for the note. There are utility functions (in scala)
available that provide a data catalog for retrieving data sources. These
f
Thanks for sharing this Ruslan - I will take a look.
I agree that paragraphs can form tasks within a DAG. My point was that
ideally a DAG could encompass multiple notes. I.e. the completion of one
note triggers another and so on to complete an entire chain of dependent
tasks.
For example team A
Thanks for sharing this Ben.
I agree Zeppelin is a better fit with tighter integration with Spark and
built-in visualizations.
We have pretty much standardized on pySpark, so here's one of the scripts
we use internally
to extract %pyspark, %sql and %md paragraphs into a standalone script (that
ca
I do not expect the relationship between DAGs to be described in Zeppelin -
that would be done in Airflow. It just seems that Zeppelin is such a great
tool for a data scientists workflow that it would be nice if once they are
done with the work the note could be productionized directly. I could
e
We also use both Zeppelin and Airflow.
I'm interested in hearing what others are doing here too.
Although honestly there might be some challenges
- Airflow expects a DAG structure, while a notebook has pretty linear
structure;
- Airflow is Python-based; Zeppelin is all Java (REST API might be of
Hi all,
We are really enjoying the workflow of interacting with our data via
Zeppelin, but are not sold on using the built in cron scheduling
capability. We would like to be able to create more complex DAGs that are
better suited for something like Airflow. I was curious as to whether
anyone has