Cloud Composer recently launched "Data lineage with Dataplex" feature which effectively means to generate lineage out of DAG/task executions and export it to Data Lineage (Data Catalog service) for further analysis. https://cloud.google.com/composer/docs/composer-2/lineage-integration
This feature is as of now in the "Preview" state. The current implementation uses built-in "Airflow lineage backend" feature and methods to extract lineage metadata on task post execution events. The general idea was to contribute this to the Airflow community in a form: - generalize lineage metadata extraction as self-method in each operator, using generic lineage entities - implement "adapter"s to convert generated metadata to Data Lineage format, Open Lineage format, etc. Adoption of "Airflow OpenLineage" for Composer would mean to introduce an additional layer of converting from OpenLineage format to Data Lineage (Data Catalog/Dataplex) format. But this is definitely a possibility. On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem <jul...@astronomer.io.invalid> wrote: > Thank you very much for your input Jarek. > I am responding in the comments and adding to the doc accordingly. > I would also love to hear from more stakeholders. > Thanks to all who provided feedback so far. > Julien > > On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <ja...@potiuk.com> wrote: > >> General comment from my side: I think Open Lineage is (and should be >> even more) a feature of Airflow that expands Airflow's capabilities >> greatly and opens up the direction we've been all working on - Airflow >> as a Platform. >> >> I think closely integrating it with Open-Lineage goes the same >> direction (also mentioned in the doc) as Open Telemetry goes, where we >> might decide to support certain standards in order to expand >> capabilities of Airflow-as-a-platform and allows to plug-in multiple >> external solutions that would use the standard API. After Open-Lineage >> graduated recently to LFAI&Data foundation (I've been watching this >> happening from far), it is I think the perfect candidate for Airflow >> to incorporate it. I hope this will help all the players to make use >> of the extra work necessary by the community to make it "officially >> supported". I think we have to also get some feedback from the big >> stakeholders in Airflow - because one thing is to have such a >> capability, and another is to get it used in all the ways Airflow is >> used - not only by on-premise/self-hosted users (which is obviously a >> huge driving factor) but also everywhere where Airflow is exposed by >> others - Astronomer is obviously on-board. we see some warm words from >> Amazon (mentioned by Julian), I would love to hear whether the >> Composer team at Google would be on board in using the open-lineage >> information exposed this way in their Data Catalog (and likely more) >> offering. We have Amundsen and others and possibly other stakeholders >> might want to say something. >> >> >> There is - undoubtedly - an extra effort involved in implementing and >> keeping it running smoothly (as Julian mentioned, that is the main >> reason why the Open Lineage community would like to make the >> integration part of Airflow. But by being smart and integrating it in >> the way that will allow to plug-it-in into our CI, verification >> process and making some very clear expectations about what it means >> for contributors to Airflow to get it running, we can make some >> initial investment in making it happen and minimise on-going cost, >> while maximising the gain. >> >> And looking at all the above - I am super happy to help with all that >> to make this easy to "swallow" and integrate well, even if it will >> take an extra effort, especially that we will have experts from Open >> Lineage who worked with both Airflow and Open Lineage being the core >> part of the effort. I am actually super excited - this might be the >> next-big-thing for Airflow to strengthen its position as an >> indispensable component of "even more modern data stack". >> >> I made my initial comments in the doc, and am looking forward to >> making it happen :). >> >> J. >> >> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem >> <jul...@astronomer.io.invalid> wrote: >> > >> > Dear Airflow Community, >> > I have been working on a proposal to bring an OpenLineage provider to >> Airflow. >> > I am looking for feedback with the goal to post an official AIP. >> > Please feel free to comment in the doc above. >> > Thank you, >> > Julien (OpenLineage project lead) >> > >> > For convenience, here is the rationale from the doc: >> > >> > Operational lineage collection is a common need to understand >> dependencies between data pipelines and track end-to-end provenance of >> data. It enables many use cases from ensuring reliable delivery of data >> through observability to compliance and cost management. >> > >> > Publishing operational lineage is a core Airflow capability to enable >> troubleshooting and governance. >> > >> > OpenLineage is a project part of the LFAI&Data foundation that provides >> a spec standardizing operational lineage collection and sharing across the >> data ecosystem. If it provides plugins for popular open source projects, >> its intent is very similar to OpenTelemetry (also under the Linux >> Foundation umbrella): to remain a spec for lineage exchange that projects - >> open source or proprietary - implement. >> > >> > Built-in OpenLineage support in Airflow will make it easier and more >> reliable for Airflow users to publish their operational lineage through the >> OpenLineage ecosystem. >> > >> > The current external plugin maintained in the OpenLineage project >> depends on Airflow and operators internals and gets broken when changes are >> made on those. Having a built-in integration ensures a better first class >> support to expose lineage that gets tested alongside other changes and >> therefore is more stable. >> > -- Eugene