Dear Airflow community,

Following the discussion thread over the past few weeks, I'd like to call a
vote on AIP-53 OpenLineage in Airflow:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow

The discussion thread is linked in the confluence doc if you wish to
consult the history of the conversation. Thank you to all who contributed!

This is my (non-binding!) +1, the vote will last until midnight (UTC) on
Friday 17th February.

Thanks,
Julien

*For reference, the Motivation section in the doc:*

Operational lineage collection is a common need to understand dependencies
between data pipelines and track end-to-end provenance of data. It enables
many use cases from ensuring reliable delivery of data through
observability to compliance and cost management.

Publishing operational lineage is a core Airflow capability to enable
troubleshooting and governance.

OpenLineage <https://openlineage.io/> is a project part of the LFAI&Data
<https://lfaidata.foundation/projects/> foundation that provides a spec
standardizing operational lineage collection and sharing across the data
ecosystem. If it provides plugins for popular open source projects, its
intent is very similar to OpenTelemetry <https://opentelemetry.io/> (also
under the Linux Foundation umbrella): to remain a spec for lineage exchange
that projects - open source or proprietary - implement.

Built-in OpenLineage support in Airflow will make it easier and more
reliable for Airflow users to publish their operational lineage through the
OpenLineage ecosystem.

The current external plugin maintained in the OpenLineage project depends
on Airflow and operators internals and gets broken when changes are made on
those. Having a built-in integration ensures a better first class support
to expose lineage that gets tested alongside other changes and therefore is
more stable.

Today, OpenLineage consumers in the ecosystem include: Egeria
<https://egeria-project.org/features/lineage-management/overview/#the-openlineage-standard>
(bank
compliance), Marquez <https://marquezproject.ai/> (build your own metadata
platform for compliance for example), Microsoft Purview
<https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/>
(Governance,
…), Astro <https://www.astronomer.io/why-openlineage/> (data
observability), Amundsen
<https://www.amundsen.io/amundsen/databuilder/#openlineagetablelineageextractor>.
AWS recently blogged about using OpenLineage in the AWS ecosystem
<https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/>.
Other projects are at various levels of progress.

On the producer side, there is support for open source projects like
Airflow, dbt, Spark, Flink, GreatExpectations and proprietary warehouses
like Snowflake
<https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/README.md>,
BigQuery, Redshift
<https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/>
through
API integration or SQL parsing.

Examples of users talking about their usage of OpenLineage can be found on
the Openlineage blog
<https://openlineage.io/blog/openlineage-at-northwestern-mutual/>..

This integration will also stimulate the continued growth of the
OpenLineage ecosystem and create more value for Airflow users.

Reply via email to