We are planning to do this session next Thursday at 5pm CET 9am PT. I will send a zoom link in advance. Julien
On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <ja...@potiuk.com> wrote: > Cool. I am looking forward to it :). It would be great to get some > insight from those who attempted to get the lineage working in several > versions of Open Lineage and finally arrived at the current > specs/integration. > > On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem > <jul...@astronomer.io.invalid> wrote: > > > > Thank you Jarek, > > I am happy to organize a zoom presentation about OpenLineage and answer > any question. It is indeed a spec decoupling the data transformation layer > from the Metadata store people are using. Just like OpenTelemetry is for > service metrics/traces. > > Best, > > Julien > > > > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com> wrote: > >> > >> And to add a little "parallel" - I think Open Lineage integration > replacing our "generic lineage" is very similar step to the new > "Multi-tenant"-ready authentication interface we are discussing in > https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck > >> > >> Yes - we have a generic authentication interface, but no - it's useless > for the case where multi-tenancy and good level of resource authorization > is needed. It's just far too simplistic and limited. > >> > >> Same with current lineage generic interface - yes, we have it but it's > only useful in a limited set of cases. and if we want to step-it-up we need > to come up with something better (and Open Lineage happens to be one that > has been developed with Airflow in mind and battle tested). > >> > >> J. > >> > >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com> wrote: > >>> > >>> Hey Rafał (Eugene, Michal - and others who are looking), > >>> > >>> I think I know where your/Eugen/Michał concerns are coming from. And I > think it would be great if we can talk it over a bit. I believe this is - > in parts - quite a misunderstanding of what Open Lineage really is, how > much of an integration it is and what are the reasons why it has been > implemented the way it was implemented in Airflow. > >>> > >>> **Idea**: (Julien - Maybe you can organize it ?): > >>> > >>> Maybe we can have an open-to-everyone presentation/zoom call with > quite some time foreseen to ask questions where you would explain the > community about those integration points (and especially those people who > are worried we are losing something by choosing the OpenLineage > integration). I would love to see such a presentation - specifically > focused on explaining how Open-Lineage is really improving the current > lineage approach and what problems it solves that the existing generic > interface doesn't. > >>> > >>> Just to set the tone and focus for such meeting if we have one: > >>> > >>> For me - when I look at Open Lineage, it is really "this is how > lineage generic interface **should** be done in Airflow". The "generic" > lineage support we have now is very, very basic, I'd even say far too > simplistic. I would even say, it's useless besides a few, very basic use > cases. Simply because there was never a good "receiver" of the information > to cover those cases. > >>> > >>> When you look closely at OpenLineage, it's nothing more than a better > convention of the dictionaries that we send as a metadata, better meta-data > in case of SQL operators (Hooks in the future hopefully), allowing handling > some cases that current lineage simply cannot. Also what open-lineage > integration with Airflow covers better handling of the lifecycle "task" and > "dag" in Airflow to be able to bind lineage data together. That's my > understanding of what we get when we integrate OL in. > >>> > >>> I think over the last 2 years Datakin/Astronomer people had worked out > the level of interface that **just works** and if we would like to get the > lineage information from Airflow as useful as it is in OL, we would have to > anyway implement pretty much all of the things they already did. > >>> > >>> I would love (and I think many community members) to take part in such > a call to hear on that particular aspect of the OL integration. > >>> > >>> J. > >>> > >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz < > rafalbieg...@google.com.invalid> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I second/echo the input provided by Eugene and Michal. > >>>> > >>>> In general, Airflow should provide generic interfaces to lineage > backends so it's easy to configure the one preferred by the user. Whether > it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it should > be the user's choice. > >>>> > >>>> We should avoid close integration with any specific lineage backend > due to the reasons already mentioned, i.e. to avoid translations between > lineage backends. Also, we would closely couple one framework (Airflow) > with another one (Open Lineage) - it makes Airflow more complex and less > flexible. Loose coupling between lineage backends and Airflow seems to be > more future-proven. > >>>> > >>>> Regards, Rafal. > >>>> > >>>> > >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem > <jul...@astronomer.io.invalid> wrote: > >>>>> > >>>>> Dear Airflow community, > >>>>> I have transferred the content of the working google doc I shared a > few weeks ago to the Airflow confluence: > >>>>> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow > >>>>> All comments have been answered, I added clarifications to the doc > accordingly and I also added your suggestions to improve the proposal. > >>>>> All that history is linked from the discussion thread link in the > confluence doc if you wish to consult it. > >>>>> Thank you all for your feedback and help in the process. > >>>>> Best > >>>>> Julien > >>>>> > >>>>> > >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <jul...@astronomer.io> > wrote: > >>>>>> > >>>>>> Thank you for the email Jarek, and Eugene for your suggestions, > >>>>>> I do agree with Jarek's assessment. I don't have very much to add > to his argument, it is very thoughtful! > >>>>>> OpenLineage was started to avoid the cartesian complexity that > Eugene mentions. There's actually that specific illustration in the > OpenLineage doc. > >>>>>> Lineage consumers want to avoid having to understand the lineage > format of each individual observed data transformation layer. And > transformation layers don't want to understand every Metadata store's model > and protocol. > >>>>>> Eugene, about your specific proposal about a global vocabulary of > entities, I think it is a great suggestion. > >>>>>> We can map those entities to Datasets in OpenLineage. The way > OpenLineage models this is by allowing specific facets attached to Dataset. > Facets are pieces of metadata each with their own JsonSchema. > >>>>>> For example a table from a relational database will have a schema > facet when a file in GCS might not. > >>>>>> So I think in Airflow we could have each of the entity classes you > describe be used in the get_openlineage_facets*() API in the Operators. > >>>>>> Each of those classes would know what OpenLineage facets they can > expose. > >>>>>> I'll add a mention in the AIP and I think we can go in more details > in a ticket. > >>>>>> Cheers, > >>>>>> Julien > >>>>>> > >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com> > wrote: > >>>>>>> > >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer > will > >>>>>>> be more thoughtful). > >>>>>>> > >>>>>>> I think you are right to the "agnostic" part. But I have one > question > >>>>>>> - what are we considering "agnostic"? > >>>>>>> > >>>>>>> There is no "widespread" standard for lineage (yet). Open Lineage > >>>>>>> with its donation to Linux Foundation Data & AI is aspiring to > become > >>>>>>> one. And it's a pretty good candidate: > >>>>>>> > >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only > >>>>>>> published as an API from day one) > >>>>>>> * as of recently, the ownership and governance of Open Lineage is > with > >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/) which > is > >>>>>>> part of "Linux Foundation Project" - well known and respectful > >>>>>>> foundation that - similarly to the ASF is an umbrella and provides > >>>>>>> governance rules for a big number of well established OSS projects > >>>>>>> > >>>>>>> In essence it is the same approach as we already discussed and > >>>>>>> approved for Open Telemetry (which is governed by CNCF which is in > the > >>>>>>> same league as recognition and governance to LFP) (not yet > implemented > >>>>>>> though). In the case of Open-Telemetry, we decided against > developing > >>>>>>> our "own" existing standard but we opted for one that is out there. > >>>>>>> Yes it is a bit more established and popular than Open Lineage is, > but > >>>>>>> i so wish that we chose and implemented it already (and earlier as > not > >>>>>>> having a standard there - except statsd which is really, really > poor) > >>>>>>> has a great impact on Airflow being just "pluggable" in existing > >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and I > hear > >>>>>>> (and see) there are attempts to do so). > >>>>>>> > >>>>>>> In the case of Open Lineage, the questions are - is there an > >>>>>>> alternative of the same caliber? Shall we produce our own "agnostic > >>>>>>> standard" for it instead ? Is there a chance the idea of > >>>>>>> "airflow-specific" attributes will catch up and many "consumers" > will > >>>>>>> be writing their own conversions to the way they can consume it? > >>>>>>> > >>>>>>> I would really, really try to avoid the pitfalls nicely summarized > >>>>>>> here: https://xkcd.com/927/ > >>>>>>> > >>>>>>> We can of course make a wrong bet and in 2 years Airflow might be > the > >>>>>>> only one supporting Open Lineage. That might happen. Though the > list > >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or > maybe - > >>>>>>> more likely - once Airflow implements it, due to Airflow's > popularity > >>>>>>> and the fact that there is already competition supporting it (e.g. > >>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption of > >>>>>>> Open Lineage. My bet is - the latter and for the benefit of the > whole > >>>>>>> ecosystem. I think we have a chance to influence creation of a new, > >>>>>>> important standard. Much less so, I think if we just provide our > own > >>>>>>> custom solution - with lots and lots of work for others to be able > to > >>>>>>> consume it, no time to properly nurture the API and make it easier > to > >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and now > >>>>>>> LFData & AI run governance main focus is) > >>>>>>> > >>>>>>> Are there other alternatives we should consider ? Do we want to > >>>>>>> develop our own standard (and implement all the integrations from > the > >>>>>>> grounds up) ? > >>>>>>> > >>>>>>> J. > >>>>>>> > >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <eu...@kosteev.com> > wrote: > >>>>>>> > > >>>>>>> > Hi Julien. > >>>>>>> > > >>>>>>> > I reviewed the design doc. > >>>>>>> > The general idea looks good to me, but I have some concerns that > I would like to share. > >>>>>>> > > >>>>>>> > If I understand correctly the proposed design is to fill in > "operators" with self-methods to extract lineage metadata from it, and I > agree with the motivation. If those are decoupled (in a form of extractors > in separate package) from operators itself, then the downsides is that (as > you mentioned) - extractors will be distributed separately and "operators" > logic is out of sync with "lineage extraction" logic by design. > >>>>>>> > Also knowledge about internals of operator spills out of the > operator which is not good at all (at the very least). > >>>>>>> > > >>>>>>> > However, if we make every operator being exposing method to > generate lineage metadata of the specific format, e.g. OpenLineage etc., > then we will end up with cartesian complexity of supporting in each > provider+operator each backend format. > >>>>>>> > > >>>>>>> > If you say that the goal is that "operators" will always > generate OpenLineage format only and each consumer will convert this format > to their own internal representation, well, if they do this then this seems > like a working approach. But with the assumption that each consumer will > support it. > >>>>>>> > > >>>>>>> > I think it comes down to the question: is OpenLineage format > enough popular, complete and proper for the lineage metadata that every > consumer will be convinced to support it. We may also consider issues like > mismatch of lineage feature parity, e.g. OpenLineage supports field-level > lineage but consumer doesn't support (or not at the moment), so we would > prefer lineage metadata transferred to the backend to be slightly different > in this case. > >>>>>>> > > >>>>>>> > What do you think about the idea: > >>>>>>> > 1. make lineage metadata generated by "operators" to be agnostic > of the specific format, just using entities from big generic vocabulary of > entities e.g. created here > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py. > We would have there e.g. entities like: > >>>>>>> > > -------------------------------------------------------------------- > >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) > >>>>>>> > class PostgresTable: > >>>>>>> > """Airflow lineage entity representing Postgres table.""" > >>>>>>> > > >>>>>>> > host: str = attr.ib() > >>>>>>> > port: str = attr.ib() > >>>>>>> > database: str = attr.ib() > >>>>>>> > schema: str = attr.ib() > >>>>>>> > table: str = attr.ib() > >>>>>>> > > >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) > >>>>>>> > class GCSEntity: > >>>>>>> > """Airflow lineage entity representing generic Google Cloud > Storage entity.""" > >>>>>>> > > >>>>>>> > bucket: str = attr.ib() > >>>>>>> > path: str = attr.ib() > >>>>>>> > > >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) > >>>>>>> > class AWSS3Entity: > >>>>>>> > """Airflow lineage entity representing generic AWS S3 > entity.""" > >>>>>>> > > >>>>>>> > bucket: str = attr.ib() > >>>>>>> > path: str = attr.ib() > >>>>>>> > > -------------------------------------------------------------------- > >>>>>>> > 2. Implement "adapters" that will act as a bridge between > "operators" and backends. Their responsibility will be to convert lineage > metadata generated by "operators" to a format understandable by specific > backend. > >>>>>>> > And then we can use the built-in mechanism of inlets/outlets to > bypass Airflow lineage metadata to the Airflow lineage backend. > >>>>>>> > > >>>>>>> > I didn't get exactly implementation details of your proposed > design, but I think maintaining global vocabulary of entities to use in > inlets/outlets of operators is crucial for Airflow, as this could be > leveraged to build various features on top of it, like displaying lineage > graph in Airflow UI (based on XCOM):) > >>>>>>> > > >>>>>>> > Importantly to note, if we decide to send out from Airflow > lineage metadata only in OpenLineage format, well, we could have than only > one "adapter" OpenLineageAdapter. But the "adapters" approach leaves us > room for adding support to others (following "pluggable" approach as > Airflow is mainly known/good about). > >>>>>>> > > >>>>>>> > All in all: > >>>>>>> > - global vocabulary of entities used across all "operators" > (with all advantages out of it, mentioned above) > >>>>>>> > - "adapters" approach > >>>>>>> > seems to me crucial points in the design that make sense to me. > >>>>>>> > > >>>>>>> > What do you think about this? > >>>>>>> > > >>>>>>> > - Eugene > >>>>>>> > > >>>>>>> > > >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem > <jul...@astronomer.io.invalid> wrote: > >>>>>>> >> > >>>>>>> >> Hello Michał, > >>>>>>> >> Thank you for your input. > >>>>>>> >> I would clarify that OpenLineage doesn't make any assumption > about the backend being used to store lineage and is an adapter-like layer. > >>>>>>> >> OpenLineage exists as the spec specifically for that purpose of > avoiding the problem of every lineage consumer having to understand every > lineage producer. > >>>>>>> >> Consumers of lineage want a unified spec consuming lineage from > any data transformation layer like Airflow, Spark, Flink, SQL, Warehouses, > ... > >>>>>>> >> Just like OpenTelemetry allows consuming traces independently > of the technology used, so does OpenLineage for lineage. > >>>>>>> >> Julien > >>>>>>> >> > >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras < > michalmod...@google.com> wrote: > >>>>>>> >>> > >>>>>>> >>> Hi everyone, > >>>>>>> >>> > >>>>>>> >>> As Airflow already supports lineage functionality through > pluggable lineage backends, I think OpenLineage and other lineage systems > integration should follow this path. I think more 'native' integration with > OpenLineage (or any other lineage system) in Airflow while maintaining the > generic lineage backend architecture in parallel would make the user > experience less open, troublesome to maintain, and the Airflow architecture > itself more constrained by a logic of a specific system. > >>>>>>> >>> > >>>>>>> >>> I think enriching operators with a generic method exposing > lineage metadata that could be leveraged by lineage backends regardless of > their implementation is a good idea which the Cloud Composer team would > gladly contribute to. I believe the translation of the Airflow metadata > exposed by the operators should be done by lineage backends (or another > adapter-like layer). Tying Airflow operators' development to a specific > lineage system like OpenLineage forces operators' contributors to > understand that system too, which increases both the entry costs and > maintenance costs. I see it as unnecessary coupling. > >>>>>>> >>> > >>>>>>> >>> Best, > >>>>>>> >>> Michal > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem < > jul...@astronomer.io> wrote: > >>>>>>> >>>> > >>>>>>> >>>> Thank you Eugen, > >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and I > think this would work well. > >>>>>>> >>>> Here are the sections in the doc that I think address your > points: > >>>>>>> >>>> - generalize lineage metadata extraction as self-method in > each operator, using generic lineage entities > >>>>>>> >>>> See: OpenLineage support in providers. It describes how each > operator exposes its lineage. > >>>>>>> >>>> - implement "adapter"s to convert generated metadata to Data > Lineage format, Open Lineage format, etc. > >>>>>>> >>>> The goal here is each consumer turns from OpenLineage format > to their own internal representation as you are suggesting. > >>>>>>> >>>> In the motivation section, towards the end, I link to a few > examples of data catalogs doing just that. > >>>>>>> >>>> > >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev < > eu...@kosteev.com> wrote: > >>>>>>> >>>>> > >>>>>>> >>>>> ++ Michal Modras > >>>>>>> >>>>> > >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev < > eu...@kosteev.com> wrote: > >>>>>>> >>>>>> > >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with > Dataplex" feature which effectively means to generate lineage out of > DAG/task executions and export it to Data Lineage (Data Catalog service) > for further analysis. > >>>>>>> >>>>>> > https://cloud.google.com/composer/docs/composer-2/lineage-integration > >>>>>>> >>>>>> > >>>>>>> >>>>>> This feature is as of now in the "Preview" state. > >>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage > backend" feature and methods to extract lineage metadata on task post > execution events. > >>>>>>> >>>>>> > >>>>>>> >>>>>> The general idea was to contribute this to the Airflow > community in a form: > >>>>>>> >>>>>> - generalize lineage metadata extraction as self-method in > each operator, using generic lineage entities > >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to > Data Lineage format, Open Lineage format, etc. > >>>>>>> >>>>>> > >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean > to introduce an additional layer of converting from OpenLineage format to > Data Lineage (Data Catalog/Dataplex) format. But this is definitely a > possibility. > >>>>>>> >>>>>> > >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem > <jul...@astronomer.io.invalid> wrote: > >>>>>>> >>>>>>> > >>>>>>> >>>>>>> Thank you very much for your input Jarek. > >>>>>>> >>>>>>> I am responding in the comments and adding to the doc > accordingly. > >>>>>>> >>>>>>> I would also love to hear from more stakeholders. > >>>>>>> >>>>>>> Thanks to all who provided feedback so far. > >>>>>>> >>>>>>> Julien > >>>>>>> >>>>>>> > >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk < > ja...@potiuk.com> wrote: > >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is > (and should be > >>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's > capabilities > >>>>>>> >>>>>>>> greatly and opens up the direction we've been all working > on - Airflow > >>>>>>> >>>>>>>> as a Platform. > >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes the > same > >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry > goes, where we > >>>>>>> >>>>>>>> might decide to support certain standards in order to > expand > >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to > plug-in multiple > >>>>>>> >>>>>>>> external solutions that would use the standard API. After > Open-Lineage > >>>>>>> >>>>>>>> graduated recently to LFAI&Data foundation (I've been > watching this > >>>>>>> >>>>>>>> happening from far), it is I think the perfect candidate > for Airflow > >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the players > to make use > >>>>>>> >>>>>>>> of the extra work necessary by the community to make it > "officially > >>>>>>> >>>>>>>> supported". I think we have to also get some feedback > from the big > >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have > such a > >>>>>>> >>>>>>>> capability, and another is to get it used in all the ways > Airflow is > >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which is > obviously a > >>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow is > exposed by > >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some > warm words from > >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear > whether the > >>>>>>> >>>>>>>> Composer team at Google would be on board in using the > open-lineage > >>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and > likely more) > >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly other > stakeholders > >>>>>>> >>>>>>>> might want to say something. > >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in > implementing and > >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that is > the main > >>>>>>> >>>>>>>> reason why the Open Lineage community would like to make > the > >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and > integrating it in > >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI, > verification > >>>>>>> >>>>>>>> process and making some very clear expectations about > what it means > >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can > make some > >>>>>>> >>>>>>>> initial investment in making it happen and minimise > on-going cost, > >>>>>>> >>>>>>>> while maximising the gain. > >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> And looking at all the above - I am super happy to help > with all that > >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even > if it will > >>>>>>> >>>>>>>> take an extra effort, especially that we will have > experts from Open > >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage > being the core > >>>>>>> >>>>>>>> part of the effort. I am actually super excited - this > might be the > >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position as > an > >>>>>>> >>>>>>>> indispensable component of "even more modern data stack". > >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking > forward to > >>>>>>> >>>>>>>> making it happen :). > >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> J. > >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem > >>>>>>> >>>>>>>> <jul...@astronomer.io.invalid> wrote: > >>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > Dear Airflow Community, > >>>>>>> >>>>>>>> > I have been working on a proposal to bring an > OpenLineage provider to Airflow. > >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an > official AIP. > >>>>>>> >>>>>>>> > Please feel free to comment in the doc above. > >>>>>>> >>>>>>>> > Thank you, > >>>>>>> >>>>>>>> > Julien (OpenLineage project lead) > >>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc: > >>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > Operational lineage collection is a common need to > understand dependencies between data pipelines and track end-to-end > provenance of data. It enables many use cases from ensuring reliable > delivery of data through observability to compliance and cost management. > >>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow > capability to enable troubleshooting and governance. > >>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data > foundation that provides a spec standardizing operational lineage > collection and sharing across the data ecosystem. If it provides plugins > for popular open source projects, its intent is very similar to > OpenTelemetry (also under the Linux Foundation umbrella): to remain a spec > for lineage exchange that projects - open source or proprietary - implement. > >>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it > easier and more reliable for Airflow users to publish their operational > lineage through the OpenLineage ecosystem. > >>>>>>> >>>>>>>> > > >>>>>>> >>>>>>>> > The current external plugin maintained in the > OpenLineage project depends on Airflow and operators internals and gets > broken when changes are made on those. Having a built-in integration > ensures a better first class support to expose lineage that gets tested > alongside other changes and therefore is more stable. > >>>>>>> >>>>>> > >>>>>>> >>>>>> > >>>>>>> >>>>>> > >>>>>>> >>>>>> -- > >>>>>>> >>>>>> Eugene > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> -- > >>>>>>> >>>>> Eugene > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > -- > >>>>>>> > Eugene >