Cool. I am looking forward to it :). It would be great to get some insight from those who attempted to get the lineage working in several versions of Open Lineage and finally arrived at the current specs/integration.
On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem <jul...@astronomer.io.invalid> wrote: > > Thank you Jarek, > I am happy to organize a zoom presentation about OpenLineage and answer any > question. It is indeed a spec decoupling the data transformation layer from > the Metadata store people are using. Just like OpenTelemetry is for service > metrics/traces. > Best, > Julien > > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com> wrote: >> >> And to add a little "parallel" - I think Open Lineage integration replacing >> our "generic lineage" is very similar step to the new "Multi-tenant"-ready >> authentication interface we are discussing in >> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck >> >> Yes - we have a generic authentication interface, but no - it's useless for >> the case where multi-tenancy and good level of resource authorization is >> needed. It's just far too simplistic and limited. >> >> Same with current lineage generic interface - yes, we have it but it's only >> useful in a limited set of cases. and if we want to step-it-up we need to >> come up with something better (and Open Lineage happens to be one that has >> been developed with Airflow in mind and battle tested). >> >> J. >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com> wrote: >>> >>> Hey Rafał (Eugene, Michal - and others who are looking), >>> >>> I think I know where your/Eugen/Michał concerns are coming from. And I >>> think it would be great if we can talk it over a bit. I believe this is - >>> in parts - quite a misunderstanding of what Open Lineage really is, how >>> much of an integration it is and what are the reasons why it has been >>> implemented the way it was implemented in Airflow. >>> >>> **Idea**: (Julien - Maybe you can organize it ?): >>> >>> Maybe we can have an open-to-everyone presentation/zoom call with quite >>> some time foreseen to ask questions where you would explain the community >>> about those integration points (and especially those people who are worried >>> we are losing something by choosing the OpenLineage integration). I would >>> love to see such a presentation - specifically focused on explaining how >>> Open-Lineage is really improving the current lineage approach and what >>> problems it solves that the existing generic interface doesn't. >>> >>> Just to set the tone and focus for such meeting if we have one: >>> >>> For me - when I look at Open Lineage, it is really "this is how lineage >>> generic interface **should** be done in Airflow". The "generic" lineage >>> support we have now is very, very basic, I'd even say far too simplistic. I >>> would even say, it's useless besides a few, very basic use cases. Simply >>> because there was never a good "receiver" of the information to cover those >>> cases. >>> >>> When you look closely at OpenLineage, it's nothing more than a better >>> convention of the dictionaries that we send as a metadata, better meta-data >>> in case of SQL operators (Hooks in the future hopefully), allowing handling >>> some cases that current lineage simply cannot. Also what open-lineage >>> integration with Airflow covers better handling of the lifecycle "task" and >>> "dag" in Airflow to be able to bind lineage data together. That's my >>> understanding of what we get when we integrate OL in. >>> >>> I think over the last 2 years Datakin/Astronomer people had worked out the >>> level of interface that **just works** and if we would like to get the >>> lineage information from Airflow as useful as it is in OL, we would have to >>> anyway implement pretty much all of the things they already did. >>> >>> I would love (and I think many community members) to take part in such a >>> call to hear on that particular aspect of the OL integration. >>> >>> J. >>> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz >>> <rafalbieg...@google.com.invalid> wrote: >>>> >>>> Hi, >>>> >>>> I second/echo the input provided by Eugene and Michal. >>>> >>>> In general, Airflow should provide generic interfaces to lineage backends >>>> so it's easy to configure the one preferred by the user. Whether it's Open >>>> Lineage, proprietary solution, Dataplex Lineage, etc. it should be the >>>> user's choice. >>>> >>>> We should avoid close integration with any specific lineage backend due to >>>> the reasons already mentioned, i.e. to avoid translations between lineage >>>> backends. Also, we would closely couple one framework (Airflow) with >>>> another one (Open Lineage) - it makes Airflow more complex and less >>>> flexible. Loose coupling between lineage backends and Airflow seems to be >>>> more future-proven. >>>> >>>> Regards, Rafal. >>>> >>>> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem >>>> <jul...@astronomer.io.invalid> wrote: >>>>> >>>>> Dear Airflow community, >>>>> I have transferred the content of the working google doc I shared a few >>>>> weeks ago to the Airflow confluence: >>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow >>>>> All comments have been answered, I added clarifications to the doc >>>>> accordingly and I also added your suggestions to improve the proposal. >>>>> All that history is linked from the discussion thread link in the >>>>> confluence doc if you wish to consult it. >>>>> Thank you all for your feedback and help in the process. >>>>> Best >>>>> Julien >>>>> >>>>> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <jul...@astronomer.io> >>>>> wrote: >>>>>> >>>>>> Thank you for the email Jarek, and Eugene for your suggestions, >>>>>> I do agree with Jarek's assessment. I don't have very much to add to his >>>>>> argument, it is very thoughtful! >>>>>> OpenLineage was started to avoid the cartesian complexity that Eugene >>>>>> mentions. There's actually that specific illustration in the OpenLineage >>>>>> doc. >>>>>> Lineage consumers want to avoid having to understand the lineage format >>>>>> of each individual observed data transformation layer. And >>>>>> transformation layers don't want to understand every Metadata store's >>>>>> model and protocol. >>>>>> Eugene, about your specific proposal about a global vocabulary of >>>>>> entities, I think it is a great suggestion. >>>>>> We can map those entities to Datasets in OpenLineage. The way >>>>>> OpenLineage models this is by allowing specific facets attached to >>>>>> Dataset. Facets are pieces of metadata each with their own JsonSchema. >>>>>> For example a table from a relational database will have a schema facet >>>>>> when a file in GCS might not. >>>>>> So I think in Airflow we could have each of the entity classes you >>>>>> describe be used in the get_openlineage_facets*() API in the Operators. >>>>>> Each of those classes would know what OpenLineage facets they can expose. >>>>>> I'll add a mention in the AIP and I think we can go in more details in a >>>>>> ticket. >>>>>> Cheers, >>>>>> Julien >>>>>> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com> wrote: >>>>>>> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer will >>>>>>> be more thoughtful). >>>>>>> >>>>>>> I think you are right to the "agnostic" part. But I have one question >>>>>>> - what are we considering "agnostic"? >>>>>>> >>>>>>> There is no "widespread" standard for lineage (yet). Open Lineage >>>>>>> with its donation to Linux Foundation Data & AI is aspiring to become >>>>>>> one. And it's a pretty good candidate: >>>>>>> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only >>>>>>> published as an API from day one) >>>>>>> * as of recently, the ownership and governance of Open Lineage is with >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/) which is >>>>>>> part of "Linux Foundation Project" - well known and respectful >>>>>>> foundation that - similarly to the ASF is an umbrella and provides >>>>>>> governance rules for a big number of well established OSS projects >>>>>>> >>>>>>> In essence it is the same approach as we already discussed and >>>>>>> approved for Open Telemetry (which is governed by CNCF which is in the >>>>>>> same league as recognition and governance to LFP) (not yet implemented >>>>>>> though). In the case of Open-Telemetry, we decided against developing >>>>>>> our "own" existing standard but we opted for one that is out there. >>>>>>> Yes it is a bit more established and popular than Open Lineage is, but >>>>>>> i so wish that we chose and implemented it already (and earlier as not >>>>>>> having a standard there - except statsd which is really, really poor) >>>>>>> has a great impact on Airflow being just "pluggable" in existing >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and I hear >>>>>>> (and see) there are attempts to do so). >>>>>>> >>>>>>> In the case of Open Lineage, the questions are - is there an >>>>>>> alternative of the same caliber? Shall we produce our own "agnostic >>>>>>> standard" for it instead ? Is there a chance the idea of >>>>>>> "airflow-specific" attributes will catch up and many "consumers" will >>>>>>> be writing their own conversions to the way they can consume it? >>>>>>> >>>>>>> I would really, really try to avoid the pitfalls nicely summarized >>>>>>> here: https://xkcd.com/927/ >>>>>>> >>>>>>> We can of course make a wrong bet and in 2 years Airflow might be the >>>>>>> only one supporting Open Lineage. That might happen. Though the list >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or maybe - >>>>>>> more likely - once Airflow implements it, due to Airflow's popularity >>>>>>> and the fact that there is already competition supporting it (e.g. >>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption of >>>>>>> Open Lineage. My bet is - the latter and for the benefit of the whole >>>>>>> ecosystem. I think we have a chance to influence creation of a new, >>>>>>> important standard. Much less so, I think if we just provide our own >>>>>>> custom solution - with lots and lots of work for others to be able to >>>>>>> consume it, no time to properly nurture the API and make it easier to >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and now >>>>>>> LFData & AI run governance main focus is) >>>>>>> >>>>>>> Are there other alternatives we should consider ? Do we want to >>>>>>> develop our own standard (and implement all the integrations from the >>>>>>> grounds up) ? >>>>>>> >>>>>>> J. >>>>>>> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <eu...@kosteev.com> >>>>>>> wrote: >>>>>>> > >>>>>>> > Hi Julien. >>>>>>> > >>>>>>> > I reviewed the design doc. >>>>>>> > The general idea looks good to me, but I have some concerns that I >>>>>>> > would like to share. >>>>>>> > >>>>>>> > If I understand correctly the proposed design is to fill in >>>>>>> > "operators" with self-methods to extract lineage metadata from it, >>>>>>> > and I agree with the motivation. If those are decoupled (in a form of >>>>>>> > extractors in separate package) from operators itself, then the >>>>>>> > downsides is that (as you mentioned) - extractors will be distributed >>>>>>> > separately and "operators" logic is out of sync with "lineage >>>>>>> > extraction" logic by design. >>>>>>> > Also knowledge about internals of operator spills out of the operator >>>>>>> > which is not good at all (at the very least). >>>>>>> > >>>>>>> > However, if we make every operator being exposing method to generate >>>>>>> > lineage metadata of the specific format, e.g. OpenLineage etc., then >>>>>>> > we will end up with cartesian complexity of supporting in each >>>>>>> > provider+operator each backend format. >>>>>>> > >>>>>>> > If you say that the goal is that "operators" will always generate >>>>>>> > OpenLineage format only and each consumer will convert this format to >>>>>>> > their own internal representation, well, if they do this then this >>>>>>> > seems like a working approach. But with the assumption that each >>>>>>> > consumer will support it. >>>>>>> > >>>>>>> > I think it comes down to the question: is OpenLineage format enough >>>>>>> > popular, complete and proper for the lineage metadata that every >>>>>>> > consumer will be convinced to support it. We may also consider issues >>>>>>> > like mismatch of lineage feature parity, e.g. OpenLineage supports >>>>>>> > field-level lineage but consumer doesn't support (or not at the >>>>>>> > moment), so we would prefer lineage metadata transferred to the >>>>>>> > backend to be slightly different in this case. >>>>>>> > >>>>>>> > What do you think about the idea: >>>>>>> > 1. make lineage metadata generated by "operators" to be agnostic of >>>>>>> > the specific format, just using entities from big generic vocabulary >>>>>>> > of entities e.g. created here >>>>>>> > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py. >>>>>>> > We would have there e.g. entities like: >>>>>>> > -------------------------------------------------------------------- >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) >>>>>>> > class PostgresTable: >>>>>>> > """Airflow lineage entity representing Postgres table.""" >>>>>>> > >>>>>>> > host: str = attr.ib() >>>>>>> > port: str = attr.ib() >>>>>>> > database: str = attr.ib() >>>>>>> > schema: str = attr.ib() >>>>>>> > table: str = attr.ib() >>>>>>> > >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) >>>>>>> > class GCSEntity: >>>>>>> > """Airflow lineage entity representing generic Google Cloud >>>>>>> > Storage entity.""" >>>>>>> > >>>>>>> > bucket: str = attr.ib() >>>>>>> > path: str = attr.ib() >>>>>>> > >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) >>>>>>> > class AWSS3Entity: >>>>>>> > """Airflow lineage entity representing generic AWS S3 entity.""" >>>>>>> > >>>>>>> > bucket: str = attr.ib() >>>>>>> > path: str = attr.ib() >>>>>>> > -------------------------------------------------------------------- >>>>>>> > 2. Implement "adapters" that will act as a bridge between "operators" >>>>>>> > and backends. Their responsibility will be to convert lineage >>>>>>> > metadata generated by "operators" to a format understandable by >>>>>>> > specific backend. >>>>>>> > And then we can use the built-in mechanism of inlets/outlets to >>>>>>> > bypass Airflow lineage metadata to the Airflow lineage backend. >>>>>>> > >>>>>>> > I didn't get exactly implementation details of your proposed design, >>>>>>> > but I think maintaining global vocabulary of entities to use in >>>>>>> > inlets/outlets of operators is crucial for Airflow, as this could be >>>>>>> > leveraged to build various features on top of it, like displaying >>>>>>> > lineage graph in Airflow UI (based on XCOM):) >>>>>>> > >>>>>>> > Importantly to note, if we decide to send out from Airflow lineage >>>>>>> > metadata only in OpenLineage format, well, we could have than only >>>>>>> > one "adapter" OpenLineageAdapter. But the "adapters" approach leaves >>>>>>> > us room for adding support to others (following "pluggable" approach >>>>>>> > as Airflow is mainly known/good about). >>>>>>> > >>>>>>> > All in all: >>>>>>> > - global vocabulary of entities used across all "operators" (with all >>>>>>> > advantages out of it, mentioned above) >>>>>>> > - "adapters" approach >>>>>>> > seems to me crucial points in the design that make sense to me. >>>>>>> > >>>>>>> > What do you think about this? >>>>>>> > >>>>>>> > - Eugene >>>>>>> > >>>>>>> > >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem >>>>>>> > <jul...@astronomer.io.invalid> wrote: >>>>>>> >> >>>>>>> >> Hello Michał, >>>>>>> >> Thank you for your input. >>>>>>> >> I would clarify that OpenLineage doesn't make any assumption about >>>>>>> >> the backend being used to store lineage and is an adapter-like layer. >>>>>>> >> OpenLineage exists as the spec specifically for that purpose of >>>>>>> >> avoiding the problem of every lineage consumer having to understand >>>>>>> >> every lineage producer. >>>>>>> >> Consumers of lineage want a unified spec consuming lineage from any >>>>>>> >> data transformation layer like Airflow, Spark, Flink, SQL, >>>>>>> >> Warehouses, ... >>>>>>> >> Just like OpenTelemetry allows consuming traces independently of the >>>>>>> >> technology used, so does OpenLineage for lineage. >>>>>>> >> Julien >>>>>>> >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras >>>>>>> >> <michalmod...@google.com> wrote: >>>>>>> >>> >>>>>>> >>> Hi everyone, >>>>>>> >>> >>>>>>> >>> As Airflow already supports lineage functionality through pluggable >>>>>>> >>> lineage backends, I think OpenLineage and other lineage systems >>>>>>> >>> integration should follow this path. I think more 'native' >>>>>>> >>> integration with OpenLineage (or any other lineage system) in >>>>>>> >>> Airflow while maintaining the generic lineage backend architecture >>>>>>> >>> in parallel would make the user experience less open, troublesome >>>>>>> >>> to maintain, and the Airflow architecture itself more constrained >>>>>>> >>> by a logic of a specific system. >>>>>>> >>> >>>>>>> >>> I think enriching operators with a generic method exposing lineage >>>>>>> >>> metadata that could be leveraged by lineage backends regardless of >>>>>>> >>> their implementation is a good idea which the Cloud Composer team >>>>>>> >>> would gladly contribute to. I believe the translation of the >>>>>>> >>> Airflow metadata exposed by the operators should be done by lineage >>>>>>> >>> backends (or another adapter-like layer). Tying Airflow operators' >>>>>>> >>> development to a specific lineage system like OpenLineage forces >>>>>>> >>> operators' contributors to understand that system too, which >>>>>>> >>> increases both the entry costs and maintenance costs. I see it as >>>>>>> >>> unnecessary coupling. >>>>>>> >>> >>>>>>> >>> Best, >>>>>>> >>> Michal >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem >>>>>>> >>> <jul...@astronomer.io> wrote: >>>>>>> >>>> >>>>>>> >>>> Thank you Eugen, >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and I think >>>>>>> >>>> this would work well. >>>>>>> >>>> Here are the sections in the doc that I think address your points: >>>>>>> >>>> - generalize lineage metadata extraction as self-method in each >>>>>>> >>>> operator, using generic lineage entities >>>>>>> >>>> See: OpenLineage support in providers. It describes how each >>>>>>> >>>> operator exposes its lineage. >>>>>>> >>>> - implement "adapter"s to convert generated metadata to Data >>>>>>> >>>> Lineage format, Open Lineage format, etc. >>>>>>> >>>> The goal here is each consumer turns from OpenLineage format to >>>>>>> >>>> their own internal representation as you are suggesting. >>>>>>> >>>> In the motivation section, towards the end, I link to a few >>>>>>> >>>> examples of data catalogs doing just that. >>>>>>> >>>> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <eu...@kosteev.com> >>>>>>> >>>> wrote: >>>>>>> >>>>> >>>>>>> >>>>> ++ Michal Modras >>>>>>> >>>>> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <eu...@kosteev.com> >>>>>>> >>>>> wrote: >>>>>>> >>>>>> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with Dataplex" >>>>>>> >>>>>> feature which effectively means to generate lineage out of >>>>>>> >>>>>> DAG/task executions and export it to Data Lineage (Data Catalog >>>>>>> >>>>>> service) for further analysis. >>>>>>> >>>>>> https://cloud.google.com/composer/docs/composer-2/lineage-integration >>>>>>> >>>>>> >>>>>>> >>>>>> This feature is as of now in the "Preview" state. >>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage >>>>>>> >>>>>> backend" feature and methods to extract lineage metadata on task >>>>>>> >>>>>> post execution events. >>>>>>> >>>>>> >>>>>>> >>>>>> The general idea was to contribute this to the Airflow community >>>>>>> >>>>>> in a form: >>>>>>> >>>>>> - generalize lineage metadata extraction as self-method in each >>>>>>> >>>>>> operator, using generic lineage entities >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to Data >>>>>>> >>>>>> Lineage format, Open Lineage format, etc. >>>>>>> >>>>>> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean to >>>>>>> >>>>>> introduce an additional layer of converting from OpenLineage >>>>>>> >>>>>> format to Data Lineage (Data Catalog/Dataplex) format. But this >>>>>>> >>>>>> is definitely a possibility. >>>>>>> >>>>>> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem >>>>>>> >>>>>> <jul...@astronomer.io.invalid> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thank you very much for your input Jarek. >>>>>>> >>>>>>> I am responding in the comments and adding to the doc >>>>>>> >>>>>>> accordingly. >>>>>>> >>>>>>> I would also love to hear from more stakeholders. >>>>>>> >>>>>>> Thanks to all who provided feedback so far. >>>>>>> >>>>>>> Julien >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk >>>>>>> >>>>>>> <ja...@potiuk.com> wrote: >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is (and >>>>>>> >>>>>>>> should be >>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's >>>>>>> >>>>>>>> capabilities >>>>>>> >>>>>>>> greatly and opens up the direction we've been all working on - >>>>>>> >>>>>>>> Airflow >>>>>>> >>>>>>>> as a Platform. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes the same >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry goes, >>>>>>> >>>>>>>> where we >>>>>>> >>>>>>>> might decide to support certain standards in order to expand >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to plug-in >>>>>>> >>>>>>>> multiple >>>>>>> >>>>>>>> external solutions that would use the standard API. After >>>>>>> >>>>>>>> Open-Lineage >>>>>>> >>>>>>>> graduated recently to LFAI&Data foundation (I've been >>>>>>> >>>>>>>> watching this >>>>>>> >>>>>>>> happening from far), it is I think the perfect candidate for >>>>>>> >>>>>>>> Airflow >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the players to >>>>>>> >>>>>>>> make use >>>>>>> >>>>>>>> of the extra work necessary by the community to make it >>>>>>> >>>>>>>> "officially >>>>>>> >>>>>>>> supported". I think we have to also get some feedback from the >>>>>>> >>>>>>>> big >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have such a >>>>>>> >>>>>>>> capability, and another is to get it used in all the ways >>>>>>> >>>>>>>> Airflow is >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which is >>>>>>> >>>>>>>> obviously a >>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow is >>>>>>> >>>>>>>> exposed by >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some warm >>>>>>> >>>>>>>> words from >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear whether the >>>>>>> >>>>>>>> Composer team at Google would be on board in using the >>>>>>> >>>>>>>> open-lineage >>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and likely >>>>>>> >>>>>>>> more) >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly other >>>>>>> >>>>>>>> stakeholders >>>>>>> >>>>>>>> might want to say something. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in >>>>>>> >>>>>>>> implementing and >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that is the >>>>>>> >>>>>>>> main >>>>>>> >>>>>>>> reason why the Open Lineage community would like to make the >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and >>>>>>> >>>>>>>> integrating it in >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI, verification >>>>>>> >>>>>>>> process and making some very clear expectations about what it >>>>>>> >>>>>>>> means >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can make some >>>>>>> >>>>>>>> initial investment in making it happen and minimise on-going >>>>>>> >>>>>>>> cost, >>>>>>> >>>>>>>> while maximising the gain. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> And looking at all the above - I am super happy to help with >>>>>>> >>>>>>>> all that >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even if it >>>>>>> >>>>>>>> will >>>>>>> >>>>>>>> take an extra effort, especially that we will have experts >>>>>>> >>>>>>>> from Open >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage being >>>>>>> >>>>>>>> the core >>>>>>> >>>>>>>> part of the effort. I am actually super excited - this might >>>>>>> >>>>>>>> be the >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position as an >>>>>>> >>>>>>>> indispensable component of "even more modern data stack". >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking forward >>>>>>> >>>>>>>> to >>>>>>> >>>>>>>> making it happen :). >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> J. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem >>>>>>> >>>>>>>> <jul...@astronomer.io.invalid> wrote: >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> > Dear Airflow Community, >>>>>>> >>>>>>>> > I have been working on a proposal to bring an OpenLineage >>>>>>> >>>>>>>> > provider to Airflow. >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an official >>>>>>> >>>>>>>> > AIP. >>>>>>> >>>>>>>> > Please feel free to comment in the doc above. >>>>>>> >>>>>>>> > Thank you, >>>>>>> >>>>>>>> > Julien (OpenLineage project lead) >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc: >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> > Operational lineage collection is a common need to >>>>>>> >>>>>>>> > understand dependencies between data pipelines and track >>>>>>> >>>>>>>> > end-to-end provenance of data. It enables many use cases >>>>>>> >>>>>>>> > from ensuring reliable delivery of data through >>>>>>> >>>>>>>> > observability to compliance and cost management. >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow capability >>>>>>> >>>>>>>> > to enable troubleshooting and governance. >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data foundation >>>>>>> >>>>>>>> > that provides a spec standardizing operational lineage >>>>>>> >>>>>>>> > collection and sharing across the data ecosystem. If it >>>>>>> >>>>>>>> > provides plugins for popular open source projects, its >>>>>>> >>>>>>>> > intent is very similar to OpenTelemetry (also under the >>>>>>> >>>>>>>> > Linux Foundation umbrella): to remain a spec for lineage >>>>>>> >>>>>>>> > exchange that projects - open source or proprietary - >>>>>>> >>>>>>>> > implement. >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it easier >>>>>>> >>>>>>>> > and more reliable for Airflow users to publish their >>>>>>> >>>>>>>> > operational lineage through the OpenLineage ecosystem. >>>>>>> >>>>>>>> > >>>>>>> >>>>>>>> > The current external plugin maintained in the OpenLineage >>>>>>> >>>>>>>> > project depends on Airflow and operators internals and gets >>>>>>> >>>>>>>> > broken when changes are made on those. Having a built-in >>>>>>> >>>>>>>> > integration ensures a better first class support to expose >>>>>>> >>>>>>>> > lineage that gets tested alongside other changes and >>>>>>> >>>>>>>> > therefore is more stable. >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> -- >>>>>>> >>>>>> Eugene >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> >>>>> -- >>>>>>> >>>>> Eugene >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > -- >>>>>>> > Eugene