Hello all, I have to move the OpenLineage presentation to next week. Sorry for the change. It will be Friday next week March 31st at 5pm CET 9am PT. https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io Julien
On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <jul...@astronomer.io> wrote: > We are planning to do this session next Thursday at 5pm CET 9am PT. I will > send a zoom link in advance. > Julien > > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <ja...@potiuk.com> wrote: > >> Cool. I am looking forward to it :). It would be great to get some >> insight from those who attempted to get the lineage working in several >> versions of Open Lineage and finally arrived at the current >> specs/integration. >> >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem >> <jul...@astronomer.io.invalid> wrote: >> > >> > Thank you Jarek, >> > I am happy to organize a zoom presentation about OpenLineage and answer >> any question. It is indeed a spec decoupling the data transformation layer >> from the Metadata store people are using. Just like OpenTelemetry is for >> service metrics/traces. >> > Best, >> > Julien >> > >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com> wrote: >> >> >> >> And to add a little "parallel" - I think Open Lineage integration >> replacing our "generic lineage" is very similar step to the new >> "Multi-tenant"-ready authentication interface we are discussing in >> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck >> >> >> >> Yes - we have a generic authentication interface, but no - it's >> useless for the case where multi-tenancy and good level of resource >> authorization is needed. It's just far too simplistic and limited. >> >> >> >> Same with current lineage generic interface - yes, we have it but it's >> only useful in a limited set of cases. and if we want to step-it-up we need >> to come up with something better (and Open Lineage happens to be one that >> has been developed with Airflow in mind and battle tested). >> >> >> >> J. >> >> >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com> wrote: >> >>> >> >>> Hey Rafał (Eugene, Michal - and others who are looking), >> >>> >> >>> I think I know where your/Eugen/Michał concerns are coming from. And >> I think it would be great if we can talk it over a bit. I believe this is >> - in parts - quite a misunderstanding of what Open Lineage really is, how >> much of an integration it is and what are the reasons why it has been >> implemented the way it was implemented in Airflow. >> >>> >> >>> **Idea**: (Julien - Maybe you can organize it ?): >> >>> >> >>> Maybe we can have an open-to-everyone presentation/zoom call with >> quite some time foreseen to ask questions where you would explain the >> community about those integration points (and especially those people who >> are worried we are losing something by choosing the OpenLineage >> integration). I would love to see such a presentation - specifically >> focused on explaining how Open-Lineage is really improving the current >> lineage approach and what problems it solves that the existing generic >> interface doesn't. >> >>> >> >>> Just to set the tone and focus for such meeting if we have one: >> >>> >> >>> For me - when I look at Open Lineage, it is really "this is how >> lineage generic interface **should** be done in Airflow". The "generic" >> lineage support we have now is very, very basic, I'd even say far too >> simplistic. I would even say, it's useless besides a few, very basic use >> cases. Simply because there was never a good "receiver" of the information >> to cover those cases. >> >>> >> >>> When you look closely at OpenLineage, it's nothing more than a better >> convention of the dictionaries that we send as a metadata, better meta-data >> in case of SQL operators (Hooks in the future hopefully), allowing handling >> some cases that current lineage simply cannot. Also what open-lineage >> integration with Airflow covers better handling of the lifecycle "task" and >> "dag" in Airflow to be able to bind lineage data together. That's my >> understanding of what we get when we integrate OL in. >> >>> >> >>> I think over the last 2 years Datakin/Astronomer people had worked >> out the level of interface that **just works** and if we would like to get >> the lineage information from Airflow as useful as it is in OL, we would >> have to anyway implement pretty much all of the things they already did. >> >>> >> >>> I would love (and I think many community members) to take part in >> such a call to hear on that particular aspect of the OL integration. >> >>> >> >>> J. >> >>> >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz < >> rafalbieg...@google.com.invalid> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> I second/echo the input provided by Eugene and Michal. >> >>>> >> >>>> In general, Airflow should provide generic interfaces to lineage >> backends so it's easy to configure the one preferred by the user. Whether >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it should >> be the user's choice. >> >>>> >> >>>> We should avoid close integration with any specific lineage backend >> due to the reasons already mentioned, i.e. to avoid translations between >> lineage backends. Also, we would closely couple one framework (Airflow) >> with another one (Open Lineage) - it makes Airflow more complex and less >> flexible. Loose coupling between lineage backends and Airflow seems to be >> more future-proven. >> >>>> >> >>>> Regards, Rafal. >> >>>> >> >>>> >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem >> <jul...@astronomer.io.invalid> wrote: >> >>>>> >> >>>>> Dear Airflow community, >> >>>>> I have transferred the content of the working google doc I shared a >> few weeks ago to the Airflow confluence: >> >>>>> >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow >> >>>>> All comments have been answered, I added clarifications to the doc >> accordingly and I also added your suggestions to improve the proposal. >> >>>>> All that history is linked from the discussion thread link in the >> confluence doc if you wish to consult it. >> >>>>> Thank you all for your feedback and help in the process. >> >>>>> Best >> >>>>> Julien >> >>>>> >> >>>>> >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <jul...@astronomer.io> >> wrote: >> >>>>>> >> >>>>>> Thank you for the email Jarek, and Eugene for your suggestions, >> >>>>>> I do agree with Jarek's assessment. I don't have very much to add >> to his argument, it is very thoughtful! >> >>>>>> OpenLineage was started to avoid the cartesian complexity that >> Eugene mentions. There's actually that specific illustration in the >> OpenLineage doc. >> >>>>>> Lineage consumers want to avoid having to understand the lineage >> format of each individual observed data transformation layer. And >> transformation layers don't want to understand every Metadata store's model >> and protocol. >> >>>>>> Eugene, about your specific proposal about a global vocabulary of >> entities, I think it is a great suggestion. >> >>>>>> We can map those entities to Datasets in OpenLineage. The way >> OpenLineage models this is by allowing specific facets attached to Dataset. >> Facets are pieces of metadata each with their own JsonSchema. >> >>>>>> For example a table from a relational database will have a schema >> facet when a file in GCS might not. >> >>>>>> So I think in Airflow we could have each of the entity classes you >> describe be used in the get_openlineage_facets*() API in the Operators. >> >>>>>> Each of those classes would know what OpenLineage facets they can >> expose. >> >>>>>> I'll add a mention in the AIP and I think we can go in more >> details in a ticket. >> >>>>>> Cheers, >> >>>>>> Julien >> >>>>>> >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com> >> wrote: >> >>>>>>> >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer >> will >> >>>>>>> be more thoughtful). >> >>>>>>> >> >>>>>>> I think you are right to the "agnostic" part. But I have one >> question >> >>>>>>> - what are we considering "agnostic"? >> >>>>>>> >> >>>>>>> There is no "widespread" standard for lineage (yet). Open Lineage >> >>>>>>> with its donation to Linux Foundation Data & AI is aspiring to >> become >> >>>>>>> one. And it's a pretty good candidate: >> >>>>>>> >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only >> >>>>>>> published as an API from day one) >> >>>>>>> * as of recently, the ownership and governance of Open Lineage is >> with >> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/) which >> is >> >>>>>>> part of "Linux Foundation Project" - well known and respectful >> >>>>>>> foundation that - similarly to the ASF is an umbrella and provides >> >>>>>>> governance rules for a big number of well established OSS projects >> >>>>>>> >> >>>>>>> In essence it is the same approach as we already discussed and >> >>>>>>> approved for Open Telemetry (which is governed by CNCF which is >> in the >> >>>>>>> same league as recognition and governance to LFP) (not yet >> implemented >> >>>>>>> though). In the case of Open-Telemetry, we decided against >> developing >> >>>>>>> our "own" existing standard but we opted for one that is out >> there. >> >>>>>>> Yes it is a bit more established and popular than Open Lineage >> is, but >> >>>>>>> i so wish that we chose and implemented it already (and earlier >> as not >> >>>>>>> having a standard there - except statsd which is really, really >> poor) >> >>>>>>> has a great impact on Airflow being just "pluggable" in existing >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and I >> hear >> >>>>>>> (and see) there are attempts to do so). >> >>>>>>> >> >>>>>>> In the case of Open Lineage, the questions are - is there an >> >>>>>>> alternative of the same caliber? Shall we produce our own >> "agnostic >> >>>>>>> standard" for it instead ? Is there a chance the idea of >> >>>>>>> "airflow-specific" attributes will catch up and many "consumers" >> will >> >>>>>>> be writing their own conversions to the way they can consume it? >> >>>>>>> >> >>>>>>> I would really, really try to avoid the pitfalls nicely summarized >> >>>>>>> here: https://xkcd.com/927/ >> >>>>>>> >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow might be >> the >> >>>>>>> only one supporting Open Lineage. That might happen. Though the >> list >> >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or >> maybe - >> >>>>>>> more likely - once Airflow implements it, due to Airflow's >> popularity >> >>>>>>> and the fact that there is already competition supporting it (e.g. >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption >> of >> >>>>>>> Open Lineage. My bet is - the latter and for the benefit of the >> whole >> >>>>>>> ecosystem. I think we have a chance to influence creation of a >> new, >> >>>>>>> important standard. Much less so, I think if we just provide our >> own >> >>>>>>> custom solution - with lots and lots of work for others to be >> able to >> >>>>>>> consume it, no time to properly nurture the API and make it >> easier to >> >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and >> now >> >>>>>>> LFData & AI run governance main focus is) >> >>>>>>> >> >>>>>>> Are there other alternatives we should consider ? Do we want to >> >>>>>>> develop our own standard (and implement all the integrations from >> the >> >>>>>>> grounds up) ? >> >>>>>>> >> >>>>>>> J. >> >>>>>>> >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <eu...@kosteev.com> >> wrote: >> >>>>>>> > >> >>>>>>> > Hi Julien. >> >>>>>>> > >> >>>>>>> > I reviewed the design doc. >> >>>>>>> > The general idea looks good to me, but I have some concerns >> that I would like to share. >> >>>>>>> > >> >>>>>>> > If I understand correctly the proposed design is to fill in >> "operators" with self-methods to extract lineage metadata from it, and I >> agree with the motivation. If those are decoupled (in a form of extractors >> in separate package) from operators itself, then the downsides is that (as >> you mentioned) - extractors will be distributed separately and "operators" >> logic is out of sync with "lineage extraction" logic by design. >> >>>>>>> > Also knowledge about internals of operator spills out of the >> operator which is not good at all (at the very least). >> >>>>>>> > >> >>>>>>> > However, if we make every operator being exposing method to >> generate lineage metadata of the specific format, e.g. OpenLineage etc., >> then we will end up with cartesian complexity of supporting in each >> provider+operator each backend format. >> >>>>>>> > >> >>>>>>> > If you say that the goal is that "operators" will always >> generate OpenLineage format only and each consumer will convert this format >> to their own internal representation, well, if they do this then this seems >> like a working approach. But with the assumption that each consumer will >> support it. >> >>>>>>> > >> >>>>>>> > I think it comes down to the question: is OpenLineage format >> enough popular, complete and proper for the lineage metadata that every >> consumer will be convinced to support it. We may also consider issues like >> mismatch of lineage feature parity, e.g. OpenLineage supports field-level >> lineage but consumer doesn't support (or not at the moment), so we would >> prefer lineage metadata transferred to the backend to be slightly different >> in this case. >> >>>>>>> > >> >>>>>>> > What do you think about the idea: >> >>>>>>> > 1. make lineage metadata generated by "operators" to be >> agnostic of the specific format, just using entities from big generic >> vocabulary of entities e.g. created here >> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py. >> We would have there e.g. entities like: >> >>>>>>> > >> -------------------------------------------------------------------- >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) >> >>>>>>> > class PostgresTable: >> >>>>>>> > """Airflow lineage entity representing Postgres table.""" >> >>>>>>> > >> >>>>>>> > host: str = attr.ib() >> >>>>>>> > port: str = attr.ib() >> >>>>>>> > database: str = attr.ib() >> >>>>>>> > schema: str = attr.ib() >> >>>>>>> > table: str = attr.ib() >> >>>>>>> > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) >> >>>>>>> > class GCSEntity: >> >>>>>>> > """Airflow lineage entity representing generic Google Cloud >> Storage entity.""" >> >>>>>>> > >> >>>>>>> > bucket: str = attr.ib() >> >>>>>>> > path: str = attr.ib() >> >>>>>>> > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True) >> >>>>>>> > class AWSS3Entity: >> >>>>>>> > """Airflow lineage entity representing generic AWS S3 >> entity.""" >> >>>>>>> > >> >>>>>>> > bucket: str = attr.ib() >> >>>>>>> > path: str = attr.ib() >> >>>>>>> > >> -------------------------------------------------------------------- >> >>>>>>> > 2. Implement "adapters" that will act as a bridge between >> "operators" and backends. Their responsibility will be to convert lineage >> metadata generated by "operators" to a format understandable by specific >> backend. >> >>>>>>> > And then we can use the built-in mechanism of inlets/outlets to >> bypass Airflow lineage metadata to the Airflow lineage backend. >> >>>>>>> > >> >>>>>>> > I didn't get exactly implementation details of your proposed >> design, but I think maintaining global vocabulary of entities to use in >> inlets/outlets of operators is crucial for Airflow, as this could be >> leveraged to build various features on top of it, like displaying lineage >> graph in Airflow UI (based on XCOM):) >> >>>>>>> > >> >>>>>>> > Importantly to note, if we decide to send out from Airflow >> lineage metadata only in OpenLineage format, well, we could have than only >> one "adapter" OpenLineageAdapter. But the "adapters" approach leaves us >> room for adding support to others (following "pluggable" approach as >> Airflow is mainly known/good about). >> >>>>>>> > >> >>>>>>> > All in all: >> >>>>>>> > - global vocabulary of entities used across all "operators" >> (with all advantages out of it, mentioned above) >> >>>>>>> > - "adapters" approach >> >>>>>>> > seems to me crucial points in the design that make sense to me. >> >>>>>>> > >> >>>>>>> > What do you think about this? >> >>>>>>> > >> >>>>>>> > - Eugene >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem >> <jul...@astronomer.io.invalid> wrote: >> >>>>>>> >> >> >>>>>>> >> Hello Michał, >> >>>>>>> >> Thank you for your input. >> >>>>>>> >> I would clarify that OpenLineage doesn't make any assumption >> about the backend being used to store lineage and is an adapter-like layer. >> >>>>>>> >> OpenLineage exists as the spec specifically for that purpose >> of avoiding the problem of every lineage consumer having to understand >> every lineage producer. >> >>>>>>> >> Consumers of lineage want a unified spec consuming lineage >> from any data transformation layer like Airflow, Spark, Flink, SQL, >> Warehouses, ... >> >>>>>>> >> Just like OpenTelemetry allows consuming traces independently >> of the technology used, so does OpenLineage for lineage. >> >>>>>>> >> Julien >> >>>>>>> >> >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras < >> michalmod...@google.com> wrote: >> >>>>>>> >>> >> >>>>>>> >>> Hi everyone, >> >>>>>>> >>> >> >>>>>>> >>> As Airflow already supports lineage functionality through >> pluggable lineage backends, I think OpenLineage and other lineage systems >> integration should follow this path. I think more 'native' integration with >> OpenLineage (or any other lineage system) in Airflow while maintaining the >> generic lineage backend architecture in parallel would make the user >> experience less open, troublesome to maintain, and the Airflow architecture >> itself more constrained by a logic of a specific system. >> >>>>>>> >>> >> >>>>>>> >>> I think enriching operators with a generic method exposing >> lineage metadata that could be leveraged by lineage backends regardless of >> their implementation is a good idea which the Cloud Composer team would >> gladly contribute to. I believe the translation of the Airflow metadata >> exposed by the operators should be done by lineage backends (or another >> adapter-like layer). Tying Airflow operators' development to a specific >> lineage system like OpenLineage forces operators' contributors to >> understand that system too, which increases both the entry costs and >> maintenance costs. I see it as unnecessary coupling. >> >>>>>>> >>> >> >>>>>>> >>> Best, >> >>>>>>> >>> Michal >> >>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem < >> jul...@astronomer.io> wrote: >> >>>>>>> >>>> >> >>>>>>> >>>> Thank you Eugen, >> >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and I >> think this would work well. >> >>>>>>> >>>> Here are the sections in the doc that I think address your >> points: >> >>>>>>> >>>> - generalize lineage metadata extraction as self-method in >> each operator, using generic lineage entities >> >>>>>>> >>>> See: OpenLineage support in providers. It describes how each >> operator exposes its lineage. >> >>>>>>> >>>> - implement "adapter"s to convert generated metadata to Data >> Lineage format, Open Lineage format, etc. >> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage format >> to their own internal representation as you are suggesting. >> >>>>>>> >>>> In the motivation section, towards the end, I link to a few >> examples of data catalogs doing just that. >> >>>>>>> >>>> >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev < >> eu...@kosteev.com> wrote: >> >>>>>>> >>>>> >> >>>>>>> >>>>> ++ Michal Modras >> >>>>>>> >>>>> >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev < >> eu...@kosteev.com> wrote: >> >>>>>>> >>>>>> >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with >> Dataplex" feature which effectively means to generate lineage out of >> DAG/task executions and export it to Data Lineage (Data Catalog service) >> for further analysis. >> >>>>>>> >>>>>> >> https://cloud.google.com/composer/docs/composer-2/lineage-integration >> >>>>>>> >>>>>> >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state. >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage >> backend" feature and methods to extract lineage metadata on task post >> execution events. >> >>>>>>> >>>>>> >> >>>>>>> >>>>>> The general idea was to contribute this to the Airflow >> community in a form: >> >>>>>>> >>>>>> - generalize lineage metadata extraction as self-method in >> each operator, using generic lineage entities >> >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to >> Data Lineage format, Open Lineage format, etc. >> >>>>>>> >>>>>> >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean >> to introduce an additional layer of converting from OpenLineage format to >> Data Lineage (Data Catalog/Dataplex) format. But this is definitely a >> possibility. >> >>>>>>> >>>>>> >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem >> <jul...@astronomer.io.invalid> wrote: >> >>>>>>> >>>>>>> >> >>>>>>> >>>>>>> Thank you very much for your input Jarek. >> >>>>>>> >>>>>>> I am responding in the comments and adding to the doc >> accordingly. >> >>>>>>> >>>>>>> I would also love to hear from more stakeholders. >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far. >> >>>>>>> >>>>>>> Julien >> >>>>>>> >>>>>>> >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk < >> ja...@potiuk.com> wrote: >> >>>>>>> >>>>>>>> >> >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is >> (and should be >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's >> capabilities >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all >> working on - Airflow >> >>>>>>> >>>>>>>> as a Platform. >> >>>>>>> >>>>>>>> >> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes >> the same >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry >> goes, where we >> >>>>>>> >>>>>>>> might decide to support certain standards in order to >> expand >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to >> plug-in multiple >> >>>>>>> >>>>>>>> external solutions that would use the standard API. >> After Open-Lineage >> >>>>>>> >>>>>>>> graduated recently to LFAI&Data foundation (I've been >> watching this >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect candidate >> for Airflow >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the players >> to make use >> >>>>>>> >>>>>>>> of the extra work necessary by the community to make it >> "officially >> >>>>>>> >>>>>>>> supported". I think we have to also get some feedback >> from the big >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have >> such a >> >>>>>>> >>>>>>>> capability, and another is to get it used in all the >> ways Airflow is >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which >> is obviously a >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow >> is exposed by >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some >> warm words from >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear >> whether the >> >>>>>>> >>>>>>>> Composer team at Google would be on board in using the >> open-lineage >> >>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and >> likely more) >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly other >> stakeholders >> >>>>>>> >>>>>>>> might want to say something. >> >>>>>>> >>>>>>>> >> >>>>>>> >>>>>>>> >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in >> implementing and >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that >> is the main >> >>>>>>> >>>>>>>> reason why the Open Lineage community would like to make >> the >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and >> integrating it in >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI, >> verification >> >>>>>>> >>>>>>>> process and making some very clear expectations about >> what it means >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can >> make some >> >>>>>>> >>>>>>>> initial investment in making it happen and minimise >> on-going cost, >> >>>>>>> >>>>>>>> while maximising the gain. >> >>>>>>> >>>>>>>> >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy to help >> with all that >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even >> if it will >> >>>>>>> >>>>>>>> take an extra effort, especially that we will have >> experts from Open >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage >> being the core >> >>>>>>> >>>>>>>> part of the effort. I am actually super excited - this >> might be the >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position as >> an >> >>>>>>> >>>>>>>> indispensable component of "even more modern data stack". >> >>>>>>> >>>>>>>> >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking >> forward to >> >>>>>>> >>>>>>>> making it happen :). >> >>>>>>> >>>>>>>> >> >>>>>>> >>>>>>>> J. >> >>>>>>> >>>>>>>> >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem >> >>>>>>> >>>>>>>> <jul...@astronomer.io.invalid> wrote: >> >>>>>>> >>>>>>>> > >> >>>>>>> >>>>>>>> > Dear Airflow Community, >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an >> OpenLineage provider to Airflow. >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an >> official AIP. >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above. >> >>>>>>> >>>>>>>> > Thank you, >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead) >> >>>>>>> >>>>>>>> > >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc: >> >>>>>>> >>>>>>>> > >> >>>>>>> >>>>>>>> > Operational lineage collection is a common need to >> understand dependencies between data pipelines and track end-to-end >> provenance of data. It enables many use cases from ensuring reliable >> delivery of data through observability to compliance and cost management. >> >>>>>>> >>>>>>>> > >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow >> capability to enable troubleshooting and governance. >> >>>>>>> >>>>>>>> > >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data >> foundation that provides a spec standardizing operational lineage >> collection and sharing across the data ecosystem. If it provides plugins >> for popular open source projects, its intent is very similar to >> OpenTelemetry (also under the Linux Foundation umbrella): to remain a spec >> for lineage exchange that projects - open source or proprietary - implement. >> >>>>>>> >>>>>>>> > >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it >> easier and more reliable for Airflow users to publish their operational >> lineage through the OpenLineage ecosystem. >> >>>>>>> >>>>>>>> > >> >>>>>>> >>>>>>>> > The current external plugin maintained in the >> OpenLineage project depends on Airflow and operators internals and gets >> broken when changes are made on those. Having a built-in integration >> ensures a better first class support to expose lineage that gets tested >> alongside other changes and therefore is more stable. >> >>>>>>> >>>>>> >> >>>>>>> >>>>>> >> >>>>>>> >>>>>> >> >>>>>>> >>>>>> -- >> >>>>>>> >>>>>> Eugene >> >>>>>>> >>>>> >> >>>>>>> >>>>> >> >>>>>>> >>>>> >> >>>>>>> >>>>> -- >> >>>>>>> >>>>> Eugene >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > -- >> >>>>>>> > Eugene >> >