Cool. I am looking forward to it :). It would be great to get some
insight from those who attempted to get the lineage working in several
versions of Open Lineage and finally arrived at the current
specs/integration.

On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
<jul...@astronomer.io.invalid> wrote:
>
> Thank you Jarek,
> I am happy to organize a zoom presentation about OpenLineage and answer any 
> question. It is indeed a spec decoupling the data transformation layer from 
> the Metadata store people are using. Just like OpenTelemetry is for service 
> metrics/traces.
> Best,
> Julien
>
> On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>> And to add a little "parallel" - I think Open Lineage integration replacing 
>> our "generic lineage" is very similar step to the new "Multi-tenant"-ready 
>> authentication interface we are discussing in 
>> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
>>
>> Yes - we have a generic authentication interface, but no - it's useless for 
>> the case where multi-tenancy and good level of resource authorization is 
>> needed. It's just far too simplistic and limited.
>>
>> Same with current lineage generic interface - yes, we have it but it's only 
>> useful in a limited set of cases. and if we want to step-it-up we need to 
>> come up with something better (and Open Lineage happens to be one that has 
>> been developed with Airflow in mind and battle tested).
>>
>> J.
>>
>> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>> Hey Rafał (Eugene, Michal - and others who are looking),
>>>
>>> I think I know where your/Eugen/Michał concerns are coming from. And I 
>>> think it would be great if we can talk it over a bit.  I believe this is - 
>>> in parts - quite a misunderstanding of what Open Lineage really is, how 
>>> much of an integration it is and what are the reasons why it has been 
>>> implemented the way it was implemented in Airflow.
>>>
>>> **Idea**: (Julien -  Maybe you can organize it ?):
>>>
>>> Maybe we can have an open-to-everyone presentation/zoom call with quite 
>>> some time foreseen to ask questions where you would explain the community 
>>> about those integration points (and especially those people who are worried 
>>> we are losing something by choosing the OpenLineage integration). I would 
>>> love to see such a presentation - specifically focused on explaining how 
>>> Open-Lineage is really improving the current lineage approach and what 
>>> problems it solves that the existing generic interface doesn't.
>>>
>>> Just to set the tone and focus for such meeting if we have one:
>>>
>>> For me - when I look at Open Lineage, it is really "this is how lineage 
>>> generic interface **should** be done in Airflow". The "generic" lineage 
>>> support we have now is very, very basic, I'd even say far too simplistic. I 
>>> would even say, it's useless besides a few, very basic use cases. Simply 
>>> because there was never a good "receiver" of the information to cover those 
>>> cases.
>>>
>>> When you look closely at OpenLineage, it's nothing more than a better 
>>> convention of the dictionaries that we send as a metadata, better meta-data 
>>> in case of SQL operators (Hooks in the future hopefully), allowing handling 
>>> some cases that current lineage simply cannot.  Also what open-lineage 
>>> integration with Airflow covers better handling of the lifecycle "task" and 
>>> "dag" in Airflow to be able to bind lineage data together. That's my 
>>> understanding of what we get when we integrate OL in.
>>>
>>> I think over the last 2 years Datakin/Astronomer people had worked out the 
>>> level of interface that **just works** and if we would like to get the 
>>> lineage information from Airflow as useful as it is in OL, we would have to 
>>> anyway implement pretty much all of the things they already did.
>>>
>>> I would love (and I think many community members) to take part in such a 
>>> call to hear on that particular aspect of the OL integration.
>>>
>>> J.
>>>
>>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz 
>>> <rafalbieg...@google.com.invalid> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I second/echo the input provided by Eugene and Michal.
>>>>
>>>> In general, Airflow should provide generic interfaces to lineage backends 
>>>> so it's easy to configure the one preferred by the user. Whether it's Open 
>>>> Lineage, proprietary solution, Dataplex Lineage, etc. it should be the 
>>>> user's choice.
>>>>
>>>> We should avoid close integration with any specific lineage backend due to 
>>>> the reasons already mentioned, i.e. to avoid translations between lineage 
>>>> backends. Also, we would closely couple one framework (Airflow) with 
>>>> another one (Open Lineage) - it makes Airflow more complex and less 
>>>> flexible. Loose coupling between lineage backends and Airflow seems to be 
>>>> more future-proven.
>>>>
>>>> Regards, Rafal.
>>>>
>>>>
>>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem 
>>>> <jul...@astronomer.io.invalid> wrote:
>>>>>
>>>>> Dear Airflow community,
>>>>> I have transferred the content of the working google doc I shared a few 
>>>>> weeks ago to the Airflow confluence:
>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
>>>>> All comments have been answered, I added clarifications to the doc 
>>>>> accordingly and I also added your suggestions to improve the proposal.
>>>>> All that history is linked from the discussion thread link in the 
>>>>> confluence doc if you wish to consult it.
>>>>> Thank you all for your feedback and help in the process.
>>>>> Best
>>>>> Julien
>>>>>
>>>>>
>>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <jul...@astronomer.io> 
>>>>> wrote:
>>>>>>
>>>>>> Thank you for the email Jarek, and Eugene for your suggestions,
>>>>>> I do agree with Jarek's assessment. I don't have very much to add to his 
>>>>>> argument, it is very thoughtful!
>>>>>> OpenLineage was started to avoid the cartesian complexity that Eugene 
>>>>>> mentions. There's actually that specific illustration in the OpenLineage 
>>>>>> doc.
>>>>>> Lineage consumers want to avoid having to understand the lineage format 
>>>>>> of each individual observed data transformation layer. And 
>>>>>> transformation layers don't want to understand every Metadata store's 
>>>>>> model and protocol.
>>>>>> Eugene, about your specific proposal about a global vocabulary of 
>>>>>> entities, I think it is a great suggestion.
>>>>>> We can map those entities to Datasets in OpenLineage. The way 
>>>>>> OpenLineage models this is by allowing specific facets attached to 
>>>>>> Dataset. Facets are pieces of metadata each with their own JsonSchema.
>>>>>> For example a table from a relational database will have a schema facet 
>>>>>> when a file in GCS might not.
>>>>>> So I think in Airflow we could have each of the entity classes you 
>>>>>> describe be used in the get_openlineage_facets*() API in the Operators.
>>>>>> Each of those classes would know what OpenLineage facets they can expose.
>>>>>> I'll add a mention in the AIP and I think we can go in more details in a 
>>>>>> ticket.
>>>>>> Cheers,
>>>>>> Julien
>>>>>>
>>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>>>>
>>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer will
>>>>>>> be more thoughtful).
>>>>>>>
>>>>>>> I think you are right to the "agnostic" part. But I have one question
>>>>>>> - what are we considering "agnostic"?
>>>>>>>
>>>>>>>  There is no "widespread" standard for lineage (yet). Open Lineage
>>>>>>> with its donation to Linux Foundation Data & AI is aspiring to become
>>>>>>> one. And it's a pretty good candidate:
>>>>>>>
>>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only
>>>>>>> published as an API from day one)
>>>>>>> * as of recently, the ownership and governance of Open Lineage is with
>>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/)  which is
>>>>>>> part of "Linux Foundation Project" - well known and respectful
>>>>>>> foundation that - similarly to the ASF is an umbrella and provides
>>>>>>> governance rules for a big number of well established OSS projects
>>>>>>>
>>>>>>> In essence it is the same approach as we already discussed and
>>>>>>> approved for Open Telemetry (which is governed by CNCF which is in the
>>>>>>> same league as recognition and governance to LFP) (not yet implemented
>>>>>>> though). In the case of Open-Telemetry, we decided against developing
>>>>>>> our "own" existing standard but we opted for one that is out there.
>>>>>>> Yes it is a bit more established and popular than Open Lineage is, but
>>>>>>> i so wish that we chose and implemented it already (and earlier as not
>>>>>>> having a standard there - except statsd which is really, really poor)
>>>>>>> has a great impact on Airflow being just "pluggable" in existing
>>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and I hear
>>>>>>> (and see) there are attempts to do so).
>>>>>>>
>>>>>>> In the case of Open Lineage, the questions are - is there an
>>>>>>> alternative of the same caliber? Shall we produce our own "agnostic
>>>>>>> standard" for it instead ? Is there a chance the idea of
>>>>>>> "airflow-specific" attributes will catch up and many "consumers" will
>>>>>>> be writing their own conversions to the way they can consume it?
>>>>>>>
>>>>>>> I would really, really try to avoid the pitfalls nicely summarized
>>>>>>> here: https://xkcd.com/927/
>>>>>>>
>>>>>>> We can of course make a wrong bet and in 2 years Airflow might be the
>>>>>>> only one supporting Open Lineage. That might happen. Though the list
>>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or maybe -
>>>>>>> more likely - once Airflow implements it, due to Airflow's popularity
>>>>>>> and the fact that there is already competition supporting it (e.g.
>>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption of
>>>>>>> Open Lineage. My bet is -  the latter and for the benefit of the whole
>>>>>>> ecosystem. I think we have a chance to influence creation of a new,
>>>>>>> important standard. Much less so, I think if we just provide our own
>>>>>>> custom solution - with lots and lots of work for others to be able to
>>>>>>> consume it, no time to properly nurture the API and make it easier to
>>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and now
>>>>>>> LFData & AI run governance main focus is)
>>>>>>>
>>>>>>> Are there other alternatives we should consider ? Do we want to
>>>>>>> develop our own standard (and implement all the integrations from the
>>>>>>> grounds up) ?
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <eu...@kosteev.com> 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi Julien.
>>>>>>> >
>>>>>>> > I reviewed the design doc.
>>>>>>> > The general idea looks good to me, but I have some concerns that I 
>>>>>>> > would like to share.
>>>>>>> >
>>>>>>> > If I understand correctly the proposed design is to fill in 
>>>>>>> > "operators" with self-methods to extract lineage metadata from it, 
>>>>>>> > and I agree with the motivation. If those are decoupled (in a form of 
>>>>>>> > extractors in separate package) from operators itself, then the 
>>>>>>> > downsides is that (as you mentioned) - extractors will be distributed 
>>>>>>> > separately and "operators" logic is out of sync with "lineage 
>>>>>>> > extraction" logic by design.
>>>>>>> > Also knowledge about internals of operator spills out of the operator 
>>>>>>> > which is not good at all (at the very least).
>>>>>>> >
>>>>>>> > However, if we make every operator being exposing method to generate 
>>>>>>> > lineage metadata of the specific format, e.g. OpenLineage etc., then 
>>>>>>> > we will end up with cartesian complexity of supporting in each 
>>>>>>> > provider+operator each backend format.
>>>>>>> >
>>>>>>> > If you say that the goal is that "operators" will always generate 
>>>>>>> > OpenLineage format only and each consumer will convert this format to 
>>>>>>> > their own internal representation, well, if they do this then this 
>>>>>>> > seems like a working approach. But with the assumption that each 
>>>>>>> > consumer will support it.
>>>>>>> >
>>>>>>> > I think it comes down to the question: is OpenLineage format enough 
>>>>>>> > popular, complete and proper for the lineage metadata that every 
>>>>>>> > consumer will be convinced to support it. We may also consider issues 
>>>>>>> > like mismatch of lineage feature parity, e.g. OpenLineage supports 
>>>>>>> > field-level lineage but consumer doesn't support (or not at the 
>>>>>>> > moment), so we would prefer lineage metadata transferred to the 
>>>>>>> > backend to be slightly different in this case.
>>>>>>> >
>>>>>>> > What do you think about the idea:
>>>>>>> > 1. make lineage metadata generated by "operators" to be agnostic of 
>>>>>>> > the specific format, just using entities from big generic vocabulary 
>>>>>>> > of entities e.g. created here 
>>>>>>> > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py.
>>>>>>> >  We would have there e.g. entities like:
>>>>>>> > --------------------------------------------------------------------
>>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>>>>>>> > class PostgresTable:
>>>>>>> >     """Airflow lineage entity representing Postgres table."""
>>>>>>> >
>>>>>>> >     host: str = attr.ib()
>>>>>>> >     port: str = attr.ib()
>>>>>>> >     database: str = attr.ib()
>>>>>>> >     schema: str = attr.ib()
>>>>>>> >     table: str = attr.ib()
>>>>>>> >
>>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>>>>>>> > class GCSEntity:
>>>>>>> >     """Airflow lineage entity representing generic Google Cloud 
>>>>>>> > Storage entity."""
>>>>>>> >
>>>>>>> >     bucket: str = attr.ib()
>>>>>>> >     path: str = attr.ib()
>>>>>>> >
>>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>>>>>>> > class AWSS3Entity:
>>>>>>> >     """Airflow lineage entity representing generic AWS S3 entity."""
>>>>>>> >
>>>>>>> >     bucket: str = attr.ib()
>>>>>>> >     path: str = attr.ib()
>>>>>>> > --------------------------------------------------------------------
>>>>>>> > 2. Implement "adapters" that will act as a bridge between "operators" 
>>>>>>> > and backends. Their responsibility will be to convert lineage 
>>>>>>> > metadata generated by "operators" to a format understandable by 
>>>>>>> > specific backend.
>>>>>>> > And then we can use the built-in mechanism of inlets/outlets to 
>>>>>>> > bypass Airflow lineage metadata to the Airflow lineage backend.
>>>>>>> >
>>>>>>> > I didn't get exactly implementation details of your proposed design, 
>>>>>>> > but I think maintaining global vocabulary of entities to use in 
>>>>>>> > inlets/outlets of operators is crucial for Airflow, as this could be 
>>>>>>> > leveraged to build various features on top of it, like displaying 
>>>>>>> > lineage graph in Airflow UI (based on XCOM):)
>>>>>>> >
>>>>>>> > Importantly to note, if we decide to send out from Airflow lineage 
>>>>>>> > metadata only in OpenLineage format, well, we could have than only 
>>>>>>> > one "adapter" OpenLineageAdapter. But the "adapters" approach leaves 
>>>>>>> > us room for adding support to others (following "pluggable" approach 
>>>>>>> > as Airflow is mainly known/good about).
>>>>>>> >
>>>>>>> > All in all:
>>>>>>> > - global vocabulary of entities used across all "operators" (with all 
>>>>>>> > advantages out of it, mentioned above)
>>>>>>> > - "adapters" approach
>>>>>>> > seems to me crucial points in the design that make sense to me.
>>>>>>> >
>>>>>>> > What do you think about this?
>>>>>>> >
>>>>>>> > - Eugene
>>>>>>> >
>>>>>>> >
>>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem 
>>>>>>> > <jul...@astronomer.io.invalid> wrote:
>>>>>>> >>
>>>>>>> >> Hello Michał,
>>>>>>> >> Thank you for your input.
>>>>>>> >> I would clarify that OpenLineage doesn't make any assumption about 
>>>>>>> >> the backend being used to store lineage and is an adapter-like layer.
>>>>>>> >> OpenLineage exists as the spec specifically for that purpose of 
>>>>>>> >> avoiding the problem of every lineage consumer having to understand 
>>>>>>> >> every lineage producer.
>>>>>>> >> Consumers of lineage want a unified spec consuming lineage from any 
>>>>>>> >> data transformation layer like Airflow, Spark, Flink, SQL, 
>>>>>>> >> Warehouses, ...
>>>>>>> >> Just like OpenTelemetry allows consuming traces independently of the 
>>>>>>> >> technology used, so does OpenLineage for lineage.
>>>>>>> >> Julien
>>>>>>> >>
>>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras 
>>>>>>> >> <michalmod...@google.com> wrote:
>>>>>>> >>>
>>>>>>> >>> Hi everyone,
>>>>>>> >>>
>>>>>>> >>> As Airflow already supports lineage functionality through pluggable 
>>>>>>> >>> lineage backends, I think OpenLineage and other lineage systems 
>>>>>>> >>> integration should follow this path. I think more 'native' 
>>>>>>> >>> integration with OpenLineage (or any other lineage system) in 
>>>>>>> >>> Airflow while maintaining the generic lineage backend architecture 
>>>>>>> >>> in parallel would make the user experience less open, troublesome 
>>>>>>> >>> to maintain, and the Airflow architecture itself more constrained 
>>>>>>> >>> by a logic of a specific system.
>>>>>>> >>>
>>>>>>> >>> I think enriching operators with a generic method exposing lineage 
>>>>>>> >>> metadata that could be leveraged by lineage backends regardless of 
>>>>>>> >>> their implementation is a good idea which the Cloud Composer team 
>>>>>>> >>> would gladly contribute to. I believe the translation of the 
>>>>>>> >>> Airflow metadata exposed by the operators should be done by lineage 
>>>>>>> >>> backends (or another adapter-like layer). Tying Airflow operators' 
>>>>>>> >>> development to a specific lineage system like OpenLineage forces 
>>>>>>> >>> operators' contributors to understand that system too, which 
>>>>>>> >>> increases both the entry costs and maintenance costs. I see it as 
>>>>>>> >>> unnecessary coupling.
>>>>>>> >>>
>>>>>>> >>> Best,
>>>>>>> >>> Michal
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem 
>>>>>>> >>> <jul...@astronomer.io> wrote:
>>>>>>> >>>>
>>>>>>> >>>> Thank you Eugen,
>>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and I think 
>>>>>>> >>>> this would work well.
>>>>>>> >>>> Here are the sections in the doc that I think address your points:
>>>>>>> >>>> - generalize lineage metadata extraction as self-method in each 
>>>>>>> >>>> operator, using generic lineage entities
>>>>>>> >>>> See: OpenLineage support in providers. It describes how each 
>>>>>>> >>>> operator exposes its lineage.
>>>>>>> >>>> - implement "adapter"s to convert generated metadata to Data 
>>>>>>> >>>> Lineage format, Open Lineage format, etc.
>>>>>>> >>>> The goal here is each consumer turns from OpenLineage format to 
>>>>>>> >>>> their own internal representation as you are suggesting.
>>>>>>> >>>> In the motivation section, towards the end, I link to a few 
>>>>>>> >>>> examples of data catalogs doing just that.
>>>>>>> >>>>
>>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <eu...@kosteev.com> 
>>>>>>> >>>> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>> ++ Michal Modras
>>>>>>> >>>>>
>>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <eu...@kosteev.com> 
>>>>>>> >>>>> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with Dataplex" 
>>>>>>> >>>>>> feature which effectively means to generate lineage out of 
>>>>>>> >>>>>> DAG/task executions and export it to Data Lineage (Data Catalog 
>>>>>>> >>>>>> service) for further analysis.
>>>>>>> >>>>>> https://cloud.google.com/composer/docs/composer-2/lineage-integration
>>>>>>> >>>>>>
>>>>>>> >>>>>> This feature is as of now in the "Preview" state.
>>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage 
>>>>>>> >>>>>> backend" feature and methods to extract lineage metadata on task 
>>>>>>> >>>>>> post execution events.
>>>>>>> >>>>>>
>>>>>>> >>>>>> The general idea was to contribute this to the Airflow community 
>>>>>>> >>>>>> in a form:
>>>>>>> >>>>>> - generalize lineage metadata extraction as self-method in each 
>>>>>>> >>>>>> operator, using generic lineage entities
>>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to Data 
>>>>>>> >>>>>> Lineage format, Open Lineage format, etc.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean to 
>>>>>>> >>>>>> introduce an additional layer of converting from OpenLineage 
>>>>>>> >>>>>> format to Data Lineage (Data Catalog/Dataplex) format. But this 
>>>>>>> >>>>>> is definitely a possibility.
>>>>>>> >>>>>>
>>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem 
>>>>>>> >>>>>> <jul...@astronomer.io.invalid> wrote:
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Thank you very much for your input Jarek.
>>>>>>> >>>>>>> I am responding in the comments and adding to the doc 
>>>>>>> >>>>>>> accordingly.
>>>>>>> >>>>>>> I would also love to hear from more stakeholders.
>>>>>>> >>>>>>> Thanks to all who provided feedback so far.
>>>>>>> >>>>>>> Julien
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk 
>>>>>>> >>>>>>> <ja...@potiuk.com> wrote:
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is (and 
>>>>>>> >>>>>>>> should be
>>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's 
>>>>>>> >>>>>>>> capabilities
>>>>>>> >>>>>>>> greatly and opens up the direction we've been all working on - 
>>>>>>> >>>>>>>> Airflow
>>>>>>> >>>>>>>> as a Platform.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes the same
>>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry goes, 
>>>>>>> >>>>>>>> where we
>>>>>>> >>>>>>>> might decide to support certain standards in order to expand
>>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to plug-in 
>>>>>>> >>>>>>>> multiple
>>>>>>> >>>>>>>> external solutions that would use the standard API. After 
>>>>>>> >>>>>>>> Open-Lineage
>>>>>>> >>>>>>>> graduated recently to  LFAI&Data foundation (I've been 
>>>>>>> >>>>>>>> watching this
>>>>>>> >>>>>>>> happening from far), it is I think the perfect candidate for 
>>>>>>> >>>>>>>> Airflow
>>>>>>> >>>>>>>> to incorporate it. I hope this will help all the players to 
>>>>>>> >>>>>>>> make use
>>>>>>> >>>>>>>> of the extra work necessary by the community to make it 
>>>>>>> >>>>>>>> "officially
>>>>>>> >>>>>>>> supported". I think we have to also get some feedback from the 
>>>>>>> >>>>>>>> big
>>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have such a
>>>>>>> >>>>>>>> capability, and another is to get it used in all the ways 
>>>>>>> >>>>>>>> Airflow is
>>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which is 
>>>>>>> >>>>>>>> obviously a
>>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow is 
>>>>>>> >>>>>>>> exposed by
>>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some warm 
>>>>>>> >>>>>>>> words from
>>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear whether the
>>>>>>> >>>>>>>> Composer team at Google would be on board in using the 
>>>>>>> >>>>>>>> open-lineage
>>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and likely 
>>>>>>> >>>>>>>> more)
>>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly other 
>>>>>>> >>>>>>>> stakeholders
>>>>>>> >>>>>>>> might want to say something.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in 
>>>>>>> >>>>>>>> implementing and
>>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that is the 
>>>>>>> >>>>>>>> main
>>>>>>> >>>>>>>> reason why the Open Lineage community would like to make the
>>>>>>> >>>>>>>> integration part of Airflow. But by being smart and 
>>>>>>> >>>>>>>> integrating it in
>>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI, verification
>>>>>>> >>>>>>>> process and making some very clear expectations about what it 
>>>>>>> >>>>>>>> means
>>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can make some
>>>>>>> >>>>>>>> initial investment in making it happen and minimise on-going 
>>>>>>> >>>>>>>> cost,
>>>>>>> >>>>>>>> while maximising the gain.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> And looking at all the above - I am super happy to help with 
>>>>>>> >>>>>>>> all that
>>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even if it 
>>>>>>> >>>>>>>> will
>>>>>>> >>>>>>>> take an extra effort, especially that we will have experts 
>>>>>>> >>>>>>>> from Open
>>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage being 
>>>>>>> >>>>>>>> the core
>>>>>>> >>>>>>>> part of the effort. I am actually super excited - this might 
>>>>>>> >>>>>>>> be the
>>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position as an
>>>>>>> >>>>>>>> indispensable component of "even more modern data stack".
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking forward 
>>>>>>> >>>>>>>> to
>>>>>>> >>>>>>>> making it happen :).
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> J.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
>>>>>>> >>>>>>>> <jul...@astronomer.io.invalid> wrote:
>>>>>>> >>>>>>>> >
>>>>>>> >>>>>>>> > Dear Airflow Community,
>>>>>>> >>>>>>>> > I have been working on a proposal to bring an OpenLineage 
>>>>>>> >>>>>>>> > provider to Airflow.
>>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an official 
>>>>>>> >>>>>>>> > AIP.
>>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
>>>>>>> >>>>>>>> > Thank you,
>>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
>>>>>>> >>>>>>>> >
>>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc:
>>>>>>> >>>>>>>> >
>>>>>>> >>>>>>>> > Operational lineage collection is a common need to 
>>>>>>> >>>>>>>> > understand dependencies between data pipelines and track 
>>>>>>> >>>>>>>> > end-to-end provenance of data. It enables many use cases 
>>>>>>> >>>>>>>> > from ensuring reliable delivery of data through 
>>>>>>> >>>>>>>> > observability to compliance and cost management.
>>>>>>> >>>>>>>> >
>>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow capability 
>>>>>>> >>>>>>>> > to enable troubleshooting and governance.
>>>>>>> >>>>>>>> >
>>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data foundation 
>>>>>>> >>>>>>>> > that provides a spec standardizing operational lineage 
>>>>>>> >>>>>>>> > collection and sharing across the data ecosystem. If it 
>>>>>>> >>>>>>>> > provides plugins for popular open source projects, its 
>>>>>>> >>>>>>>> > intent is very similar to OpenTelemetry (also under the 
>>>>>>> >>>>>>>> > Linux Foundation umbrella): to remain a spec for lineage 
>>>>>>> >>>>>>>> > exchange that projects - open source or proprietary - 
>>>>>>> >>>>>>>> > implement.
>>>>>>> >>>>>>>> >
>>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it easier 
>>>>>>> >>>>>>>> > and more reliable for Airflow users to publish their 
>>>>>>> >>>>>>>> > operational lineage through the OpenLineage ecosystem.
>>>>>>> >>>>>>>> >
>>>>>>> >>>>>>>> > The current external plugin maintained in the OpenLineage 
>>>>>>> >>>>>>>> > project depends on Airflow and operators internals and gets 
>>>>>>> >>>>>>>> > broken when changes are made on those. Having a built-in 
>>>>>>> >>>>>>>> > integration ensures a better first class support to expose 
>>>>>>> >>>>>>>> > lineage that gets tested alongside other changes and 
>>>>>>> >>>>>>>> > therefore is more stable.
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> --
>>>>>>> >>>>>> Eugene
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> --
>>>>>>> >>>>> Eugene
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Eugene

Reply via email to