Hello all,
I have to move the OpenLineage presentation to next week.
Sorry for the change.
It will be Friday next week March 31st at 5pm CET 9am PT.
https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io
Julien

On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <jul...@astronomer.io> wrote:

> We are planning to do this session next Thursday at 5pm CET 9am PT. I will
> send a zoom link in advance.
> Julien
>
> On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Cool. I am looking forward to it :). It would be great to get some
>> insight from those who attempted to get the lineage working in several
>> versions of Open Lineage and finally arrived at the current
>> specs/integration.
>>
>> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
>> <jul...@astronomer.io.invalid> wrote:
>> >
>> > Thank you Jarek,
>> > I am happy to organize a zoom presentation about OpenLineage and answer
>> any question. It is indeed a spec decoupling the data transformation layer
>> from the Metadata store people are using. Just like OpenTelemetry is for
>> service metrics/traces.
>> > Best,
>> > Julien
>> >
>> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>
>> >> And to add a little "parallel" - I think Open Lineage integration
>> replacing our "generic lineage" is very similar step to the new
>> "Multi-tenant"-ready authentication interface we are discussing in
>> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
>> >>
>> >> Yes - we have a generic authentication interface, but no - it's
>> useless for the case where multi-tenancy and good level of resource
>> authorization is needed. It's just far too simplistic and limited.
>> >>
>> >> Same with current lineage generic interface - yes, we have it but it's
>> only useful in a limited set of cases. and if we want to step-it-up we need
>> to come up with something better (and Open Lineage happens to be one that
>> has been developed with Airflow in mind and battle tested).
>> >>
>> >> J.
>> >>
>> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>>
>> >>> Hey Rafał (Eugene, Michal - and others who are looking),
>> >>>
>> >>> I think I know where your/Eugen/Michał concerns are coming from. And
>> I think it would be great if we can talk it over a bit.  I believe this is
>> - in parts - quite a misunderstanding of what Open Lineage really is, how
>> much of an integration it is and what are the reasons why it has been
>> implemented the way it was implemented in Airflow.
>> >>>
>> >>> **Idea**: (Julien -  Maybe you can organize it ?):
>> >>>
>> >>> Maybe we can have an open-to-everyone presentation/zoom call with
>> quite some time foreseen to ask questions where you would explain the
>> community about those integration points (and especially those people who
>> are worried we are losing something by choosing the OpenLineage
>> integration). I would love to see such a presentation - specifically
>> focused on explaining how Open-Lineage is really improving the current
>> lineage approach and what problems it solves that the existing generic
>> interface doesn't.
>> >>>
>> >>> Just to set the tone and focus for such meeting if we have one:
>> >>>
>> >>> For me - when I look at Open Lineage, it is really "this is how
>> lineage generic interface **should** be done in Airflow". The "generic"
>> lineage support we have now is very, very basic, I'd even say far too
>> simplistic. I would even say, it's useless besides a few, very basic use
>> cases. Simply because there was never a good "receiver" of the information
>> to cover those cases.
>> >>>
>> >>> When you look closely at OpenLineage, it's nothing more than a better
>> convention of the dictionaries that we send as a metadata, better meta-data
>> in case of SQL operators (Hooks in the future hopefully), allowing handling
>> some cases that current lineage simply cannot.  Also what open-lineage
>> integration with Airflow covers better handling of the lifecycle "task" and
>> "dag" in Airflow to be able to bind lineage data together. That's my
>> understanding of what we get when we integrate OL in.
>> >>>
>> >>> I think over the last 2 years Datakin/Astronomer people had worked
>> out the level of interface that **just works** and if we would like to get
>> the lineage information from Airflow as useful as it is in OL, we would
>> have to anyway implement pretty much all of the things they already did.
>> >>>
>> >>> I would love (and I think many community members) to take part in
>> such a call to hear on that particular aspect of the OL integration.
>> >>>
>> >>> J.
>> >>>
>> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
>> rafalbieg...@google.com.invalid> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I second/echo the input provided by Eugene and Michal.
>> >>>>
>> >>>> In general, Airflow should provide generic interfaces to lineage
>> backends so it's easy to configure the one preferred by the user. Whether
>> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it should
>> be the user's choice.
>> >>>>
>> >>>> We should avoid close integration with any specific lineage backend
>> due to the reasons already mentioned, i.e. to avoid translations between
>> lineage backends. Also, we would closely couple one framework (Airflow)
>> with another one (Open Lineage) - it makes Airflow more complex and less
>> flexible. Loose coupling between lineage backends and Airflow seems to be
>> more future-proven.
>> >>>>
>> >>>> Regards, Rafal.
>> >>>>
>> >>>>
>> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
>> <jul...@astronomer.io.invalid> wrote:
>> >>>>>
>> >>>>> Dear Airflow community,
>> >>>>> I have transferred the content of the working google doc I shared a
>> few weeks ago to the Airflow confluence:
>> >>>>>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
>> >>>>> All comments have been answered, I added clarifications to the doc
>> accordingly and I also added your suggestions to improve the proposal.
>> >>>>> All that history is linked from the discussion thread link in the
>> confluence doc if you wish to consult it.
>> >>>>> Thank you all for your feedback and help in the process.
>> >>>>> Best
>> >>>>> Julien
>> >>>>>
>> >>>>>
>> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <jul...@astronomer.io>
>> wrote:
>> >>>>>>
>> >>>>>> Thank you for the email Jarek, and Eugene for your suggestions,
>> >>>>>> I do agree with Jarek's assessment. I don't have very much to add
>> to his argument, it is very thoughtful!
>> >>>>>> OpenLineage was started to avoid the cartesian complexity that
>> Eugene mentions. There's actually that specific illustration in the
>> OpenLineage doc.
>> >>>>>> Lineage consumers want to avoid having to understand the lineage
>> format of each individual observed data transformation layer. And
>> transformation layers don't want to understand every Metadata store's model
>> and protocol.
>> >>>>>> Eugene, about your specific proposal about a global vocabulary of
>> entities, I think it is a great suggestion.
>> >>>>>> We can map those entities to Datasets in OpenLineage. The way
>> OpenLineage models this is by allowing specific facets attached to Dataset.
>> Facets are pieces of metadata each with their own JsonSchema.
>> >>>>>> For example a table from a relational database will have a schema
>> facet when a file in GCS might not.
>> >>>>>> So I think in Airflow we could have each of the entity classes you
>> describe be used in the get_openlineage_facets*() API in the Operators.
>> >>>>>> Each of those classes would know what OpenLineage facets they can
>> expose.
>> >>>>>> I'll add a mention in the AIP and I think we can go in more
>> details in a ticket.
>> >>>>>> Cheers,
>> >>>>>> Julien
>> >>>>>>
>> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com>
>> wrote:
>> >>>>>>>
>> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer
>> will
>> >>>>>>> be more thoughtful).
>> >>>>>>>
>> >>>>>>> I think you are right to the "agnostic" part. But I have one
>> question
>> >>>>>>> - what are we considering "agnostic"?
>> >>>>>>>
>> >>>>>>>  There is no "widespread" standard for lineage (yet). Open Lineage
>> >>>>>>> with its donation to Linux Foundation Data & AI is aspiring to
>> become
>> >>>>>>> one. And it's a pretty good candidate:
>> >>>>>>>
>> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only
>> >>>>>>> published as an API from day one)
>> >>>>>>> * as of recently, the ownership and governance of Open Lineage is
>> with
>> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/)  which
>> is
>> >>>>>>> part of "Linux Foundation Project" - well known and respectful
>> >>>>>>> foundation that - similarly to the ASF is an umbrella and provides
>> >>>>>>> governance rules for a big number of well established OSS projects
>> >>>>>>>
>> >>>>>>> In essence it is the same approach as we already discussed and
>> >>>>>>> approved for Open Telemetry (which is governed by CNCF which is
>> in the
>> >>>>>>> same league as recognition and governance to LFP) (not yet
>> implemented
>> >>>>>>> though). In the case of Open-Telemetry, we decided against
>> developing
>> >>>>>>> our "own" existing standard but we opted for one that is out
>> there.
>> >>>>>>> Yes it is a bit more established and popular than Open Lineage
>> is, but
>> >>>>>>> i so wish that we chose and implemented it already (and earlier
>> as not
>> >>>>>>> having a standard there - except statsd which is really, really
>> poor)
>> >>>>>>> has a great impact on Airflow being just "pluggable" in existing
>> >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and I
>> hear
>> >>>>>>> (and see) there are attempts to do so).
>> >>>>>>>
>> >>>>>>> In the case of Open Lineage, the questions are - is there an
>> >>>>>>> alternative of the same caliber? Shall we produce our own
>> "agnostic
>> >>>>>>> standard" for it instead ? Is there a chance the idea of
>> >>>>>>> "airflow-specific" attributes will catch up and many "consumers"
>> will
>> >>>>>>> be writing their own conversions to the way they can consume it?
>> >>>>>>>
>> >>>>>>> I would really, really try to avoid the pitfalls nicely summarized
>> >>>>>>> here: https://xkcd.com/927/
>> >>>>>>>
>> >>>>>>> We can of course make a wrong bet and in 2 years Airflow might be
>> the
>> >>>>>>> only one supporting Open Lineage. That might happen. Though the
>> list
>> >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or
>> maybe -
>> >>>>>>> more likely - once Airflow implements it, due to Airflow's
>> popularity
>> >>>>>>> and the fact that there is already competition supporting it (e.g.
>> >>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption
>> of
>> >>>>>>> Open Lineage. My bet is -  the latter and for the benefit of the
>> whole
>> >>>>>>> ecosystem. I think we have a chance to influence creation of a
>> new,
>> >>>>>>> important standard. Much less so, I think if we just provide our
>> own
>> >>>>>>> custom solution - with lots and lots of work for others to be
>> able to
>> >>>>>>> consume it, no time to properly nurture the API and make it
>> easier to
>> >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and
>> now
>> >>>>>>> LFData & AI run governance main focus is)
>> >>>>>>>
>> >>>>>>> Are there other alternatives we should consider ? Do we want to
>> >>>>>>> develop our own standard (and implement all the integrations from
>> the
>> >>>>>>> grounds up) ?
>> >>>>>>>
>> >>>>>>> J.
>> >>>>>>>
>> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <eu...@kosteev.com>
>> wrote:
>> >>>>>>> >
>> >>>>>>> > Hi Julien.
>> >>>>>>> >
>> >>>>>>> > I reviewed the design doc.
>> >>>>>>> > The general idea looks good to me, but I have some concerns
>> that I would like to share.
>> >>>>>>> >
>> >>>>>>> > If I understand correctly the proposed design is to fill in
>> "operators" with self-methods to extract lineage metadata from it, and I
>> agree with the motivation. If those are decoupled (in a form of extractors
>> in separate package) from operators itself, then the downsides is that (as
>> you mentioned) - extractors will be distributed separately and "operators"
>> logic is out of sync with "lineage extraction" logic by design.
>> >>>>>>> > Also knowledge about internals of operator spills out of the
>> operator which is not good at all (at the very least).
>> >>>>>>> >
>> >>>>>>> > However, if we make every operator being exposing method to
>> generate lineage metadata of the specific format, e.g. OpenLineage etc.,
>> then we will end up with cartesian complexity of supporting in each
>> provider+operator each backend format.
>> >>>>>>> >
>> >>>>>>> > If you say that the goal is that "operators" will always
>> generate OpenLineage format only and each consumer will convert this format
>> to their own internal representation, well, if they do this then this seems
>> like a working approach. But with the assumption that each consumer will
>> support it.
>> >>>>>>> >
>> >>>>>>> > I think it comes down to the question: is OpenLineage format
>> enough popular, complete and proper for the lineage metadata that every
>> consumer will be convinced to support it. We may also consider issues like
>> mismatch of lineage feature parity, e.g. OpenLineage supports field-level
>> lineage but consumer doesn't support (or not at the moment), so we would
>> prefer lineage metadata transferred to the backend to be slightly different
>> in this case.
>> >>>>>>> >
>> >>>>>>> > What do you think about the idea:
>> >>>>>>> > 1. make lineage metadata generated by "operators" to be
>> agnostic of the specific format, just using entities from big generic
>> vocabulary of entities e.g. created here
>> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py.
>> We would have there e.g. entities like:
>> >>>>>>> >
>> --------------------------------------------------------------------
>> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>> >>>>>>> > class PostgresTable:
>> >>>>>>> >     """Airflow lineage entity representing Postgres table."""
>> >>>>>>> >
>> >>>>>>> >     host: str = attr.ib()
>> >>>>>>> >     port: str = attr.ib()
>> >>>>>>> >     database: str = attr.ib()
>> >>>>>>> >     schema: str = attr.ib()
>> >>>>>>> >     table: str = attr.ib()
>> >>>>>>> >
>> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>> >>>>>>> > class GCSEntity:
>> >>>>>>> >     """Airflow lineage entity representing generic Google Cloud
>> Storage entity."""
>> >>>>>>> >
>> >>>>>>> >     bucket: str = attr.ib()
>> >>>>>>> >     path: str = attr.ib()
>> >>>>>>> >
>> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>> >>>>>>> > class AWSS3Entity:
>> >>>>>>> >     """Airflow lineage entity representing generic AWS S3
>> entity."""
>> >>>>>>> >
>> >>>>>>> >     bucket: str = attr.ib()
>> >>>>>>> >     path: str = attr.ib()
>> >>>>>>> >
>> --------------------------------------------------------------------
>> >>>>>>> > 2. Implement "adapters" that will act as a bridge between
>> "operators" and backends. Their responsibility will be to convert lineage
>> metadata generated by "operators" to a format understandable by specific
>> backend.
>> >>>>>>> > And then we can use the built-in mechanism of inlets/outlets to
>> bypass Airflow lineage metadata to the Airflow lineage backend.
>> >>>>>>> >
>> >>>>>>> > I didn't get exactly implementation details of your proposed
>> design, but I think maintaining global vocabulary of entities to use in
>> inlets/outlets of operators is crucial for Airflow, as this could be
>> leveraged to build various features on top of it, like displaying lineage
>> graph in Airflow UI (based on XCOM):)
>> >>>>>>> >
>> >>>>>>> > Importantly to note, if we decide to send out from Airflow
>> lineage metadata only in OpenLineage format, well, we could have than only
>> one "adapter" OpenLineageAdapter. But the "adapters" approach leaves us
>> room for adding support to others (following "pluggable" approach as
>> Airflow is mainly known/good about).
>> >>>>>>> >
>> >>>>>>> > All in all:
>> >>>>>>> > - global vocabulary of entities used across all "operators"
>> (with all advantages out of it, mentioned above)
>> >>>>>>> > - "adapters" approach
>> >>>>>>> > seems to me crucial points in the design that make sense to me.
>> >>>>>>> >
>> >>>>>>> > What do you think about this?
>> >>>>>>> >
>> >>>>>>> > - Eugene
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
>> <jul...@astronomer.io.invalid> wrote:
>> >>>>>>> >>
>> >>>>>>> >> Hello Michał,
>> >>>>>>> >> Thank you for your input.
>> >>>>>>> >> I would clarify that OpenLineage doesn't make any assumption
>> about the backend being used to store lineage and is an adapter-like layer.
>> >>>>>>> >> OpenLineage exists as the spec specifically for that purpose
>> of avoiding the problem of every lineage consumer having to understand
>> every lineage producer.
>> >>>>>>> >> Consumers of lineage want a unified spec consuming lineage
>> from any data transformation layer like Airflow, Spark, Flink, SQL,
>> Warehouses, ...
>> >>>>>>> >> Just like OpenTelemetry allows consuming traces independently
>> of the technology used, so does OpenLineage for lineage.
>> >>>>>>> >> Julien
>> >>>>>>> >>
>> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
>> michalmod...@google.com> wrote:
>> >>>>>>> >>>
>> >>>>>>> >>> Hi everyone,
>> >>>>>>> >>>
>> >>>>>>> >>> As Airflow already supports lineage functionality through
>> pluggable lineage backends, I think OpenLineage and other lineage systems
>> integration should follow this path. I think more 'native' integration with
>> OpenLineage (or any other lineage system) in Airflow while maintaining the
>> generic lineage backend architecture in parallel would make the user
>> experience less open, troublesome to maintain, and the Airflow architecture
>> itself more constrained by a logic of a specific system.
>> >>>>>>> >>>
>> >>>>>>> >>> I think enriching operators with a generic method exposing
>> lineage metadata that could be leveraged by lineage backends regardless of
>> their implementation is a good idea which the Cloud Composer team would
>> gladly contribute to. I believe the translation of the Airflow metadata
>> exposed by the operators should be done by lineage backends (or another
>> adapter-like layer). Tying Airflow operators' development to a specific
>> lineage system like OpenLineage forces operators' contributors to
>> understand that system too, which increases both the entry costs and
>> maintenance costs. I see it as unnecessary coupling.
>> >>>>>>> >>>
>> >>>>>>> >>> Best,
>> >>>>>>> >>> Michal
>> >>>>>>> >>>
>> >>>>>>> >>>
>> >>>>>>> >>>
>> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
>> jul...@astronomer.io> wrote:
>> >>>>>>> >>>>
>> >>>>>>> >>>> Thank you Eugen,
>> >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and I
>> think this would work well.
>> >>>>>>> >>>> Here are the sections in the doc that I think address your
>> points:
>> >>>>>>> >>>> - generalize lineage metadata extraction as self-method in
>> each operator, using generic lineage entities
>> >>>>>>> >>>> See: OpenLineage support in providers. It describes how each
>> operator exposes its lineage.
>> >>>>>>> >>>> - implement "adapter"s to convert generated metadata to Data
>> Lineage format, Open Lineage format, etc.
>> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage format
>> to their own internal representation as you are suggesting.
>> >>>>>>> >>>> In the motivation section, towards the end, I link to a few
>> examples of data catalogs doing just that.
>> >>>>>>> >>>>
>> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
>> eu...@kosteev.com> wrote:
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> ++ Michal Modras
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
>> eu...@kosteev.com> wrote:
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
>> Dataplex" feature which effectively means to generate lineage out of
>> DAG/task executions and export it to Data Lineage (Data Catalog service)
>> for further analysis.
>> >>>>>>> >>>>>>
>> https://cloud.google.com/composer/docs/composer-2/lineage-integration
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
>> >>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage
>> backend" feature and methods to extract lineage metadata on task post
>> execution events.
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> The general idea was to contribute this to the Airflow
>> community in a form:
>> >>>>>>> >>>>>> - generalize lineage metadata extraction as self-method in
>> each operator, using generic lineage entities
>> >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to
>> Data Lineage format, Open Lineage format, etc.
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean
>> to introduce an additional layer of converting from OpenLineage format to
>> Data Lineage (Data Catalog/Dataplex) format. But this is definitely a
>> possibility.
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
>> <jul...@astronomer.io.invalid> wrote:
>> >>>>>>> >>>>>>>
>> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
>> >>>>>>> >>>>>>> I am responding in the comments and adding to the doc
>> accordingly.
>> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
>> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
>> >>>>>>> >>>>>>> Julien
>> >>>>>>> >>>>>>>
>> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
>> ja...@potiuk.com> wrote:
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is
>> (and should be
>> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's
>> capabilities
>> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all
>> working on - Airflow
>> >>>>>>> >>>>>>>> as a Platform.
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes
>> the same
>> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry
>> goes, where we
>> >>>>>>> >>>>>>>> might decide to support certain standards in order to
>> expand
>> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to
>> plug-in multiple
>> >>>>>>> >>>>>>>> external solutions that would use the standard API.
>> After Open-Lineage
>> >>>>>>> >>>>>>>> graduated recently to  LFAI&Data foundation (I've been
>> watching this
>> >>>>>>> >>>>>>>> happening from far), it is I think the perfect candidate
>> for Airflow
>> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the players
>> to make use
>> >>>>>>> >>>>>>>> of the extra work necessary by the community to make it
>> "officially
>> >>>>>>> >>>>>>>> supported". I think we have to also get some feedback
>> from the big
>> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have
>> such a
>> >>>>>>> >>>>>>>> capability, and another is to get it used in all the
>> ways Airflow is
>> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which
>> is obviously a
>> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow
>> is exposed by
>> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some
>> warm words from
>> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear
>> whether the
>> >>>>>>> >>>>>>>> Composer team at Google would be on board in using the
>> open-lineage
>> >>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and
>> likely more)
>> >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly other
>> stakeholders
>> >>>>>>> >>>>>>>> might want to say something.
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in
>> implementing and
>> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that
>> is the main
>> >>>>>>> >>>>>>>> reason why the Open Lineage community would like to make
>> the
>> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and
>> integrating it in
>> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI,
>> verification
>> >>>>>>> >>>>>>>> process and making some very clear expectations about
>> what it means
>> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can
>> make some
>> >>>>>>> >>>>>>>> initial investment in making it happen and minimise
>> on-going cost,
>> >>>>>>> >>>>>>>> while maximising the gain.
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> And looking at all the above - I am super happy to help
>> with all that
>> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even
>> if it will
>> >>>>>>> >>>>>>>> take an extra effort, especially that we will have
>> experts from Open
>> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage
>> being the core
>> >>>>>>> >>>>>>>> part of the effort. I am actually super excited - this
>> might be the
>> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position as
>> an
>> >>>>>>> >>>>>>>> indispensable component of "even more modern data stack".
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking
>> forward to
>> >>>>>>> >>>>>>>> making it happen :).
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> J.
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
>> >>>>>>> >>>>>>>> <jul...@astronomer.io.invalid> wrote:
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > Dear Airflow Community,
>> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
>> OpenLineage provider to Airflow.
>> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an
>> official AIP.
>> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
>> >>>>>>> >>>>>>>> > Thank you,
>> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc:
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > Operational lineage collection is a common need to
>> understand dependencies between data pipelines and track end-to-end
>> provenance of data. It enables many use cases from ensuring reliable
>> delivery of data through observability to compliance and cost management.
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow
>> capability to enable troubleshooting and governance.
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
>> foundation that provides a spec standardizing operational lineage
>> collection and sharing across the data ecosystem. If it provides plugins
>> for popular open source projects, its intent is very similar to
>> OpenTelemetry (also under the Linux Foundation umbrella): to remain a spec
>> for lineage exchange that projects - open source or proprietary - implement.
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it
>> easier and more reliable for Airflow users to publish their operational
>> lineage through the OpenLineage ecosystem.
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > The current external plugin maintained in the
>> OpenLineage project depends on Airflow and operators internals and gets
>> broken when changes are made on those. Having a built-in integration
>> ensures a better first class support to expose lineage that gets tested
>> alongside other changes and therefore is more stable.
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> --
>> >>>>>>> >>>>>> Eugene
>> >>>>>>> >>>>>
>> >>>>>>> >>>>>
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> --
>> >>>>>>> >>>>> Eugene
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > --
>> >>>>>>> > Eugene
>>
>

Reply via email to