Re: Re: Re: [DISCUSS] airflow telemetry : improve with open telemetry

Jarek Potiuk Thu, 03 Feb 2022 08:05:23 -0800

Hello everyone,

Just to give some information on the progress and plans in the
Open-Telemetry area.


I just had a talk with Howard, and we are going to work together on the AIP
proposal on "whys", "hows", and "whats" of the Open-Telemetry for Airflow.

We have enough information already from the POC work done by Melodie during
the internship regarding the "technical capabilities" of the OpenTelemetry
and the ways it can be integrated with Airflow so that I think when we join
the "Product" vision from Howard and my understanding of the internals of
the OpenTelemetry and Airflow, we can come up with a good proposal that
might be a great base for discussion and implementation.

We will be working on the proposal together - if there is anyone who would
like to join now - do let us know and we will join you. But I think
relatively soon we will publish an AIP proposal that will start the "real"
discussion. I know there are many people interested so we might add a
dedicated channel in slack and maybe run a couple of demos/presentations of
the proposal before we send it up for voting.

Looking forward to getting this one sorted out, I think we have a chance
together to bring Airflow telemetry to the state in-sync with the state of
the telemetry in the 2nd decade of XXIst century ;)

Thanks Melodie for all the investigation and research there! This
internship was a really great start and gave me a lot of confidence on the
next steps we can take there.

J.


On Wed, Jan 12, 2022 at 4:21 AM Howard Yoo <howard....@astronomer.io.invalid>
wrote:

> I am very much interested in how we can improve
> Not only the instrumentation by using OpenTelemetry, but also
> Think about how we can make the existing metrics list better.
>
> For example, perhaps in the future, maybe we can provide things like how
> much CPU, memory, and disk I/O a task instance is using, by utilizing
> python’s plutil package as mentioned here in (
> https://stackoverflow.com/questions/16326529/python-get-process-names-cpu-mem-usage-and-peak-mem-usage-in-windows),
> because local task jobs are essentially subprocesses. By utilizing
> OpenTelemetry, we could even collect Host metrics and platform metrics
> that’s outside of the boundary of airflow easier - and even have them
> collected by the OTEL collector agent at the same time.
>
> I would be very happy if this internship project can also include
> Collecting metrics in addition to the Tracing, and think about how it can
> be extended to cover more than what’s provided out of the box.
>
> - Howard
>
> On 2022/01/10 21:22:51 Jarek Potiuk wrote:
> > > Also, I do have a feedback that current metrics list and what they
> track are not really that useful
> >
> > Fully agree.
> >
> > > (I mean, there is so much that one can do for metrics like operator
> failures and ti failures - since they don’t post any context specific
> information) - so while we may be working with making OpenTelemetry
> available for airflow, we might also investigate and try improvements on
> reviewing these metrics and really verify whether these metrics are
> helpful, and if there can be additional metrics that we can instrument
> while doing this.
> >
> > Oh yeah.
> >
> > > I think when we are designing for the distributed traces on Airflow,
> we should also work on defining what kind of traces would be useful and how
> to come up with better name convention etc. to make things clear and easy
> to understand, etc..
> >
> > Absolutely!  I think we have a very clear "separation" and actually
> > "complementary" work that we should indeed do together!
> >
> > 1) From the "internship project" that we do together with Melody, the
> > focus is more on the engineering side - "how we can easily integrate
> > open-telemetry" with Airflow - seamlessly and in a modular fashion and
> > in the way that will be easy to use and test in "development
> > environment". It is more about solving all engineering obstacles with
> > integration (for example what we learn now is that Open Telemetry
> > requires some custom code to account for a "forking" model. Also about
> > exposing a lot of low-level metrics that are not airflow specific
> > (flask, db access etc - something that really allows to debug "any"
> > application deployment, not only Airflow). Then we thought about
> > simply adding the "current" metrics that we have in statsd as custom
> > ones.
> >
> > * And I understand that your focus is - more "how we can actually make
> > a really useful set of Airflow metrics" which is ideally complementing
> > the "engineering" part - once we get OT fully integrated we can add
> > not only (or maybe even not at all) the current metrics but, once you
> > help defining "better" metrics, we can simply implement them in OT -
> > including some example dashboards etc.
> >
> > Happy to collaborate on that!
> >
> > J.
> >
> >
> > > - Howard
> > >
> >
>

Re: Re: Re: [DISCUSS] airflow telemetry : improve with open telemetry

Reply via email to