RE: Re: Re: [DISCUSS] airflow telemetry : improve with open telemetry

Howard Yoo Tue, 11 Jan 2022 19:21:41 -0800

I am very much interested in how we can improve 
Not only the instrumentation by using OpenTelemetry, but also
Think about how we can make the existing metrics list better.


For example, perhaps in the future, maybe we can provide things like how much 
CPU, memory, and disk I/O a task instance is using, by utilizing python’s 
plutil package as mentioned here in 
(https://stackoverflow.com/questions/16326529/python-get-process-names-cpu-mem-usage-and-peak-mem-usage-in-windows
 
<https://stackoverflow.com/questions/16326529/python-get-process-names-cpu-mem-usage-and-peak-mem-usage-in-windows>),
 because local task jobs are essentially subprocesses. By utilizing 
OpenTelemetry, we could even collect Host metrics and platform metrics that’s 
outside of the boundary of airflow easier - and even have them collected by the 
OTEL collector agent at the same time.

I would be very happy if this internship project can also include
Collecting metrics in addition to the Tracing, and think about how it can be 
extended to cover more than what’s provided out of the box.

- Howard

On 2022/01/10 21:22:51 Jarek Potiuk wrote:
> > Also, I do have a feedback that current metrics list and what they track 
> > are not really that useful
> 
> Fully agree.
> 
> > (I mean, there is so much that one can do for metrics like operator 
> > failures and ti failures - since they don’t post any context specific 
> > information) - so while we may be working with making OpenTelemetry 
> > available for airflow, we might also investigate and try improvements on 
> > reviewing these metrics and really verify whether these metrics are 
> > helpful, and if there can be additional metrics that we can instrument 
> > while doing this.
> 
> Oh yeah.
> 
> > I think when we are designing for the distributed traces on Airflow, we 
> > should also work on defining what kind of traces would be useful and how to 
> > come up with better name convention etc. to make things clear and easy to 
> > understand, etc..
> 
> Absolutely!  I think we have a very clear "separation" and actually
> "complementary" work that we should indeed do together!
> 
> 1) From the "internship project" that we do together with Melody, the
> focus is more on the engineering side - "how we can easily integrate
> open-telemetry" with Airflow - seamlessly and in a modular fashion and
> in the way that will be easy to use and test in "development
> environment". It is more about solving all engineering obstacles with
> integration (for example what we learn now is that Open Telemetry
> requires some custom code to account for a "forking" model. Also about
> exposing a lot of low-level metrics that are not airflow specific
> (flask, db access etc - something that really allows to debug "any"
> application deployment, not only Airflow). Then we thought about
> simply adding the "current" metrics that we have in statsd as custom
> ones.
> 
> * And I understand that your focus is - more "how we can actually make
> a really useful set of Airflow metrics" which is ideally complementing
> the "engineering" part - once we get OT fully integrated we can add
> not only (or maybe even not at all) the current metrics but, once you
> help defining "better" metrics, we can simply implement them in OT -
> including some example dashboards etc.
> 
> Happy to collaborate on that!
> 
> J.
> 
> 
> > - Howard
> >
>

RE: Re: Re: [DISCUSS] airflow telemetry : improve with open telemetry

Reply via email to