Re: Re: Re: [DISCUSS] airflow telemetry : improve with open telemetry

Jarek Potiuk Thu, 03 Feb 2022 08:07:15 -0800

Actually - maybe even bring it to the state in the 3rd decade of
the century ;)


On Thu, Feb 3, 2022 at 5:05 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Hello everyone,
>
> Just to give some information on the progress and plans in the
> Open-Telemetry area.
>
> I just had a talk with Howard, and we are going to work together on the
> AIP proposal on "whys", "hows", and "whats" of the Open-Telemetry for
> Airflow.
>
> We have enough information already from the POC work done by Melodie
> during the internship regarding the "technical capabilities" of the
> OpenTelemetry and the ways it can be integrated with Airflow so that I
> think when we join the "Product" vision from Howard and my understanding of
> the internals of the OpenTelemetry and Airflow, we can come up with a good
> proposal that might be a great base for discussion and implementation.
>
> We will be working on the proposal together - if there is anyone who would
> like to join now - do let us know and we will join you. But I think
> relatively soon we will publish an AIP proposal that will start the "real"
> discussion. I know there are many people interested so we might add a
> dedicated channel in slack and maybe run a couple of demos/presentations of
> the proposal before we send it up for voting.
>
> Looking forward to getting this one sorted out, I think we have a chance
> together to bring Airflow telemetry to the state in-sync with the state of
> the telemetry in the 2nd decade of XXIst century ;)
>
> Thanks Melodie for all the investigation and research there! This
> internship was a really great start and gave me a lot of confidence on the
> next steps we can take there.
>
> J.
>
>
> On Wed, Jan 12, 2022 at 4:21 AM Howard Yoo
> <howard....@astronomer.io.invalid> wrote:
>
>> I am very much interested in how we can improve
>> Not only the instrumentation by using OpenTelemetry, but also
>> Think about how we can make the existing metrics list better.
>>
>> For example, perhaps in the future, maybe we can provide things like how
>> much CPU, memory, and disk I/O a task instance is using, by utilizing
>> python’s plutil package as mentioned here in (
>> https://stackoverflow.com/questions/16326529/python-get-process-names-cpu-mem-usage-and-peak-mem-usage-in-windows),
>> because local task jobs are essentially subprocesses. By utilizing
>> OpenTelemetry, we could even collect Host metrics and platform metrics
>> that’s outside of the boundary of airflow easier - and even have them
>> collected by the OTEL collector agent at the same time.
>>
>> I would be very happy if this internship project can also include
>> Collecting metrics in addition to the Tracing, and think about how it can
>> be extended to cover more than what’s provided out of the box.
>>
>> - Howard
>>
>> On 2022/01/10 21:22:51 Jarek Potiuk wrote:
>> > > Also, I do have a feedback that current metrics list and what they
>> track are not really that useful
>> >
>> > Fully agree.
>> >
>> > > (I mean, there is so much that one can do for metrics like operator
>> failures and ti failures - since they don’t post any context specific
>> information) - so while we may be working with making OpenTelemetry
>> available for airflow, we might also investigate and try improvements on
>> reviewing these metrics and really verify whether these metrics are
>> helpful, and if there can be additional metrics that we can instrument
>> while doing this.
>> >
>> > Oh yeah.
>> >
>> > > I think when we are designing for the distributed traces on Airflow,
>> we should also work on defining what kind of traces would be useful and how
>> to come up with better name convention etc. to make things clear and easy
>> to understand, etc..
>> >
>> > Absolutely!  I think we have a very clear "separation" and actually
>> > "complementary" work that we should indeed do together!
>> >
>> > 1) From the "internship project" that we do together with Melody, the
>> > focus is more on the engineering side - "how we can easily integrate
>> > open-telemetry" with Airflow - seamlessly and in a modular fashion and
>> > in the way that will be easy to use and test in "development
>> > environment". It is more about solving all engineering obstacles with
>> > integration (for example what we learn now is that Open Telemetry
>> > requires some custom code to account for a "forking" model. Also about
>> > exposing a lot of low-level metrics that are not airflow specific
>> > (flask, db access etc - something that really allows to debug "any"
>> > application deployment, not only Airflow). Then we thought about
>> > simply adding the "current" metrics that we have in statsd as custom
>> > ones.
>> >
>> > * And I understand that your focus is - more "how we can actually make
>> > a really useful set of Airflow metrics" which is ideally complementing
>> > the "engineering" part - once we get OT fully integrated we can add
>> > not only (or maybe even not at all) the current metrics but, once you
>> > help defining "better" metrics, we can simply implement them in OT -
>> > including some example dashboards etc.
>> >
>> > Happy to collaborate on that!
>> >
>> > J.
>> >
>> >
>> > > - Howard
>> > >
>> >
>>
>

Re: Re: Re: [DISCUSS] airflow telemetry : improve with open telemetry

Reply via email to