Re: Re: Re: [DISCUSS] airflow telemetry : improve with open telemetry

melodie ezeani Thu, 03 Feb 2022 08:59:17 -0800

Thanks Jarek, I'm happy I could help

On Thu, Feb 3, 2022 at 5:07 PM Jarek Potiuk <ja...@potiuk.com> wrote:


> Actually - maybe even bring it to the state in the 3rd decade of
> the century ;)
>
> On Thu, Feb 3, 2022 at 5:05 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Hello everyone,
>>
>> Just to give some information on the progress and plans in the
>> Open-Telemetry area.
>>
>> I just had a talk with Howard, and we are going to work together on the
>> AIP proposal on "whys", "hows", and "whats" of the Open-Telemetry for
>> Airflow.
>>
>> We have enough information already from the POC work done by Melodie
>> during the internship regarding the "technical capabilities" of the
>> OpenTelemetry and the ways it can be integrated with Airflow so that I
>> think when we join the "Product" vision from Howard and my understanding of
>> the internals of the OpenTelemetry and Airflow, we can come up with a good
>> proposal that might be a great base for discussion and implementation.
>>
>> We will be working on the proposal together - if there is anyone who
>> would like to join now - do let us know and we will join you. But I think
>> relatively soon we will publish an AIP proposal that will start the "real"
>> discussion. I know there are many people interested so we might add a
>> dedicated channel in slack and maybe run a couple of demos/presentations of
>> the proposal before we send it up for voting.
>>
>> Looking forward to getting this one sorted out, I think we have a chance
>> together to bring Airflow telemetry to the state in-sync with the state of
>> the telemetry in the 2nd decade of XXIst century ;)
>>
>> Thanks Melodie for all the investigation and research there! This
>> internship was a really great start and gave me a lot of confidence on the
>> next steps we can take there.
>>
>> J.
>>
>>
>> On Wed, Jan 12, 2022 at 4:21 AM Howard Yoo
>> <howard....@astronomer.io.invalid> wrote:
>>
>>> I am very much interested in how we can improve
>>> Not only the instrumentation by using OpenTelemetry, but also
>>> Think about how we can make the existing metrics list better.
>>>
>>> For example, perhaps in the future, maybe we can provide things like how
>>> much CPU, memory, and disk I/O a task instance is using, by utilizing
>>> python’s plutil package as mentioned here in (
>>> https://stackoverflow.com/questions/16326529/python-get-process-names-cpu-mem-usage-and-peak-mem-usage-in-windows),
>>> because local task jobs are essentially subprocesses. By utilizing
>>> OpenTelemetry, we could even collect Host metrics and platform metrics
>>> that’s outside of the boundary of airflow easier - and even have them
>>> collected by the OTEL collector agent at the same time.
>>>
>>> I would be very happy if this internship project can also include
>>> Collecting metrics in addition to the Tracing, and think about how it
>>> can be extended to cover more than what’s provided out of the box.
>>>
>>> - Howard
>>>
>>> On 2022/01/10 21:22:51 Jarek Potiuk wrote:
>>> > > Also, I do have a feedback that current metrics list and what they
>>> track are not really that useful
>>> >
>>> > Fully agree.
>>> >
>>> > > (I mean, there is so much that one can do for metrics like operator
>>> failures and ti failures - since they don’t post any context specific
>>> information) - so while we may be working with making OpenTelemetry
>>> available for airflow, we might also investigate and try improvements on
>>> reviewing these metrics and really verify whether these metrics are
>>> helpful, and if there can be additional metrics that we can instrument
>>> while doing this.
>>> >
>>> > Oh yeah.
>>> >
>>> > > I think when we are designing for the distributed traces on Airflow,
>>> we should also work on defining what kind of traces would be useful and how
>>> to come up with better name convention etc. to make things clear and easy
>>> to understand, etc..
>>> >
>>> > Absolutely!  I think we have a very clear "separation" and actually
>>> > "complementary" work that we should indeed do together!
>>> >
>>> > 1) From the "internship project" that we do together with Melody, the
>>> > focus is more on the engineering side - "how we can easily integrate
>>> > open-telemetry" with Airflow - seamlessly and in a modular fashion and
>>> > in the way that will be easy to use and test in "development
>>> > environment". It is more about solving all engineering obstacles with
>>> > integration (for example what we learn now is that Open Telemetry
>>> > requires some custom code to account for a "forking" model. Also about
>>> > exposing a lot of low-level metrics that are not airflow specific
>>> > (flask, db access etc - something that really allows to debug "any"
>>> > application deployment, not only Airflow). Then we thought about
>>> > simply adding the "current" metrics that we have in statsd as custom
>>> > ones.
>>> >
>>> > * And I understand that your focus is - more "how we can actually make
>>> > a really useful set of Airflow metrics" which is ideally complementing
>>> > the "engineering" part - once we get OT fully integrated we can add
>>> > not only (or maybe even not at all) the current metrics but, once you
>>> > help defining "better" metrics, we can simply implement them in OT -
>>> > including some example dashboards etc.
>>> >
>>> > Happy to collaborate on that!
>>> >
>>> > J.
>>> >
>>> >
>>> > > - Howard
>>> > >
>>> >
>>>
>>

Re: Re: Re: [DISCUSS] airflow telemetry : improve with open telemetry

Reply via email to