Actually - maybe even bring it to the state in the 3rd decade of the century ;)
On Thu, Feb 3, 2022 at 5:05 PM Jarek Potiuk <ja...@potiuk.com> wrote: > Hello everyone, > > Just to give some information on the progress and plans in the > Open-Telemetry area. > > I just had a talk with Howard, and we are going to work together on the > AIP proposal on "whys", "hows", and "whats" of the Open-Telemetry for > Airflow. > > We have enough information already from the POC work done by Melodie > during the internship regarding the "technical capabilities" of the > OpenTelemetry and the ways it can be integrated with Airflow so that I > think when we join the "Product" vision from Howard and my understanding of > the internals of the OpenTelemetry and Airflow, we can come up with a good > proposal that might be a great base for discussion and implementation. > > We will be working on the proposal together - if there is anyone who would > like to join now - do let us know and we will join you. But I think > relatively soon we will publish an AIP proposal that will start the "real" > discussion. I know there are many people interested so we might add a > dedicated channel in slack and maybe run a couple of demos/presentations of > the proposal before we send it up for voting. > > Looking forward to getting this one sorted out, I think we have a chance > together to bring Airflow telemetry to the state in-sync with the state of > the telemetry in the 2nd decade of XXIst century ;) > > Thanks Melodie for all the investigation and research there! This > internship was a really great start and gave me a lot of confidence on the > next steps we can take there. > > J. > > > On Wed, Jan 12, 2022 at 4:21 AM Howard Yoo > <howard....@astronomer.io.invalid> wrote: > >> I am very much interested in how we can improve >> Not only the instrumentation by using OpenTelemetry, but also >> Think about how we can make the existing metrics list better. >> >> For example, perhaps in the future, maybe we can provide things like how >> much CPU, memory, and disk I/O a task instance is using, by utilizing >> python’s plutil package as mentioned here in ( >> https://stackoverflow.com/questions/16326529/python-get-process-names-cpu-mem-usage-and-peak-mem-usage-in-windows), >> because local task jobs are essentially subprocesses. By utilizing >> OpenTelemetry, we could even collect Host metrics and platform metrics >> that’s outside of the boundary of airflow easier - and even have them >> collected by the OTEL collector agent at the same time. >> >> I would be very happy if this internship project can also include >> Collecting metrics in addition to the Tracing, and think about how it can >> be extended to cover more than what’s provided out of the box. >> >> - Howard >> >> On 2022/01/10 21:22:51 Jarek Potiuk wrote: >> > > Also, I do have a feedback that current metrics list and what they >> track are not really that useful >> > >> > Fully agree. >> > >> > > (I mean, there is so much that one can do for metrics like operator >> failures and ti failures - since they don’t post any context specific >> information) - so while we may be working with making OpenTelemetry >> available for airflow, we might also investigate and try improvements on >> reviewing these metrics and really verify whether these metrics are >> helpful, and if there can be additional metrics that we can instrument >> while doing this. >> > >> > Oh yeah. >> > >> > > I think when we are designing for the distributed traces on Airflow, >> we should also work on defining what kind of traces would be useful and how >> to come up with better name convention etc. to make things clear and easy >> to understand, etc.. >> > >> > Absolutely! I think we have a very clear "separation" and actually >> > "complementary" work that we should indeed do together! >> > >> > 1) From the "internship project" that we do together with Melody, the >> > focus is more on the engineering side - "how we can easily integrate >> > open-telemetry" with Airflow - seamlessly and in a modular fashion and >> > in the way that will be easy to use and test in "development >> > environment". It is more about solving all engineering obstacles with >> > integration (for example what we learn now is that Open Telemetry >> > requires some custom code to account for a "forking" model. Also about >> > exposing a lot of low-level metrics that are not airflow specific >> > (flask, db access etc - something that really allows to debug "any" >> > application deployment, not only Airflow). Then we thought about >> > simply adding the "current" metrics that we have in statsd as custom >> > ones. >> > >> > * And I understand that your focus is - more "how we can actually make >> > a really useful set of Airflow metrics" which is ideally complementing >> > the "engineering" part - once we get OT fully integrated we can add >> > not only (or maybe even not at all) the current metrics but, once you >> > help defining "better" metrics, we can simply implement them in OT - >> > including some example dashboards etc. >> > >> > Happy to collaborate on that! >> > >> > J. >> > >> > >> > > - Howard >> > > >> > >> >