Hi Elad, Sure, anything would be great! I’m glad to see there’s some work done already, looks like the work is centered around traces, but would like to also take a look at how we can produce metrics via OpenTelemetry as well as logs.
Howard On 2022/01/07 22:37:08 Elad Kalif wrote: > Hi Howard, > > We actually have outreachy intern (Melodie) that is working on > researching how open-telemetry can be integrated with Airflow. > Draft PR for demo : https://github.com/apache/airflow/pull/20677 > This is an initial effort for a POC. > Maybe you can work together on this? > > > On Sat, Jan 8, 2022 at 12:19 AM Howard Yoo <ho...@astronomer.io.invalid> > wrote: > > > Hi all, > > > > I’m a staff product manager in Astronomer, and wanted to post this email > > according to the guide from > > https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals > > . > > > > Currently, the main method to publish telemetry data out of airflow is > > through its statsD implementation : > > https://github.com/apache/airflow/blob/main/airflow/stats.py , and > > currently airflow supports two flavors of stated, the original one, and > > data dog’s dogstatsd implementation. > > > > Through this implementation, we have the following list of metrics that > > would be available for other popular monitoring tools to collect, monitor, > > visualize, and alert on metrics generated from airflow: > > https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html > > > > > > There are a number of limitations of airflow’s current implementation of > > its metrics using stated. > > 1. StatsD is based on simple metrics format that does not support richer > > contexts. Its metric name would contain some of those contexts (such as dag > > id, task id, etc), but those can be limited due to the formatting issue of > > having to be a part of metric name itself. A better approach would be to > > utilizing ‘tags’ to be attached to the metrics data to add more contexts. > > 2. StatsD also utilizes UDP as its main network protocol, but UDP protocol > > is simple and does not guarantee the reliable transmission of the payload. > > Moreover, many monitoring protocols are moving into more modern protocols > > such as https to send out metrics. > > 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not > > support distributed traces and log ingestion. > > > > Due to the above reasons, I have been looking at opentelemetry ( > > https://github.com/open-telemetry) as a potential replacement for > > airflow’s current telemetry instrumentation. Opentelemetry is a product of > > opentracing and open census, and is quickly gaining momentum in terms of > > ‘standardization’ of means to producing and delivering telemetry data. Not > > only metrics, but distributed traces, as well as logs. The technology is > > also geared towards better monitoring cloud-native software. Many > > monitoring tools vendors are supporting opentelemetry (Tanzu, Datadog, > > Honeycomb, lightstep, etc.) and opentelemetry’s modular architecture is > > designed to be compatible with existing legacy instrumentations. There are > > also a stable python SDKs and APIs to easily implement it into airflow. > > > > Therefore, I’d like to work on proposing of improving metrics and > > telemetry capability of airflow by adding configuration and support of open > > telemetry so that while maintaining the backward compatibility of existing > > stated based metrics, we would also have an opportunity to have distributed > > traces and logs to be based on it, so that it would be easier for any > > Opentelemetry compatible tools to be able to monitor airflow with richer > > information. > > > > If you were thinking of a need to improve the current metrics capabilities > > of airflow, and have been thinking of standards like Opentelemetry, please > > feel free to join the thread and provide any opinions or feedback. I also > > generally think that we may need to review our current list of metrics and > > assess whether they are really useful in terms of monitoring and > > observability of airflow. There are things that we might want to add into > > metrics such as more executor related metrics, scheduler related metrics, > > as well as operators and even DB and XCOM related metrics to better assess > > the health of airflow and make these information helpful for faster > > troubleshooting and problem resolution. > > > > Thanks and regards, > > Howard > > >