Hi all, I’m a staff product manager in Astronomer, and wanted to post this email according to the guide from https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals <https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals> .
Currently, the main method to publish telemetry data out of airflow is through its statsD implementation : https://github.com/apache/airflow/blob/main/airflow/stats.py <https://github.com/apache/airflow/blob/main/airflow/stats.py> , and currently airflow supports two flavors of stated, the original one, and data dog’s dogstatsd implementation. Through this implementation, we have the following list of metrics that would be available for other popular monitoring tools to collect, monitor, visualize, and alert on metrics generated from airflow: https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html <https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html> There are a number of limitations of airflow’s current implementation of its metrics using stated. 1. StatsD is based on simple metrics format that does not support richer contexts. Its metric name would contain some of those contexts (such as dag id, task id, etc), but those can be limited due to the formatting issue of having to be a part of metric name itself. A better approach would be to utilizing ‘tags’ to be attached to the metrics data to add more contexts. 2. StatsD also utilizes UDP as its main network protocol, but UDP protocol is simple and does not guarantee the reliable transmission of the payload. Moreover, many monitoring protocols are moving into more modern protocols such as https to send out metrics. 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not support distributed traces and log ingestion. Due to the above reasons, I have been looking at opentelemetry (https://github.com/open-telemetry <https://github.com/open-telemetry>) as a potential replacement for airflow’s current telemetry instrumentation. Opentelemetry is a product of opentracing and open census, and is quickly gaining momentum in terms of ‘standardization’ of means to producing and delivering telemetry data. Not only metrics, but distributed traces, as well as logs. The technology is also geared towards better monitoring cloud-native software. Many monitoring tools vendors are supporting opentelemetry (Tanzu, Datadog, Honeycomb, lightstep, etc.) and opentelemetry’s modular architecture is designed to be compatible with existing legacy instrumentations. There are also a stable python SDKs and APIs to easily implement it into airflow. Therefore, I’d like to work on proposing of improving metrics and telemetry capability of airflow by adding configuration and support of open telemetry so that while maintaining the backward compatibility of existing stated based metrics, we would also have an opportunity to have distributed traces and logs to be based on it, so that it would be easier for any Opentelemetry compatible tools to be able to monitor airflow with richer information. If you were thinking of a need to improve the current metrics capabilities of airflow, and have been thinking of standards like Opentelemetry, please feel free to join the thread and provide any opinions or feedback. I also generally think that we may need to review our current list of metrics and assess whether they are really useful in terms of monitoring and observability of airflow. There are things that we might want to add into metrics such as more executor related metrics, scheduler related metrics, as well as operators and even DB and XCOM related metrics to better assess the health of airflow and make these information helpful for faster troubleshooting and problem resolution. Thanks and regards, Howard