[DISCUSS] airflow telemetry : improve with open telemetry

Howard Yoo Fri, 07 Jan 2022 14:19:17 -0800

Hi all,

I’m a staff product manager in Astronomer, and wanted to post this email 
according to the guide from 
https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals
 
<https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals>
 .


Currently, the main method to publish telemetry data out of airflow is through 
its statsD implementation : 
https://github.com/apache/airflow/blob/main/airflow/stats.py 
<https://github.com/apache/airflow/blob/main/airflow/stats.py> , and currently 
airflow supports two flavors of stated, the original one, and data dog’s 
dogstatsd implementation.

Through this implementation, we have the following list of metrics that would 
be available for other popular monitoring tools to collect, monitor, visualize, 
and alert on metrics generated from airflow: 
https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html
 
<https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html>
 

There are a number of limitations of airflow’s current implementation of its 
metrics using stated.
1. StatsD is based on simple metrics format that does not support richer 
contexts. Its metric name would contain some of those contexts (such as dag id, 
task id, etc), but those can be limited due to the formatting issue of having 
to be a part of metric name itself. A better approach would be to utilizing 
‘tags’ to be attached to the metrics data to add more contexts.
2. StatsD also utilizes UDP as its main network protocol, but UDP protocol is 
simple and does not guarantee the reliable transmission of the payload. 
Moreover, many monitoring protocols are moving into more modern protocols such 
as https to send out metrics.
3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not support 
distributed traces and log ingestion.

Due to the above reasons, I have been looking at opentelemetry 
(https://github.com/open-telemetry <https://github.com/open-telemetry>) as a 
potential replacement for airflow’s current telemetry instrumentation. 
Opentelemetry is a product of opentracing and open census, and is quickly 
gaining momentum in terms of ‘standardization’ of means to producing and 
delivering telemetry data. Not only metrics, but distributed traces, as well as 
logs. The technology is also geared towards better monitoring cloud-native 
software. Many monitoring tools vendors are supporting opentelemetry (Tanzu, 
Datadog, Honeycomb, lightstep, etc.) and opentelemetry’s modular architecture 
is designed to be compatible with existing legacy instrumentations. There are 
also a stable python SDKs and APIs to easily implement it into airflow.

Therefore, I’d like to work on proposing of improving metrics and telemetry 
capability of airflow by adding configuration and support of open telemetry so 
that while maintaining the backward compatibility of existing stated based 
metrics, we would also have an opportunity to have distributed traces and logs 
to be based on it, so that it would be easier for any Opentelemetry compatible 
tools to be able to monitor airflow with richer information.

If you were thinking of a need to improve the current metrics capabilities of 
airflow, and have been thinking of standards like Opentelemetry, please feel 
free to join the thread and provide any opinions or feedback. I also generally 
think that we may need to review our current list of metrics and assess whether 
they are really useful in terms of monitoring and observability of airflow. 
There are things that we might want to add into metrics such as more executor 
related metrics, scheduler related metrics, as well as operators and even DB 
and XCOM related metrics to better assess the health of airflow and make these 
information helpful for faster troubleshooting and problem resolution.

Thanks and regards,
Howard

[DISCUSS] airflow telemetry : improve with open telemetry

Reply via email to