>It looks like every python instance creates its own counter which is exported >to the OTEL Collector. We assume the PriodicExportingMetricReader is collecting the metrics for every python instance and exporting this to the OTEL collector. Then the value is overwritten by the latest exported value. There seems to be no central aggregation in OTEL
I believe this is exactly the issue. The Otel metrics in Airflow weren't designed with multiple pods in mind. > In the current level of integration attempting to use OTEL looks like > "broken". Maybe a bit strong, it's broken when using multiple pods, but adding some indication to that effect may not be a bad idea until someone has the spoons to make the changes. - ferruzzi ________________________________ From: Kuettelwesch Marco (XC-DX/ETV5) <marco.kuettelwe...@de.bosch.com.INVALID> Sent: Tuesday, March 4, 2025 6:16 AM To: dev@airflow.apache.org Subject: [EXT] [DISCUSSION] How to improve/fix OTEL integration in Airflow CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le contenu ne présente aucun risque. Hi Airflow community, TLDR: We attempted to switch from StatsD to OTEL and it seems the current integration is not working or has conceptual problems. Seeking for expertise or alternatives. we run into some issues after switching from statsd to OTEL for metric export in Airflow so I want to start a discussion about possible solutions: We are using a K8s Helm-powered multi Pod deployment with many celery and edge workers. To collect more metrics with more details without using the mapping in statsd, we decided to use OTEL to use the advantages like: 1) Metrics are getting exported with labels. (So we could filter and report on multiple axis) 2) Usage of mTLS in the OTEL Collector to collect metrics which are exported by the Edge Worker. 3) Also following with the strategic direction in Airflow to replace StatsD long term with OTEL. We checked only the metrics part and not into the trace part of the OTEL implementation and intended to use the publishing of metics as being in Airflow core (currently 2.10.5). After switching to OTEL we run into the following issues - as also discussed in https://github.com/apache/airflow/issues/41822 we also consulted Dennis (short private chat) and Howard (in the issue) on this: 1) After starting a DAG including a MappedOperator with 8 tasks I saw strange behavior of the "ti.start" counter metric. The tasks were executed on different workers and running for more than 20 minutes. The metrics were reset after some time and the counter value was increase during active DAG runs by different values. Sometimes only by 1 even if all 8 tasks were started. I tried the following to get more details: I exec into worker pod and created a python instance and manually executed Stats.incr("testing_counter") command. The counter was increased every time I run the command like expected. Then I started a second python instance and executed again the same command and saw then a toggeling metric. Sometime the value of the first counter and sometime the value of the second counter. It looks like every python instance creates its own counter which is exported to the OTEL Collector. We assume the PriodicExportingMetricReader is collecting the metrics for every python instance and exporting this to the OTEL collector. Then the value is overwritten by the latest exported value. There seems to be no central aggregation in OTEL and the integration relies on Python singletons - which is impossible in a scaled distributed environment. Does anyone with OTEL experience has an idea how to solve this issue? Looks like a central instance is required to collect and handle metrics which are exported from different python instance. 2) Metrics like "ti.start" and "ti.finish" are exported in the worker context with label dag_id and task_id. The metrics are only available during the time where the task is running. Looks metrics are gone after task finished and then the metric is removed after ~5min from the OtelCollector. Maybe because the metrics like "ti.start" are exported in the worker context and the OtelLogger is gone if the worker task is finished. (For a task execution the process is reporting from the forked interpreter, not the worker) Looks like OTEL Collector removes the metrics after they are not exported for some time. If a task is finished in less, then a 1 minute the metrics are not exported at all. I think that is also the reason why some people are complaining that they are not able to find some metrics after switching to OTEL. We really would like to see OTEL running in a usable way to use the advantages. This is a call to devlist to check for further expertise outside of the existing contributors (as we briefly discussed with Dennis already). In the current level of integration attempting to use OTEL looks like "broken". Possible options: a) We review the integration and hopefully with some OTEL expertise there is an option to make it working. We would also contribute to improve. But converting all existing Metrics (Counters/Gauges) to traces seem to be a larger endeavor. b) We mark the current documentation about integration with warnings and "experimental" such that others do not expect a working solution. c) We re-think the OTEL integration. Looking forward for your opinions and ideas regarding the OTEL implementation in Airflow to improve the situation. Marco