Re: [DISCUSSION] How to improve/fix OTEL integration in Airflow

Ferruzzi, Dennis Tue, 04 Mar 2025 10:01:32 -0800

>It looks like every python instance creates its own counter which is exported 
>to the OTEL Collector.
We assume the PriodicExportingMetricReader is collecting the metrics for every 
python instance and exporting this to the OTEL collector. Then the value is 
overwritten by the latest exported value. There seems to be no central 
aggregation in OTEL


I believe this is exactly the issue.  The Otel metrics in Airflow weren't 
designed with multiple pods in mind.

> In the current level of integration attempting to use OTEL looks like 
> "broken".

Maybe a bit strong, it's broken when using multiple pods, but adding some 
indication to that effect may not be a bad idea until someone has the spoons to 
make the changes.


 - ferruzzi


________________________________
From: Kuettelwesch Marco (XC-DX/ETV5) <marco.kuettelwe...@de.bosch.com.INVALID>
Sent: Tuesday, March 4, 2025 6:16 AM
To: dev@airflow.apache.org
Subject: [EXT] [DISCUSSION] How to improve/fix OTEL integration in Airflow

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne 
cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas 
confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le 
contenu ne présente aucun risque.



Hi Airflow community,

TLDR: We attempted to switch from StatsD to OTEL and it seems the current 
integration is not working or has conceptual problems. Seeking for expertise or 
alternatives.

we run into some issues after switching from statsd to OTEL for metric export 
in Airflow so I want to start a discussion about possible solutions:
We are using a K8s Helm-powered multi Pod deployment with many celery and edge 
workers. To collect more metrics with more details without using the mapping in 
statsd, we decided to use OTEL to use the advantages like:
1) Metrics are getting exported with labels. (So we could filter and report on 
multiple axis)
2) Usage of mTLS in the OTEL Collector to collect metrics which are exported by 
the Edge Worker.
3) Also following with the strategic direction in Airflow to replace StatsD 
long term with OTEL.

We checked only the metrics part and not into the trace part of the OTEL 
implementation and intended to use the publishing of metics as being in Airflow 
core (currently 2.10.5).
After switching to OTEL we run into the following issues - as also discussed in 
https://github.com/apache/airflow/issues/41822 we also consulted Dennis (short 
private chat) and Howard (in the issue) on this:
1) After starting a DAG including a MappedOperator with 8 tasks I saw strange 
behavior of the "ti.start" counter metric. The tasks were executed on different 
workers and running for more than 20 minutes. The metrics were reset after some 
time and the counter value was increase during active DAG runs by different 
values. Sometimes only by 1 even if all 8 tasks were started.
I tried the following to get more details: I exec into worker pod and created a 
python instance and manually executed Stats.incr("testing_counter") command. 
The counter was increased every time I run the command like expected. Then I 
started a second python instance and executed again the same command and saw 
then a toggeling metric. Sometime the value of the first counter and sometime 
the value of the second counter.
It looks like every python instance creates its own counter which is exported 
to the OTEL Collector.
We assume the PriodicExportingMetricReader is collecting the metrics for every 
python instance and exporting this to the OTEL collector. Then the value is 
overwritten by the latest exported value. There seems to be no central 
aggregation in OTEL and the integration relies on Python singletons - which is 
impossible in a scaled distributed environment.
Does anyone with OTEL experience has an idea how to solve this issue? Looks 
like a central instance is required to collect and handle metrics which are 
exported from different python instance.
2) Metrics like "ti.start" and "ti.finish" are exported in the worker context 
with label dag_id and task_id. The metrics are only available during the time 
where the task is running. Looks metrics are gone after task finished and then 
the metric is removed after ~5min from the OtelCollector. Maybe because the 
metrics like "ti.start" are exported in the worker context and the OtelLogger 
is gone if the worker task is finished. (For a task execution the process is 
reporting from the forked interpreter, not the worker)
Looks like OTEL Collector removes the metrics after they are not exported for 
some time. If a task is finished in less, then a 1 minute the metrics are not 
exported at all. I think that is also the reason why some people are 
complaining that they are not able to find some metrics after switching to OTEL.

We really would like to see OTEL running in a usable way to use the advantages.
This is a call to devlist to check for further expertise outside of the 
existing contributors (as we briefly discussed with Dennis already). In the 
current level of integration attempting to use OTEL looks like "broken". 
Possible options:
a) We review the integration and hopefully with some OTEL expertise there is an 
option to make it working. We would also contribute to improve. But converting 
all existing Metrics (Counters/Gauges) to traces seem to be a larger endeavor.
b) We mark the current documentation about integration with warnings and 
"experimental" such that others do not expect a working solution.
c) We re-think the OTEL integration.


Looking forward for your opinions and ideas regarding the OTEL implementation 
in Airflow to improve the situation.

Marco

Re: [DISCUSSION] How to improve/fix OTEL integration in Airflow

Reply via email to