[Airflow 3 call followup] StatsD/OTel metric naming

Ferruzzi, Dennis Thu, 22 Aug 2024 10:02:07 -0700

Following up on the discussion in the call this morning.  We were running tight 
on time and I didn't want to derail the remaining conversation, but wanted to 
follow up here.   There will be a separate thread on whether we intend to keep 
StatsD, either as a default or the secondary, or remove it entirely.  That's a 
whole other discussion and someone else will be starting that thread.  But I 
wanted to cover a related aspect of that which would likely derail that 
conversation and may be useful before that discussion.


Context:

When I added the OTel Metrics support, I ran into an issue where OTel metric 
names have a much shorter character limit than StatsD metric names.   This is 
at least in part because OTel supports tags and StatsD does not (out of the 
box).  So any metric which was being emitted in StatsD but whose name was too 
long for OTel is getting double-emitted under the old (long) name and the new 
(truncated, but with tags) name side by side.

Example:
The StatsD metric name 
local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code> has several 
values embedded in it which can each be an unknown length.  OTel can't emit 
that name but can emit metric local_task_job.task_exit with the job_id, dag_id, 
task_id, and return_code values attached as a dict called 'tags'.  This means 
that there are a number of places throughout the code where we are creating 
multiple metrics for the same thing with slightly different names.

Proposed solution:

If we decide to drop StatsD entirely, then there is no issue, we just go 
through and remove the longer names wherever this happened.  If we choose to 
keep StatsD support, then my proposed solution is to add an abstract method to 
BaseStatsLogger called something like "build_name()" which is implemented in 
each of the metrics loggers (StatsD, OTel, and DataDog).   In the case of OTel 
(and DataDog?  I'm not sure on this one and would need to confirm), it would 
just return the name as provided and the metric can be emitted with tags.  In 
the StatsD class, it would return the name with any tags concatenated onto it; 
something like `return f"{self.name}_{"_".join([f"{key}_{value}" for key, value 
in self.tags.items()])}`.  Then we can keep support for both options with 
predictable names (python dicts have been ordered since 3.(6? 8?) so the names 
should always parse out predictably) and simplify the codebase accordingly.

The main down side is that it would almost certainly break at least some 
existing StatsD names, but maybe with 3.0 that is less of a concern.

One positive side effect of doing it this way is that every metric will be 
created with tags so if we add a mandatory "description" field to the metric 
object then we could auto-generate the table on the metrics doc page. [1]


[1] 
https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html


 - ferruzzi

[Airflow 3 call followup] StatsD/OTel metric naming

Reply via email to