dannyl1u opened a new issue, #42881:
URL: https://github.com/apache/airflow/issues/42881

   ### Description
   
   From @ferruzzi:
   
   > Currently when you add a new metric to the codebase, you must also 
manually update the docs page.  The docs page inevitably gets out of date and 
misses some details.  We want an automated system to generate the docs page 
based on the actual metrics.  There are also known instances where the same 
metric is being created and emitted in more than one place, causing duplicate 
data.  These will have to be fixed manually and an automated check might 
possibly (stretch goal?)  include checking for same or ”too similar” names 
while collecting the names for the docs page.
   
   > Phase 1
   > Situation:
   > We support multiple different Metrics backends [0].  The two main ones are 
StatsD and OpenTelemetry.  This is managed though an interface class [1] which 
is implemented for each backend (examples:  StatsD[2] and OTel[3]).   StatsD 
was the only supported version well into Airflow 2.x and the entire codebase 
was designed with StatsD in mind so it was a good chunk of work to abstract it 
out and there are a few remaining tasks to perfect the new implementation.
   > Task 1:
   > StatsD has a name length limit of around 300 characters.  OTel limits 
names to 34 characters, but allows tagging.  Our temporary solution was to emit 
almost everything twice, once in the long format for StatsD and again in the 
short format with tags for OTel.  We also had to add code [4] to make sure the 
name is safe for OTel, and other hacks to make it work.
   > The first task in this project is to understand the difference in how the 
two implementations handle their names and them add a "get_name" method to the 
interface: `def get_name(metric_name: str, tags: dict[str: str])`.  In the 
statsd_logger [2] implementation it will concatenate the tags onto the name and 
in the OTel implementation it will just return name.
   > Once that is implemented, it can be used in the various emit methods 
(incr, decr, etc) instead of all the name validation code, and search the code 
for places where we are emitting things more than once and clean it up.
   > Example:
   > You can see an example in local_task_job_runner [5].  We emit 
local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code>` for StatsD 
but that results in a name too long for OTel so we also emits 
`local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code>`, and the 
name validation method [4] in the OTel implementation catches the one that is 
too long and just swallows it.  What we should do instead is pass incr() the 
name and the tags and let StatsD and OTel handle them accordingly.
   > [0] 
https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metric-descriptions
   > [1] 
https://github.com/apache/airflow/blob/main/airflow/metrics/base_stats_logger.py
   > [2] 
https://github.com/apache/airflow/blob/main/airflow/metrics/statsd_logger.py
   > [3] 
https://github.com/apache/airflow/blob/main/airflow/metrics/otel_logger.py
   > [4] 
https://github.com/apache/airflow/blob/main/airflow/metrics/otel_logger.py#L128
   > [5] 
https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job_runner.py#L352
   
   
   ### Use case/motivation
   
   From @ferruzzi:
   
   > Currently when you add a new metric to the codebase, you must also 
manually update the docs page.  The docs page inevitably gets out of date and 
misses some details.  We want an automated system to generate the docs page 
based on the actual metrics.  There are also known instances where the same 
metric is being created and emitted in more than one place, causing duplicate 
data.  These will have to be fixed manually and an automated check might 
possibly (stretch goal?)  include checking for same or ”too similar” names 
while collecting the names for the docs page.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to