sohurdc created FLINK-38584:
-------------------------------

             Summary: Support checkpoint external path as Prometheus info-style 
metric
                 Key: FLINK-38584
                 URL: https://issues.apache.org/jira/browse/FLINK-38584
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Metrics
    Affects Versions: 1.17.0, 1.13
            Reporter: sohurdc


h2. Problem Statement

Currently, the lastCheckpointExternalPath metric in Flink is exported as a 
Gauge with the checkpoint path as its string value. This approach has several 
limitations:
 # Incompatible with Prometheus/VictoriaMetrics: Time-series databases like 
Prometheus and VictoriaMetrics only support numeric values, making it 
impossible to store checkpoint paths without using additional storage solutions 
like InfluxDB.
 # Limited Observability: Users cannot easily correlate checkpoint paths with 
other checkpoint metrics (size, duration, etc.) in their monitoring dashboards.
 # Workaround Required: Currently, users need to set up separate storage 
systems (e.g., InfluxDB) just to track checkpoint paths, increasing operational 
complexity.

h2. Proposed Solution

{{ Export lastCheckpointExternalPath as a Prometheus info-style metric: }}
 * {{Metric name: lastCheckpointExternalPath_info}}
 * {{Value: Always 1.0 (following Prometheus convention) }}
 * {{Checkpoint path: Stored in a path label }}{{}}

{{This approach: }}
 * {{✅ Compatible with Prometheus/VictoriaMetrics }}
 * {{✅ Follows Prometheus best practices for string-value metrics (similar to 
node_uname_info) }}
 * {{✅ Enables joining with other metrics via PromQL }}
 * {{✅ No breaking changes to existing metrics}}

h2. {{{}Example Output{}}}{{{}{}}}

{{Before:}}

flink_jobmanager_job_lastCheckpointExternalPath\{job_id="...",host="..."} 
"hdfs://..."

❌ Not supported by Prometheus thus it will be tranfered 
to:flink_jobmanager_job_lastCheckpointExternalPath\{job_id="...",host="..."} 0, 
which losed its real meaning.

After:

flink_jobmanager_job_lastCheckpointExternalPath_info{job_id="...",host="...",{*}path="hdfs://..."{*}}
 1.0

✅ Fully compatible with Prometheus
h2. Use Cases
 # {{{*}Dashboard Visualization{*}: Join checkpoint path with other metrics}}
{{flink_jobmanager_job_lastCheckpointSize }}
{{  * on(job_id) group_left(path) }}
{{  flink_jobmanager_job_lastCheckpointExternalPath_info}}
 # {{{*}Alerting{*}: Detect checkpoint path changes}}
{{changes(flink_jobmanager_job_lastCheckpointExternalPath_info[5m]) > 0}}
 # {{{*}Metadata Extraction{*}: Extract path for external systems via 
Prometheus API}}
{{result['metric']['path']  # Get checkpoint path value}}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to