sohurdc created FLINK-38584:
-------------------------------
Summary: Support checkpoint external path as Prometheus info-style
metric
Key: FLINK-38584
URL: https://issues.apache.org/jira/browse/FLINK-38584
Project: Flink
Issue Type: Improvement
Components: Runtime / Metrics
Affects Versions: 1.17.0, 1.13
Reporter: sohurdc
h2. Problem Statement
Currently, the lastCheckpointExternalPath metric in Flink is exported as a
Gauge with the checkpoint path as its string value. This approach has several
limitations:
# Incompatible with Prometheus/VictoriaMetrics: Time-series databases like
Prometheus and VictoriaMetrics only support numeric values, making it
impossible to store checkpoint paths without using additional storage solutions
like InfluxDB.
# Limited Observability: Users cannot easily correlate checkpoint paths with
other checkpoint metrics (size, duration, etc.) in their monitoring dashboards.
# Workaround Required: Currently, users need to set up separate storage
systems (e.g., InfluxDB) just to track checkpoint paths, increasing operational
complexity.
h2. Proposed Solution
{{ Export lastCheckpointExternalPath as a Prometheus info-style metric: }}
* {{Metric name: lastCheckpointExternalPath_info}}
* {{Value: Always 1.0 (following Prometheus convention) }}
* {{Checkpoint path: Stored in a path label }}{{}}
{{This approach: }}
* {{✅ Compatible with Prometheus/VictoriaMetrics }}
* {{✅ Follows Prometheus best practices for string-value metrics (similar to
node_uname_info) }}
* {{✅ Enables joining with other metrics via PromQL }}
* {{✅ No breaking changes to existing metrics}}
h2. {{{}Example Output{}}}{{{}{}}}
{{Before:}}
flink_jobmanager_job_lastCheckpointExternalPath\{job_id="...",host="..."}
"hdfs://..."
❌ Not supported by Prometheus thus it will be tranfered
to:flink_jobmanager_job_lastCheckpointExternalPath\{job_id="...",host="..."} 0,
which losed its real meaning.
After:
flink_jobmanager_job_lastCheckpointExternalPath_info{job_id="...",host="...",{*}path="hdfs://..."{*}}
1.0
✅ Fully compatible with Prometheus
h2. Use Cases
# {{{*}Dashboard Visualization{*}: Join checkpoint path with other metrics}}
{{flink_jobmanager_job_lastCheckpointSize }}
{{ * on(job_id) group_left(path) }}
{{ flink_jobmanager_job_lastCheckpointExternalPath_info}}
# {{{*}Alerting{*}: Detect checkpoint path changes}}
{{changes(flink_jobmanager_job_lastCheckpointExternalPath_info[5m]) > 0}}
# {{{*}Metadata Extraction{*}: Extract path for external systems via
Prometheus API}}
{{result['metric']['path'] # Get checkpoint path value}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)