[ https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-21309: ----------------------------------- Labels: auto-deprioritized-major stale-minor (was: auto-deprioritized-major) I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Minor but is unassigned and neither itself nor its Sub-Tasks have been updated for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is still Minor, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized. > Metrics of JobManager and TaskManager overwrite each other in pushgateway > ------------------------------------------------------------------------- > > Key: FLINK-21309 > URL: https://issues.apache.org/jira/browse/FLINK-21309 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics > Affects Versions: 1.9.0, 1.10.0, 1.11.0 > Environment: 1. Components : > Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn > 2. Metrics Configuration in flink-conf.yaml : > {code:java} > metrics.reporter.promgateway.class: > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter > metrics.reporter.promgateway.jobName: myjob > metrics.reporter.promgateway.randomJobNameSuffix: false{code} > > Reporter: jiguodai > Priority: Minor > Labels: auto-deprioritized-major, stale-minor > Attachments: image-2021-02-05-21-07-42-292.png > > Original Estimate: 12h > Remaining Estimate: 12h > > When a flink job run on yarn, metrics of jobmanager and taskmanagers will > overwrite each other. The phenomenon is that on one second you can find only > jobmanager metrics on pushgateway web ui, while on the next second you can > find only taskmanager metrics on pushgateway web ui, these two kinds of > metrics appear alternately. One metric of taskmanager on grafana will be like > below intermittently (this taskmanager metric disappear on grafana when > jobmanager metrics overwrite taskmanager metrics): > !image-2021-02-05-21-07-42-292.png! > The real reason is that Flink PrometheusPushGatewayReporter use PUT style > instead of POST style to push metrics to pushgateway, what's more, > taskmanagers and jobmanager use the same jobName (the only grouping key) > which we configured in flink-conf.yaml. > Althought REST URLs are same as below, > {code:java} > /metrics/job/<JOB_NAME>{/<LABEL_NAME>/<LABEL_VALUE>} > {code} > PUT and POST caused different results, as we can see below : > * PUT is used to push a group of metrics. All metrics with the grouping key > specified in the URL are replaced by the metrics pushed with PUT. > * POST works exactly like the PUT method but only metrics with the same name > as the newly pushed metrics are replaced. > For these reasons, it's better to use POST style to push metrics to > pushgateway to prevent jobmanager metrics and taskmanager metrics from > overwriting each other, so that we can get continuous graph on grafana. Maybe > you will say that we can set > {code:java} > metrics.reporter.promgateway.randomJobNameSuffix: true{code} > in flink-conf.yaml, in this way, jobName from different nodes will has a > random suffix and metrics will not overwrite each other any more. While we > should be aware that most of users tend to use jobName as filter condition in > PromQL, and using regular expressions to find exact jobName will degrade the > speed of data retrieval in prometheus. > Everytime some body ask why metrics on grafana is discontinuous on Flink > mailing list, i will tell him that you should change the style of pushing > metrics to pushgateway from PUT to POST and then repackage the > flink-metrics-prometheus module. So, why don't we solve the problem > permanently now ? I hope to have the chance to solve the problem, sincerely. > related links : > [https://github.com/prometheus/pushgateway#put-method] > [https://github.com/prometheus/pushgateway/issues/308] -- This message was sent by Atlassian Jira (v8.3.4#803005)