[ https://issues.apache.org/jira/browse/FLINK-22664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344504#comment-17344504 ]
Guokuai Huang commented on FLINK-22664: --------------------------------------- This issue is related to: https://issues.apache.org/jira/browse/FLINK-9665 > Task metrics are not properly unregistered during region failover > ----------------------------------------------------------------- > > Key: FLINK-22664 > URL: https://issues.apache.org/jira/browse/FLINK-22664 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics > Affects Versions: 1.11.0, 1.12.0 > Reporter: Guokuai Huang > Priority: Major > Attachments: Screen Shot 2021-05-14 at 2.51.04 PM.png, Screen Shot > 2021-05-14 at 5.40.22 PM.png > > > In the current implementation of AbstractPrometheusReporter, metrics with the > same scopedMetricName share the same metric Collector. At the same time, a > HashMap named collectorsWithCountByMetricName is maintained to record the > refrence counter of each Collector. Only when the refrence counter of one > Collector becomes 0, it will be unregistered. > Suppose we have a flink job with single chained operator, and *execution > failover-strategy is set to region.* > !Screen Shot 2021-05-14 at 2.51.04 PM.png! > The following figure compares the number of metrics when this job runs on 2 > TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM after region > failover. > Each inflection point on the graph represents a region failover. *For > TaskManager with multiple tasks(slots), the number of metrics increases after > region failover.* > This is a case I deliberately constructed to illustrate this problem. > TaskManager only needs to restart part of the tasks during each region > failover, that is to say, *the refrence counter of task's metric Collector > will never become 0, so the metric Collector will not be unregistered.* > This problem has brought a lot of pressure to our Prometheus, please see if > there is a good solution. > !Screen Shot 2021-05-14 at 5.40.22 PM.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)