[ https://issues.apache.org/jira/browse/FLINK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646157#comment-16646157 ]
Florian Schmidt commented on FLINK-10521: ----------------------------------------- [~till.rohrmann] I attached the debug logs. Turning on debug log level in Flink shows {code:java} 2018-10-11 08:47:19,101 DEBUG org.apache.flink.runtime.metrics.dump.MetricDumpSerialization - Failed to serialize histogram. java.util.ConcurrentModificationException ... (this one has a stacktrace) {code} and {code:java} 2018-10-11 08:47:07,174 DEBUG org.apache.flink.runtime.metrics.dump.MetricDumpSerialization - Failed to serialize histogram. java.lang.ArrayIndexOutOfBoundsException (this one does not have a stracktrace) {code} so it really looks like my sketched together Histogram implementation is at fault here. Would you still consider this a bug in Flinks Metric System that no metrics are reported if one of them is implemented so that it might throw an Exception? In this case I can change the description of the issue to reflect what we found out so far. Otherwise, if this is expected behaviour, feel free to close this issue > TaskManager metrics are not reported to prometheus after running a job > ---------------------------------------------------------------------- > > Key: FLINK-10521 > URL: https://issues.apache.org/jira/browse/FLINK-10521 > Project: Flink > Issue Type: Bug > Components: Metrics > Affects Versions: 1.6.1 > Environment: Flink 1.6.1 cluster with one taskmanager and one > jobmanager, prometheus and grafana, all started in a local docker environment. > See sample project at: > https://github.com/florianschmidt1994/flink-fault-tolerance-baseline > Reporter: Florian Schmidt > Priority: Major > Attachments: Screenshot 2018-10-10 at 11.32.59.png, prometheus.log, > taskmanager.log > > > Update: This only seems to happen when my custom (admittedly poorly > implemented) Histogram is enabled. Still I think one poorly implemented > metric should not bring down the whole metrics system. > -- > I'm using prometheus to collect the metrics from Flink, and I noticed that > shortly after running a job, metrics from the taskmanager will stop being > reported most of the time. > Looking at the prometheus logs I can see that requests to > taskmanager:9249/metrics are correct when no job is running, but after > starting to run a job those requests will return an empty response with > increasing frequency, until at some point most of the requests are not > successful anymore. I was able to very this by running `curl > localhost:9249/metrics` inside the taskmanager container, where more often > that not the response was empty, instead of containing the expected metrics. > In the attached image you can see that occasionally some requests succeed, > but there are some big gaps in between. Eventually it will stop to succeed > completely. The prometheus scrape interval is set to 1s. > !Screenshot 2018-10-10 at 11.32.59.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)