Florian Schmidt created FLINK-10521:
---------------------------------------

             Summary: TaskManager metrics are not reported to prometheus after 
running a job
                 Key: FLINK-10521
                 URL: https://issues.apache.org/jira/browse/FLINK-10521
             Project: Flink
          Issue Type: Bug
          Components: Metrics
    Affects Versions: 1.6.1
         Environment: Flink 1.6.1 cluster with one taskmanager and one 
jobmanager, prometheus and grafana, all started in a local docker environment.

See sample project at: 
https://github.com/florianschmidt1994/flink-fault-tolerance-baseline
            Reporter: Florian Schmidt
         Attachments: Screenshot 2018-10-10 at 11.16.22.png

 

I'm using prometheus to collect the metrics from Flink, and I noticed that 
shortly after running a job, metrics from the taskmanager will stop being 
reported most of the time.

Looking at the prometheus logs I can see that requests to 
taskmanager:9249/metrics are correct when no job is running, but after starting 
to run a job those requests will return an empty response with increasing 
frequency, until at some point most of the requests are not successful anymore. 
I was able to very this by running `curl localhost:9249/metrics` inside the 
taskmanager container, where more often that not the response was empty, 
instead of containing the expected metrics.

In the attached image you can see that occasionally some requests succeed, but 
there are some big gaps in between. The prometheus scrape interval is set to 1s.

!Screenshot 2018-10-10 at 11.16.22.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to