Hello,

I am doing some tests with flink 1.11.1 and I have noticed something
strange/wrong going on with the exported metrics.

I have a configuration like such:





*metrics.reporter.graphite.class:
org.apache.flink.metrics.graphite.GraphiteReporterFactorymetrics.reporter.graphite.host:
graphitemetrics.reporter.graphite.port:
8080metrics.reporter.graphite.protocol:
tcpmetrics.reporter.graphite.interval: 10 SECONDS*

which should produce metrics to graphite every 10 seconds.

And that works with low parallelism (e.g. <= 20). Then we get all metrics,
all the time, every 10th second.
However, when I scale my job to 200 parallelism or more, the metrics are
not sent every 10 seconds. Sometimes they are missing for up to 3 reporting
cycles.
I have had a brief look in the code here:
https://github.com/apache/flink/blob/release-1.11.1/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/MetricRegistryImpl.java#L107-L144
and
it looks like there is a separate thread. That was my first guess, if it is
doing too much work on the same thread.

I have tried lowering the reporting interval from 10 SECONDS to 6-7
SECONDS, but even in that case there will be missing metrics. Even for
simpler jobs such as "source -> map -> sink" with higher parallelism that
would happen.

What can I do to further debug/make this work? Has anyone come across this
before?

Regards
,
Nikola Hrusov

Reply via email to