Thank you for your feedback :)
Regarding names:
The Dumper does not create a MetricSnapshot. The Dumper creates a
list of key-value pairs; metric_name:value.
A (single) MetricSnapshot exists in the WebRuntimeMonitor, into
which the dumped list is inserted.
So the dumper creates a snapshot but not a MetricSnapshot, and the
WebRuntimeMonitor contains a MetricSnapshot which isn't really a
snapshot but more a storage.
The naming isn't the best.
I'm not sure if "Service" really fits the bill; I associate a
service with separate thread running in the background.
Regarding merging of metrics:
We are not merging any metrics right now. While Counters are easy to
merge, for Gauge's we may have to let the user choose in the
WebInterface how they should be aggregated.
This is /not really/ a problem; in the sense that we don't have
different versions overwriting each other:
* JM/TM metrics don't have to be merged
* task metrics can be kept on a per subtask/operator level for now
(the prototype exposes them as
"<subtask_index>_<operator_name>_<metric_name>")
* job metrics are currently only gathered on the JM; so no merging
here either
Regarding transfer:
Should we transfer numbers as numbers, or also as strings? I'm
concerned about the efficiency of the whole thing; if we send some
metrics as strings and some as numbers we have to decide for every
metric which option we should take. That's why i was wondering
whether to send everything as objects or everything as strings.
Regarding traversal of groups:
Yes, we would save on startup/teardown time if we traversed the
groups instead. However the dumping itself should become more
expensive this way; and since this is done by the TaskManager thread
i wanted to keep it as simple as possible.
Also, there is currently no way to access the metrics contained in a
group. We would have to add another method to the
AbstractMetricGroup, which i would prefer not to do as it can lead
to concurrency issues during teardown.
On 02.08.2016 15:05, Till Rohrmann wrote:
The metrics transfer design document looks good to me. Thanks for your work
Chesnay :-)
I think the benefit of registering the metrics at the MetricDumper is that
we don't have to walk through the hierarchy of metric groups to collect the
metric values. Indeed, this comes with increased costs at start-up. But I'm
not sure what's the concrete impact on job performance in these cases.
Cheers,
Till
On Tue, Aug 2, 2016 at 8:34 PM, Stephan Ewen <se...@apache.org> wrote:
Hi!
Thanks for writing this up. I think it looks quite reasonable (I hope I
understood that design correctly)
There is one point of confusions left for me, though: The MetricDumper and
MetricSnapshot: I think it is just the names that confuse me here.
It looks like they define a way to query the metrics in the Metric Registry
in a standard schema (independent of the scope formats).
Should the "dumper" maybe be called "MetricsQueryService" or so (the query
service returns a MetricSnapshot, if I understand correctly).
It would be great if the "query service" would not need metrics to be
registered - saves us some effort during startup / teardown. It looks
as if the query service could just use the the root-most component metric
groups to walk the tree of whatever metric is currently there and put it
into the current snapshot.
One open questions that I have is: How do you know how to merge the metrics
from the subtasks, for example in case you want a metric across subtasks.
In general, not transferring objects (only strings / numbers) would be
preferable, because the WebMonitor may run in an environment where no
user-code classloader can be used.
It may run in the dispatcher (which must be trusted and cannot execute user
code).
Greetings,
Stephan
On Thu, Jul 28, 2016 at 3:12 PM, Chesnay Schepler <ches...@apache.org>
wrote:
Hello,
I just created a new FLIP which aims at exposing our metrics to the
WebInterface.
https://cwiki.apache.org/confluence/display/FLINK/FLIP-7%3A+Expose+metrics+to+WebInterface
Looking forward to feedback :)
Regards,
Chesnay Schepler