GitHub user zentol opened a pull request: https://github.com/apache/flink/pull/1947
[FLINK-1502] Basic Metric System This PR is a preview of the new metric system. It is not complete because * there is no documentation for the website * a few smaller parts also don't have code documentation * I haven't tried out the ganglia/statsD reporter yet In general though it works and it is now time to gather some feedback. The PR is organized into several commits to give it some structure; generally divided by which part of the system they expose the metric system to. Note that The last commit "Metric Usage Examples" is not technically part of the PR but showcases the usage. The division was done very simple, so some changes may technically belong to several commits. ## General overview A user can access a system-provided MetricGroup to register a Metric, which is stored in a MetricRegistry and forwarded regularly to a Reporter which communicates them to an external system. ## MetricGroups MetricGroups are the user-facing part of the system. They are a nested data structure, containing other groups and metrics, that allow registering metrics with Flink while organizing them in a hierarchy. For example, every TaskManager has a MetricGroup, and for every task that is deployed a new sub-group for that task is added. This task specific group is propagated through the task stack, with new groups/metrics being added. Within a UDF the operator MetricGroup is accessed through the RuntimeContext. ## Metrics Metrics are the objects used to measure something. Metrics include * Gauges, that measure a value on-demand * Meters, that measure the rate/count of events * Histograms, that measure the distribution of long values * Counters, that count stuff * Timers, that measure rate of calls and distribution of execution time for a given piece of code. Under the hood we use the Metrics from the Dropwizard library. In order to ensure interface stability, and to give us the option to reimplement things without breaking everything, they (and other classes) are wrapped to match our interfaces. ## Reporters Reporters are the component that communicate the Metrics to the outside world. With this PR we allow exporting Metrics via JMX (default), Graphite, Ganglia and StatsD. They interval in which they report is configurable. Similarly to Metrics, we partially use reporters from the DropWizard library (Graphite, Ganglia), again wrapped to match out interfaces. Reporters are configured via flink-conf.yaml. An example configuration might look like this: metrics.reporter.class: org.apache.flink.metrics.GraphiteReporter metrics.reporter.arguments: --host localhost --port 8080 metrics.reporter.interval: 30 SECONDS Reporters are instantiated generically and configured with a Configuration containing the parsed arguments. All non-JMXReporters are not part of the distribution and have to be added to the classpath manually (usually by putting the jar into /lib) JMX uses the port 9010 by default, This can be configured by setting the metrics.jmx.port property in the flink-conf.yaml ## Registry The registry is essentially just a connection between all MetricGroups and the Reporter. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zentol/flink metrics_v2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1947.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1947 ---- commit b90b53cd73824389b41978f0113ca0c6d3da1422 Author: zentol <ches...@apache.org> Date: 2016-04-15T13:57:14Z Add basic metric structures -add dropwizard dependency to flink-core -add metric wrappers -add metric groups/category organization -add metric registry commit 45e6e123d37a8fba1bf76386a84436e8fb04a9fa Author: zentol <ches...@apache.org> Date: 2016-04-19T11:28:28Z Graphite/Ganglia/StatsD Reporters commit e634060d83f2b475e954c67424ba39e3ffd92b6b Author: zentol <ches...@apache.org> Date: 2016-04-13T16:47:04Z Task Integration -included job name in TaskDeploymentDescriptor -enabled remote JMX for TaskManager -added TaskManager status metrics commit 20ca6c3b19690e08335e31fcf3377f4a511e9b00 Author: zentol <ches...@apache.org> Date: 2016-04-13T14:50:16Z Environment Integration -add MetricGroup field to environment -primary location to retrieve tm/task/subtask keyed metricgroup commit e8eed4d27361ea311dbf9e9694cca70633d5b54e Author: zentol <ches...@apache.org> Date: 2016-04-13T14:23:54Z IO Metrics Integration -add metrics for records/bytes read/written commit f47161db1804909f46520844d23a4e3148387f7b Author: zentol <ches...@apache.org> Date: 2016-04-14T10:02:51Z Streaming Operator Integration commit c0c2d967dd53ceac966af4b7400982de5e53a272 Author: zentol <ches...@apache.org> Date: 2016-04-13T15:17:15Z Batch Operator Integration -add getMetricGroup() method to TaskContext for driver access -add MetricGroup field to ChainedDriver for chained driver access commit fa7a8947bde42333748ae02d7c02023f89d20e41 Author: zentol <ches...@apache.org> Date: 2016-04-13T14:51:46Z Context Integration -add getMetricGroup() method to udf-context for udf/IO-format access commit 9082d0697ad7f5c9146d77c932eb551eabba40ac Author: zentol <ches...@apache.org> Date: 2016-04-13T14:58:38Z Metric Usage Examples ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---