GitHub user tammymendt opened a pull request: https://github.com/apache/flink/pull/605
[FLINK-1297] Added OperatorStatsAccumulator for tracking operator related stats The accumulator tracks min and max values, and estimates for count distinct and heavy hitters. The count distinct algorithms are Linear Counting and HyperLogLog, both from an imported library (clearspring). The heavy hitters algorithms are Lossy counting (Manku et.al 2002) and Count Min Sketch (Cormode 2005). The heavy hitters algorithms are implemented in the statistics package in flink-core. The accumulator currently only uses Linear Counting as default for count distinct and Lossy Counting as default for heavy hitters. The accumulator does not only track the globally merged value the way the other accumulators do. It additionally tracks an array of local statistics which have been collected at each subtask of a task. It does this by wrapping an extra class called OperatorStatisticsResult which holds the local and global accumulated results. The idea of this is to be able to track statistics of data processed in subtasks, so that they can be used to reason about partitioning strategies. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tammymendt/flink FLINK-1297-v2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/605.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #605 ---- commit f365ccd92b513f10d0ba2d1a84b210d36060947c Author: Tamara Mendt <tammyme...@gmail.com> Date: 2015-04-16T09:25:16Z [FLINK-1297] Added an accumulator called OperatorStatsAccumulator capable of tracking min, max and estimates for count distinct and heavy hitters. The count distinct algorithms are Linear Counting and HyperLogLog, both from an imported library from clearspring. The heavy hitters algorithms are Lossy counting (Manku et.al 2002) and one based on Count Min Sketch (Cormode 2005). The heavy hitters algorithms are implemented in the statistics package in flink-core. The accumulator does not only track the globally merged value, but tracks an array of local statistics which have been collected at each subtask of a task. It does this using an extra class called OperatorStatisticsResult ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---