[ https://issues.apache.org/jira/browse/FLINK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497903#comment-14497903 ]
ASF GitHub Bot commented on FLINK-1297: --------------------------------------- GitHub user tammymendt opened a pull request: https://github.com/apache/flink/pull/605 [FLINK-1297] Added OperatorStatsAccumulator for tracking operator related stats The accumulator tracks min and max values, and estimates for count distinct and heavy hitters. The count distinct algorithms are Linear Counting and HyperLogLog, both from an imported library (clearspring). The heavy hitters algorithms are Lossy counting (Manku et.al 2002) and Count Min Sketch (Cormode 2005). The heavy hitters algorithms are implemented in the statistics package in flink-core. The accumulator currently only uses Linear Counting as default for count distinct and Lossy Counting as default for heavy hitters. The accumulator does not only track the globally merged value the way the other accumulators do. It additionally tracks an array of local statistics which have been collected at each subtask of a task. It does this by wrapping an extra class called OperatorStatisticsResult which holds the local and global accumulated results. The idea of this is to be able to track statistics of data processed in subtasks, so that they can be used to reason about partitioning strategies. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tammymendt/flink FLINK-1297-v2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/605.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #605 ---- commit f365ccd92b513f10d0ba2d1a84b210d36060947c Author: Tamara Mendt <tammyme...@gmail.com> Date: 2015-04-16T09:25:16Z [FLINK-1297] Added an accumulator called OperatorStatsAccumulator capable of tracking min, max and estimates for count distinct and heavy hitters. The count distinct algorithms are Linear Counting and HyperLogLog, both from an imported library from clearspring. The heavy hitters algorithms are Lossy counting (Manku et.al 2002) and one based on Count Min Sketch (Cormode 2005). The heavy hitters algorithms are implemented in the statistics package in flink-core. The accumulator does not only track the globally merged value, but tracks an array of local statistics which have been collected at each subtask of a task. It does this using an extra class called OperatorStatisticsResult ---- > Add support for tracking statistics of intermediate results > ----------------------------------------------------------- > > Key: FLINK-1297 > URL: https://issues.apache.org/jira/browse/FLINK-1297 > Project: Flink > Issue Type: Improvement > Components: Distributed Runtime > Reporter: Alexander Alexandrov > Assignee: Alexander Alexandrov > Fix For: 0.9 > > Original Estimate: 1,008h > Remaining Estimate: 1,008h > > One of the major problems related to the optimizer at the moment is the lack > of proper statistics. > With the introduction of staged execution, it is possible to instrument the > runtime code with a statistics facility that collects the required > information for optimizing the next execution stage. > I would therefore like to contribute code that can be used to gather basic > statistics for the (intermediate) result of dataflows (e.g. min, max, count, > count distinct) and make them available to the job manager. > Before I start, I would like to hear some feedback form the other users. > In particular, to handle skew (e.g. on grouping) it might be good to have > some sort of detailed sketch about the key distribution of an intermediate > result. I am not sure whether a simple histogram is the most effective way to > go. Maybe somebody would propose another lightweight sketch that provides > better accuracy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)