[ 
https://issues.apache.org/jira/browse/FLINK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497903#comment-14497903
 ] 

ASF GitHub Bot commented on FLINK-1297:
---------------------------------------

GitHub user tammymendt opened a pull request:

    https://github.com/apache/flink/pull/605

    [FLINK-1297] Added OperatorStatsAccumulator for tracking operator related 
stats

    The accumulator tracks min and max values, and estimates for count distinct 
and heavy hitters.
    
    The count distinct algorithms are Linear Counting and HyperLogLog, both 
from an imported library (clearspring).
    
    The heavy hitters algorithms are Lossy counting (Manku et.al 2002) and 
Count Min Sketch (Cormode 2005).
    
    The heavy hitters algorithms are implemented in the statistics package in 
flink-core.
    
    The accumulator currently only uses Linear Counting as default for count 
distinct and Lossy Counting as default for heavy hitters. 
    
    The accumulator does not only track the globally merged value the way the 
other accumulators do. It additionally tracks an array of local statistics 
which have been collected at each subtask of a task. It does this by wrapping 
an extra class called OperatorStatisticsResult which holds the local and global 
accumulated results. The idea of this is to be able to track statistics of data 
processed in subtasks, so that they can be used to reason about partitioning 
strategies.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tammymendt/flink FLINK-1297-v2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/605.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #605
    
----
commit f365ccd92b513f10d0ba2d1a84b210d36060947c
Author: Tamara Mendt <tammyme...@gmail.com>
Date:   2015-04-16T09:25:16Z

    [FLINK-1297] Added an accumulator called OperatorStatsAccumulator capable 
of tracking min, max and estimates for count distinct and heavy hitters.
    
    The count distinct algorithms are Linear Counting and HyperLogLog, both 
from an imported library from clearspring.
    
    The heavy hitters algorithms are Lossy counting (Manku et.al 2002) and one 
based on Count Min Sketch (Cormode 2005).
    
    The heavy hitters algorithms are implemented in the statistics package in 
flink-core.
    
    The accumulator does not only track the globally merged value, but tracks 
an array of local statistics which have been collected at each subtask of a 
task. It does this using an extra class called OperatorStatisticsResult

----


> Add support for tracking statistics of intermediate results
> -----------------------------------------------------------
>
>                 Key: FLINK-1297
>                 URL: https://issues.apache.org/jira/browse/FLINK-1297
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Runtime
>            Reporter: Alexander Alexandrov
>            Assignee: Alexander Alexandrov
>             Fix For: 0.9
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> One of the major problems related to the optimizer at the moment is the lack 
> of proper statistics.
> With the introduction of staged execution, it is possible to instrument the 
> runtime code with a statistics facility that collects the required 
> information for optimizing the next execution stage.
> I would therefore like to contribute code that can be used to gather basic 
> statistics for the (intermediate) result of dataflows (e.g. min, max, count, 
> count distinct) and make them available to the job manager.
> Before I start, I would like to hear some feedback form the other users.
> In particular, to handle skew (e.g. on grouping) it might be good to have 
> some sort of detailed sketch about the key distribution of an intermediate 
> result. I am not sure whether a simple histogram is the most effective way to 
> go. Maybe somebody would propose another lightweight sketch that provides 
> better accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to