[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284467#comment-15284467 ]
Gabor Gevay commented on FLINK-2147: ------------------------------------ In my opinion, the semantics would be to calculate the statistic only about each window separately. When to emit is handled by the triggers (as with other windowing calculations in Flink.) (Note that the windows can be quite large, like weekly or monthly.) I think that having a statistic about the entire stream is rarely what the user actually wants. Flink programs are designed to run indefinitely for a long time, and the starting point of a stream is just when the user happened to start the Flink program, which might have no real semantic meaning if the Flink program is analyzing some external system. > Approximate calculation of frequencies in data streams > ------------------------------------------------------ > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming > Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)