[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955147#comment-15955147 ]
Aljoscha Krettek commented on FLINK-2147: ----------------------------------------- Yes, but what I'm saying is that it is not easy to deal with these task-local states when you change parallelism. For example, assume that you have parallelism 3. You have three task-local states. Now, the parallelism is changed to 2. How do you redistribute the sketch state? Keep in mind that Flink uses a (more or less) fixed partitioner for deciding where to send keyed elements. We have this to ensure that elements go to the parallel operator that is responsible for a key and that has the correct state. The reverse problem is even harder, I think. For example. when you want to scale from parallelism 1 to a higher parallelism. > Approximate calculation of frequencies in data streams > ------------------------------------------------------ > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: DataStream API > Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.15#6346)