[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284406#comment-15284406 ]
Gabor Gevay commented on FLINK-2147: ------------------------------------ Yes, I think it would be very good if you worked on this, and starting with a design document is a good idea. However, if you haven't yet contributed to Flink, then I would suggest starting with some simpler tasks first. Note, that the hard part here will not be implementing the algorithm that is described in the linked paper, but figuring out how it fits into the API and internals of Flink. (This was a sub-task of my Google Summer of Code project last summer, but then I haven't really worked on this, because the streaming API was changing a lot at that time. I will convert this sub-task to a stand-alone issue now.) > Approximate calculation of frequencies in data streams > ------------------------------------------------------ > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: Sub-task > Components: Streaming > Reporter: Gabor Gevay > Priority: Minor > Labels: statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)