[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

Gabor Gevay (JIRA) Mon, 16 May 2016 04:42:27 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284406#comment-15284406
 ]


Gabor Gevay commented on FLINK-2147:
------------------------------------

Yes, I think it would be very good if you worked on this, and starting with a 
design document is a good idea. However, if you haven't yet contributed to 
Flink, then I would suggest starting with some simpler tasks first. Note, that 
the hard part here will not be implementing the algorithm that is described in 
the linked paper, but figuring out how it fits into the API and internals of 
Flink.

(This was a sub-task of my Google Summer of Code project last summer, but then 
I haven't really worked on this, because the streaming API was changing a lot 
at that time. I will convert this sub-task to a stand-alone issue now.)

> Approximate calculation of frequencies in data streams
> ------------------------------------------------------
>
>                 Key: FLINK-2147
>                 URL: https://issues.apache.org/jira/browse/FLINK-2147
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Streaming
>            Reporter: Gabor Gevay
>            Priority: Minor
>              Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

Reply via email to