[jira] [Commented] (FLINK-7465) Add build-in BloomFilterCount on TableAPI&SQL

Fabian Hueske (JIRA) Tue, 22 Aug 2017 14:23:34 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137425#comment-16137425
 ]


Fabian Hueske commented on FLINK-7465:
--------------------------------------

I'm sorry, I confused count-min sketches (for approximate group counts) and 
HyperLogLog (for approximate distinct counts). 

I assume the goal of the BloomFilterCount function is to (approximately) count 
the number of distinct values. In contrast to HyperLogLog, Bloom filters are 
not specifically designed for approximate distinct counting but for approximate 
membership testing. AFAIK, bloom filters should be more precise for log 
distinct cardinalities but HyperLogLog should provide much better results for 
larger cardinalities.

IMO, [~jark]'s idea to split the bitmask into multiple long values is pretty 
nice. OTOH, point multiple RocksDB lookups might also be more expensive than a 
single lookup with larger serialization payload (the deserialization logic for 
byte arrays shouldn't be very costy).

> Add build-in BloomFilterCount on TableAPI&SQL
> ---------------------------------------------
>
>                 Key: FLINK-7465
>                 URL: https://issues.apache.org/jira/browse/FLINK-7465
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Table API & SQL
>            Reporter: sunjincheng
>            Assignee: sunjincheng
>         Attachments: bloomfilter.png
>
>
> In this JIRA. use BloomFilter to implement counting functions.
> BloomFilter Algorithm description:
> An empty Bloom filter is a bit array of m bits, all set to 0. There must also 
> be k different hash functions defined, each of which maps or hashes some set 
> element to one of the m array positions, generating a uniform random 
> distribution. Typically, k is a constant, much smaller than m, which is 
> proportional to the number of elements to be added; the precise choice of k 
> and the constant of proportionality of m are determined by the intended false 
> positive rate of the filter.
> To add an element, feed it to each of the k hash functions to get k array 
> positions. Set the bits at all these positions to 1.
> To query for an element (test whether it is in the set), feed it to each of 
> the k hash functions to get k array positions. If any of the bits at these 
> positions is 0, the element is definitely not in the set – if it were, then 
> all the bits would have been set to 1 when it was inserted. If all are 1, 
> then either the element is in the set, or the bits have by chance been set to 
> 1 during the insertion of other elements, resulting in a false positive.
> An example of a Bloom filter, representing the set {x, y, z}. The colored 
> arrows show the positions in the bit array that each set element is mapped 
> to. The element w is not in the set {x, y, z}, because it hashes to one 
> bit-array position containing 0. For this figure, m = 18 and k = 3. The 
> sketch as follows:
> !bloomfilter.png!
> Reference:
> 1. https://en.wikipedia.org/wiki/Bloom_filter
> 2. 
> https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomFilter.java
> Hi [~fhueske] [~twalthr] I appreciated if you can give me some advice. :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-7465) Add build-in BloomFilterCount on TableAPI&SQL

Reply via email to