[ https://issues.apache.org/jira/browse/FLINK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156251#comment-16156251 ]
ASF GitHub Bot commented on FLINK-7465: --------------------------------------- GitHub user sunjincheng121 opened a pull request: https://github.com/apache/flink/pull/4652 [FLINK-7465][table]Add cardinality count for tableAPI and SQL. ## What is the purpose of the change *In this PR. we want add add CARDINALITY_COUNT for tableAPI and SQL.(Using `HyperLogLog` algorithm). The implementation of HyperLogLog (HLL) algorithm from this paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf As we know there are still some improved algorithms, such as: HyperLogLog++, HyperBitBit etc. But `HyperLogLog` is a classic algorithm that has been massively verified, so I chose to use the `HyperLogLog` algorithm as the first version of cardinality to achieve. And we can improve the algorithm at any time If we need. * ## Brief change log - *Add Java implementation of `HyperLogLog`(base on stream-lib)* - *Add MURMURHASH See more: http://murmurhash.googlepages.com/* - *Add build-in `CardinalityCountAggFunction`* - *Add some test case for the validation* - *Add documentation for TableAPI&SQL* ## Verifying this change This change added tests and can be verified as follows: - *Added SQL/TableAPI integration tests for `cardinality_count`* - *Added `CardinalityCountAggFunctionTest` test case for verify the AGG logic.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no) ## Documentation - Does this pull request introduce a new feature? (yes) - If yes, how is the feature documented? (docs / JavaDocs) You can merge this pull request into a Git repository by running: $ git pull https://github.com/sunjincheng121/flink FLINK-7465-PR Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4652.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4652 ---- commit bc1166ad88538bdcdd6df685c750359aadff3950 Author: ้็ซน <jincheng.su...@alibaba-inc.com> Date: 2017-09-05T10:21:10Z [FLINK-7465][table]Add cardinality count for tableAPI and SQL. ---- > Add build-in BloomFilterCount on TableAPI&SQL > --------------------------------------------- > > Key: FLINK-7465 > URL: https://issues.apache.org/jira/browse/FLINK-7465 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL > Reporter: sunjincheng > Assignee: sunjincheng > Attachments: bloomfilter.png > > > In this JIRA. use BloomFilter to implement counting functions. > BloomFilter Algorithm description: > An empty Bloom filter is a bit array of m bits, all set to 0. There must also > be k different hash functions defined, each of which maps or hashes some set > element to one of the m array positions, generating a uniform random > distribution. Typically, k is a constant, much smaller than m, which is > proportional to the number of elements to be added; the precise choice of k > and the constant of proportionality of m are determined by the intended false > positive rate of the filter. > To add an element, feed it to each of the k hash functions to get k array > positions. Set the bits at all these positions to 1. > To query for an element (test whether it is in the set), feed it to each of > the k hash functions to get k array positions. If any of the bits at these > positions is 0, the element is definitely not in the set โ if it were, then > all the bits would have been set to 1 when it was inserted. If all are 1, > then either the element is in the set, or the bits have by chance been set to > 1 during the insertion of other elements, resulting in a false positive. > An example of a Bloom filter, representing the set {x, y, z}. The colored > arrows show the positions in the bit array that each set element is mapped > to. The element w is not in the set {x, y, z}, because it hashes to one > bit-array position containing 0. For this figure, m = 18 and k = 3. The > sketch as follows: > !bloomfilter.png! > Reference: > 1. https://en.wikipedia.org/wiki/Bloom_filter > 2. > https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomFilter.java > Hi [~fhueske] [~twalthr] I appreciated if you can give me some advice. :-) -- This message was sent by Atlassian JIRA (v6.4.14#64029)