[ https://issues.apache.org/jira/browse/FLINK-10993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698306#comment-16698306 ]
Stephan Ewen commented on FLINK-10993: -------------------------------------- I think this could be an interesting feature, but it loos like there are many things involved here: - bloom filter needs to be built first (parallel with merging or serial), then made available for probing - what would the distributed execution look like - embedding in the API I think this should be a FLIP, given all these non-trivial questions, and I would additionally seek support from a committer with DataStream / TableAPI experience to help shepherd this before diving in. > Bring bloomfilter as a public API > --------------------------------- > > Key: FLINK-10993 > URL: https://issues.apache.org/jira/browse/FLINK-10993 > Project: Flink > Issue Type: New Feature > Components: DataStream API > Reporter: vinoyang > Assignee: vinoyang > Priority: Major > > Flink internally provides an implementation of BloomFilter, but only for > internal optimization, and does not provide APIs for public access. > Here is a user mail discussion before : > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Bloom-filter-in-Flink-td10608.html > Considering that many users have the need to "determine duplicates" in > streaming computing, I think it would make sense to provide such an API. > In addition, Spark has provided BloomFilter as a public API : > {code:java} > val bf = df.stat.bloomFilter("dd",dataLen,0.01) > val rightNum = rdd.map(x=>(x.toInt,bf.mightContainString(x))) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)