The change to make the CountingBloomFilter an interface is in this PR [1]. This renames the existing CountingBloomFilter to MapCountingBloomFilter. I’ve left this alone and just done my work on new files:
- CountingBloomFilter: An interface to extend BloomFilter - ArrayCountingBloomFilter: An implementation using a backing array Have a look and see what you think. To implement the CountingBloomFilter you have to state the policy for handling any negative counts that may be generated by incorrect usage. For this implementation the policy is that negative counts are left alone and the filter is invalid. In most realistic use cases the generation of negative counts should not happen. It means that you are using the filter incorrectly by removing something you never added. So I don’t think there is a need to police this event strictly with the associated overhead of doing so. There was previous discussion about the limitations of using an array with regard to how many bits the filter can have. I’ve explained this limitation in the javadoc header for ArrayCountingBloomFilter. However it seems to apply to the entire API given that it is based around bit indexes as 32-bit integers. So the maximum size for any Bloom filter in this API is 2^31 - 1. I note that a BitSet only supports a size up to this level. You could store 64 times more than this using the long[] storage representation. But then you would have to address them using a long for the index. Thus this warning in the ArrayCountingBloomFilter should probably be taken out and put in the package-info somewhere. The new renamed MapCountingBloomFilter could either be updated to implement the CountingBloomFilter interface or removed from the API. Other changes to the existing API should follow after this (i.e. returning iterators/splitators of indexes, returning booleans after merge operations, etc.) One detail is I have removed the toString() implementation which concatenated all indexes and their counts. This may be helpful for debugging but printing out all the set indexes in a big filter could overflow a StringBuilder buffer! I think that all the toString() overrides should be removed or changed to a simple summary such as the filter shape and the current cardinality. If you want a set notation for the current state of the filter (e.g. like BitSet.toString()) then a helper class could be provided that takes the filter or counting filter and extracts the string representation: BloomFilterConverter static Appendable toAppendable(BloomFilter bf, Appendable out); BloomFilter bf = … // set bits 1,2,3,10 BloomFilterConverter.toAppendable(bf, new StringBuilder()).toString() "{1,2,5,10}” [1] https://github.com/apache/commons-collections/pull/137 <https://github.com/apache/commons-collections/pull/137>