I am refactoring some of the code to make the Builder and Hash classes enclosed in the ProtoBloomFilter class. I have also add the proxy serializer as noted in the effective java talk. This simplifies the classes significantly. I will have unit tests and fix the javadocs before the next push.
Claude On Mon, Sep 16, 2019 at 2:01 AM Gary Gregory <garydgreg...@gmail.com> wrote: > On Sun, Sep 15, 2019 at 8:17 PM sebb <seb...@gmail.com> wrote: > > > On Mon, 16 Sep 2019 at 00:17, Gilles Sadowski <gillese...@gmail.com> > > wrote: > > > > > > Hi. > > > > > > Le sam. 14 sept. 2019 à 08:15, Claude Warren <cla...@xenei.com> a > écrit > > : > > > > > > > > @Gilles > > > > > > > > I am happy to rename the package without the plural if that is the > > > > standard, I will also fix the indent issue. Is there a definition > > that can > > > > be quickly imported into Eclipse to do the proper formatting? > > > > > > Hopefully, someone else can answer that one. > > > > > > > I am adding/updating all comments in the code. > > > > > > > > FilterConfig contains a main method to make it easy to assess how the > > > > various settings influence the length and number of functions in the > > > > filter. This could go into another class or a Utility class but > > neither of > > > > those seem like a better solution. > > > > > > I don't think there should be a "main" method in a library. > > > Also, just printing to stdout does not seem very flexible. > > > Perhaps define a "toString()" method? > > > > > > > I agree with your assessment of final and immutable for methods, > > variables > > > > and classes. > > > > > > > > I am removing the FilterConfig constructor for BloomFilter and > > ensuring the > > > > BloomFilter makes a defensive copy. > > > > > > > > The difference between match() and equals() is that equals() is a bit > > > > vector equality and match is the standard bloom filter comparison T&C > > = T. > > > > Documentation is being added to the match() and inverseMatch() to > > explain > > > > this. > > > > > > > > STORED_SIZE is similar to Long.BYTES (perhaps it should be renamed). > > It > > > > identifies how many bytes the FilterConfig will use when written to a > > > > DataOuptut or read from a DataInput. > > > > > > Why is it necessary to know such internal details? > > > Anyways, I think that persistence is best left to higher-level code. > > > > > > > > > > > I think the Utils class can be removed once the other code is moved > to > > > > Java8 streams and not extended iterators. > > > > > > > > Hash function is changed to Objects.hash() > > > > > > > > Constructors are/will be private where factories make sense. For > some > > the > > > > simple constructor makes more sense. (e.g. BloomFilter(BitSet) ) > > > > > > See > > > > https://www.baeldung.com/java-constructors-vs-static-factory-methods > > > and > > > https://blog.joda.org/2014/03/valjos-value-java-objects.html > > > > > > > I have no problem removing the Serializable on FilterConfig, but can > > you > > > > explain (or point to a post) why it should be avoided? > > > > > > Same as above (not prejudging how application want to handle > > serialization). > > > > It's difficult to code Serializable classes properly, and just as hard > > to maintain them. > > Also the serialised format becomes part of the API making changes harder. > > > > We should not promise what we likely cannot continue to deliver, > > especially since it may not be needed. > > > > Serializable classes are also a target for code exploits. > > > > +1 to removing Serializable support in its current implementation. If > serialization is a requirement, then it should be implemented using a > Serialization Proxy a la J. Block: > > https://www.youtube.com/watch?v=V1vQf4qyMXg&feature=youtu.be&t=56m12s > > Gary > > > > > > The update/build methods are being documented and an explanation of > the > > > > difference included. > > > > > > Am I wrong that the "build(...)" methods only spare the typing of 2 > > > characters? > > > If so, it's not worth the increase of the API size. > > > > > > > > > > > getLog() is also being documented > > > > > > > > The asBitSet() method was to extract the bitset from the bloom filter > > so > > > > that it could be stored otherwise used with BitSet based processes. > > The > > > > name of the method is being chagned to getBitSet() and it makes a > > defensive > > > > copy. > > > > > > > > The ProtoBloomFilter is not a factory. It is really a bean and there > > are > > > > times when it makes sense to use the ProtoBloomFilter. The name is > > > > important as it identifies what it is. It contains the necessary > > > > information to build any configuration of BloomFilter. There is a > > proposal > > > > to use ProtoBloomFilters in service calls. It could be defined as an > > > > interface with a concrete implementation in the > > ProtoBloomFilterBuilder. > > > > > > > > As an example of ProtoBloomFilter usage. Say you have a Person > object > > that > > > > has the user's first name, last name, eye color, hair color. > > > > > > > > The ProtoBloomFilterBuilder would accept the 4 values as 4 calls to > > > > update() followed by a build() to build the protoBloomFilter. > > > > ProtoBloomFilter proto = ne wProtoBloomFilterBuilder().update( > "Claude" > > > > ).update( "Warren" ).update( "brown" ).update( "silver" ).build(); > > > > > > I'm still missing why it cannot be a factory, but maybe it is a > > > question of design habits. What about the following: > > > > > > public class BloomFilter { > > > private final BitSet bitSet; > > > > > > private BloomFilter(BitSet bs) { > > > bitSet = (BitSet) (bs.clone()); > > > } > > > > > > /** Factory method. (cf. "ValJO") */ > > > public BloomFilter of(BitSet bs) { > > > return new BloomFilter(bs); > > > } > > > > > > // Combined "ProtoBloomFilter" and "ProtoBloomFilterBuilder" logic. > > > public static class Builder { > > > private final Set<Hash> hashes = new HashSet<>(); > > > > > > private Builder() {} > > > > > > /** Factory method(s). */ > > > public Builder with(Builder other) { > > > hashes.addAll(other.hashes); > > > } > > > public Builder with(ByteBuffer b) { /* ... */ } > > > // etc. > > > > > > /** Factory (converter). */ > > > public BloomFilter create(FilterConfig cfg) { > > > final BitSet set = new BitSet(cfg.getNumberOfBits()); > > > for (Hash hash : hashes) { > > > hash.populate(set, cfg); > > > } > > > return BloomFilter.of(set); > > > } > > > } > > > } > > > > > > Perhaps I'm taking too many shortcuts... But, as pointed out by > > > Gary, more coverage is required; and unit tests will help figure out > > > the leanest design. > > > > > > > > > > > I can create a bloom filter configuration that has 1x10^7 items and > > > > accepts 1 / 1x10^6 collisions. > > > > FilterConfig config1 = new FilterConfig( 10000000, 1000000) > > > > > > > > and build the BloomFilter > > > > BloomFilter filter1 = proto.create( config1 ) > > > > > > > > I can use that filter to determine if the person is in a collection > of > > > > 1x10^7 items. The implementation of that collection could be > multiple > > > > buckets of 1x10^4 items with a collision rate of 5x10^3 items. To > > check > > > > that I create an appropriate FilterConfig > > > > FilterConfig config2 = new FilterConfig( 10000, 5000 ) > > > > > > > > and then create the proper bloom filter > > > > BloomFilter filter2 = proto.create( config2 ) > > > > > > > > > > With my above attempt at an alternative API, we'd have (untested): > > > > > > BloomFilter.Builder proto = BloomFilter > > > .with("Claude") > > > .with("Warren") > > > .with("brown"). > > > .with("silver"); > > > > > > BloomFilter filter1 = proto.create(config1); > > > BloomFilter filter2 = proto.create(config2); > > > > > > > > > IIUC, in your current implementation of "ProtoBloomFilter", there > > > is a side-effect of "build()" (i.e. the call to "clear()"); thus > calling > > > "build()" twice will not yield the same result. I don't think that's > > > a desirable feature. > > > > > > Regards, > > > Gilles > > > > > > > Claude > > > > > > > > > > > > On Fri, Sep 13, 2019 at 2:22 PM Gilles Sadowski < > gillese...@gmail.com> > > > > wrote: > > > > > > > > > > > [...] > > > > > > > > > > > > Gilles, > > > > > > > > > > > > Take a look at the PR, I added comments to allow the PR to have 0 > > deps. > > > > > > > > > > > > Gary > > > > > > > > > > > > > > > > > > > > > > IMO, the package should be named "bloomfilter" (without "s"). > > > > > Naming seems inconsistent in [Collections]: Some (package) > > > > > names are singular, other plural. > > > > > > > > > > * Indent must be 4 spaces. > > > > > * All fields and methods must be commented. > > > > > * "FilterConfig" contains a "main" method. > > > > > * Local variables and fields must be declared "final" whenever > > constant. > > > > > * Some methods are declared "final", others not. > > > > > * "BloomFilter" should be made immutable. > > > > > * "BloomFilter" could do without a second constructor: > "FilterConfig" > > > > > should have a method that returns a "BitSet". > > > > > * "BloomFilter" must make a defensive copy of its "bitSet" > argument. > > > > > * What's the difference between "match" and "equals"? > > > > > * Why is "STORED_SIZE" public? > > > > > * "Utils" classes should be avoided. > > > > > * Hash function should be imported/shaded from "Commons Codec" > > > > > (and defined there if missing). > > > > > * Constructor(s) should be private (instantiation through factory > > > > > methods (cf. ValJO). > > > > > * "Serializable" should be avoided. > > > > > * The "update" methods are missing explanations (or reference) > about > > > > > their purpose. Also, it seems that "update" and "build" are > > redundant. > > > > > * Does "getLog" return a standard value? If so, the reference > should > > > > > appear in the Javadoc (not just as code comment). > > > > > * What is the purpose of method "asBitSet()"? > > > > > * (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory". > > > > > "ProtoBloomFilterBuilder" should be a static inner class of that > > > > > factory. > > > > > > > > > > Regards, > > > > > Gilles > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren