On Tue, 30 Apr 2024 at 14:45, Gary D. Gregory <ggreg...@apache.org> wrote:
> Hi Claude, > > Thank you for the detailed reply :-) A few comments below. > > On 2024/04/30 06:29:38 Claude Warren wrote: > > I will see if I can clarify the javadocs and make things clearer. > > > > What I think I specifically heard is: > > > > - Be clear that producers are fast fail iterators with predicate > tests. > > - Rename CellConsumer to CellPredicate (?) > > Agreed (as suggested by Albert) > > > - The semantic nomenclature: > > - Bitmaps are arrays of bits not a BitMap object. > > - Indexes are ints and not an instance of a Collection object. > > - Cells are pairs of ints representing an index and a value. They > > are not Pair<> objects. > > - Producers iterate over collections of the object (Bitmap, Index, > > Cell) applying a predicate to do work and stop the iteration early > if > > necessary. They are carriers/transporters of Bloom filter enabled > bits. > > They allow us to query the contents of the Bloom filter in an > > implementation agnostic way. > > As you say naming is hard. The above is a great example and a good > exercise I've gone through at work and in other FOSS projects: "Producers > iterate over collections of the object...". In general when I see or write > a Javadoc of the form "Foo bars" or "Runners walk" or "Walkers run", you > get the idea ;-) I know that either the class (or method) name is bad or > the Javadoc/documentation is bad; not _wrong_, just bad in the sense that > it's confusing (to me). > > I am not advocating for a specific change ATM but I want to discuss the > option because it is possible the current name is not as good as it could > be. It could end up as an acceptable compromise if we cannot use more Java > friendly terms though. > > Whenever I see a class that implements a "forEach"-kind of method, I think > "Iterable". > Here we should think "Collection", or generally more than 1. In the Java sense an Iterable is something you can walk through to the end, possibly removing elements as you go using the Iterator interface. We would not require supporting removal, and we want to control a short-circuit. We could make this distinction by including it in the name: forEachUntil(Predicate ...), forEachUnless, ... > > Note the difference with "Iterator", and I had to lookup the difference > since the former implements "forEach" and the later "forEachRemaining"! > "Iterable" is also a factory of "Iterator"s. > > Should the Producers ever be implementations of Iterable or Iterator? > Right now, the answer is no because of the short-circuit aspect of using a > predicate. I'm not using the term fail-fast here because I don't think of > the iteration being in error (please tell me if I'm wrong). > > If not iterable, then we should not use that name as part of the class > name. Generally, the short-circuit aspect of Producers do not make a bad > candidates for implementations of Iterable since it can throw (unchecked) > exceptions. Different for call sites granted, but I'm just mentioning it > for fun. > I already mentioned throwing runtime exceptions for the short-circuit functionality, and that it was ruled out on the basis of performance (given a lot of short-circuiting is expected) and convenience for the caller. I don't think we should go there. Design the API for the intended purpose, and not push it into a box that is easily recognisable. > > So maybe there's nothing to do. I just want to be clear about it. For > example, I think of "factory" and "producer" as synonyms but in this case, > this is not a traditional application of the factory pattern. > > As an aside I can see that Producers would not be Streams out of the box > because Stream#filter(Predicate) filters but does not short-circuit > iteration. While Stream#findAny() and #findFirst() don't fit the > short-circuit bill, we could implement a #findLast() in a Stream > implementation. What I do not know is if Streams otherwise fit the bill of > Bloom filters. > In the case of small loops with few instructions per loop, the overhead of Streams is significant. Unfortunately we don't have any performance tests for this library, but I wouldn't change to Streams without knowing it does not impact performance. Performance is a key feature of Bloom filters. Otherwise you can achieve some of their functionality with conventional collections. > > > > > Does that basically cover the confusion? If there are better terms, > let's > > hash them out now before I update the javadocs. > > > > As an aside, Cells and Bitmaps are referenced in the literature. For the > > most part the rest is made up out of whole cloth. So we could change > > "Producer" to something else but we would need a good name. > > We have a class called BitMap and methods that use "BitMap" in the same > but I think I am more comfortable with the term reuse now. > > The question that remains is must it be public? Since the Javadoc mentions > it is about indices and bit positions, could all these methods be moved to > the package-private IndexUtils? My concern is to reduce the public and > protected API surface we will have to support and keep for the lifetime of > the 4.x code base. > It was originally package-private and Claude requested it be made public for general re-use. We could use the name BitMaps, in keeping with the pluralisation of classes seen in the JDK when working with elements of a given type: Executors, Collections, etc. Alex