Hi Gary,

I am in favour of using nomenclature and patterns that will be familiar to
a Java developer. But only if they match the familiar JDK use patterns. The
Bloom filter package has some atypical use patterns that have driven the
current API to where it is. I'll try and describe these below.

On Sun, 28 Apr 2024 at 14:16, Gary Gregory <garydgreg...@gmail.com> wrote:

> Hi Clause, Albert, and all,
>
> Since the introduction of lambdas in Java 8, Java has a well-defined
> terminology around the classic producer-consumer paradigm but (for
> reasons unknown to me) realized in the functional interfaces *Supplier
> and *Consumer. In addition, as of Java 5, we have the Iterable
> interface.
>
> In our new Bloom filter package we have new interfaces called
> *Producer (as opposed to *Supplier), where some of these new
> interfaces are formally annotated with @FunctionalInterface and some
> not (for example, BloomFilterProducer).
>
> My question is: Why call these "Producers" instead of "Suppliers"? Is
> the formal Bloom filter literature tied to the "Producer" terminology
> in a way that would make adapting to the Java term confusing? I know I
> brought up a similar topic recently, but I would like to revisit it
> now that I've started to read Claude's blog drafts. Even without
> making the current "Producers" formal suppliers by extending Supplier,
> would it be worth using the Java terminology?
>

Claude is familiar with the literature and can comment on that. I would
defer to the literature if it is a common term.

There is one notable distinction to JDK suppliers. Suppliers only supply 1
element and must be repeatedly called to generate more. The Producers in
the BloomFilter package will supply multiple values. They are invoked using
a forEach pattern with the intention of supplying all the elements to a
predicate, not a consumer. If any of those elements is rejected by the
predicate then the rest of the elements are not supplied. So this is a
fail-fast bulk supplier.


>
> My second observation is that some might neither be "Producers" or
> "Suppliers" but instead be extensions of Iterable. For example,
> BitMapProducer is not a factory for instances of BitMap; the BitMap
> does not appear in the signatures of BitMapProducer methods. From a
> strict Java POV, this is (slightly) perplexing.
>

Iterable was suggested in an earlier API, particular for the IndexProducer.
IIRC it was rejected on the basis of simplifying the code for the caller in
the fail-fast case. Otherwise every user of the iterator must implement
fail-fast loops over the elements. There may have been other reasons so it
could be worth a check in the mailing list archives. It would require going
back a few years but it was discussed on the dev list.

The term BitMap refers to a long that holds 64-consecutive indices as
either present or absent. You can consider the sequential bitmaps
containing all indices from [0, n) as the serialized state of a Bloom
filter with n bits. This is essentially a BitSet as you can see from the
SimpleBloomFilter implementation. This originally wrapped a BitSet; it was
converted to directly implement the required read/write bit functionality
on the grounds of performance (no memory reallocation; no index checks).

We do not have a BitMap class since we use a long primitive. A rename would
be to LongProducer causing a name clash with the JDK. Renaming to something
else is possible but I believe BitMap is a term from the literature.


>
> Instead (forgetting the class name issue for now), we could have:
>
> @FunctionalInterface
> public interface BitMapProducer extends Iterable<LongPredicate> {...}
>
> Which would let implementations define:
>
> Iterator<LongPredicate> iterator();
>
> Instead of:
>
> boolean forEachBitMap(LongPredicate predicate);
>

The BitMapProducer is not iterating LongPredicates. It is iterating longs
to be accepted by a single LongPredicate. The boolean return allows
signalling to stop the forEach loop. There is no primitive specialisation
of Iterator for long. There is a Spliterator.OfLong but that bundles some
other API that we do not wish to support, namely parallel streaming via
split and the ability to advance element by element (tryAdvance). Currently
we only implement the equivalent of the forEachRemaining pattern from
Spliterator. That accepts a consumer and so fail-fast would be done via
raising a runtime exception. Given that fail-fast is a key feature of a
Bloom filter then we do not want this to be implemented via exceptions.

The primary use case for fail-fast is to stop as soon as a bit index is
found, or not found (case dependent). Consider a Bloom filter that has 20
indices per hashed item. You have populated the filter with items, each has
20 random indices. You then check if a new item is not contained in the
filter by creating indices for the new item with your hash function and
checking each index against those already in the filter. If your new
element has an index not in the filter, then you have not seen this element
before. This process can be done by creating all the indices by hashing as
the first step, then comparing each to the filter. Or you can generate
indices lazily, only creating the next index if the current one is present
in the filter. This requires fail-fast index iteration and dynamic index
generation by the hasher. You can walk through this looking at
the EnhancedDoubleHasher. This is what takes a byte[] (capturing
information on the item) and creates indices. But it only creates indices
until it is triggered to stop.

Note that the filter can only tell you with 100% confidence that an item
has not been observed. It cannot tell you with 100% confidence it has been
observed, as the random indices created by hashing the item may all be
present in the collective indices from all items added to the filter. The
Shape class has the standard computations to create a filter with enough
bits to control the false-positive rate for the expected number of items
you wish to store.


>
> Same comment for IndexProducer.
> Same comment for BloomFilterProducer.
> Is this too much Java-ness?
>
> CellConsumer looks like a Predicate, not a traditional Java *Consumer.
> We have a specialization called LongBiPredicate so I propose we rename
> and extract CellConsumer as IntBiPredicate.
>

Cell is a term from the literature. A cell is an index and a count. So
although you are accepting two int values, they are actually only one cell.
A rename to CellPredicate would be more appropriate to imitate
Predicate<Cell> where the Cell is a pair of int primitives.

In contrast the LongBiPredicate is used to accept two BitMaps that have
been paired by iterating over two filters with the same shape. This is not
an extensible design as you can only iterate over a pair of filters, and
not n filters of the same shape. The use case is in standard set operations
such as union and intersect which are used to generate metrics used in the
literature (see the SetOperations class). This could be renamed to
BitMapBiPredicate. However I think the current name was chosen to be in
keeping with the JDK j.u.function package.

Hope that helps set some background.

Alex



>
> TY!
> Gary
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Reply via email to