On Thu, Sep 28, 2017 at 3:32 AM, Peter Geoghegan <p...@bowt.ie> wrote: > On Wed, Sep 27, 2017 at 1:45 AM, Michael Paquier > <michael.paqu...@gmail.com> wrote: >> I have signed up as a reviewer of this patch, and I have looked at the >> bloom filter implementation for now. This is the kind of facility that >> people have asked for on this list for many years. >> >> One first thing striking me is that there is no test for this >> implementation, which would be a base stone for other things, it would >> be nice to validate that things are working properly before moving on >> with 0002, and 0001 is a feature on its own. I don't think that it >> would be complicated to have a small module in src/test/modules which >> plugs in a couple of SQL functions on top of bloomfilter.h. > > 0001 is just a library utility. None of the backend lib utilities > (like HyperLogLog, Discrete knapsack, etc) have dedicated test > frameworks. Coding up an SQL interface for these things is a > non-trivial project in itself. > > I have testing the implementation myself, but that was something that > used C code. > > Can you tell me what you have in mind here in detail? For example, > should there be a custom datatype that encapsulates the current state > of the bloom filter? Or, should there be an aggregate function, that > takes a containment argument that is tested at the end?
That could be something very simple: - A function to initiate a bloom filter in a session context, with a number of elements in input, which uses for example integers. - A function to add an integer element to it. - A function to query if an integer may exist or not. - A function to free it. The point is just to stress this code, I don't think that it is a project this "large". >> +#define MAX_HASH_FUNCS 10 >> Being able to define the number of hash functions used at >> initialization would be nicer. Usually this number is kept way lower >> than the number of elements to check as part of a set, but I see no >> reason to not allow people to play with this API in a more extended >> way. You can then let your module decide what it wants to use. > > The number of hash functions used is itself a function of the amount > of memory available, and an estimate of the overall size of the set > made ahead of time (in the case of amcheck, this is > pg_class.reltuples). The typical interface is that the caller either > specifies the amount of memory, or the required false positive rate > (either way, they must also provide that estimate). + filter->k_hash_funcs = optimal_k(bitset_bits, total_elems); The estimate is from wikipedia-sensei: https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions Being able to enforce that would be nice, not mandatory perhaps. > The value of MAX_HASH_FUNCS, 10, was chosen based on the fact that we > never actually use more than that many hash functions in practice, > given the limitations on the total amount of memory you can use > (128MB). The formula for determining the optimum number of hash > functions is pretty standard stuff. Hm... OK. That could be a default... Not really convinced though. >> +/* >> + * Hash function is taken from sdbm, a public-domain reimplementation of the >> + * ndbm database library. >> + */ >> Reference link? > > http://www.eecs.harvard.edu/margo/papers/usenix91/paper.pdf That would be nicer if added in the code :) > I'm probably going to end up using murmurhash32() instead of the sdbm > hash function anyway, now that Andres has exposed it in a header file. > This probably won't matter in the next version. Yeah, that may be a good idea at the end. >> + bitset_bytes = bitset_bits / BITS_PER_BYTE; >> + filter->bitset = palloc0(bitset_bytes); >> + filter->k_hash_funcs = optimal_k(bitset_bits, total_elems); >> + filter->seed = seed; >> I think that doing the allocation within the initialization phase is a >> mistake. Being able to use a DSA would be nice, but I predict as well >> cases where a module may want to have a bloom filter that persists as >> well across multiple transactions, so the allocation should be able to >> live across memory contexts. > > Why not just switch memory context before calling? Again, other > comparable utilities don't have provide this in their interface. > > As for DSM, I think that that can come later, and can be written by > somebody closer to that problem. There can be more than one > initialization function. I don't completely disagree with that, there could be multiple initialization functions. Still, an advantage about designing things right from the beginning with a set of correct APIs is that we don't need extra things later and this will never bother module maintainers. I would think that this utility interface should be minimal and portable to maintain a long-term stance. >> What I think you should do instead to >> make this bloom implementation more modular is to let the caller give >> a pointer to a memory area as well as its size. Then what bloom_init >> should do is to just initialize this area of memory with zeros. This >> approach would give a lot of freedom. Not linking a bloom definition >> to work_mem would be nice as well. > > The implementation is probably always going to be bound in size by > work_mem in practice, like tuplesort and tuplestore. I would say that > that's a natural fit. Hm... >> + hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0); >> I am wondering if it would make sense to be able to enforce the hash >> function being used. The default does not look bad to me though, so we >> could live with that. > > I prefer to keep things simple. I'm not aware of any use case that > calls for the user to use a custom hash function. That said, I could > believe that someone would want to use their own hash value for each > bloom_add_element(), when they have one close at hand anyway -- much > like addHyperLogLog(). Again, that seems like work for the ultimate > consumer of that functionality. It's a trivial tweak that can happen > later. Yeah, perhaps most people would be satisfied with having only a default. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers