Hi, On 2018-03-26 20:20:57 -0700, Peter Geoghegan wrote: > From ede1ba731dc818172a94adbb6331323c1f2b1170 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <p...@bowt.ie> > Date: Thu, 24 Aug 2017 20:58:21 -0700 > Subject: [PATCH v7 1/2] Add Bloom filter data structure implementation. > > A Bloom filter is a space-efficient, probabilistic data structure that > can be used to test set membership. Callers will sometimes incur false > positives, but never false negatives. The rate of false positives is a > function of the total number of elements and the amount of memory > available for the Bloom filter. > > Two classic applications of Bloom filters are cache filtering, and data > synchronization testing. Any user of Bloom filters must accept the > possibility of false positives as a cost worth paying for the benefit in > space efficiency. > > This commit adds a test harness extension module, test_bloomfilter. It > can be used to get a sense of how the Bloom filter implementation > performs under varying conditions.
Maybe add a short paragraph explaining what this'll be used for soon. > @@ -12,7 +12,7 @@ subdir = src/backend/lib > top_builddir = ../../.. > include $(top_builddir)/src/Makefile.global > > -OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \ > - knapsack.o pairingheap.o rbtree.o stringinfo.o > +OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \ > + ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o *NOT* for this patch: I really wonder whether we should move towards a style where there's only ever a single object per-line. Would make things like this easier to view and conflicts easier to resolve. > --- /dev/null > +++ b/src/backend/lib/bloomfilter.c > @@ -0,0 +1,304 @@ > +/*------------------------------------------------------------------------- > + * > + * bloomfilter.c > + * Space-efficient set membership testing > + * > + * A Bloom filter is a probabilistic data structure that is used to test an > + * element's membership of a set. s/of/in/? > False positives are possible, but false > + * negatives are not; a test of membership of the set returns either > "possibly > + * in set" or "definitely not in set". This can be very space efficient when > + * individual elements are larger than a few bytes, because elements are > hashed > + * in order to set bits in the Bloom filter bitset. The second half of this paragraph isn't easy to understand. > + * Elements can be added to the set, but not removed. The more elements that > + * are added, the larger the probability of false positives. Caller must > hint > + * an estimated total size of the set when its Bloom filter is initialized. > + * This is used to balance the use of memory against the final false positive > + * rate. s/its Bloom/the Bloom/? > + * The implementation is well suited to data synchronization problems between > + * unordered sets, especially where predictable performance is important and > + * some false positives are acceptable. I'm not finding "data synchronization" very descriptive. Makes me think of multi-threaded races and such. > +/* > + * Create Bloom filter in caller's memory context. This should get a false > + * positive rate of between 1% and 2% when bitset is not constrained by > memory. s/should/aims at/? > + * total_elems is an estimate of the final size of the set. It ought to be > + * approximately correct, but we can cope well with it being off by perhaps a > + * factor of five or more. See "Bloom Filters in Probabilistic Verification" > + * (Dillinger & Manolios, 2004) for details of why this is the case. I'd simplify the language here. I'd replace ought with should at the very least. Replace we with "the bloom filter" or similar? > + * bloom_work_mem is sized in KB, in line with the general work_mem > convention. > + * This determines the size of the underlying bitset (trivial bookkeeping > space > + * isn't counted). The bitset is always sized as a power-of-two number of > + * bits, and the largest possible bitset is 512MB. The implementation rounds > + * down as needed. "as needed" should be expanded. Just say ~"Only the required amount of memory is allocated"? > +bloom_filter * > +bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed) > +{ > + bloom_filter *filter; > + int bloom_power; > + uint64 bitset_bytes; > + uint64 bitset_bits; > + > + /* > + * Aim for two bytes per element; this is sufficient to get a false > + * positive rate below 1%, independent of the size of the bitset or > total > + * number of elements. Also, if rounding down the size of the bitset to > + * the next lowest power of two turns out to be a significant drop, the > + * false positive rate still won't exceed 2% in almost all cases. > + */ > + bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2); > + /* Minimum allowable size is 1MB */ > + bitset_bytes = Max(1024L * 1024L, bitset_bytes); Some upcasting might be advisable, to avoid dangers of overflows? > +/* > + * Generate k hash values for element. > + * > + * Caller passes array, which is filled-in with k values determined by > hashing > + * caller's element. > + * > + * Only 2 real independent hash functions are actually used to support an > + * interface of up to MAX_HASH_FUNCS hash functions; enhanced double hashing > is > + * used to make this work. The main reason we prefer enhanced double hashing > + * to classic double hashing is that the latter has an issue with collisions > + * when using power-of-two sized bitsets. See Dillinger & Manolios for full > + * details. > + */ > +static void > +k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t > len) > +{ > + uint64 hash; > + uint32 x, y; > + uint64 m; > + int i; > + > + /* Use 64-bit hashing to get two independent 32-bit hashes */ > + hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed)); Hm. Is that smart given how some hash functions are defined? E.g. for int8 the higher bits aren't really that independent for small values: Datum hashint8(PG_FUNCTION_ARGS) { /* * The idea here is to produce a hash value compatible with the values * produced by hashint4 and hashint2 for logically equal inputs; this is * necessary to support cross-type hash joins across these input types. * Since all three types are signed, we can xor the high half of the int8 * value if the sign is positive, or the complement of the high half when * the sign is negative. */ int64 val = PG_GETARG_INT64(0); uint32 lohalf = (uint32) val; uint32 hihalf = (uint32) (val >> 32); lohalf ^= (val >= 0) ? hihalf : ~hihalf; return hash_uint32(lohalf); } Datum hashint8extended(PG_FUNCTION_ARGS) { /* Same approach as hashint8 */ int64 val = PG_GETARG_INT64(0); uint32 lohalf = (uint32) val; uint32 hihalf = (uint32) (val >> 32); lohalf ^= (val >= 0) ? hihalf : ~hihalf; return hash_uint32_extended(lohalf, PG_GETARG_INT64(1)); } > +/* > + * Calculate "val MOD m" inexpensively. > + * > + * Assumes that m (which is bitset size) is a power-of-two. > + * > + * Using a power-of-two number of bits for bitset size allows us to use > bitwise > + * AND operations to calculate the modulo of a hash value. It's also a > simple > + * way of avoiding the modulo bias effect. > + */ > +static inline uint32 > +mod_m(uint32 val, uint64 m) > +{ > + Assert(m <= PG_UINT32_MAX + UINT64CONST(1)); > + Assert(((m - 1) & m) == 0); > + > + return val & (m - 1); > +} What's up with the two different widths here? > @@ -0,0 +1,27 @@ > +/*------------------------------------------------------------------------- > + * > + * bloomfilter.h > + * Space-efficient set membership testing > + * > + * Copyright (c) 2018, PostgreSQL Global Development Group > + * > + * IDENTIFICATION > + * src/include/lib/bloomfilter.h > + * > + *------------------------------------------------------------------------- > + */ > +#ifndef _BLOOMFILTER_H_ > +#define _BLOOMFILTER_H_ Names starting with an underscore followed by an uppercase letter are reserved. Yes, we have some already. No, we shouldn't introduce further ones. > From 71878742061500b969faf7a7cff3603d644c90ca Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <p...@bowt.ie> > Date: Tue, 2 May 2017 00:19:24 -0700 > Subject: [PATCH v7 2/2] Add amcheck verification of indexes against heap. > > Add a new, optional capability to bt_index_check() and > bt_index_parent_check(): callers can check that each heap tuple that > ought to have an index entry does in fact have one. This happens at the > end of the existing verification checks. And here we get back to why I last year though this interface is bad. Now this really can't properly described as a pure index check anymore, and we need to add the same functionality to multiple functions. > +-- > +-- bt_index_check() > +-- > +DROP FUNCTION bt_index_check(regclass); > +CREATE FUNCTION bt_index_check(index regclass, > + heapallindexed boolean DEFAULT false) > +RETURNS VOID > +AS 'MODULE_PATHNAME', 'bt_index_check' > +LANGUAGE C STRICT PARALLEL RESTRICTED; This breaks functions et al referencing the existing function. Also, I don't quite recall the rules, don't we have to drop the function from the extension first? > /* > - * bt_index_check(index regclass) > + * bt_index_check(index regclass, heapallindexed boolean) > * > * Verify integrity of B-Tree index. > * > * Acquires AccessShareLock on heap & index relations. Does not consider > - * invariants that exist between parent/child pages. > + * invariants that exist between parent/child pages. Optionally verifies > + * that heap does not contain any unindexed or incorrectly indexed tuples. > */ > Datum > bt_index_check(PG_FUNCTION_ARGS) > { > Oid indrelid = PG_GETARG_OID(0); > + bool heapallindexed = false; > > - bt_index_check_internal(indrelid, false); > + if (PG_NARGS() == 2) > + heapallindexed = PG_GETARG_BOOL(1); > + > + bt_index_check_internal(indrelid, false, heapallindexed); > > PG_RETURN_VOID(); > } Given the PG_NARGS() checks I don't understand why don't just create a separate two-argument SQL function above? If you rely on the default value you don't need it anyway? > + /* > + * * Heap contains unindexed/malformed tuples check * > + */ I'd reorder this to "Check whether heap contains ...". > + if (state->heapallindexed) > + { > + IndexInfo *indexinfo = BuildIndexInfo(state->rel); > + HeapScanDesc scan; > + > + /* > + * Create our own scan for IndexBuildHeapScan(), like a > parallel index > + * build. We do things this way because it lets us use the MVCC > + * snapshot we acquired before index fingerprinting began in the > + * !readonly case. > + */ I'd shorten the middle part out, so it's "IndexBuildHeapScan(), so we can register an MVCC snapshot acquired before..." > + scan = heap_beginscan_strat(state->heaprel, /* relation */ > + > snapshot, /* snapshot */ > + 0, > /* number of keys */ > + NULL, > /* scan key */ > + true, > /* buffer access strategy OK */ > + true); > /* syncscan OK? */ > + > + /* > + * Scan will behave as the first scan of a CREATE INDEX > CONCURRENTLY > + * behaves when only AccessShareLock held. This is really only > needed > + * to prevent confusion within IndexBuildHeapScan() about how to > + * interpret the state we pass. > + */ > + indexinfo->ii_Concurrent = !state->readonly; That's not very descriptive. > + /* Fingerprint leaf page tuples (those that point to the heap) > */ > + if (state->heapallindexed && P_ISLEAF(topaque) && > !ItemIdIsDead(itemid)) > + bloom_add_element(state->filter, (unsigned char *) itup, > + IndexTupleSize(itup)); So, if we didn't use IndexBuildHeapScan(), we could also check whether dead entries at least have a corresponding item on the page, right? I'm not asking to change, mostly curious. > +/* > + * Per-tuple callback from IndexBuildHeapScan, used to determine if index has > + * all the entries that definitely should have been observed in leaf pages of > + * the target index (that is, all IndexTuples that were fingerprinted by our > + * Bloom filter). All heapallindexed checks occur here. The last sentence isn't entirely fully true ;), given that we check for the bloom insertion above. s/checks/verification/? We should be able to get this into v11... Greetings, Andres Freund