Re: [HACKERS] A design for amcheck heapam verification

Andres Freund Thu, 29 Mar 2018 18:19:21 -0700

Hi,

On 2018-03-26 20:20:57 -0700, Peter Geoghegan wrote:
> From ede1ba731dc818172a94adbb6331323c1f2b1170 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <p...@bowt.ie>
> Date: Thu, 24 Aug 2017 20:58:21 -0700
> Subject: [PATCH v7 1/2] Add Bloom filter data structure implementation.
> 
> A Bloom filter is a space-efficient, probabilistic data structure that
> can be used to test set membership.  Callers will sometimes incur false
> positives, but never false negatives.  The rate of false positives is a
> function of the total number of elements and the amount of memory
> available for the Bloom filter.
> 
> Two classic applications of Bloom filters are cache filtering, and data
> synchronization testing.  Any user of Bloom filters must accept the
> possibility of false positives as a cost worth paying for the benefit in
> space efficiency.
> 
> This commit adds a test harness extension module, test_bloomfilter.  It
> can be used to get a sense of how the Bloom filter implementation
> performs under varying conditions.


Maybe add a short paragraph explaining what this'll be used for soon.

> @@ -12,7 +12,7 @@ subdir = src/backend/lib
>  top_builddir = ../../..
>  include $(top_builddir)/src/Makefile.global
>  
> -OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
> -        knapsack.o pairingheap.o rbtree.o stringinfo.o
> +OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
> +       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o

*NOT* for this patch: I really wonder whether we should move towards a
style where there's only ever a single object per-line. Would make
things like this easier to view and conflicts easier to resolve.


> --- /dev/null
> +++ b/src/backend/lib/bloomfilter.c
> @@ -0,0 +1,304 @@
> +/*-------------------------------------------------------------------------
> + *
> + * bloomfilter.c
> + *           Space-efficient set membership testing
> + *
> + * A Bloom filter is a probabilistic data structure that is used to test an
> + * element's membership of a set.

s/of/in/?


> False positives are possible, but false
> + * negatives are not; a test of membership of the set returns either 
> "possibly
> + * in set" or "definitely not in set".  This can be very space efficient when
> + * individual elements are larger than a few bytes, because elements are 
> hashed
> + * in order to set bits in the Bloom filter bitset.

The second half of this paragraph isn't easy to understand.


> + * Elements can be added to the set, but not removed.  The more elements that
> + * are added, the larger the probability of false positives.  Caller must 
> hint
> + * an estimated total size of the set when its Bloom filter is initialized.
> + * This is used to balance the use of memory against the final false positive
> + * rate.

s/its Bloom/the Bloom/?


> + * The implementation is well suited to data synchronization problems between
> + * unordered sets, especially where predictable performance is important and
> + * some false positives are acceptable.

I'm not finding "data synchronization" very descriptive. Makes me think
of multi-threaded races and such.



> +/*
> + * Create Bloom filter in caller's memory context.  This should get a false
> + * positive rate of between 1% and 2% when bitset is not constrained by 
> memory.

s/should/aims at/?


> + * total_elems is an estimate of the final size of the set.  It ought to be
> + * approximately correct, but we can cope well with it being off by perhaps a
> + * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
> + * (Dillinger & Manolios, 2004) for details of why this is the case.

I'd simplify the language here. I'd replace ought with should at the
very least.  Replace we with "the bloom filter" or similar?


> + * bloom_work_mem is sized in KB, in line with the general work_mem 
> convention.
> + * This determines the size of the underlying bitset (trivial bookkeeping 
> space
> + * isn't counted).  The bitset is always sized as a power-of-two number of
> + * bits, and the largest possible bitset is 512MB.  The implementation rounds
> + * down as needed.

"as needed" should be expanded. Just say ~"Only the required amount of
memory is allocated"?


> +bloom_filter *
> +bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
> +{
> +     bloom_filter *filter;
> +     int                     bloom_power;
> +     uint64          bitset_bytes;
> +     uint64          bitset_bits;
> +
> +     /*
> +      * Aim for two bytes per element; this is sufficient to get a false
> +      * positive rate below 1%, independent of the size of the bitset or 
> total
> +      * number of elements.  Also, if rounding down the size of the bitset to
> +      * the next lowest power of two turns out to be a significant drop, the
> +      * false positive rate still won't exceed 2% in almost all cases.
> +      */
> +     bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
> +     /* Minimum allowable size is 1MB */
> +     bitset_bytes = Max(1024L * 1024L, bitset_bytes);

Some upcasting might be advisable, to avoid dangers of overflows?


> +/*
> + * Generate k hash values for element.
> + *
> + * Caller passes array, which is filled-in with k values determined by 
> hashing
> + * caller's element.
> + *
> + * Only 2 real independent hash functions are actually used to support an
> + * interface of up to MAX_HASH_FUNCS hash functions; enhanced double hashing 
> is
> + * used to make this work.  The main reason we prefer enhanced double hashing
> + * to classic double hashing is that the latter has an issue with collisions
> + * when using power-of-two sized bitsets.  See Dillinger & Manolios for full
> + * details.
> + */
> +static void
> +k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t 
> len)
> +{
> +     uint64          hash;
> +     uint32          x, y;
> +     uint64          m;
> +     int                     i;
> +
> +     /* Use 64-bit hashing to get two independent 32-bit hashes */
> +     hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));

Hm. Is that smart given how some hash functions are defined? E.g. for
int8 the higher bits aren't really that independent for small values:

Datum
hashint8(PG_FUNCTION_ARGS)
{
        /*
         * The idea here is to produce a hash value compatible with the values
         * produced by hashint4 and hashint2 for logically equal inputs; this is
         * necessary to support cross-type hash joins across these input types.
         * Since all three types are signed, we can xor the high half of the 
int8
         * value if the sign is positive, or the complement of the high half 
when
         * the sign is negative.
         */
        int64           val = PG_GETARG_INT64(0);
        uint32          lohalf = (uint32) val;
        uint32          hihalf = (uint32) (val >> 32);

        lohalf ^= (val >= 0) ? hihalf : ~hihalf;

        return hash_uint32(lohalf);
}

Datum
hashint8extended(PG_FUNCTION_ARGS)
{
        /* Same approach as hashint8 */
        int64           val = PG_GETARG_INT64(0);
        uint32          lohalf = (uint32) val;
        uint32          hihalf = (uint32) (val >> 32);

        lohalf ^= (val >= 0) ? hihalf : ~hihalf;

        return hash_uint32_extended(lohalf, PG_GETARG_INT64(1));
}




> +/*
> + * Calculate "val MOD m" inexpensively.
> + *
> + * Assumes that m (which is bitset size) is a power-of-two.
> + *
> + * Using a power-of-two number of bits for bitset size allows us to use 
> bitwise
> + * AND operations to calculate the modulo of a hash value.  It's also a 
> simple
> + * way of avoiding the modulo bias effect.
> + */
> +static inline uint32
> +mod_m(uint32 val, uint64 m)
> +{
> +     Assert(m <= PG_UINT32_MAX + UINT64CONST(1));
> +     Assert(((m - 1) & m) == 0);
> +
> +     return val & (m - 1);
> +}

What's up with the two different widths here?


> @@ -0,0 +1,27 @@
> +/*-------------------------------------------------------------------------
> + *
> + * bloomfilter.h
> + *     Space-efficient set membership testing
> + *
> + * Copyright (c) 2018, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *    src/include/lib/bloomfilter.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef _BLOOMFILTER_H_
> +#define _BLOOMFILTER_H_

Names starting with an underscore followed by an uppercase letter are
reserved. Yes, we have some already. No, we shouldn't introduce further ones.



> From 71878742061500b969faf7a7cff3603d644c90ca Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <p...@bowt.ie>
> Date: Tue, 2 May 2017 00:19:24 -0700
> Subject: [PATCH v7 2/2] Add amcheck verification of indexes against heap.
> 
> Add a new, optional capability to bt_index_check() and
> bt_index_parent_check():  callers can check that each heap tuple that
> ought to have an index entry does in fact have one.  This happens at the
> end of the existing verification checks.

And here we get back to why I last year though this interface is
bad. Now this really can't properly described as a pure index check
anymore, and we need to add the same functionality to multiple
functions.


> +--
> +-- bt_index_check()
> +--
> +DROP FUNCTION bt_index_check(regclass);
> +CREATE FUNCTION bt_index_check(index regclass,
> +    heapallindexed boolean DEFAULT false)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_check'
> +LANGUAGE C STRICT PARALLEL RESTRICTED;

This breaks functions et al referencing the existing function. Also, I
don't quite recall the rules, don't we have to drop the function from
the extension first?



>  /*
> - * bt_index_check(index regclass)
> + * bt_index_check(index regclass, heapallindexed boolean)
>   *
>   * Verify integrity of B-Tree index.
>   *
>   * Acquires AccessShareLock on heap & index relations.  Does not consider
> - * invariants that exist between parent/child pages.
> + * invariants that exist between parent/child pages.  Optionally verifies
> + * that heap does not contain any unindexed or incorrectly indexed tuples.
>   */
>  Datum
>  bt_index_check(PG_FUNCTION_ARGS)
>  {
>       Oid                     indrelid = PG_GETARG_OID(0);
> +     bool            heapallindexed = false;
>  
> -     bt_index_check_internal(indrelid, false);
> +     if (PG_NARGS() == 2)
> +             heapallindexed = PG_GETARG_BOOL(1);
> +
> +     bt_index_check_internal(indrelid, false, heapallindexed);
>  
>       PG_RETURN_VOID();
>  }

Given the PG_NARGS() checks I don't understand why don't just create a
separate two-argument SQL function above? If you rely on the default
value you don't need it anyway?


> +     /*
> +      * * Heap contains unindexed/malformed tuples check *
> +      */

I'd reorder this to "Check whether heap contains ...".


> +     if (state->heapallindexed)
> +     {
> +             IndexInfo  *indexinfo = BuildIndexInfo(state->rel);
> +             HeapScanDesc scan;
> +
> +             /*
> +              * Create our own scan for IndexBuildHeapScan(), like a 
> parallel index
> +              * build.  We do things this way because it lets us use the MVCC
> +              * snapshot we acquired before index fingerprinting began in the
> +              * !readonly case.
> +              */

I'd shorten the middle part out, so it's "IndexBuildHeapScan(), so we
can register an MVCC snapshot acquired before..."


> +             scan = heap_beginscan_strat(state->heaprel, /* relation */
> +                                                                     
> snapshot,       /* snapshot */
> +                                                                     0,      
> /* number of keys */
> +                                                                     NULL,   
> /* scan key */
> +                                                                     true,   
> /* buffer access strategy OK */
> +                                                                     true);  
> /* syncscan OK? */
> +
> +             /*
> +              * Scan will behave as the first scan of a CREATE INDEX 
> CONCURRENTLY
> +              * behaves when only AccessShareLock held.  This is really only 
> needed
> +              * to prevent confusion within IndexBuildHeapScan() about how to
> +              * interpret the state we pass.
> +              */
> +             indexinfo->ii_Concurrent = !state->readonly;

That's not very descriptive.


> +             /* Fingerprint leaf page tuples (those that point to the heap) 
> */
> +             if (state->heapallindexed && P_ISLEAF(topaque) && 
> !ItemIdIsDead(itemid))
> +                     bloom_add_element(state->filter, (unsigned char *) itup,
> +                                                       IndexTupleSize(itup));

So, if we didn't use IndexBuildHeapScan(), we could also check whether
dead entries at least have a corresponding item on the page, right? I'm
not asking to change, mostly curious.


> +/*
> + * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
> + * all the entries that definitely should have been observed in leaf pages of
> + * the target index (that is, all IndexTuples that were fingerprinted by our
> + * Bloom filter).  All heapallindexed checks occur here.

The last sentence isn't entirely fully true ;), given that we check for
the bloom insertion above. s/checks/verification/?



We should be able to get this into v11...


Greetings,

Andres Freund

Re: [HACKERS] A design for amcheck heapam verification

Reply via email to