Re: [HACKERS] amcheck (B-Tree integrity checking tool)

Thomas Munro Wed, 23 Nov 2016 14:48:06 -0800

On Fri, Nov 18, 2016 at 12:17 PM, Peter Geoghegan <p...@heroku.com> wrote:
> So, the Btree you show is supposed to be good? You've left it up to me
> to draw T2, the corrupt version? If so, here is a drawing of T2 for
> the sake of clarity:
>
>       Btree   =        [a|b]
>                        /   \
>                       /     \
>                      /       \
>                    [a]-------[b]
>                     |         |
>                     |         |
>                    [c]-------[d]


Actually I meant that at T2 the btree is exactly same as it was at T1,
but the order of the values has changed according to the comparator
(for example, because a collation definition changed when you updated
your operating system), so that the btree is now corrupted.  Based on
Goetz Graefe's paper I was wondering if amcheck would be unable to
detect this due to the cousin problem.

>> Now c and b have been switched around.  Won't amcheck still pass since
>> we only verify immediate downlinks and siblings?  Yet searches for b
>> will take a wrong turn at the root.  Do I have this right?  I wonder
>> if there is a way to verify that each page's high key < the 'next' key
>> for all ancestors.
>
> I don't think you have it right, because your argument is predicated
> on classic L&Y, not the Postgres variant. The high key is literally
> guaranteed to be *equal* to the initial first value of the right half
> of a split page post-split (at least initially), which is where the
> new downlink for the new right half comes from in turn, so it will be
> exactly equal as well. There is a comment which says all this within
> _bt_insert_parent(), actually.
>
> In general, I think it's pretty bad that we're so loose about
> representing the key space in downlinks compared to classic L&Y,
> because it probably makes index bloat a lot worse in internal pages,
> messing up the key space long term. As long as we're doing that,
> though, we clearly cannot usefully "verify that each page's high key <
> the 'next' key for all ancestors", because we deviate from L&Y in a
> way that makes that check always fail. It's okay that it always fails
> for us (it doesn't actually indicate corruption), because of our
> duplicate-aware index scan logic (change to invariant).
>
> In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the
> cousin problem a non-issue, because we don't follow downlinks that are
> equal to our scankey -- we go to their immediate left and start
> *there*, in order to support duplicates in leaf pages. Index scans for
> 'b' work for both T1 and T2 in Postgres, because we refuse to follow
> the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I
> would actually like to change things to make the invariant the classic
> L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make
> suffix truncation in internal pages work.
>
> So, we don't have the cousin problem, but since I wish for us to adopt
> the stricter classic L&Y invariant at some point, you could say that I
> also wished that we had the cousin problem.

Thank you for your patient explanation!  Please see my humble attempt
to test this scenario and learn something about btrees in the process,
attached.  amcheck does detect the violation of the high key
invariant.

> Confused yet? :-)

Yes.  But it's getting better slowly :-)

-- 
Thomas Munro
http://www.enterprisedb.com

create extension if not exists amcheck;
create extension if not exists pageinspect;

drop table if exists my_table;
drop type if exists my_type;

-- define an enum ordered a, b, c, d, but do it in a funky way that results in
-- odd OIDs that we can later mess around with
create type my_type as enum ('d');
alter type my_type add value 'c' before 'd';
alter type my_type add value 'b' before 'c';
alter type my_type add value 'a' before 'b';

create table my_table (my_column my_type, counter serial);
create index my_index on my_table(my_column);

create or replace function pretty_print_btree_recurse(pageno int,
                                                      height int,
                                                      level int)
returns void language plpgsql as
$$
declare
  v_spaces text := rpad('', (height - level) * 2, ' ');
  v_ctid text;
  v_subpageno int;
  v_key_hex text;
  v_key_oid oid;
  v_key_label text;
  v_page_description text;
  v_item int;
begin
  if height = level then
    v_page_description := 'root';
  elsif level = 0 then
    v_page_description := 'leaf';
  else
    v_page_description := 'intl';
  end if;
  for v_ctid, v_key_hex, v_item in
  select ctid::text, data, itemoffset
  from bt_page_items('my_index', pageno)
  loop
    if v_ctid like '(4294967295,%' then
      continue;
    end if;
    v_subpageno := (v_ctid::point)[0];
    v_key_oid := ('x' || substring(v_key_hex, 1, 2))::bit(8)::int +
                 (('x' || substring(v_key_hex, 4, 2))::bit(8)::int << 8);
    v_key_label := coalesce((select enumlabel from pg_enum where oid = 
v_key_oid::oid), '-inf');
    if level = 0 then
      raise info '% % page % item %: [%] -> heap tuple', v_spaces, 
v_page_description, pageno, v_item - 1, v_key_label;
    else
      raise info '% % page % item %: [%] -> page %', v_spaces, 
v_page_description, pageno, v_item, v_key_label, v_subpageno;
    end if;
    if level > 0 then
      perform pretty_print_btree_recurse(v_subpageno, height, level - 1);
    end if;
  end loop;
end;
$$;

create or replace function pretty_print_btree() returns void language plpgsql as
$$
declare
  v_root int;
  v_height int;
begin
  select root, level into v_root, v_height from bt_metap('my_index');
  if v_root = 0 then
    raise info 'empty btree';
  else
    perform pretty_print_btree_recurse(v_root, v_height, v_height);
  end if;
end;
$$;

truncate my_table;
insert into my_table (my_column) select 'a' from generate_series(1, 64000);
insert into my_table (my_column) select 'c' from generate_series(1, 64000);
insert into my_table (my_column) select 'b' from generate_series(1, 64000);
insert into my_table (my_column) select 'd' from generate_series(1, 64000);

delete from my_table where my_column = 'a' and counter > (select min(counter) 
from my_table where my_column = 'a');

delete from my_table where my_column = 'b' and counter > (select min(counter) 
from my_table where my_column = 'b');

delete from my_table where my_column = 'c' and counter > (select min(counter) 
from my_table where my_column = 'c');

delete from my_table where my_column = 'd' and counter > (select min(counter) 
from my_table where my_column = 'd');

vacuum my_table;

-- now we have a btree that looks like this:
--     Btree   =        [a|c]
--                      /   \
--                     /     \
--                    /       \
--                  [a]-------[c]
--                   |         |
--                   |         |
--                  [b]-------[d]

select pretty_print_btree();


set seq_page_cost to 100;
set enable_bitmapscan to false;

-- we can find b by index scan
select * from my_table where my_column = 'b';

-- now mess with the sort order of the enum, as a cheap hack way to simulate
-- a collation order change affecting a text field.  this switches the order
-- of 'b' and 'c' around in the enum

update pg_enum set enumsortorder = 999 where enumtypid = 'my_type'::regtype and 
enumlabel = 'c';
update pg_enum set enumsortorder = 0 where enumtypid = 'my_type'::regtype and 
enumlabel = 'b';
update pg_enum set enumsortorder = -1 where enumtypid = 'my_type'::regtype and 
enumlabel = 'c';

-- log out and back in again to nuke syscache


-- now we can't find 'b' by index scan anymore!
set seq_page_cost to 100;
set enable_bitmapscan to false;

select * from my_table where my_column = 'b';

-- let's see what amcheck says about that

select bt_index_check('my_index');
ERROR:  high key invariant violated for index "my_index"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

Reply via email to