On Sat, Mar 13, 2021 at 10:35 AM Mark Dilger <mark.dil...@enterprisedb.com> wrote: > The testing strategy I'm using is to corrupt heap and btree pages in schema > "public" of the "regression" database created by `make installcheck`, by > overwriting random bytes in randomly selected locations on pages after the > page header. Then I run `pg_amcheck regression` and see if anything > segfaults. Doing this repeatedly, with random bytes and locations within > files not the same from one run to the next, I can find the locations of > segfaults that are particularly common.
That seems like a reasonable starting point to me. > The first common problem [1] happens at verify_nbtree.c:1422 when a > corruption report is being generated. The generation does not seem entirely > safe, and the problematic bit can be avoided > The temPointerGetBlockNumberNoCheck(tid) seems to be unsafe here. I think that the best way to further harden verify_nbtree.c at this point is to do the most basic validation of IndexTuples in our own new variant of a core bufpage.h macro: PageGetItemCareful(). In other words, we should invent the IndexTuple equivalent of the existing PageGetItemIdCareful() function (which already does the same thing for item pointers), and then replace all current PageGetItem() calls with PageGetItemCareful() calls -- we ban raw PageGetItem() calls from verify_nbtree.c forever. Here is some of the stuff that could go in PageGetItemCareful(), just off the top of my head: * The existing "if (tupsize != ItemIdGetLength(itemid))" check at the top of bt_target_page_check() -- make sure the length from the caller's line pointer agrees with IndexTupleSize(). * Basic validation against the index's tuple descriptor -- in particular, that varlena headers are basically sane, and that the apparent range of datums is safely within the space on the page for the tuple. * Similarly, BTreeTupleGetHeapTID() should not be able to return a pointer that doesn't actually point somewhere inside the space that the target page has for the IndexTuple. * No external TOAST pointers, since this is an index AM, and so doesn't allow that. In general this kind of very basic validation should be pushed down to the lowest level code, so that it detects the problem as early as possible, before slightly higher level code has the opportunity to run. Higher level code is always going to be at risk of making assumptions about the data not being corrupt, because there is so much more of it, and also because it tends to roughly look like idiomatic AM code. -- Peter Geoghegan