Re: new heapcheck contrib module

Peter Geoghegan Tue, 12 May 2020 20:07:18 -0700

On Tue, May 12, 2020 at 7:07 PM Mark Dilger
<mark.dil...@enterprisedb.com> wrote:
> Thank you yet again for reviewing.  I really appreciate the feedback!


Happy to help. It's important work.

> Ok, I take your point that the code cannot soldier on after the first error 
> is returned.  I'll change that for v6 of the patch, moving on to the next 
> relation after hitting the first corruption in any particular index.  Do you 
> mind that I refactored the code to return the error rather than ereporting?

try/catch seems like the way to do it. Not all amcheck errors come
from amcheck -- some are things that the backend code does, that are
known to appear in amcheck from time to time. I'm thinking in
particular of the
table_index_build_scan()/heapam_index_build_range_scan() errors, as
well as the errors from _bt_checkpage().

> Yes, I agree that reindexing is the most sensible remedy.  I certainly have 
> no plans to implement some pg_fsck_index type tool.  Even for tables, I'm not 
> interested in creating such a tool. I just want a good tool for finding out 
> what the nature of the corruption is, as that might make it easier to debug 
> what went wrong.  It's not just for debugging production systems, but also 
> for chasing down problems in half-baked code prior to release.

All good goals.

>  * check_tuphdr_xids

> The point is that when checking the table for corruption I avoid calling 
> anything that might assert (or segfault, or whatever).

I don't think that you can expect to avoid assertion failures in
general. I'll stick with your example. You're calling
TransactionIdDidCommit() from check_tuphdr_xids(), which will
interrogate the commit log and pg_subtrans. It's just not under your
control. I'm sure that you could get an assertion failure somewhere in
there, and even if you couldn't that could change at any time.

You've quasi-duplicated some sensitive code to do that much, which
seems excessive. But it's also not enough.

> I'm starting to infer from your comments that you see the landmine detection 
> vehicle as also driving at high speed, detecting landmines on occasion by 
> seeing them first, but frequently by failing to see them and just blowing up.

That's not it. I would certainly prefer if the landmine detector
didn't blow up. Not having that happen is certainly a goal I share --
that's why PageGetItemIdCareful() exists. But not at any cost,
especially not when "blow up" means an assertion failure that users
won't actually see in production. Avoiding assertion failures like the
one you showed is likely to have a high cost (removing defensive
asserts in low level access method code) for a low benefit. Any
attempt to avoid having the checker itself blow up rather than throw
an error message needs to be assessed pragmatically, on a case-by-case
basis.

> One of the delays in submitting the most recent version of the patch is that 
> I was having trouble creating a reliable, portable btree corrupting 
> regression test.

To be clear, I think that corrupting data is very helpful with ad-hoc
testing during development.

> I did however address (some?) issues that you and others mentioned about the 
> table corrupting regression test.  Perhaps there are remaining issues that 
> will show up on machines with different endianness than I have thus far 
> tested, but I don't see that they will be insurmountable.  Are you 
> fundamentally opposed to that test framework?

I haven't thought about it enough just yet, but I am certainly suspicious of it.

-- 
Peter Geoghegan

Re: new heapcheck contrib module

Reply via email to