On Tue, May 12, 2020 at 7:07 PM Mark Dilger <mark.dil...@enterprisedb.com> wrote: > Thank you yet again for reviewing. I really appreciate the feedback!
Happy to help. It's important work. > Ok, I take your point that the code cannot soldier on after the first error > is returned. I'll change that for v6 of the patch, moving on to the next > relation after hitting the first corruption in any particular index. Do you > mind that I refactored the code to return the error rather than ereporting? try/catch seems like the way to do it. Not all amcheck errors come from amcheck -- some are things that the backend code does, that are known to appear in amcheck from time to time. I'm thinking in particular of the table_index_build_scan()/heapam_index_build_range_scan() errors, as well as the errors from _bt_checkpage(). > Yes, I agree that reindexing is the most sensible remedy. I certainly have > no plans to implement some pg_fsck_index type tool. Even for tables, I'm not > interested in creating such a tool. I just want a good tool for finding out > what the nature of the corruption is, as that might make it easier to debug > what went wrong. It's not just for debugging production systems, but also > for chasing down problems in half-baked code prior to release. All good goals. > * check_tuphdr_xids > The point is that when checking the table for corruption I avoid calling > anything that might assert (or segfault, or whatever). I don't think that you can expect to avoid assertion failures in general. I'll stick with your example. You're calling TransactionIdDidCommit() from check_tuphdr_xids(), which will interrogate the commit log and pg_subtrans. It's just not under your control. I'm sure that you could get an assertion failure somewhere in there, and even if you couldn't that could change at any time. You've quasi-duplicated some sensitive code to do that much, which seems excessive. But it's also not enough. > I'm starting to infer from your comments that you see the landmine detection > vehicle as also driving at high speed, detecting landmines on occasion by > seeing them first, but frequently by failing to see them and just blowing up. That's not it. I would certainly prefer if the landmine detector didn't blow up. Not having that happen is certainly a goal I share -- that's why PageGetItemIdCareful() exists. But not at any cost, especially not when "blow up" means an assertion failure that users won't actually see in production. Avoiding assertion failures like the one you showed is likely to have a high cost (removing defensive asserts in low level access method code) for a low benefit. Any attempt to avoid having the checker itself blow up rather than throw an error message needs to be assessed pragmatically, on a case-by-case basis. > One of the delays in submitting the most recent version of the patch is that > I was having trouble creating a reliable, portable btree corrupting > regression test. To be clear, I think that corrupting data is very helpful with ad-hoc testing during development. > I did however address (some?) issues that you and others mentioned about the > table corrupting regression test. Perhaps there are remaining issues that > will show up on machines with different endianness than I have thus far > tested, but I don't see that they will be insurmountable. Are you > fundamentally opposed to that test framework? I haven't thought about it enough just yet, but I am certainly suspicious of it. -- Peter Geoghegan