On Mon, Aug 3, 2020 at 12:00 AM Peter Geoghegan <p...@bowt.ie> wrote: > Moving on with verification of the same index in the event of B-Tree > index corruption is a categorical mistake. verify_nbtree.c was simply > not designed to work that way. > > You were determined to avoid allowing any behavior that can result in > a backend crash in the event of corruption, but this design will > defeat various measures I took to avoid crashing with corrupt data > (e.g. in commit a9ce839a313). > > What's the point in not just giving up on the index (though not > necessarily the table or other indexes) at the first sign of trouble, > anyway? It makes sense for the heap structure, but not for indexes.
I agree that there's a serious design problem with Mark's patch in this regard, but I disagree that the effort is pointless on its own terms. You're basically postulating that users don't care how corrupt their index is: whether there's one problem or one million problems, it's all the same. If the user presents an index with one million problems and we tell them about one of them, we've done our job and can go home. This doesn't match my experience. When an EDB customer reports corruption, typically one of the first things I want to understand is how widespread the problem is. This same issue came up on the thread about relfrozenxid/relminmxid corruption. If you've got a table with one or two rows where tuple.xmin < relfrozenxid, that's a different kind of problem than if 50% of the tuples in the table have tuple.xmin < relfrozenxid; the latter might well indicate that relfrozenxid value itself is garbage, while the former indicates that a few tuples slipped through the cracks somehow. If you're contemplating a recovery strategy like "nuke the affected tuples from orbit," you really need to understand which of those cases you've got. Granted, this is a bit less important with indexes, because in most cases you're just going to REINDEX. But, even there, the question is not entirely academic. For instance, consider the case of a user whose database crashes and then fails to restart because WAL replay fails. Typically, there is little option here but to run pg_resetwal. At this point, you know that there is some damage, but you don't know how bad it is. If there was little system activity at the time of the crash, there may be only a handful of problems with the database. If there was a heavy OLTP workload running at the time of the crash, with a long checkpoint interval, the problems may be widespread. If the user has done this repeatedly before bothering to contact support, which is more common than you might suppose, the damage may be extremely widespread. Now, you could argue (and not unreasonably) that in any case after something like this happens even once, the user ought to dump and restore to get back to a known good state. However, when the cluster is 10TB in size and there's a $100,000 financial loss for every hour of downtime, the question naturally arises of how urgent that dump and restore is. Can we wait until our next maintenance window? Can we at least wait until off hours? Being able to tell the user whether they've got a tiny bit of corruption or a whole truckload of corruption can enable them to make better decisions in such cases, or at least more educated ones. Now, again, just replacing ereport(ERROR, ...) with something else that does not abort the rest of the checks is clearly not OK. I don't endorse that approach, or anything like it. But neither do I accept the argument that it would be useless to report all the errors even if we could do so safely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company