Re: new heapcheck contrib module

Robert Haas Mon, 03 Aug 2020 08:09:45 -0700

On Mon, Aug 3, 2020 at 12:00 AM Peter Geoghegan <[email protected]> wrote:
> Moving on with verification of the same index in the event of B-Tree
> index corruption is a categorical mistake. verify_nbtree.c was simply
> not designed to work that way.
>
> You were determined to avoid allowing any behavior that can result in
> a backend crash in the event of corruption, but this design will
> defeat various measures I took to avoid crashing with corrupt data
> (e.g. in commit a9ce839a313).
>
> What's the point in not just giving up on the index (though not
> necessarily the table or other indexes) at the first sign of trouble,
> anyway? It makes sense for the heap structure, but not for indexes.


I agree that there's a serious design problem with Mark's patch in
this regard, but I disagree that the effort is pointless on its own
terms. You're basically postulating that users don't care how corrupt
their index is: whether there's one problem or one million problems,
it's all the same. If the user presents an index with one million
problems and we tell them about one of them, we've done our job and
can go home.

This doesn't match my experience. When an EDB customer reports
corruption, typically one of the first things I want to understand is
how widespread the problem is. This same issue came up on the thread
about relfrozenxid/relminmxid corruption. If you've got a table with
one or two rows where tuple.xmin < relfrozenxid, that's a different
kind of problem than if 50% of the tuples in the table have tuple.xmin
< relfrozenxid; the latter might well indicate that relfrozenxid value
itself is garbage, while the former indicates that a few tuples
slipped through the cracks somehow. If you're contemplating a recovery
strategy like "nuke the affected tuples from orbit," you really need
to understand which of those cases you've got.

Granted, this is a bit less important with indexes, because in most
cases you're just going to REINDEX. But, even there, the question is
not entirely academic. For instance, consider the case of a user whose
database crashes and then fails to restart because WAL replay fails.
Typically, there is little option here but to run pg_resetwal. At this
point, you know that there is some damage, but you don't know how bad
it is. If there was little system activity at the time of the crash,
there may be only a handful of problems with the database. If there
was a heavy OLTP workload running at the time of the crash, with a
long checkpoint interval, the problems may be widespread. If the user
has done this repeatedly before bothering to contact support, which is
more common than you might suppose, the damage may be extremely
widespread.

Now, you could argue (and not unreasonably) that in any case after
something like this happens even once, the user ought to dump and
restore to get back to a known good state. However, when the cluster
is 10TB in size and there's a $100,000 financial loss for every hour
of downtime, the question naturally arises of how urgent that dump and
restore is. Can we wait until our next maintenance window? Can we at
least wait until off hours? Being able to tell the user whether
they've got a tiny bit of corruption or a whole truckload of
corruption can enable them to make better decisions in such cases, or
at least more educated ones.

Now, again, just replacing ereport(ERROR, ...) with something else
that does not abort the rest of the checks is clearly not OK. I don't
endorse that approach, or anything like it. But neither do I accept
the argument that it would be useless to report all the errors even if
we could do so safely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: new heapcheck contrib module

Reply via email to