Re: new heapcheck contrib module

Mark Dilger Thu, 30 Jul 2020 15:11:44 -0700

> On Jul 30, 2020, at 2:00 PM, Robert Haas <robertmh...@gmail.com> wrote:
> 
> On Thu, Jul 30, 2020 at 4:18 PM Mark Dilger
> <mark.dil...@enterprisedb.com> wrote:
>>> Maybe I'm just being dense here -- exactly what problem are you worried 
>>> about?
>> 
>> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax 
>> committed via TransactionIdDidCommit.  I am worried about concurrent 
>> truncation of clog entries causing I/O errors on SLRU lookup when performing 
>> that check.  The three strategies I had for dealing with that were taking 
>> the XactTruncationLock (formerly known as CLogTruncationLock, for those 
>> reading this thread from the beginning), locking out vacuum, and the idea 
>> upthread from Andres about setting PROC_IN_VACUUM and such.  Maybe I'm being 
>> dense and don't need to worry about this.  But I haven't convinced myself of 
>> that, yet.
> 
> I don't get it. If you've already checked that the XIDs are >=
> relfrozenxid and <= ReadNewFullTransactionId(), then this shouldn't be
> a problem. It could be, if CLOG is hosed, which is possible, because
> if the table is corrupted, why shouldn't CLOG also be corrupted? But
> I'm not sure that's what your concern is here.

No, that wasn't my concern.  I was thinking about CLOG entries disappearing 
during the scan as a consequence of concurrent vacuums, and the effect that 
would have on the validity of the cached [relfrozenxid..next_valid_xid] range.  
In the absence of corruption, I don't immediately see how this would cause any 
problems.  But for a corrupt table, I'm less certain how it would play out.

The kind of scenario I'm worried about may not be possible in practice.  I 
think it would depend on how vacuum behaves when scanning a corrupt table that 
is corrupt in some way that vacuum doesn't notice, and whether vacuum could 
finish scanning the table with the false belief that it has frozen all tuples 
with xids less than some cutoff.

I thought it would be safer if that kind of thing were not happening during 
verify_heapam's scan of the table.  Even if a careful analysis proved it was 
not an issue with the current coding of vacuum, I don't think there is any 
coding convention requiring future versions of vacuum to be hardened against 
corruption, so I don't see how I can rely on vacuum not causing such problems.

I don't think this is necessarily a too-rare-to-care-about type concern, 
either.  If corruption across multiple tables prevents autovacuum from 
succeeding, and the DBA doesn't get involved in scanning tables for corruption 
until the lack of successful vacuums impacts the production system, I imagine 
you could end up with vacuums repeatedly happening (or trying to happen) around 
the time the DBA is trying to fix tables, or perhaps drop them, or whatever, 
using verify_heapam for guidance on which tables are corrupted.

Anyway, that's what I was thinking.  I was imagining that calling 
TransactionIdDidCommit might keep crashing the backend while the DBA is trying 
to find and fix corruption, and that could get really annoying.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Re: new heapcheck contrib module

Reply via email to