Re: [HACKERS] Enabling Checksums

Greg Smith Sun, 11 Nov 2012 20:55:50 -0800

On 11/11/12 2:56 PM, Jeff Davis wrote:

We could have a separate utility, pg_checksums, that can
alter the state and/or do an offline verification. And initdb would take
an option that would start everything out fully protected with
checksums.

Adding an initdb option to start out with everything checksummed seemsan uncontroversial good first thing to have available. It seems like aproper 9.3 target to aim at even if per-table upgrading gets bogged downin details. I have an argument below that the area between initdb andper-table upgrades is fundamentally uncertain and therefore not worthchasing after, based on reasons you already started to outline. There'snot much useful middle ground there.

Won't a pg_checksums program just grow until it looks like a limitedversion of vacuum though? It's going to iterate over most of the table;it needs the same cost controls as autovacuum (and to respect the loadof concurrent autovacuum work) to keep I/O under control; and those costcontrol values might change if there's a SIGHUP to reload parameters.It looks so much like vacuum that I think there needs to be a reallycompelling reason to split it into something new. Why can't this be yetanother autovacuum worker that does its thing?


> In order to get to the fully-protected state, you still need to
> somehow make sure that all of the old data is checksummed.
>
> And the "fully protected" state is important in my opinion, because
> otherwise we aren't protected against corrupt page headers that say
> they have no checksum (even when it really should have a checksum).

I think it's useful to step back for a minute and consider the largeruncertainty an existing relation has, which amplifies just how ugly thissituation is. The best guarantee I think online checksumming can offeris to tell the user "after transaction id X, all new data in relation Ris known to be checksummed". Unless you do this at initdb time, anyconversion case is going to have the possibility that a page iscorrupted before you get to it--whether you're adding the checksum aspart of a "let's add them while we're writing anyway" page update or theconversion tool is hitting it.

That's why I don't think anyone will find online conversion reallyuseful until they've done a full sweep updating the old pages. And ifyou accept that, a flexible checksum upgrade utility, one that co-existswith autovacuum activity costs, becomes a must.

One of the really common cases I was expecting here is that conversionsare done by kicking off a slow background VACUUM CHECKSUM job that mightrun in pieces. I was thinking of an approach like this:


-Initialize a last_checked_block value for each table
-Loop:
--Grab the next block after the last checked one

--When on the last block of the relation, grab an exclusive lock toprotect against race conditions with extension

--If it's marked as checksummed and the checksum matches, skip it
---Otherwise, add a checksum and write it out
--When that succeeds, update last_checked_block

--If that was the last block, save some state saying the whole table ischeckedsummed

With that logic, there is at least a forward moving pointer that removesthe uncertainty around whether pages have been updated or not. It willkeep going usefully if interrupted too. One obvious this way this canfail is if:


1) A late page in the relation is updated and a checksummed page written

2) The page is corrupted such that the "is this checksummed?" bits arenot consistent anymore, along with other damage to it

3) The conversion process gets to this page eventually
4) The corruption of (2) isn't detected

But I think that this possibility--that a page might get quietlycorrupted after checked once, but still in the middle of checking arelation--is both impossible to remove and a red herring. How do weknow that this page of the relation wasn't corrupted on disk before weeven started? We don't, and we can't.

The only guarantee I see that we can give for online upgrades is thatafter a VACUUM CHECKSUM sweep is done, and every page is known to bothhave a valid checksum on it and have its checksum bits set, *then* anypage that doesn't have both set bits and a matching checksum is garbage.Until reaching that point, any old data is suspect. The idea ofoperating in an "we'll convert on write but never convert old pages"can't come up with any useful guarantees about data integrity that I cansee. As you say, you don't ever gain the ability to tell pages thatwere checksummed but have since been corrupted from ones that werecorrupt all along in that path.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

Reply via email to