On 11/11/12 2:56 PM, Jeff Davis wrote:
We could have a separate utility, pg_checksums, that can
alter the state and/or do an offline verification. And initdb would take
an option that would start everything out fully protected with
checksums.

Adding an initdb option to start out with everything checksummed seems an uncontroversial good first thing to have available. It seems like a proper 9.3 target to aim at even if per-table upgrading gets bogged down in details. I have an argument below that the area between initdb and per-table upgrades is fundamentally uncertain and therefore not worth chasing after, based on reasons you already started to outline. There's not much useful middle ground there.

Won't a pg_checksums program just grow until it looks like a limited version of vacuum though? It's going to iterate over most of the table; it needs the same cost controls as autovacuum (and to respect the load of concurrent autovacuum work) to keep I/O under control; and those cost control values might change if there's a SIGHUP to reload parameters. It looks so much like vacuum that I think there needs to be a really compelling reason to split it into something new. Why can't this be yet another autovacuum worker that does its thing?

> In order to get to the fully-protected state, you still need to
> somehow make sure that all of the old data is checksummed.
>
> And the "fully protected" state is important in my opinion, because
> otherwise we aren't protected against corrupt page headers that say
> they have no checksum (even when it really should have a checksum).

I think it's useful to step back for a minute and consider the larger uncertainty an existing relation has, which amplifies just how ugly this situation is. The best guarantee I think online checksumming can offer is to tell the user "after transaction id X, all new data in relation R is known to be checksummed". Unless you do this at initdb time, any conversion case is going to have the possibility that a page is corrupted before you get to it--whether you're adding the checksum as part of a "let's add them while we're writing anyway" page update or the conversion tool is hitting it.

That's why I don't think anyone will find online conversion really useful until they've done a full sweep updating the old pages. And if you accept that, a flexible checksum upgrade utility, one that co-exists with autovacuum activity costs, becomes a must.

One of the really common cases I was expecting here is that conversions are done by kicking off a slow background VACUUM CHECKSUM job that might run in pieces. I was thinking of an approach like this:

-Initialize a last_checked_block value for each table
-Loop:
--Grab the next block after the last checked one
--When on the last block of the relation, grab an exclusive lock to protect against race conditions with extension
--If it's marked as checksummed and the checksum matches, skip it
---Otherwise, add a checksum and write it out
--When that succeeds, update last_checked_block
--If that was the last block, save some state saying the whole table is checkedsummed

With that logic, there is at least a forward moving pointer that removes the uncertainty around whether pages have been updated or not. It will keep going usefully if interrupted too. One obvious this way this can fail is if:

1) A late page in the relation is updated and a checksummed page written
2) The page is corrupted such that the "is this checksummed?" bits are not consistent anymore, along with other damage to it
3) The conversion process gets to this page eventually
4) The corruption of (2) isn't detected

But I think that this possibility--that a page might get quietly corrupted after checked once, but still in the middle of checking a relation--is both impossible to remove and a red herring. How do we know that this page of the relation wasn't corrupted on disk before we even started? We don't, and we can't.

The only guarantee I see that we can give for online upgrades is that after a VACUUM CHECKSUM sweep is done, and every page is known to both have a valid checksum on it and have its checksum bits set, *then* any page that doesn't have both set bits and a matching checksum is garbage. Until reaching that point, any old data is suspect. The idea of operating in an "we'll convert on write but never convert old pages" can't come up with any useful guarantees about data integrity that I can see. As you say, you don't ever gain the ability to tell pages that were checksummed but have since been corrupted from ones that were corrupt all along in that path.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to