Re: better page-level checksums

Robert Haas Wed, 15 Jun 2022 13:28:15 -0700

On Tue, Jun 14, 2022 at 10:30 PM Peter Geoghegan <[email protected]> wrote:
> Basically I think that this is giving up rather a lot. For example,
> isn't it possible that we'd have corruption that could be a bug in
> either the checksum code, or in recovery?
>
> I'd feel a lot better about it if there was some sense of both the
> costs and the benefits.


I think that, if and when we get TDE, debuggability is likely to be a
huge issue. Something will go wrong for someone at some point, and
when it does, what they'll have is a supposedly-encrypted page that
cannot be decrypted, and it will be totally unclear what has gone
wrong. Did the page get corrupted on disk by a random bit flip? Is
there a bug in the algorithm? Torn page? As things stand today, when a
page gets corrupted, a human being can look at the page and make an
educated guess about what has gone wrong and whether PostgreSQL or
some other system is to blame, and if it's PostgreSQL, perhaps have
some ideas as to where to look for the bug. If the pages are
encrypted, that's a lot harder. I think what will happen, depending on
the encryption mode, is probably that either (a) the page will decrypt
to complete garbage or (b) the page will fail some kind of
verification and you won't be able to decrypt it at all. Either way,
you won't be able to infer anything about what caused the problem. All
you'll know is that something is wrong. That sucks - a lot - and I
don't have a lot of good ideas as to what can be done about it. The
idea that an encrypted page is unintelligible and that small changes
to either the encrypted or unencrypted data should result in large
changes to the other is intrinsic to the nature of encryption. It's
more or less un-debuggable by design.

With extended checksums, I don't think the issues are anywhere near as
bad. I'm not deeply opposed to setting a page-level flag but I expect
nominal benefits. A human being looking at the page isn't going to
have a ton of trouble figuring out whether or not the extended
checksum is present unless the page is horribly, horribly garbled, and
even if that happens, will debugging that problem really be any worse
than debugging a horribly, horribly garbled page today? I don't think
so. I likewise expect that pg_filedump could use heuristics to figure
out what's going on just by looking at the page, even if no external
information is available. You are probably right when you say that
there's no need to be so parsimonious with pd_flags space as all that,
but I believe that if we did decide to set no bit in pd_flags, whoever
maintains pg_filedump these days would not have huge difficulty
inventing a suitable heuristic. A page with an extended checksum is
basically still an intelligible page, and we shouldn't understate the
value of that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: better page-level checksums

Reply via email to