Re: better page-level checksums

Robert Haas Fri, 10 Jun 2022 06:59:09 -0700

On Thu, Jun 9, 2022 at 8:00 PM Matthias van de Meent
<boekewurm+postg...@gmail.com> wrote:
> Why so? We already dole out per-page space in 4-byte increments
> through pd_linp, and I see no reason why we can't reserve some line
> pointers for per-page metadata if we decide that we need extra
> per-page ~overhead~ metadata.


Hmm, that's an interesting approach. I was thinking that putting data
after the PageHeaderData struct would be a non-starter because the
code that looks up a line pointer by index is currently just
multiply-and-add and complicating it seems bad for performance.
However, if we treated the space there as overlapping the line pointer
array and making some line pointers unusable rather than something
inserted prior to the line pointer array, we could avoid that. I still
think it would be kind of complicated, though, because we'd have to
find every bit of code that loops over the line pointer array or
accesses it by index and make sure that it doesn't try to access the
low-numbered line pointers.

> Isn't the goal of a checksum to find - and where possible, correct -
> bit flips and other broken pages? I would suggest not to use
> cryptographic hash functions for that, as those are rarely
> error-correcting.

I wasn't thinking of trying to do error correction, just error
detection. See also my earlier reply to Peter Geoghegan.

> Isn't that expected for most of those places? With the current
> bufpage.h description of Page, it seems obvious that all bytes on a
> page except those in the "hole" and those in the page header are under
> full control of the AM. Of course AMs will pre-calculate limits and
> offsets during compilation, that saves recalculation cycles and/or
> cache lines with constants to keep in L1.

Yep.

> Can't we add some extra fork that stores this extra per-page
> information, and contains this extra metadata in a double-buffered
> format, so that both before the actual page is written the metadata
> too is written to disk, while the old metadata is available too for
> recovery purposes. This allows us to maintain the current format with
> its low per-page overhead, and only have extra overhead (up to 2x
> writes for each page, but the writes for these metadata pages need not
> be BLCKSZ in size) for those that opt-in to the more computationally
> expensive features of larger checksums, nonces, and/or other non-AM
> per-page ~overhead~ metadata.

It's not impossible, I'm sure, but it doesn't seem very appealing to
me. Those extra reads and writes could be expensive, and there's no
place to cleanly integrate them into the code structure. A function
like PageIsVerified() -- which is where we currently validate
checksums -- only gets the page. It can't go off and read some other
page from disk to perform the checksum calculation.

I'm not exactly sure what you have in mind when you say that the
writes need not be BLCKSZ in size. Technically I guess that's true,
but then the checksums have to be crash safe, or they're not much
good. If they're not part of the page, how do they get updated in a
way that makes them crash safe? I guess it could be done: every time
we write a FPW, enlarge the page image by the number of bytes that are
stored in this location. When replaying an FPW, update those bytes
too. And every time we read or write a page, also read or write those
bytes. In essence, we'd be deciding that pages are 8192+n bytes, but
the last n bytes are stored in a different file - and, in memory, a
different buffer pool. I think that would be hugely invasive and
unpleasant to make work and I think the performance would be poor,
too.

> I'd prefer if we didn't change the way pages are presented to AMs.
> Currently, it is clear what area is available to you if you write an
> AM that uses the bufpage APIs. Changing the page format to have the
> buffer manager also touch / reserve space in the special areas seems
> like a break of abstraction: Quoting from bufpage.h:
>
>  * AM-generic per-page information is kept in PageHeaderData.
>  *
>  * AM-specific per-page data (if any) is kept in the area marked "special
>  * space"; each AM has an "opaque" structure defined somewhere that is
>  * stored as the page trailer.  an access method should always
>  * initialize its pages with PageInit and then set its own opaque
>  * fields.
>
> I'd rather we keep this contract: am-generic stuff belongs in
> PageHeaderData, with the rest of the page fully available for the AM
> to use (including the special area).

I don't think that changing the contract has to mean that it becomes
unclear what the contract is. And you can't improve any system without
changing some stuff. But you certainly don't have to like my ideas or
anything....

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: better page-level checksums

Reply via email to