On Thu, Aug 2, 2018 at 1:20 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > On 2018-Aug-02, Thomas Munro wrote: >> PostgreSQL only requires atomic writes of 512 bytes (see >> PG_CONTROL_MAX_SAFE_SIZE), the traditional sector size for disks made >> approximately 1980-2010, though as far as I know spinning disks made >> this decade use 4KB sectors, and for SSDs there is more variation. I >> suppose the theory for torn SLRU page safety today is that the >> existing SLRU users all have fully independent values that don't cross >> sector boundaries, so torn writes can't corrupt them. > > Hmm, I wonder if this is true for multixact/members. I think it's not > true for either 4kB sectors nor for 512 byte sectors.
Hmm, right, the set of members can span sectors. Let me try that again. You can cross sector boundaries, but only if you don't require any kind of multi-sector consistency during replay. I think the important property for correct operation without FPWs is that you can't read data from the page itself in order to redo writes to the page. That rules out whole-page checksum verification, and probably requires "physical" addressing. By physical addressing I mean for example that the WAL record that writes member data must know exactly where to put it on the page without, for example, consulting the page header or item pointers to data that can move data around ("logical" intra-page addressing). We make the page consistent incrementally, because each WAL record that writes new members into a page is concerned with a specific physical part of the page identified by offset and doesn't care about the rest, and no one should ever try to read any part of it that hasn't already been made consistent. This seems OK. Another way to say it is that FPWs are physical logging of whole pages (they say how to set every single bit), and WAL for multixacts is a bit like physical logging of smaller regions of the page. Physical logging doesn't suffer from torn pages, as long as readers are also looking stuff up by physical addresses and never trying to read areas of the page that haven't been written to yet. If you want page-level checksums, though, the incremental approach won't work. -- Thomas Munro http://www.enterprisedb.com