On 01/27/2018 05:01 AM, Bruce Momjian wrote: > On Fri, Jan 26, 2018 at 11:53:33PM +0100, Tomas Vondra wrote: >> >> ... >> >> FWIW even if it's not save in general, it would be useful to >> understand what are the requirements to make it work. I mean, >> conditions that need to be met on various levels (sector size of >> the storage device, page size of of the file system, filesystem >> alignment, ...). > > I think you are fine as soon the data arrives at the durable > storage, and assuming the data can't be partially written to durable > storage. I was thinking more of a case where you have a file system, > a RAID card without a BBU, and then magnetic disks. In that case, > even if the file system were to write in 4k chunks, the RAID > controller would also need to do the same, and with the same > alignment. Of course, that's probably a silly example since there is > probably no way to atomically write 4k to a magnetic disk. > > Actually, what happens if a 4k write is being written to an SSD and > the server crashes. Is the entire write discarded? >
AFAIK it's not possible to end up with a partial write, particularly not such that would contain a mix of old and new data - that's because SSDs can't overwrite a block without erasing it first. So the write should either succeed or fail as a whole, depending on when exactly the server crashes - it might be right before confirming the flush back to the client, for example. That assumes the drive has 4kB sectors (internal pages) - on drives with volatile write cache but supporting write barriers and cache flushes. On drives with non-volatile write cache (so with battery/capacitor) it should always succeed and never get discarded. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services