Hi, On 2022-01-27 19:55:45 +0200, Heikki Linnakangas wrote: > If you create an empty table, it is not fsync'd. As soon as you insert a row > to it, register_dirty_segment() gets called, and after that, the next > checkpoint will fsync it. But before that, the creation itself is never > fsync'd. That's obviously not great.
Good catch. > The lack of an fsync is a bit hard to prove because it requires a hardware > failure, or a simulation of it, and can be affected by filesystem options > too. I've been thinking that we need a validation layer for fsyncs, it's too hard to get right without testing, and crash testing is not likel enough to catch problems quickly / resource intensive. My thought was that we could keep a shared hash table of all files created / dirtied at the fd layer, with the filename as key and the value the current LSN. We'd delete files from it when they're fsynced. At checkpoints we go through the list and see if there's any files from before the redo that aren't yet fsynced. All of this in assert builds only, of course. > > 1. Create a VM with two virtual disks. Use ext4, with 'data=writeback' > option (I'm not sure if that's required). Install PostgreSQL on one of the > virtual disks. > > 2. Start the server, and create a tablespace on the other disk: > > CREATE TABLESPACE foospc LOCATION '/data/heikki'; > > 3. Do this: > > CREATE TABLE foo (i int) TABLESPACE foospc; > CHECKPOINT; > > 4. Immediately after that, kill the VM. I used: > > killall -9 qemu-system-x86_64 > > 5. Restart the VM, restart PostgreSQL. Now when you try to use the table, > you get an error: > > postgres=# select * from crashtest ; > ERROR: could not open file "pg_tblspc/81921/PG_15_202201271/5/98304": No > such file or directory I think it might be good to have a bunch of in-core scripting to do this kind of thing. I think we can get lower iteration time by using linux device-mapper targets to refuse writes. > I was not able to reproduce this without the tablespace on a different > virtual disk, I presume because ext4 orders the writes so that the > checkpoint implicitly always flushes the creation of the file to disk. It's likely that the control file sync at the end of a checkpoint has the side effect of also forcing the file creation to be durable if on the same tablespace (it'd not make the file contents durable, but they don't exist here, so ...). > I tried data=writeback but it didn't make a difference. But with a separate > disk, it happens every time. My understanding is that data=writeback just stops a journal flush of filesystem metadata from also flushing all previously dirtied data. Which often is good, because that's a lot of unnecessary writes. Even with data=writeback prior file system journal contents are still synced to disk. > I think the simplest fix is to call register_dirty_segment() from > mdcreate(). As in the attached. Thoughts? > diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c > index d26c915f90e..2dfd80ca66b 100644 > --- a/src/backend/storage/smgr/md.c > +++ b/src/backend/storage/smgr/md.c > @@ -225,6 +225,9 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool > isRedo) > mdfd = &reln->md_seg_fds[forkNum][0]; > mdfd->mdfd_vfd = fd; > mdfd->mdfd_segno = 0; > + > + if (!SmgrIsTemp(reln)) > + register_dirty_segment(reln, forkNum, mdfd); I think we might want different handling for unlogged tables and for relations that are synced to disk in other ways (e.g. index builds). But I've not re-read the relevant code. Greetings, Andres Freund