Re: Creation of an empty table is not fsync'd at checkpoint

Andres Freund Thu, 27 Jan 2022 10:29:06 -0800

Hi,

On 2022-01-27 19:55:45 +0200, Heikki Linnakangas wrote:
> If you create an empty table, it is not fsync'd. As soon as you insert a row
> to it, register_dirty_segment() gets called, and after that, the next
> checkpoint will fsync it. But before that, the creation itself is never
> fsync'd. That's obviously not great.


Good catch.


> The lack of an fsync is a bit hard to prove because it requires a hardware
> failure, or a simulation of it, and can be affected by filesystem options
> too.

I've been thinking that we need a validation layer for fsyncs, it's too hard
to get right without testing, and crash testing is not likel enough to catch
problems quickly / resource intensive.

My thought was that we could keep a shared hash table of all files created /
dirtied at the fd layer, with the filename as key and the value the current
LSN. We'd delete files from it when they're fsynced. At checkpoints we go
through the list and see if there's any files from before the redo that aren't
yet fsynced.  All of this in assert builds only, of course.



> 
> 1. Create a VM with two virtual disks. Use ext4, with 'data=writeback'
> option (I'm not sure if that's required). Install PostgreSQL on one of the
> virtual disks.
> 
> 2. Start the server, and create a tablespace on the other disk:
> 
> CREATE TABLESPACE foospc LOCATION '/data/heikki';
> 
> 3. Do this:
> 
> CREATE TABLE foo (i int) TABLESPACE foospc;
> CHECKPOINT;
> 
> 4. Immediately after that, kill the VM. I used:
> 
> killall -9 qemu-system-x86_64
> 
> 5. Restart the VM, restart PostgreSQL. Now when you try to use the table,
> you get an error:
> 
> postgres=# select * from crashtest ;
> ERROR:  could not open file "pg_tblspc/81921/PG_15_202201271/5/98304": No
> such file or directory

I think it might be good to have a bunch of in-core scripting to do this kind
of thing.  I think we can get lower iteration time by using linux
device-mapper targets to refuse writes.


> I was not able to reproduce this without the tablespace on a different
> virtual disk, I presume because ext4 orders the writes so that the
> checkpoint implicitly always flushes the creation of the file to disk.

It's likely that the control file sync at the end of a checkpoint has the side
effect of also forcing the file creation to be durable if on the same
tablespace (it'd not make the file contents durable, but they don't exist
here, so ...).


> I tried data=writeback but it didn't make a difference. But with a separate
> disk, it happens every time.

My understanding is that data=writeback just stops a journal flush of
filesystem metadata from also flushing all previously dirtied data. Which
often is good, because that's a lot of unnecessary writes. Even with
data=writeback prior file system journal contents are still synced to disk.


> I think the simplest fix is to call register_dirty_segment() from
> mdcreate(). As in the attached. Thoughts?

> diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
> index d26c915f90e..2dfd80ca66b 100644
> --- a/src/backend/storage/smgr/md.c
> +++ b/src/backend/storage/smgr/md.c
> @@ -225,6 +225,9 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool 
> isRedo)
>       mdfd = &reln->md_seg_fds[forkNum][0];
>       mdfd->mdfd_vfd = fd;
>       mdfd->mdfd_segno = 0;
> +
> +     if (!SmgrIsTemp(reln))
> +             register_dirty_segment(reln, forkNum, mdfd);

I think we might want different handling for unlogged tables and for relations
that are synced to disk in other ways (e.g. index builds). But I've not
re-read the relevant code.

Greetings,

Andres Freund

Re: Creation of an empty table is not fsync'd at checkpoint

Reply via email to