If you create an empty table, it is not fsync'd. As soon as you insert a row to it, register_dirty_segment() gets called, and after that, the next checkpoint will fsync it. But before that, the creation itself is never fsync'd. That's obviously not great.

The lack of an fsync is a bit hard to prove because it requires a hardware failure, or a simulation of it, and can be affected by filesystem options too. But I was able to demonstrate a problem with these steps:

1. Create a VM with two virtual disks. Use ext4, with 'data=writeback' option (I'm not sure if that's required). Install PostgreSQL on one of the virtual disks.

2. Start the server, and create a tablespace on the other disk:

CREATE TABLESPACE foospc LOCATION '/data/heikki';

3. Do this:

CREATE TABLE foo (i int) TABLESPACE foospc;
CHECKPOINT;

4. Immediately after that, kill the VM. I used:

killall -9 qemu-system-x86_64

5. Restart the VM, restart PostgreSQL. Now when you try to use the table, you get an error:

postgres=# select * from crashtest ;
ERROR: could not open file "pg_tblspc/81921/PG_15_202201271/5/98304": No such file or directory

I was not able to reproduce this without the tablespace on a different virtual disk, I presume because ext4 orders the writes so that the checkpoint implicitly always flushes the creation of the file to disk. I tried data=writeback but it didn't make a difference. But with a separate disk, it happens every time.

I think the simplest fix is to call register_dirty_segment() from mdcreate(). As in the attached. Thoughts?

- Heikki
From 075b59f4c30c0a9de66fe72ce49ca40b5a9388f9 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Thu, 27 Jan 2022 19:52:37 +0200
Subject: [PATCH 1/1] Ensure that creation of an empty relfile is fsync'd at
 checkpoint.

If you create a table and don't insert any data into it, the relation file
is never fsync'd. You don't lose data, because an empty table doesn't have
any data to begin with, but if you crash and lose the file, subsequent
operations on the table will fail with "could not open file" erorr.

To fix, register an fsync request in mdcreate(), like we do for mdwrite().
---
 src/backend/storage/smgr/md.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d26c915f90e..2dfd80ca66b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -225,6 +225,9 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 	mdfd = &reln->md_seg_fds[forkNum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+
+	if (!SmgrIsTemp(reln))
+		register_dirty_segment(reln, forkNum, mdfd);
 }
 
 /*
-- 
2.30.2

Reply via email to