Re: fdatasync performance problem with large number of DB files

Fujii Masao Tue, 16 Mar 2021 01:29:25 -0700



On 2021/03/16 8:15, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <gu...@vmware.com> wrote:

By the way, there is a usual case that we could skip fsync: A fsync-ed already 
standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not 
DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg 
instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files 
in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using 
guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset 
the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in 
pg_rewind/pg_basebackup, but people might
      copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to 
hint startup, before starting the pg instance.


As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up).  It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough.  Hmm  What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata?  Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives.  Or something like that...

I'm not too keen to invent clever new schemes for PG14, though.  This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities.  Is
that useful enough?  In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea.  We already have a
running-with-scissors mode you could use for that: fsync=off.


I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: fdatasync performance problem with large number of DB files

Reply via email to