On Tue, 12 May 2020, 08:42 Paul Guo, <p...@pivotal.io> wrote:

> Hello hackers,
>
> 1. StartupXLOG() does fsync on the whole data directory early in the crash
> recovery. I'm wondering if we could skip some directories (at least the
> pg_log/, table directories) since wal, etc could ensure consistency. Here
> is the related code.
>
>       if (ControlFile->state != DB_SHUTDOWNED &&
>           ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
>       {
>           RemoveTempXlogFiles();
>           SyncDataDirectory();
>       }
>

This would actually be a good candidate for a thread pool. Dispatch sync
requests and don't wait. Come back later when they're done.

Unsure if that's at all feasible given that pretty much all the Pg APIs
aren't thread safe though. No palloc, no elog/ereport, etc. However I don't
think we're ready to run bgworkers or use shm_mq etc at that stage.

Of course if OSes would provide asynchronous IO interfaces that weren't
utterly vile and broken, we wouldn't have to worry...


>
> RecreateTwoPhaseFile() writes a state file for a prepared transaction and
> does fsync. It might be good to do fsync for all files once after writing
> them, given the kernel is able to do asynchronous flush when writing those
> file contents. If the TwoPhaseState->numPrepXacts is large we could do
> batching to avoid the fd resource limit. I did not test them yet but this
> should be able to speed up checkpoint/restartpoint a bit.
>

I seem to recall some hints we can set on a FD or mmapped  range that
encourage dirty buffers to be written without blocking us, too. I'll have
to look them up...


>

Reply via email to