Hi, On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote: > One of our PG12 instances was in crash recovery for an embarassingly long time > after hitting ENOSPC. (Note, I first started wroting this mail 10 months ago > while running PG11 after having same experience after OOM). Running linux. > > As I understand, the first thing that happens syncing every file in the data > dir, like in initdb --sync. These instances were both 5+TB on zfs, with > compression, so that's slow, but tolerable, and at least understandable, and > with visible progress in ps. > > The 2nd stage replays WAL. strace show's it's occasionally running > sync_file_range, and I think recovery might've been several times faster if > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've > just kill -9 the recovery process and edited the config to disable this lest > it > spend all night in recovery.
I'm not quite sure what you mean here with "fsync once per file". The sync_file_range doesn't actually issue an fsync, even if sounds like it. In the case of checkpointing what it basically does is to ask the kernel to please start writing data back immediately, instead of waiting till the absolute end of the checkpoint, when doing fsyncs. IOW, the data is going to be written back *anyway* in short order. It's ossible that ZFS's compression just does broken things here, I don't know. > That GUC is intended to reduce latency spikes caused by checkpoint fsync. But > I think limiting to default 256kB between syncs is too limiting during > recovery, and at that point it's better to optimize for throughput anyway, > since no other backends are running (in that instance) and cannot run until > recovery finishes. I don't think that'd be good by default - in my experience the stalls caused by the kernel writing back massive amounts of data at once is also problematic during recovery (and can lead to much higher %sys too). You get the pattern of the fsync at the end taking forever, while IO is idle before. And you'd get the latency spikes once recovery is over too. > At least, if this setting is going to apply during > recovery, the documentation should mention it (it's a "recovery checkpoint") That makes sense. > See also > 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled). FWIW, I still think this is the wrong default, and that it causes our users harm. It only makes sense because the reverse was the default. But it's easy to see quite massive stalls even on fast nvme SSDs (as in 10s of no transactions committing, in an oltp workload). Nor do I think is it really comparable with the checkpointing setting, because there we *know* that we're about to fsync the file, whereas in the backend case we might just use the fs page cache as an extension of shared buffers. Greetings, Andres Freund