Hi, On 2020-01-18 14:11:12 -0600, Justin Pryzby wrote: > On Sat, Jan 18, 2020 at 10:48:22AM -0800, Andres Freund wrote: > > On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote: > > > One of our PG12 instances was in crash recovery for an embarassingly long > > > time > > > after hitting ENOSPC. (Note, I first started wroting this mail 10 months > > > ago > > > while running PG11 after having same experience after OOM). Running > > > linux. > > > > > > As I understand, the first thing that happens syncing every file in the > > > data > > > dir, like in initdb --sync. These instances were both 5+TB on zfs, with > > > compression, so that's slow, but tolerable, and at least understandable, > > > and > > > with visible progress in ps. > > > > > > The 2nd stage replays WAL. strace show's it's occasionally running > > > sync_file_range, and I think recovery might've been several times faster > > > if > > > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, > > > I've > > > just kill -9 the recovery process and edited the config to disable this > > > lest it > > > spend all night in recovery. > > > > I'm not quite sure what you mean here with "fsync once per file". The > > sync_file_range doesn't actually issue an fsync, even if sounds like it. > > I mean if we didn't call sync_file_range() and instead let the kernel handle > the writes and then fsync() at end of checkpoint, which happens in any > case.
Yea, but then more writes have to be done at the end, instead of in parallel with other work during checkpointing. the kernel will often end up starting to write back buffers before that - but without much concern for locality, so it'll be a lot more random writes. > > > 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled). > > > > FWIW, I still think this is the wrong default, and that it causes our > > users harm. > > I have no opinion about the default, but the maximum seems low, as a maximum. > Why not INT_MAX, like wal_writer_flush_after ? Because it requires a static memory allocation, and that'd not be all that trivial to change (we may be in a critical section, so can't allocate). And issuing them in a larger batch will often stall within the kernel, anyway - there's a limited number of writes the kernel can have in progress at once. We could make it a PGC_POSTMASTER variable, and allocate at server start, but that seems like a cure worse than the disease. wal_writer_flush_after doesn't have that concern, because it works differently. Greetings, Andres Freund