Re: should crash recovery ignore checkpoint_flush_after ?

Andres Freund Sat, 18 Jan 2020 15:22:48 -0800

Hi,

On 2020-01-18 14:11:12 -0600, Justin Pryzby wrote:
> On Sat, Jan 18, 2020 at 10:48:22AM -0800, Andres Freund wrote:
> > On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote:
> > > One of our PG12 instances was in crash recovery for an embarassingly long 
> > > time
> > > after hitting ENOSPC.  (Note, I first started wroting this mail 10 months 
> > > ago
> > > while running PG11 after having same experience after OOM).  Running 
> > > linux.
> > > 
> > > As I understand, the first thing that happens syncing every file in the 
> > > data
> > > dir, like in initdb --sync.  These instances were both 5+TB on zfs, with
> > > compression, so that's slow, but tolerable, and at least understandable, 
> > > and
> > > with visible progress in ps.
> > >
> > > The 2nd stage replays WAL.  strace show's it's occasionally running
> > > sync_file_range, and I think recovery might've been several times faster 
> > > if
> > > we'd just dumped the data at the OS ASAP, fsync once per file.  In fact, 
> > > I've
> > > just kill -9 the recovery process and edited the config to disable this 
> > > lest it
> > > spend all night in recovery.
> > 
> > I'm not quite sure what you mean here with "fsync once per file". The
> > sync_file_range doesn't actually issue an fsync, even if sounds like it.
> 
> I mean if we didn't call sync_file_range() and instead let the kernel handle
> the writes and then fsync() at end of checkpoint, which happens in any
> case.


Yea, but then more writes have to be done at the end, instead of in
parallel with other work during checkpointing. the kernel will often end
up starting to write back buffers before that - but without much concern
for locality, so it'll be a lot more random writes.



> > > 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled).
> > 
> > FWIW, I still think this is the wrong default, and that it causes our
> > users harm.
> 
> I have no opinion about the default, but the maximum seems low, as a maximum.
> Why not INT_MAX, like wal_writer_flush_after ?

Because it requires a static memory allocation, and that'd not be all
that trivial to change (we may be in a critical section, so can't
allocate). And issuing them in a larger batch will often stall within
the kernel, anyway - there's a limited number of writes the kernel can
have in progress at once. We could make it a PGC_POSTMASTER variable,
and allocate at server start, but that seems like a cure worse than the
disease.

wal_writer_flush_after doesn't have that concern, because it works
differently.

Greetings,

Andres Freund

Re: should crash recovery ignore checkpoint_flush_after ?

Reply via email to