My buildfarm animal dromedary ran out of disk space yesterday, which
I found rather surprising because the last time I'd looked it had
tens of GB to spare.  On investigation, the problem was lots and lots
of core images in /cores, which is where macOS drops them (by default
at least).  It looked like I was getting one new core image per
buildfarm run, even successful runs.  Even odder, they mostly seemed
to be images from /bin/cp, not Postgres.

After investigation, the mechanism that's causing that is that the
src/test/recovery/t/010_logical_decoding_timelines.pl test shuts
down its replica server with a mode-immediate stop, which causes
that postmaster to shut down all its children with SIGQUIT, and
in particular that signal propagates to a "cp" command that the
archiver process is executing.  The "cp" is unsurprisingly running
with default SIGQUIT handling, which per the signal man page
includes dumping core.

This makes me wonder whether we shouldn't be using some other signal
to shut down archiver subprocesses.  It's not real cool if we're
spewing cores all over the place.  Admittedly, production servers
are likely running with "ulimit -c 0" on most modern platforms,
so this might not be a huge problem in the field; but accumulation
of core files could be a problem anywhere that's configured to allow
server core dumps.

I suspect the reason we've not noticed this in the buildfarm is that most
of those platforms are configured to dump core into the data directory,
where it'll be thrown away when we clean up after the run.  But aside
from macOS, the machines running recent systemd releases are likely
accumulating cores somewhere behind the scenes, since systemd has seen
fit to insert itself into core-handling along with everything else.

Ideally, perhaps, we'd be using SIGINT not SIGQUIT to shut down
non-Postgres child processes.  But redesigning the system's signal
handling to make that possible seems like a bit of a mess.

Thoughts?

                        regards, tom lane


Reply via email to