On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote: > On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pry...@telsasoft.com> wrote: > > I ought to have remembered that it *was* in fact out of space this AM when > > this > > core was dumped (due to having not touched it since scheduling transition to > > this VM last week). > > > > I want to say I'm almost certain it wasn't ENOSPC in other cases, since, > > failing to find log output, I ran df right after the failure.
I meant it wasn't a trivial error on my part of failing to drop the previously loaded DB instance. It occured to me to check inodes, which can also cause ENOSPC. This is mkfs -T largefile, so running out of inodes is not an impossibility. But seems an unlikely culprit, unless something made tens of thousands of (small) files. [pryzbyj@alextelsasrv01 ~]$ df -i /var/lib/pgsql Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/data-postgres 65536 5605 59931 9% /var/lib/pgsql > Ok, cool, so the ENOSPC thing we understand, and the postmaster death > thing is probably something entirely different. Which brings us to > the question: what is killing your postmaster or causing it to exit > silently and unexpectedly, but leaving no trace in any operating > system log? You mentioned that you couldn't see any signs of the OOM > killer. Are you in a situation to test an OOM failure so you can > confirm what that looks like on your system? $ command time -v python -c "'x'*4999999999" |wc Traceback (most recent call last): File "<string>", line 1, in <module> MemoryError Command exited with non-zero status 1 ... Maximum resident set size (kbytes): 4276 $ dmesg ... Out of memory: Kill process 10665 (python) score 478 or sacrifice child Killed process 10665, UID 503, (python) total-vm:4024260kB, anon-rss:3845756kB, file-rss:1624kB I wouldn't burn too much more time on it until I can reproduce it. The failures were all during pg_restore, so checkpointer would've been very busy. It seems possible it for it to notice ENOSPC before workers...which would be fsyncing WAL, where checkpointer is fsyncing data. > Admittedly it is quite hard for to distinguish between a web browser > and a program designed to eat memory as fast as possible... Browsers making lots of progress here but still clearly 2nd place. Justin