On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote: > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pry...@telsasoft.com> wrote: > > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at > > elog.c:555 > > edata = <value optimized out> > > If you have that core, it might be interesting to go to frame 2 and > print *edata or edata->saved_errno.
As you saw..unless someone you know a trick, it's "optimized out". > Could it have been fleetingly full due to some other thing happening on the > system that rapidly expands and contracts? It's not impossible, especially while loading data, and data_dir is only 64GB; it may have happened that way sometimes; but it's hard to believe I that's been the case 5-10 times now. If I don't forget to drop the old database previously loaded, when loading old/historic data, it should have ~40GB free on data_dir, and no clients connected other than pg_restore. $ df -h /var/lib/pgsql Filesystem Size Used Avail Use% Mounted on /dev/mapper/data-postgres 64G 26G 38G 41% /var/lib/pgsql > | ereport(PANIC, > | (errcode_for_file_access(), > | errmsg("could not write to file \"%s\": %m", > | tmppath))); > > And since there's consistently nothing in logs, I'm guessing there's a > legitimate write error (legitimate from PG perspective). Storage here is ext4 > plus zfs tablespace on top of LVM on top of vmware thin volume. I realized this probably is *not* an issue with zfs, since it's failing to log (for one reason or another) to /var/lib/pgsql (ext4). > Perhaps it would be clearer what's going on if you could put the PostgreSQL > log onto a different filesystem, so we get a better chance of collecting > evidence? I didn't mention it but last weekend I'd left a loop around the restore process running overnight, and had convinced myself the issue didn't recur since their faulty blade was taken out of service... My plan was to leave the server running in the foreground with logging_collector=no, which I hope is enough, unless logging is itself somehow implicated. I'm trying to stress test that way now. > But then... the parallel leader process was apparently able > to log something -- maybe it was just lucky, but you said this > happened this way more than once. I'm wondering how it could be that > you got some kind of IO failure and weren't able to log the PANIC > message AND your postmaster was killed, and you were able to log a > message about that. Perhaps we're looking at evidence from two > unrelated failures. The messages from the parallel leader (building indices) were visible to the client, not via the server log. I was loading their data and the errors were visible when pg_restore failed. On Wed, Jul 24, 2019 at 09:10:41AM +1200, Thomas Munro wrote: > Just by the way, parallelism in CREATE INDEX is controlled by > max_parallel_maintenance_workers, not max_parallel_workers_per_gather. Thank you. Justin