On Tue, Nov 19, 2019 at 04:49:10PM -0600, Justin Pryzby wrote: > On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote: > > Perhaps we should not panic if we failed to open (not fsync) the file, > > but it's not the root problem here which is that somehow we thought we > > should be fsyncing a file that had apparently been removed already > > (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).
Note, the ALTER was (I think) building index in a parallel process: 2019-11-15 22:16:08.752-05 | 5976 | terminating connection because of crash of another server process 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was terminated by signal 6: Aborted 2019-11-15 22:16:08.751-05 | 27384 | terminating any other active server processes 2019-11-15 22:16:07.098-05 | 27388 | could not fsync file "base/16491/1731839470.2": No such file or directory 2019-11-15 22:15:59.592-05 | 19860 | duration: 220283.831 ms statement: ALTER TABLE child.eric_enodeb_cell_201811 ALTER pmradiothpvolulscell TYPE integer USING pmradiothpvolulscell::integer 2019-11-15 22:15:59.459-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/1.0", size 5144576 2019-11-15 22:15:59.458-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/2.0", size 6463488 2019-11-15 22:15:59.456-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/0.0", size 4612096 FYI, that table is *currently* (5 days later): ts=# \dti+ child.eric_enodeb_cell_201811* child | eric_enodeb_cell_201811 | table | telsasoft | | 2595 MB | child | eric_enodeb_cell_201811_idx | index | telsasoft | eric_enodeb_cell_201811 | 120 kB | child | eric_enodeb_cell_201811_site_idx | index | telsasoft | eric_enodeb_cell_201811 | 16 MB | I don't know if that table is likely to be the one with relfilenode 1731839470 (but it certainly wasn't its index), or if that was maybe a table (or index) from an earlier ALTER. I tentatively think we wouldn't have had any other tables being dropped, partitions pruned or maintenance commands running. Checkpoint logs for good measure: 2019-11-15 22:18:26.168-05 | 10388 | checkpoint complete: wrote 2915 buffers (3.0%); 0 WAL file(s) added, 0 removed, 18 recycled; write=30.022 s, sync=0.472 s, total=32.140 s; sync files=107, longest=0.364 s, average=0.004 s; distance=297471 kB, estimate=297471 kB 2019-11-15 22:17:54.028-05 | 10388 | checkpoint starting: time 2019-11-15 22:16:53.753-05 | 10104 | checkpoint complete: wrote 98275 buffers (100.0%); 0 WAL file(s) added, 0 removed, 43 recycled; write=11.040 s, sync=0.675 s, total=11.833 s; sync files=84, longest=0.335 s, average=0.008 s ; distance=698932 kB, estimate=698932 kB 2019-11-15 22:16:41.921-05 | 10104 | checkpoint starting: end-of-recovery immediate 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was terminated by signal 6: Aborted 2019-11-15 22:15:33.03-05 | 27388 | checkpoint starting: time 2019-11-15 22:15:03.62-05 | 27388 | checkpoint complete: wrote 5436 buffers (5.5%); 0 WAL file(s) added, 0 removed, 45 recycled; write=28.938 s, sync=0.355 s, total=29.711 s; sync files=22, longest=0.174 s, average=0.016 s; d istance=740237 kB, estimate=740237 kB 2019-11-15 22:14:33.908-05 | 27388 | checkpoint starting: time I was trying to reproduce what was happening: set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int unique); INSERT INTO t SELECT generate_series(1,999999)"; echo "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At >/tmp/snapshot& sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; PGOPTIONS='-cclient_min_messages=debug' psql postgres -txc "ALTER TABLE t ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44; Under v12, with or without the CHECKPOINT command, it fails: |pg_dump: error: query failed: ERROR: cache lookup failed for index 0 But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails like: |pg_dump: [archiver (db)] query failed: ERROR: cache lookup failed for index 16391 With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be needed, or why it would behave differently (or if it's related to this crash).