In connection with a nearby thread, I tried to run the subscription test suite in a CLOBBER_CACHE_ALWAYS build. I soon found that I had to increase wal_receiver_timeout, but after doing this:
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index 1488bff..5fe6810 100644 *** a/src/test/perl/PostgresNode.pm --- b/src/test/perl/PostgresNode.pm *************** sub init *** 447,452 **** --- 447,453 ---- print $conf "log_statement = all\n"; print $conf "log_replication_commands = on\n"; print $conf "wal_retrieve_retry_interval = '500ms'\n"; + print $conf "wal_receiver_timeout = '10min'\n"; # If a setting tends to affect whether tests pass or fail, print it after # TEMP_CONFIG. Otherwise, print it before TEMP_CONFIG, thereby permitting I let it run overnight, and came back to find that it was stuck at [03:02:15] t/013_partition.pl ................. 19/51 and had been for circa eight hours, where extrapolation from other tests said it shouldn't take much over half an hour. Investigation found that the subscriber was repeatedly failing like this: 2020-09-14 11:05:26.483 EDT [1030506] LOG: logical replication apply worker for subscription "sub1" has started 2020-09-14 11:05:27.139 EDT [1030506] ERROR: cache lookup failed for relation 0 2020-09-14 11:05:27.140 EDT [947156] LOG: background worker "logical replication worker" (PID 1030506) exited with exit code 1 2020-09-14 11:05:27.571 EDT [1030509] LOG: logical replication apply worker for subscription "sub1" has started 2020-09-14 11:05:28.227 EDT [1030509] ERROR: cache lookup failed for relation 0 2020-09-14 11:05:28.228 EDT [947156] LOG: background worker "logical replication worker" (PID 1030509) exited with exit code 1 The publisher's log shows no sign of distress: 2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication command: IDENTIFY_SYSTEM 2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"') 2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: starting logical decoding for slot "sub1" 2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08. 2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: logical decoding found consistent point at 0/163CF08 2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: There are no running transactions. 2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG: received replication command: IDENTIFY_SYSTEM 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"') 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: starting logical decoding for slot "sub1" 2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08. 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: logical decoding found consistent point at 0/163CF08 2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: There are no running transactions. I do not have time today to chase this further, but somebody should. More generally, this seems like good evidence that we really oughta have a buildfarm member that's running *all* the tests under CLOBBER_CACHE_ALWAYS not just the core tests. That seems impossibly expensive, but I realized while watching these tests that a ridiculous fraction of the runtime is being spent in repeated initdb calls. On my machine, initdb takes about 12 minutes under CCA, so doing it separately for publisher and subscriber means 24 minutes, which compares not at all favorably to the circa-half-an-hour total runtime of each of the subscription test scripts. We're surely not learning anything after the first CCA run of initdb, so if we could find a way to skip that overhead for later runs, it'd make a huge difference in the practicality of running these tests under CCA. I recall having worked on a patch to make the regression tests run initdb just once, creating a template directory tree, and then "cp -a" that into place for each test. I did not finish it, because it wasn't showing a lot of advantage in a normal test run, but maybe the idea could be resurrected for CCA and other slow builds. Another idea is to make CCA a little more dynamic, say allow it to be suppressed through an environment variable setting, and then use that to speed up per-test initdbs. regards, tom lane