[HACKERS] Still another race condition in recovery TAP tests

Tom Lane Fri, 08 Sep 2017 19:33:06 -0700

In a moment of idleness I tried to run the TAP tests on pademelon,
which is a mighty old and slow machine.  Behold,
src/test/recovery/t/010_logical_decoding_timelines.pl fell over,
with the relevant section of its log contents being:


# testing logical timeline following with a filesystem-level copy
# Taking filesystem backup b1 from node "master"
# pg_start_backup: 0/2000028
could not 
opendir(/home/postgres/pgsql/src/test/recovery/tmp_check/t_010_logical_decoding_timelines_master_data/pgdata/pg_wal/archive_status/000000010000000000000001.ready):
 No such file or directory at ../../../src/test/perl//RecursiveCopy.pm line 115.
### Stopping node "master" using mode immediate

The postmaster log has this relevant entry:

2017-09-08 22:03:22.917 EDT [19160] DEBUG:  archived write-ahead log file 
"000000010000000000000001"

It looks to me like the archiver removed 000000010000000000000001.ready
between where RecursiveCopy.pm checks that $srcpath is a regular file
or directory (line 95) and where it rechecks whether it's a regular
file (line 105).  Then the "-f" test on line 105 fails, allowing it to
fall through to the its-a-directory path, and unsurprisingly the opendir
at line 115 fails with the above symptom.

In short, RecursiveCopy.pm is woefully unprepared for the idea that the
source directory tree might be changing underneath it.

I'm not real sure if the appropriate answer to this is "we need to fix
RecursiveCopy" or "we need to fix the callers to not call RecursiveCopy
until the source directory is stable".  Thoughts?

(I do kinda wonder why we rolled our own RecursiveCopy; surely there's
a better implementation in CPAN?)

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Still another race condition in recovery TAP tests

Reply via email to