Random pg_upgrade test failure on drongo

Hayato Kuroda (Fujitsu) Wed, 08 Nov 2023 00:13:31 -0800

Dear hackers,

While tracking a buildfarm, I found that drongo failed the test 
pg_upgrade/003_logical_slots [1].
A strange point is that the test passed in the next iteration. Currently I'm not
sure the reason, but I will keep my eye for it and will investigate if it
happens again.


I think this failure is not related with our logical slots work, whereas it
failed 003_logical_slots.pl. More detail, please see latter part.

For more investigation, a server log during the upgrade may be needed. It will
be in the data directory so BF system will not upload them. I may need 
additional
information if it failed again.

# Analysis of failure

According to the output, pg_upgrade seemed to be failed while restoring objects
to new cluster[2].

As code-level anaysis, pg_upgrade command failed in exec_prog().
In the function, pg_restore tried to be executed for database "postgres".
Below is a brief call-stack. Note that pg_restore is not used for migrating
logical replication slots, it is done by pg_upgrade binary itself. Also, the
migration is done after all objects are copied, not in create_new_objects().

```
exec_prog()
parallel_exec_prog("pg_restore ... ") <-- Since -j option is not specified, it 
is just a wrapper
create_new_objects()
main()
```

In exec_prog(), system() system call was called but returned non-zero value.
Doc said that sytem() returns value that is returned by the command interpreter,
when input is not NULL [3]. Unfortunately, current code does not output the
return code.  Also, BF system does not upload data directory for failed tests.
Therefore, I could not get more information for the investigation.

[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&dt=2023-11-07%2013%3A43%3A23&stg=pg_upgrade-check
[2]:
```
...
# No postmaster PID for node "oldpub"
# Running: pg_upgrade --no-sync -d 
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_oldpub_data/pgdata
 -D 
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata
 -b C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -B 
C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -s 
127.0.0.1 -p 54813 -P 54814 --copy
Performing Consistency Checks
...
Setting frozenxid and minmxid counters in new cluster         ok
Restoring global objects in the new cluster                   ok
Restoring database schemas in the new cluster                 
*failure*

Consult the last few lines of 
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_logical_slots/data/t_003_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231107T142224.580/log/pg_upgrade_dump_5.log"
 for
the probable cause of the failure.
Failure, exiting
[14:23:26.632](70.141s) not ok 10 - run of pg_upgrade of old cluster
[14:23:26.632](0.000s) #   Failed test 'run of pg_upgrade of old cluster'
#   at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl 
line 170.
### Starting node "newpub"
# Running: pg_ctl -w -D 
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata
 -l 
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\log/003_logical_slots_newpub.log
 -o --cluster-name=newpub start
waiting for server to start.... done
server started
# Postmaster PID for node "newpub" is 4604
[14:23:28.398](1.766s) not ok 11 - check the slot exists on new cluster
[14:23:28.398](0.001s) #   Failed test 'check the slot exists on new cluster'
#   at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl 
line 176.
[14:23:28.399](0.000s) #          got: ''
#     expected: 'regress_sub|t'
...
```
[3]: 
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/system-wsystem?view=msvc-170


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Random pg_upgrade test failure on drongo

Reply via email to