windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

Andres Freund Tue, 07 Feb 2023 17:29:09 -0800

Hi,

A recent cfbot run caused CI on windows to crash - on a patch that could not
conceivably cause this issue:
  https://cirrus-ci.com/task/5646021133336576
the patch is just:
  
https://github.com/postgresql-cfbot/postgresql/commit/dbd4afa6e7583c036b86abe2e3d27b508d335c2b


regression.diffs: 
https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/testrun/build/testrun/regress/regress/regression.diffs
postmaster.log: 
https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/testrun/build/testrun/regress/regress/log/postmaster.log
crash info: 
https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/crashlog/crashlog-postgres.exe_1af0_2023-02-08_00-53-23-997.txt

00000085`f03ffa40 00007ff6`fd89faa8     ucrtbased!abort(void)+0x5a 
[minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
00000085`f03ffa80 00007ff6`fd6474dc     postgres!ExceptionalCondition(
                        char * conditionName = 0x00007ff6`fdd03ca8 
"PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED",
                        char * fileName = 0x00007ff6`fdd03c80 
"../src/backend/storage/ipc/pmsignal.c",
                        int lineNumber = 0n329)+0x78 
[c:\cirrus\src\backend\utils\error\assert.c @ 67]
00000085`f03ffac0 00007ff6`fd676eff     
postgres!MarkPostmasterChildActive(void)+0x7c 
[c:\cirrus\src\backend\storage\ipc\pmsignal.c @ 329]
00000085`f03ffb00 00007ff6`fd59aa3a     postgres!InitProcess(void)+0x2ef 
[c:\cirrus\src\backend\storage\lmgr\proc.c @ 375]
00000085`f03ffb60 00007ff6`fd467689     postgres!SubPostmasterMain(
                        int argc = 0n3,
                        char ** argv = 0x000001c6`f3814e80)+0x33a 
[c:\cirrus\src\backend\postmaster\postmaster.c @ 4962]
00000085`f03ffd90 00007ff6`fda0e1c9     postgres!main(
                        int argc = 0n3,
                        char ** argv = 0x000001c6`f3814e80)+0x2f9 
[c:\cirrus\src\backend\main\main.c @ 192]

So, somehow we ended up a pmsignal slot for a new backend that's not currently
in PM_CHILD_ASSIGNED state.


Obviously the first idea is to wonder whether this is a problem introduced as
part of the the recent postmaster-latchification work.


At first I thought we were failing to terminate running processes, due to the
following output:

parallel group (20 tests):  name char txid text varchar enum float8 regproc 
int2 boolean bit oid pg_lsn int8 int4 float4 uuid rangetypes numeric money
     boolean                      ... ok          684 ms
     char                         ... ok          517 ms
     name                         ... ok          354 ms
     varchar                      ... ok          604 ms
     text                         ... ok          603 ms
     int2                         ... ok          676 ms
     int4                         ... ok          818 ms
     int8                         ... ok          779 ms
     oid                          ... ok          720 ms
     float4                       ... ok          823 ms
     float8                       ... ok          628 ms
     bit                          ... ok          666 ms
     numeric                      ... ok         1132 ms
     txid                         ... ok          497 ms
     uuid                         ... ok          818 ms
     enum                         ... ok          619 ms
     money                        ... FAILED (test process exited with exit 
code 2)     7337 ms
     rangetypes                   ... ok          813 ms
     pg_lsn                       ... ok          762 ms
     regproc                      ... ok          632 ms


But now I realize the reason none of the other tests failed, is because the
crash took a long time, presumably due to the debugger creating the above
information, so no other tests failed.


2023-02-08 00:53:20.257 GMT client backend[4584] pg_regress/rangetypes 
STATEMENT:  select '-[a,z)'::textrange;
TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), 
File: "../src/backend/storage/ipc/pmsignal.c", Line: 329, PID: 5948
[ quite a few lines ]
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  server process (PID 5948) was 
terminated by exception 0xC0000354
2023-02-08 00:53:27.420 GMT postmaster[872] HINT:  See C include file 
"ntstatus.h" for a description of the hexadecimal value.
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  terminating any other active 
server processes
2023-02-08 00:53:27.434 GMT postmaster[872] LOG:  all server processes 
terminated; reinitializing
2023-02-08 00:53:27.459 GMT startup[5800] LOG:  database system was 
interrupted; last known up at 2023-02-08 00:53:19 GMT
2023-02-08 00:53:27.459 GMT startup[5800] LOG:  database system was not 
properly shut down; automatic recovery in progress
2023-02-08 00:53:27.462 GMT startup[5800] LOG:  redo starts at 0/20DCF08
2023-02-08 00:53:27.484 GMT startup[5800] LOG:  could not stat file 
"pg_tblspc/16502": No such file or directory
2023-02-08 00:53:27.484 GMT startup[5800] CONTEXT:  WAL redo at 0/20DCFB8 for 
Tablespace/DROP: 16502
2023-02-08 00:53:27.614 GMT startup[5800] LOG:  invalid record length at 
0/25353E8: wanted 24, got 0
2023-02-08 00:53:27.614 GMT startup[5800] LOG:  redo done at 0/2534FE0 system 
usage: CPU: user: 0.04 s, system: 0.04 s, elapsed: 0.15 s


Nevertheless, clearly this should never be reached.

Greetings,

Andres Freund

windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

Reply via email to