On 8/30/21 10:36 AM, hubert depesz lubaczewski
Anyway - it's 12.6 on aarm64. Couple of days there was replication
slot started, and now it seems to be stuck.

#0  hash_seq_search (status=status@entry=0xffffdd90f380) at 
./build/../src/backend/utils/hash/dynahash.c:1448
#1  0x0000aaaac3042060 in RelfilenodeMapInvalidateCallback (arg=<optimized 
out>, relid=105496194) at ./build/../src/backend/utils/cache/relfilenodemap.c:64
#2  0x0000aaaac3033aa4 in LocalExecuteInvalidationMessage (msg=0xffff9b66eec8) 
at ./build/../src/backend/utils/cache/inval.c:595
#3  0x0000aaaac2ec8274 in ReorderBufferExecuteInvalidations (rb=0xaaaac326bb00 <errordata>, 
txn=0xaaaac326b998 <formatted_start_time>, txn=0xaaaac326b998 <formatted_start_time>) 
at ./build/../src/backend/replication/logical/reorderbuffer.c:2149
#4  ReorderBufferCommit (rb=0xaaaac326bb00 <errordata>, xid=xid@entry=2668396569, 
commit_lsn=187650393290540, end_lsn=<optimized out>, 
commit_time=commit_time@entry=683222349268077, origin_id=origin_id@entry=0, 
origin_lsn=origin_lsn@entry=0) at 
./build/../src/backend/replication/logical/reorderbuffer.c:1770
#5  0x0000aaaac2ebd314 in DecodeCommit (xid=2668396569, parsed=0xffffdd90f7e0, 
buf=0xffffdd90f960, ctx=0xaaaaf5d396a0) at 
./build/../src/backend/replication/logical/decode.c:640
#6  DecodeXactOp (ctx=ctx@entry=0xaaaaf5d396a0, buf=0xffffdd90f960, 
buf@entry=0xffffdd90f9c0) at 
./build/../src/backend/replication/logical/decode.c:248
#7  0x0000aaaac2ebd42c in LogicalDecodingProcessRecord (ctx=0xaaaaf5d396a0, 
record=0xaaaaf5d39938) at 
./build/../src/backend/replication/logical/decode.c:117
#8  0x0000aaaac2ecfdfc in XLogSendLogical () at 
./build/../src/backend/replication/walsender.c:2840
#9  0x0000aaaac2ed2228 in WalSndLoop (send_data=send_data@entry=0xaaaac2ecfd98 
<XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2189
#10 0x0000aaaac2ed2efc in StartLogicalReplication (cmd=0xaaaaf5d175a8) at 
./build/../src/backend/replication/walsender.c:1133
#11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaaf5c0eb00 "START_REPLICATION SLOT cdc 
LOGICAL 1A2D/4B3640 (\"proto_version\" '1', \"publication_names\" 'cdc')") at 
./build/../src/backend/replication/walsender.c:1549
#12 0x0000aaaac2f258a4 in PostgresMain (argc=<optimized out>, 
argv=argv@entry=0xaaaaf5c78cd8, dbname=<optimized out>, username=<optimized out>) at 
./build/../src/backend/tcop/postgres.c:4257
#13 0x0000aaaac2eac338 in BackendRun (port=0xaaaaf5c68070, port=0xaaaaf5c68070) 
at ./build/../src/backend/postmaster/postmaster.c:4484
#14 BackendStartup (port=0xaaaaf5c68070) at 
./build/../src/backend/postmaster/postmaster.c:4167
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#16 0x0000aaaac2ead364 in PostmasterMain (argc=<optimized out>, argv=<optimized 
out>) at ./build/../src/backend/postmaster/postmaster.c:1398
#17 0x0000aaaac2c3ca5c in main (argc=5, argv=0xaaaaf5c07720) at 
./build/../src/backend/main/main.c:228

The thing is - I can't close it with pg_terminate_backend(), and I'd
rather not kill -9, as it will, I think, close all other connections,
and this is prod server.

still makes me ask: why does Pg end up in such place,> where it
doesn't do any syscalls, doesn't accept pg_terminate_backend(), and
is using 100% of cpu?
src/backend/utils/hash/dynahash.c:1448 is in the middle of a while loop, which is apparently not exiting.

There is no check for interrupts in there and it is a fairly tight loop which would explain both symptoms.

As to how it got that way, I have to assume data corruption or a bug of some sort. I would repost the details to hackers for better visibility.

Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Reply via email to