On Thursday, May 20, 2021 9:59 PM Amit Langote <amitlangot...@gmail.com> wrote: > Here are updated/divided patches. Thanks for your updates.
But, I've detected segmentation faults caused by the patch, which can happen during 100_bugs.pl in src/test/subscription. This happens more than one in ten times. This problem would be a timing issue and has been introduced by v3 already. I used v5 for HEAD also and reproduced this failure, while OSS HEAD doesn't reproduce this, even when I executed 100_bugs.pl 200 times in a tight loop. I aligned the commit id 4f586fe2 for all check. Below logs are ones I got from v3. * The message of the failure during TAP test. # Postmaster PID for node "twoways" is 5015 Waiting for replication conn testsub's replay_lsn to pass pg_current_wal_lsn() on twoways # poll_query_until timed out executing this query: # SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'testsub'; # expecting this output: # t # last actual query output: # # with stderr: # psql: error: connection to server on socket "/tmp/cs8dhFOtZZ/.s.PGSQL.59345" failed: No such file or directory # Is the server running locally and accepting connections on that socket? timed out waiting for catchup at t/100_bugs.pl line 148. The failure produces core file and its back trace is below. My first guess of the cause is that between the timing to get an entry from hash_search() in get_rel_sync_entry() and to set the map by convert_tuples_by_name() in maybe_send_schema(), we had invalidation message, which tries to free unset descs in the entry ? * core file backtrace Core was generated by `postgres: twoways: walsender k5user [local] START_REPLICATION '. Program terminated with signal 11, Segmentation fault. #0 0x00007f93b38b8c2b in rel_sync_cache_relation_cb (arg=0, relid=16388) at pgoutput.c:1225 1225 FreeTupleDesc(entry->map->indesc); Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64 libicu-50.2-4.el7_7.x86_64 libstdc++-4.8.5-44.el7.x86_64 (gdb) bt #0 0x00007f93b38b8c2b in rel_sync_cache_relation_cb (arg=0, relid=16388) at pgoutput.c:1225 #1 0x0000000000ae21f0 in LocalExecuteInvalidationMessage (msg=0x21d1de8) at inval.c:601 #2 0x00000000008dbd6e in ReorderBufferExecuteInvalidations (nmsgs=4, msgs=0x21d1db8) at reorderbuffer.c:3232 #3 0x00000000008da70a in ReorderBufferProcessTXN (rb=0x21d1a40, txn=0x2210b58, commit_lsn=25569096, snapshot_now=0x21d1e10, command_id=1, streaming=false) at reorderbuffer.c:2294 #4 0x00000000008dae56 in ReorderBufferReplay (txn=0x2210b58, rb=0x21d1a40, xid=748, commit_lsn=25569096, end_lsn=25569216, commit_time=674891531661619, origin_id=0, origin_lsn=0) at reorderbuffer.c:2591 #5 0x00000000008daede in ReorderBufferCommit (rb=0x21d1a40, xid=748, commit_lsn=25569096, end_lsn=25569216, commit_time=674891531661619, origin_id=0, origin_lsn=0) at reorderbuffer.c:2615 #6 0x00000000008cae06 in DecodeCommit (ctx=0x21e6880, buf=0x7fffb9383db0, parsed=0x7fffb9383c10, xid=748, two_phase=false) at decode.c:744 #7 0x00000000008ca1ed in DecodeXactOp (ctx=0x21e6880, buf=0x7fffb9383db0) at decode.c:278 #8 0x00000000008c9e76 in LogicalDecodingProcessRecord (ctx=0x21e6880, record=0x21e6c80) at decode.c:142 #9 0x0000000000901fcc in XLogSendLogical () at walsender.c:2876 #10 0x0000000000901327 in WalSndLoop (send_data=0x901f30 <XLogSendLogical>) at walsender.c:2306 #11 0x00000000008ffd5f in StartLogicalReplication (cmd=0x219aff8) at walsender.c:1206 #12 0x00000000009006ae in exec_replication_command ( cmd_string=0x2123c20 "START_REPLICATION SLOT \"pg_16400_sync_16392_6964617299612181363\" LOGICAL 0/182D058 (proto_version '2', publication_names '\"testpub\"')") at walsender.c:1646 #13 0x000000000096ffae in PostgresMain (argc=1, argv=0x7fffb93840d0, dbname=0x214ef18 "d1", username=0x214eef8 "k5user") at postgres.c:4482 I'll update when I get more information. Best Regards, Takamichi Osumi