Hello hackers,

As a recent buildfarm test failure on olingo (which builds postgres with
-O1 and address sanitizer) [1] shows:
...
[23:12:02.127](0.166s) not ok 6 - snapshot conflict: stats show conflict on 
standby
[23:12:02.130](0.003s) #   Failed test 'snapshot conflict: stats show conflict 
on standby'
#   at 
/home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl
 line 332.
[23:12:02.130](0.000s) #          got: '2'
#     expected: '1'
...
[23:12:06.848](1.291s) not ok 17 - 5 recovery conflicts shown in 
pg_stat_database
[23:12:06.887](0.040s) #   Failed test '5 recovery conflicts shown in 
pg_stat_database'
#   at 
/home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl
 line 286.
[23:12:06.887](0.000s) #          got: '6'
#     expected: '5'
Waiting for replication conn standby's replay_lsn to pass 0/3459160 on primary
done
...

pgsql.build/testrun/recovery/031_recovery_conflict/log/031_recovery_conflict_standby.log:
2024-05-15 23:12:01.616 UTC [1299981][client backend][2/2:0] LOG: statement: FETCH FORWARD FROM test_recovery_conflict_cursor;
2024-05-15 23:12:01.617 UTC [1299981][client backend][2/2:0] LOG: statement: ;
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] LOG: recovery still waiting after 15.289 ms: recovery conflict on snapshot
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] DETAIL: Conflicting 
process: 1299981.
2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] CONTEXT:  WAL redo at 0/344F468 for Heap2/PRUNE_VACUUM_SCAN: snapshotConflictHorizon: 746, isCatalogRel: F, nplans: 2, nredirected: 18, ndead: 0, nunused: 0, plans: [{ xmax: 0, infomask: 2816, infomask2: 2, ntuples: 2, offsets: [21, 22] }, { xmax: 0, infomask: 11008, infomask2: 32770, ntuples: 18, offsets: [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] }], redirected: [23->41, 24->42, 25->43, 26->44, 27->45, 28->46, 29->47, 30->48, 31->49, 32->50, 33->51, 34->52, 35->53, 36->54, 37->55, 38->56, 39->57, 40->58]; blkref #0: rel 1663/16385/16386, blk 0
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] FATAL: terminating 
connection due to conflict with recovery
2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] DETAIL:  User query might have needed to see row versions that must be removed. 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] HINT: In a moment you should be able to reconnect to the database and repeat your command.
vvv an important difference from a successful test run
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] LOG: could not 
send data to client: Broken pipe
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] FATAL: terminating 
connection due to conflict with recovery
2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] DETAIL:  User query might have needed to see row versions that must be removed. 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] HINT: In a moment you should be able to reconnect to the database and repeat your command.
^^^

test  031_recovery_conflict may fail in the following scenario:

031_recovery_conflict.pl:
    executes  a query, which produces a conflict:
    ## RECOVERY CONFLICT 2: Snapshot conflict
    ...
    $psql_standby->query_safe(...)

startup process:
        detects a snapshot conflict and sends
        PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
        (ResolveRecoveryConflictWithVirtualXIDs ->
        CancelVirtualTransaction) to the client backend

client backend:
            receives and processes the signal:
            HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
            ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
            ProcessRecoveryConflictInterrupt,

            reports the recovery conflict:
            pgstat_report_recovery_conflict(reason);

            and reports the error:
            ereport(FATAL, ... "terminating connection due to conflict with
            recovery" ...)
            sends the message to the server log:
            errfinish -> EmitErrorReport -> send_message_to_server_log

031_recovery_conflict.pl:
    # finds the message in the log and resets psql connection:
    check_conflict_log(
      "User query might have needed to see row versions that must
       be removed");
    $psql_standby->reconnect_and_clear();

startup process:
        keeps sending PROCSIG_RECOVERY_CONFLICT_SNAPSHOT to the client
        backend in a loop inside ResolveRecoveryConflictWithVirtualXIDs

client backend:
            tries to send the message to the client:
            send_message_to_frontend -> socket_flush ->
            internal_flush_buffer,
            gets an error (EPIPE) from secure_write, and calls
            ereport(COMMERROR,
              (errcode_for_socket_access(),
               errmsg("could not send data to client: %m")));

           receives the following PROCSIG_RECOVERY_CONFLICT_SNAPSHOT
           signal and processes it the same way:

           HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt ->
           ProcessInterrupts -> ProcessRecoveryConflictInterrupts ->
           ProcessRecoveryConflictInterrupt,

           reports the recovery conflict:
           pgstat_report_recovery_conflict(reason);
           // now the conflict is counted twice

           and reports the error:
           ereport(FATAL, ... "terminating connection due to conflict with
           recovery" ...)
           sends the message to the server log:
           errfinish -> EmitErrorReport -> send_message_to_server_log

031_recovery_conflict.pl:
    calls
    check_conflict_stat("snapshot");
    and gets 2 instead of 1.

The patch adding delays to reproduce the issue is attached.

With the patch applied, I run the test (against an "-O0" build) in a loop:
for i in `seq 20`; do echo "I $i"; make check -s -C \
src/test/recovery/ PROVE_TESTS="t/031*"; grep ' not ok 6 ' \
src/test/recovery/tmp_check/log/regress_log_031_recovery_conflict && break;
done

and get exactly the same failure as on olingo:
I 1
# +++ tap check in src/test/recovery +++
t/031_recovery_conflict.pl .. 6/?
#   Failed test 'snapshot conflict: stats show conflict on standby'
#   at t/031_recovery_conflict.pl line 333.
#          got: '2'
#     expected: '1'
t/031_recovery_conflict.pl .. 13/?
#   Failed test '5 recovery conflicts shown in pg_stat_database'
#   at t/031_recovery_conflict.pl line 287.
#          got: '6'
#     expected: '5'
# Looks like you failed 2 tests of 18.
t/031_recovery_conflict.pl .. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/18 subtests

Test Summary Report
-------------------
t/031_recovery_conflict.pl (Wstat: 512 Tests: 18 Failed: 2)
  Failed tests:  6, 17
  Non-zero exit status: 2

(Similar failures can be seen with other sections of 031_recovery_conflict.)

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-05-15%2023%3A03%3A30

Best regards,
Alexander
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index d91a85cb2d..303301ddc6 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -465,6 +465,8 @@ errstart(int elevel, const char *domain)
 	return true;
 }
 
+static bool do_sleep = false;
+
 /*
  * errfinish --- end an error-reporting cycle
  *
@@ -542,6 +544,12 @@ errfinish(const char *filename, int lineno, const char *funcname)
 		PG_RE_THROW();
 	}
 
+if (elevel == FATAL && rand() % 3 == 0)
+{
+long r = 0; for (long i = 0; i < 10000000L; i++) r += i % 2;
+fprintf(stderr, "!!!errfinish [%d]| r: %ld,  edata->funcname: %s, sleep() before EmitErrorReport()\n", getpid(), r, edata->funcname);
+do_sleep = true;
+}
 	/* Emit the message to the right places */
 	EmitErrorReport();
 
@@ -1710,6 +1718,12 @@ EmitErrorReport(void)
 	if (edata->output_to_server)
 		send_message_to_server_log(edata);
 
+if (do_sleep)
+{
+long r = 0; for (long i = 0; i < 30000000L; i++) r += i % 2;
+fprintf(stderr, "!!!EmitErrorReport [%d]| r: %ld, sleep() before send_message_to_frontend()...\n", getpid(), r);
+do_sleep = false;
+}
 	/* Send to client, if enabled */
 	if (edata->output_to_client)
 		send_message_to_frontend(edata);
diff --git a/src/test/recovery/t/031_recovery_conflict.pl b/src/test/recovery/t/031_recovery_conflict.pl
index d87efa823f..a59bd13c38 100644
--- a/src/test/recovery/t/031_recovery_conflict.pl
+++ b/src/test/recovery/t/031_recovery_conflict.pl
@@ -153,6 +153,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 check_conflict_log(
 	"User query might have needed to see row versions that must be removed");
 $psql_standby->reconnect_and_clear();
+sleep(1);
 check_conflict_stat("snapshot");
 
 

Reply via email to