On Thu, 25 Sep 2025 02:19:27 +0900 Fujii Masao <masao.fu...@gmail.com> wrote:
> On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda > <ikedarinta...@oss.nttdata.com> wrote: > > I think the issue is a new bug because we have transitioned to CSTATE_ABORT > > immediately after queries failed, without executing discardUntilSync(). Agreed. > If so, the fix in discardUntilSync() should be part of the continue-on-error > patch. The assertion failure fix should be a separate patch, since only > that needs to be backpatched (the failure can also occur in back branches). +1 > > > I've attached a patch that fixes the assertion error. The content of v1 > > patch by > > Mr. Nagata is also included. I would appreciate it if you review my patch. > + switch (PQresultStatus(res)) > + { > + case PGRES_PIPELINE_SYNC: > + received_sync = true; > > In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be. > > + case PGRES_FATAL_ERROR: > + PQclear(res); > + goto done; > > I don't think we need goto here. How about this instead? > > ----------------------- > @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st) > * results have been discarded. > */ > st->num_syncs = 0; > - PQclear(res); > break; > } > + /* > + * Stop receiving further results if PGRES_FATAL_ERROR > is returned > + * (e.g., due to a connection failure) before > PGRES_PIPELINE_SYNC, > + * since PGRES_PIPELINE_SYNC will never arrive. > + */ > + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) > + break; > PQclear(res); > } > + PQclear(res); > > /* exit pipeline */ > if (PQexitPipelineMode(st->con) != 1) > ----------------------- I think Fujii-san's version is better because Ikeda-san's version doesn't consider the case where PGRES_PIPELINE_SYNC is followed by another one. In that situation, the loop would terminate without getting NULL, which would causes an error in PQexitPipelineMode(). However, I would like to suggest an alternative solution: checking the connection status when readCommandResponse() returns false. This seems more straightforwad, since the cause of the error can be investigated immediately after it is detected. @@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON) st->state = CSTATE_END_COMMAND; } - else if (canRetryError(st->estatus)) + else if (PQstatus(st->con) == CONNECTION_BAD) + st->state = CSTATE_ABORTED; + else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) || + canRetryError(st->estatus)) st->state = CSTATE_ERROR; else st->state = CSTATE_ABORTED; What do you think? Additionally, I noticed that in pipeline mode, the error message reported in readCommandResponse() is lost, because it is reset when PQgetResult() returned NULL to indicate the end of query processing. For example: pgbench: client 0 got an error in command 3 (SQL) of script 0; pgbench: client 1 got an error in command 3 (SQL) of script 0; This can be fixed this by saving the previous error message and using it for the report. After the fix: pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL: terminating connection due to administrator command I've attached update patches. 0001 fixes the assersion failure in commandError() and error message lost in readCommandResponse(). 0002 was the previous 0001 that adds --continiue-on-error, including the fix to handle connection termination errors. 0003 is for improving error messages for errors that cause client abortion. Regareds, Yugo Nagata -- Yugo Nagata <nag...@sraoss.co.jp>
>From e6b4022ec06f97a1ed100de9aca9eebd5fd4bc02 Mon Sep 17 00:00:00 2001 From: Yugo Nagata <nag...@sraoss.co.jp> Date: Thu, 10 Jul 2025 17:21:05 +0900 Subject: [PATCH v12 3/3] Improve error messages for errors that cause client abortion This commit modifies relevant error messages to explicitly indicate that the client was aborted. As part of this change, pg_log_error was replaced with commandFailed(). --- src/bin/pgbench/pgbench.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 36d15c95f3e..43450b4b54a 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -3318,8 +3318,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */ if (is_last && meta == META_GSET) { - pg_log_error("client %d script %d command %d query %d: expected one row, got %d", - st->id, st->use_file, st->command, qrynum, 0); + commandFailed(st, "gset", psprintf("expected one row, got %d", 0)); st->estatus = ESTATUS_META_COMMAND_ERROR; goto error; } @@ -3333,8 +3332,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) if (meta == META_GSET && ntuples != 1) { /* under \gset, report the error */ - pg_log_error("client %d script %d command %d query %d: expected one row, got %d", - st->id, st->use_file, st->command, qrynum, PQntuples(res)); + commandFailed(st, "gset", psprintf("expected one row, got %d", PQntuples(res))); st->estatus = ESTATUS_META_COMMAND_ERROR; goto error; } @@ -3348,18 +3346,18 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) for (int fld = 0; fld < PQnfields(res); fld++) { char *varname = PQfname(res, fld); + char *cmd = (meta == META_ASET ? "aset" : "gset"); /* allocate varname only if necessary, freed below */ if (*varprefix != '\0') varname = psprintf("%s%s", varprefix, varname); /* store last row result as a string */ - if (!putVariable(&st->variables, meta == META_ASET ? "aset" : "gset", varname, + if (!putVariable(&st->variables, cmd, varname, PQgetvalue(res, ntuples - 1, fld))) { /* internal error */ - pg_log_error("client %d script %d command %d query %d: error storing into variable %s", - st->id, st->use_file, st->command, qrynum, varname); + commandFailed(st, cmd, psprintf("error storing into variable %s", varname)); st->estatus = ESTATUS_META_COMMAND_ERROR; goto error; } @@ -3394,8 +3392,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, errmsg); + commandFailed(st, "SQL", errmsg); goto error; } -- 2.43.0
>From b8f05f605176232bf0aa1eaf8a2783c17059a39a Mon Sep 17 00:00:00 2001 From: Fujii Masao <fu...@postgresql.org> Date: Fri, 19 Sep 2025 16:54:49 +0900 Subject: [PATCH v12 2/3] Add --continue-on-error option When the option is set, client rolls back the failed transaction and starts a new one when its transaction fails due to the reason other than the deadlock and serialization failure. --- doc/src/sgml/ref/pgbench.sgml | 64 ++++++++++++++++---- src/bin/pgbench/pgbench.c | 59 +++++++++++++++--- src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +++++++ 3 files changed, 126 insertions(+), 19 deletions(-) diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index ab252d9fc74..828ce0d90cf 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time) and number of transactions per client); these will be equal unless the run failed before completion or some SQL command(s) failed. (In <option>-T</option> mode, only the actual number of transactions is printed.) - The next line reports the number of failed transactions due to - serialization or deadlock errors (see <xref linkend="failures-and-retries"/> - for more information). + The next line reports the number of failed transactions (see + <xref linkend="failures-and-retries"/> for more information). The last line reports the number of transactions per second. </para> @@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d <listitem> <para>deadlock failures;</para> </listitem> + <listitem> + <para>other failures;</para> + </listitem> </itemizedlist> See <xref linkend="failures-and-retries"/> for more information. </para> @@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d </listitem> </varlistentry> + <varlistentry id="pgbench-option-continue-on-error"> + <term><option>--continue-on-error</option></term> + <listitem> + <para> + Allows clients to continue their run even if an SQL statement fails due to + errors other than serialization or deadlock. Unlike serialization and deadlock + failures, clients do not retry the same transactions but start new transaction. + This option is useful when your custom script may raise errors due to some + reason like unique constraints violation. Without this option, the client is + aborted after such errors. + </para> + <para> + Note that serialization and deadlock failures never cause the client to be + aborted even after clients retries <option>--max-tries</option> times by + default, so they are not affected by this option. + See <xref linkend="failures-and-retries"/> for more information. + </para> + </listitem> + </varlistentry> + </variablelist> </para> @@ -2409,8 +2431,8 @@ END; will be reported as <literal>failed</literal>. If you use the <option>--failures-detailed</option> option, the <replaceable>time</replaceable> of the failed transaction will be reported as - <literal>serialization</literal> or - <literal>deadlock</literal> depending on the type of failure (see + <literal>serialization</literal>, <literal>deadlock</literal>, or + <literal>other</literal> depending on the type of failure (see <xref linkend="failures-and-retries"/> for more information). </para> @@ -2638,6 +2660,16 @@ END; </para> </listitem> </varlistentry> + + <varlistentry> + <term><replaceable>other_sql_failures</replaceable></term> + <listitem> + <para> + number of transactions that got a SQL error + (zero unless <option>--failures-detailed</option> is specified) + </para> + </listitem> + </varlistentry> </variablelist> </para> @@ -2646,8 +2678,8 @@ END; <screen> <userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput> -1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 -1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 +1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0 +1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0 </screen> </para> @@ -2851,10 +2883,20 @@ statement latencies in milliseconds, failures and retries: <para> A client's run is aborted in case of a serious error; for example, the connection with the database server was lost or the end of script was reached - without completing the last transaction. In addition, if execution of an SQL - or meta command fails for reasons other than serialization or deadlock errors, - the client is aborted. Otherwise, if an SQL command fails with serialization or - deadlock errors, the client is not aborted. In such cases, the current + without completing the last transaction. The client also aborts + if a meta command fails, or if an SQL command fails for reasons other than + serialization or deadlock errors when <option>--continue-on-error</option> + is not specified. With <option>--continue-on-error</option>, + the client does not abort on such SQL errors and instead proceeds to + the next transaction. These cases are reported as + <literal>other failures</literal> in the output. If the error occurs + in a meta command, however, the client still aborts even when this option + is specified. + </para> + <para> + If an SQL command fails due to serialization or deadlock errors, the + client does not aborted, regardless of whether + <option>--continue-on-error</option> is used. Instead, the current transaction is rolled back, which also includes setting the client variables as they were before the run of this transaction (it is assumed that one transaction script contains only one transaction; see diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index f25a2e20e70..36d15c95f3e 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -402,15 +402,23 @@ typedef struct StatsData * directly successful transactions (they were successfully completed on * the first try). * - * A failed transaction is defined as unsuccessfully retried transactions. - * It can be one of two types: + * A failed transaction is counted differently depending on whether + * the --continue-on-error option is specified. * + * Without --continue-on-error: * failed (the number of failed transactions) = * 'serialization_failures' (they got a serialization error and were not * successfully retried) + * 'deadlock_failures' (they got a deadlock error and were not * successfully retried). * + * When --continue-on-error is specified: + * + * failed (number of failed transactions) = + * 'serialization_failures' + 'deadlock_failures' + + * 'other_sql_failures' (they got some other SQL error; the transaction was + * not retried and counted as failed due to --continue-on-error). + * * If the transaction was retried after a serialization or a deadlock * error this does not guarantee that this retry was successful. Thus * @@ -440,6 +448,11 @@ typedef struct StatsData int64 deadlock_failures; /* number of transactions that were not * successfully retried after a deadlock * error */ + int64 other_sql_failures; /* number of failed transactions for + * reasons other than + * serialization/deadlock failure, which + * is counted if --continue-on-error is + * specified */ SimpleStats latency; SimpleStats lag; } StatsData; @@ -770,6 +783,7 @@ static int64 total_weight = 0; static bool verbose_errors = false; /* print verbose messages of all errors */ static bool exit_on_abort = false; /* exit when any client is aborted */ +static bool continue_on_error = false; /* continue after errors */ /* Builtin test scripts */ typedef struct BuiltinScript @@ -954,6 +968,7 @@ usage(void) " --log-prefix=PREFIX prefix for transaction time log file\n" " (default: \"pgbench_log\")\n" " --max-tries=NUM max number of tries to run transaction (default: 1)\n" + " --continue-on-error continue running after an SQL error\n" " --progress-timestamp use Unix epoch timestamps for progress\n" " --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n" " --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n" @@ -1467,6 +1482,7 @@ initStats(StatsData *sd, pg_time_usec_t start) sd->retried = 0; sd->serialization_failures = 0; sd->deadlock_failures = 0; + sd->other_sql_failures = 0; initSimpleStats(&sd->latency); initSimpleStats(&sd->lag); } @@ -1516,6 +1532,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag, case ESTATUS_DEADLOCK_ERROR: stats->deadlock_failures++; break; + case ESTATUS_OTHER_SQL_ERROR: + stats->other_sql_failures++; + break; default: /* internal error which should never occur */ pg_fatal("unexpected error status: %d", estatus); @@ -3365,7 +3384,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) case PGRES_FATAL_ERROR: st->estatus = getSQLErrorStatus(PQresultErrorField(res, PG_DIAG_SQLSTATE)); - if (canRetryError(st->estatus)) + if (continue_on_error || canRetryError(st->estatus)) { if (verbose_errors) commandError(st, errmsg); @@ -4029,7 +4048,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON) st->state = CSTATE_END_COMMAND; } - else if (canRetryError(st->estatus)) + else if (PQstatus(st->con) == CONNECTION_BAD) + st->state = CSTATE_ABORTED; + else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) || + canRetryError(st->estatus)) st->state = CSTATE_ERROR; else st->state = CSTATE_ABORTED; @@ -4550,7 +4572,8 @@ static int64 getFailures(const StatsData *stats) { return (stats->serialization_failures + - stats->deadlock_failures); + stats->deadlock_failures + + stats->other_sql_failures); } /* @@ -4570,6 +4593,8 @@ getResultString(bool skipped, EStatus estatus) return "serialization"; case ESTATUS_DEADLOCK_ERROR: return "deadlock"; + case ESTATUS_OTHER_SQL_ERROR: + return "other"; default: /* internal error which should never occur */ pg_fatal("unexpected error status: %d", estatus); @@ -4625,6 +4650,7 @@ doLog(TState *thread, CState *st, int64 skipped = 0; int64 serialization_failures = 0; int64 deadlock_failures = 0; + int64 other_sql_failures = 0; int64 retried = 0; int64 retries = 0; @@ -4665,10 +4691,12 @@ doLog(TState *thread, CState *st, { serialization_failures = agg->serialization_failures; deadlock_failures = agg->deadlock_failures; + other_sql_failures = agg->other_sql_failures; } - fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT, + fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT, serialization_failures, - deadlock_failures); + deadlock_failures, + other_sql_failures); fputc('\n', logfile); @@ -6307,6 +6335,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now, cur.serialization_failures += threads[i].stats.serialization_failures; cur.deadlock_failures += threads[i].stats.deadlock_failures; + cur.other_sql_failures += threads[i].stats.other_sql_failures; } /* we count only actually executed transactions */ @@ -6449,7 +6478,8 @@ printResults(StatsData *total, /* * Remaining stats are nonsensical if we failed to execute any xacts due - * to others than serialization or deadlock errors + * to other than serialization or deadlock errors and --continue-on-error + * is not set. */ if (total_cnt <= 0) return; @@ -6465,6 +6495,9 @@ printResults(StatsData *total, printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n", total->deadlock_failures, 100.0 * total->deadlock_failures / total_cnt); + printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n", + total->other_sql_failures, + 100.0 * total->other_sql_failures / total_cnt); } /* it can be non-zero only if max_tries is not equal to one */ @@ -6568,6 +6601,10 @@ printResults(StatsData *total, sstats->deadlock_failures, (100.0 * sstats->deadlock_failures / script_total_cnt)); + printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n", + sstats->other_sql_failures, + (100.0 * sstats->other_sql_failures / + script_total_cnt)); } /* @@ -6727,6 +6764,7 @@ main(int argc, char **argv) {"verbose-errors", no_argument, NULL, 15}, {"exit-on-abort", no_argument, NULL, 16}, {"debug", no_argument, NULL, 17}, + {"continue-on-error", no_argument, NULL, 18}, {NULL, 0, NULL, 0} }; @@ -7080,6 +7118,10 @@ main(int argc, char **argv) case 17: /* debug */ pg_logging_increase_verbosity(); break; + case 18: /* continue-on-error */ + benchmarking_option_set = true; + continue_on_error = true; + break; default: /* getopt_long already emitted a complaint */ pg_log_error_hint("Try \"%s --help\" for more information.", progname); @@ -7435,6 +7477,7 @@ main(int argc, char **argv) stats.retried += thread->stats.retried; stats.serialization_failures += thread->stats.serialization_failures; stats.deadlock_failures += thread->stats.deadlock_failures; + stats.other_sql_failures += thread->stats.other_sql_failures; latency_late += thread->latency_late; conn_total_duration += thread->conn_duration; diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index 7dd78940300..3c19a36a005 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset # Clean up $node->safe_psql('postgres', 'DROP TABLE counter;'); +# Test --continue-on-error +$node->safe_psql('postgres', + 'CREATE TABLE unique_table(i int unique);'); + +$node->pgbench( + '-n -t 10 --continue-on-error --failures-detailed', + 0, + [ + qr{processed: 1/10\b}, + qr{other failures: 9\b} + ], + [], + 'test --continue-on-error', + { + '001_continue_on_error' => q{ + INSERT INTO unique_table VALUES(0); + } + }); + +# Clean up +$node->safe_psql('postgres', 'DROP TABLE unique_table;'); + # done $node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts'); $node->stop; -- 2.43.0
>From d141ea4422d76b59021e2c25ad378bfd12d97651 Mon Sep 17 00:00:00 2001 From: Yugo Nagata <nag...@sraoss.co.jp> Date: Wed, 24 Sep 2025 22:23:25 +0900 Subject: [PATCH v12 1/3] Fix assertion failure and verbose messages in pipeline mode commandError() is called to report errors when they can be retried, and it previously assumed that errors are always detected during SQL command execution. However, in pipeline mode, an error may also be detected when a \endpipeline meta-command is executed. This caused an assertion failure. To fix this, it is now assumed that errors can also be detected in this case. Additionally, in pipeline mode, the error message reported in readCommandResponse() was lost, because it was reset when PQgetResult() returned NULL to indicate the end of query processing. To fix this, save the previous error message and use it for reporting. --- src/bin/pgbench/pgbench.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 3cafd88ac53..f25a2e20e70 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -3059,7 +3059,13 @@ commandFailed(CState *st, const char *cmd, const char *message) static void commandError(CState *st, const char *message) { - Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND); + /* + Errors should only be detected during an SQL command or the \endpipeline + meta command. Any other case triggers an assertion failure. + */ + Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND || + sql_script[st->use_file].commands[st->command]->meta == META_ENDPIPELINE); + pg_log_info("client %d got an error in command %d (SQL) of script %d; %s", st->id, st->command, st->use_file, message); } @@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) PGresult *res; PGresult *next_res; int qrynum = 0; + char *errmsg; /* * varprefix should be set only with \gset or \aset, and \endpipeline and @@ -3280,6 +3287,8 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) { bool is_last; + errmsg = pg_strdup(PQerrorMessage(st->con)); + /* peek at the next result to know whether the current is last */ next_res = PQgetResult(st->con); is_last = (next_res == NULL); @@ -3349,7 +3358,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) st->num_syncs--; if (st->num_syncs == 0 && PQexitPipelineMode(st->con) != 1) pg_log_error("client %d failed to exit pipeline mode: %s", st->id, - PQerrorMessage(st->con)); + errmsg); break; case PGRES_NONFATAL_ERROR: @@ -3359,7 +3368,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) if (canRetryError(st->estatus)) { if (verbose_errors) - commandError(st, PQerrorMessage(st->con)); + commandError(st, errmsg); goto error; } /* fall through */ @@ -3367,14 +3376,14 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) default: /* anything else is unexpected */ pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, - PQerrorMessage(st->con)); + st->id, st->use_file, st->command, qrynum, errmsg); goto error; } PQclear(res); qrynum++; res = next_res; + pg_free(errmsg); } if (qrynum == 0) -- 2.43.0