Re: Fix around conn_duration in pgbench
Hello Fujii-san, On Wed, 28 Jul 2021 00:20:21 +0900 Fujii Masao wrote: > > > On 2021/07/27 11:02, Yugo NAGATA wrote: > > Hello Fujii-san, > > > > Thank you for looking at it. > > > > On Tue, 27 Jul 2021 03:04:35 +0900 > > Fujii Masao wrote: > + * Per-thread last disconnection time is not > measured because it > + * is already done when the transaction > successfully finished. > + * Also, we don't need it when the thread is > aborted because we > + * can't report complete results anyway in such > cases. > > What about commenting a bit more explicitly like the following? > > > In CSTATE_FINISHED state, this disconnect_all() is no-op under -C/--connect > because all the connections that this thread established should have already > been closed at the end of transactions. So we don't need to measure the > disconnection delays here. > > In CSTATE_ABORTED state, the measurement is no longer necessary because we > cannot report complete results anyways in this case. > Thank you for the suggestion. I updated the comment. > > > >> - /* no connection delay to record */ > >> - thread->conn_duration = 0; > >> + /* connection delay is measured globally between the barriers */ > >> > >> This comment is really correct? I was thinking that the measurement is not > >> necessary here because this is the case where -C option is not specified. > > > > This comment means that, when -C is not specified, the connection delay is > > measured between the barrier point where the benchmark starts > > > > /* READY */ > > THREAD_BARRIER_WAIT(&barrier); > > > > and the barrier point where all the thread finish making initial > > connections. > > > > /* GO */ > > THREAD_BARRIER_WAIT(&barrier); > > Ok, so you're commenting about the initial connection delay that's > measured when -C is not specified. But I'm not sure if this comment > here is really helpful. Seem rather confusing?? Ok. I removed this comment. > I found another disconnect_all(). > > /* XXX should this be connection time? */ > disconnect_all(state, nclients); > > The measurement is also not necessary here. > So the above comment should be removed or updated? I think this disconnect_all will be a no-op because all connections should be already closed in threadRun(), but I left it just to be sure that connections are all cleaned-up. I updated the comment for explaining above. I attached the updated patch. Could you please look at this? Regards, Yugo Nagata -- Yugo NAGATA diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 364b5a2e47..bf9649251b 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -3545,8 +3545,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if (is_connect) { + pg_time_usec_t start = now; + + pg_time_now_lazy(&start); finishCon(st); - now = 0; + now = pg_time_now(); + thread->conn_duration += now - start; } if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded) @@ -3570,6 +3574,17 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) */ case CSTATE_ABORTED: case CSTATE_FINISHED: +/* + * In CSTATE_FINISHED state, this finishCon() is a no-op + * under -C/--connect because all the connections that this + * thread established should have already been closed at the end + * of transactions. So we don't need to measure the disconnection + * delays here. + * + * In CSTATE_ABORTED state, the measurement is no longer + * necessary because we cannot report complete results anyways + * in this case. + */ finishCon(st); return; } @@ -6550,7 +6565,11 @@ main(int argc, char **argv) bench_start = thread->bench_start; } - /* XXX should this be connection time? */ + /* + * All connections should be already closed in threadRun(), so this + * disconnect_all() will be a no-op, but clean up the connecions just + * to be sure. We don't need to measure the disconnection delays here. + */ disconnect_all(state, nclients); /* @@ -6615,6 +6634,7 @@ threadRun(void *arg) thread_start = pg_time_now(); thread->started_time = thread_start; + thread->conn_duration = 0; last_report = thread_start; next_report = last_report + (int64) 100 * progress; @@ -6638,14 +6658,6 @@ threadRun(void *arg) goto done; } } - - /* compute connection delay */ - thread->conn_duration = pg_time_now() - thread->started_time; - } - else - { - /* no connection delay to record */ - thread->conn_duration = 0; } /* GO */ @@ -6846,9 +6858,7 @@ threadRun(void *arg) } done: - start = pg_time_now(); disconnect_all(state, nstate); - thread->conn_dur
Re: proposal: possibility to read dumped table's name from file
Hi út 13. 7. 2021 v 1:16 odesílatel Tom Lane napsal: > Alvaro Herrera writes: > > [1] your proposal of "[+-] OBJTYPE OBJIDENT" plus empty lines allowed > > plus lines starting with # are comments, seems plenty. Any line not > > following that format would cause an error to be thrown. > > I'd like to see some kind of keyword on each line, so that we could extend > the command set by adding new keywords. As this stands, I fear we'd end > up using random punctuation characters in place of [+-], which seems > pretty horrid from a readability standpoint. > > I think that this file format should be designed with an eye to allowing > every, or at least most, pg_dump options to be written in the file rather > than on the command line. I don't say we have to *implement* that right > now; but if the format spec is incapable of being extended to meet > requests like that one, I think we'll regret it. This line of thought > suggests that the initial commands ought to match the existing > include/exclude switches, at least approximately. > > Hence I suggest > > include table PATTERN > exclude table PATTERN > > which ends up being the above but with words not [+-]. > Here is an updated implementation of filter's file, that implements syntax proposed by you. Regards Pavel > regards, tom lane > diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 7682226b99..d0459b385e 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -789,6 +789,56 @@ PostgreSQL documentation + + --filter=filename + + +Read objects filters from the specified file. +If you use "-" as a filename, the filters are read from stdin. +The lines of this file must have the following format: + +(include|exclude)[table|schema|foreign_data|data] objectname + + + + +The first keyword specifies whether the object is to be included +or excluded, and the second keyword specifies the type of object +to be filtered: +table (table), +schema (schema), +foreign_data (foreign server), +data (table data). + + + +With the following filter file, the dump would include table +mytable1 and data from foreign tables of +some_foreign_server foreign server, but exclude data +from table mytable2. + +include table mytable1 +include foreign_data some_foreign_server +exclude table mytable2 + + + + +The lines starting with symbol # are ignored. +Previous white chars (spaces, tabs) are not allowed. These +lines can be used for comments, notes. Empty lines are ignored too. + + + +The --filter option works just like the other +options to include or exclude tables, schemas, table data, or foreign +tables, and both forms may be combined. Note that there are no options +to exclude a specific foreign table or to include a specific table's +data. + + + + --if-exists diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 90ac445bcd..ba4c425ee6 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -55,10 +55,12 @@ #include "catalog/pg_trigger_d.h" #include "catalog/pg_type_d.h" #include "common/connect.h" +#include "common/string.h" #include "dumputils.h" #include "fe_utils/option_utils.h" #include "fe_utils/string_utils.h" #include "getopt_long.h" +#include "lib/stringinfo.h" #include "libpq/libpq-fs.h" #include "parallel.h" #include "pg_backup_db.h" @@ -308,7 +310,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions, static char *get_synchronized_snapshot(Archive *fout); static void setupDumpWorker(Archive *AHX); static TableInfo *getRootTableInfo(const TableInfo *tbinfo); - +static void read_patterns_from_file(char *filename, DumpOptions *dopt); int main(int argc, char **argv) @@ -380,6 +382,7 @@ main(int argc, char **argv) {"enable-row-security", no_argument, &dopt.enable_row_security, 1}, {"exclude-table-data", required_argument, NULL, 4}, {"extra-float-digits", required_argument, NULL, 8}, + {"filter", required_argument, NULL, 12}, {"if-exists", no_argument, &dopt.if_exists, 1}, {"inserts", no_argument, NULL, 9}, {"lock-wait-timeout", required_argument, NULL, 2}, @@ -613,6 +616,10 @@ main(int argc, char **argv) optarg); break; + case 12: /* filter implementation */ +read_patterns_from_file(optarg, &dopt); +break; + default: fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); exit_nicely(1); @@ -1038,6 +1045,8 @@ help(const char *progname) " access to)\n")); printf(_(" --exclude-table-data=PATTERN do NOT dump data for the specified table(s)\n")); printf(_("
Problem about postponing gathering partial paths for topmost scan/join rel
Hi all, In commit 3f90ec85 we are trying to postpone gathering partial paths for topmost scan/join rel until we know the final targetlist, in order to allow more accurate costing of parallel paths. We do this by the following code snippet in standard_join_search: + /* +* Except for the topmost scan/join rel, consider gathering +* partial paths. We'll do the same for the topmost scan/join rel +* once we know the final targetlist (see grouping_planner). +*/ + if (lev < levels_needed) + generate_gather_paths(root, rel, false); This change may cause a problem if the joinlist contains sub-joinlist nodes, in which case 'lev == levels_needed' does not necessarily imply it's the topmost for the final scan/join rel. It may only be the topmost scan/join rel for the subproblem. And then we would miss the Gather paths for this subproblem. It can be illustrated with the query below: create table foo(i int, j int); insert into foo select i, i from generate_series(1,5)i; analyze foo; set max_parallel_workers_per_gather to 4; set parallel_setup_cost to 0; set parallel_tuple_cost to 0; set min_parallel_table_scan_size to 0; # explain (costs off) select * from foo a join foo b on a.i = b.i full join foo c on b.i = c.i; QUERY PLAN Hash Full Join Hash Cond: (b.i = c.i) -> Hash Join Hash Cond: (a.i = b.i) -> Gather Workers Planned: 4 -> Parallel Seq Scan on foo a -> Hash -> Gather Workers Planned: 4 -> Parallel Seq Scan on foo b -> Hash -> Gather Workers Planned: 4 -> Parallel Seq Scan on foo c (15 rows) Please note how we do the join for rel a and b. We run Gather above the parallel scan and then do the join above the Gather. These two base rels are grouped in a subproblem because of the FULL JOIN. And due to the mentioned code change, we are unable to gather partial paths for their joinrel. If we can somehow fix this problem, then we would be able to do better planning by running parallel join first and then doing Gather above the join. -> Gather Workers Planned: 4 -> Parallel Hash Join Hash Cond: (a.i = b.i) -> Parallel Seq Scan on foo a -> Parallel Hash -> Parallel Seq Scan on foo b To fix this problem, I'm thinking we can leverage 'root->all_baserels' to tell if we are at the topmost scan/join rel, something like: --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -3041,7 +3041,7 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels) * partial paths. We'll do the same for the topmost scan/join rel * once we know the final targetlist (see grouping_planner). */ - if (lev < levels_needed) + if (!bms_equal(rel->relids, root->all_baserels)) generate_useful_gather_paths(root, rel, false); Any thoughts? Thanks Richard
Re: Have I found an interval arithmetic bug?
On Wed, 28 Jul 2021 at 00:08, John W Higgins wrote: > > It's nice to envision all forms of fancy calculations. But the fact is that > > '1.5 month'::interval * 2 != '3 month"::interval > That's not exactly true. Even without the patch: SELECT '1.5 month'::interval * 2 AS product, '3 month'::interval AS expected, justify_interval('1.5 month'::interval * 2) AS justified_product, '1.5 month'::interval * 2 = '3 month'::interval AS equal; product | expected | justified_product | equal +--+---+--- 2 mons 30 days | 3 mons | 3 mons| t (1 row) So it's equal even without calling justify_interval() on the result. FWIW, I remain of the opinion that the interval literal code should just spill down to lower units in all cases, just like the multiplication and division code, so that the results are consistent (barring floating point rounding errors) and explainable. Regards, Dean
Re: proposal: enhancing plpgsql debug API - returns text value of variable content
pá 23. 7. 2021 v 10:47 odesílatel Pavel Stehule napsal: > > > pá 23. 7. 2021 v 10:30 odesílatel Aleksander Alekseev < > aleksan...@timescale.com> napsal: > >> Hi Pavel, >> >> > I know it. Attached patch try to fix this issue >> > >> > I merged you patch (thank you) >> >> Thanks! I did some more minor changes, mostly in the comments. See the >> attached patch. Other than that it looks OK. I think it's Ready for >> Committer now. >> > > looks well, > > thank you very much > > Pavel > rebase Pavel > >> -- >> Best regards, >> Aleksander Alekseev >> > diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c index 14bbe12da5..e0e68a2592 100644 --- a/src/pl/plpgsql/src/pl_exec.c +++ b/src/pl/plpgsql/src/pl_exec.c @@ -4059,6 +4059,8 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate, { (*plpgsql_plugin_ptr)->error_callback = plpgsql_exec_error_callback; (*plpgsql_plugin_ptr)->assign_expr = exec_assign_expr; + (*plpgsql_plugin_ptr)->eval_datum = exec_eval_datum; + (*plpgsql_plugin_ptr)->cast_value = do_cast_value; if ((*plpgsql_plugin_ptr)->func_setup) ((*plpgsql_plugin_ptr)->func_setup) (estate, func); diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y index 0f6a5b30b1..abe7d63f78 100644 --- a/src/pl/plpgsql/src/pl_gram.y +++ b/src/pl/plpgsql/src/pl_gram.y @@ -415,6 +415,7 @@ pl_block : decl_sect K_BEGIN proc_sect exception_sect K_END opt_label new->cmd_type = PLPGSQL_STMT_BLOCK; new->lineno = plpgsql_location_to_lineno(@2); new->stmtid = ++plpgsql_curr_compile->nstatements; + new->ns = plpgsql_ns_top(); new->label = $1.label; new->n_initvars = $1.n_initvars; new->initvarnos = $1.initvarnos; @@ -907,7 +908,8 @@ stmt_perform : K_PERFORM new = palloc0(sizeof(PLpgSQL_stmt_perform)); new->cmd_type = PLPGSQL_STMT_PERFORM; new->lineno = plpgsql_location_to_lineno(@1); - new->stmtid = ++plpgsql_curr_compile->nstatements; + new->stmtid = ++plpgsql_curr_compile->nstatements; + new->ns = plpgsql_ns_top(); plpgsql_push_back_token(K_PERFORM); /* @@ -943,6 +945,7 @@ stmt_call : K_CALL new->cmd_type = PLPGSQL_STMT_CALL; new->lineno = plpgsql_location_to_lineno(@1); new->stmtid = ++plpgsql_curr_compile->nstatements; + new->ns = plpgsql_ns_top(); plpgsql_push_back_token(K_CALL); new->expr = read_sql_stmt(); new->is_call = true; @@ -962,6 +965,7 @@ stmt_call : K_CALL new->cmd_type = PLPGSQL_STMT_CALL; new->lineno = plpgsql_location_to_lineno(@1); new->stmtid = ++plpgsql_curr_compile->nstatements; + new->ns = plpgsql_ns_top(); plpgsql_push_back_token(K_DO); new->expr = read_sql_stmt(); new->is_call = false; @@ -1000,7 +1004,8 @@ stmt_assign : T_DATUM new = palloc0(sizeof(PLpgSQL_stmt_assign)); new->cmd_type = PLPGSQL_STMT_ASSIGN; new->lineno = plpgsql_location_to_lineno(@1); - new->stmtid = ++plpgsql_curr_compile->nstatements; + new->stmtid = ++plpgsql_curr_compile->nstatements; + new->ns = plpgsql_ns_top(); new->varno = $1.datum->dno; /* Push back the head name to include it in the stmt */ plpgsql_push_back_token(T_DATUM); @@ -1022,6 +1027,7 @@ stmt_getdiag : K_GET getdiag_area_opt K_DIAGNOSTICS getdiag_list ';' new->cmd_type = PLPGSQL_STMT_GETDIAG; new->lineno = plpgsql_location_to_lineno(@1); new->stmtid = ++plpgsql_curr_compile->nstatements; + new->ns = plpgsql_ns_top(); new->is_stacked = $2; new->diag_items = $4; @@ -1194,6 +1200,7 @@ stmt_if : K_IF expr_until_then proc_sect stmt_elsifs stmt_else K_END K_IF ';' new->cmd_type = PLPGSQL_STMT_IF; new->lineno = plpgsql_location_to_lineno(@1); new->stmtid = ++plpgsql_curr_compile->nstatements; + new->ns = plpgsql_ns_top(); new->cond = $2; new->then_body = $3; new->elsif_list = $4; @@ -1299,6 +1306,7 @@ stmt_loop : opt_loop_label K_LOOP loop_body new->cmd_type = PLPGSQL_STMT_LOOP; new->lineno = plpgsql_location_to_lineno(@2); new->stmtid = ++plpgsql_curr_compile->nstatements; + new->ns = plpgsql_ns_top(); new->label = $1; new->body = $3.stmts; @@ -1317,6 +1325,7 @@ stmt_while : opt_loop_label K_WHILE expr_until_loop loop_body new->cmd_type = PLPGSQL_STMT_WHILE; new->lineno = plpgsql_location_to_lineno(@2); new->stmtid = ++plpgsql_curr_compile->nstatements; + new->ns = plpgsql_ns_top(); new->label = $1; new->cond = $3; new->body = $4.stmts; @@ -1381,6 +1390,7 @@ for_control : for_variable K_IN new = palloc0(sizeof(PLpgSQL_stmt_dynfors)); new->cmd_type = PLPGSQL_STMT_DYNFORS; new->stmtid = ++plpgsql_curr_compile->nstatements; + new->ns = plpgsql_ns_top(); if ($1.row) { new->var = (P
Re: needless complexity in StartupXLOG
At Tue, 27 Jul 2021 11:03:14 -0400, Robert Haas wrote in > On Tue, Jul 27, 2021 at 9:18 AM Kyotaro Horiguchi > wrote: > > FWIW, by the way, I complained that the variable name "promoted" is a > > bit confusing. It's old name was fast_promoted, which means that fast > > promotion is being *requsted* and ongoing. On the other hand the > > current name "promoted" still means "(fast=non-fallback) promotion is > > ongoing" so there was a conversation as the follows. > > > > https://www.postgresql.org/message-id/9fdd994d-a531-a52b-7906-e1cc22701310%40oss.nttdata.com > > I agree - that variable name is also not great. I am open to making > improvements in that area and in others that have been suggested on > this thread, but my immediate goal is to figure out whether anyone > objects to me committing the posted patch. If nobody comes up with a > reason why it's a bad idea in the next few days, I'll plan to move > ahead with it. That's fine with me. I still haven't find a way to lose the last checkpoint due to software failure. Repeated promotion without having new checkpoints is safe as expected. We don't remove WAL files unless a checkpoint completes, and a checkpoint preserves segments back to the one containing its redo point. In short, I'm for it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: when the startup process doesn't (logging startup delays)
> > > I saw that you fixed it by calling InitStartupProgress() after the > > > walkdir() > > > calls which do pre_sync_fname. So then walkdir is calling > > > LogStartupProgress(STARTUP_PROCESS_OP_FSYNC) even when it's not doing > > > fsync, > > > and then LogStartupProgress() is returning because !AmStartupProcess(). > > > > I don't think walkdir() has any business calling LogStartupProgress() > > at all. It's supposed to be a generic function, not one that is only > > available to be called from the startup process, or has different > > behavior there. From my point of view, the right thing is to put the > > logging calls into the particular callbacks that SyncDataDirectory() > > uses. > > You're right - this is better. I also agree that this is the better place to do it. Hence modified the patch accordingly. The condition "!AmStartupProcess()" is added to differentiate whether the call is done from a startup process or some other process. Actually StartupXLOG() gets called in 2 places. one in StartupProcessMain() and the other in InitPostgres(). As the logging of the startup progress is required only during the startup process and not in the other cases, so added the condition to confirm the call is from the startup process. I did not find any other way to differentiate. Kindly let me know if there is any other better approach to do this. > > > Maybe I'm missing something here, but I don't understand the purpose > > > of this. You can always combine two functions into one, but it's only > > > worth doing if you end up with less code, which doesn't seem to be the > > > case here. > > > > 4 files changed, 39 insertions(+), 71 deletions(-) > > Hmm. I don't completely hate that, but I don't love it either. I don't > think the result is any easier to understand than the original, and in > some ways it's worse. In particular, you've flattened the separate > comments for each of the individual functions into a single-line > comment that is more generic than the comments it replaced. You could > fix that and you'd still be slightly ahead on LOC, but I'm not > convinced that it's actually a real win. As per my understanding there are no changes required wrt this. Hence not done any changes. Please find the updated patch attached. Kindly share your comments if any. On Mon, Jul 26, 2021 at 10:41 PM Robert Haas wrote: > > On Mon, Jul 26, 2021 at 11:30 AM Justin Pryzby wrote: > > > Maybe I'm missing something here, but I don't understand the purpose > > > of this. You can always combine two functions into one, but it's only > > > worth doing if you end up with less code, which doesn't seem to be the > > > case here. > > > > 4 files changed, 39 insertions(+), 71 deletions(-) > > Hmm. I don't completely hate that, but I don't love it either. I don't > think the result is any easier to understand than the original, and in > some ways it's worse. In particular, you've flattened the separate > comments for each of the individual functions into a single-line > comment that is more generic than the comments it replaced. You could > fix that and you'd still be slightly ahead on LOC, but I'm not > convinced that it's actually a real win. > > > > ... but I'm not exactly sure how to get there from here. Having only > > > LogStartupProgress() but having it do a giant if-statement to figure > > > out whether we're mid-phase or end-of-phase does not seem like the > > > right approach. > > > > I used a bool arg and negation to handle within a single switch. Maybe it's > > cleaner to use a separate enum value for each DONE, and set a local done > > flag. > > If we're going to go the route of combining the functions, I agree > that a Boolean is the way to go; I think we have some existing > precedent for 'bool finished' rather than 'bool done'. > > But I kind of wonder if we should have an enum in the first place. It > feels like we've got code in a bunch of places that just exists to > decide which enum value to use, and then code someplace else that > turns around and decides which message to produce. That would be > sensible if we were using the same enum values in lots of places, but > that's not the case here. So suppose we just moved the messages to the > places where we're now calling LogStartupProgress() or > LogStartupProgressComplete()? So something like this: > > if (should_report_startup_progress()) > ereport(LOG, > (errmsg("syncing data directory (syncfs), elapsed > time: %ld.%02d s, current path: %s", >secs, (usecs / 1), path))); > > Well, that doesn't quite work, because "secs" and "usecs" aren't going > to exist in the caller. We can fix that easily enough: let's just make > should_report_startup_progress take arguments long *secs, int *usecs, > and bool done. Then the above becomes: > > if (should_report_startup_progress(&secs, &usecs, false)) > ereport(LOG, > (errmsg("syncing data directory (syncfs), elapsed > time: %ld.%02d
Re: a thinko in b676ac443b6
On 7/28/21 3:15 AM, Amit Langote wrote: > On Wed, Jul 28, 2021 at 1:07 AM Tomas Vondra > wrote: >> On 7/27/21 4:28 AM, Amit Langote wrote: >>> I think it can be incorrect to use the same TupleDesc for both the >>> slots in ri_Slots (for ready-to-be-inserted tuples) and ri_PlanSlots >>> (for subplan output tuples). Especially if you consider what we did >>> in 86dc90056df that was committed into v14. In that commit, we >>> changed the way a subplan under ModifyTable produces its output for an >>> UPDATE statement. Previously, it would produce a tuple matching the >>> target table's TupleDesc exactly (plus any junk columns), but now it >>> produces only a partial tuple containing the values for the changed >>> columns. >>> >>> So it's better to revert to using planSlot->tts_tupleDescriptor for >>> the slots in ri_PlanSlots. Attached a patch to do so. >> >> Yeah, this seems like a clear mistake - thanks for noticing it! Clearly >> no regression test triggered the issue, so I wonder what's the best way >> to test it - any idea what would the test need to do? > > Ah, I should've mentioned that this is only a problem if the original > query is an UPDATE. With v14, only INSERTs can use batching and the > subplan does output a tuple matching the target table's TupleDesc in > their case, so the code seems to work fine. > > As I said, I noticed a problem when rebasing my patch to allow > cross-partition UPDATEs to use batching for the inserts that are > performed internally to implement such UPDATEs. The exact problem I > noticed is that the following Assert tts_virtual_copyslot() (via > ExecCopySlot called with an ri_PlanSlots[] entry) failed: > > Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts); > > srcdesc in this case is a slot in ri_PlanSlots[] initialized with the > target table's TupleDesc (the "thinko") and dstslot is the slot that > holds subplan's output tuple ('planSlot' passed to ExecInsert). As I > described in my previous email, dstslot's TupleDesc can be narrower > than the target table's TupleDesc in the case of an UPDATE, so the > Assert can fail in theory. > >> I did some quick experiments with batched INSERTs with RETURNING clauses >> and/or subplans, but I haven't succeeded in triggering the issue :-( > > Yeah, no way to trigger this except UPDATEs. It still seems like a > good idea to fix this in v14. > OK, thanks for the explanation. So it's benign in v14, but I agree it's better to fix it there too. I'll get this sorted/pushed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: .ready and .done files considered harmful
Hi, > I don't think it's great that we're using up SIGINT for this purpose. > There aren't that many signals available at the O/S level that we can > use for our purposes, and we generally try to multiplex them at the > application layer, e.g. by setting a latch or a flag in shared memory, > rather than using a separate signal. Can we do something of that sort > here? Or maybe we don't even need a signal. ThisTimeLineID is already > visible in shared memory, so why not just have the archiver just check > and see whether it's changed, say via a new accessor function > GetCurrentTimeLineID()? As of now shared memory is not attached to the archiver. Archiver cannot access ThisTimeLineID or a flag available in shared memory. if (strcmp(argv[1], "--forkbackend") == 0 || strcmp(argv[1], "--forkavlauncher") == 0 || strcmp(argv[1], "--forkavworker") == 0 || strcmp(argv[1], "--forkboot") == 0 || strncmp(argv[1], "--forkbgworker=", 15) == 0) PGSharedMemoryReAttach(); else PGSharedMemoryNoReAttach(); This is the reason we have thought of sending a notification to the archiver if there is a timeline switch. Should we consider attaching shared memory to archiver process or explore more on notification mechanism to avoid using SIGINT? Thanks, Dipesh
Re: Added schema level support for publication.
On Tue, Jul 27, 2021 at 5:11 AM Greg Nancarrow wrote: > > On Mon, Jul 26, 2021 at 3:21 PM vignesh C wrote: > > > > Thanks for the comment, this is modified in the v15 patch attached. > > > > I have several minor review comments. > > (1) src/backend/catalog/objectaddress.c > Should start comment sentences with an uppercase letter, for consistency. > > + /* fetch publication name and schema oid from input list */ > > I also notice that some 1-sentence comments end with "." (full-stop) > and others don't. It seems to alternate all over the place, and so is > quite noticeable. > Unfortunately, it already seems to be like this in much of the code > that this patch patches. > Ideally (at least my personal preference is) 1-sentence comments > should not end with a ".". Modified. > (2) src/backend/catalog/pg_publication.c > errdetail message > > I think the following should say "Temporary schemas ..." (since the > existing error message for tables says "System tables cannot be added > to publications."). > > + errdetail("Temporary schema cannot be added to publications."))); > Modified. > (3) src/backend/commands/publicationcmds.c > PublicationAddTables > > I think that the Assert below is not correct (i.e. not restrictive enough). > Although the condition passes, it is allowing, for example, > stmt->for_all_tables==true if stmt->schemas==NIL, and that doesn't > seem to be correct. > I suggest the following change: > > BEFORE: > + Assert(!stmt || !stmt->for_all_tables || !stmt->schemas); > AFTER: > + Assert(!stmt || (!stmt->for_all_tables && !stmt->schemas)); > Modified. > (4) src/backend/commands/publicationcmds.c > PublicationAddSchemas > > Similarly, I think that the Assert below is not restrictive enough, > and think it should be changed: > > BEFORE: > + Assert(!stmt || !stmt->for_all_tables || !stmt->tables); > AFTER: > + Assert(!stmt || (!stmt->for_all_tables && !stmt->tables)); > Modified. > (5) src/bin/pg_dump/common.c > > Spelling mistake. > > BEFORE: > + pg_log_info("reading publciation schemas"); > AFTER: > + pg_log_info("reading publication schemas"); Modified. Thanks for the comments, the comments are fixed in the v16 patch attached at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm2LgV5XcLF80rJ60NwnjKpZj%3D%3DLxJpO4W2TG2G5XmUtDA%40mail.gmail.com Regards, Vignesh
Re: pg_receivewal starting position
Le mercredi 28 juillet 2021, 08:22:30 CEST Kyotaro Horiguchi a écrit : > At Tue, 27 Jul 2021 07:50:39 +0200, Ronan Dunklau > wrote in > > Hello, > > > > I've notived that pg_receivewal logic for deciding which LSN to start > > > > streaming at consists of: > > - looking up the latest WAL file in our destination folder, and resume > > from > > > > here > > > > - if there isn't, use the current flush location instead. > > > > This behaviour surprised me when using it with a replication slot: I was > > expecting it to start streaming at the last flushed location from the > > replication slot instead. If you consider a backup tool which will take > > pg_receivewal's output and transfer it somewhere else, using the > > replication slot position would be the easiest way to ensure we don't > > miss WAL files. > > > > Does that make sense ? > > > > I don't know if it should be the default, toggled by a command line flag, > > or if we even should let the user provide a LSN. > > *I* think it is completely reasonable (or at least convenient or less > astonishing) that pg_receivewal starts from the restart_lsn of the > replication slot to use. The tool already decides the clean-start LSN > a bit unusual way. And it seems to me that proposed behavior can be > the default when -S is specified. > As of now we can't get the replication_slot restart_lsn with a replication connection AFAIK. This implies that the patch could require the user to specify a maintenance-db parameter, and we would use that if provided to fetch the replication slot info, or fallback to the previous behaviour. I don't really like this approach as the behaviour changing wether we supply a maintenance-db parameter or not is error-prone for the user. Another option would be to add a new replication command (for example ACQUIRE_REPLICATION_SLOT ) to set the replication slot as the current one, and return some info about it (restart_lsn at least for a physical slot). I don't see any reason not to make it work for logical replication connections / slots, but it wouldn't be that useful since we can query the database in that case. Acquiring the replication slot instead of just reading it would make sure that no other process could start the replication between the time we read the restart_lsn and when we issue START_REPLICATION. START_REPLICATION could then check if we already have a replication slot, and ensure it is the same one as the one we're trying to use. From pg_receivewal point of view, this would amount to: - check if we currently have wal in the target directory. - if we do, proceed as currently done, by computing the start lsn and timeline from the last archived wal - if we don't, and we have a slot, run ACQUIRE_REPLICATION_SLOT. Use the restart_lsn as the start lsn if there is one, and don't provide a timeline - if we still don't have a start_lsn, fallback to using the current server wal position as is done. What do you think ? Which information should we provide about the slot ? -- Ronan Dunklau
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Jul 28, 2021 at 2:26 AM Robert Haas wrote: > > On Fri, Jul 23, 2021 at 4:03 PM Robert Haas wrote: > > My 0003 is where I see some lingering problems. It creates > > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't > > need the xlogreader. But it doesn't really solve the problem of how > > checkpointer.c would be able to call this function with proper > > arguments. It is at least better in not needing two arguments to > > decide what to do, but how is checkpointer.c supposed to know what to > > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know > > what to pass for EndOfLogTLI and EndOfLog? > > On further study, I found another problem: the way my patch set leaves > things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which > will not be correctly initialized in any process other than the > startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog) > would just be skipped. Your 0001 seems to have the same problem. You > added Assert(AmStartupProcess()) to the inside of the if > (ArchiveRecoveryRequested) block, but that doesn't fix anything. > Outside the startup process, ArchiveRecoveryRequested will always be > false, but the point is that the associated stuff should be done if > ArchiveRecoveryRequested would have been true in the startup process. > Both of our patch sets leave things in a state where that would never > happen, which is not good. Unless I'm missing something, it seems like > maybe you didn't test your patches to verify that, when the > XLogAcceptWrites() call comes from the checkpointer, all the same > things happen that would have happened had it been called from the > startup process. That would be a really good thing to have tested > before posting your patches. > My bad, I am extremely sorry about that. I usually do test my patches, but somehow I failed to test this change due to manually testing the whole ASRO feature and hurrying in posting the newest version. I will try to be more careful next time. > As far as EndOfLogTLI is concerned, there are, somewhat annoyingly, > several TLIs stored in XLogCtl. None of them seem to be precisely the > same thing as EndLogTLI, but I am hoping that replayEndTLI is close > enough. I found out pretty quickly through testing that replayEndTLI > isn't always valid -- it ends up 0 if we don't enter recovery. That's > not really a problem, though, because we only need it to be valid if > ArchiveRecoveryRequested. The code that initializes and updates it > seems to run whenever InRecovery = true, and ArchiveRecoveryRequested > = true will force InRecovery = true. So it looks to me like > replayEndTLI will always be initialized in the cases where we need a > value. It's not yet entirely clear to me if it has to have the same > value as EndOfLogTLI. I find this code comment quite mysterious: > > /* > * EndOfLogTLI is the TLI in the filename of the XLOG segment containing > * the end-of-log. It could be different from the timeline that EndOfLog > * nominally belongs to, if there was a timeline switch in that segment, > * and we were reading the old WAL from a segment belonging to a higher > * timeline. > */ > EndOfLogTLI = xlogreader->seg.ws_tli; > > The thing is, if we were reading old WAL from a segment belonging to a > higher timeline, wouldn't we have switched to that new timeline? AFAIUC, by browsing the code, yes, we are switching to the new timeline. Along with lastReplayedTLI, lastReplayedEndRecPtr is also the same as the EndOfLog that we needed when ArchiveRecoveryRequested is true. I went through the original commit 7cbee7c0a1db and the thread[1] but didn't find any related discussion for that. > Suppose we want WAL segment 246 from TLI 1, but we don't have that > segment on TLI 1, only TLI 2. Well, as far as I know, for us to use > the TLI 2 version, we'd need to have TLI 2 in the history of the > recovery_target_timeline. And if that is the case, then we would have > to replay through the record where the timeline changes. And if we do > that, then the discrepancy postulated by the comment cannot still > exist by the time we reach this code, because this code is only > reached after we finish WAL redo. So I'm baffled as to how this can > happen, but considering how many cases there are in this code, I sure > can't promise that it doesn't. The fact that we have few tests for any > of this doesn't help either. I am not an expert in this area, but will try to spend some more time on understanding and testing. 1] postgr.es/m/555dd101.7080...@iki.fi Regards, Amul
Re: perlcritic: prohibit map and grep in void conext
On 7/27/21 12:06 PM, Dagfinn Ilmari Mannsåker wrote: > Hi hackers, > > In the patches for improving the MSVC build process, I noticed a use of > `map` in void context. This is considered bad form, and has a > perlcritic policy forbidding it: > https://metacpan.org/pod/Perl::Critic::Policy::BuiltinFunctions::ProhibitVoidMap. > > Attached is a patch that increases severity of that and the > corresponding `grep` policy to 5 to enable it in our perlcritic policy, > and fixes the one use that had already snuck in. > Personally I'm OK with it, but previous attempts to enforce perlcritic policies have met with a less than warm reception, and we had to back off. Maybe this one will fare better. I keep the buildfarm code perlcritic compliant down to severity 3 with a handful of exceptions. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Out-of-memory error reports in libpq
On 7/27/21 6:40 PM, Tom Lane wrote: > While cleaning out dead branches in my git repo, I came across an > early draft of what eventually became commit ffa2e4670 ("In libpq, > always append new error messages to conn->errorMessage"). I realized > that it contained a good idea that had gotten lost on the way to that > commit. Namely, let's reduce all of the 60-or-so "out of memory" > reports in libpq to calls to a common subroutine, and then let's teach > the common subroutine a recovery strategy for the not-unlikely > possibility that it fails to append the "out of memory" string to > conn->errorMessage. That recovery strategy of course is to reset the > errorMessage buffer to empty, hopefully regaining some space. We lose > whatever we'd had in the buffer before, but we have a better chance of > the "out of memory" message making its way to the user. > > The first half of that just saves a few hundred bytes of repetitive > coding. However, I think that the addition of recovery logic is > important for robustness, because as things stand libpq may be > worse off than before for OOM handling. Before ffa2e4670, almost > all of these call sites did printfPQExpBuffer(..., "out of memory"). > That would automatically clear the message buffer to empty, and > thereby be sure to report the out-of-memory failure if at all > possible. Now we might fail to report the thing that the user > really needs to know to make sense of what happened. > > Therefore, I feel like this was an oversight in ffa2e4670, > and we ought to back-patch the attached into v14. > > cc'ing the RMT in case they wish to object. > > I'm honored you've confused me with Alvaro :-) This seems sensible, and we certainly shouldn't be worse off than before, so let's do it. I'm fine with having two functions for call simplicity, but I don't feel strongly about it. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: perlcritic: prohibit map and grep in void conext
> On 28 Jul 2021, at 13:10, Andrew Dunstan wrote: > > On 7/27/21 12:06 PM, Dagfinn Ilmari Mannsåker wrote: >> Hi hackers, >> >> In the patches for improving the MSVC build process, I noticed a use of >> `map` in void context. This is considered bad form, and has a >> perlcritic policy forbidding it: >> https://metacpan.org/pod/Perl::Critic::Policy::BuiltinFunctions::ProhibitVoidMap. >> >> Attached is a patch that increases severity of that and the >> corresponding `grep` policy to 5 to enable it in our perlcritic policy, >> and fixes the one use that had already snuck in. > > Personally I'm OK with it, but previous attempts to enforce perlcritic > policies have met with a less than warm reception, and we had to back > off. Maybe this one will fare better. I'm fine with increasing this policy, but I don't have strong feelings. If we feel the perlcritic policy change is too much, I would still prefer to go ahead with the map rewrite part of the patch though. -- Daniel Gustafsson https://vmware.com/
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Jul 28, 2021 at 4:37 PM Amul Sul wrote: > > On Wed, Jul 28, 2021 at 2:26 AM Robert Haas wrote: > > > > On Fri, Jul 23, 2021 at 4:03 PM Robert Haas wrote: > > > My 0003 is where I see some lingering problems. It creates > > > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't > > > need the xlogreader. But it doesn't really solve the problem of how > > > checkpointer.c would be able to call this function with proper > > > arguments. It is at least better in not needing two arguments to > > > decide what to do, but how is checkpointer.c supposed to know what to > > > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know > > > what to pass for EndOfLogTLI and EndOfLog? > > > > On further study, I found another problem: the way my patch set leaves > > things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which > > will not be correctly initialized in any process other than the > > startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog) > > would just be skipped. Your 0001 seems to have the same problem. You > > added Assert(AmStartupProcess()) to the inside of the if > > (ArchiveRecoveryRequested) block, but that doesn't fix anything. > > Outside the startup process, ArchiveRecoveryRequested will always be > > false, but the point is that the associated stuff should be done if > > ArchiveRecoveryRequested would have been true in the startup process. > > Both of our patch sets leave things in a state where that would never > > happen, which is not good. Unless I'm missing something, it seems like > > maybe you didn't test your patches to verify that, when the > > XLogAcceptWrites() call comes from the checkpointer, all the same > > things happen that would have happened had it been called from the > > startup process. That would be a really good thing to have tested > > before posting your patches. > > > > My bad, I am extremely sorry about that. I usually do test my patches, > but somehow I failed to test this change due to manually testing the > whole ASRO feature and hurrying in posting the newest version. > > I will try to be more careful next time. > I was too worried about how I could miss that & after thinking more about that, I realized that the operation for ArchiveRecoveryRequested is never going to be skipped in the startup process and that never left for the checkpoint process to do that later. That is the reason that assert was added there. When ArchiveRecoveryRequested, the server will no longer be in the wal prohibited mode, we implicitly change the state to wal-permitted. Here is the snip from the 0003 patch: @@ -6614,13 +6629,30 @@ StartupXLOG(void) (errmsg("starting archive recovery"))); } - /* - * Take ownership of the wakeup latch if we're going to sleep during - * recovery. - */ if (ArchiveRecoveryRequested) + { + /* + * Take ownership of the wakeup latch if we're going to sleep during + * recovery. + */ OwnLatch(&XLogCtl->recoveryWakeupLatch); + /* + * Since archive recovery is requested, we cannot be in a wal prohibited + * state. + */ + if (ControlFile->wal_prohibited) + { + /* No need to hold ControlFileLock yet, we aren't up far enough */ + ControlFile->wal_prohibited = false; + ControlFile->time = (pg_time_t) time(NULL); + UpdateControlFile(); + + ereport(LOG, + (errmsg("clearing WAL prohibition because the system is in archive recovery"))); + } + } + > > As far as EndOfLogTLI is concerned, there are, somewhat annoyingly, > > several TLIs stored in XLogCtl. None of them seem to be precisely the > > same thing as EndLogTLI, but I am hoping that replayEndTLI is close > > enough. I found out pretty quickly through testing that replayEndTLI > > isn't always valid -- it ends up 0 if we don't enter recovery. That's > > not really a problem, though, because we only need it to be valid if > > ArchiveRecoveryRequested. The code that initializes and updates it > > seems to run whenever InRecovery = true, and ArchiveRecoveryRequested > > = true will force InRecovery = true. So it looks to me like > > replayEndTLI will always be initialized in the cases where we need a > > value. It's not yet entirely clear to me if it has to have the same > > value as EndOfLogTLI. I find this code comment quite mysterious: > > > > /* > > * EndOfLogTLI is the TLI in the filename of the XLOG segment containing > > * the end-of-log. It could be different from the timeline that EndOfLog > > * nominally belongs to, if there was a timeline switch in that segment, > > * and we were reading the old WAL from a segment belonging to a higher > > * timeline. > > */ > > EndOfLogTLI = xlogreader->seg.ws_tli; > > > > The thing is, if we were reading old WAL from a segment belonging to a > > higher timeline, wouldn't we have switched to that new timeline? > > AFAIUC, by browsing the code, yes, we are switching to the new > timeline. Along with lastReplayedTLI, lastReplayedEndRecPtr is also > the same as the
Re: RFC: Logging plan of the running query
On 2021-07-28 03:45, Pavel Stehule wrote: út 27. 7. 2021 v 20:34 odesílatel Fujii Masao napsal: On 2021/07/09 14:05, torikoshia wrote: On 2021-07-02 23:21, Bharath Rupireddy wrote: On Tue, Jun 22, 2021 at 8:00 AM torikoshia wrote: Updated the patch. Thanks for the patch. Here are some comments on the v4 patch: Thanks for your comments and suggestions! I agree with you and updated the patch. On Thu, Jul 1, 2021 at 3:34 PM Fujii Masao wrote: DO $$ BEGIN PERFORM pg_sleep(100); END$$; When I called pg_log_current_query_plan() to send the signal to the backend executing the above query, I got the following log message. I think that this is not expected message. I guess this issue happened because the information about query text and plan is retrieved from ActivePortal. If this understanding is right, ISTM that we should implement new mechanism so that we can retrieve those information even while nested query is being executed. I'm now working on this comment. One idea is to define new global pointer, e.g., "QueryDesc *ActiveQueryDesc;". This global pointer is set to queryDesc in ExecutorRun() (also maybe ExecutorStart()). And this is reset to NULL in ExecutorEnd() and when an error is thrown. Then ProcessLogCurrentPlanInterrupt() can get the plan of the currently running query from that global pointer instead of ActivePortal, and log it. Thought? It cannot work - there can be a lot of nested queries, and at the end you cannot reset to null, but you should return back pointer to outer query. Thanks for your comment! I'm wondering if we can avoid this problem by saving one outer level QueryDesc in addition to the current one. I'm going to try it. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Re: Reduce the number of special cases to build contrib modules on windows
On Wed, 28 Jul 2021 at 03:52, Dagfinn Ilmari Mannsåker wrote: > > Alvaro Herrera writes: > > I think using the return value of grep as a boolean is confusing. It > > seems more legible to compare to 0. So instead of this: > > > > + if (! grep { $_ eq $ref} @{ $self->{references} }) > > + { > > + push @{ $self->{references} }, $ref; > > + } > > > > use something like: > > > > + if (grep { $_ eq $ref} @{ $self->{references} } == 0) > > I disagree. Using grep in boolean context is perfectly idiomatic perl. > What would be more idiomatic is List::Util::any, but that's not availble > without upgrading List::Util from CPAN on Perls older than 5.20, so we > can't use that. Ok, if the grep stuff is ok as is with the boolean comparison then I'd say 0002 and 0003 of the attached are ok to go. I pushed the v9 0001 and 0005 patch after adjusting the AddFile($self, ...) to become $self->AddFile(...) I've adjusted the attached 0001 patch (previously 0002) to define LOWER_NODE in ltree.h as mentioned by Tom. 0004 still needs work. Thanks for all the reviews. David v10-0001-Adjust-MSVC-build-scripts-to-parse-Makefiles-for.patch Description: Binary data v10-0002-Make-the-includes-field-an-array-in-MSVC-build-s.patch Description: Binary data v10-0003-Don-t-duplicate-references-and-libraries-in-MSVC.patch Description: Binary data v10-0004-Remove-some-special-cases-from-MSVC-build-script.patch Description: Binary data
RE: Failed transaction statistics to measure the logical replication progress
On Tuesday, July 27, 2021 3:59 PM Ajin Cherian wrote: > On Thu, Jul 8, 2021 at 4:55 PM osumi.takami...@fujitsu.com > wrote: > > > Attached file is the POC patch for this. > > Current design is to save failed stats data in the ReplicationSlot struct. > > This is because after the error, I'm not able to access the ReorderBuffer > object. > > Thus, I chose the object where I can interact with at the > ReplicationSlotRelease timing. > > I think this is a good idea to capture the failed replication stats. > But I'm wondering how you are deciding if the replication failed or not? Not > all > cases of ReplicationSLotRelease are due to a failure. It could also be due to > a > planned dropping of subscription or disable of subscription. I have not tested > this but won't the failed stats be updated in this case as well? Is that > correct? Yes, what you said is true. Currently, when I run DROP SUBSCRIPTION or ALTER SUBSCRIPTION DISABLE, failed stats values are added to pg_stat_replication_slots unintentionally, if they have some left values. This is because all those commands, like the subscriber apply failure by duplication error, have the publisher get 'X' message at ProcessRepliesIfAny() and go into the path to call ReplicationSlotRelease(). Also, other opportunities like server stop call the same in the end, which leads to a situation that after the server restart, the value of failed stats catch up with the (successful) existing stats values. Accordingly, I need to change the patch to adjust those situations. Thank you. Best Regards, Takamichi Osumi
Re: Showing applied extended statistics in explain
On Tue, Jul 27, 2021 at 4:50 PM Tom Lane wrote: > TBH I do not agree that this is a great idea. I think it's inevitably > exposing a lot of unstable internal planner details. Well, that is a risk, but I think I like the alternative even less. Imagine if we had a CREATE INDEX command but no way -- other than running queries and noticing how long they seem to take -- to tell whether it was being used. That would suck, a lot, and this seems to me to be exactly the same. A user who creates a statistics object has a legitimate interest in finding out whether that object is doing anything to a given query that they happen to care about. > I like even less > the aspect that this means a lot of information has to be added to the > finished plan in case it's needed for EXPLAIN. Aside from the sheer > cost of copying that data around, what happens if for example somebody > drops a statistic object between the time of planning and the EXPLAIN? > Are we going to start keeping locks on those objects for the lifetime > of the plans? I don't understand the premise of this question. We don't keep locks on tables or indexes involved in a plan for the lifetime of a plan, or on functions or any other kind of object either. We instead arrange to invalidate the plan if those objects are modified or dropped. Why would we not use the same approach here? -- Robert Haas EDB: http://www.enterprisedb.com
Re: .ready and .done files considered harmful
On Wed, Jul 28, 2021 at 6:48 AM Dipesh Pandit wrote: > As of now shared memory is not attached to the archiver. Archiver cannot > access ThisTimeLineID or a flag available in shared memory. If that is true, why are there functions PgArchShmemSize() and PgArchShmemInit(), and how does this statement in PgArchiverMain() manage not to core dump? /* * Advertise our pgprocno so that backends can use our latch to wake us up * while we're sleeping. */ PgArch->pgprocno = MyProc->pgprocno; I think what you are saying is true before v14, but not in v14 and master. -- Robert Haas EDB: http://www.enterprisedb.com
fixing pg_basebackup tests for modern Windows/msys2
While looking into issues with fairywren and pg_basebackup tests, I created a similar environment but with more modern Windows / msys2. Before it even got to the test that failed on fairywren it failed the first TAP test for a variety of reasons, all connected to TestLib::perl2host. First, this function is in some cases returning paths for directories with trailing slashes and or embedded double slashes. Both of these can cause problems, especially when written to a tablespace map file. Also, the cygpath invocation is returning a path with backslashes whereas "pwd -W' returns a path with forward slashes. So the first attached patch rectifies these problems. It fixes issues with doubles and trailing slashes and makes cygpath return a path with forward slashes just like the non-cygpath branch. However, there is another problem, which is that if called on a path that includes a symlink, on the test platform I set up it actually resolves that link rather than just following it. The end result is that the use of a shorter path via a symlink is effectively defeated. I haven't found any way to stop this behaviour. The second patch therefore adjusts the test to avoid calling perl2host on such a path. It just calls perl2host on the symlink's parent, and thereafter uses that result. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com >From eee31e6fd8f183ae0d836dfc131537912979d954 Mon Sep 17 00:00:00 2001 From: Andrew Dunstan Date: Mon, 26 Jul 2021 09:34:59 -0400 Subject: [PATCH 1/2] Make TestLib::perl2host more consistent and robust Sometimes cygpath has been observed to return a path with a trailing slash. That can cause problems, Also, make "cygpath" usage consistent with "pwd -W" with respect to the use of forward slashes. Backpatch to release 14 where the current code was introduced. --- src/test/perl/TestLib.pm | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm index 15572abbea..cbab1587cc 100644 --- a/src/test/perl/TestLib.pm +++ b/src/test/perl/TestLib.pm @@ -304,6 +304,8 @@ except for the case of Perl=msys and host=mingw32. The subject need not exist, but its parent or grandparent directory must exist unless cygpath is available. +The returned path uses forward slashes but has no trailing slash. + =cut sub perl2host @@ -313,10 +315,11 @@ sub perl2host if ($is_msys2) { # get absolute, windows type path - my $path = qx{cygpath -a -w "$subject"}; + my $path = qx{cygpath -a -m "$subject"}; if (!$?) { chomp $path; + $path =~ s!/$!!; return $path if $path; } # fall through if this didn't work. @@ -342,6 +345,7 @@ sub perl2host # this odd way of calling 'pwd -W' is the only way that seems to work. my $dir = qx{sh -c "pwd -W"}; chomp $dir; + $dir =~ s!/$!!; chdir $here; return $dir . $leaf; } -- 2.25.4 >From c50c319df2fba24779861f410c5dda8303e80a28 Mon Sep 17 00:00:00 2001 From: Andrew Dunstan Date: Wed, 28 Jul 2021 07:29:38 -0400 Subject: [PATCH 2/2] Avoid calling TestLib::perl2host on a symlinked directory Certain versions of msys2/Windows have been observed to resolve symlinks in perl2host rather than just follow them. This defeats using a symlinked shorter path to a longer path, and makes certain tests fail. We therefore call perl2host on the parent directory of the symlink and thereafter just use that result. Apply to release 14 where the problem has been observed. --- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index 74f8c2c739..bde31b3c03 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -238,11 +238,13 @@ $node->start; # to our physical temp location. That way we can use shorter names # for the tablespace directories, which hopefully won't run afoul of # the 99 character length limit. -my $shorter_tempdir = TestLib::tempdir_short . "/tempdir"; +my $sys_tempdir = TestLib::tempdir_short; +my $real_sys_tempdir = TestLib::perl2host($sys_tempdir) . "/tempdir"; +my $shorter_tempdir = $sys_tempdir . "/tempdir"; dir_symlink "$tempdir", $shorter_tempdir; mkdir "$tempdir/tblspc1"; -my $realTsDir= TestLib::perl2host("$shorter_tempdir/tblspc1"); +my $realTsDir= "$real_sys_tempdir/tblspc1"; my $real_tempdir = TestLib::perl2host($tempdir); $node->safe_psql('postgres', "CREATE TABLESPACE tblspc1 LOCATION '$realTsDir';"); @@ -275,7 +277,7 @@ SKIP: # Recover tablespace into a new directory (not where it was!) my $repTsDir = "$tempdir/tblspc1replica"; - my $realRepTsDir = TestLib::perl2host("$shorter_tempdir/tblspc1replica"); + my $realRepTsDir = "$real_sys_tempdir/tblspc1replica"; mkdir $repTsDir; TestLib::system_or_bail($tar, 'xf', $tblspc_tars[0], '-C', $repTsDir); @@ -390,7 +392,7 @@ ok( -d "$te
Re: when the startup process doesn't (logging startup delays)
On Wed, Jul 28, 2021 at 5:24 AM Nitin Jadhav wrote: > I also agree that this is the better place to do it. Hence modified > the patch accordingly. The condition "!AmStartupProcess()" is added to > differentiate whether the call is done from a startup process or some > other process. Actually StartupXLOG() gets called in 2 places. one in > StartupProcessMain() and the other in InitPostgres(). As the logging > of the startup progress is required only during the startup process > and not in the other cases, The InitPostgres() case occurs when the server is started in bootstrap mode (during initdb) or in single-user mode (postgres --single). I do not see any reason why we shouldn't produce progress messages in at least the latter case. I suspect that someone who is in the rather desperate scenario of having to use single-user mode would really like to know how long the server is going to take to start up. Perhaps during initdb we don't want messages, but I'm not sure that we need to do anything about that here. None of the messages that the server normally produces show up when you run initdb, so I guess they are getting redirected to /dev/null or something. So I don't think that using AmStartupProcess() for this purpose is right. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit namespace in post-copy output
I took a look at this I agree with the reviewer that it's a good change. The output from multiple jobs in vacuumdb is clearly easier to parse with this since the initial LOG and later DETAIL can be interleaved with other relations of the same name in other namespaces. + get_namespace_name(RelationGetNamespace(OldHeap)), Since get_namespace_name() returns a palloced string, this will lead to a 2x leak of the namespace length as opposed to the 1x of today. While hardly a big deal, it seems prudent to cap this by storing the returned string locally now that we need it twice. I've updated the patch with this, see the attached v2. Barring objections I will go ahead with this. -- Daniel Gustafsson https://vmware.com/ v2-0001-Emit-namespace-in-the-post-copy-errmsg.patch Description: Binary data
Re: Have I found an interval arithmetic bug?
On Wed, Jul 28, 2021 at 12:42 AM Dean Rasheed wrote: > On Wed, 28 Jul 2021 at 00:08, John W Higgins wrote: > > > > It's nice to envision all forms of fancy calculations. But the fact is > that > > > > '1.5 month'::interval * 2 != '3 month"::interval > > > > That's not exactly true. Even without the patch: > > SELECT '1.5 month'::interval * 2 AS product, >'3 month'::interval AS expected, >justify_interval('1.5 month'::interval * 2) AS justified_product, >'1.5 month'::interval * 2 = '3 month'::interval AS equal; > > product | expected | justified_product | equal > +--+---+--- > 2 mons 30 days | 3 mons | 3 mons| t > (1 row) > > That's viewing something via the mechanism that is incorrectly (technically speaking) doing the work in the first place. It believes they are the same - but they are clearly not when actually used. select '1/1/2001'::date + (interval '3 month'); ?column? - 2001-04-01 00:00:00 (1 row) vs select '1/1/2001'::date + (interval '1.5 month' * 2); ?column? - 2001-03-31 00:00:00 (1 row) That's the flaw in this entire body of work - we keep taking fractional amounts - doing round offs and then trying to add or multiply the pieces back and ending up with weird floating point math style errors. That's never to complain about it - but we shouldn't be looking at edge cases with things like 1 month * 1.234 when 1.5 months * 2 doesn't work properly. John P.S. Finally we have items like this select '12/1/2001'::date + (interval '1.5 months' * 2); ?column? - 2002-03-03 00:00:00 (1 row) postgres=# select '1/1/2001'::date + (interval '1.5 months' * 2); ?column? - 2001-03-31 00:00:00 (1 row) Which only has a 28 day gap because of the length of February - clearly this is not working quite right.
Re: Some code cleanup for pgbench and pg_verifybackup
On Tue, Jul 27, 2021 at 11:18 PM Michael Paquier wrote: > I have been looking at that. Here are some findings: > - In pg_archivecleanup, CleanupPriorWALFiles() does not bother > exit()'ing with an error code. Shouldn't we worry about reporting > that correctly? It seems strange to not inform users about paths that > would be broken, as that could bloat the archives without one knowing > about it. > - In pg_basebackup.c, ReceiveTarAndUnpackCopyChunk() would not > exit() when failing to change permissions for non-WIN32. > - pg_recvlogical is missing a failure handling for fstat(), as of > 5c0de38. > - pg_verifybackup is in the wrong, as mentioned upthread. > > Thoughts? All that does not seem to enter into the category of things > worth back-patching, except for pg_verifybackup. I think all of these are reasonable fixes. In the case of pg_basebackup, a chmod() failure doesn't necessarily oblige us to give up and exit; we could presumably still write the data. But it may be best to give up and exit. The other cases appear to be clear bugs. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Case expression pushdown
Le 26/07/2021 à 18:03, Alexander Pyhalov a écrit : Tom Lane писал 2021-07-26 18:18: Alexander Pyhalov writes: [ 0001-Allow-pushing-CASE-expression-to-foreign-server-v6.patch ] This doesn't compile cleanly: deparse.c: In function 'foreign_expr_walker.isra.4': deparse.c:920:8: warning: 'collation' may be used uninitialized in this function [-Wmaybe-uninitialized] if (collation != outer_cxt->collation) ^ deparse.c:914:3: warning: 'state' may be used uninitialized in this function [-Wmaybe-uninitialized] switch (state) ^~ These uninitialized variables very likely explain the fact that it fails regression tests, both for me and for the cfbot. Even if this weren't an outright bug, we don't tolerate code that produces warnings on common compilers. regards, tom lane Hi. Of course, this is a patch issue. Don't understand how I overlooked this. Rebased on master and fixed it. Tests are passing here (but they also passed for previous patch version). What exact tests are failing? I confirm that there is no compilation warning and all regression tests pass successfully for the v7 patch, I have not checked previous patch but this one doesn't fail on cfbot too. Best regards, -- Gilles Darold
Re: CREATE SEQUENCE with RESTART option
On Wed, Jul 28, 2021 at 11:50 AM Michael Paquier wrote: > > On Mon, Jul 26, 2021 at 04:57:53PM +0900, Michael Paquier wrote: > > FWIW, like Ashutosh upthread, my vote would be to do nothing here in > > terms of behavior changes as this is just breaking a behavior for the > > sake of breaking it, so there are chances that this is going to piss > > some users that relied accidentally on the existing behavior. > > In short, I would be tempted with something like the attached, that > documents RESTART in CREATE SEQUENCE, while describing its behavior > according to START. In terms of regression tests, there is already a > lot in this area with ALTER SEQUENCE, but I think that having two > tests makes sense for CREATE SEQUENCE: one for RESTART without a > value and one with, where both explicitely set START. > > Thoughts? -1. IMHO, this is something creating more confusion to the user. We say that we allow both START and RESTART that RESTART is accepted as a consequence of our internal option handling in gram.y. Instead, I recommend throwing errorConflictingDefElem or errmsg("START and RESTART are mutually exclusive options"). We do throw these errors in a lot of other places for various options. Others may have better thoughts though. Regards, Bharath Rupireddy.
Re: Out-of-memory error reports in libpq
Michael Paquier writes: > On Tue, Jul 27, 2021 at 10:31:25PM -0400, Tom Lane wrote: >> Yeah, there are half a dozen places that currently print something >> more specific than "out of memory". I judged that the value of this >> was not worth the complexity it'd add to support it in this scheme. >> Different opinions welcome of course. > I don't mind either that this removes a bit of context. For > unlikely-going-to-happen errors that's not worth the extra translation > cost. Yeah, the extra translatable strings are the main concrete cost of keeping this behavior. But I'm dubious that labeling a small number of the possible OOM points is worth anything, especially if they're not providing the failed allocation request size. You can't tell if that request was unreasonable or if it was just an unlucky victim of bloat elsewhere. Unifying the reports into a common function could be a starting point for more consistent/detailed OOM reports, if anyone cared to work on that. (I hasten to add that I don't.) > + pqReportOOMBuffer(&conn->errorMessage); > Not much a fan of having two routines to do this job though. I would > vote for keeping the one named pqReportOOM() with PQExpBuffer as > argument. Here I've got to disagree. We do need the form with a PQExpBuffer argument, because there are some places where that isn't a pointer to a PGconn's errorMessage. But the large majority of the calls are "pqReportOOM(conn)", and I think having to write that as "pqReportOOM(&conn->errorMessage)" is fairly ugly and perhaps error-prone. I'm not wedded to the name "pqReportOOMBuffer" though --- maybe there's some better name for that one? regards, tom lane
Re: [Proposal] Global temporary tables
Hi Wenjing would you please rebase the code? Thank you very much Tony The new status of this patch is: Waiting on Author
Re: Have I found an interval arithmetic bug?
On Wed, Jul 28, 2021 at 08:42:31AM +0100, Dean Rasheed wrote: > On Wed, 28 Jul 2021 at 00:08, John W Higgins wrote: > > > > It's nice to envision all forms of fancy calculations. But the fact is that > > > > '1.5 month'::interval * 2 != '3 month"::interval > > > > That's not exactly true. Even without the patch: > > SELECT '1.5 month'::interval * 2 AS product, >'3 month'::interval AS expected, >justify_interval('1.5 month'::interval * 2) AS justified_product, >'1.5 month'::interval * 2 = '3 month'::interval AS equal; > > product | expected | justified_product | equal > +--+---+--- > 2 mons 30 days | 3 mons | 3 mons| t > (1 row) > > So it's equal even without calling justify_interval() on the result. > > FWIW, I remain of the opinion that the interval literal code should > just spill down to lower units in all cases, just like the > multiplication and division code, so that the results are consistent > (barring floating point rounding errors) and explainable. Here is a more minimal patch that doesn't change the spill-down units at all, but merely documents it, and changes the spilldown to months to round instead of truncate. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion. diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 453115f942..bd938a675a 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -2840,15 +2840,17 @@ P years-months- - In the verbose input format, and in some fields of the more compact - input formats, field values can have fractional parts; for example - '1.5 week' or '01:02:03.45'. Such input is - converted to the appropriate number of months, days, and seconds - for storage. When this would result in a fractional number of - months or days, the fraction is added to the lower-order fields - using the conversion factors 1 month = 30 days and 1 day = 24 hours. - For example, '1.5 month' becomes 1 month and 15 days. - Only seconds will ever be shown as fractional on output. + Field values can have fractional parts; for example, '1.5 + weeks' or '01:02:03.45'. However, + because interval internally stores only three integer units (months, + days, seconds), fractional units must be spilled to smaller units. + For example, because months are approximated to equal 30 days, + fractional values of units greater than months is rounded to be the + nearest integer number of months. Fractional units of months or less + are computed to be an integer number of days and seconds, assuming + 24 hours per day. For example, '1.5 months' + becomes 1 month 15 days. Only seconds will ever + be shown as fractional on output. diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c index 54ae632de2..cb3fa85892 100644 --- a/src/backend/utils/adt/datetime.c +++ b/src/backend/utils/adt/datetime.c @@ -3306,29 +3306,25 @@ DecodeInterval(char **field, int *ftype, int nf, int range, case DTK_YEAR: tm->tm_year += val; - if (fval != 0) - tm->tm_mon += fval * MONTHS_PER_YEAR; + tm->tm_mon += rint(fval * MONTHS_PER_YEAR); tmask = DTK_M(YEAR); break; case DTK_DECADE: tm->tm_year += val * 10; - if (fval != 0) - tm->tm_mon += fval * MONTHS_PER_YEAR * 10; + tm->tm_mon += rint(fval * MONTHS_PER_YEAR * 10); tmask = DTK_M(DECADE); break; case DTK_CENTURY: tm->tm_year += val * 100; - if (fval != 0) - tm->tm_mon += fval * MONTHS_PER_YEAR * 100; + tm->tm_mon += rint(fval * MONTHS_PER_YEAR * 100); tmask = DTK_M(CENTURY); break; case DTK_MILLENNIUM: tm->tm_year += val * 1000; - if (fval != 0) - tm->tm_mon += fval * MONTHS_PER_YEAR * 1000; + tm->tm_mon += rint(fval * MONTHS_PER_YEAR * 1000); tmask = DTK_M(MILLENNIUM); break; @@ -3565,7 +3561,7 @@ DecodeISO8601Interval(char *str, { case 'Y': tm->tm_year += val; - tm->tm_mon += (fval * MONTHS_PER_YEAR); + tm->tm_mon += rint(fval * MONTHS_PER_YEAR); break; case 'M': tm->tm_mon += val; @@ -3601,7 +3597,7 @@ DecodeISO8601Interval(char *str, return DTERR_BAD_FORMAT; tm->tm_year += val; - tm->tm_mon += (fval * MONTHS_PER_YEAR); + tm->tm_mon += rint(fval * MONTHS_PER_YEAR); if (unit == '\0') return 0; if (unit == 'T') diff --git a/src/interfaces/ecpg/pgtypeslib/interval.c b/src/interfaces/ecpg/pgtypeslib/interval.c index 02b3c47223..a7e530cb5d 100644 --- a/src/interfaces/ecpg/pgtypeslib/interval.c +++ b/src/interfaces/ecpg/pgtypeslib/interval.c @@ -155,7 +155,7 @@ DecodeISO8601Interval(char *str, { case 'Y': tm
Re: when the startup process doesn't (logging startup delays)
On Wed, Jul 28, 2021 at 7:02 PM Robert Haas wrote: > > On Wed, Jul 28, 2021 at 5:24 AM Nitin Jadhav > wrote: > > I also agree that this is the better place to do it. Hence modified > > the patch accordingly. The condition "!AmStartupProcess()" is added to > > differentiate whether the call is done from a startup process or some > > other process. Actually StartupXLOG() gets called in 2 places. one in > > StartupProcessMain() and the other in InitPostgres(). As the logging > > of the startup progress is required only during the startup process > > and not in the other cases, > > The InitPostgres() case occurs when the server is started in bootstrap > mode (during initdb) or in single-user mode (postgres --single). I do > not see any reason why we shouldn't produce progress messages in at > least the latter case. I suspect that someone who is in the rather > desperate scenario of having to use single-user mode would really like > to know how long the server is going to take to start up. +1 to emit the same log messages in single-user mode and basically whoever calls StartupXLOG. Do we need to adjust the GUC parameter log_startup_progress_interval(a reasonable value) so that the logs are generated by default? > Perhaps during initdb we don't want messages, but I'm not sure that we > need to do anything about that here. None of the messages that the > server normally produces show up when you run initdb, so I guess they > are getting redirected to /dev/null or something. I enabled the below log message in XLogFlush and ran initdb, it is printing these logs onto the stdout, looks like the logs have not been redirected to /dev/null in initdb/bootstrap mode. #ifdef WAL_DEBUG if (XLOG_DEBUG) elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X", LSN_FORMAT_ARGS(record), LSN_FORMAT_ARGS(LogwrtResult.Write), LSN_FORMAT_ARGS(LogwrtResult.Flush)); #endif So, even in bootstrap mode, can we use the above #ifdef WAL_DEBUG and XLOG_DEBUG to print those logs? It looks like the problem with these macros is that they are not settable by normal users in the production environment, but only by the developers. I'm not sure how much it is helpful to print the startup process progress logs in bootstrap mode. Maybe we can use the IsBootstrapProcessingMode macro to disable these logs in the bootstrap mode. > So I don't think that using AmStartupProcess() for this purpose is right. Agree. Regards, Bharath Rupireddy.
Re: Minimal logical decoding on standbys
On 2021-Jul-27, Drouvot, Bertrand wrote: > diff --git a/src/backend/utils/cache/lsyscache.c > b/src/backend/utils/cache/lsyscache.c > +bool > +get_rel_logical_catalog(Oid relid) > +{ > + boolres; > + Relation rel; > + > + /* assume previously locked */ > + rel = table_open(relid, NoLock); > + res = RelationIsAccessibleInLogicalDecoding(rel); > + table_close(rel, NoLock); > + > + return res; > +} So RelationIsAccessibleInLogicalDecoding() does a cheap check for wal_level which can be done without opening the table; I think this function should be rearranged to avoid doing that when not needed. Also, putting this function in lsyscache.c seems somewhat wrong since it's not merely accessing the system caches ... I think it would be better to move this elsewhere (relcache.c, proto in relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding) and short-circuit for the check that can be done before opening the table. At least the GiST code appears to be able to call this several times per vacuum run, so it makes sense to short-circuit it for the fast case. ... though looking at the GiST code again I wonder if it would be more sensible to just stash the table's Relation pointer somewhere in the context structs instead of opening and closing it time and again. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Investigación es lo que hago cuando no sé lo que estoy haciendo" (Wernher von Braun)
Re: Have I found an interval arithmetic bug?
On Tue, Jul 27, 2021 at 4:02 PM Tom Lane wrote: > What I think we have consensus on is that interval_in is doing the > wrong thing in a particular corner case. I have heard nobody but > you suggesting that we should start undertaking behavioral changes > in other interval functions, and I don't believe that that's a good > road to start going down. These behaviors have stood for many years. > Moreover, since the whole thing is by definition operating with > inadequate information, it is inevitable that for every case you > make better there will be another one you make worse. I agree that we need to be really conservative here. I think Tom is right that if we start changing behaviors that "seem wrong," we will probably make some things better and other things worse. The overall amount of stuff that "seems wrong" will probably not go down, but a lot of people's applications will break when they try to upgrade to v15. That's not going to be a win overall. I think a lot of the discussion on this thread consists of people hoping for things that are not very realistic. The interval type represents the number of months as an integer, and the number of days as an integer. That means that an interval like '0.7 months' does not really exist. If you ask for that interval what you get is actually 21 days, which is a reasonable approximation of 0.7 months but not the same thing, except in April, June, September, and November. So when you then say that you want 0.7 months + 0.3 months to equal 1.0 months, what you're really requesting is that 21 days + 9 days = 1 month. That system has been tried in the past, but it was abandoned roughly around the time of Julius Caeser for the very good reason that the orbital period of the earth about the sun is noticeably greater than 360 days. It would be entirely possible to design a data type that could represent such values more exactly. A data type that had a representation similar to interval but with double values for the numbers of years and months would be able to compute 0.7 months + 0.3 months and get 1.0 months with no problem. If we were doing this over again, I would argue that, with this on-disk representation, 0.7 months ought to be rejected as invalid input, because it's generally not a good idea to have a data type that silently converts a value into a different value that is not equivalent for all purposes. It is confusing and causes people to expect behavior different from what they will actually get. Now, it seems far too late to consider such a change at this point ... and it is also no good considering a change to the on-disk representation of the existing data type at this point ... but it is also no good pretending like we have a floating-point representation of months and days when we actually do not. -- Robert Haas EDB: http://www.enterprisedb.com
Re: when the startup process doesn't (logging startup delays)
On Wed, Jul 28, 2021 at 11:25 AM Bharath Rupireddy wrote: > > Perhaps during initdb we don't want messages, but I'm not sure that we > > need to do anything about that here. None of the messages that the > > server normally produces show up when you run initdb, so I guess they > > are getting redirected to /dev/null or something. > > I enabled the below log message in XLogFlush and ran initdb, it is > printing these logs onto the stdout, looks like the logs have not been > redirected to /dev/null in initdb/bootstrap mode. > > #ifdef WAL_DEBUG > if (XLOG_DEBUG) > elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X", > LSN_FORMAT_ARGS(record), > LSN_FORMAT_ARGS(LogwrtResult.Write), > LSN_FORMAT_ARGS(LogwrtResult.Flush)); > #endif > > So, even in bootstrap mode, can we use the above #ifdef WAL_DEBUG and > XLOG_DEBUG to print those logs? It looks like the problem with these > macros is that they are not settable by normal users in the production > environment, but only by the developers. I'm not sure how much it is > helpful to print the startup process progress logs in bootstrap mode. > Maybe we can use the IsBootstrapProcessingMode macro to disable these > logs in the bootstrap mode. I don't think we should drag XLOG_DEBUG into this. That's a debugging facility that isn't really relevant to the topic at hand. The point is the server normally prints a bunch of messages that we don't see in bootstrap mode. For example: [rhaas pgsql]$ postgres 2021-07-28 11:32:33.824 EDT [36801] LOG: starting PostgreSQL 15devel on x86_64-apple-darwin19.6.0, compiled by clang version 5.0.2 (tags/RELEASE_502/final), 64-bit 2021-07-28 11:32:33.825 EDT [36801] LOG: listening on IPv6 address "::1", port 5432 2021-07-28 11:32:33.825 EDT [36801] LOG: listening on IPv4 address "127.0.0.1", port 5432 2021-07-28 11:32:33.826 EDT [36801] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2021-07-28 11:32:33.846 EDT [36802] LOG: database system was shut down at 2021-07-28 11:32:32 EDT 2021-07-28 11:32:33.857 EDT [36801] LOG: database system is ready to accept connections None of that shows up when you run initdb. Taking a fast look at the code, I don't think the reasons are the same for all of those messages. Some of the code isn't reached, whereas e.g. "database system was shut down at 2021-07-28 11:32:32 EDT" is special-cased. I'm not sure right off the top of my head what this code should do, but ideally it looks something like one of the cases we've already got. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep
On Wed, Jul 28, 2021 at 8:12 AM Kyotaro Horiguchi wrote: > > Hello. > > At Tue, 27 Jul 2021 11:04:09 +0530, Bharath Rupireddy > wrote in > > On Mon, Jul 26, 2021 at 11:03 PM Bossart, Nathan > > wrote: > > PSA v3 patch. > > I have some comments. > > - No harm, but it's pointless to feed MyLatch to WaitLatch when > WL_LATCH_SET is not set (or rather misleading). +1. I can send NULL to WaitLatch. > - It seems that the additional wait-event is effectively useless for > most of the processes. Considering that this feature is for debugging > purpose, it'd be better to use ps display instead (or additionally) > if we want to see the wait event anywhere. Hm. That's a good idea to show up in the ps display. > The events of autovacuum workers can be seen in pg_stat_activity properly. > > For client-backends, that state cannot be seen in > pg_stat_activity. That seems inevitable since backends aren't > allocated a PGPROC entry yet at that time. (So the wait event is set > to local memory as a safety measure in this case.) On the other hand, > I find it inconvenient that the ps display is shown as just "postgres" > while in that state. I think we can show "postgres: preauth waiting" > or something. (It is shown as "postgres: username dbname [conn] > initializing" while PostAuthDelay) Hm. Is n't it better to show something like below in the ps display? for pre_auth_delay: "postgres: pre auth delay" for post_auth_delay: "postgres: <> post auth delay" But, I'm not sure whether this ps display thing will add any value to the end user who always can't see the ps display. So, how about having both i.e. ps display (useful for pre auth delay cases) and wait event (useful for post auth delay)? Regards, Bharath Rupireddy.
Re: Have I found an interval arithmetic bug?
Robert Haas writes: > If we were doing this over again, I would argue that, with this > on-disk representation, 0.7 months ought to be rejected as invalid > input, because it's generally not a good idea to have a data type that > silently converts a value into a different value that is not > equivalent for all purposes. It is confusing and causes people to > expect behavior different from what they will actually get. Now, it > seems far too late to consider such a change at this point ... and it > is also no good considering a change to the on-disk representation of > the existing data type at this point ... but it is also no good > pretending like we have a floating-point representation of months and > days when we actually do not. You know, I was thinking exactly that thing earlier. Changing the on-disk representation is certainly a nonstarter, but the problem here stems from expecting interval_in to do something sane with inputs that do not correspond to any representable value. I do not think we have any other datatypes where we expect the input function to make choices like that. Is it really too late to say "that was a damfool idea" and drop fractional years/months/etc from interval_in's lexicon? By definition, this will not create any dump/reload problems, because interval_out will never emit any such thing. It will break applications that are expecting such syntax to do something sane. But that expectation is fundamentally not meetable, so maybe we should just make a clean break. (Without back-patching it, of course.) I'm not entirely sure about whether to reject fractional days, though. Converting those on the assumption of 1 day == 24 hours is not quite theoretically right. But it's unsurprising, which is not something we can say about fractions of the larger units. So maybe it's OK to continue accepting that. regards, tom lane
Re: truncating timestamps on arbitrary intervals
On Wed, Jul 28, 2021 at 12:15 AM Michael Paquier wrote: > > On Tue, Jul 27, 2021 at 12:05:51PM -0400, John Naylor wrote: > > Concretely, I propose to push the attached on master and v14. Since we're > > in beta 2 and this thread might not get much attention, I've CC'd the RMT. > > (It looks like gmail has messed up a bit the format of your last > message.) Hmm, it looks fine in the archives. > Hmm. The docs say also the following thing, but your patch does not > reflect that anymore: > "Negative intervals are allowed and are treated the same as positive > intervals." I'd forgotten that was documented based on incomplete information, thanks for looking! Pushed with that fixed. -- John Naylor EDB: http://www.enterprisedb.com
Re: Out-of-memory error reports in libpq
On 7/28/21 11:02 AM, Tom Lane wrote: > > Here I've got to disagree. We do need the form with a PQExpBuffer > argument, because there are some places where that isn't a pointer > to a PGconn's errorMessage. But the large majority of the calls > are "pqReportOOM(conn)", and I think having to write that as > "pqReportOOM(&conn->errorMessage)" is fairly ugly and perhaps > error-prone. > > I'm not wedded to the name "pqReportOOMBuffer" though --- maybe > there's some better name for that one? > > Is it worth making the first one a macro? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Have I found an interval arithmetic bug?
On Wed, Jul 28, 2021 at 11:52 AM Tom Lane wrote: > You know, I was thinking exactly that thing earlier. Changing the > on-disk representation is certainly a nonstarter, but the problem > here stems from expecting interval_in to do something sane with inputs > that do not correspond to any representable value. I do not think we > have any other datatypes where we expect the input function to make > choices like that. It's not exactly the same issue, but the input function whose behavior most regularly trips people up is bytea, because they try something like 'x'::bytea and it seems to DWTW so then they try it on all their data and discover that, for example, '\'::bytea fails outright, or that ''::bytea = '\x'::bytea, contrary to expectations. People often seem to think that casting to bytea should work like convert_to(), but it doesn't. As in the case at hand, byteain() has to guess whether the input is intended to be the 'hex' or 'escape' format, and because the 'escape' format looks a lot like plain old text, confusion ensues. Now, guessing between two input formats that are both legal for the data type is not exactly the same as guessing what to do with a value that's not directly representable, but it has the same ultimate effect i.e. the human beings perceive the system as buggy. A case that is perhaps more theoretically similar to the instance at hand is rounding during the construction of floating point values. My system thinks '1.01'::float = '1'::float, so in that case, as in this one, we've decided that it's OK to build an inexact representation of the input value. I don't really see what can be done about this considering that the textual representation uses base 10 and the internal representation uses base 2, but I think this doesn't cause us as many problems in practice because people understand how it works, which doesn't seem to be the case with the interval data type, at last if this thread is any indication. I am dubious that it's worth the pain of making the input function reject cases involving fractional units. It's true that some people here aren't happy with the current behavior, but they may no happier if we reject those cases with an error, and other people may then be unhappy too. I think your previous idea was the best one so far: fix the input function so that 'X years Y months' and 'Y months X years' always produce the same answer, and call it good. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Why don't update minimum recovery point in xact_redo_abort
On 2021/07/28 12:44, Fujii Masao wrote: On 2021/07/28 6:25, Heikki Linnakangas wrote: On 27/07/2021 19:49, Fujii Masao wrote: Anyway I attached the patch that changes only xact_redo_abort() so that it calls XLogFlush() to update min recovery point. Looks good to me, thanks! FWIW, I used the attached script to reproduce this. Thanks for the review! Barring any objection, I will commit the patch and back-patch it to all supported versions. Pushed. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Out-of-memory error reports in libpq
Andrew Dunstan writes: > Is it worth making the first one a macro? It'd be the same from a source-code perspective, but probably a shade bulkier in terms of object code. regards, tom lane
Re: CREATE SEQUENCE with RESTART option
On 2021/07/28 23:53, Bharath Rupireddy wrote: On Wed, Jul 28, 2021 at 11:50 AM Michael Paquier wrote: On Mon, Jul 26, 2021 at 04:57:53PM +0900, Michael Paquier wrote: FWIW, like Ashutosh upthread, my vote would be to do nothing here in terms of behavior changes as this is just breaking a behavior for the sake of breaking it, so there are chances that this is going to piss some users that relied accidentally on the existing behavior. In short, I would be tempted with something like the attached, that documents RESTART in CREATE SEQUENCE, while describing its behavior according to START. In terms of regression tests, there is already a lot in this area with ALTER SEQUENCE, but I think that having two tests makes sense for CREATE SEQUENCE: one for RESTART without a value and one with, where both explicitely set START. Thoughts? -1. IMHO, this is something creating more confusion to the user. We say that we allow both START and RESTART that RESTART is accepted as a consequence of our internal option handling in gram.y. Instead, I recommend throwing errorConflictingDefElem or errmsg("START and RESTART are mutually exclusive options"). We do throw these errors in a lot of other places for various options. Others may have better thoughts though. Per docs, CREATE SEQUENCE conforms to the SQL standard, with some exceptions. So I'd agree with Michael if CREATE SEQUENCE with RESTART also conforms to the SQL standard, but I'd agree with Bharath otherwise. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Have I found an interval arithmetic bug?
Robert Haas writes: > On Wed, Jul 28, 2021 at 11:52 AM Tom Lane wrote: >> You know, I was thinking exactly that thing earlier. Changing the >> on-disk representation is certainly a nonstarter, but the problem >> here stems from expecting interval_in to do something sane with inputs >> that do not correspond to any representable value. I do not think we >> have any other datatypes where we expect the input function to make >> choices like that. > A case that is perhaps more theoretically similar to the instance at > hand is rounding during the construction of floating point values. My > system thinks '1.01'::float = '1'::float, so > in that case, as in this one, we've decided that it's OK to build an > inexact representation of the input value. Fair point, but you decided when you chose to use float that you don't care about the differences between numbers that only differ at the seventeenth or so decimal place. (Maybe, if you don't understand what float is, you didn't make that choice intentionally ... but that's a documentation issue not a code shortcoming.) However, it's fairly hard to believe that somebody who writes '1.4 years'::interval doesn't care about the 0.4 year. The fact that we silently convert that to, effectively, 1.... years seems like a bigger roundoff error than one would expect. > I am dubious that it's worth the pain of making the input function > reject cases involving fractional units. It's true that some people > here aren't happy with the current behavior, but they may no happier > if we reject those cases with an error, and other people may then be > unhappy too. Maybe. A possible compromise is to accept only exactly-representable fractions. Then, for instance, we'd take 1.5 years (resulting in 18 months) but not 1.4 years. Now, this might fall foul of your point about not wanting to mislead people into expecting the system to do things it can't; but I'd argue that the existing behavior misleads them much more. > I think your previous idea was the best one so far: fix > the input function so that 'X years Y months' and 'Y months X years' > always produce the same answer, and call it good. That would clearly be a bug fix. I'm just troubled that there are still behaviors that people will see as bugs. regards, tom lane
Re: CREATE SEQUENCE with RESTART option
Fujii Masao writes: > On 2021/07/28 23:53, Bharath Rupireddy wrote: >> -1. IMHO, this is something creating more confusion to the user. We >> say that we allow both START and RESTART that RESTART is accepted as a >> consequence of our internal option handling in gram.y. Instead, I >> recommend throwing errorConflictingDefElem or errmsg("START and >> RESTART are mutually exclusive options"). We do throw these errors in >> a lot of other places for various options. Others may have better >> thoughts though. > Per docs, CREATE SEQUENCE conforms to the SQL standard, with some exceptions. > So I'd agree with Michael if CREATE SEQUENCE with RESTART also conforms to > the SQL standard, but I'd agree with Bharath otherwise. I do not see any RESTART option in SQL:2021 11.72 . Since we don't document it either, there's really no expectation that anyone would use it. I don't particularly think that we should document it, so I'm with Michael that we don't need to do anything. This is hardly the only undocumented corner case in PG. regards, tom lane
Re: Fix around conn_duration in pgbench
On 2021/07/28 16:15, Yugo NAGATA wrote: I found another disconnect_all(). /* XXX should this be connection time? */ disconnect_all(state, nclients); The measurement is also not necessary here. So the above comment should be removed or updated? I think this disconnect_all will be a no-op because all connections should be already closed in threadRun(), but I left it just to be sure that connections are all cleaned-up. I updated the comment for explaining above. I attached the updated patch. Could you please look at this? Thanks for updating the patch! LGTM. Barring any objection, I will commit the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Have I found an interval arithmetic bug?
On Wed, Jul 28, 2021 at 1:05 PM Tom Lane wrote: > Fair point, but you decided when you chose to use float that you don't > care about the differences between numbers that only differ at the > seventeenth or so decimal place. (Maybe, if you don't understand what > float is, you didn't make that choice intentionally ... but that's > a documentation issue not a code shortcoming.) However, it's fairly > hard to believe that somebody who writes '1.4 years'::interval doesn't > care about the 0.4 year. The fact that we silently convert that to, > effectively, 1.... years seems like a bigger roundoff error > than one would expect. Yeah, that's definitely a fair point! > > I am dubious that it's worth the pain of making the input function > > reject cases involving fractional units. It's true that some people > > here aren't happy with the current behavior, but they may no happier > > if we reject those cases with an error, and other people may then be > > unhappy too. > > Maybe. A possible compromise is to accept only exactly-representable > fractions. Then, for instance, we'd take 1.5 years (resulting in 18 > months) but not 1.4 years. Now, this might fall foul of your point about > not wanting to mislead people into expecting the system to do things it > can't; but I'd argue that the existing behavior misleads them much more. Well, let's see what other people think. > > I think your previous idea was the best one so far: fix > > the input function so that 'X years Y months' and 'Y months X years' > > always produce the same answer, and call it good. > > That would clearly be a bug fix. I'm just troubled that there are > still behaviors that people will see as bugs. That's a reasonable thing to be troubled about, but the date and time related datatypes have so many odd and crufty behaviors that I have a hard time believing that there's another possible outcome. If somebody showed up today and proposed a new data type and told us that the way to format values of that data type was to say to_char(my_value, alphabet_soup) I think they would not be taken very seriously. A lot of this code, and the associated interfaces, date back to a time when PostgreSQL was far more primitive than today, and when databases in general were as well. At least we didn't end up with a datatype called varchar2 ... or not yet, anyway. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Asynchronous and "direct" IO support for PostgreSQL.
On Tue, Feb 23, 2021 at 5:04 AM Andres Freund wrote: > > ## AIO API overview > > The main steps to use AIO (without higher level helpers) are: > > 1) acquire an "unused" AIO: pgaio_io_get() > > 2) start some IO, this is done by functions like >pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal) > >The (read|write|fsync|flush_range) indicates the operation, whereas >(smgr|sb|raw|wal) determines how IO completions, errors, ... are handled. > >(see below for more details about this design choice - it might or not be >right) > > 3) optionally: assign a backend-local completion callback to the IO >(pgaio_io_on_completion_local()) > > 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be >put on a per-backend list of pending IOs. The pending IOs can be explicitly >be flushed pgaio_submit_pending(), but will also be submitted if the >pending list gets to be too large, or if the current backend waits for the >IO. > >The are two main reasons not to submit the IO immediately: >- If adjacent, we can merge several IOs into one "kernel level" IO during > submission. Larger IOs are considerably more efficient. >- Several AIO APIs allow to submit a batch of IOs in one system call. > > 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current >backend. When other backends may need to wait for an IO to finish, >pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a >BufferDesc), which can be waited for using pgaio_io_wait_ref(). > > 6) Process the results of the request. If a callback was registered in 3), >this isn't always necessary. The results of AIO can be accessed using >pgaio_io_result() which returns an integer where negative numbers are >-errno, and positive numbers are the [partial] success conditions >(e.g. potentially indicating a short read). > > 7) release ownership of the io (pgaio_io_release()) or reuse the IO for >another operation (pgaio_io_recycle()) > > > Most places that want to use AIO shouldn't themselves need to care about > managing the number of writes in flight, or the readahead distance. To help > with that there are two helper utilities, a "streaming read" and a "streaming > write". > > The "streaming read" helper uses a callback to determine which blocks to > prefetch - that allows to do readahead in a sequential fashion but importantly > also allows to asynchronously "read ahead" non-sequential blocks. > > E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map > to figure out which block needs to be read next. Similarly lazy_vacuum_heap() > uses the tids in LVDeadTuples to figure out which blocks are going to be > needed. Here's the latter as an example: > https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906 > Attached is a patch on top of the AIO branch which does bitmapheapscan prefetching using the PgStreamingRead helper already used by sequential scan and vacuum on the AIO branch. The prefetch iterator is removed and the main iterator in the BitmapHeapScanState node is now used by the PgStreamingRead helper. Some notes about the code: Each IO will have its own TBMIterateResult allocated and returned by the PgStreamingRead helper and freed later by heapam_scan_bitmap_next_block() before requesting the next block. Previously it was allocated once and saved in the TBMIterator in the BitmapHeapScanState node and reused. Because of this, the table AM API routine, table_scan_bitmap_next_block() now defines the TBMIterateResult as an output parameter. The PgStreamingRead helper pgsr_private parameter for BitmapHeapScan is now the actual BitmapHeapScanState node. It needed access to the iterator, the heap scan descriptor, and a few fields in the BitmapHeapScanState node that could be moved elsewhere or duplicated (visibility map buffer and can_skip_fetch, for example). So, it is possible to either create a new struct or move fields around to avoid this--but, I'm not sure if that would actually be better. Because the PgStreamingReadHelper needs to be set up with the BitmapHeapScanState node but also needs some table AM specific functions, I thought it made more sense to initialize it using a new table AM API routine. Instead of fully implementing that I just wrote a wrapper function, table_bitmap_scan_setup() which just calls bitmapheap_pgsr_alloc() to socialize the idea before implementing it. I haven't made the GIN code reasonable yet either (it uses the TID bitmap functions that I've changed). There are various TODOs in the code posing questions both to the reviewer and myself for future versions of the patch. Oh, also, I haven't updated the failing partition_prune regression test because I haven't had a chance to look at the EXPLAIN code which adds the text which is not being produced to see if it is
Re: Have I found an interval arithmetic bug?
Robert Haas writes: > On Wed, Jul 28, 2021 at 1:05 PM Tom Lane wrote: >> That would clearly be a bug fix. I'm just troubled that there are >> still behaviors that people will see as bugs. > That's a reasonable thing to be troubled about, but the date and time > related datatypes have so many odd and crufty behaviors that I have a > hard time believing that there's another possible outcome. There's surely a ton of cruft there, but I think most of it stems from western civilization's received rules for timekeeping, which we can do little about. But the fact that interval_in accepts '1.4 years' when it cannot do anything very reasonable with that input is entirely self-inflicted damage. BTW, I don't have a problem with the "interval * float8" operator doing equally strange things, because if you don't like what it does you can always write your own multiplication function that you like better. There can be only one interval_in, though, so I don't think it should be misrepresenting the fundamental capabilities of the datatype. regards, tom lane
Re: Asynchronous and "direct" IO support for PostgreSQL.
Hi, On 2021-07-28 13:37:48 -0400, Melanie Plageman wrote: > Attached is a patch on top of the AIO branch which does bitmapheapscan > prefetching using the PgStreamingRead helper already used by sequential > scan and vacuum on the AIO branch. > > The prefetch iterator is removed and the main iterator in the > BitmapHeapScanState node is now used by the PgStreamingRead helper. Cool! I'm heartened to see "12 files changed, 272 insertions(+), 495 deletions(-)" It's worth calling out that this fixes some abstraction leakyness around tableam too... > Each IO will have its own TBMIterateResult allocated and returned by the > PgStreamingRead helper and freed later by > heapam_scan_bitmap_next_block() before requesting the next block. > Previously it was allocated once and saved in the TBMIterator in the > BitmapHeapScanState node and reused. Because of this, the table AM API > routine, table_scan_bitmap_next_block() now defines the TBMIterateResult > as an output parameter. > > I haven't made the GIN code reasonable yet either (it uses the TID > bitmap functions that I've changed). I don't quite understand the need to change the tidbitmap interface, or maybe rather I'm not convinced that pessimistically preallocating space is a good idea? > I don't see a need for it right now. If you wanted you > Because the PgStreamingReadHelper needs to be set up with the > BitmapHeapScanState node but also needs some table AM specific > functions, I thought it made more sense to initialize it using a new > table AM API routine. Instead of fully implementing that I just wrote a > wrapper function, table_bitmap_scan_setup() which just calls > bitmapheap_pgsr_alloc() to socialize the idea before implementing it. That makes sense. > static bool > heapam_scan_bitmap_next_block(TableScanDesc scan, > - TBMIterateResult > *tbmres) > + TBMIterateResult **tbmres) ISTM that we possibly shouldn't expose the TBMIterateResult outside of the AM after this change? It feels somewhat like an implementation detail now. It seems somewhat odd to expose a ** to set a pointer that nodeBitmapHeapscan.c then doesn't really deal with itself. > @@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm) >* Create the TBMIterator struct, with enough trailing space to serve > the >* needs of the TBMIterateResult sub-struct. >*/ > - iterator = (TBMIterator *) palloc(sizeof(TBMIterator) + > - > MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)); > + iterator = (TBMIterator *) palloc(sizeof(TBMIterator)); > iterator->tbm = tbm; Hm? > diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h > index 9a07f06b9f..8e1aa48827 100644 > --- a/src/include/storage/aio.h > +++ b/src/include/storage/aio.h > @@ -39,7 +39,7 @@ typedef enum IoMethod > } IoMethod; > > /* We'll default to bgworker. */ > -#define DEFAULT_IO_METHOD IOMETHOD_WORKER > +#define DEFAULT_IO_METHOD IOMETHOD_IO_URING I agree with the sentiment, but ... :) Greetings, Andres Freund
Re: speed up verifying UTF-8
I wrote: > On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > > > >+ utf8_advance(s, state, len); > > >+ > > >+ /* > > >+ * If we saw an error during the loop, let the caller handle it. We treat > > >+ * all other states as success. > > >+ */ > > >+ if (state == ERR) > > >+ return 0; > > > > Did you mean state = utf8_advance(s, state, len); there? (reassign state variable) > > Yep, that's a bug, thanks for catching! Fixed in v21, with a regression test added. Also, utf8_advance() now directly changes state by a passed pointer rather than returning a value. Some cosmetic changes: s/valid_bytes/non_error_bytes/ since the former is kind of misleading now. Some other var name and symbol changes. In my first DFA experiment, ASC conflicted with the parser or scanner somehow, but it doesn't here, so it's clearer to use this. Rewrote a lot of comments about the state machine and regression tests. -- John Naylor EDB: http://www.enterprisedb.com v21-0001-Add-fast-paths-for-validating-UTF-8-text.patch Description: Binary data
Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep
On Wed, Jul 28, 2021 at 09:10:35PM +0530, Bharath Rupireddy wrote: > On Wed, Jul 28, 2021 at 8:12 AM Kyotaro Horiguchi > wrote: > > > > - It seems that the additional wait-event is effectively useless for > > most of the processes. Considering that this feature is for debugging > > purpose, it'd be better to use ps display instead (or additionally) > > if we want to see the wait event anywhere. > > Hm. That's a good idea to show up in the ps display. Keep in mind that ps is apparently so expensive under windows that the GUC defaults to off. The admin can leave the ps display off, but I wonder if it's of any concern that something so expensive can be caused by an unauthenticated connection. > > The events of autovacuum workers can be seen in pg_stat_activity properly. > > > > For client-backends, that state cannot be seen in > > pg_stat_activity. That seems inevitable since backends aren't > > allocated a PGPROC entry yet at that time. (So the wait event is set > > to local memory as a safety measure in this case.) On the other hand, > > I find it inconvenient that the ps display is shown as just "postgres" > > while in that state. I think we can show "postgres: preauth waiting" > > or something. (It is shown as "postgres: username dbname [conn] > > initializing" while PostAuthDelay) > > Hm. Is n't it better to show something like below in the ps display? > for pre_auth_delay: "postgres: pre auth delay" > for post_auth_delay: "postgres: <> post auth delay" > > But, I'm not sure whether this ps display thing will add any value to > the end user who always can't see the ps display. So, how about having > both i.e. ps display (useful for pre auth delay cases) and wait event > (useful for post auth delay)?
Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep
Justin Pryzby writes: > On Wed, Jul 28, 2021 at 09:10:35PM +0530, Bharath Rupireddy wrote: >> Hm. That's a good idea to show up in the ps display. > Keep in mind that ps is apparently so expensive under windows that the GUC > defaults to off. > The admin can leave the ps display off, but I wonder if it's of any concern > that something so expensive can be caused by an unauthenticated connection. I'm detecting a certain amount of lily-gilding here. Neither of these delays are meant for anything except debugging purposes, and nobody as far as I've heard has ever expressed great concern about identifying which process they need to attach to for that purpose. So I think it is a *complete* waste of time to add any cycles to connection startup to make these delays more visible. I follow the idea of using WaitLatch to ensure that the delays are interruptible by postmaster signals, but even that isn't worth a lot given the expected use of these things. I don't see a need to expend any extra effort on wait-reporting. regards, tom lane
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On 2021-07-25 19:07:18 +0300, Yura Sokolov wrote: > I've dreamed to write more compact structure for vacuum for three > years, but life didn't give me a time to. > > Let me join to friendly competition. > > I've bet on HATM approach: popcount-ing bitmaps for non-empty elements. My concern with several of the proposals in this thread is that they over-optimize for this specific case. It's not actually that crucial to have a crazily optimized vacuum dead tid storage datatype. Having something more general that also performs reasonably for the dead tuple storage, but also performs well in a number of other cases, makes a lot more sense to me. > (Bad radix result probably due to smaller cache in notebook's CPU ?) Probably largely due to the node dispatch. a) For some reason gcc likes jump tables too much, I get better numbers when disabling those b) the node type dispatch should be stuffed into the low bits of the pointer. > select prepare(100, 2, 100, 1); > >attach size shuffled > array6ms12MB 53.42s > intset 23ms16MB 54.99s > rtbm 115ms38MB8.19s > tbm186ms 100MB8.37s > vtbm 105ms59MB9.08s > radix 64ms42MB 10.41s > svtm73ms10MB7.49s > Test4 > > select prepare(100, 100, 1, 1); > >attach size shuffled > array 304ms 600MB 75.12s > intset 775ms98MB 47.49s > rtbm 356ms38MB4.11s > tbm539ms 100MB4.20s > vtbm 493ms42MB4.44s > radix 263ms42MB6.05s > svtm 360ms 8MB3.49s > > Therefore Specialized Vaccum Tid Map always consumes least memory amount > and usually faster. Impressive. Greetings, Andres Freund
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote: > Apart from performance and memory usage points of view, we also need > to consider the reusability of the code. When I started this thread, I > thought the best data structure would be the one optimized for > vacuum's dead tuple storage. However, if we can use a data structure > that can also be used in general, we can use it also for other > purposes. Moreover, if it's too optimized for the current TID system > (32 bits block number, 16 bits offset number, maximum block/offset > number, etc.) it may become a blocker for future changes. Indeed. > In that sense, radix tree also seems good since it can also be used in > gist vacuum as a replacement for intset, or a replacement for hash > table for shared buffer as discussed before. Are there any other use > cases? Yes, I think there are. Whenever there is some spatial locality it has a decent chance of winning over a hash table, and it will most of the time win over ordered datastructures like rbtrees (which perform very poorly due to the number of branches and pointer dispatches). There's plenty hashtables, e.g. for caches, locks, etc, in PG that have a medium-high degree of locality, so I'd expect a few potential uses. When adding "tree compression" (i.e. skip inner nodes that have a single incoming & outgoing node) radix trees even can deal quite performantly with variable width keys. > On the other hand, I’m concerned that radix tree would be an > over-engineering in terms of vacuum's dead tuples storage since the > dead tuple storage is static data and requires only lookup operation, > so if we want to use radix tree as dead tuple storage, I'd like to see > further use cases. I don't think we should rely on the read-only-ness. It seems pretty clear that we'd want parallel dead-tuple scans at a point not too far into the future? Greetings, Andres Freund
Re: needless complexity in StartupXLOG
Hi, On 2021-07-26 12:12:53 -0400, Robert Haas wrote: > My first thought was that we should do the check unconditionally, > rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and > ERROR if it fails. But then I wondered what the point of that would be > exactly. If we have such a bug -- and to the best of my knowledge > there's no evidence that we do -- there's no particular reason it > should only happen at the end of recovery. It could happen any time > the system -- or the user, or malicious space aliens -- remove files > from pg_wal, and we have no real idea about the timing of malicious > space alien activity, so doing the test here rather than anywhere else > just seems like a shot in the dark. Yea. The history of that code being added doesn't suggest that there was a concrete issue being addressed, from what I can tell. > So at the moment I am leaning toward the view that we should just > remove this check entirely, as in the attached, proposed patch. +1 > Really, I think we should consider going further. If it's safe to > write an end-of-recovery record rather than a checkpoint, why not do > so all the time? +many. The current split doesn't make much sense. For one, it often is a huge issue if crash recovery takes a long time - why should we incur the cost that we are OK avoiding during promotions? For another, end-of-recovery is a crucial path for correctness, reducing the number of non-trivial paths is good. > Imagine if instead of > all the hairy logic we have now we just replaced this whole if > (IsInRecovery) stanza with this: > > if (InRecovery) > CreateEndOfRecoveryRecord(); > > That would be WAY easier to reason about than the rat's nest we have > here today. Now, I am not sure what it would take to get there, but I > think that is the direction we ought to be heading. What are we going to do in the single user ([1]) case in this awesome future? I guess we could just not create a checkpoint until single user mode is shut down / creates a checkpoint for other reasons? Greetings, Andres Freund [1] I really wish somebody had the energy to just remove single user and bootstrap modes. The degree to which they increase complexity in the rest of the system is entirely unreasonable. There's not actually any reason bootstrapping can't happen with checkpointer et al running, it's just autovacuum that'd need to be blocked.
Re: Out-of-memory error reports in libpq
Hi, On 2021-07-27 18:40:48 -0400, Tom Lane wrote: > The first half of that just saves a few hundred bytes of repetitive > coding. However, I think that the addition of recovery logic is > important for robustness, because as things stand libpq may be > worse off than before for OOM handling. Agreed. > Before ffa2e4670, almost all of these call sites did > printfPQExpBuffer(..., "out of memory"). That would automatically > clear the message buffer to empty, and thereby be sure to report the > out-of-memory failure if at all possible. Now we might fail to report > the thing that the user really needs to know to make sense of what > happened. Hm. It seems we should be able to guarantee that the recovery path can print something, at least in the PGconn case. Is it perhaps worth pre-sizing PGConn->errorMessage so it'd fit an error like this? But perhaps that's more effort than it's worth. > +void > +pqReportOOMBuffer(PQExpBuffer errorMessage) > +{ > + const char *msg = libpq_gettext("out of memory\n"); I should probably know this, but I don't. Nor did I quickly find an answer. I assume gettext() reliably and reasonably deals with OOM? Looking in the gettext code I'm again scared by the fact that it takes locks during gettext (because of stuff like erroring out of signal handlers, not OOMs). It does look like it tries to always return the original string in case of OOM. Although the code is quite maze-like, so it's not easy to tell.. Greetings, Andres Freund
Re: alter table set TABLE ACCESS METHOD
On Wed, 2021-07-28 at 14:02 +0900, Michael Paquier wrote: > Arg, sorry about that! I was unclear what the situation of the patch > was. No problem, race condition ;-) > Right. Isn't that an older issue though? A rewrite involved after a > change of relpersistence does not call the hook either. It looks to > me that this should be after finish_heap_swap() to match with > ATExecSetTableSpace() in ATRewriteTables(). The only known user of > object_access_hook in the core code is sepgsql, so this would > involve a change of behavior. And I don't recall any backpatching > that added a post-alter hook. Sounds like it should be a different patch. Thank you. > > Also, I agree with Justin that it should fail when there are > > multiple > > SET ACCESS METHOD subcommands consistently, regardless of whether > > one > > is a no-op, and it should probably throw a syntax error to match > > SET > > TABLESPACE. > > Hmm. Okay. > > > Minor nit: in tab-complete.c, why does it say ""? Is that just > > a > > typo or is there a reason it's different from everything else, > > which > > uses ""? And what does "sth" mean anyway? > > "Something". That should be "" to be consistent with the area. These two issues are pretty minor. Regards, Jeff Davis
Re: [PATCH] test/ssl: rework the sslfiles Makefile target
On Wed, 2021-07-28 at 00:24 +0200, Daniel Gustafsson wrote: > > On 4 Mar 2021, at 01:03, Jacob Champion wrote: > > Andrew pointed out elsewhere [1] that it's pretty difficult to add new > > certificates to the test/ssl suite without blowing away the current > > state and starting over. I needed new cases for the NSS backend work, > > and ran into the same pain, so here is my attempt to improve the > > situation. > > Thanks for working on this, I second the pain cited. I've just started to > look > at this, so only a few comments thus far. > > > The unused server-ss certificate has been removed entirely. > > Nice catch, this seems to have been unused since the original import of the > SSL > test suite. To cut down scope of the patch (even if only a small bit) I > propose to apply this separately first, as per the attached. LGTM. > > - Serial number collisions are less likely, thanks to Andrew's idea to > > use the current clock time as the initial serial number in a series. > > +my $serialno = `openssl x509 -serial -noout -in ssl/client.crt`; > +$serialno =~ s/^serial=//; > +$serialno = hex($serialno); # OpenSSL prints serial numbers in hexadecimal > > Will that work on Windows? We don't currently require the openssl binary to > be > in PATH unless one wants to rebuild sslfiles (which it is quite likely to be > but there should at least be errorhandling covering when it's not). Hm, that's a good point. It should be easy enough for me to add a fallback if the invocation fails; I'll give it a shot tomorrow. > > - I am making _heavy_ use of GNU Make-isms, which does not improve > > long-term maintainability. > > GNU Make is already a requirement, I don't see this shifting the needle in any > direction. As long as the .SECONDARYEXPANSION magic is clear enough to others, I'm happy. Thanks! --Jacob
Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep
On 7/28/21, 11:32 AM, "Tom Lane" wrote: > I'm detecting a certain amount of lily-gilding here. Neither of these > delays are meant for anything except debugging purposes, and nobody as > far as I've heard has ever expressed great concern about identifying > which process they need to attach to for that purpose. So I think it > is a *complete* waste of time to add any cycles to connection startup > to make these delays more visible. > > I follow the idea of using WaitLatch to ensure that the delays are > interruptible by postmaster signals, but even that isn't worth a > lot given the expected use of these things. I don't see a need to > expend any extra effort on wait-reporting. +1. The proposed patch doesn't make the delay visibility any worse than what's already there. Nathan
Re: [PATCH] test/ssl: rework the sslfiles Makefile target
On 7/28/21 4:10 PM, Jacob Champion wrote: > >>> - I am making _heavy_ use of GNU Make-isms, which does not improve >>> long-term maintainability. >> GNU Make is already a requirement, I don't see this shifting the needle in >> any >> direction. > As long as the .SECONDARYEXPANSION magic is clear enough to others, I'm > happy. > We don't currently have any, and so many of us (including me) will have to learn to understand it. But that's not to say it's unacceptable. If there's no new infrastructure requirement then I'm OK with it. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Out-of-memory error reports in libpq
Andres Freund writes: > I should probably know this, but I don't. Nor did I quickly find an answer. I > assume gettext() reliably and reasonably deals with OOM? I've always assumed that their fallback in cases of OOM, can't read the message file, yadda yadda is to return the original string. I admit I haven't gone and checked their code, but it'd be unbelievably stupid to do otherwise. > Looking in the gettext code I'm again scared by the fact that it takes locks > during gettext (because of stuff like erroring out of signal handlers, not > OOMs). Hm. regards, tom lane
Re: [PATCH] test/ssl: rework the sslfiles Makefile target
Jacob Champion writes: > On Wed, 2021-07-28 at 00:24 +0200, Daniel Gustafsson wrote: >> GNU Make is already a requirement, I don't see this shifting the needle in >> any >> direction. Um ... the existing requirement is for gmake 3.80 or newer; if you want to use newer features we'd have to have a discussion about whether it's worthwhile to move that goalpost. > As long as the .SECONDARYEXPANSION magic is clear enough to others, I'm > happy. After reading the gmake docs about that, I'd have to say it's likely to be next door to unmaintainable. Do we really have to be that cute? And, AFAICT, it's not in 3.80. regards, tom lane
Re: [PATCH] test/ssl: rework the sslfiles Makefile target
> On 28 Jul 2021, at 23:02, Tom Lane wrote: > Jacob Champion writes: >> As long as the .SECONDARYEXPANSION magic is clear enough to others, I'm >> happy. > > After reading the gmake docs about that, I'd have to say it's likely to be > next > door to unmaintainable. Personally, I don’t think it’s that bad, but mileage varies. It’s obviously a show-stopper if maintainers don’t feel comfortable with it. > And, AFAICT, it's not in 3.80. That however, is a very good point that I missed. I think it’s a good tool, but probably not enough to bump the requirement. -- Daniel Gustafsson https://vmware.com/
Re: Out-of-memory error reports in libpq
Andres Freund writes: > Hm. It seems we should be able to guarantee that the recovery path can print > something, at least in the PGconn case. Is it perhaps worth pre-sizing > PGConn->errorMessage so it'd fit an error like this? Forgot to address this. Right now, the normal situation is that PGConn->errorMessage is "pre sized" to 256 bytes, because that's what pqexpbuffer.c does for all PQExpBuffers. So unless you've overrun that, the question is moot. If you have, and you got an OOM in trying to expand the PQExpBuffer, then pqexpbuffer.c will release what it has and substitute the "oom_buffer" empty string. If you're really unlucky you might then not be able to allocate another 256-byte buffer, in which case we end up with an empty-string result. I don't think it's probable, but in a multithread program it could happen. > But perhaps that's more effort than it's worth. Yeah. I considered changing things so that oom_buffer contains "out of memory\n" rather than an empty string, but I'm afraid that that's making unsupportable assumptions about what PQExpBuffers are used for. For now, I'm content if it's not worse than v13. We've not heard a lot of complaints in this area. regards, tom lane
Re: Replace l337sp34k in comments.
On Wed, Jul 28, 2021 at 11:32 AM Michael Paquier wrote: > > On Wed, Jul 28, 2021 at 09:39:02AM +1000, Peter Smith wrote: > > IMO the PG code comments are not an appropriate place for leetspeak > > creativity. > > > > PSA a patch to replace a few examples that I recently noticed. > > > > "up2date" --> "up-to-date" > > Agreed that this is a bit cleaner to read, so done. Just note that > pgindent has been complaining about the format of some of the updated > comments. Thanks for pushing! BTW, the commit comment [1] attributes most of these to a recent patch, but I think that is mistaken. AFAIK they are from when the file was first introduced 8 years ago [2]. -- [1] https://github.com/postgres/postgres/commit/7b7fbe1e8bb4b2a244d1faa618789db411316e55 [2] https://github.com/postgres/postgres/commit/b89e151054a05f0f6d356ca52e3b725dd0505e53#diff-034b6d4eaf36425e75d7a7087d09bd6c734dd9ea8398533559d537d13b6b9197 Kind Regards, Peter Smith. Fujitsu Australia
Re: CREATE SEQUENCE with RESTART option
On Wed, Jul 28, 2021 at 01:16:19PM -0400, Tom Lane wrote: > I do not see any RESTART option in SQL:2021 11.72 definition>. Since we don't document it either, there's really no > expectation that anyone would use it. Okay, good point. I was not aware of that. > I don't particularly think that we should document it, so I'm with Michael > that we don't need to do anything. This is hardly the only undocumented > corner case in PG. More regression tests for CREATE SEQUENCE may be interesting, but that's not an issue either with the ones for ALTER SEQUENCE. Let's drop the patch and move on. -- Michael signature.asc Description: PGP signature
Re: Out-of-memory error reports in libpq
I wrote: > Andres Freund writes: >> Hm. It seems we should be able to guarantee that the recovery path can print >> something, at least in the PGconn case. Is it perhaps worth pre-sizing >> PGConn->errorMessage so it'd fit an error like this? >> But perhaps that's more effort than it's worth. > Yeah. I considered changing things so that oom_buffer contains > "out of memory\n" rather than an empty string, but I'm afraid > that that's making unsupportable assumptions about what PQExpBuffers > are used for. Actually, wait a minute. There are only a couple of places that ever read out the value of conn->errorMessage, so let's make those places responsible for dealing with OOM scenarios. That leads to a nicely small patch, as attached, and it looks to me like it makes us quite bulletproof against such scenarios. It might still be worth doing the "pqReportOOM" changes to save a few bytes of code space, but I'm less excited about that now. regards, tom lane diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index e950b41374..49eec3e835 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -6739,6 +6739,14 @@ PQerrorMessage(const PGconn *conn) if (!conn) return libpq_gettext("connection pointer is NULL\n"); + /* + * The errorMessage buffer might be marked "broken" due to having + * previously failed to allocate enough memory for the message. In that + * case, tell the application we ran out of memory. + */ + if (PQExpBufferBroken(&conn->errorMessage)) + return libpq_gettext("out of memory\n"); + return conn->errorMessage.data; } diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index aca81890bb..87f348f3dc 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -191,7 +191,7 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status) /* non-error cases */ break; default: -pqSetResultError(result, conn->errorMessage.data); +pqSetResultError(result, &conn->errorMessage); break; } @@ -662,14 +662,28 @@ pqResultStrdup(PGresult *res, const char *str) * assign a new error message to a PGresult */ void -pqSetResultError(PGresult *res, const char *msg) +pqSetResultError(PGresult *res, PQExpBuffer errorMessage) { + char *msg; + if (!res) return; - if (msg && *msg) - res->errMsg = pqResultStrdup(res, msg); + + /* + * We handle two OOM scenarios here. The errorMessage buffer might be + * marked "broken" due to having previously failed to allocate enough + * memory for the message, or it might be fine but pqResultStrdup fails + * and returns NULL. In either case, just make res->errMsg point directly + * at a constant "out of memory" string. + */ + if (!PQExpBufferBroken(errorMessage)) + msg = pqResultStrdup(res, errorMessage->data); + else + msg = NULL; + if (msg) + res->errMsg = msg; else - res->errMsg = NULL; + res->errMsg = libpq_gettext("out of memory\n"); } /* @@ -2122,7 +2136,7 @@ PQgetResult(PGconn *conn) appendPQExpBuffer(&conn->errorMessage, libpq_gettext("PGEventProc \"%s\" failed during PGEVT_RESULTCREATE event\n"), res->events[i].name); -pqSetResultError(res, conn->errorMessage.data); +pqSetResultError(res, &conn->errorMessage); res->resultStatus = PGRES_FATAL_ERROR; break; } diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h index e9f214b61b..490458adef 100644 --- a/src/interfaces/libpq/libpq-int.h +++ b/src/interfaces/libpq/libpq-int.h @@ -637,7 +637,7 @@ extern pgthreadlock_t pg_g_threadlock; /* === in fe-exec.c === */ -extern void pqSetResultError(PGresult *res, const char *msg); +extern void pqSetResultError(PGresult *res, PQExpBuffer errorMessage); extern void *pqResultAlloc(PGresult *res, size_t nBytes, bool isBinary); extern char *pqResultStrdup(PGresult *res, const char *str); extern void pqClearAsyncResult(PGconn *conn);
Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep
On Wed, Jul 28, 2021 at 08:28:12PM +, Bossart, Nathan wrote: > On 7/28/21, 11:32 AM, "Tom Lane" wrote: >> I follow the idea of using WaitLatch to ensure that the delays are >> interruptible by postmaster signals, but even that isn't worth a >> lot given the expected use of these things. I don't see a need to >> expend any extra effort on wait-reporting. > > +1. The proposed patch doesn't make the delay visibility any worse > than what's already there. Agreed to just drop the patch (my opinion about this patch is unchanged). Not to mention that wait events are not available at SQL level at this stage yet. -- Michael signature.asc Description: PGP signature
Re: Emit namespace in post-copy output
On Wed, Jul 28, 2021 at 04:15:19PM +0200, Daniel Gustafsson wrote: > Since get_namespace_name() returns a palloced string, this will lead to a 2x > leak of the namespace length as opposed to the 1x of today. While hardly a > big > deal, it seems prudent to cap this by storing the returned string locally now > that we need it twice. I don't think this matters much. A quick read of the code shows that this memory should be allocated within the transaction context running CLUSTER/VACUUM FULL, and that gets free'd in the internal calls of CommitTransactionCommand(). -- Michael signature.asc Description: PGP signature
Re: Out-of-memory error reports in libpq
Em qua., 28 de jul. de 2021 às 21:25, Tom Lane escreveu: > I wrote: > > Andres Freund writes: > >> Hm. It seems we should be able to guarantee that the recovery path can > print > >> something, at least in the PGconn case. Is it perhaps worth pre-sizing > >> PGConn->errorMessage so it'd fit an error like this? > >> But perhaps that's more effort than it's worth. > > > Yeah. I considered changing things so that oom_buffer contains > > "out of memory\n" rather than an empty string, but I'm afraid > > that that's making unsupportable assumptions about what PQExpBuffers > > are used for. > > Actually, wait a minute. There are only a couple of places that ever > read out the value of conn->errorMessage, so let's make those places > responsible for dealing with OOM scenarios. That leads to a nicely > small patch, as attached, and it looks to me like it makes us quite > bulletproof against such scenarios. > > It might still be worth doing the "pqReportOOM" changes to save a > few bytes of code space, but I'm less excited about that now. > IMO, I think that "char *msg" is unnecessary, isn't it? + if (!PQExpBufferBroken(errorMessage)) + res->errMsg = pqResultStrdup(res, errorMessage->data); else - res->errMsg = NULL; + res->errMsg = libpq_gettext("out of memory\n"); > > regards, tom lane > >
Re: Failed transaction statistics to measure the logical replication progress
On Thu, Jul 8, 2021 at 3:55 PM osumi.takami...@fujitsu.com wrote: > > Hello, hackers > > > When the current HEAD fails during logical decoding, the failure > increments txns count in pg_stat_replication_slots - [1] and adds > the transaction size to the sum of bytes in the same repeatedly > on the publisher, until the problem is solved. > One of the good examples is duplication error on the subscriber side > and this applies to both streaming and spill cases as well. > > This update prevents users from grasping the exact number and size of > successful and unsuccessful transactions. Accordingly, we need to > have new columns of failed transactions that will work to differentiate > both of them for all types, which means spill, streaming and normal > transactions. This will help users to measure the exact status of > logical replication. Could you please elaborate on use cases of the proposed statistics? For example, the current statistics on pg_replication_slots can be used for tuning logical_decoding_work_mem as well as inferring the total amount of bytes passed to the output plugin. How will the user use those statistics? Also, if we want the stats of successful transactions why don't we show the stats of successful transactions in the view instead of ones of failed transactions? > > Attached file is the POC patch for this. > Current design is to save failed stats data in the ReplicationSlot struct. > This is because after the error, I'm not able to access the ReorderBuffer > object. > Thus, I chose the object where I can interact with at the > ReplicationSlotRelease timing. When discussing the pg_stat_replication_slots view, there was an idea to store the slot statistics on ReplicationSlot struct. But the idea was rejected mainly because the struct is on the shared buffer[1]. If we store those counts on ReplicationSlot struct it increases the usage of shared memory. And those statistics are used only by logical slots and don’t necessarily need to be shared among the server processes. Moreover, if we want to add more statistics on the view in the future, it further increases the usage of shared memory. If we want to track the stats of successful transactions, I think it's easier to track them on the subscriber side rather than the publisher side. We can increase counters when applying [stream]commit/abort logical changes on the subscriber. Regards, [1] https://www.postgresql.org/message-id/CAA4eK1Kuj%2B3G59hh3wu86f4mmpQLpah_mGv2-wfAPyn%2BzT%3DP4A%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: pg_upgrade does not upgrade pg_stat_statements properly
On Thu, Jul 15, 2021 at 08:15:57AM -0700, David G. Johnston wrote: > On Thursday, July 15, 2021, David G. Johnston > My uncertainty revolves around core extensions since it seems odd to tell > the user to overwrite them with versions from an older version of > PostgreSQL. > > Ok. Just re-read the docs a third time…no uncertainty regarding contrib > now…following the first part of the instructions means that before one could > re-run create extension they would need to restore the original contrib > library > files to avoid the new extension code using the old library. So that whole > part about recreation is inconsistent with the existing unchanged text. I came up with the attached patch. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion. diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml index a83c63cd98..c042300c8c 100644 --- a/doc/src/sgml/ref/pgupgrade.sgml +++ b/doc/src/sgml/ref/pgupgrade.sgml @@ -306,8 +306,9 @@ make prefix=/usr/local/pgsql.new install into the new cluster, e.g., pgcrypto.so, whether they are from contrib or some other source. Do not install the schema definitions, e.g., - CREATE EXTENSION pgcrypto, because these will be upgraded - from the old cluster. + CREATE EXTENSION pgcrypto, because these will be copied + from the old cluster. (Consider upgrading the extensions later + using ALTER EXTENSION...UPGRADE.) Also, any custom full text search files (dictionary, synonym, thesaurus, stop words) must also be copied to the new cluster.
Re: Slightly improve initdb --sync-only option's help message
On Mon, Jul 26, 2021 at 11:05 AM Bossart, Nathan wrote: > Here are my suggestions in patch form. +printf(_(" -S, --sync-only safely write all database files to disk and exit\n")); Not your patch's fault, but the word "write" does not seem to convey the true intent of the option, because generally a "write" operation is still limited to dirtying the OS buffers, and does not guarantee sync-to-disk. It'd be better if the help message said, either "flush all database files to disk and exit",or "sync all database files to disk and exit". Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
Re: Out-of-memory error reports in libpq
Ranier Vilela writes: > IMO, I think that "char *msg" is unnecessary, isn't it? > + if (!PQExpBufferBroken(errorMessage)) > + res->errMsg = pqResultStrdup(res, errorMessage->data); > else > - res->errMsg = NULL; > + res->errMsg = libpq_gettext("out of memory\n"); Please read the comment. regards, tom lane
Re: pgbench bug candidate: negative "initial connection time"
Hello, On Fri, 18 Jun 2021 15:58:48 +0200 (CEST) Fabien COELHO wrote: > Attached Yugo-san patch with some updates discussed in the previous mails, > so as to move things along. I attached the patch rebased to a change due to 856de3b39cf. Regards, Yugo Nagata -- Yugo NAGATA diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 55d14604c0..e460dc7e5b 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -3181,6 +3181,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if ((st->con = doConnect()) == NULL) { + /* as the bench is already running, we do not abort the process */ pg_log_error("client %d aborted while establishing connection", st->id); st->state = CSTATE_ABORTED; break; @@ -4418,7 +4419,10 @@ runInitSteps(const char *initialize_steps) initPQExpBuffer(&stats); if ((con = doConnect()) == NULL) + { + pg_log_fatal("connection for initialization failed"); exit(1); + } setup_cancel_handler(NULL); SetCancelConn(con); @@ -6361,7 +6365,10 @@ main(int argc, char **argv) /* opening connection... */ con = doConnect(); if (con == NULL) + { + pg_log_fatal("setup connection failed"); exit(1); + } /* report pgbench and server versions */ printVersion(con); @@ -6515,7 +6522,7 @@ main(int argc, char **argv) #endif /* ENABLE_THREAD_SAFETY */ for (int j = 0; j < thread->nstate; j++) - if (thread->state[j].state == CSTATE_ABORTED) + if (thread->state[j].state != CSTATE_FINISHED) exit_code = 2; /* aggregate thread level stats */ @@ -6583,7 +6590,7 @@ threadRun(void *arg) if (thread->logfile == NULL) { pg_log_fatal("could not open logfile \"%s\": %m", logpath); - goto done; + exit(1); } } @@ -6607,16 +6614,10 @@ threadRun(void *arg) { if ((state[i].con = doConnect()) == NULL) { -/* - * On connection failure, we meet the barrier here in place of - * GO before proceeding to the "done" path which will cleanup, - * so as to avoid locking the process. - * - * It is unclear whether it is worth doing anything rather - * than coldly exiting with an error message. - */ -THREAD_BARRIER_WAIT(&barrier); -goto done; +/* coldly abort on initial connection failure */ +pg_log_fatal("cannot create connection for client %d", + state[i].id); +exit(1); } } @@ -6749,7 +6750,7 @@ threadRun(void *arg) continue; } /* must be something wrong */ -pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD); +pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD); goto done; } }
Re: pg_upgrade does not upgrade pg_stat_statements properly
On Wed, Jul 28, 2021 at 6:52 PM Bruce Momjian wrote: > I came up with the attached patch. > Thank you. It is an improvement but I think more could be done here (not exactly sure what - though removing the "copy binaries for contrib modules from the old server" seems like a decent second step.) I'm not sure it really needs a parenthetical, and I personally dislike using "Consider" to start the sentence. "Bringing extensions up to the newest version available on the new server can be done later using ALTER EXTENSION UPGRADE (after ensuring the correct binaries are installed)." David J.
Re: needless complexity in StartupXLOG
On Thu, Jul 29, 2021 at 3:28 AM Andres Freund wrote: > > [1] I really wish somebody had the energy to just remove single user and > bootstrap modes. The degree to which they increase complexity in the rest of > the system is entirely unreasonable. There's not actually any reason > bootstrapping can't happen with checkpointer et al running, it's just > autovacuum that'd need to be blocked. Any objection to adding an entry for that in the wiki TODO?
[postgres_fdw] add local pid to fallback_application_name
Hi Hackers, I propose adding trackable information in postgres_fdw, in order to track remote query correctly. ## Background and motivation Currently postgres_fdw connects remote servers by using connect_pg_server(). However the function just calls PQconnectdbParams() with fallback_application_name = "postgres_fdw." Therefore, if two or more servers connect to one data server and two queries arrive at the data server, the database administrator cannot determine which queries came from which server. This problem prevents some workload analysis because it cannot track the flow of queries. ## Implementation I just added local backend's pid to fallback_application_name. This is the key for seaching and matching two logs. In order to use the feature and track remote transactions, user must add backend-pid and application_name to log_line_prefix, like ``` log_line_prefix = '%m [%p] [%a] ' ``` Here is the output example. Assume that remote server has a table "foo," and local server imports the schema. When local server executes foregin scan, the following line was output in the local's logfile. ``` 2021-07-29 03:18:50.630 UTC [21572] [psql] LOG: duration: 23.366 ms statement: select * from foo; ``` And in the remote's one, the following lines were appered. ``` 2021-07-29 03:18:50.628 UTC [21573] [postgres_fdw for remote PID: 21572] LOG: duration: 0.615 ms parse : DECLARE c1 CURSOR FOR SELECT id FROM public.foo ``` Two lines have same pid, so we can track the log and analyze workloads correctly. I will write docs later, but now I want you to review the motivation and implementation. Best Regards, Hayato Kuroda FUJITSU LIMITED add_pid.patch Description: add_pid.patch
Re: pg_receivewal starting position
At Wed, 28 Jul 2021 12:57:39 +0200, Ronan Dunklau wrote in > Le mercredi 28 juillet 2021, 08:22:30 CEST Kyotaro Horiguchi a écrit : > > At Tue, 27 Jul 2021 07:50:39 +0200, Ronan Dunklau > > wrote in > > > I don't know if it should be the default, toggled by a command line flag, > > > or if we even should let the user provide a LSN. > > > > *I* think it is completely reasonable (or at least convenient or less > > astonishing) that pg_receivewal starts from the restart_lsn of the > > replication slot to use. The tool already decides the clean-start LSN > > a bit unusual way. And it seems to me that proposed behavior can be > > the default when -S is specified. > > > > As of now we can't get the replication_slot restart_lsn with a replication > connection AFAIK. > > This implies that the patch could require the user to specify a > maintenance-db > parameter, and we would use that if provided to fetch the replication slot > info, or fallback to the previous behaviour. I don't really like this > approach > as the behaviour changing wether we supply a maintenance-db parameter or not > is error-prone for the user. > > Another option would be to add a new replication command (for example > ACQUIRE_REPLICATION_SLOT ) to set the replication slot as the > current one, and return some info about it (restart_lsn at least for a > physical slot). I didn't thought in details. But I forgot that ordinary SQL commands have been prohibited in physical replication connection. So we need a new replication command but it's not that a big deal. > I don't see any reason not to make it work for logical replication > connections > / slots, but it wouldn't be that useful since we can query the database in > that case. Ordinary SQL queries are usable on a logical replication slot so I'm not sure how logical replication connection uses the command. However, like you, I wouldn't bother restricting the command to physical replication, but perhaps the new command should return the slot type. > Acquiring the replication slot instead of just reading it would make sure > that > no other process could start the replication between the time we read the > restart_lsn and when we issue START_REPLICATION. START_REPLICATION could then > check if we already have a replication slot, and ensure it is the same one as > the one we're trying to use. I'm not sure it's worth adding complexity for such strictness. START_REPLICATION safely fails if someone steals the slot meanwhile. In the first place there's no means to protect a slot from others while idle. One possible problem is the case where START_REPLICATION successfully acquire the slot after the new command failed. But that case doesn't seem worse than the case someone advances the slot while absence. So I think READ_REPLICATION_SLOT is sufficient. > From pg_receivewal point of view, this would amount to: > > - check if we currently have wal in the target directory. >- if we do, proceed as currently done, by computing the start lsn and > timeline from the last archived wal > - if we don't, and we have a slot, run ACQUIRE_REPLICATION_SLOT. Use the > restart_lsn as the start lsn if there is one, and don't provide a timeline > - if we still don't have a start_lsn, fallback to using the current server > wal position as is done. That's pretty much it. > What do you think ? Which information should we provide about the slot ? We need the timeline id to start with when using restart_lsn. The current timeline can be used in most cases but there's a case where the LSN is historical. pg_receivewal doesn't send a replication status report when a segment is finished. So after pg_receivewal stops just after a segment is finished, the slot stays at the beginning of the last segment. Thus next time it will start from there, creating a duplicate segment. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: needless complexity in StartupXLOG
On Mon, Jul 26, 2021 at 9:43 PM Robert Haas wrote: > > StartupXLOG() has code beginning around line 7900 of xlog.c that > decides, at the end of recovery, between four possible courses of > action. It either writes an end-of-recovery record, or requests a > checkpoint, or creates a checkpoint, or does nothing, depending on the > value of 3 flag variables, and also on whether we're still able to > read the last checkpoint record: > > checkPointLoc = ControlFile->checkPoint; > > /* > * Confirm the last checkpoint is available for us to recover > * from if we fail. > */ > record = ReadCheckpointRecord(xlogreader, > checkPointLoc, 1, false); > if (record != NULL) > { > promoted = true; > > It seems to me that ReadCheckpointRecord() should never fail here. It > should always be true, even when we're not in recovery, that the last > checkpoint record is readable. If there's ever a situation where that > is not true, even for an instant, then a crash at that point will be > unrecoverable. Now, one way that it could happen is if we've got a bug > in the code someplace that removes WAL segments too soon. However, if > we have such a bug, we should fix it. What this code does is says "oh, > I guess we removed the old checkpoint too soon, no big deal, let's > just be more aggressive about getting the next one done," which I do > not think is the right response. Aside from a bug, the only other way > I can see it happening is if someone is manually removing WAL segments > as the server is running through recovery, perhaps as some last-ditch > play to avoid running out of disk space. I don't think the server > needs to have - or should have - code to cater to such crazy > scenarios. Therefore I think that this check, at least in its current > form, is not sensible. > > My first thought was that we should do the check unconditionally, > rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and > ERROR if it fails. But then I wondered what the point of that would be > exactly. If we have such a bug -- and to the best of my knowledge > there's no evidence that we do -- there's no particular reason it > should only happen at the end of recovery. It could happen any time > the system -- or the user, or malicious space aliens -- remove files > from pg_wal, and we have no real idea about the timing of malicious > space alien activity, so doing the test here rather than anywhere else > just seems like a shot in the dark. Perhaps the most logical place > would be to move it to CreateCheckPoint() just after we remove old > xlog files, but we don't have the xlogreader there, and anyway I don't > see how it's really going to help. What bug do we expect to catch by > removing files we think we don't need and then checking that we didn't > remove the files we think we do need? That seems more like grasping at > straws than a serious attempt to make things work better. > > So at the moment I am leaning toward the view that we should just > remove this check entirely, as in the attached, proposed patch. > Can we have an elog() fatal error or warning to make sure that the last checkpoint is still readable? Since the case where the user (knowingly or unknowingly) or some buggy code has removed the WAL file containing the last checkpoint could be possible. If it is then we would have a hard time finding out when we get further unexpected behavior due to this. Thoughts? > Really, I think we should consider going further. If it's safe to > write an end-of-recovery record rather than a checkpoint, why not do > so all the time? Right now we fail to do that in the above-described > "impossible" scenario where the previous checkpoint record can't be > read, or if we're exiting archive recovery for some reason other than > a promotion request, or if we're in single-user mode, or if we're in > crash recovery. Presumably, people would like to start up the server > quickly in all of those scenarios, so the only reason not to use this > technology all the time is if we think it's safe in some scenarios and > not others. I can't think of a reason why it matters why we're exiting > archive recovery, nor can I think of a reason why it matters whether > we're in single user mode. The distinction between crash recovery and > archive recovery does seem to matter, but if anything the crash > recovery scenario seems simpler, because then there's only one > timeline involved. > > I realize that conservatism may have played a role in this code ending > up looking the way that it does; someone seems to have thought it > would be better not to rely on a new idea in all cases. From my point > of view, though, it's scary to have so many cases, especially cases > that don't seem like they should ever be reached. I think that > simplifying the logic here and trying to do the same things in as many >
Re: alter table set TABLE ACCESS METHOD
On Wed, Jul 28, 2021 at 01:05:10PM -0700, Jeff Davis wrote: > On Wed, 2021-07-28 at 14:02 +0900, Michael Paquier wrote: >> Right. Isn't that an older issue though? A rewrite involved after a >> change of relpersistence does not call the hook either. It looks to >> me that this should be after finish_heap_swap() to match with >> ATExecSetTableSpace() in ATRewriteTables(). The only known user of >> object_access_hook in the core code is sepgsql, so this would >> involve a change of behavior. And I don't recall any backpatching >> that added a post-alter hook. > > Sounds like it should be a different patch. Thank you. Doing any checks around the hooks of objectaccess.h is very annoying, because we have no modules to check after them easily except sepgsql. Anyway, I have been checking that, with the hack-ish module attached and tracked down that swap_relation_files() calls InvokeObjectPostAlterHookArg() already (as you already spotted?), but that's an internal change when it comes to SET LOGGED/UNLOGGED/ACCESS METHOD :( Attached is a small module I have used for those tests, for reference. It passes on HEAD, and with the patch attached you can see the extra entries. >>> Also, I agree with Justin that it should fail when there are >>> multiple >>> SET ACCESS METHOD subcommands consistently, regardless of whether >>> one >>> is a no-op, and it should probably throw a syntax error to match >>> SET >>> TABLESPACE. >> >> Hmm. Okay. I'd still disagree with that. One example is SET LOGGED that would not fail when repeated multiple times. Also, tracking down if a SET ACCESS METHOD subcommand has been passed down requires an extra field in each tab entry so I am not sure that this is worth the extra complication. I can see benefits and advantages one way or the other, and I would tend to keep the code a maximum simple as we never store InvalidOid for a table AM. Anyway, I won't fight the majority either. >>> Minor nit: in tab-complete.c, why does it say ""? Is that just >>> a >>> typo or is there a reason it's different from everything else, >>> which >>> uses ""? And what does "sth" mean anyway? >> >> "Something". That should be "" to be consistent with the area. > > These two issues are pretty minor. Fixed this one, while not forgetting about it. Thanks. -- Michael object_hooks.tar.gz Description: application/gzip diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index fcd778c62a..b18de38e73 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -5475,6 +5475,8 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, RecentXmin, ReadNextMultiXactId(), persistence); + + InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); } else { signature.asc Description: PGP signature