Re: Postgres picks suboptimal index after building of an extended statistics
On 8/12/21 4:26 AM, Tomas Vondra wrote: On 8/11/21 2:48 AM, Peter Geoghegan wrote: On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov wrote: Ivan Frolkov reported a problem with choosing a non-optimal index during a query optimization. This problem appeared after building of an extended statistics. Any thoughts on this, Tomas? Thanks for reminding me, I missed / forgot about this thread. I agree the current behavior is unfortunate, but I'm not convinced the proposed patch is fixing the right place - doesn't this mean the index costing won't match the row estimates displayed by EXPLAIN? I think, it is not a problem. In EXPLAIN you will see only 1 row with/without this patch. I wonder if we should teach clauselist_selectivity about UNIQUE indexes, and improve the cardinality estimates directly, not just costing for index scans. This idea looks better. I will try to implement it. Also, is it correct that the patch calculates num_sa_scans only when (numIndexTuples >= 0.0)? Thanks, fixed. -- regards, Andrey Lepikhov Postgres Professional >From 8a4ad08d61a5d14a45ef5e182f002e918f0eaccc Mon Sep 17 00:00:00 2001 From: Andrey Lepikhov Date: Wed, 23 Jun 2021 12:05:24 +0500 Subject: [PATCH] In the case of an unique one row btree index scan only one row can be returned. In the genericcostestimate() routine we must arrange the index selectivity value in accordance with this knowledge. --- src/backend/utils/adt/selfuncs.c | 78 1 file changed, 48 insertions(+), 30 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 0c8c05f6c2..9538c4a5b4 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -6369,15 +6369,9 @@ genericcostestimate(PlannerInfo *root, double num_scans; double qual_op_cost; double qual_arg_cost; - List *selectivityQuals; ListCell *l; - /* - * If the index is partial, AND the index predicate with the explicitly - * given indexquals to produce a more accurate idea of the index - * selectivity. - */ - selectivityQuals = add_predicate_to_index_quals(index, indexQuals); + numIndexTuples = costs->numIndexTuples; /* * Check for ScalarArrayOpExpr index quals, and estimate the number of @@ -6398,36 +6392,59 @@ genericcostestimate(PlannerInfo *root, } } - /* Estimate the fraction of main-table tuples that will be visited */ - indexSelectivity = clauselist_selectivity(root, selectivityQuals, - index->rel->relid, - JOIN_INNER, - NULL); + if (numIndexTuples >= 0.0) + { + List *selectivityQuals; - /* - * If caller didn't give us an estimate, estimate the number of index - * tuples that will be visited. We do it in this rather peculiar-looking - * way in order to get the right answer for partial indexes. - */ - numIndexTuples = costs->numIndexTuples; - if (numIndexTuples <= 0.0) + /* + * If the index is partial, AND the index predicate with the explicitly + * given indexquals to produce a more accurate idea of the index + * selectivity. + */ + selectivityQuals = add_predicate_to_index_quals(index, indexQuals); + + /* Estimate the fraction of main-table tuples that will be visited */ + indexSelectivity = clauselist_selectivity(root, selectivityQuals, + index->rel->relid, + JOIN_INNER, + NULL); + + /* + * If caller didn't give us an estimate, estimate the number of index + * tuples that will be visited. We do it in this rather peculiar-looking + * way in order to get the right answer for partial indexes. + */ + if (numIndexTuples == 0.0) + { + numIndexTuples = indexSelectivity * index->rel->tuples; + + /* + * The above calculation counts all the tuples visited across all + * scans induced by ScalarArrayOpExpr nodes. We want to consider the + * average per-indexscan number, so adjust. This is a handy place to + * round to integer, too. (If caller supplied tuple estimate, it's + * responsible for handling these considerations.) + */ + numIndexTuples = rint(numIndexTuples / num_sa_scans); + } + } + else { - numIndexTuples = indexSelectivity * index->rel->tuples; + Assert(numIndexTuples == -1.0); /* - * The above calculation counts all the tuples visited across all - * scans induced by ScalarArrayOpExpr nodes. We want to consider the - * average per-indexscan number, so adjust. This is a handy place to - * round to integer, too. (If caller supplied tuple estimate, it's - * responsible for handling these considerations.) + * Unique one row scan can select no more than one row. It needs to + * manually set the selectivity of the index. The value of numIndexTuples + * will be corrected later. */ - numIndexTuples = rint(numIndexTuples / num_sa_scans); + indexSelectivity = 1.0 / index->rel->tuples; } /* * We can bound the number of tuples by the index size in any case. Also, *
Re: Default to TIMESTAMP WITH TIME ZONE?
On 13/08/21 5:14 pm, Greg Stark wrote: I think having a GUC to change to a different set of semantics is not workable. However that doesn't mean we can't do anything. We could have a GUC that just disables allowing creating columns of type timestamp without tz. That could print an error with a hint suggesting you can change the variable if you really want to allow them. I still would hesitate to make it the default but admins could set it to help their developers. [...] I always use the tz version, except when I forget. Initially, I was totally unaware of the need & usefulness of storing time in tz. So I would use the GUC. Suspect that it is extremely rare when one would not want to use the TZ option, which suggests to me that using TZ should be the default. Cheers, Gavin
Re: Shared memory size computation oversight?
On 12.08.21 16:18, Julien Rouhaud wrote: On Thu, Aug 12, 2021 at 9:34 PM Peter Eisentraut wrote: On 27.02.21 09:08, Julien Rouhaud wrote: PFA a patch that fixes pg_prewarm and pg_stat_statements explicit alignment to CACHELINEALIGN, and also update the alignment in hash_estimate_size() to what I think ShmemInitHash will actually consume. For extensions, wouldn't it make things better if RequestAddinShmemSpace() applied CACHELINEALIGN() to its argument? If you consider the typical sequence of RequestAddinShmemSpace(mysize()) and later ShmemInitStruct(..., mysize(), ...), But changing RequestAddinShmemSpace() to apply CACHELINEALIGN() would only really work for that specific usage only? If an extension does multiple allocations you can't rely on correct results. I think you can do different things here to create inconsistent results, but I think there should be one common, standard, normal, straightforward way to get a correct result. Btw., I think your patch was wrong to apply CACHELINEALIGN() to intermediate results rather than at the end. I'm not sure why you think it's wrong. It's ShmemInitStruct() that will apply the padding, so if the extension calls it multiple times (whether directly or indirectly), then the padding will be added multiple times. Which means that in theory the extension should account for it multiple time in the amount of memory it's requesting. Yeah, in that case it's probably rather the case that there is one CACHELINEALIGN() too few, since pg_stat_statements does two separate shmem allocations.
Re: [bug] Logical Decoding of relation rewrite with toast does not reset toast_hash
On Fri, Aug 13, 2021 at 11:47 AM Drouvot, Bertrand wrote: > > On 8/12/21 1:00 PM, Amit Kapila wrote: > >> > >> - sets "relwrewrite" for the toast. > >> > > --- a/src/backend/commands/tablecmds.c > > +++ b/src/backend/commands/tablecmds.c > > @@ -3861,6 +3861,10 @@ RenameRelationInternal(Oid myrelid, const char > > *newrelname, bool is_internal, bo > >*/ > >namestrcpy(&(relform->relname), newrelname); > > > > + /* reset relrewrite for toast */ > > + if (relform->relkind == RELKIND_TOASTVALUE) > > + relform->relrewrite = InvalidOid; > > + > > > > I find this change quite ad-hoc. I think this API is quite generic to > > make such a change. I see two ways for this (a) pass a new bool flag > > (reset_toast_rewrite) in this API and then make this change, (b) in > > the specific place where we need this, change relrewrite separately > > via a new API. > > > > I would prefer (b) in the ideal case, but I understand it would be an > > additional cost, so maybe (a) is also okay. What do you people think? > > I would prefer a) because: > > - b) would need to update the exact same tuple one more time (means > doing more or less the same work: open relation, search for the tuple, > update the tuple...) > > - a) would still give the ability for someone reading the code to > understand where the relrewrite reset is needed (as opposed to the way > the patch is currently written) > > - finish_heap_swap() with swap_toast_by_content set to false, is the > only place where we currently need to reset explicitly relrewrite (so > that we would have the new API produced by b) being called only at that > place). > > - That means that b) would be only for code readability but at the price > of extra cost. > Anybody else would like to weigh in? I think it is good if few others also share their opinion as we need to backpatch this bug-fix. -- With Regards, Amit Kapila.
Re: Bug in huge simplehash
Hi, On 2021-08-10 11:52:59 +0300, Yura Sokolov wrote: > - sizemask is set only in SH_COMPUTE_PARAMETERS . And it is set in this way: > > /* now set size */ > tb->size = size; > > if (tb->size == SH_MAX_SIZE) > tb->sizemask = 0; > else > tb->sizemask = tb->size - 1; > > that means, when we are resizing to SH_MAX_SIZE, sizemask becomes zero. I think that was intended to be ~0. > Ahh... ok, patch is updated to fix this as well. Any chance you'd write a test for simplehash with such huge amount of values? It'd require a small bit of trickery to be practical. On systems with MAP_NORESERVE it should be feasible. > static inline void > -SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize) > +SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize) > { > uint64 size; > > @@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize) > > /* now set size */ > tb->size = size; > - > - if (tb->size == SH_MAX_SIZE) > - tb->sizemask = 0; > - else > - tb->sizemask = tb->size - 1; > + tb->sizemask = (uint32)(size - 1); ISTM using ~0 would be nicer here? Greetings, Andres Freund
Re: Autovacuum on partitioned table (autoanalyze)
Hi, On 2021-08-11 18:33:07 -0400, Alvaro Herrera wrote: > After thinking about the described issues for a while, my proposal is to > completely revamp the way this feature works. See below. > > Now, the proposal seems awfully invasive, but it's *the* way I see to > avoid the pgstat traffic. For pg14, maybe we can live with it, and just > use the smaller patches that Horiguchi-san and I have posted, which > solve the other issues; also, Euler Taveira suggested that we could add > a reloption to turn the feature off completely for some tables (maybe > make it off by default and have a reloption to turn it on for specific > partition hierarchies), so that it doesn't cause unduly pain for people > with large partitioning hierarchies. I think we should revert the changes for 14 - to me the feature clearly isn't mature enough to be released. > * PgStat_StatTabEntry gets a new "Oid reportAncestorOid" member. This is > the OID of a single partitioned ancestor, to which the changed-tuple > counts are propagated up. > Normally this is the topmost ancestor; but if the user wishes some > intermediate ancestor to receive the counts they can use > ALTER TABLE the_intermediate_ancestor SET (autovacuum_enabled=on). > > * Corollary 1: for the normal case of single-level partitioning, the > parent partitioned table behaves as currently. > > * Corollary 2: for multi-level partitioning with no especially > configured intermediate ancestors, only the leaf partitions and the > top-level partitioned table will be analyzed. Intermediate ancestors > are ignored by autovacuum. > > * Corollary 3: for multi-level partitioning with some intermediate > ancestor(s) marked as autovacuum_enabled=on, that ancestor will > receive all the counts from all of its partitions, so it will get > analyzed itself; and it'll also forward those counts up to its > report-ancestor. This seems awfully confusing to me. One fundamental issue here is that we separately build stats for partitioned tables and partitions. Can we instead tackle this by reusing the stats for partitions for the inheritance stats? I think it's a bit easier to do that for partitioned tables than for old school inheritance roots, because there's no other rows in partitioned tables. Greetings, Andres Freund
Re: Bug in huge simplehash
Andres Freund писал 2021-08-13 12:21: Hi, On 2021-08-10 11:52:59 +0300, Yura Sokolov wrote: - sizemask is set only in SH_COMPUTE_PARAMETERS . And it is set in this way: /* now set size */ tb->size = size; if (tb->size == SH_MAX_SIZE) tb->sizemask = 0; else tb->sizemask = tb->size - 1; that means, when we are resizing to SH_MAX_SIZE, sizemask becomes zero. I think that was intended to be ~0. I believe so. Ahh... ok, patch is updated to fix this as well. Any chance you'd write a test for simplehash with such huge amount of values? It'd require a small bit of trickery to be practical. On systems with MAP_NORESERVE it should be feasible. Which way C structures should be tested in postgres? dynahash/simplhash - do they have any tests currently? I'll do if hint is given. static inline void -SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize) +SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize) { uint64 size; @@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize) /* now set size */ tb->size = size; - - if (tb->size == SH_MAX_SIZE) - tb->sizemask = 0; - else - tb->sizemask = tb->size - 1; + tb->sizemask = (uint32)(size - 1); ISTM using ~0 would be nicer here? I don't think so. To be rigid it should be `~(uint32)0`. But I believe it doesn't differ from `tb->sizemask = (uint32)(size - 1)` that is landed with patch, therefore why `if` is needed? Greetings, Andres Freund
Re: [PATCH] Hooks at XactCommand level
Hi, On 2021-08-10 10:12:26 +0200, Gilles Darold wrote: > Sorry for the response delay. I have though about adding this odd hook to be > able to implement this feature through an extension because I don't think > this is something that should be implemented in core. There were also > patches proposals which were all rejected. > > We usually implement the feature at client side which is imo enough for the > use cases. But the problem is that this a catastrophe in term of > performances. I have done a small benchmark to illustrate the problem. This > is a single process client on the same host than the PG backend. > > For 10,000 tuples inserted with 50% of failures and rollback at statement > level handled at client side: > > Expected: 5001, Count: 5001 > DML insert took: 13 wallclock secs ( 0.53 usr + 0.94 sys = 1.47 > CPU) Something seems off here. This suggests every insert took 2.6ms. That seems awfully long, unless your network latency is substantial. I did a quick test implementing this in the naive-most way in pgbench, and I get better times - and there's *lots* of room for improvement. I used a pgbench script that sent the following: BEGIN; SAVEPOINT insert_fail; INSERT INTO testinsert(data) VALUES (1); ROLLBACK TO SAVEPOINT insert_fail; SAVEPOINT insert_success; INSERT INTO testinsert(data) VALUES (1); RELEASE SAVEPOINT insert_success; {repeat 5 times} COMMIT; I.e. 5 failing and 5 succeeding insertions wrapped in one transaction. I get >2500 tps, i.e. > 25k rows/sec. And it's not hard to optimize that further - the {ROLLBACK TO,RELEASE} SAVEPOINT; SAVEPOINT; INSERT can be sent in one roundtrip. That gets me to somewhere around 40k rows/sec. BEGIN; \startpipeline SAVEPOINT insert_fail; INSERT INTO testinsert(data) VALUES (1); \endpipeline \startpipeline ROLLBACK TO SAVEPOINT insert_fail; SAVEPOINT insert_success; INSERT INTO testinsert(data) VALUES (1); \endpipeline \startpipeline RELEASE SAVEPOINT insert_success; SAVEPOINT insert_fail; INSERT INTO testinsert(data) VALUES (1); \endpipeline \startpipeline ROLLBACK TO SAVEPOINT insert_fail; SAVEPOINT insert_success; INSERT INTO testinsert(data) VALUES (1); \endpipeline {repeat last two blocks three times} COMMIT; Greetings, Andres Freund
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Hi, On 2021-08-11 16:11:34 -0400, Melanie Plageman wrote: > On Tue, Aug 3, 2021 at 2:13 PM Andres Freund wrote: > > > Also, I'm unsure how writing the buffer action stats out in > > > pgstat_write_statsfiles() will work, since I think that backends can > > > update their buffer action stats after we would have already persisted > > > the data from the BufferActionStatsArray -- causing us to lose those > > > updates. > > > > I was thinking it'd work differently. Whenever a connection ends, it reports > > its data up to pgstats.c (otherwise we'd loose those stats). By the time > > shutdown happens, they all need to have already have reported their stats - > > so > > we don't need to do anything to get the data to pgstats.c during shutdown > > time. > > > > When you say "whenever a connection ends", what part of the code are you > referring to specifically? pgstat_beshutdown_hook() > Also, when you say "shutdown", do you mean a backend shutting down or > all backends shutting down (including postmaster) -- like pg_ctl stop? Admittedly our language is very imprecise around this :(. What I meant is that backends would report their own stats up to the stats collector when the connection ends (in pgstat_beshutdown_hook()). That means that when the whole server (pgstat and then postmaster, potentially via pg_ctl stop) shuts down, all the per-connection stats have already been reported up to pgstat. > > > diff --git a/src/backend/catalog/system_views.sql > > > b/src/backend/catalog/system_views.sql > > > index 55f6e3711d..96cac0a74e 100644 > > > --- a/src/backend/catalog/system_views.sql > > > +++ b/src/backend/catalog/system_views.sql > > > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS > > > pg_stat_get_bgwriter_buf_written_checkpoints() AS > > > buffers_checkpoint, > > > pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, > > > pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, > > > -pg_stat_get_buf_written_backend() AS buffers_backend, > > > -pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, > > > -pg_stat_get_buf_alloc() AS buffers_alloc, > > > pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; > > > > Material for a separate patch, not this. But if we're going to break > > monitoring queries anyway, I think we should consider also renaming > > maxwritten_clean (and perhaps a few others), because nobody understands what > > that is supposed to mean. > Do you mean I shouldn't remove anything from the pg_stat_bgwriter view? No - I just meant that now that we're breaking pg_stat_bgwriter queries, we should also rename the columns to be easier to understand. But that it should be a separate patch / commit... > > > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 > > > *num_buf_alloc) > > >*/ > > > *complete_passes += nextVictimBuffer / NBuffers; > > > } > > > - > > > - if (num_buf_alloc) > > > - { > > > - *num_buf_alloc = > > > pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0); > > > - } > > > SpinLockRelease(&StrategyControl->buffer_strategy_lock); > > > return result; > > > } > > > > Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I > > suspect this patch shouldn't get rid of numBufferAllocs at the same time as > > overhauling the stats stuff. Perhaps we don't need both - but it's not > > obvious > > that that's the case / how we can make that work. > > > > > > I initially meant to add a function to the patch like > pg_stat_get_buffer_actions() but which took a BufferActionType and > BackendType as parameters and returned a single value which is the > number of buffer action types of that type for that type of backend. > > let's say I defined it like this: > uint64 > pg_stat_get_backend_buffer_actions_stats(BackendType backend_type, > BufferActionType ba_type) > > Then, I intended to use that in StrategySyncStart() to set num_buf_alloc > by subtracting the value of StrategyControl->numBufferAllocs from the > value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER, > BA_Alloc), val, then adding that value, val, to > StrategyControl->numBufferAllocs. I don't think you could restrict this to B_BG_WRITER? The whole point of this logic is that bgwriter uses the stats for *all* backends to get the "usage rate" for buffers, which it then uses to control how many buffers to clean. > I think that would have the same behavior as current, though I'm not > sure if the performance would end up being better or worse. It wouldn't > be atomically incrementing StrategyControl->numBufferAllocs, but it > would do a few additional atomic operations in StrategySyncStart() than > before. Also, we would do all the work done by > pg_stat_get_buffer_actions() in StrategySyncStart(). I think it'd be better to separate changing the
Re: Bug in huge simplehash
Hi, On 2021-08-13 12:44:17 +0300, Yura Sokolov wrote: > Andres Freund писал 2021-08-13 12:21: > > Any chance you'd write a test for simplehash with such huge amount of > > values? It'd require a small bit of trickery to be practical. On systems > > with MAP_NORESERVE it should be feasible. > > Which way C structures should be tested in postgres? > dynahash/simplhash - do they have any tests currently? > I'll do if hint is given. We don't have a great way right now :(. I think the best is to have a SQL callable function that uses some API. See e.g. test_atomic_ops() et al in src/test/regress/regress.c > > > static inline void > > > -SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize) > > > +SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize) > > > { > > > uint64 size; > > > > > > @@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 > > > newsize) > > > > > > /* now set size */ > > > tb->size = size; > > > - > > > - if (tb->size == SH_MAX_SIZE) > > > - tb->sizemask = 0; > > > - else > > > - tb->sizemask = tb->size - 1; > > > + tb->sizemask = (uint32)(size - 1); > > > > ISTM using ~0 would be nicer here? > > I don't think so. > To be rigid it should be `~(uint32)0`. > But I believe it doesn't differ from `tb->sizemask = (uint32)(size - 1)` > that is landed with patch, therefore why `if` is needed? Personally I find it more obvious to understand the intended behaviour with ~0 (i.e. all bits set) than with a width truncation. Greetings, Andres Freund
Re: logical replication empty transactions
On Mon, Aug 2, 2021 at 7:20 PM Amit Kapila wrote: > > On Fri, Jul 23, 2021 at 3:39 PM Ajin Cherian wrote: > > > > Let's first split the patch for prepared and non-prepared cases as > that will help to focus on each of them separately. BTW, why haven't > you considered implementing point 1b as explained by Andres in his > email [1]? I think we can send a keepalive message in case of > synchronous replication when we skip an empty transaction, otherwise, > it might delay in responding to transactions synchronous_commit mode. > I think in the tests done in the thread, it might not have been shown > because we are already sending keepalives too frequently. But what if > someone disables wal_sender_timeout or kept it to a very large value? > See WalSndKeepaliveIfNecessary. The other thing you might want to look > at is if the reason for frequent keepalives is the same as described > in the email [2]. > I have tried to address the comment here by modifying the ctx->update_progress callback function (WalSndUpdateProgress) provided for plugins. I have added an option by which the callback can specify if it wants to send keep_alives. And when the callback is called with that option set, walsender updates a flag force_keep_alive_syncrep. The Walsender in the WalSndWaitForWal for loop, checks this flag and if synchronous replication is enabled, then sends a keep alive. Currently this logic is added as an else to the current logic that is already there in WalSndWaitForWal, which is probably considered unnecessary and a source of the keep alive flood that you talked about. So, I can change that according to how that fix shapes up there. I have also added an extern function in syncrep.c that makes it possible for walsender to query if synchronous replication is turned on. The reason I had to turn on a flag and rely on the WalSndWaitForWal to send the keep alive in its next iteration is because I tried doing this directly when a commit is skipped but it didn't work. The reason for this is that when the commit is being decoded the sentptr at the moment is at the commit LSN and the keep alive will be sent for the commit LSN but the syncrep wait is waiting for end_lsn of the transaction which is the next LSN. So, sending a keep alive at the moment the commit is decoded doesn't seem to solve the problem of the waiting synchronous reply. > Few other miscellaneous comments: > 1. > static void > pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > - XLogRecPtr commit_lsn) > + XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn, > + TimestampTz prepare_time) > { > + PGOutputTxnData*txndata = (PGOutputTxnData *) > txn->output_plugin_private; > + > OutputPluginUpdateProgress(ctx); > > + /* > + * If the BEGIN PREPARE was not yet sent, then it means there were no > + * relevant changes encountered, so we can skip the COMMIT PREPARED > + * message too. > + */ > + if (txndata) > + { > + bool skip = !txndata->sent_begin_txn; > + pfree(txndata); > + txn->output_plugin_private = NULL; > > How is this supposed to work after the restart when prepared is sent > before the restart and we are just sending commit_prepared after > restart? Won't this lead to sending commit_prepared even when the > corresponding prepare is not sent? Can we think of a better way to > deal with this? > I have tried to resolve this by adding logic in worker,c to silently ignore spurious commit_prepareds. But this change required checking if the prepare exists on the subscriber before attempting the commit_prepared but the current API that checks this requires prepare time and transaction end_lsn. But for this I had to change the protocol of commit_prepared, and I understand that this would break backward compatibility between subscriber and publisher (you have raised this issue as well). I am not sure how else to handle this, let me know if you have any other ideas. One option could be to have another API to check if the prepare exists on the subscriber with the prepared 'gid' alone, without checking prepare_time or end_lsn. Let me know if this idea works. I have left out the patch 0002 for prepared transactions until we arrive at a decision on how to address the above issue. Peter, I have also addressed the comments you've raised on patch 0001, please have a look and confirm. Regards, Ajin Cherian Fujitsu Australia. v13-0001-Skip-empty-transactions-for-logical-replication.patch Description: Binary data
Re: Bug in huge simplehash
Em sex., 13 de ago. de 2021 às 07:15, Andres Freund escreveu: > Hi, > > On 2021-08-13 12:44:17 +0300, Yura Sokolov wrote: > > Andres Freund писал 2021-08-13 12:21: > > > Any chance you'd write a test for simplehash with such huge amount of > > > values? It'd require a small bit of trickery to be practical. On > systems > > > with MAP_NORESERVE it should be feasible. > > > > Which way C structures should be tested in postgres? > > dynahash/simplhash - do they have any tests currently? > > I'll do if hint is given. > > We don't have a great way right now :(. I think the best is to have a > SQL callable function that uses some API. See e.g. test_atomic_ops() et > al in src/test/regress/regress.c > > > > > > static inline void > > > > -SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize) > > > > +SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize) > > > > { > > > > uint64 size; > > > > > > > > @@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 > > > > newsize) > > > > > > > > /* now set size */ > > > > tb->size = size; > > > > - > > > > - if (tb->size == SH_MAX_SIZE) > > > > - tb->sizemask = 0; > > > > - else > > > > - tb->sizemask = tb->size - 1; > > > > + tb->sizemask = (uint32)(size - 1); > > > > > > ISTM using ~0 would be nicer here? > > > > I don't think so. > > To be rigid it should be `~(uint32)0`. > > But I believe it doesn't differ from `tb->sizemask = (uint32)(size - 1)` > > that is landed with patch, therefore why `if` is needed? > > Personally I find it more obvious to understand the intended behaviour > with ~0 (i.e. all bits set) than with a width truncation. > https://godbolt.org/z/57WcjKqMj The generated code is identical. regards, Ranier Vilela
Re: Bug in huge simplehash
Ranier Vilela писал 2021-08-13 14:12: Em sex., 13 de ago. de 2021 às 07:15, Andres Freund escreveu: Hi, On 2021-08-13 12:44:17 +0300, Yura Sokolov wrote: Andres Freund писал 2021-08-13 12:21: Any chance you'd write a test for simplehash with such huge amount of values? It'd require a small bit of trickery to be practical. On systems with MAP_NORESERVE it should be feasible. Which way C structures should be tested in postgres? dynahash/simplhash - do they have any tests currently? I'll do if hint is given. We don't have a great way right now :(. I think the best is to have a SQL callable function that uses some API. See e.g. test_atomic_ops() et al in src/test/regress/regress.c > static inline void > -SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize) > +SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize) > { > uint64 size; > > @@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 > newsize) > > /* now set size */ > tb->size = size; > - > - if (tb->size == SH_MAX_SIZE) > - tb->sizemask = 0; > - else > - tb->sizemask = tb->size - 1; > + tb->sizemask = (uint32)(size - 1); ISTM using ~0 would be nicer here? I don't think so. To be rigid it should be `~(uint32)0`. But I believe it doesn't differ from `tb->sizemask = (uint32)(size - 1)` that is landed with patch, therefore why `if` is needed? Personally I find it more obvious to understand the intended behaviour with ~0 (i.e. all bits set) than with a width truncation. https://godbolt.org/z/57WcjKqMj The generated code is identical. I believe, you mean https://godbolt.org/z/qWzE1ePTE regards, Ranier Vilela
Re: Bug in huge simplehash
On 2021-08-13 14:40:08 +0300, Yura Sokolov wrote: > Ranier Vilela писал 2021-08-13 14:12: > > Em sex., 13 de ago. de 2021 às 07:15, Andres Freund > > escreveu: > > > Personally I find it more obvious to understand the intended > > > behaviour > > > with ~0 (i.e. all bits set) than with a width truncation. > > > > https://godbolt.org/z/57WcjKqMj > > The generated code is identical. > > I believe, you mean https://godbolt.org/z/qWzE1ePTE I don't think the choice of instructions matters. This is called around creation and resizing - the other costs are several orders of magnitude more expensive than determining the mask.
Re: Shared memory size computation oversight?
On Fri, Aug 13, 2021 at 10:52:50AM +0200, Peter Eisentraut wrote: > On 12.08.21 16:18, Julien Rouhaud wrote: > > > > But changing RequestAddinShmemSpace() to apply CACHELINEALIGN() would > > only really work for that specific usage only? If an extension does > > multiple allocations you can't rely on correct results. > > I think you can do different things here to create inconsistent results, but > I think there should be one common, standard, normal, straightforward way to > get a correct result. Unless I'm missing something, the standard and straightforward way to get a correct result is to account for padding bytes in the C code, as it's currently done in all contrib modules. The issue is that those modules aren't using the correct alignment anymore. > > > > Btw., I think your patch was wrong to apply CACHELINEALIGN() to > > > intermediate results rather than at the end. > > > > I'm not sure why you think it's wrong. It's ShmemInitStruct() that > > will apply the padding, so if the extension calls it multiple times > > (whether directly or indirectly), then the padding will be added > > multiple times. Which means that in theory the extension should > > account for it multiple time in the amount of memory it's requesting. > > Yeah, in that case it's probably rather the case that there is one > CACHELINEALIGN() too few, since pg_stat_statements does two separate shmem > allocations. I still don't get it. Aligning the total shmem size is totally different from properly padding all single allocation sizes, and is almost never the right answer. Using a naive example, if your extension needs to ShmemInitStruct() twice 64B, postgres will consume 2*128B. But if you only rely on RequestAddinShmemSpace() to add a CACHELINEALIGN(), then no padding at all will be added, and you'll then be not one but two CACHELINEALIGN() too few. But again, the real issue is not the CACHELINEALIGN() roundings, as those have a more or less stable size and are already accounted for in the extra 100kB, but the dynahash size estimation which seems to be increasingly off as the number of entries grows.
Re: [PATCH] Hooks at XactCommand level
Le 13/08/2021 à 11:58, Andres Freund a écrit : > Hi, > > On 2021-08-10 10:12:26 +0200, Gilles Darold wrote: >> Sorry for the response delay. I have though about adding this odd hook to be >> able to implement this feature through an extension because I don't think >> this is something that should be implemented in core. There were also >> patches proposals which were all rejected. >> >> We usually implement the feature at client side which is imo enough for the >> use cases. But the problem is that this a catastrophe in term of >> performances. I have done a small benchmark to illustrate the problem. This >> is a single process client on the same host than the PG backend. >> >> For 10,000 tuples inserted with 50% of failures and rollback at statement >> level handled at client side: >> >> Expected: 5001, Count: 5001 >> DML insert took: 13 wallclock secs ( 0.53 usr + 0.94 sys = 1.47 >> CPU) > Something seems off here. This suggests every insert took 2.6ms. That > seems awfully long, unless your network latency is substantial. I did a > quick test implementing this in the naive-most way in pgbench, and I get > better times - and there's *lots* of room for improvement. > > I used a pgbench script that sent the following: > BEGIN; > SAVEPOINT insert_fail; > INSERT INTO testinsert(data) VALUES (1); > ROLLBACK TO SAVEPOINT insert_fail; > SAVEPOINT insert_success; > INSERT INTO testinsert(data) VALUES (1); > RELEASE SAVEPOINT insert_success; > {repeat 5 times} > COMMIT; > > I.e. 5 failing and 5 succeeding insertions wrapped in one transaction. I > get >2500 tps, i.e. > 25k rows/sec. And it's not hard to optimize that > further - the {ROLLBACK TO,RELEASE} SAVEPOINT; SAVEPOINT; INSERT can be > sent in one roundtrip. That gets me to somewhere around 40k rows/sec. > > > BEGIN; > > \startpipeline > SAVEPOINT insert_fail; > INSERT INTO testinsert(data) VALUES (1); > \endpipeline > > \startpipeline > ROLLBACK TO SAVEPOINT insert_fail; > SAVEPOINT insert_success; > INSERT INTO testinsert(data) VALUES (1); > \endpipeline > > \startpipeline > RELEASE SAVEPOINT insert_success; > SAVEPOINT insert_fail; > INSERT INTO testinsert(data) VALUES (1); > \endpipeline > > \startpipeline > ROLLBACK TO SAVEPOINT insert_fail; > SAVEPOINT insert_success; > INSERT INTO testinsert(data) VALUES (1); > \endpipeline > > {repeat last two blocks three times} > COMMIT; > > Greetings, > > Andres Freund I have written a Perl script to mimic what I have found in an Oracle batch script to import data in a table. I had this use case in a recent migration the only difference is that the batch was written in Java. $dbh->do("BEGIN") or die "FATAL: " . $dbh->errstr . "\n"; my $start = new Benchmark; my $sth = $dbh->prepare("INSERT INTO t1 VALUES (?, ?)"); exit 1 if (not defined $sth); for (my $i = 0; $i <= 1; $i++) { $dbh->do("SAVEPOINT foo") or die "FATAL: " . $dbh->errstr . "\n"; # Generate a duplicate key each two row inserted my $val = $i; $val = $i-1 if ($i % 2 != 0); unless ($sth->execute($val, 'insert '.$i)) { $dbh->do("ROLLBACK TO foo") or die "FATAL: " . $dbh->errstr . "\n"; } else { $dbh->do("RELEASE foo") or die "FATAL: " . $dbh->errstr . "\n"; } } $sth->finish(); my $end = new Benchmark; $dbh->do("COMMIT;"); my $td = timediff($end, $start); print "DML insert took: " . timestr($td) . "\n"; The timing reported are from my personal computer, there is no network latency, it uses localhost. Anyway, the objective was not to bench the DML throughput but the overhead of the rollback at statement level made at client side versus server side. I guess that you might have the same speed gain around x3 to x5 or more following the number of tuples? The full script can be found here https://github.com/darold/pg_statement_rollbackv2/blob/main/test/batch_script_example.pl Cheers, -- Gilles Darold
Re: SI messages sent when excuting ROLLBACK PREPARED command
On Wed, Aug 11, 2021 at 03:14:11PM +0900, Michael Paquier wrote: > I would just tweak the comment block at the top of what's being > changed, as per the attached. Please let me know if there are any > objections. And applied as of 710796f. -- Michael signature.asc Description: PGP signature
Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS
On 2021-08-05 19:56:49 -0700, Andres Freund wrote: > Done in the attached patch. I don't think we need to add more to the docs than > the flag being required? Pushed that patch now. If we want further additions to the docs we can do so separately.
Re: CI/windows docker vs "am a service" autodetection on windows
Hi, Magnus, Michael, Anyone - I'd appreciate a look. On 2021-03-05 12:55:37 -0800, Andres Freund wrote: > > After fighting with a windows VM for a bit (ugh), it turns out that yes, > > there is stderr, but that fileno(stderr) returns -2, and > > GetStdHandle(STD_ERROR_HANDLE) returns NULL (not INVALID_HANDLE_VALUE). > > > > The complexity however is that while that's true for pg_ctl within > > pgwin32_ServiceMain: > > checking what stderr=7FF8687DFCB0 is (handle: 0, fileno=-2) > > but not for postmaster or backends > > WARNING: 01000: checking what stderr=7FF880F5FCB0 is (handle: 92, > > fileno=2) > > > > which makes sense in a way, because we don't tell CreateProcessAsUser() > > that it should pass stdin/out/err down (which then seems to magically > > get access to the "topmost" console applications output - damn, this > > stuff is weird). > > That part is not too hard to address - it seems we only need to do that > in pg_ctl pgwin32_doRunAsService(). It seems that the > stdin/stderr/stdout being set to invalid will then be passed down to > postmaster children. > > https://docs.microsoft.com/en-us/windows/console/getstdhandle > "If an application does not have associated standard handles, such as a > service running on an interactive desktop, and has not redirected them, > the return value is NULL." > > There does seem to be some difference between what services get as std* > - GetStdHandle() returns NULL, and what explicitly passing down invalid > handles to postmaster does - GetStdHandle() returns > INVALID_HANDLE_VALUE. But passing down NULL rather than > INVALID_HANDLE_VALUE to postmaster seems to lead to postmaster > re-opening console buffers. > > Patch attached. > I'd like to commit something to address this issue to master soon - it > allows us to run a lot more tests in cirrus-ci... But probably not > backpatch it [yet] - there've not really been field complaints, and who > knows if there end up being some unintentional side-effects... Because it'd allow us to run more tests as part of cfbot and other CI efforts, I'd like to push this forward. So I'm planning to commit this to master soon-ish, unless somebody wants to take this over? I'm really not a windows person... Greetings, Andres Freund
Proposal: More structured logging
Hello, The logging system already captures a lot of information in the ErrorData. But at present there is no way for a log message authors to include more metadata about the log outside of the log message itself. For example, including the extension name which can be useful for filtering / dispatching log messages. This can be done in the log message itself, but that requires specialized parsing in the log output. Even though among the available loggers in core, only the csv logger would be able to make sense of it, I feel it would be beneficial to add a tagging system to logs, by adding different (tag, value) combinations to a log entry, with an API similar to what we do for other ErrorData fields: ereport(NOTICE, (errmsg("My log message")), (errtag("EMITTER", "MYEXTENSION")), (errtag("MSG-ID", "%d", error_message_id)) ); Please find attached a very small POC patch to better demonstrate what I propose. Third party logging hooks could then exploit those values to output them correctly. For example the json loggger written by Michael Paquier here: https://github.com/michaelpq/pg_plugins/tree/master/jsonlog, or the seeminlgy-abandonned journald hook here: https://github.com/intgr/pg_journal could add those information in a structured way. I think the pgaudit extension (or something similar) could also make good use of such a feature instead of dumping a CSV entry in the log file. As for Postgres own log messages, I'm sure we could find practical use cases for tagging messages like this. On a related note, the only structured logger we have in-core is the CSV logger. Many users nowadays end up feeding the logs to journald either by capturing the stderr output with systemd, or by having syslog implemented by journald itself. Would there be any interest in having native journald support as a logging_destination ? Best regards, -- Ronan Dunklau>From 270ffc5ed2fbe5f5076bddee14c5fb3555b87e4f Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 13 Aug 2021 15:03:18 +0200 Subject: [PATCH v1 1/2] Add ErrorTag support --- src/backend/utils/error/elog.c | 48 ++ src/include/utils/elog.h | 10 +++ 2 files changed, 58 insertions(+) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index a3e1c59a82..5b9b1b8a72 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -465,6 +465,7 @@ errstart(int elevel, const char *domain) edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION; /* errno is saved here so that error parameter eval can't change it */ edata->saved_errno = errno; + edata->tags = NIL; /* * Any allocations for this error state level should go into ErrorContext @@ -516,6 +517,7 @@ errfinish(const char *filename, int lineno, const char *funcname) int elevel; MemoryContext oldcontext; ErrorContextCallback *econtext; + ListCell *lc; recursion_depth++; CHECK_STACK_DEPTH(); @@ -621,7 +623,18 @@ errfinish(const char *filename, int lineno, const char *funcname) pfree(edata->constraint_name); if (edata->internalquery) pfree(edata->internalquery); + /* Every tag should have been palloc'ed */ + if (edata->tags != NIL) + { + foreach(lc, edata->tags) + { + ErrorTag *tag = (ErrorTag *) lfirst(lc); + pfree(tag->tagvalue); + pfree(tag); + } + pfree(edata->tags); + } errordata_stack_depth--; /* Exit error-handling context */ @@ -1192,6 +1205,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural, return 0; /* return value does not matter */ } +int +errtag(const char *tag, const char *fmt_value,...) +{ + ErrorData *edata = &errordata[errordata_stack_depth]; + ErrorTag *etag; + MemoryContext oldcontext; + StringInfoData buf; + + recursion_depth++; + CHECK_STACK_DEPTH(); + oldcontext = MemoryContextSwitchTo(edata->assoc_context); + etag = palloc(sizeof(ErrorTag)); + etag->tagname = tag; + initStringInfo(&buf); + for (;;) + { + va_list args; + int needed; + + errno = edata->saved_errno; + va_start(args, fmt_value); + needed = appendStringInfoVA(&buf, fmt_value, args); + va_end(args); + if (needed == 0) + break; + enlargeStringInfo(&buf, needed); + } + etag->tagvalue = pstrdup(buf.data); + edata->tags = lappend(edata->tags, etag); + pfree(buf.data); + MemoryContextSwitchTo(oldcontext); + recursion_depth--; + return 0; +} + /* * errcontext_msg --- add a context error message text to the current error diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index f53607e12e..1c490d1b11 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -15,6 +15,7 @@ #define ELOG_H #include +#include "nodes/pg_list.h" /* Error level codes */ #define DEBUG5 10 /* Debugging messages, in categories of @@ -193,6 +194,8 @@ extern int errhint(const char *fmt,...) pg_attribute_printf(1, 2); extern int errhint_plural(const char *fmt_singular, const char *fmt_plural, unsigned
Re: Default to TIMESTAMP WITH TIME ZONE?
On Fri, Aug 13, 2021 at 12:33 AM Gavin Flower wrote: > > I always use the tz version, except when I forget. I find it nearly impossible for me to forget how this works. But that is probably because I just pretend that the standard, multi-word, data types don't even exist. It's not that "timestamp" defaults to "WITHOUT TIME ZONE" but rather that there are only two types in existence: timestamp and timestamptz. It's self-evident which one doesn't handle time zones. > Suspect that it is extremely rare when one would not want to use the TZ > option, which suggests to me that using TZ should be the default. > > I would agree, but at this point I'd leave it up to documentation and educational materials to teach/encourage, not runtime behaviors. David J.
Re: pg_settings.pending_restart not set when line removed
Hi Alvaro, On Tue, 27 Jul 2021 at 22:17, Alvaro Herrera wrote: > I tested this -- it works correctly AFAICS. > Nope, IMO it doesn't work correctly. Lets say we have recovery_target = '' in the config: localhost/postgres=# select name, setting, setting is null, pending_restart from pg_settings where name = 'recovery_target'; name │ setting │ ?column? │ pending_restart ─┼─┼──┼─ recovery_target │ │ f│ f (1 row) After that we remove it from the config and call pg_ctl reload. It sets the panding_restart. localhost/postgres=# select name, setting, setting is null, pending_restart from pg_settings where name = 'recovery_target'; name │ setting │ ?column? │ pending_restart ─┼─┼──┼─ recovery_target │ │ f│ t (1 row) IMO is totally wrong, because the actual value didn't change: it was an empty string in the config and now it remains an empty string due to the default value in the guc.c Regards, -- Alexander Kukushkin
Re: pg_settings.pending_restart not set when line removed
Alexander Kukushkin writes: > IMO is totally wrong, because the actual value didn't change: it was an > empty string in the config and now it remains an empty string due to the > default value in the guc.c I can't get very excited about that. The existing message about "parameter \"%s\" cannot be changed without restarting the server" was emitted without regard to that fine point, and nobody has complained about it. regards, tom lane
Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o
On Thu, Aug 12, 2021 at 6:18 PM Andres Freund wrote: > > Hi, > > On 2021-08-12 15:06:23 +0530, Amit Kapila wrote: > > On Thu, Aug 12, 2021 at 1:52 PM Andres Freund wrote: > > > I'm not so sure. Why does sharedfileset have its own proc exit hook in the > > > first place? ISTM that this should be dealt with using resowners, rathers > > > than > > > a sharedfileset specific mechanism? > > > > > > The underlying temporary files need to be closed at xact end but need > > to survive across transactions. > > Why do they need to be closed at xact end? To avoid wasting memory due to too > many buffered files? > Yes. > > > These are registered with the resource owner via > > PathNameOpenTemporaryFile/PathNameCreateTemporaryFile and then closed > > at xact end. So, we need a way to remove the files used by the process > > (apply worker in this particular case) before process exit and used > > this proc_exit hook (possibly on the lines of AtProcExit_Files). > > What I'm wondering is why it is a good idea to have a SharedFileSet specific > cleanup mechanism. One that only operates on process lifetime level, rather > than something more granular. I get that the of the files here needs to be > longer than a transaction, but that can easily be addressed by having a longer > lived resource owner. > > Process lifetime may work well for the current worker.c, but even there it > doesn't seem optimal. One e.g. could easily imagine that we'd want to handle > connection errors or configuration changes without restarting the worker, in > which case process lifetime obviously isn't a good idea anymore. > I don't deny that we can't make this at a more granular level. IIRC, at that time, we tried to follow AtProcExit_Files which cleans up temp files at proc exit and we needed something similar for temporary files used via SharedFileSet. I think we can extend this API but I guess it is better to then do it for dsm-based as well so that these get tracked via resowner. > > I think SharedFileSetInit() needs a comment explaining that it needs to be > called in a process-lifetime memory context if used without dsm > segments. > We already have a comment about proc_exit clean up of files but will extend that a bit about memory context. -- With Regards, Amit Kapila.
Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o
On Thu, Aug 12, 2021 at 6:24 PM Andres Freund wrote: > > On 2021-08-12 05:48:19 -0700, Andres Freund wrote: > > I think SharedFileSetInit() needs a comment explaining that it needs to be > > called in a process-lifetime memory context if used without dsm > > segments. Because otherwise SharedFileSetDeleteOnProcExit() will access > > already freed memory (both for filesetlist and the SharedFileSet itself). > > Oh. And I think it's not ok that SharedFileSetDeleteAll() unconditionally does > SharedFileSetUnregister(). SharedFileSetUnregister() asserts out if there's no > match, but DSM based sets are never entered into filesetlist. So one cannot > have a non-DSM and DSM set coexisting. Which seems surprising. > Oops, it should be allowed to have both non-DSM and DSM set coexisting. I think we can remove Assert from SharedFileSetUnregister(). The other way could be to pass a parameter to SharedFileSetDeleteAll() to tell whether to unregister or not. -- With Regards, Amit Kapila.
RE: Multiple Postgres process are running in background
Hi, We are using Postgres 10 (Single client)and observed that there are multiple PostgreSQL Server process are running in background. Why these additional process are created or is this an expected behavior. [cid:image001.png@01D7905A.F629F440] Regards, RamCharan
[PATCH] Native spinlock support on RISC-V
Hello, The attached patch adds native spinlock support to PostgreSQL on RISC-V systems. As suspected by Richard W.M. Jones of Red Hat back in 2016, the __sync_lock_test_and_set() approach applied on arm and arm64 works here as well. Tested against PostgreSQL 13.3 on a physical rv64gc system (BeagleV Starlight beta board) - builds and installs fine, all tests pass. From what I can see in gcc documentation this should in theory work on rv32 (and possibly rv128) as well, therefore the patch as it stands covers all RISC-V systems (i.e. doesn't check the value of __risc_xlen) - but I haven't confirmed this experimentally. -- MS --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -315,12 +315,12 @@ #endif /* __ia64__ || __ia64 */ /* - * On ARM and ARM64, we use __sync_lock_test_and_set(int *, int) if available. + * On ARM, ARM64 and RISC-V, we use __sync_lock_test_and_set(int *, int) if available. * * We use the int-width variant of the builtin because it works on more chips * than other widths. */ -#if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64) +#if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64) || defined(__riscv) #ifdef HAVE_GCC__SYNC_INT32_TAS #define HAS_TEST_AND_SET @@ -337,7 +337,7 @@ #define S_UNLOCK(lock) __sync_lock_release(lock) #endif /* HAVE_GCC__SYNC_INT32_TAS */ -#endif /* __arm__ || __arm || __aarch64__ || __aarch64 */ +#endif /* __arm__ || __arm || __aarch64__ || __aarch64 || __riscv */ /* S/390 and S/390x Linux (32- and 64-bit zSeries) */ OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH] Native spinlock support on RISC-V
Marek Szuba writes: > Tested against PostgreSQL 13.3 on a physical rv64gc system (BeagleV > Starlight beta board) - builds and installs fine, all tests pass. Cool ... I had hoped to acquire one of those myself, but I didn't make the cut. regards, tom lane
Re: [PATCH] Native spinlock support on RISC-V
Hi, On 2021-08-13 11:09:04 -0400, Tom Lane wrote: > Marek Szuba writes: > > Tested against PostgreSQL 13.3 on a physical rv64gc system (BeagleV > > Starlight beta board) - builds and installs fine, all tests pass. Seems like a good idea to me. > Cool ... I had hoped to acquire one of those myself, but I didn't > make the cut. Should we backpatch this? It's not like we're going to break existing risc-v systems by enabling spinlock support... Greetings, Andres Freund
Re: Default to TIMESTAMP WITH TIME ZONE?
On Fri, Aug 13, 2021 at 07:09:15AM -0700, David G. Johnston wrote: > On Fri, Aug 13, 2021 at 12:33 AM Gavin Flower > wrote: > > > I always use the tz version, except when I forget. > > > I find it nearly impossible for me to forget how this works. But that is > probably because I just pretend that the standard, multi-word, data types > don't > even exist. It's not that "timestamp" defaults to "WITHOUT TIME ZONE" but > rather that there are only two types in existence: timestamp and timestamptz. > It's self-evident which one doesn't handle time zones. > > > > Suspect that it is extremely rare when one would not want to use the TZ > option, which suggests to me that using TZ should be the default. > > > > I would agree, but at this point I'd leave it up to documentation and > educational materials to teach/encourage, not runtime behaviors. Yes, only using 'timetamptz' does fix it. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o
Hi, (dropping -committers to avoid moderation stalls due xposting to multiple lists - I find that more annoying than helpful) On 2021-08-13 14:38:37 +0530, Amit Kapila wrote: > > What I'm wondering is why it is a good idea to have a SharedFileSet specific > > cleanup mechanism. One that only operates on process lifetime level, rather > > than something more granular. I get that the of the files here needs to be > > longer than a transaction, but that can easily be addressed by having a > > longer > > lived resource owner. > > > > Process lifetime may work well for the current worker.c, but even there it > > doesn't seem optimal. One e.g. could easily imagine that we'd want to handle > > connection errors or configuration changes without restarting the worker, in > > which case process lifetime obviously isn't a good idea anymore. > > > > I don't deny that we can't make this at a more granular level. IIRC, > at that time, we tried to follow AtProcExit_Files which cleans up temp > files at proc exit and we needed something similar for temporary files > used via SharedFileSet. The comparison to AtProcExit_Files isn't convincing to me - normally temp files are cleaned up long before that via resowner cleanup. To me the reuse of SharedFileSet for worker.c as executed seems like a bad design. As things stand there's little code shared between dsm/non-dsm shared file sets. The cleanup semantics are entirely different. Several functions don't work if used on the "wrong kind" of set (e.g. SharedFileSetAttach()). > I think we can extend this API but I guess it is better to then do it > for dsm-based as well so that these get tracked via resowner. DSM segments are resowner managed already, so it's not obvious that that'd buy us much? Although I guess there could be a few edge cases that'd look cleaner, because we could reliably trigger cleanup in the leader instead of relying on dsm detach hooks + refcounts to manage when a set is physically deleted? Greetings, Andres Freund
Re: [PATCH] Native spinlock support on RISC-V
Andres Freund writes: > On 2021-08-13 11:09:04 -0400, Tom Lane wrote: >> Marek Szuba writes: >>> Tested against PostgreSQL 13.3 on a physical rv64gc system (BeagleV >>> Starlight beta board) - builds and installs fine, all tests pass. > Should we backpatch this? It's not like we're going to break existing > risc-v systems by enabling spinlock support... Yeah, why not? If you were building with --disable-spinlocks before, this shouldn't change anything for you. (I haven't actually looked at the patch, mind you, but in principle it shouldn't break anything that worked before.) regards, tom lane
[PATCH] Error out if SKIP LOCKED and WITH TIES are both specified
Both bugs #16676[1] and #17141[2] illustrate that the combination of SKIP LOCKED and FETCH FIRST WITH TIES break expectations when it comes to rows returned to other sessions accessing the same row. Since this situation is detectable from the syntax and hard to fix otherwise, forbid for now, with the potential to fix in the future. [1] https://www.postgresql.org/message-id/16676-fd62c3c835880da6%40postgresql.org [2] https://www.postgresql.org/message-id/17141-913d78b9675aac8e%40postgresql.org Proposed backpatch to 13. 0001-Error-out-if-SKIP-LOCKED-and-WITH-TIES-are-both-spec.patch Description: Binary data
Re: Default to TIMESTAMP WITH TIME ZONE?
On 8/12/21 7:25 PM, Simon Riggs wrote: > I thought we found that changing behavior via GUC usually ends badly. >> Yeah. Changing from SQL-spec to not-SQL-spec behavior is going to be >> one tough sell to begin with, even without the point that that's been >> our behavior for over two decades. But proposing to do it via a GUC >> is just not-even-worth-discussing territory. That would force every >> wannabe-portable program to cope with both behaviors; which would >> end up meaning that not only do you still have to take care to write >> WITH TIME ZONE when you want that, but *also* you'd have to be sure >> to write WITHOUT TIME ZONE when you want that. In short, the worst >> of both worlds. > All of which I agree with, but this wasn't a cute idea of mine, this > was what our users have requested because of the extreme annoyance > caused by the current behavior. > What do other DBMSs do? This strikes me as primarily an education issue (I did a webinar on it not that long ago) If you want to protect against people using tz-less timestamp, maybe an event trigger would be a solution, although maybe that's using a sledgehammer to crack a nut. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Default to TIMESTAMP WITH TIME ZONE?
On Fri, 13 Aug 2021 at 17:23, Andrew Dunstan wrote: > What do other DBMSs do? I think MySQL defaults to WITH TIME ZONE, not sure, but I would bet a few others follow the standard. > This strikes me as primarily an education issue > (I did a webinar on it not that long ago) Yes, agreed. > If you want to protect against people using tz-less timestamp, maybe an > event trigger would be a solution, although maybe that's using a > sledgehammer to crack a nut. If you know about the issue, I guess you fix it in lots of different ways. The issue is about those that don't know and my patch didn't help them either. The only hope is to eventually change the default, so probably the best thing is to apply pressure via the SQL Std process. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Multiple Postgres process are running in background
Em sex., 13 de ago. de 2021 às 11:55, Ram Charan Kallem < ramcharan.kal...@non.se.com> escreveu: > Hi, > > > > We are using Postgres 10 (Single client)and observed that there are > multiple PostgreSQL Server process are running in background. > > Why these additional process are created or is this an expected behavior. > This is a normal and expected behavior. pgsql-hackers@, is not an appropriate place to such questions. regards, Ranier Vilela
Re: Default to TIMESTAMP WITH TIME ZONE?
On Fri, Aug 13, 2021 at 9:28 AM Simon Riggs wrote: > > The only hope is to eventually change the default, so probably > the best thing is to apply pressure via the SQL Std process. > > Then there is no hope because this makes the situation worse. If anything I'd suggest the SQL standard should probably just admit this "default behavior of timestamp" is a bad idea and deprecate its existence. IOW, the only two standard conforming syntaxes are the explicit WITH/WITHOUT TIME ZONE ones. Any database implementation that implements "timestamp" as a type alias is doing so in an implementation dependent way. Code that wants to be SQL standard conforming portable needs to use the explicit types. David J.
Re: Default to TIMESTAMP WITH TIME ZONE?
"David G. Johnston" writes: > On Fri, Aug 13, 2021 at 9:28 AM Simon Riggs > wrote: >> The only hope is to eventually change the default, so probably >> the best thing is to apply pressure via the SQL Std process. > Then there is no hope because this makes the situation worse. Agreed; the points I made upthread are just as valid if the change is made in the standard. But I'd be astonished if the SQL committee would consider such a change anyway. The one thing I could potentially see us doing is more strongly encouraging the use of the names "timestamp" and "timestamptz", up to and including changing what format_type() et al. put out. Yeah, "timestamptz" is not standard, but so what? At least it's not actually *contrary* to the standard, as the original proposal here is. (Also, while I hate to bring it up in this context, our timestamptz data type is not really compatible with the spec in the first place, so that a case could be made that this behavior is more honest/spec-compatible than what we do today.) If you wanted to be even more in people's faces about it, you could print the type names as "timestamptz" and "timestamp without time zone". regards, tom lane
Re: [PATCH] Native spinlock support on RISC-V
I wrote: > Andres Freund writes: >> Should we backpatch this? It's not like we're going to break existing >> risc-v systems by enabling spinlock support... > Yeah, why not? If you were building with --disable-spinlocks before, > this shouldn't change anything for you. > (I haven't actually looked at the patch, mind you, but in principle > it shouldn't break anything that worked before.) I now have looked at the patch, and it seems good as far as it goes, but I wonder whether some effort ought to be expended in src/include/port/atomics/. regards, tom lane
Re: [PATCH] Native spinlock support on RISC-V
Hi, On Fri, Aug 13, 2021, at 19:25, Tom Lane wrote: > I wrote: > > Andres Freund writes: > >> Should we backpatch this? It's not like we're going to break existing > >> risc-v systems by enabling spinlock support... > > > Yeah, why not? If you were building with --disable-spinlocks before, > > this shouldn't change anything for you. > > (I haven't actually looked at the patch, mind you, but in principle > > it shouldn't break anything that worked before.) > > I now have looked at the patch, and it seems good as far as it goes, > but I wonder whether some effort ought to be expended in > src/include/port/atomics/. That should automatically pick up the intrinsic. I think we should do the same on modern compilers for spinlocks, but that's a separate discussion I guess. Address
Re: [PATCH] Native spinlock support on RISC-V
"Andres Freund" writes: > On Fri, Aug 13, 2021, at 19:25, Tom Lane wrote: >> I now have looked at the patch, and it seems good as far as it goes, >> but I wonder whether some effort ought to be expended in >> src/include/port/atomics/. > That should automatically pick up the intrinsic. I think we should do the > same on modern compilers for spinlocks, but that's a separate discussion I > guess. I was looking at the comment in atomics.h about * Provide a full fallback of the pg_*_barrier(), pg_atomic**_flag and * pg_atomic_* APIs for platforms without sufficient spinlock and/or atomics * support. In the case of spinlock backed atomics the emulation is expected * to be efficient, although less so than native atomics support. so it seems like someday we might want to expend some effort on native atomics. I agree that that day need not be today, though. This patch seems sufficient until we get to the point of (at least) having some RISC-V in the buildfarm. regards, tom lane
Re: [PATCH] Native spinlock support on RISC-V
Hi, On 2021-08-13 13:37:02 -0400, Tom Lane wrote: > "Andres Freund" writes: > > On Fri, Aug 13, 2021, at 19:25, Tom Lane wrote: > >> I now have looked at the patch, and it seems good as far as it goes, > >> but I wonder whether some effort ought to be expended in > >> src/include/port/atomics/. > > > That should automatically pick up the intrinsic. I think we should do the > > same on modern compilers for spinlocks, but that's a separate discussion I > > guess. > > I was looking at the comment in atomics.h about > > * Provide a full fallback of the pg_*_barrier(), pg_atomic**_flag and > * pg_atomic_* APIs for platforms without sufficient spinlock and/or atomics > * support. In the case of spinlock backed atomics the emulation is expected > * to be efficient, although less so than native atomics support. > > so it seems like someday we might want to expend some effort on native > atomics. I agree that that day need not be today, though. This patch > seems sufficient until we get to the point of (at least) having some > RISC-V in the buildfarm. For gcc compatible compilers the relevant comments would be * There exist generic, hardware independent, implementations for several * compilers which might be sufficient, although possibly not optimal, for a * new platform. If no such generic implementation is available spinlocks (or * even OS provided semaphores) will be used to implement the API. and /* * Compiler specific, but architecture independent implementations. * * Provide architecture independent implementations of the atomic * facilities. At the very least compiler barriers should be provided, but a * full implementation of * * pg_compiler_barrier(), pg_write_barrier(), pg_read_barrier() * * pg_atomic_compare_exchange_u32(), pg_atomic_fetch_add_u32() * using compiler intrinsics are a good idea. */ Gcc's intriniscs are pretty good these days (and if not, a *lot* of projects just outright break). What's more, It turns out that using intrinsics with compilers of the last ~5 years often generates *better* code than inline assembly (e.g. because the compiler can utilize condition codes more effectively and has more detailed information about clobbers). So for new platforms that'll only have support by new compilers it doesn't really make sense to add inline assembler imo. Greetings, Andres Freund
Re: [PATCH] Native spinlock support on RISC-V
Andres Freund writes: > On 2021-08-13 13:37:02 -0400, Tom Lane wrote: >> so it seems like someday we might want to expend some effort on native >> atomics. I agree that that day need not be today, though. This patch >> seems sufficient until we get to the point of (at least) having some >> RISC-V in the buildfarm. > Gcc's intriniscs are pretty good these days (and if not, a *lot* of > projects just outright break). > What's more, It turns out that using intrinsics with compilers of the > last ~5 years often generates *better* code than inline assembly > (e.g. because the compiler can utilize condition codes more effectively > and has more detailed information about clobbers). So for new platforms > that'll only have support by new compilers it doesn't really make sense > to add inline assembler imo. I didn't say it had to be __asm blocks ;-). I was thinking more of the sort of stuff we have in e.g. arch-arm.h and arch-ia64.h, where we know some things about what is efficient or less efficient on a particular architecture. gcc will do its best to provide implementations of its builtins, but that doesn't mean that using a particular one of them is always the most sane thing to do. But anyway, that seems like minor optimization that maybe someday somebody will be motivated to do. We're on the same page about this being enough for today. I did not like confusing the RISC-V case with the ARM case: duplicating the code block seems better, in case there's ever reason to add arch-specific stuff like SPIN_DELAY. So I split it off to its own code block and pushed it. regards, tom lane
Re: Default to TIMESTAMP WITH TIME ZONE?
On Friday, August 13, 2021, Tom Lane wrote: > > The one thing I could potentially see us doing is more strongly > encouraging the use of the names "timestamp" and "timestamptz", > up to and including changing what format_type() et al. put out. +1. Having the canonical form be timestamptz would make pretending the “with time zone” version doesn’t exist much easier. David J.
Re: Autovacuum on partitioned table (autoanalyze)
Here is a proposal for 14. This patch has four main changes: * The mod counts are only propagated to the topmost parent, not to each ancestor. This means that we'll only analyze the topmost partitioned table and not each intermediate partitioned table; seems a good compromise to avoid sampling all partitions multiple times per round. * One pgstat message is sent containing many partition/parent pairs, not just one. This reduces the number of messages sent. 123 partitions fit in one message (messages are 1000 bytes). This is done once per autovacuum worker run, so it shouldn't be too bad. * There's a sleep between sending the message and re-reading stats. It would be great to have a mechanism by which pgstat collector says "I've received and processed up to this point", but we don't have that; what we can do is sleep PGSTAT_STAT_INTERVAL and then reread the file, so we're certain that the file we read is at least as new as that time. This is far longer than it takes to process the messages. Note that if the messages do take longer than that to be processed by the collector, it's not a big loss anyway; those tables will be processed by the next autovacuum run. * I changed vacuum_expand_rel to put the main-rel OID at the end. (This is a variation of Horiguchi-san proposed patch; instead of making the complete list be in the opposite order, it's just that one OID that appears at the other end). This has the same effect as his patch: any error reports thrown by vacuum/analyze mention the first partition rather than the main table. This part is in 0002 and I'm not totally convinced it's a sane idea. Minor changes: * I reduced autovacuum from three passes over pg_class to two passes, per your observation that we can acquire toast association together with processing partitions, and then use that in the second pass to collect everything. * I moved the catalog-accessing code to partition.c, so we don't need to have pgstat.c doing it. Some doc changes are pending, and some more commentary in parts of the code, but I think this is much more sensible. I do lament the lack of a syscache for pg_inherits.From 3e904de5f15cfc69692ad2aea64c0034445d957e Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 10 Aug 2021 13:05:59 -0400 Subject: [PATCH 1/2] Propagate counts up only to topmost ancestor Ignore intermediate partitions, to avoid redundant sampling of partitions. If needed, those intermediate partitions can be analyzed manually. --- src/backend/catalog/partition.c | 53 +++ src/backend/commands/analyze.c | 3 +- src/backend/postmaster/autovacuum.c | 222 ++-- src/backend/postmaster/pgstat.c | 60 src/include/catalog/partition.h | 1 + src/include/pgstat.h| 21 ++- 6 files changed, 211 insertions(+), 149 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 790f4ccb92..017d5ba5a2 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -26,6 +26,7 @@ #include "nodes/makefuncs.h" #include "optimizer/optimizer.h" #include "partitioning/partbounds.h" +#include "pgstat.h" #include "rewrite/rewriteManip.h" #include "utils/fmgroids.h" #include "utils/partcache.h" @@ -166,6 +167,58 @@ get_partition_ancestors_worker(Relation inhRel, Oid relid, List **ancestors) get_partition_ancestors_worker(inhRel, parentOid, ancestors); } +/* + * Inform pgstats collector about the topmost ancestor of + * each of the given partitions. + */ +void +partition_analyze_report_ancestors(List *partitions) +{ + List *report_parts = NIL; + List *ancestors = NIL; + Relation inhRel; + ListCell *lc; + + inhRel = table_open(InheritsRelationId, AccessShareLock); + + /* + * Search pg_inherits for the topmost ancestor of each given partition, + * and if found, store both their OIDs in lists. + * + * By the end of this loop, partitions and ancestors are lists to be + * read in parallel, where the i'th element of ancestors is the topmost + * ancestor of the i'th element of partitions. + */ + foreach(lc, partitions) + { + Oid partition_id = lfirst_oid(lc); + Oid cur_relid; + + cur_relid = partition_id; + for (;;) + { + bool detach_pending; + Oid parent_relid; + + parent_relid = get_partition_parent_worker(inhRel, cur_relid, + &detach_pending); + if ((!OidIsValid(parent_relid) || detach_pending) && +cur_relid != partition_id) + { +report_parts = lappend_oid(report_parts, partition_id); +ancestors = lappend_oid(ancestors, cur_relid); +break; + } + + cur_relid = parent_relid; + } + } + + table_close(inhRel, AccessShareLock); + + pgstat_report_anl_ancestors(report_parts, ancestors); +} + /* * index_get_partition * Return the OID of index of the given partition that is a child diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 0099a04bbe..c930e3e3cd 100644
Re: [UNVERIFIED SENDER] Re: [bug] Logical Decoding of relation rewrite with toast does not reset toast_hash
Hi Drouvot, I don't see extra data in your output and it looks like your copy/paste is missing some content, no? On my side, that looks good and here is what i get with the patch applied: I ran the test again, now I got the same output as yours, and it looks good for me. (The issue I mentioned in previous email was caused by my console output.) Thank you, -- David Software Engineer Highgo Software Inc. (Canada) www.highgo.ca
Re: Changes to recovery_min_apply_delay are ignored while waiting for delay
Hey Michael, Really appreciate the review! On Wed, Aug 11, 2021 at 12:40 AM Michael Paquier wrote: > Agreed that the current behavior is confusing. As you are using the > commit timestamp for the comparison, this is right. One small-ish > comment I have about the code is that we should mention > recovery_min_apply_delay for HandleStartupProcInterrupts(), and not > only the trigger file location. Cool, updated. > +# First, set the delay for the next commit to some obscenely high value. > It has no need to be obscenely high, just high enough to give the time > to slow machines to catch the wait event lookup done. So this could > use better words to explain this choice. Sounds good. Done. > Anyway, it seems to me that something is incorrect in this new test > (manual tests pass here, the TAP test is off). One thing we need to > make sure for any new test added is that it correctly breaks if the > fix proposed is *not* in place. And as far as I can see, the test > passes even if the recalculation of delayUntil is done within the loop > that reloads the configuration. The thing may be a bit tricky to make > reliable as the parameter reloading may cause wait_event to not be > RecoveryApplyDelay, so we should have at least a check based on a scan > of pg_stat_replication.replay_lsn on the primary. Perhaps you have > an other idea? Hmm, please see attached v4 which just shifts the test to the middle, like v1. When I run the test without the code change, the test hangs as expected in: # Now the standby should catch up. $node_primary->wait_for_catchup('standby', 'replay'); and passes with the code change, as expected. I can't explain why the test doesn't freeze up in v3 in wait_for_catchup() at the end. As for checking on the wait event, since we only signal the standby after performing the check, that should be fine. Nothing else would be sending a SIGHUP before the check. Is that assumption correct? > When using wait_for_catchup(), I would recommend to list "replay" for > this test, even if that's the default mode, to make clear what the > intention is. Done. Regards, Soumyadeep (VMware) From 824076a977af0e40da1c7eb9e4aeac9a8e81a7ee Mon Sep 17 00:00:00 2001 From: Soumyadeep Chakraborty Date: Mon, 2 Aug 2021 20:50:37 -0700 Subject: [PATCH v4 1/1] Refresh delayUntil in recovery delay loop This ensures that the wait interval in the loop is correctly recalculated with the updated value of recovery_min_apply_delay. Now, if one changes the GUC while we are waiting, those changes will take effect. Practical applications include instantly cancelling a long wait time by setting recovery_min_apply_delay to 0. This can be useful in tests. --- src/backend/access/transam/xlog.c | 13 +--- src/test/recovery/t/005_replay_delay.pl | 41 ++--- 2 files changed, 46 insertions(+), 8 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d0ec6a834be..74ad7ff905b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6234,12 +6234,11 @@ recoveryApplyDelay(XLogReaderState *record) if (!getRecordTimestamp(record, &xtime)) return false; - delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay); - /* * Exit without arming the latch if it's already past time to apply this * record */ + delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay); msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil); if (msecs <= 0) return false; @@ -6248,14 +6247,20 @@ recoveryApplyDelay(XLogReaderState *record) { ResetLatch(&XLogCtl->recoveryWakeupLatch); - /* might change the trigger file's location */ + /* might change the trigger file's location and recovery_min_apply_delay */ HandleStartupProcInterrupts(); if (CheckForStandbyTrigger()) break; /* - * Wait for difference between GetCurrentTimestamp() and delayUntil + * Recalculate delayUntil as recovery_min_apply_delay could have changed + * since last time. + */ + delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay); + + /* + * Wait for difference between GetCurrentTimestamp() and delayUntil. */ msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil); diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl index 0b56380e0a7..ad8093df41a 100644 --- a/src/test/recovery/t/005_replay_delay.pl +++ b/src/test/recovery/t/005_replay_delay.pl @@ -7,7 +7,7 @@ use warnings; use PostgresNode; use TestLib; -use Test::More tests => 3; +use Test::More tests => 4; # Initialize primary node my $node_primary = PostgresNode->new('primary'); @@ -56,6 +56,39 @@ $node_standby->poll_query_until('postgres', ok(time() - $primary_insert_time >= $delay, "standby applies WAL only after replication delay"); +# Now test to see if updates to recovery_min_apply_delay apply