Re: Remove a unused argument from qual_is_pushdown_safe
On Mon, Nov 28, 2022 at 3:40 PM Michael Paquier wrote: > > On Fri, Nov 25, 2022 at 04:05:13PM +0800, Richard Guo wrote: > >> I wonder if we need to revise the comment atop qual_is_pushdown_safe() > >> too which says > >> > >> * rinfo is a restriction clause applying to the given subquery (whose > RTE > >> * has index rti in the parent query). > >> > >> since there is no 'given subquery' after we remove it from the params. > > I was thinking about this point, and it seems to me that we could just > do s/the given subquery/a subquery/. But perhaps you have a different > view on the matter? I think the new wording is good. Thanks for the change. Thanks Richard
RE: Avoid streaming the transaction which are skipped (in corner cases)
On Sun, Nov 27, 2022 1:33 PM Dilip Kumar wrote: > > On Sat, Nov 26, 2022 at 12:15 PM Amit Kapila > wrote: > > > > On Fri, Nov 25, 2022 at 5:38 PM Amit Kapila > wrote: > > > > > > On Fri, Nov 25, 2022 at 1:35 PM Dilip Kumar > wrote: > > > > > > > > During DecodeCommit() for skipping a transaction we use ReadRecPtr to > > > > check whether to skip this transaction or not. Whereas in > > > > ReorderBufferCanStartStreaming() we use EndRecPtr to check whether > to > > > > stream or not. Generally it will not create a problem but if the > > > > commit record itself is adding some changes to the transaction(e.g. > > > > snapshot) and if the "start_decoding_at" is in between ReadRecPtr and > > > > EndRecPtr then streaming will decide to stream the transaction where > > > > as DecodeCommit will decide to skip it. And for handling this case in > > > > ReorderBufferForget() we call stream_abort(). > > > > > > > > > > The other cases are probably where we don't have FilterByOrigin or > > > dbid check, for example, > XLOG_HEAP2_NEW_CID/XLOG_XACT_INVALIDATIONS. > > > We anyway actually don't send anything for such cases except empty > > > start/stop messages. Can we add some flag to txn which says that there > > > is at least one change like DML that we want to stream? > > > > > > > We can probably think of using txn_flags for this purpose. > > In the attached patch I have used txn_flags to identify whether it has > any streamable change or not and the transaction will not be selected > for streaming unless it has at least one streamable change. > Thanks for your patch. I saw that the patch added a check when selecting largest transaction, but in addition to ReorderBufferCheckMemoryLimit(), the transaction can also be streamed in ReorderBufferProcessPartialChange(). Should we add the check in this function, too? diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 9a58c4bfb9..108737b02f 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -768,7 +768,8 @@ ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn, */ if (ReorderBufferCanStartStreaming(rb) && !(rbtxn_has_partial_change(toptxn)) && - rbtxn_is_serialized(txn)) + rbtxn_is_serialized(txn) && + rbtxn_has_streamable_change(txn)) ReorderBufferStreamTXN(rb, toptxn); } Regards, Shi yu
Re: Logical Replication Custom Column Expression
On Fri, Nov 25, 2022 at 9:43 PM Stavros Koureas wrote: > > Yes, if the property is on the subscription side then it should be applied > for all the tables that the connected publication is exposing. > So if the property is enabled you should be sure that this origin column > exists to all of the tables that the publication is exposing... > > Sure this is the complete idea, that the subscriber should match the PK of > origin, > As the subscription table will contain same key values from different > origins, for example: > > For publisher1 database table > id pk integer | value character varying > 1 | testA1 > 2 | testA2 > > For publisher2 database table > id pk integer | value character varying > 1 | testB1 > 2 | testB2 > > For subscriber database table > origin pk character varying | id pk integer | value character varying > publisher1 | 1 | testA1 > publisher1 | 2 | testA2 > publisher2 | 1 | testB1 > publisher2 | 2 | testB2 > > All statements INSERT, UPDATE, DELETE should always include the predicate of > the origin. > This sounds similar to what I had posted [1] although I was saying the generated column value might be the *subscriber* name, not the origin publisher name. (where are you getting that value from -- somehow from the subscriptions' CONNECTION dbname?) Anyway, regardless of the details, please note -- my idea was really intended just as a discussion starting point to demonstrate that required functionality might be achieved using a simpler syntax than what had been previously suggested. But in practice there may be some problems with this approach -- e.g. how will the initial tablesync COPY efficiently assign these subscriber name column values? -- [1] https://www.postgresql.org/message-id/CAHut%2BPuZowXd7Aa7t0nqjP6afHMwJarngzeMq%2BQP0vE2KKLOgQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia.
Failed Assert while pgstat_unlink_relation
Hi, While reviewing/testing one of the patches I found the following Assert: #0 0x55c6312ba639 in pgstat_unlink_relation (rel=0x7fb73bcbac58) at pgstat_relation.c:161 #1 0x55c6312bbb5a in pgstat_relation_delete_pending_cb (entry_ref=0x55c6335563a8) at pgstat_relation.c:839 #2 0x55c6312b72bc in pgstat_delete_pending_entry (entry_ref=0x55c6335563a8) at pgstat.c:1124 #3 0x55c6312be3f1 in pgstat_release_entry_ref (key=..., entry_ref=0x55c6335563a8, discard_pending=true) at pgstat_shmem.c:523 #4 0x55c6312bee9a in pgstat_drop_entry (kind=PGSTAT_KIND_RELATION, dboid=5, objoid=40960) at pgstat_shmem.c:867 #5 0x55c6312c034a in AtEOXact_PgStat_DroppedStats (xact_state=0x55c6334baac8, isCommit=false) at pgstat_xact.c:97 #6 0x55c6312c0240 in AtEOXact_PgStat (isCommit=false, parallel=false) at pgstat_xact.c:55 #7 0x55c630df8bee in AbortTransaction () at xact.c:2861 #8 0x55c630df94fd in AbortCurrentTransaction () at xact.c:3352 I could reproduce this issue with the following steps: create table t1(c1 int); BEGIN; CREATE TABLE t (c1 int); CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1; CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1; Regards, Vignesh
Re: Understanding, testing and improving our Windows filesystem code
On Mon, Nov 28, 2022 at 8:58 PM Ian Lawrence Barwick wrote: > For my understanding, does this entry supersede the proposal in > https://commitfest.postgresql.org/40/3347/ ? I think so (Victor hasn't commented). Patch 0004 derives from Victor's patch and has him as primary author still, but I made some changes: * remove obsolete version check code * provide fallback code for systems where it doesn't work (after some research to determine that there are such systems, and what they do) * test that it's really more POSIX-like and demonstrate what that means (building on 0003) Patch 0003 is a set of file system semantics tests that work on Unix, but also exercise those src/port/*.c wrappers on Windows and show differences from Unix semantics. Some of these tests also verify various bugfixes already committed, so they've been pretty useful to me already even though they aren't in the tree yet. Patches 0001 and 0002 are generic, unrelated to this Windows stuff, and provide a simple way to write unit tests for small bits of C code without a whole PostgreSQL server. That's something that has been proposed in the abstract many times before by many people. Here I've tried to be minimalist about it, just what I needed for the higher-numbered patches, building on existing technologies (TAP).
Re: Report roles in pg_upgrade pg_ prefix check
> On 28 Nov 2022, at 02:18, Michael Paquier wrote: > > On Thu, Nov 24, 2022 at 12:31:09PM +0100, Daniel Gustafsson wrote: >> Looking at a recent pg_upgrade thread I happened to notice that the check for >> roles with a pg_ prefix only reports the error, not the roles it found. >> Other >> similar checks where the user is expected to alter the old cluster typically >> reports the found objects in a textfile. The attached adds reporting to make >> that class of checks consistent (the check for prepared transactions which >> also >> isn't reporting is different IMO as it doesn't expect ALTER commands). >> >> As this check is only executed against the old cluster the patch removes the >> check when printing the error. > > +1. A backpatch would be nice, though not strictly mandatory as > that's not a bug fix. Yeah, it doesn't really qualify since this not a bugfix. > + ntups = PQntuples(res); > + i_rolname = PQfnumber(res, "rolname"); > > Would it be worth adding the OID on top of the role name in the > generated report? That would be a free meal. We are a bit inconsistent in how much details we include in the report textfiles, so could do that without breaking any consistency in reporting. Looking at other checks, the below format would match what we already do fairly well: (oid=) Done in the attached. -- Daniel Gustafsson https://vmware.com/ v2-0001-Report-incompatible-roles-in-pg_upgrade-checking.patch Description: Binary data
Failed Assert in pgstat_assoc_relation
Hi, While reviewing/testing one of the patches I found the following Assert: #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139624429171648) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=139624429171648) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=139624429171648, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x7efcda6e3476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x7efcda6c97f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x5590bf283139 in ExceptionalCondition (conditionName=0x5590bf468170 "rel->pgstat_info->relation == NULL", fileName=0x5590bf46812b "pgstat_relation.c", lineNumber=143) at assert.c:66 #6 0x5590bf0ce5f8 in pgstat_assoc_relation (rel=0x7efcce996a48) at pgstat_relation.c:143 #7 0x5590beb83046 in initscan (scan=0x5590bfbf4af8, key=0x0, keep_startblock=false) at heapam.c:343 #8 0x5590beb8466f in heap_beginscan (relation=0x7efcce996a48, snapshot=0x5590bfb5a520, nkeys=0, key=0x0, parallel_scan=0x0, flags=449) at heapam.c:1223 #9 0x5590bf02af39 in table_beginscan (rel=0x7efcce996a48, snapshot=0x5590bfb5a520, nkeys=0, key=0x0) at ../../../src/include/access/tableam.h:891 #10 0x5590bf02bf8a in DefineQueryRewrite (rulename=0x5590bfb281d0 "_RETURN", event_relid=16387, event_qual=0x0, event_type=CMD_SELECT, is_instead=true, replace=false, action=0x5590bfbf4aa8) at rewriteDefine.c:447 #11 0x5590bf02b5ab in DefineRule (stmt=0x5590bfb285c0, queryString=0x5590bfb277a8 "CREATE RULE \"_RETURN\" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1;") at rewriteDefine.c:213 I could reproduce this issue with the following steps: create table t1(c int); BEGIN; CREATE TABLE t (c int); SAVEPOINT q; CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1; select * from t; ROLLBACK TO q; CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1; ROLLBACK; Regards, Vignesh
Re: Avoid streaming the transaction which are skipped (in corner cases)
On Mon, Nov 28, 2022 at 1:46 PM shiy.f...@fujitsu.com wrote: > > Thanks for your patch. > > I saw that the patch added a check when selecting largest transaction, but in > addition to ReorderBufferCheckMemoryLimit(), the transaction can also be > streamed in ReorderBufferProcessPartialChange(). Should we add the check in > this function, too? > > diff --git a/src/backend/replication/logical/reorderbuffer.c > b/src/backend/replication/logical/reorderbuffer.c > index 9a58c4bfb9..108737b02f 100644 > --- a/src/backend/replication/logical/reorderbuffer.c > +++ b/src/backend/replication/logical/reorderbuffer.c > @@ -768,7 +768,8 @@ ReorderBufferProcessPartialChange(ReorderBuffer *rb, > ReorderBufferTXN *txn, > */ > if (ReorderBufferCanStartStreaming(rb) && > !(rbtxn_has_partial_change(toptxn)) && > - rbtxn_is_serialized(txn)) > + rbtxn_is_serialized(txn) && > + rbtxn_has_streamable_change(txn)) > ReorderBufferStreamTXN(rb, toptxn); > } You are right we need this in ReorderBufferProcessPartialChange() as well. I will fix this in the next version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Bug in row_number() optimization
On 28.11.2022 03:23, David Rowley wrote: On Sat, 26 Nov 2022 at 05:19, Tom Lane wrote: Sergey Shinderuk writes: What about user-defined operators? I created my own <= operator for int8 which returns true on null input, and put it in a btree operator class. Admittedly, it's weird that (null <= 1) evaluates to true. But does it violate the contract of the btree operator class or something? Didn't find a clear answer in the docs. It's pretty unlikely that this would work during an actual index scan. I'm fairly sure that btree (and other index AMs) have hard-wired assumptions that index operators are strict. If we're worried about that then we could just restrict this optimization to only work with strict quals. Not sure this is necessary if btree operators must be strict anyway. The proposal to copy the datums into the query context does not seem to me to be a good idea. If there are a large number of partitions then it sounds like we'll leak lots of memory. We could invent some partition context that we reset after each partition, but that's probably more complexity than it would be worth. Ah, good point. I've attached a draft patch to move the code to nullify the aggregate results so that's only done once per partition and adjusted the planner to limit this to strict quals. Not quite sure that we don't need to do anything for the WINDOWAGG_PASSTHROUGH_STRICT case. Although, we won't return any more tuples for the current partition, we still call ExecProject with dangling pointers. Is it okay? + if (!func_strict(opexpr->opfuncid)) + return false; Should return true instead? -- Sergey Shinderukhttps://postgrespro.com/
[PATCH] Check snapshot argument of index_beginscan and family
Hi hackers, A colleague of mine (cc'ed) reported that he was able to pass a NULL snapshot to index_beginscan() and it even worked to a certain degree. I took my toy extension [1] and replaced the argument with NULL as an experiment: ``` eax=# CREATE EXTENSION experiment; CREATE EXTENSION eax=# SELECT phonebook_lookup_index('Alice'); phonebook_lookup_index -1 (1 row) eax=# SELECT phonebook_insert('Bob', 456); phonebook_insert -- 1 (1 row) eax=# SELECT phonebook_lookup_index('Alice'); phonebook_lookup_index -1 (1 row) eax=# SELECT phonebook_insert('Alice', 123); phonebook_insert -- 2 (1 row) eax=# SELECT phonebook_lookup_index('Alice'); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. ``` So evidently it really works as long as the index doesn't find any matching rows. This could be really confusing for the extension authors so here is a patch that adds corresponding Asserts(). [1]: https://github.com/afiskon/postgresql-extensions/tree/main/005-table-access -- Best regards, Aleksander Alekseev v1-0001-Check-snapshot-argument-of-index_beginscan-and-fa.patch Description: Binary data
Re: [PATCH] Check snapshot argument of index_beginscan and family
Hi, Alexander! > A colleague of mine (cc'ed) reported that he was able to pass a NULL > snapshot to index_beginscan() and it even worked to a certain degree. > > I took my toy extension [1] and replaced the argument with NULL as an > experiment: > > ``` > eax=# CREATE EXTENSION experiment; > CREATE EXTENSION > eax=# SELECT phonebook_lookup_index('Alice'); > phonebook_lookup_index > > -1 > (1 row) > > eax=# SELECT phonebook_insert('Bob', 456); > phonebook_insert > -- > 1 > (1 row) > > eax=# SELECT phonebook_lookup_index('Alice'); > phonebook_lookup_index > > -1 > (1 row) > > eax=# SELECT phonebook_insert('Alice', 123); > phonebook_insert > -- > 2 > (1 row) > > eax=# SELECT phonebook_lookup_index('Alice'); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > ``` > > So evidently it really works as long as the index doesn't find any > matching rows. > > This could be really confusing for the extension authors so here is a > patch that adds corresponding Asserts(). > > [1]: > https://github.com/afiskon/postgresql-extensions/tree/main/005-table-access I think it's a nice catch and worth fixing. The one thing I don't agree with is using asserts for handling the error that can appear because most probably the server is built with assertions off and in this case, there still will be a crash in this case. I'd do this with report ERROR. Otherwise, the patch looks right and worth committing. Kind regards, Pavel Borisov.
Re: Failed Assert in pgstat_assoc_relation
On Mon, Nov 28, 2022 at 8:10 PM vignesh C wrote: > > Hi, > > While reviewing/testing one of the patches I found the following Assert: > #0 __pthread_kill_implementation (no_tid=0, signo=6, > threadid=139624429171648) at ./nptl/pthread_kill.c:44 > #1 __pthread_kill_internal (signo=6, threadid=139624429171648) at > ./nptl/pthread_kill.c:78 > #2 __GI___pthread_kill (threadid=139624429171648, > signo=signo@entry=6) at ./nptl/pthread_kill.c:89 > #3 0x7efcda6e3476 in __GI_raise (sig=sig@entry=6) at > ../sysdeps/posix/raise.c:26 > #4 0x7efcda6c97f3 in __GI_abort () at ./stdlib/abort.c:79 > #5 0x5590bf283139 in ExceptionalCondition > (conditionName=0x5590bf468170 "rel->pgstat_info->relation == NULL", > fileName=0x5590bf46812b "pgstat_relation.c", lineNumber=143) at > assert.c:66 > #6 0x5590bf0ce5f8 in pgstat_assoc_relation (rel=0x7efcce996a48) > at pgstat_relation.c:143 > #7 0x5590beb83046 in initscan (scan=0x5590bfbf4af8, key=0x0, > keep_startblock=false) at heapam.c:343 > #8 0x5590beb8466f in heap_beginscan (relation=0x7efcce996a48, > snapshot=0x5590bfb5a520, nkeys=0, key=0x0, parallel_scan=0x0, > flags=449) at heapam.c:1223 > #9 0x5590bf02af39 in table_beginscan (rel=0x7efcce996a48, > snapshot=0x5590bfb5a520, nkeys=0, key=0x0) at > ../../../src/include/access/tableam.h:891 > #10 0x5590bf02bf8a in DefineQueryRewrite (rulename=0x5590bfb281d0 > "_RETURN", event_relid=16387, event_qual=0x0, event_type=CMD_SELECT, > is_instead=true, replace=false, action=0x5590bfbf4aa8) > at rewriteDefine.c:447 > #11 0x5590bf02b5ab in DefineRule (stmt=0x5590bfb285c0, > queryString=0x5590bfb277a8 "CREATE RULE \"_RETURN\" AS ON SELECT TO t > DO INSTEAD SELECT * FROM t1;") at rewriteDefine.c:213 > > I could reproduce this issue with the following steps: > create table t1(c int); > BEGIN; > CREATE TABLE t (c int); > SAVEPOINT q; > CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1; > select * from t; > ROLLBACK TO q; > CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1; > ROLLBACK; > > Regards, > Vignesh I think what is happening here is that the previous relation is not unlinked when pgstat_init_relation() is called because the relation is now a view and for relations without storage the relation is not unlinked in pgstat_init_relation() void pgstat_init_relation(Relation rel) { charrelkind = rel->rd_rel->relkind; /* * We only count stats for relations with storage and partitioned tables */ if (!RELKIND_HAS_STORAGE(relkind) && relkind != RELKIND_PARTITIONED_TABLE) { rel->pgstat_enabled = false; rel->pgstat_info = NULL; return; } There is a logic in DefineQueryRewrite() which converts a relation to a view when you create such a rule like the test case does. So initially the relation had storage, the pgstat_info is linked, then table is converted to a view, but in init, the previous relation is not unlinked but when it tries to link a new relation, the assert fails saying a previous relation is already linked to pgstat_info I have made a small patch with a fix, but I am not sure if this is the right way to fix this. regards, Ajin Cherian Fujitsu Australia pgstat_assoc_fix_for_views.patch Description: Binary data
Re: Another multi-row VALUES bug
On Wed, 23 Nov 2022 at 18:56, Tom Lane wrote: > > I wonder if somehow we could just make one pass over > all the VALUES RTEs, and process each one as needed? The problem > is to identify the relevant target relation, I guess. > I have been thinking about that some more, but I think it would be pretty difficult to achieve. Part of the problem is that the targetlist processing and VALUES RTE processing are quite closely coupled (because of things like GENERATED ALWAYS columns). Both rewriteTargetListIU() and rewriteValuesRTE() rely on being passed the VALUES RTE that the targetlist is reading from, and rewriteValuesRTE() then relies on extra information returned by rewriteTargetListIU(). Also, there's the way that DEFAULTs from updatable views work, which means that the DEFAULTs in a VALUES RTE won't necessarily all come from the same target relation. So I think it would be much harder to do the VALUES RTE processing anywhere other than where it's being done right now, and even if it could be done elsewhere, it would be a very invasive change, and therefore hard to back-patch. That, of course, leaves the problem of identifying the right VALUES RTE to process. A different way to do this, without relying on the contents of the targetlist, is to note that, while processing a product query, what we really want to do is ignore any VALUES RTEs from the original query, since they will have already been processed. There should then never be more than one VALUES RTE left to process -- the one from the rule action. This can be done by exploiting the fact that in product queries, the rtable always consists of the rtable from the original query followed by the rtable from the rule action, so we just need to ignore the right number of RTEs at the start of the rtable. Of course that would break if we ever changed the way rewriteRuleAction() worked, but at least it only depends on that one other place in the code, which has been stable for a long time, so the risk of future breakage seems managable. Regards, Dean diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c new file mode 100644 index fb0c687..1e3efbb --- a/src/backend/rewrite/rewriteHandler.c +++ b/src/backend/rewrite/rewriteHandler.c @@ -418,6 +418,10 @@ rewriteRuleAction(Query *parsetree, * NOTE: because planner will destructively alter rtable, we must ensure * that rule action's rtable is separate and shares no substructure with * the main rtable. Hence do a deep copy here. + * + * Note also that RewriteQuery() relies on the fact that RT entries from + * the original query appear at the start of the expanded rtable, so + * beware of changing this. */ sub_action->rtable = list_concat(copyObject(parsetree->rtable), sub_action->rtable); @@ -3622,9 +3626,13 @@ rewriteTargetView(Query *parsetree, Rela * * rewrite_events is a list of open query-rewrite actions, so we can detect * infinite recursion. + * + * orig_rt_length is the length of the originating query's rtable, for product + * queries created by fireRules(), and 0 otherwise. This is used to skip any + * already-processed VALUES RTEs from the original query. */ static List * -RewriteQuery(Query *parsetree, List *rewrite_events) +RewriteQuery(Query *parsetree, List *rewrite_events, int orig_rt_length) { CmdType event = parsetree->commandType; bool instead = false; @@ -3648,7 +3656,7 @@ RewriteQuery(Query *parsetree, List *rew if (ctequery->commandType == CMD_SELECT) continue; - newstuff = RewriteQuery(ctequery, rewrite_events); + newstuff = RewriteQuery(ctequery, rewrite_events, 0); /* * Currently we can only handle unconditional, single-statement DO @@ -3722,6 +3730,7 @@ RewriteQuery(Query *parsetree, List *rew RangeTblEntry *rt_entry; Relation rt_entry_relation; List *locks; + int product_orig_rt_length; List *product_queries; bool hasUpdate = false; int values_rte_index = 0; @@ -3743,23 +3752,30 @@ RewriteQuery(Query *parsetree, List *rew */ if (event == CMD_INSERT) { + ListCell *lc2; RangeTblEntry *values_rte = NULL; /* - * If it's an INSERT ... VALUES (...), (...), ... there will be a - * single RTE for the VALUES targetlists. + * Test if it's a multi-row INSERT ... VALUES (...), (...), ... by + * looking for a VALUES RTE in the fromlist. For product queries, + * we must ignore any already-processed VALUES RTEs from the + * original query. These appear at the start of the rangetable. */ - if (list_length(parsetree->jointree->fromlist) == 1) + foreach(lc2, parsetree->jointree->fromlist) { -RangeTblRef *rtr = (RangeTblRef *) linitial(parsetree->jointree->fromlist); +RangeTblRef *rtr = (RangeTblRef *) lfirst(lc2); -if (IsA(rtr, RangeTblRef)) +if (IsA(rtr, RangeTblRef) && rtr->rtindex > orig_rt_length) { RangeTblEntry *rte = rt_fetch(rtr->rtindex, parsetree->rtable);
Re: [PATCH] Check snapshot argument of index_beginscan and family
Hi Pavel, Thanks for the feedback! > I think it's a nice catch and worth fixing. The one thing I don't > agree with is using asserts for handling the error that can appear > because most probably the server is built with assertions off and in > this case, there still will be a crash in this case. I'd do this with > report ERROR. Otherwise, the patch looks right and worth committing. Normally a developer is not supposed to pass NULLs there so I figured having extra if's in the release builds doesn't worth it. Personally I wouldn't mind using ereport() but my intuition tells me that the committers are not going to approve this :) Let's see what the rest of the community thinks. -- Best regards, Aleksander Alekseev
Re: Introduce a new view for checkpointer related stats
On Sat, Nov 26, 2022 at 4:32 AM Andres Freund wrote: > Thanks Andres for reviewing. > > May I know what it means to deprecate pg_stat_bgwriter columns? > > Add a note to the docs saying that the columns will be removed. Done. > > Are > > you suggesting to add deprecation warnings to corresponding functions > > pg_stat_get_bgwriter_buf_written_clean(), > > pg_stat_get_bgwriter_maxwritten_clean(), pg_stat_get_buf_alloc() and > > pg_stat_get_bgwriter_stat_reset_time() and in the docs? > > I'm thinking of the checkpoint related columns in pg_stat_bgwriter. Added note in the docs alongside each deprecated pg_stat_bgwriter's checkpoint related column. > If we move, rather than duplicate, the pg_stat_bgwriter columns to > pg_stat_checkpointer, everyone will have to update their monitoring scripts > when upgrading and will need to add version dependency if they monitor > multiple versions. If we instead keep the duplicated columns in > pg_stat_bgwriter for 5 years, users can reloy on pg_stat_checkpointer in all > supported versions. Agree. However, it's a bit difficult to keep track of deprecated things and come back after 5 years to clean them up unless "some" postgres hacker/developer/user notices it again. Perhaps, adding a separate section, say 'Deprecated and To-Be-Removed, in https://wiki.postgresql.org/wiki/Main_Page is a good idea. > To be clear, it isn't a very heavy burden for users to make these > adjustments. But if it only costs us a few lines to keep the old columns for a > bit, that seems worth it. Yes. I'm attaching the v3 patch for further review. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From ec09bfcde836cb23f090fd6088b89f0fb9a68000 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Mon, 28 Nov 2022 10:07:55 + Subject: [PATCH v3] Introduce a new view for checkpointer related stats pg_stat_bgwriter view currently reports checkpointer stats as well. It is that way because historically checkpointer was part of bgwriter until the commits 806a2ae and bf405ba, that went into PG 9.2, separated them out. It is time for us to separate checkpointer stats to its own view called pg_stat_checkpointer. For now, we deprecate the use of these checkpointer related fields under pg_stat_bgwriter view for smoother transitioning. Eventually, we need to remove them from pg_stat_bgwriter view. Bump catalog version. Author: Bharath Rupireddy Reviewed-by: Andres Freund, Bertrand Drouvot Discussion: https://www.postgresql.org/message-id/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com --- doc/src/sgml/monitoring.sgml | 165 +- src/backend/catalog/system_views.sql | 17 +- .../utils/activity/pgstat_checkpointer.c | 1 + src/backend/utils/adt/pgstatfuncs.c | 19 +- src/include/catalog/pg_proc.dat | 22 ++- src/include/pgstat.h | 1 + src/test/recovery/t/029_stats_restart.pl | 6 +- src/test/regress/expected/rules.out | 14 +- src/test/regress/expected/stats.out | 21 ++- src/test/regress/sql/stats.sql| 12 +- 10 files changed, 236 insertions(+), 42 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 5579b8b9e0..c69d3b21c0 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -448,6 +448,15 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser + + pg_stat_checkpointerpg_stat_checkpointer + One row only, showing statistics about the + checkpointer process's activity. See + + pg_stat_checkpointer for details. + + + pg_stat_walpg_stat_wal One row only, showing statistics about WAL activity. See @@ -3654,7 +3663,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i checkpoints_timed bigint - Number of scheduled checkpoints that have been performed + Number of scheduled checkpoints that have been performed. Use of this + field under pg_stat_bgwriter view has been + deprecated, because the field actually shows checkpointer process + activity and is moved to a new view called + pg_stat_checkpointer. @@ -3663,7 +3676,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i checkpoints_req bigint - Number of requested checkpoints that have been performed + Number of requested checkpoints that have been performed. Use of this + field under pg_stat_bgwriter view has been + deprecated, because the field actually shows checkpointer process + activity and is moved to a new view called + pg_stat_checkpointer. @@ -3673,7 +3690,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_a
Re: Perform streaming logical transactions by background workers and parallel apply
On Sun, Nov 27, 2022 at 9:43 AM houzj.f...@fujitsu.com wrote: > > Attach the new version patch which addressed all comments so far. > Few comments on v52-0001* 1. pa_free_worker() { ... + /* Free the worker information if the worker exited cleanly. */ + if (!winfo->error_mq_handle) + { + pa_free_worker_info(winfo); + + if (winfo->in_use && + !hash_search(ParallelApplyWorkersHash, &xid, HASH_REMOVE, NULL)) + elog(ERROR, "hash table corrupted"); pa_free_worker_info() pfrees the winfo, so how is it legal to winfo->in_use in the above check? Also, why is this check (!winfo->error_mq_handle) required in the first place in the patch? The worker exits cleanly only when the leader apply worker sends a SIGINT signal and in that case, we already detach from the error queue and clean up other worker information. 2. +HandleParallelApplyMessages(void) +{ ... ... + foreach(lc, ParallelApplyWorkersList) + { + shm_mq_result res; + Size nbytes; + void*data; + ParallelApplyWorkerInfo *winfo = (ParallelApplyWorkerInfo *) lfirst(lc); + + if (!winfo->error_mq_handle) + continue; Similar to the previous comment, it is not clear whether we need this check. If required, can we add a comment to indicate the case where it happens to be true? Note, there is a similar check for winfo->error_mq_handle in pa_wait_for_xact_state(). Please add some comments if that is required. 3. Why is there apply_worker_clean_exit() at the end of ParallelApplyWorkerMain()? Normally either the leader worker stops parallel apply, or parallel apply gets stopped because of a parameter change, or exits because of error, and in none of those cases it can hit this code path unless I am missing something. Additionally, I think in LogicalParallelApplyLoop, we will never receive zero-length messages so that is also wrong and should be converted to elog(ERROR,..). 4. I think in logicalrep_worker_detach(), we should detach from the shm error queue so that the parallel apply worker won't try to send a termination message back to the leader worker. 5. pa_send_data() { ... + if (startTime == 0) + startTime = GetCurrentTimestamp(); ... What is the use of getting the current timestamp before waitlatch logic, if it is not used before that? It seems that is for the time logic to look correct. We can probably reduce the 10s interval to 9s for that. In this function, we need to add some comments to indicate why the current logic is used, and also probably we can refer to the comments atop this file. 6. I think it will be better if we keep stream_apply_worker local to applyparallelworker.c by exposing functions to cache/resetting the required info. 7. Apart from the above, I have made a few changes in the comments and some miscellaneous cosmetic changes in the attached. Kindly include these in the next version unless you see a problem with any change. -- With Regards, Amit Kapila. changes_amit_v52_1.patch Description: Binary data changes_amit_v52_1.patch Description: Binary data
Re: Logical Replication Custom Column Expression
Sure I understand and neither do I have good knowledge of what else could be influenced by such a change. If the value of the column is the subscriber name has no benefit to this idea of merging multiple upstreams with same primary keys, later you describe the "connection dbname", yes this could be a possibility. I do not fully understand that part "how will the initial tablesync COPY efficiently assign these subscriber name column values?" Why is difficult that during the initial sync put everywhere the same value for all rows of the same origin? Στις Δευ 28 Νοε 2022 στις 10:16 π.μ., ο/η Peter Smith έγραψε: > On Fri, Nov 25, 2022 at 9:43 PM Stavros Koureas > wrote: > > > > Yes, if the property is on the subscription side then it should be > applied for all the tables that the connected publication is exposing. > > So if the property is enabled you should be sure that this origin column > exists to all of the tables that the publication is exposing... > > > > Sure this is the complete idea, that the subscriber should match the PK > of origin, > > As the subscription table will contain same key values from different > origins, for example: > > > > For publisher1 database table > > id pk integer | value character varying > > 1 | testA1 > > 2 | testA2 > > > > For publisher2 database table > > id pk integer | value character varying > > 1 | testB1 > > 2 | testB2 > > > > For subscriber database table > > origin pk character varying | id pk integer | value character varying > > publisher1 | 1 | testA1 > > publisher1 | 2 | testA2 > > publisher2 | 1 | testB1 > > publisher2 | 2 | testB2 > > > > All statements INSERT, UPDATE, DELETE should always include the > predicate of the origin. > > > > This sounds similar to what I had posted [1] although I was saying the > generated column value might be the *subscriber* name, not the origin > publisher name. (where are you getting that value from -- somehow from > the subscriptions' CONNECTION dbname?) > > Anyway, regardless of the details, please note -- my idea was really > intended just as a discussion starting point to demonstrate that > required functionality might be achieved using a simpler syntax than > what had been previously suggested. But in practice there may be some > problems with this approach -- e.g. how will the initial tablesync > COPY efficiently assign these subscriber name column values? > > -- > [1] > https://www.postgresql.org/message-id/CAHut%2BPuZowXd7Aa7t0nqjP6afHMwJarngzeMq%2BQP0vE2KKLOgQ%40mail.gmail.com > > Kind Regards, > Peter Smith. > Fujitsu Australia. >
Re: Reducing power consumption on idle servers
On Sun, Nov 27, 2022 at 6:57 PM Thomas Munro wrote: > The main documentation of pg_promote() etc now has "The parameter > promote_trigger_file has been removed" in the > places where the GUC was previously mentioned. Perhaps we should just > remove the mentions completely (it's somehow either too much and not > enough information), but I'm OK with leaving those in for now until > some better place exists for historical notes. I think we should remove those mentions. Otherwise the documentation just collects mentions of an increasing number of things that are no longer relevant. And, as you say, we can't presume that future readers would know what promote_trigger_file even is. -- Robert Haas EDB: http://www.enterprisedb.com
Re: O(n) tasks cause lengthy startups and checkpoints
On Sun, 27 Nov 2022 at 23:34, Nathan Bossart wrote: > > Thanks for taking a look! > > On Thu, Nov 24, 2022 at 05:31:02PM +, Simon Riggs wrote: > > * not sure I believe that everything it does can always be aborted out > > of and shutdown - to achieve that you will need a > > CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least > > I did something like this earlier, but was advised to simply let the > functions finish as usual during shutdown [0]. I think this is what the > checkpointer process does today, anyway. If we say "The custodian is not an essential process and can shutdown quickly when requested.", and yet we know its not true in all cases, then that will lead to misunderstandings and bugs. If we perform a restart and the custodian is performing extra work that delays shutdown, then it also delays restart. Given the title of the thread, we should be looking to improve that, or at least know it occurred. > > * not sure why you want immediate execution of custodian tasks - I > > feel supporting two modes will be a lot harder. For me, I would run > > locally when !IsUnderPostmaster and also in an Assert build, so we can > > test it works right - i.e. running in its own process is just a > > production optimization for performance (which is the stated reason > > for having this) > > I added this because 0004 involves requesting a task from the postmaster, > so checking for IsUnderPostmaster doesn't work. Those tasks would always > run immediately. However, we could use IsPostmasterEnvironment instead, > which would allow us to remove the "immediate" argument. I did it this way > in v14. Thanks > > 0005 seems good from what I know > > * There is no check to see if it worked in any sane time > > What did you have in mind? Should the custodian begin emitting WARNINGs > after a while? I think it might be useful if it logged anything that took an "extended period", TBD. Maybe that is already covered by startup process logging. Please tell me that still works? > > Rather than explicitly use DEBUG1 everywhere I would have an > > #define CUSTODIAN_LOG_LEVEL LOG > > so we can run with it in LOG mode and then set it to DEBUG1 with a one > > line change in a later phase of Beta > > I can create a separate patch for this, but I don't think I've ever seen > this sort of thing before. Much of recovery is coded that way, for the same reason. > Is the idea just to help with debugging during > the development phase? "Just", yes. Tests would be desirable also, under src/test/modules. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Perform streaming logical transactions by background workers and parallel apply
On Mon, Nov 28, 2022 at 12:49 PM Peter Smith wrote: > ... > > 17. > @@ -388,10 +401,9 @@ static inline void cleanup_subxact_info(void); > /* > * Serialize and deserialize changes for a toplevel transaction. > */ > -static void stream_cleanup_files(Oid subid, TransactionId xid); > static void stream_open_file(Oid subid, TransactionId xid, > bool first_segment); > -static void stream_write_change(char action, StringInfo s); > +static void stream_write_message(TransactionId xid, char action, StringInfo > s); > static void stream_close_file(void); > > 17a. > > I felt just saying "file/files" is too vague. All the references to > the file should be consistent, so IMO everything would be better named > like: > > "stream_cleanup_files" -> "stream_msg_spoolfile_cleanup()" > "stream_open_file" -> "stream_msg_spoolfile_open()" > "stream_close_file" -> "stream_msg_spoolfile_close()" > "stream_write_message" -> "stream_msg_spoolfile_write_msg()" > > ~ > > 17b. > IMO there is not enough distinction here between function names > stream_write_message and stream_write_change. e.g. You cannot really > tell from their names what might be the difference. > > ~~~ > I think the only new function needed by this patch is stream_write_message so don't see why to change all others for that. I see two possibilities to make name better (a) name function as stream_open_and_write_change, or (b) pass a new argument (boolean open) to stream_write_change ... > > src/include/replication/worker_internal.h > > 33. LeaderFileSetState > > +/* State of fileset in leader apply worker. */ > +typedef enum LeaderFileSetState > +{ > + LEADER_FILESET_UNKNOWN, > + LEADER_FILESET_BUSY, > + LEADER_FILESET_ACCESSIBLE > +} LeaderFileSetState; > > 33a. > > Missing from typedefs.list? > > ~ > > 33b. > > I thought some more explanatory comments for the meaning of > BUSY/ACCESSIBLE should be here. > > ~ > > 33c. > > READY might be a better value than ACCESSIBLE > > ~ > > 33d. > I'm not sure what usefulness does the "LEADER_" and "Leader" prefixes > give here. Maybe a name like PartialFileSetStat is more meaningful? > > e.g. like this? > > typedef enum PartialFileSetState > { > FS_UNKNOWN, > FS_BUSY, > FS_READY > } PartialFileSetState; > > ~ > All your suggestions in this point look good to me. > > ~~~ > > > 35. globals > > /* > + * Indicates whether the leader apply worker needs to serialize the > + * remaining changes to disk due to timeout when sending data to the > + * parallel apply worker. > + */ > + bool serialize_changes; > > 35a. > I wonder if the comment would be better to also mention "via shared memory". > > SUGGESTION > > Indicates whether the leader apply worker needs to serialize the > remaining changes to disk due to timeout when attempting to send data > to the parallel apply worker via shared memory. > > ~ > I think the comment should say " .. the leader apply worker serialized remaining changes ..." > 35b. > I wonder if a more informative variable name might be > serialize_remaining_changes? > I think this needlessly makes the variable name long. -- With Regards, Amit Kapila.
Re: TAP output format in pg_regress
> On 27 Nov 2022, at 11:22, Nikolay Shaplov wrote: > В письме от суббота, 26 ноября 2022 г. 23:35:45 MSK пользователь Daniel > Gustafsson написал: > I wold suggest to use word immediate instead of noatexit. This will do the > code much more sensible for me. I think noatexit is clearer since the codepath is specifically to avoid any registered atexit functions. The point of this function is to be able to call bail while in a function registered with atexit() so I think the current name is better. > I've also rewritten the comment, the way I would understand it better, if I > read it for the first time. I am not sure about my English, but key features > there are: > > - "terminate process" instead of "exit". Too many exist in the sentence, > better to use synonyms wherever is possible. Sure, I can do that before pushing if the current version of the patch is acceptable. >> That would if so make the output something like the below. Personally I >> think the "test" prefix adds little value since everything printed are test >> suites, and we are already today using indentation for grouping parallel >> tests. > > So this extra offset indicates that test is being included into parallel > group? Guess it not really obvious... Grouping parallel tests via an initial list of test and then indenting each test with whitespace was committed 22 years ago. While there might be better ways to do this, the lack of complaints so far at least seems to indicate that it isn't all too terrible. > Theoretically TAP 14 has subtests and this parallel tests looks like > subtests... but TAP 14 is not supported by modern harnesses.. Parallel tests aren't subtests though, they are single top-level tests which run in parallel to each other. -- Daniel Gustafsson https://vmware.com/
Re: Non-decimal integer literals
On 23.11.22 17:25, Dean Rasheed wrote: Taking a quick look, I noticed that there are no tests for negative values handled in the parser. Giving that a spin shows that make_const() fails to correctly identify the base of negative non-decimal integers in the T_Float case, causing it to fall through to numeric_in() and fail: Fixed in new patch. From 2d7f41981187df904e3d985f2770d9b5c26e9d7c Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 28 Nov 2022 09:24:20 +0100 Subject: [PATCH v11] Non-decimal integer literals Add support for hexadecimal, octal, and binary integer literals: 0x42F 0o273 0b100101 per SQL:202x draft. This adds support in the lexer as well as in the integer type input functions. Discussion: https://www.postgresql.org/message-id/flat/b239564c-cad0-b23e-c57e-166d883cb...@enterprisedb.com --- doc/src/sgml/syntax.sgml | 34 src/backend/catalog/information_schema.sql | 6 +- src/backend/catalog/sql_features.txt | 1 + src/backend/parser/parse_node.c| 37 +++- src/backend/parser/scan.l | 101 --- src/backend/utils/adt/numutils.c | 170 -- src/fe_utils/psqlscan.l| 78 +++-- src/interfaces/ecpg/preproc/pgc.l | 106 ++- src/test/regress/expected/int2.out | 80 + src/test/regress/expected/int4.out | 80 + src/test/regress/expected/int8.out | 80 + src/test/regress/expected/numerology.out | 193 - src/test/regress/sql/int2.sql | 22 +++ src/test/regress/sql/int4.sql | 22 +++ src/test/regress/sql/int8.sql | 22 +++ src/test/regress/sql/numerology.sql| 51 +- 16 files changed, 974 insertions(+), 109 deletions(-) diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index 93ad71737f51..956182e7c6a8 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -694,6 +694,40 @@ Numeric Constants + + Additionally, non-decimal integer constants can be used in these forms: + +0xhexdigits +0ooctdigits +0bbindigits + + hexdigits is one or more hexadecimal digits + (0-9, A-F), octdigits is one or more octal + digits (0-7), bindigits is one or more binary + digits (0 or 1). Hexadecimal digits and the radix prefixes can be in + upper or lower case. Note that only integers can have non-decimal forms, + not numbers with fractional parts. + + + + These are some examples of this: +0b100101 +0B10011001 +0o273 +0O755 +0x42f +0X + + + + + + Nondecimal integer constants are currently only supported in the range + of the bigint type (see ). + + + integer bigint diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql index 18725a02d1fb..95c27a625e7e 100644 --- a/src/backend/catalog/information_schema.sql +++ b/src/backend/catalog/information_schema.sql @@ -119,7 +119,7 @@ CREATE FUNCTION _pg_numeric_precision(typid oid, typmod int4) RETURNS integer WHEN 1700 /*numeric*/ THEN CASE WHEN $2 = -1 THEN null - ELSE (($2 - 4) >> 16) & 65535 + ELSE (($2 - 4) >> 16) & 0x END WHEN 700 /*float4*/ THEN 24 /*FLT_MANT_DIG*/ WHEN 701 /*float8*/ THEN 53 /*DBL_MANT_DIG*/ @@ -147,7 +147,7 @@ CREATE FUNCTION _pg_numeric_scale(typid oid, typmod int4) RETURNS integer WHEN $1 IN (1700) THEN CASE WHEN $2 = -1 THEN null - ELSE ($2 - 4) & 65535 + ELSE ($2 - 4) & 0x END ELSE null END; @@ -163,7 +163,7 @@ CREATE FUNCTION _pg_datetime_precision(typid oid, typmod int4) RETURNS integer WHEN $1 IN (1083, 1114, 1184, 1266) /* time, timestamp, same + tz */ THEN CASE WHEN $2 < 0 THEN 6 ELSE $2 END WHEN $1 IN (1186) /* interval */ - THEN CASE WHEN $2 < 0 OR $2 & 65535 = 65535 THEN 6 ELSE $2 & 65535 END + THEN CASE WHEN $2 < 0 OR $2 & 0x = 0x THEN 6 ELSE $2 & 0x END ELSE null END; diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt index 8704a42b60a8..abad216b7ee4 100644 --- a/src/backend/catalog/sql_features.txt +++ b/src/backend/catalog/sql_features.txt @@ -527,6 +527,7 @@ T652SQL-dynamic statements in SQL routines NO T653 SQL-schema statements in external routines YES T654 SQL-dynamic statements in external routines NO T655 Cyclically dependent routines YES +T661 Non-decimal integer literalsYES SQL:202x draft T811 Basic SQL/JSON constructor functionsNO T812 SQL/JSON: JSON_OBJECTAGG
Re: Reducing power consumption on idle servers
Robert Haas writes: > I think we should remove those mentions. Otherwise the documentation > just collects mentions of an increasing number of things that are no > longer relevant. Yeah, I think the same. There will be a release-note entry, and I don't object to having something about it in appendix-obsolete.sgml; but we shouldn't clutter the main docs with it. regards, tom lane
Re: Remove a unused argument from qual_is_pushdown_safe
Michael Paquier writes: > On Mon, Nov 28, 2022 at 11:54:45AM +0900, Michael Paquier wrote: >> On Fri, Nov 25, 2022 at 04:05:13PM +0800, Richard Guo wrote: > I wonder if we need to revise the comment atop qual_is_pushdown_safe() > too which says > > * rinfo is a restriction clause applying to the given subquery (whose RTE > * has index rti in the parent query). > > since there is no 'given subquery' after we remove it from the params. > I was thinking about this point, and it seems to me that we could just > do s/the given subquery/a subquery/. But perhaps you have a different > view on the matter? My viewpoint is that this change is misguided. Even if the current coding of qual_is_pushdown_safe doesn't happen to reference the subquery, it might need to in future. regards, tom lane
Re: Remove a unused argument from qual_is_pushdown_safe
> On 28 Nov 2022, at 15:15, Tom Lane wrote: > My viewpoint is that this change is misguided. Even if the current > coding of qual_is_pushdown_safe doesn't happen to reference the > subquery, it might need to in future. If I understand the code correctly the variable has some value in terms of "documenting the code" (for lack of better terminology), and I would assume virtually every modern compiler to figure out it's not needed. -- Daniel Gustafsson https://vmware.com/
Bug in wait time when waiting on nested subtransaction
Issue in current XactLockTableWait, starting with 3c27944fb2141, discovered while reviewing https://commitfest.postgresql.org/40/3806/ Test demonstrating the problem is 001-isotest-tuplelock-subxact.v1.patch A narrative description of the issue follows: session1 - requests multiple nested subtransactions like this: BEGIN; ... SAVEPOINT subxid1; ... SAVEPOINT subxid2; ... If another session2 sees an xid from subxid2, it waits. If subxid2 aborts, then session2 sees the abort and can continue processing normally. However, if subxid2 subcommits, then the lock wait moves from subxid2 to the topxid. If subxid1 subsequently aborts, it will also abort subxid2, but session2 waiting for subxid2 to complete doesn't see this and waits for topxid instead. Which means that it waits longer than it should, and later arriving lock waiters may actually get served first. So it's a bug, but not too awful, since in many cases people don't use nested subtransactions, and if they do use SAVEPOINTs, they don't often close them using RELEASE. And in most cases the extra wait time is not very long, hence why nobody ever reported this issue. Thanks to Robert Haas and Julien Tachoires for asking the question "are you sure the existing coding is correct?". You were both right; it is not. How to fix? Correct lock wait can be achieved while a subxid is running if we do either * a lock table entry for the subxid OR * a subtrans entry that points to its immediate parent So we have these options 1. Removing the XactLockTableDelete() call in CommitSubTransaction(). That releases lock waiters earlier than expected, which requires pushups in XactLockTableWait() to cope with that (which are currently broken). Why do we do it? To save shmem space in the lock table should anyone want to run a transaction that contains thousands of subtransactions, or more. So option (1) alone would eventually cause us to run out of space in the lock table and a transaction would receive ERRORs rather than be allowed to run for a long time. 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so we go up the levels one by one as we did before. However, (2) causes huge subtrans contention and if we implemented that and backpatched it the performance issues could be significant. So my feeling is that if we do (2) then we should not backpatch it. So both (1) and (2) have issues. The main result from patch https://commitfest.postgresql.org/40/3806/ is that having subtrans point direct to topxid is very good for performance in XidIsInMVCCSnapshot(), and presumably other places also. This bug prevents the simple application of a patch to improve performance. So now we need a stronger mix of ideas to both resolve the bug and fix the subtrans contention issue in HEAD. My preferred solution would be a mix of the above, call it option (3) 3. a) Include the lock table entry for the first 64 subtransactions only, so that we limit shmem. For those first 64 entries, have the subtrans point direct to top, since this makes a subtrans lookup into an O(1) operation, which is important for performance of later actions. b) For any subtransactions after first 64, delete the subxid lock on subcommit, to save shmem, but make subtrans point to the immediate parent (only), not the topxid. That fixes the bug, but causes performance problems in XidInMVCCSnapshot() and others, so we also do c) and d) c) At top level commit, go back and alter subtrans again for subxids so now it points to the topxid, so that we avoid O(N) behavior in XidInMVCCSnapshot() and other callers. Additional cost for longer transactions, but it saves considerable cost in later callers that need to call GetTopmostTransaction. d) Optimize SubTransGetTopmostTransaction() so it retrieves entries page-at-a-time. This will reduce the contention of repeatedly re-visiting the same page(s) and ensure that a page is less often paged out when we are still using it. Thoughts? -- Simon Riggshttp://www.EnterpriseDB.com/ 001-isotest-tuplelock-subxact.v1.patch Description: Binary data
[PATCH] Infinite loop while acquiring new TOAST Oid
Hi hackers! While working on Pluggable TOAST we've detected a defective behavior on tables with large amounts of TOASTed data - queries freeze and DB stalls. Further investigation led us to the loop with GetNewOidWithIndex function call - when all available Oid already exist in the related TOAST table this loop continues infinitely. Data type used for value ID is the UINT32, which is unsigned int and has a maximum value of *4294967295* which allows maximum 4294967295 records in the TOAST table. It is not a very big amount for modern databases and is the major problem for productive systems. Quick fix for this problem is limiting GetNewOidWithIndex loops to some reasonable amount defined by related macro and returning error if there is still no available Oid. Please check attached patch, any feedback is appreciated. -- Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/ 0001_infinite_new_toast_oid_v1.patch Description: Binary data
Re: fixing CREATEROLE
On Thu, Nov 24, 2022 at 2:41 AM wrote: > INHERITCREATEDROLES and SETCREATEDROLES behave much like DEFAULT > PRIVILEGES. What about something like: > > ALTER DEFAULT PRIVILEGES FOR alice > GRANT TO alice WITH INHERIT FALSE, SET TRUE, ADMIN TRUE > > The "abbreviated grant" is very much abbreviated, because the original > syntax GRANT a TO b is already quite short to begin with, i.e. there is > no ON ROLE or something like that in it. > > The initial DEFAULT privilege would be INHERIT FALSE, SET FALSE, ADMIN > TRUE, I guess? I don't know if changing the syntax from A to B is really getting us anywhere. I generally agree that the ALTER DEFAULT PRIVILEGES syntax looks nicer than the CREATE/ALTER ROLE syntax, but I'm not sure that's a sufficient reason to move the control over this behavior to ALTER DEFAULT PRIVILEGES. One thing to consider is that, as I've designed this, whether or not ADMIN is included in the grant is non-negotiable. I am, at least at present, inclined to think that was the right call, partly because Mark Dilger expressed a lot of concern about the CREATEROLE user losing control over the role they'd just created, and allowing ADMIN to be turned off would have exactly that effect. Plus a grant with INHERIT FALSE, SET FALSE, ADMIN FALSE would end up being almost identical to no great at all, which seems pointless. Basically, without ADMIN, the implicit GRANT fails to accomplish its intended purpose, so I don't like having that as a possibility. The other thing that's a little weird about the syntax which you propose is that it's not obviously related to CREATE ROLE. The intent of the patch as implemented is to allow control over only the implicit GRANT that is created when a new role is created, not all grants that might be created by or to a particular user. Setting defaults for all grants doesn't seem like a particularly good idea to me, but it's definitely a different idea than what the patch proposes to do. I did spend some time thinking about trying to tie this to the CREATEROLE syntax itself. For example, instead of CREATE ROLE alice CREATEROLE INHERITCREATEDROLES SETCREATEDROLES you could write CREATE ROLE alice CREATEROLE WITH (INHERIT TRUE, SET TRUE) or something like this. That would avoid introducing new, lengthy keywords that are just concatenations of other English words, a kind of syntax that doesn't look particularly nice to me and probably is less friendly to non-English speakers as well. I didn't do it that way because the parser support would be more complex, but I could. CREATEROLE would have to become a keyword again, but that's not a catastrophe. Another idea would be to break the CREATEROLE stuff off from CREATE ROLE entirely and put it all into GRANT. You could imagine syntax like GRANT CREATEROLE (or CREATE ROLE?) TO role_specification WITH (INHERIT TRUE/FALSE, SET TRUE/FALSE). There are a few potential issues with this direction. One, if we did this, then CREATEROLE probably ought to become inheritable, because that's the way grants work in general, and this likely shouldn't be an exception, but this would be a behavior change. However, if it is the consensus that such a behavior change would be an improvement, that might be OK. Two, I wonder what we'd do about the GRANTED BY role_specification clause. We could leave it out, but that would be asymmetric with other GRANT commands. We could also support it and record that information and make this work more like other cases, including, I suppose, the possibility of dependent grants. We'd have to think about what that means exactly. If you revoke CREATEROLE from someone who has granted CREATEROLE to others, I suppose that's a clear dependent grant and needs to be recursively revoked. But what about the implicit grants that were created because the person had CREATEROLE? Are those also dependent grants? And what about the roles themselves? Should revoking CREATEROLE drop the roles that the user in question created? That gets complicated, because those roles might own objects. That's scary, because you might not expect revoking a role permission to result in tables getting dropped. It's also problematic, because those tables might be in some other database where they are inaccessible to the current session. All in all I'm inclined to think that recursing to the roles themselves is a bad plan, but it's debatable. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing another gen_node_support.pl special case
On 27.11.22 02:39, Tom Lane wrote: I got confused about how we were managing EquivalenceClass pointers in the copy/equal infrastructure, and it took me awhile to remember that the reason it works is that gen_node_support.pl has hard-wired knowledge about that. I think that's something we'd be best off dropping in favor of explicit annotations on affected fields. Hence, I propose the attached. This results in zero change in the generated copy/equal code. I suppose the question is whether this behavior is something that is a property of the EquivalenceClass type as such or something that is specific to each individual field.
Re: Removing another gen_node_support.pl special case
Peter Eisentraut writes: > On 27.11.22 02:39, Tom Lane wrote: >> I got confused about how we were managing EquivalenceClass pointers >> in the copy/equal infrastructure, and it took me awhile to remember >> that the reason it works is that gen_node_support.pl has hard-wired >> knowledge about that. I think that's something we'd be best off >> dropping in favor of explicit annotations on affected fields. >> Hence, I propose the attached. This results in zero change in >> the generated copy/equal code. > I suppose the question is whether this behavior is something that is a > property of the EquivalenceClass type as such or something that is > specific to each individual field. That's an interesting point, but what I'm on about is that I don't want the behavior buried in gen_node_support.pl. I think there's a reasonable argument to be made that equal_as_scalar *is* a field-level property not a node-level property. I agree that for the copy case you could argue it differently, and I also agree that it seems error-prone to have to remember to label fields this way. I notice that EquivalenceClass is already marked as no_copy_equal, which means that gen_node_support.pl can know that emitting a recursive node-copy or node-compare request is a bad idea. What do you think of using the patch as it stands, plus a cross-check that we don't emit COPY_NODE_FIELD or COMPARE_NODE_FIELD if the target node type is no_copy or no_equal? This is different from just silently applying scalar copy/equal, in that (a) it's visibly under the programmer's control, and (b) it's not hard to imagine wanting to use other solutions such as copy_as(NULL). (More generally, I suspect that there are other useful cross-checks gen_node_support.pl could be making. I had a to-do item to think about that, but it didn't get to the top of the list yet.) regards, tom lane
Re: Add 64-bit XIDs into PostgreSQL 15
On Sat, Nov 26, 2022 at 4:08 AM Chris Travers wrote: > I didn't see any changes to pg_upgrade to make this change possible on > upgrade. Is that also outside of the scope of your patch set? If so how is > that continuity supposed to be ensured? The scheme is documented in their 0006 patch, in a README.XID file. I'm not entirely confident that it's the best design and have argued against it in the past, but it's not crazy. More generally, while I think there's plenty of stuff to be concerned about in this patch set and while I'm somewhat skeptical about the likelihood of its getting or staying committed, I can't really understand your concerns in particular. The thrust of your concern seems to be that if we allow people to get further behind, recovery will be more difficult. I'm not sure I see the problem. Suppose that we adopt this proposal and that it is bug-free. Now, consider a user who gets 8 billion XIDs behind. They probably have to vacuum pretty much every page in the database to do that, or least every page in the tables that haven't been vacuumed recently. But that would likely also be true if they were 800 million XIDs behind, as is possible today. The effort to catch up doesn't increase linearly with how far behind you are, and is always bounded by the DB size. It is true that if the table is progressively bloating, it is likely to be more bloated by the time you are 8 billion XIDs behind than it was when you were 800 million XIDs behind. I don't see that as a very good reason not to adopt this patch, because you can bloat the table by an arbitrarily large amount while consuming only a small number of XiDs, even just 1 XID. Protecting against bloat is good, but shutting down the database when the XID age reaches a certain value is not a particularly effective way of doing that, so saying that we'll be hurting people by not shutting down the database at the point where we do so today doesn't ring true to me. I think that most people who get to the point of wraparound shutdown have workloads where bloat isn't a huge issue, because those who do start having problems with the bloat way before they run out of XIDs. It would be entirely possible to add a parameter to the system that says "hey, you know we can keep running even if we're a shazillion XiDs behind, but instead shut down when we are behind by this number of XIDs." Then, if somebody wants to force an automatic shutdown at that point, they could, and I think that then the scenario you're worried about just can't happen any more . But isn't that a little bit silly? You could also just monitor how far behind you are and page the DBA when you get behind by more than a certain number of XIDs. Then, you wouldn't be risking a shutdown, and you'd still be able to stay on top of the XID ages of your tables. Philosophically, I disagree with the idea of shutting down the database completely in any situation in which a reasonable alternative exists. Losing read and write availability is really bad, and I don't think it's what users want. I think that most users want the database to degrade gracefully when things do not go according to plan. Ideally, they'd like everything to Just Work, but reasonable users understand that sometimes there are going to be problems, and in my experience, what makes them happy is when the database acts to contain the scope of the problem so that it affects their workload as little as possible, rather than acting to magnify the problem so that it impacts their workload as much as possible. This patch, implementation and design concerns to one side, does that. I don't believe there's a single right answer to the question of what to do about vacuum falling behind, and I think it's worth exploring multiple avenues to improve the situation. You can have vacuum never run on a table at all, say because all of the workers are busy elsewhere, or because the table is locked until the heat death of the universe. You can have vacuum run on a table but too slowly to do any good, because of the vacuum cost delay mechanism. You can have vacuum run and finish but do little good because of prepared transactions or replication slots or long-running queries. It's reasonable to think about what kinds of steps might help in those different scenarios, and especially to think about what kind of steps might help in multiple cases. We should do that. But, I don't think any of that means that we can ignore the need for some kind of expansion of the XID space forever. Computers are getting faster. It's already possible to burn through the XID space in hours, and the number of hours is going to go down over time and maybe eventually the right unit will be minutes, or even seconds. Sometime before then, we need to do something to make the runway bigger, or else just give up on PostgreSQL being a relevant piece of software. Perhaps the thing we need to do is not exactly this, but if not, it's probably a sibling or cousin of this. -- Rober
Re: predefined role(s) for VACUUM and ANALYZE
On 2022-11-23 We 18:54, Nathan Bossart wrote: > On Wed, Nov 23, 2022 at 02:56:28PM -0500, Andrew Dunstan wrote: >> I have committed the first couple of these to get them out of the way. > Thanks! > >> But I think we need a bit of cleanup in the next patch. >> vacuum_is_relation_owner() looks like it's now rather misnamed. Maybe >> vacuum_is_permitted_for_relation()? Also I think we need a more thorough >> reworking of the comments around line 566. And I think we need a more >> detailed explanation of why the change in vacuum_rel is ok, and if it is >> OK we should adjust the head comment on the function. >> >> In any case I think this comment would be better English with "might" >> instead of "may": >> >> /* user may have the ANALYZE privilege */ > I've attempted to address all your feedback in v13. Please let me know if > anything needs further reworking. Thanks, pushed. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Bug in wait time when waiting on nested subtransaction
On Mon, Nov 28, 2022 at 10:28 AM Simon Riggs wrote: > So we have these options > > 1. Removing the XactLockTableDelete() call in CommitSubTransaction(). > That releases lock waiters earlier than expected, which requires > pushups in XactLockTableWait() to cope with that (which are currently > broken). Why do we do it? To save shmem space in the lock table should > anyone want to run a transaction that contains thousands of > subtransactions, or more. So option (1) alone would eventually cause > us to run out of space in the lock table and a transaction would > receive ERRORs rather than be allowed to run for a long time. This seems unprincipled to me. The point of having a lock on the subtransaction in the lock table is so that people can wait for the subtransaction to end. If we don't remove the lock table entry at subtransaction end, then that lock table entry doesn't serve that purpose any more. > 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so > we go up the levels one by one as we did before. However, (2) causes > huge subtrans contention and if we implemented that and backpatched it > the performance issues could be significant. So my feeling is that if > we do (2) then we should not backpatch it. What I find suspicious about the coding of this function is that it doesn't distinguish between commits and aborts at all. Like, if a subtransaction commits, the changes don't become globally visible until the parent transaction also commits. If a subtransaction aborts, though, what happens to the top-level XID doesn't seem to matter at all. The comment seems to agree: * Note that this does the right thing for subtransactions: if we wait on a * subtransaction, we will exit as soon as it aborts or its top parent commits. I feel like what I'd expect to see given this comment is code which (1) waits until the supplied XID is no longer running, (2) checks whether the XID aborted, and if so return at once, and (3) otherwise recurse to the parent XID. But the code doesn't do that. Perhaps that's not actually the right thing to do, since it seems like a big behavior change, but then I don't understand the comment. Incidentally, one possible optimization here to try to release locking traffic would be to try waiting for the top-parent first using a conditional lock acquisition. If that works, cool. If not, go back around and work up the tree level by level. Since that path would only be taken in the unhappy case where we're probably going to have to wait anyway, the cost probably wouldn't be too bad. > My preferred solution would be a mix of the above, call it option (3) > > 3. > a) Include the lock table entry for the first 64 subtransactions only, > so that we limit shmem. For those first 64 entries, have the subtrans > point direct to top, since this makes a subtrans lookup into an O(1) > operation, which is important for performance of later actions. > > b) For any subtransactions after first 64, delete the subxid lock on > subcommit, to save shmem, but make subtrans point to the immediate > parent (only), not the topxid. That fixes the bug, but causes > performance problems in XidInMVCCSnapshot() and others, so we also do > c) and d) > > c) At top level commit, go back and alter subtrans again for subxids > so now it points to the topxid, so that we avoid O(N) behavior in > XidInMVCCSnapshot() and other callers. Additional cost for longer > transactions, but it saves considerable cost in later callers that > need to call GetTopmostTransaction. > > d) Optimize SubTransGetTopmostTransaction() so it retrieves entries > page-at-a-time. This will reduce the contention of repeatedly > re-visiting the same page(s) and ensure that a page is less often > paged out when we are still using it. I'm honestly not very sanguine about changing pg_subtrans to point straight to the topmost XID. It's only OK to do that if there's absolutely nothing that needs to know the full tree structure, and the present case looks like an instance where we would like to have the full tree structure. I would not be surprised if there are others. That said, it seems a lot less likely that this would be an issue once the top-level transaction is no longer running. At that point, all the subtransactions are no longer running either: they either committed or they rolled back, and I can't see a reason why any code should care about anything other than which of those two things happened. So I think your idea in (c) might have some possibilities. You could also flip that idea around and have readers replace immediate parent pointers with top-parent pointers opportunistically, but I'm not sure that's actually any better. As you present it in (c) above, there's a risk of going back and updating CLOG state that no one will ever look at. But if you flipped it around and did it on the read side, then you'd have the risk of a bunch of backends trying to do it at the same time. I'm not sure whether either of those things
Re: Introduce a new view for checkpointer related stats
On Tue, Nov 22, 2022 at 3:53 PM Andres Freund wrote: > I think we should consider deprecating the pg_stat_bgwriter columns but > leaving them in place for a few years. New stuff should only be added to > pg_stat_checkpointer, but we don't need to break old monitoring queries. I vote to just remove them. I think that most people won't update their queries until they are forced to do so. I don't think it matters very much when we force them to do that. Our track record in following through on deprecations is pretty bad too, which is another consideration. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decouple last important WAL record LSN from WAL insert locks
Hi, On 2022-11-28 11:42:19 +0530, Bharath Rupireddy wrote: > On Sun, Nov 27, 2022 at 2:43 AM Andres Freund wrote: > > On 2022-11-23 19:12:07 +0530, Bharath Rupireddy wrote: > > > While working on something else, I noticed that each WAL insert lock > > > tracks its own last important WAL record's LSN (lastImportantAt) and > > > both the bgwriter and checkpointer later computes the max > > > value/server-wide last important WAL record's LSN via > > > GetLastImportantRecPtr(). While doing so, each WAL insertion lock is > > > acquired in exclusive mode in a for loop. This seems like too much > > > overhead to me. > > > > GetLastImportantRecPtr() should be a very rare operation, so it's fine for > > it > > to be expensive. The important thing is for the maintenance of the > > underlying > > data to be very cheap. > > > > > I quickly coded a patch (attached herewith) that > > > tracks the server-wide last important WAL record's LSN in > > > XLogCtlInsert (lastImportantPos) protected with a spinlock and gets > > > rid of lastImportantAt from each WAL insert lock. > > > > That strikes me as a very bad idea. It adds another point of contention to a > > very very hot code path, to make a very rare code path cheaper. > > Thanks for the response. I agree that GetLastImportantRecPtr() gets > called rarely, however, what concerns me is that it's taking all the > WAL insertion locks when it gets called. So what? It's far from the only operation doing so. And in contrast to most of the other places (c.f. WALInsertLockAcquireExclusive()) it only takes one of them at a time. > Is tracking lastImportantPos as pg_atomic_uint64 in XLogCtlInsert any > better than an explicit spinlock? I think it's better on platforms > where atomics are supported, however, it boils down to using a spin > lock on the platforms where atomics aren't supported. A central atomic in XLogCtlInsert would be better than a spinlock protected variable, but still bad. We do *not* want to have more central state that needs to be manipulated, we want *less*. If we wanted to optimize this - and I haven't seen any evidence it's worth doing so - we should just optimize the lock acquisitions in GetLastImportantRecPtr() away. *Without* centralizing the state. Greetings, Andres Freund
Re: TAP output format in pg_regress
On 2022-11-28 14:13:16 +0100, Daniel Gustafsson wrote: > > On 27 Nov 2022, at 11:22, Nikolay Shaplov wrote: > > В письме от суббота, 26 ноября 2022 г. 23:35:45 MSK пользователь Daniel > > Gustafsson написал: > > > I wold suggest to use word immediate instead of noatexit. This will do the > > code much more sensible for me. > > I think noatexit is clearer since the codepath is specifically to avoid any > registered atexit functions. The point of this function is to be able to call > bail while in a function registered with atexit() so I think the current name > is better. +1
Re: O(n) tasks cause lengthy startups and checkpoints
On 2022-11-28 13:08:57 +, Simon Riggs wrote: > On Sun, 27 Nov 2022 at 23:34, Nathan Bossart wrote: > > > Rather than explicitly use DEBUG1 everywhere I would have an > > > #define CUSTODIAN_LOG_LEVEL LOG > > > so we can run with it in LOG mode and then set it to DEBUG1 with a one > > > line change in a later phase of Beta > > > > I can create a separate patch for this, but I don't think I've ever seen > > this sort of thing before. > > Much of recovery is coded that way, for the same reason. I think that's not a good thing to copy without a lot more justification than "some old code also does it that way". It's sometimes justified, but also makes code harder to read (one doesn't know what it does without looking up the #define, line length).
Re: O(n) tasks cause lengthy startups and checkpoints
On Mon, Nov 28, 2022 at 1:31 PM Andres Freund wrote: > On 2022-11-28 13:08:57 +, Simon Riggs wrote: > > On Sun, 27 Nov 2022 at 23:34, Nathan Bossart > > wrote: > > > > Rather than explicitly use DEBUG1 everywhere I would have an > > > > #define CUSTODIAN_LOG_LEVEL LOG > > > > so we can run with it in LOG mode and then set it to DEBUG1 with a one > > > > line change in a later phase of Beta > > > > > > I can create a separate patch for this, but I don't think I've ever seen > > > this sort of thing before. > > > > Much of recovery is coded that way, for the same reason. > > I think that's not a good thing to copy without a lot more justification than > "some old code also does it that way". It's sometimes justified, but also > makes code harder to read (one doesn't know what it does without looking up > the #define, line length). Yeah. If people need some of the log messages at a higher level during development, they can patch their own copies. I think there might be some argument for having a facility that lets you pick subsystems or even individual messages that you want to trace and pump up the log level for just those call sites. But I don't know exactly what that would look like, and I don't think inventing one-off mechanisms for particular cases is a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Failed Assert in pgstat_assoc_relation
vignesh C writes: > I could reproduce this issue with the following steps: > create table t1(c int); > BEGIN; > CREATE TABLE t (c int); > SAVEPOINT q; > CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1; > select * from t; > ROLLBACK TO q; > CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1; > ROLLBACK; Uh-huh. I've not bothered to trace this in detail, but presumably what is happening is that the first CREATE RULE converts the table to a view, and then the ROLLBACK undoes that so far as the catalogs are concerned, but probably doesn't undo related pg_stats state changes fully. Then we're in a bad state that will cause problems. (It still crashes if you replace the second CREATE RULE with "select * from t".) As far as HEAD is concerned, maybe it's time to nuke the whole convert-table-to-view kluge entirely? Only pg_dump older than 9.4 will emit such code, so we're really about out of reasons to keep on maintaining it. However, I'm not sure that removing that code in v15 will fly, so maybe we need to make the new pg_stats code a little more robust against the possibility of a relkind change. regards, tom lane
Re: Failed Assert in pgstat_assoc_relation
Hi, On 2022-11-28 13:37:16 -0500, Tom Lane wrote: > vignesh C writes: > > I could reproduce this issue with the following steps: > > create table t1(c int); > > BEGIN; > > CREATE TABLE t (c int); > > SAVEPOINT q; > > CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1; > > select * from t; > > ROLLBACK TO q; > > CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD SELECT * FROM t1; > > ROLLBACK; > > Uh-huh. I've not bothered to trace this in detail, but presumably > what is happening is that the first CREATE RULE converts the table > to a view, and then the ROLLBACK undoes that so far as the catalogs > are concerned, but probably doesn't undo related pg_stats state > changes fully. Then we're in a bad state that will cause problems. > (It still crashes if you replace the second CREATE RULE with > "select * from t".) Yea. I haven't yet fully traced through this, but presumably relcache inval doesn't fix this because we don't want to loose pending stats after DDL. Perhaps we need to add a rule about not swapping pgstat* in RelationClearRelation() when relkind changes? > As far as HEAD is concerned, maybe it's time to nuke the whole > convert-table-to-view kluge entirely? Only pg_dump older than > 9.4 will emit such code, so we're really about out of reasons > to keep on maintaining it. Sounds good to me. > However, I'm not sure that removing that code in v15 will fly, Agreed, at the very least that'd increase memory usage. > so maybe we need to make the new pg_stats code a little more > robust against the possibility of a relkind change. Possibly via the relcache code. Greetings, Andres Freund
Re: Another multi-row VALUES bug
Dean Rasheed writes: > A different way to do this, without relying on the contents of the > targetlist, is to note that, while processing a product query, what we > really want to do is ignore any VALUES RTEs from the original query, > since they will have already been processed. There should then never > be more than one VALUES RTE left to process -- the one from the rule > action. > This can be done by exploiting the fact that in product queries, the > rtable always consists of the rtable from the original query followed > by the rtable from the rule action, so we just need to ignore the > right number of RTEs at the start of the rtable. Of course that would > break if we ever changed the way rewriteRuleAction() worked, but at > least it only depends on that one other place in the code, which has > been stable for a long time, so the risk of future breakage seems > managable. This looks like a good solution. I didn't actually test the patch, but it passes an eyeball check. I like the fact that we can verify that we find only one candidate VALUES RTE. regards, tom lane
Re: predefined role(s) for VACUUM and ANALYZE
On Mon, Nov 28, 2022 at 12:13:13PM -0500, Andrew Dunstan wrote: > pushed. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Bug in wait time when waiting on nested subtransaction
On 2022-Nov-28, Simon Riggs wrote: > A narrative description of the issue follows: > session1 - requests multiple nested subtransactions like this: > BEGIN; ... > SAVEPOINT subxid1; ... > SAVEPOINT subxid2; ... > However, if subxid2 subcommits, then the lock wait moves from subxid2 > to the topxid. Hmm, do we really do that? Seems very strange .. it sounds to me like the lock should have been transferred to subxid1 (which is subxid2's parent), not to the top-level Xid. Maybe what the user wanted was to release subxid1 before establishing subxid2? Or do they want to continue to be able to rollback to subxid1 after establishing subxid2? (but why?) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: fixing CREATEROLE
Robert Haas: I don't know if changing the syntax from A to B is really getting us anywhere. I generally agree that the ALTER DEFAULT PRIVILEGES syntax looks nicer than the CREATE/ALTER ROLE syntax, but I'm not sure that's a sufficient reason to move the control over this behavior to ALTER DEFAULT PRIVILEGES. The list of role attributes can currently be roughly divided into the following categories: - Settings with role-specific values: CONNECTION LIMIT, PASSWORD, VALID UNTIL. It's hard to imagine storing them anywhere else, because they need to have a different value for each role. Those are not just "flags" like all the other attributes. - Two special attributes in INHERIT and BYPASSRLS regarding security/privileges. Those were invented because there was no other syntax to do the same thing. Those could be interpreted as privileges to do something, too - but lacking the ability to do that explicit. There is no SET BYPASSRLS ON/OFF or SET INHERIT ON/OFF. Of course the INHERIT case is now a bit different, because there is the inherit grant option you introduced. - Cluster-wide privileges: SUPERUSER, CREATEDB, CREATEROLE, LOGIN, REPLICATION. Those can't be granted on some kind of object, because there is no such global object. You could imagine inventing some kind of global CLUSTER object and do something like GRANT SUPERUSER ON CLUSTER TO alice; instead. Turning those into role attributes was the choice made instead. Most likely it would have been only a syntactic difference anyway: Even if there was something like GRANT .. ON CLUSTER, you'd most likely implement that as... storing those grants as role attributes. Your patch is introducing a new category of role attributes - those that are affecting default behavior. But there is already a way to express this right now, and that's ALTER DEFAULT PRIVILEGES in this case. Imho, the question asked should not be "why change from syntax A to B?" but rather: Why introduce a new category of role attributes, when there is a way to express the same concept already? I can't see any compelling reason for that, yet. Or not "yet", but rather "anymore". When I understood and remember correctly, you implemented it in a way that a user could not change those new attributes on their own role. This is in fact different to how ALTER DEFAULT PRIVILEGES works, so you could have made an argument that this was better expressed as role attributes. But I think this was asked and agreed on to act differently, so that the user can change this default behavior of what happens when they create a role for themselves. And now this reason is gone - there is no reason NOT to implement it as DEFAULT PRIVILEGES. One thing to consider is that, as I've designed this, whether or not ADMIN is included in the grant is non-negotiable. I am, at least at present, inclined to think that was the right call, partly because Mark Dilger expressed a lot of concern about the CREATEROLE user losing control over the role they'd just created, and allowing ADMIN to be turned off would have exactly that effect. Plus a grant with INHERIT FALSE, SET FALSE, ADMIN FALSE would end up being almost identical to no great at all, which seems pointless. Basically, without ADMIN, the implicit GRANT fails to accomplish its intended purpose, so I don't like having that as a possibility. With how you implemented it right now, is it possible to do the following? CREATE ROLE alice; REVOKE ADMIN OPTION FOR alice FROM CURRENT_USER; If the answer is yes, then there is no reason to allow a user to set a shortcut for SET and INHERIT, but not for ADMIN. If the answer is no, then you could just not allow specifying the ADMIN option in the ALTER DEFAULT PRIVILEGES statement and always force it to be TRUE. The other thing that's a little weird about the syntax which you propose is that it's not obviously related to CREATE ROLE. The intent of the patch as implemented is to allow control over only the implicit GRANT that is created when a new role is created, not all grants that might be created by or to a particular user. Setting defaults for all grants doesn't seem like a particularly good idea to me, but it's definitely a different idea than what the patch proposes to do. Before I proposed that I was confused for a moment about this, too - but it turns out to be wrong. ALTER DEFAULT PRIVILEGES in general works as: When object A is created, issue a GRANT ON A automatically. In my proposal, the "object" is not the GRANT of that role. It's the role itself. So the default privileges express what should happen when the role is created. The default privileges would NOT affect a regular GRANT role TO role_spec command. They only run that command when a role is created. I did spend some time thinking about trying to tie this to the CREATEROLE syntax itself. For example, instead of CREATE ROLE alice CREATEROLE INHERITCREATEDROLES SETCREATEDROLES you could write CREATE ROLE alic
Re: Add tracking of backend memory allocated to pg_stat_activity
On 2022-11-26 22:10:06 -0500, Reid Thompson wrote: >- zero allocated bytes after fork to avoid double counting postmaster > allocations I still don't understand this - postmaster shouldn't be counted at all. It doesn't have a PgBackendStatus. There simply shouldn't be any tracked allocations from it. Greetings, Andres Freund
Re: TAP output format in pg_regress
В письме от понедельник, 28 ноября 2022 г. 21:28:48 MSK пользователь Andres Freund написал: > On 2022-11-28 14:13:16 +0100, Daniel Gustafsson wrote: > > > On 27 Nov 2022, at 11:22, Nikolay Shaplov wrote: > > > В письме от суббота, 26 ноября 2022 г. 23:35:45 MSK пользователь Daniel > > > Gustafsson написал: > > > > > > I wold suggest to use word immediate instead of noatexit. This will do > > > the > > > code much more sensible for me. > > > > I think noatexit is clearer since the codepath is specifically to avoid > > any > > registered atexit functions. The point of this function is to be able to > > call bail while in a function registered with atexit() so I think the > > current name is better. > > +1 Ok, then I would not insist on it. > > So this extra offset indicates that test is being included into parallel > > group? Guess it not really obvious... > > Grouping parallel tests via an initial list of test and then indenting each > test with whitespace was committed 22 years ago. While there might be > better > ways to do this, the lack of complaints so far at least seems to indicate > that it isn't all too terrible. Ok, it was there for 22 years, it will do no harm if it is left this way for some time :-) From my reviewer's point of view patch is ready for commit. Thank you for your patience :-) PS Should I change commitfest status? -- Nikolay Shaplov aka Nataraj Fuzzing Engineer at Postgres Professional Matrix IM: @dhyan:nataraj.su signature.asc Description: This is a digitally signed message part.
Re: Collation version tracking for macOS
On Wed, Nov 23, 2022 at 12:09 AM Thomas Munro wrote: > OK. Time for a new list of the various models we've discussed so far: > > 1. search-by-collversion: We introduce no new "library version" > concept to COLLATION and DATABASE object and little or no new syntax. > > 2. lib-version-in-providers: We introduce a separate provider value > for each ICU version, for example ICU63, plus an unversioned ICU like > today. > > 3. lib-version-in-attributes: We introduce daticuversion (alongside > datcollversion) and collicuversion (alongside collversion). Similar > to the above, but it's a separate property and the provider is always > ICU. New syntax for CREATE/ALTER COLLATION/DATABASE to set and change > ICU_VERSION. > > 4. lib-version-in-locale: "63:en" from earlier versions. That was > mostly a strawman proposal to avoid getting bogged down in > syntax/catalogue/model change discussions while trying to prove that > dlopen would even work. It doesn't sound like anyone really likes > this. > > 5. lib-version-in-collversion: We didn't explicitly discuss this > before, but you hinted at it: we could just use u_getVersion() in > [dat]collversion. I'd like to vote against #3 at least in the form that's described here. If we had three more libraries providing collations, it's likely that they would need versioning, too. So if we add an explicit notion of provider version, then it ought not to be specific to libicu. I think it's OK to decide that different library versions are different providers (your option #2), or that they are the same provider but give rise to different collations (your option #4), or that there can be multiple version of each collation which are distinguished by some additional provider version field (your #3 made more generic). I don't really understand #1 or #5 well enough to have an educated opinion, but I do think that #1 seems a bit magical. It hopes that the combination of a collation name and a datcollversion will be sufficient to find exactly one matcing collation in a list of provided libraries. The advantage of that, as I understand it, is that if you do something to your system that causes the number of matches to go from one to zero, you can just throw another library on the pile and get the number back up to one. Woohoo! But there's a part of me that worries: what if the number goes up to two, and they're not all the same? Probably that's something that shouldn't happen, but if it does then I think there's kind of no way to fix it. With the other options, if there's some way to jigger the catalog state to match what you want to happen, you can always repair the situation somehow, because the library to be used for each collation is explicitly specified in some way, and you just have to get it to match what you want to have happen. I don't know too much about this, though, so I might have it all wrong. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TAP output format in pg_regress
> On 28 Nov 2022, at 20:02, Nikolay Shaplov wrote: > From my reviewer's point of view patch is ready for commit. > > Thank you for your patience :-) Thanks for review. The attached tweaks a few comments and attempts to address the compiler warning error in the CFBot CI. Not sure I entirely agree with the compiler there but here is an attempt to work around it at least (by always copying the va_list for the logfile). It also adds a missing va_end for the logfile va_list. I hope this is close to a final version of this patch (commitmessage needs a bit of work though). > PS Should I change commitfest status? Sure, go ahead. -- Daniel Gustafsson https://vmware.com/ v14-0001-Change-pg_regress-output-format-to-be-TAP-compli.patch Description: Binary data
Re: fixing CREATEROLE
On Mon, Nov 28, 2022 at 11:57 AM wrote: > Robert Haas: > > I don't know if changing the syntax from A to B is really getting us > > anywhere. I generally agree that the ALTER DEFAULT PRIVILEGES syntax > > looks nicer than the CREATE/ALTER ROLE syntax, but I'm not sure that's > > a sufficient reason to move the control over this behavior to ALTER > > DEFAULT PRIVILEGES. > > Your patch is introducing a new category of role attributes - those that > are affecting default behavior. But there is already a way to express > this right now, and that's ALTER DEFAULT PRIVILEGES in this case. I do not like ALTER DEFAULT PRIVILEGES (ADP) for this. I don't really like defaults, period, for this. The role doing the creation and the role being created are both in scope when the command is executed and if anything it is the role doing to the creation that is receiving the privileges not the role being created. For ADP, the role being created gets the privileges and it is objects not in the scope of the executed command that are being affected. > > One thing to consider is that, as I've designed > > this, whether or not ADMIN is included in the grant is non-negotiable. > > I am, at least at present, inclined to think that was the right call, > > partly because Mark Dilger expressed a lot of concern about the > > CREATEROLE user losing control over the role they'd just created, and > > allowing ADMIN to be turned off would have exactly that effect. Plus a > > grant with INHERIT FALSE, SET FALSE, ADMIN FALSE would end up being > > almost identical to no great at all, which seems pointless. Basically, > > without ADMIN, the implicit GRANT fails to accomplish its intended > > purpose, so I don't like having that as a possibility. > > With how you implemented it right now, is it possible to do the following? > > CREATE ROLE alice; > REVOKE ADMIN OPTION FOR alice FROM CURRENT_USER; > > If the answer is yes, then there is no reason to allow a user to set a > shortcut for SET and INHERIT, but not for ADMIN. > > If the answer is no, then you could just not allow specifying the ADMIN > option in the ALTER DEFAULT PRIVILEGES statement and always force it to > be TRUE. > A prior email described that the creation of a role by a CREATEROLE role results in the necessary creation of an ADMIN grant from the creator to the new role granted by the bootstrap superuser (or, possibly, whichever superuser granted CREATEROLE). That REVOKE will not work as there would be no existing "grant by current_user over alice granted by current_user" immediately after current_user creates alice. Or we could just decide not to, > because is it really that hard to just issue a GRANT statement > immediately after CREATE ROLE, when you want to have SET or INHERIT > options on that role? > > The answer to that question was "yes it is too hard" a while back and as > such DEFAULT PRIVILEGES were introduced. > > A quick tally of the thread so far: No Defaults needed: David J., Mark?, Tom? Defaults needed - attached to role directly: Robert Defaults needed - defined within Default Privileges: Walther? The capability itself seems orthogonal to the rest of the patch to track these details better. I think we can "Fix CREATEROLE" without any feature regarding optional default behaviors and would suggest this patch be so limited and that another thread be started for discussion of (assuming a default specifying mechanism is wanted overall) how it should look. Let's not let a usability debate distract us from fixing a real problem. David J.
Re: Bug in wait time when waiting on nested subtransaction
On Mon, 28 Nov 2022 at 18:53, Alvaro Herrera wrote: > > On 2022-Nov-28, Simon Riggs wrote: > > > A narrative description of the issue follows: > > session1 - requests multiple nested subtransactions like this: > > BEGIN; ... > > SAVEPOINT subxid1; ... > > SAVEPOINT subxid2; ... > > > However, if subxid2 subcommits, then the lock wait moves from subxid2 > > to the topxid. > > Hmm, do we really do that? Seems very strange .. it sounds to me like > the lock should have been transferred to subxid1 (which is subxid2's > parent), not to the top-level Xid. Correct; that is exactly what I'm saying and why we have a bug since 3c27944fb2141. > Maybe what the user wanted was to > release subxid1 before establishing subxid2? Or do they want to > continue to be able to rollback to subxid1 after establishing subxid2? > (but why?) This isn't a description of a user's actions, it is a script that illustrates the bug in XactLockTableWait(). Perhaps a better example would be nested code blocks with EXCEPTION clauses where the outer block fails... e.g. DO $$ BEGIN SELECT 1; BEGIN SELECT 1; EXCEPTION WHEN OTHERS THEN RAISE NOTICE 's2'; END; RAISE division_by_zero; -- now back in outer subxact, which now fails EXCEPTION WHEN OTHERS THEN RAISE NOTICE 's1'; END;$$; Of course, debugging this is harder since there is no way to return the current subxid in SQL. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: fixing CREATEROLE
On Mon, Nov 28, 2022 at 1:56 PM wrote: > And now this reason is gone - there is no reason NOT to implement it as > DEFAULT PRIVILEGES. I think there is, and it's this, which you wrote further down: > In my proposal, the "object" is not the GRANT of that role. It's the > role itself. So the default privileges express what should happen when > the role is created. The default privileges would NOT affect a regular > GRANT role TO role_spec command. They only run that command when a role > is created. I agree that this is what you are proposing, but it is not what your proposed syntax says. Your proposed syntax simply says ALTER DEFAULT PRIVILEGES .. GRANT. Users who read that are going to think it controls the default behavior for all grants, because that's what the syntax says. If the proposed syntax mentioned CREATE ROLE someplace, maybe that would have some potential. A proposal to make a command that controls CREATE ROLE and only CREATE ROLE and mentions neither CREATE nor ROLE anywhere in the syntax is never going to be acceptable. > With how you implemented it right now, is it possible to do the following? > > CREATE ROLE alice; > REVOKE ADMIN OPTION FOR alice FROM CURRENT_USER; > > If the answer is yes, then there is no reason to allow a user to set a > shortcut for SET and INHERIT, but not for ADMIN. > > If the answer is no, then you could just not allow specifying the ADMIN > option in the ALTER DEFAULT PRIVILEGES statement and always force it to > be TRUE. It's no. Well, OK, you can do it, but it doesn't revoke anything, because you can only revoke your own grant, not the bootstrap superuser's grant. > attributes vs. default privileges. Or we could just decide not to, > because is it really that hard to just issue a GRANT statement > immediately after CREATE ROLE, when you want to have SET or INHERIT > options on that role? It's not difficult in the sense that climbing Mount Everest is difficult, but it makes the user experience as a CREATEROLE non-superuser quite noticeably different from being a superuser. Having a way to paper over such differences is, in my opinion, an important usability feature. -- Robert Haas EDB: http://www.enterprisedb.com
Re: fixing CREATEROLE
David G. Johnston: A quick tally of the thread so far: No Defaults needed: David J., Mark?, Tom? Defaults needed - attached to role directly: Robert Defaults needed - defined within Default Privileges: Walther? s/Walther/Wolfgang The capability itself seems orthogonal to the rest of the patch to track these details better. I think we can "Fix CREATEROLE" without any feature regarding optional default behaviors and would suggest this patch be so limited and that another thread be started for discussion of (assuming a default specifying mechanism is wanted overall) how it should look. Let's not let a usability debate distract us from fixing a real problem. +1 I didn't argue for whether defaults are needed in this case or not. I just said that ADP is better for defaults than role attributes are. Or the other way around: I think role attributes are not a good way to express those. Personally, I'm in the No Defaults needed camp, too. Best, Wolfgang
Re: Bug in wait time when waiting on nested subtransaction
On Mon, 28 Nov 2022 at 17:38, Robert Haas wrote: > > On Mon, Nov 28, 2022 at 10:28 AM Simon Riggs > wrote: > > So we have these options > > > > 1. Removing the XactLockTableDelete() call in CommitSubTransaction(). > > That releases lock waiters earlier than expected, which requires > > pushups in XactLockTableWait() to cope with that (which are currently > > broken). Why do we do it? To save shmem space in the lock table should > > anyone want to run a transaction that contains thousands of > > subtransactions, or more. So option (1) alone would eventually cause > > us to run out of space in the lock table and a transaction would > > receive ERRORs rather than be allowed to run for a long time. > > This seems unprincipled to me. The point of having a lock on the > subtransaction in the lock table is so that people can wait for the > subtransaction to end. If we don't remove the lock table entry at > subtransaction end, then that lock table entry doesn't serve that > purpose any more. An easy point to confuse: "subtransaction to end": The subtransaction is "still running" to other backends even AFTER it has been subcommitted, but its state now transfers to the parent. So the subtransaction doesn't cease running until it aborts, one of its parent aborts or top level commit. The subxid lock should, on principle, exist until one of those events occurs. It doesn't, but that is an optimization, for the stated reason. (All of the above is current behavior). > > 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so > > we go up the levels one by one as we did before. However, (2) causes > > huge subtrans contention and if we implemented that and backpatched it > > the performance issues could be significant. So my feeling is that if > > we do (2) then we should not backpatch it. > > What I find suspicious about the coding of this function is that it > doesn't distinguish between commits and aborts at all. Like, if a > subtransaction commits, the changes don't become globally visible > until the parent transaction also commits. If a subtransaction aborts, > though, what happens to the top-level XID doesn't seem to matter at > all. The comment seems to agree: > > * Note that this does the right thing for subtransactions: if we wait on a > * subtransaction, we will exit as soon as it aborts or its top parent > commits. > > I feel like what I'd expect to see given this comment is code which > (1) waits until the supplied XID is no longer running, (2) checks > whether the XID aborted, and if so return at once, and (3) otherwise > recurse to the parent XID. But the code doesn't do that. Perhaps > that's not actually the right thing to do, since it seems like a big > behavior change, but then I don't understand the comment. As I mention above, the correct behavior is that the subxact doesn't cease running until it aborts, one of its parent aborts or top level commit. Which is slightly different from the comment, which may explain why the bug exists. > Incidentally, one possible optimization here to try to release locking > traffic would be to try waiting for the top-parent first using a > conditional lock acquisition. If that works, cool. If not, go back > around and work up the tree level by level. Since that path would only > be taken in the unhappy case where we're probably going to have to > wait anyway, the cost probably wouldn't be too bad. That sounds like a potential bug fix (not just an optimization). > > My preferred solution would be a mix of the above, call it option (3) > > > > 3. > > a) Include the lock table entry for the first 64 subtransactions only, > > so that we limit shmem. For those first 64 entries, have the subtrans > > point direct to top, since this makes a subtrans lookup into an O(1) > > operation, which is important for performance of later actions. > > > > b) For any subtransactions after first 64, delete the subxid lock on > > subcommit, to save shmem, but make subtrans point to the immediate > > parent (only), not the topxid. That fixes the bug, but causes > > performance problems in XidInMVCCSnapshot() and others, so we also do > > c) and d) > > > > c) At top level commit, go back and alter subtrans again for subxids > > so now it points to the topxid, so that we avoid O(N) behavior in > > XidInMVCCSnapshot() and other callers. Additional cost for longer > > transactions, but it saves considerable cost in later callers that > > need to call GetTopmostTransaction. > > > > d) Optimize SubTransGetTopmostTransaction() so it retrieves entries > > page-at-a-time. This will reduce the contention of repeatedly > > re-visiting the same page(s) and ensure that a page is less often > > paged out when we are still using it. > > I'm honestly not very sanguine about changing pg_subtrans to point > straight to the topmost XID. It's only OK to do that if there's > absolutely nothing that needs to know the full tree structure, and the > present case looks like an instance w
Re: Bug in wait time when waiting on nested subtransaction
On Mon, Nov 28, 2022 at 2:45 PM Simon Riggs wrote: > An easy point to confuse: > "subtransaction to end": The subtransaction is "still running" to > other backends even AFTER it has been subcommitted, but its state now > transfers to the parent. > > So the subtransaction doesn't cease running until it aborts, one of > its parent aborts or top level commit. The subxid lock should, on > principle, exist until one of those events occurs. It doesn't, but > that is an optimization, for the stated reason. That's not what "running" means to me. Running means it's started and hasn't yet committed or rolled back. -- Robert Haas EDB: http://www.enterprisedb.com
Re: fixing CREATEROLE
On Mon, Nov 28, 2022 at 12:42 PM wrote: > David G. Johnston: > > A quick tally of the thread so far: > > > > No Defaults needed: David J., Mark?, Tom? > > Defaults needed - attached to role directly: Robert > > Defaults needed - defined within Default Privileges: Walther? > > s/Walther/Wolfgang > Sorry 'bout that, I was just reading the To: line in my email reply. > > Personally, I'm in the No Defaults needed camp, too. > I kinda thought so from your final comments, thanks for clarifying. David J.
Re: fixing CREATEROLE
Robert Haas: In my proposal, the "object" is not the GRANT of that role. It's the role itself. So the default privileges express what should happen when the role is created. The default privileges would NOT affect a regular GRANT role TO role_spec command. They only run that command when a role is created. I agree that this is what you are proposing, but it is not what your proposed syntax says. Your proposed syntax simply says ALTER DEFAULT PRIVILEGES .. GRANT. Users who read that are going to think it controls the default behavior for all grants, because that's what the syntax says. If the proposed syntax mentioned CREATE ROLE someplace, maybe that would have some potential. A proposal to make a command that controls CREATE ROLE and only CREATE ROLE and mentions neither CREATE nor ROLE anywhere in the syntax is never going to be acceptable. Yes, I agree - the abbreviated GRANT syntax is confusing/misleading in that case. Consistent with the other syntaxes, but easily confused nonetheless. It's no. Well, OK, you can do it, but it doesn't revoke anything, because you can only revoke your own grant, not the bootstrap superuser's grant. Ah, I see. I didn't get that difference regarding the bootstrap superuser, so far. So in that sense, the ADP GRANT would be an additional GRANT issued by the user that created the role in addition to the bootstrap superuser's grant. You can't revoke the bootstrap superuser's grant - but you can't modify it either. And there is no need to add SET or INHERIT to the boostrap superuser's grant, because you can grant the role yourself again, with those options. I think it would be very strange to have a default for that bootstrap superuser's grant. Or rather: A different default than the minimum required - and that's just ADMIN, not SET, not INHERIT. When you have the minimum, you can always choose to grant SET and INHERIT later on yourself - and revoke it, too! But when the SET and INHERIT are on the boostrap superuser's grant - then there is no way for you to revoke SET or INHERIT on that grant anymore later. Why should the superuser, who gave you CREATEROLE, insist on you having SET or INHERIT forever and disallow to revoke it from yourself? Best, Wolfgang
Re: Failed Assert in pgstat_assoc_relation
Andres Freund writes: > On 2022-11-28 13:37:16 -0500, Tom Lane wrote: >> As far as HEAD is concerned, maybe it's time to nuke the whole >> convert-table-to-view kluge entirely? Only pg_dump older than >> 9.4 will emit such code, so we're really about out of reasons >> to keep on maintaining it. > Sounds good to me. Here's a draft patch for that. If we apply this to HEAD then we only need that klugery in relcache for v15. regards, tom lane diff --git a/doc/src/sgml/rules.sgml b/doc/src/sgml/rules.sgml index be3d17c852..3c9459b648 100644 --- a/doc/src/sgml/rules.sgml +++ b/doc/src/sgml/rules.sgml @@ -280,14 +280,16 @@ Views in PostgreSQL are implemented -using the rule system. In fact, there is essentially no difference -between: +using the rule system. A view is basically an empty table (having no +actual storage) with an ON SELECT DO INSTEAD rule. +Conventionally, that rule is named _RETURN. +So a view like CREATE VIEW myview AS SELECT * FROM mytab; -compared against the two commands: +is very nearly the same thing as CREATE TABLE myview (same column list as mytab); @@ -295,13 +297,17 @@ CREATE RULE "_RETURN" AS ON SELECT TO myview DO INSTEAD SELECT * FROM mytab; -because this is exactly what the CREATE VIEW -command does internally. This has some side effects. One of them -is that the information about a view in the -PostgreSQL system catalogs is exactly -the same as it is for a table. So for the parser, there is -absolutely no difference between a table and a view. They are the -same thing: relations. +although you can't actually write that, because tables are not +allowed to have ON SELECT rules. + + + +A view can also have other kinds of DO INSTEAD +rules, allowing INSERT, UPDATE, +or DELETE commands to be performed on the view +despite its lack of underlying storage. +This is discussed further below, in +. @@ -,7 +1117,7 @@ SELECT word FROM words ORDER BY word <-> 'caterpiler' LIMIT 10; Rules that are defined on INSERT, UPDATE, and DELETE are significantly different from the view rules -described in the previous section. First, their CREATE +described in the previous sections. First, their CREATE RULE command allows more: diff --git a/src/backend/rewrite/rewriteDefine.c b/src/backend/rewrite/rewriteDefine.c index db45d8a08b..8ac2c81b06 100644 --- a/src/backend/rewrite/rewriteDefine.c +++ b/src/backend/rewrite/rewriteDefine.c @@ -239,7 +239,6 @@ DefineQueryRewrite(const char *rulename, Relation event_relation; ListCell *l; Query *query; - bool RelisBecomingView = false; Oid ruleId = InvalidOid; ObjectAddress address; @@ -311,7 +310,18 @@ DefineQueryRewrite(const char *rulename, /* * Rules ON SELECT are restricted to view definitions * - * So there cannot be INSTEAD NOTHING, ... + * So this had better be a view, ... + */ + if (event_relation->rd_rel->relkind != RELKIND_VIEW && + event_relation->rd_rel->relkind != RELKIND_MATVIEW) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("relation \"%s\" cannot have ON SELECT rules", + RelationGetRelationName(event_relation)), + errdetail_relkind_not_supported(event_relation->rd_rel->relkind))); + + /* + * ... there cannot be INSTEAD NOTHING, ... */ if (action == NIL) ereport(ERROR, @@ -407,93 +417,6 @@ DefineQueryRewrite(const char *rulename, ViewSelectRuleName))); rulename = pstrdup(ViewSelectRuleName); } - - /* - * Are we converting a relation to a view? - * - * If so, check that the relation is empty because the storage for the - * relation is going to be deleted. Also insist that the rel not be - * involved in partitioning, nor have any triggers, indexes, child or - * parent tables, RLS policies, or RLS enabled. (Note: some of these - * tests are too strict, because they will reject relations that once - * had such but don't anymore. But we don't really care, because this - * whole business of converting relations to views is just an obsolete - * kluge to allow dump/reload of views that participate in circular - * dependencies.) - */ - if (event_relation->rd_rel->relkind != RELKIND_VIEW && - event_relation->rd_rel->relkind != RELKIND_MATVIEW) - { - TableScanDesc scanDesc; - Snapshot snapshot; - TupleTableSlot *slot; - - if (event_relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) -ereport(ERROR, - (errcode(ERRCODE_WRONG_OBJECT_TYPE), - errmsg("cannot convert partitioned table \"%s\" to a view", -RelationGetRelationName(event_relation; - - /* only case left: */ - Assert(event_relation->rd_rel->relkind == RELKIND_RELATION); - - if (event_relation->rd_rel->relispartition) -ereport(ERROR, - (errcode(ERRCODE_WRONG_OBJECT_TYPE), - errmsg("cannot convert partition \"%s\"
Re: Bug in wait time when waiting on nested subtransaction
Robert Haas writes: > That's not what "running" means to me. Running means it's started and > hasn't yet committed or rolled back. A subxact definitely can't be considered committed until its topmost parent commits. However, it could be known to be rolled back before its parent. IIUC, the current situation is that we don't take advantage of the latter case but just wait for the topmost parent. One thing we need to be pretty careful of here is to not break the promise of atomic commit. At topmost commit, all subxacts must appear committed simultaneously. It's not quite clear to me whether we need a similar guarantee in the rollback case. It seems like we shouldn't, but maybe I'm missing something, in which case maybe the current behavior is correct? regards, tom lane
Re: fixing CREATEROLE
> On Nov 28, 2022, at 11:34 AM, David G. Johnston > wrote: > > No Defaults needed: David J., Mark?, Tom? As Robert has the patch organized, I think defaults are needed, but I see that as a strike against the patch. > Defaults needed - attached to role directly: Robert > Defaults needed - defined within Default Privileges: Walther? > The capability itself seems orthogonal to the rest of the patch to track > these details better. I think we can "Fix CREATEROLE" without any feature > regarding optional default behaviors and would suggest this patch be so > limited and that another thread be started for discussion of (assuming a > default specifying mechanism is wanted overall) how it should look. Let's > not let a usability debate distract us from fixing a real problem. In Robert's initial email, he wrote, "It seems to me that the root of any fix in this area must be to change the rule that CREATEROLE can administer any role whatsoever." The obvious way to fix that is to revoke that rule and instead automatically grant ADMIN OPTION to a creator over any role they create. That's problematic, though, because as things stand, ADMIN OPTION is granted to somebody by granting them membership in the administered role WITH ADMIN OPTION, so membership in the role and administration of the role are conflated. Robert's patch tries to deal with the (possibly unwanted) role membership by setting up defaults to mitigate the effects, but that is more confusing to me than just de-conflating role membership from role administration, and giving role creators administration over roles they create, without in so doing giving them role membership. I don't recall enough details about how hard it is to de-conflate role membership from role administration, and maybe that's a non-starter for reasons I don't recall at the moment. I expect Robert has already contemplated that idea and instead proposed this patch for good reasons. Robert? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: fixing CREATEROLE
Mark Dilger: Robert's patch tries to deal with the (possibly unwanted) role membership by setting up defaults to mitigate the effects, but that is more confusing to me than just de-conflating role membership from role administration, and giving role creators administration over roles they create, without in so doing giving them role membership. I don't recall enough details about how hard it is to de-conflate role membership from role administration, and maybe that's a non-starter for reasons I don't recall at the moment. Isn't this just GRANT .. WITH SET FALSE, INHERIT FALSE, ADMIN TRUE? That should allow role administration, without actually granting membership in that role, yet, right? Best, Wolfgang
Re: fixing CREATEROLE
On Mon, Nov 28, 2022 at 3:02 PM Mark Dilger wrote: > Robert's patch tries to deal with the (possibly unwanted) role membership by > setting up defaults to mitigate the effects, but that is more confusing to me > than just de-conflating role membership from role administration, and giving > role creators administration over roles they create, without in so doing > giving them role membership. I don't recall enough details about how hard it > is to de-conflate role membership from role administration, and maybe that's > a non-starter for reasons I don't recall at the moment. I expect Robert has > already contemplated that idea and instead proposed this patch for good > reasons. Robert? "De-conflating role membership from role administration" isn't really a specific proposal that someone can go out and implement. You have to make some decision about *how* you are going to separate those concepts. And that's what I did: I made INHERIT and SET into grant-level options. That allows you to give someone access to the privileges of a role without the ability to administer it (at least one of INHERIT and SET true, and ADMIN false) or the ability to administer a role without having any direct access to its privileges (INHERIT FALSE, SET FALSE, ADMIN TRUE). I don't see that we can, or need to, separate things any more than that. You can argue that a grant with INHERIT FALSE, SET FALSE, ADMIN TRUE still grants membership, and I think formally that's true, but I also think it's just picking something to bicker about. The need isn't to separate membership per se from administration. It's to separate privilege inheritance and the ability to SET ROLE from role administration. And I've done that. I strongly disagree with the idea that the ability for users to control defaults here isn't needed. You can set a default tablespace for your database, and a default tablespace for your session, and a default tablespace for new partitions of an existing partition table. You can set default privileges for every type of object you can create, and a default search path to find objects in the database. You can set defaults for all of your connection parameters to the database using environment variables, and the default data directory for commands that need one. You can set defaults for all of your psql settings in ~/.psqlrc. You can set defaults for the character sets, locales and collations of new databases. You can set the default version of an extension in the control file, so that the user doesn't have to specify a version. And so on and so on. There's absolutely scads of things for which it is useful to be able to set defaults and for which we give people the ability to set defaults, and I don't think anyone is making a real argument for why that isn't also true here. The argument that has been made is essentially that you could get by without it, but that's true of *every* default. Yet we keep adding the ability to set defaults for new things, and to set the defaults for existing things in new ways, and there's a very good reason for that: it's extremely convenient. And that's true here, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: fixing CREATEROLE
> On Nov 28, 2022, at 12:08 PM, walt...@technowledgy.de wrote: > > Isn't this just GRANT .. WITH SET FALSE, INHERIT FALSE, ADMIN TRUE? That > should allow role administration, without actually granting membership in > that role, yet, right? Can you clarify what you mean here? Are you inventing a new syntax? +GRANT bob TO alice WITH SET FALSE, INHERIT FALSE, ADMIN TRUE; +ERROR: syntax error at or near "SET" +LINE 1: GRANT bob TO alice WITH SET FALSE, INHERIT FALSE, ADMIN TRUE... — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Hi, On 2022-11-28 18:34:20 +0300, Nikita Malakhov wrote: > While working on Pluggable TOAST we've detected a defective behavior > on tables with large amounts of TOASTed data - queries freeze and DB stalls. > Further investigation led us to the loop with GetNewOidWithIndex function > call - when all available Oid already exist in the related TOAST table this > loop continues infinitely. Data type used for value ID is the UINT32, which > is > unsigned int and has a maximum value of *4294967295* which allows > maximum 4294967295 records in the TOAST table. It is not a very big amount > for modern databases and is the major problem for productive systems. I don't think the absolute number is the main issue - by default external toasting will happen only for bigger datums. 4 billion external datums typically use a lot of space. If you hit this easily with your patch, then you likely broke the conditions under which external toasting happens. IMO the big issue is the global oid counter making it much easier to hit oid wraparound. Due to that we end up assigning oids that conflict with existing toast oids much sooner than 4 billion toasted datums. I think the first step to improve the situation is to not use a global oid counter for toasted values. One way to do that would be to use the sequence code to do oid assignment, but we likely can find a more efficient representation. Eventually we should do the obvious thing and make toast ids 64bit wide - to combat the space usage we likely should switch to representing the ids as variable width integers or such, otherwise the space increase would likely be prohibitive. > Quick fix for this problem is limiting GetNewOidWithIndex loops to some > reasonable amount defined by related macro and returning error if there is > still no available Oid. Please check attached patch, any feedback is > appreciated. This feels like the wrong spot to tackle the issue. For one, most of the looping will be in GetNewOidWithIndex(), so limiting looping in toast_save_datum() won't help much. For another, if the limiting were in the right place, it'd break currently working cases. Due to oid wraparound it's pretty easy to hit "ranges" of allocated oids, without even getting close to 2^32 toasted datums. Greetings, Andres Freund
Re: fixing CREATEROLE
> On Nov 28, 2022, at 12:33 PM, Mark Dilger > wrote: > >> Isn't this just GRANT .. WITH SET FALSE, INHERIT FALSE, ADMIN TRUE? That >> should allow role administration, without actually granting membership in >> that role, yet, right? > > Can you clarify what you mean here? Are you inventing a new syntax? Nevermind. After reading Robert's email, it's clear enough what you mean here. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Hi! We've already encountered this issue on large production databases, and 4 billion rows is not so much for modern bases, so this issue already arises from time to time and would arise more and more often. I agree that global oid counter is the main issue, and better solution would be local counters with larger datatype for value id. This is the right way to solve this issue, although it would take some time. As I understand, global counter was taken because it looked the fastest way of getting unique ID. Ok, I'll prepare a patch with it. >Due to that we end up assigning oids that conflict with existing >toast oids much sooner than 4 billion toasted datums. Just a note: global oid is checked for related TOAST table only, so equal oids in different TOAST tables would not collide. >Eventually we should do the obvious thing and make toast ids 64bit wide - to >combat the space usage we likely should switch to representing the ids as >variable width integers or such, otherwise the space increase would likely be >prohibitive. I'm already working on it, but I thought that 64-bit value ID won't be easily accepted by community. I'd be very thankful for any advice on this. On Mon, Nov 28, 2022 at 11:36 PM Andres Freund wrote: > Hi, > > On 2022-11-28 18:34:20 +0300, Nikita Malakhov wrote: > > While working on Pluggable TOAST we've detected a defective behavior > > on tables with large amounts of TOASTed data - queries freeze and DB > stalls. > > Further investigation led us to the loop with GetNewOidWithIndex function > > call - when all available Oid already exist in the related TOAST table > this > > loop continues infinitely. Data type used for value ID is the UINT32, > which > > is > > unsigned int and has a maximum value of *4294967295* which allows > > maximum 4294967295 records in the TOAST table. It is not a very big > amount > > for modern databases and is the major problem for productive systems. > > I don't think the absolute number is the main issue - by default external > toasting will happen only for bigger datums. 4 billion external datums > typically use a lot of space. > > If you hit this easily with your patch, then you likely broke the > conditions > under which external toasting happens. > > IMO the big issue is the global oid counter making it much easier to hit > oid > wraparound. Due to that we end up assigning oids that conflict with > existing > toast oids much sooner than 4 billion toasted datums. > > I think the first step to improve the situation is to not use a global oid > counter for toasted values. One way to do that would be to use the sequence > code to do oid assignment, but we likely can find a more efficient > representation. > > Eventually we should do the obvious thing and make toast ids 64bit wide - > to > combat the space usage we likely should switch to representing the ids as > variable width integers or such, otherwise the space increase would likely > be > prohibitive. > > > > Quick fix for this problem is limiting GetNewOidWithIndex loops to some > > reasonable amount defined by related macro and returning error if there > is > > still no available Oid. Please check attached patch, any feedback is > > appreciated. > > This feels like the wrong spot to tackle the issue. For one, most of the > looping will be in GetNewOidWithIndex(), so limiting looping in > toast_save_datum() won't help much. For another, if the limiting were in > the > right place, it'd break currently working cases. Due to oid wraparound it's > pretty easy to hit "ranges" of allocated oids, without even getting close > to > 2^32 toasted datums. > > Greetings, > > Andres Freund > -- Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
On Sun, Nov 27, 2022 at 11:26:50AM -0800, Andrey Borodin wrote: > Some funny stuff. If a user tries to cancel a non-replicated transaction > Azure Postgres will answer: "user requested cancel while waiting for > synchronous replication ack. The COMMIT record has already flushed to > WAL locally and might not have been replicatead to the standby. We > must wait here." > AWS RDS will answer: "ignoring request to cancel wait for synchronous > replication" > Yandex Managed Postgres will answer: "canceling wait for synchronous > replication due requested, but cancelation is not allowed. The > transaction has already committed locally and might not have been > replicated to the standby. We must wait here." > > So, for many services providing Postgres as a service it's only a > matter of wording. Wow, you are telling me all three cloud vendors changed how query cancel behaves on an unresponsive synchronous replica? That is certainly a testament that the community needs to change or at least review our behavior. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
On Mon, Nov 28, 2022 at 12:03:06PM +0530, Bharath Rupireddy wrote: > Thanks for verifying the behaviour. And many thanks for an off-list chat. > > FWIW, I'm planning to prepare a patch as per the below idea which is > something similar to the initial proposal in this thread. Meanwhile, > thoughts are welcome. > > 1. Disable query cancel/CTRL+C/SIGINT when a backend is waiting for > sync replication acknowledgement. > 2. Process proc die immediately when a backend is waiting for sync > replication acknowledgement, as it does today, however, upon restart, > don't open up for business (don't accept ready-only connections) > unless the sync standbys have caught up. You can prepare a patch, but it unlikely to get much interest until you get agreement on what the behavior should be. The optimal order of developer actions is: Desirability -> Design -> Implement -> Test -> Review -> Commit https://wiki.postgresql.org/wiki/Todo#Development_Process Telling us what other cloud vendors do is not sufficient. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Hi, On 2022-11-28 23:54:53 +0300, Nikita Malakhov wrote: > We've already encountered this issue on large production databases, and > 4 billion rows is not so much for modern bases, so this issue already arises > from time to time and would arise more and more often. Was the issue that you exceeded 4 billion toasted datums, or that assignment took a long time? How many toast datums did you actually have? Was this due to very wide rows leading to even small datums getting toasted? Greetings, Andres Freund
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Andres Freund writes: > I think the first step to improve the situation is to not use a global oid > counter for toasted values. One way to do that would be to use the sequence > code to do oid assignment, but we likely can find a more efficient > representation. I don't particularly buy that, because the only real fix is this: > Eventually we should do the obvious thing and make toast ids 64bit wide and if we do that there'll be no need to worry about multiple counters. > - to > combat the space usage we likely should switch to representing the ids as > variable width integers or such, otherwise the space increase would likely be > prohibitive. And I don't buy that either. An extra 4 bytes with a 2K payload is not "prohibitive", it's more like "down in the noise". I think if we switch to int8 keys and widen the global OID counter to 8 bytes (using just the low 4 bytes for other purposes), we'll have a perfectly fine solution. There is still plenty of work to be done under that plan, because of the need to maintain backward compatibility for existing TOAST tables --- and maybe people would want an option to keep on using them, for non-enormous tables? If we add additional work on top of that, it'll just mean that it will take longer to have any solution. >> Quick fix for this problem is limiting GetNewOidWithIndex loops to some >> reasonable amount defined by related macro and returning error if there is >> still no available Oid. Please check attached patch, any feedback is >> appreciated. > This feels like the wrong spot to tackle the issue. Yeah, that is completely horrid. It does not remove the existing failure mode, just changes it to have worse consequences. regards, tom lane
Re: Failed Assert in pgstat_assoc_relation
Hi, On 2022-11-28 10:50:13 -0800, Andres Freund wrote: > On 2022-11-28 13:37:16 -0500, Tom Lane wrote: > > Uh-huh. I've not bothered to trace this in detail, but presumably > > what is happening is that the first CREATE RULE converts the table > > to a view, and then the ROLLBACK undoes that so far as the catalogs > > are concerned, but probably doesn't undo related pg_stats state > > changes fully. Then we're in a bad state that will cause problems. > > (It still crashes if you replace the second CREATE RULE with > > "select * from t".) > > Yea. I haven't yet fully traced through this, but presumably relcache inval > doesn't fix this because we don't want to loose pending stats after DDL. > > Perhaps we need to add a rule about not swapping pgstat* in > RelationClearRelation() when relkind changes? Something like the attached. Still needs a bit of polish, e.g. adding the test case from above. I'm a bit uncomfortable adding a function call below * Perform swapping of the relcache entry contents. Within this * process the old entry is momentarily invalid, so there *must* be no * possibility of CHECK_FOR_INTERRUPTS within this sequence. Do it in * all-in-line code for safety. but it's not the first, see MemoryContextSetParent(). Greetings, Andres Freund diff --git i/src/backend/utils/activity/pgstat_relation.c w/src/backend/utils/activity/pgstat_relation.c index f92e16e7af8..1158e09c24f 100644 --- i/src/backend/utils/activity/pgstat_relation.c +++ w/src/backend/utils/activity/pgstat_relation.c @@ -158,7 +158,7 @@ pgstat_unlink_relation(Relation rel) return; /* link sanity check */ - Assert(rel->pgstat_info->relation == rel); + Assert(rel->pgstat_info->relation->rd_rel->oid == rel->rd_rel->oid); rel->pgstat_info->relation = NULL; rel->pgstat_info = NULL; } diff --git i/src/backend/utils/cache/relcache.c w/src/backend/utils/cache/relcache.c index bd6cd4e47b5..69237025c20 100644 --- i/src/backend/utils/cache/relcache.c +++ w/src/backend/utils/cache/relcache.c @@ -2626,6 +2626,7 @@ RelationClearRelation(Relation relation, bool rebuild) bool keep_rules; bool keep_policies; bool keep_partkey; + bool keep_pgstats; /* Build temporary entry, but don't link it into hashtable */ newrel = RelationBuildDesc(save_relid, false); @@ -2666,6 +2667,8 @@ RelationClearRelation(Relation relation, bool rebuild) keep_policies = equalRSDesc(relation->rd_rsdesc, newrel->rd_rsdesc); /* partkey is immutable once set up, so we can always keep it */ keep_partkey = (relation->rd_partkey != NULL); + /* keep stats, except when converting tables into views etc */ + keep_pgstats = relation->rd_rel->relkind == newrel->rd_rel->relkind; /* * Perform swapping of the relcache entry contents. Within this @@ -2720,9 +2723,16 @@ RelationClearRelation(Relation relation, bool rebuild) SWAPFIELD(RowSecurityDesc *, rd_rsdesc); /* toast OID override must be preserved */ SWAPFIELD(Oid, rd_toastoid); + /* pgstat_info / enabled must be preserved */ - SWAPFIELD(struct PgStat_TableStatus *, pgstat_info); - SWAPFIELD(bool, pgstat_enabled); + if (keep_pgstats) + { + SWAPFIELD(struct PgStat_TableStatus *, pgstat_info); + SWAPFIELD(bool, pgstat_enabled); + } + else + pgstat_unlink_relation(relation); + /* preserve old partition key if we have one */ if (keep_partkey) {
Re: Add 64-bit XIDs into PostgreSQL 15
On Mon, Nov 28, 2022 at 8:53 AM Robert Haas wrote: > It is true that if the table is progressively bloating, it is likely > to be more bloated by the time you are 8 billion XIDs behind than it > was when you were 800 million XIDs behind. I don't see that as a very > good reason not to adopt this patch, because you can bloat the table > by an arbitrarily large amount while consuming only a small number of > XiDs, even just 1 XID. Protecting against bloat is good, but shutting > down the database when the XID age reaches a certain value is not a > particularly effective way of doing that, so saying that we'll be > hurting people by not shutting down the database at the point where we > do so today doesn't ring true to me. I can't speak for Chris, but I think that almost everybody will agree on this much, without really having to think about it. It's easy to see that having more XID space is, in general, strictly a good thing. If there was a low risk way of getting that benefit, then I'd be in favor of it. Here's the problem that I see with this patch: I don't think that the risks are commensurate with the benefits. I can imagine being in favor of an even more invasive patch that (say) totally removes the concept of freezing, but that would have to be a very different sort of design. > Philosophically, I disagree with the idea of shutting down the > database completely in any situation in which a reasonable alternative > exists. Losing read and write availability is really bad, and I don't > think it's what users want. At a certain point it may make more sense to activate XidStopLimit protections (which will only prevent new XID allocations) instead of getting further behind on freezing, even in a world where we're never strictly obligated to activate XidStopLimit. It may in fact be the lesser evil, even with 64-bit XIDs -- because we still have to freeze, and the factors that drive when and how we freeze mostly aren't changed. Fundamentally, when we're falling behind on freezing, at a certain point we can expect to keep falling behind -- unless some kind of major shift happens. That's just how freezing works, with or without 64-bit XIDs/MXIDs. If VACUUM isn't keeping up with the allocation of transactions, then the system is probably misconfigured in some way. We should do our best to signal this as early and as frequently as possible, and we should mitigate specific hazards (e.g. old temp tables) if at all possible. We should activate the failsafe when things really start to look dicey (which, incidentally, the patch just removes). These mitigations may be very effective, but in the final analysis they don't address the fundamental weakness in freezing. Granted, the specifics of the current XidStopLimit mechanism are unlikely to directly carry over to 64-bit XIDs. XidStopLimit is structured in a way that doesn't actually consider freeze debt in units like unfrozen pages. Like Chris, I just don't see why the patch obviates the need for something like XidStopLimit, since the patch doesn't remove freezing. An improved XidStopLimit mechanism might even end up kicking in *before* the oldest relfrozenxid reached 2 billion XIDs, depending on the specifics. Removing the failsafe mechanism seems misguided to me for similar reasons. I recently learned that Amazon RDS has set a *lower* vacuum_failsafe_age default than the standard default (its default of 1.6 billion to only 1.2 billion on RDS). This decision predates my joining AWS. It seems as if practical experience has shown that allowing any table's age(relfrozenxid) to get too far past a billion is not a good idea. At least it's not a good idea on modern Postgres versions, that have the freeze map. We really shouldn't have to rely on having billions of XIDs available in the first place -- XID space isn't really a fungible commodity. It's much more important to recognize that something (some specific thing) just isn't working as designed, which in general could be pretty far removed from freezing. For example, index corruption could do it (at least without the failsafe). Some kind of autovacuum starvation could do it. It's almost always more complicated than "not enough available XID space". -- Peter Geoghegan
Re: Bug in wait time when waiting on nested subtransaction
On Mon, Nov 28, 2022 at 3:01 PM Tom Lane wrote: > One thing we need to be pretty careful of here is to not break the > promise of atomic commit. At topmost commit, all subxacts must > appear committed simultaneously. It's not quite clear to me whether > we need a similar guarantee in the rollback case. It seems like > we shouldn't, but maybe I'm missing something, in which case maybe > the current behavior is correct? AFAICS, the behavior that Simon is complaining about here is an exception to the way things work in general, and therefore his complaint is well-founded. AbortSubTransaction() arranges to release resources, whereas CommitSubTransaction() arranges to have them transferred to the parent transaction. This isn't necessarily obvious from looking at those functions, but if you drill down into the AtEO(Sub)Xact functions that they call, that's what happens in nearly all cases. Given that subtransaction abort releases resources immediately, it seems pretty fair to wonder what the value is in waiting for its parent or the topmost transaction. I don't see how that can be necessary for correctness. The commit message to which Simon (3c27944fb2141) points seems to have inadvertently changed the behavior while trying to fix a bug and improve performance. I remember being a bit skeptical about that fix at the time. Back in the day, you couldn't XactLockTableWait() unless you knew that the transaction had already started. That commit tried to make it so that you could XactLockTableWait() earlier, because that's apparently something that logical replication needs to do. But that is a major redefinition of the charter of that function, and I am wondering whether it was a mistake to fold together the thing that we need in normal cases (which is to wait for a transaction we know has started and may not have finished) from the thing we need in the logical decoding case (which apparently has different requirements). Maybe we should have solved that problem by finding a way to wait for the transaction to start, and then afterwards wait for it to end. Or maybe we should have altogether different entrypoints for the two requirements. Or maybe using one function is fine but we just need it to be more correct. I'm not really sure. In short, I think Simon is right that there's a problem and right about which commit caused it, but I'm not sure what I think we ought to do about it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: fixing CREATEROLE
On Mon, Nov 28, 2022 at 1:28 PM Robert Haas wrote: > On Mon, Nov 28, 2022 at 3:02 PM Mark Dilger > wrote: > > You can argue that a grant with INHERIT FALSE, SET FALSE, ADMIN TRUE > still grants membership, and I think formally that's true, but I also > think it's just picking something to bicker about. The need isn't to > separate membership per se from administration. It's to separate > privilege inheritance and the ability to SET ROLE from role > administration. And I've done that. > We seem to now be in agreement on this design choice, and the related bit about bootstrap superuser granting admin on newly created roles by the createrole user. This seems like a patch in its own right. It still leaves open the default membership behavior as well as whether we want to rework the attributes into predefined roles. > I strongly disagree with the idea that the ability for users to > control defaults here isn't needed. That's fine, but are you saying this patch is incapable (or simply undesirable) of having the parts about handling defaults separated out from the parts that define how the system works with a given set of permissions; and the one implementation detail of having the bootstrap superuser automatically grant admin to any roles a createuser role creates? If you and others feel strongly about defaults I'm sure that the suggested other thread focused on that will get attention and be committed in a timely manner. But the system will work, and not be broken, if that got stalled, and it could be added in later. David J.
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Hi, On 2022-11-28 16:04:12 -0500, Tom Lane wrote: > Andres Freund writes: > > - to > > combat the space usage we likely should switch to representing the ids as > > variable width integers or such, otherwise the space increase would likely > > be > > prohibitive. > > And I don't buy that either. An extra 4 bytes with a 2K payload is not > "prohibitive", it's more like "down in the noise". The space usage for the the the toast relation + index itself is indeed irrelevant. Where it's not "down in the noise" is in struct varatt_external, i.e. references to external toast datums. The size of that already is an issue. Greetings, Andres Freund
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Hi, Andres Freund writes: >Was the issue that you exceeded 4 billion toasted datums, or that assignment >took a long time? How many toast datums did you actually have? Was this due to >very wide rows leading to even small datums getting toasted? Yep, we had 4 billion toasted datums. I remind that currently relation has single TOAST table for all toastable attributes, so it is not so difficult to get to 4 billion of toasted values. >I think if we switch to int8 keys and widen the global OID counter to 8 >bytes (using just the low 4 bytes for other purposes), we'll have a >perfectly fine solution. There is still plenty of work to be done under >that plan, because of the need to maintain backward compatibility for >existing TOAST tables --- and maybe people would want an option to keep on >using them, for non-enormous tables? If we add additional work on top of >that, it'll just mean that it will take longer to have any solution. I agree, but: 1) Global OID counter is used not only for TOAST, so there would be a lot of places where the short counter (low part of 64 OID, if we go with that) is used; 2) Upgrading to 64-bit id would require re-toasting old TOAST tables. Or there is some way to distinguish old tables from new ones? But I don't see any reason to keep an old short ID as an option. ... >Yeah, that is completely horrid. It does not remove the existing failure >mode, just changes it to have worse consequences. On Tue, Nov 29, 2022 at 12:04 AM Tom Lane wrote: > Andres Freund writes: > > I think the first step to improve the situation is to not use a global > oid > > counter for toasted values. One way to do that would be to use the > sequence > > code to do oid assignment, but we likely can find a more efficient > > representation. > > I don't particularly buy that, because the only real fix is this: > > > Eventually we should do the obvious thing and make toast ids 64bit wide > > and if we do that there'll be no need to worry about multiple counters. > > > - to > > combat the space usage we likely should switch to representing the ids as > > variable width integers or such, otherwise the space increase would > likely be > > prohibitive. > > And I don't buy that either. An extra 4 bytes with a 2K payload is not > "prohibitive", it's more like "down in the noise". > > I think if we switch to int8 keys and widen the global OID counter to 8 > bytes (using just the low 4 bytes for other purposes), we'll have a > perfectly fine solution. There is still plenty of work to be done under > that plan, because of the need to maintain backward compatibility for > existing TOAST tables --- and maybe people would want an option to keep on > using them, for non-enormous tables? If we add additional work on top of > that, it'll just mean that it will take longer to have any solution. > > >> Quick fix for this problem is limiting GetNewOidWithIndex loops to some > >> reasonable amount defined by related macro and returning error if there > is > >> still no available Oid. Please check attached patch, any feedback is > >> appreciated. > > > This feels like the wrong spot to tackle the issue. > > Yeah, that is completely horrid. It does not remove the existing failure > mode, just changes it to have worse consequences. > > regards, tom lane > -- Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/
Re: Patch: Global Unique Index
On Fri, Nov 25, 2022 at 05:03:06PM -0800, David Zhang wrote: > Hi Bruce, > > Thank you for helping review the patches in such detail. > > On 2022-11-25 9:48 a.m., Bruce Momjian wrote: > > Looking at the patch, I am unclear how the the patch prevents concurrent > duplicate value insertion during the partitioned index checking. I am > actually not sure how that can be done without locking all indexes or > inserting placeholder entries in all indexes. (Yeah, that sounds bad, > unless I am missing something.) > > For the uniqueness check cross all partitions, we tried to follow the > implementation of uniqueness check on a single partition, and added a loop to > check uniqueness on other partitions after the index tuple has been inserted > to > current index partition but before this index tuple has been made visible. The > uniqueness check will wait `XactLockTableWait` if there is a valid transaction > in process, and performs the uniqueness check again after the in-process > transaction finished. I can't see why this wouldn't work, but I also can't think of any cases where we do this in our code already, so it will need careful consideration. We kind of do this for UPDATE and unique key conflicts, but only for a single index entry. where we peek and sleep on pending changes, but not across indexes. > I am not quite sure if this is a proper way to deal with a deadlock in this > case. It would be so grateful if someone could help provide some cases/methods > to verify this cross all partitions uniqueness. I assume you are using our existing deadlock detection code, and just sleeping in various indexes and expecting deadlock detection to happen. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: Slow standby snapshot
Michail Nikolaev writes: >> * when idle - since we have time to do it when that happens, which >> happens often since most workloads are bursty > I have added getting of ProcArrayLock for this case. That seems like a fairly bad idea: it will add extra contention on ProcArrayLock, and I see no real strong argument that the path can't get traversed often enough for that to matter. It would likely be better for KnownAssignedXidsCompress to obtain the lock for itself, only after it knows there is something worth doing. (This ties back to previous discussion: the comment claiming it's safe to read head/tail because we have the lock is misguided. It's safe because we're the only process that changes them. So we can make the heuristic decision before acquiring lock.) While you could use the "reason" code to decide whether you need to take the lock, it might be better to add a separate boolean argument specifying whether the caller already has the lock. Beyond that, I don't see any issues except cosmetic ones. > Also, I think while we still in the context, it is good to add: > * Simon's optimization (1) for KnownAssignedXidsRemoveTree (it is > simple and effective for some situations without any seen drawbacks) > * Maybe my patch (2) for replacing known_assigned_xids_lck with memory > barrier? Doesn't seem like we have any hard evidence in favor of either of those being worth doing. We especially haven't any evidence that they'd still be worth doing after this patch. I'd be willing to make the memory barrier change anyway, because that seems like a simple change that can't hurt. I'm less enthused about the patch at (1) because of the complexity it adds. regards, tom lane
Re: Add 64-bit XIDs into PostgreSQL 15
On Mon, Nov 28, 2022 at 4:09 PM Peter Geoghegan wrote: > Granted, the specifics of the current XidStopLimit mechanism are > unlikely to directly carry over to 64-bit XIDs. XidStopLimit is > structured in a way that doesn't actually consider freeze debt in > units like unfrozen pages. Like Chris, I just don't see why the patch > obviates the need for something like XidStopLimit, since the patch > doesn't remove freezing. An improved XidStopLimit mechanism might even > end up kicking in *before* the oldest relfrozenxid reached 2 billion > XIDs, depending on the specifics. What is the purpose of using 64-bit XIDs, if not to avoid having to stop the world when we run short of XIDs? I'd say that if this patch, or any patch with broadly similar goals, fails to remove xidStopLimit, it might as well not exist. xidStopLimit is not a general defense against falling too far behind on freezing, and in general, there is no reason to think that we need such a defense. xidStopLimit is a defense against reusing the same XID and thus causing irreversible database corruption. When that possibility no longer exists, it has outlived its usefulness and we should be absolutely delighted to bid it adieu. It seems like you and Chris are proposing the moral equivalent of paying off your home loan but still sending a check to the mortgage company every month just to be sure they don't get mad. -- Robert Haas EDB: http://www.enterprisedb.com
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
On Mon, Nov 28, 2022 at 12:59 PM Bruce Momjian wrote: > > You can prepare a patch, but it unlikely to get much interest until you > get agreement on what the behavior should be. We discussed the approach on 2020's Unconference [0]. And there kind of was an agreement. Then I made a presentation on FOSDEM with all the details [1]. The patch had been on commitfest since 2019 [2]. There were reviewers in the CF entry, and we kind of had an agreement. Jeff Davis proposed a similar patch [3]. And we certainly agreed about cancels. And now Bharath is proposing the same. We have the interest and agreement. Best regards, Andrey Borodin. [0] https://wiki.postgresql.org/wiki/PgCon_2020_Developer_Unconference/Edge_cases_of_synchronous_replication_in_HA_solutions [1] https://archive.fosdem.org/2021/schedule/event/postgresql_caveats_of_replication/attachments/slides/4365/export/events/attachments/postgresql_caveats_of_replication/slides/4365/sides.pdf [2] https://commitfest.postgresql.org/26/2402/ [3] https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.camel%40j-davis.com#415dc2f7d41b8a251b419256407bb64d
Re: Failed Assert in pgstat_assoc_relation
Andres Freund writes: > Something like the attached. Still needs a bit of polish, e.g. adding the test > case from above. > I'm a bit uncomfortable adding a function call below >* Perform swapping of the relcache entry contents. Within this >* process the old entry is momentarily invalid, so there > *must* be no >* possibility of CHECK_FOR_INTERRUPTS within this sequence. Do > it in >* all-in-line code for safety. Ugh. I don't know what pgstat_unlink_relation does, but assuming that it can never throw an error seems like a pretty bad idea, especially when you aren't adding that to its API spec (contrast the comments for MemoryContextSetParent). Can't that part be done outside the critical section? regards, tom lane
Re: CF 2022-11: entries "Ready for Committer" with recent activity
On Sun, Nov 27, 2022 at 01:29:18PM +0900, Ian Lawrence Barwick wrote: > Transaction Management docs (2022-11-23) > > > - https://commitfest.postgresql.org/40/3899/ > - > https://www.postgresql.org/message-id/flat/canbhv-e_iy9fmrerxrch8tztyenpfo72hf_xd2hldppva4d...@mail.gmail.com > > Seems committable. I plan to commit this in a few hours. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Andres Freund writes: > On 2022-11-28 16:04:12 -0500, Tom Lane wrote: >> And I don't buy that either. An extra 4 bytes with a 2K payload is not >> "prohibitive", it's more like "down in the noise". > The space usage for the the the toast relation + index itself is indeed > irrelevant. Where it's not "down in the noise" is in struct varatt_external, > i.e. references to external toast datums. Ah, gotcha. Yes, the size of varatt_external is a problem. regards, tom lane
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Hi, On 2022-11-29 00:24:49 +0300, Nikita Malakhov wrote: > 2) Upgrading to 64-bit id would require re-toasting old TOAST tables. Or > there is some way to distinguish old tables from new ones? The catalog / relcache entry should suffice to differentiate between the two. Greetings, Andres Freund
Re: Add 64-bit XIDs into PostgreSQL 15
On Mon, Nov 28, 2022 at 04:30:22PM -0500, Robert Haas wrote: > What is the purpose of using 64-bit XIDs, if not to avoid having to > stop the world when we run short of XIDs? > > I'd say that if this patch, or any patch with broadly similar goals, > fails to remove xidStopLimit, it might as well not exist. > > xidStopLimit is not a general defense against falling too far behind > on freezing, and in general, there is no reason to think that we need > such a defense. xidStopLimit is a defense against reusing the same XID > and thus causing irreversible database corruption. When that > possibility no longer exists, it has outlived its usefulness and we > should be absolutely delighted to bid it adieu. > > It seems like you and Chris are proposing the moral equivalent of > paying off your home loan but still sending a check to the mortgage > company every month just to be sure they don't get mad. I think the problem is that we still have bloat with 64-bit XIDs, specifically pg_xact and pg_multixact files. Yes, that bloat is less serious, but it is still an issue worth reporting in the server logs, though not serious enough to stop the server from write queries. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: Add 64-bit XIDs into PostgreSQL 15
On Mon, Nov 28, 2022 at 1:30 PM Robert Haas wrote: > What is the purpose of using 64-bit XIDs, if not to avoid having to > stop the world when we run short of XIDs? I agree that the xidStopLimit mechanism was designed with the specific goal of preventing "true" physical XID wraparound that results in wrong answers to queries. It was not written with the intention of limiting the accumulation of freeze debt, which would have to use units like unfrozen heap pages to make any sense. That is a historic fact -- no question. I think that it is nevertheless quite possible that just refusing to allocate any more XIDs could make sense with 64-bit XIDs, where we don't strictly have to to keep the system fully functional. That might still be the lesser evil, in that other world. The cutoff itself would depend on many workload details, I suppose. Imagine if we actually had 64-bit XIDs -- let's assume for a moment that it's a done deal. This raises a somewhat awkward question: do you just let the system get further and further behind on freezing, forever? We can all agree that 2 billion XIDs is very likely the wrong time to start refusing new XIDs -- because it just isn't designed with debt in mind. But what's the right time, if any? How much debt is too much? At the very least these seem like questions that deserve serious consideration. > I'd say that if this patch, or any patch with broadly similar goals, > fails to remove xidStopLimit, it might as well not exist. Well, it could in principle be driven by lots of different kinds of information, and make better decisions by actually targeting freeze debt in some way. An "enhanced version of xidStopLimit with 64-bit XIDs" could kick in far far later than it currently would. Obviously that has some value. I'm not claiming to know how to build this "better xidStopLimit mechanism", by the way. I'm not seriously proposing it. Mostly I'm just saying that the question "where do you draw the line if not at 2 billion XIDs?" is a very pertinent question. It is not a question that is made any less valid by the fact that we already know that 2 billion XIDs is pretty far from optimal in almost all cases. Having some limit based on something is likely to be more effective than having no limit based on nothing. Admittedly this argument works a lot better with the failsafe than it does with xidStopLimit. Both are removed by the patch. -- Peter Geoghegan
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
On Mon, Nov 28, 2022 at 01:31:39PM -0800, Andrey Borodin wrote: > On Mon, Nov 28, 2022 at 12:59 PM Bruce Momjian wrote: > > > > You can prepare a patch, but it unlikely to get much interest until you > > get agreement on what the behavior should be. > > We discussed the approach on 2020's Unconference [0]. And there kind > of was an agreement. > Then I made a presentation on FOSDEM with all the details [1]. > The patch had been on commitfest since 2019 [2]. There were reviewers > in the CF entry, and we kind of had an agreement. > Jeff Davis proposed a similar patch [3]. And we certainly agreed about > cancels. > And now Bharath is proposing the same. > > We have the interest and agreement. Okay, I was not aware we had such broad agreement. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: fixing CREATEROLE
On Mon, Nov 28, 2022 at 4:19 PM David G. Johnston wrote: > That's fine, but are you saying this patch is incapable (or simply > undesirable) of having the parts about handling defaults separated out from > the parts that define how the system works with a given set of permissions; > and the one implementation detail of having the bootstrap superuser > automatically grant admin to any roles a createuser role creates? If you and > others feel strongly about defaults I'm sure that the suggested other thread > focused on that will get attention and be committed in a timely manner. But > the system will work, and not be broken, if that got stalled, and it could be > added in later. The topics are so closely intertwined that I don't believe that trying to have separate discussions will be useful or productive. There's no hope of anybody understanding 0004 or having an educated opinion about it without first understanding the earlier patches, and there's no requirement that someone has to review 0004, or like it, just because they review or like 0001-0003. But so far nobody has actually reviewed anything, and all that's happened is people have complained about 0004 for reasons which in my opinion are pretty nebulous and largely ignore the factors that caused it to exist in the first place. We had about 400 emails during the last release cycle arguing about a whole bunch of topics related to user management, and it became absolutely crystal clear in that discussion that Stephen Frost and David Steele wanted to have roles that could create other roles but not immediately be able to access their privileges. Mark and I, on the other hand, wanted to have roles that could create other roles WITH immediate access to their privileges. That argument was probably the main thing that derailed that entire patch set, which represented months of work by Mark. Now, I have come up with a competing patch set that for the price of 100 lines of code and a couple of slightly ugly option names can do either thing. So Stephen and David and any like-minded users can have what they want, and Mark and I and any like-minded users can have what we want. And the result is that I've got like five people, some of whom particulated in those discussions, showing up to say "hey, we don't need the ability to set defaults." Well, if that's the case, then why did we have hundreds and hundreds of emails within the last 12 months arguing about which way it should work? -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Andres Freund writes: > On 2022-11-29 00:24:49 +0300, Nikita Malakhov wrote: >> 2) Upgrading to 64-bit id would require re-toasting old TOAST tables. Or >> there is some way to distinguish old tables from new ones? > The catalog / relcache entry should suffice to differentiate between the two. Yeah, you could easily look at the datatype of the first attribute (in either the TOAST table or its index) to determine what to do. As I said before, I think there's a decent argument that some people will want the option to stay with 4-byte TOAST OIDs indefinitely, at least for smaller tables. So even without the fact that forced conversions would be horridly expensive, we'll need to continue support for both forms of TOAST table. regards, tom lane
Re: Add 64-bit XIDs into PostgreSQL 15
On Mon, Nov 28, 2022 at 1:52 PM Bruce Momjian wrote: > I think the problem is that we still have bloat with 64-bit XIDs, > specifically pg_xact and pg_multixact files. Yes, that bloat is less > serious, but it is still an issue worth reporting in the server logs, > though not serious enough to stop the server from write queries. That's definitely a big part of it. Again, I don't believe that the idea is fundamentally without merit. Just that it's not worth it, given that having more XID space is very much not something that I think fixes most of the problems. And given the real risk of serious bugs with something this invasive. I believe that it would be more useful to focus on just not getting into trouble in the first place, as well as on mitigating specific problems that lead to the system reaching xidStopLimit in practice. I don't think that there is any good reason to allow datfrozenxid to go past about a billion. When it does the interesting questions are questions about what went wrong, and how that specific failure can be mitigated in a fairly direct way. We've already used way to much "XID space runway", so why should using even more help? It might, I suppose, but it almost seems like a distraction to me, as somebody that wants to make things better for users in general. As long as the system continues to misbehave (in whatever way it happens to be misbehaving), why should any amount of XID space ever be enough? I think that we'll be able to get rid of freezing in a few years time. But as long as we have freezing, we have these problems. -- Peter Geoghegan
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Hi, On 2022-11-28 16:57:53 -0500, Tom Lane wrote: > As I said before, I think there's a decent argument that some people > will want the option to stay with 4-byte TOAST OIDs indefinitely, > at least for smaller tables. I think we'll need to do something about the width of varatt_external to make the conversion to 64bit toast oids viable - and if we do, I don't think there's a decent argument for staying with 4 byte toast OIDs. I think the varatt_external equivalent would end up being smaller in just about all cases. And as you said earlier, the increased overhead inside the toast table / index is not relevant compared to the size of toasted datums. Greetings, Andres Freund
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Hi, I'll check that tomorrow. If it is so then there won't be a problem keeping old tables without re-toasting. On Tue, Nov 29, 2022 at 1:10 AM Andres Freund wrote: > Hi, > > On 2022-11-28 16:57:53 -0500, Tom Lane wrote: > > As I said before, I think there's a decent argument that some people > > will want the option to stay with 4-byte TOAST OIDs indefinitely, > > at least for smaller tables. > > I think we'll need to do something about the width of varatt_external to > make > the conversion to 64bit toast oids viable - and if we do, I don't think > there's a decent argument for staying with 4 byte toast OIDs. I think the > varatt_external equivalent would end up being smaller in just about all > cases. > And as you said earlier, the increased overhead inside the toast table / > index > is not relevant compared to the size of toasted datums. > > Greetings, > > Andres Freund > -- Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
Andres Freund writes: > On 2022-11-28 16:57:53 -0500, Tom Lane wrote: >> As I said before, I think there's a decent argument that some people >> will want the option to stay with 4-byte TOAST OIDs indefinitely, >> at least for smaller tables. > And as you said earlier, the increased overhead inside the toast table / index > is not relevant compared to the size of toasted datums. Perhaps not. > I think we'll need to do something about the width of varatt_external to make > the conversion to 64bit toast oids viable - and if we do, I don't think > there's a decent argument for staying with 4 byte toast OIDs. I think the > varatt_external equivalent would end up being smaller in just about all cases. I agree that we can't simply widen varatt_external to use 8 bytes for the toast ID in all cases. Also, I now get the point about avoiding use of globally assigned OIDs here: if the counter starts from zero for each table, then a variable-width varatt_external could actually be smaller than currently for many cases. However, that bit is somewhat orthogonal, and it's certainly not required for fixing the basic problem. So it seems like the plan of attack ought to be: 1. Invent a new form or forms of varatt_external to allow different widths of the toast ID. Use the narrowest width possible for any given ID value. 2. Allow TOAST tables/indexes to store either 4-byte or 8-byte IDs. (Conversion could be done as a side effect of table-rewrite operations, perhaps.) 3. Localize ID selection so that tables can have small toast IDs even when other tables have many IDs. (Optional, could do later.) regards, tom lane
Re: [PATCH] Infinite loop while acquiring new TOAST Oid
I've missed that - >4 billion external datums >typically use a lot of space. Not quite so. It's 8 Tb for the minimal size of toasted data (about 2 Kb). In my practice tables with more than 5 billions of rows are not of something out of the ordinary (highly loaded databases with large amounts of data in use). On Tue, Nov 29, 2022 at 1:12 AM Nikita Malakhov wrote: > Hi, > > I'll check that tomorrow. If it is so then there won't be a problem keeping > old tables without re-toasting. > > On Tue, Nov 29, 2022 at 1:10 AM Andres Freund wrote: > >> Hi, >> >> On 2022-11-28 16:57:53 -0500, Tom Lane wrote: >> > As I said before, I think there's a decent argument that some people >> > will want the option to stay with 4-byte TOAST OIDs indefinitely, >> > at least for smaller tables. >> >> I think we'll need to do something about the width of varatt_external to >> make >> the conversion to 64bit toast oids viable - and if we do, I don't think >> there's a decent argument for staying with 4 byte toast OIDs. I think the >> varatt_external equivalent would end up being smaller in just about all >> cases. >> And as you said earlier, the increased overhead inside the toast table / >> index >> is not relevant compared to the size of toasted datums. >> >> Greetings, >> >> Andres Freund >> > > > -- > Regards, > Nikita Malakhov > Postgres Professional > https://postgrespro.ru/ > -- Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/
Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas
Bruce Momjian writes: > Uh, XTS doesn't require a nonce, so why are talking about nonces in this > thread? Because some other proposals do, or could, require a per-page nonce. After looking through this thread, I side with Robert: we should reject the remainder of this patch. It gives up page layout flexibility that we might want back someday. Moreover, I didn't see any hard evidence offered that there's any actual performance gain, let alone such a compelling gain that we should give up that flexibility for it. regards, tom lane
Re: Add 64-bit XIDs into PostgreSQL 15
On Mon, Nov 28, 2022 at 1:52 PM Peter Geoghegan wrote: > I'm not claiming to know how to build this "better xidStopLimit > mechanism", by the way. I'm not seriously proposing it. Mostly I'm > just saying that the question "where do you draw the line if not at 2 > billion XIDs?" is a very pertinent question. It is not a question that > is made any less valid by the fact that we already know that 2 billion > XIDs is pretty far from optimal in almost all cases. Having some limit > based on something is likely to be more effective than having no limit > based on nothing. > > Admittedly this argument works a lot better with the failsafe than it > does with xidStopLimit. Both are removed by the patch. Come to think of it, if you were going to do something like this it would probably work by throttling XID allocations, with a gradual ramp-up. It would rarely get to the point that the system refused to allocate XIDs completely. It's not fundamentally unreasonable to force the application to live within its means by throttling. Feedback that slows down the rate of writes is much more common in the LSM tree world, within systems like MyRocks [1], where the risk of the system being destabilized by debt is more pressing. As I said, I don't think that this is a particularly promising way of addressing problems with Postgres XID space exhaustion, since I believe that the underlying issue isn't usually a simple lack of "XID space slack capacity". But if you assume that I'm wrong here (if you assume that we very often don't have the ability to freeze lazily enough), then ISTM that throttling or feedback to stall new writes is a very reasonable option. In fact, it's practically mandatory. [1] https://docs.google.com/presentation/d/1WgP-SlKay5AnSoVDSvOIzmu7edMmtYhdywoa0oAR4JQ/edit#slide=id.g8839c9d71b_0_79 -- Peter Geoghegan
Re: Reducing power consumption on idle servers
I found some more comments and some documentation to word-smith very lightly, and pushed. The comments were stray references to the trigger file. It's a little confusing because the remaining mechanism also uses a file, but it uses a signal first so seems better to refer to promotion requests rather than files. /me wonders how many idle servers there are in the world
Re: Report roles in pg_upgrade pg_ prefix check
On Mon, Nov 28, 2022 at 09:58:46AM +0100, Daniel Gustafsson wrote: > We are a bit inconsistent in how much details we include in the report > textfiles, so could do that without breaking any consistency in reporting. > Looking at other checks, the below format would match what we already do > fairly > well: > > (oid=) > > Done in the attached. WFM. Thanks! -- Michael signature.asc Description: PGP signature
Re: fixing CREATEROLE
On Mon, Nov 28, 2022 at 2:55 PM Robert Haas wrote: > On Mon, Nov 28, 2022 at 4:19 PM David G. Johnston > wrote: > > That's fine, but are you saying this patch is incapable (or simply > undesirable) of having the parts about handling defaults separated out from > the parts that define how the system works with a given set of permissions; > and the one implementation detail of having the bootstrap superuser > automatically grant admin to any roles a createuser role creates? If you > and others feel strongly about defaults I'm sure that the suggested other > thread focused on that will get attention and be committed in a timely > manner. But the system will work, and not be broken, if that got stalled, > and it could be added in later. > > The topics are so closely intertwined that I don't believe that trying > to have separate discussions will be useful or productive. There's no > hope of anybody understanding 0004 or having an educated opinion about > it without first understanding the earlier patches, and there's no > requirement that someone has to review 0004, or like it, just because > they review or like 0001-0003. > > But so far nobody has actually reviewed anything > Well, if that's the case, then why > did we have hundreds and hundreds of emails within the last 12 months > arguing about which way it should work? > > When ya'll come to some final conclusion on how you want the defaults to look, come tell the rest of us. You already have 4 people debating the matter, I don't really see the point of adding more voices to that cachopany. As you noted - voicing an opinion about 0004 is optional. I'll reiterate my review from before, with a bit more confidence this time. 0001-0003 implements a desirable behavior change. In order for someone to make some other role a member in some third role that someone must have admin privileges on both other roles. CREATEROLE is not exempt from this rule. A user with CREATEROLE will, upon creating a new role, be granted admin privilege on that role by the bootstrap superuser. The consequence of 0001-0003 in the current environment is that since the newly created CREATEROLE user will not have admin rights on any existing roles in the cluster, while they can create new roles in the system they are unable to grant those new roles membership in any other roles not also created by them. The ability to assign attributes to newly created roles is unaffected. As a unit of work, those are "ready-to-commit" for me. I'll leave it to you and others to judge the technical quality of the patch and finishing up the FIXMEs that have been noted. Desirable follow-on patches include: 1) Automatically install an additional membership grant, with the CREATEROLE user as the grantor, specifying INHERIT OR SET as TRUE (I personally favor attaching these to ALTER ROLE, modifiable only by oneself) 2) Convert Attributes into default roles David J.
Re: O(n) tasks cause lengthy startups and checkpoints
Okay, here is a new patch set. 0004 adds logic to prevent custodian tasks from delaying shutdown. I haven't added any logging for long-running tasks yet. Tasks might ordinarily take a while, so such logs wouldn't necessarily indicate something is wrong. Perhaps we could add a GUC for the amount of time to wait before logging. This feature would be off by default. Another option could be to create a log_custodian GUC that causes tasks to be logged when completed, similar to log_checkpoints. Thoughts? On Mon, Nov 28, 2022 at 01:37:01PM -0500, Robert Haas wrote: > On Mon, Nov 28, 2022 at 1:31 PM Andres Freund wrote: >> On 2022-11-28 13:08:57 +, Simon Riggs wrote: >> > On Sun, 27 Nov 2022 at 23:34, Nathan Bossart >> > wrote: >> > > > Rather than explicitly use DEBUG1 everywhere I would have an >> > > > #define CUSTODIAN_LOG_LEVEL LOG >> > > > so we can run with it in LOG mode and then set it to DEBUG1 with a one >> > > > line change in a later phase of Beta >> > > >> > > I can create a separate patch for this, but I don't think I've ever seen >> > > this sort of thing before. >> > >> > Much of recovery is coded that way, for the same reason. >> >> I think that's not a good thing to copy without a lot more justification than >> "some old code also does it that way". It's sometimes justified, but also >> makes code harder to read (one doesn't know what it does without looking up >> the #define, line length). > > Yeah. If people need some of the log messages at a higher level during > development, they can patch their own copies. > > I think there might be some argument for having a facility that lets > you pick subsystems or even individual messages that you want to trace > and pump up the log level for just those call sites. But I don't know > exactly what that would look like, and I don't think inventing one-off > mechanisms for particular cases is a good idea. Given this discussion, I haven't made any changes to the logging in the new patch set. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 7fa5c047781dddedb1f9c5a4e96622a23c0c0835 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 5 Jan 2022 19:24:22 + Subject: [PATCH v15 1/4] Introduce custodian. The custodian process is a new auxiliary process that is intended to help offload tasks could otherwise delay startup and checkpointing. This commit simply adds the new process; it does not yet do anything useful. --- src/backend/postmaster/Makefile | 1 + src/backend/postmaster/auxprocess.c | 8 + src/backend/postmaster/custodian.c | 382 src/backend/postmaster/meson.build | 1 + src/backend/postmaster/postmaster.c | 38 ++- src/backend/storage/ipc/ipci.c | 3 + src/backend/storage/lmgr/proc.c | 1 + src/backend/utils/activity/wait_event.c | 3 + src/backend/utils/init/miscinit.c | 3 + src/include/miscadmin.h | 3 + src/include/postmaster/custodian.h | 32 ++ src/include/storage/proc.h | 11 +- src/include/utils/wait_event.h | 1 + 13 files changed, 482 insertions(+), 5 deletions(-) create mode 100644 src/backend/postmaster/custodian.c create mode 100644 src/include/postmaster/custodian.h diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile index 3a794e54d6..e1e1d1123f 100644 --- a/src/backend/postmaster/Makefile +++ b/src/backend/postmaster/Makefile @@ -18,6 +18,7 @@ OBJS = \ bgworker.o \ bgwriter.o \ checkpointer.o \ + custodian.o \ fork_process.o \ interrupt.o \ pgarch.o \ diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c index 7765d1c83d..c275271c95 100644 --- a/src/backend/postmaster/auxprocess.c +++ b/src/backend/postmaster/auxprocess.c @@ -20,6 +20,7 @@ #include "pgstat.h" #include "postmaster/auxprocess.h" #include "postmaster/bgwriter.h" +#include "postmaster/custodian.h" #include "postmaster/startup.h" #include "postmaster/walwriter.h" #include "replication/walreceiver.h" @@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype) case CheckpointerProcess: MyBackendType = B_CHECKPOINTER; break; + case CustodianProcess: + MyBackendType = B_CUSTODIAN; + break; case WalWriterProcess: MyBackendType = B_WAL_WRITER; break; @@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype) CheckpointerMain(); proc_exit(1); + case CustodianProcess: + CustodianMain(); + proc_exit(1); + case WalWriterProcess: WalWriterMain(); proc_exit(1); diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c new file mode 100644 index 00..a94381bc21 --- /dev/null +++ b/src/backend/postmaster/custodian.c @@ -0,0 +1,382 @@ +/*- + * + * custodian.c + * + * The custodian process handles a variety of non-critical tasks that might + * otherwise delay s
Re: fixing CREATEROLE
On Mon, Nov 28, 2022 at 4:55 PM Robert Haas wrote: > But so far nobody has actually reviewed anything, ... Actually this isn't true. Mark did review. Thanks, Mark. -- Robert Haas EDB: http://www.enterprisedb.com