Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures
On 2022-Sep-19, Bharath Rupireddy wrote: > We have a bunch of messages [1] that have an offset, but not LSN in > the error message. Firstly, is there an easiest way to figure out LSN > from offset reported in the error messages? If not, is adding LSN to > these messages along with offset a good idea? Of course, we can't just > convert offset to LSN using XLogSegNoOffsetToRecPtr() and report, but > something meaningful like reporting the LSN of the page that we are > reading-in or writing-out etc. Maybe add errcontext() somewhere that reports the LSN would be appropriate. For example, the page_read() callbacks have the LSN readily available, so the ones in backend could install the errcontext callback; or perhaps ReadPageInternal can do it #ifndef FRONTEND. Not sure what is best of those options, but either of those sounds better than sticking the LSN in a lower-level routine that doesn't necessarily have the info already. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Add 64-bit XIDs into PostgreSQL 15
Hi, Is this patch target to PG16 now? I want to have a look at these patches, but apply on master failed: Applying: Use 64-bit numbering of SLRU pages. Applying: Use 64-bit format to output XIDs Applying: Use 64-bit FullTransactionId instead of Epoch:xid Applying: Use 64-bit pages representation in SLRU callers. error: patch failed: src/backend/access/transam/multixact.c:1228 error: src/backend/access/transam/multixact.c: patch does not apply Patch failed at 0004 Use 64-bit pages representation in SLRU callers. hint: Use 'git am --show-current-patch=diff' to see the failed patch Regards, Zhang Mingli
Re: missing indexes in indexlist with partitioned tables
Thank you for having a look at the patch. On Tue, 20 Sept 2022 at 18:41, Amit Langote wrote: > Agreed, though the patch's changes to tests does not seem to have to > do with join removal? I don't really understand what the test changes > are all about. I wonder why the patch doesn't instead add the test > case that Arne showed in the file he attached with [1]. > [1] > https://www.postgresql.org/message-id/2641568c18de40e8b1528fc9d4d80127%40index.de I adjusted a test in partition_join.sql to add an additional column to the fract_t table. Before the change that table only had a single column and due to the query's join condition being USING(id), none of the columns from the left joined table were being used. That resulted in the updated code performing a left join removal as it was passing the checks for no columns being used in the left joined table in analyzejoins.c. The test in partition_join.sql claims to be testing "partitionwise join with fractional paths", so I thought we'd better not have a query that the planner removes the join when we're meant to be testing joins. It probably wouldn't hurt to have a new test to ensure left join removals work with a partitioned table. That should go in join.sql along with the other join removal tests. I didn't study Arne's patch to see what test he added. I was only interested in writing enough code so I could check there was no good reason not to add the partitioned index into RelOptInfo.indexlist. Arne sent me an off-list message to say he's planning on working on the patch that uses the existing field instead of the new one he originally added. Let's hold off for that patch. David
Re: missing indexes in indexlist with partitioned tables
On Tue, Sep 20, 2022 at 4:53 PM David Rowley wrote: > Thank you for having a look at the patch. > > On Tue, 20 Sept 2022 at 18:41, Amit Langote wrote: > > Agreed, though the patch's changes to tests does not seem to have to > > do with join removal? I don't really understand what the test changes > > are all about. I wonder why the patch doesn't instead add the test > > case that Arne showed in the file he attached with [1]. > > [1] > > https://www.postgresql.org/message-id/2641568c18de40e8b1528fc9d4d80127%40index.de > > I adjusted a test in partition_join.sql to add an additional column to > the fract_t table. Before the change that table only had a single > column and due to the query's join condition being USING(id), none of > the columns from the left joined table were being used. That resulted > in the updated code performing a left join removal as it was passing > the checks for no columns being used in the left joined table in > analyzejoins.c. The test in partition_join.sql claims to be testing > "partitionwise join with fractional paths", so I thought we'd better > not have a query that the planner removes the join when we're meant to > be testing joins. Ah, got it, thanks for the explanation. > It probably wouldn't hurt to have a new test to ensure left join > removals work with a partitioned table. That should go in join.sql > along with the other join removal tests. Makes sense. > I didn't study Arne's patch > to see what test he added. I was only interested in writing enough > code so I could check there was no good reason not to add the > partitioned index into RelOptInfo.indexlist. Arne sent me an off-list > message to say he's planning on working on the patch that uses the > existing field instead of the new one he originally added. Let's hold > off for that patch. Ok, sure. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: Add 64-bit XIDs into PostgreSQL 15
Hi, With these patches, it seems that we don’t need to handle wraparound in GetNextLocalTransactionId() too, as LocalTransactionId is unit64 now. ``` LocalTransactionId GetNextLocalTransactionId(void) { LocalTransactionId result; /* loop to avoid returning InvalidLocalTransactionId at wraparound */ do { result = nextLocalTransactionId++; } while (!LocalTransactionIdIsValid(result)); return result; } ``` Regards, Zhang Mingli
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Sep 16, 2022 at 4:54 PM John Naylor wrote: > > On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada wrote: > > > > On Mon, Aug 15, 2022 at 10:39 PM John Naylor > > wrote: > > > > > > bool, buth = and <=. Should be pretty close. Also, i believe if you > > > left this for last as a possible refactoring, it might save some work. > > v6 demonstrates why this should have been put off towards the end. (more > below) > > > > In any case, I'll take a look at the latest patch next month. > > Since the CF entry said "Needs Review", I began looking at v5 again > this week. Hopefully not too much has changed, but in the future I > strongly recommend setting to "Waiting on Author" if a new version is > forthcoming. I realize many here share updated patches at any time, > but I'd like to discourage the practice especially for large patches. Understood. Sorry for the inconveniences. > > > I've updated the radix tree patch. It's now separated into two patches. > > > > 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find > > better names) that are similar to the pg_lfind8() family but they > > return the index of the key in the vector instead of true/false. The > > patch includes regression tests. > > I don't want to do a full review of this just yet, but I'll just point > out some problems from a quick glance. > > +/* > + * Return the index of the first element in the vector that is greater than > + * or eual to the given scalar. Return sizeof(Vector8) if there is no such > + * element. > > That's a bizarre API to indicate non-existence. > > + * > + * Note that this function assumes the elements in the vector are sorted. > + */ > > That is *completely* unacceptable for a general-purpose function. > > +#else /* USE_NO_SIMD */ > + Vector8 r = 0; > + uint8 *rp = (uint8 *) &r; > + > + for (Size i = 0; i < sizeof(Vector8); i++) > + rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0; > > I don't think we should try to force the non-simd case to adopt the > special semantics of vector comparisons. It's much easier to just use > the same logic as the assert builds. > > +#ifdef USE_SSE2 > + return (uint32) _mm_movemask_epi8(v); > +#elif defined(USE_NEON) > + static const uint8 mask[16] = { > +1 << 0, 1 << 1, 1 << 2, 1 << 3, > +1 << 4, 1 << 5, 1 << 6, 1 << 7, > +1 << 0, 1 << 1, 1 << 2, 1 << 3, > +1 << 4, 1 << 5, 1 << 6, 1 << 7, > + }; > + > +uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) > vshrq_n_s8(v, 7)); > +uint8x16_t maskedhi = vextq_u8(masked, masked, 8); > + > +return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi)); > > For Arm, we need to be careful here. This article goes into a lot of > detail for this situation: > > https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon > > Here again, I'd rather put this off and focus on getting the "large > details" in good enough shape so we can got towards integrating with > vacuum. Thank you for the comments! These above comments are addressed by Nathan in a newly derived thread. I'll work on the patch. I'll consider how to integrate with vacuum as the next step. One concern for me is how to limit the memory usage to maintenance_work_mem. Unlike using a flat array, memory space for adding one TID varies depending on the situation. If we want strictly not to allow using memory more than maintenance_work_mem, probably we need to estimate the memory consumption in a conservative way. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
I've gone through the v11-0001 patch in more detail. Here are some more review comments (nothing functional I think - mostly just wording) == 1. src/backend/executor/execReplication.c - build_replindex_scan_key - * This is not generic routine, it expects the idxrel to be replication - * identity of a rel and meet all limitations associated with that. + * This is not generic routine, it expects the idxrel to be an index + * that planner would choose if the searchslot includes all the columns + * (e.g., REPLICA IDENTITY FULL on the source). */ -static bool +static int build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel, TupleTableSlot *searchslot) (I know this is not caused by your patch but maybe fix it at the same time?) "This is not generic routine, it expects..." -> "This is not a generic routine - it expects..." ~~~ 2. + IndexInfo *indexInfo PG_USED_FOR_ASSERTS_ONLY; + + /* + * This attribute is an expression, and + * SuitablePathsForRepIdentFull() was called earlier while the + * index for subscriber is selected. There, the indexes comprising + * *only* expressions have already been eliminated. + * + * We sanity check this now. + */ +#ifdef USE_ASSERT_CHECKING + indexInfo = BuildIndexInfo(idxrel); + Assert(!IndexOnlyOnExpression(indexInfo)); +#endif 2a. "while the index for subscriber is selected..." -> "when the index for the subscriber was selected...” ~ 2b. Because there is only one declaration in this code block you could simplify this a bit if you wanted to. SUGGESTION /* * This attribute is an expression, and * SuitablePathsForRepIdentFull() was called earlier while the * index for subscriber is selected. There, the indexes comprising * *only* expressions have already been eliminated. * * We sanity check this now. */ #ifdef USE_ASSERT_CHECKING IndexInfo *indexInfo = BuildIndexInfo(idxrel); Assert(!IndexOnlyOnExpression(indexInfo)); #endif ~~~ 3. src/backend/executor/execReplication.c - RelationFindReplTupleByIndex + /* Start an index scan. */ + scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0); retry: found = false; It might be better to have a blank line before that ‘retry’ label, like in the original code. == 4. src/backend/replication/logical/relation.c +/* see LogicalRepPartMapEntry for details in logicalrelation.h */ static HTAB *LogicalRepPartMap = NULL; Personally, I'd word that something like: "/* For LogicalRepPartMap details see LogicalRepPartMapEntry in logicalrelation.h */" but YMMV. ~~~ 5. src/backend/replication/logical/relation.c - GenerateDummySelectPlannerInfoForRelation +/* + * This is not a generic function, helper function for + * GetCheapestReplicaIdentityFullPath. The function creates + * a dummy PlannerInfo for the given relationId as if the + * relation is queried with SELECT command. + */ +static PlannerInfo * +GenerateDummySelectPlannerInfoForRelation(Oid relationId) (mentioned this one in my previous post) "This is not a generic function, helper function" -> "This is not a generic function. It is a helper function" ~~~ 6. src/backend/replication/logical/relation.c - GetCheapestReplicaIdentityFullPath +/* + * Generate all the possible paths for the given subscriber relation, + * for the cases that the source relation is replicated via REPLICA + * IDENTITY FULL. The function returns the cheapest Path among the + * eligible paths, see SuitablePathsForRepIdentFull(). + * + * The function guarantees to return a path, because it adds sequential + * scan path if needed. + * + * The function assumes that all the columns will be provided during + * the execution phase, given that REPLICA IDENTITY FULL guarantees + * that. + */ +static Path * +GetCheapestReplicaIdentityFullPath(Relation localrel) "for the cases that..." -> "for cases where..." ~~~ 7. + /* + * Currently it is not possible for planner to pick a partial index or + * indexes only on expressions. We still want to be explicit and eliminate + * such paths proactively. "for planner..." -> "for the planner..." == 8. .../subscription/t/032_subscribe_use_index.pl - general 8a. (remove the 'we') "# we wait until..." -> "# wait until..." X many occurrences ~ 8b. (remove the 'we') "# we show that..." -> “# show that..." X many occurrences ~~~ 9. There is inconsistent wording for some of your test case start/end comments 9a. e.g. start: SUBSCRIPTION USES INDEX UPDATEs MULTIPLE ROWS end: SUBSCRIPTION USES INDEX MODIFIES MULTIPLE ROWS ~ 9b. e.g. start: SUBSCRIPTION USES INDEX WITH MULTIPLE COLUMNS end: SUBSCRIPTION USES INDEX MODIFIES MULTIPLE ROWS ~~~ 10. I did not really understand the point of having special subscription names tap_sub_rep_full_0 tap_sub_rep_full_2 tap_sub_rep_full_3 tap_sub_rep_full_4 etc... Since you drop/recreate these for each test case can't they just be called 'tap_sub_rep_full'? ~~~ 11. SUBSCRIPTION USES INDEX WITH MULTIPLE COLUMNS +# updates 200 rows +$node_publisher->safe_psql('postgres', + "D
Re: Add common function ReplicationOriginName.
Hi Amit, > I think it is better to use Size. Even though, it may not fail now as > the size of names for origin will always be much lesser but it is > better if we are consistent. If we agree with this, then as a first > patch, we can make it to use Size in existing places and then > introduce this new function. OK, here is the updated patchset. * 0001 replaces int's with Size's in the existing code * 0002 applies Peter's patch on top of 0001 -- Best regards, Aleksander Alekseev v2-0001-Pass-Size-as-a-2nd-argument-for-snprintf-in-table.patch Description: Binary data v2-0002-Add-common-function-ReplicationOriginName.patch Description: Binary data
Re: Summary function for pg_buffercache
Hi, Also I suggest changing the names of the columns in order to make them > consistent with the rest of the system. If you consider pg_stat_activity > and family [1] you will notice that the columns are named > (entity)_(property), e.g. backend_xid, backend_type, client_addr, etc. So > instead of used_buffers and unused_buffers the naming should be > buffers_used and buffers_unused. > > [1]: https://www.postgresql.org/docs/current/monitoring-stats.html I changed these names and updated the patch. However I have somewhat mixed feelings about avg_usagecount. Generally >> AVG() is a relatively useless methric for monitoring. What if the user >> wants MIN(), MAX() or let's say a 99th percentile? I suggest splitting it >> into usagecount_min, usagecount_max and usagecount_sum. AVG() can be >> derived as usercount_sum / used_buffers. >> > > Won't be usagecount_max almost always 5 as "BM_MAX_USAGE_COUNT" set to 5 > in buf_internals.h? I'm not sure about how much usagecount_min would add > either. > A usagecount is always an integer between 0 and 5, it's not > something unbounded. I think the 99th percentile would be much better than > average if strong outlier values could occur. But in this case, I feel like > an average value would be sufficiently useful as well. > usagecount_sum would actually be useful since average can be derived from > it. If you think that the sum of usagecounts has a meaning just by itself, > it makes sense to include it. Otherwise, wouldn't showing directly averaged > value be more useful? > Aleksander, do you still think the average usagecount is a bit useless? Or does it make sense to you to keep it like this? > I suggest we focus on saving the memory first and then think about the > > performance, if necessary. > > Personally I think the locks part is at least as important - it's what > makes > the production impact higher. > I agree that it's important due to its high impact. I'm not sure how to avoid any undefined behaviour without locks though. Even with locks, performance is much better. But is it good enough for production? Thanks, Melih v6-0001-Added-pg_buffercache_summary-function.patch Description: Binary data
Re: About displaying NestLoopParam
On Fri, Sep 16, 2022 at 5:59 PM Richard Guo wrote: > I'm not saying this is a bug, but just curious why param 0 cannot be > displayed as the referenced expression. And I find the reason is that in > function find_param_referent(), we have the 'in_same_plan_level' flag > controlling that if we have emerged from a subplan, i.e. not the same > plan level any more, we would not look further for the matching > NestLoopParam. Param 0 suits this situation. > > And there is a comment there also saying, > > /* > * NestLoops transmit params to their inner child only; also, once > * we've crawled up out of a subplan, this couldn't possibly be > * the right match. > */ > After thinking of this for more time, I still don't see the reason why we cannot display NestLoopParam after we've emerged from a subplan. It seems these params are from parameterized subqueryscan and their values are supplied by an upper nestloop. These params should have been processed in process_subquery_nestloop_params() that we just add the PlannerParamItem entries to root->curOuterParams, in the form of NestLoopParam, using the same PARAM_EXEC slots. So I propose the patch attached to remove the 'in_same_plan_level' flag so that we can display NestLoopParam across subplan. Please correct me if I'm wrong. Thanks Richard v1-0001-Display-NestLoopParam-across-subplan.patch Description: Binary data
Re: HOT chain validation in verify_heapam()
On Mon, Sep 19, 2022 at 10:04 PM Aleksander Alekseev < aleksan...@timescale.com> wrote: > Hi Himanshu, > > > I have changed this in the attached patch. > > If it's not too much trouble could you please base your changes on v4 > that I submitted? I put some effort into writing a proper commit > message, editing the comments, etc. The easiest way of doing this is > using `git am` and `git format-patch`. > > Please find it attached. -- Regards, Himanshu Upadhyaya EnterpriseDB: http://www.enterprisedb.com From 3f3290ffb857117385f79f85aa599c588340924b Mon Sep 17 00:00:00 2001 From: Himanshu Upadhyaya Date: Tue, 20 Sep 2022 14:23:31 +0530 Subject: [PATCH v5] Implement HOT chain validation in verify_heapam() Himanshu Upadhyaya, reviewed by Robert Haas, Aleksander Alekseev Discussion: https://postgr.es/m/CAPF61jBBR2-iE-EmN_9v0hcQEfyz_17e5Lbb0%2Bu2%3D9ukA9sWmQ%40mail.gmail.com --- contrib/amcheck/verify_heapam.c | 214 1 file changed, 214 insertions(+) diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c index c875f3e5a2..e90e1ef2f3 100644 --- a/contrib/amcheck/verify_heapam.c +++ b/contrib/amcheck/verify_heapam.c @@ -399,6 +399,9 @@ verify_heapam(PG_FUNCTION_ARGS) for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++) { OffsetNumber maxoff; + OffsetNumber predecessor[MaxOffsetNumber] = {0}; + OffsetNumber successor[MaxOffsetNumber] = {0}; + bool lp_valid[MaxOffsetNumber] = {false}; CHECK_FOR_INTERRUPTS(); @@ -433,6 +436,8 @@ verify_heapam(PG_FUNCTION_ARGS) for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff; ctx.offnum = OffsetNumberNext(ctx.offnum)) { + OffsetNumber nextoffnum; + ctx.itemid = PageGetItemId(ctx.page, ctx.offnum); /* Skip over unused/dead line pointers */ @@ -469,6 +474,12 @@ verify_heapam(PG_FUNCTION_ARGS) report_corruption(&ctx, psprintf("line pointer redirection to unused item at offset %u", (unsigned) rdoffnum)); + +/* + * make entry in successor array, redirected tuple will be + * validated at the time when we loop over successor array + */ +successor[ctx.offnum] = rdoffnum; continue; } @@ -504,9 +515,205 @@ verify_heapam(PG_FUNCTION_ARGS) /* It should be safe to examine the tuple's header, at least */ ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid); ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr); + lp_valid[ctx.offnum] = true; /* Ok, ready to check this next tuple */ check_tuple(&ctx); + + /* + * Add the data to the successor array. It will be used later to + * generate the predecessor array. + * + * We need to access the tuple's header to populate the + * predecessor array. However the tuple is not necessarily sanity + * checked yet so delaying construction of predecessor array until + * all tuples are sanity checked. + */ + nextoffnum = ItemPointerGetOffsetNumber(&(ctx.tuphdr)->t_ctid); + if (ItemPointerGetBlockNumber(&(ctx.tuphdr)->t_ctid) == ctx.blkno && +nextoffnum != ctx.offnum) + { +successor[ctx.offnum] = nextoffnum; + } + } + + /* + * Loop over offset and populate predecessor array from all entries + * that are present in successor array. + */ + ctx.attnum = -1; + for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff; + ctx.offnum = OffsetNumberNext(ctx.offnum)) + { + ItemId curr_lp; + ItemId next_lp; + HeapTupleHeader curr_htup; + HeapTupleHeader next_htup; + TransactionId curr_xmax; + TransactionId next_xmin; + + OffsetNumber nextoffnum = successor[ctx.offnum]; + + curr_lp = PageGetItemId(ctx.page, ctx.offnum); + if (nextoffnum == 0) + { +/* + * This is either the last updated tuple in the chain or a + * corruption raised for this tuple. + */ +continue; + } + if (ItemIdIsRedirected(curr_lp) && lp_valid[nextoffnum]) + { +next_lp = PageGetItemId(ctx.page, nextoffnum); +next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);; +if (!HeapTupleHeaderIsHeapOnly(next_htup)) +{ + report_corruption(&ctx, + psprintf("redirected tuple at line pointer offset %u is not heap only tuple", + (unsigned) nextoffnum)); +} +if ((next_htup->t_infomask & HEAP_UPDATED) == 0) +{ + report_corruption(&ctx, + psprintf("redirected tuple at line pointer offset %u is not heap updated tuple", + (unsigned) nextoffnum)); +} +continue; + } + + /* + * Add a line pointer offset to the predecessor array. 1) If xmax + * is matching with xmin of next tuple (reaching via its t_ctid). + * 2) If the next tuple is in the same page. Raise corruption if + * we have two tuples having the same predecessor. + * + * We add the offset to the predecessor array irrespective of the + * transaction (t_xmin) status. We will do validation related to + * the transaction status (and
Re: Add common function ReplicationOriginName.
On Tue, Sep 20, 2022 at 6:36 PM Aleksander Alekseev wrote: > > Hi Amit, > > > I think it is better to use Size. Even though, it may not fail now as > > the size of names for origin will always be much lesser but it is > > better if we are consistent. If we agree with this, then as a first > > patch, we can make it to use Size in existing places and then > > introduce this new function. > > OK, here is the updated patchset. > > * 0001 replaces int's with Size's in the existing code > * 0002 applies Peter's patch on top of 0001 > LGTM. Thanks! - Kind Regards, Peter Smith. Fujitsu Australia
Re: Add 64-bit XIDs into PostgreSQL 15
On Tue, Sep 20, 2022 at 03:37:47PM +0800, Zhang Mingli wrote: > I want to have a look at these patches, but apply on master failed: Yeah, it's likely to break every week or more often. You have a few options: 0) resolve the conflict yourself; 1) apply the patch to the commit that the authors sent it against, or some commit before the conflicting file(s) were changed in master. Like maybe "git checkout -b 64bitxids f66d997fd". 2) Use the last patch that cfbot successfully created. You can read the patch on github's web interface, or add cfbot's user as a remote to use the patch locally for review and/or compilation. Something like "git remote add cfbot https://github.com/postgresql-cfbot/postgresql; git fetch cfbot commitfest/39/3594; git checkout -b 64bitxids cfbot/commitfest/39/3594". (Unfortunately, cfbot currently squishes the patch series into a single commit and loses the commit message). You could also check the git link in the commitfest, to see if the author has already rebased it, but haven't yet mailed the rebased patch to the list. In this case, that's not true, but you could probably use the author's branch on github, too. https://commitfest.postgresql.org/39/3594/ -- Justin
Re: why can't a table be part of the same publication as its schema
On 2022-Sep-20, Amit Kapila wrote: > On Mon, Sep 19, 2022 at 8:46 PM Alvaro Herrera > wrote: > > This seems a pretty arbitrary restriction. It feels like you're adding > > this restriction precisely so that you don't have to write the code to > > reject the ALTER .. SET SCHEMA if an incompatible configuration is > > detected. But we already have such checks in several cases, so I don't > > see why this one does not seem a good idea. > > > I agree that we have such checks at other places as well and one > somewhat similar is in ATPrepChangePersistence(). > > ATPrepChangePersistence() > { > ... > ... > /* > * Check that the table is not part of any publication when changing to > * UNLOGGED, as UNLOGGED tables can't be published. > */ Right, I think this is a sensible approach. > However, another angle to look at it is that we try to avoid adding > restrictions in other DDL commands for defined publications. Well, it makes sense to avoid restrictions wherever possible. But here, the consequence is that you end up with a restriction in the publication definition that is not very sensible. Imagine if you said "you can't add schema S because it contains an unlogged table". It's absurd. Maybe this can be relaxed in a future release, but it's quite odd. > The intention was to be in sync with FOR ALL TABLES. I guess we can change both (FOR ALL TABLES and IN SCHEMA) later. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Add 64-bit XIDs into PostgreSQL 15
Hi, On Sep 20, 2022, 17:26 +0800, Justin Pryzby , wrote: > On Tue, Sep 20, 2022 at 03:37:47PM +0800, Zhang Mingli wrote: > > I want to have a look at these patches, but apply on master failed: > > Yeah, it's likely to break every week or more often. > > You have a few options: > > 0) resolve the conflict yourself; > > 1) apply the patch to the commit that the authors sent it against, or > some commit before the conflicting file(s) were changed in master. Like > maybe "git checkout -b 64bitxids f66d997fd". > > 2) Use the last patch that cfbot successfully created. You can read the > patch on github's web interface, or add cfbot's user as a remote to use > the patch locally for review and/or compilation. Something like "git > remote add cfbot https://github.com/postgresql-cfbot/postgresql; git > fetch cfbot commitfest/39/3594; git checkout -b 64bitxids > cfbot/commitfest/39/3594". (Unfortunately, cfbot currently squishes the > patch series into a single commit and loses the commit message). > > You could also check the git link in the commitfest, to see if the > author has already rebased it, but haven't yet mailed the rebased patch > to the list. In this case, that's not true, but you could probably use > the author's branch on github, too. > https://commitfest.postgresql.org/39/3594/ > > -- > Justin Got it, thanks. Regards, Zhang Mingli
Re: Summary function for pg_buffercache
Hi Melih, > I changed these names and updated the patch. Thanks for the updated patch! > Aleksander, do you still think the average usagecount is a bit useless? Or > does it make sense to you to keep it like this? I don't mind keeping the average. > I'm not sure how to avoid any undefined behaviour without locks though. > Even with locks, performance is much better. But is it good enough for > production? Potentially you could avoid taking locks by utilizing atomic operations and lock-free algorithms. But these algorithms are typically error-prone and not always produce a faster code than the lock-based ones. I'm pretty confident this is out of scope of this particular patch. The patch v6 had several defacts: * Trailing whitespaces (can be checked by applying the patch with `git am`) * Wrong code formatting (can be fixed with pgindent) * Several empty lines were removed which is not related to the proposed change (can be seen with `git diff`) * An unlikely division by zero if buffers_used = 0 * Missing part of the commit message added in v4 Here is a corrected patch v7. To me it seems to be in pretty good shape, unless cfbot and/or other hackers will report any issues. -- Best regards, Aleksander Alekseev v7-0001-Added-pg_buffercache_summary-function.patch Description: Binary data
Re: ICU for global collation
On 17.09.22 10:33, Marina Polyakova wrote: Thanks to Kyotaro Horiguchi review we found out that there're interesting cases due to the order of some ICU checks: 1. ICU locale vs supported encoding: 1.1. On 2022-09-15 09:52, Kyotaro Horiguchi wrote: If I executed initdb as follows, I would be told to specify --icu-locale option. $ initdb --encoding sql-ascii --locale-provider icu hoge ... initdb: error: ICU locale must be specified However, when I reran the command, it complains about incompatible encoding this time. I think it's more user-friendly to check for the encoding compatibility before the check for missing --icu-locale option. This a valid point, but it would require quite a bit of work to move all those checks around and re-verify the result, so I don't want to do it in PG15. 1.2. (ok?) $ initdb --encoding sql-ascii --icu-locale en-US hoge initdb: error: --icu-locale cannot be specified unless locale provider "icu" is chosen $ initdb --encoding sql-ascii --icu-locale en-US --locale-provider icu hoge ... initdb: error: encoding mismatch initdb: detail: The encoding you selected (SQL_ASCII) is not supported with the ICU provider. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. $ createdb --encoding sql-ascii --icu-locale en-US hoge createdb: error: database creation failed: ERROR: ICU locale cannot be specified unless locale provider is ICU $ createdb --encoding sql-ascii --icu-locale en-US --locale-provider icu hoge createdb: error: database creation failed: ERROR: encoding "SQL_ASCII" is not supported with ICU provider I don't see a problem here. 2. For builds without ICU: 2.1. $ initdb --locale-provider icu hoge ... initdb: error: ICU locale must be specified $ initdb --locale-provider icu --icu-locale en-US hoge ... initdb: error: ICU is not supported in this build $ createdb --locale-provider icu hoge createdb: error: database creation failed: ERROR: ICU locale must be specified $ createdb --locale-provider icu --icu-locale en-US hoge createdb: error: database creation failed: ERROR: ICU is not supported in this build IMO, it would be more user-friendly to inform an unsupported build in the first runs too.. Again, this would require reorganizing a bunch of code to get some cosmetic benefit, which isn't a good idea now for PG15. 2.2. (ok?) 2.3. same here 3. The locale provider is ICU, but it has not yet been set from the template database: $ initdb --locale-provider icu --icu-locale en-US -D data && pg_ctl -D data -l logfile start && createdb --icu-locale ru-RU --template template0 mydb ... createdb: error: database creation failed: ERROR: ICU locale cannot be specified unless locale provider is ICU Please see attached patch for a fix. Does that work for you?From 5bb07e390baf8a471f7d30582469268c8a6d71ca Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 20 Sep 2022 05:50:41 -0400 Subject: [PATCH] Improve ICU option handling in CREATE DATABASE We check that the ICU locale is only specified if the ICU locale provider is selected. But we did that too early. We need to wait until we load the settings of the template database, since that could also set what the locale provider is. Reported-by: Marina Polyakova Discussion: https://www.postgresql.org/message-id/9ba4cd1ea6ed6b7b15c0ff15e6f54...@postgrespro.ru --- src/backend/commands/dbcommands.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index cba929b7f998..5dfec5c6b056 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -907,10 +907,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) errmsg("unrecognized locale provider: %s", locproviderstr))); } - if (diculocale && dblocprovider != COLLPROVIDER_ICU) - ereport(ERROR, - (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), -errmsg("ICU locale cannot be specified unless locale provider is ICU"))); if (distemplate && distemplate->arg) dbistemplate = defGetBoolean(distemplate); if (dallowconnections && dallowconnections->arg) @@ -1050,6 +1046,13 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) check_icu_locale(dbiculocale); } + else + { + if (dbiculocale) + ereport(ERROR, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), +errmsg("ICU locale cannot be specified unless locale provider is ICU"))); + } /* * Check that the new encoding and locale settings match the source -- 2.37.3
Re: Proposal to use JSON for Postgres Parser format
On Tue, Sep 20, 2022 at 7:48 AM Tom Lane wrote: > Peter Geoghegan writes: > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane wrote: > >> Our existing format is certainly not great on those metrics, but > >> I do not see how "let's use JSON!" is a route to improvement. > > > The existing format was designed with developer convenience as a goal, > > though -- despite my complaints, and in spite of your objections. > > As Munro adduces nearby, it'd be a stretch to conclude that the current > format was designed with any Postgres-related goals in mind at all. > I think he's right that it's a variant of some Lisp-y dump format that's > probably far hoarier than even Berkeley Postgres. > > > If it didn't have to be easy (or even practical) for developers to > > directly work with the output format, then presumably the format used > > internally could be replaced with something lower level and faster. So > > it seems like the two goals (developer ergonomics and faster > > interchange format for users) might actually be complementary. > > I think the principal mistake in what we have now is that the storage > format is identical to the "developer friendly" text format (plus or > minus some whitespace). First we need to separate those. We could > have more than one equivalent text format perhaps, and I don't have > any strong objection to basing the text format (or one of them) on > JSON. +1 for considering storage format and text format separately. Let's consider what our criteria could be for the storage format. 1) Storage effectiveness (shorter is better) and serialization/deserialization effectiveness (faster is better). On this criterion, the custom binary format looks perfect. 2) Robustness in the case of corruption. It seems much easier to detect the data corruption and possibly make some partial manual recovery for textual format. 3) Standartness. It's better to use something known worldwide or at least used in other parts of PostgreSQL than something completely custom. From this perspective, JSON/JSONB is better than custom things. -- Regards, Alexander Korotkov
Re: Proposal to use JSON for Postgres Parser format
On Tue, Sep 20, 2022 at 1:00 PM Alexander Korotkov wrote: > On Tue, Sep 20, 2022 at 7:48 AM Tom Lane wrote: > > Peter Geoghegan writes: > > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane wrote: > > >> Our existing format is certainly not great on those metrics, but > > >> I do not see how "let's use JSON!" is a route to improvement. > > > > > The existing format was designed with developer convenience as a goal, > > > though -- despite my complaints, and in spite of your objections. > > > > As Munro adduces nearby, it'd be a stretch to conclude that the current > > format was designed with any Postgres-related goals in mind at all. > > I think he's right that it's a variant of some Lisp-y dump format that's > > probably far hoarier than even Berkeley Postgres. > > > > > If it didn't have to be easy (or even practical) for developers to > > > directly work with the output format, then presumably the format used > > > internally could be replaced with something lower level and faster. So > > > it seems like the two goals (developer ergonomics and faster > > > interchange format for users) might actually be complementary. > > > > I think the principal mistake in what we have now is that the storage > > format is identical to the "developer friendly" text format (plus or > > minus some whitespace). First we need to separate those. We could > > have more than one equivalent text format perhaps, and I don't have > > any strong objection to basing the text format (or one of them) on > > JSON. > > +1 for considering storage format and text format separately. > > Let's consider what our criteria could be for the storage format. > > 1) Storage effectiveness (shorter is better) and > serialization/deserialization effectiveness (faster is better). On > this criterion, the custom binary format looks perfect. > 2) Robustness in the case of corruption. It seems much easier to > detect the data corruption and possibly make some partial manual > recovery for textual format. > 3) Standartness. It's better to use something known worldwide or at > least used in other parts of PostgreSQL than something completely > custom. From this perspective, JSON/JSONB is better than custom > things. (sorry, I've accidentally cut the last paragraph from the message) It seems that there is no perfect fit for this multi-criteria optimization, and we should pick what is more important. Any thoughts? -- Regards, Alexander Korotkov
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi Peter, Thanks for the quick response. > 1. Commit message > > It looks like some small mistake happened. You wrote [1] that my > previous review comments about the commit message were fixed, but it > seems the v11 commit message is unchanged since v10. > > Oops, yes you are right, I forgot to push commit message changes. I'll incorporate all these suggestions on v12. > == > > 2. src/backend/replication/logical/relation.c - > GenerateDummySelectPlannerInfoForRelation > > +/* > + * This is not a generic function, helper function for > + * GetCheapestReplicaIdentityFullPath. The function creates > + * a dummy PlannerInfo for the given relationId as if the > + * relation is queried with SELECT command. > + */ > +static PlannerInfo * > +GenerateDummySelectPlannerInfoForRelation(Oid relationId) > > "generic function, helper function" -> "generic function. It is a > helper function" > > Fixed. I'll attach the changes in the next email with v12. Thanks, Onder
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi Peter, > > 1. src/backend/executor/execReplication.c - build_replindex_scan_key > > - * This is not generic routine, it expects the idxrel to be replication > - * identity of a rel and meet all limitations associated with that. > + * This is not generic routine, it expects the idxrel to be an index > + * that planner would choose if the searchslot includes all the columns > + * (e.g., REPLICA IDENTITY FULL on the source). > */ > -static bool > +static int > build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel, > TupleTableSlot *searchslot) > > > (I know this is not caused by your patch but maybe fix it at the same > time?) > > "This is not generic routine, it expects..." -> "This is not a generic > routine - it expects..." > > Fixed > > 2. > > + IndexInfo *indexInfo PG_USED_FOR_ASSERTS_ONLY; > + > + /* > + * This attribute is an expression, and > + * SuitablePathsForRepIdentFull() was called earlier while the > + * index for subscriber is selected. There, the indexes comprising > + * *only* expressions have already been eliminated. > + * > + * We sanity check this now. > + */ > +#ifdef USE_ASSERT_CHECKING > + indexInfo = BuildIndexInfo(idxrel); > + Assert(!IndexOnlyOnExpression(indexInfo)); > +#endif > > 2a. > "while the index for subscriber is selected..." -> "when the index for > the subscriber was selected...” > > fixed > ~ > > 2b. > Because there is only one declaration in this code block you could > simplify this a bit if you wanted to. > > SUGGESTION > /* > * This attribute is an expression, and > * SuitablePathsForRepIdentFull() was called earlier while the > * index for subscriber is selected. There, the indexes comprising > * *only* expressions have already been eliminated. > * > * We sanity check this now. > */ > #ifdef USE_ASSERT_CHECKING > IndexInfo *indexInfo = BuildIndexInfo(idxrel); > Assert(!IndexOnlyOnExpression(indexInfo)); > #endif > > Makes sense, no reason to declare above > ~~~ > > 3. src/backend/executor/execReplication.c - RelationFindReplTupleByIndex > > + /* Start an index scan. */ > + scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0); > retry: > found = false; > > It might be better to have a blank line before that ‘retry’ label, > like in the original code. > agreed, fixed > > == > > 4. src/backend/replication/logical/relation.c > > +/* see LogicalRepPartMapEntry for details in logicalrelation.h */ > static HTAB *LogicalRepPartMap = NULL; > > Personally, I'd word that something like: > "/* For LogicalRepPartMap details see LogicalRepPartMapEntry in > logicalrelation.h */" > > but YMMV. > I also don't have any strong opinions on that, updated to your suggestion. > > ~~~ > > 5. src/backend/replication/logical/relation.c - > GenerateDummySelectPlannerInfoForRelation > > +/* > + * This is not a generic function, helper function for > + * GetCheapestReplicaIdentityFullPath. The function creates > + * a dummy PlannerInfo for the given relationId as if the > + * relation is queried with SELECT command. > + */ > +static PlannerInfo * > +GenerateDummySelectPlannerInfoForRelation(Oid relationId) > > (mentioned this one in my previous post) > > "This is not a generic function, helper function" -> "This is not a > generic function. It is a helper function" > Yes, applied. > > ~~~ > > 6. src/backend/replication/logical/relation.c - > GetCheapestReplicaIdentityFullPath > > +/* > + * Generate all the possible paths for the given subscriber relation, > + * for the cases that the source relation is replicated via REPLICA > + * IDENTITY FULL. The function returns the cheapest Path among the > + * eligible paths, see SuitablePathsForRepIdentFull(). > + * > + * The function guarantees to return a path, because it adds sequential > + * scan path if needed. > + * > + * The function assumes that all the columns will be provided during > + * the execution phase, given that REPLICA IDENTITY FULL guarantees > + * that. > + */ > +static Path * > +GetCheapestReplicaIdentityFullPath(Relation localrel) > > > "for the cases that..." -> "for cases where..." > > sounds good > ~~~ > > 7. > > + /* > + * Currently it is not possible for planner to pick a partial index or > + * indexes only on expressions. We still want to be explicit and eliminate > + * such paths proactively. > > "for planner..." -> "for the planner..." > fixed > > == > > 8. .../subscription/t/032_subscribe_use_index.pl - general > > 8a. > (remove the 'we') > "# we wait until..." -> "# wait until..." X many occurrences > > ~ > > 8b. > (remove the 'we') > "# we show that..." -> “# show that..." X many occurrences > Ok, removed all "we"s in the test > > ~~~ > > 9. > > There is inconsistent wording for some of your test case start/end comments > > 9a. > e.g. > start: SUBSCRIPTION USES INDEX UPDATEs MULTIPLE ROWS > end: SUBSCRIPTION USES INDEX MODIFIES MULTIPLE ROWS > > ~ > > 9b. > e.g. > start: SUBSCRIPTION USES INDEX WITH MULTIPLE COLUMNS > end: SUBSCRIPTION USES INDEX M
Re: pg_create_logical_replication_slot argument incongruency
On Tue, Sep 20, 2022 at 08:41:56AM +0200, Florin Irion wrote: > OK, patch only for the docs attached. Thanks, applied. -- Michael signature.asc Description: PGP signature
Re: cataloguing NOT NULL constraints
On 2022-Sep-19, Robert Haas wrote: > On Wed, Aug 17, 2022 at 2:12 PM Alvaro Herrera > wrote: > > 55489 16devel 1776237=# \d tab > > Tabla «public.tab» > > Columna │ Tipo │ Ordenamiento │ Nulable │ Por omisión > > ─┼─┼──┼──┼─ > > a │ integer │ │ not null │ > > Restricciones CHECK: > > "tab_a_not_null" CHECK (a IS NOT NULL) > > In a table with many columns, most of which are NOT NULL, this is > going to produce a ton of clutter. I don't like that. > > I'm not sure what a good alternative would be, though. Perhaps that can be solved by displaying the constraint name in the table: 55489 16devel 1776237=# \d tab Tabla «public.tab» Columna │ Tipo │ Ordenamiento │ Nulable│ Por omisión ─┼─┼──┼┼─ a │ integer │ │ tab_a_not_null │ (Perhaps we can change the column title also, not sure.) -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "The Gord often wonders why people threaten never to come back after they've been told never to return" (www.actsofgord.com)
Re: cataloguing NOT NULL constraints
On 2022-Sep-19, Isaac Morland wrote: > I thought I saw some discussion about the SQL standard saying that there is > a difference between putting NOT NULL in a column definition, and CHECK > (column_name NOT NULL). So if we're going to take this seriously, Was it Peter E.'s reply to this thread? https://postgr.es/m/bac841ed-b86d-e3c2-030d-02a3db067...@enterprisedb.com because if it wasn't there, then I would like to know what you source is. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end." (2nd Commandment for C programmers)
Re: cataloguing NOT NULL constraints
On 2022-Sep-19, Matthias van de Meent wrote: > I'm not sure on the 'good' part of this alternative, but we could go > with a single row-based IS NOT NULL to reduce such clutter, utilizing > the `ROW() IS NOT NULL` requirement of a row only matching IS NOT NULL > when all attributes are also IS NOT NULL: > > Check constraints: > "tab_notnull_check" CHECK (ROW(a, b, c, d, e) IS NOT NULL) There's no way to mark this NOT VALID individually or validate it afterwards, though. > But the performance of repeated row-casting would probably not be as > good as our current NULL checks The NULL checks would still be mostly done by the attnotnull checks internally, so there shouldn't be too much of a difference. .. though I'm now wondering if there's additional overhead from checking the constraint twice on each row: first the attnotnull bit, then the CHECK itself. Hmm. That's probably quite bad. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Summary function for pg_buffercache
Hi hackers, > Here is a corrected patch v7. To me it seems to be in pretty good > shape, unless cfbot and/or other hackers will report any issues. There was a missing empty line in pg_buffercache.out which made the tests fail. Here is a corrected v8 patch. -- Best regards, Aleksander Alekseev v8-0001-Added-pg_buffercache_summary-function.patch Description: Binary data
Re: Summary function for pg_buffercache
Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57 tarihinde şunu yazdı: > There was a missing empty line in pg_buffercache.out which made the > tests fail. Here is a corrected v8 patch. > I was just sending a corrected patch without the missing line. Thanks a lot for all these reviews and the corrected patch. Best, Melih
Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf
Hi, On 9/20/22 6:30 AM, Michael Paquier wrote: On Tue, Sep 20, 2022 at 12:09:33AM -0400, Tom Lane wrote: You have to assume that somebody (a) has a role or DB name starting with slash, (b) has an explicit reference to that name in their pg_hba.conf, (c) doesn't read the release notes, and (d) doesn't notice that things are misbehaving until after some hacker manages to break into their installation on the strength of the misbehaving entry. OK, I'll grant that the probability of (c) is depressingly close to unity; but each of the other steps seems quite low probability. All four of them happening in one installation is something I doubt will happen. It is the kind of things that could blow up as a CVE and some bad PR for the project, so I cannot get excited about enforcing this new rule in an authentication file (aka before a role is authenticated) while we are talking about 3~4 code paths (?) that would need an extra check to make sure that no instances have such object names. I also have the feeling that having (a), (b) and (d) is low probability. That said, If the user "/john" already exists and has a hba entry then this entry will still match with the patch. Problem is that all the users that contain "john" would also now match. But things get worst if say /a is an existing user and hba entry as the entry would match any users that contains "a" with the patch. I assume (maybe i should not) that if objects starting with / already exist there is very good reason(s) behind. Then I don't think that preventing their creation in the DDL would help (quite the contrary for the ones that really need them). It looks to me that adding a GUC (off by default) to enable/disable the regexp usage in the hba could be a fair compromise. It won't block any creation starting with a / and won't open more doors (if such objects exist) by default. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: pg_create_logical_replication_slot argument incongruency
Thank you! Il mar 20 set 2022, 12:29 Michael Paquier ha scritto: > On Tue, Sep 20, 2022 at 08:41:56AM +0200, Florin Irion wrote: > > OK, patch only for the docs attached. > > Thanks, applied. > -- > Michael >
Re: Proposal to use JSON for Postgres Parser format
On Tue, 20 Sept 2022 at 12:00, Alexander Korotkov wrote: > On Tue, Sep 20, 2022 at 7:48 AM Tom Lane wrote: > > Peter Geoghegan writes: > > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane wrote: > > >> Our existing format is certainly not great on those metrics, but > > >> I do not see how "let's use JSON!" is a route to improvement. > > > > > The existing format was designed with developer convenience as a goal, > > > though -- despite my complaints, and in spite of your objections. > > > > As Munro adduces nearby, it'd be a stretch to conclude that the current > > format was designed with any Postgres-related goals in mind at all. > > I think he's right that it's a variant of some Lisp-y dump format that's > > probably far hoarier than even Berkeley Postgres. > > > > > If it didn't have to be easy (or even practical) for developers to > > > directly work with the output format, then presumably the format used > > > internally could be replaced with something lower level and faster. So > > > it seems like the two goals (developer ergonomics and faster > > > interchange format for users) might actually be complementary. > > > > I think the principal mistake in what we have now is that the storage > > format is identical to the "developer friendly" text format (plus or > > minus some whitespace). First we need to separate those. We could > > have more than one equivalent text format perhaps, and I don't have > > any strong objection to basing the text format (or one of them) on > > JSON. > > +1 for considering storage format and text format separately. > > Let's consider what our criteria could be for the storage format. > > 1) Storage effectiveness (shorter is better) and > serialization/deserialization effectiveness (faster is better). On > this criterion, the custom binary format looks perfect. > 2) Robustness in the case of corruption. It seems much easier to > detect the data corruption and possibly make some partial manual > recovery for textual format. > 3) Standartness. It's better to use something known worldwide or at > least used in other parts of PostgreSQL than something completely > custom. From this perspective, JSON/JSONB is better than custom > things. Allow me to add: compressability In the thread surrounding [0] there were complaints about the size of catalogs, and specifically the template database. Significant parts of that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which consists mostly of serialized Nodes. If we're going to replace our current NodeToText infrastructure, we'd better know we can effectively compress this data. In that same thread, I also suggested that we could try to not emit a Node's fields if they contain their default values while serializing; such as the common `:location -1` or `:mynodefield <>`. Those fields still take up space in the format, while conveying no interesting information (the absense of that field in the struct definition would convey the same). It would be useful if this new serialized format would allow us to do similar tricks cheaply. As for JSON vs JSONB for storage: I'm fairly certain that JSONB is less compact than JSON (without taking compression into the picture) due to the 4-byte guaranteed overhead for each jsonb element; while for JSON that is only 2 bytes for each (up to 3 when you consider separators, plus potential extra overhead for escaped values that are unlikely to appear our catalogs). Some numbers can be stored more efficiently in JSONB, but only large numbers and small fractions that we're unlikely to hit in system views: a back-of-the-envelope calculation puts the cutoff point of efficient storage between strings-of-decimals and Numeric at >10^12, < -10^11, or very precise fractional values. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w%40mail.gmail.com
Re: binary version of pg_current_wal_insert_lsn and pg_walfile_name functions
On Mon, Sep 19, 2022 at 8:19 PM Ashutosh Sharma wrote: > > Hi All, > > Currently, we have pg_current_wal_insert_lsn and pg_walfile_name sql > functions which gives us information about the next wal insert > location and the WAL file that the next wal insert location belongs > to. Can we have a binary version of these sql functions? +1 for the idea in general. As said, pg_waldump seems to be the right candidate. I think we want the lsn of the last WAL record and its info and the WAL file name given an input data directory or just the pg_wal directory or any directory where WAL files are located. For instance, one can use this on an archive location containing archived WAL files or on a node where pg_receivewal is receiving WAL files. Am I missing any other use-cases? pg_waldump currently can't understand compressed and partial files. I think that we need to fix this as well. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)
On Tue, Sep 20, 2022 at 7:20 AM Michael Paquier wrote: > > The main regression test suite should not include direct calls to > pg_backup_start() or pg_backup_stop() as these depend on wal_level, > and we spend a certain amount of resources in keeping the tests a > maximum portable across such configurations, wal_level = minimal being > one of them. One portable example is in 001_stream_rep.pl. Understood. > > That's a good idea. I'm marking a flag if the label name overflows (> > > MAXPGPATH), later using it in do_pg_backup_start() to error out. We > > could've thrown error directly, but that changes the behaviour - right > > now, first " > > wal_level must be set to \"replica\" or \"logical\" at server start." > > gets emitted and then label name overflow error - I don't want to do > > that. > > - if (strlen(backupidstr) > MAXPGPATH) > + if (state->name_overflowed == true) > ereport(ERROR, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("backup label too long (max %d bytes)", > It does not strike me as a huge issue to force a truncation of such > backup label names. 1024 is large enough for basically all users, > in my opinion. Or you could just truncate in the internal logic, but > still complain at MAXPGPATH - 1 as the last byte would be for the zero > termination. In short, there is no need to complicate things with > name_overflowed. We currently allow MAXPGPATH bytes of label name excluding null termination. I don't want to change this behaviour. In the attached v9 patch, I'm truncating the larger names to MAXPGPATH + 1 bytes in backup state (one extra byte for representing that the name has overflown, and another extra byte for null termination). > + StringInfo backup_label; /* backup_label file contents */ > + StringInfo tablespace_map; /* tablespace_map file contents */ > + StringInfo history_file; /* history file contents */ > IMV, repeating a point I already made once upthread, BackupState > should hold none of these. Let's just generate the contents of these > files in the contexts where they are needed, making BackupState > something to rely on to build them in the code paths where they are > necessary. This is going to make the reasoning around the memory > contexts where each one of them is stored much easier and reduce the > changes of bugs in the long-term. I've separated out these variables from the backup state, please see the attached v9 patch. > > I've also taken help of the error callback mechanism to clean up the > > allocated memory in case of a failure. For do_pg_abort_backup() cases, > > I think we don't need to clean the memory because that gets called on > > proc exit (before_shmem_exit()). > > Memory could still bloat while the process running the SQL functions > is running depending on the error code path, anyway. I didn't get your point. Can you please elaborate it? I think adding error callbacks at the right places would free up the memory for us. Please note that we already are causing memory leaks on HEAD today. I addressed the above review comments. I also changed a wrong comment [1] that lies before pg_backup_start() even after the removal of exclusive backup. I'm attaching v9 patch set herewith, 0001 - refactors the backup code with backup state, 0002 - adds error callbacks to clean up the memory allocated for backup variables. Please review them further. [1] * Essentially what this does is to create a backup label file in $PGDATA, * where it will be archived as part of the backup dump. The label file * contains the user-supplied label string (typically this would be used * to tell where the backup dump will be stored) and the starting time and * starting WAL location for the dump. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From ab1e86ac7fb75a2d2219c7681ead40faf8c01446 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Tue, 20 Sep 2022 10:04:29 + Subject: [PATCH v9] Refactor backup related code Refactor backup related code, advantages of doing so are following: 1) backup state is more structured now - all in a single structure, callers can create backup_label contents whenever required, either during the pg_backup_start or the pg_backup_stop or in between. 2) no parsing of backup_label file lines now, no error checking for invalid parsing. 3) backup_label and history file contents have most of the things in common, they can now be created within a single function. 4) makes backup related code extensible and readable. This introduces new source files xlogbackup.c/.h for backup related code and adds the new code in there. The xlog.c file has already grown to ~9000 LOC (as of this time). Eventually, we would want to move all the backup related code from xlog.c, xlogfuncs.c, elsewhere to here. Author: Bharath Rupireddy Reviewed-by: Michael Paquier Reviewed-by: Fujii Masao Discussion: ht
Re: why can't a table be part of the same publication as its schema
On Tue, Sep 20, 2022 at 2:57 PM Alvaro Herrera wrote: > > On 2022-Sep-20, Amit Kapila wrote: > > > On Mon, Sep 19, 2022 at 8:46 PM Alvaro Herrera > > wrote: > > > > This seems a pretty arbitrary restriction. It feels like you're adding > > > this restriction precisely so that you don't have to write the code to > > > reject the ALTER .. SET SCHEMA if an incompatible configuration is > > > detected. But we already have such checks in several cases, so I don't > > > see why this one does not seem a good idea. > > > > > I agree that we have such checks at other places as well and one > > somewhat similar is in ATPrepChangePersistence(). > > > > ATPrepChangePersistence() > > { > > ... > > ... > > /* > > * Check that the table is not part of any publication when changing to > > * UNLOGGED, as UNLOGGED tables can't be published. > > */ > > Right, I think this is a sensible approach. > > > However, another angle to look at it is that we try to avoid adding > > restrictions in other DDL commands for defined publications. > > Well, it makes sense to avoid restrictions wherever possible. But here, > the consequence is that you end up with a restriction in the publication > definition that is not very sensible. Imagine if you said "you can't > add schema S because it contains an unlogged table". It's absurd. > > Maybe this can be relaxed in a future release, but it's quite odd. > Yeah, we can relax it in a future release based on some field experience, or maybe we can keep the current restriction of not allowing to add a table when the schema of the table is part of the same publication and try to relax that in a future release based on field experience. > > The intention was to be in sync with FOR ALL TABLES. > > I guess we can change both (FOR ALL TABLES and IN SCHEMA) later. > That sounds reasonable. -- With Regards, Amit Kapila.
Re: Support pg_attribute_aligned and noreturn in MSVC
On Mon, Sep 19, 2022 at 11:21 PM Michael Paquier wrote: > > On Mon, Sep 19, 2022 at 08:51:37PM -0400, James Coleman wrote: > > Yes, fixed. > > The CF bot is failing compilation on Windows: > http://commitfest.cputube.org/james-coleman.html > https://api.cirrus-ci.com/v1/task/5376566577332224/logs/build.log > > There is something going on with noreturn() after applying it to > elog.h: > 01:11:00.468] c:\cirrus\src\include\utils\elog.h(410,45): error C2085: > 'ThrowErrorData': not in formal parameter list (compiling source file > src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj] > [01:11:00.468] c:\cirrus\src\include\mb\pg_wchar.h(701,80): error > C2085: 'pgwin32_message_to_UTF16': not in formal parameter list > (compiling source file src/common/encnames.c) > [c:\cirrus\libpgcommon.vcxproj] > [01:11:00.468] c:\cirrus\src\include\utils\elog.h(411,54): error > C2085: 'pg_re_throw': not in formal parameter list (compiling source > file src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj] > > align() seems to look fine, at least. I'd be tempted to apply the > align part first, as that's independently useful for itemptr.h. I don't have access to a Windows machine for testing, but re-reading the documentation it looks like the issue is that our noreturn macro is used after the definition while the MSVC equivalent is used before. I've removed that for now (and commented about it); it's not as valuable anyway since it's mostly an indicator for code analysis (human and machine). James Coleman v3-0001-Support-pg_attribute_aligned-in-MSVC.patch Description: Binary data
Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures
On Tue, Sep 20, 2022 at 9:02 AM Nathan Bossart wrote: > > On Mon, Sep 19, 2022 at 03:16:42PM -0700, Nathan Bossart wrote: > > It seems like you want the opposite of pg_walfile_name_offset(). Perhaps > > we could add a function like pg_walfile_offset_lsn() that accepts a WAL > > file name and byte offset and returns the LSN. > > Like so... Yeah, something like this will be handy for sure, but I'm not sure if we want this to be in core. Let's hear from others. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Summary function for pg_buffercache
Hi, Seems like cfbot tests are passing now: https://cirrus-ci.com/build/4727923671302144 Best, Melih Melih Mutlu , 20 Eyl 2022 Sal, 14:00 tarihinde şunu yazdı: > Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57 > tarihinde şunu yazdı: > >> There was a missing empty line in pg_buffercache.out which made the >> tests fail. Here is a corrected v8 patch. >> > > I was just sending a corrected patch without the missing line. > > Thanks a lot for all these reviews and the corrected patch. > > Best, > Melih >
Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures
On Tue, Sep 20, 2022 at 12:57 PM Alvaro Herrera wrote: > > On 2022-Sep-19, Bharath Rupireddy wrote: > > > We have a bunch of messages [1] that have an offset, but not LSN in > > the error message. Firstly, is there an easiest way to figure out LSN > > from offset reported in the error messages? If not, is adding LSN to > > these messages along with offset a good idea? Of course, we can't just > > convert offset to LSN using XLogSegNoOffsetToRecPtr() and report, but > > something meaningful like reporting the LSN of the page that we are > > reading-in or writing-out etc. > > Maybe add errcontext() somewhere that reports the LSN would be > appropriate. For example, the page_read() callbacks have the LSN > readily available, so the ones in backend could install the errcontext > callback; or perhaps ReadPageInternal can do it #ifndef FRONTEND. Not > sure what is best of those options, but either of those sounds better > than sticking the LSN in a lower-level routine that doesn't necessarily > have the info already. All of the error messages [1] have the LSN from which offset was calculated, I think we can just append that to the error messages (something like " offset %u, LSN %X/%X: %m") and not complicate it. Thoughts? [1] errmsg("could not read from WAL segment %s, offset %u: %m", errmsg("could not read from WAL segment %s, offset %u: %m", errmsg("could not write to log file %s " "at offset %u, length %zu: %m", errmsg("unexpected timeline ID %u in WAL segment %s, offset %u", errmsg("could not read from WAL segment %s, offset %u: read %d of %zu", pg_log_error("received write-ahead log record for offset %u with no file open", "invalid magic number %04X in WAL segment %s, offset %u", "invalid info bits %04X in WAL segment %s, offset %u", "invalid info bits %04X in WAL segment %s, offset %u", "unexpected pageaddr %X/%X in WAL segment %s, offset %u", "out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u", -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
On Mon, Sep 19, 2022 at 4:42 PM Dmitry Koval wrote: > Thanks for comments and advice! > I thought about this problem and discussed about it with colleagues. > Unfortunately, I don't know of a good general solution. Yeah, me neither. > But for specific situation like this (certain partition is not changing) > we can add CONCURRENTLY modifier. > Our DDL query can be like > > ALTER TABLE...SPLIT PARTITION [CONCURRENTLY]; > > With CONCURRENTLY modifier we can lock partitioned table in > ShareUpdateExclusiveLock mode and split partition - in > AccessExclusiveLock mode. So we don't lock partitioned table in > AccessExclusiveLock mode and can modify other partitions during SPLIT > operation (except split partition). > If smb try to modify split partition, he will receive error "relation > does not exist" at end of operation (because split partition will be drop). I think that a built-in DDL command can't really assume that the user won't modify anything. You'd have to take a ShareLock. But you might be able to have a CONCURRENTLY variant of the command that does the same kind of multi-transaction thing as, e.g., CREATE INDEX CONCURRENTLY. You would probably have to be quite careful about race conditions (e.g. you commit the first transaction and before you start the second one, someone drops or detaches the partition you were planning to merge or split). Might take some thought, but feels possibly doable. I've never been excited enough about this kind of thing to want to put a lot of energy into engineering it, because doing it "manually" feels so much nicer to me, and doubly so given that we now have ATTACH CONCURRENTLY and DETACH CONCURRENTLY, but it does seem like a thing some people would probably use and value. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Summary function for pg_buffercache
Hi, Correct me if I’m wrong. The doc says we don’t take lock during pg_buffercache_summary, but I see locks in the v8 patch, Isn’t it? ``` Similar to pg_buffercache_pages function pg_buffercache_summary doesn't take buffer manager locks, thus the result is not consistent across all buffers. This is intentional. The purpose of this function is to provide a general idea about the state of shared buffers as fast as possible. Additionally, pg_buffercache_summary allocates much less memory. ``` Regards, Zhang Mingli On Sep 20, 2022, 20:10 +0800, Melih Mutlu , wrote: > Hi, > > Seems like cfbot tests are passing now: > https://cirrus-ci.com/build/4727923671302144 > > Best, > Melih > > > Melih Mutlu , 20 Eyl 2022 Sal, 14:00 tarihinde şunu > > yazdı: > > > Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57 > > > tarihinde şunu yazdı: > > > > > There was a missing empty line in pg_buffercache.out which made the > > > > > tests fail. Here is a corrected v8 patch. > > > > > > > > I was just sending a corrected patch without the missing line. > > > > > > > > Thanks a lot for all these reviews and the corrected patch. > > > > > > > > Best, > > > > Melih
Re: Summary function for pg_buffercache
Hi Zhang, > The doc says we don’t take lock during pg_buffercache_summary, but I see > locks in the v8 patch, Isn’t it? > > ``` > Similar to pg_buffercache_pages function > pg_buffercache_summary doesn't take buffer manager > locks [...] > ``` Correct, the procedure doesn't take the locks of the buffer manager. It does take the locks of every individual buffer. I agree that the text is somewhat confusing, but it is consistent with the current description of pg_buffercache [1]. I think this is a problem worth addressing but it also seems to be out of scope of the proposed patch. [1]: https://www.postgresql.org/docs/current/pgbuffercache.html -- Best regards, Aleksander Alekseev
Re: Summary function for pg_buffercache
Hi, Regards, Zhang Mingli On Sep 20, 2022, 20:43 +0800, Aleksander Alekseev , wrote: > > Correct, the procedure doesn't take the locks of the buffer manager. > It does take the locks of every individual buffer. Ah, now I get it, thanks.
Re: Summary function for pg_buffercache
Hi Zhang, Those are two different locks. The locks that are taken in the patch are for buffer headers. This locks only the current buffer and makes that particular buffer's info consistent within itself. However, the lock mentioned in the doc is for buffer manager which would prevent changes on any buffer if it's held. pg_buffercache_summary (and pg_buffercache_pages) does not hold buffer manager lock. Therefore, consistency across all buffers is not guaranteed. For pg_buffercache_pages, self-consistent buffer information is useful since it shows each buffer separately. For pg_buffercache_summary, even self-consistency may not matter much since results are aggregated and we can't see individual buffer information. Consistency across all buffers is also not a concern since its purpose is to give an overall idea about the state of buffers. I see that these two different locks in the same context can be confusing. I hope it is a bit more clear now. Best, Melih >
Re: Summary function for pg_buffercache
Hi, On Sep 20, 2022, 20:49 +0800, Melih Mutlu , wrote: > Hi Zhang, > > Those are two different locks. > The locks that are taken in the patch are for buffer headers. This locks only > the current buffer and makes that particular buffer's info consistent within > itself. > > However, the lock mentioned in the doc is for buffer manager which would > prevent changes on any buffer if it's held. > pg_buffercache_summary (and pg_buffercache_pages) does not hold buffer > manager lock. Therefore, consistency across all buffers is not guaranteed. > > For pg_buffercache_pages, self-consistent buffer information is useful since > it shows each buffer separately. > > For pg_buffercache_summary, even self-consistency may not matter much since > results are aggregated and we can't see individual buffer information. > Consistency across all buffers is also not a concern since its purpose is to > give an overall idea about the state of buffers. > > I see that these two different locks in the same context can be confusing. I > hope it is a bit more clear now. > > Best, > Melih Thanks for your explanation, LGTM.
Re: HOT chain validation in verify_heapam()
On Tue, Sep 20, 2022 at 5:00 AM Himanshu Upadhyaya wrote: > Please find it attached. This patch still has no test cases. Just as we have test cases for the existing corruption checks, we should have test cases for these new corruption checks, showing cases where they actually fire. I think I would be inclined to set lp_valid[x] = true in both the redirected and non-redirected case, and then have the very first thing that the second loop does be if (nextoffnum == 0 || !lp_valid[ctx.offnum]) continue. I think that would be more clear about the intent to ignore line pointers that failed validation. Also, if you did it that way, then you could catch the case of a redirected line pointer pointing to another redirected line pointer, which is a corruption condition that the current code does not appear to check. +/* + * Validation via the predecessor array. 1) If the predecessor's + * xmin is aborted or in progress, the current tuples xmin should + * be aborted or in progress respectively. Also both xmin's must + * be equal. 2) If the predecessor's xmin is not frozen, then + * current tuple's shouldn't be either. 3) If the predecessor's + * xmin is equal to the current tuple's xmin, the current tuple's + * cmin should be greater than the predecessor's cmin. 4) If the + * current tuple is not HOT then its predecessor's tuple must not + * be HEAP_HOT_UPDATED. 5) If the current tuple is HOT then its + * predecessor's tuple must be HEAP_HOT_UPDATED. + */ This comment needs to be split up into pieces and the pieces need to be moved closer to the tests to which they correspond. + psprintf("unfrozen tuple was updated to produce a tuple at offset %u which is not frozen", Shouldn't this say "which is frozen"? + * Not a corruption if current tuple is updated/deleted by a + * different transaction, means t_cid will point to cmax (that is + * command id of deleting transaction) and cid of predecessor not + * necessarily will be smaller than cid of current tuple. t_cid I think that the next person who reads this code is likely to understand that the CIDs of different transactions are numerically unrelated. What's less obvious is that if the XID is the same, the newer update must have a higher CID. + * can hold combo command id but we are not worrying here since + * combo command id of the next updated tuple (if present) must be + * greater than combo command id of the current tuple. So here we + * are not checking HEAP_COMBOCID flag and simply doing t_cid + * comparison. I disapprove of ignoring the HEAP_COMBOCID flag. Emitting a message claiming that the CID has a certain value when that's actually a combo CID is misleading, so at least a different message wording is needed in such cases. But it's also not clear to me that the newer update has to have a higher combo CID, because combo CIDs can be reused. If you have multiple cursors open in the same transaction, the updates can be interleaved, and it seems to me that it might be possible for an older CID to have created a certain combo CID after a newer CID, and then both cursors could update the same page in succession and end up with combo CIDs out of numerical order. Unless somebody can make a convincing argument that this isn't possible, I think we should just skip this check for cases where either tuple has a combo CID. +if (TransactionIdEquals(pred_xmin, curr_xmin) && +(TransactionIdEquals(curr_xmin, curr_xmax) || + !TransactionIdIsValid(curr_xmax)) && pred_cmin >= curr_cmin) I don't understand the reason for the middle part of this condition -- TransactionIdEquals(curr_xmin, curr_xmax) || !TransactionIdIsValid(curr_xmax). I suppose the comment is meant to explain this, but I still don't get it. If a tuple with XMIN 12345 CMIN 2 is updated to produce a tuple with XMIN 12345 CMIN 1, that's corruption, regardless of what the XMAX of the second tuple may happen to be. +if (HeapTupleHeaderIsHeapOnly(curr_htup) && +!HeapTupleHeaderIsHotUpdated(pred_htup)) +if (!HeapTupleHeaderIsHeapOnly(curr_htup) && +HeapTupleHeaderIsHotUpdated(pred_htup)) I think it would be slightly clearer to write these tests the other way around i.e. check the previous tuple's state first. +if (!TransactionIdIsValid(curr_xmax) && HeapTupleHeaderIsHotUpdated(tuphdr)) +{ +report_corruption(ctx, + psprintf("tuple has been updated, but xmax is 0")); +result = false; +} I guess this message needs to say "tuple has been HOT updated, but xmax is 0" or something like that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: why can't a table be part of the same publication as its schema
On Mon, Sep 19, 2022 at 11:03 PM Jonathan S. Katz wrote: > For #1 (allowing calls that have schema/table overlap...), there appears > to be both a patch that allows this (reversing[8]), and a suggestion for > dealing with a corner-case that is reasonable, i.e. disallowing adding > schemas to a publication when specifying column-lists. Do we think we > can have consensus on this prior to the RC1 freeze? I am not sure whether we can or should rush a fix in that fast, but I agree with this direction. > For #2 ("ALL TABLES IN SCHEMA" syntax), this was heavily discussed on > the original thread[1][3][4][5][7]. I thought Tom's proposal on the > syntax[3] was reasonable as it "future proofs" for when we allow other > schema-scoped objects to be published and give control over which ones > can be published. All right, well, I still don't like it and think it's confusing, but perhaps I'm in the minority. > I don't think we should change this behavior that's already in logical > replication. While I understand the reasons why "GRANT ... ALL TABLES IN > SCHEMA" has a different behavior (i.e. it's not applied to future > objects) and do not advocate to change it, I have personally been > affected where I thought a permission would be applied to all future > objects, only to discover otherwise. I believe it's more intuitive to > think that "ALL" applies to "everything, always." Nah, there's room for multiple behaviors here. It's reasonable to want to add all the tables currently in the schema to a publication (or grant permissions on them) and it's reasonable to want to include all current and future tables in the schema in a publication (or grant permissions on them) too. The reason I don't like the ALL TABLES IN SCHEMA syntax is that it sounds like the former, but actually is the latter. Based on your link to the email from Tom, I understand now the reason why it's like that, but it's still counterintuitive to me. > For #3 (more superuser-only), in general I do agree that we shouldn't be > adding more of these. However, we have in this release, and not just to > this feature. ALTER SUBSCRIPTION ... SKIP[11] requires superuser. I > think it's easier for us to "relax" privileges (e.g. w/predefined roles) > than to make something "superuser-only" in the future, so I don't see > this being a blocker for v15. The feature will continue to work for > users even if we remove "superuser-only" in the future. Yeah, this is clearly not a release blocker, I think. -- Robert Haas EDB: http://www.enterprisedb.com
Re: On login trigger: take three
On 02.09.2022 18:36, Daniel Gustafsson wrote: This had bitrotted a fair bit, attached is a rebase along with (mostly) documentation fixes. 0001 adds a generic GUC for ignoring event triggers and 0002 adds the login event with event trigger support, and hooks it up to the GUC such that broken triggers wont require single-user mode. Moving the CF entry back to Needs Review. Hello! There is a race around setting and clearing of dathasloginevt. Steps to reproduce: 1. Create a trigger: CREATE FUNCTION on_login_proc() RETURNS event_trigger AS $$ BEGIN RAISE NOTICE 'You are welcome!'; END; $$ LANGUAGE plpgsql; CREATE EVENT TRIGGER on_login_trigger ON login EXECUTE PROCEDURE on_login_proc(); 2. Then drop it, but don't start new sessions: DROP EVENT TRIGGER on_login_trigger; 3. Create another trigger, but don't commit yet: BEGIN; CREATE EVENT TRIGGER on_login_trigger ON login EXECUTE PROCEDURE on_login_proc(); 4. Start a new session. This clears dathasloginevt. 5. Commit the transaction: COMMIT; Now we have a trigger, but it doesn't fire, because dathasloginevt=false. If two sessions create triggers concurrently, one of them will fail. Steps: 1. In the first session, start a transaction and create a trigger: BEGIN; CREATE EVENT TRIGGER on_login_trigger1 ON login EXECUTE PROCEDURE on_login_proc(); 2. In the second session, create another trigger (this query blocks): CREATE EVENT TRIGGER on_login_trigger2 ON login EXECUTE PROCEDURE on_login_proc(); 3. Commit in the first session: COMMIT; The second session fails: postgres=# CREATE EVENT TRIGGER on_login_trigger2 ON login EXECUTE PROCEDURE on_login_proc(); ERROR: tuple concurrently updated What else bothers me is that login triggers execute in an environment under user control and one has to be very careful. The example trigger from the documentation +DECLARE + hour integer = EXTRACT('hour' FROM current_time); + rec boolean; +BEGIN +-- 1. Forbid logging in between 2AM and 4AM. +IF hour BETWEEN 2 AND 4 THEN + RAISE EXCEPTION 'Login forbidden'; +END IF; can be bypassed with PGOPTIONS='-c timezone=...'. Probably this is nothing new and concerns any SECURITY DEFINER function, but still... Best regards, -- Sergey Shinderukhttps://postgrespro.com/
Re: why can't a table be part of the same publication as its schema
On 2022-Sep-13, Kyotaro Horiguchi wrote: > At Mon, 12 Sep 2022 04:26:48 +, "houzj.f...@fujitsu.com" > wrote in > > I feel we'd better compare the syntax with the existing publication command: > > FOR ALL TABLES. If you create a publication FOR ALL TABLES, it means > > publishing > > all the tables in the database *including* tables created in the future. I > > think both the syntax and meaning of ALL TABLES IN SCHEMA are consistent > > with > > the existing FOR ALL TABLES. > > IMHO, I feel closer to Robert. "ALL TABLES IN SCHEMA" sounds like the > concrete tables at the time of invocation. While I agree that it is > not directly comparable to GRANT, What if we remove the ALL keyword from there? That would leave us with "FOR TABLES IN SCHEMA", which seems to better convey that it doesn't restrict to current tables in there. > but if I see "ALTER PUBLICATION p1 ADD SCHEMA s1", I automatically > translate that into "all tables in the schema s1 at the time of using > this publication". ... but that translation is wrong if replication supports other kinds of objects, as it inevitably will in the near future. Clearly the fact that we spell out TABLES there is important. When we add support for sequences, we could have combinations ADD [ALL] TABLES IN SCHEMA s ADD [ALL] SEQUENCES IN SCHEMA s ADD [ALL] TABLES AND SEQUENCES IN SCHEMA s and at that point, the unadorned ADD SCHEMA one will become ambiguous. > At least, it would cause less confusion when it were "ALT PUB p1 DROP > SCEMA s1" aginst "DROP ALL TABLES IN SCHEMA s1". I'm not sure what you mean here. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "En las profundidades de nuestro inconsciente hay una obsesiva necesidad de un universo lógico y coherente. Pero el universo real se halla siempre un paso más allá de la lógica" (Irulan)
Re: On login trigger: take three
> On 20 Sep 2022, at 15:43, Sergey Shinderuk wrote: > > On 02.09.2022 18:36, Daniel Gustafsson wrote: >> This had bitrotted a fair bit, attached is a rebase along with (mostly) >> documentation fixes. 0001 adds a generic GUC for ignoring event triggers and >> 0002 adds the login event with event trigger support, and hooks it up to the >> GUC such that broken triggers wont require single-user mode. Moving the CF >> entry back to Needs Review. > There is a race around setting and clearing of dathasloginevt. Thanks for the report! The whole dathasloginevt mechanism is IMO too clunky and unelegant to go ahead with, I wouldn't be surprised if there are other bugs lurking there. Since the original authors seems to have retired from the patch (I've only rebased it to try and help) I am inclined to mark it as returned with feedback. -- Daniel Gustafsson https://vmware.com/
Re: cataloguing NOT NULL constraints
On Tue, 20 Sept 2022 at 06:56, Alvaro Herrera wrote: The NULL checks would still be mostly done by the attnotnull checks > internally, so there shouldn't be too much of a difference. > > .. though I'm now wondering if there's additional overhead from checking > the constraint twice on each row: first the attnotnull bit, then the > CHECK itself. Hmm. That's probably quite bad. > Another reason to treat NOT NULL-implementing constraints differently. My thinking is that pg_constraint entries for NOT NULL columns are mostly an implementation detail. I've certainly never cared whether I had an actual constraint corresponding to my NOT NULL columns. So I think marking them as such, or a different contype, and excluding them from \d+ display, probably makes sense. Just need to deal with the issue of trying to create a constraint and having its name conflict with a NOT NULL constraint. Could it work to reserve [field name]_notnull for NOT NULL-implementing constraints? I'd be worried about what happens with field renames; renaming the constraint automatically seems a bit weird, but maybe…
Re: making relfilenodes 56 bits
On Fri, Sep 9, 2022 at 3:32 PM Dilip Kumar wrote: > > On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar wrote: > > > On a separate note, while reviewing the latest patch I see there is some > > risk of using the unflushed relfilenumber in GetNewRelFileNumber() > > function. Basically, in the current code, the flushing logic is tightly > > coupled with the logging new relfilenumber logic and that might not work > > with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD. So the idea > > is we need to keep the flushing logic separate from the logging, I am > > working on the idea and I will post the patch soon. > > I have fixed the issue, so now we will track nextRelFileNumber, > loggedRelFileNumber and flushedRelFileNumber. So whenever > nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the > loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more > relfilenumbers. And whenever nextRelFileNumber reaches the > flushedRelFileNumber then we will do XlogFlush for WAL upto the last > loggedRelFileNumber. Ideally flushedRelFileNumber should always be > VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can > avoid tracking the flushedRelFileNumber. But I feel keeping track of > the flushedRelFileNumber looks cleaner and easier to understand. For > more details refer to the code in GetNewRelFileNumber(). > Here are a few minor suggestions I came across while reading this patch, might be useful: +#ifdef USE_ASSERT_CHECKING + + { Unnecessary space after USE_ASSERT_CHECKING. -- + return InvalidRelFileNumber;/* placate compiler */ I don't think we needed this after the error on the latest branches. -- + LWLockAcquire(RelFileNumberGenLock, LW_SHARED); + if (shutdown) + checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber; + else + checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber; + + LWLockRelease(RelFileNumberGenLock); This is done for the good reason, I think, it should have a comment describing different checkPoint.nextRelFileNumber assignment need and crash recovery perspective. -- +#define SizeOfRelFileLocatorBackend \ + (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId)) Can append empty parenthesis "()" to the macro name, to look like a function call at use or change the macro name to uppercase? -- + if (val < 0 || val > MAX_RELFILENUMBER) .. if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \ How about adding a macro for this condition as RelFileNumberIsValid()? We can replace all the checks referring to MAX_RELFILENUMBER with this. Regards, Amul
Re: cataloguing NOT NULL constraints
On 2022-Sep-20, Isaac Morland wrote: > On Tue, 20 Sept 2022 at 06:56, Alvaro Herrera > wrote: > > > .. though I'm now wondering if there's additional overhead from checking > > the constraint twice on each row: first the attnotnull bit, then the > > CHECK itself. Hmm. That's probably quite bad. > > Another reason to treat NOT NULL-implementing constraints differently. Yeah. > My thinking is that pg_constraint entries for NOT NULL columns are mostly > an implementation detail. I've certainly never cared whether I had an > actual constraint corresponding to my NOT NULL columns. Naturally, all catalog entries are implementation details; a user never really cares if an entry exists or not, only that the desired semantics are provided. In this case, we want the constraint row because it gives us some additional features, such as the ability to mark NOT NULL constraints NOT VALID and validating them later, which is a useful thing to do in large production databases. We have some hacks to provide part of that functionality using straight CHECK constraints, but you cannot cleanly get the `attnotnull` flag set for a column (which means it's hard to add a primary key, for example). It is also supposed to fix some inconsistencies such as disallowing to remove a constraint on a table when it is implied from a constraint on an ancestor table. Right now we have ad-hoc protections for partitions, but we don't do that for legacy inheritance. That said, the patch I posted for this ~10 years ago used a separate contype and was simpler than what I ended up with now, but amusingly enough it was returned at the time with the argument that it would be better to treat them as normal CHECK constraints; so I want to be very sure that we're not just going around in circles. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: RFC: Logging plan of the running query
On 2022-09-19 17:47, Алена Рыбакина wrote: Thanks for your review and comments! Hi, I'm sorry,if this message is duplicated previous this one, but I'm not sure that the previous message is sent correctly. I sent it from email address a.rybak...@postgrespro.ru and I couldn't send this one email from those address. I've successfully received your mail from both a.rybak...@postgrespro.ru and lena.riback...@yandex.ru. I like idea to create patch for logging query plan. After reviewing this code and notice some moments and I'd rather ask you some questions. Firstly, I suggest some editing in the comment of commit. I think, it is turned out the more laconic and the same clear. I wrote it below since I can't think of any other way to add it. ``` Currently, we have to wait for finishing of the query execution to check its plan. This is not so convenient in investigation long-running queries on production environments where we cannot use debuggers. To improve this situation there is proposed the patch containing the pg_log_query_plan() function which request to log plan of the specified backend process. By default, only superusers are allowed to request log of the plan otherwise allowing any users to issue this request could create cause lots of log messages and it can lead to denial of service. At the next requesting CHECK_FOR_INTERRUPTS(), the target backend logs its plan at LOG_SERVER_ONLY level and therefore this plan will appear in the server log only, not to be sent to the client. Thanks, I have incorporated your comments. Since the latter part of the original message comes from the commit message of pg_log_backend_memory_contexts(43620e328617c), so I left it as it was for consistency. Secondly, I have question about deleting USE_ASSERT_CHECKING in lock.h? It supposed to have been checked in another placed of the code by matching values. I am worry about skipping errors due to untesting with assert option in the places where it (GetLockMethodLocalHash) participates and we won't able to get core file in segfault cases. I might not understand something, then can you please explain to me? Since GetLockMethodLocalHash() is only used for assertions, this is only defined when USE_ASSERT_CHECKING is enabled. This patch uses GetLockMethodLocalHash() not only for the assertion purpose, so I removed "ifdef USE_ASSERT_CHECKING" for this function. I belive it does not lead to skip errors. Thirdly, I have incomprehension of the point why save_ActiveQueryDesc is declared in the pquery.h? I am seemed to save_ActiveQueryDesc to be used in an once time in the ExecutorRun function and its declaration superfluous. I added it in the attached patch. Exactly. Fourthly, it seems to me there are not enough explanatory comments in the code. I also added them in the attached patch. Thanks! | + /* | +* Save value of ActiveQueryDesc before having called | +* ExecutorRun_hook function due to having reset by | +* AbortSubTransaction. | +*/ | + | save_ActiveQueryDesc = ActiveQueryDesc; | ActiveQueryDesc = queryDesc; | | @@ -314,6 +320,7 @@ ExecutorRun(QueryDesc *queryDesc, | else | standard_ExecutorRun(queryDesc, direction, count, execute_once); | | + /* We set the actual value of ActiveQueryDesc */ | ActiveQueryDesc = save_ActiveQueryDesc; Since these processes are needed for nested queries, not only for AbortSubTransaction[1], added comments on it. | +/* Function to handle the signal to output the query plan. */ | extern void HandleLogQueryPlanInterrupt(void); I feel this comment is unnecessary since the explanation of HandleLogQueryPlanInterrupt() is written in explain.c and no functions in explain.h have comments in it. Lastly, I have incomprehension about handling signals since have been unused it before. Could another signal disabled calling this signal to log query plan? I noticed this signal to be checked the latest in procsignal_sigusr1_handler function. Are you concerned that one signal will not be processed when multiple signals are sent in succession? AFAIU both of them are processed since SendProcSignal flags ps_signalFlags for each signal. ``` SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId) { volatile ProcSignalSlot *slot; ...(snip)... 278 if (slot->pss_pid == pid) 279 { 280 /* Atomically set the proper flag */ 281 slot->pss_signalFlags[reason] = true; 282 /* Send signal */ 283 return kill(pid, SIGUSR1); ``` Comments of ProcSignalReason also say 'We can cope with concurrent signals for different reasons'. ```C /* * Reasons for signaling a Postgres child process (a backend or an auxiliary * process, like checkpointer). We can cope with concurrent signals for different * reasons. However, if the same reason is signaled multiple times in quick * succession, the process is likely to observe only one notification of
Re: why can't a table be part of the same publication as its schema
> On Sep 19, 2022, at 8:03 PM, Jonathan S. Katz wrote: > > "When a partitioned table is added to a publication, all of its existing and > future partitions are implicitly considered to be part of the > publication."[10] > > Additionally, this is the behavior that is already present in "FOR ALL > TABLES": > > "Marks the publication as one that replicates changes for all tables in the > database, including tables created in the future."[10] > > I don't think we should change this behavior that's already in logical > replication. The existing behavior in logical replication doesn't have any "IN SCHEMA" qualifiers. > While I understand the reasons why "GRANT ... ALL TABLES IN SCHEMA" has a > different behavior (i.e. it's not applied to future objects) and do not > advocate to change it, I have personally been affected where I thought a > permission would be applied to all future objects, only to discover > otherwise. I believe it's more intuitive to think that "ALL" applies to > "everything, always." The conversation is focusing on what "ALL TABLES" means, but the ambiguous part is what "IN SCHEMA" means. In GRANT it means "currently in schema, computed now." We are about to create confusion by adding the "IN SCHEMA" phrase to publication commands meaning "later in schema, computed then." A user who diligently consults the documentation for one command to discover what "IN SCHEMA" means may fairly, but wrongly, assume it means the same thing in another command. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal to use JSON for Postgres Parser format
On 2022-Sep-20, Matthias van de Meent wrote: > Allow me to add: compressability > > In the thread surrounding [0] there were complaints about the size of > catalogs, and specifically the template database. Significant parts of > that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which > consists mostly of serialized Nodes. If we're going to replace our > current NodeToText infrastructure, we'd better know we can effectively > compress this data. True. Currently, the largest ev_action values compress pretty well. I think if we wanted this to be more succint, we would have to invent some binary format -- perhaps something like Protocol Buffers: it'd be stored in the binary format in catalogs, but for output it would be converted into something easy to read (we already do this for pg_statistic_ext_data for example). We'd probably lose compressibility, but that'd be okay because the binary format would already remove most of the redundancy by nature. Do we want to go there? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Java is clearly an example of money oriented programming" (A. Stepanov)
Re: [PATCH]Feature improvement for MERGE tab completion
On 2022-Sep-18, Fujii Masao wrote: > The tab-completion code for MERGE was added in the middle of that for LOCK > TABLE. > This would be an oversight of the commit that originally supported > tab-completion > for MERGE. I fixed this issue. Argh, thanks. > "MERGE" was tab-completed with just after "EXPLAIN" or "EXPLAIN ANALYZE", etc. > Since "INTO" always follows "MERGE", it's better to complete with "MERGE INTO" > there. I replaced "MERGE" with "MERGE INTO" in those tab-completions. OK, that would be similar to REFRESH MATERIALIZED VIEW. The rules starting at line 4111 make me a bit nervous, since nowhere we're restricting them to operating only on MERGE lines. I don't think it's a real problem since USING is not terribly common anyway. Likewise for the ones with WHEN [NOT] MATCHED. I kinda wish we had a way to search for stuff like "keyword MERGE appears earlier in the command", but we don't have that. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "I'm always right, but sometimes I'm more right than other times." (Linus Torvalds)
Re: Tree-walker callbacks vs -Wdeprecated-non-prototype
I wrote: > (That verbiage is from the gcc manual; clang seems to act the same > except that -Wcast-function-type is selected by -Wall, or perhaps is > even on by default.) Nah, scratch that: the reason -Wcast-function-type is on is that we explicitly enable it, and have done so since de8feb1f3 (v14). I did not happen to see this warning with gcc because the test runs I made with this patch already had c35ba141d, whereas I did my clang test on another machine that wasn't quite up to HEAD. So we should have good warning coverage for bogus walker signatures on both compilers. regards, tom lane
Re: A question about wording in messages
On 2022-Sep-14, Kyotaro Horiguchi wrote: > At Tue, 13 Sep 2022 22:38:46 -0400, Tom Lane wrote in > > Kyotaro Horiguchi writes: > > > I saw the following message recently modified. > > >> This controls the maximum distance we can read ahead in the WAL to > > >> prefetch referenced data blocks. > > > Maybe the "we" means "PostgreSQL program and you" but I see it > > > somewhat out of place. > > > > +1, I saw that today and thought it was outside our usual style. > > The whole thing is awfully verbose for a GUC description, too. > > Maybe > > > > "Maximum distance to read ahead in WAL to prefetch data blocks." I failed to notice this issue. I agree it's unusual and +1 for changing it. > It seems to sufficiently work for average users and rather easy to > read, but it looks a short description. > So, taking the middle of them, how about the following? > > Short: Buffer size for reading ahead in the WAL during recovery. > Extra: This controls the maximum distance to read ahead in WAL to prefetch > data blocks." But why do we care that it's short? We don't need it to be long .. we only need it to explain what it needs to explain. After spending way too much time editing this line, I ended up with exactly what Tom proposed, so +1 for his version. I think "This controls" adds nothing very useful, and we don't have it anywhere else, except tcp_keepalives_count from where I also propose to remove it. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans) >From 5b8bf15ed5d9f1a21150039da33a557f027640d4 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 20 Sep 2022 18:19:34 +0200 Subject: [PATCH] fix some GUC strings --- src/backend/utils/misc/guc_tables.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 550e95056c..ab3140ff3a 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -2591,7 +2591,7 @@ struct config_int ConfigureNamesInt[] = { {"wal_decode_buffer_size", PGC_POSTMASTER, WAL_RECOVERY, gettext_noop("Buffer size for reading ahead in the WAL during recovery."), - gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced data blocks."), + gettext_noop("Maximum distance to read ahead in the WAL to prefetch referenced data blocks."), GUC_UNIT_BYTE }, &wal_decode_buffer_size, @@ -3248,7 +3248,7 @@ struct config_int ConfigureNamesInt[] = { {"tcp_keepalives_count", PGC_USERSET, CONN_AUTH_TCP, gettext_noop("Maximum number of TCP keepalive retransmits."), - gettext_noop("This controls the number of consecutive keepalive retransmits that can be " + gettext_noop("Number of consecutive keepalive retransmits that can be " "lost before a connection is considered dead. A value of 0 uses the " "system default."), }, -- 2.30.2
default sorting behavior for index
Hi, I was looking at this check in src/backend/parser/parse_utilcmd.c w.r.t. constraint: if (indclass->values[i] != defopclass || attform->attcollation != index_rel->rd_indcollation[i] || attoptions != (Datum) 0 || index_rel->rd_indoption[i] != 0) ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("index \"%s\" column number %d does not have default sorting behavior", index_name, i + 1), errdetail("Cannot create a primary key or unique constraint using such an index."), It seems this first came in via `Indexes with INCLUDE columns and their support in B-tree` If the index has DESC sorting order, why it cannot be used to back a constraint ? Some concrete sample would help me understand this. Thanks
Re: cataloguing NOT NULL constraints
On Tue, Sep 20, 2022 at 10:39 AM Alvaro Herrera wrote: > That said, the patch I posted for this ~10 years ago used a separate > contype and was simpler than what I ended up with now, but amusingly > enough it was returned at the time with the argument that it would be > better to treat them as normal CHECK constraints; so I want to be very > sure that we're not just going around in circles. I don't have an intrinsic view on whether we ought to have one contype or two, but I think this comment of yours from a few messages ago is right on point: ".. though I'm now wondering if there's additional overhead from checking the constraint twice on each row: first the attnotnull bit, then the CHECK itself. Hmm. That's probably quite bad." For that exact reason, it seems absolutely necessary to be able to somehow identify the "redundant" check constraints, so that we don't pay the expense of revalidating them. Another contype would be one way of identifying such constraints, but it could probably be done in other ways, too. Perhaps it could even be discovered dynamically, like when we build a relcache entry. I actually have no idea what design is best. I am a little confused as to why we want to do this, though. While we're on the topic of what is more complicated and simpler, what functionality do we get out of adding all of these additional catalog entries that then have to be linked back to the corresponding attnotnull markings? And couldn't we get that functionality in some much simpler way? Like, if you want to track whether the NOT NULL constraint has been validated, we could add an attnotnullvalidated column, or probably better, change the existing attnotnull column to a character used as an enum, or maybe an integer bit-field of some kind. I'm not trying to blow up your patch with dynamite or anything, but I'm a little suspicious that this may be one of those cases where pgsql-hackers discussed turns a complicated project into an even more complicated project to no real benefit. One thing that I don't particularly like about this whole design is that it feels like it creates a bunch of catalog bloat. Now all of the attnotnull flags also generate additional pg_constraint rows. The catalogs in the default install will be bigger than before, and the catalogs after user tables are created will be more bigger. If we get some nifty benefit out of all that, cool! But if we're just anti-optimizing the catalog storage out of some feeling that the result will be intellectually purer than some competing design, maybe we should reconsider. It's not stupid to optimize for common special cases, and making a column as NOT NULL is probably at least one and maybe several orders of magnitude more common than putting some other CHECK constraint on it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: making relfilenodes 56 bits
On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar wrote: > [ new patch ] +typedef pg_int64 RelFileNumber; This seems really random to me. First, why isn't this an unsigned type? OID is unsigned and I don't see a reason to change to a signed type. But even if we were going to change to a signed type, why pg_int64? That is declared like this: /* Define a signed 64-bit integer type for use in client API declarations. */ typedef PG_INT64_TYPE pg_int64; Surely this is not a client API declaration Note that if we change this a lot of references to INT64_FORMAT will need to become UINT64_FORMAT. I think we should use int64 at the SQL level, because we don't have an unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits. So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or similar. But internally I think using unsigned is cleaner. + * RelFileNumber is unique within a cluster. Not really, because of CREATE DATABASE. Probably just drop this line. Or else expand it: we never assign the same RelFileNumber twice within the lifetime of the same cluster, but there can be multiple relations with the same RelFileNumber e.g. because CREATE DATABASE duplicates the RelFileNumber values from the template database. But maybe we don't need this here, as it's already explained in relfilelocator.h. +ret = (int8) (tag->relForkDetails[0] >> BUFTAG_RELNUM_HIGH_BITS); Why not declare ret as ForkNumber instead of casting twice? +uint64 relnum; + +Assert(relnumber <= MAX_RELFILENUMBER); +Assert(forknum <= MAX_FORKNUM); + +relnum = relnumber; Perhaps it'd be better to write uint64 relnum = relnumber instead of initializing on a separate line. +#define RELNUMBERCHARS 20 /* max chars printed by %llu */ Maybe instead of %llu we should say UINT64_FORMAT (or INT64_FORMAT if there's some reason to stick with a signed type). +elog(ERROR, "relfilenumber is out of bound"); It would have to be "out of bounds", with an "s". But maybe "is too large" would be better. +nextRelFileNumber = ShmemVariableCache->nextRelFileNumber; +loggedRelFileNumber = ShmemVariableCache->loggedRelFileNumber; +flushedRelFileNumber = ShmemVariableCache->flushedRelFileNumber; Maybe it would be a good idea to asset that next <= flushed and flushed <= logged? +#ifdef USE_ASSERT_CHECKING + +{ +RelFileLocatorBackend rlocator; +char *rpath; Let's add a comment here, like "Because the RelFileNumber counter only ever increases and never wraps around, it should be impossible for the newly-allocated RelFileNumber to already be in use. But, if Asserts are enabled, double check that there's no main-fork relation file with the new RelFileNumber already on disk." +elog(ERROR, "cannot forward RelFileNumber during recovery"); forward -> set (or advance) +if (relnumber >= ShmemVariableCache->loggedRelFileNumber) It probably doesn't make any difference, but to me it seems better to test flushedRelFileNumber rather than logRelFileNumber here. What do you think? /* * We set up the lockRelId in case anything tries to lock the dummy - * relation. Note that this is fairly bogus since relNumber may be - * different from the relation's OID. It shouldn't really matter though. - * In recovery, we are running by ourselves and can't have any lock - * conflicts. While syncing, we already hold AccessExclusiveLock. + * relation. Note we are setting relId to just FirstNormalObjectId which + * is completely bogus. It shouldn't really matter though. In recovery, + * we are running by ourselves and can't have any lock conflicts. While + * syncing, we already hold AccessExclusiveLock. */ rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid; -rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber; +rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId; Boy, this makes me uncomfortable. The existing logic is pretty bogus, and we're replacing it with some other bogus thing. Do we know whether anything actually does try to use this for locking? One notable difference between the existing logic and your change is that, with the existing logic, we use a bogus value that will differ from one relation to the next, whereas with this change, it will always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId = (Oid) rlocator.relNumber would be a more natural adaptation? +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)\ +do {\ +if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \ +ereport(ERROR, \ +errcode(ERRCODE_INVALID_PARAMETER_VALUE), \ +errmsg("relfilenumber %lld is out of range",\ +(long long) (relfilenumber))); \ +} while (0) Here, you take the approach of casting the relfilenumber to
Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf
On Mon, Sep 19, 2022 at 9:09 PM Tom Lane wrote: > You have to assume that somebody (a) has a role or DB name starting > with slash, (b) has an explicit reference to that name in their > pg_hba.conf, (c) doesn't read the release notes, and (d) doesn't > notice that things are misbehaving until after some hacker manages > to break into their installation on the strength of the misbehaving > entry. OK, I'll grant that the probability of (c) is depressingly > close to unity; but each of the other steps seems quite low probability. > All four of them happening in one installation is something I doubt > will happen. I can't argue with (a) or (b), but (d) seems decently likely to me. If your normal user base consists of people who are authorized to access your system, what clues would you have that your HBA is silently failing open? --Jacob
Re: A question about wording in messages
On 2022-Sep-16, Amit Kapila wrote: > We only want to commit the changes to 15 (a) if those fixes a problem > introduced in 15, or (b) those are for a bug fix. I think the error > message improvements fall into none of those categories, we can map it > to (b) but I feel those are an improvement in the current messages and > don't seem critical to me. IMO at least the GUC one does fix a problem related to the wording of a user-visible message, which also flows into the translations. I prefer to have that one fixed it in 15 also. The other messages (errors) don't seem very interesting because they're not as visible, so I don't care if those are not backpatched. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "This is a foot just waiting to be shot"(Andrew Dunstan)
Auto explain after query timeout
Hopefully I'm not missing something obvious, but as far as I know there's no way to configure auto explain to work fire statement_timeout fires. I'd like to look into this at some point, but I'm wondering if anyone has thought about it before, and, if so, is there any obvious impediment to doing so? Thanks, James Coleman
Tighten pg_get_object_address argument checking
For publication schemas (OBJECT_PUBLICATION_NAMESPACE) and user mappings (OBJECT_USER_MAPPING), pg_get_object_address() checked the array length of the second argument, but not of the first argument. If the first argument was too long, it would just silently ignore everything but the first argument. Fix that by checking the length of the first argument as well. I wouldn't be surprised if there were more holes like this in this area. I just happened to find these while working on something related.From eb80c87a083464160a1436e5f983df840b282085 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 20 Sep 2022 13:37:27 -0400 Subject: [PATCH] Tighten pg_get_object_address argument checking For publication schemas (OBJECT_PUBLICATION_NAMESPACE) and user mappings (OBJECT_USER_MAPPING), pg_get_object_address() checked the array length of the second argument, but not of the first argument. If the first argument was too long, it would just silently ignore everything but the first argument. Fix that by checking the length of the first argument as well. --- src/backend/catalog/objectaddress.c | 10 -- src/test/regress/expected/object_address.out | 16 +++- src/test/regress/sql/object_address.sql | 2 +- 3 files changed, 20 insertions(+), 8 deletions(-) diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c index 8377b4f7d4d1..27616ac2ad26 100644 --- a/src/backend/catalog/objectaddress.c +++ b/src/backend/catalog/objectaddress.c @@ -2239,10 +2239,16 @@ pg_get_object_address(PG_FUNCTION_ARGS) */ switch (type) { + case OBJECT_PUBLICATION_NAMESPACE: + case OBJECT_USER_MAPPING: + if (list_length(name) != 1) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("name list length must be exactly %d", 1))); + /* fall through to check args length */ + /* FALLTHROUGH */ case OBJECT_DOMCONSTRAINT: case OBJECT_CAST: - case OBJECT_USER_MAPPING: - case OBJECT_PUBLICATION_NAMESPACE: case OBJECT_PUBLICATION_REL: case OBJECT_DEFACL: case OBJECT_TRANSFORM: diff --git a/src/test/regress/expected/object_address.out b/src/test/regress/expected/object_address.out index 4117fc27c9a5..cbb99c7b9f94 100644 --- a/src/test/regress/expected/object_address.out +++ b/src/test/regress/expected/object_address.out @@ -105,7 +105,7 @@ BEGIN ('text search template'), ('text search configuration'), ('policy'), ('user mapping'), ('default acl'), ('transform'), ('operator of access method'), ('function of access method'), - ('publication relation') + ('publication namespace'), ('publication relation') LOOP FOR names IN VALUES ('{eins}'), ('{addr_nsp, zwei}'), ('{eins, zwei, drei}') LOOP @@ -285,10 +285,10 @@ WARNING: error for policy,{eins,zwei,drei},{}: schema "eins" does not exist WARNING: error for policy,{eins,zwei,drei},{integer}: schema "eins" does not exist WARNING: error for user mapping,{eins},{}: argument list length must be exactly 1 WARNING: error for user mapping,{eins},{integer}: user mapping for user "eins" on server "integer" does not exist -WARNING: error for user mapping,{addr_nsp,zwei},{}: argument list length must be exactly 1 -WARNING: error for user mapping,{addr_nsp,zwei},{integer}: user mapping for user "addr_nsp" on server "integer" does not exist -WARNING: error for user mapping,{eins,zwei,drei},{}: argument list length must be exactly 1 -WARNING: error for user mapping,{eins,zwei,drei},{integer}: user mapping for user "eins" on server "integer" does not exist +WARNING: error for user mapping,{addr_nsp,zwei},{}: name list length must be exactly 1 +WARNING: error for user mapping,{addr_nsp,zwei},{integer}: name list length must be exactly 1 +WARNING: error for user mapping,{eins,zwei,drei},{}: name list length must be exactly 1 +WARNING: error for user mapping,{eins,zwei,drei},{integer}: name list length must be exactly 1 WARNING: error for default acl,{eins},{}: argument list length must be exactly 1 WARNING: error for default acl,{eins},{integer}: unrecognized default ACL object type "i" WARNING: error for default acl,{addr_nsp,zwei},{}: argument list length must be exactly 1 @@ -313,6 +313,12 @@ WARNING: error for function of access method,{addr_nsp,zwei},{}: name list leng WARNING: error for function of access method,{addr_nsp,zwei},{integer}: name list length must be at least 3 WARNING: error for function of access method,{eins,zwei,drei},{}: argument list length must be exactly 2 WARNING: error for function of access meth
Re: CFM Manager
On Thu, Sep 8, 2022 at 2:34 PM Jacob Champion wrote: > I still have yet to update the section "5 to 7 days before end of CF" > and onward. Well, I've saved the hardest part for last... Ibrar, Hamid, have the checklist rewrites been helpful so far? Are you planning on doing an (optional!) triage, and if so, are there any pieces in particular you'd like me to document? --Jacob
Re: oat_post_create expected behavior
On Tue, 2022-08-02 at 13:30 -0700, Mary Xu wrote: > > Right, same thing I'm saying. I also think we should discourage > > people from doing cowboy CCIs inside their OAT hooks, because that > > makes the testability problem even worse. Maybe that means we > > need to uniformly move the CREATE hooks to after a system-provided > > CCI, but I've not thought hard about the implications of that. > > I like this approach, however, I am relatively new to the PG scene > and > am not sure how or what I should look into in terms of the > implications that Tom mentioned. Are there any tips? What should be > the next course of action here? I could update my patch to always > call > CCI before the create hooks. I didn't see a clear consensus that we should call OAT_POST_CREATE after CCI, so I went ahead and updated the comment. We can always update the behavior later when we do have consensus, but until that time, at least the comment will be more helpful. If you are satisfied you can mark the CF issue as "committed", or you can leave it open if you think it's still unresolved. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: Support tls-exporter as channel binding for TLSv1.3
On Mon, Sep 19, 2022 at 5:54 PM Michael Paquier wrote: > X509_get_signature_nid() has been introduced in 1.0.2. > SSL_export_keying_material() is older than that, being present since > 1.0.1. Considering the fact that we want to always have > tls-server-end-point as default, it seems to me that we should always > publish SCRAM-SHA-256-PLUS and support channel binding when building > with OpenSSL >= 1.0.2. The same can be said about the areas where we > have HAVE_BE_TLS_GET_[PEER_]CERTIFICATE_HASH. Should we advertise support even if the client is too old to provide an extended master secret? > I was planning to continue working on this patch based on your latest > review. Anyway, as that's originally your code, I am fine to let you > take the lead here. Just let me know which way you prefer, and I'll > stick to it :) Well, I'm working on a next version, but it's ballooning in complexity as I try to navigate the fix for OpenSSL 1.0.1 (which is currently failing the tests, unsurprisingly). You mentioned not wanting to add maintenance burden for 1.0.1, and I'm curious to see whether the approach you have in mind would be easier than what mine is turning out to be. Maybe we can compare notes? --Jacob
Re: predefined role(s) for VACUUM and ANALYZE
On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote: > I have gone through the thread, and I'd agree with getting more > granularity when it comes to assigning ACLs to relations rather than > just an on/off switch for the objects of a given type would be nice. > I've been looking at the whole use of AclMode and AclItem in the code, > and I don't quite see why a larger size could have a noticeable > impact. There are a few things that could handle a large number of > AclItems, though, say for array operations like aclupdate(). These > could be easily checked with some micro-benchmarking or some SQL > queries that emulate a large number of items in aclitem[] arrays. I performed a few quick tests with a couple thousand ACLs on my laptop, and I'm consistently seeing a 4.3% regression. > Any impact for the column sizes of the catalogs holding ACL > information? Just asking while browsing the patch set. Since each aclitem requires 16 bytes instead of 12, I assume so. However, in my testing, I hit a "row is too big" error with the same number of aclitems in a pg_class row before and after the change. I might be missing something in my patch, or maybe I am misunderstanding how arrays of aclitems are stored on disk. > Some comments in utils/acl.h need a refresh as the number of lower and > upper bits looked at from ai_privs changes. Oops, I missed that one. I fixed it in the attached patch set. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 15e9a60b990070d36f4e1f749cbfbb638b37a666 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 7 Sep 2022 22:25:29 -0700 Subject: [PATCH v6 1/4] Change AclMode from a uint32 to a uint64. --- src/backend/nodes/outfuncs.c| 2 +- src/bin/pg_upgrade/check.c | 35 + src/include/catalog/pg_type.dat | 2 +- src/include/nodes/parsenodes.h | 6 +++--- src/include/utils/acl.h | 28 +- 5 files changed, 54 insertions(+), 19 deletions(-) diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 60610e3a4b..dbb9ad52da 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -528,7 +528,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node) WRITE_BOOL_FIELD(lateral); WRITE_BOOL_FIELD(inh); WRITE_BOOL_FIELD(inFromCl); - WRITE_UINT_FIELD(requiredPerms); + WRITE_UINT64_FIELD(requiredPerms); WRITE_OID_FIELD(checkAsUser); WRITE_BITMAPSET_FIELD(selectedCols); WRITE_BITMAPSET_FIELD(insertedCols); diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index f1bc1e6886..615a53a864 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -28,6 +28,7 @@ static void check_for_incompatible_polymorphics(ClusterInfo *cluster); static void check_for_tables_with_oids(ClusterInfo *cluster); static void check_for_composite_data_type_usage(ClusterInfo *cluster); static void check_for_reg_data_type_usage(ClusterInfo *cluster); +static void check_for_aclitem_data_type_usage(ClusterInfo *cluster); static void check_for_jsonb_9_4_usage(ClusterInfo *cluster); static void check_for_pg_role_prefix(ClusterInfo *cluster); static void check_for_new_tablespace_dir(ClusterInfo *new_cluster); @@ -107,6 +108,13 @@ check_and_dump_old_cluster(bool live_check) check_for_reg_data_type_usage(&old_cluster); check_for_isn_and_int8_passing_mismatch(&old_cluster); + /* + * PG 16 increased the size of the 'aclitem' type, which breaks the on-disk + * format for existing data. + */ + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500) + check_for_aclitem_data_type_usage(&old_cluster); + /* * PG 14 changed the function signature of encoding conversion functions. * Conversions from older versions cannot be upgraded automatically @@ -1319,6 +1327,33 @@ check_for_reg_data_type_usage(ClusterInfo *cluster) check_ok(); } +/* + * check_for_aclitem_data_type_usage + * + * aclitem changed its storage format in 16, so check for it. + */ +static void +check_for_aclitem_data_type_usage(ClusterInfo *cluster) +{ + char output_path[MAXPGPATH]; + + prep_status("Checking for incompatible aclitem data type in user tables"); + + snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt"); + + if (check_for_data_type_usage(cluster, "pg_catalog.aclitem", output_path)) + { + pg_log(PG_REPORT, "fatal"); + pg_fatal("Your installation contains the \"aclitem\" data type in user tables.\n" + "The internal format of \"aclitem\" changed in PostgreSQL version 16\n" + "so this cluster cannot currently be upgraded. You can drop the\n" + "problem columns and restart the upgrade. A list of the problem\n" + "columns is in the file:\n" + "%s", output_path); + } + else + check_ok(); +} /* * check_for_jsonb_9_4_usage() diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat index df45879463..484dec39e8 100644 --- a/src/include/catalog/pg_type.dat +
Re: Auto explain after query timeout
On Tue Sep 20, 2022 at 10:34 AM PDT, James Coleman wrote: > Hopefully I'm not missing something obvious, but as far as I know > there's no way to configure auto explain to work fire > statement_timeout fires. I believe you're correct. > I'd like to look into this at some point, but I'm wondering if anyone > has thought about it before, and, if so, is there any obvious > impediment to doing so? This would be a neat feature. Since the changes would be fairly localized to the contrib module, this would be a great first patch for someone new to contributing. This can be exposed at a new GUC auto_explain.log_on_statement_timeout. I wish our conventions allowed for creating hierarchies of GUC parameters, e.g. auto_explain.when.statmeent_timeout. For someone who would like to achieve this in the field today, I believe they can set auto_explain.log_min_duration equal to, or less than, statement_timeout. Best regards, Gurjeet http://Gurje.et
Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)
Here is a rebased patch for cfbot. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 9ae1f5409ddeee17b99a9565247da885cbb86be9 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 22 Mar 2022 15:35:34 -0700 Subject: [PATCH v6 1/1] Disallow setting XMAX_COMMITTED and XMAX_LOCK_ONLY together. Even though Postgres doesn't set both the XMAX_COMMITTED and XMAX_LOCK_ONLY infomask bits on the same tuple header, the heap visibility logic still accepts it. However, other functions like compute_new_xmax_infomask() seem to assume that this bit pattern is not possible. This change marks this bit pattern as disallowed, removes the heap visibility logic that handles it, and adds checks like those for other disallowed infomask bit combinations (e.g., XMAX_COMMITTED and XMAX_IS_MULTI). Besides simplifying the heap visibility logic a bit, this change aims to reduce ambiguity about the legal tuple header states. Note that this change also disallows XMAX_COMMITTED together with the special pre-v9.3 locked-only bit pattern that HEAP_XMAX_IS_LOCKED_ONLY checks for. This locked-only bit pattern may still be present on servers pg_upgraded from pre-v9.3 versions. --- contrib/amcheck/verify_heapam.c | 19 src/backend/access/heap/README.tuplock | 2 +- src/backend/access/heap/heapam.c| 10 +++ src/backend/access/heap/heapam_visibility.c | 97 ++--- src/backend/access/heap/hio.c | 2 + 5 files changed, 96 insertions(+), 34 deletions(-) diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c index d33f33f170..f74f88afc5 100644 --- a/contrib/amcheck/verify_heapam.c +++ b/contrib/amcheck/verify_heapam.c @@ -663,6 +663,25 @@ check_tuple_header(HeapCheckContext *ctx) */ } + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && + (ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY)) + { + report_corruption(ctx, + pstrdup("locked-only should not be marked committed")); + + /* As above, do not skip further checks. */ + } + + /* also check for pre-v9.3 lock-only bit pattern */ + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && + HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask)) + { + report_corruption(ctx, + pstrdup("tuple with HEAP_XMAX_EXCL_LOCK set and HEAP_XMAX_IS_MULTI unset should not be marked committed")); + + /* As above, do not skip further checks. */ + } + if (infomask & HEAP_HASNULL) expected_hoff = MAXALIGN(SizeofHeapTupleHeader + BITMAPLEN(ctx->natts)); else diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock index 6441e8baf0..4e565e60ab 100644 --- a/src/backend/access/heap/README.tuplock +++ b/src/backend/access/heap/README.tuplock @@ -152,4 +152,4 @@ The following infomask bits are applicable: whether the XMAX is a TransactionId or a MultiXactId. We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit -is set. +or the HEAP_XMAX_LOCK_ONLY bit is set. diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index eb811d751e..616df576c3 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -5115,6 +5115,16 @@ l5: MultiXactStatus status; MultiXactStatus new_status; + /* + * Currently we don't allow XMAX_LOCK_ONLY and XMAX_COMMITTED to both be + * set in a tuple header, so cross-check. + * + * Note that this also checks for the special locked-only bit pattern + * that may be present on servers that were pg_upgraded from pre-v9.3 + * versions. + */ + Assert(!HEAP_XMAX_IS_LOCKED_ONLY(old_infomask)); + if (old_infomask2 & HEAP_KEYS_UPDATED) status = MultiXactStatusUpdate; else diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c index 6e33d1c881..e6ee8ff1fa 100644 --- a/src/backend/access/heap/heapam_visibility.c +++ b/src/backend/access/heap/heapam_visibility.c @@ -78,6 +78,31 @@ #include "utils/snapmgr.h" +/* + * InfomaskAssertions() + * + * Checks for invalid infomask bit patterns. + */ +static inline void +InfomaskAssertions(HeapTupleHeader tuple) +{ + /* + * XMAX_COMMITTED and XMAX_LOCK_ONLY cannot both be set in a tuple header. + * + * Note that this also checks for the special locked-only bit pattern that + * may be present on servers that were pg_upgraded from pre-v9.3 versions. + */ + Assert(!((tuple->t_infomask & HEAP_XMAX_COMMITTED) && + HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))); + + /* + * XMAX_COMMITTED and XMAX_IS_MULTI cannot be be set in a tuple header. + */ + Assert(!((tuple->t_infomask & HEAP_XMAX_COMMITTED) && + (tuple->t_infomask & HEAP_XMAX_IS_MULTI))); +} + + /* * SetHintBits() * @@ -113,6 +138,8 @@ static inline void SetHintBits(HeapTupleHeader tuple, Buffer buffer, uint16 infomask, TransactionId xid) { + InfomaskAssertions(tuple); + if (TransactionIdIsValid(xid)) { /* NB: xid must be known committed here!
Re: Auto explain after query timeout
On Tue, Sep 20, 2022 at 2:12 PM Gurjeet wrote: > > On Tue Sep 20, 2022 at 10:34 AM PDT, James Coleman wrote: > > Hopefully I'm not missing something obvious, but as far as I know > > there's no way to configure auto explain to work fire > > statement_timeout fires. > > I believe you're correct. > > > I'd like to look into this at some point, but I'm wondering if anyone > > has thought about it before, and, if so, is there any obvious > > impediment to doing so? > > This would be a neat feature. Since the changes would be fairly > localized to the contrib module, this would be a great first patch for > someone new to contributing. > > This can be exposed at a new GUC auto_explain.log_on_statement_timeout. > I wish our conventions allowed for creating hierarchies of GUC > parameters, e.g. auto_explain.when.statmeent_timeout. > > For someone who would like to achieve this in the field today, I believe > they can set auto_explain.log_min_duration equal to, or less than, > statement_timeout. Either I'm missing something (and/or this was fixed in a later PG version), but I don't think this is how it works. We have this specific problem now: we set auto_explain.log_min_duration to 200 (ms) and statement_timeout set to 30s, but when a statement times out we do not get the plan logged with auto-explain. James Coleman
Re: Support tls-exporter as channel binding for TLSv1.3
On Tue, Sep 20, 2022 at 11:01 AM Jacob Champion wrote: > Well, I'm working on a next version, but it's ballooning in complexity > as I try to navigate the fix for OpenSSL 1.0.1 (which is currently > failing the tests, unsurprisingly). To be more specific: I think I'm hitting the case that Heikki pointed out several years ago [1]: > The problematic case is when e.g. the server > only supports tls-unique and the client only supports > tls-server-end-point. What we would (usually) like to happen, is to fall > back to not using channel binding. But it's not clear how to make that > work, and still protect from downgrade attacks. The problem was deferred when tls-unique was removed. We might have to actually solve it now. bcc: Heikki, in case he would like to weigh in. --Jacob [1] https://www.postgresql.org/message-id/ec787074-2305-c6f4-86aa-6902f98485a4%40iki.fi
Re: A question about wording in messages
Alvaro Herrera writes: > After spending way too much time editing this line, I ended up with > exactly what Tom proposed, so +1 for his version. I think "This > controls" adds nothing very useful, and we don't have it anywhere else, > except tcp_keepalives_count from where I also propose to remove it. LGTM. regards, tom lane
Re: Auto explain after query timeout
On Tue, Sep 20, 2022 at 2:35 PM James Coleman wrote: > Either I'm missing something (and/or this was fixed in a later PG > version), but I don't think this is how it works. We have this > specific problem now: we set auto_explain.log_min_duration to 200 (ms) > and statement_timeout set to 30s, but when a statement times out we do > not get the plan logged with auto-explain. I think you're correct. auto_explain uses the ExecutorEnd hook, but that hook will not be fired in the case of an error. Indeed, if an error has already been thrown, it would be unsafe to try to auto-explain anything. For instance -- and this is just one problem of probably many -- ExplainTargetRel() performs catalog lookups, which is not OK in a failed transaction. To make this work, I think you'd need a hook that fires just BEFORE the error is thrown. However, previous attempts to introduce hooks into ProcessInterrupts() have not met with a wam response from Tom, so it might be a tough sell. And maybe for good reason. I see at least two problems. The first is that explaining a query is a pretty complicated operation that involves catalog lookups and lots of complicated stuff, and I don't think that it would be safe to do all of that at any arbitrary point in the code where ProcessInterrupts() happened to fire. What if one of the syscache lookups misses the cache and we have to open the underlying catalog? Then AcceptInvalidationMessages() will fire, but we don't currently assume that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if the running query has called a user-defined function or procedure which is running DDL which is in the middle of changing catalog state for some relation involved in the query at the moment that the statement timeout arrives? I feel like there are big problems here. The other problem I see is that ProcessInterrupts() is our mechanism for allowing people to escape from queries that would otherwise run forever by hitting ^C. But what if the explain code goes crazy and itself needs to be interrupted, when we're already inside ProcessInterrupts()? Maybe that would work out OK... or maybe it wouldn't. I'm not really sure. But it's another reason to be very, very cautious about putting complicated logic inside ProcessInterrupts(). -- Robert Haas EDB: http://www.enterprisedb.com
Re: Auto explain after query timeout
On Tue Sep 20, 2022 at 11:34 AM PDT, James Coleman wrote: > On Tue, Sep 20, 2022 at 2:12 PM Gurjeet wrote: > > > > For someone who would like to achieve this in the field today, I believe > > they can set auto_explain.log_min_duration equal to, or less than, > > statement_timeout. > > Either I'm missing something (and/or this was fixed in a later PG > version), but I don't think this is how it works. We have this > specific problem now: we set auto_explain.log_min_duration to 200 (ms) > and statement_timeout set to 30s, but when a statement times out we do > not get the plan logged with auto-explain. My DBA skills are rusty, so I'll take your word for it. If this is the current behaviour, I'm inclined to treat this as a bug, or at least a desirable improvement, and see if auto_explain can be improved to emit the plan on statment_timeout. >From what I undestand, the behaviour of auto_explain is that it waits for the query to finish before it emits the plan. This information is very useful for diagnosing long-running queries that are still running. Many a times, you encounter such queries in production workloads, and reproducing such a scenario later on is either undesirable, expensive, or even impossible. So being able to see the plan of a query that has crossed auto_explain.log_min_duration as soon as possible, would be highly desirable. Best regards, Gurjeet http://Gurje.et
Support logical replication of large objects
Hello, I’m working on a patch to support logical replication of large objects (LOBs). This is a useful feature when a database in logical replication has lots of tables, functions and other objects that change over time, such as in online cross major version upgrade. As an example, this lets users replicate large objects between different PostgreSQL major versions. The topic was previously discussed in [1]. Moreover, we need to address the following 3 challenges. I worked on some designs and appreciate feedback : 1. Replication of the change stream of LOBs My suggestion is that we can just add a check when any LOB function or API is called and executed in the backend, and then add a simple SQL command in WAL files to do the replication . Take lo_unlink as example[2] : we can create a “simple” SQL like SELECT lo_unlink(); and log it in WAL, so that replica only needs to replay the “simple” SQL command. We can unlink the LOBs in replica accordingly. Pros : a. We do not need to add much additional functionality. b. For most of the LOBs related APIs, we don’t need to touch whole LOBs, except for the case creation of LOBs. Cons: a. For the case creation of LOBs, we may need to split the whole LOB content into WAL files which will increase volume of WAL and replicated writes dramatically. This could be prevented if we can make sure the original file is publicly shared, like a url from cloud storage, or exists on host on replica as well. 2. Initializing replication of LOBs When a subscription is established, LOBs in the source should be replicated even if they are not created in replica. Here are two options to approach this problem: Option 1 : Support LOB related system catalogs in logical replication We can make an exception in this line[3] in the “check_publication_add_schema” function. Pros: All required LOBs can be transferred to replica. Cons: There is currently no support for allowing logical replication of system catalogs. Option 2 : Run a function or a script from source instance when it detects logical replication is established. The point is that we can ask the source to replicate whole LOBs when a new subscription is created. Maybe we can copy the whole LOBs related system catalogs and replicate the copy to replica, then restore the original LOBs into replica from the copy. Cons : This will increase the volume of WAL and replicated writes dramatically. I currently do not have a preference on either option. I would like to see if others have thoughts on how we could approach this. 3. OID conflicts A common case is that the OID we want to publish is already used in subscriber. Option 1 (My recommendation) : Create/Update existing System catalog for mapping the OID if conflict happens Maybe we could add another column naming like mapping_oid in system catalog pg_largeobject_metaqdate on the replica. When the replica detects the OID (E.g. 16400) that source is replicating is already used in replica, replica could store the 16400 as mapping_oid and create a new OID (E.g. 16500) as oid to be used in replica, so whatever operation is done on 16400 in source, in replica we just need to perform on 16500. Cons : We would need to add additional columns to the system catalog Option 2 : Prompt error message in Replica and let user handle it manually Cons : This is not user-friendly. Please let me know your thoughts. Borui Yang Amazon RDS/Aurora for PostgreSQL [1] https://www.postgresql.org/message-id/VisenaEmail.16.35d9e854e3626e81.15cd0de93df%40tc7-visena [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/libpq/be-fsstubs.c#l312 [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/catalog/pg_publication.c#l98
Re: why can't a table be part of the same publication as its schema
On 9/20/22 10:55 AM, Mark Dilger wrote: On Sep 19, 2022, at 8:03 PM, Jonathan S. Katz wrote: "When a partitioned table is added to a publication, all of its existing and future partitions are implicitly considered to be part of the publication."[10] Additionally, this is the behavior that is already present in "FOR ALL TABLES": "Marks the publication as one that replicates changes for all tables in the database, including tables created in the future."[10] I don't think we should change this behavior that's already in logical replication. The existing behavior in logical replication doesn't have any "IN SCHEMA" qualifiers. This behavior exists "FOR ALL TABLES" without the "IN SCHEMA" qualifier. This was discussed multiple times on the original thread[1]. While I understand the reasons why "GRANT ... ALL TABLES IN SCHEMA" has a different behavior (i.e. it's not applied to future objects) and do not advocate to change it, I have personally been affected where I thought a permission would be applied to all future objects, only to discover otherwise. I believe it's more intuitive to think that "ALL" applies to "everything, always." The conversation is focusing on what "ALL TABLES" means, but the ambiguous part is what "IN SCHEMA" means. In GRANT it means "currently in schema, computed now." We are about to create confusion by adding the "IN SCHEMA" phrase to publication commands meaning "later in schema, computed then." A user who diligently consults the documentation for one command to discover what "IN SCHEMA" means may fairly, but wrongly, assume it means the same thing in another command. I tried to diligently read the sections where we talk about granting + privileges[2][3] to see what it says about "ALL * IN SCHEMA". Unless I missed it, and I read through it twice, it does not explicitly state whether or not "GRANT" applies to all objects at only that given moment, or to future objects of that type which are created in that schema. Maybe the behavior is implied or is part of the standard, but it's not currently documented. We do link to "ALTER DEFAULT PRIVILEGES" at the bottom of the GRANT[2] docs, but we don't give any indication as to why. (This is also to say we should document in GRANT that ALL * IN SCHEMA does not apply to future objects; if you need that behavior use ALTER DEFAULT PRIVILEGES. Separate thread :) I understand there is a risk of confusion of the similar grammar across commands, but the current command in logical replication has this is building on the existing behavior. Thanks, Jonathan [1] https://www.postgresql.org/message-id/flat/CALDaNm0OANxuJ6RXqwZsM1MSY4s19nuH3734j4a72etDwvBETQ%40mail.gmail.com [2] https://www.postgresql.org/docs/current/sql-grant.html [3] https://www.postgresql.org/docs/current/ddl-priv.html OpenPGP_signature Description: OpenPGP digital signature
Re: why can't a table be part of the same publication as its schema
(RMT hat on, unless otherwise noted) On 9/20/22 9:42 AM, Robert Haas wrote: On Mon, Sep 19, 2022 at 11:03 PM Jonathan S. Katz wrote: For #1 (allowing calls that have schema/table overlap...), there appears to be both a patch that allows this (reversing[8]), and a suggestion for dealing with a corner-case that is reasonable, i.e. disallowing adding schemas to a publication when specifying column-lists. Do we think we can have consensus on this prior to the RC1 freeze? I am not sure whether we can or should rush a fix in that fast, but I agree with this direction. The RMT met today to discuss this. We did agree that the above is an open item that should be resolved before this release. While it is an accepted pattern for us to "ERROR" on unsupported behavior and then later introduce said behavior, we do agree with Peter's original post in this thread and would like it resolved. As for the state of the fix, the patch has been iterated on and Amit felt ready to commit it[1]. We do want to hear how others feel about this, but the folks behind this feature have been working on this patch since this was reported. For #2 ("ALL TABLES IN SCHEMA" syntax), this was heavily discussed on the original thread[1][3][4][5][7]. I thought Tom's proposal on the syntax[3] was reasonable as it "future proofs" for when we allow other schema-scoped objects to be published and give control over which ones can be published. All right, well, I still don't like it and think it's confusing, but perhaps I'm in the minority. The RMT discussed this as well. The RMT feels that there should not be any changes to syntax/behavior for this release. This doesn't preclude future work in this area (e.g. having a toggle for "all future behavior"), but based on all the discussions and existing behavior in this feature, we do not see a need to make changes or delay the release on this. I don't think we should change this behavior that's already in logical replication. While I understand the reasons why "GRANT ... ALL TABLES IN SCHEMA" has a different behavior (i.e. it's not applied to future objects) and do not advocate to change it, I have personally been affected where I thought a permission would be applied to all future objects, only to discover otherwise. I believe it's more intuitive to think that "ALL" applies to "everything, always." Nah, there's room for multiple behaviors here. It's reasonable to want to add all the tables currently in the schema to a publication (or grant permissions on them) and it's reasonable to want to include all current and future tables in the schema in a publication (or grant permissions on them) too. The reason I don't like the ALL TABLES IN SCHEMA syntax is that it sounds like the former, but actually is the latter. Based on your link to the email from Tom, I understand now the reason why it's like that, but it's still counterintuitive to me. I understand your view on "multiple behaviors" and I do agree with your reasoning. I still think we should leave this as is, but perhaps this opens up an option we add later to specify the behavior. For #3 (more superuser-only), in general I do agree that we shouldn't be adding more of these. However, we have in this release, and not just to this feature. ALTER SUBSCRIPTION ... SKIP[11] requires superuser. I think it's easier for us to "relax" privileges (e.g. w/predefined roles) than to make something "superuser-only" in the future, so I don't see this being a blocker for v15. The feature will continue to work for users even if we remove "superuser-only" in the future. Yeah, this is clearly not a release blocker, I think. The RMT concurs. We do recommend future work on "relaxing" the superuser-only requirement. Thanks, Jonathan [1] https://www.postgresql.org/message-id/CAA4eK1LDhoBM8K5uVme8PZ%2BkxNOfVpRh%3DoO42JtFdqBgBuj1bA%40mail.gmail.com OpenPGP_signature Description: OpenPGP digital signature
Re: why can't a table be part of the same publication as its schema
> On Sep 20, 2022, at 12:36 PM, Jonathan S. Katz wrote: > > This behavior exists "FOR ALL TABLES" without the "IN SCHEMA" qualifier. This > was discussed multiple times on the original thread[1]. Yes, nobody is debating that as far as I can see. And I do take your point that this stuff was discussed in other threads quite a while back. > I tried to diligently read the sections where we talk about granting + > privileges[2][3] to see what it says about "ALL * IN SCHEMA". Unless I missed > it, and I read through it twice, it does not explicitly state whether or not > "GRANT" applies to all objects at only that given moment, or to future > objects of that type which are created in that schema. Maybe the behavior is > implied or is part of the standard, but it's not currently documented. Interesting. Thanks for that bit of research. > We do link to "ALTER DEFAULT PRIVILEGES" at the bottom of the GRANT[2] docs, > but we don't give any indication as to why. > > (This is also to say we should document in GRANT that ALL * IN SCHEMA does > not apply to future objects; Yes, I agree this should be documented. > if you need that behavior use ALTER DEFAULT PRIVILEGES. Separate thread :) > > I understand there is a risk of confusion of the similar grammar across > commands, but the current command in logical replication has this is building > on the existing behavior. I don't complain that it is buidling on the existing behavior. I'm *only* concerned about the keywords we're using for this. Consider the following: -- AS ADMIN CREATE USER bob NOSUPERUSER; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA foo TO bob; SET ROLE bob; CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo; We're going to have that fail in pg15 because the FOR ALL TABLES IN SCHEMA option is reserved to superusers. But we agreed that was a stop-gap solution that we'd potentially loosen in the future. Certainly we'll need wiggle room in the syntax to perform that loosening: --- Must be superuser for this in pg15, and in subsequent releases. CREATE PUBLICATION bobs_pub FOR ALL FUTURE TABLES IN SCHEMA foo; --- Not supported in pg15, but reserved for some future pg versions to allow --- non-superusers to create publications on tables currently in schema foo, --- assuming they have sufficient privileges on those tables CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo; Doing it this way makes the syntax consistent between the GRANT...TO bob and the CREATE PUBLICATION bobs_pub. Surely this makes more sense? I'm not a huge fan of the keyword "FUTURE" here, but I found a reference to another database that uses that keyword for what I think is a similar purpose. We should choose *something* for this, though, if we want things to be rational going forward. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making C function declaration parameter names consistent with corresponding definition names
On Mon, Sep 19, 2022 at 11:36 PM Peter Geoghegan wrote: > Attached revision v4 fixes those pg_dump patch items. > > It also breaks out the ecpg changes into their own patch. I pushed much of this just now. All that remains to bring the entire codebase into compliance is the ecpg patch and the pg_dump patch. Those two areas are relatively tricky. But it's now unlikely that I'll need to push a commit that makes relatively many CF patches stop applying against HEAD -- that part is over. Once we're done with ecpg and pg_dump, we can talk about the actual practicalities of formally adopting a project policy on consistent parameter names. I mostly use clang-tidy via my editor's support for the clangd language server -- clang-tidy is primarily a linter, so it isn't necessarily run in bulk all that often. I'll need to come up with instructions for running clang-tidy from the command line that are easy to follow. I've found that the run_clang_tidy script (AKA run-clang-tidy.py) works, but the whole experience feels hobbled together. I think that we really need something like a build target for this -- something comparable to what we do to support GCOV. That would also allow us to use additional clang-tidy checks, which might be useful. We might even find it useful to come up with some novel check of our own. Apparently it's not all that difficult to write one from scratch, to implement custom rules. There are already custom rules for big open source projects such as the Linux Kernel, Chromium, and LLVM itself. -- Peter Geoghegan
Re: Auto explain after query timeout
On Tue, Sep 20, 2022 at 3:06 PM Robert Haas wrote: > > On Tue, Sep 20, 2022 at 2:35 PM James Coleman wrote: > > Either I'm missing something (and/or this was fixed in a later PG > > version), but I don't think this is how it works. We have this > > specific problem now: we set auto_explain.log_min_duration to 200 (ms) > > and statement_timeout set to 30s, but when a statement times out we do > > not get the plan logged with auto-explain. > > I think you're correct. auto_explain uses the ExecutorEnd hook, but > that hook will not be fired in the case of an error. Indeed, if an > error has already been thrown, it would be unsafe to try to > auto-explain anything. For instance -- and this is just one problem of > probably many -- ExplainTargetRel() performs catalog lookups, which is > not OK in a failed transaction. > > To make this work, I think you'd need a hook that fires just BEFORE > the error is thrown. However, previous attempts to introduce hooks > into ProcessInterrupts() have not met with a wam response from Tom, so > it might be a tough sell. And maybe for good reason. I see at least > two problems. The first is that explaining a query is a pretty > complicated operation that involves catalog lookups and lots of > complicated stuff, and I don't think that it would be safe to do all > of that at any arbitrary point in the code where ProcessInterrupts() > happened to fire. What if one of the syscache lookups misses the cache > and we have to open the underlying catalog? Then > AcceptInvalidationMessages() will fire, but we don't currently assume > that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if > the running query has called a user-defined function or procedure > which is running DDL which is in the middle of changing catalog state > for some relation involved in the query at the moment that the > statement timeout arrives? I feel like there are big problems here. > > The other problem I see is that ProcessInterrupts() is our mechanism > for allowing people to escape from queries that would otherwise run > forever by hitting ^C. But what if the explain code goes crazy and > itself needs to be interrupted, when we're already inside > ProcessInterrupts()? Maybe that would work out OK... or maybe it > wouldn't. I'm not really sure. But it's another reason to be very, > very cautious about putting complicated logic inside > ProcessInterrupts(). This is exactly the kind of background I was hoping someone would provide; thank you, Robert. It seems like one could imagine addressing all of these by having one of: - A safe explain (e.g., disallow catalog access) that is potentially missing information. - A safe way to interrupt queries such as "safe shutdown" of a node (e.g., a seq scan could stop returning tuples early) and allow a configurable buffer of time after the statement timeout before firing a hard abort of the query (and transaction). Both of those seem like a significant amount of work. Alternatively I wonder if it's possible (this would maybe assume no catalog changes in the current transaction -- or at least none that would be referenced by the current query) to open a new transaction (with the same horizon information) and duplicate the plan over to that transaction and run the explain there. This way you do it *after* the error is raised. That's some serious spit-balling -- I'm not saying that's doable, just trying to imagine how one might comprehensively address the concerns. Does any of that seem at all like a path you could imagine being fruitful? Thanks, James Coleman
Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans
On Mon, Sep 12, 2022 at 2:01 PM Peter Geoghegan wrote: > I'd like to talk about one such technique on this thread. The attached > WIP patch reduces the size of xl_heap_freeze_page records by applying > a simple deduplication process. Attached is v2, which I'm just posting to keep CFTester happy. No real changes here. -- Peter Geoghegan From c555f716a79c7de1f26a5e1e87265f2702b980ee Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sat, 30 Jul 2022 10:47:59 -0700 Subject: [PATCH v2] Shrink freeze WAL records via deduplication. --- src/include/access/heapam.h| 25 ++ src/include/access/heapam_xlog.h | 34 +-- src/backend/access/heap/heapam.c | 310 - src/backend/access/heap/vacuumlazy.c | 39 +--- src/backend/access/rmgrdesc/heapdesc.c | 4 +- 5 files changed, 289 insertions(+), 123 deletions(-) diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 9dab35551..579ef32b0 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -99,6 +99,19 @@ typedef enum HEAPTUPLE_DELETE_IN_PROGRESS /* deleting xact is still in progress */ } HTSV_Result; +/* heap_prepare_freeze_tuple state describing how to freeze a tuple */ +typedef struct HeapFreezeTuple +{ + /* Fields describing how to process tuple */ + uint16 t_infomask2; + uint16 t_infomask; + TransactionId xmax; + uint8 frzflags; + + /* Page offset number for tuple */ + OffsetNumber offset; +} HeapFreezeTuple; + /* * function prototypes for heap access method * @@ -164,6 +177,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer, struct TM_FailureData *tmfd); extern void heap_inplace_update(Relation relation, HeapTuple tuple); +extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple, + TransactionId relfrozenxid, + TransactionId relminmxid, + TransactionId cutoff_xid, + TransactionId cutoff_multi, + HeapFreezeTuple *frz, + bool *totally_frozen, + TransactionId *relfrozenxid_out, + MultiXactId *relminmxid_out); +extern void heap_freeze_prepared_tuples(Relation rel, Buffer buffer, + TransactionId latestRemovedXid, + HeapFreezeTuple *tuples, int ntuples); extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId relfrozenxid, TransactionId relminmxid, TransactionId cutoff_xid, TransactionId cutoff_multi); diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h index 34220d93c..4e3a023eb 100644 --- a/src/include/access/heapam_xlog.h +++ b/src/include/access/heapam_xlog.h @@ -315,34 +315,36 @@ typedef struct xl_heap_inplace /* * This struct represents a 'freeze plan', which is what we need to know about - * a single tuple being frozen during vacuum. + * a group of tuples being frozen from the same page. */ /* 0x01 was XLH_FREEZE_XMIN */ #define XLH_FREEZE_XVAC 0x02 #define XLH_INVALID_XVAC 0x04 -typedef struct xl_heap_freeze_tuple +typedef struct xl_heap_freeze_plan { TransactionId xmax; - OffsetNumber offset; + uint16 ntuples; uint16 t_infomask2; uint16 t_infomask; uint8 frzflags; -} xl_heap_freeze_tuple; +} xl_heap_freeze_plan; /* * This is what we need to know about a block being frozen during vacuum * - * Backup block 0's data contains an array of xl_heap_freeze_tuple structs, - * one for each tuple. + * Backup block 0's data contains an array of xl_heap_freeze_plan structs. */ typedef struct xl_heap_freeze_page { - TransactionId cutoff_xid; - uint16 ntuples; + TransactionId latestRemovedXid; + uint16 nplans; + + /* FREEZE PLANS FOLLOW */ + /* OFFSET NUMBERS FOLLOW */ } xl_heap_freeze_page; -#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, ntuples) + sizeof(uint16)) +#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16)) /* * This is what we need to know about setting a visibility map bit @@ -401,20 +403,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record); extern const char *heap2_identify(uint8 info); extern void heap_xlog_logical_rewrite(XLogReaderState *r); -extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer, - TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples, - int ntuples); -extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple, - TransactionId relfrozenxid, - TransactionId relminmxid, - TransactionId cutoff_xid, - TransactionId cutoff_multi, - xl_heap_freeze_tuple *frz, - bool *totally_frozen, - TransactionId *relfrozenxid_out, - MultiXactId *relminmxid_out); -extern void heap_execute_freeze_tuple(HeapTupleHeader tuple, - xl_heap_freeze_tuple *frz); extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer, TransactionId cutoff_xid, u
Re: Auto explain after query timeout
On Tue, Sep 20, 2022 at 5:08 PM James Coleman wrote: > - A safe explain (e.g., disallow catalog access) that is potentially > missing information. This would be pretty useless I think, because you'd be missing all relation names. > - A safe way to interrupt queries such as "safe shutdown" of a node > (e.g., a seq scan could stop returning tuples early) and allow a > configurable buffer of time after the statement timeout before firing > a hard abort of the query (and transaction). This might be useful, but it seems extremely difficult to get working. You'd not only have to design the safe shutdown mechanism itself, but also find a way to safely engage it at the right times. > Alternatively I wonder if it's possible (this would maybe assume no > catalog changes in the current transaction -- or at least none that > would be referenced by the current query) to open a new transaction > (with the same horizon information) and duplicate the plan over to > that transaction and run the explain there. This way you do it *after* > the error is raised. That's some serious spit-balling -- I'm not > saying that's doable, just trying to imagine how one might > comprehensively address the concerns. Doesn't work, because the new transaction's snapshot wouldn't be the same as that of the old one. Imagine that you create a table and run a query on it in the same transaction. Then you migrate the plan tree to a new transaction and try to find out the table name. But in the new transaction, that table doesn't exist: it was destroyed by the previous rollback. Honestly I have no very good ideas how to create the feature you want here. I guess the only thing I can think of is to separate the EXPLAIN process into two phases: a first phase that runs when the plan tree is set up and gathers all of the information that we might need later, like relation names, and then a second phase that runs later when you want to generate the output and does nothing that can fail, or at least no database: maybe it's allowed to allocate memory, for example. But that sounds like a big and perhaps painful refactoring exercise, and I can imagine that there might be reasons why it doesn't work out. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote: > I'm attaching v5 patch-set. I've addressed review comments received so > far and fixed a compiler warning that CF bot complained about. > > Please review it further. 0001 looks reasonable to me. +errno = 0; +rc = pg_pwritev_zeros(fd, pad_to_size); Do we need to reset errno? pg_pwritev_zeros() claims to set errno appropriately. +/* + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if + * writing more bytes per pg_pwritev_with_retry() call is proven to be more + * performant. + */ +#define PWRITEV_BLCKSZ XLOG_BLCKSZ This seems like something we should sort out now instead of leaving as future work. Given your recent note, I think we should just use XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance findings with different buffer sizes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Sep 20, 2022 at 04:00:26PM -0700, Nathan Bossart wrote: > On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote: >> I'm attaching v5 patch-set. I've addressed review comments received so >> far and fixed a compiler warning that CF bot complained about. >> >> Please review it further. > > 0001 looks reasonable to me. > > +errno = 0; > +rc = pg_pwritev_zeros(fd, pad_to_size); > > Do we need to reset errno? pg_pwritev_zeros() claims to set errno > appropriately. > > +/* > + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if > + * writing more bytes per pg_pwritev_with_retry() call is proven to be more > + * performant. > + */ > +#define PWRITEV_BLCKSZ XLOG_BLCKSZ > > This seems like something we should sort out now instead of leaving as > future work. Given your recent note, I think we should just use > XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance > findings with different buffer sizes. I also noticed that the latest patch set no longer applies, so I've marked the commitfest entry as waiting-on-author. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: [PoC] Federated Authn/z with OAUTHBEARER
Hi Mahendrakar, thanks for your interest and for the patch! On Mon, Sep 19, 2022 at 10:03 PM mahendrakar s wrote: > The changes for each component are summarized below. > > 1. Provider-specific extension: > Each OAuth provider implements their own token validator as an > extension. Extension registers an OAuth provider hook which is matched > to a line in the HBA file. How easy is it to write a Bearer validator using C? My limited understanding was that most providers were publishing libraries in higher-level languages. Along those lines, sample validators will need to be provided, both to help in review and to get the pytest suite green again. (And coverage for the new code is important, too.) > 2. Add support to pass on the OAuth bearer token. In this > obtaining the bearer token is left to 3rd party application or user. > > ./psql -U -d 'dbname=postgres > oauth_client_id= oauth_bearer_token= This hurts, but I think people are definitely going to ask for it, given the frightening practice of copy-pasting these (incredibly sensitive secret) tokens all over the place... Ideally I'd like to implement sender constraints for the Bearer token, to *prevent* copy-pasting (or, you know, outright theft). But I'm not sure that sender constraints are well-implemented yet for the major providers. > 3. HBA: An additional param ‘provider’ is added for the oauth method. > Defining "oauth" as method + passing provider, issuer endpoint > and expected audience > > * * * * oauth provider= > issuer= scope= Naming aside (this conflicts with Samay's previous proposal, I think), I have concerns about the implementation. There's this code: > + if (oauth_provider && oauth_provider->name) > + { > + ereport(ERROR, > + (errmsg("OAuth provider \"%s\" is already > loaded.", > + oauth_provider->name))); > + } which appears to prevent loading more than one global provider. But there's also code that deals with a provider list? (Again, it'd help to have test code covering the new stuff.) >b) libpq optionally compiled for the clients which > explicitly need libpq to orchestrate OAuth communication with the > issuer (it depends heavily on 3rd party library iddawc as Jacob > already pointed out. The library seems to be supporting all the OAuth > flows.) Speaking of iddawc, I don't think it's a dependency we should choose to rely on. For all the code that it has, it doesn't seem to provide compatibility with several real-world providers. Google, for one, chose not to follow the IETF spec it helped author, and iddawc doesn't support its flavor of Device Authorization. At another point, I think iddawc tried to decode Azure's Bearer tokens, which is incorrect... I haven't been able to check if those problems have been fixed in a recent version, but if we're going to tie ourselves to a huge dependency, I'd at least like to believe that said dependency is battle-tested and solid, and personally I don't feel like iddawc is. > - auth_method = I_TOKEN_AUTH_METHOD_NONE; > - if (conn->oauth_client_secret && *conn->oauth_client_secret) > - auth_method = I_TOKEN_AUTH_METHOD_SECRET_BASIC; This code got moved, but I'm not sure why? It doesn't appear to have made a change to the logic. > + if (conn->oauth_client_secret && *conn->oauth_client_secret) > + { > + session_response_type = I_RESPONSE_TYPE_CLIENT_CREDENTIALS; > + } Is this an Azure-specific requirement? Ideally a public client (which psql is) shouldn't have to provide a secret to begin with, if I understand that bit of the protocol correctly. I think Google also required provider-specific changes in this part of the code, and unfortunately I don't think they looked the same as yours. We'll have to figure all that out... Standards are great; everyone has one of their own. :) Thanks, --Jacob
Re: predefined role(s) for VACUUM and ANALYZE
On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote: > On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote: >> Any impact for the column sizes of the catalogs holding ACL >> information? Just asking while browsing the patch set. > > Since each aclitem requires 16 bytes instead of 12, I assume so. However, > in my testing, I hit a "row is too big" error with the same number of > aclitems in a pg_class row before and after the change. I might be missing > something in my patch, or maybe I am misunderstanding how arrays of > aclitems are stored on disk. Ah, it looks like relacl is compressed. The column is marked "extended," but pg_class doesn't appear to have a TOAST table, so presumably no out-of-line storage can be used. I found a couple of threads about this [0] [1] [2]. [0] https://postgr.es/m/17245.964897719%40sss.pgh.pa.us [1] https://postgr.es/m/200309040531.h845ViP05881%40candle.pha.pa.us [2] https://postgr.es/m/29061.1265327626%40sss.pgh.pa.us -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: default sorting behavior for index
Zhihong Yu writes: > I was looking at this check in src/backend/parser/parse_utilcmd.c w.r.t. > constraint: > ... > If the index has DESC sorting order, why it cannot be used to back a > constraint ? > Some concrete sample would help me understand this. Please read the nearby comments, particularly * Insist on default opclass, collation, and sort options. * While the index would still work as a constraint with * non-default settings, it might not provide exactly the same * uniqueness semantics as you'd get from a normally-created * constraint; and there's also the dump/reload problem * mentioned above. The "mentioned above" refers to this: * Insist on it being a btree. That's the only kind that supports * uniqueness at the moment anyway; but we must have an index that * exactly matches what you'd get from plain ADD CONSTRAINT syntax, * else dump and reload will produce a different index (breaking * pg_upgrade in particular). The concern about whether the uniqueness semantics are the same probably mostly applies to just the opclass and collation properties. However, rd_indoption contains AM-specific options, and we have little ability to be sure in this code exactly what those bits might do. In any case we'd definitely have a risk of things breaking during pg_upgrade if we ignore rd_indoption. regards, tom lane
Re: predefined role(s) for VACUUM and ANALYZE
On Tue, Sep 20, 2022 at 04:31:17PM -0700, Nathan Bossart wrote: > On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote: >> On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote: >>> Any impact for the column sizes of the catalogs holding ACL >>> information? Just asking while browsing the patch set. >> >> Since each aclitem requires 16 bytes instead of 12, I assume so. However, >> in my testing, I hit a "row is too big" error with the same number of >> aclitems in a pg_class row before and after the change. I might be missing >> something in my patch, or maybe I am misunderstanding how arrays of >> aclitems are stored on disk. > > Ah, it looks like relacl is compressed. The column is marked "extended," > but pg_class doesn't appear to have a TOAST table, so presumably no > out-of-line storage can be used. I found a couple of threads about this > [0] [1] [2]. I suppose there is some risk that folks with really long aclitem arrays might be unable to pg_upgrade to a version with uint64 AclModes, but I suspect that risk is limited to extreme cases (i.e., multiple thousands of aclitems). I'm not sure whether that's worth worrying about too much. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: [RFC] building postgres with meson - v13
Hi, On 2022-09-19 19:16:30 -0700, Andres Freund wrote: > To have some initial "translation" for other developers I've started a wiki > page with a translation table. Still very WIP: > https://wiki.postgresql.org/wiki/Meson > > For now, a bit of polishing aside, I'm just planning to add a minimal > explanation of what's happening, and a reference to this thread. I added installation instructions for meson for a bunch of platforms, but failed to figure out how to do so in a rhel9 container. I don't have a rhel subscription, and apparently the repos with developer tools now require a subscription. Great way to make it easy for projects to test anything on RHEL. Greetings, Andres Freund
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi Onder, Thanks for addressing all my previous feedback. I checked the latest v12-0001, and have no more comments at this time. One last thing - do you think there is any need to mention this behaviour in the pgdocs, or is OK just to be a hidden performance improvement? -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Summary function for pg_buffercache
Hi, On 2022-09-20 12:45:24 +0300, Aleksander Alekseev wrote: > > I'm not sure how to avoid any undefined behaviour without locks though. > > Even with locks, performance is much better. But is it good enough for > > production? > > Potentially you could avoid taking locks by utilizing atomic > operations and lock-free algorithms. But these algorithms are > typically error-prone and not always produce a faster code than the > lock-based ones. I'm pretty confident this is out of scope of this > particular patch. Why would you need lockfree operations? All you need to do is to read BufferDesc->state into a local variable and then make decisions based on that? > + for (int i = 0; i < NBuffers; i++) > + { > + BufferDesc *bufHdr; > + uint32 buf_state; > + > + bufHdr = GetBufferDescriptor(i); > + > + /* Lock each buffer header before inspecting. */ > + buf_state = LockBufHdr(bufHdr); > + > + /* Invalid RelFileNumber means the buffer is unused */ > + if (RelFileNumberIsValid(BufTagGetRelNumber(&bufHdr->tag))) > + { > + buffers_used++; > + usagecount_avg += BUF_STATE_GET_USAGECOUNT(buf_state); > + > + if (buf_state & BM_DIRTY) > + buffers_dirty++; > + } > + else > + buffers_unused++; > + > + if (BUF_STATE_GET_REFCOUNT(buf_state) > 0) > + buffers_pinned++; > + > + UnlockBufHdr(bufHdr, buf_state); > + } I.e. instead of locking the buffer header as done above, this could just do something along these lines: BufferDesc *bufHdr; uint32 buf_state; bufHdr = GetBufferDescriptor(i); buf_state = pg_atomic_read_u32(&bufHdr->state); if (buf_state & BM_VALID) { buffers_used++; usagecount_avg += BUF_STATE_GET_USAGECOUNT(buf_state); if (buf_state & BM_DIRTY) buffers_dirty++; } else buffers_unused++; if (BUF_STATE_GET_REFCOUNT(buf_state) > 0) buffers_pinned++; Without a memory barrier you can get very slightly "out-of-date" values of the state, but that's fine in this case. Greetings, Andres Freund
Re: Support pg_attribute_aligned and noreturn in MSVC
On Tue, Sep 20, 2022 at 08:01:20AM -0400, James Coleman wrote: > I don't have access to a Windows machine for testing, but re-reading > the documentation it looks like the issue is that our noreturn macro > is used after the definition while the MSVC equivalent is used before. A CI setup would do the job for example, see src/tools/ci/README that explains how to set up things. > I've removed that for now (and commented about it); it's not as > valuable anyway since it's mostly an indicator for code analysis > (human and machine). Except for the fact that the patch missed to define pg_attribute_noreturn() in the MSVC branch, this looks fine to me. I have been looking at what you meant with packing, and I can see the business with PACK(), something actually doable with gcc. That's a first step, at least, and I see no reason not to do it, so applied. -- Michael signature.asc Description: PGP signature
Re: predefined role(s) for VACUUM and ANALYZE
On Tue, Sep 20, 2022 at 04:50:10PM -0700, Nathan Bossart wrote: > On Tue, Sep 20, 2022 at 04:31:17PM -0700, Nathan Bossart wrote: >> On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote: >>> On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote: Any impact for the column sizes of the catalogs holding ACL information? Just asking while browsing the patch set. >>> >>> Since each aclitem requires 16 bytes instead of 12, I assume so. However, >>> in my testing, I hit a "row is too big" error with the same number of >>> aclitems in a pg_class row before and after the change. I might be missing >>> something in my patch, or maybe I am misunderstanding how arrays of >>> aclitems are stored on disk. >> >> Ah, it looks like relacl is compressed. The column is marked "extended," >> but pg_class doesn't appear to have a TOAST table, so presumably no >> out-of-line storage can be used. I found a couple of threads about this >> [0] [1] [2]. Adding a toast table to pg_class has been a sensitive topic over the years. Based on my recollection of the events, there were worries about the potential cross-dependencies with pg_class and pg_attribute that this would create. > I suppose there is some risk that folks with really long aclitem arrays > might be unable to pg_upgrade to a version with uint64 AclModes, but I > suspect that risk is limited to extreme cases (i.e., multiple thousands of > aclitems). I'm not sure whether that's worth worrying about too much. Did you just run an aclupdate()? 4% for aclitem[] sounds like quite a number to me :/ It may be worth looking at if these operations could be locally optimized more, as well. I'd like to think that we could live with that to free up enough bits in AclItems for the next 20 years, anyway. Any opinions? For the column sizes of the catalogs, I was wondering about how pg_column_size() changes when they hold ACL information. Unoptimized alignment could cause an unnecessary increase in the structure sizes, so the addition of new fields or changes in object size could have unexpected side effects. -- Michael signature.asc Description: PGP signature
Re: Hash index build performance tweak from sorting
On Tue, 2 Aug 2022 at 03:37, Simon Riggs wrote: > Using the above test case, I'm getting a further 4-7% improvement on > already committed code with the attached patch, which follows your > proposal. > > The patch passes info via a state object, useful to avoid API churn in > later patches. Hi Simon, I took this patch for a spin and saw a 2.5% performance increase using the random INT test that Tom posted. The index took an average of 7227.47 milliseconds on master and 7045.05 with the patch applied. On making a pass of the changes, I noted down a few things. 1. In _h_spoolinit() the HSpool is allocated with palloc and then you're setting the istate field to a pointer to the HashInsertState which is allocated on the stack by the only calling function (hashbuild()). Looking at hashbuild(), it looks like the return value of _h_spoolinit is never put anywhere to make it available outside of the function, so it does not seem like there is an actual bug there. However, it just seems like a bug waiting to happen. If _h_spoolinit() is pallocing memory, then we really shouldn't be setting pointer fields in that memory to point to something on the stack. It might be nicer if the istate field in HSpool was a HashInsertStateData and _h_spoolinit() just memcpy'd the contents of that parameter. That would make HSpool 4 bytes smaller and save additional pointer dereferences in _hash_doinsert(). 2. There are quite a few casts that are not required. e.g: _hash_doinsert(rel, itup, heapRel, (HashInsertState) &istate); buildstate.spool = _h_spoolinit(heap, index, num_buckets, (HashInsertState) &insertstate); buildstate.istate = (HashInsertState) &insertstate; This is just my opinion, but I don't really see the value in having a typedef for a pointer to HashInsertStateData. I can understand that if the struct was local to a .c file, but you've got the struct and pointer typedef in the same header. I understand we often do this in the code, but I feel like we do it less often in newer code. e.g we do it in aset.c but not generation.c (which is much newer than aset.c). My personal preference would be just to name the struct HashInsertState and have no extra pointer typedefs. 3. Just a minor nitpick. Line wraps at 80 chars. You're doing this sometimes but not others. This seems just to be due to the additional function parameters that have been added. 4. I added the following Assert to _hash_pgaddtup() as I expected the itup_off to be set to the same thing before and after this change. I see the Assert is failing in the regression tests. Assert(PageGetMaxOffsetNumber(page) + 1 == _hash_binsearch(page, _hash_get_indextuple_hashkey(itup))); I think this is because _hash_binsearch() returns the offset with the first tuple with the given hashkey, so if there are duplicate hashkey values then it looks like PageAddItemExtended() will set needshuffle and memmove() the existing item(s) up one slot. I don't know this hash index building code very well, but I wonder if it's worth having another version of _hash_binsearch() that can be used to make _hash_pgaddtup() put any duplicate hashkeys after the existing ones rather than before and shuffle the others up? It sounds like that might speed up normal insertions when there are many duplicate values to hash. I wonder if this might be the reason the random INT test didn't come out as good as your original test which had unique values. The unique values test would do less shuffling during PageAddItemExtended(). If so, that implies that skipping the binary search is only part of the gains here and that not shuffling tuples accounts for quite a bit of the gain you're seeing. If so, then it would be good to not have to shuffle duplicate hashkey tuples up in the page during normal insertions as well as when building the index. In any case, it would be nice to have some way to assert that we don't accidentally pass sorted==true to _hash_pgaddtup() when there's an existing item on the page with a higher hash value. Maybe we could just look at the hash value of the last tuple on the page and ensure it's <= to the current one? 5. I think it would be nicer to move the insertstate.sorted = false; into the else branch in hashbuild(). However, you might have to do that anyway if you were to do what I mentioned in #1. David
Re: Perform streaming logical transactions by background workers and parallel apply
FYI - The latest patch 30-0001 fails to apply, it seems due to a recent commit [1]. [postgres@CentOS7-x64 oss_postgres_misc]$ git apply ../patches_misc/v30-0001-Perform-streaming-logical-transactions-by-parall.patch error: patch failed: src/include/replication/logicalproto.h:246 error: src/include/replication/logicalproto.h: patch does not apply -- [1] https://github.com/postgres/postgres/commit/bfcf1b34805f70df48eedeec237230d0cc1154a6 Kind Regards, Peter Smith. Fujitsu Australia
RE: Perform streaming logical transactions by background workers and parallel apply
> FYI - > > The latest patch 30-0001 fails to apply, it seems due to a recent commit [1]. > > [postgres@CentOS7-x64 oss_postgres_misc]$ git apply > ../patches_misc/v30-0001-Perform-streaming-logical-transactions-by- > parall.patch > error: patch failed: src/include/replication/logicalproto.h:246 > error: src/include/replication/logicalproto.h: patch does not apply Thanks for your kindly reminder. I rebased the patch set and attached them in [1]. [1] - https://www.postgresql.org/message-id/OS3PR01MB6275298521AE1BBEF5A055EE9E4F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com Regards, Wang wei
Re: why can't a table be part of the same publication as its schema
[personal views, not RMT] On 9/20/22 4:06 PM, Mark Dilger wrote: I don't complain that it is buidling on the existing behavior. I'm *only* concerned about the keywords we're using for this. Consider the following: -- AS ADMIN CREATE USER bob NOSUPERUSER; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA foo TO bob; SET ROLE bob; CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo; We're going to have that fail in pg15 because the FOR ALL TABLES IN SCHEMA option is reserved to superusers. But we agreed that was a stop-gap solution that we'd potentially loosen in the future. Certainly we'll need wiggle room in the syntax to perform that loosening: --- Must be superuser for this in pg15, and in subsequent releases. CREATE PUBLICATION bobs_pub FOR ALL FUTURE TABLES IN SCHEMA foo; --- Not supported in pg15, but reserved for some future pg versions to allow --- non-superusers to create publications on tables currently in schema foo, --- assuming they have sufficient privileges on those tables CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo; Doing it this way makes the syntax consistent between the GRANT...TO bob and the CREATE PUBLICATION bobs_pub. Surely this makes more sense? When you put it that way, I see your point. However, for the lesser-privileged user though, will the behavior be that it will continue to add all future tables in a schema to the publication so long as they have sufficient privileges on those tables? Or would that mirror the current behavior with GRANT? While I understand it makes it consistent, the one concern I raise is that it means the less privileged user could have a less convenient user experience than the privileged user. Perhaps that's OK, but worth noting. I'm not a huge fan of the keyword "FUTURE" here, but I found a reference to another database that uses that keyword for what I think is a similar purpose. I did try doing research on this prior, but hadn't thought to incorporate "future" into my searches. Doing so, I probably found the same database that you did that used the "FUTURE" word for adding permissions to future objects (and this is fresh, as the docs for it were published last week). That's definitely interesting. I did see some notes on a legacy database system that offered similar advice to what we do for GRANT if you're not using ALTER DEFAULT PRIVILEGES. We should choose *something* for this, though, if we want things to be rational going forward. That all said, while I understand your point and open to the suggestion on "FUTURE", I'm not convinced on the syntax change. But I'll sleep on it. Jonathan OpenPGP_signature Description: OpenPGP digital signature
RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
> One last thing - do you think there is any need to mention this > behaviour in the pgdocs, or is OK just to be a hidden performance > improvement? FYI - I put my opinion. We have following sentence in the logical-replication.sgml: ``` ... If the table does not have any suitable key, then it can be set to replica identity full, which means the entire row becomes the key. This, however, is very inefficient and should only be used as a fallback if no other solution is possible. ... ``` Here the word "very inefficient" may mean that sequential scans will be executed every time. I think some descriptions can be added around here. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory
On Fri, Aug 19, 2022 at 5:27 PM Bharath Rupireddy wrote: > > Please review the attached v6 patch. I'm attaching the v7 patch rebased on to the latest HEAD. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 5ccee7e8ed82952bad6ef410176952aec2733e4f Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Wed, 21 Sep 2022 01:44:44 + Subject: [PATCH v7] Make WAL file initialization by pg_receivewal atomic While initializing WAL files with zeros (i.e. padding) the pg_receivewal can fail for any reason, for instance, no space left on device or host server crash etc., which may leave partially padded WAL files due to which the pg_receivewal won't come up after the hostserver restart. The error it emits is "error: write-ahead log file "foo.partial" has x bytes, should be 0 or y" (where y is size of WAL file typically and x < y). To fix this, one needs to manually intervene and delete the partially padded WAL file which requires an internal understanding of what the issue is and often a time-consuming process in production environments. The above problem can also exist for pg_basebackup as it uses the same walmethods.c function for initialization. To address the above problem, make WAL file initialization atomic i.e. first create a temporary file, pad it and then rename it to the target file. This is similar to what postgres does to make WAL file initialization atomic. Authors: Bharath Rupireddy, SATYANARAYANA NARLAPURAM Reviewed-by: Cary Huang, mahendrakar s Reviewed-by: Kyotaro Horiguchi, Michael Paquier Discussion: https://www.postgresql.org/message-id/CAHg%2BQDcVUss6ocOmbLbV5f4YeGLhOCt%2B1x2yLNfG2H_eDwU8tw%40mail.gmail.com --- src/bin/pg_basebackup/pg_receivewal.c | 22 ++ src/bin/pg_basebackup/walmethods.c| 96 --- 2 files changed, 109 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c index f98ec557db..1cbc992e67 100644 --- a/src/bin/pg_basebackup/pg_receivewal.c +++ b/src/bin/pg_basebackup/pg_receivewal.c @@ -237,6 +237,28 @@ is_xlogfilename(const char *filename, bool *ispartial, return true; } + /* + * File looks like a temporary partial uncompressed WAL file that was + * leftover during pre-padding phase from a previous crash, delete it now + * as we don't need it. + */ + if (fname_len == XLOG_FNAME_LEN + strlen(".partial.tmp") && + strcmp(filename + XLOG_FNAME_LEN, ".partial.tmp") == 0) + { + /* 12 is length of string ".partial.tmp" */ + char path[MAXPGPATH + XLOG_FNAME_LEN + 12]; + + snprintf(path, sizeof(path), "%s/%s", basedir, filename); + + if (unlink(path) < 0) + pg_log_error("could not remove file \"%s\"", path); + + if (verbose) + pg_log_info("removed file \"%s\"", path); + + return false; + } + /* File does not look like something we know */ return false; } diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index bc2e83d02b..4d34d4346a 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -118,7 +118,10 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, const char *temp_suffix, size_t pad_to_size) { DirectoryMethodData *dir_data = (DirectoryMethodData *) wwmethod; - char tmppath[MAXPGPATH]; + char targetpath[MAXPGPATH]; + char tmpsuffixpath[MAXPGPATH]; + bool shouldcreatetempfile = false; + int flags; char *filename; int fd; DirectoryMethodFile *f; @@ -134,7 +137,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, clear_error(wwmethod); filename = dir_get_file_name(wwmethod, pathname, temp_suffix); - snprintf(tmppath, sizeof(tmppath), "%s/%s", + snprintf(targetpath, sizeof(targetpath), "%s/%s", dir_data->basedir, filename); pg_free(filename); @@ -143,12 +146,57 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, * the file descriptor is important for dir_sync() method as gzflush() * does not do any system calls to fsync() to make changes permanent on * disk. + * + * Create a temporary file for padding and no compression cases to ensure + * a fully initialized file is created. */ - fd = open(tmppath, O_WRONLY | O_CREAT | PG_BINARY, pg_file_create_mode); + if (pad_to_size > 0 && + wwmethod->compression_algorithm == PG_COMPRESSION_NONE) + { + shouldcreatetempfile = true; + flags = O_WRONLY | PG_BINARY; + } + else + { + shouldcreatetempfile = false; + flags = O_WRONLY | O_CREAT | PG_BINARY; + } + + fd = open(targetpath, flags, pg_file_create_mode); if (fd < 0) { - wwmethod->lasterrno = errno; - return NULL; + if (errno == ENOENT && shouldcreatetempfile) + { + /* + * Actual file doesn't exist. Now, create a temporary file pad it + * and rename to the target file. The temporary file may exist from + * the last failed attempt (failed during partial padding or + * renaming or some ot