Re: Adding comments to help understand psql hidden queries
Hi! I have read the discussion and would like to share my humble opinion. I believe that a visually appealing way to display the output on the screen is to ensure symmetry in the length of asterisks and description lines. I imagine someone looking at the screen and focusing on symmetrical details. Therefore, the string length should serve as the basis for the calculation. If the description length is an even number, then the formula would be: ((description length − 7) / 2) Placing this result of asterisks on both sides of the string ' QUERY ' ensures balance. If the description length is an odd number, then place: ((description length − 7) / 2) asterisks on the right side and: (((description length − 7) / 2) + 1) asterisks on the left side. This method does not always result in a perfectly symmetric number of asterisks, but it provides a more visually aligned appearance. At the end of the SQL code, we should also include a line terminator of the same length of the description. The format looks like this: /** QUERY ***/ /* Get information about row-level policies */ SELECT pol.polname, pol.polpermissive, CASE WHEN pol.polroles = '{0}' THEN NULL ELSE pg_catalog.array_to_string(array(select rolname from pg_catalog.pg_roles where oid = any (pol.polroles) order by 1),',') END, pg_catalog.pg_get_expr(pol.polqual, pol.polrelid), pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid), CASE pol.polcmd WHEN 'r' THEN 'SELECT' WHEN 'a' THEN 'INSERT' WHEN 'w' THEN 'UPDATE' WHEN 'd' THEN 'DELETE' END AS cmd FROM pg_catalog.pg_policy pol WHERE pol.polrelid = '134384' ORDER BY 1; // Regards, Maiquel.
Re: Statistics Import and Export
On Mon, Mar 31, 2025 at 11:11:47AM -0400, Corey Huinker wrote: > In light of v11-0001 being committed as 4694aedf63bf, I've rebased the > remaining patches. I spent the day preparing these for commit. A few notes: * I've added a new prerequisite patch that skips the second WriteToc() call for custom-format dumps that do not include data. After some testing and code analysis, I haven't identified any examples where this produces different output. This doesn't help much on its own, but it will become rather important when we move the attribute statistics queries to happen within WriteToc() in 0002. * I was a little worried about the correctness of 0002 for dumps that run the attribute statistics queries twice, but I couldn't identify any problems here either. * I removed a lot of miscellaneous refactoring that seemed unnecessary for these patches. Let's move that to another patch set and keep these as simple as possible. * I made a small adjustment to the TOC scan restarting logic in fetchAttributeStats(). Specifically, we now only allow the scan to restart once for custom-format dumps that include data. * While these patches help decrease pg_dump's memory footprint, I believe pg_restore still reads the entire TOC into memory. That's not this patch set's problem, but I think it's still an important consideration for the bigger picture. Regarding whether pg_dump should dump statistics by default, my current thinking is that it shouldn't, but I think we _should_ have pg_upgrade dump/restore statistics by default because that is arguably the most important use-case. This is more a gut feeling than anything, so I reserve the right to change my opinion. My goal is to commit the attached patches on Friday morning, but of course that is subject to change based on any feedback or objections that emerge in the meantime. -- nathan >From 0256405dba245baa90c3fe35ee69e3cb1ce6253e Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 31 Mar 2025 10:44:26 -0500 Subject: [PATCH v12n 1/3] Skip second WriteToc() for custom-format dumps without data. Presently, "pg_dump --format=custom" calls WriteToc() twice. The second call is intended to update the data offset information, which allegedly makes parallel pg_restore significantly faster. However, if we aren't dumping any data, this step accomplishes nothing and can be skipped. This is a preparatory optimization for a follow-up commit that will move the queries for per-attribute statistics to WriteToc() to save memory. Discussion: https://postgr.es/m/Z9c1rbzZegYQTOQE%40nathan --- src/bin/pg_dump/pg_backup_custom.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index e44b887eb29..b971e3bc16e 100644 --- a/src/bin/pg_dump/pg_backup_custom.c +++ b/src/bin/pg_dump/pg_backup_custom.c @@ -755,9 +755,11 @@ _CloseArchive(ArchiveHandle *AH) * If possible, re-write the TOC in order to update the data offset * information. This is not essential, as pg_restore can cope in most * cases without it; but it can make pg_restore significantly faster -* in some situations (especially parallel restore). +* in some situations (especially parallel restore). We can skip this +* step if we're not dumping any data. */ - if (ctx->hasSeek && + if (AH->public.dopt->dumpData && + ctx->hasSeek && fseeko(AH->FH, tpos, SEEK_SET) == 0) WriteToc(AH); } -- 2.39.5 (Apple Git-154) >From 6e41a1b2a175f7e9a859429e57c2ffb17ec9051d Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 31 Mar 2025 14:53:11 -0500 Subject: [PATCH v12n 2/3] pg_dump: Reduce memory usage of dumps with statistics. Right now, pg_dump stores all generated commands for statistics in memory. These commands can be quite large and therefore can significantly increase pg_dump's memory footprint. To fix, wait until we are about to write out the commands before generating them, and be sure to free the commands after writing. This is implemented via a new defnDumper callback that works much like the dataDumper one but is specially designed for TOC entries. One drawback of this change is that custom dumps that include data will run the statistics queries twice. However, a follow-up commit will add batching for these queries that our testing indicates should greatly improve dump speed (even when compared to pg_dump without this commit). Author: Corey Huinker Discussion: https://postgr.es/m/CADkLM%3Dc%2Br05srPy9w%2B-%2BnbmLEo15dKXYQ03Q_xyK%2BriJerigLQ%40mail.gmail.com --- src/bin/pg_dump/pg_backup.h | 1 + src/bin/pg_dump/pg_backup_archiver.c | 35 ++ src/bin/pg_dump/pg_backup_archiver.h | 5 src/bin/pg_dump/pg_dump.c
Re: Adding comments to help understand psql hidden queries
Hi! I have read the discussion and would like to share my humble opinion. I believe that a visually appealing way to display the output on the screen is to ensure symmetry in the length of asterisks and description lines. I imagine someone looking at the screen and focusing on symmetrical details. Therefore, the string length should serve as the basis for the calculation. If the description length is an even number, then the formula would be: ((description length − 7) / 2) Placing this result of asterisks on both sides of the string ' QUERY ' ensures balance. If the description length is an odd number, then place: ((description length − 7) / 2) asterisks on the right side and: (((description length − 7) / 2) + 1) asterisks on the left side. This method does not always result in a perfectly symmetric number of asterisks, but it provides a more visually aligned appearance. At the end of the SQL code, we should also include a line terminator of the same length of the description. The format looks like this: /** QUERY ***/ /* Get information about row-level policies */ SELECT pol.polname, pol.polpermissive, CASE WHEN pol.polroles = '{0}' THEN NULL ELSE pg_catalog.array_to_string(array(select rolname from pg_catalog.pg_roles where oid = any (pol.polroles) order by 1),',') END, pg_catalog.pg_get_expr(pol.polqual, pol.polrelid), pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid), CASE pol.polcmd WHEN 'r' THEN 'SELECT' WHEN 'a' THEN 'INSERT' WHEN 'w' THEN 'UPDATE' WHEN 'd' THEN 'DELETE' END AS cmd FROM pg_catalog.pg_policy pol WHERE pol.polrelid = '134384' ORDER BY 1; // Regards, Maiquel.
Re: explain analyze rows=%.0f
On Mon, Mar 31, 2025 at 1:49 PM Tom Lane wrote: > Robert Haas writes: > > But why isn't it just as valuable to have two decimal places for the > > estimate? I theorize that the cases that are really a problem here are > > those where the row count estimate is between 0 and 1 per row, and > > rounding to an integer loses all precision. > > Currently, the planner rounds *all* rowcount estimates to integers > (cf. clamp_row_est()). Maybe it'd be appropriate to rethink that, > but it's not just a matter of changing EXPLAIN's print format. Oh, right. I've never really understood why we round off to integers, but the fact that we don't allow row counts < 1 feels like something pretty important. My intuition is that it probably helps a lot more than it hurts, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: RFC: Logging plan of the running query
On Fri, Mar 21, 2025 at 8:40 AM torikoshia wrote: > Rebased it again. Hi, I apologize for not having noticed this thread sooner. I just became aware of it as a result of a discussion in the hacking Discord server. I think this has got a lot over overlap with the progressive EXPLAIN patch from Rafael Castro, which I have been reviewing, but I am not sure why we have two different patch sets here. Why is that? Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning
On Mon, Mar 31, 2025 at 8:27 AM David Rowley wrote: > > On Sat, 29 Mar 2025 at 05:46, Ashutosh Bapat > wrote: > > PFA patches. 0001 and 0002 are the same as the previous set. 0003 > > changes the initial hash table size to the length of ec_derives. > > I'm just not following the logic in making it the length of the > ec_derives List. If you have 32 buckets and try to insert 32 elements, > you're guaranteed to need a resize after inserting 28 elements. See > the grow_threshold logic SH_UPDATE_PARAMETERS(). The point of making > it 64 was to ensure the table is never unnecessarily sparse and to > also ensure we make it at least big enough for the minimum number of > ec_derives that we're about to insert. If I am reading SH_CREATE correctly, it will initially set the size to 32/.9 = 36 and in SH_UPDATE_PARAMETERS will set grow_threshold to 36 * .9 = 32. So it will leave some room even after inserting the initial elements. But looking at SH_INSERT_HASH_INTERNAL(), it will soon expand the table even if there's space. We certainly need a size much more than 32. 32 is an arbitrary/empirical threshold to create a hash table. Using that threshold as initial hash table size means the table size would be arbitrary too. Using twice the length of ec_derives_list seems more reasonable since the length will decide the initial number of entries. > > Looking more closely at the patch's ec_add_clause_to_derives_hash() > function, I see you're actually making two hash table entries for each > RestrictInfo, so without any em_is_const members, you'll insert 64 > entries into the hash table with a ec_derives list of 32, in which > case 64 buckets isn't enough and the table will end up growing to 128 > elements. > Yes, that's right. > I think you'd be better off coming up with some logic like putting the > lowest pointer value'd EM first in the key and ensure that all lookups > do that too by wrapping the lookups in some helper function. That'll > half the number of hash table entries at the cost of some very cheap > comparisons. That's a good suggestion. I searched for C standard documentation which specifies that the pointer comparison, especially inequality, is stable and safe to use. But I didn't find any. While according to the C standard, the result of comparison between pointers within the same array or a struct is specified, that between pointers from two different objects is unspecified. The existing code relies on the EM pointers being stable and also relies on equality between them to be stable. It has withstood the test of time and a variety of compilers. Hence I think it should be safe to rely on pointer comparisons being stable. But since I didn't find any documentation which confirms it, I have left those changes as a separate patch. Some internet sources discussing pointer comparison can be found at [1], [2] (which mentions the C standard but doesn't provide a link). PFA the next patchset 0001, 0002 are same as in the previous set 0003 changes the initial hash table size to length(ec->ec_derives_list) * 4 to accomodate for commuted entries. 0004 uses canonical keys per your suggestion and also reduces the initial hash table size to length(ec->ec_derives_list) * 2. When the number of initial elements to be added to the hash table is known, I see users of simplehash.h using that as an estimate rather than the nearest power of two. Hence following the logic here. [1] https://stackoverflow.com/questions/59516346/how-does-pointer-comparison-work-in-c-is-it-ok-to-compare-pointers-that-dont-p [2] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Pointer-Comparison.html -- Best Wishes, Ashutosh Bapat From ccb72663e55f5ffe67595366e4a110685f647fc6 Mon Sep 17 00:00:00 2001 From: Amit Langote Date: Mon, 24 Mar 2025 21:17:46 +0900 Subject: [PATCH 1/4] Add assertion to verify derived clause has constant RHS find_derived_clause_for_ec_member() searches for a previously-derived clause that equates a non-constant EquivalenceMember to a constant. It is only called for EquivalenceClasses with ec_has_const set, and with a non-constant member as the target. The matched clause is expected to have the non-constant member on the left-hand side and the constant EquivalenceMember on the right. Assert that the RHS is indeed a constant, to catch violations of this structure and enforce assumptions made by generate_base_implied_equalities_const(). Author: Ashutosh Bapat Discussion: https://postgr.es/m/caexhw5scmxyfrqofe6odmbiw2rnvbemeeca-p4w_cyueiku...@mail.gmail.com --- src/backend/optimizer/path/equivclass.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index 0f9ecf5ee8b..493a95d26cc 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -2664,7 +2664,10 @@ find_derived_clause_for_ec_member(EquivalenceClass *ec, * members on the left side of derived clauses. */
Re: Proposal: Progressive explain
Hello again, > ERROR: could not attach to dynamic shared area In addition to that refactoring issue, the current patch had a race condition in pg_stat_progress_explain to access the DSA of a process running a query that gets aborted. While discussing with Robert we agreed that it would be wiser to take a step back and change the strategy used to share progressive explain data in shared memory. Instead of using per backend's DSAs shared via a hash structure I now define a dsa_pointer and a LWLock in each backend's PGPROC. A global DSA is created by the first backend that attempts to use the progressive explain feature. After the DSA is created, subsequent uses of the feature will just allocate memory there and reference via PGPROC's dsa_pointer. This solves the race condition reported by Torikoshi and improves concurrency performance as now we don't have a global LWLock controlling shared memory access, but a per-backend LWLock. Performed the same tests done by Torikoshi and it looks like we are good now. Even with more frequent inspects in pg_stat_progress_explain (\watch 0.01). Rafael. v15-0001-Proposal-for-progressive-explains.patch Description: Binary data
Re: Test to dump and restore objects left behind by regression
On 2025-Mar-31, Daniel Gustafsson wrote: > Given where we are in the cycle, it seems to make sense to stick to using the > schedule we already have rather than invent a new process for generating it, > and work on that for 19? No objections to that. I'll see about getting this committed during my morning today, so that I have plenty of time to watch the buildfarm. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ Are you not unsure you want to delete Firefox? [Not unsure] [Not not unsure][Cancel] http://smylers.hates-software.com/2008/01/03/566e45b2.html
Re: Reduce "Var IS [NOT] NULL" quals during constant folding
On Tue, Apr 1, 2025 at 1:55 AM Robert Haas wrote: > As a general principle, I have found that it's usually a sign that > something has been designed poorly when you find yourself wanting to > open a relation, get exactly one piece of information, and close the > relation again. That is why, today, all the information that the > planner needs about a particular relation is retrieved by > get_relation_info(). We do not just wander around doing random catalog > lookups wherever we need some critical detail. This patch increases > the number of places where we fetch relation data from 1 to 2, but > it's still the case that almost everything happens in > get_relation_info(), and there's now just exactly this 1 thing that is > done in a different place. That doesn't seem especially nice. I > thought the idea was going to be to move get_relation_info() to an > earlier stage, not split one thing out of it while leaving everything > else the same. I initially considered moving get_relation_info() to an earlier stage, where we would collect all the per-relation data by relation OID. Later, when building the RelOptInfos, we could link them to the per-relation-OID data. However, I gave up this idea because I realized it would require retrieving a whole bundle of catalog information that isn't needed until after the RelOptInfos are built, such as max_attr, pages, tuples, reltablespace, parallel_workers, extended statistics, etc. And we may also need to create the IndexOptInfos for the relation's indexes. It seems to me that it's not a trivial task to move get_relation_info() before building the RelOptInfos, and more importantly, it's unnecessary most of the time. In other words, of the many pieces of catalog information we need to retrieve for a relation, only a small portion is needed at an early stage. As far as I can see, this small portion includes: * relhassubclass, which we retrieve immediately after we have finished adding rangetable entries. * attgenerated, which we retrieve in expand_virtual_generated_columns. * attnotnull, as discussed here, which should ideally be retrieved before pull_up_sublinks. My idea is to retrieve only this small portion at an early stage, and still defer collecting the majority of the catalog information until building the RelOptInfos. It might be possible to optimize by retrieving this small portion in one place instead of three, but moving the entire get_relation_info() to an earlier stage doesn't seem like a good idea to me. It could be argued that the separation of catalog information collection isn't very great, but it seems this isn't something new in this patch. So I respectfully disagree with your statement that "all the information that the planner needs about a particular relation is retrieved by get_relation_info()", and that "there's now just exactly this 1 thing that is done in a different place". For instance, relhassubclass and attgenerated are retrieved before expression preprocessing, a relation's constraint expressions are retrieved when setting the relation's size estimates, and more. Thanks Richard
Re: pgsql: Add memory/disk usage for Window aggregate nodes in EXPLAIN.
On Tue, 1 Apr 2025 at 04:40, Christoph Berg wrote: > - Storage: Disk Maximum Storage: NkB > + Storage: Memory Maximum Storage: NkB > -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N > loops=N) We'll probably just need to bump that 2000 row count to something a bit more for 32-bit. Any chance you could share the output of: explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,2000) a(n); Could you maybe also do a binary search for the number of rows where it goes to disk by adjusting the 2000 up in some increments until the Storage method is disk? (Not that I think we should set it to the minimum, but it would be good to not set it too much higher than we need to) David
Re: Truncate logs by max_log_size
On 2025/02/03 19:31, Jim Jones wrote: Hi Kirill On 31.01.25 11:46, Kirill Gavrilov wrote: Sorry for the long silence. I fixed the indentation and a trailing whitespace. Should look fine now. The patch applies cleanly, the documentation is clear, and all tests pass. It is possible to change this new parameter session-wise, which is nice! postgres=# SET max_log_size TO 7; SET postgres=# SHOW max_log_size; max_log_size -- 7B (1 row) The default value now is clear and it corresponds to the value set on postgresql.conf: #max_log_size = 0 # max size of logged statement postgres=# SHOW max_log_size; max_log_size -- 0 (1 row) Logs are truncated as expected: postgres=# SET max_log_size TO 6; SET postgres=# SELECT length('CALL xyz;'); length 9 (1 row) postgres=# CALL xyz; ERROR: syntax error at or near ";" LINE 1: CALL xyz; log entry: 2025-02-03 10:58:19.975 CET [123945] ERROR: syntax error at or near ";" at character 9 2025-02-03 10:58:19.975 CET [123945] STATEMENT: CALL x The issue with log entry sizes for queries containing special characters was resolved by setting the unit to bytes. Overall, everythingLGTM. The new status of this patch is: Ready for Committer Since this patch is marked as ready-for-committer, I started reviewing it. Basically I like the proposed idea. When I set log_statement to 'all', max_log_size to 3, and ran "SELECT 1/0", only the first three bytes of the query were logged by log_statement. However, no query statement appeared under STATEMENT, as shown below. Is this a bug? -- =# SET log_statement TO 'all'; =# SET max_log_size TO 3; =# SELECT 1/0; LOG: statement: SEL ERROR: division by zero STATEMENT: -- When log_replication_commands is enabled, replication commands are logged. Should max_log_size apply to these logs as well to prevent excessively large commands from being logged in full? The parameter name max_log_size seems misleading. It sounds like it controls the maximum log file size. Would a name like log_statement_max_length be clearer? The functions like exec_parse_message(), exec_bind_message(), and exec_execute_message() may log query statements (e.g., via log_min_duration_statement), but the patch doesn't seem to update them to consider max_log_size. Queries can also be logged in the CONTEXT line, such as when running "DO $$ BEGIN SELECT 1/0; END $$;", but max_log_size doesn't seem to apply in this case. There might be other cases where queries are logged, but the patch doesn't handle them. I'm not sure we can identify and address all of them before the feature freeze. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: SQLFunctionCache and generic plans
Alexander Pyhalov writes: > I've looked through it and made some tests, including ones which > caused me to create separate context for planing. Was a bit worried > that it has gone, but now, as fcache->fcontext is deleted in the end > of function execution, I don't see leaks, which were the initial > reason for introducing it. Yeah. As it's set up in v10, we do parsing work in the caller's context (which is expected to be short-lived) when creating or recreating the long-lived cache entry. However, planning work (if needed) is done in the fcontext, since that will happen within init_execution_state which is called after fmgr_sql has switched into the fcontext. I thought about switching back to the caller's context but decided that it wouldn't really be worth the trouble. For a non-SRF there's no meaningful difference anyway. For a SRF, it'd mean that planning cruft survives till end of execution of the SRF rather than possibly going away when the first row is returned. But we aren't looping: any given query within the SRF is planned only once during an execution of the function. So there's no possibility of indefinite accumulation of leakage. If we wanted to improve that, my inclination would be to try to not switch into the fcontext for the whole of fmgr_sql, but only use it explicitly for allocations that need to survive. But I don't think that'll save much, so I think any such change would be best left for later. The patch is big enough already. regards, tom lane
Re: Make COPY format extendable: Extract COPY TO format implementations
On Mon, Mar 31, 2025 at 11:52 AM Masahiko Sawada wrote: > On Sat, Mar 29, 2025 at 9:49 AM David G. Johnston > wrote: > > > > On Wed, Mar 26, 2025 at 8:28 PM Sutou Kouhei wrote: > >> > >> > >> The attached v39 patch set uses the followings: > >> > >> 0001: Create copyto_internal.h and change COPY_XXX to > >> COPY_SOURCE_XXX and COPY_DEST_XXX accordingly. > >> (Same as 1. in your suggestion) > >> 0002: Support custom format for both COPY TO and COPY FROM. > >> (Same as 2. in your suggestion) > >> 0003: Expose necessary helper functions such as CopySendEndOfRow() > >> and add CopyFromSkipErrorRow(). > >> (3. + 4. in your suggestion) > >> 0004: Define handler functions for built-in formats. > >> (Not included in your suggestion) > >> 0005: Documentation. (WIP) > >> (Same as 5. in your suggestion) > >> > > > > I prefer keeping 0002 and 0004 separate. In particular, keeping the > design choice of "unqualified internal format names ignore search_path" > should stand out as its own commit. > > What is the point of having separate commits for already-agreed design > choices? I guess that it would make it easier to revert that decision. > But I think it makes more sense that if we agree with "unqualified > internal format names ignore search_path" the original commit includes > that decision and describes it in the commit message. If we want to > change that design based on the discussion later on, we can have a > separate commit that makes that change and has the link to the > discussion. > Fair. Comment withdrawn. Though I was referring to the WIP patches; I figured the final patch would squash this all together in any case. David J.
[PATCH] Fix potential overflow in binary search mid calculation
Dear PostgreSQL Developers, I have identified a potential integer overflow issue in the binary search implementation within the DSA size class lookup code. Issue Description In the current implementation, the calculation of mid is performed as: uint16 mid = (max + min) / 2; Since both max and min are of type uint16, adding them together may exceed 65535, leading to an overflow and incorrect behavior in the binary search logic. This could result in incorrect indexing into the dsa_size_classes array. Proposed Fix To prevent this overflow, we should use the alternative calculation method: uint16 mid = min + (max - min) / 2; This approach ensures that (max - min) does not exceed 65535, preventing the addition from overflowing while still correctly computing the middle index. Patch A patch implementing this fix is attached. 0001-Fix-potential-overflow-in-binary-search-mid-calculat.patch Description: Binary data
Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).
On Mon, Mar 31, 2025 at 8:13 PM jian he wrote: > On Mon, Mar 31, 2025 at 11:27 PM Fujii Masao > > > > > The copy.sgml documentation should clarify that COPY TO can > > be used with a materialized view only if it is populated. > > > "COPY TO can be used only with plain tables, not views, and does not > copy rows from child tables or child partitions" > i changed it to > "COPY TO can be used with plain tables and materialized views, not > regular views, and does not copy rows from child tables or child > partitions" > > Another alternative wording I came up with: > "COPY TO can only be used with plain tables and materialized views, > not regular views. It also does not copy rows from child tables or > child partitions." > > Those don't address the "populated" aspect of the materialized view. I'm unclear ATM (can check later if needed) if populated means "non-empty" or simply "had refresh executed at least once on it". i.e., if the refresh produces zero rows stored in the MV is it still populated? I'm guessing yes; and this only pertains to "WITH NO DATA", which itself already calls out that "...and cannot be queried until RMV is used". I find it of marginal usefulness to bring that distinction over to COPY TO absent people showing up confused about the error message, which likely will be quite rare. That said I'd probably settle with: COPY TO can only be used with plain tables and populated materialized views. It does not copy rows from child tables or child partitions (i.e., copy table to copies the same rows as select * from only table). The syntax COPY (select * from table) TO ... can be used to dump all of the rows in an inheritance hierarchy, partitioned table, or foreign table; as well as ordinary view results. Curious about sequences; no way to name an index here. I'm second-guessing why "composite type" shows up in the glossary under "Relation"...though that originally came up IIRC discussing namespaces. David J.
Re: add function argument name to substring and substr
On Mon, Mar 31, 2025 at 9:12 PM jian he wrote: > On Tue, Apr 1, 2025 at 6:22 AM David G. Johnston > wrote: > > > > On Tue, Mar 18, 2025 at 9:04 PM jian he > wrote: > >> > >> > >> new patch attached. > >> > > > > I've done v4 with a delta patch. > > your v4-0001-v3-0001-substring.patch is not the same as my > v3-0001-add-argument-name-to-function-substring-and-subst.patch > > Sorry about that. v5 attached. Confirmed with diff the v3 and v5 0001 so we should be good. David J. From 507064e643b54d2bebd7689acd3dde60d230e328 Mon Sep 17 00:00:00 2001 From: "David G. Johnston" Date: Mon, 31 Mar 2025 21:36:42 -0700 Subject: [PATCH 1/2] v3-0001 substring --- doc/src/sgml/func.sgml | 111 +-- src/backend/catalog/system_functions.sql | 2 +- src/include/catalog/pg_proc.dat | 12 +++ 3 files changed, 118 insertions(+), 7 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index c642f1ea4e..a3569995f1 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -3806,6 +3806,58 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in + + + +substring ( string text, pattern text ) +text + + +Extracts the first substring matching POSIX regular expression; see +. + + +substring('Thomas', '...$') +mas + + + + + +substring ( string text, pattern text, escape_character text) +text + + +Extracts the first substring matching SQL regular expression; +see . + + +substring('Thomas', '%#"o_a#"_', '#') +oma + + + + + + + substring + +substring ( string text, start integer , count integer ) +text + + +Extracts the substring of string starting at +the start'th character, +and stopping after count characters if that is +specified. + + + +substring('Thomas', 2, 3) +hom + + + @@ -4811,6 +4863,27 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three'); \x5678 + + + + + substring + +substring ( bytes bytea, start integer , count integer ) +bytea + + +Extracts the substring of bytes starting at +the start'th byte, +and stopping after count bytes if that is +specified. + + +substring('\x1234567890'::bytea, 3, 2) +\x5678 + + + @@ -5353,6 +5426,26 @@ cast(-1234 as bytea) \xfb2e + + + + substring + +substring ( bits bit, start integer , count integer ) +bit + + +Extracts the substring of bits starting at +the start'th bit, +and stopping after count bits if that is +specified. + + +substring(B'11001011', 3, 2) +00 + + + @@ -5816,7 +5909,7 @@ substring(string from pattern or as a plain three-argument function: -substring(string, pattern, escape-character) +substring(string, pattern, escape_character) As with SIMILAR TO, the specified pattern must match the entire data string, or else the @@ -6020,11 +6113,17 @@ substring('foobar' similar '#"o_b#"%' escape '#')NULL - The substring function with two parameters, - substring(string from - pattern), provides extraction of a - substring - that matches a POSIX regular expression pattern. It returns null if + The substring function with two parameters provides extraction of a + substring that matches a POSIX regular expression pattern. + It has syntax: + +substring(string from pattern) + + It can also written as a plain two-argument function: + +substring(string, pattern) + + It returns null if there is no match, otherwise the first portion of the text that matched the pattern. But if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql index 566f308e44..5ea9d786b6 100644 --- a/src/backend/catalog/system_functions.sql +++ b/src/backend/catalog/system_functions.sql @@ -42,7 +42,7 @@ CREATE OR REPLACE FUNCTION rpad(text, integer) IMMUTABLE PARALLEL SAFE STRICT COST 1 RETURN rpad($1, $2, ' '); -CREATE OR REPLACE FUNCTION "substring"(text, text, text) +CREATE OR REPLACE FUNCTION "substring"(string text, pattern text, escape_character text) RETURNS text LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT COST 1 diff --git a/src/include/catalog/pg_proc.dat b/src/inc
tzdata 2025b
Hi all, tzdata 2025b has been released on 3/22[1]. Do we need to update the tzdata.zi file on HEAD and backbranches? Regards, [1] https://data.iana.org/time-zones/tzdb/NEWS -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: pgsql: Add support for OAUTHBEARER SASL mechanism
On Mon, Mar 31, 2025 at 2:54 PM Christoph Berg wrote: > > > Add support for OAUTHBEARER SASL mechanism > > Debian still has this experimental port with a GNU userland and a > FreeBSD kernel called kfreebsd. I don't expect anyone to particularly > care about it, but it found an actual bug: > > /build/reproducible-path/postgresql-18-18~~devel.20250331/build/../src/interfaces/libpq/fe-auth-oauth-curl.c: > In function ‘register_socket’: > /build/reproducible-path/postgresql-18-18~~devel.20250331/build/../src/interfaces/libpq/fe-auth-oauth-curl.c:1317:20: > error: ‘actx’ undeclared (first use in this function); did you mean ‘ctx’? > 1317 | actx_error(actx, "libpq does not support multiplexer sockets > on this platform"); > |^~~~ > > This should not be a compile-time error; actx is not defined outside > the #ifdef blocks there: Ah, sorry about that. Thank you for reporting it! (That means that Windows builds --with-libcurl are similarly broken, I think. Not that Windows packagers will want to use --with-libcurl -- it doesn't do anything -- but it should build.) I don't have hurd-amd64 to test, but I'm working on a patch that will build and pass tests if I manually munge pg_config.h. We were skipping the useless tests via a $windows_os check; I think I should use check_pg_config() instead. We could change how this works a bit for the proposed libpq-oauth.so plugin, and only build it if we have a workable implementation. I do like having these other platforms compile the Curl code, though, since we'd prefer to keep the build clean for a future Windows implementation... --Jacob
Re: Replace IN VALUES with ANY in WHERE clauses during optimization
Hi, Alexander! On 30.03.2025 00:59, Alexander Korotkov wrote: Hi, Alena! On Sat, Mar 29, 2025 at 9:03 PM Alena Rybakina wrote: On 29.03.2025 14:03, Alexander Korotkov wrote: One thing I have to fix: we must do IncrementVarSublevelsUp() unconditionally for all expressions as Vars could be deeper inside. Yes, I'm looking at it too, I've just understood that it was needed for subqueries - they can contain var elements which needs decrease the sublevel parameter. for example for the query: EXPLAIN (COSTS OFF) SELECT ten FROM onek t WHERE unique1 IN (VALUES (0), ((2 IN (SELECT unique2 FROM onek c WHERE c.unique2 = t.unique1))::integer)); We are interested in this element: ((2 IN (SELECT unique2 FROM onek c WHERE c.unique2 = t.unique1)) It is funcexpr object with RabgeTblEntry variable. I highlighted WARNING: 1{FUNCEXPR :funcid 2558 :funcresulttype 23 :funcretset false :funcvariadic false :funcformat 1 :funccollid 0 :inputcollid 0 :args ({SUBLINK :subLinkType 2 :subLinkId 0 :testexpr {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]} {PARAM :paramkind 2 :paramid 1 :paramtype 23 :paramtypmod -1 :paramcollid 0 :location -1}) :location -1} :operName ("=") :subselect {QUERY :commandType 1 :querySource 0 :canSetTag true :utilityStmt <> :resultRelation 0 :hasAggs false :hasWindowFuncs false :hasTargetSRFs false :hasSubLinks false :hasDistinctOn false :hasRecursive false :hasModifyingCTE false :hasForUpdate false :hasRowSecurity false :hasGroupRTE false :isReturn false :cteList <> :rtable ({RANGETBLENTRY :alias {ALIAS :aliasname c :colnames <>} :eref {ALIAS :aliasname c :colnames ("unique1" "unique2" "two" "four" "ten" "twenty" "hundred" "thousand" "twothousand" "fivethous" "tenthous" "odd" "even" "stringu1" "stringu2" "string4")} :rtekind 0 :relid 32795 :inh true :relkind r :rellockmode 1 :perminfoindex 1 :tablesample <> :lateral false :inFromCl true :securityQuals <>}) :rteperminfos ({RTEPERMISSIONINFO :relid 32795 :inh true :requiredPerms 2 :checkAsUser 0 :selectedCols (b 9) :insertedCols (b) :updatedCols (b)}) :jointree {FROMEXPR :fromlist ({RANGETBLREF :rtindex 1}) :quals {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 2 :varreturningtype 0 :varnosyn 1 :varattnosyn 1 :location -1}) :location -1}} :mergeActionList <> :mergeTargetRelation 0 :mergeJoinCondition <> :targetList ({TARGETENTRY :expr {VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} :resno 1 :resname unique2 :ressortgroupref 0 :resorigtbl 32795 :resorigcol 2 :resjunk false}) :override 0 :onConflict <> :returningOldAlias <> :returningNewAlias <> :returningList <> :groupClause <> :groupDistinct false :groupingSets <> :havingQual <> :windowClause <> :distinctClause <> :sortClause <> :limitOffset <> :limitCount <> :limitOption 0 :rowMarks <> :setOperations <> :constraintDeps <> :withCheckOptions <> :stmt_location -1 :stmt_len -1} :location -1}) :location -1} I highlighted in bold the var we need - since it is in a subquery in the in expression will be flattened, all elements contained in it should decrease the level number by one, since they will belong to the subtree located above it. Because of that condition, this did not happen. I generally agree with you that it is better to remove that condition. The function IncrementVarSublevelsUp essentially goes through the structures below and will decrease the level of only the vars for which this needs to be done, and the condition with 1 will protect us from touching those vars that should not. So the varlevelsup for this var should be 1. I am currently investigating whether this transformation will be fair for all cases; I have not found any problems yet. Thank you for your feedback. I appreciate you're also looking for the potential problems. On thing to highlight: doing IncrementVarSublevelsUp() unconditionally is required not just for subqueries. Consider the following example. SELECT * FROM t WHERE val1 IN (VALUES (val2), (val2 +1)); The second value contain Var, which needs IncrementVarSublevelsUp(), but the top node is OpExpr. Yes, I agree with that - this is precisely why we need to call IncrementVarSublevelsUp() unconditionally for all types. As you mentioned earlier, Var nodes can be nested more deeply, and skipping this step could lead to incorrect behavior in those cases. So, now it works fine) Thank you for an example. I analyzed this transfor
Re: pgsql: Add memory/disk usage for Window aggregate nodes in EXPLAIN.
Re: David Rowley > Any chance you could share the output of: > > explain (analyze,buffers off,costs off) select sum(n) over() from > generate_series(1,2000) a(n); PostgreSQL 18devel on x86-linux, compiled by gcc-14.2.0, 32-bit =# explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,2000) a(n); QUERY PLAN ── WindowAgg (actual time=1.248..1.731 rows=2000.00 loops=1) Window: w1 AS () Storage: Memory Maximum Storage: 63kB -> Function Scan on generate_series a (actual time=0.301..0.536 rows=2000.00 loops=1) Planning Time: 0.066 ms Execution Time: 1.913 ms (6 rows) > Could you maybe also do a binary search for the number of rows where > it goes to disk by adjusting the 2000 up in some increments until the > Storage method is disk? (Not that I think we should set it to the > minimum, but it would be good to not set it too much higher than we > need to) The test has a `set work_mem = 64;` which I used here: =# explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,2047) a(n); QUERY PLAN ── WindowAgg (actual time=1.037..1.429 rows=2047.00 loops=1) Window: w1 AS () Storage: Memory Maximum Storage: 64kB -> Function Scan on generate_series a (actual time=0.262..0.457 rows=2047.00 loops=1) Planning Time: 0.058 ms Execution Time: 1.594 ms (6 rows) =# explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,2048) a(n); QUERY PLAN ── WindowAgg (actual time=2.073..2.686 rows=2048.00 loops=1) Window: w1 AS () Storage: Disk Maximum Storage: 65kB -> Function Scan on generate_series a (actual time=0.624..1.064 rows=2048.00 loops=1) Planning Time: 0.064 ms Execution Time: 2.934 ms (6 rows) (With the default work_mem, the tipping point is around 149500) Christoph
Re: explain analyze rows=%.0f
On 31.03.2025 22:09, Robert Haas wrote: Oh, right. I've never really understood why we round off to integers, but the fact that we don't allow row counts < 1 feels like something pretty important. My intuition is that it probably helps a lot more than it hurts, too. We definitely shouldn’t remove the row counts < 1 check, since there are many places in the planner where we divide by rows. This mechanism was added specifically to prevent division by zero. Also, allowing rows estimates below 1 can sometimes make the planner overly optimistic, leading it to prefer cheaper-looking plans that may not perform well in practice. For example, choosing a Nested Loop instead of a more appropriate Hash Join. Allowing fractional rows > 1 might help improve planner accuracy in some cases, but this needs further study to fully understand the impact. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Periodic FSM vacuum doesn't happen in one-pass strategy vacuum.
Hi, With commit c120550edb86, If we got the cleanup lock on the page, lazy_scan_prune() marks dead item IDs directly to LP_UNUSED. So the following check with has_lpdead_items made the periodic FSM vacuum in the one-pass strategy vacuum no longer being triggered: if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items && blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES) { FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno); next_fsm_block_to_vacuum = blkno; } Before c120550edb86, since we marked dead item IDs to LP_DEAD once even in the one-pass strategy vacuum, we used to call lazy_vacuum_heap_page() to vacuum the page and to call FreeSpaceMapVacuum() periodically, so we had that check. I've attached a patch to fix it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: pgsql: Add memory/disk usage for Window aggregate nodes in EXPLAIN.
From: David Rowley Subject: Re: pgsql: Add memory/disk usage for Window aggregate nodes in EXPLAIN. Date: Tue, 1 Apr 2025 11:09:11 +1300 Message-ID: > On Tue, 1 Apr 2025 at 09:40, Christoph Berg wrote: >> =# explain (analyze,buffers off,costs off) select sum(n) over() from >> generate_series(1,2048) a(n); >> QUERY PLAN >> ── >> WindowAgg (actual time=2.073..2.686 rows=2048.00 loops=1) >>Window: w1 AS () >>Storage: Disk Maximum Storage: 65kB > > Thank you for testing that. I've just pushed a patch to bump it up to 2500. > > I suspect the buildfarm didn't catch this due to the tuplestore > consuming enough memory in MEMORY_CONTEXT_CHECKING builds. David, Christoph, Thank you for fixing this! -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: speedup COPY TO for partitioned table.
On Tue, 1 Apr 2025 at 06:31, jian he wrote: > > On Mon, Mar 31, 2025 at 4:05 PM Kirill Reshke wrote: > > Thanks for doing the benchmark. Few comments to improve the comments, error message and remove redundant assignment: 1) How about we change below: /* * partition's rowtype might differ from the root table's. We must * convert it back to the root table's rowtype as we are export * partitioned table data here. */ To: /* * A partition's row type might differ from the root table's. * Since we're exporting partitioned table data, we must * convert it back to the root table's row type. */ 2) How about we change below: + if (relkind == RELKIND_FOREIGN_TABLE) + ereport(ERROR, + errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("cannot copy from foreign table \"%s\"", + RelationGetRelationName(rel)), + errdetail("partition \"%s\" is a foreign table", RelationGetRelationName(rel)), + errhint("Try the COPY (SELECT ...) TO variant.")); To: + if (relkind == RELKIND_FOREIGN_TABLE) + ereport(ERROR, + errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("cannot copy from a partitioned table having foreign table partition \"%s\"", + RelationGetRelationName(rel)), + errdetail("partition \"%s\" is a foreign table", RelationGetRelationName(rel)), + errhint("Try the COPY (SELECT ...) TO variant.")); 3) How about we change below: /* * rel: the relation to be copied to. * root_rel: if not NULL, then the COPY partitioned relation to destination. * processed: number of tuples processed. */ To: /* * rel: the relation from which data will be copied. * root_rel: If not NULL, indicates that rel's row type must be * converted to root_rel's row type. * processed: number of tuples processed. */ 4) You can initialize processed to 0 along with declaration in DoCopyTo function and remove the below: + if (cstate->rel && cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { ... ... processed = 0; - while (table_scan_getnextslot(scandesc, ForwardScanDirection, slot)) ... ... - - ExecDropSingleTupleTableSlot(slot); - table_endscan(scandesc); + } + else if (cstate->rel) + { + processed = 0; + CopyThisRelTo(cstate, cstate->rel, NULL, &processed); } Regards, Vignesh
Re: explain analyze rows=%.0f
On 3/31/25 19:35, Robert Haas wrote: But why isn't it just as valuable to have two decimal places for the estimate? I theorize that the cases that are really a problem here are those where the row count estimate is between 0 and 1 per row, and rounding to an integer loses all precision. Issues I've ever seen were about zero rows number - we lose understanding of what the job the executor has done in the node - loops = 1000 doesn't tell us any helpful information in that case. Maybe in the next version, we replace the two fractional numbers rule with two valuable numbers. At least, it removes boring xxx.00 numbers from EXPLAIN. -- regards, Andrei Lepikhov
Re: Non-text mode for pg_dumpall
On Mon, 31 Mar 2025 at 23:43, Álvaro Herrera wrote: > > Hi > > FWIW I don't think the on_exit_nicely business is in final shape just > yet. We're doing something super strange and novel about keeping track > of an array index, so that we can modify it later. Or something like > that, I think? That doesn't sound all that nice to me. Elsewhere it > was suggested that we need some way to keep track of the list of things > that need cleanup (a list of connections IIRC?) -- perhaps in a > thread-local variable or a global or something -- and we install the > cleanup function once, and that reads from the variable. The program > can add things to the list, or remove them, at will; and we don't need > to modify the cleanup function in any way. > > -- > Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ Thanks Álvaro for the feedback. I removed the old handling of on_exit_nicely_list from the last patch set and added one simple function to just update the archive handle in shutdown_info. (shutdown_info.AHX = AHX;) For first database, we will add entry into on_exit_nicely_list array and for rest database, we will update only shutdown_info as we already closed connection for previous database.With this fix, we will not touch entry of on_exit_nicely_list for each database. Here, I am attaching updated patches. -- Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com v25_0002-add-new-list-type-simple_oid_string_list-to-fe-utils.patch Description: Binary data v25_0001-Move-common-pg_dump-code-related-to-connections.patch Description: Binary data v25_0004-update-AX-handle-for-each-database-for-cleanup.patch Description: Binary data v25_0003-pg_dumpall-with-directory-tar-custom-format.patch Description: Binary data
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Mar 31, 2025 at 5:04 PM Zhijie Hou (Fujitsu) wrote: > > On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > > > > I suspect that this can happen in PG17 as well, but I need to think > > more about it to make a reproducible test case. > > After further analysis, I was able to reproduce the same issue [1] in > PG 17. > > However, since the proposed fix requires catalog changes and the issue is not > a > security risk significant enough to justify changing the catalog in back > branches, we cannot back-patch the same solution. > Agreed. In the past, as in commit b6e39ca92e, we have backported a catalog-modifying commit, but that is for a CVE. Users need to follow manual steps as explained in 9.6.4 release notes [1], which would be cumbersome for them. This is not a security issue, so we shouldn't backpatch a catalog modifying commit following past. > Following off-list > discussions with Amit and Kuroda-san, we are considering disallowing enabling > failover and two-phase decoding together for a replication slot, as suggested > in attachment 0002. > > Another idea considered is to prevent the slot that enables two-phase decoding > from being synced to standby. IOW, this means displaying the failover field as > false in the view, if there is any possibility that transactions prepared > before the two_phase_at position exist (e.g., if restart_lsn is less than > two_phase_at). However, implementing this change would require additional > explanations to users for this new behavior, which seems tricky. > I find it tricky to explain to users. We need to say that sometimes the slots won't be synced even if the failover is set to true. Users can verify that by checking slot properties on the publisher. Also, on the subscriber, the failover flag in the subscription may still be true unless we do more engineering to make it false. So, I prefer to simply disallow setting failover and two_phase together. We need to recommend to users in release notes for 17 that they need to disable failover for subscriptions where two_phase is enabled or re-create the subscriptions with two_phase=false and failover=true. Users may not like it, but I think it is better than getting a complaint that after promotion of standby the data to subscriber is not getting replicated. [1] - https://www.postgresql.org/docs/9.6/release-9-6-4.html -- With Regards, Amit Kapila.
Re: Periodic FSM vacuum doesn't happen in one-pass strategy vacuum.
On Mon, Mar 31, 2025 at 3:12 PM Melanie Plageman wrote: > > On Mon, Mar 31, 2025 at 6:03 PM Masahiko Sawada wrote: > > > > With commit c120550edb86, If we got the cleanup lock on the page, > > lazy_scan_prune() marks dead item IDs directly to LP_UNUSED. So the > > following check with has_lpdead_items made the periodic FSM vacuum in > > the one-pass strategy vacuum no longer being triggered: > > > > if (got_cleanup_lock && vacrel->nindexes == 0 && > > has_lpdead_items && > > blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES) > > { > > FreeSpaceMapVacuumRange(vacrel->rel, > > next_fsm_block_to_vacuum, > > blkno); > > next_fsm_block_to_vacuum = blkno; > > } > > > > Before c120550edb86, since we marked dead item IDs to LP_DEAD once > > even in the one-pass strategy vacuum, we used to call > > lazy_vacuum_heap_page() to vacuum the page and to call > > FreeSpaceMapVacuum() periodically, so we had that check. > > Whoops, yea, I had a feeling that something wasn't right here when I > saw that thread a couple weeks ago about FSM vacuuming taking forever > at the end of a vacuum and then I remember looking at the code and > thinking why is VACUUM_FSM_EVERY_PAGES named that if it only is used > when there are no indexes or a multi-pass vacuum. > > I even made a comment in [1] that we would only do FSM_EVERY_PAGES > when we have multi-pass vacuum -- which is even less after 17. > > Isn't it ironic that I actually broke it. > > > I've attached a patch to fix it. > > Looks like you forgot Oops, attached now. Looking the places related to VACUUM_FSM_EVERY_PAGES further, the comment atop of vacuumlazy.c seems incorrect: * In between phases, vacuum updates the freespace map (every * VACUUM_FSM_EVERY_PAGES). IIUC the context is two-pass strategy vacuum (i.e., the table has indexes) and VACUUM_FSM_EVERY_PAGES is ignored in this case. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com fix_periodic_fsm_vacuum.patch Description: Binary data
Re: [PATCH] Fix potential overflow in binary search mid calculation
Jianghua Yang 于2025年4月1日周二 04:29写道: > Dear PostgreSQL Developers, > > I have identified a potential integer overflow issue in the binary search > implementation within the DSA size class lookup code. > Issue Description > > In the current implementation, the calculation of mid is performed as: > > uint16 mid = (max + min) / 2; > > Since both max and min are of type uint16, adding them together may exceed > 65535, leading to an overflow and incorrect behavior in the binary > search logic. This could result in incorrect indexing into the > dsa_size_classes array. > The value of min is from the array dsa_size_class_map. The max value in dsa_size_class_map[] is 25. The value of max is the length of dsa_size_classes[], which is not too large. It will not happen that (max + min) exceeds 65535. > Proposed Fix > > To prevent this overflow, we should use the alternative calculation method: > > uint16 mid = min + (max - min) / 2; > > This approach ensures that (max - min) does not exceed 65535, preventing > the addition from overflowing while still correctly computing the middle > index. > Patch > > A patch implementing this fix is attached. > -- Thanks, Tender Wang
Re: Replace IN VALUES with ANY in WHERE clauses during optimization
Hi, Alena! On Tue, Apr 1, 2025 at 2:11 AM Alena Rybakina wrote: > Yes, I agree with that - this is precisely why we need to call IncrementVarSublevelsUp() unconditionally for all types. > > As you mentioned earlier, Var nodes can be nested more deeply, and skipping this step could lead to incorrect behavior in those cases. So, now it works fine) > > Thank you for an example. > > I analyzed this transformation with various types of values that might be used in conditions. > > First, I verified whether the change would affect semantics, especially in the presence of NULL elements. The only notable behavior I observed was > the coercion of NULL to an integer type. However, this behavior remains the same even without our transformation, so everything is fine. Thank you for your experiments! I've also rechecked we don't sacrifice lazy evaluation. But it appears we don't have one anyway. CREATE FUNCTION my_func() RETURNS text AS $$ BEGIN RAISE NOTICE 'notice'; RETURN 'b'; END; $$ LANGUAGE 'plpgsql'; # create table test (val text); # insert into test values ('a'); # explain analyze select * from test where val in (VALUES ('a'), (my_func())); NOTICE: notice QUERY PLAN --- Hash Semi Join (cost=0.05..21.26 rows=9 width=64) (actual time=0.178..0.183 rows=1.00 loops=1) Hash Cond: (test.val = ("*VALUES*".column1)::text) Buffers: shared hit=1 -> Seq Scan on test (cost=0.00..18.80 rows=880 width=64) (actual time=0.045..0.048 rows=1.00 loops=1) Buffers: shared hit=1 -> Hash (cost=0.03..0.03 rows=2 width=32) (actual time=0.111..0.112 rows=2.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=32) (actual time=0.004..0.065 rows=2.00 loops=1) Planning Time: 0.250 ms Execution Time: 0.267 ms (10 rows) -- Regards, Alexander Korotkov Supabase
Re: add function argument name to substring and substr
On Tue, Apr 1, 2025 at 6:22 AM David G. Johnston wrote: > > On Tue, Mar 18, 2025 at 9:04 PM jian he wrote: >> >> >> new patch attached. >> > > I've done v4 with a delta patch. > > Decided to standardize on calling the SQL Similar To regular expression > escape replaceable "escape" everywhere. > > Instead of fully documenting the obsolete syntax I added a note explaining > the keyword choice difference. Removed mention of it completely from the > Pattern Matching portion of the documentation - that section has enough going > on. > > I also add "Same as" references for the two pairs of entries. Not married to > them but they do seem warranted; having Pattern Matching be required reading > to make that connection seems undesirable. > your v4-0001-v3-0001-substring.patch is not the same as my v3-0001-add-argument-name-to-function-substring-and-subst.patch for example: - + substring ( string text FROM pattern text FOR escape text ) -text +text can you make sure v4-0001-v3-0001-substring.patch the same as v3-0001-add-argument-name-to-function-substring-and-subst.patch. because I tried git am On Tue, Apr 1, 2025 at 6:22 AM David G. Johnston wrote: > > On Tue, Mar 18, 2025 at 9:04 PM jian he wrote: >> >> >> new patch attached. >> > > I've done v4 with a delta patch. > > Decided to standardize on calling the SQL Similar To regular expression > escape replaceable "escape" everywhere. > > Instead of fully documenting the obsolete syntax I added a note explaining > the keyword choice difference. Removed mention of it completely from the > Pattern Matching portion of the documentation - that section has enough going > on. > > I also add "Same as" references for the two pairs of entries. Not married to > them but they do seem warranted; having Pattern Matching be required reading > to make that connection seems undesirable. > your v4-0001-v3-0001-substring.patch is not the same as my v3-0001-add-argument-name-to-function-substring-and-subst.patch for example: - + substring ( string text FROM pattern text FOR escape text ) -text +text can you make sure v4-0001-v3-0001-substring.patch the same as v3-0001-add-argument-name-to-function-substring-and-subst.patch. because I tried git am On Tue, Apr 1, 2025 at 6:22 AM David G. Johnston wrote: > > On Tue, Mar 18, 2025 at 9:04 PM jian he wrote: >> >> >> new patch attached. >> > > I've done v4 with a delta patch. > > Decided to standardize on calling the SQL Similar To regular expression > escape replaceable "escape" everywhere. > > Instead of fully documenting the obsolete syntax I added a note explaining > the keyword choice difference. Removed mention of it completely from the > Pattern Matching portion of the documentation - that section has enough going > on. > > I also add "Same as" references for the two pairs of entries. Not married to > them but they do seem warranted; having Pattern Matching be required reading > to make that connection seems undesirable. > can not build docs based on your v4-0001. your v4-0001-v3-0001-substring.patch is not the same as my v3-0001-add-argument-name-to-function-substring-and-subst.patch for example: - + substring ( string text FROM pattern text FOR escape text ) -text +text because I tried git am v3-0001-add-argument-name-to-function-substring-and-subst.patch. patch -p1 < v4-0002-v3-0002-delta.patch Then there are several places that differ, it doesn't seem easy to resolve the difference. Can you make sure v4-0001-v3-0001-substring.patch the same as v3-0001-add-argument-name-to-function-substring-and-subst.patch, then I can review your delta patch.
Re: Statistics Import and Export
On Mon, Mar 31, 2025 at 10:33 PM Nathan Bossart wrote: > On Mon, Mar 31, 2025 at 11:11:47AM -0400, Corey Huinker wrote: > Regarding whether pg_dump should dump statistics by default, my current > thinking is that it shouldn't, but I think we _should_ have pg_upgrade > dump/restore statistics by default because that is arguably the most > important use-case. This is more a gut feeling than anything, so I reserve > the right to change my opinion. > I did some mental exercises on a number of different use cases and scenarios (pagila work, pgextractor type stuff, backups, etc...) and I couldn't come up with any strong arguments against including the stats by default, generally because I think when your process needs to care about the output of pg_dump, it seems like most cases require enough specificity that this wouldn't actually break that. Still, I am sympathetic to Greg's earlier concerns on the topic, but would also agree it seems like a clear win for pg_upgrade, so I think our gut feelings might actually be aligned on this one ;-) Robert Treat https://xzilla.net
Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).
On Mon, Mar 31, 2025 at 11:27 PM Fujii Masao wrote: > > Regarding the patch, here are some review comments: > > + errmsg("cannot copy from > materialized view when the materialized view is not populated"), > > How about including the object name for consistency with > other error messages in BeginCopyTo(), like this? > > errmsg("cannot copy from unpopulated materialized view \"%s\"", >RelationGetRelationName(rel)), > > > + errhint("Use the REFRESH > MATERIALIZED VIEW command populate the materialized view first.")); > > There seems to be a missing "to" just after "command". > Should it be "Use the REFRESH MATERIALIZED VIEW command to > populate the materialized view first."? Or we could simplify > the hint to match what SELECT on an unpopulated materialized > view logs: "Use the REFRESH MATERIALIZED VIEW command.". > based on your suggestion, i changed it to: if (!RelationIsPopulated(rel)) ereport(ERROR, errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot copy from unpopulated materialized view \"%s\"", RelationGetRelationName(rel)), errhint("Use the REFRESH MATERIALIZED VIEW command to populate the materialized view first.")); > > The copy.sgml documentation should clarify that COPY TO can > be used with a materialized view only if it is populated. > "COPY TO can be used only with plain tables, not views, and does not copy rows from child tables or child partitions" i changed it to "COPY TO can be used with plain tables and materialized views, not regular views, and does not copy rows from child tables or child partitions" Another alternative wording I came up with: "COPY TO can only be used with plain tables and materialized views, not regular views. It also does not copy rows from child tables or child partitions." > > Wouldn't it be beneficial to add a regression test to check > whether COPY matview TO works as expected? sure. From 3e404817827a58721cf8966080492f1254ea06cb Mon Sep 17 00:00:00 2001 From: jian he Date: Tue, 1 Apr 2025 11:11:32 +0800 Subject: [PATCH v2 1/1] COPY materialized_view TO generally `COPY table TO` is faster than `COPY (query)`. since populated materialized view have physical storage, so this can use table_beginscan, table_endscan to scan a table. context: https://postgr.es/m/8967.1353167...@sss.pgh.pa.us context: https://www.postgresql.org/message-id/flat/20121116162558.90150%40gmx.com discussion: https://postgr.es/m/CACJufxHVxnyRYy67hiPePNCPwVBMzhTQ6FaL9_Te5On9udG=y...@mail.gmail.com commitfest entry: https://commitfest.postgresql.org/patch/5533/ --- doc/src/sgml/ref/copy.sgml | 4 ++-- src/backend/commands/copyto.c | 13 - src/test/regress/expected/copy2.out | 11 +++ src/test/regress/sql/copy2.sql | 8 4 files changed, 29 insertions(+), 7 deletions(-) diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index df093da97c5..c3107488c81 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -520,8 +520,8 @@ COPY count Notes -COPY TO can be used only with plain -tables, not views, and does not copy rows from child tables +COPY TO can be used with plain +tables and materialized views, not regular views, and does not copy rows from child tables or child partitions. For example, COPY table TO copies the same rows as SELECT * FROM ONLY rd_rel->relkind == RELKIND_MATVIEW) - ereport(ERROR, - (errcode(ERRCODE_WRONG_OBJECT_TYPE), - errmsg("cannot copy from materialized view \"%s\"", - RelationGetRelationName(rel)), - errhint("Try the COPY (SELECT ...) TO variant."))); + { + if (!RelationIsPopulated(rel)) +ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot copy from unpopulated materialized view \"%s\"", +RelationGetRelationName(rel)), + errhint("Use the REFRESH MATERIALIZED VIEW command to populate the materialized view first.")); + } else if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE) ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out index 64ea33aeae8..f7aa55e1691 100644 --- a/src/test/regress/expected/copy2.out +++ b/src/test/regress/expected/copy2.out @@ -929,3 +929,14 @@ truncate copy_default; -- DEFAULT cannot be used in COPY TO copy (select 1 as test) TO stdout with (default '\D'); ERROR: COPY DEFAULT cannot be used with COPY TO +-- COPY TO with materialized view +CREATE MATERIALIZED VIEW matview1 AS SELECT 1 as id; +CREATE MATERIALIZED VIEW matview2 AS SELECT 1 as id WITH NO DATA; +copy matview1(id) TO stdout with (header); +id +1 +copy matview2 TO stdout with (header); +ERROR: cannot copy from unpopulated
Re: speedup COPY TO for partitioned table.
On Mon, Mar 31, 2025 at 4:05 PM Kirill Reshke wrote: > > Hi! > I reviewed v7. Maybe we should add a multi-level partitioning case > into copy2.sql regression test? > sure. > I also did quick benchmarking for this patch: > > DDL > > create table ppp(i int) partition by range (i); > > genddl.sh: > > for i in `seq 0 200`; do echo "create table p$i partition of ppp for > values from ( $((10 * i)) ) to ( $((10 * (i + 1))) ); "; done > > === insert data data: > insert into ppp select i / 1000 from generate_series(0, 200)i; > > === results: > > > for 201 rows speedup is 1.40 times : 902.604 ms (patches) vs > 1270.648 ms (unpatched) > > for 402 rows speedup is 1.20 times : 1921.724 ms (patches) vs > 2343.393 ms (unpatched) > > for 804 rows speedup is 1.10 times : 3932.361 ms (patches) vs > 4358.489ms (unpatched) > > So, this patch indeed speeds up some cases, but with larger tables > speedup becomes negligible. > Thanks for doing the benchmark. From b27371ca4ff132e7d2803406f9e3f371c51c96df Mon Sep 17 00:00:00 2001 From: jian he Date: Tue, 1 Apr 2025 08:56:22 +0800 Subject: [PATCH v8 1/1] support COPY partitioned_table TO CREATE TABLE pp (id INT, val int ) PARTITION BY RANGE (id); create table pp_1 (val int, id int); create table pp_2 (val int, id int); ALTER TABLE pp ATTACH PARTITION pp_1 FOR VALUES FROM (1) TO (5); ALTER TABLE pp ATTACH PARTITION pp_2 FOR VALUES FROM (5) TO (10); insert into pp select g, 10 + g from generate_series(1,9) g; copy pp to stdout(header); the above case is much slower (around 25% in some case) than ``COPY (select * from pp) to stdout(header);``, because of column remaping. but this is still a new feature, since master does not support ``COPY (partitioned_table)``. reivewed by: vignesh C reivewed by: David Rowley reivewed by: Melih Mutlu reivewed by: Kirill Reshke discussion: https://postgr.es/m/CACJufxEZt+G19Ors3bQUq-42-61__C=y5k2wk=sHEFRusu7=i...@mail.gmail.com commitfest entry: https://commitfest.postgresql.org/patch/5467/ --- doc/src/sgml/ref/copy.sgml | 8 +- src/backend/commands/copyto.c | 143 ++-- src/test/regress/expected/copy2.out | 20 src/test/regress/sql/copy2.sql | 17 4 files changed, 157 insertions(+), 31 deletions(-) diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index df093da97c5..f86e0b7ec35 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -521,15 +521,15 @@ COPY count COPY TO can be used only with plain -tables, not views, and does not copy rows from child tables -or child partitions. For example, COPY COPY TO can be used with partitioned tables. +For example, in a table inheritance hierarchy, COPY table TO copies the same rows as SELECT * FROM ONLY table. The syntax COPY (SELECT * FROM table) TO ... can be used to -dump all of the rows in an inheritance hierarchy, partitioned table, -or view. +dump all of the rows in an inheritance hierarchy, or view. diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index 84a3f3879a8..6fc940bddbc 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -19,6 +19,8 @@ #include #include "access/tableam.h" +#include "access/table.h" +#include "catalog/pg_inherits.h" #include "commands/copyapi.h" #include "commands/progress.h" #include "executor/execdesc.h" @@ -82,6 +84,7 @@ typedef struct CopyToStateData List *attnumlist; /* integer list of attnums to copy */ char *filename; /* filename, or NULL for STDOUT */ bool is_program; /* is 'filename' a program to popen? */ + List *partitions; /* oid list of partition oid for copy to */ copy_data_dest_cb data_dest_cb; /* function for writing data */ CopyFormatOptions opts; @@ -116,6 +119,8 @@ static void CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot); static void CopyAttributeOutText(CopyToState cstate, const char *string); static void CopyAttributeOutCSV(CopyToState cstate, const char *string, bool use_quote); +static void CopyThisRelTo(CopyToState cstate, Relation rel, + Relation root_rel, uint64 *processed); /* built-in format-specific routines */ static void CopyToTextLikeStart(CopyToState cstate, TupleDesc tupDesc); @@ -643,6 +648,8 @@ BeginCopyTo(ParseState *pstate, PROGRESS_COPY_COMMAND_TO, 0 }; + List *children = NIL; + List *scan_oids = NIL; if (rel != NULL && rel->rd_rel->relkind != RELKIND_RELATION) { @@ -670,11 +677,35 @@ BeginCopyTo(ParseState *pstate, errmsg("cannot copy from sequence \"%s\"", RelationGetRelationName(rel; else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) - ereport(ERROR, - (errcode(ERRCODE_WRONG_OBJECT_TYPE), - errmsg("cannot copy from partitioned table \"%s\"", - RelationGetRelationName(rel)), - errhint("Try the COPY (SELECT ...) TO variant."))); + { + children = find_all_
Re: [PATCH] PGSERVICEFILE as part of a normal connection string
Hi, I am working on a feature adjacent to the connection service functionality and noticed some issues with the tests introduced in this thread. Basically they incorrectly invoke the append perl function by passing multiple strings to append when the function only takes one string to append. This caused the generated service files to not actually contain any connection parameters. The tests were only passing because the connect_ok perl function set the connection parameters as environment variables which covered up the misformed connection service file. The attached patch is much more strict in that it creates a dummy database that is not started and passes all queries though that and tests that the connection service file correctly overrides the environment variables set by the dummy databases' query functions Thanks, Andrew Jackson On Mon, Mar 31, 2025, 4:01 PM Ryo Kanbayashi wrote: > On Sat, Mar 22, 2025 at 4:46 PM Michael Paquier > wrote: > > > > On Thu, Mar 20, 2025 at 06:16:44PM +0900, Ryo Kanbayashi wrote: > > > Sorry, I found a miss on 006_service.pl. > > > Fixed patch is attached... > > > > Please note that the commit fest app needs all the patches of a a set > > to be posted in the same message. In this case, v2-0001 is not going > > to get automatic test coverage. > > > > Your patch naming policy is also a bit confusing. I would suggest to > > use `git format-patch -vN -2`, where N is your version number. 0001 > > would be the new tests for service files, and 0002 the new feature, > > with its own tests. > > All right. > I attached patches generated with your suggested command :) > > > +if ($windows_os) { > > + > > +# Windows: use CRLF > > +print $fh "[my_srv]", "\r\n"; > > +print $fh join( "\r\n", split( ' ', $node->connstr ) ), "\r\n"; > > +} > > +else { > > +# Non-Windows: use LF > > +print $fh "[my_srv]", "\n"; > > +print $fh join( "\n", split( ' ', $node->connstr ) ), "\n"; > > +} > > +close $fh; > > > > That's duplicated. Let's perhaps use a $newline variable and print > > into the file using the $newline? > > OK. > I reflected above comment. > > > Question: you are doing things this way in the test because fgets() is > > what is used by libpq to retrieve the lines of the service file, is > > that right? > > No. I'm doing above way simply because line ending code of service file > wrote by users may become CRLF in Windows platform. > > > Please note that the CI is failing. It seems to me that you are > > missing a done_testing() at the end of the script. If you have a > > github account, I'd suggest to set up a CI in your own fork of > > Postgres, this is really helpful to double-check the correctness of a > > patch before posting it to the lists, and saves in round trips between > > author and reviewer. Please see src/tools/ci/README in the code tree > > for details. > > Sorry. > I'm using Cirrus CI with GitHub and I checked passing the CI. > But there were misses when I created patch files... > > > +# Copyright (c) 2023-2024, PostgreSQL Global Development Group > > > > These dates are incorrect. Should be 2025, as it's a new file. > > OK. > > > +++ b/src/interfaces/libpq/t/007_servicefile_opt.pl > > @@ -0,0 +1,100 @@ > > +# Copyright (c) 2023-2024, PostgreSQL Global Development Group > > > > Incorrect date again in the second path with the new feature. I'd > > suggest to merge all the tests in a single script, with only one node > > initialized and started. > > OK. > Additional test scripts have been merged to a single script ^^ b > > --- > Great regards, > Ryo Kanbayashi > From 0d732ee8edbb16132b95f35775388528fa9003d8 Mon Sep 17 00:00:00 2001 From: AndrewJackson Date: Mon, 31 Mar 2025 15:18:48 -0500 Subject: [PATCH] libpq: Fix TAP tests for service files and names This commit builds on the tests that were added in a prior commit that tests the connection service file functionality. The tests are fixed to correctly invoke the append_to_file subroutine which only takes one argument and not multiple. The test also runs the statements through a dummy database to ensure that the options from the service file are actually being picked up and just passing because they are defaulting to the connection environment variables that are set in the connect_ok function. Author: Andrew Jackson --- src/interfaces/libpq/t/006_service.pl | 38 ++- 1 file changed, 25 insertions(+), 13 deletions(-) diff --git a/src/interfaces/libpq/t/006_service.pl b/src/interfaces/libpq/t/006_service.pl index d3ecfa6b6fc..e0d18d72359 100644 --- a/src/interfaces/libpq/t/006_service.pl +++ b/src/interfaces/libpq/t/006_service.pl @@ -9,6 +9,9 @@ use Test::More; # This tests scenarios related to the service name and the service file, # for the connection options and their environment variables. +my $dummy_node = PostgreSQL::Test::Cluster->new('dummy_node'); +$dummy_node->init; + my $node =
Re: pgsql: Add support for OAUTHBEARER SASL mechanism
Re: To Daniel Gustafsson > > Add support for OAUTHBEARER SASL mechanism > > Debian still has this experimental port with a GNU userland and a > FreeBSD kernel called kfreebsd. Sorry this part was nonsense, kfreebsd was actually terminated and obviously I didn't even read the port's name. The failing port (still experimental and care-is-optional) is hurd-amd64. The bug is the same, though. Christoph
Re: Using read stream in autoprewarm
Hi, Thank you for looking into this! On Mon, 31 Mar 2025 at 17:42, Melanie Plageman wrote: > > On Sun, Mar 30, 2025 at 10:01 AM Nazir Bilal Yavuz wrote: > > > > > Some review feedback on your v4: I don't think we need the > > > rs_have_free_buffer per_buffer_data. We can just check > > > have_free_buffers() in both the callback and main loop in > > > autoprewarm_database_main(). I also think you want a comment about why > > > first_block is needed. And I think you need to guard the > > > read_stream_end() after the loop -- what if we never made a read > > > stream because we errored out for all the block's relations or > > > something? > > > > All of these are addressed. One extra thing I noticed is we were not > > checking if blocknum < number_of_block_in_relation at the first_block > > case in the stream callback, this is fixed now. > > I'm wondering why you need to check if have_free_buffer() in the else > branch after getting the buffer from the read stream API. Can't you > just put it back in the for loop condition? Seems like it would have > the same effect. > > - for (; pos < apw_state->prewarm_stop_idx; pos++) > + for (; pos < apw_state->prewarm_stop_idx && have_free_buffer(); pos++) > { > BlockInfoRecord *blk = &block_info[pos]; > Buffer buf; > @@ -640,9 +640,6 @@ autoprewarm_database_main(Datum main_arg) > apw_state->prewarmed_blocks++; > ReleaseBuffer(buf); > } > - /* There are no free buffers left in shared buffers, break the loop. > */ > - else if (!have_free_buffer()) > - break; > You are right, done. Attached as v6. > Looking at the code some more, I feel stuck on my original point about > incrementing the position in two places. > AFAICT, you are going to end up going through the array twice with this > design. > Because you don't set the pos variable in autoprewarm_database_main() > from the p->pos variable which the read stream callback is > incrementing, if the read stream callback increments p->pos it a few > positions yielding those blocks to the read stream machinery to read, > you are then going to iterate over those positions in the array again > in the autoprewarm_database_main() loop. > > I think you can get around this by setting pos from p->pos in > autoprewarm_database_main() after read_stream_next_buffer(). Or by > using p->pos in the loop in autoprewarm_database_main() (which is > basically what Andres suggested). I'm not sure, though, if this has > any problems. Like not closing a relation in the right place. I worked on an alternative approach, I refactored code a bit. It does not traverse the list two times and I think the code is more suitable to use read streams now. I simply get how many blocks are processed by read streams and move the list forward by this number, so the actual loop skips these blocks. This approach is attached with 'alternative' prefix. -- Regards, Nazir Bilal Yavuz Microsoft v6-0001-Optimize-autoprewarm-with-read-streams.patch Description: Binary data v6-0002-Count-free-buffers-at-the-start-of-the-autoprewar.patch Description: Binary data alternative_0001-Optimize-autoprewarm-with-read-streams.patch Description: Binary data alternative_0002-Count-free-buffers-at-the-start-of-the-autoprewarm.patch Description: Binary data
Re: Periodic FSM vacuum doesn't happen in one-pass strategy vacuum.
On Mon, Mar 31, 2025 at 6:03 PM Masahiko Sawada wrote: > > With commit c120550edb86, If we got the cleanup lock on the page, > lazy_scan_prune() marks dead item IDs directly to LP_UNUSED. So the > following check with has_lpdead_items made the periodic FSM vacuum in > the one-pass strategy vacuum no longer being triggered: > > if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items > && > blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES) > { > FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, > blkno); > next_fsm_block_to_vacuum = blkno; > } > > Before c120550edb86, since we marked dead item IDs to LP_DEAD once > even in the one-pass strategy vacuum, we used to call > lazy_vacuum_heap_page() to vacuum the page and to call > FreeSpaceMapVacuum() periodically, so we had that check. Whoops, yea, I had a feeling that something wasn't right here when I saw that thread a couple weeks ago about FSM vacuuming taking forever at the end of a vacuum and then I remember looking at the code and thinking why is VACUUM_FSM_EVERY_PAGES named that if it only is used when there are no indexes or a multi-pass vacuum. I even made a comment in [1] that we would only do FSM_EVERY_PAGES when we have multi-pass vacuum -- which is even less after 17. Isn't it ironic that I actually broke it. > I've attached a patch to fix it. Looks like you forgot - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_aXqoj2Vfqu3yzscsTX%3D2nPQ4y-aadCNz6nJP_o12GyxA%40mail.gmail.com
Re: pgsql: Add memory/disk usage for Window aggregate nodes in EXPLAIN.
On Tue, 1 Apr 2025 at 09:40, Christoph Berg wrote: > =# explain (analyze,buffers off,costs off) select sum(n) over() from > generate_series(1,2048) a(n); > QUERY PLAN > ── > WindowAgg (actual time=2.073..2.686 rows=2048.00 loops=1) >Window: w1 AS () >Storage: Disk Maximum Storage: 65kB Thank you for testing that. I've just pushed a patch to bump it up to 2500. I suspect the buildfarm didn't catch this due to the tuplestore consuming enough memory in MEMORY_CONTEXT_CHECKING builds. David
Re: Test to dump and restore objects left behind by regression
> On 28 Mar 2025, at 19:12, Alvaro Herrera wrote: > > On 2025-Mar-28, Tom Lane wrote: > >> I think instead of going this direction, we really need to create a >> separately-purposed script that simply creates "one of everything" >> without doing anything else (except maybe loading a little data). >> I believe it'd be a lot easier to remember to add to that when >> inventing new SQL than to remember to leave something behind from the >> core regression tests. This would also be far faster to run than any >> approach that involves picking a random subset of the core test >> scripts. > > FWIW this sounds closely related to what I tried to do with > src/test/modules/test_ddl_deparse; it's currently incomplete, but maybe > we can use that as a starting point. Given where we are in the cycle, it seems to make sense to stick to using the schedule we already have rather than invent a new process for generating it, and work on that for 19? -- Daniel Gustafsson
Re: add function argument name to substring and substr
On Tue, Mar 18, 2025 at 9:04 PM jian he wrote: > > new patch attached. > > I've done v4 with a delta patch. Decided to standardize on calling the SQL Similar To regular expression escape replaceable "escape" everywhere. Instead of fully documenting the obsolete syntax I added a note explaining the keyword choice difference. Removed mention of it completely from the Pattern Matching portion of the documentation - that section has enough going on. I also add "Same as" references for the two pairs of entries. Not married to them but they do seem warranted; having Pattern Matching be required reading to make that connection seems undesirable. David J. From b2f64615da9522427a2e2662b1d060ffed97088c Mon Sep 17 00:00:00 2001 From: "David G. Johnston" Date: Mon, 31 Mar 2025 14:32:12 -0700 Subject: [PATCH 1/2] v3 0001 substring --- doc/src/sgml/func.sgml | 115 +-- src/backend/catalog/system_functions.sql | 2 +- src/include/catalog/pg_proc.dat | 12 +++ 3 files changed, 120 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index c642f1ea4e..e4c95f1e88 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -2850,9 +2850,9 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in substring ( string text SIMILAR pattern text ESCAPE escape text ) text - + substring ( string text FROM pattern text FOR escape text ) -text +text Extracts the first substring matching SQL regular expression; @@ -3806,6 +3806,58 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in + + + +substring ( string text, pattern text ) +text + + +Extracts the first substring matching POSIX regular expression; see +. + + +substring('Thomas', '...$') +mas + + + + + +substring ( string text, pattern text, escape_character text) +text + + +Extracts the first substring matching SQL regular expression; +see . + + +substring('Thomas', '%#"o_a#"_', '#') +oma + + + + + + + substring + +substring ( string text, start integer , count integer ) +text + + +Extracts the substring of string starting at +the start'th character, +and stopping after count characters if that is +specified. + + + +substring('Thomas', 2, 3) +hom + + + @@ -4811,6 +4863,27 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three'); \x5678 + + + + + substring + +substring ( bytes bytea, start integer , count integer ) +bytea + + +Extracts the substring of bytes starting at +the start'th byte, +and stopping after count bytes if that is +specified. + + +substring('\x1234567890'::bytea, 3, 2) +\x5678 + + + @@ -5353,6 +5426,26 @@ cast(-1234 as bytea) \xfb2e + + + + substring + +substring ( bits bit, start integer , count integer ) +bit + + +Extracts the substring of bits starting at +the start'th bit, +and stopping after count bits if that is +specified. + + +substring(B'11001011', 3, 2) +00 + + + @@ -5816,7 +5909,7 @@ substring(string from pattern or as a plain three-argument function: -substring(string, pattern, escape-character) +substring(string, pattern, escape_character) As with SIMILAR TO, the specified pattern must match the entire data string, or else the @@ -6020,11 +6113,17 @@ substring('foobar' similar '#"o_b#"%' escape '#')NULL - The substring function with two parameters, - substring(string from - pattern), provides extraction of a - substring - that matches a POSIX regular expression pattern. It returns null if + The substring function with two parameters provides extraction of a + substring that matches a POSIX regular expression pattern. + It has syntax: + +substring(string from pattern) + + It can also written as a plain two-argument function: + +substring(string, pattern) + + It returns null if there is no match, otherwise the first portion of the text that matched the pattern. But if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the diff --git a/src/backend/catalog/system_functions.s
Re: tzdata 2025b
Masahiko Sawada writes: > tzdata 2025b has been released on 3/22[1]. Do we need to update the > tzdata.zi file on HEAD and backbranches? Yup, eventually, but I don't normally worry about it until we are approaching a release date. tzdata changes often come in bunches around the spring and fall equinoxes, which is when governments tend to rush out DST changes without thinking about lead times :-(. So it's entirely possible that 2025b will already be obsolete by May. (See for example ecbac3e6e and d8fc45bd0, when I thought I'd waited long enough and was wrong.) The situation might be different with a tzdata release that affects our regression test results, but I don't believe this one does. regards, tom lane
Re: RFC: Logging plan of the running query
On 2025-04-01 04:24, Robert Haas wrote: On Fri, Mar 21, 2025 at 8:40 AM torikoshia wrote: Rebased it again. Hi, I apologize for not having noticed this thread sooner. Thank you for your checking! No worries. I just became aware of it as a result of a discussion in the hacking Discord server. I think this has got a lot over overlap with the progressive EXPLAIN patch from Rafael Castro, which I have been reviewing, but I am not sure why we have two different patch sets here. Why is that? This thread proposes a feature to log the results of a plain EXPLAIN (without ANALYZE). From Rafael Castro's presentation materials at pgconf.eu 2024 [1], I understand that he built a new patch on top of this one, adding functionality to track execution progress and display results in a view. He also mentioned that he used the way of wrapping plan nodes from this patch during the development to solve a problem[2]. I think that's why there are overlaps between the two. Previously, Rafael proposed a patch in this thread that added execution progress tracking. However, I felt that expanding the scope could make it harder to get the patch reviewed or committed. So, I suggested first completing a feature that only retrieves the execution plan of a running query, and then developing execution progress tracking afterward[3]. As far as I remember, he did not respond to this thread after my suggestion but instead started a new thread later[4]. Based on that, I assume he would not have agreed with my proposed approach. Rafael, please let us know if there are any misunderstandings in the above. In this thread, I aimed to output the plan without requiring prior configuration if possible. However, based on your comment on Rafael's thread[5], it seems that this approach would be difficult: One way in which this proposal seems safer than previous proposals is that previous proposals have involved session A poking session B and trying to get session B to emit an EXPLAIN on the fly with no prior setup. That would be very useful, but I think it's more difficult and more risky than this proposal, where all the configuration happens in the session that is going to emit the EXPLAIN output. I was considering whether to introduce a GUC in this patch to allow for prior setup before outputting the plan or to switch to Rafael's patch after reviewing its details. However, since there isn’t much time left before the feature freeze, if you have already reviewed Rafael's patch and there is a chance it could be committed, it would be better to focus on that. [1] https://www.youtube.com/watch?v=6ahTb-7C05c (mentions this patch after the 12-minute mark) [2] https://www.postgresql.org/message-id/CAG0ozMo30smtXXOR8bSCbhaZAQHo4%3DezerLitpERk85Q0ga%2BFw%40mail.gmail.com [3] https://www.postgresql.org/message-id/c161b5e7e1888eb9c9eb182a7d9dcf89%40oss.nttdata.com [4] https://www.postgresql.org/message-id/flat/CAG0ozMo30smtXXOR8bSCbhaZAQHo4%3DezerLitpERk85Q0ga%2BFw%40mail.gmail.com#4b261994a0d44649ded6e1434a00e134 [5] https://www.postgresql.org/message-id/CA%2BTgmoaD985%2BVLwR93c8PjSaoBqxw72Eu7pfBJcArzhjJ71aRw%40mail.gmail.com -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Deadlock detected while executing concurrent insert queries on same table
Hello community, I recently encountered a deadlock in postgresql while performing concurrent INSERT statements on the same table in two separate sessions. The error message explicitly mentions that the deadlock occurred while inserting an index tuple. There were no explicit transactions (BEGIN/COMMIT). The inserts were executed as the standalone statements. *PG version : 11.4* Here’s the *table definition* : Table “auditlog” Column |Type | Collation | Nullable | Default --+-+---+--+- id | bigint | | not null | audit_logid | bigint | | not null | recordid | bigint | | | recordname | text| | | module | character varying(50) | | not null | actioninfo | citext | | not null | relatedid| bigint | | | related_name | character varying(255) | | | accountid| bigint | | | accountname | character varying(255) | | | doneby | character varying(255) | | not null | userid | bigint | | | audit_time | timestamp without time zone | | not null | isauditlogdata | boolean | | not null | details | citext | | | auditcategory| integer | | not null | operationtype | integer | | not null | source | integer | | not null | Indexes: "auditlog_pkey" PRIMARY KEY, btree (id, audit_time, audit_logid) "auditlog_idx1" btree (recordid) "auditlog_idx2" btree (audit_logid DESC) "auditlog_idx3" btree (userid) "auditlog_idx4" btree (relatedid) "auditlog_idx5" gist (actioninfo gist_trgm_ops) and exact *error message* from the logs : *ERROR*: INSERT failed, ERROR: deadlock detected DETAIL: Process 3841267 waits for ShareLock on transaction 185820512; blocked by process 3841268. Process 3841268 waits for ShareLock on transaction 185820513; blocked by process 3841267. HINT: See server log for query details. CONTEXT: while inserting index tuple (31889,32) in relation “auditlog” *Insert Query1 :* INSERT INTO auditlog (ID,AUDIT_LOGID,RECORDID,RECORDNAME,MODULE,ACTIONINFO,RELATEDID,RELATED_NAME,ACCOUNTID,ACCOUNTNAME,DONEBY,USERID,AUDIT_TIME,ISAUDITLOGDATA,DETAILS,AUDITCATEGORY,OPERATIONTYPE,SOURCE) VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18) *Insert Query2 :* INSERT INTO auditlog (id, audit_logid, recordid, recordname, module, actioninfo, relatedid, related_name, accountid, accountname, doneby, userid, audit_time, isauditlogdata, details, auditcategory, operationtype, source) VALUES ('67016721806'::bigint, '38976328846849'::bigint, NULL::bigint, NULL::text, 'Tasks'::character varying(50), 'Deleted'::citext, NULL::bigint, NULL::character varying(255), NULL::bigint, NULL::character varying(255), 'Technologies'::character varying(255), '3470005430253334'::bigint, '2024-03-24 14:39:06'::timestamp without time zone, true, NULL::citext, 0, 11, 20) Could this be a bug, or is it expected behaviour under certain conditions ? I was unable to reproduce this issue again. Any insights or guidance on how to analyse this further would be greatly appreciated. Regards, Sri Keerthi. ReplyForward Add reaction
Re: support virtual generated column not null constraint
On Fri, Mar 28, 2025 at 10:06 PM Peter Eisentraut wrote: > > On 24.03.25 04:26, jian he wrote: > > rebase, and some minor code comments change. > > I have committed this. > In an earlier thread, I also posted a patch for supporting virtual generated columns over domain type. The code is somehow similar, so I post the remaining patch to this thread. v7-0001 we need to compute the generation expression for the domain with constraints, thus rename ExecComputeStoredGenerated to ExecComputeGenerated. v7-0002 soft error variant of ExecPrepareExpr, ExecInitExpr. for soft error processing of CoerceToDomain. we don't want error messages like "value for domain d2 violates check constraint "d2_check"" while validating existing domain data, we want something like: +ERROR: column "b" of table "gtest24" contains values that violate the new constraint v7-0003 supports virtual generation columns over domain type. If the domain has constraints then we need to compute the virtual generation expression, that happens mainly within ExecComputeGenerated. if a virtual generation column type is domain_with_constraint, then ALTER DOMAIN ADD CONSTRAINTS need to revalidate these virtual generation expressions again. so in validateDomainCheckConstraint, validateDomainNotNullConstraint We need to fetch the generation expression (build_generation_expression), compile the generation expression (ExecPrepareExprSafe), and evaluate it (ExecEvalExprSwitchContext). I also posted patch summary earlier at [1] [1] https://postgr.es/m/cacjufxht4r1oabzequvjb6dhrig7k-qsr6lee53q_nli9pf...@mail.gmail.com From 346dbce94c008a1ec9447ca65c7b7fcd0f8428d6 Mon Sep 17 00:00:00 2001 From: jian he Date: Mon, 31 Mar 2025 15:02:46 +0800 Subject: [PATCH v7 3/3] domain over virtual generated column. domains that don't have constraints work just fine. domain constraints check happen within ExecComputeGenerated. we compute the virtual generated column in ExecComputeGenerated and discarded the value. if value cannot coerce to domain then we will error out. domain_with_constaint can be element type of range, multirange, array, composite. to be bulletproof, if this column type is not type of TYPTYPE_BASE or it's an array type, then we do actually compute the generated expression. This can be have negative performance for INSERTs The following two query shows in pg_catalog, most of the system type is TYPTYPE_BASE. So I guess this compromise is fine. select count(*),typtype from pg_type where typnamespace = 11 group by 2; select typname, typrelid, pc.reltype, pc.oid, pt.oid from pg_type pt join pg_class pc on pc.oid = pt.typrelid where pt.typnamespace = 11 and pt.typtype = 'c' and pc.reltype = 0; ALTER DOMAIN ADD CONSTRAINT variant is supported. DOMAIN with default values are supported. but virtual generated column already have generated expression, so domain default expression doesn't matter. ALTER TABLE ADD VIRTUAL GENERATED COLUMN. need table scan to verify new column generation expression value satisfied the domain constraints. but no need table rewrite! discussion: https://postgr.es/m/cacjufxharqysbdkwfmvk+d1tphqwwtxwn15cmuuatyx3xhq...@mail.gmail.com --- src/backend/catalog/heap.c| 13 +- src/backend/commands/copyfrom.c | 5 +- src/backend/commands/tablecmds.c | 48 +++- src/backend/commands/typecmds.c | 209 +++--- src/backend/executor/nodeModifyTable.c| 88 ++-- src/include/catalog/heap.h| 1 - src/test/regress/expected/fast_default.out| 4 + .../regress/expected/generated_virtual.out| 65 +- src/test/regress/sql/fast_default.sql | 4 + src/test/regress/sql/generated_virtual.sql| 47 +++- 10 files changed, 398 insertions(+), 86 deletions(-) diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index b807ab8..5519690d85c 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -508,7 +508,7 @@ CheckAttributeNamesTypes(TupleDesc tupdesc, char relkind, TupleDescAttr(tupdesc, i)->atttypid, TupleDescAttr(tupdesc, i)->attcollation, NIL, /* assume we're creating a new rowtype */ - flags | (TupleDescAttr(tupdesc, i)->attgenerated == ATTRIBUTE_GENERATED_VIRTUAL ? CHKATYPE_IS_VIRTUAL : 0)); + flags); } } @@ -583,17 +583,6 @@ CheckAttributeType(const char *attname, } else if (att_typtype == TYPTYPE_DOMAIN) { - /* - * Prevent virtual generated columns from having a domain type. We - * would have to enforce domain constraints when columns underlying - * the generated column change. This could possibly be implemented, - * but it's not. - */ - if (flags & CHKATYPE_IS_VIRTUAL) - ereport(ERROR, - errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("virtual generated column \"%s\" cannot have a domain type", attname)); - /* * If it's a domain, recurse to check its base type. */ diff --git a/src
Re: Windows: openssl & gssapi dislike each other
On Wed, 26 Mar 2025 at 16:32, Daniel Gustafsson wrote: > > On 26 Mar 2025, at 17:15, Tom Lane wrote: > > > > Daniel Gustafsson writes: > >> Thanks for review! Pushed after making the above changes and taking it > for > >> another CI run. > > > > CF entry should be marked closed no? > > Yep, just wanted to allow some time to see how the buildfarm liked it > before > closing. Done now, thanks for the reminder. > Thank you for dealing with this! -- Dave Page pgAdmin: https://www.pgadmin.org PostgreSQL: https://www.postgresql.org pgEdge: https://www.pgedge.com
Re: Memoize ANTI and SEMI JOIN inner
On Mon, Mar 31, 2025 at 6:46 PM Andrei Lepikhov wrote: > On 3/31/25 11:03, Richard Guo wrote: > > I reviewed this patch and I have some concerns about the following > > code: > > > > if (extra->inner_unique && > > (inner_path->param_info == NULL || > > bms_num_members(inner_path->param_info->ppi_serials) < > > list_length(extra->restrictlist))) > > return NULL; > > > > I understand that this check is used to ensure that the entire join > > condition is parameterized in the case of unique joins, so that we can > > safely mark the cache entry as complete after reading the first tuple. > > However, ppi_clauses includes join clauses available from all outer > > rels, not just the current outer rel, while extra->restrictlist only > > includes the restriction clauses for the current join. This means the > > check could pass even if a restriction clause isn't parameterized, as > > long as another join clause, which doesn't belong to the current join, > > is included in ppi_clauses. > Initially, I had the same concern. But if ppi_clauses contains a qual, > it should refer to this join and, as a result, be in the > extra->restrictlist, isn't it? Hmm, I don't think so. As I mentioned upthread, ppi_clauses includes join clauses available from all outer rels, not just the current one. So a clause included in ppi_clauses is not necessarily included in extra->restrictlist. As an example, consider create table t (a int, b int); explain (costs off) select * from t t1 join t t2 join lateral (select *, t1.a+t2.a as x from t t3 offset 0) t3 on t2.a = t3.a on t1.b = t3.b; QUERY PLAN - Nested Loop -> Seq Scan on t t2 -> Nested Loop -> Seq Scan on t t1 -> Subquery Scan on t3 Filter: ((t2.a = t3.a) AND (t1.b = t3.b)) -> Seq Scan on t t3_1 (7 rows) t3's ppi_clauses includes "t2.a = t3.a" and "t1.b = t3.b", while t1/t3 join's restrictlist only includes "t1.b = t3.b". Thanks Richard
Re: Test to dump and restore objects left behind by regression
On Mon, Mar 31, 2025 at 5:07 PM Ashutosh Bapat wrote: > The bug related to materialized views has been fixed and now the test passes even if we compare statistics from dumped and restored databases. Hence removing 0003. In the attached patchset I have also addressed Vignesh's below comment On Thu, Mar 27, 2025 at 10:01 PM Ashutosh Bapat wrote: > > On Thu, Mar 27, 2025 at 6:01 PM vignesh C wrote: > > > > 2) Should "`" be ' or " here, we generally use "`" to enclose commands: > > +# It is expected that regression tests, which create `regression` > > database, are > > +# run on `src_node`, which in turn, is left in running state. A fresh node > > is > > +# created using given `node_params`, which are expected to be the > > same ones used > > +# to create `src_node`, so as to avoid any differences in the databases. > > Looking at prologues or some other functions, I see that we don't add > any decoration around the name of the argument. Hence dropped `` > altogether. Will post it with the next set of patches. -- Best Wishes, Ashutosh Bapat From 5ef4a15bf229d104028eac3a046636453e1e05fc Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Mon, 24 Mar 2025 11:21:12 +0530 Subject: [PATCH 2/2] Use only one format and make the test run default According to Alvaro (and I agree with him), the test should be run by default. Otherwise we get to know about a bug only after buildfarm animal where it's enabled reports a failure. Further testing only one format may suffice; since all the formats have shown the same bugs till now. If we use --directory format we can use -j which reduces the time taken by dump/restore test by about 12%. This patch removes PG_TEST_EXTRA option as well as runs the test only in directory format with parallelism enabled. Note for committer: If we decide to accept this change, it should be merged with the previous commit. --- doc/src/sgml/regress.sgml | 12 src/bin/pg_upgrade/t/002_pg_upgrade.pl | 76 +- 2 files changed, 25 insertions(+), 63 deletions(-) diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml index 237b974b3ab..0e5e8e8f309 100644 --- a/doc/src/sgml/regress.sgml +++ b/doc/src/sgml/regress.sgml @@ -357,18 +357,6 @@ make check-world PG_TEST_EXTRA='kerberos ldap ssl load_balance libpq_encryption' - - - regress_dump_test - - - When enabled, src/bin/pg_upgrade/t/002_pg_upgrade.pl - tests dump and restore of regression database left behind by the - regression run. Not enabled by default because it is time and resource - consuming. - - - Tests for features that are not supported by the current build diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl index 8d22d538529..f7d5b96ecd2 100644 --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl @@ -268,16 +268,9 @@ else # should be done in a separate TAP test, but doing it here saves us one full # regression run. # - # This step takes several extra seconds and some extra disk space, so - # requires an opt-in with the PG_TEST_EXTRA environment variable. - # # Do this while the old cluster is running before it is shut down by the # upgrade test. - if ( $ENV{PG_TEST_EXTRA} - && $ENV{PG_TEST_EXTRA} =~ /\bregress_dump_test\b/) - { - test_regression_dump_restore($oldnode, %node_params); - } + test_regression_dump_restore($oldnode, %node_params); } # Initialize a new node for the upgrade. @@ -590,53 +583,34 @@ sub test_regression_dump_restore $dst_node->append_conf('postgresql.conf', 'autovacuum = off'); $dst_node->start; - # Test all formats one by one. - for my $format ('plain', 'tar', 'directory', 'custom') - { - my $dump_file = "$tempdir/regression_dump.$format"; - my $restored_db = 'regression_' . $format; - - # Use --create in dump and restore commands so that the restored - # database has the same configurable variable settings as the original - # database and the plain dumps taken for comparsion do not differ - # because of locale changes. Additionally this provides test coverage - # for --create option. - $src_node->command_ok( - [ -'pg_dump', "-F$format", '--no-sync', -'-d', $src_node->connstr('regression'), -'--create', '-f', $dump_file - ], - "pg_dump on source instance in $format format"); + my $dump_file = "$tempdir/regression.dump"; - my @restore_command; - if ($format eq 'plain') - { - # Restore dump in "plain" format using `psql`. - @restore_command = [ 'psql', '-d', 'postgres', '-f', $dump_file ]; - } - else - { - @restore_command = [ -'pg_restore', '--create', -'-d', 'postgres', $dump_file - ]; - } - $dst_node->command_ok(@restore_command, - "restored dump taken in $format format on destination instance"); + # Use --create in dump and restore commands so that the restored database + # has the same configurable variable set
Re: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues
> We’ve prepared two test patches on top of current master to address both issues: > ... > * 0002-Implement-batching-for-cascade-replication.patch – test patch to implement possible batching approach in xlogreceiver.c with timer. Currently it uses GUC variables to allow testing of different batch sizes and timeout values. I've played with the second patch a little more and made some adjustments to it: 1. Setup timer only if we actually have applied messages, which are (potentially) not yet signaled to walsenders. 2. Notify logical walsenders without delay if time line has changed. Modified patch is attached. Thanks, Alexey From c691724332dd5a78b98d606188383d6b37c98021 Mon Sep 17 00:00:00 2001 From: Rustam Khamidullin Date: Fri, 14 Mar 2025 18:18:34 +0700 Subject: [PATCH 2/2] Implement batching for WAL records notification during cascade replication Currently standby server notifies walsenders after applying of each WAL record during cascade replication. This creates a bottleneck in case of large number of sender processes during WalSndWakeup invocation. This change introduces batching for such notifications, which are now sent either after certain number of applied records or specified time interval (whichever comes first). Co-authored-by: Alexey Makhmutov --- src/backend/access/transam/xlogrecovery.c | 61 ++- src/backend/postmaster/startup.c | 1 + src/backend/utils/misc/guc_tables.c | 22 src/include/access/xlogrecovery.h | 4 ++ src/include/utils/timeout.h | 1 + 5 files changed, 87 insertions(+), 2 deletions(-) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 0aa3ab59085..3e3e43d5888 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -64,6 +64,7 @@ #include "utils/pg_lsn.h" #include "utils/ps_status.h" #include "utils/pg_rusage.h" +#include "utils/timeout.h" /* Unsupported old recovery command file names (relative to $PGDATA) */ #define RECOVERY_COMMAND_FILE "recovery.conf" @@ -147,6 +148,8 @@ bool InArchiveRecovery = false; static bool StandbyModeRequested = false; bool StandbyMode = false; +#define StandbyWithCascadeReplication() (AmStartupProcess() && StandbyMode && AllowCascadeReplication()) + /* was a signal file present at startup? */ static bool standby_signal_file_found = false; static bool recovery_signal_file_found = false; @@ -298,6 +301,14 @@ bool reachedConsistency = false; static char *replay_image_masked = NULL; static char *primary_image_masked = NULL; +/* Maximum number of applied records in batch before notifying walsender during cascade replication */ +int cascadeReplicationMaxBatchSize; + +/* Maximum batching delay before notifying walsender during cascade replication */ +int cascadeReplicationMaxBatchDelay; + +/* Counter for collected records during cascade replication */ +static volatile sig_atomic_t appliedRecords = 0; /* * Shared-memory state for WAL recovery. @@ -1839,6 +1850,17 @@ PerformWalRecovery(void) * end of main redo apply loop */ + /* Ensure that notification for batched messages is sent */ + if (StandbyWithCascadeReplication() && +cascadeReplicationMaxBatchSize > 1 && +appliedRecords > 0) + { + if (cascadeReplicationMaxBatchDelay > 0) +disable_timeout(STANDBY_CASCADE_WAL_SEND_TIMEOUT, false); + appliedRecords = 0; + WalSndWakeup(false, true); + } + if (reachedRecoveryTarget) { if (!reachedConsistency) @@ -2037,8 +2059,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl *be created otherwise) * -- */ - if (AllowCascadeReplication()) - WalSndWakeup(switchedTLI, true); + + if (StandbyWithCascadeReplication()) + { + if (cascadeReplicationMaxBatchSize <= 1) + WalSndWakeup(switchedTLI, true); + else + { + bool batchFlushRequired = ++appliedRecords >= + cascadeReplicationMaxBatchSize; + if (batchFlushRequired) + { +appliedRecords = 0; +if (cascadeReplicationMaxBatchDelay > 0) + disable_timeout(STANDBY_CASCADE_WAL_SEND_TIMEOUT, false); + } + + WalSndWakeup(switchedTLI, batchFlushRequired); + + /* Setup timeout to limit maximum delay for notifications */ + if (appliedRecords == 1 && cascadeReplicationMaxBatchDelay > 0) +enable_timeout_after(STANDBY_CASCADE_WAL_SEND_TIMEOUT, + cascadeReplicationMaxBatchDelay); + } + } /* * If rm_redo called XLogRequestWalReceiverReply, then we wake up the @@ -5064,3 +5108,16 @@ assign_recovery_target_xid(const char *newval, void *extra) else recoveryTarget = RECOVERY_TARGET_UNSET; } + +/* + * Timer handler for batch notifications in cascade replication + */ +void +StandbyWalSendTimeoutHandler(void) +{ + if (appliedRecords > 0) + { + appliedRecords = 0; + WalSndWakeup(false, true); + } +} diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c in
Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits
Hi, On Mon, Mar 24, 2025 at 08:41:20AM +0900, Michael Paquier wrote: > On Wed, Mar 19, 2025 at 04:00:49PM +0800, Xuneng Zhou wrote: > > Hi, > > Moving the other two provides a more complete view of the settings. For > > newcomers(like me) to the codebase, seeing all three related values in one > > place helps avoid a narrow view of the settings. > > > > But I am not sure that I understand the cons of this well. > > While I don't disagree with the use of a hardcoded interval of time to > control timing the flush of the WAL sender stats, do we really need to > rely on the timing defined by pgstat.c? No but I thought it could make sense. > Wouldn't it be simpler to > assign one in walsender.c and pick up a different, perhaps higher, > value? I don't have a strong opinion on it so done as suggested above in the attached. I think that the 1s value is fine because: 1. it is consistent with PGSTAT_MIN_INTERVAL and 2. it already needs that the sender is caught up or has pending data to send (means it could be higher than 1s already). That said, I don't think that would hurt if you think of a higher value. > At the end the timestamp calculations are free because we can rely on > the existing call of GetCurrentTimestamp() for the physical WAL > senders to get an idea of the current time. Yup > For the logical WAL > senders, perhaps we'd better do the reports in WalSndWaitForWal(), > actually. There is also a call to GetCurrentTimestamp() that we can > rely on in this path. I think it's better to flush the stats in a shared code path. I think it's easier to maintain and that there is no differences between logical and physical walsenders that would justify to flush the stats in specific code paths. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 1102b1bccaa5ff88b4045a4e8751e43094e946e5 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Tue, 25 Feb 2025 10:18:05 + Subject: [PATCH v5] Flush the IO statistics of active walsenders The walsender does not flush its IO statistics until it exits. The issue is there since pg_stat_io has been introduced in a9c70b46dbe. This commits: 1. ensures it does not wait to exit to flush its IO statistics 2. flush its IO statistics periodically to not overload the walsender 3. adds a test for a physical walsender (a logical walsender had the same issue but the fix is in the same code path) --- src/backend/replication/walsender.c | 63 +++ src/test/recovery/t/001_stream_rep.pl | 16 +++ 2 files changed, 61 insertions(+), 18 deletions(-) 76.6% src/backend/replication/ 23.3% src/test/recovery/t/ diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 1028919aecb..9eed37b5de9 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -91,10 +91,14 @@ #include "utils/guc.h" #include "utils/memutils.h" #include "utils/pg_lsn.h" +#include "utils/pgstat_internal.h" #include "utils/ps_status.h" #include "utils/timeout.h" #include "utils/timestamp.h" +/* Minimum interval walsender IO stats flushes */ +#define MIN_IOSTATS_FLUSH_INTERVAL 1000 + /* * Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ. * @@ -2742,6 +2746,8 @@ WalSndCheckTimeOut(void) static void WalSndLoop(WalSndSendDataCallback send_data) { + TimestampTz last_flush = 0; + /* * Initialize the last reply timestamp. That enables timeout processing * from hereon. @@ -2836,30 +2842,51 @@ WalSndLoop(WalSndSendDataCallback send_data) * WalSndWaitForWal() handle any other blocking; idle receivers need * its additional actions. For physical replication, also block if * caught up; its send_data does not block. + * + * When the WAL sender is caught up or has pending data to send, we + * also periodically report I/O statistics. It's done periodically to + * not overload the WAL sender. */ - if ((WalSndCaughtUp && send_data != XLogSendLogical && - !streamingDoneSending) || - pq_is_send_pending()) + if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending()) { - long sleeptime; - int wakeEvents; + TimestampTz now; - if (!streamingDoneReceiving) -wakeEvents = WL_SOCKET_READABLE; - else -wakeEvents = 0; + now = GetCurrentTimestamp(); - /* - * Use fresh timestamp, not last_processing, to reduce the chance - * of reaching wal_sender_timeout before sending a keepalive. - */ - sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp()); + if (TimestampDifferenceExceeds(last_flush, now, MIN_IOSTATS_FLUSH_INTERVAL)) + { +/* + * Report IO statistics + */ +pgstat_flush_io(false); +(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO); +last_flush = now; + } - if (pq_is_send_pending()) -wakeEvents |= WL_SOCKET_WRITEABLE; + if (send_data != XLogSendLogical || pq_is_se
Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints
On 2025-Mar-31, jian he wrote: > hi. > in notnull-notvalid.patch > > + if (coninfo->contype == 'c') > + keyword = "CHECK CONSTRAINT"; > + else > + keyword = "INVALID NOT NULL CONSTRAINT"; > we have a new TocEntry->desc kind. Yeah, I wasn't sure that this change made much actual sense. I think it may be better to stick with just CONSTRAINT. There probably isn't sufficient reason to have a different ->desc value. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Thread-safe nl_langinfo() and localeconv()
Peter Eisentraut writes: > I'm not sure what to do with this. If setlocale() and newlocale() > indeed behave differently in what set of locale names they accept, then > technically we ought to test both of them, since we do use both of them > later on. Or maybe we push on with the effort to get rid of setlocale() > calls and then just worry about testing newlocale() (as this patch > does). But right now, if newlocale() is more permissive, then we could > accept locale names that will later fail setlocale() calls, which might > be a problem. I think the clear answer is "let's stop using setlocale(), and then not have to worry about any behavioral differences". regards, tom lane
Re: [PoC] Federated Authn/z with OAUTHBEARER
Re: Andres Freund > > Yes. Also, none of this has addressed my complaint about the extent > > of the build and install dependencies. Yes, simply not selecting > > --with-libcurl removes the problem ... but most packagers are under > > very heavy pressure to enable all features of a package. And this feature is kind of only useful if it's available anywhere. If only half of your clients are able to use SSO, you'd probably stick with passwords anyway. So it needs to be enabled by default. > How about we provide the current libpq.so without linking to curl and also a > libpq-oauth.so that has curl support? If we do it right libpq-oauth.so would > itself link to libpq.so, making libpq-oauth.so a fairly small library. > > That way packagers can split libpq-oauth.so into a separate package, while > still just building once. That's definitely a good plan. The blast radius of build dependencies isn't really a problem, the install/run-time is. Perhaps we could do the same with libldap and libgssapi? (Though admittedly I have never seen any complaints or nagging questions from security people about these.) Christoph
Re: Add a function to get the version of installed extension
On Sun, 2 Feb 2025 00:15:25 -0500 Tom Lane wrote: > Yugo Nagata writes: > > This might be avoidable if the binary is made carefully to check the > > existing > > of objects, but I think it is useful if an extension module function can > > check > > the current extension version. So, I would like to propose a new function to > > return the current extension version, get_extension_version. I've attached a > > patch. > > While I don't say this is a bad idea, I do say you've not made a very > good case for it. How would an extension know its own OID in order > to call the function? If it did manage to call the function, what > exactly would it do with the result, and how would that be easier than > maintaining backwards compatibility with its old SQL definitions? > We've not found the backwards-compatibility requirement to be hugely > onerous in our contrib extensions. > > A concrete example use-case would make this a lot more convincing. An OID of an extension can be get from the name using get_extension_oid() because the name is unique in a database. But, after second thought, this might not be so useful because the version of an extension is just a text and would not comparable as a number, so it is hard to say "if the version is 10.0 or later" and so on. I'll withdraw this proposal for now until I get a good example use-case. Regards, Yugo Nagata -- Yugo NAGATA
Re: Allow ILIKE forward matching to use btree index
On Thu, 16 Jan 2025 09:41:35 -0800 Jeff Davis wrote: > On Thu, 2025-01-16 at 14:53 +0900, Yugo NAGATA wrote: > > Instead of generating complete patterns considering every case- > > varying characters, > > two clauses considering only the first case-varying character are > > generated. > > Did you consider other approaches that integrate more deeply into the > indexing infrastructure? This feels almost like a skip scan, which > Petter Geoghegan is working on. If you model the disjunctions as skips, > and provide the right API that the index AM can use, then there would > be no limit. > > For example: > > start scanning at '123FOO' > when you encounter '123FOP' skip to '123FOo' > continue scanning > when you encounter '123FOp' skip to '123FoO' > ... That seems true there is similarity between ILIKE search and skip scan, although, on my understand, skip scan is for multi-column indexes rather than text. I have no concrete idea how the infrastructure for skip scan to improve ILIKE, anyway. > Also, I'm working on better Unicode support to handle multiple case > variants in patterns. For instance, the Greek letter Sigma has three > case variants (one upper and two lower). What's the right API so that > the index AM knows which case variants will sort first, so that the > skips don't go backward? That seems true there is similarity between ILIKE search and skip scan, although, on my understand, skip scan is for multi-column indexes rather than text. I have no concrete idea how the infrastructure for skip scan to improve ILIKE, anyway. > Also, I'm working on better Unicode support to handle multiple case > variants in patterns. For instance, the Greek letter Sigma has three > case variants (one upper and two lower). What's the right API so that > the index AM knows which case variants will sort first, so that the > skips don't go backward? I don't have idea for handling letters with multiple case variants in my patch, either. I overlooked such pattern at all. So, I'll withdraw the proposal, for now. Regards, Yugo Nagata -- Yugo NAGATA
Question about comments on simple_heap_update
Hi, While working on [1], I found the internal error "tuple concurrently updated" is raised by simple_heap_update and other similar functions, and the comments on them says "Any failure is reported via ereport()". However, I could not understand the intension of this comments because I suppose the unexpected errors are usual reported via elog() not ereport and in fact elog() is used in these functions. I wonder this statement should be fixed as the attached patch or could be removed for less confusion. Maybe, I am just missing something, though [1] https://www.postgresql.org/message-id/flat/20250331200057.00a62760966a821d484ea904%40sraoss.co.jp -- Yugo Nagata diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 6e433db039e..f47b878f559 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -3207,7 +3207,7 @@ l1: * This routine may be used to delete a tuple when concurrent updates of * the target tuple are not expected (for example, because we have a lock * on the relation associated with the tuple). Any failure is reported - * via ereport(). + * via elog(). */ void simple_heap_delete(Relation relation, ItemPointer tid) @@ -4495,7 +4495,7 @@ HeapDetermineColumnsInfo(Relation relation, * This routine may be used to update a tuple when concurrent updates of * the target tuple are not expected (for example, because we have a lock * on the relation associated with the tuple). Any failure is reported - * via ereport(). + * via elog(). */ void simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup, diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c index a56c5eceb14..eb09272865e 100644 --- a/src/backend/access/table/tableam.c +++ b/src/backend/access/table/tableam.c @@ -285,7 +285,7 @@ simple_table_tuple_insert(Relation rel, TupleTableSlot *slot) * This routine may be used to delete a tuple when concurrent updates of * the target tuple are not expected (for example, because we have a lock * on the relation associated with the tuple). Any failure is reported - * via ereport(). + * via elog(). */ void simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot) @@ -330,7 +330,7 @@ simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot) * This routine may be used to update a tuple when concurrent updates of * the target tuple are not expected (for example, because we have a lock * on the relation associated with the tuple). Any failure is reported - * via ereport(). + * via elog(). */ void simple_table_tuple_update(Relation rel, ItemPointer otid,
Re: Improve monitoring of shared memory allocations
On 3/31/25 01:01, Rahila Syed wrote: > ... > > > > I will improve the comment in the next version. > > > > OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not > really used except for this bit: > > + if (init_size > nelem_alloc) > + element_alloc = false; > > Can't we determine before calling the function, to make it a bit less > confusing? > > > Yes, we could determine whether the pre-allocated elements are zero before > calling the function, I have fixed it accordingly in the attached 0001 > patch. > Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've > passed this information as a boolean variable-initial_elems. If it is > false, > no elements are pre-allocated. > > Please find attached the v7-series, which incorporates your review patches > and addresses a few remaining comments. > I think it's almost committable. Attached is v8 with some minor review adjustments, and updated commit messages. Please read through those and feel free to suggest changes. I still found the hash_get_init_size() comment unclear, and it also referenced init_size, which is no longer relevant. I improved the comment a bit (I find it useful to mimic comments of nearby functions, so I did that too here). The "initial_elems" name was a bit confusing, as it seemed to suggest "number of elements", but it's a simple flag. So I renamed it to "prealloc", which seems clearer to me. I also tweaked (reordered/reformatted) the conditions a bit. For the other patch, I realized we can simply MemSet() the whole chunk, instead of resetting the individual parts. regards -- Tomas Vondra From 1a82ac7407eb88bc085694519739115f718cc59a Mon Sep 17 00:00:00 2001 From: Rahila Syed Date: Mon, 31 Mar 2025 03:23:20 +0530 Subject: [PATCH v8 1/5] Improve acounting for memory used by shared hash tables pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(), but for shared hash tables that covered only the header and hash directory. The remaining parts (segments and buckets) were allocated later using ShmemAlloc(), which does not update the shmem accounting. Thus, these allocations were not shown in pg_shmem_allocations. This commit improves the situation by allocating all the hash table parts at once, using a single ShmemInitStruct() call. This way the ShmemIndex entries (and thus pg_shmem_allocations) better reflect the proper size of the hash table. This affects allocations for private (non-shared) hash tables too, as the hash_create() code is shared. For non-shared tables this however makes no practical difference. This changes the alignment a bit. ShmemAlloc() aligns the chunks using CACHELINEALIGN(), which means some parts (header, directory, segments) were aligned this way. Allocating all parts as a single chunks removes this (implicit) alignment. We've considered adding explicit alignment, but we've decided not to - it seems to be merely a coincidence due to using the ShmemAlloc() API, not due to necessity. Author: Rahila Syed Reviewed-by: Andres Freund Reviewed-by: Nazir Bilal Yavuz Reviewed-by: Tomas Vondra Discussion: https://postgr.es/m/cah2l28vhzrankszhqz7dexurxkncxfirnuw68zd7+hvaqas...@mail.gmail.com --- src/backend/storage/ipc/shmem.c | 4 +- src/backend/utils/hash/dynahash.c | 283 +++--- src/include/utils/hsearch.h | 3 +- 3 files changed, 222 insertions(+), 68 deletions(-) diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c index 895a43fb39e..3c030d5743d 100644 --- a/src/backend/storage/ipc/shmem.c +++ b/src/backend/storage/ipc/shmem.c @@ -73,6 +73,7 @@ #include "storage/shmem.h" #include "storage/spin.h" #include "utils/builtins.h" +#include "utils/dynahash.h" static void *ShmemAllocRaw(Size size, Size *allocated_size); @@ -346,7 +347,8 @@ ShmemInitHash(const char *name, /* table string name for shmem index */ /* look it up in the shmem index */ location = ShmemInitStruct(name, - hash_get_shared_size(infoP, hash_flags), + hash_get_init_size(infoP, hash_flags, + true, init_size), &found); /* diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 3f25929f2d8..3dede9caa5d 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -260,12 +260,36 @@ static long hash_accesses, hash_expansions; #endif +#define HASH_DIRECTORY_PTR(hashp) \ + (((char *) (hashp)->hctl) + sizeof(HASHHDR)) + +#define HASH_SEGMENT_OFFSET(hctl, idx) \ + (sizeof(HASHHDR) + \ + ((hctl)->dsize * sizeof(HASHSEGMENT)) + \ + ((hctl)->ssize * (idx) * sizeof(HASHBUCKET))) + +#define HASH_SEGMENT_PTR(hashp, idx) \ + ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx))) + +#define HASH_SEGMENT_SIZE(hashp) \ + ((hashp)->ssize * sizeof(HASHBUCKET)) + +#define HASH_ELEMENTS_PTR(hashp, nsegs) \ + ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)-
Re: encode/decode support for base64url
Hi Florents, > Here's a v3 with some (hopefully) better test cases. Thanks for the new version of the patch. ``` +encoded_len = pg_base64_encode(src, len, dst); + +/* Convert Base64 to Base64URL */ +for (uint64 i = 0; i < encoded_len; i++) { +if (dst[i] == '+') +dst[i] = '-'; +else if (dst[i] == '/') +dst[i] = '_'; +} ``` Although it is a possible implementation, wouldn't it be better to parametrize pg_base64_encode instead of traversing the string twice? Same for pg_base64_decode. You can refactor pg_base64_encode and make it a wrapper for pg_base64_encode_impl if needed. ``` +-- Flaghsip Test case against base64. +-- Notice the = padding removed at the end and special chars. +SELECT encode('\x69b73eff', 'base64'); -- Expected: abc+/w== + encode +-- + abc+/w== +(1 row) + +SELECT encode('\x69b73eff', 'base64url'); -- Expected: abc-_w + encode + + abc-_w +(1 row) ``` I get the idea, but calling base64 is redundant IMO. It only takes several CPU cycles during every test run without much value. I suggest removing it and testing corner cases for base64url instead, which is missing at the moment. Particularly there should be tests for encoding/decoding strings of 0/1/2/3/4 characters and making sure that decode(encode(x)) = x, always. On top of that you should cover with tests the cases of invalid output for decode(). -- Best regards, Aleksander Alekseev
Re: per backend WAL statistics
Hi, On Sat, Mar 29, 2025 at 07:14:16AM +0900, Michael Paquier wrote: > On Fri, Mar 28, 2025 at 09:00:00PM +0200, Alexander Lakhin wrote: > > Please try the following query: > > BEGIN; > > SET LOCAL stats_fetch_consistency = snapshot; > > SELECT * FROM pg_stat_get_backend_wal(pg_backend_pid()); Thanks for the report! I'm able to reproduce it on my side. The issue can also be triggered with pg_stat_get_backend_io(). The issue is that in pgstat_fetch_stat_backend_by_pid() (and with stats_fetch_consistency set to snapshot) a call to pgstat_clear_backend_activity_snapshot() is done: #0 __memset_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:250 #1 0x01833bf2 in wipe_mem (ptr=0x63218800, size=80800) at ../../../../src/include/utils/memdebug.h:42 #2 0x01834c51 in AllocSetReset (context=0x61903c80) at aset.c:586 #3 0x0184f32d in MemoryContextResetOnly (context=0x61903c80) at mcxt.c:419 #4 0x01834ede in AllocSetDelete (context=0x61903c80) at aset.c:636 #5 0x0184f79b in MemoryContextDeleteOnly (context=0x61903c80) at mcxt.c:528 #6 0x0184f5a9 in MemoryContextDelete (context=0x61903c80) at mcxt.c:482 #7 0x01361e84 in pgstat_clear_backend_activity_snapshot () at backend_status.c:541 #8 0x01367f08 in pgstat_clear_snapshot () at pgstat.c:943 #9 0x01368ac3 in pgstat_prep_snapshot () at pgstat.c:1121 #10 0x013680b9 in pgstat_fetch_entry (kind=6, dboid=0, objid=0) at pgstat.c:961 #11 0x0136dd05 in pgstat_fetch_stat_backend (procNumber=0) at pgstat_backend.c:94 #12 0x0136de7d in pgstat_fetch_stat_backend_by_pid (pid=3294022, bktype=0x0) at pgstat_backend.c:136 *before* we check for "beentry->st_procpid != pid". I think we can simply move the pgstat_fetch_stat_backend() call at the end of pgstat_fetch_stat_backend_by_pid(), like in the attached. With this in place the issue is fixed on my side. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 1605f513ad691b463baacc00e3c305655525ea07 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Mon, 31 Mar 2025 07:02:34 + Subject: [PATCH v1] Fix heap-use-after-free in pgstat_fetch_stat_backend_by_pid() With stats_fetch_consistency set to snapshot the beentry is reset during the pgstat_fetch_stat_backend() call. So moving this call at the end of pgstat_fetch_stat_backend_by_pid(). Reported-by: Alexander Lakhin --- src/backend/utils/activity/pgstat_backend.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) 100.0% src/backend/utils/activity/ diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c index 187c5c76e1e..ec95c302af8 100644 --- a/src/backend/utils/activity/pgstat_backend.c +++ b/src/backend/utils/activity/pgstat_backend.c @@ -133,10 +133,6 @@ pgstat_fetch_stat_backend_by_pid(int pid, BackendType *bktype) if (!pgstat_tracks_backend_bktype(beentry->st_backendType)) return NULL; - backend_stats = pgstat_fetch_stat_backend(procNumber); - if (!backend_stats) - return NULL; - /* if PID does not match, leave */ if (beentry->st_procpid != pid) return NULL; @@ -144,6 +140,10 @@ pgstat_fetch_stat_backend_by_pid(int pid, BackendType *bktype) if (bktype) *bktype = beentry->st_backendType; + backend_stats = pgstat_fetch_stat_backend(procNumber); + if (!backend_stats) + return NULL; + return backend_stats; } -- 2.34.1
Re: Memoize ANTI and SEMI JOIN inner
On 3/31/25 05:33, David Rowley wrote: I can't say definitively that we won't find a reason in the future that we should set inner_unique for SEMI/ANTI joins, so I don't follow the need for the Assert. Maybe you're seeing something that I'm not. What do you think will break if both flags are true? I considered targeting PG 19 and July Comitfest for this feature, but currently, I don't see any further development needed here. What are your thoughts, David? Does it make sense to commit this to PG 18? I have no reason to rush the process, but this feature seems beneficial for practice. When the internal structure is a bushy join tree, producing even a single tuple can be costly. SQL Server's Spool node addresses this issue, and people sometimes experience confusion detecting degradation during migration with specific queries. -- regards, Andrei Lepikhov
Re: speedup COPY TO for partitioned table.
Hi! I reviewed v7. Maybe we should add a multi-level partitioning case into copy2.sql regression test? I also did quick benchmarking for this patch: DDL create table ppp(i int) partition by range (i); genddl.sh: for i in `seq 0 200`; do echo "create table p$i partition of ppp for values from ( $((10 * i)) ) to ( $((10 * (i + 1))) ); "; done === insert data data: insert into ppp select i / 1000 from generate_series(0, 200)i; === results: for 201 rows speedup is 1.40 times : 902.604 ms (patches) vs 1270.648 ms (unpatched) for 402 rows speedup is 1.20 times : 1921.724 ms (patches) vs 2343.393 ms (unpatched) for 804 rows speedup is 1.10 times : 3932.361 ms (patches) vs 4358.489ms (unpatched) So, this patch indeed speeds up some cases, but with larger tables speedup becomes negligible. -- Best regards, Kirill Reshke
Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints
On Sat, Mar 29, 2025 at 2:42 AM Alvaro Herrera wrote: > > On 2025-Mar-28, jian he wrote: > > > i think your patch messed up with pg_constraint.conislocal. > > for example: > > > > CREATE TABLE parted (id bigint default 1,id_abc bigint) PARTITION BY LIST > > (id); > > alter TABLE parted add CONSTRAINT dummy_constr not null id not valid; > > CREATE TABLE parted_1 (id bigint default 1,id_abc bigint); > > alter TABLE parted_1 add CONSTRAINT dummy_constr not null id; > > ALTER TABLE parted ATTACH PARTITION parted_1 FOR VALUES IN ('1'); > > > > select conrelid::regclass, conname, conislocal > > from pg_constraint where conname = 'dummy_constr'; > > > > conrelid | conname| conislocal > > --+--+ > > parted | dummy_constr | t > > parted_1 | dummy_constr | f > > (2 rows) > > > > > > if you do pg_dump, and execute the pg_dump output > > pg_dump --no-statistics --clean --table-and-children=*parted* > > --no-owner --verbose --column-inserts --file=dump.sql --no-acl > > > > select conrelid::regclass, conname, conislocal > > from pg_constraint where conname = 'dummy_constr'; > > output is > > > > conrelid | conname| conislocal > > --+--+ > > parted | dummy_constr | t > > parted_1 | dummy_constr | t > > (2 rows) > > Interesting. Yeah, I removed the code you had there because it was > super weird, had no comments, and removing it had zero effect (no tests > failed), so I thought it was useless. But apparently something is going > on here that's not what we want. > my change in AdjustNotNullInheritance is copied from MergeConstraintsIntoExisting `` /* * OK, bump the child constraint's inheritance count. (If we fail * later on, this change will just roll back.) */ child_copy = heap_copytuple(child_tuple); child_con = (Form_pg_constraint) GETSTRUCT(child_copy); if (pg_add_s16_overflow(child_con->coninhcount, 1, &child_con->coninhcount)) ereport(ERROR, errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("too many inheritance parents")); /* * In case of partitions, an inherited constraint must be * inherited only once since it cannot have multiple parents and * it is never considered local. */ if (parent_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { Assert(child_con->coninhcount == 1); child_con->conislocal = false; } `` if you look at MergeConstraintsIntoExisting, then it won't be weird. AdjustNotNullInheritance is kind of doing the same thing as MergeConstraintsIntoExisting, I think.
Re: Partition pruning on parameters grouped into an array does not prune properly
On 3/27/25 01:58, David Rowley wrote: I suspect the fix for this might be a bit invasive to backpatch. Maybe it's something we can give a bit more clear thought to after the freeze is over. One more issue I think may be addressed (or just considered) here is the following: CREATE TABLE parted (a int, b int) PARTITION BY RANGE (a, b); CREATE TABLE part1 PARTITION OF parted FOR VALUES FROM (1, 1) TO (1, 10); CREATE TABLE part2 PARTITION OF parted FOR VALUES FROM (2, 1) TO (2, 10); INSERT INTO parted (VALUES (1, 2)); INSERT INTO parted VALUES (2, 2); EXPLAIN (COSTS OFF) SELECT * FROM parted WHERE a > 1 AND b < 1; EXPLAIN (COSTS OFF) SELECT * FROM parted WHERE a > 1 AND b > 10; /* Seq Scan on part2 parted Filter: ((a > 1) AND (b < 1)) Seq Scan on part2 parted Filter: ((a > 1) AND (b > 10)) */ I think partition part2 could be pruned like in the following example: EXPLAIN (COSTS OFF) SELECT * FROM parted WHERE a > 2 AND b > 10; /* Result One-Time Filter: false */ -- regards, Andrei Lepikhov
Re: Draft for basic NUMA observability
On Thu, Mar 27, 2025 at 2:40 PM Andres Freund wrote: > > Hi, Hi Andres, > On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote: > >setup_additional_packages_script: | > > -#apt-get update > > -#DEBIAN_FRONTEND=noninteractive apt-get -y install ... > > +apt-get update > > +DEBIAN_FRONTEND=noninteractive apt-get -y install \ > > + libnuma1 \ > > + libnuma-dev > > I think libnuma is installed on the relevant platforms, so you shouldn't need > to install it manually. Fixed. Right, you mentioned this earlier, I just didnt know when it went online. > > +# > > +# libnuma > > +# > > +AC_MSG_CHECKING([whether to build with libnuma support]) > > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness], > > Most other dependencies say "build with libxyz ..." Done. > > + * pg_numa.h [..] > > +#include "c.h" > > +#include "postgres.h" > > Headers should never include either of those headers. Nor should .c files > include both. Fixed, huh, I've found explanation: https://www.postgresql.org/message-id/11634.1488932...@sss.pgh.pa.us > > @@ -200,6 +200,8 @@ pgxs_empty = [ > > > >'ICU_LIBS', > > > > + 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS', > > + > >'LIBURING_CFLAGS', 'LIBURING_LIBS', > > ] > > Maybe I am missing something, but are you actually defining and using those > LIBNUMA_* vars anywhere? OK, so it seems I've been missing `PKG_CHECK_MODULES(LIBNUMA, numa)` in configure.ac that would set those *FLAGS. I'm little bit loss dependent in how to gurantee that meson is synced with autoconf as per project requirements - trying to use past commits as reference, but I still could get something wrong here (especially in src/Makefile.global.in) > > +Size > > +pg_numa_get_pagesize(void) [..] > > Should this have a comment or an assertion that it can only be used after > shared memory startup? Because before that huge_pages_status won't be > meaningful? Added both. > > +#ifndef FRONTEND > > +/* > > + * XXX: not really tested as there is no way to trigger this in our > > + * current usage of libnuma. > > + * > > + * The libnuma built-in code can be seen here: > > + * https://github.com/numactl/numactl/blob/master/libnuma.c > > + * > > + */ > > +void > > +numa_warn(int num, char *fmt,...) [..] > > I think you would at least have to hold interrupts across this, as > ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump > out of libnuma in case an interrupt has arrived. On request by Alvaro I've removed it as that code is simply unreachable/untestable, but point taken - I'm planning to re-add this with holding interrupting in future when we start using proper numa_interleave() one day. Anyway, please let me know if you want still to keep it as deadcode. BTW for context , why this is deadcode is explained in the latter part of [1] message (TL;DR; unless we use pining/numa_interleave/local_alloc() we probably never reach that warnings/error handlers). "Another question without an easy answer as I never hit this error from numa_move_pages(), one gets invalid stuff in *os_pages_status instead. BUT!: most of our patch just uses things that cannot fail as per libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700 /sys` (because it's hard to unmount it) and then still most of numactl stuff works as euid != 0, but numactl --hardware gets at least "libnuma: Warning: Cannot parse distance information in sysfs: Permission denied" or same story with numactl -C 678 date. So unless we start way more heavy use of libnuma (not just for observability) there's like no point in that right now (?) Contrary to that: we can do just do variadic elog() for that, I've put some code, but no idea if that works fine..." > > +Size > > +pg_numa_get_pagesize(void) [..] > > I would probably implement this once, outside of the big ifdef, with one more > ifdef inside, given that you're sharing the same implementation. Done. > > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001 > > From: Jakub Wartak > > Date: Wed, 19 Mar 2025 09:34:56 +0100 > > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's > > primary function to separate functions. > > > > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple() > > and get_buffercache_tuple() that help fill result tuplestores based on the > > buffercache contents. This will be used in a follow-up commit that > > implements > > NUMA observability in pg_buffercache. > > > > Author: Jakub Wartak > > Reviewed-by: Bertrand Drouvot > > Discussion: > > https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com > > --- > > contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++ > > 1 file changed, 178 insertions(+), 139 deletions(-) > > > > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c > > b/contrib/pg_buffercache/pg_buffercache_pages.c > > index 62602af1775..86e0b8afe01 100644 > > --- a/contrib/pg_bufferca
Re: Draft for basic NUMA observability
On Thu, Mar 27, 2025 at 2:15 PM Álvaro Herrera wrote: > Hello Good morning :) > I think you should remove numa_warn() and numa_error() from 0001. > AFAICS they are dead code (even with all your patches applied), and > furthermore would get you in trouble regarding memory allocation because > src/port is not allowed to use palloc et al. If you wanted to keep them > you'd have to have them in src/common, but looking at the rest of the > code in that patch, ISTM src/port is the right place for it. If in the > future you discover that you do need numa_warn(), you can create a > src/common/ file for it then. Understood, trimmed it out from the patch. I'm going to respond also within minutes to Andres' review and I'm going to post a new version (v17) there. > Is pg_buffercache really the best place for these NUMA introspection > routines? I'm not saying that it isn't, maybe we're okay with that > (particularly if we can avoid duplicated code), but it seems a bit weird > to me. I think it is, because as I understand, Andres wanted to have observability per single database *page* and to avoid code duplication we are just putting it there (it's natural fit). Imagine looking up an 8kB root btree memory page being hit hard from CPUs on other NUMA nodes (this just gives ability to see that, but you could of course also get aggregation to get e.g. NUMA node balance for single relation and so on). -J.
Re: Memoize ANTI and SEMI JOIN inner
On Mon, 31 Mar 2025 at 22:03, Richard Guo wrote: > I reviewed this patch and I have some concerns about the following > code: > > if (extra->inner_unique && > (inner_path->param_info == NULL || > bms_num_members(inner_path->param_info->ppi_serials) < > list_length(extra->restrictlist))) > return NULL; > > I understand that this check is used to ensure that the entire join > condition is parameterized in the case of unique joins, so that we can > safely mark the cache entry as complete after reading the first tuple. > However, ppi_clauses includes join clauses available from all outer > rels, not just the current outer rel, while extra->restrictlist only > includes the restriction clauses for the current join. This means the > check could pass even if a restriction clause isn't parameterized, as > long as another join clause, which doesn't belong to the current join, > is included in ppi_clauses. Shouldn't you be more concerned about master here than the patch given that the code you pasted is from master? David
Re: Memoize ANTI and SEMI JOIN inner
On Mon, Mar 31, 2025 at 6:39 PM David Rowley wrote: > On Mon, 31 Mar 2025 at 22:03, Richard Guo wrote: > > I reviewed this patch and I have some concerns about the following > > code: > > > > if (extra->inner_unique && > > (inner_path->param_info == NULL || > > bms_num_members(inner_path->param_info->ppi_serials) < > > list_length(extra->restrictlist))) > > return NULL; > > > > I understand that this check is used to ensure that the entire join > > condition is parameterized in the case of unique joins, so that we can > > safely mark the cache entry as complete after reading the first tuple. > > However, ppi_clauses includes join clauses available from all outer > > rels, not just the current outer rel, while extra->restrictlist only > > includes the restriction clauses for the current join. This means the > > check could pass even if a restriction clause isn't parameterized, as > > long as another join clause, which doesn't belong to the current join, > > is included in ppi_clauses. > > Shouldn't you be more concerned about master here than the patch given > that the code you pasted is from master? Right. This code is from the master branch, not the patch. It caught my attention while I was reviewing the patch and noticed it modifies this code. Thanks Richard
Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).
On 2025/03/30 5:01, Andrew Dunstan wrote: On 2025-03-29 Sa 2:58 PM, David G. Johnston wrote: On Sat, Mar 29, 2025 at 11:56 AM Andrew Dunstan wrote: I don't believe that the premise supports the conclusion. Regardless, I do support this patch and probably any similar ones proposed in the future. Do you have an opinion on that? In principle I think it would be good to have COPY materialized_view TO ... I haven't found any reasons to object to this patch for now, so I have no objections to this change. Regarding the patch, here are some review comments: + errmsg("cannot copy from materialized view when the materialized view is not populated"), How about including the object name for consistency with other error messages in BeginCopyTo(), like this? errmsg("cannot copy from unpopulated materialized view \"%s\"", RelationGetRelationName(rel)), + errhint("Use the REFRESH MATERIALIZED VIEW command populate the materialized view first.")); There seems to be a missing "to" just after "command". Should it be "Use the REFRESH MATERIALIZED VIEW command to populate the materialized view first."? Or we could simplify the hint to match what SELECT on an unpopulated materialized view logs: "Use the REFRESH MATERIALIZED VIEW command.". The copy.sgml documentation should clarify that COPY TO can be used with a materialized view only if it is populated. Wouldn't it be beneficial to add a regression test to check whether COPY matview TO works as expected? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: [PATCH] Split varlena.c into varlena.c and bytea.c
> The proposed patch extracts the code dealing with bytea from varlena.c > into a separate file, as proposed previously [1]. It merely changes > the location of the existing functions. There are no other changes. Rebased. -- Best regards, Aleksander Alekseev From 86776455a7b82f8cad6b984d3ddd67fc8f0e3258 Mon Sep 17 00:00:00 2001 From: Aleksander Alekseev Date: Tue, 25 Mar 2025 13:01:31 +0300 Subject: [PATCH v2] Split varlena.c into varlena.c and bytea.c Move most of the code dealing with bytea into a separate file. Aleksander Alekseev, reviewed by TODO FIXME Discussion: https://postgr.es/m/CAJ7c6TMPVPJ5DL447zDz5ydctB8OmuviURtSwd=PHCRFEPDEAQ@mail.gmail.com --- src/backend/utils/adt/Makefile|1 + src/backend/utils/adt/bytea.c | 1147 +++ src/backend/utils/adt/meson.build |1 + src/backend/utils/adt/varlena.c | 1214 ++--- 4 files changed, 1207 insertions(+), 1156 deletions(-) create mode 100644 src/backend/utils/adt/bytea.c diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 35e8c01aab9..d7b1902a070 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -23,6 +23,7 @@ OBJS = \ arrayutils.o \ ascii.o \ bool.o \ + bytea.o \ cash.o \ char.o \ cryptohashfuncs.o \ diff --git a/src/backend/utils/adt/bytea.c b/src/backend/utils/adt/bytea.c new file mode 100644 index 000..0b3acc7b300 --- /dev/null +++ b/src/backend/utils/adt/bytea.c @@ -0,0 +1,1147 @@ +/*- + * + * bytea.c + * Functions for the bytea type. + * + * Portions Copyright (c) 2025, PostgreSQL Global Development Group + * + * + * IDENTIFICATION + * src/backend/utils/adt/bytea.c + * + *- + */ + +#include "postgres.h" + +#include "access/detoast.h" +#include "catalog/pg_collation.h" +#include "common/int.h" +#include "funcapi.h" +#include "libpq/pqformat.h" +#include "utils/builtins.h" +#include "utils/bytea.h" +#include "utils/varlena.h" + +/* GUC variable */ +int bytea_output = BYTEA_OUTPUT_HEX; + +static StringInfo + makeStringAggState(FunctionCallInfo fcinfo); +static bytea *bytea_catenate(bytea *t1, bytea *t2); +static bytea *bytea_substring(Datum str, int S, int L, bool length_not_specified); +static bytea *bytea_overlay(bytea *t1, bytea *t2, int sp, int sl); + + +/* subroutine to initialize state */ +static StringInfo +makeStringAggState(FunctionCallInfo fcinfo) +{ + StringInfo state; + MemoryContext aggcontext; + MemoryContext oldcontext; + + if (!AggCheckCallContext(fcinfo, &aggcontext)) + { + /* cannot be called directly because of internal-type argument */ + elog(ERROR, "bytea_string_agg_transfn called in non-aggregate context"); + } + + /* + * Create state in aggregate context. It'll stay there across subsequent + * calls. + */ + oldcontext = MemoryContextSwitchTo(aggcontext); + state = makeStringInfo(); + MemoryContextSwitchTo(oldcontext); + + return state; +} + +/* + * bytea_catenate + * Guts of byteacat(), broken out so it can be used by other functions + * + * Arguments can be in short-header form, but not compressed or out-of-line + */ +static bytea * +bytea_catenate(bytea *t1, bytea *t2) +{ + bytea *result; + int len1, +len2, +len; + char *ptr; + + len1 = VARSIZE_ANY_EXHDR(t1); + len2 = VARSIZE_ANY_EXHDR(t2); + + /* paranoia ... probably should throw error instead? */ + if (len1 < 0) + len1 = 0; + if (len2 < 0) + len2 = 0; + + len = len1 + len2 + VARHDRSZ; + result = (bytea *) palloc(len); + + /* Set size of result string... */ + SET_VARSIZE(result, len); + + /* Fill data field of result string... */ + ptr = VARDATA(result); + if (len1 > 0) + memcpy(ptr, VARDATA_ANY(t1), len1); + if (len2 > 0) + memcpy(ptr + len1, VARDATA_ANY(t2), len2); + + return result; +} + +#define PG_STR_GET_BYTEA(str_) \ + DatumGetByteaPP(DirectFunctionCall1(byteain, CStringGetDatum(str_))) + +static bytea * +bytea_substring(Datum str, +int S, +int L, +bool length_not_specified) +{ + int32 S1;/* adjusted start position */ + int32 L1;/* adjusted substring length */ + int32 E;/* end position */ + + /* + * The logic here should generally match text_substring(). + */ + S1 = Max(S, 1); + + if (length_not_specified) + { + /* + * Not passed a length - DatumGetByteaPSlice() grabs everything to the + * end of the string if we pass it a negative value for length. + */ + L1 = -1; + } + else if (L < 0) + { + /* SQL99 says to throw an error for E < S, i.e., negative length */ + ereport(ERROR, +(errcode(ERRCODE_SUBSTRING_ERROR), + errmsg("negative substring length not allowed"))); + L1 = -1;/* silence stupider compilers */ + } + else if (pg_add_s32_overflow(S, L, &E)) + { + /* + * L could be large enough for S + L to overflow, in which case the + * substring must run to end of string. + */ + L1 =
Re: Add partial :-variable expansion to psql \copy
> > Anyway, my feeling about it is that \copy parsing is a huge hack > right now, and I'd rather see it become less of a hack, that is > more like other psql commands, instead of getting even hackier. > I wasn't as horrified as Tom, but it did have the feeling of it solving half the problem. We can already do this COPY (SELECT :foo FROM :bar WHERE :condition) TO STDOUT \g :"myfilename" So it seems that what we need is a good way to pipe local data to a standard COPY command, which is then free to use the existing variable interpolations. If we could do this: COPY :"myschema".:"mytable" FROM STDIN \g < :"myfilename" that would fit our patterns most cleanly, but we would probably create a parsing hassle for ourselves if we ever wanted to mix pipe-to with pipe-from. It would also require checking on every command, when uploaded \copy commands make up a very small percentage of commands issued. So I don't think there's a good way around the asymmetry of COPY TO being a regular \g-able command, whereas COPY FROM will always require some other send-command. Perhaps we create a new command \copyfrom: COPY :"myschema".:"mytable" :options FROM STDIN \copyfrom :"myfilename" COPY :"myschema".:"mytable" :options FROM STDIN \copyfrom :"my_complex_command" | If we had something like that we might be able to replace all existing uses of \copy.
Re: Change log level for notifying hot standby is waiting non-overflowed snapshot
On 2025/03/31 22:45, Yugo Nagata wrote: I prefer this approach clarifying that consistency and subtransaction overflow are separate concepts in the documentation. Here are minor comments on the patch: Thanks for the review! - case CAC_NOTCONSISTENT: - if (EnableHotStandby) + case CAC_NOTHOTSTANDBY: + if (!EnableHotStandby) ereport(FATAL, (errcode(ERRCODE_CANNOT_CONNECT_NOW), errmsg("the database system is not yet accepting connections"), -errdetail("Consistent recovery state has not been yet reached."))); +errdetail("Hot standby mode is disabled."))); + else if (reachedConsistency) + ereport(FATAL, + (errcode(ERRCODE_CANNOT_CONNECT_NOW), +errmsg("the database system is not accepting connections"), +errdetail("Recovery snapshot is not yet ready for hot standby."), +errhint("To enable hot standby, close write transactions with more than %d subtransactions on the primary server.", +PGPROC_MAX_CACHED_SUBXIDS))); else ereport(FATAL, (errcode(ERRCODE_CANNOT_CONNECT_NOW), errmsg("the database system is not accepting connections"), -errdetail("Hot standby mode is disabled."))); +errdetail("Consistent recovery state has not been yet reached."))); The message says "the database system is not yet accepting connections" when "Hot standby mode is disabled". I think "yet" is not necessary in this case. Otherwise, when "Recovery snapshot is not yet ready for hot standby" or "Consistent recovery state has not been yet reached", it seems better to use "yet" I may have unintentionally modified the error message. I fixed the patch as suggested. Please check the latest patch I posted earlier in response to Torikoshi-san. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: general purpose array_sort
Junwang Zhao writes: > On Mon, Mar 31, 2025 at 5:58 AM Tom Lane wrote: >> In v18, it's somewhat annoying that the typcache doesn't cache >> the typarray field; we would not need a separate get_array_type() >> lookup if it did. I doubt there is any real reason for that except >> that pg_type.typarray didn't exist when the typcache was invented. >> So I'm tempted to add it. But I looked at existing callers of >> get_array_type() and none of them are adjacent to typcache lookups, >> so only array_sort would be helped immediately. I left it alone >> for the moment; wonder if anyone else has an opinion? > The need for `elmtyp` and `array_type` here because a column can > have arrays with varying dimensions. Maybe other callers don't share > this behavior? Maybe. I think some of what's going on here is that because for a long time we only had pg_type.typelem and not pg_type.typarray, code was written to not need to look up the array type if at all possible. So there are simply not that many users. Anyway it seems really cheap to add this field to the typcache now. Attached 0001 is the same as v18, and then 0002 is the proposed addition to typcache. regards, tom lane From 21bfc6f86a767a0ef774dbaf9b3f3b6168c15a27 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Mon, 31 Mar 2025 12:52:00 -0400 Subject: [PATCH v19 1/2] Introduce a SQL-callable function array_sort(anyarray). Create a function that will sort the elements of an array according to the element type's sort order. If the array has more than one dimension, the sub-arrays of the first dimension are sorted per normal array-comparison rules, leaving their contents alone. Author: Junwang Zhao Co-authored-by: Jian He Reviewed-by: Aleksander Alekseev Discussion: https://postgr.es/m/caeg8a3j41a4dpw_-f94ff-jprxyxw-gfsgogotkcjs9lvfe...@mail.gmail.com --- doc/src/sgml/func.sgml| 36 src/backend/utils/adt/array_userfuncs.c | 183 ++ src/include/catalog/pg_proc.dat | 12 ++ src/test/regress/expected/arrays.out | 142 ++ .../regress/expected/collate.icu.utf8.out | 13 ++ src/test/regress/sql/arrays.sql | 36 src/test/regress/sql/collate.icu.utf8.sql | 4 + src/tools/pgindent/typedefs.list | 1 + 8 files changed, 427 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 5bf6656deca..2129d027398 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -20741,6 +20741,42 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sort + +array_sort ( + array anyarray + , descending boolean + , nulls_first boolean + ) +anyarray + + +Sorts the first dimension of the array. +The sort order is determined by the default sort ordering of the +array's element type; however, if the element type is collatable, +the collation to use can be forced by adding +a COLLATE clause to +the array argument. + + +If descending is true then sort in +descending order, otherwise ascending order. If omitted, the +default is ascending order. +If nulls_first is true then nulls appear +before non-null values, otherwise nulls appear after non-null +values. +If omitted, nulls_first is taken to have +the same value as descending. + + +array_sort(ARRAY[[2,4],[2,1],[6,5]]) +{{2,1},{2,4},{6,5}} + + + diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c index 2aae2f8ed93..2a8ea974029 100644 --- a/src/backend/utils/adt/array_userfuncs.c +++ b/src/backend/utils/adt/array_userfuncs.c @@ -12,16 +12,19 @@ */ #include "postgres.h" +#include "catalog/pg_operator_d.h" #include "catalog/pg_type.h" #include "common/int.h" #include "common/pg_prng.h" #include "libpq/pqformat.h" +#include "miscadmin.h" #include "nodes/supportnodes.h" #include "port/pg_bitutils.h" #include "utils/array.h" #include "utils/builtins.h" #include "utils/datum.h" #include "utils/lsyscache.h" +#include "utils/tuplesort.h" #include "utils/typcache.h" /* @@ -43,6 +46,18 @@ typedef struct DeserialIOData Oid typioparam; } DeserialIOData; +/* + * ArraySortCachedInfo + * Used for caching catalog data in array_sort + */ +typedef struct ArraySortCachedInfo +{ + ArrayMetaState array_meta; /* metadata for array_create_iterator */ + Oid elem_lt_opr; /* "<" operator for element type */ + Oid elem_gt_opr; /* ">" operator for element type */ + Oid array_type; /* pg_type OID of array type */ +} ArraySortCachedInfo; + static Datum array_position_common(FunctionCallInfo fcinfo); @@ -1858,3 +1873,171 @@ array_reverse(PG_FUNCTION_ARGS) PG_RETURN_ARRAYTYPE_P(res
Re: Non-text mode for pg_dumpall
On 2025-03-31 Mo 1:16 PM, Andrew Dunstan wrote: Thanks. Here are patches that contain (my version of) all the cleanups. With this I get a clean restore run in my test case with no error messages. This time with patches cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com From 6286701ff360ccb8c105fa5aa0a8f9bba3b1d1d7 Mon Sep 17 00:00:00 2001 From: Mahendra Singh Thalor Date: Wed, 19 Mar 2025 01:18:46 +0530 Subject: [PATCH v20250331 1/3] Move common pg_dump code related to connections to a new file ConnectDatabase is used by pg_dumpall, pg_restore and pg_dump so move common code to new file. new file name: connectdb.c Author:Mahendra Singh Thalor --- src/bin/pg_dump/Makefile | 5 +- src/bin/pg_dump/connectdb.c | 294 +++ src/bin/pg_dump/connectdb.h | 26 +++ src/bin/pg_dump/meson.build | 1 + src/bin/pg_dump/pg_backup.h | 6 +- src/bin/pg_dump/pg_backup_archiver.c | 6 +- src/bin/pg_dump/pg_backup_db.c | 79 +-- src/bin/pg_dump/pg_dump.c| 2 +- src/bin/pg_dump/pg_dumpall.c | 278 + 9 files changed, 352 insertions(+), 345 deletions(-) create mode 100644 src/bin/pg_dump/connectdb.c create mode 100644 src/bin/pg_dump/connectdb.h diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile index 233ad15ca75..fa795883e9f 100644 --- a/src/bin/pg_dump/Makefile +++ b/src/bin/pg_dump/Makefile @@ -31,6 +31,7 @@ OBJS = \ compress_lz4.o \ compress_none.o \ compress_zstd.o \ + connectdb.o \ dumputils.o \ filter.o \ parallel.o \ @@ -50,8 +51,8 @@ pg_dump: pg_dump.o common.o pg_dump_sort.o $(OBJS) | submake-libpq submake-libpg pg_restore: pg_restore.o $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils $(CC) $(CFLAGS) pg_restore.o $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) -pg_dumpall: pg_dumpall.o dumputils.o filter.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils - $(CC) $(CFLAGS) pg_dumpall.o dumputils.o filter.o $(WIN32RES) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) +pg_dumpall: pg_dumpall.o $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils + $(CC) $(CFLAGS) pg_dumpall.o $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) install: all installdirs $(INSTALL_PROGRAM) pg_dump$(X) '$(DESTDIR)$(bindir)'/pg_dump$(X) diff --git a/src/bin/pg_dump/connectdb.c b/src/bin/pg_dump/connectdb.c new file mode 100644 index 000..9e593b70e81 --- /dev/null +++ b/src/bin/pg_dump/connectdb.c @@ -0,0 +1,294 @@ +/*- + * + * connectdb.c + *This is a common file connection to the database. + * + * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + *src/bin/pg_dump/connectdb.c + * + *- + */ + +#include "postgres_fe.h" + +#include "common/connect.h" +#include "common/logging.h" +#include "common/string.h" +#include "connectdb.h" +#include "dumputils.h" +#include "fe_utils/string_utils.h" + +static char *constructConnStr(const char **keywords, const char **values); + +/* + * ConnectDatabase + * + * Make a database connection with the given parameters. An + * interactive password prompt is automatically issued if required. + * + * If fail_on_error is false, we return NULL without printing any message + * on failure, but preserve any prompted password for the next try. + * + * On success, the 'connstr' is set to a connection string containing + * the options used and 'server_version' is set to version so that caller + * can use them. + */ +PGconn * +ConnectDatabase(const char *dbname, const char *connection_string, +const char *pghost, const char *pgport, const char *pguser, +trivalue prompt_password, bool fail_on_error, const char *progname, +const char **connstr, int *server_version, char *password, +char *override_dbname) +{ + PGconn *conn; + bool new_pass; + const char *remoteversion_str; + int my_version; + const char **keywords = NULL; + const char **values = NULL; + PQconninfoOption *conn_opts = NULL; + int server_version_temp; + + if (prompt_password == TRI_YES && !password) + password = simple_prompt("Password: ", false); + + /* + * Start the connection. Loop until we have a password if requested by + * backend. + */ + do + { + int argcount = 8; + PQconninfoOption *conn_opt; + char *err_msg = NULL; + int i = 0; + + free(keywords); + free(values); + PQconninfoFree(conn_opts); + + /* + * Merge the connection info inputs given in form of connection string + * and other options. Explicitly discard any dbname value in the + * connection string; otherwise, PQconnectdbParams() would interpret + * that value as being itself a connection string. + */ + if (connecti
Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints
On Fri, Mar 28, 2025 at 2:42 PM Alvaro Herrera wrote: > Maybe the real problem here is that making the (valid) child constraint > no longer local when the parent constraint is not valid is not sensible, > precisely because pg_dump won't be able to produce good output. That > sounds more workable to me ... except that we'd have to ensure that > validating the parent constraint would turn the child constraints as not > local anymore, which might be a bit weird. But maybe not weirder than > the other approach. It seems like a bad idea to make conislocal and coninhcount have anything to do with whether the constraint is valid. We need those to mean what they have traditionally meant just to make the correct things happen when the constraint is dropped, either directly from the child or at some ancestor of that child. If something is dropped at an ancestor table, we cascade down the tree and decrement coninhcount. If something is dropped at a child table, we can set conislocal=false. The constraint goes away, IIUC, when coninhcount=0 and conislocal=false, which directly corresponds to "this constraint no longer has a remaining definition either locally or by inheritance". I don't see how you can change much of anything here without breaking the existing structure. Validity could have separate tracking of some sort, possibly as elaborate as convalidlocal and convalidinhcount, but I don't think it can get away with redefining the tracking that we already have for existence. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Statistics Import and Export
> > The second is that the pg_upgrade test (when run with >> olddump/oldinstall) compares the before and after dumps, and if the >> "before" version is 17, then it will not have the relallfrozen argument >> to pg_restore_relation_stats. We might need a filtering step in >> adjust_new_dumpfile? >> > > That sounds trickier. > Narrator: It was not trickier. In light of v11-0001 being committed as 4694aedf63bf, I've rebased the remaining patches. From 607984bdcc91fa31fb7a12e9b24fb8704aa14975 Mon Sep 17 00:00:00 2001 From: Corey Huinker Date: Fri, 14 Mar 2025 01:06:19 -0400 Subject: [PATCH v12 1/3] Introduce CreateStmtPtr. CreateStmtPtr is a function pointer that can replace the createStmt/defn parameter. This is useful in situations where the amount of text generated for a definition is so large that it is undesirable to hold many such objects in memory at the same time. Using functions of this type, the text created is then immediately written out to the appropriate file for the given dump format. --- src/bin/pg_dump/pg_backup.h | 2 + src/bin/pg_dump/pg_backup_archiver.c | 22 ++- src/bin/pg_dump/pg_backup_archiver.h | 7 + src/bin/pg_dump/pg_dump.c| 229 +++ 4 files changed, 158 insertions(+), 102 deletions(-) diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 658986de6f8..fdcccd64a70 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -289,6 +289,8 @@ typedef int (*DataDumperPtr) (Archive *AH, const void *userArg); typedef void (*SetupWorkerPtrType) (Archive *AH); +typedef char *(*CreateStmtPtr) (Archive *AH, const void *userArg); + /* * Main archiver interface. */ diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 1d131e5a57d..1b4c62fd7d7 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -1265,6 +1265,9 @@ ArchiveEntry(Archive *AHX, CatalogId catalogId, DumpId dumpId, newToc->dataDumper = opts->dumpFn; newToc->dataDumperArg = opts->dumpArg; newToc->hadDumper = opts->dumpFn ? true : false; + newToc->createDumper = opts->createFn; + newToc->createDumperArg = opts->createArg; + newToc->hadCreateDumper = opts->createFn ? true : false; newToc->formatData = NULL; newToc->dataLength = 0; @@ -2621,7 +2624,17 @@ WriteToc(ArchiveHandle *AH) WriteStr(AH, te->tag); WriteStr(AH, te->desc); WriteInt(AH, te->section); - WriteStr(AH, te->defn); + + if (te->hadCreateDumper) + { + char *defn = te->createDumper((Archive *) AH, te->createDumperArg); + + WriteStr(AH, defn); + pg_free(defn); + } + else + WriteStr(AH, te->defn); + WriteStr(AH, te->dropStmt); WriteStr(AH, te->copyStmt); WriteStr(AH, te->namespace); @@ -3877,6 +3890,13 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, const char *pfx) { IssueACLPerBlob(AH, te); } + else if (te->hadCreateDumper) + { + char *ptr = te->createDumper((Archive *) AH, te->createDumperArg); + + ahwrite(ptr, 1, strlen(ptr), AH); + pg_free(ptr); + } else if (te->defn && strlen(te->defn) > 0) { ahprintf(AH, "%s\n\n", te->defn); diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index a2064f471ed..e68db633995 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -368,6 +368,11 @@ struct _tocEntry const void *dataDumperArg; /* Arg for above routine */ void *formatData; /* TOC Entry data specific to file format */ + CreateStmtPtr createDumper; /* Routine for create statement creation */ + const void *createDumperArg; /* arg for the above routine */ + bool hadCreateDumper; /* Archiver was passed a create statement + * routine */ + /* working state while dumping/restoring */ pgoff_t dataLength; /* item's data size; 0 if none or unknown */ int reqs; /* do we need schema and/or data of object @@ -407,6 +412,8 @@ typedef struct _archiveOpts int nDeps; DataDumperPtr dumpFn; const void *dumpArg; + CreateStmtPtr createFn; + const void *createArg; } ArchiveOpts; #define ARCHIVE_OPTS(...) &(ArchiveOpts){__VA_ARGS__} /* Called to add a TOC entry */ diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 4ca34be230c..cc195d6cd9e 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -10560,42 +10560,44 @@ appendNamedArgument(PQExpBuffer out, Archive *fout, const char *argname, } /* - * dumpRelationStats -- + * printDumpRelationStats -- * - * Dump command to import stats into the relation on the new database. + * Generate the SQL statements needed to restore a relation's statistics. */ -static void -dumpRelationStats(Archive *fout, const RelStatsInfo *rsinfo) +static char * +printRelationStats(Archive *fout, const void *userArg) { + const RelStatsInfo *rsinfo = (RelStatsInfo *) userArg; const DumpableObject *dobj = &rsinfo->dobj; + + PQExpBufferData query; + PQExpBuffe
Re: Draft for basic NUMA observability
Hi, On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote: > On Thu, Mar 27, 2025 at 2:40 PM Andres Freund wrote: > > > > > +Size > > > +pg_numa_get_pagesize(void) > [..] > > > > Should this have a comment or an assertion that it can only be used after > > shared memory startup? Because before that huge_pages_status won't be > > meaningful? > > Added both. Thanks for the updated version! + Assert(IsUnderPostmaster); I wonder if that would make more sense to add an assertion on huge_pages_status and HUGE_PAGES_UNKNOWN instead (more or less as it is done in CreateSharedMemoryAndSemaphores()). === About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch Once applied I can see mention to pg_buffercache_numa_pages() while it only comes in v17-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patch. I think pg_buffercache_numa_pages() should not be mentioned before it's actually implemented. === 1 + bufRecord->isvalid == false) { int i; - funcctx = SRF_FIRSTCALL_INIT(); - - /* Switch context when allocating stuff to be used in later calls */ - oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); - - /* Create a user function context for cross-call persistence */ - fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext)); + for (i = 1; i <= 9; i++) + nulls[i] = true; "i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM is increased to 10). In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM). That could also make sense to remove the loop and use memset() that way: " memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool)); " instead. It's done that way in some other places (hbafuncs.c for example). === 2 - if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM) - TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", - INT4OID, -1, 0); + if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1) + TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", + INT4OID, -1, 0); I think we should not change the "expected_tupledesc->natts" check here until v17-0003 is applied. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Prevent internal error at concurrent CREATE OR REPLACE FUNCTION
On Mon, 31 Mar 2025 20:00:57 +0900 Yugo Nagata wrote: > Hi, > > I found that multiple sessions concurrently execute CREATE OR REPLACE FUNCTION > for a same function, the error "tuple concurrently updated" is raised. This is > an internal error output by elog, also the message is not user-friendly. > > I've attached a patch to prevent this internal error by locking an exclusive > lock before the command and get the read tuple after acquiring the lock. > Also, if the function has been removed during the lock waiting, the new entry > is created. I also found the same error is raised when concurrent ALTER FUNCTION commands are executed. I've added a patch to fix this in the similar way. Regards, Yugo Nagata -- Yugo Nagata >From e53f318dd42f909011227d3b11e8345ab7cc5641 Mon Sep 17 00:00:00 2001 From: Yugo Nagata Date: Mon, 31 Mar 2025 20:17:07 +0900 Subject: [PATCH 2/2] Prevent internal error at concurrent ALTER FUNCTION --- src/backend/commands/functioncmds.c | 17 + 1 file changed, 17 insertions(+) diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c index b9fd7683abb..3b9d9bb098c 100644 --- a/src/backend/commands/functioncmds.c +++ b/src/backend/commands/functioncmds.c @@ -61,6 +61,7 @@ #include "parser/parse_func.h" #include "parser/parse_type.h" #include "pgstat.h" +#include "storage/lmgr.h" #include "tcop/pquery.h" #include "tcop/utility.h" #include "utils/acl.h" @@ -1384,6 +1385,22 @@ AlterFunction(ParseState *pstate, AlterFunctionStmt *stmt) if (!HeapTupleIsValid(tup)) /* should not happen */ elog(ERROR, "cache lookup failed for function %u", funcOid); + /* Lock the function so nobody else can do anything with it. */ + LockDatabaseObject(ProcedureRelationId, funcOid, 0, AccessExclusiveLock); + + /* + * It is possible that by the time we acquire the lock on function, + * concurrent DDL has removed it. We can test this by checking the + * existence of function. We get the tuple again to avoid the risk + * of function definition getting changed. + */ + tup = SearchSysCacheCopy1(PROCOID, ObjectIdGetDatum(funcOid)); + if (!HeapTupleIsValid(tup)) + ereport(ERROR, +errcode(ERRCODE_UNDEFINED_OBJECT), +errmsg("function \"%s\" does not exist", + NameListToString(stmt->func->objname))); + procForm = (Form_pg_proc) GETSTRUCT(tup); /* Permission check: must own function */ -- 2.34.1 >From 49c890eb8cb115586e0e92294fb2fec60d80b8be Mon Sep 17 00:00:00 2001 From: Yugo Nagata Date: Mon, 31 Mar 2025 18:46:58 +0900 Subject: [PATCH 1/2] Prevent internal error at concurrent CREATE OR REPLACE FUNCTION --- src/backend/catalog/pg_proc.c | 40 --- 1 file changed, 32 insertions(+), 8 deletions(-) diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c index fe0490259e9..cfd2e08ea23 100644 --- a/src/backend/catalog/pg_proc.c +++ b/src/backend/catalog/pg_proc.c @@ -35,6 +35,7 @@ #include "parser/parse_coerce.h" #include "pgstat.h" #include "rewrite/rewriteHandler.h" +#include "storage/lmgr.h" #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "utils/acl.h" @@ -355,24 +356,45 @@ ProcedureCreate(const char *procedureName, tupDesc = RelationGetDescr(rel); /* Check for pre-existing definition */ - oldtup = SearchSysCache3(PROCNAMEARGSNSP, - PointerGetDatum(procedureName), - PointerGetDatum(parameterTypes), - ObjectIdGetDatum(procNamespace)); + oldtup = SearchSysCacheCopy3(PROCNAMEARGSNSP, + PointerGetDatum(procedureName), + PointerGetDatum(parameterTypes), + ObjectIdGetDatum(procNamespace)); if (HeapTupleIsValid(oldtup)) { /* There is one; okay to replace it? */ Form_pg_proc oldproc = (Form_pg_proc) GETSTRUCT(oldtup); - Datum proargnames; - bool isnull; - const char *dropcmd; if (!replace) ereport(ERROR, (errcode(ERRCODE_DUPLICATE_FUNCTION), errmsg("function \"%s\" already exists with same argument types", procedureName))); + + /* Lock the function so nobody else can do anything with it. */ + LockDatabaseObject(ProcedureRelationId, oldproc->oid, 0, AccessExclusiveLock); + + /* + * It is possible that by the time we acquire the lock on function, + * concurrent DDL has removed it. We can test this by checking the + * existence of function. We get the tuple again to avoid the risk + * of function definition getting changed. + */ + oldtup = SearchSysCacheCopy3(PROCNAMEARGSNSP, + PointerGetDatum(procedureName), + PointerGetDatum(parameterTypes), + ObjectIdGetDatum(procNamespace)); + } + + if (HeapTupleIsValid(oldtup)) + { + + Form_pg_proc oldproc = (Form_pg_proc) GETSTRUCT(oldtup); + Datum proargnames; + boolisnull; + const char *dropcmd; + if (!object_ownercheck(ProcedureRelationId, oldproc->oid, proowner)) aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_FUNCTION, procedureName); @@ -553,11 +5
Re: Orphaned users in PG16 and above can only be managed by Superusers
On Mon, Mar 24, 2025 at 2:18 AM Ashutosh Sharma wrote: > Thank you, Robert and Tom, for sharing your valuable insights, and > apologies for the slight delay in my response. From the discussion, > what I understand is that we aim to extend the current DROP ROLE > syntax to include the CASCADE/RESTRICT option, which has been > introduced in the latest SQL standard, as mentioned by Tom. However, > as Robert pointed out, implementing the CASCADE option for handling > dependent objects that span multiple databases is not feasible at the > moment. The RESTRICT option, as I understand it, is already the > default behavior. Therefore, we will proceed with implementing the > CASCADE/RESTRICT syntax specifically for handling dependent roles, > rather than the dependent database objects like tables, views, etc., > which can span multiple databases. I would be fairly strongly opposed to implementing RESTRICT and CASCADE but having them only apply to role grants and not other database objects. I'm not entirely certain what the right thing to do is here, but I don't think it's that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add pg_buffercache_evict_all() and pg_buffercache_mark_dirty[_all]() functions
Hi, On Mon, 31 Mar 2025 at 16:37, Aidar Imamov wrote: > > Hi! > > I've reviewed the latest version of the patches and found a few things > worth > discussing. This is probably my final feedback on the patches at this > point. > Maybe Joseph has something to add. Thank you so much for the reviews, these were very helpful! > After this discussion, I think it would be helpful if one of the more > experienced > hackers could take a look at the overall picture (perhaps we should set > the > status "Ready for Committer"? To be honest, I'm not sure what step that > status > should be set at). I agree. I will mark this as 'ready for committer' after writing the tests. > pg_buffercache--1.5--1.6.sql: > It seems that there is no need for the OR REPLACE clause after the > pg_buffercache_evict() function has been dropped. > > Maybe we should remove the RETURNS clause from function declarations > that have > OUT parameters? > Doc: "When there are OUT or INOUT parameters, the RETURNS clause can be > omitted" You are right, both are done. > pg_buffercache_evict_relation(): > The function is marked as STRICT so I think we do not need for redundant > check: > > if (PG_ARGISNULL(0)) > > ereport(ERROR, > >(errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > errmsg("relation cannot be null"))); > Doc: "returns null whenever any of its arguments are null", "the > function is not > executed when there are null arguments;". Done. > EvictUnpinnedBuffer(): > Maybe edit this comment a little (just not to repeat the sentences): > > buf_state is used to check if the buffer header lock has been acquired > > before > > calling this function. If buf_state has a non-zero value, it means that > > the buffer > > header has already been locked. This information is useful for evicting > > specific > > relation's buffers, as the buffers headers need to be locked before > > this function > > can be called with such a intention. This is better, done. > EvictUnpinnedBuffer() & EvictRelUnpinnedBuffers(): > Why did you decide to use the assert to check for NULLs? > I understand that currently, the use of these functions is limited to a > specific set > of calls through the pg_buffercache interface. However, this may not > always be the > case. Wouldn't it be better to allow users to choose whether or not they > want to > receive additional information? For example, you could simply ignore any > arguments > that are passed as NULL. I do not have a strong opinion on this. I agree that these functions can be used somewhere else later but it is easier to ignore these on the return instead of handling NULLs in the functions. If we want to consider the NULL possibility in the EvictUnpinnedBuffer() & EvictRelUnpinnedBuffers(), then we need to write additional if clauses. I think this increases the complexity unnecessarily. > Additionally, I believe it would be beneficial to include some > regression tests to > check at least the following cases: return type, being a superuser, bad > buffer id, > local buffers case. You are right, I will try to write the tests but I am sharing the new version without tests to speed things up. > Also, there's a little thing about declaring functions as PARALLEL SAFE. > To be honest, > I don't really have any solid arguments against it. I just have some > doubts. For > example, how will it work if the plan is split up and we try to work on > an object in > one part, while the other part of the plan evicts the pages of that > object or marks > them as dirty... I can't really say for sure about that. And in that > context, I'd > suggest referring to that awesome statement in the documentation: "If in > doubt, > functions should be labeled as UNSAFE, which is the default." You may be right. I thought they are parallel safe as we acquire the buffer header lock for each buffer but now, I have the same doubts as you. I want to hear other people's opinions on that before labeling them as UNSAFE. v5 is attached. -- Regards, Nazir Bilal Yavuz Microsoft v5-0001-Add-pg_buffercache_evict_-relation-all-functions-.patch Description: Binary data v5-0002-Return-buffer-flushed-information-in-pg_buffercac.patch Description: Binary data v5-0003-Add-pg_buffercache_mark_dirty-_all-functions-for-.patch Description: Binary data
Re: Make COPY format extendable: Extract COPY TO format implementations
On Wed, Mar 26, 2025 at 8:28 PM Sutou Kouhei wrote: > > > --- > > +static const CopyFromRoutine CopyFromRoutineTestCopyFormat = { > > +.type = T_CopyFromRoutine, > > +.CopyFromInFunc = CopyFromInFunc, > > +.CopyFromStart = CopyFromStart, > > +.CopyFromOneRow = CopyFromOneRow, > > +.CopyFromEnd = CopyFromEnd, > > +}; > > > > In trying to document the current API I'm strongly disliking it. Namely, the amount of internal code an extension needs to care about/copy-paste to create a working handler. Presently, pg_type defines text and binary I/O routines and the text/csv formats use the text I/O while binary uses binary I/O - for all attributes. The CopyFromInFunc API allows for each attribute to somehow have its I/O format individualized. But I don't see how that is practical or useful, and it adds burden on API users. I suggest we remove both .CopyFromInFunc and .CopyFromStart/End and add a property to CopyFromRoutine (.ioMode?) with values of either Copy_IO_Text or Copy_IO_Binary and then just branch to either: CopyFromTextLikeInFunc & CopyFromTextLikeStart/End or CopyFromBinaryInFunc & CopyFromStart/End So, in effect, the only method an extension needs to write is converting to/from the 'serialized' form to the text/binary form (text being near unanimous). In a similar manner, the amount of boilerplate within CopyFromOneRow seems undesirable from an API perspective. cstate->cur_attname = NameStr(att->attname); cstate->cur_attval = string; if (string != NULL) nulls[m] = false; if (cstate->defaults[m]) { /* We must have switched into the per-tuple memory context */ Assert(econtext != NULL); Assert(CurrentMemoryContext == econtext->ecxt_per_tuple_memory); values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]); } /* * If ON_ERROR is specified with IGNORE, skip rows with soft errors */ else if (!InputFunctionCallSafe(&in_functions[m], string, typioparams[m], att->atttypmod, (Node *) cstate->escontext, &values[m])) { CopyFromSkipErrorRow(cstate); return true; } cstate->cur_attname = NULL; cstate->cur_attval = NULL; It seems to me that CopyFromOneRow could simply produce a *string collection, one cell per attribute, and NextCopyFrom could do all of the above on a for-loop over *string The pg_type and pg_proc catalogs are not extensible so the API can and should be limited to producing the, usually text, values that are ready to be passed into the text I/O routines and all the work to find and use types and functions left in the template code. I haven't looked at COPY TO but I am expecting much the same. The API should simply receive the content of the type I/O output routine (binary or text as it dictates) for each output attribute, by row, and be expected to take those values and produce a final output from them. David J.
Re: Non-text mode for pg_dumpall
Hi FWIW I don't think the on_exit_nicely business is in final shape just yet. We're doing something super strange and novel about keeping track of an array index, so that we can modify it later. Or something like that, I think? That doesn't sound all that nice to me. Elsewhere it was suggested that we need some way to keep track of the list of things that need cleanup (a list of connections IIRC?) -- perhaps in a thread-local variable or a global or something -- and we install the cleanup function once, and that reads from the variable. The program can add things to the list, or remove them, at will; and we don't need to modify the cleanup function in any way. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Memoize ANTI and SEMI JOIN inner
On 3/31/25 12:18, Richard Guo wrote: On Mon, Mar 31, 2025 at 6:46 PM Andrei Lepikhov wrote: Nested Loop -> Seq Scan on t t2 -> Nested Loop -> Seq Scan on t t1 -> Subquery Scan on t3 Filter: ((t2.a = t3.a) AND (t1.b = t3.b)) -> Seq Scan on t t3_1 (7 rows) t3's ppi_clauses includes "t2.a = t3.a" and "t1.b = t3.b", while t1/t3 join's restrictlist only includes "t1.b = t3.b". I attempted to make your query ab it closer to our case: SET enable_mergejoin = f; SET enable_hashjoin = f; SET enable_material = f; CREATE INDEX ON t(a); explain (costs off) select * from t t1 join t t2 on EXISTS (select *, t1.a+t2.a as x from t t3 WHERE t2.a = t3.a AND t1.b = t3.b); and I don't get the case. As I see, ANTI/SEMI join just transforms to the regular join and it is still not the case. May you be more specific? -- regards, Andrei Lepikhov
Re: per backend WAL statistics
> On Mar 31, 2025, at 16:42, Bertrand Drouvot > wrote: > I think we can simply move the pgstat_fetch_stat_backend() call at the end > of pgstat_fetch_stat_backend_by_pid(), like in the attached. With this in > place > the issue is fixed on my side. Thanks for the patch, I’ll check all that next week. -- Michael
Re: SQLFunctionCache and generic plans
Hi. Tom Lane писал(а) 2025-03-30 19:10: I spent some time reading and reworking this code, and have arrived at a patch set that I'm pretty happy with. I'm not sure it's quite committable but it's close: 0005: This extracts the RLS test case you had and commits it with the old non-failing behavior, just so that we can see that the new code does it differently. (I didn't adopt your test from rules.sql, because AFAICS it works the same with or without this patch set. What was the point of that one again?) The test was introduced after my error to handle case when execution_state is NULL in fcache->func_state list due to statement being completely removed by instead rule. After founding this issue, I've added a test to cover it. Not sure if it should be preserved. 0006: The guts of the patch. I couldn't break this down any further. One big difference from what you had is that there is only one path of control: we always use the plan cache. The hack you had to not use it for triggers was only needed because you didn't include the right cache key items to distinguish different trigger usages, but the code coming from plpgsql has that right. Yes, now it looks much more consistent. Still going through it. Will do some additional testing here. So far have checked all known corner cases and found no issues. Also, the memory management is done a bit differently. The "fcontext" memory context holding the SQLFunctionCache struct is now discarded at the end of each execution of the SQL function, which considerably alleviates worries about leaking memory there. I invented a small "SQLFunctionLink" struct that is what fn_extra points at, and it survives as long as the FmgrInfo does, so that's what saves us from redoing hash key computations in most cases. I've looked through it and made some tests, including ones which caused me to create separate context for planing. Was a bit worried that it has gone, but now, as fcache->fcontext is deleted in the end of function execution, I don't see leaks, which were the initial reason for introducing it. -- Best regards, Alexander Pyhalov, Postgres Professional
Re: Get rid of WALBufMappingLock
Good day, 14.03.2025 17:30, Tomas Vondra wrote: > Hi, > > I've briefly looked at this patch this week, and done a bit of testing. > I don't have any comments about the correctness - it does seem correct > to me and I haven't noticed any crashes/issues, but I'm not familiar > with the WALBufMappingLock enough to have insightful opinions. > > I have however decided to do a bit of benchmarking, to better understand > the possible benefits of the change. I happen to have access to an Azure > machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that > can do ~1.5GB/s. > > The benchmark script (attached) uses the workload mentioned by Andres > some time ago [1] > >SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE)); > > with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated > results look like this (this is throughput): > >| 8 | 64| 1024 > clients | master patched | master patched | master patched > - > 1 | 11864 12035 |7419 7345 | 968 940 > 4 | 26311 26919 | 12414 12308 |1304 1293 > 8 | 38742 39651 | 14316 14539 |1348 1348 >16 | 57299 59917 | 15405 15871 |1304 1279 >32 | 74857 82598 | 17589 17126 |1233 1233 >48 | 87596 95495 | 18616 18160 |1199 1227 >64 | 89982 97715 | 19033 18910 |1196 1221 >96 | 92853103448 | 19694 19706 |1190 1210 > 128 | 95392103324 | 20085 19873 |1188 1213 > 160 | 94933102236 | 20227 20323 |1180 1214 > 196 | 95933103341 | 20448 20513 |1188 1199 > > To put this into a perspective, this throughput relative to master: > > clients | 8 64 1024 > -- > 1 | 101% 99% 97% > 4 | 102% 99% 99% > 8 | 102%102% 100% >16 | 105%103% 98% >32 | 110% 97% 100% >48 | 109% 98% 102% >64 | 109% 99% 102% >96 | 111%100% 102% > 128 | 108% 99% 102% > 160 | 108%100% 103% > 196 | 108%100% 101% > > That does not seem like a huge improvement :-( Yes, there's 1-10% > speedup for the small (8K) size, but for larger chunks it's a wash. > > Looking at the pgbench progress, I noticed stuff like this: > > ... > progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed > progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed > progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed > progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed > progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed > progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed > progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed > progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed > progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed > progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed > progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed > progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed > progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed > progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed > progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed > ... > > i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny > things (it's a virtual SSD, I'm not sure what model is that behind the > curtains). So I decided to try running the benchmark on tmpfs, to get > the storage out of the way and get the "best case" results. > > This makes the pgbench progress perfectly "smooth" (no jumps like in the > output above), and the comparison looks like this: > >| 8 | 64| 1024 > clients | masterpatched | master patched | master patched > -|-|| > 1 | 32449 32032 | 19289 20344 | 3108 3081 > 4 | 68779 69256 | 24585 29912 | 2915 3449 > 8 | 79787 100655 | 28217 39217 | 3182 4086 >16 | 113024 148968 | 42969 62083 | 5134 5712 >32 | 125884 170678 | 44256 71183 | 4910 5447 >48 | 125571 166695 | 44693 76411 | 4717 5215 >64 | 122096 160470 | 42749 83754 | 4631 5103 >96 | 120170 154145 | 42696 86529 | 4556 5020 > 128 | 119204 152977 | 4
Re: Fix 035_standby_logical_decoding.pl race conditions
Hi Kuroda-san and Amit, On Fri, Mar 28, 2025 at 09:02:29AM +, Hayato Kuroda (Fujitsu) wrote: > Dear Amit, > > > > > Right, I think this is a better idea. I like it too and the bonus point is that this injection point can be used in more tests (more use cases). A few comments: About v2-0001-Stabilize === 1 s/to avoid the seeing a xl_running_xacts/to avoid seeing a xl_running_xacts/? === 2 (Nit) /* For testing slot invalidation due to the conflict */ Not sure "due to the conflict" is needed. About PG17-v2-0001 === 3 The commit message still mentions injection point. === 4 -# Note that pg_current_snapshot() is used to get the horizon. It does -# not generate a Transaction/COMMIT WAL record, decreasing the risk of -# seeing a xl_running_xacts that would advance an active replication slot's -# catalog_xmin. Advancing the active replication slot's catalog_xmin -# would break some tests that expect the active slot to conflict with -# the catalog xmin horizon. I'd be tempted to not remove this comment but reword it a bit instead. Something like? # Note that pg_current_snapshot() is used to get the horizon. It does # not generate a Transaction/COMMIT WAL record, decreasing the risk of # seeing a xl_running_xacts that would advance an active replication slot's # catalog_xmin. Advancing the active replication slot's catalog_xmin # would break some tests that expect the active slot to conflict with # the catalog xmin horizon. We ensure that active replication slots are not # created for tests that might produce this race condition though. === 5 The invalidation checks for active slots are kept for the wal_level case. Also the active slots are still created to test that logical decoding on the standby behaves correctly, when no conflict is expected and for the promotion. The above makes sense to me. === 6 (Nit) In drop_logical_slots(), s/needs_active_slot/drop_active_slot/? === 7 (Nit) In check_slots_conflict_reason(), s/needs_active_slot/checks_active_slot/? About PG16-v2-0001 Same as for PG17-v2-0001. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Add partial :-variable expansion to psql \copy
Hello, I've been biten by psql's \copy lack of variable expansion, in a limited-access docker-inside-VM context where COPY is not a viable option and hardwired names are not desirable. The attached patch allows \copy to use variable's values in place of table and file names: ```psql \set table 'some table' \set input 'some file name.csv' \copy :"table" from :'input' with (format csv) ``` -- Fabien.
Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints
hi. in notnull-notvalid.patch + if (coninfo->contype == 'c') + keyword = "CHECK CONSTRAINT"; + else + keyword = "INVALID NOT NULL CONSTRAINT"; we have a new TocEntry->desc kind. so the following related code within src/bin/pg_dump also needs change if (strcmp(te->desc, "CONSTRAINT") == 0 || strcmp(te->desc, "CHECK CONSTRAINT") == 0 || strcmp(te->desc, "FK CONSTRAINT") == 0) strcpy(buffer, "DROP CONSTRAINT"); else snprintf(buffer, sizeof(buffer), "DROP %s", te->desc); else if (strcmp(te->desc, "CONSTRAINT") == 0 || strcmp(te->desc, "CHECK CONSTRAINT") == 0 || strcmp(te->desc, "FK CONSTRAINT") == 0 || strcmp(te->desc, "INDEX") == 0 || strcmp(te->desc, "RULE") == 0 || strcmp(te->desc, "TRIGGER") == 0) te->section = SECTION_POST_DATA; /* these object types don't have separate owners */ else if (strcmp(type, "CAST") == 0 || strcmp(type, "CHECK CONSTRAINT") == 0 || strcmp(type, "CONSTRAINT") == 0 || strcmp(type, "DATABASE PROPERTIES") == 0 || strcmp(type, "DEFAULT") == 0 || strcmp(type, "FK CONSTRAINT") == 0 || strcmp(type, "INDEX") == 0 || strcmp(type, "RULE") == 0 || strcmp(type, "TRIGGER") == 0 || strcmp(type, "ROW SECURITY") == 0 || strcmp(type, "POLICY") == 0 || strcmp(type, "USER MAPPING") == 0) { /* do nothing */ }
Re: general purpose array_sort
Hi Tom, On Mon, Mar 31, 2025 at 5:58 AM Tom Lane wrote: > > I spent some time looking at the v17 patchset. There were some pretty > strange things in it --- why were some of the variants of array_sort() > marked as volatile, for example? I think this was due to some copy-paste of the code nearby. > But the two things I'd like to > suggest functionality-wise are: > > * The second argument of the variants with booleans should be defined > as true=descending, not true=ascending. It seems a little odd to me > for the default of a boolean option not to be "false". Also, then > you don't need an inversion between the second and third arguments. > I'm not dead set on this but it just seems a little cleaner. > Agreed. > * I see that the code is set up to detect an unsortable input type > before it takes the fast exit for "no sort required". I think this > is poor engineering: we ought to make the fast path as fast as > possible. The can't-sort case is so rare in real-world usage that > I do not think it matters if the error isn't thrown by every possible > call. Besides which, it is inconsistent anyway: consider > SELECT array_sort(NULL::xid[]); > which will not error because it will never reach the C code. Why's > that okay but delivering an answer for "array_sort('{1}'::xid[])" > is not? I think "throw error only if we must sort and cannot" is > a perfectly fine definition. Agreed. > > At the code level, I didn't like the way that the multiple entry > points were set up. I think it's generally cleaner code to have > a worker function with plain C call and return coding and make > all the SQL-visible functions be wrappers around that. Also the > caching mechanism was overcomplicated, in particular because we > do not need a cache lookup to know which sort operators apply to > arrays. Agreed, your refactor made the code cleaner. > > So all that leads me to v18 attached. (I merged the two patches > into one, didn't see much value in splitting them.) > > In v18, it's somewhat annoying that the typcache doesn't cache > the typarray field; we would not need a separate get_array_type() > lookup if it did. I doubt there is any real reason for that except > that pg_type.typarray didn't exist when the typcache was invented. > So I'm tempted to add it. But I looked at existing callers of > get_array_type() and none of them are adjacent to typcache lookups, > so only array_sort would be helped immediately. I left it alone > for the moment; wonder if anyone else has an opinion? The need for `elmtyp` and `array_type` here because a column can have arrays with varying dimensions. Maybe other callers don't share this behavior? > > regards, tom lane > -- Regards Junwang Zhao
Re: Test to dump and restore objects left behind by regression
On Fri, Mar 28, 2025 at 11:43 PM Alvaro Herrera wrote: > > On 2025-Mar-28, Tom Lane wrote: > > > I think instead of going this direction, we really need to create a > > separately-purposed script that simply creates "one of everything" > > without doing anything else (except maybe loading a little data). > > I believe it'd be a lot easier to remember to add to that when > > inventing new SQL than to remember to leave something behind from the > > core regression tests. This would also be far faster to run than any > > approach that involves picking a random subset of the core test > > scripts. It's easier to remember to do something or not do something in the same file than in some other file. I find it hard to believe that introducing another set of SQL files somewhere far from regress would make this problem easier. The number of states in which objects can be left behind in the regress/sql is very large - and maintaining that 1:1 in some other set of scripts is impossible unless it's automated. > FWIW this sounds closely related to what I tried to do with > src/test/modules/test_ddl_deparse; it's currently incomplete, but maybe > we can use that as a starting point. create_table.sql in test_ddl_deparse has only one statement creating an inheritance table whereas there are dozens of different states of parent/child tables created by regress. It will require a lot of work to bridge the gap between regress_ddl_deparse and regress and more work to maintain it. I might be missing something in your ideas. IMO, whatever we do it should rely on a single set of files. One possible way could be to break the existing files into three files each, containing DDL, DML and queries from those files respectively and create three schedules DDL, DML and queries containing the respective files. These schedules will be run as required. Standard regression run runs all the three schedules one by one. But 002_pg_upgrade will run DDL and DML on the source database and run queries on target - thus checking sanity of the dump/restore or pg_upgrade beyond just the dump comparison. 027_stream_regress might run DDL, DML on the source server and queries on the target. But that too is easier said than done for: 1. Our tests mix all three kinds of statements and also rely on the order in which they are run. It will require some significant effort to carefully separate the statements. 2. With the new set of files backpatching would become hard. -- Best Wishes, Ashutosh Bapat
Add comments about fire_triggers argument in ri_triggers.c
Hi, SPI_execute_snapshot() has a argument called "fire_triggers". If this is false, AFTER triggers are postponed to end of the query. This is true in normal case, but set to false in RI triggers. This is introduced by 9cb84097623e in 2007. It is aimed to fire check triggers after all RI updates on the same row are complete. However, I cannot find explanation of"why this is required" in the codebase. Therefore, I've attached a patch add comments in ri_trigger.c for explaining why fire_triggers is specified to false. SPI_execute_snapshot() are used in a few places in ri_trigger.c, but I added the comments only in ri_PerformCheck() because SPI_execute_snapshot() are used only for SELECT quereis in other places. Therefore, I wonder fire_triggers is not needed to be false in these places, but I left them as is. Regards, Yugo Nagata -- Yugo Nagata diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c index c4ff18ce65e..45d7e6268e4 100644 --- a/src/backend/utils/adt/ri_triggers.c +++ b/src/backend/utils/adt/ri_triggers.c @@ -2580,7 +2580,13 @@ ri_PerformCheck(const RI_ConstraintInfo *riinfo, save_sec_context | SECURITY_LOCAL_USERID_CHANGE | SECURITY_NOFORCE_RLS); - /* Finally we can run the query. */ + /* + * Finally we can run the query. + * + * Set fire_triggers to false so that AFTER triggers run at the end of + * the query. This ensures check triggers fire after all RI updates on + * the same row are complete. + */ spi_result = SPI_execute_snapshot(qplan, vals, nulls, test_snapshot, crosscheck_snapshot,
Re: Proposal: Progressive explain
Thanks for this valuable testing. I think this is actually a really good idea for how to test something like this, because the regression tests contain lots of different queries that do lots of weird things. On Sun, Mar 30, 2025 at 8:23 PM torikoshia wrote: > I haven't looked into the code yet, but when I ran below commands during > make installcheck, there was an error and an assertion failure > >=# select * from pg_stat_progress_explain; >=# \watch 0.1 > > ERROR: could not attach to dynamic shared area This seems like a race condition. Probably some backend's dsa_area went away between the time we got a pointer to it and the time we actually attached to it. We should be able to find some way of handling this without an error, like treating the case where the DSA area is missing the same as the case where there was no DSA pointer in the first place. However, this is also making me wonder if we shouldn't be using one DSA shared by all backends rather than a separate DSA area for every backend. That would require more care to avoid leaks, but I'm not sure that it's a good idea to be creating and destroying a DSA area for every single query. But I'm not 100% sure that's a problem. >TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: > "ruleutils.c", Line: 8802, PID: 73180 >TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: > "ruleutils.c", Line: 8802, PID: 73181 I wonder what is happening here. One systemic danger of the patch is that there can be a difference between what must be true at the *end* of a query and what must be true *during* a query. Anything that can't happen at the end but can happen in the middle is something that the patch will need to do something about in order to work properly. But I don't see how that can explain this failure, because AFAIR the patch just prints the same things that would have been printed by any other EXPLAIN, with the same stack of namespaces. It seems possible that this is a pre-existing bug: the regression tests might contain a query that would cause EXPLAIN to fail, but because the regression tests don't actually EXPLAIN that query, no failure occurs. But it could also be something else; for example, maybe this patch is trying to EXPLAIN something that couldn't be used with a regular EXPLAIN for some reason. Or maybe the patch doesn't actually succeed in doing the EXPLAIN with the correct namespace stack in all cases. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Change log level for notifying hot standby is waiting non-overflowed snapshot
On 2025/03/31 22:44, torikoshia wrote: Here are some comments on the documentation. Thanks for the review! The following description in high-availability.sgml also seems to misuse the word 'consistent': When the parameter is set to true on a standby server, it will begin accepting connections once the recovery has brought the system to a consistent state. Since this is part of the "User's Overview" section, it may not be appropriate to include too much detail. How about rewording it to avoid using 'consistent', for example: When the parameter is set to true on a standby server, it will begin accepting connections once it is ready. "once it is ready." feels too vague to me. How about using "once recovery has brought the system to a consistent state and be ready for hot standby." instead? + delaying accepting read-only connections. To enable hot standby, + a long-lived write transaction with more than 64 subtransactions + needs to be closed on the primary. Is it better to use 'transactions' in the plural form rather than as a nominal? - There may be more than one such transaction. - The below also uses the plural form. - The newly added message also uses the plural form: + errhint("To enable hot standby, close write transactions with more than %d subtransactions on the primary server." What do you think? I'm not sure whether the plural form is better here, but I've updated the patch as suggested. Attached is the revised version. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION From 2c83ae035c8093498b78243095180fa906a39c4a Mon Sep 17 00:00:00 2001 From: Fujii Masao Date: Tue, 1 Apr 2025 00:47:58 +0900 Subject: [PATCH v6] Improve error message when standby does accept connections. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Even after reaching the minimum recovery point, if there are long-lived write transactions with 64 subtransactions on the primary, the recovery snapshot may not yet be ready for hot standby, delaying read-only connections on the standby. Previously, when read-only connections were not accepted due to this condition, the following error message was logged: FATAL: the database system is not yet accepting connections DETAIL: Consistent recovery state has not been yet reached. This DETAIL message was misleading because the following message was already logged in this case: LOG: consistent recovery state reached This contradiction, i.e., indicating that the recovery state was consistent while also stating it wasn’t, caused confusion. This commit improves the error message to better reflect the actual state: FATAL: the database system is not yet accepting connections DETAIL: Recovery snapshot is not yet ready for hot standby. HINT: To enable hot standby, close write transactions with more than 64 subtransactions on the primary server. To implement this, the commit introduces a new postmaster signal, PMSIGNAL_RECOVERY_CONSISTENT. When the startup process reaches a consistent recovery state, it sends this signal to the postmaster, allowing it to correctly recognize that state. Since this is not a clear bug, the change is applied only to the master branch and is not back-patched. Author: Atsushi Torikoshi Co-authored-by: Fujii Masao Reviewed-by: Yugo Nagata Discussion: https://postgr.es/m/02db8cd8e1f527a8b999b94a4bee3...@oss.nttdata.com --- doc/src/sgml/high-availability.sgml | 12 src/backend/access/transam/xlogrecovery.c | 6 ++ src/backend/postmaster/postmaster.c | 12 +--- src/backend/tcop/backend_startup.c| 18 +- src/include/storage/pmsignal.h| 1 + src/include/tcop/backend_startup.h| 2 +- 6 files changed, 38 insertions(+), 13 deletions(-) diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index acf3ac0601d..b47d8b4106e 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -1535,7 +1535,8 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)' When the parameter is set to true on a standby server, it will begin accepting connections once the recovery has -brought the system to a consistent state. All such connections are +brought the system to a consistent state and be ready for hot standby. +All such connections are strictly read-only; not even temporary tables may be written. @@ -1974,9 +1975,12 @@ LOG: database system is ready to accept read-only connections Consistency information is recorded once per checkpoint on the primary. It is not possible to enable hot standby when reading WAL written during a period when wal_level was not set to -replica or logical on the primary. Reaching -a consistent state can also be delayed in the presence of both of the
Re: Reduce "Var IS [NOT] NULL" quals during constant folding
On Thu, Mar 27, 2025 at 10:08 AM Richard Guo wrote: > On Thu, Mar 27, 2025 at 10:53 PM Tom Lane wrote: > > Richard Guo writes: > > > I'm planning to push this patch soon, barring any objections. > > > FWIW, I have not reviewed it at all. > > Oh, sorry. I'll hold off on pushing it. As a general point, non-trivial patches should really get some substantive review before they are pushed. Please don't be in a rush to commit. It is very common for the time from when a patch is first posted to commit to be 3-6 months even for committers. Posting a brand-new feature patch on March 21st and then pressing to commit on March 27th is really not something you should be doing. I think it's particularly inappropriate here where you actually got a review that pointed out a serious design problem and then had to change the design. If you didn't get it right on the first try, you shouldn't be too confident that you did it perfectly the second time, either. I took a look at this today and I'm not entirely comfortable with this: + rel = table_open(rte->relid, NoLock); + + /* Record NOT NULL columns for this relation. */ + get_relation_notnullatts(rel, rte); + + table_close(rel, NoLock); As a general principle, I have found that it's usually a sign that something has been designed poorly when you find yourself wanting to open a relation, get exactly one piece of information, and close the relation again. That is why, today, all the information that the planner needs about a particular relation is retrieved by get_relation_info(). We do not just wander around doing random catalog lookups wherever we need some critical detail. This patch increases the number of places where we fetch relation data from 1 to 2, but it's still the case that almost everything happens in get_relation_info(), and there's now just exactly this 1 thing that is done in a different place. That doesn't seem especially nice. I thought the idea was going to be to move get_relation_info() to an earlier stage, not split one thing out of it while leaving everything else the same. -- Robert Haas EDB: http://www.enterprisedb.com
Re: explain analyze rows=%.0f
On Tue, Mar 11, 2025 at 5:58 AM Ilia Evdokimov wrote: > In the stats_ext regression test, there is a function > check_estimated_rows that returns actual rows as an integer. Currently, > none of the test cases produce a non-zero fractional part in actual rows. > > The question is: should this function be modified to return a fractional > number instead of an integer? > > Personally, I don’t see much value in doing so, because the purpose of > the test is to compare the estimated row count with the actual number of > rows. It is also unlikely that there will be test cases where loops > 1, > and the presence of a fractional part does not change the essence of the > test. > > What do you think? I suppose the way this is currently coded, it will have the effect of extracting the integer part of the row estimate. That doesn't really seem like a bad idea to me, so I'm not sure it's worth trying to adjust anything here. One thing that I just noticed, though, is that we added two decimal places to the actual row count, but not the estimated row count. So now we get stuff like this: -> Seq Scan on pgbench_branches b (cost=0.00..1.10 rows=10 width=364) (actual time=0.007..0.009 rows=10.00 loops=1) But why isn't it just as valuable to have two decimal places for the estimate? I theorize that the cases that are really a problem here are those where the row count estimate is between 0 and 1 per row, and rounding to an integer loses all precision. -- Robert Haas EDB: http://www.enterprisedb.com
Re: explain analyze rows=%.0f
Robert Haas writes: > But why isn't it just as valuable to have two decimal places for the > estimate? I theorize that the cases that are really a problem here are > those where the row count estimate is between 0 and 1 per row, and > rounding to an integer loses all precision. Currently, the planner rounds *all* rowcount estimates to integers (cf. clamp_row_est()). Maybe it'd be appropriate to rethink that, but it's not just a matter of changing EXPLAIN's print format. regards, tom lane
Re: Proposal - Allow extensions to set a Plan Identifier
On Tue, Mar 25, 2025 at 8:12 PM Sami Imseih wrote: > So this comes down to forking the Postgres code to do the job. What I >> had in mind was a slightly different flow, where we would be able to >> push custom node attributes between the header parsing and the >> generation of the extension code. If we do that, there would be no >> need to fork the Postgres code: extensions could force the definitions >> they want to the nodes they want to handle, as long as these do not >> need to touch the in-core query jumble logic, of course. > > > Allowing extensions to generate code for custom node attributes could be > taken up in 19. What about for 18 we think about exposing AppendJumble, > JUMBLE_FIELD, JUMBLE_FIELD_SINGLE and JUMBLE_STRING? > +1, I think exposing the node jumbling logic + ability to jumble values for extensions would make a big difference to get into 18, specifically for allowing extensions to do plan ID calculations efficiently. Attached a rebased version that accounts for the changes in f31aad9b0 and ad9a23bc4, and also makes AppendJumble* available, as well as two helper defines JUMBLE_VALUE and JUMBLE_VALUE_STRING. These are intended to work on values, not nodes, because I think that requiring the caller to have a local "expr" variable doesn't seem like a good API. I've also intentionally reduced the API surface area and not allowed initializing a jumble state that records constant locations / not exported RecordConstLocations. I think the utility of that seems less clear (possibly out-of-core queryid customization with a future extension hook in jumbleNode) and requires more refinement. Thoughts? Thanks, Lukas -- Lukas Fittl v2-0001-Allow-plugins-to-Jumble-an-expression.patch Description: Binary data
Re: Statistics Import and Export
On Thu, Feb 27, 2025 at 10:43 PM Greg Sabino Mullane wrote: > I know I'm coming late to this, but I would like us to rethink having > statistics dumped by default. +1. I think I said this before, but I don't think it's correct to regard the statistics as part of the database. It's great for pg_upgrade to preserve them, but I think doing so for a regular dump should be opt-in. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Automatic client certificate selection support for libpq v1
Wow, blast from the past. First, I no longer have this use case. Second, the PGSSLCERT, PGSSLKEY, and other relevant environmental variables should make most clients able to be "fairly easily" switched from one server to another. Third, the only real use case where this feature would be critical is a client which needs to have connections to two different PostgreSQL servers at the same time. Those applications are likely fairly rare and doing custom programming to support different filenames would likely be warranted. However, I still think that this feature would make it easier for users often connect to multiple servers from different administrative/security domains. Given the lack of "me too" or "+1" posts over the past 16 years, I suspect there may be features with higher user benefit. I would not cry if it gets removed. Thanks, -Seth Robertson From: "Robin Haberkorn" Date: Mon, 31 Mar 2025 13:57:55 +0300 To: "Seth Robertson" Subject: Re: [PATCH] Automatic client certificate selection support for libpq v1 Hello everybody! I was investigating the item "Allow automatic selection of SSL client certificates from a certificate store" from the Todo list [1]. If I understand Seth Robertson in his initial mail correctly, he wanted libpq to select client certificates automatically based on the host while storing several client certificates and keys in single crt and key files. Client certs are currently loaded with SSL_CTX_use_certificate_chain_file() in src/interfaces/libpq/fe-secure-openssl.c. While it is theoretically possible to implement host-specific logic using SSL_CTX_set_client_cert_cb(), I don't think you could store all the possible certificates in a single file as you might actually already need several certificates for a single host in the form of a certificate chain. As was pointed out in the thread back then, you would have to implement something like a certificate wallet/store from scratch. At the most, Seth's initial patch could be improved by looking up per-host client certificate/key files based on the host-reported server name (SSL_get_servername()), which should be more reliable when working with host aliases. But then on the other hand, there are sslcert/sslkey connection parameters since 8.4, which Seth was apparently aware of. As far as I understand, he just wanted this patch for 8.3 as well and he didn't want to update all of his existing programs. Considering that his initial mail was written 16 years ago, I don't think this is a valid argument anymore. It should be possible to adapt programs easily, e.g. by accepting "postgresql://" URIs instead of domains and manually choosing appropriate certificate/key filenames. In my opinion there is little than can and should be done in Postgres at this point. Or does anybody think, that a server-name-based certificate file selection feature should still be implemented? If yes, I would be happy to take care of it. If not, I would suggest to remove this item from the agenda/wiki. Best regards, Robin Haberkorn [1]: http://archives.postgresql.org/pgsql-hackers/2009-05/msg00406.php -- Robin Haberkorn Senior Software Engineer Tel.: +49 157 85228715 E-Mail: haberk...@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg / https://www.b1-systems.de GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt, HRB 3537
Re: Using read stream in autoprewarm
Hi, On Mon, 31 Mar 2025 at 21:15, Melanie Plageman wrote: > > On Mon, Mar 31, 2025 at 12:27 PM Nazir Bilal Yavuz wrote: > > > > I worked on an alternative approach, I refactored code a bit. It does > > not traverse the list two times and I think the code is more suitable > > to use read streams now. I simply get how many blocks are processed by > > read streams and move the list forward by this number, so the actual > > loop skips these blocks. This approach is attached with 'alternative' > > prefix. > > I am leaning toward the refactored approach because I don't think we > want to go through the array twice and I think it is hard to get it > right with incrementing p.pos in both places and being sure we > correctly close the relation etc. I liked it more as well. > Looking at your alternative approach, I don't see how the innermost > while loop in autoprewarm_database_main() is correct > > /* Check whether blocknum is valid and within fork file size. */ > while (cur_filenumber == blk->filenumber && >blk->blocknum >= nblocks_in_fork) > { > /* Move to next forknum. */ > pos++; > continue; > } > > Won't this just infinitely loop? Oops, you are right. It should be an if statement instead of a while loop, fixed now. Also, moved 'dsm_detach(seg);' to its previous place to reduce diff. Attached as v7. Do you think that I should continue to attach both approaches? -- Regards, Nazir Bilal Yavuz Microsoft v7-0001-Optimize-autoprewarm-with-read-streams.patch Description: Binary data v7-0002-Count-free-buffers-at-the-start-of-the-autoprewar.patch Description: Binary data