Re: Adding comments to help understand psql hidden queries

2025-03-31 Thread Maiquel Grassi
Hi!
I have read the discussion and would like
to share my humble opinion. I believe that
a visually appealing way to display the
output on the screen is to ensure symmetry
in the length of asterisks and description lines.
I imagine someone looking at the screen and
focusing on symmetrical details. Therefore,
the string length should serve as the basis for
the calculation. If the description length is an
even number, then the formula would be:

((description length − 7) / 2)​

Placing this result of asterisks on both sides of
the string ' QUERY ' ensures balance.
If the description length is an odd number,
then place:

((description length − 7) / 2)​
asterisks on the right side and:
(((description length − 7) / 2) ​+ 1)

asterisks on the left side.

This method does not always result in a perfectly
symmetric number of asterisks, but it provides a
more visually aligned appearance. At the end of
the SQL code, we should also include a line
terminator of the same length of the
description. The format looks like this:

/** QUERY ***/
/* Get information about row-level policies */
SELECT pol.polname, pol.polpermissive,
CASE WHEN pol.polroles = '{0}' THEN NULL ELSE
pg_catalog.array_to_string(array(select rolname from pg_catalog.pg_roles
where oid = any (pol.polroles) order by 1),',') END,
pg_catalog.pg_get_expr(pol.polqual, pol.polrelid),
pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid),
CASE pol.polcmd
WHEN 'r' THEN 'SELECT'
WHEN 'a' THEN 'INSERT'
WHEN 'w' THEN 'UPDATE'
WHEN 'd' THEN 'DELETE'
END AS cmd
FROM pg_catalog.pg_policy pol
WHERE pol.polrelid = '134384' ORDER BY 1;
//

Regards,
Maiquel.


Re: Statistics Import and Export

2025-03-31 Thread Nathan Bossart
On Mon, Mar 31, 2025 at 11:11:47AM -0400, Corey Huinker wrote:
> In light of v11-0001 being committed as 4694aedf63bf, I've rebased the
> remaining patches.

I spent the day preparing these for commit.  A few notes:

* I've added a new prerequisite patch that skips the second WriteToc() call
  for custom-format dumps that do not include data.  After some testing and
  code analysis, I haven't identified any examples where this produces
  different output.  This doesn't help much on its own, but it will become
  rather important when we move the attribute statistics queries to happen
  within WriteToc() in 0002.

* I was a little worried about the correctness of 0002 for dumps that run
  the attribute statistics queries twice, but I couldn't identify any
  problems here either.

* I removed a lot of miscellaneous refactoring that seemed unnecessary for
  these patches.  Let's move that to another patch set and keep these as
  simple as possible.

* I made a small adjustment to the TOC scan restarting logic in
  fetchAttributeStats().  Specifically, we now only allow the scan to
  restart once for custom-format dumps that include data.

* While these patches help decrease pg_dump's memory footprint, I believe
  pg_restore still reads the entire TOC into memory.  That's not this patch
  set's problem, but I think it's still an important consideration for the
  bigger picture.

Regarding whether pg_dump should dump statistics by default, my current
thinking is that it shouldn't, but I think we _should_ have pg_upgrade
dump/restore statistics by default because that is arguably the most
important use-case.  This is more a gut feeling than anything, so I reserve
the right to change my opinion.

My goal is to commit the attached patches on Friday morning, but of course
that is subject to change based on any feedback or objections that emerge
in the meantime.

-- 
nathan
>From 0256405dba245baa90c3fe35ee69e3cb1ce6253e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 31 Mar 2025 10:44:26 -0500
Subject: [PATCH v12n 1/3] Skip second WriteToc() for custom-format dumps
 without data.

Presently, "pg_dump --format=custom" calls WriteToc() twice.  The
second call is intended to update the data offset information,
which allegedly makes parallel pg_restore significantly faster.
However, if we aren't dumping any data, this step accomplishes
nothing and can be skipped.  This is a preparatory optimization for
a follow-up commit that will move the queries for per-attribute
statistics to WriteToc() to save memory.

Discussion: https://postgr.es/m/Z9c1rbzZegYQTOQE%40nathan
---
 src/bin/pg_dump/pg_backup_custom.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_custom.c 
b/src/bin/pg_dump/pg_backup_custom.c
index e44b887eb29..b971e3bc16e 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -755,9 +755,11 @@ _CloseArchive(ArchiveHandle *AH)
 * If possible, re-write the TOC in order to update the data 
offset
 * information.  This is not essential, as pg_restore can cope 
in most
 * cases without it; but it can make pg_restore significantly 
faster
-* in some situations (especially parallel restore).
+* in some situations (especially parallel restore).  We can 
skip this
+* step if we're not dumping any data.
 */
-   if (ctx->hasSeek &&
+   if (AH->public.dopt->dumpData &&
+   ctx->hasSeek &&
fseeko(AH->FH, tpos, SEEK_SET) == 0)
WriteToc(AH);
}
-- 
2.39.5 (Apple Git-154)

>From 6e41a1b2a175f7e9a859429e57c2ffb17ec9051d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 31 Mar 2025 14:53:11 -0500
Subject: [PATCH v12n 2/3] pg_dump: Reduce memory usage of dumps with
 statistics.

Right now, pg_dump stores all generated commands for statistics in
memory.  These commands can be quite large and therefore can
significantly increase pg_dump's memory footprint.  To fix, wait
until we are about to write out the commands before generating
them, and be sure to free the commands after writing.  This is
implemented via a new defnDumper callback that works much like the
dataDumper one but is specially designed for TOC entries.

One drawback of this change is that custom dumps that include data
will run the statistics queries twice.  However, a follow-up commit
will add batching for these queries that our testing indicates
should greatly improve dump speed (even when compared to pg_dump
without this commit).

Author: Corey Huinker 
Discussion: 
https://postgr.es/m/CADkLM%3Dc%2Br05srPy9w%2B-%2BnbmLEo15dKXYQ03Q_xyK%2BriJerigLQ%40mail.gmail.com
---
 src/bin/pg_dump/pg_backup.h  |  1 +
 src/bin/pg_dump/pg_backup_archiver.c | 35 ++
 src/bin/pg_dump/pg_backup_archiver.h |  5 
 src/bin/pg_dump/pg_dump.c  

Re: Adding comments to help understand psql hidden queries

2025-03-31 Thread Maiquel Grassi
Hi!

I have read the discussion and would like
to share my humble opinion. I believe that
a visually appealing way to display the
output on the screen is to ensure symmetry
in the length of asterisks and description lines.
I imagine someone looking at the screen and
focusing on symmetrical details. Therefore,
the string length should serve as the basis for
the calculation. If the description length is an
even number, then the formula would be:
((description length − 7) / 2)​
Placing this result of asterisks on both sides of
the string ' QUERY ' ensures balance.
If the description length is an odd number,
then place:
((description length − 7) / 2)​

asterisks on the right side and:

(((description length − 7) / 2) ​+ 1)
asterisks on the left side.
This method does not always result in a perfectly
symmetric number of asterisks, but it provides a
more visually aligned appearance. At the end of
the SQL code, we should also include a line
terminator of the same length of the
description. The format looks like this:

/** QUERY ***/
/* Get information about row-level policies */
SELECT pol.polname, pol.polpermissive,
  CASE WHEN pol.polroles = '{0}' THEN NULL ELSE
pg_catalog.array_to_string(array(select rolname from pg_catalog.pg_roles
where oid = any (pol.polroles) order by 1),',') END,
  pg_catalog.pg_get_expr(pol.polqual, pol.polrelid),
  pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid),
  CASE pol.polcmd
WHEN 'r' THEN 'SELECT'
WHEN 'a' THEN 'INSERT'
WHEN 'w' THEN 'UPDATE'
WHEN 'd' THEN 'DELETE'
END AS cmd
FROM pg_catalog.pg_policy pol
WHERE pol.polrelid = '134384' ORDER BY 1;
//
Regards,
Maiquel.


Re: explain analyze rows=%.0f

2025-03-31 Thread Robert Haas
On Mon, Mar 31, 2025 at 1:49 PM Tom Lane  wrote:
> Robert Haas  writes:
> > But why isn't it just as valuable to have two decimal places for the
> > estimate? I theorize that the cases that are really a problem here are
> > those where the row count estimate is between 0 and 1 per row, and
> > rounding to an integer loses all precision.
>
> Currently, the planner rounds *all* rowcount estimates to integers
> (cf. clamp_row_est()).  Maybe it'd be appropriate to rethink that,
> but it's not just a matter of changing EXPLAIN's print format.

Oh, right. I've never really understood why we round off to integers,
but the fact that we don't allow row counts < 1 feels like something
pretty important. My intuition is that it probably helps a lot more
than it hurts, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: RFC: Logging plan of the running query

2025-03-31 Thread Robert Haas
On Fri, Mar 21, 2025 at 8:40 AM torikoshia  wrote:
> Rebased it again.

Hi,

I apologize for not having noticed this thread sooner. I just became
aware of it as a result of a discussion in the hacking Discord server.
I think this has got a lot over overlap with the progressive EXPLAIN
patch from Rafael Castro, which I have been reviewing, but I am not
sure why we have two different patch sets here. Why is that?

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2025-03-31 Thread Ashutosh Bapat
On Mon, Mar 31, 2025 at 8:27 AM David Rowley  wrote:
>
> On Sat, 29 Mar 2025 at 05:46, Ashutosh Bapat
>  wrote:
> > PFA patches. 0001 and 0002 are the same as the previous set. 0003
> > changes the initial hash table size to the length of ec_derives.
>
> I'm just not following the logic in making it the length of the
> ec_derives List. If you have 32 buckets and try to insert 32 elements,
> you're guaranteed to need a resize after inserting 28 elements. See
> the grow_threshold logic SH_UPDATE_PARAMETERS(). The point of making
> it 64 was to ensure the table is never unnecessarily sparse and to
> also ensure we make it at least big enough for the minimum number of
> ec_derives that we're about to insert.

If I am reading SH_CREATE correctly, it will initially set the size to
32/.9 = 36 and in SH_UPDATE_PARAMETERS will set grow_threshold to 36 *
.9 = 32. So it will leave some room even after inserting the initial
elements. But looking at SH_INSERT_HASH_INTERNAL(), it will soon
expand the table even if there's space. We certainly need a size much
more than 32. 32 is an arbitrary/empirical threshold to create a hash
table. Using that threshold as initial hash table size means the table
size would be arbitrary too. Using twice the  length of
ec_derives_list seems more reasonable since the length will decide the
initial number of entries.

>
> Looking more closely at the patch's ec_add_clause_to_derives_hash()
> function, I see you're actually making two hash table entries for each
> RestrictInfo, so without any em_is_const members, you'll insert 64
> entries into the hash table with a ec_derives list of 32, in which
> case 64 buckets isn't enough and the table will end up growing to 128
> elements.
>

Yes, that's right.

> I think you'd be better off coming up with some logic like putting the
> lowest pointer value'd EM first in the key and ensure that all lookups
> do that too by wrapping the lookups in some helper function. That'll
> half the number of hash table entries at the cost of some very cheap
> comparisons.

That's a good suggestion. I searched for C standard documentation
which specifies that the pointer comparison, especially inequality, is
stable and safe to use. But I didn't find any. While according to the
C standard, the result
of comparison between pointers within the same array or a struct is
specified, that between pointers from two different objects is
unspecified. The existing code relies on the EM pointers being stable
and also relies on equality between
them to be stable. It has withstood the test of time and a variety of
compilers. Hence I think it should be safe to rely on pointer
comparisons being stable. But since I didn't find any documentation
which confirms it, I have left those changes as a separate patch. Some
internet sources discussing pointer comparison can be found at [1],
[2] (which mentions the C standard but doesn't provide a link).

PFA the next patchset

0001, 0002 are same as in the previous set
0003 changes the initial hash table size to
length(ec->ec_derives_list) * 4 to accomodate for commuted entries.
0004 uses canonical keys per your suggestion and also reduces the
initial hash table size to  length(ec->ec_derives_list) * 2.

When the number of initial elements to be added to the hash table is
known, I see users of simplehash.h using that as an estimate rather
than the nearest power of two. Hence following the logic here.

[1] 
https://stackoverflow.com/questions/59516346/how-does-pointer-comparison-work-in-c-is-it-ok-to-compare-pointers-that-dont-p
[2] 
https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Pointer-Comparison.html

-- 
Best Wishes,
Ashutosh Bapat
From ccb72663e55f5ffe67595366e4a110685f647fc6 Mon Sep 17 00:00:00 2001
From: Amit Langote 
Date: Mon, 24 Mar 2025 21:17:46 +0900
Subject: [PATCH 1/4] Add assertion to verify derived clause has constant RHS

find_derived_clause_for_ec_member() searches for a previously-derived
clause that equates a non-constant EquivalenceMember to a constant.
It is only called for EquivalenceClasses with ec_has_const set, and
with a non-constant member as the target.

The matched clause is expected to have the non-constant member on the
left-hand side and the constant EquivalenceMember on the right.

Assert that the RHS is indeed a constant, to catch violations of this
structure and enforce assumptions made by
generate_base_implied_equalities_const().

Author: Ashutosh Bapat 
Discussion: https://postgr.es/m/caexhw5scmxyfrqofe6odmbiw2rnvbemeeca-p4w_cyueiku...@mail.gmail.com
---
 src/backend/optimizer/path/equivclass.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 0f9ecf5ee8b..493a95d26cc 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2664,7 +2664,10 @@ find_derived_clause_for_ec_member(EquivalenceClass *ec,
 		 * members on the left side of derived clauses.
 		 */

Re: Proposal: Progressive explain

2025-03-31 Thread Rafael Thofehrn Castro
Hello again,

> ERROR:  could not attach to dynamic shared area

In addition to that refactoring issue, the current patch had a race
condition in pg_stat_progress_explain to access the DSA of a process
running a query that gets aborted.

While discussing with Robert we agreed that it would be wiser to take
a step back and change the strategy used to share progressive explain
data in shared memory.

Instead of using per backend's DSAs shared via a hash structure I now
define a dsa_pointer and a LWLock in each backend's PGPROC.

A global DSA is created by the first backend that attempts to use
the progressive explain feature. After the DSA is created, subsequent
uses of the feature will just allocate memory there and reference
via PGPROC's dsa_pointer.

This solves the race condition reported by Torikoshi and improves
concurrency performance as now we don't have a global LWLock
controlling shared memory access, but a per-backend LWLock.

Performed the same tests done by Torikoshi and it looks like we are
good now. Even with more frequent inspects in pg_stat_progress_explain
(\watch 0.01).

Rafael.


v15-0001-Proposal-for-progressive-explains.patch
Description: Binary data


Re: Test to dump and restore objects left behind by regression

2025-03-31 Thread Alvaro Herrera
On 2025-Mar-31, Daniel Gustafsson wrote:

> Given where we are in the cycle, it seems to make sense to stick to using the
> schedule we already have rather than invent a new process for generating it,
> and work on that for 19?

No objections to that.  I'll see about getting this committed during my
morning today, so that I have plenty of time to watch the buildfarm.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
 Are you not unsure you want to delete Firefox?
   [Not unsure] [Not not unsure][Cancel]
   http://smylers.hates-software.com/2008/01/03/566e45b2.html




Re: Reduce "Var IS [NOT] NULL" quals during constant folding

2025-03-31 Thread Richard Guo
On Tue, Apr 1, 2025 at 1:55 AM Robert Haas  wrote:
> As a general principle, I have found that it's usually a sign that
> something has been designed poorly when you find yourself wanting to
> open a relation, get exactly one piece of information, and close the
> relation again. That is why, today, all the information that the
> planner needs about a particular relation is retrieved by
> get_relation_info(). We do not just wander around doing random catalog
> lookups wherever we need some critical detail. This patch increases
> the number of places where we fetch relation data from 1 to 2, but
> it's still the case that almost everything happens in
> get_relation_info(), and there's now just exactly this 1 thing that is
> done in a different place. That doesn't seem especially nice. I
> thought the idea was going to be to move get_relation_info() to an
> earlier stage, not split one thing out of it while leaving everything
> else the same.

I initially considered moving get_relation_info() to an earlier stage,
where we would collect all the per-relation data by relation OID.
Later, when building the RelOptInfos, we could link them to the
per-relation-OID data.

However, I gave up this idea because I realized it would require
retrieving a whole bundle of catalog information that isn't needed
until after the RelOptInfos are built, such as max_attr, pages,
tuples, reltablespace, parallel_workers, extended statistics, etc.
And we may also need to create the IndexOptInfos for the relation's
indexes.  It seems to me that it's not a trivial task to move
get_relation_info() before building the RelOptInfos, and more
importantly, it's unnecessary most of the time.

In other words, of the many pieces of catalog information we need to
retrieve for a relation, only a small portion is needed at an early
stage.  As far as I can see, this small portion includes:

* relhassubclass, which we retrieve immediately after we have finished
adding rangetable entries.

* attgenerated, which we retrieve in expand_virtual_generated_columns.

* attnotnull, as discussed here, which should ideally be retrieved
before pull_up_sublinks.

My idea is to retrieve only this small portion at an early stage, and
still defer collecting the majority of the catalog information until
building the RelOptInfos.  It might be possible to optimize by
retrieving this small portion in one place instead of three, but
moving the entire get_relation_info() to an earlier stage doesn't seem
like a good idea to me.

It could be argued that the separation of catalog information
collection isn't very great, but it seems this isn't something new in
this patch.  So I respectfully disagree with your statement that "all
the information that the planner needs about a particular relation is
retrieved by get_relation_info()", and that "there's now just exactly
this 1 thing that is done in a different place".  For instance,
relhassubclass and attgenerated are retrieved before expression
preprocessing, a relation's constraint expressions are retrieved when
setting the relation's size estimates, and more.

Thanks
Richard




Re: pgsql: Add memory/disk usage for Window aggregate nodes in EXPLAIN.

2025-03-31 Thread David Rowley
On Tue, 1 Apr 2025 at 04:40, Christoph Berg  wrote:
> -   Storage: Disk  Maximum Storage: NkB
> +   Storage: Memory  Maximum Storage: NkB
> ->  Function Scan on generate_series a (actual time=N.N..N.N rows=N.N 
> loops=N)

We'll probably just need to bump that 2000 row count to something a
bit more for 32-bit.

Any chance you could share the output of:

explain (analyze,buffers off,costs off) select sum(n) over() from
generate_series(1,2000) a(n);

Could you maybe also do a binary search for the number of rows where
it goes to disk by adjusting the 2000 up in some increments until the
Storage method is disk? (Not that I think we should set it to the
minimum, but it would be good to not set it too much higher than we
need to)

David




Re: Truncate logs by max_log_size

2025-03-31 Thread Fujii Masao




On 2025/02/03 19:31, Jim Jones wrote:

Hi Kirill

On 31.01.25 11:46, Kirill Gavrilov wrote:

Sorry for the long silence.  I fixed the indentation and a trailing
whitespace. Should look fine now.



The patch applies cleanly, the documentation is clear, and all tests pass.

It is possible to change this new parameter session-wise, which is nice!

postgres=# SET max_log_size TO 7;
SET
postgres=# SHOW max_log_size;
  max_log_size
--
  7B
(1 row)


The default value now is clear and it corresponds to the value set on
postgresql.conf:

#max_log_size = 0   # max size of logged statement

postgres=# SHOW max_log_size;
  max_log_size
--
  0
(1 row)


Logs are truncated as expected:

postgres=# SET max_log_size TO 6;
SET
postgres=# SELECT length('CALL xyz;');
  length

   9
(1 row)

postgres=# CALL xyz;
ERROR:  syntax error at or near ";"
LINE 1: CALL xyz;


log entry:

2025-02-03 10:58:19.975 CET [123945] ERROR:  syntax error at or near ";"
at character 9
2025-02-03 10:58:19.975 CET [123945] STATEMENT:  CALL x


The issue with log entry sizes for queries containing special characters
was resolved by setting the unit to bytes.

Overall, everythingLGTM.

The new status of this patch is: Ready for Committer


Since this patch is marked as ready-for-committer, I started reviewing it.
Basically I like the proposed idea.


When I set log_statement to 'all', max_log_size to 3, and ran "SELECT 1/0",
only the first three bytes of the query were logged by log_statement.
However, no query statement appeared under STATEMENT, as shown below.
Is this a bug?

--
=# SET log_statement TO 'all';
=# SET max_log_size TO 3;
=# SELECT 1/0;
LOG:  statement: SEL
ERROR:  division by zero
STATEMENT:
--


When log_replication_commands is enabled, replication commands are logged.
Should max_log_size apply to these logs as well to prevent excessively
large commands from being logged in full?


The parameter name max_log_size seems misleading. It sounds like
it controls the maximum log file size. Would a name like
log_statement_max_length be clearer?


The functions like exec_parse_message(), exec_bind_message(),
and exec_execute_message() may log query statements (e.g., via
log_min_duration_statement), but the patch doesn't seem to
update them to consider max_log_size.


Queries can also be logged in the CONTEXT line, such as when running
"DO $$ BEGIN SELECT 1/0; END $$;", but max_log_size doesn't seem to
apply in this case.


There might be other cases where queries are logged, but the patch
doesn't handle them. I'm not sure we can identify and address all of
them before the feature freeze.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION





Re: SQLFunctionCache and generic plans

2025-03-31 Thread Tom Lane
Alexander Pyhalov  writes:
> I've looked through it and made some tests, including ones which
> caused me to create separate context for planing. Was a bit worried
> that it has gone, but now, as fcache->fcontext is deleted in the end
> of function execution, I don't see leaks, which were the initial
> reason for introducing it.

Yeah.  As it's set up in v10, we do parsing work in the caller's
context (which is expected to be short-lived) when creating or
recreating the long-lived cache entry.  However, planning work
(if needed) is done in the fcontext, since that will happen within
init_execution_state which is called after fmgr_sql has switched into
the fcontext.  I thought about switching back to the caller's context
but decided that it wouldn't really be worth the trouble.  For a
non-SRF there's no meaningful difference anyway.  For a SRF, it'd mean
that planning cruft survives till end of execution of the SRF rather
than possibly going away when the first row is returned.  But we
aren't looping: any given query within the SRF is planned only once
during an execution of the function.  So there's no possibility of
indefinite accumulation of leakage.

If we wanted to improve that, my inclination would be to try to
not switch into the fcontext for the whole of fmgr_sql, but only
use it explicitly for allocations that need to survive.  But I
don't think that'll save much, so I think any such change would
be best left for later.  The patch is big enough already.

regards, tom lane




Re: Make COPY format extendable: Extract COPY TO format implementations

2025-03-31 Thread David G. Johnston
On Mon, Mar 31, 2025 at 11:52 AM Masahiko Sawada 
wrote:

> On Sat, Mar 29, 2025 at 9:49 AM David G. Johnston
>  wrote:
> >
> > On Wed, Mar 26, 2025 at 8:28 PM Sutou Kouhei  wrote:
> >>
> >>
> >> The attached v39 patch set uses the followings:
> >>
> >> 0001: Create copyto_internal.h and change COPY_XXX to
> >>   COPY_SOURCE_XXX and COPY_DEST_XXX accordingly.
> >>   (Same as 1. in your suggestion)
> >> 0002: Support custom format for both COPY TO and COPY FROM.
> >>   (Same as 2. in your suggestion)
> >> 0003: Expose necessary helper functions such as CopySendEndOfRow()
> >>   and add CopyFromSkipErrorRow().
> >>   (3. + 4. in your suggestion)
> >> 0004: Define handler functions for built-in formats.
> >>   (Not included in your suggestion)
> >> 0005: Documentation. (WIP)
> >>   (Same as 5. in your suggestion)
> >>
> >
> > I prefer keeping 0002 and 0004 separate.  In particular, keeping the
> design choice of "unqualified internal format names ignore search_path"
> should stand out as its own commit.
>
> What is the point of having separate commits for already-agreed design
> choices? I guess that it would make it easier to revert that decision.
> But I think it makes more sense that if we agree with "unqualified
> internal format names ignore search_path" the original commit includes
> that decision and describes it in the commit message. If we want to
> change that design based on the discussion later on, we can have a
> separate commit that makes that change and has the link to the
> discussion.
>

Fair.  Comment withdrawn.  Though I was referring to the WIP patches; I
figured the final patch would squash this all together in any case.

David J.


[PATCH] Fix potential overflow in binary search mid calculation

2025-03-31 Thread Jianghua Yang
Dear PostgreSQL Developers,

I have identified a potential integer overflow issue in the binary search
implementation within the DSA size class lookup code.
Issue Description

In the current implementation, the calculation of mid is performed as:

uint16 mid = (max + min) / 2;

Since both max and min are of type uint16, adding them together may exceed
65535, leading to an overflow and incorrect behavior in the binary search
logic. This could result in incorrect indexing into the dsa_size_classes
array.
Proposed Fix

To prevent this overflow, we should use the alternative calculation method:

uint16 mid = min + (max - min) / 2;

This approach ensures that (max - min) does not exceed 65535, preventing
the addition from overflowing while still correctly computing the middle
index.
Patch

A patch implementing this fix is attached.


0001-Fix-potential-overflow-in-binary-search-mid-calculat.patch
Description: Binary data


Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).

2025-03-31 Thread David G. Johnston
On Mon, Mar 31, 2025 at 8:13 PM jian he  wrote:

> On Mon, Mar 31, 2025 at 11:27 PM Fujii Masao
>
> >
> > The copy.sgml documentation should clarify that COPY TO can
> > be used with a materialized view only if it is populated.
> >
> "COPY TO can be used only with plain tables, not views, and does not
> copy rows from child tables or child partitions"
> i changed it to
> "COPY TO can be used with plain tables and materialized views, not
> regular views, and does not copy rows from child tables or child
> partitions"
>
> Another alternative wording I came up with:
> "COPY TO can only be used with plain tables and materialized views,
> not regular views. It also does not copy rows from child tables or
> child partitions."
>
>
Those don't address the "populated" aspect of the materialized view.

I'm unclear ATM (can check later if needed) if populated means "non-empty"
or simply "had refresh executed at least once on it".  i.e., if the refresh
produces zero rows stored in the MV is it still populated?  I'm guessing
yes; and this only pertains to "WITH NO DATA", which itself already calls
out that "...and cannot be queried until RMV is used".  I find it of
marginal usefulness to bring that distinction over to COPY TO absent people
showing up confused about the error message, which likely will be quite
rare.  That said I'd probably settle with:

COPY TO can only be used with plain tables and populated
materialized views. It does not copy rows from child tables
or child partitions (i.e., copy table to copies the same rows as
select * from only table). The syntax COPY (select * from table) TO ...
can be used to dump all of the rows in an inheritance hierarchy,
partitioned table, or foreign table; as well as ordinary view results.

Curious about sequences; no way to name an index here.

I'm second-guessing why "composite type" shows up in the glossary under
"Relation"...though that originally came up IIRC discussing namespaces.

David J.


Re: add function argument name to substring and substr

2025-03-31 Thread David G. Johnston
On Mon, Mar 31, 2025 at 9:12 PM jian he  wrote:

> On Tue, Apr 1, 2025 at 6:22 AM David G. Johnston
>  wrote:
> >
> > On Tue, Mar 18, 2025 at 9:04 PM jian he 
> wrote:
> >>
> >>
> >> new patch attached.
> >>
> >
> > I've done v4 with a delta patch.
>
> your v4-0001-v3-0001-substring.patch is not the same as my
> v3-0001-add-argument-name-to-function-substring-and-subst.patch
>
>
Sorry about that.  v5 attached.  Confirmed with diff the v3 and v5 0001 so
we should be good.

David J.
From 507064e643b54d2bebd7689acd3dde60d230e328 Mon Sep 17 00:00:00 2001
From: "David G. Johnston" 
Date: Mon, 31 Mar 2025 21:36:42 -0700
Subject: [PATCH 1/2] v3-0001 substring

---
 doc/src/sgml/func.sgml   | 111 +--
 src/backend/catalog/system_functions.sql |   2 +-
 src/include/catalog/pg_proc.dat  |  12 +++
 3 files changed, 118 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c642f1ea4e..a3569995f1 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3806,6 +3806,58 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in

   
 
+
+  
+   
+substring ( string text, pattern text )
+text
+   
+   
+Extracts the first substring matching POSIX regular expression; see
+.
+   
+   
+substring('Thomas', '...$')
+mas
+   
+  
+
+  
+   
+substring ( string text, pattern text, escape_character  text)
+text
+   
+   
+Extracts the first substring matching SQL regular expression;
+see .
+   
+   
+substring('Thomas', '%#"o_a#"_', '#')
+oma
+   
+  
+
+  
+   
+
+ substring
+
+substring ( string text, start integer , count integer  )
+text
+   
+   
+Extracts the substring of string starting at
+the start'th character,
+and stopping after count characters if that is
+specified.
+   
+
+   
+substring('Thomas', 2, 3)
+hom
+   
+  
+
   

 
@@ -4811,6 +4863,27 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 \x5678

   
+
+  
+   
+
+ substring
+
+substring ( bytes bytea, start integer , count integer  )
+bytea
+   
+   
+Extracts the substring of bytes starting at
+the start'th byte,
+and stopping after count bytes if that is
+specified.
+   
+   
+substring('\x1234567890'::bytea, 3, 2)
+\x5678
+   
+  
+
 

   
@@ -5353,6 +5426,26 @@ cast(-1234 as bytea)   \xfb2e

   
 
+  
+   
+
+ substring
+
+substring ( bits bit, start integer , count integer  )
+bit
+   
+   
+Extracts the substring of bits starting at
+the start'th bit,
+and stopping after count bits if that is
+specified.
+   
+   
+substring(B'11001011', 3, 2)
+00
+   
+  
+
   

 
@@ -5816,7 +5909,7 @@ substring(string from pattern
 or as a plain three-argument function:
 
-substring(string, pattern, escape-character)
+substring(string, pattern, escape_character)
 
 As with SIMILAR TO, the
 specified pattern must match the entire data string, or else the
@@ -6020,11 +6113,17 @@ substring('foobar' similar '#"o_b#"%' escape '#')NULL
 
 
- The substring function with two parameters,
- substring(string from
- pattern), provides extraction of a
- substring
- that matches a POSIX regular expression pattern.  It returns null if
+ The substring function with two parameters provides extraction of a
+ substring that matches a POSIX regular expression pattern.
+ It has syntax:
+
+substring(string from pattern)
+
+ It can also written as a plain two-argument function:
+
+substring(string, pattern)
+
+ It returns null if
  there is no match, otherwise the first portion of the text that matched the
  pattern.  But if the pattern contains any parentheses, the portion
  of the text that matched the first parenthesized subexpression (the
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 566f308e44..5ea9d786b6 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -42,7 +42,7 @@ CREATE OR REPLACE FUNCTION rpad(text, integer)
  IMMUTABLE PARALLEL SAFE STRICT COST 1
 RETURN rpad($1, $2, ' ');
 
-CREATE OR REPLACE FUNCTION "substring"(text, text, text)
+CREATE OR REPLACE FUNCTION "substring"(string text, pattern text, escape_character text)
  RETURNS text
  LANGUAGE sql
  IMMUTABLE PARALLEL SAFE STRICT COST 1
diff --git a/src/include/catalog/pg_proc.dat b/src/inc

tzdata 2025b

2025-03-31 Thread Masahiko Sawada
Hi all,

tzdata 2025b has been released on 3/22[1]. Do we need to update the
tzdata.zi file on HEAD and backbranches?

Regards,

[1] https://data.iana.org/time-zones/tzdb/NEWS

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: pgsql: Add support for OAUTHBEARER SASL mechanism

2025-03-31 Thread Jacob Champion
On Mon, Mar 31, 2025 at 2:54 PM Christoph Berg  wrote:
>
> > Add support for OAUTHBEARER SASL mechanism
>
> Debian still has this experimental port with a GNU userland and a
> FreeBSD kernel called kfreebsd. I don't expect anyone to particularly
> care about it, but it found an actual bug:
>
> /build/reproducible-path/postgresql-18-18~~devel.20250331/build/../src/interfaces/libpq/fe-auth-oauth-curl.c:
>  In function ‘register_socket’:
> /build/reproducible-path/postgresql-18-18~~devel.20250331/build/../src/interfaces/libpq/fe-auth-oauth-curl.c:1317:20:
>  error: ‘actx’ undeclared (first use in this function); did you mean ‘ctx’?
>  1317 | actx_error(actx, "libpq does not support multiplexer sockets 
> on this platform");
>   |^~~~
>
> This should not be a compile-time error; actx is not defined outside
> the #ifdef blocks there:

Ah, sorry about that. Thank you for reporting it!

(That means that Windows builds --with-libcurl are similarly broken, I
think. Not that Windows packagers will want to use --with-libcurl --
it doesn't do anything -- but it should build.)

I don't have hurd-amd64 to test, but I'm working on a patch that will
build and pass tests if I manually munge pg_config.h. We were skipping
the useless tests via a $windows_os check; I think I should use
check_pg_config() instead.

We could change how this works a bit for the proposed libpq-oauth.so
plugin, and only build it if we have a workable implementation. I do
like having these other platforms compile the Curl code, though, since
we'd prefer to keep the build clean for a future Windows
implementation...

--Jacob




Re: Replace IN VALUES with ANY in WHERE clauses during optimization

2025-03-31 Thread Alena Rybakina

Hi, Alexander!

On 30.03.2025 00:59, Alexander Korotkov wrote:

Hi, Alena!

On Sat, Mar 29, 2025 at 9:03 PM Alena Rybakina
  wrote:

On 29.03.2025 14:03, Alexander Korotkov wrote:

One thing I have to fix: we must do
IncrementVarSublevelsUp() unconditionally for all expressions as Vars
could be deeper inside.

Yes, I'm looking at it too, I've just understood that it was needed for 
subqueries - they can contain var elements which needs decrease the sublevel 
parameter.

for example for the query:

EXPLAIN (COSTS OFF)
SELECT ten FROM onek t
WHERE unique1 IN (VALUES (0), ((2 IN (SELECT unique2 FROM onek c
   WHERE c.unique2 = t.unique1))::integer));

We are interested in this element: ((2 IN (SELECT unique2 FROM onek c  WHERE 
c.unique2 = t.unique1))

It is funcexpr object with RabgeTblEntry variable. I highlighted

WARNING:  1{FUNCEXPR :funcid 2558 :funcresulttype 23 :funcretset false :funcvariadic false :funcformat 1 :funccollid 0 :inputcollid 0 :args ({SUBLINK :subLinkType 2 :subLinkId 0 :testexpr {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false 
:location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]} {PARAM :paramkind 2 :paramid 1 :paramtype 23 :paramtypmod -1 :paramcollid 0 :location -1}) :location -1} :operName ("=") :subselect {QUERY :commandType 1 :querySource 0 :canSetTag true :utilityStmt <> :resultRelation 0 :hasAggs false :hasWindowFuncs false :hasTargetSRFs false :hasSubLinks false :hasDistinctOn false 
:hasRecursive false :hasModifyingCTE false :hasForUpdate false :hasRowSecurity false :hasGroupRTE false :isReturn false :cteList <> :rtable ({RANGETBLENTRY :alias {ALIAS :aliasname c :colnames <>} :eref {ALIAS :aliasname c :colnames ("unique1" "unique2" "two" "four" "ten" "twenty" "hundred" 
"thousand" "twothousand" "fivethous" "tenthous" "odd" "even" "stringu1" "stringu2" "string4")} :rtekind 0 :relid 32795 :inh true :relkind r :rellockmode 1 :perminfoindex 1 :tablesample <> :lateral false :inFromCl true :securityQuals <>}) :rteperminfos ({RTEPERMISSIONINFO :relid 32795 
:inh true :requiredPerms 2 :checkAsUser 0 :selectedCols (b 9) :insertedCols (b) :updatedCols (b)}) :jointree {FROMEXPR :fromlist ({RANGETBLREF :rtindex 1}) :quals {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 
:varnosyn 1 :varattnosyn 2 :location -1} {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 2 :varreturningtype 0 :varnosyn 1 :varattnosyn 1 :location -1}) :location -1}} :mergeActionList <> :mergeTargetRelation 0 :mergeJoinCondition <> :targetList ({TARGETENTRY :expr {VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 
:varcollid 0 :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} :resno 1 :resname unique2 :ressortgroupref 0 :resorigtbl 32795 :resorigcol 2 :resjunk false}) :override 0 :onConflict <> :returningOldAlias <> :returningNewAlias <> :returningList <> :groupClause <> :groupDistinct false :groupingSets <> 
:havingQual <> :windowClause <> :distinctClause <> :sortClause <> :limitOffset <> :limitCount <> :limitOption 0 :rowMarks <> :setOperations <> :constraintDeps <> :withCheckOptions <> :stmt_location -1 :stmt_len -1} :location -1}) :location -1}


I highlighted in bold the var we need - since it is in a subquery in the in 
expression will be flattened, all elements contained in it should decrease the 
level number by one, since they will belong to the subtree located above it. 
Because of that condition, this did not happen.

I generally agree with you that it is better to remove that condition. The 
function IncrementVarSublevelsUp essentially goes through the structures below 
and will decrease the level of only the vars for which this needs to be done, 
and the condition with 1 will protect us from touching those vars that should 
not. So the varlevelsup for this var should be 1.

I am currently investigating whether this transformation will be fair for all 
cases; I have not found any problems yet.

Thank you for your feedback.  I appreciate you're also looking for the
potential problems.  On thing to highlight: doing
IncrementVarSublevelsUp() unconditionally is required not just for
subqueries.  Consider the following example.

SELECT * FROM t WHERE val1 IN (VALUES (val2), (val2 +1));

The second value contain Var, which needs IncrementVarSublevelsUp(),
but the top node is OpExpr.
Yes, I agree with that - this is precisely why we need to call 
IncrementVarSublevelsUp() unconditionally for all types.


As you mentioned earlier, Var nodes can be nested more deeply, and 
skipping this step could lead to incorrect behavior in those cases. So, 
now it works fine)


Thank you for an example.


I analyzed this transfor

Re: pgsql: Add memory/disk usage for Window aggregate nodes in EXPLAIN.

2025-03-31 Thread Christoph Berg
Re: David Rowley
> Any chance you could share the output of:
> 
> explain (analyze,buffers off,costs off) select sum(n) over() from
> generate_series(1,2000) a(n);

PostgreSQL 18devel on x86-linux, compiled by gcc-14.2.0, 32-bit

=# explain (analyze,buffers off,costs off) select sum(n) over() from 
generate_series(1,2000) a(n);
QUERY PLAN  
  
──
 WindowAgg (actual time=1.248..1.731 rows=2000.00 loops=1)
   Window: w1 AS ()
   Storage: Memory  Maximum Storage: 63kB
   ->  Function Scan on generate_series a (actual time=0.301..0.536 
rows=2000.00 loops=1)
 Planning Time: 0.066 ms
 Execution Time: 1.913 ms
(6 rows)


> Could you maybe also do a binary search for the number of rows where
> it goes to disk by adjusting the 2000 up in some increments until the
> Storage method is disk? (Not that I think we should set it to the
> minimum, but it would be good to not set it too much higher than we
> need to)

The test has a `set work_mem = 64;` which I used here:

=# explain (analyze,buffers off,costs off) select sum(n) over() from 
generate_series(1,2047) a(n);
QUERY PLAN
──
 WindowAgg (actual time=1.037..1.429 rows=2047.00 loops=1)
   Window: w1 AS ()
   Storage: Memory  Maximum Storage: 64kB
   ->  Function Scan on generate_series a (actual time=0.262..0.457 
rows=2047.00 loops=1)
 Planning Time: 0.058 ms
 Execution Time: 1.594 ms
(6 rows)

=# explain (analyze,buffers off,costs off) select sum(n) over() from 
generate_series(1,2048) a(n);
QUERY PLAN
──
 WindowAgg (actual time=2.073..2.686 rows=2048.00 loops=1)
   Window: w1 AS ()
   Storage: Disk  Maximum Storage: 65kB
   ->  Function Scan on generate_series a (actual time=0.624..1.064 
rows=2048.00 loops=1)
 Planning Time: 0.064 ms
 Execution Time: 2.934 ms
(6 rows)

(With the default work_mem, the tipping point is around 149500)

Christoph




Re: explain analyze rows=%.0f

2025-03-31 Thread Ilia Evdokimov


On 31.03.2025 22:09, Robert Haas wrote:

Oh, right. I've never really understood why we round off to integers,
but the fact that we don't allow row counts < 1 feels like something
pretty important. My intuition is that it probably helps a lot more
than it hurts, too.



We definitely shouldn’t remove the row counts < 1 check, since there are 
many places in the planner where we divide by rows. This mechanism was 
added specifically to prevent division by zero. Also, allowing rows 
estimates below 1 can sometimes make the planner overly optimistic, 
leading it to prefer cheaper-looking plans that may not perform well in 
practice. For example, choosing a Nested Loop instead of a more 
appropriate Hash Join.


Allowing fractional rows > 1 might help improve planner accuracy in some 
cases, but this needs further study to fully understand the impact.


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.


Periodic FSM vacuum doesn't happen in one-pass strategy vacuum.

2025-03-31 Thread Masahiko Sawada
Hi,

With commit c120550edb86, If we got the cleanup lock on the page,
lazy_scan_prune() marks dead item IDs directly to LP_UNUSED. So the
following check with has_lpdead_items made the periodic FSM vacuum in
the one-pass strategy vacuum no longer being triggered:

if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items &&
blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
{
FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
blkno);
next_fsm_block_to_vacuum = blkno;
}

Before c120550edb86, since we marked dead item IDs to LP_DEAD once
even in the one-pass strategy vacuum, we used to call
lazy_vacuum_heap_page() to vacuum the page and to call
FreeSpaceMapVacuum() periodically, so we had that check.

I've attached a patch to fix it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: pgsql: Add memory/disk usage for Window aggregate nodes in EXPLAIN.

2025-03-31 Thread Tatsuo Ishii
From: David Rowley 
Subject: Re: pgsql: Add memory/disk usage for Window aggregate nodes in EXPLAIN.
Date: Tue, 1 Apr 2025 11:09:11 +1300
Message-ID: 

> On Tue, 1 Apr 2025 at 09:40, Christoph Berg  wrote:
>> =# explain (analyze,buffers off,costs off) select sum(n) over() from 
>> generate_series(1,2048) a(n);
>> QUERY PLAN
>> ──
>>  WindowAgg (actual time=2.073..2.686 rows=2048.00 loops=1)
>>Window: w1 AS ()
>>Storage: Disk  Maximum Storage: 65kB
> 
> Thank you for testing that. I've just pushed a patch to bump it up to 2500.
> 
> I suspect the buildfarm didn't catch this due to the tuplestore
> consuming enough memory in MEMORY_CONTEXT_CHECKING builds.

David,
Christoph,

Thank you for fixing this!
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: speedup COPY TO for partitioned table.

2025-03-31 Thread vignesh C
On Tue, 1 Apr 2025 at 06:31, jian he  wrote:
>
> On Mon, Mar 31, 2025 at 4:05 PM Kirill Reshke  wrote:
>
> Thanks for doing the benchmark.

Few comments to improve the comments, error message and remove
redundant assignment:
1) How about we change below:
/*
 * partition's rowtype might differ from the root table's.  We must
 * convert it back to the root table's rowtype as we are export
 * partitioned table data here.
*/
To:
/*
 * A partition's row type might differ from the root table's.
 * Since we're exporting partitioned table data, we must
 * convert it back to the root table's row type.
 */

2) How about we change below:
+   if (relkind == RELKIND_FOREIGN_TABLE)
+   ereport(ERROR,
+
errcode(ERRCODE_WRONG_OBJECT_TYPE),
+   errmsg("cannot
copy from foreign table \"%s\"",
+
 RelationGetRelationName(rel)),
+
errdetail("partition \"%s\" is a foreign table",
RelationGetRelationName(rel)),
+   errhint("Try
the COPY (SELECT ...) TO variant."));

To:
+   if (relkind == RELKIND_FOREIGN_TABLE)
+   ereport(ERROR,
+
errcode(ERRCODE_WRONG_OBJECT_TYPE),
+   errmsg("cannot
copy from a partitioned table having foreign table partition \"%s\"",
+
 RelationGetRelationName(rel)),
+
errdetail("partition \"%s\" is a foreign table",
RelationGetRelationName(rel)),
+   errhint("Try
the COPY (SELECT ...) TO variant."));

3) How about we change below:
/*
 * rel: the relation to be copied to.
 * root_rel: if not NULL, then the COPY partitioned relation to destination.
 * processed: number of tuples processed.
*/
To:
/*
 * rel: the relation from which data will be copied.
 * root_rel: If not NULL, indicates that rel's row type must be
 *   converted to root_rel's row type.
 * processed: number of tuples processed.
 */

 4)  You can initialize processed to 0 along with declaration in
DoCopyTo function and remove the below:
 +   if (cstate->rel && cstate->rel->rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE)
{
...
...
processed = 0;
-   while (table_scan_getnextslot(scandesc,
ForwardScanDirection, slot))
...
...
-
-   ExecDropSingleTupleTableSlot(slot);
-   table_endscan(scandesc);
+   }
+   else if (cstate->rel)
+   {
+   processed = 0;
+   CopyThisRelTo(cstate, cstate->rel, NULL, &processed);
}

Regards,
Vignesh




Re: explain analyze rows=%.0f

2025-03-31 Thread Andrei Lepikhov

On 3/31/25 19:35, Robert Haas wrote:

But why isn't it just as valuable to have two decimal places for the
estimate? I theorize that the cases that are really a problem here are
those where the row count estimate is between 0 and 1 per row, and
rounding to an integer loses all precision.
Issues I've ever seen were about zero rows number - we lose 
understanding of what the job the executor has done in the node - loops 
= 1000 doesn't tell us any helpful information in that case.
Maybe in the next version, we replace the two fractional numbers rule 
with two valuable numbers. At least, it removes boring xxx.00 numbers 
from EXPLAIN.


--
regards, Andrei Lepikhov




Re: Non-text mode for pg_dumpall

2025-03-31 Thread Mahendra Singh Thalor
On Mon, 31 Mar 2025 at 23:43, Álvaro Herrera  wrote:
>
> Hi
>
> FWIW I don't think the on_exit_nicely business is in final shape just
> yet.  We're doing something super strange and novel about keeping track
> of an array index, so that we can modify it later.  Or something like
> that, I think?  That doesn't sound all that nice to me.  Elsewhere it
> was suggested that we need some way to keep track of the list of things
> that need cleanup (a list of connections IIRC?) -- perhaps in a
> thread-local variable or a global or something -- and we install the
> cleanup function once, and that reads from the variable.  The program
> can add things to the list, or remove them, at will; and we don't need
> to modify the cleanup function in any way.
>
> --
> Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Thanks Álvaro for the feedback.

I removed the old handling of on_exit_nicely_list from the last patch
set and added one simple function to just update the archive handle in
shutdown_info.  (shutdown_info.AHX = AHX;)

For first database, we will add entry into on_exit_nicely_list array
and for rest database, we will update only shutdown_info as we already
closed connection for previous database.With this fix, we will not
touch entry of on_exit_nicely_list for each database.

Here, I am attaching updated patches.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com


v25_0002-add-new-list-type-simple_oid_string_list-to-fe-utils.patch
Description: Binary data


v25_0001-Move-common-pg_dump-code-related-to-connections.patch
Description: Binary data


v25_0004-update-AX-handle-for-each-database-for-cleanup.patch
Description: Binary data


v25_0003-pg_dumpall-with-directory-tar-custom-format.patch
Description: Binary data


Re: Fix slot synchronization with two_phase decoding enabled

2025-03-31 Thread Amit Kapila
On Mon, Mar 31, 2025 at 5:04 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote:
>
> >
> > I suspect that this can happen in PG17 as well, but I need to think
> > more about it to make a reproducible test case.
>
> After further analysis, I was able to reproduce the same issue [1] in
> PG 17.
>
> However, since the proposed fix requires catalog changes and the issue is not 
> a
> security risk significant enough to justify changing the catalog in back
> branches, we cannot back-patch the same solution.
>

Agreed. In the past, as in commit b6e39ca92e, we have backported a
catalog-modifying commit, but that is for a CVE. Users need to follow
manual steps as explained in 9.6.4 release notes [1], which would be
cumbersome for them. This is not a security issue, so we shouldn't
backpatch a catalog modifying commit following past.

> Following off-list
> discussions with Amit and Kuroda-san, we are considering disallowing enabling
> failover and two-phase decoding together for a replication slot, as suggested
> in attachment 0002.
>
> Another idea considered is to prevent the slot that enables two-phase decoding
> from being synced to standby. IOW, this means displaying the failover field as
> false in the view, if there is any possibility that transactions prepared
> before the two_phase_at position exist (e.g., if restart_lsn is less than
> two_phase_at). However, implementing this change would require additional
> explanations to users for this new behavior, which seems tricky.
>

I find it tricky to explain to users. We need to say that sometimes
the slots won't be synced even if the failover is set to true. Users
can verify that by checking slot properties on the publisher. Also, on
the subscriber, the failover flag in the subscription may still be
true unless we do more engineering to make it false. So, I prefer to
simply disallow setting failover and two_phase together. We need to
recommend to users in release notes for 17 that they need to disable
failover for subscriptions where two_phase is enabled or re-create the
subscriptions with two_phase=false and failover=true. Users may not
like it, but I think it is better than getting a complaint that after
promotion of standby the data to subscriber is not getting replicated.

[1] - https://www.postgresql.org/docs/9.6/release-9-6-4.html

-- 
With Regards,
Amit Kapila.




Re: Periodic FSM vacuum doesn't happen in one-pass strategy vacuum.

2025-03-31 Thread Masahiko Sawada
On Mon, Mar 31, 2025 at 3:12 PM Melanie Plageman
 wrote:
>
> On Mon, Mar 31, 2025 at 6:03 PM Masahiko Sawada  wrote:
> >
> > With commit c120550edb86, If we got the cleanup lock on the page,
> > lazy_scan_prune() marks dead item IDs directly to LP_UNUSED. So the
> > following check with has_lpdead_items made the periodic FSM vacuum in
> > the one-pass strategy vacuum no longer being triggered:
> >
> > if (got_cleanup_lock && vacrel->nindexes == 0 && 
> > has_lpdead_items &&
> > blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
> > {
> > FreeSpaceMapVacuumRange(vacrel->rel, 
> > next_fsm_block_to_vacuum,
> > blkno);
> > next_fsm_block_to_vacuum = blkno;
> > }
> >
> > Before c120550edb86, since we marked dead item IDs to LP_DEAD once
> > even in the one-pass strategy vacuum, we used to call
> > lazy_vacuum_heap_page() to vacuum the page and to call
> > FreeSpaceMapVacuum() periodically, so we had that check.
>
> Whoops, yea, I had a feeling that something wasn't right here when I
> saw that thread a couple weeks ago about FSM vacuuming taking forever
> at the end of a vacuum and then I remember looking at the code and
> thinking why is VACUUM_FSM_EVERY_PAGES named that if it only is used
> when there are no indexes or a multi-pass vacuum.
>
> I even made a comment in [1] that we would only do FSM_EVERY_PAGES
> when we have multi-pass vacuum -- which is even less after 17.
>
> Isn't it ironic that I actually broke it.
>
> > I've attached a patch to fix it.
>
> Looks like you forgot

Oops, attached now.

Looking the places related to VACUUM_FSM_EVERY_PAGES further, the
comment atop of vacuumlazy.c seems incorrect:

* In between phases, vacuum updates the freespace map (every
* VACUUM_FSM_EVERY_PAGES).

IIUC the context is two-pass strategy vacuum (i.e., the table has
indexes) and VACUUM_FSM_EVERY_PAGES is ignored in this case.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_periodic_fsm_vacuum.patch
Description: Binary data


Re: [PATCH] Fix potential overflow in binary search mid calculation

2025-03-31 Thread Tender Wang
Jianghua Yang  于2025年4月1日周二 04:29写道:

> Dear PostgreSQL Developers,
>
> I have identified a potential integer overflow issue in the binary search
> implementation within the DSA size class lookup code.
> Issue Description
>
> In the current implementation, the calculation of mid is performed as:
>
> uint16 mid = (max + min) / 2;
>
> Since both max and min are of type uint16, adding them together may exceed
>  65535, leading to an overflow and incorrect behavior in the binary
> search logic. This could result in incorrect indexing into the
> dsa_size_classes array.
>
The value of min is from the array dsa_size_class_map. The max value in
dsa_size_class_map[] is 25.
The value of max is the length of dsa_size_classes[], which is not too
large.
It will not happen that (max + min) exceeds 65535.


> Proposed Fix
>
> To prevent this overflow, we should use the alternative calculation method:
>
> uint16 mid = min + (max - min) / 2;
>
> This approach ensures that (max - min) does not exceed 65535, preventing
> the addition from overflowing while still correctly computing the middle
> index.
> Patch
>
> A patch implementing this fix is attached.
>


-- 
Thanks, Tender Wang


Re: Replace IN VALUES with ANY in WHERE clauses during optimization

2025-03-31 Thread Alexander Korotkov
Hi, Alena!

On Tue, Apr 1, 2025 at 2:11 AM Alena Rybakina 
wrote:
> Yes, I agree with that - this is precisely why we need to call
IncrementVarSublevelsUp() unconditionally for all types.
>
> As you mentioned earlier, Var nodes can be nested more deeply, and
skipping this step could lead to incorrect behavior in those cases. So, now
it works fine)
>
> Thank you for an example.
>
> I analyzed this transformation with various types of values that might be
used in conditions.
>
> First, I verified whether the change would affect semantics, especially
in the presence of NULL elements. The only notable behavior I observed was
> the coercion of NULL to an integer type. However, this behavior remains
the same even without our transformation, so everything is fine.

Thank you for your experiments!  I've also rechecked we don't sacrifice
lazy evaluation.  But it appears we don't have one anyway.

CREATE FUNCTION my_func() RETURNS text AS $$
BEGIN
RAISE NOTICE 'notice';
RETURN 'b';
END;
$$ LANGUAGE 'plpgsql';

# create table test (val text);
# insert into test values ('a');
# explain analyze select * from test where val in (VALUES ('a'),
(my_func()));
NOTICE:  notice
  QUERY PLAN
---
 Hash Semi Join  (cost=0.05..21.26 rows=9 width=64) (actual
time=0.178..0.183 rows=1.00 loops=1)
   Hash Cond: (test.val = ("*VALUES*".column1)::text)
   Buffers: shared hit=1
   ->  Seq Scan on test  (cost=0.00..18.80 rows=880 width=64) (actual
time=0.045..0.048 rows=1.00 loops=1)
 Buffers: shared hit=1
   ->  Hash  (cost=0.03..0.03 rows=2 width=32) (actual time=0.111..0.112
rows=2.00 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=32)
(actual time=0.004..0.065 rows=2.00 loops=1)
 Planning Time: 0.250 ms
 Execution Time: 0.267 ms
(10 rows)

--
Regards,
Alexander Korotkov
Supabase


Re: add function argument name to substring and substr

2025-03-31 Thread jian he
On Tue, Apr 1, 2025 at 6:22 AM David G. Johnston
 wrote:
>
> On Tue, Mar 18, 2025 at 9:04 PM jian he  wrote:
>>
>>
>> new patch attached.
>>
>
> I've done v4 with a delta patch.
>
> Decided to standardize on calling the SQL Similar To regular expression 
> escape replaceable "escape" everywhere.
>
> Instead of fully documenting the obsolete syntax I added a note explaining 
> the keyword choice difference.  Removed mention of it completely from the 
> Pattern Matching portion of the documentation - that section has enough going 
> on.
>
> I also add "Same as" references for the two pairs of entries.  Not married to 
> them but they do seem warranted; having Pattern Matching be required reading 
> to make that connection seems undesirable.
>

your v4-0001-v3-0001-substring.patch is not the same as my
v3-0001-add-argument-name-to-function-substring-and-subst.patch

for example:

-   
+   
 substring (
string text
FROM pattern
text FOR escape
text )
-text
+text

can you make sure v4-0001-v3-0001-substring.patch the same as
v3-0001-add-argument-name-to-function-substring-and-subst.patch.

because I tried
git am


On Tue, Apr 1, 2025 at 6:22 AM David G. Johnston
 wrote:
>
> On Tue, Mar 18, 2025 at 9:04 PM jian he  wrote:
>>
>>
>> new patch attached.
>>
>
> I've done v4 with a delta patch.
>
> Decided to standardize on calling the SQL Similar To regular expression 
> escape replaceable "escape" everywhere.
>
> Instead of fully documenting the obsolete syntax I added a note explaining 
> the keyword choice difference.  Removed mention of it completely from the 
> Pattern Matching portion of the documentation - that section has enough going 
> on.
>
> I also add "Same as" references for the two pairs of entries.  Not married to 
> them but they do seem warranted; having Pattern Matching be required reading 
> to make that connection seems undesirable.
>

your v4-0001-v3-0001-substring.patch is not the same as my
v3-0001-add-argument-name-to-function-substring-and-subst.patch

for example:

-   
+   
 substring (
string text
FROM pattern
text FOR escape
text )
-text
+text

can you make sure v4-0001-v3-0001-substring.patch the same as
v3-0001-add-argument-name-to-function-substring-and-subst.patch.

because I tried git am


On Tue, Apr 1, 2025 at 6:22 AM David G. Johnston
 wrote:
>
> On Tue, Mar 18, 2025 at 9:04 PM jian he  wrote:
>>
>>
>> new patch attached.
>>
>
> I've done v4 with a delta patch.
>
> Decided to standardize on calling the SQL Similar To regular expression 
> escape replaceable "escape" everywhere.
>
> Instead of fully documenting the obsolete syntax I added a note explaining 
> the keyword choice difference.  Removed mention of it completely from the 
> Pattern Matching portion of the documentation - that section has enough going 
> on.
>
> I also add "Same as" references for the two pairs of entries.  Not married to 
> them but they do seem warranted; having Pattern Matching be required reading 
> to make that connection seems undesirable.
>

can not build docs based on your v4-0001.

your v4-0001-v3-0001-substring.patch is not the same as my
v3-0001-add-argument-name-to-function-substring-and-subst.patch

for example:
-   
+   
 substring (
string text
FROM pattern
text FOR escape
text )
-text
+text

because I tried
git am v3-0001-add-argument-name-to-function-substring-and-subst.patch.
patch -p1 < v4-0002-v3-0002-delta.patch
Then there are several places that differ, it doesn't seem easy to
resolve the difference.



Can you make sure v4-0001-v3-0001-substring.patch the same as
v3-0001-add-argument-name-to-function-substring-and-subst.patch,
then I can review your delta patch.




Re: Statistics Import and Export

2025-03-31 Thread Robert Treat
On Mon, Mar 31, 2025 at 10:33 PM Nathan Bossart
 wrote:
> On Mon, Mar 31, 2025 at 11:11:47AM -0400, Corey Huinker wrote:
> Regarding whether pg_dump should dump statistics by default, my current
> thinking is that it shouldn't, but I think we _should_ have pg_upgrade
> dump/restore statistics by default because that is arguably the most
> important use-case.  This is more a gut feeling than anything, so I reserve
> the right to change my opinion.
>

I did some mental exercises on a number of different use cases and
scenarios (pagila work, pgextractor type stuff, backups, etc...) and I
couldn't come up with any strong arguments against including the stats
by default, generally because I think when your process needs to care
about the output of pg_dump, it seems like most cases require enough
specificity that this wouldn't actually break that.

Still, I am sympathetic to Greg's earlier concerns on the topic, but
would also agree it seems like a clear win for pg_upgrade, so I think
our gut feelings might actually be aligned on this one ;-)


Robert Treat
https://xzilla.net




Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).

2025-03-31 Thread jian he
On Mon, Mar 31, 2025 at 11:27 PM Fujii Masao
 wrote:
>
> Regarding the patch, here are some review comments:
>
> +   errmsg("cannot copy from 
> materialized view when the materialized view is not populated"),
>
> How about including the object name for consistency with
> other error messages in BeginCopyTo(), like this?
>
> errmsg("cannot copy from unpopulated materialized view \"%s\"",
>RelationGetRelationName(rel)),
>
>
> +   errhint("Use the REFRESH 
> MATERIALIZED VIEW command populate the materialized view first."));
>
> There seems to be a missing "to" just after "command".
> Should it be "Use the REFRESH MATERIALIZED VIEW command to
> populate the materialized view first."? Or we could simplify
> the hint to match what SELECT on an unpopulated materialized
> view logs: "Use the REFRESH MATERIALIZED VIEW command.".
>
based on your suggestion, i changed it to:

if (!RelationIsPopulated(rel))
ereport(ERROR,
errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot copy from unpopulated
materialized view \"%s\"",
RelationGetRelationName(rel)),
errhint("Use the REFRESH MATERIALIZED VIEW
command to populate the materialized view first."));


>
> The copy.sgml documentation should clarify that COPY TO can
> be used with a materialized view only if it is populated.
>
"COPY TO can be used only with plain tables, not views, and does not
copy rows from child tables or child partitions"
i changed it to
"COPY TO can be used with plain tables and materialized views, not
regular views, and does not copy rows from child tables or child
partitions"

Another alternative wording I came up with:
"COPY TO can only be used with plain tables and materialized views,
not regular views. It also does not copy rows from child tables or
child partitions."


>
> Wouldn't it be beneficial to add a regression test to check
> whether COPY matview TO works as expected?
sure.
From 3e404817827a58721cf8966080492f1254ea06cb Mon Sep 17 00:00:00 2001
From: jian he 
Date: Tue, 1 Apr 2025 11:11:32 +0800
Subject: [PATCH v2 1/1] COPY materialized_view TO

generally `COPY table TO` is faster than `COPY (query)`.
since populated materialized view have physical storage, so
this can use table_beginscan, table_endscan to scan a table.

context: https://postgr.es/m/8967.1353167...@sss.pgh.pa.us
context: https://www.postgresql.org/message-id/flat/20121116162558.90150%40gmx.com
discussion: https://postgr.es/m/CACJufxHVxnyRYy67hiPePNCPwVBMzhTQ6FaL9_Te5On9udG=y...@mail.gmail.com
commitfest entry: https://commitfest.postgresql.org/patch/5533/
---
 doc/src/sgml/ref/copy.sgml  |  4 ++--
 src/backend/commands/copyto.c   | 13 -
 src/test/regress/expected/copy2.out | 11 +++
 src/test/regress/sql/copy2.sql  |  8 
 4 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index df093da97c5..c3107488c81 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -520,8 +520,8 @@ COPY count
   Notes
 

-COPY TO can be used only with plain
-tables, not views, and does not copy rows from child tables
+COPY TO can be used with plain
+tables and materialized views, not regular views, and does not copy rows from child tables
 or child partitions.  For example, COPY table TO copies
 the same rows as SELECT * FROM ONLY rd_rel->relkind == RELKIND_MATVIEW)
-			ereport(ERROR,
-	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-	 errmsg("cannot copy from materialized view \"%s\"",
-			RelationGetRelationName(rel)),
-	 errhint("Try the COPY (SELECT ...) TO variant.")));
+		{
+			if (!RelationIsPopulated(rel))
+ereport(ERROR,
+		errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+		errmsg("cannot copy from unpopulated materialized view \"%s\"",
+RelationGetRelationName(rel)),
+		errhint("Use the REFRESH MATERIALIZED VIEW command to populate the materialized view first."));
+		}
 		else if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 			ereport(ERROR,
 	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 64ea33aeae8..f7aa55e1691 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -929,3 +929,14 @@ truncate copy_default;
 -- DEFAULT cannot be used in COPY TO
 copy (select 1 as test) TO stdout with (default '\D');
 ERROR:  COPY DEFAULT cannot be used with COPY TO
+-- COPY TO with materialized view
+CREATE MATERIALIZED VIEW matview1 AS SELECT 1 as id;
+CREATE MATERIALIZED VIEW matview2 AS SELECT 1 as id WITH NO DATA;
+copy matview1(id) TO stdout with (header);
+id
+1
+copy matview2 TO stdout with (header);
+ERROR:  cannot copy from unpopulated 

Re: speedup COPY TO for partitioned table.

2025-03-31 Thread jian he
On Mon, Mar 31, 2025 at 4:05 PM Kirill Reshke  wrote:
>
> Hi!
> I reviewed v7. Maybe we should add a multi-level partitioning case
> into copy2.sql regression test?
>

sure.

> I also did quick benchmarking for this patch:
>
>  DDL
>
>  create table ppp(i int) partition by range (i);
>
> genddl.sh:
>
> for i in `seq 0 200`; do echo "create table p$i partition of ppp for
> values from ( $((10 * i)) ) to ( $((10 * (i + 1))) ); "; done
>
> === insert data data:
> insert into ppp select i / 1000  from generate_series(0, 200)i;
>
> === results:
>
>
> for 201 rows speedup is 1.40 times : 902.604 ms (patches) vs
> 1270.648 ms (unpatched)
>
> for 402 rows speedup is 1.20 times : 1921.724 ms (patches) vs
> 2343.393 ms (unpatched)
>
> for 804 rows speedup is 1.10 times : 3932.361 ms (patches) vs
> 4358.489ms (unpatched)
>
> So, this patch indeed speeds up some cases, but with larger tables
> speedup becomes negligible.
>

Thanks for doing the benchmark.
From b27371ca4ff132e7d2803406f9e3f371c51c96df Mon Sep 17 00:00:00 2001
From: jian he 
Date: Tue, 1 Apr 2025 08:56:22 +0800
Subject: [PATCH v8 1/1] support COPY partitioned_table TO

CREATE TABLE pp (id  INT, val int ) PARTITION BY RANGE (id);
create table pp_1 (val int, id int);
create table pp_2 (val int, id int);
ALTER TABLE pp ATTACH PARTITION pp_1 FOR VALUES FROM (1) TO (5);
ALTER TABLE pp ATTACH PARTITION pp_2 FOR VALUES FROM (5) TO (10);
insert into pp select g, 10 + g from generate_series(1,9) g;
copy pp to stdout(header);

the above case is much slower (around 25% in some case) than
``COPY (select * from pp) to stdout(header);``,
because of column remaping.  but this is still a new
feature, since master does not support ``COPY (partitioned_table)``.

reivewed by: vignesh C 
reivewed by: David Rowley 
reivewed by: Melih Mutlu 
reivewed by: Kirill Reshke 

discussion: https://postgr.es/m/CACJufxEZt+G19Ors3bQUq-42-61__C=y5k2wk=sHEFRusu7=i...@mail.gmail.com
commitfest entry: https://commitfest.postgresql.org/patch/5467/
---
 doc/src/sgml/ref/copy.sgml  |   8 +-
 src/backend/commands/copyto.c   | 143 ++--
 src/test/regress/expected/copy2.out |  20 
 src/test/regress/sql/copy2.sql  |  17 
 4 files changed, 157 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index df093da97c5..f86e0b7ec35 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -521,15 +521,15 @@ COPY count
 

 COPY TO can be used only with plain
-tables, not views, and does not copy rows from child tables
-or child partitions.  For example, COPY COPY TO can be used with partitioned tables.
+For example, in a table inheritance hierarchy, COPY table TO copies
 the same rows as SELECT * FROM ONLY table.
 The syntax COPY (SELECT * FROM table) TO ... can be used to
-dump all of the rows in an inheritance hierarchy, partitioned table,
-or view.
+dump all of the rows in an inheritance hierarchy, or view.

 

diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index 84a3f3879a8..6fc940bddbc 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -19,6 +19,8 @@
 #include 
 
 #include "access/tableam.h"
+#include "access/table.h"
+#include "catalog/pg_inherits.h"
 #include "commands/copyapi.h"
 #include "commands/progress.h"
 #include "executor/execdesc.h"
@@ -82,6 +84,7 @@ typedef struct CopyToStateData
 	List	   *attnumlist;		/* integer list of attnums to copy */
 	char	   *filename;		/* filename, or NULL for STDOUT */
 	bool		is_program;		/* is 'filename' a program to popen? */
+	List	   *partitions;		/* oid list of partition oid for copy to */
 	copy_data_dest_cb data_dest_cb; /* function for writing data */
 
 	CopyFormatOptions opts;
@@ -116,6 +119,8 @@ static void CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot);
 static void CopyAttributeOutText(CopyToState cstate, const char *string);
 static void CopyAttributeOutCSV(CopyToState cstate, const char *string,
 bool use_quote);
+static void CopyThisRelTo(CopyToState cstate, Relation rel,
+		  Relation root_rel, uint64 *processed);
 
 /* built-in format-specific routines */
 static void CopyToTextLikeStart(CopyToState cstate, TupleDesc tupDesc);
@@ -643,6 +648,8 @@ BeginCopyTo(ParseState *pstate,
 		PROGRESS_COPY_COMMAND_TO,
 		0
 	};
+	List	   *children = NIL;
+	List	   *scan_oids = NIL;
 
 	if (rel != NULL && rel->rd_rel->relkind != RELKIND_RELATION)
 	{
@@ -670,11 +677,35 @@ BeginCopyTo(ParseState *pstate,
 	 errmsg("cannot copy from sequence \"%s\"",
 			RelationGetRelationName(rel;
 		else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-			ereport(ERROR,
-	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-	 errmsg("cannot copy from partitioned table \"%s\"",
-			RelationGetRelationName(rel)),
-	 errhint("Try the COPY (SELECT ...) TO variant.")));
+		{
+			children = find_all_

Re: [PATCH] PGSERVICEFILE as part of a normal connection string

2025-03-31 Thread Andrew Jackson
Hi,

I am working on a feature adjacent to the connection service functionality
and noticed some issues with the tests introduced in this thread. Basically
they incorrectly invoke the append perl function by passing multiple
strings to append when the function only takes one string to append. This
caused the generated service files to not actually contain any connection
parameters. The tests were only passing because the connect_ok perl
function set the connection parameters as environment variables which
covered up the misformed connection service file.

The attached patch is much more strict in that it creates a dummy database
that is not started and passes all queries though that and tests that the
connection service file correctly overrides the environment variables set
by the dummy databases' query functions

Thanks,
Andrew Jackson

On Mon, Mar 31, 2025, 4:01 PM Ryo Kanbayashi 
wrote:

> On Sat, Mar 22, 2025 at 4:46 PM Michael Paquier 
> wrote:
> >
> > On Thu, Mar 20, 2025 at 06:16:44PM +0900, Ryo Kanbayashi wrote:
> > > Sorry, I found a miss on 006_service.pl.
> > > Fixed patch is attached...
> >
> > Please note that the commit fest app needs all the patches of a a set
> > to be posted in the same message.  In this case, v2-0001 is not going
> > to get automatic test coverage.
> >
> > Your patch naming policy is also a bit confusing.  I would suggest to
> > use `git format-patch -vN -2`, where N is your version number.  0001
> > would be the new tests for service files, and 0002 the new feature,
> > with its own tests.
>
> All right.
> I attached patches generated with your suggested command :)
>
> > +if ($windows_os) {
> > +
> > +# Windows: use CRLF
> > +print $fh "[my_srv]",   "\r\n";
> > +print $fh join( "\r\n", split( ' ', $node->connstr ) ), "\r\n";
> > +}
> > +else {
> > +# Non-Windows: use LF
> > +print $fh "[my_srv]", "\n";
> > +print $fh join( "\n", split( ' ', $node->connstr ) ), "\n";
> > +}
> > +close $fh;
> >
> > That's duplicated.  Let's perhaps use a $newline variable and print
> > into the file using the $newline?
>
> OK.
> I reflected above comment.
>
> > Question: you are doing things this way in the test because fgets() is
> > what is used by libpq to retrieve the lines of the service file, is
> > that right?
>
> No. I'm doing above way simply because line ending code of service file
> wrote by users may become CRLF in Windows platform.
>
> > Please note that the CI is failing.  It seems to me that you are
> > missing a done_testing() at the end of the script.  If you have a
> > github account, I'd suggest to set up a CI in your own fork of
> > Postgres, this is really helpful to double-check the correctness of a
> > patch before posting it to the lists, and saves in round trips between
> > author and reviewer.  Please see src/tools/ci/README in the code tree
> > for details.
>
> Sorry.
> I'm using Cirrus CI with GitHub and I checked passing the CI.
> But there were misses when I created patch files...
>
> > +# Copyright (c) 2023-2024, PostgreSQL Global Development Group
> >
> > These dates are incorrect.  Should be 2025, as it's a new file.
>
> OK.
>
> > +++ b/src/interfaces/libpq/t/007_servicefile_opt.pl
> > @@ -0,0 +1,100 @@
> > +# Copyright (c) 2023-2024, PostgreSQL Global Development Group
> >
> > Incorrect date again in the second path with the new feature.  I'd
> > suggest to merge all the tests in a single script, with only one node
> > initialized and started.
>
> OK.
> Additional test scripts have been merged to a single script ^^ b
>
> ---
> Great regards,
> Ryo Kanbayashi
>
From 0d732ee8edbb16132b95f35775388528fa9003d8 Mon Sep 17 00:00:00 2001
From: AndrewJackson 
Date: Mon, 31 Mar 2025 15:18:48 -0500
Subject: [PATCH] libpq: Fix TAP tests for service files and names

This commit builds on the tests that were added in a prior commit
that tests the connection service file functionality. The tests are
fixed to correctly invoke the append_to_file subroutine which only
takes one argument and not multiple. The test also runs the statements
through a dummy database to ensure that the options from the service
file are actually being picked up and just passing because they are
defaulting to the connection environment variables that are set in
the connect_ok function.

Author: Andrew Jackson 
---
 src/interfaces/libpq/t/006_service.pl | 38 ++-
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/src/interfaces/libpq/t/006_service.pl b/src/interfaces/libpq/t/006_service.pl
index d3ecfa6b6fc..e0d18d72359 100644
--- a/src/interfaces/libpq/t/006_service.pl
+++ b/src/interfaces/libpq/t/006_service.pl
@@ -9,6 +9,9 @@ use Test::More;
 # This tests scenarios related to the service name and the service file,
 # for the connection options and their environment variables.

+my $dummy_node = PostgreSQL::Test::Cluster->new('dummy_node');
+$dummy_node->init;
+
 my $node =

Re: pgsql: Add support for OAUTHBEARER SASL mechanism

2025-03-31 Thread Christoph Berg
Re: To Daniel Gustafsson
> > Add support for OAUTHBEARER SASL mechanism
> 
> Debian still has this experimental port with a GNU userland and a
> FreeBSD kernel called kfreebsd.

Sorry this part was nonsense, kfreebsd was actually terminated and
obviously I didn't even read the port's name. The failing port (still
experimental and care-is-optional) is hurd-amd64.

The bug is the same, though.

Christoph




Re: Using read stream in autoprewarm

2025-03-31 Thread Nazir Bilal Yavuz
Hi,

Thank you for looking into this!

On Mon, 31 Mar 2025 at 17:42, Melanie Plageman
 wrote:
>
> On Sun, Mar 30, 2025 at 10:01 AM Nazir Bilal Yavuz  wrote:
> >
> > > Some review feedback on your v4: I don't think we need the
> > > rs_have_free_buffer per_buffer_data. We can just check
> > > have_free_buffers() in both the callback and main loop in
> > > autoprewarm_database_main(). I also think you want a comment about why
> > > first_block is needed. And I think you need to guard the
> > > read_stream_end() after the loop -- what if we never made a read
> > > stream because we errored out for all the block's relations or
> > > something?
> >
> > All of these are addressed. One extra thing I noticed is we were not
> > checking if blocknum < number_of_block_in_relation at the first_block
> > case in the stream callback, this is fixed now.
>
> I'm wondering why you need to check if have_free_buffer() in the else
> branch after getting the buffer from the read stream API. Can't you
> just put it back in the for loop condition? Seems like it would have
> the same effect.
>
> -   for (; pos < apw_state->prewarm_stop_idx; pos++)
> +   for (; pos < apw_state->prewarm_stop_idx && have_free_buffer(); pos++)
> {
> BlockInfoRecord *blk = &block_info[pos];
> Buffer  buf;
> @@ -640,9 +640,6 @@ autoprewarm_database_main(Datum main_arg)
> apw_state->prewarmed_blocks++;
> ReleaseBuffer(buf);
> }
> -   /* There are no free buffers left in shared buffers, break the loop. 
> */
> -   else if (!have_free_buffer())
> -   break;
>

You are right, done. Attached as v6.

> Looking at the code some more, I feel stuck on my original point about
> incrementing the position in two places.
> AFAICT, you are going to end up going through the array twice with this 
> design.
> Because you don't set the pos variable in autoprewarm_database_main()
> from the p->pos variable which the read stream callback is
> incrementing, if the read stream callback increments p->pos it a few
> positions yielding those blocks to the read stream machinery to read,
> you are then going to iterate over those positions in the array again
> in the autoprewarm_database_main() loop.
>
> I think you can get around this by setting pos from p->pos in
> autoprewarm_database_main() after read_stream_next_buffer(). Or by
> using p->pos in the loop in autoprewarm_database_main() (which is
> basically what Andres suggested). I'm not sure, though, if this has
> any problems. Like not closing a relation in the right place.

I worked on an alternative approach, I refactored code a bit. It does
not traverse the list two times and I think the code is more suitable
to use read streams now. I simply get how many blocks are processed by
read streams and move the list forward by this number, so the actual
loop skips these blocks. This approach is attached with 'alternative'
prefix.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft


v6-0001-Optimize-autoprewarm-with-read-streams.patch
Description: Binary data


v6-0002-Count-free-buffers-at-the-start-of-the-autoprewar.patch
Description: Binary data


alternative_0001-Optimize-autoprewarm-with-read-streams.patch
Description: Binary data


alternative_0002-Count-free-buffers-at-the-start-of-the-autoprewarm.patch
Description: Binary data


Re: Periodic FSM vacuum doesn't happen in one-pass strategy vacuum.

2025-03-31 Thread Melanie Plageman
On Mon, Mar 31, 2025 at 6:03 PM Masahiko Sawada  wrote:
>
> With commit c120550edb86, If we got the cleanup lock on the page,
> lazy_scan_prune() marks dead item IDs directly to LP_UNUSED. So the
> following check with has_lpdead_items made the periodic FSM vacuum in
> the one-pass strategy vacuum no longer being triggered:
>
> if (got_cleanup_lock && vacrel->nindexes == 0 && has_lpdead_items 
> &&
> blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
> {
> FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
> blkno);
> next_fsm_block_to_vacuum = blkno;
> }
>
> Before c120550edb86, since we marked dead item IDs to LP_DEAD once
> even in the one-pass strategy vacuum, we used to call
> lazy_vacuum_heap_page() to vacuum the page and to call
> FreeSpaceMapVacuum() periodically, so we had that check.

Whoops, yea, I had a feeling that something wasn't right here when I
saw that thread a couple weeks ago about FSM vacuuming taking forever
at the end of a vacuum and then I remember looking at the code and
thinking why is VACUUM_FSM_EVERY_PAGES named that if it only is used
when there are no indexes or a multi-pass vacuum.

I even made a comment in [1] that we would only do FSM_EVERY_PAGES
when we have multi-pass vacuum -- which is even less after 17.

Isn't it ironic that I actually broke it.

> I've attached a patch to fix it.

Looks like you forgot

- Melanie

[1] 
https://www.postgresql.org/message-id/CAAKRu_aXqoj2Vfqu3yzscsTX%3D2nPQ4y-aadCNz6nJP_o12GyxA%40mail.gmail.com




Re: pgsql: Add memory/disk usage for Window aggregate nodes in EXPLAIN.

2025-03-31 Thread David Rowley
On Tue, 1 Apr 2025 at 09:40, Christoph Berg  wrote:
> =# explain (analyze,buffers off,costs off) select sum(n) over() from 
> generate_series(1,2048) a(n);
> QUERY PLAN
> ──
>  WindowAgg (actual time=2.073..2.686 rows=2048.00 loops=1)
>Window: w1 AS ()
>Storage: Disk  Maximum Storage: 65kB

Thank you for testing that. I've just pushed a patch to bump it up to 2500.

I suspect the buildfarm didn't catch this due to the tuplestore
consuming enough memory in MEMORY_CONTEXT_CHECKING builds.

David


Re: Test to dump and restore objects left behind by regression

2025-03-31 Thread Daniel Gustafsson
> On 28 Mar 2025, at 19:12, Alvaro Herrera  wrote:
> 
> On 2025-Mar-28, Tom Lane wrote:
> 
>> I think instead of going this direction, we really need to create a
>> separately-purposed script that simply creates "one of everything"
>> without doing anything else (except maybe loading a little data).
>> I believe it'd be a lot easier to remember to add to that when
>> inventing new SQL than to remember to leave something behind from the
>> core regression tests.  This would also be far faster to run than any
>> approach that involves picking a random subset of the core test
>> scripts.
> 
> FWIW this sounds closely related to what I tried to do with
> src/test/modules/test_ddl_deparse; it's currently incomplete, but maybe
> we can use that as a starting point.

Given where we are in the cycle, it seems to make sense to stick to using the
schedule we already have rather than invent a new process for generating it,
and work on that for 19?

--
Daniel Gustafsson





Re: add function argument name to substring and substr

2025-03-31 Thread David G. Johnston
On Tue, Mar 18, 2025 at 9:04 PM jian he  wrote:

>
> new patch attached.
>
>
I've done v4 with a delta patch.

Decided to standardize on calling the SQL Similar To regular expression
escape replaceable "escape" everywhere.

Instead of fully documenting the obsolete syntax I added a note explaining
the keyword choice difference.  Removed mention of it completely from the
Pattern Matching portion of the documentation - that section has enough
going on.

I also add "Same as" references for the two pairs of entries.  Not married
to them but they do seem warranted; having Pattern Matching be required
reading to make that connection seems undesirable.

David J.
From b2f64615da9522427a2e2662b1d060ffed97088c Mon Sep 17 00:00:00 2001
From: "David G. Johnston" 
Date: Mon, 31 Mar 2025 14:32:12 -0700
Subject: [PATCH 1/2] v3 0001 substring

---
 doc/src/sgml/func.sgml   | 115 +--
 src/backend/catalog/system_functions.sql |   2 +-
 src/include/catalog/pg_proc.dat  |  12 +++
 3 files changed, 120 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c642f1ea4e..e4c95f1e88 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -2850,9 +2850,9 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
 substring ( string text SIMILAR pattern text ESCAPE escape text )
 text

-   
+   
 substring ( string text FROM pattern text FOR escape text )
-text
+text


 Extracts the first substring matching SQL regular expression;
@@ -3806,6 +3806,58 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in

   
 
+
+  
+   
+substring ( string text, pattern text )
+text
+   
+   
+Extracts the first substring matching POSIX regular expression; see
+.
+   
+   
+substring('Thomas', '...$')
+mas
+   
+  
+
+  
+   
+substring ( string text, pattern text, escape_character  text)
+text
+   
+   
+Extracts the first substring matching SQL regular expression;
+see .
+   
+   
+substring('Thomas', '%#"o_a#"_', '#')
+oma
+   
+  
+
+  
+   
+
+ substring
+
+substring ( string text, start integer , count integer  )
+text
+   
+   
+Extracts the substring of string starting at
+the start'th character,
+and stopping after count characters if that is
+specified.
+   
+
+   
+substring('Thomas', 2, 3)
+hom
+   
+  
+
   

 
@@ -4811,6 +4863,27 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 \x5678

   
+
+  
+   
+
+ substring
+
+substring ( bytes bytea, start integer , count integer  )
+bytea
+   
+   
+Extracts the substring of bytes starting at
+the start'th byte,
+and stopping after count bytes if that is
+specified.
+   
+   
+substring('\x1234567890'::bytea, 3, 2)
+\x5678
+   
+  
+
 

   
@@ -5353,6 +5426,26 @@ cast(-1234 as bytea)   \xfb2e

   
 
+  
+   
+
+ substring
+
+substring ( bits bit, start integer , count integer  )
+bit
+   
+   
+Extracts the substring of bits starting at
+the start'th bit,
+and stopping after count bits if that is
+specified.
+   
+   
+substring(B'11001011', 3, 2)
+00
+   
+  
+
   

 
@@ -5816,7 +5909,7 @@ substring(string from pattern
 or as a plain three-argument function:
 
-substring(string, pattern, escape-character)
+substring(string, pattern, escape_character)
 
 As with SIMILAR TO, the
 specified pattern must match the entire data string, or else the
@@ -6020,11 +6113,17 @@ substring('foobar' similar '#"o_b#"%' escape '#')NULL
 
 
- The substring function with two parameters,
- substring(string from
- pattern), provides extraction of a
- substring
- that matches a POSIX regular expression pattern.  It returns null if
+ The substring function with two parameters provides extraction of a
+ substring that matches a POSIX regular expression pattern.
+ It has syntax:
+
+substring(string from pattern)
+
+ It can also written as a plain two-argument function:
+
+substring(string, pattern)
+
+ It returns null if
  there is no match, otherwise the first portion of the text that matched the
  pattern.  But if the pattern contains any parentheses, the portion
  of the text that matched the first parenthesized subexpression (the
diff --git a/src/backend/catalog/system_functions.s

Re: tzdata 2025b

2025-03-31 Thread Tom Lane
Masahiko Sawada  writes:
> tzdata 2025b has been released on 3/22[1]. Do we need to update the
> tzdata.zi file on HEAD and backbranches?

Yup, eventually, but I don't normally worry about it until we are
approaching a release date.  tzdata changes often come in bunches
around the spring and fall equinoxes, which is when governments
tend to rush out DST changes without thinking about lead times :-(.
So it's entirely possible that 2025b will already be obsolete by May.
(See for example ecbac3e6e and d8fc45bd0, when I thought I'd waited
long enough and was wrong.)

The situation might be different with a tzdata release that affects
our regression test results, but I don't believe this one does.

regards, tom lane




Re: RFC: Logging plan of the running query

2025-03-31 Thread torikoshia

On 2025-04-01 04:24, Robert Haas wrote:
On Fri, Mar 21, 2025 at 8:40 AM torikoshia  
wrote:

Rebased it again.


Hi,

I apologize for not having noticed this thread sooner.


Thank you for your checking! No worries.


I just became
aware of it as a result of a discussion in the hacking Discord server.



I think this has got a lot over overlap with the progressive EXPLAIN
patch from Rafael Castro, which I have been reviewing, but I am not
sure why we have two different patch sets here. Why is that?


This thread proposes a feature to log the results of a plain EXPLAIN 
(without ANALYZE). From Rafael Castro's presentation materials at 
pgconf.eu 2024 [1], I understand that he built a new patch on top of 
this one, adding functionality to track execution progress and display 
results in a view.
He also mentioned that he used the way of wrapping plan nodes from this 
patch during the development to solve a problem[2].

I think that's why there are overlaps between the two.

Previously, Rafael proposed a patch in this thread that added execution 
progress tracking. However, I felt that expanding the scope could make 
it harder to get the patch reviewed or committed. So, I suggested first 
completing a feature that only retrieves the execution plan of a running 
query, and then developing execution progress tracking afterward[3].


As far as I remember, he did not respond to this thread after my 
suggestion but instead started a new thread later[4]. Based on that, I 
assume he would not have agreed with my proposed approach.


Rafael, please let us know if there are any misunderstandings in the 
above.



In this thread, I aimed to output the plan without requiring prior 
configuration if possible. However, based on your comment on Rafael's 
thread[5], it seems that this approach would be difficult:


  One way in which this proposal seems safer than previous proposals is
  that previous proposals have involved session A poking session B and
  trying to get session B to emit an EXPLAIN on the fly with no prior
  setup. That would be very useful, but I think it's more difficult and
  more risky than this proposal, where all the configuration happens in
  the session that is going to emit the EXPLAIN output.

I was considering whether to introduce a GUC in this patch to allow for 
prior setup before outputting the plan or to switch to Rafael's patch 
after reviewing its details. However, since there isn’t much time left 
before the feature freeze, if you have already reviewed Rafael's patch 
and there is a chance it could be committed, it would be better to focus 
on that.



[1] https://www.youtube.com/watch?v=6ahTb-7C05c (mentions this patch 
after the 12-minute mark)
[2] 
https://www.postgresql.org/message-id/CAG0ozMo30smtXXOR8bSCbhaZAQHo4%3DezerLitpERk85Q0ga%2BFw%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/c161b5e7e1888eb9c9eb182a7d9dcf89%40oss.nttdata.com
[4] 
https://www.postgresql.org/message-id/flat/CAG0ozMo30smtXXOR8bSCbhaZAQHo4%3DezerLitpERk85Q0ga%2BFw%40mail.gmail.com#4b261994a0d44649ded6e1434a00e134
[5] 
https://www.postgresql.org/message-id/CA%2BTgmoaD985%2BVLwR93c8PjSaoBqxw72Eu7pfBJcArzhjJ71aRw%40mail.gmail.com


--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.




Deadlock detected while executing concurrent insert queries on same table

2025-03-31 Thread Sri Keerthi
Hello community,


I recently encountered a deadlock in postgresql while performing concurrent
INSERT statements on the same table in two separate sessions.

The error message explicitly mentions that the deadlock occurred while
inserting an index tuple.

There were no explicit transactions (BEGIN/COMMIT). The inserts were
executed as the standalone statements.

*PG version : 11.4*


Here’s the *table definition* :


Table “auditlog”


  Column  |Type | Collation | Nullable |
Default
--+-+---+--+-
 id | bigint  |   | not null |
 audit_logid   | bigint  |   | not null |
 recordid | bigint  |   |  |
 recordname   | text|   |  |
 module   | character varying(50)   |   | not null |
 actioninfo   | citext  |   | not null |
 relatedid| bigint  |   |  |
 related_name  | character varying(255)  |   |  |
 accountid| bigint  |   |  |
 accountname  | character varying(255)  |   |  |
 doneby   | character varying(255)  |   | not null |
 userid   | bigint  |   |  |
 audit_time  | timestamp without time zone |   | not null |
 isauditlogdata   | boolean |   | not null |
 details | citext  |   |  |
 auditcategory| integer |   | not null |
 operationtype   | integer |   | not null |
 source   | integer |   | not null |

Indexes:
"auditlog_pkey" PRIMARY KEY, btree (id, audit_time, audit_logid)
"auditlog_idx1" btree (recordid)
"auditlog_idx2" btree (audit_logid DESC)
"auditlog_idx3" btree (userid)
"auditlog_idx4" btree (relatedid)
"auditlog_idx5" gist (actioninfo gist_trgm_ops)


 and exact *error message* from the logs :


*ERROR*:  INSERT failed, ERROR:  deadlock detected

DETAIL:  Process 3841267 waits for ShareLock on transaction
185820512; blocked by process 3841268.

Process 3841268 waits for ShareLock on transaction 185820513;
blocked by process 3841267.

HINT:  See server log for query details.

CONTEXT:  while inserting index tuple (31889,32) in relation
“auditlog”

*Insert Query1 :*
INSERT INTO
auditlog 
(ID,AUDIT_LOGID,RECORDID,RECORDNAME,MODULE,ACTIONINFO,RELATEDID,RELATED_NAME,ACCOUNTID,ACCOUNTNAME,DONEBY,USERID,AUDIT_TIME,ISAUDITLOGDATA,DETAILS,AUDITCATEGORY,OPERATIONTYPE,SOURCE)
VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18)

*Insert Query2 :*
INSERT INTO auditlog (id, audit_logid, recordid, recordname, module,
actioninfo, relatedid, related_name, accountid, accountname, doneby,
userid, audit_time, isauditlogdata, details, auditcategory, operationtype,
source) VALUES ('67016721806'::bigint, '38976328846849'::bigint,
NULL::bigint, NULL::text, 'Tasks'::character varying(50),
'Deleted'::citext, NULL::bigint, NULL::character varying(255),
NULL::bigint, NULL::character varying(255), 'Technologies'::character
varying(255), '3470005430253334'::bigint, '2024-03-24 14:39:06'::timestamp
without time zone, true, NULL::citext, 0, 11, 20)


Could this be a bug, or is it expected behaviour under certain conditions ?

I was unable to reproduce this issue again. Any insights or guidance on how
to analyse this further would be greatly appreciated.



Regards,

Sri Keerthi.
ReplyForward
Add reaction


Re: support virtual generated column not null constraint

2025-03-31 Thread jian he
On Fri, Mar 28, 2025 at 10:06 PM Peter Eisentraut  wrote:
>
> On 24.03.25 04:26, jian he wrote:
> > rebase, and some minor code comments change.
>
> I have committed this.
>
In an earlier thread, I also posted a patch for supporting virtual
generated columns over domain type.
The code is somehow similar,
so I post the remaining patch to this thread.


v7-0001
we need to compute the generation expression for the domain with constraints,
thus rename ExecComputeStoredGenerated to ExecComputeGenerated.


v7-0002
soft error variant of ExecPrepareExpr, ExecInitExpr.
for soft error processing of CoerceToDomain.
we don't want error messages like
"value for domain d2 violates check constraint "d2_check""
while validating existing domain data,
we want something like:
+ERROR:  column "b" of table "gtest24" contains values that violate
the new constraint



v7-0003
supports virtual generation columns over domain type.
If the domain has constraints then we need to compute the virtual
generation expression,
that happens mainly within ExecComputeGenerated.

if a virtual generation column type is domain_with_constraint, then
ALTER DOMAIN ADD CONSTRAINTS need to revalidate these virtual
generation expressions again.
so in validateDomainCheckConstraint, validateDomainNotNullConstraint
We need to fetch the generation expression (build_generation_expression),
compile the generation expression (ExecPrepareExprSafe),
and evaluate it (ExecEvalExprSwitchContext).


I also posted patch summary earlier at [1]
[1] 
https://postgr.es/m/cacjufxht4r1oabzequvjb6dhrig7k-qsr6lee53q_nli9pf...@mail.gmail.com
From 346dbce94c008a1ec9447ca65c7b7fcd0f8428d6 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Mon, 31 Mar 2025 15:02:46 +0800
Subject: [PATCH v7 3/3] domain over virtual generated column.

domains that don't have constraints work just fine.
domain constraints check happen within ExecComputeGenerated.
we compute the virtual generated column in ExecComputeGenerated and discarded the value.
if value cannot coerce to domain then we will error out.

domain_with_constaint can be element type of range, multirange, array, composite.
to be bulletproof, if this column type is not type of TYPTYPE_BASE or it's an array type,
then we do actually compute the generated expression.
This can be have negative performance for INSERTs

The following two query shows in pg_catalog, most of the system type is TYPTYPE_BASE.
So I guess this compromise is fine.

select count(*),typtype from pg_type where typnamespace = 11 group by 2;
select typname, typrelid, pc.reltype, pc.oid, pt.oid
from pg_type pt join pg_class pc on pc.oid = pt.typrelid
where pt.typnamespace = 11 and pt.typtype = 'c' and pc.reltype = 0;

ALTER DOMAIN ADD CONSTRAINT variant is supported.
DOMAIN with default values are supported. but virtual generated column already have
generated expression, so domain default expression doesn't matter.

ALTER TABLE ADD VIRTUAL GENERATED COLUMN.
need table scan to verify new column generation expression value
satisfied the domain constraints.
but no need table rewrite!

discussion: https://postgr.es/m/cacjufxharqysbdkwfmvk+d1tphqwwtxwn15cmuuatyx3xhq...@mail.gmail.com
---
 src/backend/catalog/heap.c|  13 +-
 src/backend/commands/copyfrom.c   |   5 +-
 src/backend/commands/tablecmds.c  |  48 +++-
 src/backend/commands/typecmds.c   | 209 +++---
 src/backend/executor/nodeModifyTable.c|  88 ++--
 src/include/catalog/heap.h|   1 -
 src/test/regress/expected/fast_default.out|   4 +
 .../regress/expected/generated_virtual.out|  65 +-
 src/test/regress/sql/fast_default.sql |   4 +
 src/test/regress/sql/generated_virtual.sql|  47 +++-
 10 files changed, 398 insertions(+), 86 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index b807ab8..5519690d85c 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -508,7 +508,7 @@ CheckAttributeNamesTypes(TupleDesc tupdesc, char relkind,
 		   TupleDescAttr(tupdesc, i)->atttypid,
 		   TupleDescAttr(tupdesc, i)->attcollation,
 		   NIL, /* assume we're creating a new rowtype */
-		   flags | (TupleDescAttr(tupdesc, i)->attgenerated == ATTRIBUTE_GENERATED_VIRTUAL ? CHKATYPE_IS_VIRTUAL : 0));
+		   flags);
 	}
 }
 
@@ -583,17 +583,6 @@ CheckAttributeType(const char *attname,
 	}
 	else if (att_typtype == TYPTYPE_DOMAIN)
 	{
-		/*
-		 * Prevent virtual generated columns from having a domain type.  We
-		 * would have to enforce domain constraints when columns underlying
-		 * the generated column change.  This could possibly be implemented,
-		 * but it's not.
-		 */
-		if (flags & CHKATYPE_IS_VIRTUAL)
-			ereport(ERROR,
-	errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	errmsg("virtual generated column \"%s\" cannot have a domain type", attname));
-
 		/*
 		 * If it's a domain, recurse to check its base type.
 		 */
diff --git a/src

Re: Windows: openssl & gssapi dislike each other

2025-03-31 Thread Dave Page
On Wed, 26 Mar 2025 at 16:32, Daniel Gustafsson  wrote:

> > On 26 Mar 2025, at 17:15, Tom Lane  wrote:
> >
> > Daniel Gustafsson  writes:
> >> Thanks for review!  Pushed after making the above changes and taking it
> for
> >> another CI run.
> >
> > CF entry should be marked closed no?
>
> Yep, just wanted to allow some time to see how the buildfarm liked it
> before
> closing.  Done now, thanks for the reminder.
>

Thank you for dealing with this!

-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org
pgEdge: https://www.pgedge.com


Re: Memoize ANTI and SEMI JOIN inner

2025-03-31 Thread Richard Guo
On Mon, Mar 31, 2025 at 6:46 PM Andrei Lepikhov  wrote:
> On 3/31/25 11:03, Richard Guo wrote:
> > I reviewed this patch and I have some concerns about the following
> > code:
> >
> >  if (extra->inner_unique &&
> >  (inner_path->param_info == NULL ||
> >   bms_num_members(inner_path->param_info->ppi_serials) <
> >   list_length(extra->restrictlist)))
> >  return NULL;
> >
> > I understand that this check is used to ensure that the entire join
> > condition is parameterized in the case of unique joins, so that we can
> > safely mark the cache entry as complete after reading the first tuple.
> > However, ppi_clauses includes join clauses available from all outer
> > rels, not just the current outer rel, while extra->restrictlist only
> > includes the restriction clauses for the current join.  This means the
> > check could pass even if a restriction clause isn't parameterized, as
> > long as another join clause, which doesn't belong to the current join,
> > is included in ppi_clauses.

> Initially, I had the same concern. But if ppi_clauses contains a qual,
> it should refer to this join and, as a result, be in the
> extra->restrictlist, isn't it?

Hmm, I don't think so.  As I mentioned upthread, ppi_clauses includes
join clauses available from all outer rels, not just the current one.
So a clause included in ppi_clauses is not necessarily included in
extra->restrictlist.  As an example, consider

create table t (a int, b int);

explain (costs off)
select * from t t1 join t t2 join
lateral (select *, t1.a+t2.a as x from t t3 offset 0) t3
on t2.a = t3.a
  on t1.b = t3.b;
   QUERY PLAN
-
 Nested Loop
   ->  Seq Scan on t t2
   ->  Nested Loop
 ->  Seq Scan on t t1
 ->  Subquery Scan on t3
   Filter: ((t2.a = t3.a) AND (t1.b = t3.b))
   ->  Seq Scan on t t3_1
(7 rows)

t3's ppi_clauses includes "t2.a = t3.a" and "t1.b = t3.b", while t1/t3
join's restrictlist only includes "t1.b = t3.b".

Thanks
Richard




Re: Test to dump and restore objects left behind by regression

2025-03-31 Thread Ashutosh Bapat
On Mon, Mar 31, 2025 at 5:07 PM Ashutosh Bapat
 wrote:
>
The bug related to materialized views has been fixed and now the test
passes even if we compare statistics from dumped and restored
databases. Hence removing 0003. In the attached patchset I have also
addressed Vignesh's below comment

On Thu, Mar 27, 2025 at 10:01 PM Ashutosh Bapat
 wrote:
>
> On Thu, Mar 27, 2025 at 6:01 PM vignesh C  wrote:
> >
> > 2) Should "`" be ' or " here, we   generally use "`" to enclose commands:
> > +# It is expected that regression tests, which create `regression` 
> > database, are
> > +# run on `src_node`, which in turn, is left in running state. A fresh node 
> > is
> > +# created using given `node_params`, which are expected to be the
> > same ones used
> > +# to create `src_node`, so as to avoid any differences in the databases.
>
> Looking at prologues or some other functions, I see that we don't add
> any decoration around the name of the argument. Hence dropped ``
> altogether. Will post it with the next set of patches.

-- 
Best Wishes,
Ashutosh Bapat
From 5ef4a15bf229d104028eac3a046636453e1e05fc Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Mon, 24 Mar 2025 11:21:12 +0530
Subject: [PATCH 2/2] Use only one format and make the test run default

According to Alvaro (and I agree with him), the test should be run by
default. Otherwise we get to know about a bug only after buildfarm
animal where it's enabled reports a failure. Further testing only one
format may suffice; since all the formats have shown the same bugs till
now.

If we use --directory format we can use -j which reduces the time taken
by dump/restore test by about 12%.

This patch removes PG_TEST_EXTRA option as well as runs the test only in
directory format with parallelism enabled.

Note for committer: If we decide to accept this change, it should be
merged with the previous commit.
---
 doc/src/sgml/regress.sgml  | 12 
 src/bin/pg_upgrade/t/002_pg_upgrade.pl | 76 +-
 2 files changed, 25 insertions(+), 63 deletions(-)

diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml
index 237b974b3ab..0e5e8e8f309 100644
--- a/doc/src/sgml/regress.sgml
+++ b/doc/src/sgml/regress.sgml
@@ -357,18 +357,6 @@ make check-world PG_TEST_EXTRA='kerberos ldap ssl load_balance libpq_encryption'
   
  
 
-
-
- regress_dump_test
- 
-  
-   When enabled, src/bin/pg_upgrade/t/002_pg_upgrade.pl
-   tests dump and restore of regression database left behind by the
-   regression run. Not enabled by default because it is time and resource
-   consuming.
-  
- 
-

 
Tests for features that are not supported by the current build
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 8d22d538529..f7d5b96ecd2 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -268,16 +268,9 @@ else
 	# should be done in a separate TAP test, but doing it here saves us one full
 	# regression run.
 	#
-	# This step takes several extra seconds and some extra disk space, so
-	# requires an opt-in with the PG_TEST_EXTRA environment variable.
-	#
 	# Do this while the old cluster is running before it is shut down by the
 	# upgrade test.
-	if (   $ENV{PG_TEST_EXTRA}
-		&& $ENV{PG_TEST_EXTRA} =~ /\bregress_dump_test\b/)
-	{
-		test_regression_dump_restore($oldnode, %node_params);
-	}
+	test_regression_dump_restore($oldnode, %node_params);
 }
 
 # Initialize a new node for the upgrade.
@@ -590,53 +583,34 @@ sub test_regression_dump_restore
 	$dst_node->append_conf('postgresql.conf', 'autovacuum = off');
 	$dst_node->start;
 
-	# Test all formats one by one.
-	for my $format ('plain', 'tar', 'directory', 'custom')
-	{
-		my $dump_file = "$tempdir/regression_dump.$format";
-		my $restored_db = 'regression_' . $format;
-
-		# Use --create in dump and restore commands so that the restored
-		# database has the same configurable variable settings as the original
-		# database and the plain dumps taken for comparsion do not differ
-		# because of locale changes. Additionally this provides test coverage
-		# for --create option.
-		$src_node->command_ok(
-			[
-'pg_dump', "-F$format", '--no-sync',
-'-d', $src_node->connstr('regression'),
-'--create', '-f', $dump_file
-			],
-			"pg_dump on source instance in $format format");
+	my $dump_file = "$tempdir/regression.dump";
 
-		my @restore_command;
-		if ($format eq 'plain')
-		{
-			# Restore dump in "plain" format using `psql`.
-			@restore_command = [ 'psql', '-d', 'postgres', '-f', $dump_file ];
-		}
-		else
-		{
-			@restore_command = [
-'pg_restore', '--create',
-'-d', 'postgres', $dump_file
-			];
-		}
-		$dst_node->command_ok(@restore_command,
-			"restored dump taken in $format format on destination instance");
+	# Use --create in dump and restore commands so that the restored database
+	# has the same configurable variable set

Re: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues

2025-03-31 Thread Alexey Makhmutov
> We’ve prepared two test patches on top of current master to address 
both issues:

> ...
> * 0002-Implement-batching-for-cascade-replication.patch – test patch 
to implement possible batching approach in xlogreceiver.c with timer. 
Currently it uses GUC variables to allow testing of different batch 
sizes and timeout values.


I've played with the second patch a little more and made some 
adjustments to it:
1. Setup timer only if we actually have applied messages, which are 
(potentially) not yet signaled to walsenders.

2. Notify logical walsenders without delay if time line has changed.

Modified patch is attached.

Thanks,
Alexey
From c691724332dd5a78b98d606188383d6b37c98021 Mon Sep 17 00:00:00 2001
From: Rustam Khamidullin 
Date: Fri, 14 Mar 2025 18:18:34 +0700
Subject: [PATCH 2/2] Implement batching for WAL records notification during
 cascade replication

Currently standby server notifies walsenders after applying of each WAL
record during cascade replication. This creates a bottleneck in case of large
number of sender processes during WalSndWakeup invocation. This change
introduces batching for such notifications, which are now sent either after
certain number of applied records or specified time interval (whichever comes
first).

Co-authored-by: Alexey Makhmutov 
---
 src/backend/access/transam/xlogrecovery.c | 61 ++-
 src/backend/postmaster/startup.c  |  1 +
 src/backend/utils/misc/guc_tables.c   | 22 
 src/include/access/xlogrecovery.h |  4 ++
 src/include/utils/timeout.h   |  1 +
 5 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0aa3ab59085..3e3e43d5888 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -64,6 +64,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
+#include "utils/timeout.h"
 
 /* Unsupported old recovery command file names (relative to $PGDATA) */
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
@@ -147,6 +148,8 @@ bool		InArchiveRecovery = false;
 static bool StandbyModeRequested = false;
 bool		StandbyMode = false;
 
+#define StandbyWithCascadeReplication() (AmStartupProcess() && StandbyMode && AllowCascadeReplication())
+
 /* was a signal file present at startup? */
 static bool standby_signal_file_found = false;
 static bool recovery_signal_file_found = false;
@@ -298,6 +301,14 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/* Maximum number of applied records in batch before notifying walsender during cascade replication */
+int cascadeReplicationMaxBatchSize;
+
+/* Maximum batching delay before notifying walsender during cascade replication */
+int cascadeReplicationMaxBatchDelay;
+
+/* Counter for collected records during cascade replication */
+static volatile sig_atomic_t appliedRecords = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -1839,6 +1850,17 @@ PerformWalRecovery(void)
 		 * end of main redo apply loop
 		 */
 
+		/* Ensure that notification for batched messages is sent */
+		if (StandbyWithCascadeReplication() &&
+cascadeReplicationMaxBatchSize > 1 &&
+appliedRecords > 0)
+		{
+			if (cascadeReplicationMaxBatchDelay > 0)
+disable_timeout(STANDBY_CASCADE_WAL_SEND_TIMEOUT, false);
+			appliedRecords = 0;
+			WalSndWakeup(false, true);
+		}
+
 		if (reachedRecoveryTarget)
 		{
 			if (!reachedConsistency)
@@ -2037,8 +2059,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	 *be created otherwise)
 	 * --
 	 */
-	if (AllowCascadeReplication())
-		WalSndWakeup(switchedTLI, true);
+
+	if (StandbyWithCascadeReplication())
+	{
+		if (cascadeReplicationMaxBatchSize <= 1)
+			WalSndWakeup(switchedTLI, true);
+		else
+		{
+			bool batchFlushRequired = ++appliedRecords >=
+	cascadeReplicationMaxBatchSize;
+			if (batchFlushRequired)
+			{
+appliedRecords = 0;
+if (cascadeReplicationMaxBatchDelay > 0)
+	disable_timeout(STANDBY_CASCADE_WAL_SEND_TIMEOUT, false);
+			}
+
+			WalSndWakeup(switchedTLI, batchFlushRequired);
+
+			/* Setup timeout to limit maximum delay for notifications */
+			if (appliedRecords == 1 && cascadeReplicationMaxBatchDelay > 0)
+enable_timeout_after(STANDBY_CASCADE_WAL_SEND_TIMEOUT,
+		cascadeReplicationMaxBatchDelay);
+			}
+	}
 
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
@@ -5064,3 +5108,16 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Timer handler for batch notifications in cascade replication
+ */
+void
+StandbyWalSendTimeoutHandler(void)
+{
+	if (appliedRecords > 0)
+	{
+		appliedRecords = 0;
+		WalSndWakeup(false, true);
+	}
+}
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
in

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

2025-03-31 Thread Bertrand Drouvot
Hi,

On Mon, Mar 24, 2025 at 08:41:20AM +0900, Michael Paquier wrote:
> On Wed, Mar 19, 2025 at 04:00:49PM +0800, Xuneng Zhou wrote:
> > Hi,
> > Moving the other two provides a more complete view of the settings. For
> > newcomers(like me) to the codebase, seeing all three related values in one
> > place helps avoid a narrow view of the settings.
> > 
> > But I am not sure that I understand the cons of this well.
> 
> While I don't disagree with the use of a hardcoded interval of time to
> control timing the flush of the WAL sender stats, do we really need to
> rely on the timing defined by pgstat.c?

No but I thought it could make sense.

> Wouldn't it be simpler to
> assign one in walsender.c and pick up a different, perhaps higher,
> value?

I don't have a strong opinion on it so done as suggested above in the attached.

I think that the 1s value is fine because: 1. it is consistent with 
PGSTAT_MIN_INTERVAL and 2. it already needs that the sender is caught up or
has pending data to send (means it could be higher than 1s already). That said,
I don't think that would hurt if you think of a higher value.

> At the end the timestamp calculations are free because we can rely on
> the existing call of GetCurrentTimestamp() for the physical WAL
> senders to get an idea of the current time.

Yup

> For the logical WAL
> senders, perhaps we'd better do the reports in WalSndWaitForWal(),
> actually.  There is also a call to GetCurrentTimestamp() that we can
> rely on in this path.

I think it's better to flush the stats in a shared code path. I think it's
easier to maintain and that there is no differences between logical and
physical walsenders that would justify to flush the stats in specific code
paths.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 1102b1bccaa5ff88b4045a4e8751e43094e946e5 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 25 Feb 2025 10:18:05 +
Subject: [PATCH v5] Flush the IO statistics of active walsenders

The walsender does not flush its IO statistics until it exits.
The issue is there since pg_stat_io has been introduced in a9c70b46dbe.
This commits:

1. ensures it does not wait to exit to flush its IO statistics
2. flush its IO statistics periodically to not overload the walsender
3. adds a test for a physical walsender (a logical walsender had the same issue
but the fix is in the same code path)
---
 src/backend/replication/walsender.c   | 63 +++
 src/test/recovery/t/001_stream_rep.pl | 16 +++
 2 files changed, 61 insertions(+), 18 deletions(-)
  76.6% src/backend/replication/
  23.3% src/test/recovery/t/

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1028919aecb..9eed37b5de9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -91,10 +91,14 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
+#include "utils/pgstat_internal.h"
 #include "utils/ps_status.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+/* Minimum interval walsender IO stats flushes */
+#define MIN_IOSTATS_FLUSH_INTERVAL 1000
+
 /*
  * Maximum data payload in a WAL data message.  Must be >= XLOG_BLCKSZ.
  *
@@ -2742,6 +2746,8 @@ WalSndCheckTimeOut(void)
 static void
 WalSndLoop(WalSndSendDataCallback send_data)
 {
+	TimestampTz last_flush = 0;
+
 	/*
 	 * Initialize the last reply timestamp. That enables timeout processing
 	 * from hereon.
@@ -2836,30 +2842,51 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		 * WalSndWaitForWal() handle any other blocking; idle receivers need
 		 * its additional actions.  For physical replication, also block if
 		 * caught up; its send_data does not block.
+		 *
+		 * When the WAL sender is caught up or has pending data to send, we
+		 * also periodically report I/O statistics. It's done periodically to
+		 * not overload the WAL sender.
 		 */
-		if ((WalSndCaughtUp && send_data != XLogSendLogical &&
-			 !streamingDoneSending) ||
-			pq_is_send_pending())
+		if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending())
 		{
-			long		sleeptime;
-			int			wakeEvents;
+			TimestampTz now;
 
-			if (!streamingDoneReceiving)
-wakeEvents = WL_SOCKET_READABLE;
-			else
-wakeEvents = 0;
+			now = GetCurrentTimestamp();
 
-			/*
-			 * Use fresh timestamp, not last_processing, to reduce the chance
-			 * of reaching wal_sender_timeout before sending a keepalive.
-			 */
-			sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+			if (TimestampDifferenceExceeds(last_flush, now, MIN_IOSTATS_FLUSH_INTERVAL))
+			{
+/*
+ * Report IO statistics
+ */
+pgstat_flush_io(false);
+(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+last_flush = now;
+			}
 
-			if (pq_is_send_pending())
-wakeEvents |= WL_SOCKET_WRITEABLE;
+			if (send_data != XLogSendLogical || pq_is_se

Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-03-31 Thread Alvaro Herrera
On 2025-Mar-31, jian he wrote:

> hi.
> in notnull-notvalid.patch
> 
> + if (coninfo->contype == 'c')
> + keyword = "CHECK CONSTRAINT";
> + else
> + keyword = "INVALID NOT NULL CONSTRAINT";
> we have a new TocEntry->desc kind.

Yeah, I wasn't sure that this change made much actual sense.  I think it
may be better to stick with just CONSTRAINT.  There probably isn't
sufficient reason to have a different ->desc value.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Thread-safe nl_langinfo() and localeconv()

2025-03-31 Thread Tom Lane
Peter Eisentraut  writes:
> I'm not sure what to do with this.  If setlocale() and newlocale() 
> indeed behave differently in what set of locale names they accept, then 
> technically we ought to test both of them, since we do use both of them 
> later on.  Or maybe we push on with the effort to get rid of setlocale() 
> calls and then just worry about testing newlocale() (as this patch 
> does).  But right now, if newlocale() is more permissive, then we could 
> accept locale names that will later fail setlocale() calls, which might 
> be a problem.

I think the clear answer is "let's stop using setlocale(), and then
not have to worry about any behavioral differences".

regards, tom lane




Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-03-31 Thread Christoph Berg
Re: Andres Freund
> > Yes.  Also, none of this has addressed my complaint about the extent
> > of the build and install dependencies.  Yes, simply not selecting
> > --with-libcurl removes the problem ... but most packagers are under
> > very heavy pressure to enable all features of a package.

And this feature is kind of only useful if it's available anywhere. If
only half of your clients are able to use SSO, you'd probably stick
with passwords anyway. So it needs to be enabled by default.

> How about we provide the current libpq.so without linking to curl and also a
> libpq-oauth.so that has curl support? If we do it right libpq-oauth.so would
> itself link to libpq.so, making libpq-oauth.so a fairly small library.
> 
> That way packagers can split libpq-oauth.so into a separate package, while
> still just building once.

That's definitely a good plan. The blast radius of build dependencies
isn't really a problem, the install/run-time is.

Perhaps we could do the same with libldap and libgssapi? (Though
admittedly I have never seen any complaints or nagging questions from
security people about these.)

Christoph




Re: Add a function to get the version of installed extension

2025-03-31 Thread Yugo NAGATA
On Sun, 2 Feb 2025 00:15:25 -0500
Tom Lane  wrote:

> Yugo Nagata  writes:
> > This might be avoidable if the binary is made carefully to check the 
> > existing
> > of objects, but I think it is useful if an extension module function can 
> > check
> > the current extension version. So, I would like to propose a new function to
> > return the current extension version, get_extension_version. I've attached a
> > patch.
> 
> While I don't say this is a bad idea, I do say you've not made a very
> good case for it.  How would an extension know its own OID in order
> to call the function?  If it did manage to call the function, what
> exactly would it do with the result, and how would that be easier than
> maintaining backwards compatibility with its old SQL definitions?
> We've not found the backwards-compatibility requirement to be hugely
> onerous in our contrib extensions.
> 
> A concrete example use-case would make this a lot more convincing.

An OID of an extension can be get from the name using get_extension_oid()
because the name is unique in a database. But, after second thought, this 
might not be so useful because the version of an extension is just a text
and would not comparable as a number, so it is hard to say "if the version
is 10.0 or later" and so on. 

I'll withdraw this proposal for now until I get a good example use-case.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: Allow ILIKE forward matching to use btree index

2025-03-31 Thread Yugo NAGATA
On Thu, 16 Jan 2025 09:41:35 -0800
Jeff Davis  wrote:

> On Thu, 2025-01-16 at 14:53 +0900, Yugo NAGATA wrote:
> > Instead of generating complete patterns considering every case-
> > varying characters,
> > two clauses considering only the first case-varying character are
> > generated.
> 
> Did you consider other approaches that integrate more deeply into the
> indexing infrastructure? This feels almost like a skip scan, which
> Petter Geoghegan is working on. If you model the disjunctions as skips,
> and provide the right API that the index AM can use, then there would
> be no limit.
> 
> For example:
> 
> start scanning at '123FOO'
> when you encounter '123FOP' skip to '123FOo' 
> continue scanning
> when you encounter '123FOp' skip to '123FoO'
> ...

That seems true there is similarity between ILIKE search and skip scan,
although, on my understand, skip scan is for multi-column indexes
rather than text. I have no concrete idea how the infrastructure for skip
scan to improve ILIKE, anyway.
 
> Also, I'm working on better Unicode support to handle multiple case
> variants in patterns. For instance, the Greek letter Sigma has three
> case variants (one upper and two lower). What's the right API so that
> the index AM knows which case variants will sort first, so that the
> skips don't go backward?

That seems true there is similarity between ILIKE search and skip scan,
although, on my understand, skip scan is for multi-column indexes
rather than text. I have no concrete idea how the infrastructure for skip
scan to improve ILIKE, anyway.
 
> Also, I'm working on better Unicode support to handle multiple case
> variants in patterns. For instance, the Greek letter Sigma has three
> case variants (one upper and two lower). What's the right API so that
> the index AM knows which case variants will sort first, so that the
> skips don't go backward?

I don't have idea for handling letters with multiple case variants in my patch,
either. I overlooked such pattern at all. So, I'll withdraw the proposal, for 
now.

Regards,
Yugo Nagata


-- 
Yugo NAGATA 




Question about comments on simple_heap_update

2025-03-31 Thread Yugo Nagata
Hi,

While working on [1], I found the internal error "tuple concurrently updated"
is raised by simple_heap_update and other similar functions, and the comments
on them says "Any failure is reported via ereport()". However, I could not 
understand
the intension of this comments because I suppose the unexpected errors are usual
reported via elog() not ereport and in fact elog() is used in these functions. 

I wonder this statement should be fixed as the attached patch or could be 
removed for
less confusion. Maybe, I am just missing something, though
 

[1] 
https://www.postgresql.org/message-id/flat/20250331200057.00a62760966a821d484ea904%40sraoss.co.jp

-- 
Yugo Nagata 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6e433db039e..f47b878f559 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3207,7 +3207,7 @@ l1:
  * This routine may be used to delete a tuple when concurrent updates of
  * the target tuple are not expected (for example, because we have a lock
  * on the relation associated with the tuple).  Any failure is reported
- * via ereport().
+ * via elog().
  */
 void
 simple_heap_delete(Relation relation, ItemPointer tid)
@@ -4495,7 +4495,7 @@ HeapDetermineColumnsInfo(Relation relation,
  * This routine may be used to update a tuple when concurrent updates of
  * the target tuple are not expected (for example, because we have a lock
  * on the relation associated with the tuple).  Any failure is reported
- * via ereport().
+ * via elog().
  */
 void
 simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..eb09272865e 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -285,7 +285,7 @@ simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
  * This routine may be used to delete a tuple when concurrent updates of
  * the target tuple are not expected (for example, because we have a lock
  * on the relation associated with the tuple).  Any failure is reported
- * via ereport().
+ * via elog().
  */
 void
 simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot)
@@ -330,7 +330,7 @@ simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot)
  * This routine may be used to update a tuple when concurrent updates of
  * the target tuple are not expected (for example, because we have a lock
  * on the relation associated with the tuple).  Any failure is reported
- * via ereport().
+ * via elog().
  */
 void
 simple_table_tuple_update(Relation rel, ItemPointer otid,


Re: Improve monitoring of shared memory allocations

2025-03-31 Thread Tomas Vondra
On 3/31/25 01:01, Rahila Syed wrote:
> ...
> 
> 
> > I will improve the comment in the next version.
> >
> 
> OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
> really used except for this bit:
> 
> +    if (init_size > nelem_alloc)
> +        element_alloc = false;
> 
> Can't we determine before calling the function, to make it a bit less
> confusing?
> 
> 
> Yes, we could determine whether the pre-allocated elements are zero before
> calling the function, I have fixed it accordingly in the attached 0001
> patch. 
> Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've
> passed this information as a boolean variable-initial_elems. If it is
> false,
> no elements are pre-allocated.
> 
> Please find attached the v7-series, which incorporates your review patches
> and addresses a few remaining comments.
> 

I think it's almost committable. Attached is v8 with some minor review
adjustments, and updated commit messages. Please read through those and
feel free to suggest changes.

I still found the hash_get_init_size() comment unclear, and it also
referenced init_size, which is no longer relevant. I improved the
comment a bit (I find it useful to mimic comments of nearby functions,
so I did that too here). The "initial_elems" name was a bit confusing,
as it seemed to suggest "number of elements", but it's a simple flag. So
I renamed it to "prealloc", which seems clearer to me. I also tweaked
(reordered/reformatted) the conditions a bit.

For the other patch, I realized we can simply MemSet() the whole chunk,
instead of resetting the individual parts.

regards

-- 
Tomas Vondra
From 1a82ac7407eb88bc085694519739115f718cc59a Mon Sep 17 00:00:00 2001
From: Rahila Syed 
Date: Mon, 31 Mar 2025 03:23:20 +0530
Subject: [PATCH v8 1/5] Improve acounting for memory used by shared hash
 tables

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(),
but for shared hash tables that covered only the header and hash
directory.  The remaining parts (segments and buckets) were allocated
later using ShmemAlloc(), which does not update the shmem accounting.
Thus, these allocations were not shown in pg_shmem_allocations.

This commit improves the situation by allocating all the hash table
parts at once, using a single ShmemInitStruct() call. This way the
ShmemIndex entries (and thus pg_shmem_allocations) better reflect the
proper size of the hash table.

This affects allocations for private (non-shared) hash tables too, as
the hash_create() code is shared. For non-shared tables this however
makes no practical difference.

This changes the alignment a bit. ShmemAlloc() aligns the chunks using
CACHELINEALIGN(), which means some parts (header, directory, segments)
were aligned this way. Allocating all parts as a single chunks removes
this (implicit) alignment. We've considered adding explicit alignment,
but we've decided not to - it seems to be merely a coincidence due to
using the ShmemAlloc() API, not due to necessity.

Author: Rahila Syed 
Reviewed-by: Andres Freund 
Reviewed-by: Nazir Bilal Yavuz 
Reviewed-by: Tomas Vondra 
Discussion: https://postgr.es/m/cah2l28vhzrankszhqz7dexurxkncxfirnuw68zd7+hvaqas...@mail.gmail.com
---
 src/backend/storage/ipc/shmem.c   |   4 +-
 src/backend/utils/hash/dynahash.c | 283 +++---
 src/include/utils/hsearch.h   |   3 +-
 3 files changed, 222 insertions(+), 68 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..3c030d5743d 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,8 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-			   hash_get_shared_size(infoP, hash_flags),
+			   hash_get_init_size(infoP, hash_flags,
+  true, init_size),
 			   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d8..3dede9caa5d 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+#define	HASH_DIRECTORY_PTR(hashp) \
+	(((char *) (hashp)->hctl) + sizeof(HASHHDR))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp) \
+	((hashp)->ssize * sizeof(HASHBUCKET))
+
+#define HASH_ELEMENTS_PTR(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)-

Re: encode/decode support for base64url

2025-03-31 Thread Aleksander Alekseev
Hi Florents,

> Here's a v3 with some (hopefully) better test cases.

Thanks for the new version of the patch.

```
+encoded_len = pg_base64_encode(src, len, dst);
+
+/* Convert Base64 to Base64URL */
+for (uint64 i = 0; i < encoded_len; i++) {
+if (dst[i] == '+')
+dst[i] = '-';
+else if (dst[i] == '/')
+dst[i] = '_';
+}
```

Although it is a possible implementation, wouldn't it be better to
parametrize pg_base64_encode instead of traversing the string twice?
Same for pg_base64_decode. You can refactor pg_base64_encode and make
it a wrapper for pg_base64_encode_impl if needed.

```
+-- Flaghsip Test case against base64.
+-- Notice the = padding removed at the end and special chars.
+SELECT encode('\x69b73eff', 'base64');  -- Expected: abc+/w==
+  encode
+--
+ abc+/w==
+(1 row)
+
+SELECT encode('\x69b73eff', 'base64url');  -- Expected: abc-_w
+ encode
+
+ abc-_w
+(1 row)
```

I get the idea, but calling base64 is redundant IMO. It only takes
several CPU cycles during every test run without much value. I suggest
removing it and testing corner cases for base64url instead, which is
missing at the moment. Particularly there should be tests for
encoding/decoding strings of 0/1/2/3/4 characters and making sure that
decode(encode(x)) = x, always. On top of that you should cover with
tests the cases of invalid output for decode().

-- 
Best regards,
Aleksander Alekseev




Re: per backend WAL statistics

2025-03-31 Thread Bertrand Drouvot
Hi,

On Sat, Mar 29, 2025 at 07:14:16AM +0900, Michael Paquier wrote:
> On Fri, Mar 28, 2025 at 09:00:00PM +0200, Alexander Lakhin wrote:
> > Please try the following query:
> > BEGIN;
> > SET LOCAL stats_fetch_consistency = snapshot;
> > SELECT * FROM pg_stat_get_backend_wal(pg_backend_pid());

Thanks for the report! I'm able to reproduce it on my side. The issue can
also be triggered with pg_stat_get_backend_io().

The issue is that in pgstat_fetch_stat_backend_by_pid() (and with
stats_fetch_consistency set to snapshot) a call to
pgstat_clear_backend_activity_snapshot() is done:

#0  __memset_evex_unaligned_erms () at 
../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:250
#1  0x01833bf2 in wipe_mem (ptr=0x63218800, size=80800) at 
../../../../src/include/utils/memdebug.h:42
#2  0x01834c51 in AllocSetReset (context=0x61903c80) at aset.c:586
#3  0x0184f32d in MemoryContextResetOnly (context=0x61903c80) at 
mcxt.c:419
#4  0x01834ede in AllocSetDelete (context=0x61903c80) at aset.c:636
#5  0x0184f79b in MemoryContextDeleteOnly (context=0x61903c80) at 
mcxt.c:528
#6  0x0184f5a9 in MemoryContextDelete (context=0x61903c80) at 
mcxt.c:482
#7  0x01361e84 in pgstat_clear_backend_activity_snapshot () at 
backend_status.c:541
#8  0x01367f08 in pgstat_clear_snapshot () at pgstat.c:943
#9  0x01368ac3 in pgstat_prep_snapshot () at pgstat.c:1121
#10 0x013680b9 in pgstat_fetch_entry (kind=6, dboid=0, objid=0) at 
pgstat.c:961
#11 0x0136dd05 in pgstat_fetch_stat_backend (procNumber=0) at 
pgstat_backend.c:94
#12 0x0136de7d in pgstat_fetch_stat_backend_by_pid (pid=3294022, 
bktype=0x0) at pgstat_backend.c:136

*before* we check for "beentry->st_procpid != pid".

I think we can simply move the pgstat_fetch_stat_backend() call at the end
of pgstat_fetch_stat_backend_by_pid(), like in the attached. With this in place
the issue is fixed on my side.

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 1605f513ad691b463baacc00e3c305655525ea07 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 31 Mar 2025 07:02:34 +
Subject: [PATCH v1] Fix heap-use-after-free in
 pgstat_fetch_stat_backend_by_pid()

With stats_fetch_consistency set to snapshot the beentry is reset during
the pgstat_fetch_stat_backend() call. So moving this call at the end of
pgstat_fetch_stat_backend_by_pid().

Reported-by: Alexander Lakhin 
---
 src/backend/utils/activity/pgstat_backend.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index 187c5c76e1e..ec95c302af8 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -133,10 +133,6 @@ pgstat_fetch_stat_backend_by_pid(int pid, BackendType *bktype)
 	if (!pgstat_tracks_backend_bktype(beentry->st_backendType))
 		return NULL;
 
-	backend_stats = pgstat_fetch_stat_backend(procNumber);
-	if (!backend_stats)
-		return NULL;
-
 	/* if PID does not match, leave */
 	if (beentry->st_procpid != pid)
 		return NULL;
@@ -144,6 +140,10 @@ pgstat_fetch_stat_backend_by_pid(int pid, BackendType *bktype)
 	if (bktype)
 		*bktype = beentry->st_backendType;
 
+	backend_stats = pgstat_fetch_stat_backend(procNumber);
+	if (!backend_stats)
+		return NULL;
+
 	return backend_stats;
 }
 
-- 
2.34.1



Re: Memoize ANTI and SEMI JOIN inner

2025-03-31 Thread Andrei Lepikhov

On 3/31/25 05:33, David Rowley wrote:

I can't say definitively that we won't find a reason in the future
that we should set inner_unique for SEMI/ANTI joins, so I don't follow
the need for the Assert.

Maybe you're seeing something that I'm not. What do you think will
break if both flags are true?
I considered targeting PG 19 and July Comitfest for this feature, but 
currently, I don't see any further development needed here. What are 
your thoughts, David? Does it make sense to commit this to PG 18?


I have no reason to rush the process, but this feature seems beneficial 
for practice. When the internal structure is a bushy join tree, 
producing even a single tuple can be costly. SQL Server's Spool node 
addresses this issue, and people sometimes experience confusion 
detecting degradation during migration with specific queries.


--
regards, Andrei Lepikhov




Re: speedup COPY TO for partitioned table.

2025-03-31 Thread Kirill Reshke
Hi!
I reviewed v7. Maybe we should add a multi-level partitioning case
into copy2.sql regression test?

I also did quick benchmarking for this patch:

 DDL

 create table ppp(i int) partition by range (i);

genddl.sh:

for i in `seq 0 200`; do echo "create table p$i partition of ppp for
values from ( $((10 * i)) ) to ( $((10 * (i + 1))) ); "; done

=== insert data data:
insert into ppp select i / 1000  from generate_series(0, 200)i;

=== results:


for 201 rows speedup is 1.40 times : 902.604 ms (patches) vs
1270.648 ms (unpatched)

for 402 rows speedup is 1.20 times : 1921.724 ms (patches) vs
2343.393 ms (unpatched)

for 804 rows speedup is 1.10 times : 3932.361 ms (patches) vs
4358.489ms (unpatched)

So, this patch indeed speeds up some cases, but with larger tables
speedup becomes negligible.

-- 
Best regards,
Kirill Reshke




Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-03-31 Thread jian he
On Sat, Mar 29, 2025 at 2:42 AM Alvaro Herrera  wrote:
>
> On 2025-Mar-28, jian he wrote:
>
> > i think your patch messed up with pg_constraint.conislocal.
> > for example:
> >
> > CREATE TABLE parted (id bigint default 1,id_abc bigint) PARTITION BY LIST 
> > (id);
> > alter TABLE parted add CONSTRAINT dummy_constr not null id not valid;
> > CREATE TABLE parted_1 (id bigint default 1,id_abc bigint);
> > alter TABLE parted_1 add CONSTRAINT dummy_constr not null id;
> > ALTER TABLE parted ATTACH PARTITION parted_1 FOR VALUES IN ('1');
> >
> > select conrelid::regclass, conname, conislocal
> > from pg_constraint where conname = 'dummy_constr';
> >
> >  conrelid |   conname| conislocal
> > --+--+
> >  parted   | dummy_constr | t
> >  parted_1 | dummy_constr | f
> > (2 rows)
> >
> >
> > if you  do pg_dump, and execute the pg_dump output
> > pg_dump  --no-statistics --clean --table-and-children=*parted*
> > --no-owner --verbose --column-inserts --file=dump.sql --no-acl
> >
> > select conrelid::regclass, conname, conislocal
> > from pg_constraint where conname = 'dummy_constr';
> > output is
> >
> >  conrelid |   conname| conislocal
> > --+--+
> >  parted   | dummy_constr | t
> >  parted_1 | dummy_constr | t
> > (2 rows)
>
> Interesting.  Yeah, I removed the code you had there because it was
> super weird, had no comments, and removing it had zero effect (no tests
> failed), so I thought it was useless.  But apparently something is going
> on here that's not what we want.
>

my change in AdjustNotNullInheritance is copied from
MergeConstraintsIntoExisting
``
/*
 * OK, bump the child constraint's inheritance count.  (If we fail
 * later on, this change will just roll back.)
 */
child_copy = heap_copytuple(child_tuple);
child_con = (Form_pg_constraint) GETSTRUCT(child_copy);
if (pg_add_s16_overflow(child_con->coninhcount, 1,
&child_con->coninhcount))
ereport(ERROR,
errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many inheritance parents"));
/*
 * In case of partitions, an inherited constraint must be
 * inherited only once since it cannot have multiple parents and
 * it is never considered local.
 */
if (parent_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
Assert(child_con->coninhcount == 1);
child_con->conislocal = false;
}
``

if you look at MergeConstraintsIntoExisting, then it won't be weird.
AdjustNotNullInheritance is kind of doing the same thing as
MergeConstraintsIntoExisting, I think.




Re: Partition pruning on parameters grouped into an array does not prune properly

2025-03-31 Thread Andrei Lepikhov

On 3/27/25 01:58, David Rowley wrote:

I suspect the fix for this might be a bit invasive to backpatch. Maybe
it's something we can give a bit more clear thought to after the
freeze is over.
One more issue I think may be addressed (or just considered) here is the 
following:


CREATE TABLE parted (a int, b int) PARTITION BY RANGE (a, b);

CREATE TABLE part1 PARTITION OF parted
  FOR VALUES FROM (1, 1) TO (1, 10);
CREATE TABLE part2 PARTITION OF parted
  FOR VALUES FROM (2, 1) TO (2, 10);

INSERT INTO parted (VALUES (1, 2));
INSERT INTO parted VALUES (2, 2);

EXPLAIN (COSTS OFF)
SELECT * FROM parted WHERE a > 1 AND b < 1;
EXPLAIN (COSTS OFF)
SELECT * FROM parted WHERE a > 1 AND b > 10;

/*
 Seq Scan on part2 parted
   Filter: ((a > 1) AND (b < 1))
 Seq Scan on part2 parted
   Filter: ((a > 1) AND (b > 10))
*/

I think partition part2 could be pruned like in the following example:

EXPLAIN (COSTS OFF)
SELECT * FROM parted WHERE a > 2 AND b > 10;

/*
 Result
   One-Time Filter: false
*/

--
regards, Andrei Lepikhov




Re: Draft for basic NUMA observability

2025-03-31 Thread Jakub Wartak
On Thu, Mar 27, 2025 at 2:40 PM Andres Freund  wrote:
>
> Hi,

Hi Andres,

> On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
> >setup_additional_packages_script: |
> > -#apt-get update
> > -#DEBIAN_FRONTEND=noninteractive apt-get -y install ...
> > +apt-get update
> > +DEBIAN_FRONTEND=noninteractive apt-get -y install \
> > +  libnuma1 \
> > +  libnuma-dev
>
> I think libnuma is installed on the relevant platforms, so you shouldn't need
> to install it manually.

Fixed. Right, you mentioned this earlier, I just didnt know when it went online.

> > +#
> > +# libnuma
> > +#
> > +AC_MSG_CHECKING([whether to build with libnuma support])
> > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
>
> Most other dependencies say "build with libxyz ..."

Done.

> > + * pg_numa.h
[..]
> > +#include "c.h"
> > +#include "postgres.h"
>
> Headers should never include either of those headers.  Nor should .c files
> include both.

Fixed, huh, I've found explanation:
https://www.postgresql.org/message-id/11634.1488932...@sss.pgh.pa.us

> > @@ -200,6 +200,8 @@ pgxs_empty = [
> >
> >'ICU_LIBS',
> >
> > +  'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
> > +
> >'LIBURING_CFLAGS', 'LIBURING_LIBS',
> >  ]
>
> Maybe I am missing something, but are you actually defining and using those
> LIBNUMA_* vars anywhere?

OK, so it seems I've been missing `PKG_CHECK_MODULES(LIBNUMA, numa)`
in configure.ac that would set those *FLAGS. I'm little bit loss
dependent in how to gurantee that meson is synced with autoconf as per
project requirements - trying to use past commits as reference, but I
still could get something wrong here (especially in
src/Makefile.global.in)

> > +Size
> > +pg_numa_get_pagesize(void)
[..]
>
> Should this have a comment or an assertion that it can only be used after
> shared memory startup? Because before that huge_pages_status won't be
> meaningful?

Added both.

> > +#ifndef FRONTEND
> > +/*
> > + * XXX: not really tested as there is no way to trigger this in our
> > + * current usage of libnuma.
> > + *
> > + * The libnuma built-in code can be seen here:
> > + * https://github.com/numactl/numactl/blob/master/libnuma.c
> > + *
> > + */
> > +void
> > +numa_warn(int num, char *fmt,...)
[..]
>
> I think you would at least have to hold interrupts across this, as
> ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
> out of libnuma in case an interrupt has arrived.

On request by Alvaro I've removed it as that code is simply
unreachable/untestable, but point taken - I'm planning to re-add this
with holding interrupting in future when we start using proper
numa_interleave() one day. Anyway, please let me know if you want
still to keep it as deadcode. BTW for context , why this is deadcode
is explained in the latter part of [1] message (TL;DR; unless we use
pining/numa_interleave/local_alloc() we probably never reach that
warnings/error handlers).

"Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine..."

> > +Size
> > +pg_numa_get_pagesize(void)
[..]
>
> I would probably implement this once, outside of the big ifdef, with one more
> ifdef inside, given that you're sharing the same implementation.

Done.

> > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
> > From: Jakub Wartak 
> > Date: Wed, 19 Mar 2025 09:34:56 +0100
> > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
> >  primary function to separate functions.
> >
> > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
> > and get_buffercache_tuple() that help fill result tuplestores based on the
> > buffercache contents. This will be used in a follow-up commit that 
> > implements
> > NUMA observability in pg_buffercache.
> >
> > Author: Jakub Wartak 
> > Reviewed-by: Bertrand Drouvot 
> > Discussion: 
> > https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> > ---
> >  contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++
> >  1 file changed, 178 insertions(+), 139 deletions(-)
> >
> > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c 
> > b/contrib/pg_buffercache/pg_buffercache_pages.c
> > index 62602af1775..86e0b8afe01 100644
> > --- a/contrib/pg_bufferca

Re: Draft for basic NUMA observability

2025-03-31 Thread Jakub Wartak
On Thu, Mar 27, 2025 at 2:15 PM Álvaro Herrera  wrote:

> Hello

Good morning :)

> I think you should remove numa_warn() and numa_error() from 0001.
> AFAICS they are dead code (even with all your patches applied), and
> furthermore would get you in trouble regarding memory allocation because
> src/port is not allowed to use palloc et al.  If you wanted to keep them
> you'd have to have them in src/common, but looking at the rest of the
> code in that patch, ISTM src/port is the right place for it.  If in the
> future you discover that you do need numa_warn(), you can create a
> src/common/ file for it then.

Understood, trimmed it out from the patch. I'm going to respond also
within minutes to Andres' review and I'm going to post a new version
(v17) there.

> Is pg_buffercache really the best place for these NUMA introspection
> routines?  I'm not saying that it isn't, maybe we're okay with that
> (particularly if we can avoid duplicated code), but it seems a bit weird
> to me.

I think it is, because as I understand, Andres wanted to have
observability per single database *page* and to avoid code duplication
we are just putting it there (it's natural fit). Imagine looking up an
8kB root btree memory page being hit hard from CPUs on other NUMA
nodes (this just gives ability to see that, but you could of course
also get aggregation to get e.g. NUMA node balance for single relation
and so on).

-J.




Re: Memoize ANTI and SEMI JOIN inner

2025-03-31 Thread David Rowley
On Mon, 31 Mar 2025 at 22:03, Richard Guo  wrote:
> I reviewed this patch and I have some concerns about the following
> code:
>
> if (extra->inner_unique &&
> (inner_path->param_info == NULL ||
>  bms_num_members(inner_path->param_info->ppi_serials) <
>  list_length(extra->restrictlist)))
> return NULL;
>
> I understand that this check is used to ensure that the entire join
> condition is parameterized in the case of unique joins, so that we can
> safely mark the cache entry as complete after reading the first tuple.
> However, ppi_clauses includes join clauses available from all outer
> rels, not just the current outer rel, while extra->restrictlist only
> includes the restriction clauses for the current join.  This means the
> check could pass even if a restriction clause isn't parameterized, as
> long as another join clause, which doesn't belong to the current join,
> is included in ppi_clauses.

Shouldn't you be more concerned about master here than the patch given
that the code you pasted is from master?

David




Re: Memoize ANTI and SEMI JOIN inner

2025-03-31 Thread Richard Guo
On Mon, Mar 31, 2025 at 6:39 PM David Rowley  wrote:
> On Mon, 31 Mar 2025 at 22:03, Richard Guo  wrote:
> > I reviewed this patch and I have some concerns about the following
> > code:
> >
> > if (extra->inner_unique &&
> > (inner_path->param_info == NULL ||
> >  bms_num_members(inner_path->param_info->ppi_serials) <
> >  list_length(extra->restrictlist)))
> > return NULL;
> >
> > I understand that this check is used to ensure that the entire join
> > condition is parameterized in the case of unique joins, so that we can
> > safely mark the cache entry as complete after reading the first tuple.
> > However, ppi_clauses includes join clauses available from all outer
> > rels, not just the current outer rel, while extra->restrictlist only
> > includes the restriction clauses for the current join.  This means the
> > check could pass even if a restriction clause isn't parameterized, as
> > long as another join clause, which doesn't belong to the current join,
> > is included in ppi_clauses.
>
> Shouldn't you be more concerned about master here than the patch given
> that the code you pasted is from master?

Right.  This code is from the master branch, not the patch.  It caught
my attention while I was reviewing the patch and noticed it modifies
this code.

Thanks
Richard




Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).

2025-03-31 Thread Fujii Masao




On 2025/03/30 5:01, Andrew Dunstan wrote:


On 2025-03-29 Sa 2:58 PM, David G. Johnston wrote:

On Sat, Mar 29, 2025 at 11:56 AM Andrew Dunstan  wrote:

I don't believe that the premise supports the conclusion.


Regardless, I do support this patch and probably any similar ones proposed in 
the future.  Do you have an opinion on that?




In principle I think it would be good to have COPY materialized_view TO ...


I haven't found any reasons to object to this patch for now,
so I have no objections to this change.


Regarding the patch, here are some review comments:

+   errmsg("cannot copy from 
materialized view when the materialized view is not populated"),

How about including the object name for consistency with
other error messages in BeginCopyTo(), like this?

errmsg("cannot copy from unpopulated materialized view \"%s\"",
   RelationGetRelationName(rel)),


+   errhint("Use the REFRESH 
MATERIALIZED VIEW command populate the materialized view first."));

There seems to be a missing "to" just after "command".
Should it be "Use the REFRESH MATERIALIZED VIEW command to
populate the materialized view first."? Or we could simplify
the hint to match what SELECT on an unpopulated materialized
view logs: "Use the REFRESH MATERIALIZED VIEW command.".


The copy.sgml documentation should clarify that COPY TO can
be used with a materialized view only if it is populated.


Wouldn't it be beneficial to add a regression test to check
whether COPY matview TO works as expected?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION





Re: [PATCH] Split varlena.c into varlena.c and bytea.c

2025-03-31 Thread Aleksander Alekseev
> The proposed patch extracts the code dealing with bytea from varlena.c
> into a separate file, as proposed previously [1]. It merely changes
> the location of the existing functions. There are no other changes.

Rebased.


--
Best regards,
Aleksander Alekseev
From 86776455a7b82f8cad6b984d3ddd67fc8f0e3258 Mon Sep 17 00:00:00 2001
From: Aleksander Alekseev 
Date: Tue, 25 Mar 2025 13:01:31 +0300
Subject: [PATCH v2] Split varlena.c into varlena.c and bytea.c

Move most of the code dealing with bytea into a separate file.

Aleksander Alekseev, reviewed by TODO FIXME
Discussion: https://postgr.es/m/CAJ7c6TMPVPJ5DL447zDz5ydctB8OmuviURtSwd=PHCRFEPDEAQ@mail.gmail.com
---
 src/backend/utils/adt/Makefile|1 +
 src/backend/utils/adt/bytea.c | 1147 +++
 src/backend/utils/adt/meson.build |1 +
 src/backend/utils/adt/varlena.c   | 1214 ++---
 4 files changed, 1207 insertions(+), 1156 deletions(-)
 create mode 100644 src/backend/utils/adt/bytea.c

diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 35e8c01aab9..d7b1902a070 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -23,6 +23,7 @@ OBJS = \
 	arrayutils.o \
 	ascii.o \
 	bool.o \
+	bytea.o \
 	cash.o \
 	char.o \
 	cryptohashfuncs.o \
diff --git a/src/backend/utils/adt/bytea.c b/src/backend/utils/adt/bytea.c
new file mode 100644
index 000..0b3acc7b300
--- /dev/null
+++ b/src/backend/utils/adt/bytea.c
@@ -0,0 +1,1147 @@
+/*-
+ *
+ * bytea.c
+ *	  Functions for the bytea type.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/bytea.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "access/detoast.h"
+#include "catalog/pg_collation.h"
+#include "common/int.h"
+#include "funcapi.h"
+#include "libpq/pqformat.h"
+#include "utils/builtins.h"
+#include "utils/bytea.h"
+#include "utils/varlena.h"
+
+/* GUC variable */
+int			bytea_output = BYTEA_OUTPUT_HEX;
+
+static StringInfo
+			makeStringAggState(FunctionCallInfo fcinfo);
+static bytea *bytea_catenate(bytea *t1, bytea *t2);
+static bytea *bytea_substring(Datum str, int S, int L, bool length_not_specified);
+static bytea *bytea_overlay(bytea *t1, bytea *t2, int sp, int sl);
+
+
+/* subroutine to initialize state */
+static StringInfo
+makeStringAggState(FunctionCallInfo fcinfo)
+{
+	StringInfo	state;
+	MemoryContext aggcontext;
+	MemoryContext oldcontext;
+
+	if (!AggCheckCallContext(fcinfo, &aggcontext))
+	{
+		/* cannot be called directly because of internal-type argument */
+		elog(ERROR, "bytea_string_agg_transfn called in non-aggregate context");
+	}
+
+	/*
+	 * Create state in aggregate context.  It'll stay there across subsequent
+	 * calls.
+	 */
+	oldcontext = MemoryContextSwitchTo(aggcontext);
+	state = makeStringInfo();
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
+/*
+ * bytea_catenate
+ *	Guts of byteacat(), broken out so it can be used by other functions
+ *
+ * Arguments can be in short-header form, but not compressed or out-of-line
+ */
+static bytea *
+bytea_catenate(bytea *t1, bytea *t2)
+{
+	bytea	   *result;
+	int			len1,
+len2,
+len;
+	char	   *ptr;
+
+	len1 = VARSIZE_ANY_EXHDR(t1);
+	len2 = VARSIZE_ANY_EXHDR(t2);
+
+	/* paranoia ... probably should throw error instead? */
+	if (len1 < 0)
+		len1 = 0;
+	if (len2 < 0)
+		len2 = 0;
+
+	len = len1 + len2 + VARHDRSZ;
+	result = (bytea *) palloc(len);
+
+	/* Set size of result string... */
+	SET_VARSIZE(result, len);
+
+	/* Fill data field of result string... */
+	ptr = VARDATA(result);
+	if (len1 > 0)
+		memcpy(ptr, VARDATA_ANY(t1), len1);
+	if (len2 > 0)
+		memcpy(ptr + len1, VARDATA_ANY(t2), len2);
+
+	return result;
+}
+
+#define PG_STR_GET_BYTEA(str_) \
+	DatumGetByteaPP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
+
+static bytea *
+bytea_substring(Datum str,
+int S,
+int L,
+bool length_not_specified)
+{
+	int32		S1;/* adjusted start position */
+	int32		L1;/* adjusted substring length */
+	int32		E;/* end position */
+
+	/*
+	 * The logic here should generally match text_substring().
+	 */
+	S1 = Max(S, 1);
+
+	if (length_not_specified)
+	{
+		/*
+		 * Not passed a length - DatumGetByteaPSlice() grabs everything to the
+		 * end of the string if we pass it a negative value for length.
+		 */
+		L1 = -1;
+	}
+	else if (L < 0)
+	{
+		/* SQL99 says to throw an error for E < S, i.e., negative length */
+		ereport(ERROR,
+(errcode(ERRCODE_SUBSTRING_ERROR),
+ errmsg("negative substring length not allowed")));
+		L1 = -1;/* silence stupider compilers */
+	}
+	else if (pg_add_s32_overflow(S, L, &E))
+	{
+		/*
+		 * L could be large enough for S + L to overflow, in which case the
+		 * substring must run to end of string.
+		 */
+		L1 = 

Re: Add partial :-variable expansion to psql \copy

2025-03-31 Thread Corey Huinker
>
> Anyway, my feeling about it is that \copy parsing is a huge hack
> right now, and I'd rather see it become less of a hack, that is
> more like other psql commands, instead of getting even hackier.
>

I wasn't as horrified as Tom, but it did have the feeling of it solving
half the problem.

We can already do this

COPY (SELECT :foo FROM :bar WHERE :condition) TO STDOUT \g :"myfilename"

So it seems that what we need is a good way to pipe local data to a
standard COPY command, which is then free to use the existing variable
interpolations.

If we could do this:

COPY :"myschema".:"mytable" FROM STDIN \g < :"myfilename"

that would fit our patterns most cleanly, but we would probably create a
parsing hassle for ourselves if we ever wanted to mix pipe-to with
pipe-from. It would also require checking on every command, when uploaded
\copy commands make up a very small percentage of commands issued. So I
don't think there's a good way around the asymmetry of COPY TO being a
regular \g-able command, whereas COPY FROM will always require some other
send-command.

Perhaps we create a new command \copyfrom:

COPY :"myschema".:"mytable" :options FROM STDIN \copyfrom :"myfilename"

COPY :"myschema".:"mytable" :options FROM STDIN \copyfrom
:"my_complex_command" |

If we had something like that we might be able to replace all existing uses
of \copy.


Re: Change log level for notifying hot standby is waiting non-overflowed snapshot

2025-03-31 Thread Fujii Masao




On 2025/03/31 22:45, Yugo Nagata wrote:

I prefer this approach clarifying that consistency and subtransaction overflow
are separate concepts in the documentation.

Here are minor comments on the patch:


Thanks for the review!



-   case CAC_NOTCONSISTENT:
-   if (EnableHotStandby)
+   case CAC_NOTHOTSTANDBY:
+   if (!EnableHotStandby)
 ereport(FATAL,
 (errcode(ERRCODE_CANNOT_CONNECT_NOW),
  errmsg("the database system is not yet accepting 
connections"),
-errdetail("Consistent recovery state has not been yet 
reached.")));
+errdetail("Hot standby mode is disabled.")));
+   else if (reachedConsistency)
+   ereport(FATAL,
+   (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+errmsg("the database system is not accepting 
connections"),
+errdetail("Recovery snapshot is not yet ready for hot 
standby."),
+errhint("To enable hot standby, close write 
transactions with more than %d subtransactions on the primary server.",
+PGPROC_MAX_CACHED_SUBXIDS)));
 else
 ereport(FATAL,
 (errcode(ERRCODE_CANNOT_CONNECT_NOW),
  errmsg("the database system is not accepting 
connections"),
-errdetail("Hot standby mode is disabled.")));
+errdetail("Consistent recovery state has not been yet 
reached.")));

The message says "the database system is not yet accepting connections" when "Hot 
standby mode is disabled".
I think "yet" is not necessary in this case. Otherwise, when "Recovery snapshot is 
not yet ready for hot standby"
or "Consistent recovery state has not been yet reached", it seems better to use 
"yet"


I may have unintentionally modified the error message.
I fixed the patch as suggested. Please check the latest patch
I posted earlier in response to Torikoshi-san.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION





Re: general purpose array_sort

2025-03-31 Thread Tom Lane
Junwang Zhao  writes:
> On Mon, Mar 31, 2025 at 5:58 AM Tom Lane  wrote:
>> In v18, it's somewhat annoying that the typcache doesn't cache
>> the typarray field; we would not need a separate get_array_type()
>> lookup if it did.  I doubt there is any real reason for that except
>> that pg_type.typarray didn't exist when the typcache was invented.
>> So I'm tempted to add it.  But I looked at existing callers of
>> get_array_type() and none of them are adjacent to typcache lookups,
>> so only array_sort would be helped immediately.  I left it alone
>> for the moment; wonder if anyone else has an opinion?

> The need for `elmtyp` and `array_type` here because a column can
> have arrays with varying dimensions. Maybe other callers don't share
> this behavior?

Maybe.  I think some of what's going on here is that because for a
long time we only had pg_type.typelem and not pg_type.typarray,
code was written to not need to look up the array type if at all
possible.  So there are simply not that many users.  Anyway it
seems really cheap to add this field to the typcache now.

Attached 0001 is the same as v18, and then 0002 is the proposed
addition to typcache.

regards, tom lane

From 21bfc6f86a767a0ef774dbaf9b3f3b6168c15a27 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Mon, 31 Mar 2025 12:52:00 -0400
Subject: [PATCH v19 1/2] Introduce a SQL-callable function
 array_sort(anyarray).

Create a function that will sort the elements of an array
according to the element type's sort order.  If the array
has more than one dimension, the sub-arrays of the first
dimension are sorted per normal array-comparison rules,
leaving their contents alone.

Author: Junwang Zhao 
Co-authored-by: Jian He 
Reviewed-by: Aleksander Alekseev 
Discussion: https://postgr.es/m/caeg8a3j41a4dpw_-f94ff-jprxyxw-gfsgogotkcjs9lvfe...@mail.gmail.com
---
 doc/src/sgml/func.sgml|  36 
 src/backend/utils/adt/array_userfuncs.c   | 183 ++
 src/include/catalog/pg_proc.dat   |  12 ++
 src/test/regress/expected/arrays.out  | 142 ++
 .../regress/expected/collate.icu.utf8.out |  13 ++
 src/test/regress/sql/arrays.sql   |  36 
 src/test/regress/sql/collate.icu.utf8.sql |   4 +
 src/tools/pgindent/typedefs.list  |   1 +
 8 files changed, 427 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5bf6656deca..2129d027398 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20741,6 +20741,42 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sort
+
+array_sort (
+  array anyarray
+  , descending boolean
+  , nulls_first boolean
+   )
+anyarray
+   
+   
+Sorts the first dimension of the array.
+The sort order is determined by the default sort ordering of the
+array's element type; however, if the element type is collatable,
+the collation to use can be forced by adding
+a COLLATE clause to
+the array argument.
+   
+   
+If descending is true then sort in
+descending order, otherwise ascending order.  If omitted, the
+default is ascending order.
+If nulls_first is true then nulls appear
+before non-null values, otherwise nulls appear after non-null
+values.
+If omitted, nulls_first is taken to have
+the same value as descending.
+   
+   
+array_sort(ARRAY[[2,4],[2,1],[6,5]])
+{{2,1},{2,4},{6,5}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index 2aae2f8ed93..2a8ea974029 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -12,16 +12,19 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_operator_d.h"
 #include "catalog/pg_type.h"
 #include "common/int.h"
 #include "common/pg_prng.h"
 #include "libpq/pqformat.h"
+#include "miscadmin.h"
 #include "nodes/supportnodes.h"
 #include "port/pg_bitutils.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
 #include "utils/typcache.h"
 
 /*
@@ -43,6 +46,18 @@ typedef struct DeserialIOData
 	Oid			typioparam;
 } DeserialIOData;
 
+/*
+ * ArraySortCachedInfo
+ *		Used for caching catalog data in array_sort
+ */
+typedef struct ArraySortCachedInfo
+{
+	ArrayMetaState array_meta;	/* metadata for array_create_iterator */
+	Oid			elem_lt_opr;	/* "<" operator for element type */
+	Oid			elem_gt_opr;	/* ">" operator for element type */
+	Oid			array_type;		/* pg_type OID of array type */
+} ArraySortCachedInfo;
+
 static Datum array_position_common(FunctionCallInfo fcinfo);
 
 
@@ -1858,3 +1873,171 @@ array_reverse(PG_FUNCTION_ARGS)
 
 	PG_RETURN_ARRAYTYPE_P(res

Re: Non-text mode for pg_dumpall

2025-03-31 Thread Andrew Dunstan


On 2025-03-31 Mo 1:16 PM, Andrew Dunstan wrote:




Thanks. Here are patches that contain (my version of) all the 
cleanups. With this I get a clean restore run in my test case with no 
error messages.






This time with patches


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From 6286701ff360ccb8c105fa5aa0a8f9bba3b1d1d7 Mon Sep 17 00:00:00 2001
From: Mahendra Singh Thalor 
Date: Wed, 19 Mar 2025 01:18:46 +0530
Subject: [PATCH v20250331 1/3] Move common pg_dump code related to connections
 to a new file

ConnectDatabase is used by pg_dumpall, pg_restore and pg_dump so move
common code to new file.

new file name: connectdb.c

Author:Mahendra Singh Thalor 
---
 src/bin/pg_dump/Makefile |   5 +-
 src/bin/pg_dump/connectdb.c  | 294 +++
 src/bin/pg_dump/connectdb.h  |  26 +++
 src/bin/pg_dump/meson.build  |   1 +
 src/bin/pg_dump/pg_backup.h  |   6 +-
 src/bin/pg_dump/pg_backup_archiver.c |   6 +-
 src/bin/pg_dump/pg_backup_db.c   |  79 +--
 src/bin/pg_dump/pg_dump.c|   2 +-
 src/bin/pg_dump/pg_dumpall.c | 278 +
 9 files changed, 352 insertions(+), 345 deletions(-)
 create mode 100644 src/bin/pg_dump/connectdb.c
 create mode 100644 src/bin/pg_dump/connectdb.h

diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile
index 233ad15ca75..fa795883e9f 100644
--- a/src/bin/pg_dump/Makefile
+++ b/src/bin/pg_dump/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	compress_lz4.o \
 	compress_none.o \
 	compress_zstd.o \
+	connectdb.o \
 	dumputils.o \
 	filter.o \
 	parallel.o \
@@ -50,8 +51,8 @@ pg_dump: pg_dump.o common.o pg_dump_sort.o $(OBJS) | submake-libpq submake-libpg
 pg_restore: pg_restore.o $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils
 	$(CC) $(CFLAGS) pg_restore.o $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
 
-pg_dumpall: pg_dumpall.o dumputils.o filter.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
-	$(CC) $(CFLAGS) pg_dumpall.o dumputils.o filter.o $(WIN32RES) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+pg_dumpall: pg_dumpall.o $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils
+	$(CC) $(CFLAGS) pg_dumpall.o $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
 
 install: all installdirs
 	$(INSTALL_PROGRAM) pg_dump$(X) '$(DESTDIR)$(bindir)'/pg_dump$(X)
diff --git a/src/bin/pg_dump/connectdb.c b/src/bin/pg_dump/connectdb.c
new file mode 100644
index 000..9e593b70e81
--- /dev/null
+++ b/src/bin/pg_dump/connectdb.c
@@ -0,0 +1,294 @@
+/*-
+ *
+ * connectdb.c
+ *This is a common file connection to the database.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *src/bin/pg_dump/connectdb.c
+ *
+ *-
+ */
+
+#include "postgres_fe.h"
+
+#include "common/connect.h"
+#include "common/logging.h"
+#include "common/string.h"
+#include "connectdb.h"
+#include "dumputils.h"
+#include "fe_utils/string_utils.h"
+
+static char *constructConnStr(const char **keywords, const char **values);
+
+/*
+ * ConnectDatabase
+ *
+ * Make a database connection with the given parameters.  An
+ * interactive password prompt is automatically issued if required.
+ *
+ * If fail_on_error is false, we return NULL without printing any message
+ * on failure, but preserve any prompted password for the next try.
+ *
+ * On success, the 'connstr' is set to a connection string containing
+ * the options used and 'server_version' is set to version so that caller
+ * can use them.
+ */
+PGconn *
+ConnectDatabase(const char *dbname, const char *connection_string,
+const char *pghost, const char *pgport, const char *pguser,
+trivalue prompt_password, bool fail_on_error, const char *progname,
+const char **connstr, int *server_version, char *password,
+char *override_dbname)
+{
+	PGconn	   *conn;
+	bool		new_pass;
+	const char *remoteversion_str;
+	int			my_version;
+	const char **keywords = NULL;
+	const char **values = NULL;
+	PQconninfoOption *conn_opts = NULL;
+	int			server_version_temp;
+
+	if (prompt_password == TRI_YES && !password)
+		password = simple_prompt("Password: ", false);
+
+	/*
+	 * Start the connection.  Loop until we have a password if requested by
+	 * backend.
+	 */
+	do
+	{
+		int			argcount = 8;
+		PQconninfoOption *conn_opt;
+		char	   *err_msg = NULL;
+		int			i = 0;
+
+		free(keywords);
+		free(values);
+		PQconninfoFree(conn_opts);
+
+		/*
+		 * Merge the connection info inputs given in form of connection string
+		 * and other options.  Explicitly discard any dbname value in the
+		 * connection string; otherwise, PQconnectdbParams() would interpret
+		 * that value as being itself a connection string.
+		 */
+		if (connecti

Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-03-31 Thread Robert Haas
On Fri, Mar 28, 2025 at 2:42 PM Alvaro Herrera  wrote:
> Maybe the real problem here is that making the (valid) child constraint
> no longer local when the parent constraint is not valid is not sensible,
> precisely because pg_dump won't be able to produce good output.  That
> sounds more workable to me ... except that we'd have to ensure that
> validating the parent constraint would turn the child constraints as not
> local anymore, which might be a bit weird.  But maybe not weirder than
> the other approach.

It seems like a bad idea to make conislocal and coninhcount have
anything to do with whether the constraint is valid. We need those to
mean what they have traditionally meant just to make the correct
things happen when the constraint is dropped, either directly from the
child or at some ancestor of that child. If something is dropped at an
ancestor table, we cascade down the tree and decrement coninhcount. If
something is dropped at a child table, we can set conislocal=false.
The constraint goes away, IIUC, when coninhcount=0 and
conislocal=false, which directly corresponds to "this constraint no
longer has a remaining definition either locally or by inheritance". I
don't see how you can change much of anything here without breaking
the existing structure. Validity could have separate tracking of some
sort, possibly as elaborate as convalidlocal and convalidinhcount, but
I don't think it can get away with redefining the tracking that we
already have for existence.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Statistics Import and Export

2025-03-31 Thread Corey Huinker
>
> The second is that the pg_upgrade test (when run with
>> olddump/oldinstall) compares the before and after dumps, and if the
>> "before" version is 17, then it will not have the relallfrozen argument
>> to pg_restore_relation_stats. We might need a filtering step in
>> adjust_new_dumpfile?
>>
>
> That sounds trickier.
>

Narrator: It was not trickier.

In light of v11-0001 being committed as 4694aedf63bf, I've rebased the
remaining patches.
From 607984bdcc91fa31fb7a12e9b24fb8704aa14975 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Fri, 14 Mar 2025 01:06:19 -0400
Subject: [PATCH v12 1/3] Introduce CreateStmtPtr.

CreateStmtPtr is a function pointer that can replace the createStmt/defn
parameter. This is useful in situations where the amount of text
generated for a definition is so large that it is undesirable to hold
many such objects in memory at the same time.

Using functions of this type, the text created is then immediately
written out to the appropriate file for the given dump format.
---
 src/bin/pg_dump/pg_backup.h  |   2 +
 src/bin/pg_dump/pg_backup_archiver.c |  22 ++-
 src/bin/pg_dump/pg_backup_archiver.h |   7 +
 src/bin/pg_dump/pg_dump.c| 229 +++
 4 files changed, 158 insertions(+), 102 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 658986de6f8..fdcccd64a70 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -289,6 +289,8 @@ typedef int (*DataDumperPtr) (Archive *AH, const void *userArg);
 
 typedef void (*SetupWorkerPtrType) (Archive *AH);
 
+typedef char *(*CreateStmtPtr) (Archive *AH, const void *userArg);
+
 /*
  * Main archiver interface.
  */
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 1d131e5a57d..1b4c62fd7d7 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1265,6 +1265,9 @@ ArchiveEntry(Archive *AHX, CatalogId catalogId, DumpId dumpId,
 	newToc->dataDumper = opts->dumpFn;
 	newToc->dataDumperArg = opts->dumpArg;
 	newToc->hadDumper = opts->dumpFn ? true : false;
+	newToc->createDumper = opts->createFn;
+	newToc->createDumperArg = opts->createArg;
+	newToc->hadCreateDumper = opts->createFn ? true : false;
 
 	newToc->formatData = NULL;
 	newToc->dataLength = 0;
@@ -2621,7 +2624,17 @@ WriteToc(ArchiveHandle *AH)
 		WriteStr(AH, te->tag);
 		WriteStr(AH, te->desc);
 		WriteInt(AH, te->section);
-		WriteStr(AH, te->defn);
+
+		if (te->hadCreateDumper)
+		{
+			char	   *defn = te->createDumper((Archive *) AH, te->createDumperArg);
+
+			WriteStr(AH, defn);
+			pg_free(defn);
+		}
+		else
+			WriteStr(AH, te->defn);
+
 		WriteStr(AH, te->dropStmt);
 		WriteStr(AH, te->copyStmt);
 		WriteStr(AH, te->namespace);
@@ -3877,6 +3890,13 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, const char *pfx)
 	{
 		IssueACLPerBlob(AH, te);
 	}
+	else if (te->hadCreateDumper)
+	{
+		char	   *ptr = te->createDumper((Archive *) AH, te->createDumperArg);
+
+		ahwrite(ptr, 1, strlen(ptr), AH);
+		pg_free(ptr);
+	}
 	else if (te->defn && strlen(te->defn) > 0)
 	{
 		ahprintf(AH, "%s\n\n", te->defn);
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index a2064f471ed..e68db633995 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -368,6 +368,11 @@ struct _tocEntry
 	const void *dataDumperArg;	/* Arg for above routine */
 	void	   *formatData;		/* TOC Entry data specific to file format */
 
+	CreateStmtPtr createDumper; /* Routine for create statement creation */
+	const void *createDumperArg;	/* arg for the above routine */
+	bool		hadCreateDumper;	/* Archiver was passed a create statement
+	 * routine */
+
 	/* working state while dumping/restoring */
 	pgoff_t		dataLength;		/* item's data size; 0 if none or unknown */
 	int			reqs;			/* do we need schema and/or data of object
@@ -407,6 +412,8 @@ typedef struct _archiveOpts
 	int			nDeps;
 	DataDumperPtr dumpFn;
 	const void *dumpArg;
+	CreateStmtPtr createFn;
+	const void *createArg;
 } ArchiveOpts;
 #define ARCHIVE_OPTS(...) &(ArchiveOpts){__VA_ARGS__}
 /* Called to add a TOC entry */
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 4ca34be230c..cc195d6cd9e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -10560,42 +10560,44 @@ appendNamedArgument(PQExpBuffer out, Archive *fout, const char *argname,
 }
 
 /*
- * dumpRelationStats --
+ * printDumpRelationStats --
  *
- * Dump command to import stats into the relation on the new database.
+ * Generate the SQL statements needed to restore a relation's statistics.
  */
-static void
-dumpRelationStats(Archive *fout, const RelStatsInfo *rsinfo)
+static char *
+printRelationStats(Archive *fout, const void *userArg)
 {
+	const RelStatsInfo *rsinfo = (RelStatsInfo *) userArg;
 	const DumpableObject *dobj = &rsinfo->dobj;
+
+	PQExpBufferData query;
+	PQExpBuffe

Re: Draft for basic NUMA observability

2025-03-31 Thread Bertrand Drouvot
Hi,

On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote:
> On Thu, Mar 27, 2025 at 2:40 PM Andres Freund  wrote:
> >
> > > +Size
> > > +pg_numa_get_pagesize(void)
> [..]
> >
> > Should this have a comment or an assertion that it can only be used after
> > shared memory startup? Because before that huge_pages_status won't be
> > meaningful?
> 
> Added both.

Thanks for the updated version!

+   Assert(IsUnderPostmaster);

I wonder if that would make more sense to add an assertion on huge_pages_status
and HUGE_PAGES_UNKNOWN instead (more or less as it is done in
CreateSharedMemoryAndSemaphores()).

=== About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch

Once applied I can see mention to pg_buffercache_numa_pages() while it
only comes in v17-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patch.

I think pg_buffercache_numa_pages() should not be mentioned before it's actually
implemented.

=== 1

+   bufRecord->isvalid == false)
{
int i;

-   funcctx = SRF_FIRSTCALL_INIT();
-
-   /* Switch context when allocating stuff to be used in later 
calls */
-   oldcontext = 
MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
-
-   /* Create a user function context for cross-call persistence */
-   fctx = (BufferCachePagesContext *) 
palloc(sizeof(BufferCachePagesContext));
+   for (i = 1; i <= 9; i++)
+   nulls[i] = true; 

"i <= 9" will be correct only once v17-0003 is applied (when 
NUM_BUFFERCACHE_PAGES_ELEM
is increased to 10).

In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM).

That could also make sense to remove the loop and use memset() that way:

"
memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
"

instead. It's done that way in some other places (hbafuncs.c for example).

=== 2

-   if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
-   TupleDescInitEntry(tupledesc, (AttrNumber) 9, 
"pinning_backends",
-  INT4OID, -1, 0);

+   if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+   TupleDescInitEntry(tupledesc, (AttrNumber) 9, 
"pinning_backends",
+  INT4OID, -1, 0);

I think we should not change the "expected_tupledesc->natts" check here until
v17-0003 is applied.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Prevent internal error at concurrent CREATE OR REPLACE FUNCTION

2025-03-31 Thread Yugo Nagata
On Mon, 31 Mar 2025 20:00:57 +0900
Yugo Nagata  wrote:

> Hi,
> 
> I found that multiple sessions concurrently execute CREATE OR REPLACE FUNCTION
> for a same function, the error "tuple concurrently updated" is raised. This is
> an internal error output by elog, also the message is not user-friendly.
> 
> I've attached a patch to prevent this internal error by locking an exclusive
> lock before the command and get the read tuple after acquiring the lock.
> Also, if the function has been removed during the lock waiting, the new entry
> is created.

I also found the same error is raised when concurrent ALTER FUNCTION commands 
are
executed. I've added a patch to fix this in the similar way.

Regards,
Yugo Nagata

-- 
Yugo Nagata 
>From e53f318dd42f909011227d3b11e8345ab7cc5641 Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Mon, 31 Mar 2025 20:17:07 +0900
Subject: [PATCH 2/2] Prevent internal error at concurrent ALTER FUNCTION

---
 src/backend/commands/functioncmds.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c
index b9fd7683abb..3b9d9bb098c 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -61,6 +61,7 @@
 #include "parser/parse_func.h"
 #include "parser/parse_type.h"
 #include "pgstat.h"
+#include "storage/lmgr.h"
 #include "tcop/pquery.h"
 #include "tcop/utility.h"
 #include "utils/acl.h"
@@ -1384,6 +1385,22 @@ AlterFunction(ParseState *pstate, AlterFunctionStmt *stmt)
 	if (!HeapTupleIsValid(tup)) /* should not happen */
 		elog(ERROR, "cache lookup failed for function %u", funcOid);
 
+	/* Lock the function so nobody else can do anything with it. */
+	LockDatabaseObject(ProcedureRelationId, funcOid, 0, AccessExclusiveLock);
+
+	/*
+	 * It is possible that by the time we acquire the lock on function,
+	 * concurrent DDL has removed it. We can test this by checking the
+	 * existence of function. We get the tuple again to avoid the risk
+	 * of function definition getting changed.
+	 */
+	tup = SearchSysCacheCopy1(PROCOID, ObjectIdGetDatum(funcOid));
+	if (!HeapTupleIsValid(tup))
+		ereport(ERROR,
+errcode(ERRCODE_UNDEFINED_OBJECT),
+errmsg("function \"%s\" does not exist",
+	   NameListToString(stmt->func->objname)));
+
 	procForm = (Form_pg_proc) GETSTRUCT(tup);
 
 	/* Permission check: must own function */
-- 
2.34.1

>From 49c890eb8cb115586e0e92294fb2fec60d80b8be Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Mon, 31 Mar 2025 18:46:58 +0900
Subject: [PATCH 1/2] Prevent internal error at concurrent CREATE OR REPLACE
 FUNCTION

---
 src/backend/catalog/pg_proc.c | 40 ---
 1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index fe0490259e9..cfd2e08ea23 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -35,6 +35,7 @@
 #include "parser/parse_coerce.h"
 #include "pgstat.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/lmgr.h"
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
@@ -355,24 +356,45 @@ ProcedureCreate(const char *procedureName,
 	tupDesc = RelationGetDescr(rel);
 
 	/* Check for pre-existing definition */
-	oldtup = SearchSysCache3(PROCNAMEARGSNSP,
-			 PointerGetDatum(procedureName),
-			 PointerGetDatum(parameterTypes),
-			 ObjectIdGetDatum(procNamespace));
+	oldtup = SearchSysCacheCopy3(PROCNAMEARGSNSP,
+ PointerGetDatum(procedureName),
+ PointerGetDatum(parameterTypes),
+ ObjectIdGetDatum(procNamespace));
 
 	if (HeapTupleIsValid(oldtup))
 	{
 		/* There is one; okay to replace it? */
 		Form_pg_proc oldproc = (Form_pg_proc) GETSTRUCT(oldtup);
-		Datum		proargnames;
-		bool		isnull;
-		const char *dropcmd;
 
 		if (!replace)
 			ereport(ERROR,
 	(errcode(ERRCODE_DUPLICATE_FUNCTION),
 	 errmsg("function \"%s\" already exists with same argument types",
 			procedureName)));
+
+		/* Lock the function so nobody else can do anything with it. */
+		LockDatabaseObject(ProcedureRelationId, oldproc->oid, 0, AccessExclusiveLock);
+
+		/*
+		 * It is possible that by the time we acquire the lock on function,
+		 * concurrent DDL has removed it. We can test this by checking the
+		 * existence of function. We get the tuple again to avoid the risk
+		 * of function definition getting changed.
+		 */
+		oldtup = SearchSysCacheCopy3(PROCNAMEARGSNSP,
+	 PointerGetDatum(procedureName),
+	 PointerGetDatum(parameterTypes),
+	 ObjectIdGetDatum(procNamespace));
+	}
+
+	if (HeapTupleIsValid(oldtup))
+	{
+
+		Form_pg_proc oldproc = (Form_pg_proc) GETSTRUCT(oldtup);
+		Datum   proargnames;
+		boolisnull;
+		const char *dropcmd;
+
 		if (!object_ownercheck(ProcedureRelationId, oldproc->oid, proowner))
 			aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_FUNCTION,
 		   procedureName);
@@ -553,11 +5

Re: Orphaned users in PG16 and above can only be managed by Superusers

2025-03-31 Thread Robert Haas
On Mon, Mar 24, 2025 at 2:18 AM Ashutosh Sharma  wrote:
> Thank you, Robert and Tom, for sharing your valuable insights, and
> apologies for the slight delay in my response. From the discussion,
> what I understand is that we aim to extend the current DROP ROLE
> syntax to include the CASCADE/RESTRICT option, which has been
> introduced in the latest SQL standard, as mentioned by Tom. However,
> as Robert pointed out, implementing the CASCADE option for handling
> dependent objects that span multiple databases is not feasible at the
> moment. The RESTRICT option, as I understand it, is already the
> default behavior. Therefore, we will proceed with implementing the
> CASCADE/RESTRICT syntax specifically for handling dependent roles,
> rather than the dependent database objects like tables, views, etc.,
> which can span multiple databases.

I would be fairly strongly opposed to implementing RESTRICT and
CASCADE but having them only apply to role grants and not other
database objects.

I'm not entirely certain what the right thing to do is here, but I
don't think it's that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add pg_buffercache_evict_all() and pg_buffercache_mark_dirty[_all]() functions

2025-03-31 Thread Nazir Bilal Yavuz
Hi,

On Mon, 31 Mar 2025 at 16:37, Aidar Imamov  wrote:
>
> Hi!
>
> I've reviewed the latest version of the patches and found a few things
> worth
> discussing. This is probably my final feedback on the patches at this
> point.
> Maybe Joseph has something to add.

Thank you so much for the reviews, these were very helpful!

> After this discussion, I think it would be helpful if one of the more
> experienced
> hackers could take a look at the overall picture (perhaps we should set
> the
> status "Ready for Committer"? To be honest, I'm not sure what step that
> status
> should be set at).

I agree. I will mark this as 'ready for committer' after writing the tests.

> pg_buffercache--1.5--1.6.sql:
> It seems that there is no need for the OR REPLACE clause after the
> pg_buffercache_evict() function has been dropped.
>
> Maybe we should remove the RETURNS clause from function declarations
> that have
> OUT parameters?
> Doc: "When there are OUT or INOUT parameters, the RETURNS clause can be
> omitted"

You are right, both are done.

> pg_buffercache_evict_relation():
> The function is marked as STRICT so I think we do not need for redundant
> check:
> > if (PG_ARGISNULL(0))
> >  ereport(ERROR,
> >(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > errmsg("relation cannot be null")));
> Doc: "returns null whenever any of its arguments are null", "the
> function is not
> executed when there are null arguments;".

Done.

> EvictUnpinnedBuffer():
> Maybe edit this comment a little (just not to repeat the sentences):
> > buf_state is used to check if the buffer header lock has been acquired
> > before
> > calling this function. If buf_state has a non-zero value, it means that
> > the buffer
> > header has already been locked. This information is useful for evicting
> > specific
> > relation's buffers, as the buffers headers need to be locked before
> > this function
> > can be called with such a intention.

This is better, done.

> EvictUnpinnedBuffer() & EvictRelUnpinnedBuffers():
> Why did you decide to use the assert to check for NULLs?
> I understand that currently, the use of these functions is limited to a
> specific set
> of calls through the pg_buffercache interface. However, this may not
> always be the
> case. Wouldn't it be better to allow users to choose whether or not they
> want to
> receive additional information? For example, you could simply ignore any
> arguments
> that are passed as NULL.

I do not have a strong opinion on this. I agree that these functions
can be used somewhere else later but it is easier to ignore these on
the return instead of handling NULLs in the functions. If we want to
consider the NULL possibility in the EvictUnpinnedBuffer() &
EvictRelUnpinnedBuffers(), then we need to write additional if
clauses. I think this increases the complexity unnecessarily.

> Additionally, I believe it would be beneficial to include some
> regression tests to
> check at least the following cases: return type, being a superuser, bad
> buffer id,
> local buffers case.

You are right, I will try to write the tests but I am sharing the new
version without tests to speed things up.

> Also, there's a little thing about declaring functions as PARALLEL SAFE.
> To be honest,
> I don't really have any solid arguments against it. I just have some
> doubts. For
> example, how will it work if the plan is split up and we try to work on
> an object in
> one part, while the other part of the plan evicts the pages of that
> object or marks
> them as dirty... I can't really say for sure about that. And in that
> context, I'd
> suggest referring to that awesome statement in the documentation: "If in
> doubt,
> functions should be labeled as UNSAFE, which is the default."

You may be right. I thought they are parallel safe as we acquire the
buffer header lock for each buffer but now, I have the same doubts as
you. I want to hear other people's opinions on that before labeling
them as UNSAFE.

v5 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft


v5-0001-Add-pg_buffercache_evict_-relation-all-functions-.patch
Description: Binary data


v5-0002-Return-buffer-flushed-information-in-pg_buffercac.patch
Description: Binary data


v5-0003-Add-pg_buffercache_mark_dirty-_all-functions-for-.patch
Description: Binary data


Re: Make COPY format extendable: Extract COPY TO format implementations

2025-03-31 Thread David G. Johnston
On Wed, Mar 26, 2025 at 8:28 PM Sutou Kouhei  wrote:

>
> > ---
> > +static const CopyFromRoutine CopyFromRoutineTestCopyFormat = {
> > +.type = T_CopyFromRoutine,
> > +.CopyFromInFunc = CopyFromInFunc,
> > +.CopyFromStart = CopyFromStart,
> > +.CopyFromOneRow = CopyFromOneRow,
> > +.CopyFromEnd = CopyFromEnd,
> > +};
> >
>
>
In trying to document the current API I'm strongly disliking it.  Namely,
the amount of internal code an extension needs to care about/copy-paste to
create a working handler.

Presently, pg_type defines text and binary I/O routines and the text/csv
formats use the text I/O while binary uses binary I/O - for all
attributes.  The CopyFromInFunc API allows for each attribute to somehow
have its I/O format individualized.  But I don't see how that is practical
or useful, and it adds burden on API users.

I suggest we remove both .CopyFromInFunc and .CopyFromStart/End and add a
property to CopyFromRoutine (.ioMode?) with values of either Copy_IO_Text
or Copy_IO_Binary and then just branch to either:

CopyFromTextLikeInFunc & CopyFromTextLikeStart/End
or
CopyFromBinaryInFunc & CopyFromStart/End

So, in effect, the only method an extension needs to write is converting
to/from the 'serialized' form to the text/binary form (text being near
unanimous).

In a similar manner, the amount of boilerplate within CopyFromOneRow seems
undesirable from an API perspective.

cstate->cur_attname = NameStr(att->attname);
cstate->cur_attval = string;

if (string != NULL)
nulls[m] = false;

if (cstate->defaults[m])
{
/* We must have switched into the per-tuple memory context */
Assert(econtext != NULL);
Assert(CurrentMemoryContext == econtext->ecxt_per_tuple_memory);

values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]);
}

/*
* If ON_ERROR is specified with IGNORE, skip rows with soft errors
*/
else if (!InputFunctionCallSafe(&in_functions[m],
string,
typioparams[m],
att->atttypmod,
(Node *) cstate->escontext,
&values[m]))
{
CopyFromSkipErrorRow(cstate);
return true;
}

cstate->cur_attname = NULL;
cstate->cur_attval = NULL;

It seems to me that CopyFromOneRow could simply produce a *string
collection,
one cell per attribute, and NextCopyFrom could do all of the above on a
for-loop over *string

The pg_type and pg_proc catalogs are not extensible so the API can and
should be limited to
producing the, usually text, values that are ready to be passed into the
text I/O routines and all
the work to find and use types and functions left in the template code.

I haven't looked at COPY TO but I am expecting much the same. The API
should simply receive
the content of the type I/O output routine (binary or text as it dictates)
for each output attribute, by
row, and be expected to take those values and produce a final output from
them.

David J.


Re: Non-text mode for pg_dumpall

2025-03-31 Thread Álvaro Herrera
Hi

FWIW I don't think the on_exit_nicely business is in final shape just
yet.  We're doing something super strange and novel about keeping track
of an array index, so that we can modify it later.  Or something like
that, I think?  That doesn't sound all that nice to me.  Elsewhere it
was suggested that we need some way to keep track of the list of things
that need cleanup (a list of connections IIRC?) -- perhaps in a
thread-local variable or a global or something -- and we install the
cleanup function once, and that reads from the variable.  The program
can add things to the list, or remove them, at will; and we don't need
to modify the cleanup function in any way.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: Memoize ANTI and SEMI JOIN inner

2025-03-31 Thread Andrei Lepikhov

On 3/31/25 12:18, Richard Guo wrote:

On Mon, Mar 31, 2025 at 6:46 PM Andrei Lepikhov  wrote:
  Nested Loop
->  Seq Scan on t t2
->  Nested Loop
  ->  Seq Scan on t t1
  ->  Subquery Scan on t3
Filter: ((t2.a = t3.a) AND (t1.b = t3.b))
->  Seq Scan on t t3_1
(7 rows)

t3's ppi_clauses includes "t2.a = t3.a" and "t1.b = t3.b", while t1/t3
join's restrictlist only includes "t1.b = t3.b".

I attempted to make your query ab it closer to our case:

SET enable_mergejoin = f;
SET enable_hashjoin = f;
SET enable_material = f;
CREATE INDEX ON t(a);

explain (costs off)
select * from t t1 join t t2
  on EXISTS (select *, t1.a+t2.a as x from t t3
WHERE t2.a = t3.a AND t1.b = t3.b);

and I don't get the case. As I see, ANTI/SEMI join just transforms to 
the regular join and it is still not the case. May you be more specific?


--
regards, Andrei Lepikhov




Re: per backend WAL statistics

2025-03-31 Thread Michael Paquier
> On Mar 31, 2025, at 16:42, Bertrand Drouvot  
> wrote:
> I think we can simply move the pgstat_fetch_stat_backend() call at the end
> of pgstat_fetch_stat_backend_by_pid(), like in the attached. With this in 
> place
> the issue is fixed on my side.

Thanks for the patch, I’ll check all that next week.
--
Michael





Re: SQLFunctionCache and generic plans

2025-03-31 Thread Alexander Pyhalov

Hi.

Tom Lane писал(а) 2025-03-30 19:10:

I spent some time reading and reworking this code, and have
arrived at a patch set that I'm pretty happy with.  I'm not
sure it's quite committable but it's close:




0005: This extracts the RLS test case you had and commits it
with the old non-failing behavior, just so that we can see that
the new code does it differently.  (I didn't adopt your test
from rules.sql, because AFAICS it works the same with or without
this patch set.  What was the point of that one again?)


The test was introduced after my error to handle case when 
execution_state
is NULL in fcache->func_state list due to statement being completely 
removed
by instead rule. After founding this issue, I've added a test to cover 
it.

Not sure if it should be preserved.



0006: The guts of the patch.  I couldn't break this down any
further.

One big difference from what you had is that there is only one path
of control: we always use the plan cache.  The hack you had to not
use it for triggers was only needed because you didn't include the
right cache key items to distinguish different trigger usages, but
the code coming from plpgsql has that right.



Yes, now it looks much more consistent. Still going through it. Will
do some additional testing here. So far have checked all known corner
cases and found no issues.


Also, the memory management is done a bit differently.  The
"fcontext" memory context holding the SQLFunctionCache struct is
now discarded at the end of each execution of the SQL function,
which considerably alleviates worries about leaking memory there.
I invented a small "SQLFunctionLink" struct that is what fn_extra
points at, and it survives as long as the FmgrInfo does, so that's
what saves us from redoing hash key computations in most cases.


I've looked through it and made some tests, including ones which caused 
me
to create separate context for planing. Was a bit worried that it has 
gone,
but now, as fcache->fcontext is deleted in the end of function 
execution,

I don't see leaks, which were the initial reason for introducing it.

--
Best regards,
Alexander Pyhalov,
Postgres Professional




Re: Get rid of WALBufMappingLock

2025-03-31 Thread Yura Sokolov
Good day,

14.03.2025 17:30, Tomas Vondra wrote:
> Hi,
> 
> I've briefly looked at this patch this week, and done a bit of testing.
> I don't have any comments about the correctness - it does seem correct
> to me and I haven't noticed any crashes/issues, but I'm not familiar
> with the WALBufMappingLock enough to have insightful opinions.
> 
> I have however decided to do a bit of benchmarking, to better understand
> the possible benefits of the change. I happen to have access to an Azure
> machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that
> can do ~1.5GB/s.
> 
> The benchmark script (attached) uses the workload mentioned by Andres
> some time ago [1]
> 
>SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE));
> 
> with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated
> results look like this (this is throughput):
> 
>|  8 |  64|  1024
>   clients  |  master   patched  |  master   patched  |  master  patched
>   -
> 1  |   11864 12035  |7419  7345  | 968  940
> 4  |   26311 26919  |   12414 12308  |1304 1293
> 8  |   38742 39651  |   14316 14539  |1348 1348
>16  |   57299 59917  |   15405 15871  |1304 1279
>32  |   74857 82598  |   17589 17126  |1233 1233
>48  |   87596 95495  |   18616 18160  |1199 1227
>64  |   89982 97715  |   19033 18910  |1196 1221
>96  |   92853103448  |   19694 19706  |1190 1210
>   128  |   95392103324  |   20085 19873  |1188 1213
>   160  |   94933102236  |   20227 20323  |1180 1214
>   196  |   95933103341  |   20448 20513  |1188 1199
> 
> To put this into a perspective, this throughput relative to master:
> 
>   clients  | 8  64 1024
>   --
> 1  |  101% 99%  97%
> 4  |  102% 99%  99%
> 8  |  102%102% 100%
>16  |  105%103%  98%
>32  |  110% 97% 100%
>48  |  109% 98% 102%
>64  |  109% 99% 102%
>96  |  111%100% 102%
>   128  |  108% 99% 102%
>   160  |  108%100% 103%
>   196  |  108%100% 101%
> 
> That does not seem like a huge improvement :-( Yes, there's 1-10%
> speedup for the small (8K) size, but for larger chunks it's a wash.
> 
> Looking at the pgbench progress, I noticed stuff like this:
> 
> ...
> progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed
> progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed
> progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed
> progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed
> progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed
> progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed
> progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed
> progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed
> progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed
> progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed
> progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed
> progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed
> progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed
> progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed
> progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed
> ...
> 
> i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny
> things (it's a virtual SSD, I'm not sure what model is that behind the
> curtains). So I decided to try running the benchmark on tmpfs, to get
> the storage out of the way and get the "best case" results.
> 
> This makes the pgbench progress perfectly "smooth" (no jumps like in the
> output above), and the comparison looks like this:
> 
>|  8  |  64| 1024
>   clients  |  masterpatched  |  master   patched  | master  patched
>   -|-||
> 1  |   32449  32032  |   19289 20344  |   3108 3081
> 4  |   68779  69256  |   24585 29912  |   2915 3449
> 8  |   79787 100655  |   28217 39217  |   3182 4086
>16  |  113024 148968  |   42969 62083  |   5134 5712
>32  |  125884 170678  |   44256 71183  |   4910 5447
>48  |  125571 166695  |   44693 76411  |   4717 5215
>64  |  122096 160470  |   42749 83754  |   4631 5103
>96  |  120170 154145  |   42696 86529  |   4556 5020
>   128  |  119204 152977  |   4

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-03-31 Thread Bertrand Drouvot
Hi Kuroda-san and Amit,

On Fri, Mar 28, 2025 at 09:02:29AM +, Hayato Kuroda (Fujitsu) wrote:
> Dear Amit,
> 
> > 
> > Right, I think this is a better idea.

I like it too and the bonus point is that this injection point can be used
in more tests (more use cases).

A few comments:

 About v2-0001-Stabilize

=== 1

s/to avoid the seeing a xl_running_xacts/to avoid seeing a xl_running_xacts/?

=== 2 (Nit)

/* For testing slot invalidation due to the conflict */

Not sure "due to the conflict" is needed.

 About PG17-v2-0001

=== 3

The commit message still mentions injection point.

=== 4

-# Note that pg_current_snapshot() is used to get the horizon.  It does
-# not generate a Transaction/COMMIT WAL record, decreasing the risk of
-# seeing a xl_running_xacts that would advance an active replication slot's
-# catalog_xmin.  Advancing the active replication slot's catalog_xmin
-# would break some tests that expect the active slot to conflict with
-# the catalog xmin horizon.

I'd be tempted to not remove this comment but reword it a bit instead. Something
like?

# Note that pg_current_snapshot() is used to get the horizon.  It does
# not generate a Transaction/COMMIT WAL record, decreasing the risk of
# seeing a xl_running_xacts that would advance an active replication slot's
# catalog_xmin.  Advancing the active replication slot's catalog_xmin
# would break some tests that expect the active slot to conflict with
# the catalog xmin horizon. We ensure that active replication slots are not
# created for tests that might produce this race condition though. 

=== 5

The invalidation checks for active slots are kept for the wal_level case. Also
the active slots are still created to test that logical decoding on the standby 
behaves correctly, when no conflict is expected and for the promotion.

The above makes sense to me.

=== 6 (Nit)

In drop_logical_slots(), s/needs_active_slot/drop_active_slot/?

=== 7 (Nit)

In check_slots_conflict_reason(), s/needs_active_slot/checks_active_slot/?

 About PG16-v2-0001

Same as for PG17-v2-0001.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Add partial :-variable expansion to psql \copy

2025-03-31 Thread Fabien COELHO
Hello,

I've been biten by psql's \copy lack of variable expansion, in a limited-access 
docker-inside-VM context where COPY is not a viable option and hardwired names 
are not desirable. The attached patch allows \copy to use variable's values in 
place of table and file names:

```psql
\set table 'some table'
\set input 'some file name.csv'
\copy :"table" from :'input' with (format csv)
```

--
Fabien.


Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-03-31 Thread jian he
hi.
in notnull-notvalid.patch

+ if (coninfo->contype == 'c')
+ keyword = "CHECK CONSTRAINT";
+ else
+ keyword = "INVALID NOT NULL CONSTRAINT";

we have a new TocEntry->desc kind.
so the following related code within src/bin/pg_dump also needs change


if (strcmp(te->desc, "CONSTRAINT") == 0 ||
strcmp(te->desc, "CHECK CONSTRAINT") == 0 ||
strcmp(te->desc, "FK CONSTRAINT") == 0)
strcpy(buffer, "DROP CONSTRAINT");
else
snprintf(buffer, sizeof(buffer), "DROP %s",
 te->desc);

else if (strcmp(te->desc, "CONSTRAINT") == 0 ||
 strcmp(te->desc, "CHECK CONSTRAINT") == 0 ||
 strcmp(te->desc, "FK CONSTRAINT") == 0 ||
 strcmp(te->desc, "INDEX") == 0 ||
 strcmp(te->desc, "RULE") == 0 ||
 strcmp(te->desc, "TRIGGER") == 0)
te->section = SECTION_POST_DATA;

/* these object types don't have separate owners */
else if (strcmp(type, "CAST") == 0 ||
 strcmp(type, "CHECK CONSTRAINT") == 0 ||
 strcmp(type, "CONSTRAINT") == 0 ||
 strcmp(type, "DATABASE PROPERTIES") == 0 ||
 strcmp(type, "DEFAULT") == 0 ||
 strcmp(type, "FK CONSTRAINT") == 0 ||
 strcmp(type, "INDEX") == 0 ||
 strcmp(type, "RULE") == 0 ||
 strcmp(type, "TRIGGER") == 0 ||
 strcmp(type, "ROW SECURITY") == 0 ||
 strcmp(type, "POLICY") == 0 ||
 strcmp(type, "USER MAPPING") == 0)
{
/* do nothing */
}




Re: general purpose array_sort

2025-03-31 Thread Junwang Zhao
Hi Tom,

On Mon, Mar 31, 2025 at 5:58 AM Tom Lane  wrote:
>
> I spent some time looking at the v17 patchset.  There were some pretty
> strange things in it --- why were some of the variants of array_sort()
> marked as volatile, for example?

I think this was due to some copy-paste of the code nearby.

> But the two things I'd like to
> suggest functionality-wise are:
>
> * The second argument of the variants with booleans should be defined
> as true=descending, not true=ascending.  It seems a little odd to me
> for the default of a boolean option not to be "false".  Also, then
> you don't need an inversion between the second and third arguments.
> I'm not dead set on this but it just seems a little cleaner.
>
Agreed.

> * I see that the code is set up to detect an unsortable input type
> before it takes the fast exit for "no sort required".  I think this
> is poor engineering: we ought to make the fast path as fast as
> possible.  The can't-sort case is so rare in real-world usage that
> I do not think it matters if the error isn't thrown by every possible
> call.  Besides which, it is inconsistent anyway: consider
> SELECT array_sort(NULL::xid[]);
> which will not error because it will never reach the C code.  Why's
> that okay but delivering an answer for "array_sort('{1}'::xid[])"
> is not?  I think "throw error only if we must sort and cannot" is
> a perfectly fine definition.
Agreed.

>
> At the code level, I didn't like the way that the multiple entry
> points were set up.  I think it's generally cleaner code to have
> a worker function with plain C call and return coding and make
> all the SQL-visible functions be wrappers around that.  Also the
> caching mechanism was overcomplicated, in particular because we
> do not need a cache lookup to know which sort operators apply to
> arrays.

Agreed, your refactor made the code cleaner.

>
> So all that leads me to v18 attached.  (I merged the two patches
> into one, didn't see much value in splitting them.)
>
> In v18, it's somewhat annoying that the typcache doesn't cache
> the typarray field; we would not need a separate get_array_type()
> lookup if it did.  I doubt there is any real reason for that except
> that pg_type.typarray didn't exist when the typcache was invented.
> So I'm tempted to add it.  But I looked at existing callers of
> get_array_type() and none of them are adjacent to typcache lookups,
> so only array_sort would be helped immediately.  I left it alone
> for the moment; wonder if anyone else has an opinion?

The need for `elmtyp` and `array_type` here because a column can
have arrays with varying dimensions. Maybe other callers don't share
this behavior?


>
> regards, tom lane
>


-- 
Regards
Junwang Zhao




Re: Test to dump and restore objects left behind by regression

2025-03-31 Thread Ashutosh Bapat
On Fri, Mar 28, 2025 at 11:43 PM Alvaro Herrera  wrote:
>
> On 2025-Mar-28, Tom Lane wrote:
>
> > I think instead of going this direction, we really need to create a
> > separately-purposed script that simply creates "one of everything"
> > without doing anything else (except maybe loading a little data).
> > I believe it'd be a lot easier to remember to add to that when
> > inventing new SQL than to remember to leave something behind from the
> > core regression tests.  This would also be far faster to run than any
> > approach that involves picking a random subset of the core test
> > scripts.

It's easier to remember to do something or not do something in the
same file than in some other file. I find it hard to believe that
introducing another set of SQL files somewhere far from regress would
make this problem easier.

The number of states in which objects can be left behind in the
regress/sql is very large - and maintaining that 1:1 in some other set
of scripts is impossible unless it's automated.

> FWIW this sounds closely related to what I tried to do with
> src/test/modules/test_ddl_deparse; it's currently incomplete, but maybe
> we can use that as a starting point.

create_table.sql in test_ddl_deparse has only one statement creating
an inheritance table whereas there are dozens of different states of
parent/child tables created by regress. It will require a lot of work
to bridge the gap between regress_ddl_deparse and regress and more
work to maintain it.

I might be missing something in your ideas.

IMO, whatever we do it should rely on a single set of files. One
possible way could be to break the existing files into three files
each, containing DDL, DML and queries from those files respectively
and create three schedules DDL, DML and queries containing the
respective files. These schedules will be run as required. Standard
regression run runs all the three schedules one by one. But
002_pg_upgrade will run DDL and DML on the source database and run
queries on target - thus checking sanity of the dump/restore or
pg_upgrade beyond just the dump comparison. 027_stream_regress might
run DDL, DML on the source server and queries on the target.

But that too is easier said than done for:
1. Our tests mix all three kinds of statements and also rely on the
order in which they are run. It will require some significant effort
to carefully separate the statements.
2. With the new set of files backpatching would become hard.

--
Best Wishes,
Ashutosh Bapat




Add comments about fire_triggers argument in ri_triggers.c

2025-03-31 Thread Yugo Nagata
Hi,

SPI_execute_snapshot() has a argument called "fire_triggers".  If this is false,
AFTER triggers are postponed to end of the query. This is true in normal case,
but set to false in RI triggers.

This is introduced by 9cb84097623e in 2007. It is aimed to fire check triggers
after all RI updates on the same row are complete.

 However, I cannot find explanation of"why this is required" in the codebase. 
Therefore, I've attached a patch add comments in ri_trigger.c for explaining why
fire_triggers is specified to false.

SPI_execute_snapshot() are used in a few places in ri_trigger.c, but I added
the comments only in ri_PerformCheck() because SPI_execute_snapshot() are used
only for SELECT quereis in other places. Therefore, I wonder fire_triggers is
not needed to be false in these places, but I left them as is.

Regards,
Yugo Nagata

-- 
Yugo Nagata 
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index c4ff18ce65e..45d7e6268e4 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -2580,7 +2580,13 @@ ri_PerformCheck(const RI_ConstraintInfo *riinfo,
 		   save_sec_context | SECURITY_LOCAL_USERID_CHANGE |
 		   SECURITY_NOFORCE_RLS);
 
-	/* Finally we can run the query. */
+	/*
+	 * Finally we can run the query.
+	 *
+	 * Set fire_triggers to false so that AFTER triggers run at the end of
+	 * the query. This ensures check triggers fire after all RI updates on
+	 * the same row are complete.
+	 */
 	spi_result = SPI_execute_snapshot(qplan,
 	  vals, nulls,
 	  test_snapshot, crosscheck_snapshot,


Re: Proposal: Progressive explain

2025-03-31 Thread Robert Haas
Thanks for this valuable testing. I think this is actually a really
good idea for how to test something like this, because the regression
tests contain lots of different queries that do lots of weird things.

On Sun, Mar 30, 2025 at 8:23 PM torikoshia  wrote:
> I haven't looked into the code yet, but when I ran below commands during
> make installcheck, there was an error and an assertion failure
>
>=# select * from pg_stat_progress_explain;
>=# \watch 0.1
>
> ERROR:  could not attach to dynamic shared area

This seems like a race condition. Probably some backend's dsa_area
went away between the time we got a pointer to it and the time we
actually attached to it. We should be able to find some way of
handling this without an error, like treating the case where the DSA
area is missing the same as the case where there was no DSA pointer in
the first place. However, this is also making me wonder if we
shouldn't be using one DSA shared by all backends rather than a
separate DSA area for every backend. That would require more care to
avoid leaks, but I'm not sure that it's a good idea to be creating and
destroying a DSA area for every single query. But I'm not 100% sure
that's a problem.

>TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
> "ruleutils.c", Line: 8802, PID: 73180
>TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
> "ruleutils.c", Line: 8802, PID: 73181

I wonder what is happening here. One systemic danger of the patch is
that there can be a difference between what must be true at the *end*
of a query and what must be true *during* a query. Anything that can't
happen at the end but can happen in the middle is something that the
patch will need to do something about in order to work properly. But I
don't see how that can explain this failure, because AFAIR the patch
just prints the same things that would have been printed by any other
EXPLAIN, with the same stack of namespaces. It seems possible that
this is a pre-existing bug: the regression tests might contain a query
that would cause EXPLAIN to fail, but because the regression tests
don't actually EXPLAIN that query, no failure occurs. But it could
also be something else; for example, maybe this patch is trying to
EXPLAIN something that couldn't be used with a regular EXPLAIN for
some reason. Or maybe the patch doesn't actually succeed in doing the
EXPLAIN with the correct namespace stack in all cases.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Change log level for notifying hot standby is waiting non-overflowed snapshot

2025-03-31 Thread Fujii Masao



On 2025/03/31 22:44, torikoshia wrote:

Here are some comments on the documentation.


Thanks for the review!



The following description in high-availability.sgml also seems to misuse the 
word 'consistent':

   When the  parameter is set to true on a
   standby server, it will begin accepting connections once the recovery has
   brought the system to a consistent state.

Since this is part of the "User's Overview" section, it may not be appropriate 
to include too much detail.
How about rewording it to avoid using 'consistent', for example:

   When the  parameter is set to true on a
   standby server, it will begin accepting connections once it is ready.


"once it is ready." feels too vague to me. How about using
"once recovery has brought the system to a consistent state and
be ready for hot standby." instead?



+    delaying accepting read-only connections.  To enable hot standby,
+    a long-lived write transaction with more than 64 subtransactions
+    needs to be closed on the primary.

Is it better to use 'transactions' in the plural form rather than as a nominal?

- There may be more than one such transaction.
- The  below also uses the plural form.
- The newly added message also uses the plural form:
   + errhint("To enable hot standby, close write transactions with more than %d 
subtransactions on the primary server."


What do you think?


I'm not sure whether the plural form is better here, but I've updated
the patch as suggested. Attached is the revised version.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From 2c83ae035c8093498b78243095180fa906a39c4a Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Tue, 1 Apr 2025 00:47:58 +0900
Subject: [PATCH v6] Improve error message when standby does accept
 connections.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Even after reaching the minimum recovery point, if there are long-lived
write transactions with 64 subtransactions on the primary, the recovery
snapshot may not yet be ready for hot standby, delaying read-only
connections on the standby. Previously, when read-only connections were
not accepted due to this condition, the following error message was logged:

FATAL:  the database system is not yet accepting connections
DETAIL:  Consistent recovery state has not been yet reached.

This DETAIL message was misleading because the following message was
already logged in this case:

LOG:  consistent recovery state reached

This contradiction, i.e., indicating that the recovery state was consistent
while also stating it wasn’t, caused confusion.

This commit improves the error message to better reflect the actual state:

FATAL: the database system is not yet accepting connections
DETAIL: Recovery snapshot is not yet ready for hot standby.
HINT: To enable hot standby, close write transactions with more than 64 
subtransactions on the primary server.

To implement this, the commit introduces a new postmaster signal,
PMSIGNAL_RECOVERY_CONSISTENT. When the startup process reaches
a consistent recovery state, it sends this signal to the postmaster,
allowing it to correctly recognize that state.

Since this is not a clear bug, the change is applied only to the master
branch and is not back-patched.

Author: Atsushi Torikoshi 
Co-authored-by: Fujii Masao 
Reviewed-by: Yugo Nagata 
Discussion: https://postgr.es/m/02db8cd8e1f527a8b999b94a4bee3...@oss.nttdata.com
---
 doc/src/sgml/high-availability.sgml   | 12 
 src/backend/access/transam/xlogrecovery.c |  6 ++
 src/backend/postmaster/postmaster.c   | 12 +---
 src/backend/tcop/backend_startup.c| 18 +-
 src/include/storage/pmsignal.h|  1 +
 src/include/tcop/backend_startup.h|  2 +-
 6 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index acf3ac0601d..b47d8b4106e 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1535,7 +1535,8 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'

 When the  parameter is set to true on a
 standby server, it will begin accepting connections once the recovery has
-brought the system to a consistent state.  All such connections are
+brought the system to a consistent state and be ready for hot standby.
+All such connections are
 strictly read-only; not even temporary tables may be written.

 
@@ -1974,9 +1975,12 @@ LOG:  database system is ready to accept read-only 
connections
 Consistency information is recorded once per checkpoint on the primary.
 It is not possible to enable hot standby when reading WAL
 written during a period when wal_level was not set to
-replica or logical on the primary.  
Reaching
-a consistent state can also be delayed in the presence of both of the

Re: Reduce "Var IS [NOT] NULL" quals during constant folding

2025-03-31 Thread Robert Haas
On Thu, Mar 27, 2025 at 10:08 AM Richard Guo  wrote:
> On Thu, Mar 27, 2025 at 10:53 PM Tom Lane  wrote:
> > Richard Guo  writes:
> > > I'm planning to push this patch soon, barring any objections.
>
> > FWIW, I have not reviewed it at all.
>
> Oh, sorry.  I'll hold off on pushing it.

As a general point, non-trivial patches should really get some
substantive review before they are pushed. Please don't be in a rush
to commit. It is very common for the time from when a patch is first
posted to commit to be 3-6 months even for committers. Posting a
brand-new feature patch on March 21st and then pressing to commit on
March 27th is really not something you should be doing. I think it's
particularly inappropriate here where you actually got a review that
pointed out a serious design problem and then had to change the
design. If you didn't get it right on the first try, you shouldn't be
too confident that you did it perfectly the second time, either.

I took a look at this today and I'm not entirely comfortable with this:

+ rel = table_open(rte->relid, NoLock);
+
+ /* Record NOT NULL columns for this relation. */
+ get_relation_notnullatts(rel, rte);
+
+ table_close(rel, NoLock);

As a general principle, I have found that it's usually a sign that
something has been designed poorly when you find yourself wanting to
open a relation, get exactly one piece of information, and close the
relation again. That is why, today, all the information that the
planner needs about a particular relation is retrieved by
get_relation_info(). We do not just wander around doing random catalog
lookups wherever we need some critical detail. This patch increases
the number of places where we fetch relation data from 1 to 2, but
it's still the case that almost everything happens in
get_relation_info(), and there's now just exactly this 1 thing that is
done in a different place. That doesn't seem especially nice. I
thought the idea was going to be to move get_relation_info() to an
earlier stage, not split one thing out of it while leaving everything
else the same.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: explain analyze rows=%.0f

2025-03-31 Thread Robert Haas
On Tue, Mar 11, 2025 at 5:58 AM Ilia Evdokimov
 wrote:
> In the stats_ext regression test, there is a function
> check_estimated_rows that returns actual rows as an integer. Currently,
> none of the test cases produce a non-zero fractional part in actual rows.
>
> The question is: should this function be modified to return a fractional
> number instead of an integer?
>
> Personally, I don’t see much value in doing so, because the purpose of
> the test is to compare the estimated row count with the actual number of
> rows. It is also unlikely that there will be test cases where loops > 1,
> and the presence of a fractional part does not change the essence of the
> test.
>
> What do you think?

I suppose the way this is currently coded, it will have the effect of
extracting the integer part of the row estimate. That doesn't really
seem like a bad idea to me, so I'm not sure it's worth trying to
adjust anything here.

One thing that I just noticed, though, is that we added two decimal
places to the actual row count, but not the estimated row count. So
now we get stuff like this:

 ->  Seq Scan on pgbench_branches b  (cost=0.00..1.10 rows=10
width=364) (actual time=0.007..0.009 rows=10.00 loops=1)

But why isn't it just as valuable to have two decimal places for the
estimate? I theorize that the cases that are really a problem here are
those where the row count estimate is between 0 and 1 per row, and
rounding to an integer loses all precision.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: explain analyze rows=%.0f

2025-03-31 Thread Tom Lane
Robert Haas  writes:
> But why isn't it just as valuable to have two decimal places for the
> estimate? I theorize that the cases that are really a problem here are
> those where the row count estimate is between 0 and 1 per row, and
> rounding to an integer loses all precision.

Currently, the planner rounds *all* rowcount estimates to integers
(cf. clamp_row_est()).  Maybe it'd be appropriate to rethink that,
but it's not just a matter of changing EXPLAIN's print format.

regards, tom lane




Re: Proposal - Allow extensions to set a Plan Identifier

2025-03-31 Thread Lukas Fittl
On Tue, Mar 25, 2025 at 8:12 PM Sami Imseih  wrote:

> So this comes down to forking the Postgres code to do the job.  What I
>> had in mind was a slightly different flow, where we would be able to
>> push custom node attributes between the header parsing and the
>> generation of the extension code.  If we do that, there would be no
>> need to fork the Postgres code: extensions could force the definitions
>> they want to the nodes they want to handle, as long as these do not
>> need to touch the in-core query jumble logic, of course.
>
>
> Allowing extensions to generate code for custom node attributes could be
> taken up in 19. What about for 18 we think about exposing AppendJumble,
>  JUMBLE_FIELD, JUMBLE_FIELD_SINGLE and JUMBLE_STRING?
>

+1, I think exposing the node jumbling logic + ability to jumble values for
extensions would make a big difference to get into 18, specifically for
allowing extensions to do plan ID calculations efficiently.

Attached a rebased version that accounts for the changes in f31aad9b0 and
ad9a23bc4, and also makes AppendJumble* available, as well as two helper
defines JUMBLE_VALUE and JUMBLE_VALUE_STRING. These are intended to work on
values, not nodes, because I think that requiring the caller to have a
local "expr" variable doesn't seem like a good API.

I've also intentionally reduced the API surface area and not allowed
initializing a jumble state that records constant locations / not exported
RecordConstLocations. I think the utility of that seems less clear
(possibly out-of-core queryid customization with a future extension hook in
jumbleNode) and requires more refinement.

Thoughts?

Thanks,
Lukas

-- 
Lukas Fittl


v2-0001-Allow-plugins-to-Jumble-an-expression.patch
Description: Binary data


Re: Statistics Import and Export

2025-03-31 Thread Robert Haas
On Thu, Feb 27, 2025 at 10:43 PM Greg Sabino Mullane  wrote:
> I know I'm coming late to this, but I would like us to rethink having 
> statistics dumped by default.

+1. I think I said this before, but I don't think it's correct to
regard the statistics as part of the database. It's great for
pg_upgrade to preserve them, but I think doing so for a regular dump
should be opt-in.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Automatic client certificate selection support for libpq v1

2025-03-31 Thread Seth Robertson



Wow, blast from the past.

First, I no longer have this use case.

Second, the PGSSLCERT, PGSSLKEY, and other relevant environmental
variables should make most clients able to be "fairly easily" switched
from one server to another.

Third, the only real use case where this feature would be critical is
a client which needs to have connections to two different PostgreSQL
servers at the same time.  Those applications are likely fairly rare
and doing custom programming to support different filenames would
likely be warranted.

However, I still think that this feature would make it easier for
users often connect to multiple servers from different
administrative/security domains.

Given the lack of "me too" or "+1" posts over the past 16 years, I
suspect there may be features with higher user benefit.  I would not
cry if it gets removed.

Thanks,
-Seth Robertson



From: "Robin Haberkorn"
Date: Mon, 31 Mar 2025 13:57:55 +0300
To: "Seth Robertson"
Subject: Re: [PATCH] Automatic client certificate selection support for libpq v1

Hello everybody!

I was investigating the item "Allow automatic selection of SSL client
certificates from a certificate store" from the Todo list [1]. If I
understand Seth Robertson in his initial mail correctly, he wanted libpq to
select client certificates automatically based on the host while storing
several client certificates and keys in single crt and key files.

Client certs are currently loaded with SSL_CTX_use_certificate_chain_file()
in src/interfaces/libpq/fe-secure-openssl.c.
While it is theoretically possible to implement host-specific logic using
SSL_CTX_set_client_cert_cb(), I don't think you could store all
the possible certificates in a single file as you might actually already
need several certificates for a single host in the form of a certificate
chain. As was pointed out in the thread back then, you would have to
implement something like a certificate wallet/store from scratch.
At the most, Seth's initial patch could be improved by looking
up per-host client certificate/key files based on the host-reported
server name (SSL_get_servername()), which should be more reliable when
working with host aliases.

But then on the other hand, there are sslcert/sslkey connection parameters
since 8.4, which Seth was apparently aware of. As far as I understand,
he just wanted this patch for 8.3 as well and he didn't want to update
all of his existing programs. Considering that his initial mail was
written 16 years ago, I don't think this is a valid argument anymore.
It should be possible to adapt programs easily, e.g. by accepting
"postgresql://" URIs instead of domains and manually choosing appropriate
certificate/key filenames.

In my opinion there is little than can and should be done in Postgres
at this point. Or does anybody think, that a server-name-based certificate
file selection feature should still be implemented?
If yes, I would be happy to take care of it.
If not, I would suggest to remove this item from the agenda/wiki.

Best regards,
Robin Haberkorn

[1]: http://archives.postgresql.org/pgsql-hackers/2009-05/msg00406.php

--
Robin Haberkorn
Senior Software Engineer
Tel.: +49 157 85228715
E-Mail: haberk...@b1-systems.de

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / https://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt, HRB 3537




Re: Using read stream in autoprewarm

2025-03-31 Thread Nazir Bilal Yavuz
Hi,

On Mon, 31 Mar 2025 at 21:15, Melanie Plageman
 wrote:
>
> On Mon, Mar 31, 2025 at 12:27 PM Nazir Bilal Yavuz  wrote:
> >
> > I worked on an alternative approach, I refactored code a bit. It does
> > not traverse the list two times and I think the code is more suitable
> > to use read streams now. I simply get how many blocks are processed by
> > read streams and move the list forward by this number, so the actual
> > loop skips these blocks. This approach is attached with 'alternative'
> > prefix.
>
> I am leaning toward the refactored approach because I don't think we
> want to go through the array twice and I think it is hard to get it
> right with incrementing p.pos in both places and being sure we
> correctly close the relation etc.

I liked it more as well.

> Looking at your alternative approach, I don't see how the innermost
> while loop in autoprewarm_database_main() is correct
>
> /* Check whether blocknum is valid and within fork file size. */
> while (cur_filenumber == blk->filenumber &&
>blk->blocknum >= nblocks_in_fork)
> {
> /* Move to next forknum. */
> pos++;
> continue;
> }
>
> Won't this just infinitely loop?

Oops, you are right. It should be an if statement instead of a while
loop, fixed now. Also, moved 'dsm_detach(seg);' to its previous place
to reduce diff. Attached as v7. Do you think that I should continue to
attach both approaches?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft


v7-0001-Optimize-autoprewarm-with-read-streams.patch
Description: Binary data


v7-0002-Count-free-buffers-at-the-start-of-the-autoprewar.patch
Description: Binary data


  1   2   >