date:20250415

Re: Built-in Raft replication

2025-04-15 Thread Andrey Borodin

> On 16 Apr 2025, at 09:33, Ashutosh Bapat  wrote:
> 
> In my experience, the load of managing hundreds of replicas which all
> participate in RAFT protocol becomes more than regular transaction
> load. So making every replica a RAFT participant will affect the
> ability to deploy hundreds of replica.

No need to make all standbys voting. And no need to make plain topology. 
pg_consul is using 2/3 or 3/5 HA groups, and cascades all others from HA group.
Existing tools already solve the original problem, Konstantin is just proposing 
to solve it in some standard “official” way.

> We may build an extension which
> has a similar role in PostgreSQL world as zookeeper in Hadoop.

Patroni, pg_consul and others already use zookeeper, etcd and similar systems 
for consensus.
Is it any better as extension than as etcd?

> It can
> be then used for other distributed systems as well - like shared
> nothing clusters based on FDW.

I didn’t get FDW analogy. Why other distributed systems should choose Postgres 
extension over Zookeeper?

> There's already a proposal to bring
> CREATE SERVER to the world of logical replication - so I see these two
> worlds uniting in future.

Again, I’m lost here. Which two worlds?

> The way I imagine it is some PostgreSQL
> instances, which have this extension installed, will act as a RAFT
> cluster (similar to Zookeeper ensemble or etcd cluster).

That’s exactly what is proposed here.

> The
> distributed system based on logical replication or FDW or both will
> use this ensemble to manage its shared state. The same ensemble can be
> shared across multiple distributed clusters if it has scaling
> capabilities.

Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance per 
hundred Postgres clusters to coordinate pg_consuls.

Actually, scalability is opposite to topic of this thread. Let me explain.
Currently, Postgres automatic failover tools rely on databases with built-in 
automatic failover. Konstantin is proposing to shorten this loop and make 
Postgres use its build-in automatic failover.

So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 
hosts able to elect new leader in case of failover.
And you can have only 2 hosts for Postgres - Primary and Standby. You can have 
2 big Postgres machines with 64 CPUs. And 3 one-CPU hosts for Zookeper\etcd.

If you use build-in failover you have to resort to 3 big Postgres machines 
because you need 2/3 majority. Of course, you can install MySQL-stype arbiter - 
host that had no real PGDATA, only participates in voting. But this is a 
solution to problem induced by built-in autofailover.

Best regards, Andrey Borodin.

Re: Conflict detection for update_deleted in logical replication

2025-04-15 Thread shveta malik

On Wed, Mar 26, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
 wrote:
>
> Here's a rebased version of the patch series.
>

Thanks Hou-San for the patches. I am going through this long thread
and patches. One doubt I have is whenever there is a chance of
conflict-slot update (either xmin or possibility of its invalidation),
apply-worker gives a wake-up call to the launcher
(ApplyLauncherWakeup). Shouldn't that suffice to wake-up launcher
irrespective of its nap-time? Do we actually need to introduce
MIN/MAX_NAPTIME_PER_SLOT_UPDATE in the launcher and the logic around
it?

thanks
Shveta

Re: Built-in Raft replication

2025-04-15 Thread Ashutosh Bapat

On Wed, Apr 16, 2025 at 10:29 AM Andrey Borodin  wrote:
>
> > We may build an extension which
> > has a similar role in PostgreSQL world as zookeeper in Hadoop.
>
> Patroni, pg_consul and others already use zookeeper, etcd and similar systems 
> for consensus.
> Is it any better as extension than as etcd?

I feel so. An extension runs from within a postgresql process, uses
the same protocol as PostgreSQL whereas etcd is another process and
another protocol.

>
> > It can
> > be then used for other distributed systems as well - like shared
> > nothing clusters based on FDW.
>
> I didn’t get FDW analogy. Why other distributed systems should choose 
> Postgres extension over Zookeeper?

By other distributed systems I mean PostgreSQL distributed systems -
FDW based native sharding or native replication or a system which uses
both.

>
> > There's already a proposal to bring
> > CREATE SERVER to the world of logical replication - so I see these two
> > worlds uniting in future.
>
> Again, I’m lost here. Which two worlds?

Logical replication and FDW based native sharding.

>
> > The
> > distributed system based on logical replication or FDW or both will
> > use this ensemble to manage its shared state. The same ensemble can be
> > shared across multiple distributed clusters if it has scaling
> > capabilities.
>
> Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance 
> per hundred Postgres clusters to coordinate pg_consuls.
>
> Actually, scalability is opposite to topic of this thread. Let me explain.
> Currently, Postgres automatic failover tools rely on databases with built-in 
> automatic failover. Konstantin is proposing to shorten this loop and make 
> Postgres use its build-in automatic failover.
>
> So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 
> hosts able to elect new leader in case of failover.
> And you can have only 2 hosts for Postgres - Primary and Standby. You can 
> have 2 big Postgres machines with 64 CPUs. And 3 one-CPU hosts for 
> Zookeper\etcd.
>
> If you use build-in failover you have to resort to 3 big Postgres machines 
> because you need 2/3 majority. Of course, you can install MySQL-stype arbiter 
> - host that had no real PGDATA, only participates in voting. But this is a 
> solution to problem induced by built-in autofailover.

Users find it a waste of resources to deploy 3 big PostgreSQL
instances just for HA where 2 suffice even if they deploy 3
lightweight DCS instances. Having only some of the nodes act as DCS
and others purely PostgreSQL nodes will reduce waste of resources.

-- 
Best Wishes,
Ashutosh Bapat

Re: recoveryCheck test failure on flaviventris

2025-04-15 Thread Richard Guo

On Wed, Apr 16, 2025 at 11:54 AM Richard Guo  wrote:
> I noticed this recoveryCheck test failure on flaviventris after
> pushing commit 3b35f9a4c (Fix an incorrect check in get_memoize_path).
>
> ### Reloading node "standby"
> # Running: pg_ctl --pgdata xxx/pgdata reload
> server signaled
> ### Restarting node "standby"
> # Running: pg_ctl --wait --pgdata xxx/pgdata --log xxx.log restart
> waiting for server to shut down ... failed
> pg_ctl: server does not shut down
> # pg_ctl restart failed; see logfile for details: xxx.log
> # Postmaster PID for node "standby" is 63577
> [02:09:29.767](364.847s) Bail out!  pg_ctl restart failed
>
> I'm not convinced this failure is related to 3b35f9a4c; it seems
> unlikely that just changing how Memoize paths are created would cause
> it.
>
> Does anyone have any insights into what might be going on?

Forgot to include the link to the failure.  Here it is:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2025-04-16%2001%3A58%3A42

Thanks
Richard

Re: pg_dump --if-exists --clean when drop index that is partition of a partitioned index

2025-04-15 Thread jian he

On Wed, Apr 16, 2025 at 6:51 AM Tom Lane  wrote:
>
> Or we could do what Jian suggested and just not emit any dropStmt
> for child indexes.  I continue to fear that that will have
> undesirable side-effects, but I have to admit that I'm not sure
> what.  The fact that the backend will automatically drop the
> child index if we drop either its parent index or its table
> seems to cover a lot of the edge cases ... but does it cover
> all of them?
>

If pg_dump not produce "DROP INDEX IF EXISTS child_index;"
then we are actually tests drop parent index will cascade to child index.
If it fails, then execute pg_dump output, CREATE INDEX on child table will fail,
error message be like:
ERROR:  relation "t_a_idx" already exists

CREATE TABLE tp(c int, a int, b int) PARTITION BY RANGE (b);
CREATE TABLE tp_1(c int, a int, b int);
ALTER TABLE tp ATTACH PARTITION tp_1 FOR VALUES FROM (0) TO (1);
CREATE INDEX t_a_idx ON tp_1(a);
CREATE INDEX tp_a_idx ON tp(a);

in this case, the partition index and partitioned index dependency
relationship is
DEPENDENCY_PARTITION_PRI.

per https://www.postgresql.org/docs/current/catalog-pg-depend.html
"
Example: a child partitioned index is made partition-dependent on both the
partition table it is on and the parent partitioned index, so that it goes away
if either of those is dropped, but not otherwise. The dependency on the parent
index is primary, so that if the user tries to drop the child partitioned index,
the error message will suggest dropping the parent index instead (not the
table).
"
from the doc, drop the partitioned index will cascade and also drop
the partition index.

On Tue, Apr 15, 2025 at 3:08 PM Pavel Stehule  wrote:
>
>
> There is a question why dump exports DROP INDEX, although the index will be 
> dropped implicitly when table is dropped.
>
> So your fix can work, because all indexes will be dropped by DROP TABLE, but 
> on second hand it is not consistent with the current behavior of pg_dump, 
> where indexes are dropped explicitly.
>
> I have not knowledge, why pg_dump exports DROP INDEX explicitly, although it 
> is redundant.
>
> Regards
>
> Pavel
>
if you specify ``--section=post-data``, then only the index command
will be generated.
so we actually do need separate dump exports DROP INDEX.

Re: transforms [was Re: FmgrInfo allocation patterns (and PL handling as staged programming)]

2025-04-15 Thread Tom Lane

Chapman Flack  writes:
> On 04/15/25 10:52, Tom Lane wrote:
>> The problem from a PL's standpoint is "given this input or output
>> of type FOO, should I transform it, and if so using what?".  So
>> the starting point has to be a type not a transform.

> Perhaps the efficiency argument is really "say a function has
> a list of 100 arguments and only one is of type foo, how many cycles
> are wasted in get_transform_tosql and get_transform_fromsql applied
> to all those other types?"

That, and also "how many cycles are wasted in get_transform_tosql and
get_transform_fromsql when there isn't any TRANSFORM clause at all?"

> If pg_proc had protransforms instead, that would add a step zero: looking
> up the declared transforms to make an in-memory list of (typid, tosqloid,
> fromsqloid). After that, get_transform_{tosql,fromsql} would be applied
> and return quickly when the passed typid isn't in that list.

I don't doubt there are other ways that this could be built.  But
you've not really made the case that another way is better (let alone
enough better to be worth breaking backward compatibility for).

I think the "normal" case is that there isn't any TRANSFORM clause.
As long as that case falls through quickly, it's hard to credit that
there's any meaningful performance difference between different
arrangements here.

My own beef with the whole setup is that you can't specify *which*
arguments or results you want transformed.  I don't believe that
this should have been keyed off data types to begin with.  But
that design was the SQL committee's choice so we're stuck ...

regards, tom lane

Re: recoveryCheck test failure on flaviventris

2025-04-15 Thread Alexander Lakhin


Hello Richard,

16.04.2025 06:05, Richard Guo wrote:

On Wed, Apr 16, 2025 at 11:54 AM Richard Guo wrote:

I noticed this recoveryCheck test failure on flaviventris after
pushing commit 3b35f9a4c (Fix an incorrect check in get_memoize_path).
...
I'm not convinced this failure is related to 3b35f9a4c; it seems
unlikely that just changing how Memoize paths are created would cause
it.

Does anyone have any insights into what might be going on?

Forgot to include the link to the failure.  Here it is:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2025-04-16%2001%3A58%3A42


Judging by:
### Restarting node "standby"
# Running: pg_ctl --wait --pgdata 
/home/bf/bf-build/flaviventris/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata 
--log 
/home/bf/bf-build/flaviventris/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log 
restart
waiting for server to shut 
down... 
failed

pg_ctl: server does not shut down

I think it's a known issue:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_rep_changes.pl_fails_due_to_publisher_stuck_on_shutdown
and is not related to the Memoize paths.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: Built-in Raft replication

2025-04-15 Thread Andrey Borodin

> On 16 Apr 2025, at 04:19, Tom Lane  wrote:
> 
> feebly, and seems to have a bus factor of 1.  Another example is the
> Spencer regex engine; we thought we could depend on Tcl to be the
> upstream for that, but for a decade or more they've acted as though
> *we* are the upstream.

I think it's what Konstantin is proposing. To have our own Raft implementation, 
without dependencies.

IMO to better understand what is proposed we need some more description of 
proposed systems. How the new system will be configured? initdb and what than? 
How new node joins cluster? What is running pg_rewind when necessary?

Some time ago Peter E proposed to be able to start replication atop of empty 
directory, so that initial sync would be more straightforward. And also Heikki 
proposed to remove archive race condition when choosing new timeline. I think 
this steps are gradual movement in the same direction.

My view is what Konstantin wants is automatic replication topology management. 
For some reason this technology is called HA, DCS, Raft, Paxos and many other 
scary words. But basically it manages primary_conn_info of some nodes to 
provide some fault-tolerance properties. I'd start to design from here, not 
from Raft paper.

Best regards, Andrey Borodin.

Re: Built-in Raft replication

2025-04-15 Thread Tom Lane

Andrey Borodin  writes:
> I think it's what Konstantin is proposing. To have our own Raft 
> implementation, without dependencies.

Hmm, OK.  I thought that the proposal involved relying on some existing
code, but re-reading the thread that was said nowhere.  Still, that
moves it from a large project to a really large project :-(

I continue to think that it'd be best to try to implement it as
an extension, at least up till the point of finding show-stopping
reasons why it cannot be that.

regards, tom lane

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-04-15 Thread Thomas Munro

On Tue, Apr 15, 2025 at 5:44 AM Robert Haas  wrote:
> On Thu, Apr 10, 2025 at 11:15 PM Thomas Munro  wrote:
> > The new streaming BHS isn't just issuing probabilistic hints about
> > future access obtained from a second iterator.  It has just one shared
> > iterator connected up to the workers' ReadStreams.  Each worker pulls
> > a disjoint set of blocks out of its stream, possibly running a bunch
> > of IOs in the background as required.
>
> It feels to me like the problem here is that the shared iterator is
> connected to unshared read-streams. If you make a shared read-stream
> object and connect the shared iterator to that instead, does that
> solve this whole problem, or is there more to it?

More or less, yeah, just put the whole ReadStream object in shared
memory, pin an LWLock on it and call it a parallel-aware or shared
ReadStream.  But how do you make the locking not terrible?

My "work stealing" brain dump was imagining a way to achieve the same
net effect, except NOT have to acquire an exclusive lock for every
buffer you pull out of the stream.  I was speculating that we could
achieve zero locking for most of the stream without any cache line
ping pong, but a cunning read barrier scheme could detect when you've
been flipped into a slower coordination mode by another backend and
need to turn on some locking and fight over the last handful of
buffers.  And I was also observing that if you can figure out to make
it general and reusable enough, we have more unsolved problems like
this in unrelated parallel query code not even involving streams.
It's a tiny more approachable subset of the old "data buffered in
other workers" problem, as I think you called it once.

Re: use correct variable in error message in _allocAH function (pg_backup_archiver.c)

2025-04-15 Thread Fujii Masao





On 2025/04/15 23:37, Tom Lane wrote:

Fujii Masao  writes:

Thanks for the report and patch! It looks good to me.


Agreed.


Since this issue exists in the back branches,
the patch needs be back-patched to all supported versions.


I doubt it's worth the trouble and buildfarm cycles to
back-patch, since this should be a can't-happen code path.
Worth fixing in HEAD, yes, but not convinced about doing
more than that.


Yes, that makes sense. I'll apply the fix to the master branch only.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Built-in Raft replication

2025-04-15 Thread Ashutosh Bapat

On Wed, Apr 16, 2025 at 9:37 AM Andrey Borodin  wrote:
>
> My view is what Konstantin wants is automatic replication topology 
> management. For some reason this technology is called HA, DCS, Raft, Paxos 
> and many other scary words. But basically it manages primary_conn_info of 
> some nodes to provide some fault-tolerance properties. I'd start to design 
> from here, not from Raft paper.
>

In my experience, the load of managing hundreds of replicas which all
participate in RAFT protocol becomes more than regular transaction
load. So making every replica a RAFT participant will affect the
ability to deploy hundreds of replica. We may build an extension which
has a similar role in PostgreSQL world as zookeeper in Hadoop. It can
be then used for other distributed systems as well - like shared
nothing clusters based on FDW. There's already a proposal to bring
CREATE SERVER to the world of logical replication - so I see these two
worlds uniting in future. The way I imagine it is some PostgreSQL
instances, which have this extension installed, will act as a RAFT
cluster (similar to Zookeeper ensemble or etcd cluster). The
distributed system based on logical replication or FDW or both will
use this ensemble to manage its shared state. The same ensemble can be
shared across multiple distributed clusters if it has scaling
capabilities.

-- 
Best Wishes,
Ashutosh Bapat

Re: Built-in Raft replication

2025-04-15 Thread Andrey Borodin

> On 16 Apr 2025, at 11:18, Ashutosh Bapat  wrote:
> 
> Having only some of the nodes act as DCS
> and others purely PostgreSQL nodes will reduce waste of resources.

But typically you need more DCS nodes than PostgreSQL nodes. Did you mean
“Having only some of nodes act as PostgreSQL and others purely DCS nodes will 
reduce waste of resources”?

Best regards, Andrey Borodin.

Re: not null constraints, again

2025-04-15 Thread Tender Wang

Alvaro Herrera  于2025年4月16日周三 03:12写道：

> On 2025-Apr-15, Tender Wang wrote:
>
> > I thought further about the lockmode calling find_inheritance_children
> > in ATPrepAddPrimaryKey.
> > What we do here? We first get oids of children, then check the if the
> > column of children has marked not-null,  if not, report an error.
> > No operation here on children.  I check other places that call
> > find_inheritance_children, if we have operation on children, we usually
> pass
> > Lockmode to find_inheritance_children, otherwise pass NoLock.
>
> Hmm, I'm wary of doing this, although you're perhaps right that there's
> no harm.  If we do need to add a not-null constraint on the children,
> surely we'll acquire a stronger lock further down the execution chain.
> In principle this sounds a good idea though.  (I'm not sure about doing
> SearchSysCacheAttName() on a relation that might be dropped
> concurrently; does dropping the child acquire lock on its parent?  I
> suppose so, in which case this is okay; but still icky.  What about
> DETACH CONCURRENTLY?)
>

Yes,  I'm also wary of doing this.
Although NoLock may fix this issue, I feel it will trigger other problems,
such as the scenario you listed above.


> However, I've also been looking at this and realized that this code can
> have different structure which may allows us to skip the
> find_inheritance_children() altogether.  The reason is that we already
> scan the parent's list of columns searching for not-null constraints on
> each of them; we only need to run this verification on children for
> columns where there is none in the parent, and then only in the case
> where recursion is turned off.


> So I propose the attached patch, which also has some comments to
> hopefully explain what is going on and why.  I ran Tom's test script a
> few hundred times in a loop and I see no deadlock anymore.
>

No objection from me.   The comments may need a little polishing.


-- 
Thanks,
Tender Wang

Re: Built-in Raft replication

2025-04-15 Thread Ashutosh Bapat

On Wed, Apr 16, 2025 at 11:57 AM Andrey Borodin  wrote:
>
>
>
> > On 16 Apr 2025, at 11:18, Ashutosh Bapat  
> > wrote:
> >
> > Having only some of the nodes act as DCS
> > and others purely PostgreSQL nodes will reduce waste of resources.
>
> But typically you need more DCS nodes than PostgreSQL nodes. Did you mean

In a small HA setup this might be true. But not when there are many
replicas. But ...

> “Having only some nodes act as PostgreSQL and others purely DCS nodes will 
> reduce waste of resources”?

I mean, whatever the setup may be one shouldn't require to deploy a
big PostgreSQL server just because DCS needs majority.

-- 
Best Wishes,
Ashutosh Bapat

Re: dispchar for oauth_client_secret

2025-04-15 Thread Jelte Fennema-Nio

On Wed, 16 Apr 2025 at 02:03, Jacob Champion
 wrote:
> Thank you for saying something; I'd hallucinated that srvoptions was
> limited to the server owner, and that's not true. It's pg_user_mapping
> that has the protection.

FWIW, I have some ideas on being able to store secrets in a server in
a safe way. I'll probably start a thread on that somewhere in the next
few months.

> So: maybe it'd be best to disallow the OAuth options in the proxy code
> entirely, so that a proxy-friendly future implementation is not
> accidentally tied to decisions we made in v18.

Seems fine to me.

On Wed, 16 Apr 2025 at 02:03, Jacob Champion
 wrote:
>
> On Tue, Apr 15, 2025 at 3:25 PM Jelte Fennema-Nio  wrote:
> > I don't understand why it should be a server option instead of a user
> > mapping option. Having it be a server option means that *any* Postgres
> > user can read its contents.
>
> Thank you for saying something; I'd hallucinated that srvoptions was
> limited to the server owner, and that's not true. It's pg_user_mapping
> that has the protection.
>
> But if you want to hide the client secret from your users, making it a
> user mapping option has not made the situation better. The client ID
> and secret are there to authenticate libpq (and by extension,
> postgres_fdw), not the end user.
>
> > My knowledge on the purpose is pretty
> > limited, but having that be the case for something with "secret" in
> > the name seems very unintuitive.
>
> From the point of view of a proxy, whether you use a secret at all is
> up to the flow in use. And for 18, the only supported flow is focused
> on authorizing an actual human at the keyboard, without any token
> caching, making it pretty unhelpful for FDW. A more proxy-friendly
> flow would be awesome -- and/or explicit integration of the existing
> builtin flow into the proxies, both safely and helpfully -- but that's
> not a v18 thing. (I view it as similar in spirit to the
> SCRAM-passthrough work.)
>
> So: maybe it'd be best to disallow the OAuth options in the proxy code
> entirely, so that a proxy-friendly future implementation is not
> accidentally tied to decisions we made in v18.
>
> --Jacob

Re: Built-in Raft replication

2025-04-15 Thread Andrey Borodin

> On 16 Apr 2025, at 09:26, Tom Lane  wrote:
> 
> Andrey Borodin  writes:
>> I think it's what Konstantin is proposing. To have our own Raft 
>> implementation, without dependencies.
> 
> Hmm, OK.  I thought that the proposal involved relying on some existing
> code, but re-reading the thread that was said nowhere.  Still, that
> moves it from a large project to a really large project :-(
> 
> I continue to think that it'd be best to try to implement it as
> an extension, at least up till the point of finding show-stopping
> reasons why it cannot be that.

I think I can provide some reasons why it cannot be neither extension, nor any 
part running within postmaster reign.

1. When joining cluster, there’s not PGDATA to run postmaster on top of it.

2. After failover, old Primary node must rejoin cluster by running pg_rewind 
and following timeline switch.

The system in hand must be able to manipulate with PGDATA without starting 
Postgres.

My question to Konstantin is Why wouldn’t you just add Raft to Patroni? Is 
there a reason why something like Patroni is not in core and noone rushes to 
get it in?
Everyone is using it, or system like it.

Best regards, Andrey Borodin.

Re: Built-in Raft replication

2025-04-15 Thread Kirill Reshke

On Wed, 16 Apr 2025 at 10:25, Andrey Borodin  wrote:
>
> I think I can provide some reasons why it cannot be neither extension, nor 
> any part running within postmaster reign.
>
> 1. When joining cluster, there’s not PGDATA to run postmaster on top of it.

You can join the cluster on pg_basebackup of its master; So I dont get
why this is an anti-extension restriction.

> 2. After failover, old Primary node must rejoin cluster by running pg_rewind 
> and following timeline switch.

You can run bash from extension, what's the point?

> The system in hand must be able to manipulate with PGDATA without starting 
> Postgres.

-- 
Best regards,
Kirill Reshke

Re: Fundamental scheduling bug in parallel restore of partitioned tables

2025-04-15 Thread Tom Lane

Ashutosh Bapat  writes:
> On Mon, Apr 14, 2025 at 11:14 PM Tom Lane  wrote:
>> This is disastrous for assorted reasons.  The ALTER ADD CONSTRAINT
>> command might fail outright if we've loaded data for the referencing
>> table but not the referenced table.

> There's a comment in getConstraints()
> /*
> * Restoring an FK that points to a partitioned table requires that
> * all partition indexes have been attached beforehand. Ensure that
> * happens by making the constraint depend on each index partition
> * attach object.
> */

Ah, that is an excellent point which I missed.  And the INDEX ATTACH
objects have dependencies on the leaf tables, which *will* get
repointed to their TABLE DATA objects by repoint_table_dependencies.
So by the time we are ready to restore the FK CONSTRAINT object,
we are certain to have loaded all the data of the referenced table.
But there's nothing delaying the constraint till after the referencing
table's data is loaded.

>> Even if it doesn't fail, if
>> it completes before we load data for the referencing table, we'll
>> have to do retail FK checks, greatly slowing that data load.

> FWIW, Executing pg_restore -j2 -v, I think, I see evidence that the FK
> constraint is created before data is loaded into the referencing
> table.

Yes, I reproduced that as well.  That squares with the above
analysis.

So at this point we have:

#1: ADD CONSTRAINT failure because of missing referenced data:
not possible after all.

#2: Deadlock between parallel restore jobs: possible in HEAD, but
it seems likely to be a bug introduced by the not-null-constraint
work rather than being pg_restore's fault.  We have no evidence
that such a deadlock can happen in released branches, and the lack
of field reports suggests that it can't.

#3: Restoring the FK constraint before referencing data is loaded:
this seems to be possible, and it's a performance problem, but
no more than that.

So now I withdraw the suggestion that this patch needs to be
back-patched.  We may not even need it in v18, if another fix
for #2 is found.  Fixing #3 would be a desirable thing to do
in v19, but if that's the only thing at stake then it's not
something to break feature freeze for.

For the moment I'll mark this CF entry as meant for v19.
We can resurrect consideration of it for v18 if there's not
a better way to fix the deadlock problem.

regards, tom lane

Re: not null constraints, again

2025-04-15 Thread Alvaro Herrera

On 2025-Apr-15, Tender Wang wrote:

> I thought further about the lockmode calling find_inheritance_children
> in ATPrepAddPrimaryKey.
> What we do here? We first get oids of children, then check the if the
> column of children has marked not-null,  if not, report an error.
> No operation here on children.  I check other places that call
> find_inheritance_children, if we have operation on children, we usually pass
> Lockmode to find_inheritance_children, otherwise pass NoLock.

Hmm, I'm wary of doing this, although you're perhaps right that there's
no harm.  If we do need to add a not-null constraint on the children,
surely we'll acquire a stronger lock further down the execution chain.
In principle this sounds a good idea though.  (I'm not sure about doing
SearchSysCacheAttName() on a relation that might be dropped
concurrently; does dropping the child acquire lock on its parent?  I
suppose so, in which case this is okay; but still icky.  What about
DETACH CONCURRENTLY?)

However, I've also been looking at this and realized that this code can
have different structure which may allows us to skip the
find_inheritance_children() altogether.  The reason is that we already
scan the parent's list of columns searching for not-null constraints on
each of them; we only need to run this verification on children for
columns where there is none in the parent, and then only in the case
where recursion is turned off.

So I propose the attached patch, which also has some comments to
hopefully explain what is going on and why.  I ran Tom's test script a
few hundred times in a loop and I see no deadlock anymore.

Note that I also considered the idea of just not doing the check at all;
that is, if a child table doesn't have a not-null constraint, then let
  ALTER TABLE ONLY parent ADD PRIMARY KEY ( ... )
create the not-null constraint.  This works fine (it breaks one
regression test query though, would be easily fixed).  But I don't like
this very much, because it means the user could be surprised by the
lengthy unexpected runtime of creating the primary key, only to realize
that the server is checking the child table for nulls.  This is
especially bad if the user says ONLY.  I think it's better if they have
the chance to create the not-null constraint on their own volition.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Learn about compilers. Then everything looks like either a compiler or
a database, and now you have two problems but one of them is fun."
https://twitter.com/thingskatedid/status/1456027786158776329
>From 8aacd6aa67ad6891aaea7a16b9304897798fef39 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= 
Date: Tue, 15 Apr 2025 20:38:54 +0200
Subject: [PATCH] Fix verification of not-null constraints on children during
 PK creation

---
 src/backend/commands/tablecmds.c | 80 +++-
 1 file changed, 49 insertions(+), 31 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b3ed69457fc..03dd3754940 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -9440,6 +9440,15 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 /*
  * Prepare to add a primary key on table, by adding not-null constraints
  * on all columns.
+ *
+ * An important aspect of this is pg_dump creation of primary keys on
+ * partitioned tables, which is done by first creating the primary key
+ * constraint on the partitioned table itself as not-recursive (so that the
+ * creation of the PK itself doesn't recurse to the partitions), then creating
+ * the corresponding indexes on the partitions, then doing ALTER INDEX ATTACH
+ * PARTITION.  This maximizes parallelism.  However, it means that we must
+ * ensure the creation of not-null constraints on the partitions even if asked
+ * not to recurse.
  */
 static void
 ATPrepAddPrimaryKey(List **wqueue, Relation rel, AlterTableCmd *cmd,
@@ -9447,42 +9456,13 @@ ATPrepAddPrimaryKey(List **wqueue, Relation rel, AlterTableCmd *cmd,
 	AlterTableUtilityContext *context)
 {
 	Constraint *pkconstr;
+	List	   *children;
+	bool		got_children = false;

 	pkconstr = castNode(Constraint, cmd->def);
 	if (pkconstr->contype != CONSTR_PRIMARY)
 		return;

-	/*
-	 * If not recursing, we must ensure that all children have a NOT NULL
-	 * constraint on the columns, and error out if not.
-	 */
-	if (!recurse)
-	{
-		List	   *children;
-
-		children = find_inheritance_children(RelationGetRelid(rel),
-			 lockmode);
-		foreach_oid(childrelid, children)
-		{
-			foreach_node(String, attname, pkconstr->keys)
-			{
-HeapTuple	tup;
-Form_pg_attribute attrForm;
-
-tup = SearchSysCacheAttName(childrelid, strVal(attname));
-if (!tup)
-	elog(ERROR, "cache lookup failed for attribute %s of relation %u",
-		 strVal(attname), childrelid);
-attrForm = (Form_pg_attribute) GETSTRUCT(tup);
-if (!attrForm->attn

dispchar for oauth_client_secret

2025-04-15 Thread Noah Misch

commit b3f0be7 wrote:
+   {"oauth_scope", NULL, NULL, NULL,
+   "OAuth-Scope", "", 15,
+   offsetof(struct pg_conn, oauth_scope)},

The field containing "" is documented as follows:

char   *dispchar;   /* Indicates how to display this field 
in a
 * connect 
dialog. Values are: "" Display
 * entered 
value as is "*" Password field -
 * hide value 
"D"  Debug option - don't show
 * by default */

I suspect this should use .dispchar="*" to encourage UIs to display
oauth_client_secret like a password field.  Thoughts?

[I didn't review commit b3f0be7, but this caught my attention.]

Re: dispchar for oauth_client_secret

2025-04-15 Thread Jacob Champion

On Tue, Apr 15, 2025 at 12:14 PM Noah Misch  wrote:
> I suspect this should use .dispchar="*" to encourage UIs to display
> oauth_client_secret like a password field.  Thoughts?

Hmm, from a UI perspective I agree. (The builtin flow targets "public
clients", where secrets are discouraged and/or understood to be
not-really-secret, but there's no reason to assume that all flows used
by the application are public.)

>From a proxy perspective, this would mess with FDW handling. By making
the dispchar '*', oauth_client_secret will be made into a user mapping
option rather than a server option. (Neither is very useful to
postgres_fdw anyway, because the builtin flow needs an end user to
interact with the provider.) But I'm not sure if we'll need to handle
compatibility in the future if we implement a proxy-friendly flow. Is
it okay to move options back and forth during a major version bump? I
assume it would present a problem for pg_upgrade?

Thanks!
--Jacob

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-04-15 Thread David Rowley

On Tue, 15 Apr 2025 at 21:44, Ilia Evdokimov
 wrote:
> Wrapping the line in costs or verbose would help with test stability, but 
> doing that only to simplify test output doesn't look like the right reason. 
> Maybe it makes more sense to mask these values in other tests that use 
> Memoize too. What do you think?

Disregard what I said about the explain_memoize() function. The new
output just shouldn't be shown with EXPLAIN (costs off).

David

Re: pgsql: Non text modes for pg_dumpall, correspondingly change pg_restore

2025-04-15 Thread Álvaro Herrera

On 2025-Apr-15, Mahendra Singh Thalor wrote:

> I took this patch and did some testing. Patch looks good to me.

Hello, thank you very much for looking.

> I was not able to use "git am" in my setup due to CR's in patch but no
> warning in the patch when I used "patch -p".

Hmm, don't blame me; rather I think your email rig is broken and
produces CRs for no reason (pretty normal on Windows I think).  Here's
what hexdump shows for the attached file direct from the email:

02a0  2d 2d 2d 20 61 2f 73 72  63 2f 62 69 6e 2f 70 67  |--- a/src/bin/pg|
02b0  5f 64 75 6d 70 2f 70 67  5f 72 65 73 74 6f 72 65  |_dump/pg_restore|
02c0  2e 63 0a 2b 2b 2b 20 62  2f 73 72 63 2f 62 69 6e  |.c.+++ b/src/bin|
02d0  2f 70 67 5f 64 75 6d 70  2f 70 67 5f 72 65 73 74  |/pg_dump/pg_rest|
02e0  6f 72 65 2e 63 0a 40 40  20 2d 36 37 2c 31 30 20  |ore.c.@@ -67,10 |
02f0  2b 36 37 2c 32 30 20 40  40 20 73 74 61 74 69 63  |+67,20 @@ static|
0300  20 69 6e 74 09 70 72 6f  63 65 73 73 5f 67 6c 6f  | int.process_glo|
0310  62 61 6c 5f 73 71 6c 5f  63 6f 6d 6d 61 6e 64 73  |bal_sql_commands|
0320  28 50 47 63 6f 6e 6e 20  2a 63 6f 6e 6e 2c 20 63  |(PGconn *conn, c|
0330  6f 6e 73 74 20 63 68 61  72 20 2a 64 75 6d 70 64  |onst char *dumpd|
0340  69 72 70 61 74 68 2c 0a  20 09 09 09 09 09 09 09  |irpath,. ...|
0350  09 09 09 63 6f 6e 73 74  20 63 68 61 72 20 2a 6f  |...const char *o|
0360  75 74 66 69 6c 65 29 3b  0a 20 73 74 61 74 69 63  |utfile);. static|

In bytes 340ff you can see that the comma (2c) is followed by 0a ("line
feed" or newline) and then a bunch of tabs (09).  There's no carriage
return (0d) anywhere.

Maybe you would be able to convince git not to complain by downloading
the file from the email[1] directly onto it somehow, without letting
Windows mess things up.  Maybe something like this on a command line
  curl 
"https://www.postgresql.org/message-id/attachment/175873/0001-remove-unnecessary-oid_string-list-stuff.patch";
 | git am -
assuming you can get curl to run on a command line at all.

[1] https://postgr.es/m/202504141220.343fmoxfsbj4@alvherre.pgsql

> *One review comment:*
> We are using FLEXIBLE_ARRAY_MEMBER for dbname. Can we use NAMEDATALEN? In
> code, many places we are using this hard coded value of 64 size for names.

Well, yes we can, but then we'd be allocating a bunch of null bytes for
no reason.  The current allocation strategy, which comes from the
original commit not my code, is to allocate a struct with the OID and
just enough bytes to store the database name, which is known.  I don't
see any reason to modify this tbh.  The reason to use NAMEDATALEN in
various places is to have enough room to store whatever name the user
specifies, without having to know the length before the allocation
occurs.

FLEXIBLE_ARRAY_MEMBER is there to document exactly what's happening
here, and used thoroughly across the codebase.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"E pur si muove" (Galileo Galilei)

Re: A modest proposal: make parser/rewriter/planner inputs read-only

2025-04-15 Thread Tom Lane

Andrei Lepikhov  writes:
> But what is the way you are proposing here? Do you mean that one more 
> entity will be explicitly introduced: a transformed parse tree?

No, I wasn't thinking of adding new concepts, just saying that the
inputs to certain operations need to be treated as read-only.
What would you envision a "transformed parse tree" being that's not
what we have today?

> It would open an opportunity for extensions to build a set of 
> alternative transformed trees, pass them through the optimisation phase 
> and choose the best plan.

Yeah, the ability to repeatedly operate on a tree without the overhead
of making copies would be a major benefit of being stricter here.

regards, tom lane

Re: Built-in Raft replication

2025-04-15 Thread Konstantin Osipov

* Greg Sabino Mullane  [25/04/15 18:08]:

> > If anyone is working on Raft already I'd be happy to discuss
> > the details. I am fairly new to the PostgreSQL hackers ecosystem
> > so cautious of starting work in isolation/knowing there is no
> > interest in accepting the feature into the trunk.
> >
> 
> Putting aside the technical concerns about this specific idea, it's best to
> start by laying out a very detailed plan of what you would want to change,
> and what you see as the costs and benefits. It's also extremely helpful to
> think about developing this as an extension. If you get stuck due to
> extension limitations, propose additional hooks. If the hooks will not
> work, explain why.
> 
> Getting this into core is going to be a long, multi-year effort, in which
> people are going to be pushing back the entire time, so prepare yourself
> for that. My immediate retort is going to be: why would we add this if
> there are existing tools that already do the job just fine? Postgres has
> lots of tasks that it is happy to let other programs/OS
> subsystems/extensions/etc. handle instead.

I had hoped I explained why external state providers can not
provide the same seamless UX as built-in ones. The key idea is to
have a built-in configuration management, so that adding and
removing replicas does not require changes in multiple disjoint
parts of the installation (server configurations, proxies,
clients).

I understand and accept that it's a multi-year effort, but I do
not accept the retort - my main point is that external tools
are not a replacement, and I'd like to reach consensus on that.

-- 
Konstantin Osipov, Moscow, Russia

Re: A modest proposal: make parser/rewriter/planner inputs read-only

2025-04-15 Thread Tom Lane

Richard Guo  writes:
> On Sun, Apr 6, 2025 at 1:46 AM Tom Lane  wrote:
>> Years ago we fixed the executor to treat its input Plan trees as
>> read-only.  It seems like we really ought to do the same for these
>> upstream subsystems.  Surely, whatever benefit we get from changing
>> Node contents in-place is lost due to all these other hacks.

> While I'm in favor of this idea, it seems that it will require lots of
> code changes.  In particular, within the planner, the parsetree goes
> through considerable transformations during the preprocessing phase,
> such as subquery pull-up, constant folding, and so on.  Would this
> proposal mean we'd need to refactor all those places to treat the
> input parsetrees as read-only?  Just trying to understand the scope of
> what would be involved.

>From a high-level point of view, I'd be satisfied to have this
property at the subsystem level: parse analysis can't modify the raw
parse tree output by the grammar, rewriter can't modify the analyzed
parse tree it's given, planner can't modify the parse tree it's given.
That would not preclude, say, portions of the planner doing
modify-in-place as long as it was certain that what they were
modifying was a planner-local copy.  (The proposed test infrastructure
would only be able to enforce things at the subsystem level anyway.)

As we get into the project we might decide that it'd be worth
enforcing similar rules on portions of those subsystems, eg successive
phases of the planner.  I don't have strong feelings about that at
present.  An example here is that it'd be nice to have some way of
verifying that a GEQO search step doesn't change any of the data
structures that are supposed to remain the same.

I do suspect that there are specific functions that will need to be
made strict about this, for example eval_const_expressions, but that
doesn't necessarily translate to a hard requirement across the board.

regards, tom lane

Re: Fix a resource leak (src/backend/utils/adt/rowtypes.c)

2025-04-15 Thread Ranier Vilela

Em seg., 14 de abr. de 2025 às 18:11, Tom Lane  escreveu:

> Robert Haas  writes:
> > On Sun, Apr 13, 2025 at 7:34 PM Ranier Vilela 
> wrote:
> >> The function *record_in* has a new report about resource leak.
> >> I think Coverity is right.
>
> > I agree, for small values of "right". The memory isn't formally leaked
> > because it will be eventually released when the containing memory
> > context is deleted, but it's unclear why we should bother to clean up
> > the memory in the normal path yet skip it here. I wondered whether the
> > existing pfree calls were added in response to some specific
> > complaint, but it doesn't appear so: they date back to Tom's 2004-era
> > commit a3704d3deca6d08013a6b1db0432b75dc6b78d28, the commit message
> > for which is rather more brief than what is typical today. Still, it
> > seems safer to bet on the pfree being a good idea than on the reverse,
> > because record_in() can be called lots of times in a single
> > transaction.
>
> Well, the pfree's in the main path do, but the "fail:" code path is
> considerably more recent (ccff2d20e).  In the latter, I believe I
> thought it wasn't worth the bookkeeping to keep track of whether these
> allocations had been made yet.  Note the comment
>
> /* exit here once we've done lookup_rowtype_tupdesc */
>
> i.e. there is (was) exactly one criterion for whether to "goto fail"
> or just return.  I don't love the proposed patch, partly because it
> doesn't bother to update that comment after falsifying it,

 I can't see how the patch can do this.
The comment is still valid, isn't it?
But it's incomplete from the beginning, it could be:
/*exit here once we've done lookup_rowtype_tupdesc or when fail */

but mostly
> because it makes the code more fragile.  Put a "goto fail" in the
> wrong place, and kaboom!
>
The only fragile thing is the StringInfoData buf.
That's why the patch moves the initialization of buf, to avoid the "kaboom".


>
> We could do something like initialize all these variables to NULL
> at the top and then write
>
> if (values)
> pfree(values);
>
This is not necessary, both in the normal path and in the failure path,
all variables are correctly initialized and released correctly too.


> et cetera.  But I don't see the point.  Frankly, if I wanted to make
> this more consistent I would delete the pfrees in the main path.
> Exactly zero evidence has been offered that that would create a leak
> of significance, and the existing code goes against our general advice
> that retail pfree's are more expensive than letting context reset
> clean up.  (The fact that Coverity doesn't understand that doesn't
> make it not true.)
>
I think something needs to be done, and I vote to free the memory when it
fails.
The function as it is, has "bipolar disorder".
Sometimes they behave one way, other times they behave another.
IMO, there is also zero evidence that not releasing it would be beneficial.

best regards,
Ranier Vilela

Re: FmgrInfo allocation patterns (and PL handling as staged programming)

2025-04-15 Thread Tom Lane

Chapman Flack  writes:
> On 04/06/25 22:37, Tom Lane wrote:
>> BTW, I feel a little uncomfortable with the idea that we're adding
>> dependencies on objects that are explicitly mentioned nowhere in the
>> pg_proc entry.

> Pursuing that idea a bit further, was there a compelling original reason
> the column in pg_proc had to be protrftypes and not just protransforms?

The problem from a PL's standpoint is "given this input or output
of type FOO, should I transform it, and if so using what?".  So
the starting point has to be a type not a transform.  The lookups
are implemented by get_transform_tosql and get_transform_fromsql,
and if you look at those you find that the protrftypes data is used
as a filter before attempting a pg_transform lookup.  If the pg_proc
entry contained transform OIDs we'd have to filter after the lookup,
which is pretty inefficient, especially if you expect that the normal
case is that there's not an applicable transform.

So I agree with storing protrftypes.  Maybe we should also have stored
a parallel array of transform OIDs, but it'd be just to make the
dependency more obvious, and maybe it's not worth the storage space.

regards, tom lane

Re: Fundamental scheduling bug in parallel restore of partitioned tables

2025-04-15 Thread Ashutosh Bapat

On Mon, Apr 14, 2025 at 11:14 PM Tom Lane  wrote:
>
> I wrote:
> > Here's a draft patch for this.  It seems to fix the problem in
> > light testing.
>
> I realized that the "repro" I had for this isn't testing the same
> thing that Dimitrios is seeing; what it is exposing looks more like
> a bug or at least a behavioral change due to the v18 work to record
> not-null constraints in pg_constraint [1].  So my patch may fix his
> problem or it may not.  It would be good to have a reproducer that
> fails (not necessarily every time) in v17 or earlier.
>

I tried to reproduce the problem using your script on v17, but could't
get either deadlock or constraint violation error.

>
> This is disastrous for assorted reasons.  The ALTER ADD CONSTRAINT
> command might fail outright if we've loaded data for the referencing
> table but not the referenced table.

There's a comment in getConstraints()
/*
* Restoring an FK that points to a partitioned table requires that
* all partition indexes have been attached beforehand. Ensure that
* happens by making the constraint depend on each index partition
* attach object.
*/
FK constraint addition will wait for the child indexes in referenced
partitioned table to be attached, which in turn wait for data to be
loaded in the child tables. So it doesn't look like we will see ADD
constraint failing. I may be missing something though.

>  It could deadlock against other
> parallel restore jobs, as reported in [1] (and which I find not
> too terribly hard to reproduce here).

> Even if it doesn't fail, if
> it completes before we load data for the referencing table, we'll
> have to do retail FK checks, greatly slowing that data load.

FWIW, Executing pg_restore -j2 -v, I think, I see evidence that the FK
constraint is created before data is loaded into the referencing
table.

pg_restore: processing data for table "public.c22"
... snip
pg_restore: launching item 2477 FK CONSTRAINT parent2 parent2_ref_fkey
pg_restore: creating FK CONSTRAINT "public.parent2 parent2_ref_fkey"
pg_restore: finished item 2626 TABLE DATA c22
... snip
pg_restore: launching item 2625 TABLE DATA c21
pg_restore: processing data for table "public.c21"
pg_restore: finished item 2477 FK CONSTRAINT parent2 parent2_ref_fkey
... snip
pg_restore: finished item 2625 TABLE DATA c21

I tried applying your patch on v17 to see whether it causes the FK
creation to wait, but the patch doesn't apply cleanly.

-- 
Best Wishes,
Ashutosh Bapat

Re: pgsql: Add function to get memory context stats for processes

2025-04-15 Thread Robert Haas

On Thu, Apr 10, 2025 at 4:05 PM Andres Freund  wrote:
> I don't know of existing discussion, but it seems rather fundamental to me -
> if either DSA or memory contexts could be inconsistent at a CFI(), how could
> it possibly be safe to interrupt at that point? After all, after an error you
> need to be able to reset the memory contexts / release memory in a
> dsa/dshash/whatnot? Memory context reset requires walking over the allocations
> made in the context, similar releasing a dsa?

I think it would be a bit surprising if somebody put a
CHECK_FOR_INTERRUPTS() inside aset.c or similar, but I don't see a
reason why we couldn't end up with one reachable via the DSA code. DSA
calls DSM which depending on dynamic_shared_memory_type might involve
filesystem operations. That's a fairly large amount of code. I admit I
have no particular theory about how CFI could be reachable from there
today, but even if it definitely isn't, I don't see why someone would
hesitate to add one in the future.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-15 Thread Jacob Champion

On Tue, Apr 15, 2025 at 5:31 AM Peter Eisentraut  wrote:
> On 10.04.25 01:08, Jacob Champion wrote:
> > Christoph noted that this was also confusing from the packaging side,
> > earlier,

Since Christoph has withdrawn the request, I will drop -0002.

However, I'll go ahead and put some of my opinions down on paper here:

> The general idea, at least on the Autoconf side, is that --with-FOO
> means enable all the features that require library FOO.

I don't think this is particularly user-friendly if it's not obvious
what feature is enabled by FOO.

LDAP? PAM? Sure. SSL? Eh, I think the pgcrypto coupling is a little
strange -- that's not implied by "SSL" at all! -- but it's not
problematic enough to complain loudly. --with-gssapi selects... some
dependency... which may or may not come from a particular library.
--with-bsd-auth doesn't add any library dependencies at all, instead
depending on the kernel, but it makes sense.

But there's no connection between "libcurl" and "OAuth Device
Authorization flow" in anyone's mind except the people who have worked
on that feature.

If the argument is that we'd need to switch to --enable-oauth-client
rather than --with-oauth-client, that works for me. But I don't quite
understand the desire to stick to the existing configuration
methodology for something that's very different from an end-user
perspective.

> The naming system you propose has problems:
>
> First, what if we add another kind of "oauth-client" that doesn't use
> libcurl, how would you extend the set of options?

With an extension to the values that you can provide to
--with-oauth-client, similarly to what was originally proposed for
--with-ssl.

> Second, what if we add some kind of oauth plugin for the server that
> uses libcurl, how would you extend the set of options?

With a new option.

But let me turn this around, because we currently have the opposite
problem: if someone comes in and adds a completely new feature
depending on libcurl, and you want OAuth but you do not want that new
feature -- or vice-versa -- what do you do? In other words, what if
your concern is not with libcurl, but with the feature itself?

> But worse, what you are hiding is the information what
> dependencies you are pulling in, which is the actual reason for the
> options. (If there was no external dependency, there would be no option
> at all.)

I'm not sure I agree, either practically or philosophically. I like to
see the build dependencies, definitely, but I also like to see the
features. (Meson will make both things visible separately, for that
matter.)

> This seems unnecessarily complicated and inconsistent.  Once you have
> made the choice of taking the libcurl dependency, why not build
> everything that requires it?

Simply because the end user or packager might not want to.

In any case -- I won't die on this particular hill, and I'm happy to
continue forward with 0001 alone.

Thanks!
--Jacob

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-15 Thread Robert Haas

On Tue, Apr 15, 2025 at 8:32 AM Peter Eisentraut  wrote:
> On 10.04.25 01:08, Jacob Champion wrote:
> > Christoph noted that this was also confusing from the packaging side,
> > earlier, and Daniel proposed -Doauth-client/--with-oauth-client as the
> > feature switch name instead. Any objections? Unfortunately it would
> > mean a buildfarm email is in order, so we should get it locked in.
>
> We had that discussion early in the development, and I still think it's
> not the right choice.

I strongly agree. I think it will not be long before somebody
implements a second feature depending on libcurl, and there's no
precedent for the idea that we allow those features to be enabled and
disabled individually. If that turns out to be something that is
wanted, then since this will be a separate library, a packager can
choose not to ship it, or to put it in a separate package, if they
wish. If there's REALLY a lot of demand for a separate enable/disable
switch for this feature then we can consider making this an exception
to what we do for all other dependent libraries, but I bet there won't
be. I can imagine someone not wanting libcurl on their system on the
theory that it would potentially open up the ability to download data
from arbitrary URLs which might be considered bad from a security
posture -- but I don't really see why someone would be particularly
upset about one particular way in which libcurl might be used.

I also don't mind being wrong, of course. But I think it's better to
bet on making this like other things and then change strategy if that
doesn't work out, rather than starting out by making this different
from other things.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-15 Thread Jacob Champion

On Tue, Apr 15, 2025 at 11:57 AM Christoph Berg  wrote:
> What made me reconsider was Peter saying that what defines the blast
> radius of some feature is usually the extra dependency pulled in. If
> you don't like tracking OpenSSL problems, build without it. If you
> don't like libcurl, build without it. That's the "we are going to be
> hated by security scanner people" argument that brought up this
> sub-thread.
>
> Now if the feature itself were a problem, that might change how
> configuration should be working. Is "libpq can now initiate oauth
> requests" something people would like to be able to control?

Well... I'd sure like to live in a world where users thought about the
implications and risks of what they're using and why, rather than
farming a decision out to a static analysis tool. ("And as long as I'm
dreaming, I'd like a pony.")

But end users already control the initiation of OAuth requests (it's
opt-in via the connection string), so that's not really the goal.

> Debian does not care really about static libs. We are currently
> shipping libpq.a, but if it breaks in any funny way, we might as well
> remove it.

Awesome. I think we have a consensus.

Thanks!
--Jacob

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-15 Thread Jacob Champion

On Tue, Apr 15, 2025 at 12:21 PM Robert Haas  wrote:
> I also don't mind being wrong, of course. But I think it's better to
> bet on making this like other things and then change strategy if that
> doesn't work out, rather than starting out by making this different
> from other things.

Works for me. (And it's less work, too!)

Thanks,
--Jacob

Re: not null constraints, again

2025-04-15 Thread Tom Lane

Alvaro Herrera  writes:
> However, I've also been looking at this and realized that this code can
> have different structure which may allows us to skip the
> find_inheritance_children() altogether.  The reason is that we already
> scan the parent's list of columns searching for not-null constraints on
> each of them; we only need to run this verification on children for
> columns where there is none in the parent, and then only in the case
> where recursion is turned off.

+1.  Fundamentally the problem here is that pg_restore needs

ALTER TABLE ONLY foo ADD PRIMARY KEY

to not recurse to child tables at all.  It is expecting this command
to acquire a lock on foo and nothing else; and it has already taken
care of making foo's PK column(s) NOT NULL, so there is no reason we
should have to examine the children.

Looking at the patch itself, it doesn't seem like the got_children
flag is accomplishing anything; I guess that was leftover from an
earlier version?  You could declare "List *children" inside the
block where it's used, too.  Basically, this patch is just moving
the check-the-children logic from one place to another.

Also I find the comments still a bit confusing, but maybe that's
on me.

regards, tom lane

Re: Non-text mode for pg_dumpall

2025-04-15 Thread Mahendra Singh Thalor

On Sat, 5 Apr 2025 at 01:41, Andrew Dunstan  wrote:
>
>
> On 2025-04-04 Fr 5:12 AM, Mahendra Singh Thalor wrote:
>
> On Fri, 4 Apr 2025 at 13:52, Mahendra Singh Thalor  wrote:
>
> On Fri, 4 Apr 2025 at 01:17, Andrew Dunstan  wrote:
>
> On 2025-04-01 Tu 1:59 AM, Mahendra Singh Thalor wrote:
>
> On Mon, 31 Mar 2025 at 23:43, Álvaro Herrera  wrote:
>
> Hi
>
> FWIW I don't think the on_exit_nicely business is in final shape just
> yet.  We're doing something super strange and novel about keeping track
> of an array index, so that we can modify it later.  Or something like
> that, I think?  That doesn't sound all that nice to me.  Elsewhere it
> was suggested that we need some way to keep track of the list of things
> that need cleanup (a list of connections IIRC?) -- perhaps in a
> thread-local variable or a global or something -- and we install the
> cleanup function once, and that reads from the variable.  The program
> can add things to the list, or remove them, at will; and we don't need
> to modify the cleanup function in any way.
>
> --
> Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
>
> Thanks Álvaro for the feedback.
>
> I removed the old handling of on_exit_nicely_list from the last patch
> set and added one simple function to just update the archive handle in
> shutdown_info.  (shutdown_info.AHX = AHX;)
>
> For first database, we will add entry into on_exit_nicely_list array
> and for rest database, we will update only shutdown_info as we already
> closed connection for previous database.With this fix, we will not
> touch entry of on_exit_nicely_list for each database.
>
> Here, I am attaching updated patches.
>
>
> OK, looks good. here's my latest. I'm currently working on tidying up
> docco and comments.
>
>
> cheers
>
>
> andrew
>
>
>
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
> Thanks Andrew for the updated patches.
>
> Here, I am attaching a delta patch with some more TAP-test cases.
>
> Here, I am attaching an updated delta patch which has some more TAP
> tests. Please include these tests also. This patch can be applied on
> v20250403_0004* patch.
>
>
>
> Thanks. I have pushed these now with a few further small tweaks.
>
>
> cheers
>
>
> andrew
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com

Hi Andrew,
I did some refactoring to find out dump file extensions(.dmp/.tar etc)
in pg_restore. With the attached patch, we will not try to find out
file extension with each database, rather we will find out before the
loop.

Here, I am attaching a patch for the same. Please have a look over this.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com


v01-pg_restore-refactor-code-of-dump-file-extenion.patch
Description: Binary data

Re: Non-text mode for pg_dumpall

2025-04-15 Thread Andrew Dunstan




On 2025-04-15 Tu 2:30 PM, Mahendra Singh Thalor wrote:

Hi Andrew,
I did some refactoring to find out dump file extensions(.dmp/.tar etc)
in pg_restore. With the attached patch, we will not try to find out
file extension with each database, rather we will find out before the
loop.

Here, I am attaching a patch for the same. Please have a look over this.




That doesn't look right at first glance. You shouldn't have to tell 
pg_restore what format to use, it should be able to intuit it from the 
dumps (and that's what the docs say it does).


The saving here would be hardly measurable anyway - you would be in 
effect saving one or two stat calls per database.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-15 Thread Christoph Berg

Re: Jacob Champion
> But there's no connection between "libcurl" and "OAuth Device
> Authorization flow" in anyone's mind except the people who have worked
> on that feature.

Fwiw that was exactly the reason I originally voiced the idea to
rename.

> But let me turn this around, because we currently have the opposite
> problem: if someone comes in and adds a completely new feature
> depending on libcurl, and you want OAuth but you do not want that new
> feature -- or vice-versa -- what do you do? In other words, what if
> your concern is not with libcurl, but with the feature itself?

What made me reconsider was Peter saying that what defines the blast
radius of some feature is usually the extra dependency pulled in. If
you don't like tracking OpenSSL problems, build without it. If you
don't like libcurl, build without it. That's the "we are going to be
hated by security scanner people" argument that brought up this
sub-thread.

Now if the feature itself were a problem, that might change how
configuration should be working. Is "libpq can now initiate oauth
requests" something people would like to be able to control?


Re: Jacob Champion
> Dynamic: --with-libcurl builds a runtime-loadable module, and if you
> don't install it, OAuth isn't supported (i.e. it's optional)

Ok.

> Static: --with-libcurl builds an additional linkable staticlib, which
> you must link into your application (i.e. not optional)

Debian does not care really about static libs. We are currently
shipping libpq.a, but if it breaks in any funny way, we might as well
remove it.

Christoph

Re: use correct variable in error message in _allocAH function (pg_backup_archiver.c)

2025-04-15 Thread Tom Lane

Fujii Masao  writes:
> Thanks for the report and patch! It looks good to me.

Agreed.

> Since this issue exists in the back branches,
> the patch needs be back-patched to all supported versions.

I doubt it's worth the trouble and buildfarm cycles to
back-patch, since this should be a can't-happen code path.
Worth fixing in HEAD, yes, but not convinced about doing
more than that.

regards, tom lane

Re: Built-in Raft replication

2025-04-15 Thread Greg Sabino Mullane

On Mon, Apr 14, 2025 at 1:15 PM Konstantin Osipov 
wrote:

> If anyone is working on Raft already I'd be happy to discuss
> the details. I am fairly new to the PostgreSQL hackers ecosystem
> so cautious of starting work in isolation/knowing there is no
> interest in accepting the feature into the trunk.
>

Putting aside the technical concerns about this specific idea, it's best to
start by laying out a very detailed plan of what you would want to change,
and what you see as the costs and benefits. It's also extremely helpful to
think about developing this as an extension. If you get stuck due to
extension limitations, propose additional hooks. If the hooks will not
work, explain why.

Getting this into core is going to be a long, multi-year effort, in which
people are going to be pushing back the entire time, so prepare yourself
for that. My immediate retort is going to be: why would we add this if
there are existing tools that already do the job just fine? Postgres has
lots of tasks that it is happy to let other programs/OS
subsystems/extensions/etc. handle instead.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-04-15 Thread Robert Haas

On Mon, Apr 14, 2025 at 8:23 PM David Rowley  wrote:
> Maybe we could compress the text output a bit with:
>
> "Estimates: capacity=N distinct keys=N hit ratio=N.N%"

Sure, that works for me.

> If we get it that compact, maybe lookups could fit in there too, i.e.:
>
> "Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"

Is lookups=N here the estimated number of lookups i.e. what we think
nloops will end up being?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

RE: AIX support

2025-04-15 Thread Srirama Kucherlapati

Hi Team, Please find the attached diff with the latest changes.

The diff file changes that were done for memset and spinlock, are modified
on top of the previous patch.

New changes for review: 0001-AIX-support.tas.memset.diffs
Previous patch: 0001-AIX-support.v8.patch


>> diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
>
We have replaced the asm code with the gcc specific routines 
__sync_lock_test_and_set ()
and ran the pgbench. On AIX both have the similar performance, On PPcle we see 
a slight
variation in the performance. It seems the asm code is performing better on 
PPCle. Please
let us know, would it be better to continue to use the PPC assemble code over 
gcc routines.

Attached are the complete stats. (spinlock.stats.log)
Below is the summary…

AIX With asmAIX With __sync   PPCle With asm  PPCle 
With __sync
number of transactions "226554/ "230810/  "356945/
"364346/
above the 10.0 ms   556329  567936918647  863937
latency limit   (40.723%)"  (40.640%)"   (38.856%)"   
(42.173%)"

latency average 16.160 ms   15.821 ms 9.796 ms
10.416 ms
initial connection time 230.786 ms  249.668 ms19.606 ms   
20.090 ms
tps 3086.209412 3158.149155103.024267 
4799.621896


> diff --git a/src/makefiles/Makefile.aix b/src/makefiles/Makefile.aix
>
> At least it needs to be updated to match what MemSet() looks like
> nowadays. The changes may be just cosmetic, but better check. Should
> also check the effect on MemSetAligned(). That might matter more for
> performance in practice.

As per the stats in the previous mail wrt to memset, both the loop and the 
native
memset are performing similar after optimization “-O2”. So for now we removed 
the
native memset changes.
We will setup the memset performance on different platforms and will post the
details in a different thread.


>>> +# -blibpath must contain ALL directories where we should look for libraries
>>> +libpath := $(shell echo $(subst -L,:,$(filter -L/%,$(LDFLAGS))) | sed 
>>> -e's/ //g'):/usr/lib:/lib
>

Added additional details about the -blibpath usage. These details are available 
in the below link
https://download.boulder.ibm.com/ibmdl/pub/software/dw/aix/es-aix_ll.pdf



> 21  -- test overflow/underflow handling
>
> 22  SELECT gamma(float8 '-infinity');
>
> 23  ERROR:  value out of range: overflow

This testcase is resolved from AIX libm side. Below is the result.


postgres=#
SELECT x, gamma(x),lgamma(x)
FROM (VALUES (0.5), (1), (2), (3), (4), (5),
(float8 'infinity'), (float8 'nan')) AS t(x);

  0.5 | 1.772453850905516 | 0.5723649429247001
1 | 1 |  0
2 | 1 |  0
3 | 2 | 0.6931471805599453
4 | 6 |  1.791759469228055
5 |24 | 3.1780538303479458
Infinity |  Infinity |   Infinity
  NaN |   NaN |NaN


Kindly let us know your feedback on the diffs.
I will commit the diffs to the patch(0001-AIX-support.v8.patch) once you are 
fine.



Warm regards,
Sriram.




0001-AIX-support.tas.memset.diffs
Description: 0001-AIX-support.tas.memset.diffs
--
AIX with asm spin lock


>> pgbench -c 50 -p 5678 -d postgres -T 180 -r -P 10  -L 10 -j 20
pgbench (18devel)
starting vacuum...end.
progress: 10.0 s, 3181.5 tps, lat 15.331 ms stddev 17.867, 0 failed
progress: 20.0 s, 3277.2 tps, lat 14.730 ms stddev 18.792, 0 failed
progress: 30.0 s, 2642.9 tps, lat 19.556 ms stddev 106.940, 0 failed
progress: 40.0 s, 3614.3 tps, lat 13.830 ms stddev 14.673, 0 failed
progress: 50.0 s, 2355.4 tps, lat 21.208 ms stddev 132.907, 0 failed
progress: 60.0 s, 3046.8 tps, lat 16.407 ms stddev 53.507, 0 failed
progress: 70.0 s, 3640.1 tps, lat 13.734 ms stddev 14.549, 0 failed
progress: 80.0 s, 2435.2 tps, lat 20.518 ms stddev 125.603, 0 failed
progress: 90.0 s, 3576.3 tps, lat 13.977 ms stddev 14.341, 0 failed
progress: 100.0 s, 3653.9 tps, lat 13.681 ms stddev 14.554, 0 failed
progress: 110.0 s, 2414.6 tps, lat 20.684 ms stddev 134.585, 0 failed
progress: 120.0 s, 3051.6 tps, lat 16.383 ms stddev 63.253, 0 failed
progress: 130.0 s, 3657.2 tps, lat 13.667 ms stddev 14.375, 0 failed
progress: 140.0 s, 3159.0 tps, lat 14.259 ms stddev 18.915, 0 failed
progress: 150.0 s, 2889.8 tps, lat 19.006 ms stddev 121.852, 0 failed
progress: 160.0 s, 3649.0 tps, lat 13.693 ms stddev 14.288, 0 failed
progress: 170.0 s, 2799.8 tps, lat 14.552 ms stddev 17.307, 0 failed
progress: 180.0 s, 2582.9 tps, lat 22.914 ms stddev 121.785, 0 failed
transaction type: 
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 20
maximum number of tries: 1
duration: 180 s
number of transactions actually processed: 556329
number of failed transactions: 0 (0.00

Re: pg_dump --if-exists --clean when drop index that is partition of a partitioned index

2025-04-15 Thread jian he

On Tue, Apr 15, 2025 at 1:45 PM Pavel Stehule  wrote:
>>
>> seems pretty easy to fix.
>> we only need dropStmt when IndxInfo->parentidx oid is invalid.
>>
>> +if (!OidIsValid(indxinfo->parentidx))
>> +appendPQExpBuffer(delq, "DROP INDEX %s;\n", qqindxname);
>
>
> I don't think it is the correct fix.
>
> because then fails CREATE INDEX qqindxname
>
> The motivation for usage --clean and --if-exists option is possibility to 
> call restore idempotently. It should not fail when I do restore in an empty 
> database, and it shouldn't fail if I do restore in an existing database.
>


I am not sure what you mean fails
``CREATE INDEX qqindxname``
?

for example:

>> >> CREATE TABLE tp(c int, a int, b int) PARTITION BY RANGE (b);
>> >> CREATE TABLE tp_1(c int, a int, b int);
>> >> ALTER TABLE tp ATTACH PARTITION tp_1 FOR VALUES FROM (0) TO (1);
>> >> CREATE INDEX t_a_idx ON tp_1(a);
>> >> CREATE INDEX tp_a_idx ON tp(a);

with the patch, pg_dump output will be
```
DROP INDEX IF EXISTS public.tp_a_idx;
DROP TABLE IF EXISTS public.tp_11;
DROP TABLE IF EXISTS public.tp;

CREATE INDEX tp_a_idx ON ONLY public.tp USING btree (a);
CREATE INDEX tp_11_a_idx ON public.tp_11 USING btree (a);
ALTER INDEX public.tp_a_idx ATTACH PARTITION public.tp_11_a_idx;
```

What's your expectation?

Re: Fundamental scheduling bug in parallel restore of partitioned tables

2025-04-15 Thread Dimitrios Apostolou

On 15 April 2025 18:18:41 CEST, Tom Lane  wrote:
>Dimitrios Apostolou  writes:
>> I only got the "ERROR: deadlock detected" message once, with pg_restore 
>> compiled from master branch.
>
>Oh, hmm ... do you recall just when on the master branch?  I'm
>wondering if it was before or after 14e87ffa5 (2024-11-08).
>If it was after, then it might have been subject to the same
>issue with not-null constraints that I ran into.
>
>   regards, tom lane

Definitely after.

Re: rounding_up

2025-04-15 Thread Daria Shanina

Hi, Tom!

> Round-to-nearest-even is a well-respected rule

Yes, you're convinced me! I can’t argue with IEEE 754 =) And, of course,
can’t break compatibility.


Best regards,

Daria Shanina

пн, 14 апр. 2025 г. в 17:26, Tom Lane :

> Christoph Moench-Tegeder  writes:
> > ## Daria Shanina (vilensip...@gmail.com):
> >> I noticed, when we parse and validate values (in particular, the int
> type),
> >> we use the *rint* method, but unfortunately it does not work according
> to
> >> the round rules.
>
> > First question would be "which round rule?" as (of course) there're
> > multiple to chose from.
>
> Yeah.  Round-to-nearest-even is a well-respected rule, which is why
> it's the default per IEEE 754.  I don't see a good reason for us
> to switch.  Even if someone advanced an argument, it would have
> to be a *mighty* convincing argument to justify breaking backwards
> compatibility here.
>
> I do find it a little unfortunate that our numeric type does it
> differently than our float types.  Again though, there's a huge
> compatibility argument against changing that now.  It does give
> you an "out" if you really need one or the other behavior for
> a particular application: you can cast to numeric or float8
> before casting to int.
>
> regards, tom lane
>


-- 
С уважением,
Шанина Дарья Александровна

Re: rounding_up

2025-04-15 Thread Daria Shanina

Hi, Christoph!

You wrote a very interesting answer.


> First question would be "which round rule?"

I mean rounding up “as at school”, but there are nuances in programming.

> https://www.postgresql.org/docs/current/functions-math.html

Thanks a lot for the link to the doc!

> you set it to with fesetround() (see
man page)

To my great shame, I didn't know about fesetround(). Thanks a lot too!


Best regards,

Daria Shanina

пн, 14 апр. 2025 г. в 17:26, Tom Lane :

> Christoph Moench-Tegeder  writes:
> > ## Daria Shanina (vilensip...@gmail.com):
> >> I noticed, when we parse and validate values (in particular, the int
> type),
> >> we use the *rint* method, but unfortunately it does not work according
> to
> >> the round rules.
>
> > First question would be "which round rule?" as (of course) there're
> > multiple to chose from.
>
> Yeah.  Round-to-nearest-even is a well-respected rule, which is why
> it's the default per IEEE 754.  I don't see a good reason for us
> to switch.  Even if someone advanced an argument, it would have
> to be a *mighty* convincing argument to justify breaking backwards
> compatibility here.
>
> I do find it a little unfortunate that our numeric type does it
> differently than our float types.  Again though, there's a huge
> compatibility argument against changing that now.  It does give
> you an "out" if you really need one or the other behavior for
> a particular application: you can cast to numeric or float8
> before casting to int.
>
> regards, tom lane
>


-- 
С уважением,
Шанина Дарья Александровна

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-15 Thread Jacob Champion

On Mon, Apr 14, 2025 at 1:17 PM Jacob Champion
 wrote:
> I believe so. I'm in the middle of the .pc stuff right now; v6 should
> have the fixes as long as I don't get stuck.

Done in v6-0001. I think this is now architecturally complete, so if
reviewers are happy I can work on docs and the commit message. As a
summary:

- We provide a libpq-oauth-18.so module for shared builds, and a
corresponding .a for static builds, when OAuth is enabled.
- Platforms which cannot support the builtin flow now error out if you
request OAuth at configure time.
- When OAuth is enabled and there's no custom client flow, libpq.so
loads the module via dlopen(), which respects RPATH/LD_LIBRARY_PATH et
al. If it's not installed, OAuth doesn't continue.
- Static builds must link libpq-oauth-18.a explicitly. libpq.pc now
puts -lpq-oauth-18 in Libs.private, and libcurl in Requires.private.
- Internally, we compile separate versions of fe-auth-oauth.c to
handle the different cases (disabled, dynamic, static). This is
borrowed from src/common.
- The only new export from libpq is appendPQExpBufferVA. Other
internal symbols are shared with libpq-oauth via dependency injection.

v6-0002 is a WIP rename of the --with-libcurl option to
--with-oauth-client. I'm not sure I have all of the Meson corner cases
with auto_features figured out, but maybe it doesn't matter since it's
temporary (it targets a total of seven buildfarm animals, and once
they've switched we can remove the old name). I have added a separate
open item for this.

Thanks,
--Jacob


v6-0002-oauth-rename-with-libcurl-to-with-oauth-client.patch
Description: Binary data


v6-0001-WIP-split-Device-Authorization-flow-into-dlopen-d.patch
Description: Binary data

Re: Fix a resource leak (src/backend/utils/adt/rowtypes.c)

2025-04-15 Thread Robert Haas

On Sun, Apr 13, 2025 at 7:34 PM Ranier Vilela  wrote:
> CID 1608916: (#1 of 1): Resource leak (RESOURCE_LEAK)
> 52. leaked_storage: Variable buf going out of scope leaks the storage 
> buf.data points to.
>
> The function *record_in* has a new report about resource leak.
> I think Coverity is right.

I agree, for small values of "right". The memory isn't formally leaked
because it will be eventually released when the containing memory
context is deleted, but it's unclear why we should bother to clean up
the memory in the normal path yet skip it here. I wondered whether the
existing pfree calls were added in response to some specific
complaint, but it doesn't appear so: they date back to Tom's 2004-era
commit a3704d3deca6d08013a6b1db0432b75dc6b78d28, the commit message
for which is rather more brief than what is typical today. Still, it
seems safer to bet on the pfree being a good idea than on the reverse,
because record_in() can be called lots of times in a single
transaction.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New committer: Jacob Champion

2025-04-15 Thread Andrey Borodin




> On 12 Apr 2025, at 01:26, Jonathan S. Katz  wrote:
> 
> Please join us in wishing Jacob much success and few reverts!

Congratulations, Jacob!


Best regards, Andrey Borodin.

Re: someone else to do the list of acknowledgments

2025-04-15 Thread Aleksander Alekseev

Hi,

>> The whole thing might take about 20 to 30 hours wall-clock time.
>
> After this dev cycle, things with a defined end to them hold a greater 
> attraction than they did previously.
>
>>
>> So, there is some time to think about this.  Please discuss here if
>> you're interested or have questions.
>
> I am interested.

+1

-- 
Best regards,
Aleksander Alekseev

Re: Support for runtime parameters in injection points, for AIO tests

2025-04-15 Thread Andres Freund

Hi,

On 2025-04-14 16:46:22 +0900, Michael Paquier wrote:
> While reading the AIO code I have noticed that the way it uses
> injection points is limited by the fact that we don't have support for
> runtime parameters in the existing injection point facility.

Yep.

> The code is shaped with two set/get functions that set a static parameter
> that the injection callback would reuse internally, using
> pgaio_inj_io_get(), and pgaio_io_call_inj() and a static
> pgaio_inj_cur_handle.  Relying on a static variable for that is not a good
> idea, IMO, even if the stack is reset with a TRY/CATCH block on error in the
> callback run.

It's not great, I agree.  I was frankly rather surprised to see that injection
points don't support anything to pass parameters, since, without that, you
really can't actually inject different results or anything reliably.

> We are still in early April, and I'd like to propose a cleanup of the
> AIO code on HEAD, even if we are post-feature freeze, to not release
> this new code in this state, implying an open item.  Please note that
> I'm OK to take responsibility for this patch set at the end, reviews
> are welcome.

I'm pretty agnostic about this happening for 18 or not.  If we can do it,
good, but if not, the impact of needing to use static variable supporting
injection points is really rather small.

I'd say that the fact that injection variables are really hard to use in
critical sections, requiring to have attached to the injection point
beforehand, is worse.

> Anyway, I have spent some time with my mind on that, and finished
> with the attached patch set:
> - 0001 is the addition of runtime parameters in the backend code.  I
> have made the choice of extending the existing INJECTION_POINT() and
> INJECTION_POINT_CACHED() instead of introducing new macros.  That's a
> matter of taste, perhaps, but increasing the number of these macros
> leads to a confusing result.  This one is more invasive, but that's OK
> for me.

I can see arguments for going either way on that one.

> The code is shaped so as we can rely on InjectionPointCallback to define the
> shape of a callback.

I don't know what this means though?

Greetings,

Andres Freund

Re: Doc: Move standalone backup section, mention -X argument

2025-04-15 Thread Benoit Lobréau


On 3/16/25 2:19 PM, vignesh C wrote:

I noticed that Alvaro's comment from [1] is not yet addressed, I have
changed the status of commitfest entry to Waiting on Author, please
address them and change the status back to Needs review.
[1] - 
https://www.postgresql.org/message-id/202502101154.bmb536npfl5e%40alvherre.pgsql

Regards,
Vignesh


Hi,

You will find a patch for the proposed changes attached to this mail.

The menu is now:

25.1. SQL Dump
25.1.1. Restoring the Dump
25.1.2. Using pg_dumpall
25.1.3. Handling Large Databases
25.2. Physical Backups Using Continuous Archiving
25.2.1. Built-In Standalone Backups
25.2.2. Setting Up WAL Archiving
25.2.3. Making a Base Backup
25.2.4. Making an Incremental Backup
25.2.5. Making a Base Backup Using the Low Level API
25.2.6. Recovering Using a Continuous Archive Backup
25.2.7. Timelines
25.2.8. Tips and Examples
25.2.9. Caveats
25.3. File System Level Backup

I slightly modified section 25.2.1 and 25.3 as proposed.

--
Benoit Lobréau
Consultant
http://dalibo.com
From 16813b396a45c4061b5c2d21a9091e3fb372567c Mon Sep 17 00:00:00 2001
From: benoit 
Date: Tue, 15 Apr 2025 15:25:08 +0200
Subject: [PATCH] Reorganize the backup section

The standalone backup of the backup documentation lacks visibility. The
solution described in the file level backup section, while still usable,
is not the preferred method. This patch attempts to remedy this by moving
things around.
---
 doc/src/sgml/backup.sgml | 278 +++
 1 file changed, 139 insertions(+), 139 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 25b8904baf7..c167fb5b6b6 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -354,124 +354,8 @@ pg_dump -j num -F d -f 
  
 
- 
-  File System Level Backup
-
-  
-   An alternative backup strategy is to directly copy the files that
-   PostgreSQL uses to store the data in the database;
-explains where these files
-   are located.  You can use whatever method you prefer
-   for doing file system backups; for example:
-
-
-tar -cf backup.tar /usr/local/pgsql/data
-
-  
-
-  
-   There are two restrictions, however, which make this method
-   impractical, or at least inferior to the pg_dump
-   method:
-
-   
-
- 
-  The database server must be shut down in order to
-  get a usable backup. Half-way measures such as disallowing all
-  connections will not work
-  (in part because tar and similar tools do not take
-  an atomic snapshot of the state of the file system,
-  but also because of internal buffering within the server).
-  Information about stopping the server can be found in
-  .  Needless to say, you
-  also need to shut down the server before restoring the data.
- 
-
-
-
- 
-  If you have dug into the details of the file system layout of the
-  database, you might be tempted to try to back up or restore only certain
-  individual tables or databases from their respective files or
-  directories. This will not work because the
-  information contained in these files is not usable without
-  the commit log files,
-  pg_xact/*, which contain the commit status of
-  all transactions. A table file is only usable with this
-  information. Of course it is also impossible to restore only a
-  table and the associated pg_xact data
-  because that would render all other tables in the database
-  cluster useless.  So file system backups only work for complete
-  backup and restoration of an entire database cluster.
- 
-
-   
-  
-
-  
-   An alternative file-system backup approach is to make a
-   consistent snapshot of the data directory, if the
-   file system supports that functionality (and you are willing to
-   trust that it is implemented correctly).  The typical procedure is
-   to make a frozen snapshot of the volume containing the
-   database, then copy the whole data directory (not just parts, see
-   above) from the snapshot to a backup device, then release the frozen
-   snapshot.  This will work even while the database server is running.
-   However, a backup created in this way saves
-   the database files in a state as if the database server was not
-   properly shut down; therefore, when you start the database server
-   on the backed-up data, it will think the previous server instance
-   crashed and will replay the WAL log.  This is not a problem; just
-   be aware of it (and be sure to include the WAL files in your backup).
-   You can perform a CHECKPOINT before taking the
-   snapshot to reduce recovery time.
-  
-
-  
-   If your database is spread across multiple file systems, there might not
-   be any way to obtain exactly-simultaneous frozen snapshots of all
-   the volumes.  For example, if your data files and WAL log are on different
-   disks, or if tablespaces are on different file systems, i

Re: [Proposal] Add \dAt [AMPTRN [TBLPTRN]] to list tables by Table Access Method in psql

2025-04-15 Thread Tom Lane

Greg Sabino Mullane  writes:
> On Tue, Apr 15, 2025 at 4:57 AM Srinath Reddy  wrote:
>> - There's currently no native `\d`-style way to explore which tables are
>> using a given TAM.

> Perhaps there could be a more generic table-filtering mechanism for \d, but
> carving something out for such a niche case seems unwarranted, IMO.

I don't have a strong opinion on whether this functionality is worth
having in a psql meta-command.  But I don't like the proposed syntax
one bit.  In my mind the \dA group of meta-commands are supposed to
provide information on the *properties* of access methods.  Not on
what uses them.  It could be reasonable to have a \dAt command that
shows information about a table access method (although not much is
exposed at SQL level today, so there's not a lot for it to do).
But, for example, \dAf does not run around and find all indexes
using that operator family.

I like your thought that maybe this functionality could be cast
as some sort of filter in the \dt command group (with a syntax that
would allow for other sorts of filters too).  I don't have concrete
ideas about how to write that though.

regards, tom lane

Re: Wrong security context for deferred triggers?

2025-04-15 Thread Noah Misch

On Thu, Jan 23, 2025 at 07:28:19PM +0100, Laurenz Albe wrote:
> On Thu, 2025-01-23 at 12:30 -0500, Tom Lane wrote:
> > Pushed with some cosmetic adjustments
> 
> Thank you!

commit 01463e1 wrote:
> +NOTICE:  I am regress_groot

Let's not incur trivially-avoidable trademark risks
(https://google.com/search?q=%22i+am+groot%22) in the source tree.

> --- a/doc/src/sgml/trigger.sgml
> +++ b/doc/src/sgml/trigger.sgml
> @@ -129,6 +129,10 @@
>  In all cases, a trigger is executed as part of the same transaction as
>  the statement that triggered it, so if either the statement or the
>  trigger causes an error, the effects of both will be rolled back.
> +Also, the trigger will always run in the security context of the role
> +that executed the statement that caused the trigger to fire, unless
> +the trigger function is defined as SECURITY DEFINER,
> +in which case it will run as the function owner.

Phrase "the role that executed the statement" doesn't match what happens if
the role changes mid-statement.  Example of a statement that does so:

  select set_config('role', rolname, true), current_user from pg_authid;

The term "security context" doesn't otherwise appear in doc/.  I would just
change "run in the security context of the role" to "run as the role".  That's
simpler and less likely to create an impression that this stops attacks.

RE: Conflict detection for update_deleted in logical replication

2025-04-15 Thread Hayato Kuroda (Fujitsu)

Dear Hou,

Thanks for updating the patch! I can finally come back to the thread.

Regarding the max_retain_conflict_duration, I prefer GUC approach because it has
already had a mechanism for converting unit: subscription option does not have 
it.

Below part contains my comments:

01. check_new_cluster_subscription_configuration

```
@@ -2024,6 +2025,7 @@ check_new_cluster_subscription_configuration(void)
PGresult   *res;
PGconn *conn;
int max_active_replication_origins;
+   int max_replication_slots;
```

I feel max_replication_slots is needed when 
old_cluster.sub_retain_conflict_info is true.

02. check_old_cluster_for_valid_slots
```
+   /*
+* The name "pg_conflict_detection" (defined as
+* CONFLICT_DETECTION_SLOT) has been reserved for 
logical
+* replication conflict detection since PG18.
+*/
+   if (GET_MAJOR_VERSION(new_cluster.major_version) >= 
1800 &&
+   strcmp(slot->slotname, "pg_conflict_detection") 
== 0)
```

IIUC, we can assume that the vesion of new_cluster is same as pg_upgrade, so no
need to check the major version here.

03.

Can we add a test for upgrading subscriber node with retain_conflict_info in 
004_subscription?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Fundamental scheduling bug in parallel restore of partitioned tables

2025-04-15 Thread Dimitrios Apostolou


On Mon, 14 Apr 2025, Tom Lane wrote:


I wrote:

Here's a draft patch for this.  It seems to fix the problem in
light testing.


I realized that the "repro" I had for this isn't testing the same
thing that Dimitrios is seeing; what it is exposing looks more like
a bug or at least a behavioral change due to the v18 work to record
not-null constraints in pg_constraint [1].  So my patch may fix his
problem or it may not.  It would be good to have a reproducer that
fails (not necessarily every time) in v17 or earlier.


Thank you for your work on it.

I only got the "ERROR: deadlock detected" message once, with pg_restore 
compiled from master branch. My dump is too large to test it many times on 
v17 so I can't tell if it occurs there.


In general I believe that dependency resolution is not optimal, either 
there is a deadlock bug or not. It can definitely be improved as work 
(mostly post-data) is not parallelized as much as it can.


Anyway if I get the deadlock on v17 I'll update the initial thread.


Thanks,
Dimitris

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-15 Thread Christoph Berg

Re: Peter Eisentraut
> But worse, what you are hiding is the information what dependencies
> you are pulling in, which is the actual reason for the options.  (If there
> was no external dependency, there would be no option at all.)
> 
> This seems unnecessarily complicated and inconsistent.  Once you have made
> the choice of taking the libcurl dependency, why not build everything that
> requires it?

I agree with this reasoning and retract my suggestion to rename the option.

Thanks for the explanation,
Christoph

Re: bug in stored generated column over domain with constraints.

2025-04-15 Thread Tom Lane

jian he  writes:
> Thanks for simplifying the tests, overall all looks good.

OK, pushed and back-patched, but only to v14.  I tried to make
a variant that'd work in v13, but it caused ExecCheckPlanOutput
to throw "Query provides a value for a generated column" errors
for UPDATEs.  That's surely related to the major refactoring of
UPDATE targetlists that we did in v14.  Given that we've had
zero field complaints about this problem so far, I think the
risk of breaking something in v13 exceeds the value of fixing
the bug, so I left v13 alone.

regards, tom lane

Re: Logical Replication of sequences

2025-04-15 Thread Peter Smith

Hi Vignesh,

Some review comments for v20250525-0004.

==
Commit message

1.
A new sequencesync worker is launched as needed to synchronize sequences.
It does the following:
a) Retrieves remote values of sequences with pg_sequence_state() INIT.
b) Log a warning if the sequence parameters differ between the
publisher and subscriber.
c) Sets the local sequence values accordingly.
d) Updates the local sequence state to READY.
e) Repeat until all done; Commits synchronized sequences in batches of 100

~

/Log a warning/Logs a warning/
/Repeat until all done/Repeats until all done/

~~~

2.
1) CREATE SUBSCRIPTION
- (PG17 command syntax is unchanged)
- The subscriber retrieves sequences associated with publications.
- Published sequences are added to pg_subscription_rel with INIT state.
- Initiates the sequencesync worker (see above) to synchronize all
  sequences.

~

2a.
Since PG18 is frozen now I think you can say "PG18 command syntax is unchanged"
(replace same elsewhere in this commit message)

~

2b.
/Initiates/Initiate/
(replace same elsewhere in this commit message)

==
src/backend/catalog/pg_publication.c

pg_get_publication_sequences:

3.
+Datum
+pg_get_publication_sequences(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ char*pubname = text_to_cstring(PG_GETARG_TEXT_PP(0));
+ Publication *publication;
+ List*sequences = NIL;
+
+ /* stuff done only on the first call of the function */
+ if (SRF_IS_FIRSTCALL())
+ {

The 'pubname' and 'publication' variables can be declared later,
within the SRF_IS_FIRSTCALL block.

==
src/backend/commands/subscriptioncmds.c

CreateSubscription:

4.
+ /*
+ * XXX: If the subscription is for a sequence-only publication, creating
+ * this origin is unnecessary. It can be created later during the ALTER
+ * SUBSCRIPTION ... REFRESH command, if the publication is updated to
+ * include tables or tables in schemas.
+ */

Since it already says "to include tables", I didn't think you needed
to say "tables in schemas".

~~~

5.
+ *
+ * XXX: If the subscription is for a sequence-only publication,
+ * creating this slot is unnecessary. It can be created later
+ * during the ALTER SUBSCRIPTION ... REFRESH PUBLICATION or ALTER
+ * SUBSCRIPTION ... REFRESH PUBLICATION SEQUENCES command, if the
+ * publication is updated to include tables or tables in schema.
  */

(same comment as above #4).

I thought maybe it is redundant to say "or tables in schema".

~~~

AlterSubscription_refresh:

6.
+#ifdef USE_ASSERT_CHECKING
+ if (resync_all_sequences)
+ Assert(copy_data && !refresh_tables && refresh_sequences);
+#endif
+

Maybe this can have a comment like /* Sanity checks for parameter values */

~~~

7.
+ sub_remove_rels[remove_rel_len].relid = relid;
+ sub_remove_rels[remove_rel_len++].state = state;

  /*
- * For READY state, we would have already dropped the
- * tablesync origin.
+ * A single sequencesync worker synchronizes all sequences, so
+ * only stop workers when relation kind is not sequence.
  */
- if (state != SUBREL_STATE_READY)
+ if (relkind != RELKIND_SEQUENCE)

Should those assignments...:
sub_remove_rels[remove_rel_len].relid = relid;
sub_remove_rels[remove_rel_len++].state = state;

...be done only inside the "if (relkind != RELKIND_SEQUENCE)". It
seems like they'll be skipped anyway in subsequent code -- see "if
(get_rel_relkind(sub_remove_rels[off].relid) == RELKIND_SEQUENCE)".
Perhaps if these assignments are moved, then the subsequent skipping
code is also not needed anymore?

==
src/backend/replication/logical/launcher.c

logicalrep_worker_launch:

8.
+ case WORKERTYPE_SEQUENCESYNC:
+ snprintf(bgw.bgw_function_name, BGW_MAXLEN, "SequenceSyncWorkerMain");
+ snprintf(bgw.bgw_name, BGW_MAXLEN,
+ "logical replication sequencesync worker for subscription %u",
+ subid);
+ snprintf(bgw.bgw_type, BGW_MAXLEN, "logical replication sequencesync worker");
+ break;
+

Previously all these cases were in alphabetical order. Maybe you can
move this case to keep it that way.

~~~

pg_stat_get_subscription:

9.
  case WORKERTYPE_TABLESYNC:
  values[9] = CStringGetTextDatum("table synchronization");
  break;
+ case WORKERTYPE_SEQUENCESYNC:
+ values[9] = CStringGetTextDatum("sequence synchronization");
+ break;

Previously all these cases were in alphabetical order. Maybe you can
move this case to keep it that way.

==
.../replication/logical/sequencesync.c

ProcessSyncingSequencesForApply:

10.
+ * To prevent starting the sequencesync worker at a high frequency after a
+ * failure, we store its last failure time. We start the sequencesync worker
+ * again after waiting at least wal_retrieve_retry_interval.

I felt this comment might be better inside the function where it is
doing the TimestampDifferenceExceeds check.

~~~

10.
+ if (!started_tx)
+ {
+ StartTransactionCommand();
+ started_tx = true;
+ }
+
+ Assert(get_rel_relkind(rstate->relid) == RELKIND_SEQUENCE);

Maybe the Assert should come 1st before the tx stuf

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-04-15 Thread James Hunter

Thanks for the comments!

On Tue, Apr 15, 2025 at 3:11 AM Andres Freund  wrote:
>
> Hi,
>
> On 2025-04-14 09:58:19 -0700, James Hunter wrote:

> > I see two orthogonal problems, in processing Bitmap Heap pages in
> > parallel: (1) we need to prefetch enough pages, far enough in advance,
> > to hide read latency; (2) later, every parallel worker needs to be
> > given a set of pages to process, in a way that minimizes contention.
> >
> > The easiest way to hand out work to parallel workers (and often the
> > best) is to maintain a single, shared, global work queue. Just put
> > whatever pages you prefetch into a FIFO queue, and let each worker
> > pull one piece of "work" off that queue. In this was, there's no
> > "ramp-down" problem.
>
> If you just issue prefetch requests separately you'll get no read combining -
> and it turns out that that is a really rather significant loss, both on the
> storage layer and just due to the syscall overhead.  So you do need to perform
> batching when issuing IO. Which in turn requires a bit of rampup logic etc.

Right, so if you need to do batching anyway, contention on a shared
queue will be minimal, because it's amortized over the batch size.

I agree about ramp *up* logic, I just don't see the need for ramp *down* logic.

> > This is why a single shared queue is so nice, because it avoids
> > workers being idle. But I am confused by your proposal, which seems to
> > be trying to get the behavior of a single shared queue, but
> > implemented with the added complexity of multiple queues.
> >
> > Why not just use a single queue?
>
> Accessing buffers in a maximally interleaved way, which is what a single queue
> would give you, adds a good bit of overhead when you have a lot of memory,
> because e.g. TLB hit rate is minimized.

Well that's trade-off, right? As you point out, you need to do
batching when issuing reads, to allow for read combining. The larger
your batch, the more reads you can combine -- the more efficient your
I/O, etc. But the larger your batch, the less locality you get in
memory.

You always have to choose a batch size large enough to hide I/O
latency, plus allow, I guess, for read combining. I suspect that will
blow out your TLB more than letting 8 parallel workers share the same
queue.

Not to mention the complexity (as Thomas has described very nicely, in
this thread) of trying to partition+affinitize async read requests to
individual parallel workers. (Consider "ramp-down" for a moment: the
"problem" here is just that one parallel worker issued a batch of
async reads , near the end of the query; and since the worker is
affinitized to the async read, all other workers pack up and go home,
leaving a single worker to process this last batch. If, instead, we
just used a single queue, then there would be no need for "ramp-down"
logic, because async reads would go into a single queue/pool, and not
be affinitized to a single, "unlucky" worker.)

> > It has never been clear to me why prefetching the exact blocks you'll
> > later consume is seen as a *benefit*, rather than a *cost*. I'm not
> > aware of any prefetch interface, other than PG's "ReadStream," that
> > insists on this. But that's a separate discussion...
>
> ...
>
> As I said above, that's not to say that we'll only ever want to do readahead
> via a the read stream interface.

Well that's my point: since, I believe, we'll ultimately want a
"heuristic" prefetch, which will be incompatible with the new read
stream interface... we'll end up writing and supporting two different
prefetch interfaces.

It has never been clear to me that the advantages of having this
second, read-stream, prefetch interface outweigh the costs of having
to write and maintain two separate interfaces, to do pretty much the
same thing. If we *didn't* need the "heuristic" interface, then I
could be convinced that the "read-stream" interface was a good choice.
But since we'll (eventually) need the "heuristic" interface, anyway,
it's not clear to me that the benefits outweigh the costs of
implementing this "read-stream" interface, as well.

Thanks,
James

Re: AIO v2.5

2025-04-15 Thread Alexander Lakhin


Hello Andres,

14.04.2025 19:06, Andres Freund wrote:

Unfortunately I'm several hundred iterations in, without reproducing the
issue. I'm bad at statistics, but I think that makes it rather unlikely that I
will, without changing some aspect.

Was this an assert enabled build? What compiler and what optimization settings
did you use? Do you have huge pages configured (so that the default
huge_pages=try would end up with huge pages)?


Yes, I used --enable-cassert; no explicit optimization setting and no huge
pages configured. pg_config says:
CONFIGURE =  '--enable-debug' '--enable-cassert' '--enable-tap-tests' 
'--with-liburing'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels 
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security 
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -O2


Please look at the complete script attached. I've just run it and got:
iteration 56 (jobs: 44)
Tue Apr 15 06:30:52 PM CEST 2025
dropdb: error: database removal failed: ERROR:  could not read blocks 0..0 in file 
"global/1213": Operation canceled
2025-04-15 18:31:00.650 CEST [1612266] LOG:  could not read blocks 0..0 in file 
"global/1213": Operation canceled
2025-04-15 18:31:00.650 CEST [1612266] CONTEXT:  completing I/O on behalf of 
process 1612271
2025-04-15 18:31:00.650 CEST [1612266] STATEMENT:  DROP DATABASE db3;

I used gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, but now I've also
reproduced the issue with CC=clang (18.1.3 (1ubuntu1)).

Please take a look also at the simple reproducer for the crash inside
pg_get_aios() I mentioned upthread:
for i in {1..100}; do
  numjobs=12
  echo "iteration $i"
  date
  for ((j=1;j<=numjobs;j++)); do
    ( createdb db$j; for k in {1..300}; do
    echo "CREATE TABLE t (a INT); CREATE INDEX ON t (a); VACUUM t;
  SELECT COUNT(*) >= 0 AS ok FROM pg_aios; " \
    | psql -d db$j >/dev/null 2>&1;
  done; dropdb db$j; ) &
  done
  wait
  psql -c 'SELECT 1' || break;
done

it fails for me as follows:
iteration 20
Tue Apr 15 07:21:29 PM EEST 2025
dropdb: error: connection to server on socket "/tmp/.s.PGSQL.55432" failed: No 
such file or directory
   Is the server running locally and accepting connections on that socket?
...
2025-04-15 19:21:30.675 EEST [3111699] LOG:  client backend (PID 3320979) was 
terminated by signal 11: Segmentation fault
2025-04-15 19:21:30.675 EEST [3111699] DETAIL:  Failed process was running: SELECT 
COUNT(*) >= 0 AS ok FROM pg_aios;
2025-04-15 19:21:30.675 EEST [3111699] LOG:  terminating any other active 
server processes


I reproduced this error on three different machines (all are running
Ubuntu 24.04, two with kernel version 6.8, one with 6.11), with PGDATA
located on tmpfs.

That's another variable to try - so far I've been trying this on 6.15.0-rc1
[1].  I guess I'll have to set up a ubuntu 24.04 VM and try with that.

Greetings,

Andres Freund


[1] I wanted to play with io_uring changes that were recently merged. Namely
support for readv/writev of "fixed" buffers. That avoids needing to pin/unpin
buffers while IO is ongoing, which turns out to be a noticeable bottleneck in
some workloads, particularly when using 1GB huge pages.


Best regards,
Alexander Lakhin
Neon (https://neon.tech)

repro.tar.gz
Description: application/gzip

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-15 Thread Jacob Champion

On Tue, Apr 15, 2025 at 8:34 AM Christoph Berg  wrote:
> I agree with this reasoning and retract my suggestion to rename the option.

(Thank you for chiming in; having the packager feedback has been
extremely helpful.)

While I have you, may I ask whether you're okay (from the packager
perspective) with the current division of dynamic and static
behaviors?

Dynamic: --with-libcurl builds a runtime-loadable module, and if you
don't install it, OAuth isn't supported (i.e. it's optional)
Static: --with-libcurl builds an additional linkable staticlib, which
you must link into your application (i.e. not optional)

I want to make absolutely sure the existing packager requests are not
conflicting. :D

Thanks,
--Jacob

Re: New committer: Jacob Champion

2025-04-15 Thread Umar Hayat

On Sat, 12 Apr 2025 at 05:26, Jonathan S. Katz  wrote:
>
> The Core Team would like to extend our congratulations to Jacob
> Champion, who has accepted an invitation to become our newest PostgreSQL
> committer.
>
> Please join us in wishing Jacob much success and few reverts!
>
> Thanks,
>
> Jonathan

Congratulations Jacob!

-- 
Umar Hayat
SKAI Worldwide (https://skaiworldwide.com/ )

Re: pg_combinebackup: correct code comment.

2025-04-15 Thread Daniel Gustafsson

> On 15 Apr 2025, at 06:22, Amul Sul  wrote:

> Attached is a patch that corrects the code comment for
> process_directory_recursively() in pg_combinebackup, where the comment
> incorrectly refers to "output_directory" instead of the intended
> "input_directory".

Agreed.  The second paragraph in the comment also seem a bit odd, how about the
below while we are there fixing things?

- * If processing is a user-defined tablespace, the tsoid should be the OID
+ * If processing a user-defined tablespace, the tsoid should be the OID

--
Daniel Gustafsson

Re: pg_dump --if-exists --clean when drop index that is partition of a partitioned index

2025-04-15 Thread Pavel Stehule

út 15. 4. 2025 v 8:52 odesílatel jian he 
napsal:

> On Tue, Apr 15, 2025 at 1:45 PM Pavel Stehule 
> wrote:
> >>
> >> seems pretty easy to fix.
> >> we only need dropStmt when IndxInfo->parentidx oid is invalid.
> >>
> >> +if (!OidIsValid(indxinfo->parentidx))
> >> +appendPQExpBuffer(delq, "DROP INDEX %s;\n", qqindxname);
> >
> >
> > I don't think it is the correct fix.
> >
> > because then fails CREATE INDEX qqindxname
> >
> > The motivation for usage --clean and --if-exists option is possibility
> to call restore idempotently. It should not fail when I do restore in an
> empty database, and it shouldn't fail if I do restore in an existing
> database.
> >
>
>
> I am not sure what you mean fails
> ``CREATE INDEX qqindxname``
> ?
>
> for example:
>
> >> >> CREATE TABLE tp(c int, a int, b int) PARTITION BY RANGE (b);
> >> >> CREATE TABLE tp_1(c int, a int, b int);
> >> >> ALTER TABLE tp ATTACH PARTITION tp_1 FOR VALUES FROM (0) TO (1);
> >> >> CREATE INDEX t_a_idx ON tp_1(a);
> >> >> CREATE INDEX tp_a_idx ON tp(a);
>
> with the patch, pg_dump output will be
> ```
> DROP INDEX IF EXISTS public.tp_a_idx;
> DROP TABLE IF EXISTS public.tp_11;
> DROP TABLE IF EXISTS public.tp;
> 
> CREATE INDEX tp_a_idx ON ONLY public.tp USING btree (a);
> CREATE INDEX tp_11_a_idx ON public.tp_11 USING btree (a);
> ALTER INDEX public.tp_a_idx ATTACH PARTITION public.tp_11_a_idx;
> ```
>
> What's your expectation?
>

I am not sure at this moment.

There is a question why dump exports DROP INDEX, although the index will be
dropped implicitly when table is dropped.

So your fix can work, because all indexes will be dropped by DROP TABLE,
but on second hand it is not consistent with the current behavior of
pg_dump, where indexes are dropped explicitly.

I have not knowledge, why pg_dump exports DROP INDEX explicitly, although
it is redundant.

Regards

Pavel

Re: minor error message enhancement in refuseDupeIndexAttach

2025-04-15 Thread Daniel Gustafsson

> On 15 Apr 2025, at 10:52, jian he  wrote:

> in refuseDupeIndexAttach, we can change from
> 
> errdetail("Another index is already attached for partition \"%s\"."...)
> 
> to
> 
> errdetail("Another index \"%s\" is already attached for partition \"%s\".")
> 
> So we can easily understand which index is already attached for
> partition \"%s\".

That seems like it could be handy.  As this relates to chaning old and existing
behavior, and we are in feature freeze, I suggest registering this in the
commitfest app.

--
Daniel Gustafsson

Re: use correct variable in error message in _allocAH function (pg_backup_archiver.c)

2025-04-15 Thread Fujii Masao





On 2025/04/15 7:21, Mahendra Singh Thalor wrote:

Hi hackers,

In _allocAH function(pg_backup_archiver.c), we were using the *fmt* variable in 
switch case for *default case*, but correct variable is AH->format.

Here, I am attaching a patch for the same.


Thanks for the report and patch! It looks good to me.
Since this issue exists in the back branches,
the patch needs be back-patched to all supported versions.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

[Proposal] Add \dAt [AMPTRN [TBLPTRN]] to list tables by Table Access Method in psql

2025-04-15 Thread Srinath Reddy

Hi all,

I'd like to propose adding a new `psql` meta-command:

\dAt [AMPTRN [TBLPTRN]]

This would list all user tables that are using a specific Table Access
Method (TAM), optionally filtered by table name pattern.

Why?
- Table Access Methods are increasingly used.
- There's currently no native `\d`-style way to explore which tables are
using a given TAM.
- Users often need to check or debug TAM-based extensions and it'd be
useful to have a shortcut.

Example usage:

\dAt sometam-- list all tables using sometam TAM
\dAt sometam auto%  -- filter to tables starting with 'auto'

Would love to hear if this sounds like a useful addition — if so, I’d be
happy to work on a patch. Also, please let me know if something similar has
been discussed before.

Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Built-in Raft replication

2025-04-15 Thread Yura Sokolov

14.04.2025 20:44, Kirill Reshke пишет:
> OTOH Raft needs to write its own log, and what's worse, it sometimes
> needs to remove already written parts of it (so, it is not appended
> only, unlike WAL). If you have a production system which maintains two
> kinds of logs with different semantics, it is a very hard system to
> maintain..

Raft is log replication protocol which uses log position and term.
But... PostgreSQL already have log position and term in its WAL structure.
PostgreSQL's timeline is actually the Term.
Raft implementer needs just to correct rules for Term/Timeline switching:
- instead of "next TimeLine number is just increment of largest known
TimeLine number" it needs to be "next TimeLine number is the result of
Leader Election".

And yes, "it sometimes needs to remove already written parts of it".
But... It is exactly what every PostgreSQL's cluster manager software have
to do to join previous leader as a follower to new leader - pg_rewind.

So, PostgreSQL already have 70-90%% of Raft implementation details.
Raft doesn't have to be implemented in PostgreSQL.
Raft has to be finished!!!

PS: One of the biggest issues is forced snapshot on replica promotion. It
really slows down leader switch time. It looks like it is not really
needed, or some small workaround should be enough.

-- 
regards
Yura Sokolov aka funny-falcon

Re: A modest proposal: make parser/rewriter/planner inputs read-only

2025-04-15 Thread Richard Guo

On Sun, Apr 6, 2025 at 1:46 AM Tom Lane  wrote:
> The parser, rewriter, and planner all have a bad habit of scribbling
> on their input parsetrees.  At this point I've lost count of how many
> bugs that's caused (but e33f2335a and 0f43083d1 are the two latest
> examples), and of how many places there are that are brute-forcing a
> solution to the problem by copying a whole parsetree before letting
> one of these subsystems see it.
>
> Years ago we fixed the executor to treat its input Plan trees as
> read-only.  It seems like we really ought to do the same for these
> upstream subsystems.  Surely, whatever benefit we get from changing
> Node contents in-place is lost due to all these other hacks.

While I'm in favor of this idea, it seems that it will require lots of
code changes.  In particular, within the planner, the parsetree goes
through considerable transformations during the preprocessing phase,
such as subquery pull-up, constant folding, and so on.  Would this
proposal mean we'd need to refactor all those places to treat the
input parsetrees as read-only?  Just trying to understand the scope of
what would be involved.

Thanks
Richard

Re: pg_combinebackup: correct code comment.

2025-04-15 Thread Amul Sul

On Tue, Apr 15, 2025 at 1:16 PM Daniel Gustafsson  wrote:
>
> > On 15 Apr 2025, at 06:22, Amul Sul  wrote:
>
> > Attached is a patch that corrects the code comment for
> > process_directory_recursively() in pg_combinebackup, where the comment
> > incorrectly refers to "output_directory" instead of the intended
> > "input_directory".
>
> Agreed.  The second paragraph in the comment also seem a bit odd, how about 
> the
> below while we are there fixing things?
>
> - * If processing is a user-defined tablespace, the tsoid should be the OID
> + * If processing a user-defined tablespace, the tsoid should be the OID
>

+1

Regards,
Amul

Re: support create index on virtual generated column.

2025-04-15 Thread jian he

On Mon, Apr 14, 2025 at 8:05 PM Kirill Reshke  wrote:
>
> On Mon, 14 Apr 2025 at 16:10, jian he  wrote:
> >
> > new patch attached. Now,
> > ALTER TABLE DROP COLUMN works fine.
> > ALTER INDEX ATTACH PARTITION works fine.
> > creating such an index on a partitioned table works just fine.
> > for table inheritance: create index on parent table will not cascade
> > to child table,
> > so we don't need to worry about this.
>
> Hi! I reviewed v2, and it seems to be working now.
>
> But there are tests that are comment-out, what is their purpose? I
> note that commit 83ea6c5 also included some commented tests, so
> perhaps there's a reason I'm not aware of.
>

> ```
> ALTER TABLE gtest22c DROP COLUMN e;
> \d gtest22c
>
> -- EXPLAIN (COSTS OFF) SELECT * FROM gtest22c WHERE b * 3 = 6;
> -- SELECT * FROM gtest22c WHERE b * 3 = 6;
> -- EXPLAIN (COSTS OFF) SELECT * FROM gtest22c WHERE a = 1 AND b > 0;
> -- SELECT * FROM gtest22c WHERE a = 1 AND b > 0;
> ```

comment out tests are for to be implemented feature.
There are some test changes that are indeed not necessary, I restored it back,
please check attached.
From c3f39858798c27de7afdf5156c4174217920c676 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Tue, 15 Apr 2025 16:28:46 +0800
Subject: [PATCH v3 1/1] index on virtual generated column

* internally such index transformed into expression index.
  For example, an index on (b int GENERATED ALWAYS AS (a * 2) VIRTUAL)
  internal representation: an expression index on ((a * 2)).
* in pageinspect module, add some test to check the index content of virtual generated column.
* primary key, unique index over virtual generated column are not supported.
* expression index and predicate index over virtual generated columns are
  currently not supported.
* virtual generated column can not be in "include column"
* all types of indexes are supported, hash index, gist index regress test added.
* To support ALTER TABLE SET EXPRESSION, in pg_index, we need to track the original
  virtual generated column attribute number, so ALTER TABLE SET EXPRESSION can
  identify which index needs to be rebuilt.
* ALTER COLUMN SET DATA TYPE will also cause table rewrite, so we really
  need to track the virtual generated column attribute number that index was built on.

* if the partitioned table and partition both have an index, then the index over the virtual
  generated column should be the same expression. For example, the following last
  command should error out.

CREATE TABLE parted (b integer,c integer,a integer GENERATED ALWAYS AS (c+1)) PARTITION BY RANGE (b);
CREATE TABLE part (b integer,c integer,a integer GENERATED ALWAYS AS (c));
create index on part(a);
create index on parted(a);
alter table parted ATTACH partition part for values from (1) to (10);

discussion: https://postgr.es/m/CACJufxGao-cypdNhifHAdt8jHfK6-HX=trbovbkgruxw063...@mail.gmail.com
commitfest entry: https://commitfest.postgresql.org/patch/5667/
---
 contrib/pageinspect/expected/btree.out|  33 ++
 contrib/pageinspect/sql/btree.sql |  21 ++
 doc/src/sgml/catalogs.sgml|  15 +
 src/backend/catalog/index.c   |  69 
 src/backend/commands/indexcmds.c  | 301 +++---
 src/backend/commands/tablecmds.c  |  61 
 src/backend/parser/parse_utilcmd.c|  25 +-
 src/backend/utils/adt/ruleutils.c |  28 +-
 src/include/catalog/index.h   |   6 +
 src/include/catalog/pg_index.h|   1 +
 src/include/nodes/execnodes.h |   1 +
 src/test/regress/expected/fast_default.out|   8 +
 .../regress/expected/generated_virtual.out| 187 ++-
 src/test/regress/sql/fast_default.sql |   6 +
 src/test/regress/sql/generated_virtual.sql|  94 +-
 15 files changed, 788 insertions(+), 68 deletions(-)

diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 0aa5d73322f..56d57848cf7 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -183,6 +183,39 @@ tids   |

 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
+---test index over virtual generated column
+CREATE TABLE test3 (a int8, b int4range, c int8 generated always as (a+1) virtual);
+INSERT INTO test3 VALUES (72057594037927936, '[0,1)');
+CREATE INDEX test3_a_idx ON test3 USING btree (c);
+SELECT * FROM bt_page_items('test3_a_idx', 1);
+-[ RECORD 1 ]---
+itemoffset | 1
+ctid   | (0,1)
+itemlen| 16
+nulls  | f
+vars   | f
+data   | 01 00 00 00 00 00 00 01
+dead   | f
+htid   | (0,1)
+tids   | 
+
+--expect zero row.
+SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 1))
+EXCEPT ALL
+SELECT * FROM bt_page_items(get_raw_page('test3_a_idx', 1));
+(0 rows)
+
+CREATE TABLE test4(a int, b int GENERATED ALWAYS AS (a + 1),

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-04-15 Thread Ilia Evdokimov



On 15.04.2025 03:23, David Rowley wrote:

I'm just looking for ways to allow us to group all three of these
together for the TEXT format type. I see the BUFFERS are displayed
quite compactly, e.g. "Buffers: shared hit=17 read=4".
show_wal_usage() does something similar and so does
show_modifytable_info() for MERGE.

Maybe we could compress the text output a bit with:

"Estimates: capacity=N distinct keys=N hit ratio=N.N%"

If we get it that compact, maybe lookups could fit in there too, i.e.:

"Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"

I'd also like to vote that you modify explain.c and multiply the
hit_ratio by 100 and use "%" as the unit in ExplainPropertyFloat().
Just looking at the raw number of "1.00" in the expected output, it
isn't obvious if the planner expects every lookup to be a cache hit or
just 1% of them.



+1. I'll attach three patches with 'capacity=N distinct keys=N' output, 
'lookups=N' and 'ratio=N.N%'





Also, to get something commitable, you'll also need to modify
explain_memoize() at the top of memoize.sql and add handling to mask
out the value of these new properties. These are not going to be
anywhere near stable enough across platforms to have these shown in
the expected output files.

David



There are other regression tests - such as create_index, subselect, 
join, and partition_join - that also use Memoize nodes. These tests 
currently don't have any masking in place, so the new output fields may 
cause instability across platforms.


Wrapping the line in costs or verbose would help with test stability, 
but doing that only to simplify test output doesn't look like the right 
reason. Maybe it makes more sense to mask these values in other tests 
that use Memoize too. What do you think?


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: [Proposal] Add \dAt [AMPTRN [TBLPTRN]] to list tables by Table Access Method in psql

2025-04-15 Thread Mahendra Singh Thalor

On Tue, 15 Apr 2025 at 14:27, Srinath Reddy  wrote:
>
> Hi all,
>
> I'd like to propose adding a new `psql` meta-command:
>
> \dAt [AMPTRN [TBLPTRN]]
>
> This would list all user tables that are using a specific Table Access Method 
> (TAM), optionally filtered by table name pattern.
>
> Why?
> - Table Access Methods are increasingly used.
> - There's currently no native `\d`-style way to explore which tables are 
> using a given TAM.
> - Users often need to check or debug TAM-based extensions and it'd be useful 
> to have a shortcut.
>
> Example usage:
>
> \dAt sometam-- list all tables using sometam TAM
> \dAt sometam auto%  -- filter to tables starting with 'auto'
>
> Would love to hear if this sounds like a useful addition — if so, I’d be 
> happy to work on a patch. Also, please let me know if something similar has 
> been discussed before.

I think we can get both the details by SELECT command. Additionally,
this info should be visible by the admin only because there is no use
case for the user to know all the tables from one TAM.

postgres=# select * from pg_am;
 oid  | amname |  amhandler   | amtype
--++--+
2 | heap   | heap_tableam_handler | t
  403 | btree  | bthandler| i
  405 | hash   | hashhandler  | i
  783 | gist   | gisthandler  | i
 2742 | gin| ginhandler   | i
 4000 | spgist | spghandler   | i
 3580 | brin   | brinhandler  | i
(7 rows)

postgres=#
postgres=# select relname from pg_class where relam = (select relam
from pg_am where amname = 'heap');
relname

 test
 pg_statistic
 pg_type
 pg_toast_1255
 pg_toast_1255_index
 pg_toast_1247
 pg_toast_1247_index
 pg_toast_2604
 pg_toast_2604_index
 pg_toast_2606
 pg_toast_2606_index
 pg_toast_2610
 pg_toast_2610_index
 pg_toast_2612
-

In the above command, we can add FILTER also to SELECT only pattern
matching table only.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: A modest proposal: make parser/rewriter/planner inputs read-only

2025-04-15 Thread Andrei Lepikhov


On 4/5/25 18:46, Tom Lane wrote:

I'm not planning to pursue this idea Right Now, but it seems like
something that could happen for v19 or so.  In the meantime I wanted
to get the ideas down on electrons.

Thoughts?
I generally like the idea because, for now, I need to be sure that no 
one touched the parse tree before copying it to do additional 
transformations before the optimisation phase.
But what is the way you are proposing here? Do you mean that one more 
entity will be explicitly introduced: a transformed parse tree?
It would open an opportunity for extensions to build a set of 
alternative transformed trees, pass them through the optimisation phase 
and choose the best plan.


--
regards, Andrei Lepikhov

Re: Add pg_get_injection_points() for information of injection points

2025-04-15 Thread Michael Paquier

On Mon, Apr 14, 2025 at 05:26:14PM +0500, Kirill Reshke wrote:
> Hi! I noticed you do not bump catalog version here, so i was a little
> confused with
> ```
> reshke=# SELECT name, library, function FROM pg_get_injection_points();
> ERROR:  function pg_get_injection_points() does not exist
> ```
> for a while

Catalog version bumps are done by committers when code is committed.
When sending patches for reviews (this one is for July and v19),
including catversion bumps in the patches sent just leads to useless
code conflicts, so this should not be done when submitting something.

> Also, should we define pg_get_injection_points in the injection_points
> extension maybe?

As this is global for all libraries, that's not something I would add
there.
--
Michael

signature.asc
Description: PGP signature

Re: Adding support for SSLKEYLOGFILE in the frontend

2025-04-15 Thread Jacob Champion

On Mon, Apr 14, 2025 at 1:16 PM Daniel Gustafsson  wrote:
> Just to close the loop, this was done yesterday as 2970c75dd982.

Thanks!

--Jacob

Re: use correct variable in error message in _allocAH function (pg_backup_archiver.c)

2025-04-15 Thread Fujii Masao





On 2025/04/15 17:02, Fujii Masao wrote:



On 2025/04/15 7:21, Mahendra Singh Thalor wrote:

Hi hackers,

In _allocAH function(pg_backup_archiver.c), we were using the *fmt* variable in 
switch case for *default case*, but correct variable is AH->format.

Here, I am attaching a patch for the same.


Thanks for the report and patch! It looks good to me.
Since this issue exists in the back branches,
the patch needs be back-patched to all supported versions.


Unless there are any objections, I'll commit the patch with
the following commit message:

--
pg_dump: Fix incorrect archive format shown in error message.

In pg_dump and pg_restore, _allocAH() calls _discoverArchiveFormat() to
determine the archive format when the input format is unknown one.
If the input or discovered format is unrecognized, it reports an error
including the archive format number.

If discovered format is unrecognized, its number should be shown in
the error message. But previously the error message mistakenly showed
the originally requested format number (i.e., unknown one) instead of
the discovered one, due to referencing the wrong variable in the error
message.

This commit corrects the issue by using the appropriate variable in
the error message.

In practice, this has no visible effect for users, since
_discoverArchiveFormat() does not currently return unrecognized formats,
so the error condition is never actually triggered.

Back-patch to all supported versions.

Author: Mahendra Singh Thalor 
Reviewed-by: Fujii Masao 
Discussion: 
https://postgr.es/m/cakytnaqu+n-ab2fq6wznsom_-0n-bmneanynv1+6kfdxjva...@mail.gmail.com
Backpatch-through: 13
--

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: bug in stored generated column over domain with constraints.

2025-04-15 Thread jian he

On Tue, Apr 15, 2025 at 4:10 AM Tom Lane  wrote:
>
> jian he  writes:
> > new patch attached.
>
> I looked this over.  It's kind of astonishing that nobody has reported
> this before, because AFAICT it's been broken since we invented
> generated columns.
>
> > rewriteTargetListIU, expand_insert_targetlist these two places can
> > make a null Const TargetEntry for the generated column in an INSERT
> > operation.
>
> rewriteTargetListIU will *not* do that: it explicitly does nothing
> for an attgenerated column, except apply a bunch of error checks.
> See about line 988 in HEAD.  So it seems sufficient to fix
> expand_insert_targetlist as you've done here.  And then we have to
> make ExecCheckPlanOutput cope, too.
>
> I think that this patch is technically correct, but I don't like
> it much because it pays little attention to keeping
> expand_insert_targetlist and ExecCheckPlanOutput readable, and no
> attention at all to keeping their logic parallel.  I think we can
> make the code cleaner by moving the default case to the end, as
> in the attached.
>

your ExecCheckPlanOutput change makes sense to me.
call getBaseTypeAndTypmod in ExecCheckPlanOutput would be a waste cycle,
given that we will compute the generated column later.


> The test cases seemed a bit overdone, especially in comparison
> to the adjacent existing tests.  Fine for development maybe,
> but I don't see the value in carrying them forever.  On the other
> hand, there's a comment at the top of the test script:
> -- keep these tests aligned with generated_virtual.sql
> The VIRTUAL case is currently rejecting domains altogether.
> But perhaps that won't be true forever, so I think we ought
> to follow that advice.
>

I submitted a patch for the domain over the virtual generated column,
so didn't add such a test on it.

Thanks for simplifying the tests, overall all looks good.

Re: pgsql: Add function to get memory context stats for processes

2025-04-15 Thread Andres Freund

Hi,

On 2025-04-14 10:03:28 -0400, Robert Haas wrote:
> On Thu, Apr 10, 2025 at 4:05 PM Andres Freund  wrote:
> > I don't know of existing discussion, but it seems rather fundamental to me -
> > if either DSA or memory contexts could be inconsistent at a CFI(), how could
> > it possibly be safe to interrupt at that point? After all, after an error 
> > you
> > need to be able to reset the memory contexts / release memory in a
> > dsa/dshash/whatnot? Memory context reset requires walking over the 
> > allocations
> > made in the context, similar releasing a dsa?
> 
> I think it would be a bit surprising if somebody put a
> CHECK_FOR_INTERRUPTS() inside aset.c or similar, but I don't see a
> reason why we couldn't end up with one reachable via the DSA code. DSA
> calls DSM which depending on dynamic_shared_memory_type might involve
> filesystem operations. That's a fairly large amount of code. I admit I
> have no particular theory about how CFI could be reachable from there
> today, but even if it definitely isn't, I don't see why someone would
> hesitate to add one in the future.

There very well could be a CFI - but it better be somewhere where the
in-memory state is consistent. Otherwise an error inside raised in the CFI
would lead the in-memory state inconsistent which then would cause problems
when cleaning up the dsa during resowner release or process exit.

What am I missing here?

Greetings,

Andres Freund

Re: Recent pg_rewind test failures in buildfarm

2025-04-15 Thread Andres Freund

Hi,

On 2025-04-14 22:58:28 -0400, Tom Lane wrote:
> In the last day or so, both skink and mamba have hit this
> in the pg_rewind test suite [1][2]:
> 
> #3  0x01f03f7c in ExceptionalCondition 
> (conditionName=conditionName@entry=0x2119c4c "pending_since == 0", 
> fileName=fileName@entry=0x2119454 "pgstat.c", 
> lineNumber=lineNumber@entry=734) at assert.c:66
> #4  0x01d7994c in pgstat_report_stat (force=force@entry=true) at pgstat.c:734
> #5  0x01d7a248 in pgstat_shutdown_hook (code=, arg= out>) at pgstat.c:615
> #6  0x01d1e7c0 in shmem_exit (code=code@entry=0) at ipc.c:243
> #7  0x01d1e92c in proc_exit_prepare (code=code@entry=0) at ipc.c:198
> #8  0x01d1e9fc in proc_exit (code=0) at ipc.c:111
> #9  0x01cd5bd0 in ProcessRepliesIfAny () at walsender.c:2294
> #10 0x01cd8084 in WalSndLoop (send_data=send_data@entry=0x1cd7628 
> ) at walsender.c:2790
> #11 0x01cd8f40 in StartReplication (cmd=0xfd048210) at walsender.c:960
> #12 exec_replication_command (cmd_string=cmd_string@entry=0xfd53c098 
> "START_REPLICATION 0/300 TIMELINE 2") at walsender.c:2135
> #13 0x01d5bd84 in PostgresMain (dbname=, username= out>) at postgres.c:4767
> 
> That assert appears to be several years old, and the
> 008_min_recovery_point.pl test script that's triggering it hasn't
> changed very recently either, so I'm baffled where to start digging.
> It has the odor of a timing problem, so maybe we just started hitting
> this by chance.  Still ... anybody have an idea?

See also 
https://postgr.es/m/dwrkeszz6czvtkxzr5mqlciy652zau5qqnm3cp5f3p2po74ppk%40omg4g3cc6dgq

I am fairly certain it's the fault of 039549d70f6

Greetings,

Andres Freund

pending_since assertion failure on skink

2025-04-15 Thread Andres Freund

Hi,

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-04-11%2007%3A41%3A36&stg=subscription-check
shows a failure on the publisher:

  2025-04-11 11:01:06.175 UTC [2975477][walsender][29/0:0] STATEMENT:  
START_REPLICATION SLOT "pg_16410_sync_16392_7491957737045826723" LOGICAL 
0/17DEB30 (proto_version '4', streaming 'parallel', origin 'any', 
publication_names '"tap_pub_x","tap_pub_allinschema"')
  TRAP: failed Assert("pending_since == 0"), File: 
"../pgsql/src/backend/utils/activity/pgstat.c", Line: 734, PID: 2975477
  postgres: publisher: walsender bf postgres [local] 
START_REPLICATION(ExceptionalCondition+0x5f) [0x6fb855]
  postgres: publisher: walsender bf postgres [local] 
START_REPLICATION(pgstat_report_stat+0x119) [0x5cad06]
  postgres: publisher: walsender bf postgres [local] 
START_REPLICATION(PostgresMain+0x80f) [0x5b1e84]
  postgres: publisher: walsender bf postgres [local] 
START_REPLICATION(BackendMain+0x59) [0x5aacfe]
  postgres: publisher: walsender bf postgres [local] 
START_REPLICATION(postmaster_child_launch+0x117) [0x4f8e27]
  postgres: publisher: walsender bf postgres [local] 
START_REPLICATION(+0x3f3eec) [0x4fbeec]
  postgres: publisher: walsender bf postgres [local] 
START_REPLICATION(+0x3f5c6b) [0x4fdc6b]
  postgres: publisher: walsender bf postgres [local] 
START_REPLICATION(PostmasterMain+0x1190) [0x4ff26c]
  postgres: publisher: walsender bf postgres [local] 
START_REPLICATION(main+0x1e2) [0x42afe2]
  /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x5b3bca8]
  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x5b3bd65]
  postgres: publisher: walsender bf postgres [local] 
START_REPLICATION(_start+0x21) [0x1ee761]
  2025-04-11 11:01:06.574 UTC [2971921][postmaster][:0] LOG:  client backend 
(PID 2975477) was terminated by signal 6: Aborted
  2025-04-11 11:01:06.574 UTC [2971921][postmaster][:0] DETAIL:  Failed process 
was running: START_REPLICATION SLOT "pg_16410_sync_16392_7491957737045826723" 
LOGICAL 0/17DEB30 (proto_version '4', streaming 'parallel', origin 'any', 
publication_names '"tap_pub_x","tap_pub_allinschema"')
  2025-04-11 11:01:06.577 UTC [2971921][postmaster][:0] LOG:  terminating any 
other active server processes

My understanding of what this indicates is that we we set pending_stats =
true, but then somehow discovered that we don't actually have pending stats.

I suspect that this is related to

commit 039549d70f6aa2daa3714a13752a08fa8ca2fb05
Author: Michael Paquier 
Date:   2025-04-08 07:57:19 +0900

Flush the IO statistics of active WAL senders more frequently

Which afaict could lead to having pending stats as known to
pgstat_report_stat() which then could be flushed "behind
pgstat_report_stat()'s back" due to the new pgstat_flush_io() and
pgstat_flush_backend() calls.

We might just have to give up on that assertion, I guess?

Greetings,

Andres Freund

Re: Recent pg_rewind test failures in buildfarm

2025-04-15 Thread Andres Freund

Hi,

On 2025-04-15 06:04:54 -0400, Andres Freund wrote:
> See also 
> https://postgr.es/m/dwrkeszz6czvtkxzr5mqlciy652zau5qqnm3cp5f3p2po74ppk%40omg4g3cc6dgq

Huh, turns out my emails of the last few days had gotten stuck. I only noticed
when trying to open the above link and it was dead.  I had been wondering why
neither Bertrand nor Michael had reacted to the mail...

Greetings,

Andres Freund

Re: Built-in Raft replication

2025-04-15 Thread Aleksander Alekseev

Hi Konstantin,

> I am considering starting work on implementing a built-in Raft
> replication for PostgreSQL.

Generally speaking I like the idea. The more important question IMO is
whether we want to maintain Raft within the PostgreSQL core project.

Building distributed systems on commodity hardware was a popular idea
back in the 2000s. These days you can rent a server with 2 Tb of RAM
for something like 2000 USD/month (numbers from my memory that were
valid ~5 years ago) which will fit many of the existing businesses (!)
in memory. And you can rent another one for a replica, just in order
not to recover from a backup if something happens to your primary
server. The common wisdom is if you can avoid building distributed
systems, don't build one.

Which brings the question if we want to maintain something like this
(which will include logic for cases when a node joins or leaves the
cluster, proxy server / service discovery for clients, test cases /
infrastructure for all this and also upgrading the cluster, docs, ...)
for a presumably view users which business doesn't fit in a single
server *and* they want an automatic failover (not the manual one)
*and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.

Although the idea is tempting personally I'm inclined to think that
it's better to invest community resources into something else.

-- 
Best regards,
Aleksander Alekseev

minor error message enhancement in refuseDupeIndexAttach

2025-04-15 Thread jian he

hi.

minor error message enhancement

in refuseDupeIndexAttach, we can change from

errdetail("Another index is already attached for partition \"%s\"."...)

to

errdetail("Another index \"%s\" is already attached for partition \"%s\".")

So we can easily understand which index is already attached for
partition \"%s\".
From 1b9f592756eab51049307862aa6f954e551743ac Mon Sep 17 00:00:00 2001
From: jian he 
Date: Tue, 15 Apr 2025 15:50:08 +0800
Subject: [PATCH v1 1/1] minor error message enhancement

in refuseDupeIndexAttach change it
from
errdetail("Another index is already attached for partition \"%s\"."...)
to
errdetail("Another index \"%s\" is already attached for partition \"%s\".")

So we can easily understand which index is already attached for partition \"%s\".

discussion: https://postgr.es/m/
---
 src/backend/commands/tablecmds.c   | 3 ++-
 src/test/regress/expected/indexing.out | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b3ed69457fc..c113f7e022f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -21674,7 +21674,8 @@ refuseDupeIndexAttach(Relation parentIdx, Relation partIdx, Relation partitionTb
  errmsg("cannot attach index \"%s\" as a partition of index \"%s\"",
 		RelationGetRelationName(partIdx),
 		RelationGetRelationName(parentIdx)),
- errdetail("Another index is already attached for partition \"%s\".",
+ errdetail("Another index \"%s\" is already attached for partition \"%s\".",
+		   get_rel_name(existingIdx),
 		   RelationGetRelationName(partitionTbl;
 }
 
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..4d29fb85293 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -248,7 +248,7 @@ alter index idxpart_a_b_idx attach partition idxpart1_a_b_idx; -- quiet
 create index idxpart1_2_a_b on idxpart1 (a, b);
 alter index idxpart_a_b_idx attach partition idxpart1_2_a_b;
 ERROR:  cannot attach index "idxpart1_2_a_b" as a partition of index "idxpart_a_b_idx"
-DETAIL:  Another index is already attached for partition "idxpart1".
+DETAIL:  Another index "idxpart1_a_b_idx" is already attached for partition "idxpart1".
 drop table idxpart;
 -- make sure everything's gone
 select indexrelid::regclass, indrelid::regclass
-- 
2.34.1

Re: Built-in Raft replication

2025-04-15 Thread Yura Sokolov

15.04.2025 13:20, Aleksander Alekseev пишет:
> Hi Konstantin,
> 
>> I am considering starting work on implementing a built-in Raft
>> replication for PostgreSQL.
> 
> Generally speaking I like the idea. The more important question IMO is
> whether we want to maintain Raft within the PostgreSQL core project.
> 
> Building distributed systems on commodity hardware was a popular idea
> back in the 2000s. These days you can rent a server with 2 Tb of RAM
> for something like 2000 USD/month (numbers from my memory that were
> valid ~5 years ago) which will fit many of the existing businesses (!)
> in memory. And you can rent another one for a replica, just in order
> not to recover from a backup if something happens to your primary
> server. The common wisdom is if you can avoid building distributed
> systems, don't build one.
> 
> Which brings the question if we want to maintain something like this
> (which will include logic for cases when a node joins or leaves the
> cluster, proxy server / service discovery for clients, test cases /
> infrastructure for all this and also upgrading the cluster, docs, ...)
> for a presumably view users which business doesn't fit in a single
> server *and* they want an automatic failover (not the manual one)
> *and* they don't use Patroni/Stolon/CockroachDB/Neon/... already.
> 
> Although the idea is tempting personally I'm inclined to think that
> it's better to invest community resources into something else.

Raft is not for "commodity hardware". It is for reliability.
Yes, it needs 3 servers instead of 2. It costs more than simple replication
with "manual" failover.

But if business needs high availability, it wouldn't rely on "manual"
failover. And if business relies on correctness, it would rely on any
solution which "automatically switches between two replicas". Because there
is no way to guarantee correctness with just two replicas. And many stories
of lost transactions with Patroni/Stolon already confirms this thesis.

CockroachDB/Neon - they are good solutions for distributed systems. But, as
you've said, many clients don't need distributed systems. They just need
reliable replication.

I've been working in a company which uses MongoDB (3.6 and up) as their
primary storage. And it seemed to me as "God Send". Everything just worked.
Replication was as reliable as one could imagine. It outlives several
hardware incidents without manual intervention. It allowed cluster
maintenance (software and hardware upgrades) without application downtime.
I really dream PostgreSQL will be as reliable as MongoDB without need of
external services.

-- 
regards
Yura Sokolov aka funny-falcon

Re: Add pg_get_injection_points() for information of injection points

2025-04-15 Thread Rahila Syed

Hi,

Thank you for this enhancement to injection points.

Following are a few comments.

+ * This overestimates the allocation size, including slots that may be
+ * free, for simplicity.
+ */
+ result = palloc0(sizeof(InjectionPointData) * max_inuse);
+

The result variable name is a bit generic. How about renaming it to
inj_points/injpnt_list or something similar?

+typedef struct InjectionPointData
+{
+ const char *name;
+ const char *library;
+ const char *function;
+} InjectionPointData

+/*
+ * Retrieve a list of all the injection points currently attached.
+ *
+ * This list is palloc'd in the current memory context.
+ */
+InjectionPointData *
+InjectionPointList(uint32 *num_points)

The function actually retrieves an array not a list, so the comment and
the function name may be misleading.
This function not only retrieves an array of injection points but also
the number of injection points. Would it be more efficient to define a
single
struct that includes both the array of injection points and the number of
points?
This way, both values can be returned from the function.

The second patch is still something folks seem to be meh about, but if
> the end consensus is to put the new SQL function in the module
> injection_points we are going to need the first patch anyway.
>

FWIW, I believe it would be beneficial to incorporate the new SQL function
into the core. This would eliminate the need for users to install the
injection_points module solely to monitor injection points attached in the
other modules or within the core itself.

+Datum
+pg_get_injection_points(PG_FUNCTION_ARGS)

Would it be useful to put the logic of the above function under #define
USE_INJECTION_POINT.  This approach would make it possible to
distinguish between cases where no injection points are attached and
instances where the build does not support injection points.

Thank you,
Rahila Syed

Re: Built-in Raft replication

2025-04-15 Thread Konstantin Osipov

* Yura Sokolov  [25/04/15 12:02]:

> > OTOH Raft needs to write its own log, and what's worse, it sometimes
> > needs to remove already written parts of it (so, it is not appended
> > only, unlike WAL). If you have a production system which maintains two
> > kinds of logs with different semantics, it is a very hard system to
> > maintain..
> 
> Raft is log replication protocol which uses log position and term.
> But... PostgreSQL already have log position and term in its WAL structure.
> PostgreSQL's timeline is actually the Term.
> Raft implementer needs just to correct rules for Term/Timeline switching:
> - instead of "next TimeLine number is just increment of largest known
> TimeLine number" it needs to be "next TimeLine number is the result of
> Leader Election".
> 
> And yes, "it sometimes needs to remove already written parts of it".
> But... It is exactly what every PostgreSQL's cluster manager software have
> to do to join previous leader as a follower to new leader - pg_rewind.
> 
> So, PostgreSQL already have 70-90%% of Raft implementation details.
> Raft doesn't have to be implemented in PostgreSQL.
> Raft has to be finished!!!
> 
> PS: One of the biggest issues is forced snapshot on replica promotion. It
> really slows down leader switch time. It looks like it is not really
> needed, or some small workaround should be enough.

I'd say my pet peeve is storing the cluster topology (the so
called raft configuration) inside the database, not in an external
state provider. Agree on other points.

-- 
Konstantin Osipov

Re: Built-in Raft replication

2025-04-15 Thread Aleksander Alekseev

Hi Yura,

> I've been working in a company which uses MongoDB (3.6 and up) as their
> primary storage. And it seemed to me as "God Send". Everything just worked.
> Replication was as reliable as one could imagine. It outlives several
> hardware incidents without manual intervention. It allowed cluster
> maintenance (software and hardware upgrades) without application downtime.
> I really dream PostgreSQL will be as reliable as MongoDB without need of
> external services.

I completely understand. I had exactly the same experience with
Stolon. Everything just worked. And the setup took like 5 minutes.

It's a pity this project doesn't seem to get as much attention as
Patroni. Probably because attention requires traveling and presenting
the project at conferences which costs money. Or perhaps people are
just happy with Patroni. I'm not sure in which state Stolon is today.

-- 
Best regards,
Aleksander Alekseev

Re: pgsql: Non text modes for pg_dumpall, correspondingly change pg_restore

2025-04-15 Thread Andrew Dunstan



On 2025-04-14 Mo 6:55 PM, Mahendra Singh Thalor wrote:



--- a/src/bin/pg_dump/pg_restore.c
+++ b/src/bin/pg_dump/pg_restore.c
@@ -1053,15 +1053,19 @@ get_dbname_oid_list_from_mfile(const char
*dumpdirpath, SimpleOidStringList *dbn
        while ((fgets(line, MAXPGPATH, pfile)) != NULL)
        {
                Oid                     db_oid = InvalidOid;
-               char            db_oid_str[MAXPGPATH + 1] = "";
                char       *dbname;
+               char            *p = line;

                /* Extract dboid. */
                sscanf(line, "%u", &db_oid);
-               sscanf(line, "%20s", db_oid_str);
+
+               while(isdigit(*p))
+                       p++;
+
+               Assert(*p == ' ');

                /* dbname is the rest of the line */
-               dbname = line + strlen(db_oid_str) + 1;
+               dbname = ++p;

                /* Remove \n from dbname. */
                dbname[strlen(dbname) - 1] = '\0';




I don't think an Assert is the right thing here. It's for things that 
should not be possible, and won't trigger anything in a non-assertion 
build. This condition should cause a pg_fatal().



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Built-in Raft replication

2025-04-15 Thread Yura Sokolov

15.04.2025 14:15, Aleksander Alekseev пишет:
> Hi Yura,
> 
>> I've been working in a company which uses MongoDB (3.6 and up) as their
>> primary storage. And it seemed to me as "God Send". Everything just worked.
>> Replication was as reliable as one could imagine. It outlives several
>> hardware incidents without manual intervention. It allowed cluster
>> maintenance (software and hardware upgrades) without application downtime.
>> I really dream PostgreSQL will be as reliable as MongoDB without need of
>> external services.
> 
> I completely understand. I had exactly the same experience with
> Stolon. Everything just worked. And the setup took like 5 minutes.
> 
> It's a pity this project doesn't seem to get as much attention as
> Patroni. Probably because attention requires traveling and presenting
> the project at conferences which costs money. Or perhaps people are
> just happy with Patroni. I'm not sure in which state Stolon is today.

But the key point: if PostgreSQL will be improved a bit, there will be no
need neither in Patroni, nor in Stolon. Isn't it great?

-- 
regards
Yura Sokolov aka funny-falcon

Re: dispchar for oauth_client_secret

2025-04-15 Thread Noah Misch

On Tue, Apr 15, 2025 at 01:16:12PM -0700, Jacob Champion wrote:
> On Tue, Apr 15, 2025 at 12:14 PM Noah Misch  wrote:
> > I suspect this should use .dispchar="*" to encourage UIs to display
> > oauth_client_secret like a password field.  Thoughts?
> 
> Hmm, from a UI perspective I agree. (The builtin flow targets "public
> clients", where secrets are discouraged and/or understood to be
> not-really-secret, but there's no reason to assume that all flows used
> by the application are public.)
> 
> From a proxy perspective, this would mess with FDW handling. By making
> the dispchar '*', oauth_client_secret will be made into a user mapping
> option rather than a server option. (Neither is very useful to
> postgres_fdw anyway, because the builtin flow needs an end user to
> interact with the provider.) But I'm not sure if we'll need to handle
> compatibility in the future if we implement a proxy-friendly flow.

If we think oauth_client_secret should get dispchar="*" UI treatment yet be a
postgres_fdw server option, postgres_fdw code can make it so.  postgres_fdw
already has explicit code to reclassify the "user" option.

> Is
> it okay to move options back and forth during a major version bump?  I
> assume it would present a problem for pg_upgrade?

It would break dump/reload, so it's best not to move options between
optcontext=ForeignServerRelationId and optcontext=UserMappingRelationId, even
at major versions.  As above, we can change dispchar separately from
optcontext.  The documentation of dispchar likely should tell people to update
the postgres_fdw documentation when adding a dispchar="*' option.

It sounds like we might end up wanting to allow oauth_client_secret in both of
ForeignServerRelationId and UserMappingRelationId, each catering to different
oauth implementations.  Is that right?  (It would be fine to allow one today
and later change to allow both.)

Re: Align memory context level numbering in pg_log_backend_memory_contexts()

2025-04-15 Thread Daniel Gustafsson

> On 15 Apr 2025, at 23:03, David Rowley  wrote:

> My vote is to make the levels 1-based in all locations where we output
> the context information.

I agree with this, pg_get_process_memory_contexts() also use 1-based levels
fwiw.

--
Daniel Gustafsson

Re: pipelining in psql, commit 41625ab

2025-04-15 Thread Jelte Fennema-Nio

On Tue, 15 Apr 2025 at 23:34, Noah Misch  wrote:
> That said, maybe having
> PQpipelineSync() was a mistake, since I think it's just PQsendPipelineSync() +
> PQflush().

Yes, IMO that's pretty much the case. But we cannot remove that
function because of backwards compatibility.

Note that for all the other commands (e.g \bind, \close, \parse) we
use the send variant of the libpq function too without including send
in the command. However, send means a different thing for those: If
you use the non-send one you get a PGresult back.

Re: Built-in Raft replication

2025-04-15 Thread Tom Lane

Nikolay Samokhvalov  writes:
> This is exactly what I wanted to write as well. The idea is great. At the
> same time, I think, consensus on many decisions will be extremely hard to
> reach, so this project has a high risk of being very long. Unless it's an
> extension, at least in the beginning.

Yeah.  The two questions you'd have to get past to get this into PG
core are:

1. Why can't it be an extension?  (You claimed it would work more
seamlessly in core, but I don't think you've made a proven case.)

2. Why depend on Raft rather than some other project?

Longtime PG developers are going to be particularly hard on point 2,
because we have a track record now of outliving outside projects
that we thought we could rely on.  One example here is the Snowball
stemmer; while its upstream isn't quite dead, it's twitching only
feebly, and seems to have a bus factor of 1.  Another example is the
Spencer regex engine; we thought we could depend on Tcl to be the
upstream for that, but for a decade or more they've acted as though
*we* are the upstream.  And then there's libxml2.  And uuid-ossp.
And autoconf.  And various documentation toolchains.  Need I go on?

The great advantage of implementing an outside dependency in an
extension is that if the depended-on project dies, we can say a few
words of mourning and move on.  It's a lot harder to walk away from
in-core features.

regards, tom lane

Re: Logical Replication of sequences

2025-04-15 Thread Peter Smith

Hi Vignesh,

Some review comments for patch v20250325-0005 (docs).

==
doc/src/sgml/catalogs.sgml

(52.55. pg_subscription_rel)

1.
State code: i = initialize, d = data is being copied, f = finished
table copy, s = synchronized, r = ready (normal replication)

~

This is not part of the patch, but AFAIK not all of those states are
relevant if the srrelid was a SEQUENCE relation. Should there be 2
sets of states given here for what is possible for tables/sequences?

==
doc/src/sgml/config.sgml

(19.6.4. Subscribers)

2.

 In logical replication, this parameter also limits how often a failing
-replication apply worker or table synchronization worker will be
-respawned.
+replication apply worker or table synchronization worker or sequence
+synchronization worker will be respawned.


Does it though? I thought /often/quickly/ because if there is some
ERROR the respawning may occur again and again forever, so you are not
really limiting how often it occurs, only the *rate* at which the
respawning happens.

==
doc/src/sgml/logical-replication.sgml

(29.1 Publication)

3.
Publications may currently only contain tables and all tables in schema.
Objects must be added explicitly, except when a publication is created for
-   ALL TABLES.
+   ALL TABLES. Publications can include sequences as well,
+   but their behavior differs from that of tables or groups of tables. Unlike
+   tables, sequences allow users to synchronize their current state at any
+   given time. For more information, refer to
+   .
   

This doesn't really make sense. The change seems too obviously just
tacked onto the end of the existing documentation. e.g It is strange
saying "Publications may currently only contain tables and all tables
in schema." and then later saying Oh, BTW "Publications can include
sequences as well". I think the whole paragraph may need some
reworking.

The preceding para on this page also still says "A publication is a
set of changes generated from a table or a group of tables" which also
failed to mention sequences.

OTOH, I think it would get too confusing to mush everything together,
so after saying  publish tables and sequences, then I think there
should be one paragraph that just talks about publishing tables,
followed by another paragraph that just talks about publishing
sequences. Probably this should mention ALL SEQUENCES too since it
already mentioned ALL TABLES.

~~~

(29.6. Replicating Sequences)

4.
+  
+   To replicate sequences from a publisher to a subscriber, first publish the
+   sequence using 
+   CREATE PUBLICATION ... FOR ALL SEQUENCES.
+  

/first publish the sequence/first publish them/

~~~

5.
+  
+   A new sequence synchronization worker will be started
+   to synchronize the sequences after executing any of the above subscriber
+   commands, and will exit once the sequences are synchronized.
+  

IMO the name of the worker makes it obvious what it does so I think
you can remove the redundant words "to synchronize the sequences" from
this sentence.

~~~

(29.7.1. Sequence Definition Mismatches)

6.
+   
+To resolve this, use
+ALTER SEQUENCE
+to align the subscriber's sequence parameters with those of the publisher.
+Subsequently, execute 
+ALTER SUBSCRIPTION ... REFRESH PUBLICATION
SEQUENCES.
+   

/Subsequently,/Then,/

~~~

(29.6.3. Examples)

7.
+   
+Create some test sequences on the publisher.

This whole section is just examples, so I don't think you need to call
them "test" sequences. Just say "Create some sequences on the
publisher.".

~~~

(29.9. Restrictions)

8.
 
- Sequence data is not replicated.  The data in serial or identity columns
- backed by sequences will of course be replicated as part of the table,
- but the sequence itself would still show the start value on the
- subscriber.  If the subscriber is used as a read-only database, then this
- should typically not be a problem.  If, however, some kind of switchover
- or failover to the subscriber database is intended, then the sequences
- would need to be updated to the latest values, either by copying the
- current data from the publisher (perhaps
- using pg_dump) or by determining a sufficiently high
- value from the tables themselves.
+ Incremental sequence changes are not replicated.  The data in serial or
+ identity columns backed by sequences will of course be replicated as part
+ of the table, but the sequence itself would still show the start value on
+ the subscriber.  If the subscriber is used as a read-only database, then
+ this should typically not be a problem.  If, however, some kind of
+ switchover or failover to the subscriber database is intended, then the
+ sequences would need to be updated to the latest values, either
by executing
+ 
+ ALTER SUBSCRIPTION ... REFRESH PUBLICATION
SEQUENCES
+ or by copying the current data from the p

Re: [18] Unintentional behavior change in commit e9931bfb75

2025-04-15 Thread Noah Misch

On Tue, Apr 15, 2025 at 04:08:55PM -0700, Jeff Davis wrote:
> On Mon, 2025-04-14 at 13:44 -0700, Noah Misch wrote:
> > However, I think v17's
> > concept of separate PG_REGEX_ symbols for the default-locale case is
> > still the
> > right thing for v18.  In other words, this code should change to look
> > more
> > like v17, with the removal of non-locale_t calls being the main
> > change.
> 
> I tried that in v2-0003, but I think it ended up worse. Most
> pg_wc_xyz() functions don't care if it's the default collation or not,
> so there are a lot of duplicate cases.

I'd likely adopt 0003, but I'm fine if you stop at 0002.

> + * As a special case, in the "default" collation we force ASCII letters to

I'd change s/we force/(2) and (3) force/ to make explicit that this isn't
specific to (3).  That's all I would change in v2.

transforms [was Re: FmgrInfo allocation patterns (and PL handling as staged programming)]

2025-04-15 Thread Chapman Flack

On 04/15/25 10:52, Tom Lane wrote:
> The problem from a PL's standpoint is "given this input or output
> of type FOO, should I transform it, and if so using what?".  So
> the starting point has to be a type not a transform. ... protrftypes data
> is used as a filter before attempting a pg_transform lookup. If the pg_proc
> entry contained transform OIDs we'd have to filter after the lookup,

I don't think I follow. If a function is declared in LANGUAGE BAR with
TRANSFORM FOR TYPE FOO, then either:

1. There is no transform declared for (trftype foo, trflang bar).
   CREATE FUNCTION fails, 42704: transform for type foo language "bar"
   does not exist.

2. CREATE FUNCTION succeeds and there is a transform
   (trftype foo, trflang bar, trffromsql f1, trftosql f2).

So the choice of what to put in pg_proc is between the oid of type foo,
or the oid of the transform (foo, bar, f1, f2).

If the function is going to encounter type foo at all, it cannot avoid
looking up the transform, either by the transform oid or by (type, lang).

The only case in which that lookup could be avoided would be where
a function declared with TRANSFORM FOR TYPE FOO actually sees no input
or output arguments or internally-SQL-encountered values of type foo in
a given call or at a given call site.

That seems to me like an edge case, so I really am questioning this:

> which is pretty inefficient, especially if you expect that the normal
> case is that there's not an applicable transform.

I *don't* expect that. I expect that if someone took the time to declare
the function with TRANSFORM FOR TYPE FOO and CreateFunction confirmed
that transform existed, the function is expected to encounter values
of type foo and apply that intended transform to them.

Perhaps the efficiency argument is really "say a function has
a list of 100 arguments and only one is of type foo, how many cycles
are wasted in get_transform_tosql and get_transform_fromsql applied
to all those other types?"

At present, they return quickly when the passed typid isn't found
in the passed list of trftypes.

If pg_proc had protransforms instead, that would add a step zero: looking
up the declared transforms to make an in-memory list of (typid, tosqloid,
fromsqloid). After that, get_transform_{tosql,fromsql} would be applied
and return quickly when the passed typid isn't in that list. When it is
in the list, they'd return just as quickly the transform function oid
directly from the list. For a routine with no transforms declared,
step zero completes in the time it takes to see protransforms empty and
return an empty list.

Now the question becomes: how many cycles does step zero spend in excess
of those that must be spent for the function to have its intended behavior?

My answer would be "zero, except in the vanishingly perverse case of
a function declared with transforms for types it never sees."

Am I mistaken?

On 04/15/25 01:05, Pavel Stehule wrote:
> There was a long discussion, and I think the main reason for this design
> is  a possibility to use an old code without change although the
> transformations are installed. And secondly there was a possibility to use
> a transformation when it was installed, and use it although it was
> installed after the function was created. ... I can write code
> in PL/Python that can work with or without transformation
> ...
>
https://www.postgresql.org/message-id/flat/1339713732.11971.79.camel%40vanquo.pezone.net

Thank you for the link to that old thread. I can see now that the first
version of the patch in mid-2012 had CREATE and DROP TRANSFORM but did
not add CREATE FUNCTION syntax for which transforms to use; in that
version, it did entail the idea of transforms being selected for use
just by existing. But the problem of that changing the behavior of existing
functions was already recognized by the fifth message in the thread, where
Peter aptly used the words "worst nightmare".

By November of 2013 there were already suggestions about explicit
CREATE FUNCTION syntax for what transforms to apply, And by January 2014
Peter had found the ISO SQL  that does
exactly that, and by April 2015 the patch was adopted in that form.

I'll guess that the 'bogus' dependency creation, based only on argument
and return types, that Tom fixed last week in b73e6d7, was a vestige from
the earliest "apply whatever transforms exist" version of the patch that
didn't get changed when the explicit function-creation syntax was added.

At any rate, we are now firmly in the world of "apply exactly the
transforms explicitly requested at function creation", and that mirrors
ISO SQL and seems the sanest world to me.

While I don't doubt that a PL/Python function could be written that would
work with or without a transform, that sounds to me like the kind of feat
undertaken to show it can be done. My normal expectation would be for ~ 100%
of real-world functions to faceplant if their data suddenly started
arriving in completely different datatypes. And we have

Re: An incorrect check in get_memoize_path

2025-04-15 Thread Richard Guo

On Mon, Apr 14, 2025 at 8:08 PM wenhui qiu  wrote:
>   No objections.It's a pity that the postgresql18 version has been code-frozen

v18 is now in feature freeze, but not code freeze, so bug fixes are
still allowed.  I've pushed this patch after adding the "Reviewed-by"
tags.

Thanks
Richard

1 2 >

1 - 100 of 137 matches

Mail list logo