date:20250416

Re: Built-in Raft replication

2025-04-16 Thread Alastair Turner

On Wed, 16 Apr 2025 at 07:18, Ashutosh Bapat 
wrote:

> On Wed, Apr 16, 2025 at 10:29 AM Andrey Borodin 
> wrote:
> >
> > If you use build-in failover you have to resort to 3 big Postgres
> machines because you need 2/3 majority. Of course, you can install
> MySQL-stype arbiter - host that had no real PGDATA, only participates in
> voting. But this is a solution to problem induced by built-in autofailover.
>
> Users find it a waste of resources to deploy 3 big PostgreSQL
> instances just for HA where 2 suffice even if they deploy 3
> lightweight DCS instances. Having only some of the nodes act as DCS
> and others purely PostgreSQL nodes will reduce waste of resources.
>
> The experience of other projects/products with automated failover based on
quorum shows that this is a critical issue for adoption. In the In-memory
Data Grid space (Coherence, Geode/GemFire) the question of how to ensure
that some nodes didn't carry any data comes up early in many architecture
discussions. When RabbitMQ shipped their Quorum Queues feature, the first
and hardest area of pushback was around all nodes hosting message content.

It's not just about the requirement for compute resources, it's also about
bandwidth and latency. Many large organisations have, for historical
reasons, pairs of data centres with very good point-to-point connectivity.
As the requirement for quorum witnesses has come up for all sorts of
things, including storage arrays, they have built arbiter/witness sites at
branches, colocation providers or even on the public cloud. More than not
holding user data or processing queries, the arbiter can't even be sent the
replication stream for the user data in the database, it just won't fit
down the pipe.

Which feels like a very difficult requirement to meet if the replication
model for all data is being changed to a quorum model.

Regards
Alastair

NUMA shared memory interleaving

2025-04-16 Thread Jakub Wartak

Thanks to having pg_numa.c, we can now simply address problem#2 of
NUMA imbalance from [1] pages 11-14, by interleaving shm memory in
PG19 - patch attached. We do not need to call numa_set_localalloc() as
we only interleave shm segments, while local allocations stay the same
(well, "local" means relative to the CPU asking for private memory).
Below is result from legacy 4s32t64 Sandy Bridge EP box with low NUMA
(QPI) interconnect bandwidth to better illustrate the problem (it's
little edgecase, but some one may hit it):

Testcase:
small SB (here it was 4GB*) that fully fits NUMA hugepage zone as
this was tested with hugepages=on

$ cat seqconcurrscans.pgb
\set num (:client_id % 8) + 1
select sum(octet_length(filler)) from pgbench_accounts_:num;

/usr/local/pgsql/bin/pg_ctl -D /db/data -l logfile restart
/usr/local/pgsql/bin/psql  -c "select
pg_prewarm('pgbench_accounts_'||s) from generate_series(1, 8) s;"
#load all using current policy
/usr/local/pgsql/bin/psql  -c "select * from
pg_shmem_allocations_numa where name = 'Buffer Blocks';"
/usr/local/pgsql/bin/pgbench -c 64 -j 8 -P 1 -T 60 -f seqconcurrscans.pgb

on master and numa=off (default) and in previous versions:
 name  | numa_node |size
---+---+
 Buffer Blocks | 0 |  0
 Buffer Blocks | 1 |  0
 Buffer Blocks | 2 | 4297064448
 Buffer Blocks | 3 |  0

latency average = 1826.324 ms
latency stddev = 665.567 ms
tps = 34.708151 (without initial connection time)

on master and numa=on:
 name  | numa_node |size
---+---+
 Buffer Blocks | 0 | 1073741824
 Buffer Blocks | 1 | 1073741824
 Buffer Blocks | 2 | 1075838976
 Buffer Blocks | 3 | 1073741824

latency average = 1002.288 ms
latency stddev = 214.392 ms
tps = 63.344814 (without initial connection time)

Normal pgbench workloads tend to be not affected, as each backend
tends to touch just a small partition of shm (thanks to BAS
strategies). Some remaining questions are:
1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
first option, as we could potentially in future add more optimizations
behind that GUC.
2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
expert on DSA/DSM at all)
3. Should we fail to start if we numa=on on an unsupported platform?

* interesting tidbit to get reliable measurement: one needs to double
check that s_b (hugepage allocation) is smaller than per-NUMA zone
free hugepages (s_b fits static hugepage allocation within a single
zone). This shouldn't be a problem on 2 sockets (as most of the time
there, s_b is < 50% RAM anyway, well usually 26-30% with some stuff by
max_connections, it's higher than 25% but people usually sysctl
nr_hugepages=25%RAM) , but with >= 4 NUMA nodes (4 sockets or some
modern MCMs) kernel might start spilling the s_b (> 25%) to the other
NUMA node on it's own, so it's best to verify it using
pg_shmem_allocations_numa...

-J.

[1] - 
https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf


v1-0001-Add-capability-to-interleave-shared-memory-across.patch
Description: Binary data

Re: not null constraints, again

2025-04-16 Thread Alvaro Herrera

On 2025-Apr-16, Tender Wang wrote:

> if (conForm->contype != CONSTRAINT_NOTNULL)
> elog(ERROR, "constraint %u is not a not-null constraint", conForm->oid);
> 
> I feel that using conForm->conname is more friendly than oid for users.

Yeah, this doesn't really matter because this function would not be
called with any other kind of constraint anyway.  This test could just
as well be an Assert() ... I was pretty torn about that choice TBH (I
still am).

> Others look good for me.

Thanks for looking!

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No renuncies a nada. No te aferres a nada."

Re: Performance issues with v18 SQL-language-function changes

2025-04-16 Thread Bruce Momjian

On Mon, Apr 14, 2025 at 10:38:29AM -0400, Robert Haas wrote:
> On Sun, Apr 13, 2025 at 3:23 PM Tom Lane  wrote:
> > create function fx(p_summa bigint) returns text immutable strict
> > return ltrim(to_char(p_summa, '999 999 999 999 999 999 999 999'));
> >
> > explain analyze select fx(i) from generate_series(1,100) as i(i);
> >
> > you arrive at the rude discovery that 0dca5d68d is about 50% slower
> > than 0dca5d68d^, because the old implementation builds a plan for fx()
> > only once and then re-uses it throughout the query.
> 
> I agree that we should do something about this. I haven't reviewed
> your patches but the approach sounds broadly reasonable.

Yep, we went down the road in PG 18 to convert syntax, and now we have
to fix this, or we have to revert all the PG 18 syntax changes, which
seems like a step backward.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Typos in the comment for the estimate_multivariate_ndistinct()

2025-04-16 Thread Daniel Gustafsson

> On 14 Apr 2025, at 15:34, Tender Wang  wrote:
> 
> Hi,
> 
> While reading the estimate_multivariate_ndistinct(),
>  I think "If a match it found, *varinfos is
>  * updated to remove the list of matched varinfos"
> should be "If a match is found, *varinfos is
>  * updated to remove the list of matched varinfos"
> I've attached a patch for that.

Seems like a correct change.

--
Daniel Gustafsson

Re: Add Pipelining support in psql

2025-04-16 Thread Michael Paquier

On Wed, Apr 16, 2025 at 09:31:59PM +0700, a.kozhemyakin wrote:
> After commit 2cce0fe on master
> 
> ERROR:  unexpected message type 0x50 during COPY from stdin
> CONTEXT:  COPY psql_pipeline, line 1
> Pipeline aborted, command did not run
> psql: common.c:1510: discardAbortedPipelineResults: Assertion `res == ((void
> *)0) || result_status == PGRES_PIPELINE_ABORTED' failed.
> Aborted (core dumped)

Reproduced here, thanks for the report.  I'll look at that at the
beginning of next week, adding an open item for now.
--
Michael


signature.asc
Description: PGP signature

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-04-16 Thread Maciek Sakrejda

On Tue, Apr 15, 2025 at 1:50 PM David Rowley  wrote:
> On Wed, 16 Apr 2025 at 04:25, Robert Haas  wrote:
> > On Mon, Apr 14, 2025 at 8:23 PM David Rowley  wrote:
> > > "Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"
> >
> > Is lookups=N here the estimated number of lookups i.e. what we think
> > nloops will end up being?
>
> Yes. The estimate is the "calls" variable in cost_memoize_rescan(),
> which is fairly critical in the hit ratio estimate calculation.
>
> Technically this is just the Nested Loop's outer_path->rows. There was
> an argument earlier in the thread for putting this in along with the
> other fields to make things easier to read. I did argue that it was
> redundant due to the fact that the reader can look at the row estimate
> for the outer side of the Nest Loop, but maybe it's small enough to go
> in using the above format.

This kind of thing is nice to have in the text format, but it's really
nice to have when working with structured formats. Currently, to get a
loops estimate for the inner node, I need to find the parent, find the
outer node in the parent's children, and get that sibling's Plan Rows
(I think). It'd be nice to make things like that easier.

Thanks,
Maciek

Re: pipelining in psql, commit 41625ab

2025-04-16 Thread Michael Paquier

On Tue, Apr 15, 2025 at 02:34:50PM -0700, Noah Misch wrote:
> On Fri, Feb 21, 2025 at 11:33:41AM +0900, Michael Paquier wrote:
> commit 41625ab wrote:
>> * \syncpipeline queues a synchronisation request, without flushing the
>> commands to the server, equivalent of PQsendPipelineSync().
>> 
> libpq has both PQpipelineSync() and PQsendPipelineSync(), so I find it odd
> that the psql command for PQsendPipelineSync() is \syncpipeline.  I would have
> expected the word "send" somewhere in its name.  That said, maybe having
> PQpipelineSync() was a mistake, since I think it's just PQsendPipelineSync() +
> PQflush().  In that light, it's reasonable not to spread the extra "send" word
> into psql.  \syncpipeline is fine with me, but it's worth others taking a
> second look.

At this stage, PQpipelineSync() is just around for compatibility
purposes, we cannot remove it and I don't see recommending it either.
It is true that \syncpipeline does not map clearly with the fact that
we're just calling PQsendPipelineSync() under the hood.

> 
>> +pg_log_error("\\getresults: invalid number of 
>> requested results");
>> +return PSQL_CMD_SKIP_LINE;
> 
> This should be PSQL_CMD_ERROR.  That matters under ON_ERROR_STOP=1.

Yes, good catch.  Will fix.

>> --- a/src/bin/psql/help.c
>> +++ b/src/bin/psql/help.c
>> @@ -167,15 +167,22 @@ slashUsage(unsigned short int pager)
>>  HELP0("  \\close STMT_NAME   close an existing prepared 
>> statement\n");
>>  HELP0("  \\copyright show PostgreSQL usage and distribution 
>> terms\n");
>>  HELP0("  \\crosstabview [COLUMNS] execute query and display result in 
>> crosstab\n");
>> +HELP0("  \\endpipeline   exit pipeline mode\n");
>>  HELP0("  \\errverboseshow most recent error message at 
>> maximum verbosity\n");
>> +HELP0("  \\flush push unsent data to the server\n");
>> +HELP0("  \\flushrequest  send a flushrequest command\n");
> 
> protocol.sgml calls it a "Flush command".

For \flushrequest, how about "send a request to the server to flush
its output buffer" and for \flush "flush output data to the server",
mapping more with the libpq desctiptions?

> General
>   \bind [PARAM]...   set query parameters
>   \copyright show PostgreSQL usage and distribution terms
>   \crosstabview [COLUMNS] execute query and display result in crosstab
>   \errverboseshow most recent error message at maximum verbosity
>   \g [(OPTIONS)] [FILE]  execute query (and send result to file or |pipe);
>  \g with no arguments is equivalent to a semicolon
>   \gdesc describe result of query, without executing it
>   \gexec execute query, then execute each value in its result
>   \gset [PREFIX] execute query and store result in psql variables
>   \gx [(OPTIONS)] [FILE] as \g, but forces expanded output mode
>   \q quit psql
>   \watch [[i=]SEC] [c=N] [m=MIN]
>  execute query every SEC seconds, up to N times,
>  stop if less than MIN rows are returned
> 
> v18 has raised that to 26 lines via $SUBJECT and other additions.  I think a
> new "Extended Query Protocol" section should house \bind and all the v18
> additions.  Beginners should ignore the new section.  The section may as well
> appear last.  What do you think?

At this stage, most of the bloat of the general section is caused by
the extended protocol commands, so I like your suggestion of a new
section with a split from General.
--
Michael


signature.asc
Description: PGP signature

Re: Built-in Raft replication

2025-04-16 Thread Yura Sokolov

16.04.2025 08:24, Andrey Borodin пишет:
> 2. After failover, old Primary node must rejoin cluster by running pg_rewind 
> and following timeline switch.

It is really do-able: BiHA already does it. And BiHA runs as a child
process of postmaster, ie both postmaster and BiHA doesn't restart when
PostgreSQL needs to rewind and restart.

Yes, there are non-trivial amount of changes made into postmaster
machinery. But it is doable.

-- 
regards
Yura Sokolov aka funny-falcon

Re: Add Pipelining support in psql

2025-04-16 Thread a.kozhemyakin


Hello,

After commit 2cce0fe on master

When executing query:
psql postgres  "discardAbortedPipelineResults") at 
./assert/assert.c:96
#6  0x760edd23b517 in __assert_fail 
(assertion=assertion@entry=0x5ba46ab79850 "res == ((void *)0) || 
result_status == PGRES_PIPELINE_ABORTED",

    file=file@entry=0x5ba46ab6fcad "common.c", line=line@entry=1510,
    function=function@entry=0x5ba46ab9c780 <__PRETTY_FUNCTION__.3> 
"discardAbortedPipelineResults") at ./assert/assert.c:105

#7  0x5ba46ab2bd40 in discardAbortedPipelineResults () at common.c:1510
#8  ExecQueryAndProcessResults (query=query@entry=0x5ba4a2ec1e10 "SELECT 
'val1';", elapsed_msec=elapsed_msec@entry=0x7ffeb07262a8,
    svpt_gone_p=svpt_gone_p@entry=0x7ffeb07262a7, 
is_watch=is_watch@entry=false, min_rows=min_rows@entry=0, 
opt=opt@entry=0x0, printQueryFout=0x0)

    at common.c:1811
#9  0x5ba46ab2983f in SendQuery (query=0x5ba4a2ec1e10 "SELECT 
'val1';") at common.c:1212
#10 0x5ba46ab3f66a in MainLoop (source=source@entry=0x760edd4038e0 
<_IO_2_1_stdin_>) at mainloop.c:515
#11 0x5ba46ab23f2a in process_file (filename=0x0, 
use_relative_path=use_relative_path@entry=false) at command.c:4870
#12 0x5ba46ab1e9d9 in main (argc=, 
argv=0x7ffeb07269d8) at startup.c:420





06.03.2025 11:20, Michael Paquier пишет:

On Wed, Mar 05, 2025 at 03:25:12PM +0100, Daniel Verite wrote:

Anthonin Bonnefoy wrote:

I do see the idea to make it easier to convert existing scripts into
using pipelining. The main focus of the initial implementation was
more on protocol regression tests with psql, so that's not necessarily
something I had in mind.

Understood. Yet pipelining can accelerate considerably certain scripts
when client-server latency is an issue. We should expect end users to
benefit from it too.

That was not a test case we had in mind originally here, but if it is
possible to keep the implementation simple while supporting your
demand, well, let's do it.  If it's not that straight-forward, let's
use the new meta-command, forbidding \g and \gx based on your
arguments from upthread.
--
Michael

Re: NUMA shared memory interleaving

2025-04-16 Thread Robert Haas

On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak
 wrote:
> Normal pgbench workloads tend to be not affected, as each backend
> tends to touch just a small partition of shm (thanks to BAS
> strategies). Some remaining questions are:
> 1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
> first option, as we could potentially in future add more optimizations
> behind that GUC.

I wonder whether the GUC needs to support interleaving between a
designated set of nodes rather than only being able to do all nodes.
For example, suppose someone is pinning the processes to a certain set
of NUMA nodes; perhaps then they wouldn't want to use memory from
other nodes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: NUMA shared memory interleaving

2025-04-16 Thread Thomas Munro

On Thu, Apr 17, 2025 at 1:58 AM Thomas Munro  wrote:
> I have no answers but I have speculated for years about a very
> specific case (without any idea where to begin due to lack of ... I
> guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
> workers split up and try to work on different batches on their own to
> minimise contention, and when that's not possible (more workers than
> batches, or finishing their existing work at different times and going
> to help others), they just proceed in round-robin order.  A beginner
> thought is: if you're going to help someone working on a hash table,
> it would surely be best to have the CPUs and all the data on the same
> NUMA node.  During loading, cache line ping pong would be cheaper, and
> during probing, it *might* be easier to tune explicit memory prefetch
> timing that way as it would look more like a single node system with a
> fixed latency, IDK (I've shared patches for prefetching before that
> showed pretty decent speedups, and the lack of that feature is
> probably a bigger problem than any of this stuff, who knows...).
> Another beginner thought is that the DSA allocator is a source of
> contention during loading: the dumbest problem is that the chunks are
> just too small, but it might also be interesting to look into per-node
> pools.  Or something.   IDK, just some thoughts...

And BTW there are papers about that (but they mostly just remind me
that I have to reboot the prefetching patch long before that...), for
example:

https://15721.courses.cs.cmu.edu/spring2023/papers/11-hashjoins/lang-imdm2013.pdf

Re: Built-in Raft replication

2025-04-16 Thread Konstantin Osipov

* Alastair Turner  [25/04/16 15:58]:

> > > If you use build-in failover you have to resort to 3 big Postgres
> > machines because you need 2/3 majority. Of course, you can install
> > MySQL-stype arbiter - host that had no real PGDATA, only participates in
> > voting. But this is a solution to problem induced by built-in autofailover.
> >
> > Users find it a waste of resources to deploy 3 big PostgreSQL
> > instances just for HA where 2 suffice even if they deploy 3
> > lightweight DCS instances. Having only some of the nodes act as DCS
> > and others purely PostgreSQL nodes will reduce waste of resources.
> >
> > The experience of other projects/products with automated failover based on
> quorum shows that this is a critical issue for adoption. In the In-memory
> Data Grid space (Coherence, Geode/GemFire) the question of how to ensure
> that some nodes didn't carry any data comes up early in many architecture
> discussions. When RabbitMQ shipped their Quorum Queues feature, the first
> and hardest area of pushback was around all nodes hosting message content.
> 
> It's not just about the requirement for compute resources, it's also about
> bandwidth and latency. Many large organisations have, for historical
> reasons, pairs of data centres with very good point-to-point connectivity.
> As the requirement for quorum witnesses has come up for all sorts of
> things, including storage arrays, they have built arbiter/witness sites at
> branches, colocation providers or even on the public cloud. More than not
> holding user data or processing queries, the arbiter can't even be sent the
> replication stream for the user data in the database, it just won't fit
> down the pipe.
> 
> Which feels like a very difficult requirement to meet if the replication
> model for all data is being changed to a quorum model.

I agree master/replica deployment layouts are very popular and are
not going to directly benefit from raft. They'll still work, but no
automation will be available, just like today with Patroni.

However, if the storage cost is an argument, then the logical path is to
disaggregate storage/compute altogether, i.e. use projects like
neon.

-- 
Konstantin Osipov, Moscow, Russia

Re: Built-in Raft replication

2025-04-16 Thread Yura Sokolov

16.04.2025 07:58, Andrey Borodin пишет:
> Yes, shared DCS are common these days. AFAIK, we use one Zookeeper instance 
> per hundred Postgres clusters to coordinate pg_consuls.
> 
> Actually, scalability is opposite to topic of this thread. Let me explain.
> Currently, Postgres automatic failover tools rely on databases with built-in 
> automatic failover. Konstantin is proposing to shorten this loop and make 
> Postgres use its build-in automatic failover.
> 
> So, existing tooling allows you to have 3 hosts for DCS, with majority of 2 
> hosts able to elect new leader in case of failover.
> And you can have only 2 hosts for Postgres - Primary and Standby. You can 
> have 2 big Postgres machines with 64 CPUs. And 3 one-CPU hosts for 
> Zookeper\etcd.
> 
> If you use build-in failover you have to resort to 3 big Postgres machines 
> because you need 2/3 majority. Of course, you can install MySQL-stype arbiter 
> - host that had no real PGDATA, only participates in voting. But this is a 
> solution to problem induced by built-in autofailover.

Arbiter can store WAL without (almost) any data. Then it is not only for
voting, but also for reliability as almost full featured third server.

Certainly, it may become only "read-only master" - just to replicate WAL's
tail it has and commit it by commiting record in new term/timeline. Then it
should give leadership to other replica immediately.

This idea is not a fantasy. BiHA does it.

-- 
regards
Yura Sokolov aka funny-falcon

Re: A modest proposal: make parser/rewriter/planner inputs read-only

2025-04-16 Thread Tom Lane

Andrei Lepikhov  writes:
> I just want to understand how your idea will work. The query_planner 
> does the job for subqueries separately. If a query is transformed in 
> some way (let's say, an unnecessary join is deleted), we need to change 
> references in the parse tree of another subquery, or it will not find 
> the reference at the moment of planning, right?

Don't see why.  If we're separately planning a subquery, we would
not dare to change anything about its outputs, just as we would
not make a change that affects the topmost level's outputs.  I don't
believe there's anything right now that requires a recursive
subquery_planner call to change the outer parsetree, and this idea
wouldn't affect that.

Now, subquery_planner does have side effects on the PlannerGlobal
struct, but that's planner-local data, not an input to the planner.

Maybe we would like to have some enforced contract about what
subquery_planner can and can't touch in the outer planner level's
data, but I'm not feeling a great need for that right now.

regards, tom lane

Re: pgsql: Non text modes for pg_dumpall, correspondingly change pg_restore

2025-04-16 Thread Andrew Dunstan




On 2025-04-14 Mo 8:20 AM, Álvaro Herrera wrote:

On 2025-Apr-04, Andrew Dunstan wrote:


Non text modes for pg_dumpall, correspondingly change pg_restore

I think the new oid_string_list stuff in this commit is unnecessary, and
we can remove a bunch of lines by simplifying that to using
SimplePtrList, as the attached illustrates.  (Perhaps the names of
members of the proposed struct can be improved.)

I also propose to remove the open-coded implementation of pg_get_line.

WDYT?

I'm also not sure about the double sscanf() business there ... There
must be a better way to do this.




Pushed your patch plus cleanup of the sscanf stuff.

Thanks!


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: NUMA shared memory interleaving

2025-04-16 Thread Thomas Munro

On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak
 wrote:
> 2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
> expert on DSA/DSM at all)

I have no answers but I have speculated for years about a very
specific case (without any idea where to begin due to lack of ... I
guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
workers split up and try to work on different batches on their own to
minimise contention, and when that's not possible (more workers than
batches, or finishing their existing work at different times and going
to help others), they just proceed in round-robin order.  A beginner
thought is: if you're going to help someone working on a hash table,
it would surely be best to have the CPUs and all the data on the same
NUMA node.  During loading, cache line ping pong would be cheaper, and
during probing, it *might* be easier to tune explicit memory prefetch
timing that way as it would look more like a single node system with a
fixed latency, IDK (I've shared patches for prefetching before that
showed pretty decent speedups, and the lack of that feature is
probably a bigger problem than any of this stuff, who knows...).
Another beginner thought is that the DSA allocator is a source of
contention during loading: the dumbest problem is that the chunks are
just too small, but it might also be interesting to look into per-node
pools.  Or something.   IDK, just some thoughts...

Re: Typos in the comment for the estimate_multivariate_ndistinct()

2025-04-16 Thread Tender Wang

Tender Wang  于2025年4月14日周一 21:34写道：

> Hi,
>
> While reading the estimate_multivariate_ndistinct(),
>  I think "If a match it found, *varinfos is
>  * updated to remove the list of matched varinfos"
> should be "If a match is found, *varinfos is
>  * updated to remove the list of matched varinfos"
> I've attached a patch for that.
>

Hi Alvaro,

Can you help me double-check this?

-- 
Thanks,
Tender Wang

Re: pg_dump: Fix dangling pointer in EndCompressorZstd()

2025-04-16 Thread Daniel Gustafsson

> On 16 Apr 2025, at 13:48, Ashutosh Bapat  wrote:
> 
> On Wed, Apr 16, 2025 at 1:57 PM Alexander Kuznetsov
>  wrote:
>> 
>> Hello everyone,
>> 
>> We've found that EndCompressorZstd() doesn't set cs->private_data to NULL 
>> after pg_free(),
>> unlike other EndCompressor implementations.
>> While this doesn't currently cause issues (as the pointer soon gets 
>> reassigned),
>> we recommend fixing this to maintain consistency with other implementations 
>> and prevent potential future issues.
>> 
>> The patch is attached, would appreciate your thoughts on this change.
> 
> Thanks for the patch.
> 
> The next thing that happens in EndCompressor() is freeing cs itself.
> So cs->private_data is not used anywhere, so no harm in the current
> code. But it's better to set to NULL since EndCompressorZstd()
> wouldn't know how it's being accessed after returning from there. The
> other implementation of CompressionState::end() EndCompressorGzip()
> calls DeflateCompressorEnd() which also sets cs->private_data
> explicitly. So should EndCompressorZstd().

Agreed, while it's perfectly safe today the end method should not make
assumptions about the use of the private_data pointer upon return and should
leave it set to NULL.

--
Daniel Gustafsson

Re: Align memory context level numbering in pg_log_backend_memory_contexts()

2025-04-16 Thread torikoshia


On 2025-04-16 06:18, Daniel Gustafsson wrote:

On 15 Apr 2025, at 23:03, David Rowley  wrote:



My vote is to make the levels 1-based in all locations where we output
the context information.


I agree with this, pg_get_process_memory_contexts() also use 1-based 
levels

fwiw.


+1.
I believe there's no particular issue with starting the level from 1 in 
pg_log_backend_memory_contexts().


Regarding the implementation:
In the initial patch attached, I naïvely incremented the level just 
before emitting the log line.
However, it might be cleaner to simply initialize the level variable to 
1 from the start. This could help avoid unnecessary confusion when 
debugging that part of the code.


Similarly, I noticed that in pg_get_process_memory_contexts(), the level 
is initialized to 0 in ProcessGetMemoryContextInterrupt(void):


int level = 0;
..
MemoryContextStatsInternal(c, level, 100, 100, &grand_totals, ..

If we want to be consistent, perhaps it would make sense to start from 1 
there as well.


BTW level variable has existed since before pg_backend_memory_contexts 
was introduced — it was originally used for functions that help inspect 
memory contexts via the debugger. Because of that, I think changing this 
would affect not only these functions codes but some older ones.



--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Re: transforms [was Re: FmgrInfo allocation patterns (and PL handling as staged programming)]

2025-04-16 Thread Chapman Flack

On 04/15/25 23:39, Tom Lane wrote:
> My own beef with the whole setup is that you can't specify *which*
> arguments or results you want transformed.  I don't believe that
> this should have been keyed off data types to begin with.  But
> that design was the SQL committee's choice so we're stuck ...

Yes, per-argument declarations would be more flexible ... but more verbose.
They were discussed in the long email thread Pavel linked, and I suppose
the SQL committee may have had its own discussions about the same tradeoff.

One area where we do depart from the standard is we don't have named
transform groups. I had wondered about that before, and supposed that
maybe our feature was designed from an older edition that didn't have
them yet. But they also got mentioned in that email thread, so they were
definitely in the standard already.

I do think our lack of named groups is unfortunate. It may be that Perl
and Python each have exactly one standard type that's exactly right for
an hstore, and Python exactly one representation you would ever want to
use for an ltree, but I don't think that generalizes. There are PostgreSQL
types you might map, with varying fidelity, into half a dozen different
Java standard library types, for example.

The standard also has GUC-like SET DEFAULT TRANSFORM GROUP and
SET TRANSFORM GROUP FOR TYPE session-wide settings that could help
reduce verbosity. (It doesn't quite come out and say that these session
values apply at function-creation time to types for which the CREATE
FUNCTION specifies no transform explicitly, but it does say that default
transforms will be supplied for those, with the defaults being
implementation-defined. So defaulting to session-wide settings would
not be wrong. The defaults do, of course, become part of the function
declaration at the time it's created, so the function doesn't fall over
later if someone has used SET DEFAULT TRANSFORM GROUP.)

Those features, naturally, would again require recording actual
transformation oids in pg_proc and not just types.

Maybe for per-argument flexibility we could have the transform spec
in CREATE FUNCTION accept FOR PARAMETER FOO as well as FOR TYPE BAR,
and then lobby the SQL committee not to adopt some conflicting variation
of it.

Regards,
-Chap

Re: Performance issues with v18 SQL-language-function changes

2025-04-16 Thread Tom Lane

Bruce Momjian  writes:
> On Mon, Apr 14, 2025 at 10:38:29AM -0400, Robert Haas wrote:
>> I agree that we should do something about this. I haven't reviewed
>> your patches but the approach sounds broadly reasonable.

> Yep, we went down the road in PG 18 to convert syntax, and now we have
> to fix this, or we have to revert all the PG 18 syntax changes, which
> seems like a step backward.

I'm confused?  0dca5d68d didn't have anything to do with
syntax changes, just with when planning happens.

regards, tom lane

Re: pgsql: Non text modes for pg_dumpall, correspondingly change pg_restore

2025-04-16 Thread Álvaro Herrera

On 2025-Apr-16, Andrew Dunstan wrote:

> Pushed your patch plus cleanup of the sscanf stuff.

Thank you!

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El sabio habla porque tiene algo que decir;
el tonto, porque tiene que decir algo" (Platon).

Re: not null constraints, again

2025-04-16 Thread Tom Lane

Alvaro Herrera  writes:
> Here's another version where I do skip searching for children twice, and
> rewrote the comments.

v2 LGTM, with two small nits:

1. Grammar feels shaky here:

+ * normal case where we're asked to recurse, this routine ensures that the
+ * not-null constraints either exist already, or queues a requirement for them
+ * to be created by phase 2.

The "either" seems to apply to "ensures" versus "queues", but it's in
the wrong place for that.  Maybe something like

+ * normal case where we're asked to recurse, this routine checks if the
+ * not-null constraints exist already, and if not queues a requirement for
+ * them to be created by phase 2.

2. Stupider compilers are likely to whine about the "children"
variable possibly being used uninitialized.  Suggest initializing
it to NIL.

regards, tom lane

Re: Built-in Raft replication

2025-04-16 Thread Konstantin Osipov

* Greg Sabino Mullane  [25/04/16 22:33]:
> > Users find it a waste of resources to deploy 3 big PostgreSQL instances
> > just for HA where 2 suffice even if they deploy 3 lightweight DCS
> > instances. Having only some of the nodes act as DCS and others purely
> > PostgreSQL nodes will reduce waste of resources.
> >
> 
> A big problem is that putting your DCS into Postgres means your whole
> system is now super-sensitive to IO/WAL-streaming issues, and a busy
> database doing database stuff is going to start affecting the DCS stuff.

Affecting in what way? Do you have a scenario in mind where an
external state provider would act differently (better)? 

> With three lightweight DCS servers, you don't really need to worry about
> how stressed the database servers are. In that way, I feel etcd et al. are
> adhering to the unix philosophy of "do one thing, and do it well."

-- 
Konstantin Osipov, Moscow, Russia

Re: pg_dump --if-exists --clean when drop index that is partition of a partitioned index

2025-04-16 Thread Tom Lane

jian he  writes:
> On Wed, Apr 16, 2025 at 6:51 AM Tom Lane  wrote:
>> Or we could do what Jian suggested and just not emit any dropStmt
>> for child indexes.  I continue to fear that that will have
>> undesirable side-effects, but I have to admit that I'm not sure
>> what.  The fact that the backend will automatically drop the
>> child index if we drop either its parent index or its table
>> seems to cover a lot of the edge cases ... but does it cover
>> all of them?

> If pg_dump not produce "DROP INDEX IF EXISTS child_index;"
> then we are actually tests drop parent index will cascade to child index.

That's only true if we emit a DROP for the parent index.
What's bothering me is the idea that a selective restore might
not drop either the parent index or the index's table, and yet
expect that this child index went away (and should be restored).
But I don't think we have any restore selectivity modes that
would make it easy to hit that corner case.

However, this asymmetry is not pg_dump's fault; it's being forced
on us by the backend's choice to not allow detaching/dropping an
index once it's been attached to a partitioned index.  So after
sleeping on it, I don't see that we have an alternative that is
better than what you proposed.  I pushed it after doing some work
on the comments.

regards, tom lane

Re: Performance issues with v18 SQL-language-function changes

2025-04-16 Thread Pavel Stehule

Hi

st 16. 4. 2025 v 19:43 odesílatel Tom Lane  napsal:

> Bruce Momjian  writes:
> > On Mon, Apr 14, 2025 at 10:38:29AM -0400, Robert Haas wrote:
> >> I agree that we should do something about this. I haven't reviewed
> >> your patches but the approach sounds broadly reasonable.
>
> > Yep, we went down the road in PG 18 to convert syntax, and now we have
> > to fix this, or we have to revert all the PG 18 syntax changes, which
> > seems like a step backward.
>
> I'm confused?  0dca5d68d didn't have anything to do with
> syntax changes, just with when planning happens.
>

yes, and for most cases it has significant performance benefits.  For a few
corner cases, there can be some slowdown that can be fixed by last Tom
patches.

Regards

Pavel


> regards, tom lane
>

Re: recoveryCheck test failure on flaviventris

2025-04-16 Thread Alexander Lakhin


Dear Kuroda-san,

16.04.2025 11:12, Hayato Kuroda (Fujitsu) wrote:

Dear Richard, Alexander,

FYI - I felt there is a same type of failure [1] for REL_17_STABLE, added in 
the wikipage.
Feel free to revert it if I did something wrong.

[1]:https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=prion&dt=2025-04-15%2016%3A16%3A06&stg=recovery-check


Thank you for adding that! (I'm a bit slow on analyzing new failures, but
I hope I could catch up on the weekend.)
I've just reformatted the record to match my script's [1] expectations.

[1] 
https://www.postgresql.org/message-id/b62f97b1-bbff-42cd-861f-c785f0774ab8%40gmail.com

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: pg_dump: Fix dangling pointer in EndCompressorZstd()

2025-04-16 Thread Tom Lane

Michael Paquier  writes:
> On Wed, Apr 16, 2025 at 04:19:02PM +0200, Daniel Gustafsson wrote:
>> Agreed, while it's perfectly safe today the end method should not make
>> assumptions about the use of the private_data pointer upon return and should
>> leave it set to NULL.

> Indeed.  I was just looking at applying what Alexander has sent
> because what EndCompressorZstd() not doing what the other methods do
> makes no sense.  Perhaps you are already on it, Daniel?

I think the actual reason for the difference is that the methods that
are taking care to zero the pointer do so because they test the
pointer themselves.  For instance in EndCompressorGzip, the test is
needed because perhaps no data was sent so the struct never got made.
It incidentally offers protection against a double call of that
function, but I don't think that was the intended reason.

I don't have any big objection to zeroing the pointer in
EndCompressorZstd, but I think the claim that it's precisely
analogous to the other EndCompressor methods is faulty,
because it has no similar test.

regards, tom lane

Re: jsonapi: scary new warnings with LTO enabled

2025-04-16 Thread Jacob Champion

On Wed, Apr 16, 2025 at 4:04 PM Tom Lane  wrote:
> Looking through all of the callers of freeJsonLexContext, quite
> a lot of them use local JsonLexContext structs, and probably some
> of them are more performance-critical than these.  So that raises
> the question of why are we seeing warnings for only these call
> sites?

Yeah, I had the same question...

> Maybe there is a more elegant way to suppress them.

Can we brute-force ignore this particular warning site with a #pragma
(suggested in [1])?

--Jacob

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98753

Re: pipelining in psql, commit 41625ab

2025-04-16 Thread Noah Misch

On Wed, Apr 16, 2025 at 09:46:42AM -0700, Michael Paquier wrote:
> On Tue, Apr 15, 2025 at 02:34:50PM -0700, Noah Misch wrote:
> > On Fri, Feb 21, 2025 at 11:33:41AM +0900, Michael Paquier wrote:
> > commit 41625ab wrote:
> >> --- a/src/bin/psql/help.c
> >> +++ b/src/bin/psql/help.c
> >> @@ -167,15 +167,22 @@ slashUsage(unsigned short int pager)
> >>HELP0("  \\close STMT_NAME   close an existing prepared 
> >> statement\n");
> >>HELP0("  \\copyright show PostgreSQL usage and distribution 
> >> terms\n");
> >>HELP0("  \\crosstabview [COLUMNS] execute query and display result in 
> >> crosstab\n");
> >> +  HELP0("  \\endpipeline   exit pipeline mode\n");
> >>HELP0("  \\errverboseshow most recent error message at 
> >> maximum verbosity\n");
> >> +  HELP0("  \\flush push unsent data to the server\n");
> >> +  HELP0("  \\flushrequest  send a flushrequest command\n");
> > 
> > protocol.sgml calls it a "Flush command".
> 
> For \flushrequest, how about "send a request to the server to flush
> its output buffer" and for \flush "flush output data to the server",
> mapping more with the libpq desctiptions?

Works for me.

Re: jsonapi: scary new warnings with LTO enabled

2025-04-16 Thread Tom Lane

Daniel Gustafsson  writes:
>> On 16 Apr 2025, at 23:42, Tom Lane  wrote:
>> I'm not sure
>> how other than giving up on stack allocation of JsonLexContexts,
>> though, especially if we consider the jsonapi API frozen.  But seeing
>> that there are only three such call sites and none of them seem in the
>> least performance-critical, maybe we should just do that?

> I can't see any other option really, and there is no performance angle really
> so that should be safe.  Since I committed at least one of these, let me know
> if you want me to tackle it.

The only alternative I can see that might stop the warning is if we
can find a way to make it clearer to the optimizer that the FREE()
isn't reached.  But I'm not sure about a trustworthy way to make that
happen.  Maybe it'd work to change the signature of freeJsonLexContext
(or perhaps better, add a separate entry point) so that the caller is
passing a bool constant that controls whether to free the struct.
We could have an Assert that compares that to the state of the
JSONLEX_FREE_STRUCT flag to catch mistakes.  This seems kind of messy
though.

regards, tom lane

Re: Skipping schema changes in publication

2025-04-16 Thread Amit Kapila

On Wed, Apr 16, 2025 at 8:22 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thu, Apr 10, 2025 at 7:25 PM Amit Kapila wrote:
> >
> > On Tue, Jan 9, 2024 at 12:02 PM vignesh C  wrote:
> > >
> > > As I did not see much interest from others, I'm withdrawing this patch
> > > for now. But if there is any interest others in future, I would be
> > > more than happy to work on this feature.
> > >
> >
> > Just FYI, I noticed a use case for this patch in email [1]. Users would 
> > like to
> > replicate all except a few columns having sensitive information. The 
> > challenge
> > with current column list features is that adding new tables to columns would
> > lead users to change the respective publications as well.
> >
> > [1] -
> > https://www.postgresql.org/message-id/tencent_DCDF626FCD4A556C51BE
> > 270FDC3047540208%40qq.com
>
> BTW, I noticed that debezium, an open source distributed platform for change
> data capture that replies on logical decoding, also support specifying the
> column exclusion list[1]. So, this indicates that there could be some use 
> cases
> for this feature.
>

Thanks for sharing the link. I see that they support both the include
and exclude lists for columns and tables.

-- 
With Regards,
Amit Kapila.

Re: Make prep_status() message translatable

2025-04-16 Thread Fujii Masao




On 2025/04/07 16:26, Fujii Masao wrote:



On 2025/04/07 15:55, Kyotaro Horiguchi wrote:

Hello,

The recent commit 173c97812ff made the following change:

-    prep_status("Adding \".old\" suffix to old global/pg_control");
+    prep_status("Adding \".old\" suffix to old " XLOG_CONTROL_FILE);

This change results in a message that is untranslatable, at least into
Japanese.  In addition, the file name should be quoted.

The attached patch modifies the message to use %s for XLOG_CONTROL_FILE,
making it properly translatable.


Thanks for the report and patch!

The fix looks good to me.

     pg_log(PG_REPORT, "\n"
    "If you want to start the old cluster, you will need to 
remove\n"
    "the \".old\" suffix from %s/%s.old.\n"
    "Because \"link\" mode was used, the old cluster cannot be 
safely\n"
    "started once the new cluster has been started.",
    old_cluster.pgdata, XLOG_CONTROL_FILE);

Commit 173c97812ff also updated the above part of disable_old_cluster()
and replaced the hardcoded "global/pg_control" with %s using XLOG_CONTROL_FILE.
Maybe we should also add a translator: comment and wrap %s/%s.old in
double quotes there?


I've updated the patch to reflect this comment.
Barring any objections, I'm thinking to commit this patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From 36c91f4ecf21cdaabbb49d46f3a967e89b4f6714 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Thu, 17 Apr 2025 12:44:15 +0900
Subject: [PATCH v2] Make pg_upgrade log message with control file path
 translatable.

Commit 173c97812ff replaced the hardcoded "global/pg_control" in pg_upgrade
log message with a string literal concatenation of XLOG_CONTROL_FILE macro.
However, this change made the message untranslatable.

This commit fixes the issue by using %s with XLOG_CONTROL_FILE instead of
that literal concatenation, allowing the message to be translated properly.
It also wraps the file path in double quotes for consistency with similar
log messages.

Author: Kyotaro Horiguchi 
Reviewed-by: Masao Fujii 
Discussion: 
https://postgr.es/m/20250407.155546.2129693791769531891.horikyota@gmail.com
---
 src/bin/pg_upgrade/controldata.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index 8b7c349a875..90cef0864de 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -758,7 +758,8 @@ disable_old_cluster(transferMode transfer_mode)
new_path[MAXPGPATH];
 
/* rename pg_control so old server cannot be accidentally started */
-   prep_status("Adding \".old\" suffix to old " XLOG_CONTROL_FILE);
+   /* translator: %s is the file path of the control file */
+   prep_status("Adding \".old\" suffix to old \"%s\"", XLOG_CONTROL_FILE);
 
snprintf(old_path, sizeof(old_path), "%s/%s", old_cluster.pgdata, 
XLOG_CONTROL_FILE);
snprintf(new_path, sizeof(new_path), "%s/%s.old", old_cluster.pgdata, 
XLOG_CONTROL_FILE);
@@ -768,9 +769,10 @@ disable_old_cluster(transferMode transfer_mode)
check_ok();
 
if (transfer_mode == TRANSFER_MODE_LINK)
+   /* translator: %s/%s is the file path of the control file */
pg_log(PG_REPORT, "\n"
   "If you want to start the old cluster, you will need 
to remove\n"
-  "the \".old\" suffix from %s/%s.old.\n"
+  "the \".old\" suffix from \"%s/%s.old\".\n"
   "Because \"link\" mode was used, the old cluster 
cannot be safely\n"
   "started once the new cluster has been started.",
   old_cluster.pgdata, XLOG_CONTROL_FILE);
-- 
2.49.0

Re: use correct variable in error message in _allocAH function (pg_backup_archiver.c)

2025-04-16 Thread Fujii Masao





On 2025/04/16 13:30, Fujii Masao wrote:



On 2025/04/15 23:37, Tom Lane wrote:

Fujii Masao  writes:

Thanks for the report and patch! It looks good to me.


Agreed.


Since this issue exists in the back branches,
the patch needs be back-patched to all supported versions.


I doubt it's worth the trouble and buildfarm cycles to
back-patch, since this should be a can't-happen code path.
Worth fixing in HEAD, yes, but not convinced about doing
more than that.


Yes, that makes sense. I'll apply the fix to the master branch only.


I've pushed the patch only to the master. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2025-04-16 Thread Ajin Cherian

On Thu, Apr 3, 2025 at 11:01 PM Shlok Kyal  wrote:
>
> Hi Ajin,
>
> I have reviewed patch v16-0001, here are my comments:
>
> 1. There are some places where comments are more than 80 columns
> window. I think it should be <=80 as per [1].

Fixed.

>
> 2. I think, 'rb->unfiltered_changes_count' should be initialised in
> the function 'ReorderBufferAllocate'. While debugging I found that the
> initial value of rb->unfiltered_changes_count is 127. I think it
> should be set to '0' inside 'ReorderBufferAllocate'. Am I missing
> something here?
> I have also added the same comment in point 1. in [2].
>

Fixed.



On Sat, Apr 5, 2025 at 12:34 AM Shlok Kyal  wrote:
>
> On Thu, 20 Mar 2025 at 18:09, Ajin Cherian  wrote:
> >
> > On Thu, Mar 13, 2025 at 5:49 PM Ajin Cherian  wrote:
> > >
> > > Moving this patch to the next CF as this patch needs more design level
> > > inputs which may not be feasible in this CF but do continue to review
> > > the patch.
> > >
> > > regards,
> > > Ajin Cherian
> > > Fujitsu Australia
> >
> > Rebased the patch as it no longer applied.
> >
>
> Hi Ajin,
>
> I reviewed v16-0002 patch, here are my comments:
>
> 1. Guc CHANGES_THRESHOLD_FOR_FILTER was introduced in 0001 patch,
> should this change be in 0001 patch?
>

Fixed.

> - * temporarily suspended. Filtering resumes after processing every 100 
> changes.
> + * temporarily suspended. Filtering resumes after processing
> CHANGES_THRESHOLD_FOR_FILTER
> + * changes.
>
> 2. Following comment is repeated inside DecodeInsert, DecodeUpdate,
> DecodeDelete, DecodeMultiInsert, DecodeSpecConfirm
> + /*
> + * When filtering changes, determine if the relation associated with the 
> change
> + * can be skipped. This could be because the relation is unlogged or because
> + * the plugin has opted to exclude this relation from decoding.
> + */
> + if (SkipThisChange(ctx, buf->origptr, XLogRecGetXid(r), &target_locator,
>
> Since this comment is the same, should we remove it from the above
> functions and add this where function 'SkipThisChange' is defined?
>

reworded this.

On Thu, Apr 10, 2025 at 2:26 PM Peter Smith  wrote:
>
> Hi Ajin.
>
> Some review comments for patch v16-0001.
>
> ==
> Commit message
>
> 1.
> A hash cache of relation file locators is implemented to cache whether a
> relation is filterable or not. This ensures that we only need to retrieve
>
> ~
>
> /hash cache/hash table/
>
> (AFAICT you called this a hash table elsewhere in all code comments)

Fixed.

>
> ~~~
>
> 2.
> Additionally, filtering is paused for transactions containing WAL records
> (INTERNAL_SNAPSHOT, COMMAND_ID, or INVALIDATION) that modify the historical
> snapshot constructed during logical decoding. This pause is necessary because
> constructing a correct historical snapshot for searching publication
> information requires processing these WAL records.
>
> ~
>
> IIUC, the "pause" here is really referring to the 100 changes
> following an unfilterable txn. So, I don't think what you are
> describing here is a "pause" -- it is just another reason for the tx
> to be marked unfilterable, and the pause logic will take care of
> itself. So, maybe it should be worded more like
>
> SUGGESTION
> Additionally, transactions containing WAL records (INTERNAL_SNAPSHOT,
> COMMAND_ID, or INVALIDATION) that modify the historical
> snapshot constructed during logical decoding are deemed unfilterable.
> This is necessary because constructing a correct historical snapshot
> for searching publication information requires processing these WAL
> records.
>

Fixed.

> ==
> src/backend/replication/logical/reorderbuffer.c
>
> 3. Header comment.
>
> If you change the commit message based on previous suggestions, then
> change the comment here also to match it.
>
> ~~~
>
> 4.
> +/*
> + * After encountering a change that cannot be filtered out, filtering is
> + * temporarily suspended. Filtering resumes after processing every 100 
> changes.
> + * This strategy helps to minimize the overhead of performing a hash table
> + * search for each record, especially when most changes are not filterable.
> + */
> +#define CHANGES_THRESHOLD_FOR_FILTER 100
>
> /every 100/the next 100/
>
> ~~~
>
> +ReorderBufferFilterByRelFileLocator:
>
> 5.
> + * if we are decoding a transaction without these records. See comment on top
> + * of GetDecodableRelation() to see list of relations that are not
> + * decoded by the reorderbuffer.
>
> /to see list of relations/to see the kinds of relation/
>

Fixed.

On Fri, Apr 11, 2025 at 1:15 PM Peter Smith  wrote:
>
> Hi Ajin,
>
> Here is another general review comment for patch 0001.
>
> ~~~
>
> I keep getting confused by the distinction between the 2 member fields
> that are often found working hand-in-hand:
> entry->filterable
> rb->can_filter_change
>
> Unfortunately, because the names are very similar I keep blurring the
> meanings, and then nothing makes sense.
>
> IIUC, the meanings are actually like:
>
> entry->filterable. This means filtering is *possib

Re: Add partial :-variable expansion to psql \copy

2025-04-16 Thread Fabien Coelho


Hello,

Here is a rebased and updated version of adding \gi:

- improved documentation
- some refactoring to share one function
- signal trap disabled on \gi as well

The thread subject is now a misnomer, I'm not sure whether I should 
update it or start a new thread.


On 05/04/2025 22:14, Fabien Coelho wrote:

Hello Tom and Corey,


[...] Anyway, my feeling about it is that \copy parsing is a huge hack
right now, and I'd rather see it become less of a hack, that is
more like other psql commands, instead of getting even hackier.


After giving it some thoughts, I concluded that trying to salvage 
\copy is not the
way to go and I have followed Corey's suggestion to extend the 
standard SQL COPY
handling by providing an alternate input stream with "\gi file" or 
"\gi cmd|".


This has lead to significant restructuring so as to simplify \copy 
handling to use
the \g and \gi existing infrastructure. I've moved checked performed 
only for \copy
so that they are now also done with \g, and error messages are more 
systematically
shown. The pipe char used to mark a command instead of a file is 
switched to a
boolean, which is more consistent with other places and how it can be 
managed with
"\gi command|" as the pipe char is at the end instead of the start. 
The patch also

includes tests and some doc.

If this is accepted, ISTM that \copy could be retired and even removed 
at a

later stage, which would solve elegantly its debatable implementation.


--
Fabien.
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index f7c8bc16a7f..8f2506517db 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1172,10 +1172,13 @@ SELECT $1 \parse stmt1
 
 
 Another way to obtain the same result as \copy
-... to is to use the SQL COPY
-... TO STDOUT command and terminate it
-with \g filename
-or \g |program.
+... to or from is to use the SQL
+COPY ... TO STDOUT or FROM STDIN
+command and terminate it with either
+\g filename
+or \g |program for output
+and \gi filename
+or \gi program| for input.
 Unlike \copy, this method allows the command to
 span multiple lines; also, variable interpolation and backquote
 expansion can be used.
@@ -1188,7 +1191,7 @@ SELECT $1 \parse stmt1
 COPY command with a file or program data source or
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
-command might be preferable.
+command might be preferable if data are available on the server.
 
 
 
@@ -2558,6 +2561,28 @@ CREATE INDEX
   
 
 
+  
+\gi file
+\gi command|
+
+
+Sends the current query buffer to the server and uses the provided
+file contents or command
+output as input.
+This should only apply to SQL
+COPY
+which seeks an input when used with FROM STDIN, and
+will simply result in the command simple execution for other commands
+which do not need an input stream.
+
+
+This approach should be prefered to using \copy
+as it achieves the same result but can span several lines and
+is subject to variable interpolation and backquote expansion.
+
+
+  
+
   
 \gset [ prefix ]
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index a8a13c2b88b..fd10e458eb1 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -107,6 +107,7 @@ static backslashResult exec_command_getenv(PsqlScanState scan_state, bool active
 		   const char *cmd);
 static backslashResult exec_command_gexec(PsqlScanState scan_state, bool active_branch);
 static backslashResult exec_command_getresults(PsqlScanState scan_state, bool active_branch);
+static backslashResult exec_command_gi(PsqlScanState scan_state, bool active_branch);
 static backslashResult exec_command_gset(PsqlScanState scan_state, bool active_branch);
 static backslashResult exec_command_help(PsqlScanState scan_state, bool active_branch);
 static backslashResult exec_command_html(PsqlScanState scan_state, bool active_branch);
@@ -380,6 +381,8 @@ exec_command(const char *cmd,
 		status = exec_command_getresults(scan_state, active_branch);
 	else if (strcmp(cmd, "gexec") == 0)
 		status = exec_command_gexec(scan_state, active_branch);
+	else if (strcmp(cmd, "gi") == 0)
+		status = exec_command_gi(scan_state, active_branch);
 	else if (strcmp(cmd, "gset") == 0)
 		status = exec_command_gset(scan_state, active_branch);
 	else if (strcmp(cmd, "h") == 0 || strcmp(cmd, "help") == 0)
@@ -1750,7 +1753,8 @@ exec_command_g(PsqlScanState scan_state, bool active_branch, const char *cmd)
 		else
 		{
 			expand_tilde(&fname);
-			pset.gfname = pg_strdup(fname);
+			pset.g_pipe = fname[0] == '|';
+			pset.gfname

Re: Built-in Raft replication

2025-04-16 Thread Michael Banck

Hi,

On Wed, Apr 16, 2025 at 10:24:48AM +0500, Andrey Borodin wrote:
> I think I can provide some reasons why it cannot be neither extension,
> nor any part running within postmaster reign.
> 
> 1. When joining cluster, there’s not PGDATA to run postmaster on top
> of it.
> 
> 2. After failover, old Primary node must rejoin cluster by running
> pg_rewind and following timeline switch.
> 
> The system in hand must be able to manipulate with PGDATA without
> starting Postgres.

Yeah, while you could maybe implement some/all of the RAFT protocol in
an extension, actually building something useful on top with regards to
high availability or distributed whatever does not look feasible.

> My question to Konstantin is Why wouldn’t you just add Raft to
> Patroni?

Patroni can use pysyncobj, which is a Python implementation of RAFT, so
then you do not need an external RAFT provider like etcd, consul or
zookeeper. However, it is deemed deprecated by the Patroni authors due
to being difficult to debug when it breaks.

I guess a better Python implementation of RAFT for Patroni to use or
Patroni to implement it itself would help, but I believe nobody is
working on the latter right now, nor has any plans to do so. And there
also does not seem to be anybody working on a better pysyncobj.

> Is there a reason why something like Patroni is not in core and noone
> rushes to get it in?  Everyone is using it, or system like it.

Well, Patroni is written in Python, for starters.  It also does a lot
more than just leader election / cluster config. So I think nobody
seriously thought about proposing to put Patroni into core so far.

I guess the current proposal tries to be a step into the "something like
Patroni in core" if you tilt your head a little. It's just that the
whole thing would be a really big step for Postgres, maybe similar to
deciding we want in-core replication way back when.

Michael

POC: Parallel processing of indexes in autovacuum

2025-04-16 Thread Maxim Orlov

Hi!

The VACUUM command can be executed with the parallel option. As
documentation states, it will perform index vacuum and index cleanup phases
of VACUUM in parallel using *integer* background workers. But such an
interesting feature is not used for an autovacuum. After a quick look at
the source codes, it became clear to me that when the parallel option was
added, the corresponding option for autovacuum wasn't implemented, although
there are no clear obstacles to this.

Actually, one of our customers step into a problem with autovacuum on a
table with many indexes and relatively long transactions. Of course, long
transactions are an ultimate evil and the problem can be solved by calling
running vacuum and a cron task, but, I think, we can do better.

Anyhow, what about adding parallel option for an autovacuum? Here is a POC
patch for proposed functionality. For the sake of simplicity's, several
GUC's have been added. It would be good to think through the parallel
launch condition without them.

As always, any thoughts and opinions are very welcome!

-- 
Best regards,
Maxim Orlov.


WIP-Allow-autovacuum-to-process-indexes-of-single-table.patch
Description: Binary data

Re: A modest proposal: make parser/rewriter/planner inputs read-only

2025-04-16 Thread Andrei Lepikhov


On 4/15/25 19:34, Tom Lane wrote:

Andrei Lepikhov  writes:

But what is the way you are proposing here? Do you mean that one more
entity will be explicitly introduced: a transformed parse tree?


No, I wasn't thinking of adding new concepts, just saying that the
inputs to certain operations need to be treated as read-only.
What would you envision a "transformed parse tree" being that's not
what we have today?
I just want to understand how your idea will work. The query_planner 
does the job for subqueries separately. If a query is transformed in 
some way (let's say, an unnecessary join is deleted), we need to change 
references in the parse tree of another subquery, or it will not find 
the reference at the moment of planning, right?


--
regards, Andrei Lepikhov

pg_dump: Fix dangling pointer in EndCompressorZstd()

2025-04-16 Thread Alexander Kuznetsov


Hello everyone,

We've found that EndCompressorZstd() doesn't set cs->private_data to NULL after 
pg_free(),
unlike other EndCompressor implementations.
While this doesn't currently cause issues (as the pointer soon gets reassigned),
we recommend fixing this to maintain consistency with other implementations and 
prevent potential future issues.

The patch is attached, would appreciate your thoughts on this change.

--
Best regards,
Alexander Kuznetsov
From 428c60888f96aa5d0b7575a4342cdce4ff0257ab Mon Sep 17 00:00:00 2001
From: Alexander Kuznetsov 
Date: Wed, 16 Apr 2025 11:19:56 +0300
Subject: [PATCH] pg_dump: Fix dangling pointer in EndCompressorZstd()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

cs->private_data becomes dangling after pg_free() call and should be set to NULL
(consistent with other EndCompressor implementations)

Found by Linux Verification Center (linuxtesting.org) with Svace.
---
 src/bin/pg_dump/compress_zstd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/bin/pg_dump/compress_zstd.c b/src/bin/pg_dump/compress_zstd.c
index 1f7b4942706..cb595b10c2d 100644
--- a/src/bin/pg_dump/compress_zstd.c
+++ b/src/bin/pg_dump/compress_zstd.c
@@ -142,6 +142,7 @@ EndCompressorZstd(ArchiveHandle *AH, CompressorState *cs)
 	/* output buffer may be allocated in either mode */
 	pg_free(zstdcs->output.dst);
 	pg_free(zstdcs);
+	cs->private_data = NULL;
 }
 
 static void
-- 
2.42.4

Re: pg_dump: Fix dangling pointer in EndCompressorZstd()

2025-04-16 Thread Ashutosh Bapat

On Wed, Apr 16, 2025 at 1:57 PM Alexander Kuznetsov
 wrote:
>
> Hello everyone,
>
> We've found that EndCompressorZstd() doesn't set cs->private_data to NULL 
> after pg_free(),
> unlike other EndCompressor implementations.
> While this doesn't currently cause issues (as the pointer soon gets 
> reassigned),
> we recommend fixing this to maintain consistency with other implementations 
> and prevent potential future issues.
>
> The patch is attached, would appreciate your thoughts on this change.

Thanks for the patch.

The next thing that happens in EndCompressor() is freeing cs itself.
So cs->private_data is not used anywhere, so no harm in the current
code. But it's better to set to NULL since EndCompressorZstd()
wouldn't know how it's being accessed after returning from there. The
other implementation of CompressionState::end() EndCompressorGzip()
calls DeflateCompressorEnd() which also sets cs->private_data
explicitly. So should EndCompressorZstd().

--
Best Wishes,
Ashutosh Bapat

Re: recoveryCheck test failure on flaviventris

2025-04-16 Thread Richard Guo

On Wed, Apr 16, 2025 at 1:00 PM Alexander Lakhin  wrote:
> 16.04.2025 06:05, Richard Guo wrote:
> On Wed, Apr 16, 2025 at 11:54 AM Richard Guo  wrote:
> I noticed this recoveryCheck test failure on flaviventris after
> pushing commit 3b35f9a4c (Fix an incorrect check in get_memoize_path).
> ...
> I'm not convinced this failure is related to 3b35f9a4c; it seems
> unlikely that just changing how Memoize paths are created would cause
> it.
>
> Does anyone have any insights into what might be going on?

> Forgot to include the link to the failure.  Here it is:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2025-04-16%2001%3A58%3A42

> Judging by:
> ### Restarting node "standby"
> # Running: pg_ctl --wait --pgdata 
> /home/bf/bf-build/flaviventris/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata
>  --log 
> /home/bf/bf-build/flaviventris/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log
>  restart
> waiting for server to shut 
> down...
>  failed
> pg_ctl: server does not shut down
>
> I think it's a known issue:
> https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_rep_changes.pl_fails_due_to_publisher_stuck_on_shutdown
> and is not related to the Memoize paths.

Got it.  Thanks for the information.

Thanks
Richard

Re: Built-in Raft replication

2025-04-16 Thread Konstantin Osipov

* Tom Lane  [25/04/16 11:05]:
> Nikolay Samokhvalov  writes:
> > This is exactly what I wanted to write as well. The idea is great. At the
> > same time, I think, consensus on many decisions will be extremely hard to
> > reach, so this project has a high risk of being very long. Unless it's an
> > extension, at least in the beginning.
> 
> Yeah.  The two questions you'd have to get past to get this into PG
> core are:
> 
> 1. Why can't it be an extension?  (You claimed it would work more
> seamlessly in core, but I don't think you've made a proven case.)

I think this can be best addressed when the discussion moves on to
an architecture design record, where the UX and implementation
details are outlined. I'm sure there can be a lot of bike-shedding
on that part. For now I merely wanted to know if:
- maybe there is a reason this will never be accepted
- maybe someone is already working on this.

>From the replies I sense that while there is quite a bit of
scepticism about it ever making its way into the trunk, generally
there is no aversion to it. If my understanding is right,
it's a decent start.

> 2. Why depend on Raft rather than some other project?
> 
> Longtime PG developers are going to be particularly hard on point 2,
> because we have a track record now of outliving outside projects
> that we thought we could rely on.  One example here is the Snowball
> stemmer; while its upstream isn't quite dead, it's twitching only
> feebly, and seems to have a bus factor of 1.  Another example is the
> Spencer regex engine; we thought we could depend on Tcl to be the
> upstream for that, but for a decade or more they've acted as though
> *we* are the upstream.  And then there's libxml2.  And uuid-ossp.
> And autoconf.  And various documentation toolchains.  Need I go on?
> 
> The great advantage of implementing an outside dependency in an
> extension is that if the depended-on project dies, we can say a few
> words of mourning and move on.  It's a lot harder to walk away from
> in-core features.

Raft is an algorithm, not a library. For a quick start the project
could use an existing library - I'd pick tidb's raft-rs, which
happens to be implemented in Rust, but going forward I'd guess the
community will want to have a plain C implementation. 

There is a plethora of C implementations out there, but in my
somewhat educated opinion none are good enough for PostgreSQL
standards or purposes: ideally the protocol should be fully
isolated from storage and transport and extensively tested,
randomized & injection tests being a priority. Most of C
implementation I've seen are built by enthusiasts as a
self-education projects.

So at some point the project will need its own Raft
implementation. Good news is that the design of Raft internals
has been fairly well polished in all of the various
implementations in many different programming languages, so 
it should be a fairly straightforward job.

Regarding the maintenance, since its first publishing back in ~2010
the protocol stabilized quite a bit. The core of the protocol
doesn't get many changes, I'd say nearly no changes, and it's also 
noticeable in implementations, e.g. etcd-raft, raft-rs from tikv, etc
don't get many new commits nowadays.

Now a more broad question is whether or not Raft is an optimal
long term solution for log replication? Generally Raft is
leader-based, so in theory it could be replaced with a leader-less
protocol - e.g. FastPaxos, EPaxos, and newer
developments on top of those. To the best of my understanding all 
leader-less algorithms which provide a single round-trip commit cost
require co-designing the transaction and replication
layer - which may be a way more intrusive change than adding raft
on top of the existing synchronous replication in PostgreSQL.

Given that Raft already provides an amortized single-round-trip
commit time, and the goal is simplicity of UX and unification,
I'd say it's wise to wait and see for the leader-less approaches
to mature.

At the end of the day, there is always a trade-off of trying to 
do something today and waiting for perfection, but in case of Raft
in my personal opinion the balance is just right.

-- 
Konstantin Osipov, Moscow, Russia

Re: Fundamental scheduling bug in parallel restore of partitioned tables

2025-04-16 Thread Ashutosh Bapat

On Wed, Apr 16, 2025 at 12:19 AM Tom Lane  wrote:
>
> Ashutosh Bapat  writes:
> > On Mon, Apr 14, 2025 at 11:14 PM Tom Lane  wrote:
> >> This is disastrous for assorted reasons.  The ALTER ADD CONSTRAINT
> >> command might fail outright if we've loaded data for the referencing
> >> table but not the referenced table.
>
> > There's a comment in getConstraints()
> > /*
> > * Restoring an FK that points to a partitioned table requires that
> > * all partition indexes have been attached beforehand. Ensure that
> > * happens by making the constraint depend on each index partition
> > * attach object.
> > */
>
> Ah, that is an excellent point which I missed.  And the INDEX ATTACH
> objects have dependencies on the leaf tables, which *will* get
> repointed to their TABLE DATA objects by repoint_table_dependencies.
> So by the time we are ready to restore the FK CONSTRAINT object,
> we are certain to have loaded all the data of the referenced table.
> But there's nothing delaying the constraint till after the referencing
> table's data is loaded.

Yes.

>
> So at this point we have:
>
> #1: ADD CONSTRAINT failure because of missing referenced data:
> not possible after all.
>

check

> #2: Deadlock between parallel restore jobs: possible in HEAD, but
> it seems likely to be a bug introduced by the not-null-constraint
> work rather than being pg_restore's fault.  We have no evidence
> that such a deadlock can happen in released branches, and the lack
> of field reports suggests that it can't.
>

I agree. I ran that repro several times against v17 and never saw a
deadlock, even after changing the number of partitions, sizes of
partitions, adding more tables etc.

Creating an FK constraint either happens before or after the loading
data in one of the partitions. That might point towards either
constraint creation or data load waiting for the other. But I have not
seen an evidence that a full deadlock chain is possible.

> #3: Restoring the FK constraint before referencing data is loaded:
> this seems to be possible, and it's a performance problem, but
> no more than that.
>

Yes. And once we fix this, there won't be waiting between constraint
creation and data load, so the remote possibility of a deadlock also
vanishes.

> So now I withdraw the suggestion that this patch needs to be
> back-patched.  We may not even need it in v18, if another fix
> for #2 is found.  Fixing #3 would be a desirable thing to do
> in v19, but if that's the only thing at stake then it's not
> something to break feature freeze for.
>
> For the moment I'll mark this CF entry as meant for v19.
> We can resurrect consideration of it for v18 if there's not
> a better way to fix the deadlock problem.

+1.


-- 
Best Wishes,
Ashutosh Bapat

Re: Built-in Raft replication

2025-04-16 Thread Konstantin Osipov

* Ashutosh Bapat  [25/04/16 11:06]:
> > My view is what Konstantin wants is automatic replication topology 
> > management. For some reason this technology is called HA, DCS, Raft, Paxos 
> > and many other scary words. But basically it manages primary_conn_info of 
> > some nodes to provide some fault-tolerance properties. I'd start to design 
> > from here, not from Raft paper.
> >
> In my experience, the load of managing hundreds of replicas which all
> participate in RAFT protocol becomes more than regular transaction
> load. So making every replica a RAFT participant will affect the
> ability to deploy hundreds of replica.

I think this experience needs to be detailed out. There are
implementations in the field that are less efficient than others.

Early etcd-raft didn't have pre-voting and had "bastardized" 
(their own definition) implementation of configuration changes
which didn't use joint consensus.

Then there is a liveness issue if leader election is implemented
in a straightforward way in large clusters. But this is addressed:
scaling up the randomized election timeout with the cluster size,
converting most of participants to non-voters in large clusters. 

Raft replication, again, if implemented in a naive way, would
require a O(outstanding transaction) * number of replicas amount of
RAM. But that doesn't have to be naive.

To sum up, I am not aware of any principal limitations in this
area.

-- 
Konstantin Osipov, Moscow, Russia

Re: not null constraints, again

2025-04-16 Thread Alvaro Herrera

Here's another version where I do skip searching for children twice, and
rewrote the comments.

I also noticed that in child tables we were only looking for
pg_attribute.attnotnull, and not whether the constraints had been
validated or made inheritable.  This seemed a wasted opportunity, so I
refactored the code to instead examine the pg_constraint row and apply
the same checks as for the constraint on the parent (namely, that it's
valid and not NO INHERIT).  We already check for these things downstream
(alter table phase 2, during AdjustNotNullInheritance), but only after
potentially wasting more work, so it makes sense to do it here (alter
table phase 1) given that it's very easy.  I added some tests for these
things also, as those cases weren't covered.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"¿Cómo puedes confiar en algo que pagas y que no ves,
y no confiar en algo que te dan y te lo muestran?" (Germán Poo)
>From 710ba9b55d9fcf38464c9d2e00e2946e051425ab Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= 
Date: Tue, 15 Apr 2025 20:38:54 +0200
Subject: [PATCH v2] Fix verification of not-null constraints on children
 during PK creation

---
 src/backend/commands/tablecmds.c  | 145 +-
 src/test/regress/expected/constraints.out |  22 +++-
 src/test/regress/sql/constraints.sql  |  14 +++
 3 files changed, 122 insertions(+), 59 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b3ed69457fc..a485f045890 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -540,6 +540,7 @@ static ObjectAddress ATExecDropColumn(List **wqueue, Relation rel, const char *c
 static void ATPrepAddPrimaryKey(List **wqueue, Relation rel, AlterTableCmd *cmd,
 bool recurse, LOCKMODE lockmode,
 AlterTableUtilityContext *context);
+static void verifyNotNullPKCompatible(HeapTuple tuple, const char *colname);
 static ObjectAddress ATExecAddIndex(AlteredTableInfo *tab, Relation rel,
 	IndexStmt *stmt, bool is_rebuild, LOCKMODE lockmode);
 static ObjectAddress ATExecAddStatistics(AlteredTableInfo *tab, Relation rel,
@@ -9438,8 +9439,26 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 }
 
 /*
- * Prepare to add a primary key on table, by adding not-null constraints
+ * Prepare to add a primary key on a table, by adding not-null constraints
  * on all columns.
+ *
+ * The not-null constraints for a primary key must cover the whole inheritance
+ * hierarchy (failing to ensure that leads to funny corner cases).  For the
+ * normal case where we're asked to recurse, this routine ensures that the
+ * not-null constraints either exist already, or queues a requirement for them
+ * to be created by phase 2.
+ *
+ * For the case where we're asked not to recurse, we verify that a not-null
+ * constraint exists on each column of each (direct) child table, throwing an
+ * error if not.  Not throwing an error would also work, because a not-null
+ * constraint would be created anyway, but it'd cause a silent scan of the
+ * child table to verify absence of nulls.  We prefer to let the user know so
+ * that they can add the constraint manually without having to hold
+ * AccessExclusiveLock while at it.
+ *
+ * However, it's also important that we do not acquire locks on children if
+ * the not-null constraints already exist on the parent, to avoid risking
+ * deadlocks during parallel pg_restore of PKs on partitioned tables.
  */
 static void
 ATPrepAddPrimaryKey(List **wqueue, Relation rel, AlterTableCmd *cmd,
@@ -9447,42 +9466,13 @@ ATPrepAddPrimaryKey(List **wqueue, Relation rel, AlterTableCmd *cmd,
 	AlterTableUtilityContext *context)
 {
 	Constraint *pkconstr;
+	List	   *children;
+	bool		got_children = false;
 
 	pkconstr = castNode(Constraint, cmd->def);
 	if (pkconstr->contype != CONSTR_PRIMARY)
 		return;
 
-	/*
-	 * If not recursing, we must ensure that all children have a NOT NULL
-	 * constraint on the columns, and error out if not.
-	 */
-	if (!recurse)
-	{
-		List	   *children;
-
-		children = find_inheritance_children(RelationGetRelid(rel),
-			 lockmode);
-		foreach_oid(childrelid, children)
-		{
-			foreach_node(String, attname, pkconstr->keys)
-			{
-HeapTuple	tup;
-Form_pg_attribute attrForm;
-
-tup = SearchSysCacheAttName(childrelid, strVal(attname));
-if (!tup)
-	elog(ERROR, "cache lookup failed for attribute %s of relation %u",
-		 strVal(attname), childrelid);
-attrForm = (Form_pg_attribute) GETSTRUCT(tup);
-if (!attrForm->attnotnull)
-	ereport(ERROR,
-			errmsg("column \"%s\" of table \"%s\" is not marked NOT NULL",
-   strVal(attname), get_rel_name(childrelid)));
-ReleaseSysCache(tup);
-			}
-		}
-	}
-
 	/* Verify that columns are not-null, or request that they be made so */
 	foreach_node(String, column, pkconstr->keys)
 	{
@@ -9498,42 +9488,46 @@ ATPrepA

Re: Built-in Raft replication

2025-04-16 Thread Konstantin Osipov

* Andrey Borodin  [25/04/16 11:06]:
> > You can run bash from extension, what's the point?
> 
> You cannot run bash that will stop backend running bash.

You're right there is a chicken and egg problem when you add 
Raft to an existing project, and rebootstrap 
becomes a trick, but it's a plumbing trick.

The new member needs to generate and persist a globally unique
identifier as the first step. Later it can 
reintroduce itself to the cluster given this identifier
can be preserved in the new incarnation (popen + fork).

-- 
Konstantin Osipov, Moscow, Russia

Re: Performance issues with v18 SQL-language-function changes

2025-04-16 Thread Tom Lane

Bruce Momjian  writes:
> On Wed, Apr 16, 2025 at 01:43:46PM -0400, Tom Lane wrote:
>> I'm confused?  0dca5d68d didn't have anything to do with
>> syntax changes, just with when planning happens.

> I was referencing the contrib initialization functions you converted to
> use SQL-standard function bodies:

Nah, that's not really relevant.  The speed concerns I have here
are mostly independent of whether the SQL function is written
in string or SQL-standard form.  Also, I think all of those
contrib functions that are at all performance-relevant are
capable of being inlined, and so wouldn't reach this code anyway.

regards, tom lane

Re: Performance issues with v18 SQL-language-function changes

2025-04-16 Thread Bruce Momjian

On Wed, Apr 16, 2025 at 01:43:46PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Mon, Apr 14, 2025 at 10:38:29AM -0400, Robert Haas wrote:
> >> I agree that we should do something about this. I haven't reviewed
> >> your patches but the approach sounds broadly reasonable.
> 
> > Yep, we went down the road in PG 18 to convert syntax, and now we have
> > to fix this, or we have to revert all the PG 18 syntax changes, which
> > seems like a step backward.
> 
> I'm confused?  0dca5d68d didn't have anything to do with
> syntax changes, just with when planning happens.

I was referencing the contrib initialization functions you converted to
use SQL-standard function bodies:

commit 68ff25eef12
Author: Tom Lane 
Date:   Sun Dec 29 14:58:05 2024 -0500

contrib/pageinspect: Use SQL-standard function bodies.

In the same spirit as 969bbd0fa, 13e3796c9, 3f323eba8.

Tom Lane and Ronan Dunklau

Discussion: https://postgr.es/m/3316564.aeNJFYEL58@aivenlaptop

I thought that was what you were saying were now slower;  maybe I was
confused.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Prototype: Implement dead tuples xid histograms

2025-04-16 Thread Renan Alves Fonseca

Hi hackers,

in a recent hacking workshop organized by Robert Haas, we discussed
[1]. Among the autovacuum issues exposed, the useless vacuum case caught
my attention. I've started to study the respective code and I came up
with a prototype to improve the statistics system regarding dead tuples.

The attached patch implements only partially the dead tuples histogram
mentioned in [1]. But, since I'm a beginner, I thought it would be nice
to have an early feedback just to make sure I don't do anything very
wrong.

My initial idea was to implement a growing histogram with a linked list
of bins, exploiting the fact that most of dead tuples are added in the
last bin. Then, I realized that there are no other cases of dynamical
data structures in pg_stats and it would be harder to serialize
it. That's why I choose to implement the histogram in a static data
structure inside one of the pg_stats data structures. It does require a
little bit more logic to maintain the histogram but it is well
integrated in the whole pg_stats architecture.

As discussed in the hacking workshop, one of the problems is to capture
the exact xmin of the dead tuple. In my tests, I've observed that,
outside of a transaction, xmin corresponds to
GetCurrentTransactionId(). But inside a transaction, xmin receives
incremental xids on successive DM statements. Capturing xids for every
statement inside a transaction seems overkill. So, I decided to
attribute the highest xmin/xid of a transaction to all dead tuples
of that transaction.

In order to see the statistics in a table t1, we do:
select pg_stat_get_dead_tuples_xid_freqs ('t1'::regclass),
   pg_stat_get_dead_tuples_xid_bounds('t1'::regclass);

Then, to verify that the bounds make sense, I've used:
select xmin from t1;

In this version, the removal of dead tuples is not yet implemented, so
these histograms only grow.

I would really appreciate any kind of feedback.

Best regards,
Renan Fonseca

[1] How Autovacuum Goes Wrong: And Can We Please Make It Stop Doing
That? (PGConf.dev 2024)
>From 396eb5ae24fc37af01c81552f656ca628f57 Mon Sep 17 00:00:00 2001
From: "Renan A. Fonseca" 
Date: Wed, 16 Apr 2025 10:52:16 +0200
Subject: [PATCH] Implement dead tuples xid histograms

This is one possible solution to provide finer statistics regarding
dead tuples. It is conceived to answer the following question: If I
vacuum this table now, how many dead tuples will be removed?

We implement it by adding a fixed size data structure in
PgStat_StatTabEntry representing an histogram with variable
bounds. This means that we persist this histogram using pg_catalog
machinery, and we need a pg_catalog version update.

It is out of the scope of this patch to modify the autovacuum
algorithm to use this new information. We provide sql functions to
retrieve dead tuples xid histogram of a given table so that users can
make experiments with this refined info.

Some details on the implementation can be found in comments.

TODO: update the histogram when removing dead tuples
---
 src/backend/access/transam/xact.c|  18 +++
 src/backend/utils/activity/pgstat_relation.c | 147 ++-
 src/backend/utils/adt/pgstatfuncs.c  |  44 ++
 src/include/access/xact.h|   1 +
 src/include/catalog/pg_proc.dat  |   8 +
 src/include/pgstat.h |  29 +++-
 6 files changed, 242 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..87c9e6d9617 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -460,6 +460,24 @@ GetCurrentTransactionId(void)
 	return XidFromFullTransactionId(s->fullTransactionId);
 }
 
+
+/*
+ *	GetCurrentTransactionMaxChildId
+ *
+ * This will return the highest XID among current transaction childs XIDs. It
+ * returns InvalidTransactionId if current transaction has no childs.
+ */
+TransactionId
+GetCurrentTransactionMaxChildId(void)
+{
+	TransactionState s = CurrentTransactionState;
+
+	if (s->nChildXids == 0)
+		return InvalidTransactionId;
+	else
+		return s->childXids[s->nChildXids - 1];
+}
+
 /*
  *	GetCurrentTransactionIdIfAny
  *
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index eeb2d43cb10..08ce3d4ca85 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -448,6 +448,14 @@ pgstat_count_truncate(Relation rel)
  * recovery of "delta" dead tuples; so delta_dead_tuples decreases
  * rather than increasing, and the change goes straight into the per-table
  * counter, not into transactional state.
+ *
+ * TODO: we need to take this into account in the dead_tuples_xid
+ * histogram. Actually, this is one of the reasons for using a richer
+ * representation for XIDS in the intermediate data structure
+ * (PgStat_TableStatus.counts). Since this is called quite often in
+ * heap_page_prune_opt, we

jsonapi: scary new warnings with LTO enabled

2025-04-16 Thread Tom Lane

I noticed some new warnings from buildfarm member chafer,
which I'm able to reproduce locally on a Fedora 41 box
by building with "meson setup build -Db_lto=true":

ninja: Entering directory `build'
[1515/2472] Linking target src/interfaces/libpq/libpq.so.5.18
In function 'freeJsonLexContext',
inlined from 'freeJsonLexContext' at ../src/common/jsonapi.c:688:1,
inlined from 'handle_oauth_sasl_error' at 
../src/interfaces/libpq/fe-auth-oauth.c:547:2:
../src/common/jsonapi.c:723:17: warning: 'free' called on unallocated object 
'lex' [-Wfree-nonheap-object]
  723 | FREE(lex);
  | ^
../src/interfaces/libpq/fe-auth-oauth.c: In function 'handle_oauth_sasl_error':
../src/interfaces/libpq/fe-auth-oauth.c:479:24: note: declared here
  479 | JsonLexContext lex = {0};
  |^
[2407/2472] Linking target 
src/test/modules/test_json_parser/test_json_parser_incremental_shlib
In function 'freeJsonLexContext',
inlined from 'freeJsonLexContext' at ../src/common/jsonapi.c:688:1,
inlined from 'main' at 
../src/test/modules/test_json_parser/test_json_parser_incremental.c:198:2:
../src/common/jsonapi.c:723:17: warning: 'free' called on unallocated object 
'lex' [-Wfree-nonheap-object]
  723 | FREE(lex);
  | ^
../src/test/modules/test_json_parser/test_json_parser_incremental.c: In 
function 'main':
../src/test/modules/test_json_parser/test_json_parser_incremental.c:87:24: 
note: declared here
   87 | JsonLexContext lex;
  |^
[2426/2472] Linking target 
src/test/modules/test_json_parser/test_json_parser_incremental
In function 'pg_free',
inlined from 'pfree' at ../src/common/fe_memutils.c:135:2,
inlined from 'freeJsonLexContext' at ../src/common/jsonapi.c:723:3,
inlined from 'freeJsonLexContext' at ../src/common/jsonapi.c:688:1,
inlined from 'main' at 
../src/test/modules/test_json_parser/test_json_parser_incremental.c:198:2:
../src/common/fe_memutils.c:107:9: warning: 'free' called on unallocated object 
'lex' [-Wfree-nonheap-object]
  107 | free(ptr);
  | ^
../src/test/modules/test_json_parser/test_json_parser_incremental.c: In 
function 'main':
../src/test/modules/test_json_parser/test_json_parser_incremental.c:87:24: 
note: declared here
   87 | JsonLexContext lex;
  |^

AFAICT there is no actual bug here: the FREE() call is reached only if
the JSONLEX_FREE_STRUCT flag is set, which it should not be for these
call sites.  But evidently the LTO optimizer is not quite smart enough
to realize that.

It seems fairly dangerous to ignore -Wfree-nonheap-object warnings.
I feel like we ought to move to prevent these somehow.  I'm not sure
how other than giving up on stack allocation of JsonLexContexts,
though, especially if we consider the jsonapi API frozen.  But seeing
that there are only three such call sites and none of them seem in the
least performance-critical, maybe we should just do that?

regards, tom lane

Re: Built-in Raft replication

2025-04-16 Thread Greg Sabino Mullane

On Wed, Apr 16, 2025 at 2:18 AM Ashutosh Bapat 
wrote:

> Users find it a waste of resources to deploy 3 big PostgreSQL instances
> just for HA where 2 suffice even if they deploy 3 lightweight DCS
> instances. Having only some of the nodes act as DCS and others purely
> PostgreSQL nodes will reduce waste of resources.
>

A big problem is that putting your DCS into Postgres means your whole
system is now super-sensitive to IO/WAL-streaming issues, and a busy
database doing database stuff is going to start affecting the DCS stuff.
With three lightweight DCS servers, you don't really need to worry about
how stressed the database servers are. In that way, I feel etcd et al. are
adhering to the unix philosophy of "do one thing, and do it well."

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

Re: A modest proposal: make parser/rewriter/planner inputs read-only

2025-04-16 Thread Andrei Lepikhov


On 4/16/25 18:22, Tom Lane wrote:

Maybe we would like to have some enforced contract about what
subquery_planner can and can't touch in the outer planner level's
data, but I'm not feeling a great need for that right now.
I wouldn't say I'm entirely convinced that side effects are absent, but 
multiple copies of the same query tree look messy now. It makes sense to 
try this way and see what happens.


--
regards, Andrei Lepikhov

Re: jsonapi: scary new warnings with LTO enabled

2025-04-16 Thread Daniel Gustafsson

> On 17 Apr 2025, at 00:12, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>>> On 16 Apr 2025, at 23:42, Tom Lane  wrote:
>>> I'm not sure
>>> how other than giving up on stack allocation of JsonLexContexts,
>>> though, especially if we consider the jsonapi API frozen.  But seeing
>>> that there are only three such call sites and none of them seem in the
>>> least performance-critical, maybe we should just do that?
> 
>> I can't see any other option really, and there is no performance angle really
>> so that should be safe.  Since I committed at least one of these, let me know
>> if you want me to tackle it.
> 
> The only alternative I can see that might stop the warning is if we
> can find a way to make it clearer to the optimizer that the FREE()
> isn't reached.  But I'm not sure about a trustworthy way to make that
> happen.  Maybe it'd work to change the signature of freeJsonLexContext
> (or perhaps better, add a separate entry point) so that the caller is
> passing a bool constant that controls whether to free the struct.
> We could have an Assert that compares that to the state of the
> JSONLEX_FREE_STRUCT flag to catch mistakes.  This seems kind of messy
> though.

Yeah, that seems messy enough that someone down the line will go "why on earth"
and we'll have to revisit this discussion.  It can probably be made to work but
I doubt it will be worth it compared to allocating on the heap.

--
Daniel Gustafsson

Re: jsonapi: scary new warnings with LTO enabled

2025-04-16 Thread Ranier Vilela

Em qua., 16 de abr. de 2025 às 18:42, Tom Lane  escreveu:

> I noticed some new warnings from buildfarm member chafer,
> which I'm able to reproduce locally on a Fedora 41 box
> by building with "meson setup build -Db_lto=true":
>
> ninja: Entering directory `build'
> [1515/2472] Linking target src/interfaces/libpq/libpq.so.5.18
> In function 'freeJsonLexContext',
> inlined from 'freeJsonLexContext' at ../src/common/jsonapi.c:688:1,
> inlined from 'handle_oauth_sasl_error' at
> ../src/interfaces/libpq/fe-auth-oauth.c:547:2:
> ../src/common/jsonapi.c:723:17: warning: 'free' called on unallocated
> object 'lex' [-Wfree-nonheap-object]
>   723 | FREE(lex);
>   | ^
> ../src/interfaces/libpq/fe-auth-oauth.c: In function
> 'handle_oauth_sasl_error':
> ../src/interfaces/libpq/fe-auth-oauth.c:479:24: note: declared here
>   479 | JsonLexContext lex = {0};
>   |^
> [2407/2472] Linking target
> src/test/modules/test_json_parser/test_json_parser_incremental_shlib
> In function 'freeJsonLexContext',
> inlined from 'freeJsonLexContext' at ../src/common/jsonapi.c:688:1,
> inlined from 'main' at
> ../src/test/modules/test_json_parser/test_json_parser_incremental.c:198:2:
> ../src/common/jsonapi.c:723:17: warning: 'free' called on unallocated
> object 'lex' [-Wfree-nonheap-object]
>   723 | FREE(lex);
>   | ^
> ../src/test/modules/test_json_parser/test_json_parser_incremental.c: In
> function 'main':
> ../src/test/modules/test_json_parser/test_json_parser_incremental.c:87:24:
> note: declared here
>87 | JsonLexContext lex;
>   |^
> [2426/2472] Linking target
> src/test/modules/test_json_parser/test_json_parser_incremental
> In function 'pg_free',
> inlined from 'pfree' at ../src/common/fe_memutils.c:135:2,
> inlined from 'freeJsonLexContext' at ../src/common/jsonapi.c:723:3,
> inlined from 'freeJsonLexContext' at ../src/common/jsonapi.c:688:1,
> inlined from 'main' at
> ../src/test/modules/test_json_parser/test_json_parser_incremental.c:198:2:
> ../src/common/fe_memutils.c:107:9: warning: 'free' called on unallocated
> object 'lex' [-Wfree-nonheap-object]
>   107 | free(ptr);
>   | ^
> ../src/test/modules/test_json_parser/test_json_parser_incremental.c: In
> function 'main':
> ../src/test/modules/test_json_parser/test_json_parser_incremental.c:87:24:
> note: declared here
>87 | JsonLexContext lex;
>   |^
>
> AFAICT there is no actual bug here: the FREE() call is reached only if
> the JSONLEX_FREE_STRUCT flag is set, which it should not be for these
> call sites.

See the function *makeJsonLexContextCstringLen* (line 400)
The JSONLEX_FREE_STRUCT  is enabled, no?

fe-auth-oauth.c (line 507)
makeJsonLexContextCstringLen(&lex, msg, msglen, PG_UTF8, true);

Worst, on a second call, with lex not NULL, the flags is reseted
and the struct will no longer be released?

best regards,
Ranier Vilela

Re: Support for runtime parameters in injection points, for AIO tests

2025-04-16 Thread Michael Paquier

On Mon, Apr 14, 2025 at 04:05:40AM -0400, Andres Freund wrote:
> On 2025-04-14 16:46:22 +0900, Michael Paquier wrote:
>> We are still in early April, and I'd like to propose a cleanup of the
>> AIO code on HEAD, even if we are post-feature freeze, to not release
>> this new code in this state, implying an open item.  Please note that
>> I'm OK to take responsibility for this patch set at the end, reviews
>> are welcome.
> 
> I'm pretty agnostic about this happening for 18 or not.  If we can do it,
> good, but if not, the impact of needing to use static variable supporting
> injection points is really rather small.

One thing that I'm afraid of is seeing this pattern grow in the code
even for code in the works, and as injection points are for tests
purposes, I'd like to think that we can be rather flexible with them.
We're still early in the post-freeze phase, so I'd say to just do it
at this point.  Of course, this won't happen without a clear consensus
and if folks are happy with what I've sent.

> I'd say that the fact that injection variables are really hard to use in
> critical sections, requiring to have attached to the injection point
> beforehand, is worse.

I've also looked at providing a new flavor of load_external_function()
that would not need to worry about the allocation, even if it means
relying on MAXPGPATH.  I've never come down to actually do that as
being able to load the callback in the cache beforehand is something I
think we'll require anyway.  So we don't really need this new flavor
API at this level.

>> The code is shaped so as we can rely on InjectionPointCallback to define the
>> shape of a callback.
> 
> I don't know what this means though?

I mean here to have all the callbacks rely on one single typedef.
I've looked some options, like having multiple typedef where some have
zero, one or mote arguments, but found that inelegant.

One thing that Jeff Davis (added in CC) has mentioned me upthread is
that he was looking at runtime arguments for injection points for a
new project of his, where he wanted to do some stack manipulation to
make a couple of corner cases cheap to emulate.  I've asked his
opinion about the interface I've proposed, as well.
--
Michael

signature.asc
Description: PGP signature

Re: Rename injection point names in test_aio

2025-04-16 Thread Michael Paquier

On Mon, Apr 14, 2025 at 11:14:59AM +, Hayato Kuroda (Fujitsu) wrote:
>> Anyway, would you like to propose a patch for the documentation?
> 
> Sure, I did. Please see [1].
> 
> [1]: 
> https://www.postgresql.org/message-id/OSCPR01MB14966E14C1378DEE51FB7B7C5F5B32%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Thanks, I'll check that.

If anybody objects about the points getting renamed here, please let
me know.
--
Michael


signature.asc
Description: PGP signature

Re: pg_dump: Fix dangling pointer in EndCompressorZstd()

2025-04-16 Thread Michael Paquier

On Wed, Apr 16, 2025 at 04:19:02PM +0200, Daniel Gustafsson wrote:
> Agreed, while it's perfectly safe today the end method should not make
> assumptions about the use of the private_data pointer upon return and should
> leave it set to NULL.

Indeed.  I was just looking at applying what Alexander has sent
because what EndCompressorZstd() not doing what the other methods do
makes no sense.  Perhaps you are already on it, Daniel?
--
Michael


signature.asc
Description: PGP signature

Decouple C++ support in Meson's PGXS from LLVM enablement

2025-04-16 Thread Tristan Partin

Howdy folks,

While playing around with pg_duckdb[0], a Postgres extension written in 
C++ which uses PGXS, I came across a strange build error:

std=c++17  -Wno-sign-compare...
/bin/sh: line 1: -Wno-sign-compare: command not found

I was very confused by the error, but reading the command line, it made 
sense. After talking to Jelte off-list, he told me to try a Postgres 
installation that had been built with autotools. Today, I finally had 
a chance to try that tip, and building pg_duckdb succeeded.

I spent some time exploring the Meson build a bit, and I realized that 
C++ support in PGXS is tied to LLVM enablement. Checking the autotools 
build in the configure.ac script indicates that that is not the case for 
it.

On master, C++ support looks like:

llvmopt = get_option('llvm')
llvm = not_found_dep
if add_languages('cpp', required: llvmopt, native: false)
  llvm = dependency('llvm', version: '>=14', method: 'config-tool', required: 
llvmopt)
if llvm.found()

  cdata.set('USE_LLVM', 1)

  cpp = meson.get_compiler('cpp')

By default, the `llvm` option is disabled, which Meson takes to mean, 
"do not check for C++ support". Thusly, add_languages() returns false. 
In addition, every check for adding to cxxflags, et. al. is gated on 
llvm.found(), which is always false for the `not_found_dep`. All this 
considered, the Makefile.global of a Postgres build roughly looked like:

CXX =
CXXFLAGS = 
...

This then accounts for the original pg_duckdb command line looking the 
way that it did.

Attached is a patch which decouples C++ support in PGXS from LLVM for 
a Meson-compiled Postgres.

[0]: https://github.com/duckdb/pg_duckdb

-- 
Tristan Partin
Neon (https://neon.tech)
From aba365ed36a82f41e8bd6a7f61b90c91f1556d08 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Wed, 16 Apr 2025 20:25:21 -0500
Subject: [PATCH] Decouple C++ support in Meson's PGXS from LLVM enablement

This is important for Postgres extensions that are written in C++, such
as pg_duckdb, which uses PGXS as the build system currently. In the
autotools build, C++ is not coupled to LLVM. If the autotools build is
configured without --with-llvm, the C++ compiler and the various flags
get persisted into the Makefile.global.

Signed-off-by: Tristan Partin 
---
 meson.build   | 27 +--
 src/include/meson.build   |  2 +-
 src/makefiles/meson.build |  9 ++---
 3 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/meson.build b/meson.build
index a1516e54529..f63769f8b67 100644
--- a/meson.build
+++ b/meson.build
@@ -40,6 +40,10 @@ build_system = build_machine.system()
 host_cpu = host_machine.cpu_family()
 
 cc = meson.get_compiler('c')
+have_cpp = add_languages('cpp', required: false, native: false)
+if have_cpp
+  cpp = meson.get_compiler('cpp')
+endif
 
 not_found_dep = dependency('', required: false)
 thread_dep = dependency('threads')
@@ -798,15 +802,13 @@ endif
 
 llvmopt = get_option('llvm')
 llvm = not_found_dep
-if add_languages('cpp', required: llvmopt, native: false)
+if have_cpp and not llvmopt.disabled()
   llvm = dependency('llvm', version: '>=14', method: 'config-tool', required: llvmopt)
 
   if llvm.found()
 
 cdata.set('USE_LLVM', 1)
 
-cpp = meson.get_compiler('cpp')
-
 llvm_binpath = llvm.get_variable(configtool: 'bindir')
 
 ccache = find_program('ccache', native: true, required: false)
@@ -815,8 +817,13 @@ if add_languages('cpp', required: llvmopt, native: false)
 # find via PATH, too.
 clang = find_program(llvm_binpath / 'clang', 'clang', required: true)
   endif
-elif llvmopt.auto()
-  message('llvm requires a C++ compiler')
+else
+  msg = 'llvm requires a C++ compiler'
+  if llvmopt.auto()
+message(msg)
+  elif llvmopt.enabled()
+error(msg)
+  endif
 endif
 
 
@@ -2053,7 +2060,7 @@ common_functional_flags = [
 ]
 
 cflags += cc.get_supported_arguments(common_functional_flags)
-if llvm.found()
+if have_cpp
   cxxflags += cpp.get_supported_arguments(common_functional_flags)
 endif
 
@@ -2077,7 +2084,7 @@ common_warning_flags = [
 ]
 
 cflags_warn += cc.get_supported_arguments(common_warning_flags)
-if llvm.found()
+if have_cpp
   cxxflags_warn += cpp.get_supported_arguments(common_warning_flags)
 endif
 
@@ -2131,7 +2138,7 @@ foreach w : negative_warning_flags
   if cc.has_argument('-W' + w)
 cflags_warn += '-Wno-' + w
   endif
-  if llvm.found() and cpp.has_argument('-W' + w)
+  if have_cpp and cpp.has_argument('-W' + w)
 cxxflags_warn += '-Wno-' + w
   endif
 endforeach
@@ -2192,7 +2199,7 @@ elif optimization == 's'
 endif
 
 cflags_builtin = cc.get_supported_arguments(common_builtin_flags)
-if llvm.found()
+if have_cpp
   cxxflags_builtin = cpp.get_supported_arguments(common_builtin_flags)
 endif
 
@@ -3920,7 +3927,7 @@ if meson.version().version_compare('>=0.57')
 section: 'Compiler Flags',
   )
 
-  if llvm.found()
+  if have_cpp
 summary(
   {
 'C++ compiler': '@0@ @1@'.format(cpp.get_id(), cpp.ver

Re: not null constraints, again

2025-04-16 Thread Tender Wang

Alvaro Herrera  于2025年4月16日周三 19:24写道：

> Here's another version where I do skip searching for children twice, and
> rewrote the comments.
>
> I also noticed that in child tables we were only looking for
> pg_attribute.attnotnull, and not whether the constraints had been
> validated or made inheritable.  This seemed a wasted opportunity, so I
> refactored the code to instead examine the pg_constraint row and apply
> the same checks as for the constraint on the parent (namely, that it's
> valid and not NO INHERIT).  We already check for these things downstream
> (alter table phase 2, during AdjustNotNullInheritance), but only after
> potentially wasting more work, so it makes sense to do it here (alter
> table phase 1) given that it's very easy.  I added some tests for these
> things also, as those cases weren't covered.
>

if (conForm->contype != CONSTRAINT_NOTNULL)
elog(ERROR, "constraint %u is not a not-null constraint", conForm->oid);

I feel that using conForm->conname is more friendly than oid for users.

Others look good for me.

-- 
Thanks,
Tender Wang

RE: recoveryCheck test failure on flaviventris

2025-04-16 Thread Hayato Kuroda (Fujitsu)

Dear Richard, Alexander,

FYI - I felt there is a same type of failure [1] for REL_17_STABLE, added in 
the wikipage.
Feel free to revert it if I did something wrong.

[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=prion&dt=2025-04-15%2016%3A16%3A06&stg=recovery-check

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: POC: make mxidoff 64 bits

2025-04-16 Thread Maxim Orlov

I moved the topic to the next commitfest.

-- 
Best regards,
Maxim Orlov.

Re: Built-in Raft replication

2025-04-16 Thread Konstantin Osipov

* Andrey Borodin  [25/04/16 11:06]:
> > Andrey Borodin  writes:
> >> I think it's what Konstantin is proposing. To have our own Raft 
> >> implementation, without dependencies.
> > 
> > Hmm, OK.  I thought that the proposal involved relying on some existing
> > code, but re-reading the thread that was said nowhere.  Still, that
> > moves it from a large project to a really large project :-(
> > 
> > I continue to think that it'd be best to try to implement it as
> > an extension, at least up till the point of finding show-stopping
> > reasons why it cannot be that.
> 
> I think I can provide some reasons why it cannot be neither extension, nor 
> any part running within postmaster reign.
> 
> 1. When joining cluster, there’s not PGDATA to run postmaster on top of it.
> 
> 2. After failover, old Primary node must rejoin cluster by running pg_rewind 
> and following timeline switch.
> 
> The system in hand must be able to manipulate with PGDATA without starting 
> Postgres.
> 
> My question to Konstantin is Why wouldn’t you just add Raft to Patroni? Is 
> there a reason why something like Patroni is not in core and noone rushes to 
> get it in?
> Everyone is using it, or system like it.

Raft uses the same WAL to store configuration change records as is used
for commit records. This is at the core of the correctness of the
algorithm. This is also my biggest concern with correctness of
Patroni - but to the best of my knowledge 's 90%+ of
use cases of Patroni use a "fixed" quorum size, that's defined at
start of the deployment and never/rarely changes.
Contrast to that being able to a replica to the quorum at any
time, and all it takes is just starting this replica and pointing
it at the existing cluster. This greatly simplifies UX.

-- 
Konstantin Osipov, Moscow, Russia

Re: Align memory context level numbering in pg_log_backend_memory_contexts()

2025-04-16 Thread David Rowley

On Thu, 17 Apr 2025 at 02:20, torikoshia  wrote:
> Regarding the implementation:
> In the initial patch attached, I naïvely incremented the level just
> before emitting the log line.
> However, it might be cleaner to simply initialize the level variable to
> 1 from the start. This could help avoid unnecessary confusion when
> debugging that part of the code.

I didn't look at your patch before, but have now and agree that's not
the best way.

> Similarly, I noticed that in pg_get_process_memory_contexts(), the level
> is initialized to 0 in ProcessGetMemoryContextInterrupt(void):
>
>  int level = 0;
>  ..
>  MemoryContextStatsInternal(c, level, 100, 100, &grand_totals, ..
>
> If we want to be consistent, perhaps it would make sense to start from 1
> there as well.

Yes.

> BTW level variable has existed since before pg_backend_memory_contexts
> was introduced — it was originally used for functions that help inspect
> memory contexts via the debugger. Because of that, I think changing this
> would affect not only these functions codes but some older ones.

I get the impression that it wasn't well thought through prior to
this. If you asked for max_level of 10 it would stop at 9. Changing
these to 1-based levels means we'll now stop at level 10 without
printing any more levels than we did before.

The attached patch is how I think we should do it.

David


make_memory_context_levels_1_based.patch
Description: Binary data

Re: Built-in Raft replication

2025-04-16 Thread Hannu Krosing

On Wed, Apr 16, 2025 at 6:27 AM Tom Lane  wrote:
>
> Andrey Borodin  writes:
> > I think it's what Konstantin is proposing. To have our own Raft 
> > implementation, without dependencies.
>
> Hmm, OK.  I thought that the proposal involved relying on some existing
> code, but re-reading the thread that was said nowhere.  Still, that
> moves it from a large project to a really large project :-(

My understanding is that RAFT is a fancy name for what PostgreSQL is
largely already doing - "electing" a leader and then doing all the
changes through that leader (a.k.a. WAL streaming)

The thing that needs adding - and which makes it "RAFT" instead of
just a streaming replication with a failover - is what happens when
the leader goes away and one of the replicas needs to become a new
leader.

We have ways to do rollbacks and roll-forwards but the main tricky
part is "how do you know that you have not lost some changes" and here
we must have some place to store the info about at which LSN the
failover happened, so that we know to run pg_rewind if any losts hosts
come back and want to join.

And of course we need to have a way to communicate this "who is the
new leader" to clients which needs new libpgq functionality of
"failover connection pools" which hide the failovers from old clients.

The RAFT protocol could be a provider of "who is current leader info"
and optionally cache the LSN the switch happened.

> I continue to think that it'd be best to try to implement it as
> an extension, at least up till the point of finding show-stopping
> reasons why it cannot be that.

The main thing I would like to see in core is ability to do clean
*switchovers* (not failovers) by sending a magic WAL record with a
message "hey node N, please take over the write node role" over WAL so
that node N knows to self-promote and all other nodes know to start
following N starting  from the next WAL record either directly or

Why WAL - because it is already is sent to all replicas, it guarantees
continuity as it very clearly states at what LSN the write-head-switch
happens.

We also should be able to send this info to the client libraries
currently connected to the writer. so that they can choose to switch
to the new head.

The rest could be easily an extension.

Mainly we want more than one "coordinators" which can be running in
some or all of the nodes.and who agree on
- which node is current leader
- at which LSN the switch happened (so if some node coming back
discovers that it has magicall moved ahead it knows to rewind to that
LSN and then re-stream it from commonly agreed place.

 It would also be nice to have some agreed, hopefully lightweight,
notion of node identity, which we could then use for many things,
including stamping it in WAL records to guarantee / verify that all
the nodes have been on the same WAL stream all the time

But regarding weather to use RAFT I would just define a "coordinator
API" and leave it up to the specific coordinator/consensus extension
to decide how the consensus is achieved

So to summarize:

# Core should provide

- way tomove to new node,
  - for switchover a WAL-based switchover
  - for failover something similar which also writes the WAL record so
all histories are synced
- a libpq message informing clients about "new write head node"
- node IDs and more general c;luster-awareness inside the PostgreSQL
node (I had a shoutout about this in a recent pgconf.dev unconference
talk)
- a new write-node field in WAL to track write head movement
- API for a joining node to find out which cluster it joins and the
switchover history
  - in WAL it is always switchover, maybe with some info saying that
it was a forces switchover because we lost old write head
  - if some lost node comes back it may need to rewind or
re-initialize if it finds out it had been following a lost timeline
that is not fully part of

NOTE: switchovers in WAL would be very similar to timeline changes. I
am not sure how much extra info is needed there.

# Extension can provide
- agreeing on new leader node in case of failover
  - protocol can be RAFT, PAXOS or "the DBA says so" :)
- sharing fresh info about current leader and switch timelines (though
this should more likely be in core)
- anything else ???

# external apps is (likely?) needed for
- setting up cluster, provisioning machines / VMs
- setting up networking
- starting PostgreSQL servers.
- spinning up and down clients,
- communicating current leader and replica set (could be done by DNS
with agreed conventions)

Re: jsonapi: scary new warnings with LTO enabled

2025-04-16 Thread Daniel Gustafsson

> On 16 Apr 2025, at 23:42, Tom Lane  wrote:

> It seems fairly dangerous to ignore -Wfree-nonheap-object warnings.
> I feel like we ought to move to prevent these somehow.

Absolutely agree.

> I'm not sure
> how other than giving up on stack allocation of JsonLexContexts,
> though, especially if we consider the jsonapi API frozen.  But seeing
> that there are only three such call sites and none of them seem in the
> least performance-critical, maybe we should just do that?

I can't see any other option really, and there is no performance angle really
so that should be safe.  Since I committed at least one of these, let me know
if you want me to tackle it.

--
Daniel Gustafsson

Re: jsonapi: scary new warnings with LTO enabled

2025-04-16 Thread Tom Lane

Daniel Gustafsson  writes:
> On 17 Apr 2025, at 00:12, Tom Lane  wrote:
>> The only alternative I can see that might stop the warning is if we
>> can find a way to make it clearer to the optimizer that the FREE()
>> isn't reached.  But I'm not sure about a trustworthy way to make that
>> happen.

> Yeah, that seems messy enough that someone down the line will go "why on 
> earth"
> and we'll have to revisit this discussion.  It can probably be made to work 
> but
> I doubt it will be worth it compared to allocating on the heap.

Looking through all of the callers of freeJsonLexContext, quite
a lot of them use local JsonLexContext structs, and probably some
of them are more performance-critical than these.  So that raises
the question of why are we seeing warnings for only these call
sites?  Maybe there is a more elegant way to suppress them.

Still, I think that observation refutes my initial thought that
we should rip out support for local JsonLexContext structs
altogether.  I'm inclined now to just do the minimal thing of
changing these callers to use an allocated struct, and call it
a day.  (BTW, there seem to be only 2 places to change not 3;
two of the warnings are pointing at the same variable.)

regards, tom lane

Re: New committer: Jacob Champion

2025-04-16 Thread Dilip Kumar

Congratulations Jacob!

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


On Sat, 12 Apr 2025 at 1:56 AM, Jonathan S. Katz 
wrote:

> The Core Team would like to extend our congratulations to Jacob
> Champion, who has accepted an invitation to become our newest PostgreSQL
> committer.
>
> Please join us in wishing Jacob much success and few reverts!
>
> Thanks,
>
> Jonathan
>

Re: Fixup some appendPQExpBuffer() calls

2025-04-16 Thread David Rowley

On Mon, 14 Apr 2025 at 19:39, David Rowley  wrote:
> 1) I should commit the attached 0002 patch now, or;
> 2) Should commit only the new-to-v18 ones now, or;
> 3) do nothing.
>
> I think #3 isn't a good option as we (or I) have made efforts in the
> past to keep these in check. Also, for me, #2 isn't ideal as if I do
> this next year for v19, I'll need to ignore the older inconsistencies,
> and I'd rather not have to do that, so I vote for #1.
>
> Any other opinions?

To reduce the above 3 options down to two, I just pushed the ones that
were new to v18.  The attached patch handles the ones that were added
prior to v18.

The options now are:
1) Commit the attached to master
2) Do nothing.

I'd like to do #1.

David


v2-0001-Fixup-various-older-usages-of-appendPQExpBuffer.patch
Description: Binary data

Re: Conflict detection for update_deleted in logical replication

2025-04-16 Thread Nisha Moond

On Wed, Mar 26, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
 wrote:
>
> Here's a rebased version of the patch series.
>

Thanks for the patches.

While testing the GUC "max_conflict_retention_duration", I noticed a
behavior that seems to bypass its intended purpose.

On Pub, if a txn is stuck in the COMMIT phase for a long time, the
apply_worker on the sub keeps looping in wait_for_publisher_status()
until that Pub's concurrent txn completes its commit.
Due to this, the apply worker can't advance its
oldest_nonremovable_xid and keeps waiting for the Pub's txn to finish.
In such a case, even if the wait time exceeds the configured
max_conflict_retention_duration, conflict retention doesn't stop for
the apply_worker. The conflict info retention is stoppend only once
the Pub's txn is committed and the apply_worker moves to
wait_for_local_flush().

Doesn't this defeat the purpose of max_conflict_retention_duration?
The apply worker has exceeded the max wait time but still retains the
conflict info.
I think we should consider applying the same max time limit check
inside wait_for_publisher_status() as well.

--
Thanks,
Nisha

71 matches

Mail list logo