Re: using index to speedup add not null constraints to a table

2025-04-18 Thread jian he
On Wed, Feb 5, 2025 at 4:24 PM jian he  wrote:
>
> rebased new patch attached.
> I also did some cosmetic changes. comments refined.
> make sure using index_scan mechanism to fast check column not-null can
> only be used via btree index.
> isolation tests are simplified.

I realized that my previous patch was quite wrong,
we should not do indexscan verify individual not-null constraints on phase2.

So a new patch is attached,
the main idea is Phase2 collects all to be added not-null constraints
to AlteredTableInfo->constraints.
then in Phase3  check, can we use index to fast check not-null
constraint or not.

To minimize concurrency issues, using an index scan to quickly validate
NOT NULL constraints requires strict conditions in Phase3:
* No table rewrite
* No table scan
* Each NOT NULL constraint must have a suitable supporting index for
fast checking
* The table must already hold an AccessExclusiveLock
* The DDL must not involve creating any new indexes

I don't have any good ideas to do the regress tests.
I use
ereport(NOTICE,
errmsg("all not-null constraints on relation
\"%s\" are validated by index scan",
RelationGetRelationName(oldrel)));
to do the tests.

for example:
create temp table t2 (x int, y int, z int, primary key (x, y));
create unique index t2_z_uidx on t2(z);

alter table t2 alter column z set not null;
NOTICE:  all not-null constraints on relation "t2" are validated by index scan
ALTER TABLE
From 0b4ad35e8fbc46f93ba6df2ef7013ff2f3055216 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Fri, 18 Apr 2025 16:05:42 +0800
Subject: [PATCH v4 1/1] using indexscan to speedup add not null constraints

This patch tries to use index_beginscan() / index_getnext() / index_endscan()
mentioned in [1] to speedup adding not-null constraints to the existing table.

the main logic happen in phase3 ATRewriteTable
1. collect all not-null constraints.
2. does each not-null constraint have a corresponding index to vertify it, if
   not then do the normal processing.
3. if table scan, table rewrite, table was not in AccessExclusiveLock.
   not-null constraint was on virtual generated column
   then we can not use indexcan to fast check not-null constraint.
3. If a table scan or rewrite occurs and the table is not held with an
   AccessExclusiveLock, and the NOT NULL constraint is on a virtual generated
   column, then we can not use indexscan to fast validate not-null constraints.

concurrency concern:
ALTER TABLE SET NOT NULL will take an ACCESS EXCLUSIVE lock, so there is less
variant of racing issue can occur?  to prove accurate, I wrote some isolation
tests. see[2]

performance:
It will only be slower than usual to add a NOT NULL constraint if your index is
bloated and a significant portion of that bloat is due to rows with NULL values.

demo:
drop table if exists t;
create unlogged table t(a int, b int, c int) with (autovacuum_enabled = off);
insert into t select g, g+1 from generate_series(1,1_000_000) g;
create index t_idx_a on t(a);

COMMAND: alter table t add constraint t1 not null a;
patch Time: 1.178 ms
master Time:  15.137 ms

100% table bloat: via (delete from t)
COMMAND: alter table t add constraint t1 not null a;
patch Time:   0.962 ms
master Time: 16.404 ms

case when: %20 percent values are NULL and have been deleted from heap but they
still on the index.

drop table if exists t;
create unlogged table t(a int, b int) with (autovacuum_enabled = off);
insert into t select case when g % 5 = 0 then null else g end, g+1
from generate_series(1,1_000_000) g;
create index t_idx_a on t(a);
delete from t where a is null;

alter table t add constraint t1 not null a;
patch Time:: 17.443 ms
master Time: 23.045 ms

references:
[1] https://postgr.es/m/CA%2BTgmoa5NKz8iGW_9v7wz%3D-%2BzQFu%3DE4SZoaTaU1znLaEXRYp-Q%40mail.gmail.com
[2] https://postgr.es/m/900056D1-32DF-4927-8251-3E0C0DC407FD%40anarazel.de
discussion: https://postgr.es/m/CACJufxFiW=4k1is=F1J=r-cx1rubyxqpurwb331u47rsngz...@mail.gmail.com
commitfest entry: https://commitfest.postgresql.org/patch/5444/
---
 src/backend/commands/tablecmds.c  | 129 ++-
 src/backend/executor/execIndexing.c   | 200 ++
 src/include/executor/executor.h   |   2 +
 .../expected/indexscan-check-notnull.out  | 102 +
 src/test/isolation/isolation_schedule |   1 +
 .../specs/indexscan-check-notnull.spec|  45 
 src/test/regress/expected/aggregates.out  |   1 +
 src/test/regress/expected/alter_table.out |  42 
 src/test/regress/expected/indexing.out|   2 +
 src/test/regress/expected/publication.out |   1 +
 src/test/regress/expected/tablespace.out  |   1 +
 src/test/regress/sql/alter_table.sql  |  46 
 12 files changed, 566 insertions(+), 6 deletions(-)
 create mode 100644 src/test/isolation/expected/indexscan-check-notnull.out
 create mode 100644 src/test/isolation/specs/indexscan-check-

[BUG] temporary file usage report with extended protocol and unnamed portals

2025-04-18 Thread Frédéric Yhuel

Hi,

It seems there's a bug in the logging of temporary file usage when the 
extended protocol is used with unnamed portals.


For example, with the attached Java / pgJDBC  programs, we get the 
following logs:


[...] LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp525566.0", 
size 2416640

[..] STATEMENT:  SELECT 1

but it should be:

[...] LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp538230.0", 
size 2416640

[...] STATEMENT:  SELECT * FROM foo ORDER BY a OFFSET $1 LIMIT 2

It has been tested with HEAD and REL_17_STABLE.

My guess is that there's a race somewhere, probably when the global 
variable "debug_query_string" is set.
The unnamed portal is closed when the BIND of the next query arrives 
(here: SELECT 1), and I suspect that the variable is set before the 
temporary file is deleted (and logged).


pgJDBC uses unnamed portals, but I don't think this is specific to JDBC. 
I see the same problem with the attached Python / psycopg3 program.
I think it would be better if the drivers used named portals all the 
time (and an explicit close message), but this seems to be a postgres bug.


What do you think?

Best regards,
Frédéric


PS : the dataset is created like this on the server:

CREATE UNLOGGED TABLE foo(a int);
INSERT INTO foo SELECT * FROM generate_series(1, 20);
ALTER SYSTEM SET log_temp_files = 0;
ALTER SYSTEM SET log_min_duration_statement = -1
SELECT pg_reload_conf();import psycopg

with psycopg.connect("dbname=postgres user=postgres port=5434 host=localhost") as conn:

conn.execute("SELECT * FROM foo ORDER BY a OFFSET %s LIMIT %s", [10,2])
conn.execute("SELECT 1")

conn.close()
// CREATE UNLOGGED TABLE foo(a int);
// INSERT INTO foo SELECT * FROM generate_series(1, 20);

public class Babar {
public static void main(String[] args)
throws ClassNotFoundException, java.sql.SQLException {
Class.forName("org.postgresql.Driver");

java.sql.Connection conn =
java.sql.DriverManager.getConnection(
"jdbc:postgresql://127.0.0.1:5434/postgres?user=postgres"
);

conn.setAutoCommit(false);

java.sql.ResultSet rs;

java.sql.PreparedStatement stmt =
conn.prepareStatement(
"SELECT * FROM foo ORDER BY a OFFSET ? LIMIT 2"
);

java.sql.PreparedStatement stmt2 =
conn.prepareStatement(
"SELECT 1"
);

	stmt.setInt(1, 10);

rs = stmt.executeQuery();
rs.close();
stmt.close();

rs = stmt2.executeQuery();
rs.close();
stmt2.close();

conn.commit();
conn.close();
}
}



Re: Make prep_status() message translatable

2025-04-18 Thread Fujii Masao




On 2025/04/17 13:12, Fujii Masao wrote:

I've updated the patch to reflect this comment.
Barring any objections, I'm thinking to commit this patch.


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION





Re: Align memory context level numbering in pg_log_backend_memory_contexts()

2025-04-18 Thread Fujii Masao




On 2025/04/18 18:23, David Rowley wrote:

On Fri, 18 Apr 2025 at 20:54, Fujii Masao  wrote:

Shouldn't the example output of pg_log_backend_memory_contexts() in
the documentation also be updated to use 1-based numbering for consistency?
Patch attached.


Yeah. I failed to notice we had an example of the output.

Want to take care of it?


Yeah, I will if you're okay with that!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION





RE: Parallel heap vacuum

2025-04-18 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

Thanks for updating the patch. I have been reviewing and below are comments for 
now.

01. 
Sorry if I forgot something, but is there a reason that
parallel_vacuum_compute_workers() is mandatory? My fresh eye felt that the 
function
can check and regards zero if the API is NULL.

02. table_parallel_vacuum_estimate
We can assert that state is not NULL.

03. table_parallel_vacuum_initialize

Same as 02.. Also, I think we should ensure here that 
table_parallel_vacuum_estimate()
has already been called before, and current assertion might not enough because
others can set random value here. Other functions like table_rescan_tidrange()
checks the internal flag of the data structure, which is good for me. How do you
feel to check pcxt->estimator has non-zero value?

04. table_parallel_vacuum_initialize_worker
Same as 02. Also, can we esure that table_parallel_vacuum_initialize() has been
done?

05. table_parallel_vacuum_collect_dead_items
Same as 02. Also, can we esure that table_parallel_vacuum_initialize_worker()
has been done?

06. table_parallel_vacuum_initialize_worker
Comments atop function needs to be updated.

07. 
While playing, I found that the parallel vacuum worker can be launched more than
pages:

```
postgres=# CREATE TABLE var (id int);
CREATE TABLE
postgres=# INSERT INTO var VALUES (generate_series(1, 1));
INSERT 0 1
postgres=# VACUUM (parallel 80, verbose) var ;
INFO:  vacuuming "postgres.public.var"
INFO:  launched 80 parallel vacuum workers for collecting dead tuples (planned: 
80)
INFO:  finished vacuuming "postgres.public.var": index scans: 0
pages: 0 removed, 45 remain, 45 scanned (100.00% of total), 0 eagerly scanned
...
```

I hope the minimum chunk size is a page so that this situation can reduce the 
performance.
How do you feel to cap the value with rel::rd_rel::relpages in 
heap_parallel_vacuum_compute_workers()?
This value is not always up-to-date but seems good candidate.

Best regards,
Hayato Kuroda
FUJITSU LIMITED 



Re: Changing shared_buffers without restart

2025-04-18 Thread Thomas Munro
On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote:
> >
> > In an offlist chat Thomas Munro mentioned that just ftruncate() would
> > be enough to resize the shared memory without touching address maps
> > using mmap and munmap().
> >
> > ftruncate man page seems to concur with him
> >
> >If the effect of ftruncate() is to decrease the size of a memory
> >mapped file or a shared memory object and whole pages beyond the
> >new end were previously mapped, then the whole pages beyond the
> >new end shall be discarded.
> >
> >References to discarded pages shall result in the generation of a
> >SIGBUS signal.
> >
> >If the effect of ftruncate() is to increase the size of a memory
> >object, it is unspecified whether the contents of any mapped pages
> >between the old end-of-file and the new are flushed to the
> >underlying object.
> >
> > ftruncate() when shrinking memory will release the extra pages and
> > also would cause segmentation fault when memory outside the size of
> > file is accessed even if the actual address map is larger than the
> > mapped file. The expanded memory is allocated as it is written to, and
> > those pages also become visible in the underlying object.
>
> Thanks for sharing. I need to do more thorough tests, but after a quick
> look I'm not sure about that. ftruncate will take care about the memory,
> but AFAICT the memory mapping will stay the same, is that what you mean?
> In that case if the segment got increased, the memory still can't be
> used because it's beyond the mapping end (at least in my test that's
> what happened). If the segment got shrinked, the memory couldn't be
> reclaimed, because, well, there is already a mapping. Or do I miss
> something?

I was imagining that you might map some maximum possible size at the
beginning to reserve the address space permanently, and then adjust
the virtual memory object's size with ftruncate as required to provide
backing.  Doesn't that achieve the goal with fewer steps, using only
portable* POSIX stuff, and keeping all pointers stable?  I understand
that pointer stability may not be required (I can see roughly how that
argument is constructed), but isn't it still better to avoid having to
prove that and deal with various other problems completely?  Is there
a downside/cost to having a large mapping that is only partially
backed?  I suppose choosing that number might offend you but at least
there is an obvious upper bound: physical memory size.

*You might also want to use fallocate after ftruncate on Linux to
avoid SIGBUS on allocation failure on first touch page fault, which
raises portability questions since it's unspecified whether you can do
that with shm fds and fails on some systems, but it let's call that an
independent topic as it's not affected by this choice.




Re: Align memory context level numbering in pg_log_backend_memory_contexts()

2025-04-18 Thread David Rowley
On Fri, 18 Apr 2025 at 00:25, Rahila Syed  wrote:
> Regarding v2 patch,
> -   int level = 0;
>
> Retaining the level variable will enhance the code readability, IMO.

When I read that, I suspected it might have been leftover from a
refactor during the development that was forgotten about. There'd be
thousands of places in our code base that you could make the
readability argument for, including the max_level and max_children
parameters at the same call-site. But those didn't get the same
treatment.

I've now pushed the latest patch. Thanks for the reviews.

David




Re: Align memory context level numbering in pg_log_backend_memory_contexts()

2025-04-18 Thread Fujii Masao



On 2025/04/18 6:11, David Rowley wrote:

On Fri, 18 Apr 2025 at 00:25, Rahila Syed  wrote:

Regarding v2 patch,
-   int level = 0;

Retaining the level variable will enhance the code readability, IMO.


When I read that, I suspected it might have been leftover from a
refactor during the development that was forgotten about. There'd be
thousands of places in our code base that you could make the
readability argument for, including the max_level and max_children
parameters at the same call-site. But those didn't get the same
treatment.

I've now pushed the latest patch. Thanks for the reviews.


Shouldn't the example output of pg_log_backend_memory_contexts() in
the documentation also be updated to use 1-based numbering for consistency?
Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From 7b3aa9de467005e540b2246ed89c9e43fe08f6bd Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Fri, 18 Apr 2025 17:45:29 +0900
Subject: [PATCH v1] doc: Fix memory context level in
 pg_log_backend_memory_contexts() example.

Commit d9e03864b6b changed the memory context level numbers shown by
pg_log_backend_memory_contexts() to be 1-based. However, the example in
the documentation was not updated and still used 0-based numbering.

This commit updates the example to match the current 1-based output.
---
 doc/src/sgml/func.sgml | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1c5cfee25d1..574a544d9fa 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28922,16 +28922,16 @@ One message for each memory context will be logged. 
For example:
 
 LOG:  logging memory contexts of PID 10377
 STATEMENT:  SELECT pg_log_backend_memory_contexts(pg_backend_pid());
-LOG:  level: 0; TopMemoryContext: 80800 total in 6 blocks; 14432 free (5 
chunks); 66368 used
-LOG:  level: 1; pgstat TabStatusArray lookup hash table: 8192 total in 1 
blocks; 1408 free (0 chunks); 6784 used
-LOG:  level: 1; TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 
chunks); 472 used
-LOG:  level: 1; RowDescriptionContext: 8192 total in 1 blocks; 6880 free (0 
chunks); 1312 used
-LOG:  level: 1; MessageContext: 16384 total in 2 blocks; 5152 free (0 chunks); 
11232 used
-LOG:  level: 1; Operator class cache: 8192 total in 1 blocks; 512 free (0 
chunks); 7680 used
-LOG:  level: 1; smgr relation table: 16384 total in 2 blocks; 4544 free (3 
chunks); 11840 used
-LOG:  level: 1; TransactionAbortContext: 32768 total in 1 blocks; 32504 free 
(0 chunks); 264 used
+LOG:  level: 1; TopMemoryContext: 80800 total in 6 blocks; 14432 free (5 
chunks); 66368 used
+LOG:  level: 2; pgstat TabStatusArray lookup hash table: 8192 total in 1 
blocks; 1408 free (0 chunks); 6784 used
+LOG:  level: 2; TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 
chunks); 472 used
+LOG:  level: 2; RowDescriptionContext: 8192 total in 1 blocks; 6880 free (0 
chunks); 1312 used
+LOG:  level: 2; MessageContext: 16384 total in 2 blocks; 5152 free (0 chunks); 
11232 used
+LOG:  level: 2; Operator class cache: 8192 total in 1 blocks; 512 free (0 
chunks); 7680 used
+LOG:  level: 2; smgr relation table: 16384 total in 2 blocks; 4544 free (3 
chunks); 11840 used
+LOG:  level: 2; TransactionAbortContext: 32768 total in 1 blocks; 32504 free 
(0 chunks); 264 used
 ...
-LOG:  level: 1; ErrorContext: 8192 total in 1 blocks; 7928 free (3 chunks); 
264 used
+LOG:  level: 2; ErrorContext: 8192 total in 1 blocks; 7928 free (3 chunks); 
264 used
 LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 
1029560 used
 
 If there are more than 100 child contexts under the same parent, the first
-- 
2.49.0



Re: Changing shared_buffers without restart

2025-04-18 Thread Dmitry Dolgov
> On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote:
>
> In an offlist chat Thomas Munro mentioned that just ftruncate() would
> be enough to resize the shared memory without touching address maps
> using mmap and munmap().
>
> ftruncate man page seems to concur with him
>
>If the effect of ftruncate() is to decrease the size of a memory
>mapped file or a shared memory object and whole pages beyond the
>new end were previously mapped, then the whole pages beyond the
>new end shall be discarded.
>
>References to discarded pages shall result in the generation of a
>SIGBUS signal.
>
>If the effect of ftruncate() is to increase the size of a memory
>object, it is unspecified whether the contents of any mapped pages
>between the old end-of-file and the new are flushed to the
>underlying object.
>
> ftruncate() when shrinking memory will release the extra pages and
> also would cause segmentation fault when memory outside the size of
> file is accessed even if the actual address map is larger than the
> mapped file. The expanded memory is allocated as it is written to, and
> those pages also become visible in the underlying object.

Thanks for sharing. I need to do more thorough tests, but after a quick
look I'm not sure about that. ftruncate will take care about the memory,
but AFAICT the memory mapping will stay the same, is that what you mean?
In that case if the segment got increased, the memory still can't be
used because it's beyond the mapping end (at least in my test that's
what happened). If the segment got shrinked, the memory couldn't be
reclaimed, because, well, there is already a mapping. Or do I miss
something?

> > > I might have not noticed it, but are we putting two mappings one
> > > reserved and one allocated in the same address space, so that when the
> > > allocated mapping shrinks or expands, the reserved mapping continues
> > > to prohibit any other mapping from appearing there? I looked at some
> > > of the previous emails, but didn't find anything that describes how
> > > the reserved mapped space is managed.
> >
> > I though so, but this turns out to be incorrect. Just have done a small
> > experiment -- looks like when reserving some space, mapping and
> > unmapping a small segment from it leaves a non-mapped gap. That would
> > mean for shrinking the new available space has to be reserved again.
>
> Right. That's what I thought. But I didn't see the corresponding code.
> So we have to keep track of two mappings for every segment - 1 for
> allocation and one for reserving space and resize those two while
> shrinking and expanding buffers. Am I correct?

Not necessarily, depending on what we want. Again, I'll do a bit more testing,
but after a quick check it seems that it's possible to "plug" the gap with a
new reservation mapping, then reallocate it to another mapping or unmap both
reservations (main and the "gap" one) at once. That would mean that for the
current functionality we don't need to track reservation in any way more than
just start and the end of the "main" reserved space. The only consequence I can
imagine is possible fragmentation of the reserved space in case of frequent
increase/decrease of a segment with even decreasing size. But since it's only
reserved space, which will not really be used, it's probably not going to be a
problem.




Re: Changing shared_buffers without restart

2025-04-18 Thread Dmitry Dolgov
> On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote:
>
> 1. Performance of Postgres CLOCK page eviction algorithm depends on number
> of shared buffers. My first native attempt just to mark unused buffers as
> invalid cause significant degrade of performance

Thanks for sharing!

Right, but it concerns the case when the number of shared buffers is
high, independently from whether it was changed online or with a
restart, correct? In that case it's out of scope for this patch.

> 2. There are several data structures i Postgres which size depends on number
> of buffers.
> In my patch I used in some cases dynamic shared buffer size, but if this
> structure has to be allocated in shared memory then still maximal size has
> to be used. We have the buffers themselves (8 kB per buffer), then the main
> BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's
> CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
> 128 bytes per 8kb bytes seems to  large overhead (~1%) but but it may be
> quote noticeable with size differences larger than 2 orders of magnitude:
> E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd
> have ~2GiB of static overhead on only 0.5GiB of actual buffers.

Not sure what do you mean by using a maximal size, can you elaborate.

In the current patch those structures are allocated as before, except
each goes into a separate segment -- without any extra memory overhead
as far as I see.

> 3. `madvise` is not portable.

The current implementation doesn't rely on madvise so far (it might for
shared memory shrinking), but yeah there are plenty of other not very
portable things (MAP_FIXED, memfd_create). All of that is mentioned in
the corresponding patches as a limitation.




Re: Align memory context level numbering in pg_log_backend_memory_contexts()

2025-04-18 Thread David Rowley
On Fri, 18 Apr 2025 at 20:54, Fujii Masao  wrote:
> Shouldn't the example output of pg_log_backend_memory_contexts() in
> the documentation also be updated to use 1-based numbering for consistency?
> Patch attached.

Yeah. I failed to notice we had an example of the output.

Want to take care of it?

David




Re: n_ins_since_vacuum stats for aborted transactions

2025-04-18 Thread David Rowley
On Sat, 12 Apr 2025 at 07:33, Sami Imseih  wrote:
> What do you think of the attached?

I looked at the v3 patch and I'm having trouble getting excited about it.

I'd say this part is misleading:

@@ -3956,7 +3961,8 @@ description | Waiting for a newly initialized
WAL file to reach durable storage
n_dead_tup bigint
   
   
-   Estimated number of dead rows
+   Estimated number of dead rows (updated by committed transactions, or by
+   VACUUM and VACUUM FULL)
   

An aborted insert will contribute to that counter, but that's not mentioned.

Would it be ok just to adjust n_mod_since_analyze's "Estimated number
of rows modified since this table was last analyzed" and inject "by
committed transactions" after "modified", then call it a day?

David




Re: Fixup some appendPQExpBuffer() calls

2025-04-18 Thread David Rowley
On Thu, 17 Apr 2025 at 19:50, Daniel Gustafsson  wrote:
> > On 17 Apr 2025, at 01:44, David Rowley  wrote:
> > 1) Commit the attached to master
> > 2) Do nothing.
> >
> > I'd like to do #1.
>
> I vote for #1 as well.

Thanks for the judgment sense check.  I suspect anyone who thought #2
would have come forward by now, so I've pushed the changes.

David




Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-04-18 Thread David Rowley
On Wed, 16 Apr 2025 at 10:09, Ilia Evdokimov
 wrote:
> I've prepared the updated patches as discussed, including the addition
> of estimated lookups in the EXPLAIN output for Memoize. Please let me
> know if you have any feedback or further suggestions.

While this is fresh, as I might forget before the July CF...

I know I did suggest that the hit_ratio should be a percent and it
should be multiplied by 100.0.  This seems fine for the text format as
you can have the % unit there. However, looking at
ExplainPropertyFloat(), we don't print the unit for non-text formats.
I wonder if printing a number between 0 and 100 there will be
confusing. I don't believe we have anything else in EXPLAIN that shows
a percentage. I don't quite know what to do about this. One thought I
had was to only * 100 for text format, but that might be more
confusing.

Aside from that, I'd prefer the new fields in struct Memoize to be
prefixed with "est_" the same as the existing "est_entries" field. I'm
unsure why MemoizePath.calls becomes Memoize.lookups. Seems
unnecessary and just means more brain space being used to maintain a
mental model.

David




Re: Changing shared_buffers without restart

2025-04-18 Thread Thomas Munro
On Fri, Apr 18, 2025 at 9:17 PM Thomas Munro  wrote:
> On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > Thanks for sharing. I need to do more thorough tests, but after a quick
> > look I'm not sure about that. ftruncate will take care about the memory,
> > but AFAICT the memory mapping will stay the same, is that what you mean?
> > In that case if the segment got increased, the memory still can't be
> > used because it's beyond the mapping end (at least in my test that's
> > what happened). If the segment got shrinked, the memory couldn't be
> > reclaimed, because, well, there is already a mapping. Or do I miss
> > something?
>
> I was imagining that you might map some maximum possible size at the
> beginning to reserve the address space permanently, and then adjust
> the virtual memory object's size with ftruncate as required to provide
> backing.  Doesn't that achieve the goal with fewer steps, using only
> portable* POSIX stuff, and keeping all pointers stable?  I understand
> that pointer stability may not be required (I can see roughly how that
> argument is constructed), but isn't it still better to avoid having to
> prove that and deal with various other problems completely?  Is there
> a downside/cost to having a large mapping that is only partially
> backed?  I suppose choosing that number might offend you but at least
> there is an obvious upper bound: physical memory size.

TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side
effect on Linux, as if you had called ftruncate on it (fully allocated
huge pages I expected up to the object's size, just not magical size
changes beyond that when I merely asked to map it).  That doesn't
happen for regular page size, or for any page size on my local OS's
shm objects and doesn't seem to fit mmap's job description given an
fd*, but maybe I'm just confused.  Anyway, a  workaround seems to be
to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ
| PROT_WRITE) new regions after extending with ftruncate(), at least
in simple tests...

(*Hmm, wiild uninformed speculation: perhap the size-setting behaviour
needed when hugetlbfs is used secretly to implement MAP_ANONYMOUS is
being exposed also when a hugetlbfs fd is given explicitly to mmap,
generating this bizarro side effect?)




Re: magical eref alias names

2025-04-18 Thread Robert Haas
On Fri, Jan 3, 2025 at 8:44 AM Robert Haas  wrote:
> > Actually, I noticed that we are failing to honor that in the places
> > where we inject "*SELECT*" and "*SELECT* %d" names, because that
> > code puts those names into RTE->alias not only RTE->eref.
> > I experimented with the attached patch to not do that anymore,
> > which is sort of a subset of what you did but just focused on
> > not lying about what's generated versus user-written.  We could
> > alternatively keep the current generated names by extending
> > addRangeTableEntryForSubquery's API so that alias and generated eref
> > are passed separately.  (I didn't look to see if anyplace else
> > is messing up this distinction similarly.)
>
> Hmm, I definitely like not lying about what is generated vs. what is
> user-written. I don't have a strong opinion right now on the best way
> of accomplishing that.

I rediscovered, or re-encountered, this problem today, which motivated
me to have a closer look at your (Tom's) patch. My feeling is that
it's the right approach. I agree that we could try to keep the current
generated names by extending addRangeTableEntryForSubquery, but I'm
tentatively inclined to think that we shouldn't. Perhaps that's partly
a stylistic preference on my part: I think unnamed_subquery_N looks
nicer than "*SELECT * N", but there's also something to be said for
keeping the code simple. I think it would be reasonable to instead
extend addRangeTableEntryForSubquery if we find that the naming change
breaks something, but if it doesn't then I like what you've done
better. There's also an argument from consistency: even without the
patch, unnamed_subquery leaks out in some contexts, and I think it's
nicer to use unnamed_subquery everywhere than to use that in some
places and *SELECT* in others.

I then went looking for other places that have similar issues. I
pretty quickly discovered that convert_ANY_sublink_to_join is another
caller of addRangeTableEntryForSubquery that is fabricating an alias
when it could just pass NULL; in that case, the fabricated name is
ANY_subquery. Also, it seems like the recursive CTE stuff can just set
the alias to NULL and the eref as it's currently doing, instead of
setting both alias and eref as the code does currently. So, PFA 0001,
a rebased version of Tom's patch; 0002, addressing ANY_subquery; and
0003, addressing *TLOCRN* and *TROCRN*. If we decide to adopt all of
these, we may want to do some squashing before commit, but we have a
little time to figure that out since I think this is v19 material
anyway.

Apart from those cases, and at least AFAICS, everything that's using a
wholly fabricated name seems to be either (1) a case where we intend
for the name to be referenceable (old, new, excluded) or (2) a value
that is assigned only to eref and not to alias (e.g. *GROUP*).
However, I did come across one other mildly interesting case.
expand_single_inheritance_child has this:

/*
 * We just duplicate the parent's table alias name for each child.  If the
 * plan gets printed, ruleutils.c has to sort out unique table aliases to
 * use, which it can handle.
 */
childrte->alias = childrte->eref = makeAlias(parentrte->eref->aliasname,
 child_colnames);

What I find curious about this is that we're assigning the parent's
eref to both the child's eref and the child's alias. Maybe there's
something I don't understand here, or maybe it just doesn't matter,
but why wouldn't we assign eref to eref and alias to alias? Or even
alias to alias and generate a new eref?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0002-Don-t-generate-fake-ANY_subquery-aliases-either.patch
Description: Binary data


v2-0003-Don-t-generate-fake-TLOCRN-or-TROCRN-aliases-eith.patch
Description: Binary data


v2-0001-Don-t-generate-fake-SELECT-or-SELECT-d-subquery-a.patch
Description: Binary data


Re: Changing shared_buffers without restart

2025-04-18 Thread Konstantin Knizhnik



On 18/04/2025 12:26 am, Dmitry Dolgov wrote:

On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote:

1. Performance of Postgres CLOCK page eviction algorithm depends on number
of shared buffers. My first native attempt just to mark unused buffers as
invalid cause significant degrade of performance

Thanks for sharing!

Right, but it concerns the case when the number of shared buffers is
high, independently from whether it was changed online or with a
restart, correct? In that case it's out of scope for this patch.


2. There are several data structures i Postgres which size depends on number
of buffers.
In my patch I used in some cases dynamic shared buffer size, but if this
structure has to be allocated in shared memory then still maximal size has
to be used. We have the buffers themselves (8 kB per buffer), then the main
BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's
CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
128 bytes per 8kb bytes seems to  large overhead (~1%) but but it may be
quote noticeable with size differences larger than 2 orders of magnitude:
E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd
have ~2GiB of static overhead on only 0.5GiB of actual buffers.

Not sure what do you mean by using a maximal size, can you elaborate.

In the current patch those structures are allocated as before, except
each goes into a separate segment -- without any extra memory overhead
as far as I see.


Thank you for explanation. I am sorry that I have not precisely 
investigated your patch before writing: it seems to be that you are are 
placing in separate segment only content of shared buffers.
Now I see that I was wrong and it is actually the main difference with 
memory ballooning approach I have used. As far as you are are allocating 
buffers descriptors and hash table in the same segment,

there is no extra memory overhead.
The only drawback is that we are loosing content of shared buffers in 
case of resize. It may be sadly, but not looks like there is no better 
alternative.


But there are still some dependencies on shared buffers size which are 
not addressed in this PR.
I am not sure how critical they are and is it possible to do something 
here, but at least I want to enumerate them:


1. Checkpointer: maximal number of checkpointer requests depends on 
NBuffers. So if we start with small shared buffers and then upscale, it 
may cause the too frequent checkpoints:


Size
CheckpointerShmemSize(void)
...
        size = add_size(size, mul_size(NBuffers, 
sizeof(CheckpointerRequest)));


CheckpointerShmemInit(void)
    CheckpointerShmem->max_requests = NBuffers;

2. XLOG: number of xlog buffers is calculated depending on number of 
shared buffers:


XLOGChooseNumBuffers(void)
{
...
 xbuffers = NBuffers / 32;

Should not cause some errors, but may be not so efficient if once again 
we start we tiny shared buffers.


3. AIO: AIO max concurrency is also calculated based on number of shared 
buffers:


AioChooseMaxConcurrency(void)
{
...

    max_proportional_pins = NBuffers / max_backends;

For small shared buffers (i.e. 1Mb,  there will be no concurrency at all).

So none of this issues can cause some error, just some inefficient behavior.
But if we want to start with very small shared buffers and then increase 
them on demand,

then it can be a problem.

In all this three cases NBuffers is used not just to calculate some 
threshold value, but also determine size of the structure in shared memory.
The straightforward solution is to place them in the same segment as 
shared buffers. But I am not sure how difficult it will be to implement.









Re: disabled SSL log_like tests

2025-04-18 Thread Tom Lane
I wrote:
> What I think happened here is that a previous backend hadn't exited
> yet when we start the test, and when its report does come out,
> connect_fails prematurely stops waiting.  (In the above, evidently
> the child process we want to wait for is 599, but we're fooled by
> a delayed report for 25401.)  So my v1 patch needs work.

> Maybe we can make the test compare the PIDs in the "forked new client
> backend" and "client backend exited" log messages.  Stay tuned...

Okay, this version seems more reliable.

regards, tom lane

diff --git a/src/test/authentication/t/001_password.pl b/src/test/authentication/t/001_password.pl
index 756b4146050..37d96d95a1a 100644
--- a/src/test/authentication/t/001_password.pl
+++ b/src/test/authentication/t/001_password.pl
@@ -66,6 +66,8 @@ sub test_conn
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init;
 $node->append_conf('postgresql.conf', "log_connections = on\n");
+# Needed to allow connect_fails to inspect postmaster log:
+$node->append_conf('postgresql.conf', "log_min_messages = debug2");
 $node->start;
 
 # Test behavior of log_connections GUC
diff --git a/src/test/authentication/t/003_peer.pl b/src/test/authentication/t/003_peer.pl
index 69ba73bd2b9..2879800eacf 100644
--- a/src/test/authentication/t/003_peer.pl
+++ b/src/test/authentication/t/003_peer.pl
@@ -72,6 +72,8 @@ sub test_role
 my $node = PostgreSQL::Test::Cluster->new('node');
 $node->init;
 $node->append_conf('postgresql.conf', "log_connections = on\n");
+# Needed to allow connect_fails to inspect postmaster log:
+$node->append_conf('postgresql.conf', "log_min_messages = debug2");
 $node->start;
 
 # Set pg_hba.conf with the peer authentication.
diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index 6748b109dec..2dc6bec9b89 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -66,6 +66,7 @@ $node->append_conf(
 listen_addresses = '$hostaddr'
 krb_server_keyfile = '$krb->{keytab}'
 log_connections = on
+log_min_messages = debug2
 lc_messages = 'C'
 });
 $node->start;
diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl
index 352b0fc1fa7..d1315ed5351 100644
--- a/src/test/ldap/t/001_auth.pl
+++ b/src/test/ldap/t/001_auth.pl
@@ -48,6 +48,8 @@ note "setting up PostgreSQL instance";
 my $node = PostgreSQL::Test::Cluster->new('node');
 $node->init;
 $node->append_conf('postgresql.conf', "log_connections = on\n");
+# Needed to allow connect_fails to inspect postmaster log:
+$node->append_conf('postgresql.conf', "log_min_messages = debug2");
 $node->start;
 
 $node->safe_psql('postgres', 'CREATE USER test0;');
diff --git a/src/test/modules/oauth_validator/t/001_server.pl b/src/test/modules/oauth_validator/t/001_server.pl
index d88994abc24..4f035417a40 100644
--- a/src/test/modules/oauth_validator/t/001_server.pl
+++ b/src/test/modules/oauth_validator/t/001_server.pl
@@ -48,6 +48,8 @@ $node->init;
 $node->append_conf('postgresql.conf', "log_connections = on\n");
 $node->append_conf('postgresql.conf',
 	"oauth_validator_libraries = 'validator'\n");
+# Needed to allow connect_fails to inspect postmaster log:
+$node->append_conf('postgresql.conf', "log_min_messages = debug2");
 $node->start;
 
 $node->safe_psql('postgres', 'CREATE USER test;');
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 8759ed2cbba..4b858900139 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2618,13 +2618,19 @@ to fail.
 
 =item expected_stderr => B
 
-If this regular expression is set, matches it with the output generated.
+If this regular expression is set, matches it to the output generated
+by B.
 
 =item log_like => [ qr/required message/ ]
 
 =item log_unlike => [ qr/prohibited message/ ]
 
-See C.
+See C.  CAUTION: use of either option requires that
+the server's log_min_messages be at least DEBUG2, and that no other
+client backend is launched concurrently.  These requirements allow
+C to wait to see the postmaster-log report of backend
+exit, without which there is a race condition as to whether we will
+see the expected backend log output.
 
 =back
 
@@ -2652,7 +2658,14 @@ sub connect_fails
 		like($stderr, $params{expected_stderr}, "$test_name: matches");
 	}
 
-	$self->log_check($test_name, $log_location, %params);
+	if (defined($params{log_like}) or defined($params{log_unlike}))
+	{
+		$self->wait_for_log(
+			qr/DEBUG:  (?:0: )?forked new client backend, pid=(\d+) socket.*DEBUG:  (?:0: )?client backend \(PID \1\) exited with exit code \d/s,
+			$log_location);
+
+		$self->log_check($test_name, $log_location, %params);
+	}
 }
 
 =pod
diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl
index 086abf3b8b3..8b0de2d8e7e 100644
--- a/src/test/ssl/t/001_ssltests.pl
+++ b/src/test/ssl/t/001_ssltests.pl
@@ -60,6 +60,8 @@ my $common_connstr;
 note "setting up data dire

Re: Changing shared_buffers without restart

2025-04-18 Thread Andres Freund
Hi, 

On April 18, 2025 11:17:21 AM GMT+02:00, Thomas Munro  
wrote:
> Doesn't that achieve the goal with fewer steps, using only
>portable* POSIX stuff, and keeping all pointers stable?  I understand
>that pointer stability may not be required (I can see roughly how that
>argument is constructed), but isn't it still better to avoid having to
>prove that and deal with various other problems completely?  

I think we should flat out reject any approach that does not maintain pointer 
stability.  It would restrict future optimizations a lot if we can't rely on 
that (e.g. not materializing tuples when transporting them from worker to 
leader; pointering datastructures in shared buffers).

Greetings, 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: [BUG] temporary file usage report with extended protocol and unnamed portals

2025-04-18 Thread Frédéric Yhuel



On 4/18/25 10:49, Frédéric Yhuel wrote:

Hi,

It seems there's a bug in the logging of temporary file usage when the 
extended protocol is used with unnamed portals.




FWIW, the attached patch seems to fix the problem.

Best regards,
FrédéricFrom afb228f07f847c467ba05dbe204861e7be2ffc32 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Fri, 18 Apr 2025 13:20:52 +0200
Subject: [PATCH] fix reporting of temp files usage

when extended protocol is used with unnamed portals
---
 src/backend/utils/mmgr/portalmem.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index e3526e78064..246e711db81 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -27,6 +27,7 @@
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
+#include "tcop/tcopprot.h"
 
 /*
  * Estimate of the maximum number of open portals a user would have,
@@ -488,6 +489,9 @@ PortalDrop(Portal portal, bool isTopCommit)
 (errcode(ERRCODE_INVALID_CURSOR_STATE),
  errmsg("cannot drop active portal \"%s\"", portal->name)));
 
+	if (portal->queryDesc)
+		debug_query_string = portal->queryDesc->sourceText;
+
 	/*
 	 * Allow portalcmds.c to clean up the state it knows about, in particular
 	 * shutting down the executor if still active.  This step potentially runs
-- 
2.47.2



Re: ALTER COLUMN SET DATA TYPE does not change the generation expression's collation

2025-04-18 Thread jian he
On Wed, Mar 26, 2025 at 1:01 PM jian he  wrote:
>
> hi.
>
> ATPrepAlterColumnType forbids us to ALTER COLUMN SET DATA TYPE USING (expr)
> for generated columns.
> however we can still change the generated column type from non-text to text
> or text type from one collation to another collation.
>
> In ATExecAlterColumnType, we also need to set the generation
> expression collation?
>
> We can do this by adding exprSetCollation:
>
> --- a/src/backend/commands/tablecmds.c
> +++ b/src/backend/commands/tablecmds.c
> @@ -14115,6 +14115,7 @@ ATExecAlterColumnType(AlteredTableInfo *tab,
> Relation rel,
>  errmsg("default for
> column \"%s\" cannot be cast automatically to type %s",
>
> colName, format_type_be(targettype;
> }
> +   exprSetCollation(defaultexpr, targetcollid);
>
>
> -
> CREATE TABLE x1(a int,
> b int GENERATED ALWAYS AS (a * 2) stored,
> c text GENERATED ALWAYS AS ('1') stored );
> ALTER TABLE x1 alter column b set data type text collate "C";
> ALTER TABLE x1 alter column c set data type text collate "C";
>
> SELECT pg_get_expr(d.adbin, d.adrelid) AS default_value, d.adbin
> FROM   pg_catalog.pg_attributea
> JOIN   pg_catalog.pg_attrdef d ON (a.attrelid, a.attnum) = (d.adrelid, 
> d.adnum)
> ANDa.attrelid = 'x1'::regclass
> ANDa.attname in ('b', 'c');
> by adding exprSetCollation, the output is
>
> default_value | (a * 2)
> adbin | {COERCEVIAIO :arg {OPEXPR :opno 514 :opfuncid 141
> :opresulttype 23 :opretset false :opcollid 0 :inputcollid 0 :args
> ({VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0
> :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1
> :varattnosyn 1 :location -1} {CONST :consttype 23 :consttypmod -1
> :constcollid 0 :constlen 4 :constbyval true :constisnull false
> :location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]}) :location -1}
> :resulttype 25 :resultcollid 950 :coerceformat 2 :location -1}
> -[ RECORD 2 
> ]-+
> default_value | '1'::text COLLATE "C"
> adbin | {CONST :consttype 25 :consttypmod -1 :constcollid 950
> :constlen -1 :constbyval false :constisnull false :location -1
> :constvalue 5 [ 20 0 0 0 49 ]}
>
>
> master behavior:
>
> default_value | (a * 2)
> adbin | {COERCEVIAIO :arg {OPEXPR :opno 514 :opfuncid 141
> :opresulttype 23 :opretset false :opcollid 0 :inputcollid 0 :args
> ({VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0
> :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1
> :varattnosyn 1 :location -1} {CONST :consttype 23 :consttypmod -1
> :constcollid 0 :constlen 4 :constbyval true :constisnull false
> :location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]}) :location -1}
> :resulttype 25 :resultcollid 0 :coerceformat 2 :location -1}
> -[ RECORD 2 
> ]-+--
> default_value | '1'::text
> adbin | {CONST :consttype 25 :consttypmod -1 :constcollid 100
> :constlen -1 :constbyval false :constisnull false :location -1
> :constvalue 5 [ 20 0 0 0 49 ]}


I still think this is a bug, so i put it on the

https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items#Older_bugs_affecting_stable_branches




Re: Built-in Raft replication

2025-04-18 Thread Alastair Turner
Hi Konstantin

On Wed, 16 Apr 2025 at 15:07, Konstantin Osipov 
wrote:

> * Alastair Turner  [25/04/16 15:58]:
>
> > > > If you use build-in failover you have to resort to 3 big Postgres
> > > machines because you need 2/3 majority. Of course, you can install
> > > MySQL-stype arbiter - host that had no real PGDATA, only participates
> in
> > > voting. But this is a solution to problem induced by built-in
> autofailover.
> > >
> > > Users find it a waste of resources to deploy 3 big PostgreSQL
> > > instances just for HA where 2 suffice even if they deploy 3
> > > lightweight DCS instances. Having only some of the nodes act as DCS
> > > and others purely PostgreSQL nodes will reduce waste of resources.
> > >
> > > The experience of other projects/products with automated failover
> based on
> > quorum shows that this is a critical issue for adoption. In the In-memory
> > Data Grid space (Coherence, Geode/GemFire) the question of how to ensure
> > that some nodes didn't carry any data comes up early in many architecture
> > discussions. When RabbitMQ shipped their Quorum Queues feature, the first
> > and hardest area of pushback was around all nodes hosting message
> content.
> >
> > It's not just about the requirement for compute resources, it's also
> about
> > bandwidth and latency. Many large organisations have, for historical
> > reasons, pairs of data centres with very good point-to-point
> connectivity.
> > As the requirement for quorum witnesses has come up for all sorts of
> > things, including storage arrays, they have built arbiter/witness sites
> at
> > branches, colocation providers or even on the public cloud. More than not
> > holding user data or processing queries, the arbiter can't even be sent
> the
> > replication stream for the user data in the database, it just won't fit
> > down the pipe.
> >
> > Which feels like a very difficult requirement to meet if the replication
> > model for all data is being changed to a quorum model.
>
> I agree master/replica deployment layouts are very popular and are
> not going to directly benefit from raft. They'll still work, but no
> automation will be available, just like today with Patroni.
>
> Users of Patroni and etcd setups can get automation for two-site
primary/replica pairs by putting a third etcd node on a third site. Which
only requires moving the membership/leadership data to the arbiter site,
not all database activity.


> However, if the storage cost is an argument, then the logical path is to
> disaggregate storage/compute altogether, i.e. use projects like
> neon.
>
> The issue is not generally storage, but network. There may simply not be
enough bandwidth available to transmit the whole WAL to the arbiter site.

Many on-premises IT setups have this limitation in some form.

If your proposal would leave these large, traditional user organisations
(which account for thousands of Postgres HA pairs or DR pairs) doing what
they currently do with wraparound tooling like Patroni, and create a new,
in core, option for balanced 3, 5, 7... member groups, then I don't think
it's worth doing.

Regards,
Alastair


Re: jsonapi: scary new warnings with LTO enabled

2025-04-18 Thread Tom Lane
Jacob Champion  writes:
> On Wed, Apr 16, 2025 at 4:04 PM Tom Lane  wrote:
>> Looking through all of the callers of freeJsonLexContext, quite
>> a lot of them use local JsonLexContext structs, and probably some
>> of them are more performance-critical than these.  So that raises
>> the question of why are we seeing warnings for only these call
>> sites?

> Yeah, I had the same question...

After making another pass through the callers of freeJsonLexContext,
I observe that the warnings appear in callers that use a local
variable *and* contain goto statements.  So I'm betting that the
presence of goto's causes the LTO optimizer to pull in its horns
quite a bit and thereby fail to detect the flag correlation.

>> Maybe there is a more elegant way to suppress them.

> Can we brute-force ignore this particular warning site with a #pragma
> (suggested in [1])?

That's surely not elegant :-(.  However, I don't especially want to
rewrite away the goto's in these callers ...

regards, tom lane




Re: jsonapi: scary new warnings with LTO enabled

2025-04-18 Thread Jacob Champion
On Thu, Apr 17, 2025 at 8:20 AM Tom Lane  wrote:
> I confirm this silences those warnings on my Fedora 41 box.

Instead of doing

lex = calloc(...);
/* (error out on NULL return) */
makeJsonLexContextCstringLen(lex, ...);

we need to do

lex = makeJsonLexContextCstringLen(NULL, ...);
/* (error out on NULL return) */

so that JSONLEX_FREE_STRUCT is set correctly. Otherwise we'll leak the
main allocation:

==12550==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 120 byte(s) in 1 object(s) allocated from:
#0 0xe34d2a84 in __interceptor_calloc
(/home/jacob/src/postgres/worktree-oauth/build-clang/src/interfaces/libpq/fuzz_libpq_handle_oauth_sasl_error+0x112a84)
(BuildId: 359bf20b63a97771ccb3bd2c238485920485521f)
#1 0xe3510ff0 in handle_oauth_sasl_error
/home/jacob/src/postgres/worktree-oauth/build-clang/../src/interfaces/libpq/fe-auth-oauth.c:511:8

> I'm content to do it like this, but maybe Jacob wants to
> investigate alternatives?

I was more worried about it when you said you wanted to get rid of the
stack allocation API. (I like having the flexibility to choose between
the two forms, not just for performance but also for struct
embedding.) But I'm perfectly happy with just adjusting these sites.

Thanks!
--Jacob




SQL functions: avoid making a tuplestore unnecessarily

2025-04-18 Thread Tom Lane
(I'm not proposing this for v18, but I wanted to get the patch
written while functions.c is still fresh in mind.)

The attached patch changes functions.c so that we only make a
tuplestore object if we're actually going to return multiple
rows in it, that is if it's a set-returning function and we
cannot use "lazyEval" mode.  Otherwise, we can just rely on
the result slot we made anyway to hold the one tuple we need.

This saves a small number of cycles by not shoving a tuple into the
tuplestore only to pull it right back out.  But the real reason for
changing it is not speed but resource-management worries.  The
existing code (as it was before 0dca5d68d and is again as of
0313c5dc6) makes the tuplestore in the potentially long-lived fn_mcxt.
This'd be merely a slight annoyance if a tuplestore were purely a
memory object, but it isn't: it might contain an open file.  If it
does, that file reference will be backed by a ResourceOwner,
specifically whichever ResourceOwner was CurrentResourceOwner at the
time of creating the tuplestore.  What I am afraid of is that I don't
think there's any guarantee that that ResourceOwner is as long-lived
as the FmgrInfo.  It should be fine within normal SQL query execution,
where the FmgrInfo will be part of the executor's state tree.  But
there are long-lived FmgrInfos, such as those within the typcache, or
within an index's relcache entry.  What if somebody tries to use a
SQL function to implement functionality that's reached through those
mechanisms?

Given the lack of field complaints over the many years it's been
like this, there doesn't seem to be a live problem.  I think the
explanation for that is
(1) those mechanisms are never used to call set-returning functions,
(2) therefore, the tuplestore will not be called on to hold more
than one result row,
(3) therefore, it won't get large enough that it wants to spill
to disk,
(4) therefore, its potentially dangling resowner pointer is never
used.
However, this is an uncomfortably shaky chain of assumptions.
I want to cut it down by fixing things so that there is no
tuplestore, period, in a non-set-returning SQL function.

Furthermore, this patch changes things so that when we do make a
tuplestore, it lives in the per-query context not the fn_mcxt,
providing additional surety that it won't outlive its ResourceManager.
This is a change I made in 0dca5d68d and then had to undo for
performance reasons, but because of these resource-lifetime
considerations I really want it back that way again.

Since I think there is probably not a reachable bug here today,
I'm going to park this for v19.

regards, tom lane

diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index e0bca7cb81c..8d4d062d579 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -44,7 +44,7 @@
 typedef struct
 {
 	DestReceiver pub;			/* publicly-known function pointers */
-	Tuplestorestate *tstore;	/* where to put result tuples */
+	Tuplestorestate *tstore;	/* where to put result tuples, or NULL */
 	JunkFilter *filter;			/* filter to convert tuple type */
 } DR_sqlfunction;
 
@@ -145,11 +145,13 @@ typedef struct SQLFunctionCache
 	bool		lazyEvalOK;		/* true if lazyEval is safe */
 	bool		shutdown_reg;	/* true if registered shutdown callback */
 	bool		lazyEval;		/* true if using lazyEval for result query */
+	bool		randomAccess;	/* true if tstore needs random access */
 	bool		ownSubcontext;	/* is subcontext really a separate context? */
 
 	ParamListInfo paramLI;		/* Param list representing current args */
 
-	Tuplestorestate *tstore;	/* where we accumulate result tuples */
+	Tuplestorestate *tstore;	/* where we accumulate result for a SRF */
+	MemoryContext tscontext;	/* memory context that tstore should be in */
 
 	JunkFilter *junkFilter;		/* will be NULL if function returns VOID */
 	int			jf_generation;	/* tracks whether junkFilter is up-to-date */
@@ -1250,7 +1252,7 @@ static void
 postquel_start(execution_state *es, SQLFunctionCachePtr fcache)
 {
 	DestReceiver *dest;
-	MemoryContext oldcontext;
+	MemoryContext oldcontext = CurrentMemoryContext;
 
 	Assert(es->qd == NULL);
 
@@ -1296,12 +1298,27 @@ postquel_start(execution_state *es, SQLFunctionCachePtr fcache)
 		fcache->ownSubcontext = false;
 	}
 
+	/*
+	 * Build a tuplestore if needed, that is if it's a set-returning function
+	 * and we're producing the function result without using lazyEval mode.
+	 */
+	if (es->setsResult)
+	{
+		Assert(fcache->tstore == NULL);
+		if (fcache->func->returnsSet && !es->lazyEval)
+		{
+			MemoryContextSwitchTo(fcache->tscontext);
+			fcache->tstore = tuplestore_begin_heap(fcache->randomAccess,
+   false, work_mem);
+		}
+	}
+
 	/* Switch into the selected subcontext (might be a no-op) */
-	oldcontext = MemoryContextSwitchTo(fcache->subcontext);
+	MemoryContextSwitchTo(fcache->subcontext);
 
 	/*
-	 * If this query produces the function result, send its output to the
-	 

Re: Missing comma in libpq.sgml

2025-04-18 Thread Daniel Gustafsson
> On 17 Apr 2025, at 13:30, Tatsuo Ishii  wrote:
> 
> In the "Asynchronous Command Processing" section of libpq.sgml,
> there's a line which misses a comma at the end. Attached patch should
> fix it.  I am going to push attached patch into master and
> REL_17_STABLE branches if there's no objection.

+1

--
Daniel Gustafsson





Re: Changing shared_buffers without restart

2025-04-18 Thread Konstantin Knizhnik


On 25/02/2025 11:52 am, Dmitry Dolgov wrote:

On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
changing shared memory mapping layout. Any feedback is appreciated.


Hi Dmitry,

I am sorry that I have not participated in the discussion in this thread 
from the very beginning, although I am also very interested in dynamic 
shared buffer resizing and evn proposed my own implementation of it: 
https://github.com/knizhnik/postgres/pull/2 based on memory ballooning 
and using `madvise`. And it really works (returns unused memory to the 
system).

This PoC allows me to understand the main drawbacks of this approach:

1. Performance of Postgres CLOCK page eviction algorithm depends on 
number of shared buffers. My first native attempt just to mark unused 
buffers as invalid cause significant degrade of performance


pgbench -c 32 -j 4 -T 100 -P1 -M prepared -S

(here shared_buffers - is maximal shared buffers size and 
`available_buffers` - is used part:


| shared_buffers | available_buffers | TPS | | --| 
 |  | | 128MB | -1 | 280k | | 1GB | -1 | 
324k | | 2GB | -1 | 358k | | 32GB | -1 | 350k | | 2GB | 128Mb | 130k | | 
2GB | 1Gb | 311k | | 32GB | 128Mb | 13k | | 32GB | 1Gb | 140k | | 32GB | 
2Gb | 348k |


My first thought is to replace clock with LRU based in double-linked 
list. As far as there is no lockless double-list implementation,
it need some global lock. This lock can become bottleneck. The standard 
solution is partitioning: use N  LRU lists instead of 1.
Just as partitioned has table used by buffer manager to lockup buffers. 
Actually we can use the same partitions locks to protect LRU list.
But it not clear what to do with ring buffers (strategies).So I decided 
not to perform such revolution in bufmgr, but optimize clock to more 
efficiently split reserved buffers.
Just add|skip_count|field to buffer descriptor. And it helps! Now the 
worst case shared_buffer/available_buffers = 32Gb/128Mb

shows the same performance 280k as  shared_buffers=128Mb without ballooning.

2. There are several data structures i Postgres which size depends on 
number of buffers.
In my patch I used in some cases dynamic shared buffer size, but if this 
structure has to be allocated in shared memory then still maximal size 
has to be used. We have the buffers themselves (8 kB per buffer), then 
the main BufferDescriptors array (64 B), the BufferIOCVArray (16 B), 
checkpoint's CkptBufferIds (20 B), and the hashmap on the buffer cache 
(24B+8B/entry).
128 bytes per 8kb bytes seems to  large overhead (~1%) but but it may be 
quote noticeable with size differences larger than 2 orders of magnitude:
E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer 
we'd have ~2GiB of static overhead on only 0.5GiB of actual buffers.


3. `madvise` is not portable.

Certainly you have moved much further in your proposal comparing with my 
PoC (including huge pages support).
But it is still not quite clear to me how you are going to solve the 
problems with large memory overhead in case of ~100x times variation of 
shared buffers size.





I






Re: Changing shared_buffers without restart

2025-04-18 Thread Ni Ku
Hi Ashutosh / Dmitry,

Thanks for the information and discussions, it's been very helpful.

I also have a related question about how ftruncate() is used in the patch.
In my testing I also see that when using ftruncate to shrink a shared
segment, the memory is freed immediately after the call, even if other
processes still have that memory mapped, and they will hit SIGBUS if they
try to access that memory again as the manpage says.

So am I correct to think that, to support the bufferpool shrinking case, it
would not be safe to call ftruncate in AnonymousShmemResize as-is, since at
that point other processes may still be using pages that belong to the
truncated memory?
It appears that for shrinking we should only call ftruncate when we're sure
no process will access those pages again (eg, all processes have handled
the resize interrupt signal barrier). I suppose this can be done by the
resize coordinator after synchronizing with all the other processes.
But in that case it seems we cannot use the postmaster as the coordinator
then? b/c I see some code comments saying the postmaster does not have
waiting infrastructure... (maybe even if the postmaster has waiting infra
we don't want to use it anyway since it can be blocked for a long time and
won't be able to serve other requests).

Regards,

Jack Ng


Re: Changing shared_buffers without restart

2025-04-18 Thread Thomas Munro
On Thu, Nov 21, 2024 at 8:55 PM Peter Eisentraut  wrote:
> On 19.11.24 14:29, Dmitry Dolgov wrote:
> >> I see that memfd_create() has a MFD_HUGETLB flag.  It's not very clear how
> >> that interacts with the MAP_HUGETLB flag for mmap().  Do you need to 
> >> specify
> >> both of them if you want huge pages?
> > Correct, both (one flag in memfd_create and one for mmap) are needed to
> > use huge pages.
>
> I was worried because the FreeBSD man page says
>
> MFD_HUGETLB   This flag is currently unsupported.
>
> It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant.
>
> But you should make sure in your patch that the right set of flags for
> huge pages is passed.

MFD_HUGETLB does actually work on FreeBSD, but the man page doesn't
admit it (guessing an oversight, not sure, will see).  And you don't
need the corresponding (non-existent) mmap flag.  You also have to
specify a size eg MFD_HUGETLB | MFD_HUGE_2MB or you get ENOTSUPP, but
other than that quirk I see it definitely working with eg procstat -v.
That might be because FreeBSD doesn't have a default huge page size
concept?  On Linux that's a boot time setting, I guess rarely changed.
I contemplated that once before, when I wrote a quick demo patch[1] to
implement huge_pages=on for FreeBSD (ie explicit rather than
transparent).  I used a different function, not the Linuxoid one but
it's the same under the covers, and I wrote:

+ /*
+ * Find the matching page size index, or if huge_page_size wasn't set,
+ * then skip the smallest size and take the next one after that.
+ */

Swapping that topic back in, I was left wondering: (1) how to choose
between SHM_LARGEPAGE_ALLOC_DEFAULT, a policy that will cause
ftruncate() to try to defragment physical memory to fulfil your
request and can eat some serious CPU, and SHM_LARGEPAGE_ALLOC_NOWAIT,
and (2) if it's the second thing, well Linux is like that in respect
of failing fast, but for it to succeed you have to configure
nr_hugepages in the OS as a separate administrative step and *that's*
when it does any defragmentation required, and that's another concept
FreeBSD doesn't have.  It's a bit of a weird concept too, I mean those
pages are not reserved for you in any way and anyone could nab them,
which is undeniably practical but it lacks a few qualities one might
hope for in a kernel facility...  IDK.  Anyway, the Linux-like
memfd_create() always does it the _DEFAULT way.  EIther way, we can't
have identical "try" semantics: it'll actually put some effort into
trying, perhaps burning many seconds of CPU.

I took a peek at what we're doing for Windows and the man pages tell
me that it's like that too.  I don't recall hearing any complaints
about that, but it's gated on a Windows permission that I assume very
few enabled, so "try" probably isn't trying for most systems.
Quoting:

"Large-page memory regions may be difficult to obtain after the system
has been running for a long time because the physical space for each
large page must be contiguous, but the memory may have become
fragmented. Allocating large pages under these conditions can
significantly affect system performance. Therefore, applications
should avoid making repeated large-page allocations and instead
allocate all large pages one time, at startup."

For Windows we also interpret "on" with GetLargePageMinimum(), which
sounds like my "second known page size" idea.

To make Windows do the thing that this thread wants, I found a thread
saying that calling VirtualAlloc(..., MEM_RESET) and then convincing
every process to call VirtualUnlock(...) might work:

https://groups.google.com/g/microsoft.public.win32.programmer.kernel/c/3SvznY38SSc/m/4Sx_xwon1vsJ

I'm not sure what to do about the other Unixen.  One option is
nothing, no feature, patches welcome.  Another is to use
shm_open(), like DSM segments, except we never need to
reopen these ones so we could immediately call shm_unlink() to leave
only a very short window to crash and leak a name.  It'd be low risk
name pollution in a name space that POSIX forgot to provide any way to
list.  The other idea is  non-standard madvise tricks but they seem
far too squishy to be part of a "portable" fallback if they even work
at all, so it might be better not to have the feature than that I
think.




Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-18 Thread Jacob Champion
On Tue, Apr 15, 2025 at 2:38 PM Jelte Fennema-Nio  wrote:
> libpq_append_conn_error(conn, "no custom OAuth flows are available,
> and libpq-oauth could not be loaded library could not be loaded. Try
> installing the libpq-oauth package from the same source that you
> installed libpq from");

Thanks! I think that's a little too prescriptive for packagers,
personally, but I agree that the current message isn't correct
anymore. I've gone with "no custom OAuth flows are available, and the
builtin flow is not installed". (I suppose packagers could patch in a
platform-specific message if they really wanted?)

--

Other changes in v7:

- The option name remains --with-libcurl.
- Daniel and I have tweaked the documentation, and a draft commit message is up
- Removed the ENABLE_NLS-mismatch assertion in oauth-utils.c; we don't
need to care anymore
- Added an initialization mutex

I was feeling paranoid about injecting dependency pointers
concurrently to their use in another thread. They're _supposed_ to be
constant... but I have no doubt that someone somewhere knows of a
platform/compiler/linker combo where that blows up anyway.
Initialization is now run once, under pthread_mutex protection.

- Fixed module load on macOS

The green CI was masking a bug with its use of DYLD_LIBRARY_PATH: we
don't make use of RPATH on macOS, so after installing libpq, it lost
the ability to find libpq-oauth. (A stale installation due to SIP
weirdness was masking this on my local machine; sorry for not catching
it before.)

I have swapped to using an absolute path on Mac only, because unlike
LD_LIBRARY_PATH on *nix, DYLD_LIBRARY_PATH can still override absolute
paths in dlopen()! Whee. I could use a sanity check from a native Mac
developer, but I believe this mirrors the expected behavior for a
"typical" runtime dependency: libraries point directly to the things
they depend on.

With those, I have no more TODOs and I believe this is ready for a
final review round.

Thanks,
--Jacob


v7-0001-oauth-Move-the-builtin-flow-into-a-separate-modul.patch
Description: Binary data


Re: disabled SSL log_like tests

2025-04-18 Thread Jacob Champion
On Fri, Apr 18, 2025 at 12:46 PM Tom Lane  wrote:
> * The commented-out tests in 001_ssltests.pl contained hard-wired
> expectations as to certificate serial numbers, which are obsolete now.
> I just replaced them with "\d+", but if anyone feels like that's not
> good enough, we could continue to check for exact serial numbers and
> eat the ensuing maintenance effort.

No, I think \d+ should be fine.

(I haven't stared closely at the racing code yet, but I like the
concept of the patch.)

Thanks!
--Jacob




SQL function to access to `creating_extension`

2025-04-18 Thread Yurii Rashkovskii
Hi,

I propose to introduce `pg_creating_extension()` function that would return
the OID of the extension being currently created (or `null` if none is).

The core motivation for it is to be able to establish data provenance in
tables created by extensions to be used in `pg_extension_config_dump`
configuration. This way, even if a secondary extension inserts data, it can
still be excluded from the dump by tracking data provenance in a column
with a default. Something like

```
create table my_config (
   --- ...
   extension_provenance oid default pg_creating_extension()
)
```

This would allow for a generalized exclusion in pg_extension_config_dump.

I've attached a draft patch for this simple function. I am happy to
finalize it with tests and documentation if there is a consensus on the
shape.

--
Founder at Omnigres
https://omnigres.com


v1-0001-pg_creating_extension-to-inspect-if-an-extension-is-.patch
Description: Binary data


Re: disabled SSL log_like tests

2025-04-18 Thread Tom Lane
I wrote:
> * I was more than slightly surprised to find that there are a bunch of
> other connect_fails callers that are testing log_like or log_unlike
> and thereby risking the same type of race condition.  Some of those
> tests are relatively new and perhaps just haven't failed *yet*,
> but I wonder if we changed something since 2022 that solves this
> problem in a different way?

Nope, apparently not.  The failure depends on the kernel's scheduler,
so unsurprisingly it's quite platform-dependent.  But I can reproduce
it reliably on mamba's host (NetBSD 10/macppc) if I enable the
001_ssltests.pl log_like tests without applying the connect_fails
changes.

The fact that mamba hasn't been failing on the other affected
tests is a bit puzzling.  mamba isn't running kerberos or ldap
or oauth_validator, so the lack of failures there is expected.
authentication/t/001_password.pl appears accidentally not vulnerable:
it's not using log_like in any connect_fails tests.  (It is using
log_unlike, so those tests could be giving silent false negatives.)
But authentication/t/003_peer.pl has 8 test cases that look
vulnerable.  I guess there must be some extra dollop of timing
weirdness in the ssl tests that's not there in 003_peer.pl.

Unfortunately ... it sometimes fails even with the connect_fails
changes, for example

#   Failed test 'intermediate client certificate is missing: log matches'
#   at 
/home/tgl/pgsql/src/test/ssl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm 
line 2666.
#   '2025-04-18 17:59:19.358 EDT [1460] DEBUG:  assigned pm 
child slot 2 for backend
# 2025-04-18 17:59:19.359 EDT [1460] DEBUG:  forked new client backend, pid=599 
socket=8
# 2025-04-18 17:59:19.369 EDT [599] [unknown] LOG:  connection received: 
host=localhost port=63709
# 2025-04-18 17:59:19.436 EDT [1460] DEBUG:  releasing pm child slot 1
# 2025-04-18 17:59:19.436 EDT [1460] DEBUG:  client backend (PID 25401) exited 
with exit code 0
# '
# doesn't match '(?^:Client certificate verification failed at depth 0: 
unable to get local issuer certificate)'

What I think happened here is that a previous backend hadn't exited
yet when we start the test, and when its report does come out,
connect_fails prematurely stops waiting.  (In the above, evidently
the child process we want to wait for is 599, but we're fooled by
a delayed report for 25401.)  So my v1 patch needs work.

Maybe we can make the test compare the PIDs in the "forked new client
backend" and "client backend exited" log messages.  Stay tuned...

regards, tom lane




Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-18 Thread Jacob Champion
On Thu, Apr 17, 2025 at 5:47 PM Jacob Champion
 wrote:
> With those, I have no more TODOs and I believe this is ready for a
> final review round.

Some ABI self-review. These references to conn->errorMessage also need
the indirection treatment, which I'm working on now:

> if (actx->errctx)
> {
> appendPQExpBufferStr(&conn->errorMessage,
>  libpq_gettext(actx->errctx));
> appendPQExpBufferStr(&conn->errorMessage, ": ");
> ...

I was searching backwards through history to confirm that we don't
rearrange struct pg_conn in back branches; turns out that was a false
assumption. See e8f60e6fe2:

While at it, fix some places where parameter-related infrastructure
was added with the aid of a dartboard, or perhaps with the aid of
the anti-pattern "add new stuff at the end".  It should be safe
to rearrange the contents of struct pg_conn even in released
branches, since that's private to libpq (and we'd have to move
some fields in some builds to fix this, anyway).

So that means, I think, the name needs to go back to --,
unless anyone can think of a clever way around it. (Injecting
conn->errorMessage to avoid the messiness around ENABLE_GSS et al is
still useful, but injecting every single offset doesn't seem
maintainable to me.) Sorry, Christoph; I know that's not what you were
hoping for.

--Jacob




Re: [BUG] temporary file usage report with extended protocol and unnamed portals

2025-04-18 Thread Sami Imseih
> [...] LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp525566.0",
> size 2416640
> [..] STATEMENT:  SELECT 1

> but it should be:

> [...] LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp538230.0",
> size 2416640
> [...] STATEMENT:  SELECT * FROM foo ORDER BY a OFFSET $1 LIMIT 2


hmm, looking at the repro and your patch, I was really confused by
why the STATEMENT: ends up being SELECT 1 even though the
query with temp being tracked is the ORDER BY query.

So, I ran a slowed down test where there is a 15 second thread sleep between
the ORDER BY and SELECT 1 ( see attached ) and I also ran it with the
following GUCs enabled

```
log_temp_files = '0'
work_mem = '64kB'
max_parallel_workers_per_gather = '0'
log_min_duration_statement = '0'
```

2025-04-18 13:31:45.572 CDT [95918] LOG:  duration: 0.046 ms  bind
: select pg_sleep(0)
2025-04-18 13:31:45.572 CDT [95918] LOG:  duration: 0.009 ms  execute
: select pg_sleep(0)
2025-04-18 13:31:45.574 CDT [95918] LOG:  duration: 0.454 ms  parse
: SELECT * FROM foo ORDER BY a
2025-04-18 13:31:45.575 CDT [95918] LOG:  duration: 0.278 ms  bind
: SELECT * FROM foo ORDER BY a
2025-04-18 13:31:46.396 CDT [95918] LOG:  duration: 821.076 ms
execute : SELECT * FROM foo ORDER BY a

2025-04-18 13:32:01.403 CDT [95918] LOG:  duration: 0.438 ms  parse
: SELECT 1
2025-04-18 13:32:01.454 CDT [95918] LOG:  temporary file: path
"base/pgsql_tmp/pgsql_tmp95918.0", size 202039296
2025-04-18 13:32:01.454 CDT [95918] STATEMENT:  SELECT 1
2025-04-18 13:32:01.454 CDT [95918] LOG:  duration: 50.818 ms  bind
: SELECT 1
2025-04-18 13:32:01.454 CDT [95918] LOG:  duration: 0.022 ms  execute
: SELECT 1
2025-04-18 13:32:01.457 CDT [95918] LOG:  duration: 0.064 ms  parse S_1: COMMIT
2025-04-18 13:32:01.457 CDT [95918] LOG:  duration: 0.017 ms  bind S_1: COMMIT
2025-04-18 13:32:01.457 CDT [95918] LOG:  duration: 0.105 ms  execute
S_1: COMMIT

In the log above, it shows that the temporary file logging does not
actually occur
until the next query is executed, so at that point the query string is
already "SELECT 1"

what is even more interesting is when running the commands directly from psql
using extended query protocol, the STATEMENT is not logged after the
temporary file message.


```
SELECT  FROM foo ORDER BY a \bind
;
SELECT 1 \bind
;
```

```
istance=0 kB, estimate=0 kB; lsn=0/E55A088, redo lsn=0/E55A030
2025-04-18 13:36:39.275 CDT [96237] LOG:  duration: 1.447 ms  parse
: SELECT  FROM foo ORDER BY a
;
2025-04-18 13:36:39.275 CDT [96237] LOG:  duration: 0.719 ms  bind
: SELECT  FROM foo ORDER BY a
;
2025-04-18 13:36:39.822 CDT [96237] LOG:  duration: 546.803 ms
execute : SELECT  FROM foo ORDER BY a
;
2025-04-18 13:36:39.852 CDT [96237] LOG:  temporary file: path
"base/pgsql_tmp/pgsql_tmp96237.0", size 202039296
2025-04-18 13:36:40.736 CDT [96237] LOG:  duration: 0.394 ms  parse
: SELECT 1
;
2025-04-18 13:36:40.736 CDT [96237] LOG:  duration: 0.133 ms  bind
: SELECT 1
;
2025-04-18 13:36:40.736 CDT [96237] LOG:  duration: 0.023 ms  execute
: SELECT 1
;
```

I think the DropPortal is happening later in the JDBC case, and only
after the next query is
executed, maybe? So your patch does take care of the problem because
it ensures that
the drop portal, when it occurs, is referencing the correct sql. I am
not yet sure if the patch
the right solution yet, it maybe the best answer.

I don't have a better answer, but wanted to share this research as well.

--
Sami Imseih
Amazon Web Services (AWS)


Test.java
Description: Binary data


Re: n_ins_since_vacuum stats for aborted transactions

2025-04-18 Thread Sami Imseih
On Thu, Apr 17, 2025 at 11:13 PM David Rowley  wrote:
>
> On Sat, 12 Apr 2025 at 07:33, Sami Imseih  wrote:
> > What do you think of the attached?
>
> I looked at the v3 patch and I'm having trouble getting excited about it.
>
> I'd say this part is misleading:
>
> @@ -3956,7 +3961,8 @@ description | Waiting for a newly initialized
> WAL file to reach durable storage
> n_dead_tup bigint
>
>
> -   Estimated number of dead rows
> +   Estimated number of dead rows (updated by committed transactions, or 
> by
> +   VACUUM and VACUUM FULL)
>
>
> An aborted insert will contribute to that counter, but that's not mentioned.

I thought the first line "An aborted transaction will also increment
tuple-related counters, unless otherwise noted."
makes it clear that it will be updated in an aborted transaction, but
after re-reading
I can see it being confusing. What about this wording to make it more
clear when the field is
updated?


n_dead_tup bigint
   
   
-   Estimated number of dead rows
+   Estimated number of dead rows (updated either by committed or
aborted transactions,
+   or by VACUUM and VACUUM FULL)
   
  

> Would it be ok just to adjust n_mod_since_analyze's "Estimated number
> of rows modified since this table was last analyzed" and inject "by
> committed transactions" after "modified", then call it a day?

That works also.

--
Sami Imseih
Amazon Web Services (AWS)




Re: Fix slot synchronization with two_phase decoding enabled

2025-04-18 Thread Masahiko Sawada
On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu)
 wrote:
>
>
> --
> Approach 2
> --
>
> Instead of disallowing the use of two-phase and failover together, a more
> flexible strategy could be only restrict failover for slots with two-phase
> enabled when there's a possibility of existing prepared transactions before 
> the
> two_phase_at that are not yet replicated. During slot creation with two-phase
> and failover, we could check for any decoded prepared transactions when
> determining the decoding start point (DecodingContextFindStartpoint). For
> subsequent attempts to alter failover to true, we ensure that two_phase_at is
> less than restart_lsn, indicating that all prepared transactions have been
> committed and replicated, thus the bug would not happen.
>
> pros:
>
> This method minimizes restrictions for users. Especially during slot creation
> with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare
> during consistent snapshot creation, the restriction becomes almost
> unnoticeable.

I think this approach can work for the transactions that are prepared
while the slot is created. But if I understand the problem correctly,
while the initial table sync is performing, the slot's two_phase is
still false, so we need to deal with the transactions that are
prepared during the initial table sync too. What do you think?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: ALTER COLUMN SET DATA TYPE does not change the generation expression's collation

2025-04-18 Thread Tom Lane
jian he  writes:
>> ATPrepAlterColumnType forbids us to ALTER COLUMN SET DATA TYPE USING (expr)
>> for generated columns.
>> however we can still change the generated column type from non-text to text
>> or text type from one collation to another collation.

I don't really understand why we allow SET DATA TYPE on a generated
column at all.  However ...

>> In ATExecAlterColumnType, we also need to set the generation
>> expression collation?

Don't believe that would matter in the slightest.  The generation
expression is not exposed anywhere --- we don't incorporate it
in the plan tree, just evaluate it in ExecComputeStoredGenerated.

It could matter in the case of a virtual generated column,
but it appears that that works already:

regression=# CREATE TABLE x1(a int,
b int GENERATED ALWAYS AS (a * 2) virtual,
c text GENERATED ALWAYS AS ('1') stored );
CREATE TABLE
regression=# insert into x1 values (11);
INSERT 0 1
regression=# ALTER TABLE x1 alter column b set data type text collate "C";
ALTER TABLE
regression=#  select pg_collation_for(b) from x1;
 pg_collation_for 
--
 "C"
(1 row)
regression=# ALTER TABLE x1 alter column b set data type text collate "POSIX";
ALTER TABLE
regression=#  select pg_collation_for(b) from x1;
 pg_collation_for 
--
 "POSIX"
(1 row)

(It looks like the reason that works is that
build_generation_expression inserts the necessary coercion.)

So I don't see a bug here.  If you want to claim that this is
a bug deserving of being an open item, I think you need to
demonstrate some observable misbehavior.  If you want to say
it'd be cleaner to fix the stored expression and get rid of
the extra step in build_generation_expression, I'd probably
agree, but that seems like cleanup that could wait for v19.
It's certainly not a bug affecting any stable branches.

regards, tom lane




Re: n_ins_since_vacuum stats for aborted transactions

2025-04-18 Thread Sami Imseih
> I can see it being confusing. What about this wording to make it more
> clear when the field is
> updated?

here are both of the changes in v4.

--
Sami Imseih
Amazon Web Services (AWS)


v4-0001-Clarify-when-aborted-rows-are-tracked-for-tuple-r.patch
Description: Binary data


Re: disabled SSL log_like tests

2025-04-18 Thread Tom Lane
Andrew Dunstan  writes:
> On 2025-04-17 Th 10:56 AM, Tom Lane wrote:
>> However, I wonder whether Andres' work at 8b886a4e3 could be used
>> to improve this, either directly or as inspiration?

> I don't think so - these tests are about checking log file contents, not 
> a psql problem.

Well, I was mainly leaning on the "inspiration" part: the idea
is to wait for something that must come out after the text we
want to look for.  After digging around a bit I noticed that
002_connection_limits.pl's connect_fails_wait is doing something
that's almost what we want: that test cranks log_min_messages
up to DEBUG2 and then waits for the postmaster's report of
backend exit before believing that it's done.

Awhile later I had the attached patch.  Some notes:

* The commented-out tests in 001_ssltests.pl contained hard-wired
expectations as to certificate serial numbers, which are obsolete now.
I just replaced them with "\d+", but if anyone feels like that's not
good enough, we could continue to check for exact serial numbers and
eat the ensuing maintenance effort.

* I was more than slightly surprised to find that there are a bunch of
other connect_fails callers that are testing log_like or log_unlike
and thereby risking the same type of race condition.  Some of those
tests are relatively new and perhaps just haven't failed *yet*,
but I wonder if we changed something since 2022 that solves this
problem in a different way?  Anyway, after this change any such
caller must set log_min_messages = debug2 or fail.  I think I got
all the relevant test scripts in the attached.

regards, tom lane

diff --git a/src/test/authentication/t/001_password.pl b/src/test/authentication/t/001_password.pl
index 756b4146050..37d96d95a1a 100644
--- a/src/test/authentication/t/001_password.pl
+++ b/src/test/authentication/t/001_password.pl
@@ -66,6 +66,8 @@ sub test_conn
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init;
 $node->append_conf('postgresql.conf', "log_connections = on\n");
+# Needed to allow connect_fails to inspect postmaster log:
+$node->append_conf('postgresql.conf', "log_min_messages = debug2");
 $node->start;
 
 # Test behavior of log_connections GUC
diff --git a/src/test/authentication/t/003_peer.pl b/src/test/authentication/t/003_peer.pl
index 69ba73bd2b9..2879800eacf 100644
--- a/src/test/authentication/t/003_peer.pl
+++ b/src/test/authentication/t/003_peer.pl
@@ -72,6 +72,8 @@ sub test_role
 my $node = PostgreSQL::Test::Cluster->new('node');
 $node->init;
 $node->append_conf('postgresql.conf', "log_connections = on\n");
+# Needed to allow connect_fails to inspect postmaster log:
+$node->append_conf('postgresql.conf', "log_min_messages = debug2");
 $node->start;
 
 # Set pg_hba.conf with the peer authentication.
diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index 6748b109dec..2dc6bec9b89 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -66,6 +66,7 @@ $node->append_conf(
 listen_addresses = '$hostaddr'
 krb_server_keyfile = '$krb->{keytab}'
 log_connections = on
+log_min_messages = debug2
 lc_messages = 'C'
 });
 $node->start;
diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl
index 352b0fc1fa7..d1315ed5351 100644
--- a/src/test/ldap/t/001_auth.pl
+++ b/src/test/ldap/t/001_auth.pl
@@ -48,6 +48,8 @@ note "setting up PostgreSQL instance";
 my $node = PostgreSQL::Test::Cluster->new('node');
 $node->init;
 $node->append_conf('postgresql.conf', "log_connections = on\n");
+# Needed to allow connect_fails to inspect postmaster log:
+$node->append_conf('postgresql.conf', "log_min_messages = debug2");
 $node->start;
 
 $node->safe_psql('postgres', 'CREATE USER test0;');
diff --git a/src/test/modules/oauth_validator/t/001_server.pl b/src/test/modules/oauth_validator/t/001_server.pl
index d88994abc24..4f035417a40 100644
--- a/src/test/modules/oauth_validator/t/001_server.pl
+++ b/src/test/modules/oauth_validator/t/001_server.pl
@@ -48,6 +48,8 @@ $node->init;
 $node->append_conf('postgresql.conf', "log_connections = on\n");
 $node->append_conf('postgresql.conf',
 	"oauth_validator_libraries = 'validator'\n");
+# Needed to allow connect_fails to inspect postmaster log:
+$node->append_conf('postgresql.conf', "log_min_messages = debug2");
 $node->start;
 
 $node->safe_psql('postgres', 'CREATE USER test;');
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 8759ed2cbba..7877cc8d994 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2624,7 +2624,12 @@ If this regular expression is set, matches it with the output generated.
 
 =item log_unlike => [ qr/prohibited message/ ]
 
-See C.
+See C.  CAUTION: use of either option requires that
+the server's log_min_messages be at least DEBUG2, and that no other
+client backend is launched concurrently.  These requirements allow
+C to wait to see the postmaster-l

Re: ZStandard (with dictionaries) compression support for TOAST compression

2025-04-18 Thread Robert Haas
On Tue, Apr 15, 2025 at 2:13 PM Nikhil Kumar Veldanda
 wrote:
> Addressing Compressed Datum Leaks problem (via CTAS, INSERT INTO ... SELECT 
> ...)
>
> As compressed datums can be copied to other unrelated tables via CTAS,
> INSERT INTO ... SELECT, or CREATE TABLE ... EXECUTE, I’ve introduced a
> method inheritZstdDictionaryDependencies. This method is invoked at
> the end of such statements and ensures that any dictionary
> dependencies from source tables are copied to the destination table.
> We determine the set of source tables using the relationOids field in
> PlannedStmt.

With the disclaimer that I haven't opened the patch or thought
terribly deeply about this issue, at least not yet, my fairly strong
suspicion is that this design is not going to work out, for multiple
reasons. In no particular order:

1. I don't think users will like it if dependencies on a zstd
dictionary spread like kudzu across all of their tables. I don't think
they'd like it even if it were 100% accurate, but presumably this is
going to add dependencies any time there MIGHT be a real dependency
rather than only when there actually is one.

2. Inserting into a table or updating it only takes RowExclusiveLock,
which is not even self-exclusive. I doubt that it's possible to change
system catalogs in a concurrency-safe way with such a weak lock. For
instance, if two sessions tried to do the same thing in concurrent
transactions, they could both try to add the same dependency at the
same time.

3. I'm not sure that CTAS, INSERT INTO...SELECT, and CREATE
TABLE...EXECUTE are the only ways that datums can creep from one table
into another. For example, what if I create a plpgsql function that
gets a value from one table and stores it in a variable, and then use
that variable to drive an INSERT into another table? I seem to recall
there are complex cases involving records and range types and arrays,
too, where the compressed object gets wrapped inside of another
object; though maybe that wouldn't matter to your implementation if
INSERT INTO ... SELECT uses a sufficiently aggressive strategy for
adding dependencies.

When Dilip and I were working on lz4 TOAST compression, my first
instinct was to not let LZ4-compressed datums leak out of a table by
forcing them to be decompressed (and then possibly recompressed). We
spent a long time trying to make that work before giving up. I think
this is approximately where things started to unravel, and I'd suggest
you read both this message and some of the discussion before and
after:

https://www.postgresql.org/message-id/20210316185455.5gp3c5zvvvq66...@alap3.anarazel.de

I think we could add plain-old zstd compression without really
tackling this issue, but if we are going to add dictionaries then I
think we might need to revisit the idea of preventing things from
leaking out of tables. What I can't quite remember at the moment is
how much of the problem was that it was going to be slow to force the
recompression, and how much of it was that we weren't sure we could
even find all the places in the code that might need such handling.

I'm now also curious to know whether Andres would agree that it's bad
if zstd dictionaries are un-droppable. After all, I thought it would
be bad if there was no way to eliminate a dependency on a compression
method, and he disagreed. So maybe he would also think undroppable
dictionaries are fine. But maybe not. It seems even worse to me than
undroppable compression methods, because you'll probably not have that
many compression methods ever, but you could have a large number of
dictionaries eventually.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: NUMA shared memory interleaving

2025-04-18 Thread Bertrand Drouvot
Hi,

On Wed, Apr 16, 2025 at 10:05:04AM -0400, Robert Haas wrote:
> On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak
>  wrote:
> > Normal pgbench workloads tend to be not affected, as each backend
> > tends to touch just a small partition of shm (thanks to BAS
> > strategies). Some remaining questions are:
> > 1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
> > first option, as we could potentially in future add more optimizations
> > behind that GUC.
> 
> I wonder whether the GUC needs to support interleaving between a
> designated set of nodes rather than only being able to do all nodes.
> For example, suppose someone is pinning the processes to a certain set
> of NUMA nodes; perhaps then they wouldn't want to use memory from
> other nodes.

+1. That could be used for instances consolidation on the same host. One could
ensure that numa nodes are not shared across instances (cpu and memory resource
isolation per instance). Bonus point, adding Direct I/O into the game would
ensure that the OS page cache is not shared too.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: NUMA shared memory interleaving

2025-04-18 Thread Bertrand Drouvot
Hi,

On Thu, Apr 17, 2025 at 01:58:44AM +1200, Thomas Munro wrote:
> On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak
>  wrote:
> > 2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
> > expert on DSA/DSM at all)
> 
> I have no answers but I have speculated for years about a very
> specific case (without any idea where to begin due to lack of ... I
> guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
> workers split up and try to work on different batches on their own to
> minimise contention, and when that's not possible (more workers than
> batches, or finishing their existing work at different times and going
> to help others), they just proceed in round-robin order.  A beginner
> thought is: if you're going to help someone working on a hash table,
> it would surely be best to have the CPUs and all the data on the same
> NUMA node.  During loading, cache line ping pong would be cheaper, and
> during probing, it *might* be easier to tune explicit memory prefetch
> timing that way as it would look more like a single node system with a
> fixed latency, IDK (I've shared patches for prefetching before that
> showed pretty decent speedups, and the lack of that feature is
> probably a bigger problem than any of this stuff, who knows...).
> Another beginner thought is that the DSA allocator is a source of
> contention during loading: the dumbest problem is that the chunks are
> just too small, but it might also be interesting to look into per-node
> pools.  Or something.   IDK, just some thoughts...

I'm also thinking that could be beneficial for parallel workers. I think the
ideal scenario would be to have the parallel workers spread across numa nodes 
and
accessing their "local" memory first (and help with "remote" memory access if
there is still more work to do "remotely").

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: pending_since assertion failure on skink

2025-04-18 Thread Bertrand Drouvot
Hi,

On Sat, Apr 12, 2025 at 10:33:33AM -0400, Andres Freund wrote:
> I suspect that this is related to
> 
> commit 039549d70f6aa2daa3714a13752a08fa8ca2fb05

+1

> We might just have to give up on that assertion, I guess?

Probably. I'll have a look at it next week.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com