Re: Typo in comment for pgstat_database_flush_cb()

2025-03-30 Thread Heikki Linnakangas

On 30/03/2025 14:32, Heikki Linnakangas wrote:

On 30/03/2025 13:28, Etsuro Fujita wrote:

Another thing I noticed is $SUBJECT: I think “if lock could not
immediately acquired” should be “if lock could not be immediately
acquired”.  Attached is a patch for that.


Yep. And there are more instances of the same typo in other such 
flush_cb functions, if you search for "immediately acquired".


While we're at it, the comment feels a bit awkward to me anyway. Maybe 
rephrase it to "If nowait is true and the lock could not be immediately 
acquired, returns false without flushing the entry. Otherwise returns true."


--
Heikki Linnakangas
Neon (https://neon.tech)




Re: Thinko in pgstat_build_snapshot()

2025-03-30 Thread Heikki Linnakangas

On 30/03/2025 13:23, Etsuro Fujita wrote:

While working on something else I noticed $SUBJECT: we are allocating
more memory than necessary and copying more data than necessary
because we specify the wrong PgStat_KindInfo member as the size
argument for MemoryContextAlloc and memcpy.  This could become
problematic if snapshotting a very large number of variables stats, so
I fixed it.  Attached is a patch for that.


Good catch. Patch looks good to me at quick glance.

--
Heikki Linnakangas
Neon (https://neon.tech)




Re: Typo in comment for pgstat_database_flush_cb()

2025-03-30 Thread Tender Wang
Etsuro Fujita  于2025年3月30日周日 18:28写道:

> Another thing I noticed is $SUBJECT: I think “if lock could not
> immediately acquired” should be “if lock could not be immediately
> acquired”.  Attached is a patch for that.
>

Agree.
The patch looks good to me.


-- 
Thanks,
Tender Wang


Re: Amcheck verification of GiST and GIN

2025-03-30 Thread Tomas Vondra
On 3/30/25 06:04, Tom Lane wrote:
> Tomas Vondra  writes:
>> I've pushed all the parts of this patch series, except for the stress
>> test - which I think was not meant for commit.
>> buildfarm seems happy so far, except for a minor indentation issue
>> (forgot to reindent after merging the separate fix patch).
> 
> Not so much here:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gecko&dt=2025-03-29%2020%3A25%3A23
> 
> Need to avoid depending on md5(), or it'll fail on machines with
> FIPS mode enabled.  See for example 95b856de2.
> 

Ah, I forgot about that. Fixed.


regards

-- 
Tomas Vondra





Re: speedup COPY TO for partitioned table.

2025-03-30 Thread jian he
On Sun, Mar 30, 2025 at 9:14 AM vignesh C  wrote:
>
> On Sat, 29 Mar 2025 at 12:08, jian he  wrote:
> >
> >
> > I consolidated it into a new function: CopyThisRelTo.
>
> Few comments:
> 1) Here the error message is not correct, we are printing the original
> table from where copy was done which is a regular table and not a
> foreign table, we should use childreloid instead of rel.
>
> +   if (relkind == RELKIND_FOREIGN_TABLE)
> +   ereport(ERROR,
> +
> errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +   errmsg("cannot
> copy from foreign table \"%s\"",
> +
>  RelationGetRelationName(rel)),
> +
> errdetail("partition \"%s\" is a foreign table",
> RelationGetRelationName(rel)),
> +   errhint("Try
> the COPY (SELECT ...) TO variant."));
>
> In the error detail you can include the original table too.
>
I changed it to:
if (relkind == RELKIND_FOREIGN_TABLE)
{
char   *relation_name;

relation_name = get_rel_name(childreloid);
ereport(ERROR,
errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("cannot copy from foreign table
\"%s\"", relation_name),
errdetail("Partition \"%s\" is a foreign
table in the partitioned table \"%s.%s\"",
  relation_name,
RelationGetRelationName(rel),

get_namespace_name(rel->rd_rel->relnamespace)),
errhint("Try the COPY (SELECT ...) TO variant."));
}

>
> 2) 2.a) I felt the comment should be "then copy partitioned rel to
> destionation":
> + * rel: the relation to be copied to.
> + * root_rel: if not null, then the COPY TO partitioned rel.
> + * processed: number of tuple processed.
> +*/
> +static void
> +CopyThisRelTo(CopyToState cstate, Relation rel, Relation root_rel,
> uint64 *processed)
> +{
> +   TupleTableSlot *slot;
>

i changed it to:
+/*
+ * rel: the relation to be copied to.
+ * root_rel: if not null, then the COPY partitioned relation to destination.
+ * processed: number of tuple processed.
+*/
+static void
+CopyThisRelTo(CopyToState cstate, Relation rel, Relation root_rel,
+  uint64 *processed)


> 3) There is a small indentation issue here:
> +   /*
> +* partition's rowtype might differ from the root table's.  We must
> +* convert it back to the root table's rowtype as we are export
> +* partitioned table data here.
> +   */
> +   if (root_rel != NULL)
>
I am not so sure.
can you check if the attached still has the indentation issue.
From 4e501b7a8e67cffbf6432bdb43985b21d2b635b8 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Sun, 30 Mar 2025 21:47:12 +0800
Subject: [PATCH v7 1/1] support COPY partitioned_table TO

CREATE TABLE pp (id  INT, val int ) PARTITION BY RANGE (id);
create table pp_1 (val int, id int);
create table pp_2 (val int, id int);
ALTER TABLE pp ATTACH PARTITION pp_1 FOR VALUES FROM (1) TO (5);
ALTER TABLE pp ATTACH PARTITION pp_2 FOR VALUES FROM (5) TO (10);
insert into pp select g, 10 + g from generate_series(1,9) g;
copy pp to stdout(header);

the above case is much slower (around 25% in some case) than
``COPY (select * from pp) to stdout(header);``,
because of column remaping.  but this is still a new
feature, since master does not support ``COPY (partitioned_table)``.

reivewed by: vignesh C 
reivewed by: David Rowley 
reivewed by: Melih Mutlu 

discussion: https://postgr.es/m/CACJufxEZt+G19Ors3bQUq-42-61__C=y5k2wk=sHEFRusu7=i...@mail.gmail.com
commitfest entry: https://commitfest.postgresql.org/patch/5467/
---
 doc/src/sgml/ref/copy.sgml  |   8 +-
 src/backend/commands/copyto.c   | 143 ++--
 src/test/regress/expected/copy2.out |  16 
 src/test/regress/sql/copy2.sql  |  11 +++
 4 files changed, 147 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index df093da97c5..f86e0b7ec35 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -521,15 +521,15 @@ COPY count
 

 COPY TO can be used only with plain
-tables, not views, and does not copy rows from child tables
-or child partitions.  For example, COPY COPY TO can be used with partitioned tables.
+For example, in a table inheritance hierarchy, COPY table TO copies
 the same rows as SELECT * FROM ONLY table.
 The syntax COPY (SELECT * FROM table) TO ... can be used to
-dump all of the rows in an inheritance hierarchy, partitioned table,
-or view.
+dump all of the rows in an inheritance hierarchy, or view.

 

diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index 84a3f3879a8..b75bbfea6a4 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c

Re: Non-text mode for pg_dumpall

2025-03-30 Thread Andrew Dunstan


On 2025-03-29 Sa 1:17 AM, Mahendra Singh Thalor wrote:

On Sat, 29 Mar 2025 at 03:50, Andrew Dunstan  wrote:


On 2025-03-27 Th 5:15 PM, Andrew Dunstan wrote:

On 2025-03-19 We 2:41 AM, Mahendra Singh Thalor wrote:

On Wed, 12 Mar 2025 at 21:18, Andrew Dunstan 
wrote:

On 2025-03-12 We 3:03 AM, jian he wrote:

On Wed, Mar 12, 2025 at 1:06 AM Álvaro Herrera
 wrote:

Hello,

On 2025-Mar-11, Mahendra Singh Thalor wrote:


In map.dat file, I tried to fix this issue by adding number of
characters
in dbname but as per code comments, as of now, we are not
supporting \n\r
in dbnames so i removed handling.
I will do some more study to fix this issue.

Yeah, I think this is saying that you should not consider the
contents
of map.dat as a shell string.  After all, you're not going to
_execute_
that file via the shell.

Maybe for map.dat you need to escape such characters somehow, so that
they don't appear as literal newlines/carriage returns.


I am confused.
currently pg_dumpall plain format will abort when encountering dbname
containing newline.
the left dumped plain file does not contain all the cluster
databases data.


if pg_dumpall non-text format aborts earlier,
it's aligned with pg_dumpall plain format?
it's also an improvement since aborts earlier, nothing will be dumped?


am i missing something?



I think we should fix that.

But for the current proposal, Álvaro and I were talking this morning,
and we thought the simplest thing here would be to have the one line
format and escape NL/CRs in the database name.


cheers


Okay. As per discussions, we will keep one line entry for each
database into map.file.

Thanks all for feedback and review.

Here, I am attaching updated patches for review and testing. These
patches can be applied on commit a6524105d20b.



I'm working through this patch set with a view to committing it.
Attached is some cleanup which is where I got to today, although there
is more to do. One thing I am wondering is why not put the
SimpleDatabaseOidList stuff in fe_utils/simle_list.{c,h} ? That's
where all the similar stuff belongs, and it feels strange to have this
inline in pg_restore.c. (I also don't like the name much -
SimpleOidStringList or maybe SimpleOidPlusStringList might be better).





OK, I have done that, so here is the result. The first two are you
original patches. patch 3 adds the new list type to fe-utils, and patch
4 contains my cleanups and use of the new list type. Apart from some
relatively minor cleanup, the one thing I would like to change is how
dumps are named. If we are producing tar or custom format dumps, I think
the file names should reflect that (oid.dmp and oid.tar rather than a
bare oid as the filename), and pg_restore should look for those. I'm
going to work on that tomorrow - I don't think it will be terribly
difficult.


Thanks Andrew.

Here, I am attaching a delta patch for oid.tar and oid.dmp format.




OK, looks good, I have incorporated that.

There are a couple of rough edges, though.

First, I see this:


andrew@ub22arm:inst $ bin/pg_restore -C -d postgres 
--exclude-database=regression_dummy_seclabel 
--exclude-database=regression_test_extensions 
--exclude-database=regression_test_pg_dump dest
pg_restore: error: could not execute query: "ERROR:  role "andrew" 
already exists

"
Command was: "

--
-- Roles
--

CREATE ROLE andrew;"
pg_restore: warning: errors ignored on global.dat file restore: 1
pg_restore: error: could not execute query: ERROR:  database "template1" 
already exists
Command was: CREATE DATABASE template1 WITH TEMPLATE = template0 
ENCODING = 'SQL_ASCII' LOCALE_PROVIDER = libc LOCALE = 'C';



pg_restore: warning: errors ignored on database "template1" restore: 1
pg_restore: error: could not execute query: ERROR:  database "postgres" 
already exists
Command was: CREATE DATABASE postgres WITH TEMPLATE = template0 ENCODING 
= 'SQL_ASCII' LOCALE_PROVIDER = libc LOCALE = 'C';



pg_restore: warning: errors ignored on database "postgres" restore: 1
pg_restore: warning: errors ignored on restore: 3



It seems pointless to be trying to create the rolw that we are connected 
as, and we also expect template1 and postgres to exist.


In a similar vein, I don't see why we are setting the --create flag in 
pg_dumpall for those databases. I'm attaching a patch that is designed 
to stop that, but it doesn't solve the above issues.


I also notice a bunch of these in globals.dat:


--
-- Databases
--

--
-- Database "template1" dump
--

--
-- Database "andrew" dump
--

--
-- Database "isolation_regression_brin" dump
--

--
-- Database "isolation_regression_delay_execution" dump
--

 ...


The patch also tries to fix this.

Lastly, this badly needs some TAP tests written.

I'm going to work on reviewing the documentation next.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From ed53d8b5ad82e49bb56bc5bc48ddb8426fdb4c80 Mon Sep 17 00:00:00 2001
From: Mahendra Singh Thalor 
Date: Wed, 19 Mar 2025 01:18:46 +0530
Subject: [

Re: Options to control remote transactions’ access/deferrable modes in postgres_fdw

2025-03-30 Thread Etsuro Fujita
On Thu, Mar 27, 2025 at 1:25 PM Ashutosh Bapat
 wrote:
> On Tue, Mar 25, 2025 at 4:01 PM Etsuro Fujita  wrote:
> > In the patch I also fixed a bug; I trusted XactReadOnly to see if the
> > local transaction is READ ONLY, but I noticed that that is not 100%
> > correct, because a transaction which started as READ WRITE can show as
> > READ ONLY later within subtransactions, so I modified the patch so
> > that postgres_fdw opens remote transactions in READ ONLY mode if the
> > local transaction has been declared READ ONLY at the top level.
>
> Nice catch. postgres_fdw replicates the transaction stack on foreign
> server. I think we need to replicate it along with the transaction
> properties. And also we need a test which tests readonly
> subtransaction behaviour.

Ok, will do.

Thanks for the comment!

Best regards,
Etsuro Fujita




Thinko in pgstat_build_snapshot()

2025-03-30 Thread Etsuro Fujita
While working on something else I noticed $SUBJECT: we are allocating
more memory than necessary and copying more data than necessary
because we specify the wrong PgStat_KindInfo member as the size
argument for MemoryContextAlloc and memcpy.  This could become
problematic if snapshotting a very large number of variables stats, so
I fixed it.  Attached is a patch for that.

Best regards,
Etsuro Fujita


fix-thinko-in-pgstat_build_snapshot.patch
Description: Binary data


Minor edits to README.tuplock, and a question

2025-03-30 Thread Gurjeet Singh
Please see attached a few minor edits to README.tuplock, which I feel
make an improvement over the current version.

Reading through that, though, I could not see a functional difference
between FOR NO KEY UPDATE and FOR KEY SHARE mode of locks. I understand
they are of different strength, exclusive vs. shared, but the way the
text (quoted below) describes them, they essentially both achieve the
same effect.

> SELECT FOR NO
> KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal
> and modifications which might alter the tuple's key.

> SELECT FOR KEY SHARE obtains a shared lock which only
> prevents tuple removal and modifications of key fields.

Am I missing something?



Nevermind. Deciphering the conflict table below it makes clear the need
for similar looking locks, but with exclusive vs. shared mode
differences. I can't think of an improvement in the two sentences quoted
above, but perhaps others can think of something that helps the reader.

-- 
Best regards,
Gurjeet
http://Gurje.et
diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 843c2e58f92..0763fbaa9e7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -3,7 +3,7 @@ Locking tuples
 
 Locking tuples is not as easy as locking tables or other database objects.
 The problem is that transactions might want to lock large numbers of tuples at
-any one time, so it's not possible to keep the locks objects in shared memory.
+any one time, so it's not possible to keep the lock objects in shared memory.
 To work around this limitation, we use a two-level mechanism.  The first level
 is implemented by storing locking information in the tuple header: a tuple is
 marked as locked by setting the current transaction's XID as its XMAX, and
@@ -20,8 +20,8 @@ tuple, potentially leading to indefinite starvation of some 
waiters.  The
 possibility of share-locking makes the problem much worse --- a steady stream
 of share-lockers can easily block an exclusive locker forever.  To provide
 more reliable semantics about who gets a tuple-level lock first, we use the
-standard lock manager, which implements the second level mentioned above.  The
-protocol for waiting for a tuple-level lock is really
+standard lock manager, which implements the second of the two-level mechanism
+mentioned above.  The protocol for waiting for a tuple-level lock is really
 
  LockTuple()
  XactLockTableWait()
@@ -39,7 +39,7 @@ conflict for a tuple, we don't incur any extra overhead.
 We make an exception to the above rule for those lockers that already hold
 some lock on a tuple and attempt to acquire a stronger one on it.  In that
 case, we skip the LockTuple() call even when there are conflicts, provided
-that the target tuple is being locked, updated or deleted by multiple sessions
+that the target tuple is being locked, updated, or deleted by multiple sessions
 concurrently.  Failing to skip the lock would risk a deadlock, e.g., between a
 session that was first to record its weaker lock in the tuple header and would
 be waiting on the LockTuple() call to upgrade to the stronger lock level, and
@@ -142,7 +142,7 @@ The following infomask bits are applicable:
 
 - HEAP_KEYS_UPDATED
   This bit lives in t_infomask2.  If set, indicates that the operation(s) done
-  by the XMAX compromise the tuple key, such as a SELECT FOR UPDATE, an UPDATE
+  by the XMAX modify the tuple key, such as a SELECT FOR UPDATE, an UPDATE
   that modifies the columns of the key, or a DELETE.  It's set regardless of
   whether the XMAX is a TransactionId or a MultiXactId.
 


The 026_overwrite_contrecord test might fail on extremely slow animals

2025-03-30 Thread Alexander Lakhin

Hello hackers,

A couple of buildfarm failures produced by skink, which runs
tests under Valgrind: [1], [2], with the following diagnostics:
 31/324 postgresql:recovery / recovery/026_overwrite_contrecord ERROR    325.72s   (exit status 255 or signal 
127 SIGinvalid)


# Initializing node "standby" from backup "backup" of node "primary"
### Enabling streaming replication for node "standby"
### Starting node "standby"
# Running: pg_ctl -w -D 
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/026_overwrite_contrecord/data/t_026_overwrite_contrecord_standby_data/pgdata 
-l 
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/026_overwrite_contrecord/log/026_overwrite_contrecord_standby.log 
-o --cluster-name=standby start

waiting for server to start stopped waiting
pg_ctl: could not start server
Examine the log output.
# pg_ctl start failed; see logfile for details: 
/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/026_overwrite_contrecord/log/026_overwrite_contrecord_standby.log

# No postmaster PID for node "standby"
[14:44:39.051](7.936s) Bail out!  pg_ctl start failed

026_overwrite_contrecord_standby.log:
2025-03-14 14:44:38.533 UTC [1558222][startup][:0] LOG:  database system was interrupted; last known up at 2025-03-14 
14:44:30 UTC

2025-03-14 14:44:38.806 UTC [1558222][startup][:0] LOG:  invalid checkpoint 
record
2025-03-14 14:44:38.808 UTC [1558222][startup][:0] PANIC:  could not locate a 
valid checkpoint record at 0/2094248
2025-03-14 14:44:38.937 UTC [1555866][postmaster][:0] LOG:  startup process (PID 1558222) was terminated by signal 6: 
Aborted

2025-03-14 14:44:38.937 UTC [1555866][postmaster][:0] LOG:  aborting startup 
due to startup process failure

026_overwrite_contrecord_primary.log:
2025-03-14 14:44:30.536 UTC [1553014][client backend][4/2:0] LOG: statement: SELECT 
pg_walfile_name(pg_current_wal_insert_lsn())
2025-03-14 14:44:30.621 UTC [1519069][checkpointer][:0] LOG: checkpoint complete: wrote 109 buffers (85.2%), wrote 3 
SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=12.404 s, sync=0.013 s, total=12.574 s; sync files=28, 
longest=0.004 s, average=0.001 s; distance=8299 kB, estimate=8299 kB; lsn=0/2094248, redo lsn=0/1FA5A48

2025-03-14 14:44:31.119 UTC [1519026][postmaster][:0] LOG:  received immediate 
shutdown request

indicates that this test might fail when a checkpoint performed during the
test execution (that is, when the duration of the test up to
$node->stop('immediate');
exceeds 300 secs).

This failure can be easily reproduced with:
--- a/src/test/recovery/t/026_overwrite_contrecord.pl
+++ b/src/test/recovery/t/026_overwrite_contrecord.pl
@@ -24,2 +24,3 @@ autovacuum = off
 wal_keep_size = 1GB
+checkpoint_timeout=30s
 ));
@@ -61,2 +62,3 @@ $node->safe_psql('postgres',
 #$node->safe_psql('postgres', qq{create table foo ()});
+sleep(40);
 my $endfile = $node->safe_psql('postgres',

The last 5 runs of the test on skink on master show the following duration of 
the test:
 29/331 postgresql:recovery / recovery/026_overwrite_contrecord  OK 317.37s   3 
subtests passed
 29/331 postgresql:recovery / recovery/026_overwrite_contrecord  OK 412.78s   3 
subtests passed
 29/331 postgresql:recovery / recovery/026_overwrite_contrecord  OK 448.46s   3 
subtests passed
 29/331 postgresql:recovery / recovery/026_overwrite_contrecord  OK 445.73s   3 
subtests passed
 29/331 postgresql:recovery / recovery/026_overwrite_contrecord  OK 383.50s   3 
subtests passed

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-09%2020%3A23%3A09
 - REL_17_STABLE
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-14%2013%3A52%3A09
 - master

Best regards,
Alexander Lakhin
Neon (https://neon.tech)




Re: SQLFunctionCache and generic plans

2025-03-30 Thread Pavel Stehule
Hi


> We get substantial wins on all of fx, fx3, fx4.  fx2 is the
> case that gets inlined and never reaches functions.c, so the
> lack of change there is expected.  What I found odd is that
> I saw a small speedup (~6%) on fx5 and fx6; those functions
> are in plpgsql so they really shouldn't change either.
> The only thing I can think of is that I made the hash key
> computation a tiny bit faster by omitting unused argtypes[]
> entries.  That does avoid hashing several hundred bytes
> typically, but it's hard to believe it'd amount to any
> visible savings overall.
>
> Anyway, PFA v10.
>
>
I can confirm so all tests passed without problems

Regards

Pavel


> regards, tom lane
>
> [1]
> https://www.postgresql.org/message-id/CAFj8pRDWDeF2cC%2BpCjLHJno7KnK5kdtjYN-f933RHS7UneArFw%40mail.gmail.com
>
>


Re: Typo in comment for pgstat_database_flush_cb()

2025-03-30 Thread Gurjeet Singh
On Sun Mar 30, 2025 at 4:39 AM PDT, Heikki Linnakangas wrote:
> On 30/03/2025 14:32, Heikki Linnakangas wrote:
>> On 30/03/2025 13:28, Etsuro Fujita wrote:
>>> Another thing I noticed is $SUBJECT: I think “if lock could not
>>> immediately acquired” should be “if lock could not be immediately
>>> acquired”.  Attached is a patch for that.
>> 
>> Yep. And there are more instances of the same typo in other such 
>> flush_cb functions, if you search for "immediately acquired".

+1 for chainging other occurrences of this in other pgstat_*.c files.

> While we're at it, the comment feels a bit awkward to me anyway. Maybe 
> rephrase it to "If nowait is true and the lock could not be immediately 
> acquired, returns false without flushing the entry. Otherwise returns true."

This rephrasing does sounds better.

Best regards,
Gurjeet
http://Gurje.et





Re: Typo in comment for pgstat_database_flush_cb()

2025-03-30 Thread Heikki Linnakangas

On 30/03/2025 13:28, Etsuro Fujita wrote:

Another thing I noticed is $SUBJECT: I think “if lock could not
immediately acquired” should be “if lock could not be immediately
acquired”.  Attached is a patch for that.


Yep. And there are more instances of the same typo in other such 
flush_cb functions, if you search for "immediately acquired".


--
Heikki Linnakangas
Neon (https://neon.tech)




Re: Some read stream improvements

2025-03-30 Thread Thomas Munro
On Tue, Mar 18, 2025 at 5:56 AM Andres Freund  wrote:
> So one thing is that the pin count differs by 1 at the start of the scan. No
> idea why.
>
> I still don't know what drives the difference between freebsd and the rest,
> but IIUC the reason this fails is just that we are holding too many buffers
> pinned, due to some buffers being pinned outside of read_stream.c.

I couldn't reproduce this on my local FreeBSD box, but I think I see
one part of the problem: the cursor a few lines up has a stream with a
higher distance holding a couple of pins.  Not sure how the local
buffer pool got into a state that caused that if it isn't doing the
same on other machines, but anyway, if I read the test right you
intend to pin strictly one page per cursor, so I tried saying so
explicitly:

-   query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM
test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+   query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM
test_temp WHERE ctid between '(%s, 1)'::tid and '(%s, )'::tid $q$,
cursorname, i, i);

That passed on the FreeBSD CI task.




Re: AIO v2.5

2025-03-30 Thread Melanie Plageman
On Tue, Mar 25, 2025 at 11:58 AM Andres Freund  wrote:
>
> > Another thought on complete_shared running on other backends: I wonder if we
> > should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
> > other process" or similar, so people wonder less about how "SELECT FROM a" 
> > led
> > to a log message about IO on table "b".
>
> I've been wondering about that as well, and yes, we probably should.
>
> I'd add the pid of the backend that started the IO to the message - although
> need to check whether we're trying to keep PIDs of other processes from
> unprivileged users.
>
> I think we probably should add a similar, but not equivalent, context in io
> workers. Maybe "I/O worker executing I/O on behalf of process %d".

I think this has not yet been done. Attached patch is an attempt to
add error context for IO completions by another backend when using
io_uring and IO processing in general by an IO worker. It seems to
work -- that is, running the test_aio tests, you can see the context
in the logs.

I'm not certain that I did this in the way you both were envisioning, though.

- Melanie
From 79d7e930de510cad6aef31532b06ba679a72d94a Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sun, 30 Mar 2025 14:03:55 -0400
Subject: [PATCH] Add errcontext for processing I/Os for another backend

Push an ErrorContextCallback adding additional detail about the process
performing the I/O and the owner of the I/O when those are not the same.

For io_method worker, this adds context specifying which process owns
the I/O that the I/O worker is processing.

For io_method io_uring, this adds context only when a backend is
*completing* I/O for another backend. It specifies the pid of the owning
process.

Discussion: https://postgr.es/m/20250325141120.8e.nmisch%40google.com
---
 src/backend/storage/aio/aio.c |  1 +
 src/backend/storage/aio/method_io_uring.c | 31 +++
 src/backend/storage/aio/method_worker.c   | 29 +
 3 files changed, 61 insertions(+)

diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 43f1e2a7785..3d5cf726e24 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -42,6 +42,7 @@
 #include "port/atomics.h"
 #include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/proc.h"
 #include "storage/procnumber.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index 3b299dcf388..244918e1883 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -303,14 +303,41 @@ pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 	return num_staged_ios;
 }
 
+static void
+pgaio_uring_completion_error_callback(void *arg)
+{
+	ProcNumber	owner;
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioHandle *ioh = arg;
+
+	if (!ioh)
+		return;
+
+	/* No need for context if a backend is completing the IO for itself */
+	if (ioh->owner_procno == MyProcNumber)
+		return;
+
+	owner = ioh->owner_procno;
+	owner_proc = GetPGProcByNumber(owner);
+	owner_pid = owner_proc->pid;
+
+	errcontext("completing I/O on behalf of process %d", owner_pid);
+}
+
 static void
 pgaio_uring_drain_locked(PgAioUringContext *context)
 {
 	int			ready;
 	int			orig_ready;
+	ErrorContextCallback errcallback = {0};
 
 	Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
 
+	errcallback.callback = pgaio_uring_completion_error_callback;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
 	/*
 	 * Don't drain more events than available right now. Otherwise it's
 	 * plausible that one backend could get stuck, for a while, receiving CQEs
@@ -338,9 +365,11 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
 			PgAioHandle *ioh;
 
 			ioh = io_uring_cqe_get_data(cqe);
+			errcallback.arg = ioh;
 			io_uring_cqe_seen(&context->io_uring_ring, cqe);
 
 			pgaio_io_process_completion(ioh, cqe->res);
+			errcallback.arg = NULL;
 		}
 
 		END_CRIT_SECTION();
@@ -349,6 +378,8 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
 	"drained %d/%d, now expecting %d",
 	ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
 	}
+
+	error_context_stack = errcallback.previous;
 }
 
 static void
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 5ea00d8a89e..52f901ed4c2 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -361,11 +361,33 @@ pgaio_worker_register(void)
 	on_shmem_exit(pgaio_worker_die, 0);
 }
 
+static void
+pgaio_worker_error_callback(void *arg)
+{
+	ProcNumber	owner;
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioHandle *ioh = arg;
+
+	if (!ioh)
+		return;
+
+	Assert(ioh->owner_procno != MyProcNumber);
+	Assert(MyBackendType == B_IO_WORKER);
+
+	owner = ioh->owner_procno

Re: pg_stat_database.checksum_failures vs shared relations

2025-03-30 Thread Magnus Hagander
On Sat, Mar 29, 2025 at 7:09 PM Andres Freund  wrote:

> Hi,
>
> On 2025-03-29 13:17:44 -0400, Andres Freund wrote:
> > On 2025-03-28 13:47:16 -0400, Andres Freund wrote:
> > > Attached is a fix for the issue.
> >
> > I plan to push this fix soon, unless somebody protests...
>
> And done.
>

Hi!

Sorry to get into this thread a bit late. Just to let you know that now
that I'm caught up, I do agree  it looks right.

And  - thanks for handling this!

//Magnus


Re: Non-text mode for pg_dumpall

2025-03-30 Thread Andrew Dunstan


On 2025-03-30 Su 12:50 PM, Andrew Dunstan wrote:


On 2025-03-29 Sa 1:17 AM, Mahendra Singh Thalor wrote:
On Sat, 29 Mar 2025 at 03:50, Andrew Dunstan  
wrote:


On 2025-03-27 Th 5:15 PM, Andrew Dunstan wrote:

On 2025-03-19 We 2:41 AM, Mahendra Singh Thalor wrote:

On Wed, 12 Mar 2025 at 21:18, Andrew Dunstan 
wrote:

On 2025-03-12 We 3:03 AM, jian he wrote:

On Wed, Mar 12, 2025 at 1:06 AM Álvaro Herrera
 wrote:

Hello,

On 2025-Mar-11, Mahendra Singh Thalor wrote:


In map.dat file, I tried to fix this issue by adding number of
characters
in dbname but as per code comments, as of now, we are not
supporting \n\r
in dbnames so i removed handling.
I will do some more study to fix this issue.

Yeah, I think this is saying that you should not consider the
contents
of map.dat as a shell string.  After all, you're not going to
_execute_
that file via the shell.

Maybe for map.dat you need to escape such characters somehow, 
so that

they don't appear as literal newlines/carriage returns.


I am confused.
currently pg_dumpall plain format will abort when encountering 
dbname

containing newline.
the left dumped plain file does not contain all the cluster
databases data.


if pg_dumpall non-text format aborts earlier,
it's aligned with pg_dumpall plain format?
it's also an improvement since aborts earlier, nothing will be 
dumped?



am i missing something?



I think we should fix that.

But for the current proposal, Álvaro and I were talking this 
morning,

and we thought the simplest thing here would be to have the one line
format and escape NL/CRs in the database name.


cheers


Okay. As per discussions, we will keep one line entry for each
database into map.file.

Thanks all for feedback and review.

Here, I am attaching updated patches for review and testing. These
patches can be applied on commit a6524105d20b.



I'm working through this patch set with a view to committing it.
Attached is some cleanup which is where I got to today, although there
is more to do. One thing I am wondering is why not put the
SimpleDatabaseOidList stuff in fe_utils/simle_list.{c,h} ? That's
where all the similar stuff belongs, and it feels strange to have this
inline in pg_restore.c. (I also don't like the name much -
SimpleOidStringList or maybe SimpleOidPlusStringList might be better).





OK, I have done that, so here is the result. The first two are you
original patches. patch 3 adds the new list type to fe-utils, and patch
4 contains my cleanups and use of the new list type. Apart from some
relatively minor cleanup, the one thing I would like to change is how
dumps are named. If we are producing tar or custom format dumps, I 
think

the file names should reflect that (oid.dmp and oid.tar rather than a
bare oid as the filename), and pg_restore should look for those. I'm
going to work on that tomorrow - I don't think it will be terribly
difficult.


Thanks Andrew.

Here, I am attaching a delta patch for oid.tar and oid.dmp format.




OK, looks good, I have incorporated that.

There are a couple of rough edges, though.

First, I see this:


andrew@ub22arm:inst $ bin/pg_restore -C -d postgres 
--exclude-database=regression_dummy_seclabel 
--exclude-database=regression_test_extensions 
--exclude-database=regression_test_pg_dump dest
pg_restore: error: could not execute query: "ERROR:  role "andrew" 
already exists

"
Command was: "

--
-- Roles
--

CREATE ROLE andrew;"
pg_restore: warning: errors ignored on global.dat file restore: 1
pg_restore: error: could not execute query: ERROR:  database 
"template1" already exists
Command was: CREATE DATABASE template1 WITH TEMPLATE = template0 
ENCODING = 'SQL_ASCII' LOCALE_PROVIDER = libc LOCALE = 'C';



pg_restore: warning: errors ignored on database "template1" restore: 1
pg_restore: error: could not execute query: ERROR:  database 
"postgres" already exists
Command was: CREATE DATABASE postgres WITH TEMPLATE = template0 
ENCODING = 'SQL_ASCII' LOCALE_PROVIDER = libc LOCALE = 'C';



pg_restore: warning: errors ignored on database "postgres" restore: 1
pg_restore: warning: errors ignored on restore: 3



It seems pointless to be trying to create the rolw that we are 
connected as, and we also expect template1 and postgres to exist.


In a similar vein, I don't see why we are setting the --create flag in 
pg_dumpall for those databases. I'm attaching a patch that is designed 
to stop that, but it doesn't solve the above issues.


I also notice a bunch of these in globals.dat:


--
-- Databases
--

--
-- Database "template1" dump
--

--
-- Database "andrew" dump
--

--
-- Database "isolation_regression_brin" dump
--

--
-- Database "isolation_regression_delay_execution" dump
--

 ...


The patch also tries to fix this.

Lastly, this badly needs some TAP tests written.

I'm going to work on reviewing the documentation next.







I have reworked the documentation some. See attached.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/doc/src/sgml

Re: Commit fest 2025-03

2025-03-30 Thread vignesh C
On Mon, 24 Mar 2025 at 10:07, vignesh C  wrote:
>
> On Mon, 17 Mar 2025 at 09:26, vignesh C  wrote:
>
Here's a quick commitfest status report as of today:
status |  start  |  10th |  17th |  24th  |  31st
+-+---+---++-
Needs review:   |  198   | 182   | 134| 120   |  105
Waiting on Author: |37  |   35|69   |62  |59
Ready for Committer:|33  |   33|34   |34  |27
Committed:|52  |   67|75   |90  |  109
Moved to next CF: | 7   | 7|11   |14  |16
Withdrawn:|12  |   15|15   |18  |18
Rejected:   |  1  | 1| 3| 3   |  5
RWF: |  2  | 2| 2| 2   |  4
Total:  |   343 |  343   |  343   |  343  |  343

There are currently 27 patches in the "Ready for Committer" status at
[1]. If any committers have bandwidth, please consider reviewing and
advancing them. If you have submitted a patch that is in the "Waiting
for Author" state, please update it to "Needs Review" as soon as
possible. This is where reviewers are most likely to focus their
efforts. Additionally, 12 Needs review patches require a rebase due to
recent commits at [2]. Patch owners are requested to rebase and submit
updated versions.

19 patches have been committed in the last week.

[1] - https://commitfest.postgresql.org/52/?status=3
[2] - https://commitfest.postgresql.org/52/?status=1

Regards,
Vignesh




Re: Memoize ANTI and SEMI JOIN inner

2025-03-30 Thread Alena Rybakina

On 31.03.2025 06:33, David Rowley wrote:

On Mon, 31 Mar 2025 at 16:21, Alena Rybakina  wrote:

However, is it necessary to check that extra->inner_unique must be false for 
SEMI/ANTI joins here, or am I missing something? It looks a little confusing at 
this point.

If it is necessary, I don't see the reason for it. It was me that
worked on unique joins and I see no reason why a SEMI or ANTI join
couldn't be marked as unique. The reason they're not today is that the
only point of the unique join optimisation is so that during
execution, the join nodes could skip to the next outer tuple after
matching the current outer to an inner.  If the join is unique, then
there are no more join partners to find for the current outer after
matching it up once. With SEMI and ANTI joins, we skip to the next
outer tuple after finding a match anyway, so there's no point in going
to the trouble of setting the inner_unique flag.

I can't say definitively that we won't find a reason in the future
that we should set inner_unique for SEMI/ANTI joins, so I don't follow
the need for the Assert.

Maybe you're seeing something that I'm not. What do you think will
break if both flags are true?

Actually, I was mainly confused by the code itself - the check seemed to 
contradict the explanation. It looked like we were enforcing that
inner_unique must be false for SEMI/ANTI joins, even though it's not 
actually important for those join types.
That’s why I originally proposed either adding an Assert or removing 
this flag from check altogether, just to make it more explicit.


So, I agree with your explanation — by the definition of SEMI and ANTI 
joins, there's no need to set inner_unique, and also no need to assert 
against it.
These joins skip to the next outer tuple once they find a match (or fail 
to find one, in the case of ANTI).


I updated the diff, where I left changes only in the code comment.

--
Regards,
Alena Rybakina
Postgres Professional
diff --git a/src/backend/optimizer/path/joinpath.c 
b/src/backend/optimizer/path/joinpath.c
index c75408f552b..252cb712943 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -728,14 +728,16 @@ get_memoize_path(PlannerInfo *root, RelOptInfo *innerrel,
single_mode = true;
 
/*
-* Memoize normally marks cache entries as complete when it runs out of
-* tuples to read from its subplan.  However, with unique joins, Nested
-* Loop will skip to the next outer tuple after finding the first 
matching
-* inner tuple. Another case is a semi or anti join. If number of join
-* clauses, pushed to the inner as parameterised filter no less than the
-* number of join clauses, that means all the clauses have been pushed 
to
-* the inner and any tuple coming from the inner side will be 
successfully
-* used to build the join result.
+* Normally, memoize marks cache entries as complete when it exhausts
+* all tuples from its subplan.  However, in unique joins, Nested Loop
+* will skip to the next outer tuple after finding the first matching
+* inner tuple.
+* Another case is a SEMI or ANTI joins. If the number of join clauses,
+* pushed to the inner as parameterised filter is equal to or greater
+* than the total number of join clauses. This implies that all relevant
+* join conditions have been applied on the inner side, so any returned
+* inner tuple will be guaranteed to satisfy the join condition, making
+* it safe to memoize.
 * This means that we may not read the inner side of the
 * join to completion which leaves no opportunity to mark the cache 
entry
 * as complete.  To work around that, when the join is unique we


Re: general purpose array_sort

2025-03-30 Thread Tom Lane
I spent some time looking at the v17 patchset.  There were some pretty
strange things in it --- why were some of the variants of array_sort()
marked as volatile, for example?  But the two things I'd like to
suggest functionality-wise are:

* The second argument of the variants with booleans should be defined
as true=descending, not true=ascending.  It seems a little odd to me
for the default of a boolean option not to be "false".  Also, then
you don't need an inversion between the second and third arguments.
I'm not dead set on this but it just seems a little cleaner.

* I see that the code is set up to detect an unsortable input type
before it takes the fast exit for "no sort required".  I think this
is poor engineering: we ought to make the fast path as fast as
possible.  The can't-sort case is so rare in real-world usage that
I do not think it matters if the error isn't thrown by every possible
call.  Besides which, it is inconsistent anyway: consider
SELECT array_sort(NULL::xid[]);
which will not error because it will never reach the C code.  Why's
that okay but delivering an answer for "array_sort('{1}'::xid[])"
is not?  I think "throw error only if we must sort and cannot" is
a perfectly fine definition.

At the code level, I didn't like the way that the multiple entry
points were set up.  I think it's generally cleaner code to have
a worker function with plain C call and return coding and make
all the SQL-visible functions be wrappers around that.  Also the
caching mechanism was overcomplicated, in particular because we
do not need a cache lookup to know which sort operators apply to
arrays.

So all that leads me to v18 attached.  (I merged the two patches
into one, didn't see much value in splitting them.)

In v18, it's somewhat annoying that the typcache doesn't cache
the typarray field; we would not need a separate get_array_type()
lookup if it did.  I doubt there is any real reason for that except
that pg_type.typarray didn't exist when the typcache was invented.
So I'm tempted to add it.  But I looked at existing callers of
get_array_type() and none of them are adjacent to typcache lookups,
so only array_sort would be helped immediately.  I left it alone
for the moment; wonder if anyone else has an opinion?

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5bf6656deca..2129d027398 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20741,6 +20741,42 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sort
+
+array_sort (
+  array anyarray
+  , descending boolean
+  , nulls_first boolean
+   )
+anyarray
+   
+   
+Sorts the first dimension of the array.
+The sort order is determined by the default sort ordering of the
+array's element type; however, if the element type is collatable,
+the collation to use can be forced by adding
+a COLLATE clause to
+the array argument.
+   
+   
+If descending is true then sort in
+descending order, otherwise ascending order.  If omitted, the
+default is ascending order.
+If nulls_first is true then nulls appear
+before non-null values, otherwise nulls appear after non-null
+values.
+If omitted, nulls_first is taken to have
+the same value as descending.
+   
+   
+array_sort(ARRAY[[2,4],[2,1],[6,5]])
+{{2,1},{2,4},{6,5}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index 2aae2f8ed93..2a8ea974029 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -12,16 +12,19 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_operator_d.h"
 #include "catalog/pg_type.h"
 #include "common/int.h"
 #include "common/pg_prng.h"
 #include "libpq/pqformat.h"
+#include "miscadmin.h"
 #include "nodes/supportnodes.h"
 #include "port/pg_bitutils.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
 #include "utils/typcache.h"
 
 /*
@@ -43,6 +46,18 @@ typedef struct DeserialIOData
 	Oid			typioparam;
 } DeserialIOData;
 
+/*
+ * ArraySortCachedInfo
+ *		Used for caching catalog data in array_sort
+ */
+typedef struct ArraySortCachedInfo
+{
+	ArrayMetaState array_meta;	/* metadata for array_create_iterator */
+	Oid			elem_lt_opr;	/* "<" operator for element type */
+	Oid			elem_gt_opr;	/* ">" operator for element type */
+	Oid			array_type;		/* pg_type OID of array type */
+} ArraySortCachedInfo;
+
 static Datum array_position_common(FunctionCallInfo fcinfo);
 
 
@@ -1858,3 +1873,171 @@ array_reverse(PG_FUNCTION_ARGS)
 
 	PG_RETURN_ARRAYTYPE_P(result);
 }
+
+/*
+ * array_sort
+ *
+ * Sorts the first dimension of the a

Re: AIO v2.5

2025-03-30 Thread Andres Freund
Hi,

On 2025-03-27 10:52:10 +1300, Thomas Munro wrote:
> On Thu, Mar 27, 2025 at 10:41 AM Andres Freund  wrote:
> > > > Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems
> > >
> > > Consider also updating this comment to stop focusing on prefetch; I think
> > > changing that aligns with the patch's other changes:
> > >
> > > /*
> > >  * How many buffers PrefetchBuffer callers should try to stay ahead of 
> > > their
> > >  * ReadBuffer calls by.  Zero means "never prefetch".  This value is only 
> > > used
> > >  * for buffers not belonging to tablespaces that have their
> > >  * effective_io_concurrency parameter set.
> > >  */
> > > int   effective_io_concurrency = 
> > > DEFAULT_EFFECTIVE_IO_CONCURRENCY;
> >
> > Good point.  Although I suspect it might be worth adjusting this, and also 
> > the
> > config.sgml bit about effective_io_concurrency separately. That seems like 
> > it
> > might take an iteration or two.
> 
> +1 for rewriting that separately from this work on the code (I can
> have a crack at that if you want).

You taking a crack at that would be appreciated!

> For the comment, my suggestion would be something like:
> 
> "Default limit on the level of concurrency that each I/O stream
> (currently, ReadStream but in future other kinds of streams) can use.
> Zero means that I/O is always performed synchronously, ie not
> concurrently with query execution. This value can be overridden at the
> tablespace level with the parameter of the same name. Note that
> streams performing I/O not classified as single-session work respect
> maintenance_io_concurrency instead."

Generally sounds good. I do wonder if the last sentence could be made a bit
simpler, it took me a few seconds to parse "not classified as single-session".

Maybe "classified as performing work for multiple sessions respect
maintenance_io_concurrency instead."?

Greetings,

Andres Freund




Re: AIO v2.5

2025-03-30 Thread Andres Freund
Hi,

On 2025-03-29 14:29:29 -0700, Noah Misch wrote:
> > Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control
> >  ignore_checksum_failure
> > Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers()
> > Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO
> > Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional 
> > batchmode
> >  support
> > Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as
> >  wait time
> > Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems
>
> Ready for commit

I've pushed most of these after some very light further editing.  Yay.  Thanks
a lot for all the reviews!

So far the buildfarm hasn't been complaining, but it's early days.


I didn't yet push

> > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level 
> > design

because I want to integrate some language that could be referenced by
smgrstartreadv() (and more in the future), as we have been talking about.


Tomorrow I'll work on sending out a new version with the remaining patches. I
plan for that version to have:

- pg_aios view with the security checks (already done, trivial)

- a commit with updated language for smgrstartreadv(), probably referencing
  aio's README

- a change to mdreadv() around zero_damaged_pages, as Noah and I have been
  discussing

- updated tests, with the FIXMEs etc addressed

- a reviewed version of the errcontext callback patch that Melanie sent
  earlier today


Todo beyond that:

- The comment and docs updates we've been discussing in
  
https://postgr.es/m/5fc6r4smanncsmqw7ib6s3uj6eoiqoioxbd5ibmhtqimcggtoe%40fyrok2gozsoq

- I think a good long search through the docs is in order, there probably are
  other things that should be updated, beyond concrete references to
  effective_io_concurrency etc.

- Whether we should do something, and if so what, about BAS_BULKREAD for
  18. Thomas may have some thoughts / code.

- Whether there's anything committable around Jelte's RLIMIT_NOFILE changes.


Greetings,

Andres Freund




Improve coments on structures in trigger.c

2025-03-30 Thread Yugo Nagata
Hi,

I found that comments in trigger.c describing fields of three different
structures in a comment block. I feel this is confusing because some
structure names are not explicitly mentioned there even though common field
names are used between structures. I've attached a patch to improve the comments
by splitting it to three blocks.

Regards,
Yugo Nagata

-- 
Yugo Nagata 
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index c9f61130c69..699454ff277 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3771,55 +3771,15 @@ typedef struct AfterTriggerEventList
  *
  * query_stack[query_depth] is the per-query-level data, including these fields:
  *
- * events is a list of AFTER trigger events queued by the current query.
- * None of these are valid until the matching AfterTriggerEndQuery call
- * occurs.  At that point we fire immediate-mode triggers, and append any
- * deferred events to the main events list.
- *
- * fdw_tuplestore is a tuplestore containing the foreign-table tuples
- * needed by events queued by the current query.  (Note: we use just one
- * tuplestore even though more than one foreign table might be involved.
- * This is okay because tuplestores don't really care what's in the tuples
- * they store; but it's possible that someday it'd break.)
- *
- * tables is a List of AfterTriggersTableData structs for target tables
- * of the current query (see below).
- *
  * maxquerydepth is just the allocated length of query_stack.
  *
  * trans_stack holds per-subtransaction data, including these fields:
  *
- * state is NULL or a pointer to a saved copy of the SET CONSTRAINTS
- * state data.  Each subtransaction level that modifies that state first
- * saves a copy, which we use to restore the state if we abort.
- *
- * events is a copy of the events head/tail pointers,
- * which we use to restore those values during subtransaction abort.
- *
- * query_depth is the subtransaction-start-time value of query_depth,
- * which we similarly use to clean up at subtransaction abort.
- *
- * firing_counter is the subtransaction-start-time value of firing_counter.
- * We use this to recognize which deferred triggers were fired (or marked
- * for firing) within an aborted subtransaction.
- *
  * We use GetCurrentTransactionNestLevel() to determine the correct array
  * index in trans_stack.  maxtransdepth is the number of allocated entries in
  * trans_stack.  (By not keeping our own stack pointer, we can avoid trouble
  * in cases where errors during subxact abort cause multiple invocations
  * of AfterTriggerEndSubXact() at the same nesting depth.)
- *
- * We create an AfterTriggersTableData struct for each target table of the
- * current query, and each operation mode (INSERT/UPDATE/DELETE), that has
- * either transition tables or statement-level triggers.  This is used to
- * hold the relevant transition tables, as well as info tracking whether
- * we already queued the statement triggers.  (We use that info to prevent
- * firing the same statement triggers more than once per statement, or really
- * once per transition table set.)  These structs, along with the transition
- * table tuplestores, live in the (sub)transaction's CurTransactionContext.
- * That's sufficient lifespan because we don't allow transition tables to be
- * used by deferrable triggers, so they only need to survive until
- * AfterTriggerEndQuery.
  */
 typedef struct AfterTriggersQueryData AfterTriggersQueryData;
 typedef struct AfterTriggersTransData AfterTriggersTransData;
@@ -3842,6 +3802,23 @@ typedef struct AfterTriggersData
 	int			maxtransdepth;	/* allocated len of above array */
 } AfterTriggersData;
 
+/*
+ * AfterTriggersQueryData has the following fields:
+ *
+ * events is a list of AFTER trigger events queued by the current query.
+ * None of these are valid until the matching AfterTriggerEndQuery call
+ * occurs.  At that point we fire immediate-mode triggers, and append any
+ * deferred events to the main events list.
+ *
+ * fdw_tuplestore is a tuplestore containing the foreign-table tuples
+ * needed by events queued by the current query.  (Note: we use just one
+ * tuplestore even though more than one foreign table might be involved.
+ * This is okay because tuplestores don't really care what's in the tuples
+ * they store; but it's possible that someday it'd break.)
+ *
+ * tables is a List of AfterTriggersTableData structs for target tables
+ * of the current query (see below).
+ */
 struct AfterTriggersQueryData
 {
 	AfterTriggerEventList events;	/* events pending from this query */
@@ -3849,6 +3826,23 @@ struct AfterTriggersQueryData
 	List	   *tables;			/* list of AfterTriggersTableData, see below */
 };
 
+/*
+ * AfterTriggersTransData has the following fields:
+ *
+ * state is NULL or a pointer to a saved copy of the SET CONSTRAINTS
+ * state data.  Each subtransaction level that modifies that state first
+ * saves a copy, which we use to restore th

Re: Proposal: Progressive explain

2025-03-30 Thread Andrei Lepikhov

On 3/31/25 02:23, torikoshia wrote:
On Fri, Mar 7, 2025 at 6:43 AM Rafael Thofehrn Castro 

Implemented this version. New patch has the following characteristics:



I haven't looked into the code yet, but when I ran below commands during 
make installcheck, there was an error and an assertion failure


   =# select * from pg_stat_progress_explain;
   =# \watch 0.1

Yeah, that's to be expected.
I think many corner cases may be found: hash table in the middle of 
filling, opened file descriptors, an incorrect combination of variables, 
 'not yet visited' subtrees - who knows what else? So, last time, I 
just ended up with the idea that using the explain code is a bit 
dangerous - in the middle of execution, it is enough to expose only 
basic data - rows, numbers and timings. It seems safe to gather.


--
regards, Andrei Lepikhov




Re: Replace IN VALUES with ANY in WHERE clauses during optimization

2025-03-30 Thread Alexander Korotkov
Hi, Alena!

On Sat, Mar 29, 2025 at 9:03 PM Alena Rybakina
 wrote:
> On 29.03.2025 14:03, Alexander Korotkov wrote:
>> One thing I have to fix: we must do
>> IncrementVarSublevelsUp() unconditionally for all expressions as Vars
>> could be deeper inside.
>
> Yes, I'm looking at it too, I've just understood that it was needed for 
> subqueries - they can contain var elements which needs decrease the sublevel 
> parameter.
>
> for example for the query:
>
> EXPLAIN (COSTS OFF)
> SELECT ten FROM onek t
> WHERE unique1 IN (VALUES (0), ((2 IN (SELECT unique2 FROM onek c
>   WHERE c.unique2 = t.unique1))::integer));
>
> We are interested in this element: ((2 IN (SELECT unique2 FROM onek c  WHERE 
> c.unique2 = t.unique1))
>
> It is funcexpr object with RabgeTblEntry variable. I highlighted
>
> WARNING:  1{FUNCEXPR :funcid 2558 :funcresulttype 23 :funcretset false 
> :funcvariadic false :funcformat 1 :funccollid 0 :inputcollid 0 :args 
> ({SUBLINK :subLinkType 2 :subLinkId 0 :testexpr {OPEXPR :opno 96 :opfuncid 65 
> :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({CONST 
> :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true 
> :constisnull false :location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]} {PARAM 
> :paramkind 2 :paramid 1 :paramtype 23 :paramtypmod -1 :paramcollid 0 
> :location -1}) :location -1} :operName ("=") :subselect {QUERY :commandType 1 
> :querySource 0 :canSetTag true :utilityStmt <> :resultRelation 0 :hasAggs 
> false :hasWindowFuncs false :hasTargetSRFs false :hasSubLinks false 
> :hasDistinctOn false :hasRecursive false :hasModifyingCTE false :hasForUpdate 
> false :hasRowSecurity false :hasGroupRTE false :isReturn false :cteList <> 
> :rtable ({RANGETBLENTRY :alias {ALIAS :aliasname c :colnames <>} :eref {ALIAS 
> :aliasname c :colnames ("unique1" "unique2" "two" "four" "ten" "twenty" 
> "hundred" "thousand" "twothousand" "fivethous" "tenthous" "odd" "even" 
> "stringu1" "stringu2" "string4")} :rtekind 0 :relid 32795 :inh true :relkind 
> r :rellockmode 1 :perminfoindex 1 :tablesample <> :lateral false :inFromCl 
> true :securityQuals <>}) :rteperminfos ({RTEPERMISSIONINFO :relid 32795 :inh 
> true :requiredPerms 2 :checkAsUser 0 :selectedCols (b 9) :insertedCols (b) 
> :updatedCols (b)}) :jointree {FROMEXPR :fromlist ({RANGETBLREF :rtindex 1}) 
> :quals {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16 :opretset false 
> :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 2 :vartype 23 
> :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 
> :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} {VAR :varno 1 
> :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) 
> :varlevelsup 2 :varreturningtype 0 :varnosyn 1 :varattnosyn 1 :location -1}) 
> :location -1}} :mergeActionList <> :mergeTargetRelation 0 :mergeJoinCondition 
> <> :targetList ({TARGETENTRY :expr {VAR :varno 1 :varattno 2 :vartype 23 
> :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 
> :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} :resno 1 
> :resname unique2 :ressortgroupref 0 :resorigtbl 32795 :resorigcol 2 :resjunk 
> false}) :override 0 :onConflict <> :returningOldAlias <> :returningNewAlias 
> <> :returningList <> :groupClause <> :groupDistinct false :groupingSets <> 
> :havingQual <> :windowClause <> :distinctClause <> :sortClause <> 
> :limitOffset <> :limitCount <> :limitOption 0 :rowMarks <> :setOperations <> 
> :constraintDeps <> :withCheckOptions <> :stmt_location -1 :stmt_len -1} 
> :location -1}) :location -1}
>
>
> I highlighted in bold the var we need - since it is in a subquery in the in 
> expression will be flattened, all elements contained in it should decrease 
> the level number by one, since they will belong to the subtree located above 
> it. Because of that condition, this did not happen.
>
> I generally agree with you that it is better to remove that condition. The 
> function IncrementVarSublevelsUp essentially goes through the structures 
> below and will decrease the level of only the vars for which this needs to be 
> done, and the condition with 1 will protect us from touching those vars that 
> should not. So the varlevelsup for this var should be 1.
>
> I am currently investigating whether this transformation will be fair for all 
> cases; I have not found any problems yet.

Thank you for your feedback.  I appreciate you're also looking for the
potential problems.  On thing to highlight: doing
IncrementVarSublevelsUp() unconditionally is required not just for
subqueries.  Consider the following example.

SELECT * FROM t WHERE val1 IN (VALUES (val2), (val2 +1));

The second value contain Var, which needs IncrementVarSublevelsUp(),
but the top node is OpExpr.

--
Regards,
Alexander Korotkov
Supabase




Typo in comment for pgstat_database_flush_cb()

2025-03-30 Thread Etsuro Fujita
Another thing I noticed is $SUBJECT: I think “if lock could not
immediately acquired” should be “if lock could not be immediately
acquired”.  Attached is a patch for that.

Best regards,
Etsuro Fujita


fix-typo-in-comment.patch
Description: Binary data


Re: Memoize ANTI and SEMI JOIN inner

2025-03-30 Thread Alena Rybakina

Hi!

On 21.03.2025 18:56, Andrei Lepikhov wrote:

On 20/3/2025 07:02, David Rowley wrote:

On Thu, 20 Mar 2025 at 06:16, Andrei Lepikhov  wrote:

How can we be sure that semi or anti-join needs only one tuple? I think
it would be enough to control the absence of join clauses and 
filters in

the join. Unfortunately, we only have such a guarantee in the plan
creation stage (maybe even setrefs.c). So, it seems we need to 
invent an

approach like AlternativeSubplan.


I suggest looking at what 9e215378d did.  You might be able to also
allow semi and anti-joins providing the cache keys cover the entire
join condition. I think this might be safe as Nested Loop will only
ask its inner subnode for the first match before skipping to the next
outer row and with anti-join, there's no reason to look for additional
rows after the first. Semi-join and unique joins do the same thing in
nodeNestloop.c. To save doing additional checks at run-time, the code
does:

Thank you for the clue! I almost took the wrong direction.
I have attached the new patch, which includes corrected comments for 
better clarification of the changes, as well as some additional tests.
I now feel much more confident about this version since I have 
resolved that concern.




I reviewed your patch and made a couple of suggestions.

The first change is related to your comment (and the one before it). I 
fixed some grammar issues and simplified the wording to make it clearer 
and easier to understand.


The second change involves adding an Assert when generating the Memoize 
path. Based on the existing comment and the surrounding logic (shown 
below),
I believe it's worth asserting that both inner_unique and single_mode 
are not true at the same time — just as a safety check.

/*
* We may do this for SEMI or ANTI joins when they need only one tuple from
* the inner side to produce the result. Following if condition checks that
* rule.
*
* XXX Currently we don't attempt to mark SEMI/ANTI joins as inner_unique
* = true. Should we? See add_paths_to_joinrel()
*/
if(!extra->inner_unique&& (jointype== JOIN_SEMI||
jointype== JOIN_ANTI))
single_mode= true;

--
Regards,
Alena Rybakina
Postgres Professional
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index c75408f552b..252cb712943 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -728,14 +728,16 @@ get_memoize_path(PlannerInfo *root, RelOptInfo *innerrel,
 		single_mode = true;
 
 	/*
-	 * Memoize normally marks cache entries as complete when it runs out of
-	 * tuples to read from its subplan.  However, with unique joins, Nested
-	 * Loop will skip to the next outer tuple after finding the first matching
-	 * inner tuple. Another case is a semi or anti join. If number of join
-	 * clauses, pushed to the inner as parameterised filter no less than the
-	 * number of join clauses, that means all the clauses have been pushed to
-	 * the inner and any tuple coming from the inner side will be successfully
-	 * used to build the join result.
+	 * Normally, memoize marks cache entries as complete when it exhausts
+	 * all tuples from its subplan.  However, in unique joins, Nested Loop
+	 * will skip to the next outer tuple after finding the first matching
+	 * inner tuple.
+	 * Another case is a SEMI or ANTI joins. If the number of join clauses,
+	 * pushed to the inner as parameterised filter is equal to or greater
+	 * than the total number of join clauses. This implies that all relevant
+	 * join conditions have been applied on the inner side, so any returned
+	 * inner tuple will be guaranteed to satisfy the join condition, making
+	 * it safe to memoize.
 	 * This means that we may not read the inner side of the
 	 * join to completion which leaves no opportunity to mark the cache entry
 	 * as complete.  To work around that, when the join is unique we
@@ -808,6 +810,8 @@ get_memoize_path(PlannerInfo *root, RelOptInfo *innerrel,
 	&hash_operators,
 	&binary_mode))
 	{
+		Assert(!(extra->inner_unique && single_mode));
+
 		return (Path *) create_memoize_path(root,
 			innerrel,
 			inner_path,


Re: Memoize ANTI and SEMI JOIN inner

2025-03-30 Thread David Rowley
On Mon, 31 Mar 2025 at 16:21, Alena Rybakina  wrote:
> However, is it necessary to check that extra->inner_unique must be false for 
> SEMI/ANTI joins here, or am I missing something? It looks a little confusing 
> at this point.

If it is necessary, I don't see the reason for it. It was me that
worked on unique joins and I see no reason why a SEMI or ANTI join
couldn't be marked as unique. The reason they're not today is that the
only point of the unique join optimisation is so that during
execution, the join nodes could skip to the next outer tuple after
matching the current outer to an inner.  If the join is unique, then
there are no more join partners to find for the current outer after
matching it up once. With SEMI and ANTI joins, we skip to the next
outer tuple after finding a match anyway, so there's no point in going
to the trouble of setting the inner_unique flag.

I can't say definitively that we won't find a reason in the future
that we should set inner_unique for SEMI/ANTI joins, so I don't follow
the need for the Assert.

Maybe you're seeing something that I'm not. What do you think will
break if both flags are true?

David




Re: Change log level for notifying hot standby is waiting non-overflowed snapshot

2025-03-30 Thread Fujii Masao



On 2025/03/28 0:13, torikoshia wrote:

On 2025-03-27 11:06, torikoshia wrote:

Hi,

I had another off-list discussion with Fujii-san, and according to the
following manual[1], it seems that a transaction with an overflowed
subtransaction is already considered inconsistent:

  Reaching a consistent state can also be delayed in the presence of
both of these conditions:

  - A write transaction has more than 64 subtransactions
  - Very long-lived write transactions

IIUC, the manual suggests that both conditions must be met -- recovery
reaching at least minRecoveryPoint and no overflowed subtransactions
—- for the standby to be considered consistent.

OTOH, the following log message is emitted even when subtransactions
have overflowed, which appears to contradict the definition of
consistency mentioned above:

  LOG:  consistent recovery state reached

This log message is triggered when recovery progresses beyond
minRecoveryPoint(according to CheckRecoveryConsistency()).
However, since this state does not satisfy 'consistency' defined in
the manual, I think it would be more accurate to log that it has
merely reached the "minimum recovery point".
Furthermore, it may be better to emit the above log message only when
recovery has progressed beyond minRecoveryPoint and there are no
overflowed subtransactions.

Attached patch does this.

Additionally, renaming variables such as reachedConsistency in
CheckRecoveryConsistency might also be appropriate.
However, in the attached patch, I have left them unchanged for now.


On 2025-03-25 00:55, Fujii Masao wrote:

-  case CAC_NOTCONSISTENT:
+  case CAC_NOTCONSISTENT_OR_OVERFLOWED:

This new name seems a bit too long. I'm OK to leave the name as it is.
Or, something like CAC_NOTHOTSTANDBY seems simpler and better to me.


Beyond just the length issue, given the understanding outlined above,
I now think CAC_NOTCONSISTENT does not actually need to be changed.



In high-availability.sgml, the "Administrator's Overview" section already
describes the conditions for accepting hot standby connections.
This section should also be updated accordingly.


Agreed.
I have updated this section to mention that the resolution is to close
the problematic transaction.
OTOH the changes made in v2 patch seem unnecessary, since the concept
of 'consistent' is already explained in the "Administrator's
Overview."


- errdetail("Consistent recovery state has not been yet 
reached.")));
+ errdetail("Consistent recovery state has not been 
yet
reached, or snappshot is pending because subtransaction is
overflowed."),

Given the above understanding, "or" is not appropriate in this
context, so I left this message unchanged.
Instead, I have added an errhint. The phrasing in the hint message
aligns with the manual, allowing users to search for this hint and
find the newly added resolution instructions.


On second thought, it may not be appropriate to show this output to clients 
attempting to connect. This message should be notified not to clients but to 
administrators.

 From this point of view, it'd be better to output a message indicating the 
status inside ProcArrayApplyRecoveryInfo(). However, a straightforward 
implementation would result in the same message being logged every time an 
XLOG_RUNNING_XACTS WAL is received, making it noisy.

Instead of directly outputting a log indicating that a hot standby connection 
cannot be established due to subtransaction overflow, the attached patch 
updates the manual so that administrators can determine whether a 
subtransaction overflow has occurred based on the modified log output.


What do you think?


I had the same thought during our off-list discussion. However,
after reviewing the recovery code - such as recoveryStopsBefore(),
which checks whether a consistent state is reached - I now believe
the manual’s definition of a consistent state may be incorrect.
A consistent state should be defined as the point where recovery
has reached minRecoveryPoint.

If we were to change the definition to match the manual,
we would also need to update various recovery checks,
which wouldn't be a trivial task.

Given that, I now think it's better to revive your v1 patch,
which introduces a new postmaster signal and improves the error message
when connections are not accepted during hot standby. I've attached
a revised version of the patch based on your v1. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From a21c5fa943c2e372bf6b77fcbb32345751fb7a84 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Fri, 28 Mar 2025 01:40:52 +0900
Subject: [PATCH v5] Improve error message when standby does accept
 connections.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Even after reaching the minimum recovery point, if there are long-lived
write transactions with 

Re: [PoC] Reducing planning time when tables have many partitions

2025-03-30 Thread Yuya Watari
Hello Ashutosh,

Thank you for your detailed review, and apologies for my delayed response.

On Thu, Mar 27, 2025 at 1:42 PM Ashutosh Bapat
 wrote:
>
> Here are memory consumption numbers using list_free() instead of pfree(), 
> using the same method as [1], using a binary without asserts and debug info. 
> PFA the patchset where all the patches are the same as v35 but with an extra 
> patch fixing memory leak. The memory leak is visible with a higher number of 
> joins. At a lower number of joins, I expect that the memory saved is less 
> than a KB or the leaked memory fits within 1 chunk of memory context and 
> hence not visible.

Thank you for conducting your benchmarks. Your results clearly show
increased memory consumption with my patches. As Tom also suggested,
we may reduce memory usage by adopting a different design. I will
reconsider alternative approaches and compare the memory usage to the
current version.

Thank you once again for conducting the benchmarks.

-- 
Best regards,
Yuya Watari




Re: Thinko in pgstat_build_snapshot()

2025-03-30 Thread Gurjeet Singh
On Sun, Mar 30, 2025 at 4:31 AM Heikki Linnakangas  wrote:
>
> On 30/03/2025 13:23, Etsuro Fujita wrote:
> > While working on something else I noticed $SUBJECT: we are allocating
> > more memory than necessary and copying more data than necessary
> > because we specify the wrong PgStat_KindInfo member as the size
> > argument for MemoryContextAlloc and memcpy.  This could become
> > problematic if snapshotting a very large number of variables stats, so
> > I fixed it.  Attached is a patch for that.
>
> Good catch. Patch looks good to me at quick glance.

+1 on both counts.

Best regards,
Gurjeet
http://Gurje.et




Re: Memoize ANTI and SEMI JOIN inner

2025-03-30 Thread Alena Rybakina

On 31.03.2025 06:04, David Rowley wrote:

On Mon, 31 Mar 2025 at 15:33, Alena Rybakina  wrote:

I believe it's worth asserting that both inner_unique and single_mode are not 
true at the same time — just as a safety check.

add_paths_to_joinrel() just chooses not to populate inner_unique for
SEMI and ANTI joins because, as of today's master, it's pretty
pointless to determine that because the executor will short-circuit
and skip to the next outer tuple for those join types anyway. I don't
follow why having both these flags set would cause trouble. It seems
perfectly legitimate that add_paths_to_joinrel() could choose to set
the inner_unique flag for these join types, and if it did, the Assert
you're proposing would fail for no good reason.

I tend to agree with you that someone might set this flag to true for 
these join types in the future.


However, is it necessary to check that extra->inner_unique must be false 
for SEMI/ANTI joins here, or am I missing something? It looks a little 
confusing at this point.


if(!extra->inner_unique&& (jointype== JOIN_SEMI|| jointype== JOIN_ANTI))

    single_mode= true;

--

Regards,
Alena Rybakina
Postgres Professional


Re: Improve monitoring of shared memory allocations

2025-03-30 Thread Rahila Syed
Hi Tomas,


>
> Right. I'm still not convinced if this makes any difference, or whether
> this alignment was merely a consequence of using ShmemAlloc(). I don't
> want to make this harder to understand unnecessarily.
>

Yeah, it makes sense.


> Let's keep this simple - without additional alignment. I'll think about
> it a bit more, and maybe add it before commit.
>

OK.


>
>
> > I will improve the comment in the next version.
> >
>
> OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
> really used except for this bit:
>
> +if (init_size > nelem_alloc)
> +element_alloc = false;
>
> Can't we determine before calling the function, to make it a bit less
> confusing?
>

Yes, we could determine whether the pre-allocated elements are zero before
calling the function, I have fixed it accordingly in the attached 0001
patch.
Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've
passed this information as a boolean variable-initial_elems. If it is
false,
no elements are pre-allocated.

Please find attached the v7-series, which incorporates your review patches
and addresses a few remaining comments.

Thank you,
Rahila Syed


v7-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch
Description: Binary data


v7-0003-Add-cacheline-padding-between-heavily-accessed-array.patch
Description: Binary data


v7-0001-Account-for-all-the-shared-memory-allocated-by-hash_.patch
Description: Binary data


Re: Memoize ANTI and SEMI JOIN inner

2025-03-30 Thread David Rowley
On Mon, 31 Mar 2025 at 15:33, Alena Rybakina  wrote:
> I believe it's worth asserting that both inner_unique and single_mode are not 
> true at the same time — just as a safety check.

add_paths_to_joinrel() just chooses not to populate inner_unique for
SEMI and ANTI joins because, as of today's master, it's pretty
pointless to determine that because the executor will short-circuit
and skip to the next outer tuple for those join types anyway. I don't
follow why having both these flags set would cause trouble. It seems
perfectly legitimate that add_paths_to_joinrel() could choose to set
the inner_unique flag for these join types, and if it did, the Assert
you're proposing would fail for no good reason.

David