Re: Typo in comment for pgstat_database_flush_cb()
On 30/03/2025 14:32, Heikki Linnakangas wrote: On 30/03/2025 13:28, Etsuro Fujita wrote: Another thing I noticed is $SUBJECT: I think “if lock could not immediately acquired” should be “if lock could not be immediately acquired”. Attached is a patch for that. Yep. And there are more instances of the same typo in other such flush_cb functions, if you search for "immediately acquired". While we're at it, the comment feels a bit awkward to me anyway. Maybe rephrase it to "If nowait is true and the lock could not be immediately acquired, returns false without flushing the entry. Otherwise returns true." -- Heikki Linnakangas Neon (https://neon.tech)
Re: Thinko in pgstat_build_snapshot()
On 30/03/2025 13:23, Etsuro Fujita wrote: While working on something else I noticed $SUBJECT: we are allocating more memory than necessary and copying more data than necessary because we specify the wrong PgStat_KindInfo member as the size argument for MemoryContextAlloc and memcpy. This could become problematic if snapshotting a very large number of variables stats, so I fixed it. Attached is a patch for that. Good catch. Patch looks good to me at quick glance. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Typo in comment for pgstat_database_flush_cb()
Etsuro Fujita 于2025年3月30日周日 18:28写道: > Another thing I noticed is $SUBJECT: I think “if lock could not > immediately acquired” should be “if lock could not be immediately > acquired”. Attached is a patch for that. > Agree. The patch looks good to me. -- Thanks, Tender Wang
Re: Amcheck verification of GiST and GIN
On 3/30/25 06:04, Tom Lane wrote: > Tomas Vondra writes: >> I've pushed all the parts of this patch series, except for the stress >> test - which I think was not meant for commit. >> buildfarm seems happy so far, except for a minor indentation issue >> (forgot to reindent after merging the separate fix patch). > > Not so much here: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gecko&dt=2025-03-29%2020%3A25%3A23 > > Need to avoid depending on md5(), or it'll fail on machines with > FIPS mode enabled. See for example 95b856de2. > Ah, I forgot about that. Fixed. regards -- Tomas Vondra
Re: speedup COPY TO for partitioned table.
On Sun, Mar 30, 2025 at 9:14 AM vignesh C wrote: > > On Sat, 29 Mar 2025 at 12:08, jian he wrote: > > > > > > I consolidated it into a new function: CopyThisRelTo. > > Few comments: > 1) Here the error message is not correct, we are printing the original > table from where copy was done which is a regular table and not a > foreign table, we should use childreloid instead of rel. > > + if (relkind == RELKIND_FOREIGN_TABLE) > + ereport(ERROR, > + > errcode(ERRCODE_WRONG_OBJECT_TYPE), > + errmsg("cannot > copy from foreign table \"%s\"", > + > RelationGetRelationName(rel)), > + > errdetail("partition \"%s\" is a foreign table", > RelationGetRelationName(rel)), > + errhint("Try > the COPY (SELECT ...) TO variant.")); > > In the error detail you can include the original table too. > I changed it to: if (relkind == RELKIND_FOREIGN_TABLE) { char *relation_name; relation_name = get_rel_name(childreloid); ereport(ERROR, errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("cannot copy from foreign table \"%s\"", relation_name), errdetail("Partition \"%s\" is a foreign table in the partitioned table \"%s.%s\"", relation_name, RelationGetRelationName(rel), get_namespace_name(rel->rd_rel->relnamespace)), errhint("Try the COPY (SELECT ...) TO variant.")); } > > 2) 2.a) I felt the comment should be "then copy partitioned rel to > destionation": > + * rel: the relation to be copied to. > + * root_rel: if not null, then the COPY TO partitioned rel. > + * processed: number of tuple processed. > +*/ > +static void > +CopyThisRelTo(CopyToState cstate, Relation rel, Relation root_rel, > uint64 *processed) > +{ > + TupleTableSlot *slot; > i changed it to: +/* + * rel: the relation to be copied to. + * root_rel: if not null, then the COPY partitioned relation to destination. + * processed: number of tuple processed. +*/ +static void +CopyThisRelTo(CopyToState cstate, Relation rel, Relation root_rel, + uint64 *processed) > 3) There is a small indentation issue here: > + /* > +* partition's rowtype might differ from the root table's. We must > +* convert it back to the root table's rowtype as we are export > +* partitioned table data here. > + */ > + if (root_rel != NULL) > I am not so sure. can you check if the attached still has the indentation issue. From 4e501b7a8e67cffbf6432bdb43985b21d2b635b8 Mon Sep 17 00:00:00 2001 From: jian he Date: Sun, 30 Mar 2025 21:47:12 +0800 Subject: [PATCH v7 1/1] support COPY partitioned_table TO CREATE TABLE pp (id INT, val int ) PARTITION BY RANGE (id); create table pp_1 (val int, id int); create table pp_2 (val int, id int); ALTER TABLE pp ATTACH PARTITION pp_1 FOR VALUES FROM (1) TO (5); ALTER TABLE pp ATTACH PARTITION pp_2 FOR VALUES FROM (5) TO (10); insert into pp select g, 10 + g from generate_series(1,9) g; copy pp to stdout(header); the above case is much slower (around 25% in some case) than ``COPY (select * from pp) to stdout(header);``, because of column remaping. but this is still a new feature, since master does not support ``COPY (partitioned_table)``. reivewed by: vignesh C reivewed by: David Rowley reivewed by: Melih Mutlu discussion: https://postgr.es/m/CACJufxEZt+G19Ors3bQUq-42-61__C=y5k2wk=sHEFRusu7=i...@mail.gmail.com commitfest entry: https://commitfest.postgresql.org/patch/5467/ --- doc/src/sgml/ref/copy.sgml | 8 +- src/backend/commands/copyto.c | 143 ++-- src/test/regress/expected/copy2.out | 16 src/test/regress/sql/copy2.sql | 11 +++ 4 files changed, 147 insertions(+), 31 deletions(-) diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index df093da97c5..f86e0b7ec35 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -521,15 +521,15 @@ COPY count COPY TO can be used only with plain -tables, not views, and does not copy rows from child tables -or child partitions. For example, COPY COPY TO can be used with partitioned tables. +For example, in a table inheritance hierarchy, COPY table TO copies the same rows as SELECT * FROM ONLY table. The syntax COPY (SELECT * FROM table) TO ... can be used to -dump all of the rows in an inheritance hierarchy, partitioned table, -or view. +dump all of the rows in an inheritance hierarchy, or view. diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index 84a3f3879a8..b75bbfea6a4 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c
Re: Non-text mode for pg_dumpall
On 2025-03-29 Sa 1:17 AM, Mahendra Singh Thalor wrote: On Sat, 29 Mar 2025 at 03:50, Andrew Dunstan wrote: On 2025-03-27 Th 5:15 PM, Andrew Dunstan wrote: On 2025-03-19 We 2:41 AM, Mahendra Singh Thalor wrote: On Wed, 12 Mar 2025 at 21:18, Andrew Dunstan wrote: On 2025-03-12 We 3:03 AM, jian he wrote: On Wed, Mar 12, 2025 at 1:06 AM Álvaro Herrera wrote: Hello, On 2025-Mar-11, Mahendra Singh Thalor wrote: In map.dat file, I tried to fix this issue by adding number of characters in dbname but as per code comments, as of now, we are not supporting \n\r in dbnames so i removed handling. I will do some more study to fix this issue. Yeah, I think this is saying that you should not consider the contents of map.dat as a shell string. After all, you're not going to _execute_ that file via the shell. Maybe for map.dat you need to escape such characters somehow, so that they don't appear as literal newlines/carriage returns. I am confused. currently pg_dumpall plain format will abort when encountering dbname containing newline. the left dumped plain file does not contain all the cluster databases data. if pg_dumpall non-text format aborts earlier, it's aligned with pg_dumpall plain format? it's also an improvement since aborts earlier, nothing will be dumped? am i missing something? I think we should fix that. But for the current proposal, Álvaro and I were talking this morning, and we thought the simplest thing here would be to have the one line format and escape NL/CRs in the database name. cheers Okay. As per discussions, we will keep one line entry for each database into map.file. Thanks all for feedback and review. Here, I am attaching updated patches for review and testing. These patches can be applied on commit a6524105d20b. I'm working through this patch set with a view to committing it. Attached is some cleanup which is where I got to today, although there is more to do. One thing I am wondering is why not put the SimpleDatabaseOidList stuff in fe_utils/simle_list.{c,h} ? That's where all the similar stuff belongs, and it feels strange to have this inline in pg_restore.c. (I also don't like the name much - SimpleOidStringList or maybe SimpleOidPlusStringList might be better). OK, I have done that, so here is the result. The first two are you original patches. patch 3 adds the new list type to fe-utils, and patch 4 contains my cleanups and use of the new list type. Apart from some relatively minor cleanup, the one thing I would like to change is how dumps are named. If we are producing tar or custom format dumps, I think the file names should reflect that (oid.dmp and oid.tar rather than a bare oid as the filename), and pg_restore should look for those. I'm going to work on that tomorrow - I don't think it will be terribly difficult. Thanks Andrew. Here, I am attaching a delta patch for oid.tar and oid.dmp format. OK, looks good, I have incorporated that. There are a couple of rough edges, though. First, I see this: andrew@ub22arm:inst $ bin/pg_restore -C -d postgres --exclude-database=regression_dummy_seclabel --exclude-database=regression_test_extensions --exclude-database=regression_test_pg_dump dest pg_restore: error: could not execute query: "ERROR: role "andrew" already exists " Command was: " -- -- Roles -- CREATE ROLE andrew;" pg_restore: warning: errors ignored on global.dat file restore: 1 pg_restore: error: could not execute query: ERROR: database "template1" already exists Command was: CREATE DATABASE template1 WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' LOCALE_PROVIDER = libc LOCALE = 'C'; pg_restore: warning: errors ignored on database "template1" restore: 1 pg_restore: error: could not execute query: ERROR: database "postgres" already exists Command was: CREATE DATABASE postgres WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' LOCALE_PROVIDER = libc LOCALE = 'C'; pg_restore: warning: errors ignored on database "postgres" restore: 1 pg_restore: warning: errors ignored on restore: 3 It seems pointless to be trying to create the rolw that we are connected as, and we also expect template1 and postgres to exist. In a similar vein, I don't see why we are setting the --create flag in pg_dumpall for those databases. I'm attaching a patch that is designed to stop that, but it doesn't solve the above issues. I also notice a bunch of these in globals.dat: -- -- Databases -- -- -- Database "template1" dump -- -- -- Database "andrew" dump -- -- -- Database "isolation_regression_brin" dump -- -- -- Database "isolation_regression_delay_execution" dump -- ... The patch also tries to fix this. Lastly, this badly needs some TAP tests written. I'm going to work on reviewing the documentation next. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com From ed53d8b5ad82e49bb56bc5bc48ddb8426fdb4c80 Mon Sep 17 00:00:00 2001 From: Mahendra Singh Thalor Date: Wed, 19 Mar 2025 01:18:46 +0530 Subject: [
Re: Options to control remote transactions’ access/deferrable modes in postgres_fdw
On Thu, Mar 27, 2025 at 1:25 PM Ashutosh Bapat wrote: > On Tue, Mar 25, 2025 at 4:01 PM Etsuro Fujita wrote: > > In the patch I also fixed a bug; I trusted XactReadOnly to see if the > > local transaction is READ ONLY, but I noticed that that is not 100% > > correct, because a transaction which started as READ WRITE can show as > > READ ONLY later within subtransactions, so I modified the patch so > > that postgres_fdw opens remote transactions in READ ONLY mode if the > > local transaction has been declared READ ONLY at the top level. > > Nice catch. postgres_fdw replicates the transaction stack on foreign > server. I think we need to replicate it along with the transaction > properties. And also we need a test which tests readonly > subtransaction behaviour. Ok, will do. Thanks for the comment! Best regards, Etsuro Fujita
Thinko in pgstat_build_snapshot()
While working on something else I noticed $SUBJECT: we are allocating more memory than necessary and copying more data than necessary because we specify the wrong PgStat_KindInfo member as the size argument for MemoryContextAlloc and memcpy. This could become problematic if snapshotting a very large number of variables stats, so I fixed it. Attached is a patch for that. Best regards, Etsuro Fujita fix-thinko-in-pgstat_build_snapshot.patch Description: Binary data
Minor edits to README.tuplock, and a question
Please see attached a few minor edits to README.tuplock, which I feel make an improvement over the current version. Reading through that, though, I could not see a functional difference between FOR NO KEY UPDATE and FOR KEY SHARE mode of locks. I understand they are of different strength, exclusive vs. shared, but the way the text (quoted below) describes them, they essentially both achieve the same effect. > SELECT FOR NO > KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal > and modifications which might alter the tuple's key. > SELECT FOR KEY SHARE obtains a shared lock which only > prevents tuple removal and modifications of key fields. Am I missing something? Nevermind. Deciphering the conflict table below it makes clear the need for similar looking locks, but with exclusive vs. shared mode differences. I can't think of an improvement in the two sentences quoted above, but perhaps others can think of something that helps the reader. -- Best regards, Gurjeet http://Gurje.et diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock index 843c2e58f92..0763fbaa9e7 100644 --- a/src/backend/access/heap/README.tuplock +++ b/src/backend/access/heap/README.tuplock @@ -3,7 +3,7 @@ Locking tuples Locking tuples is not as easy as locking tables or other database objects. The problem is that transactions might want to lock large numbers of tuples at -any one time, so it's not possible to keep the locks objects in shared memory. +any one time, so it's not possible to keep the lock objects in shared memory. To work around this limitation, we use a two-level mechanism. The first level is implemented by storing locking information in the tuple header: a tuple is marked as locked by setting the current transaction's XID as its XMAX, and @@ -20,8 +20,8 @@ tuple, potentially leading to indefinite starvation of some waiters. The possibility of share-locking makes the problem much worse --- a steady stream of share-lockers can easily block an exclusive locker forever. To provide more reliable semantics about who gets a tuple-level lock first, we use the -standard lock manager, which implements the second level mentioned above. The -protocol for waiting for a tuple-level lock is really +standard lock manager, which implements the second of the two-level mechanism +mentioned above. The protocol for waiting for a tuple-level lock is really LockTuple() XactLockTableWait() @@ -39,7 +39,7 @@ conflict for a tuple, we don't incur any extra overhead. We make an exception to the above rule for those lockers that already hold some lock on a tuple and attempt to acquire a stronger one on it. In that case, we skip the LockTuple() call even when there are conflicts, provided -that the target tuple is being locked, updated or deleted by multiple sessions +that the target tuple is being locked, updated, or deleted by multiple sessions concurrently. Failing to skip the lock would risk a deadlock, e.g., between a session that was first to record its weaker lock in the tuple header and would be waiting on the LockTuple() call to upgrade to the stronger lock level, and @@ -142,7 +142,7 @@ The following infomask bits are applicable: - HEAP_KEYS_UPDATED This bit lives in t_infomask2. If set, indicates that the operation(s) done - by the XMAX compromise the tuple key, such as a SELECT FOR UPDATE, an UPDATE + by the XMAX modify the tuple key, such as a SELECT FOR UPDATE, an UPDATE that modifies the columns of the key, or a DELETE. It's set regardless of whether the XMAX is a TransactionId or a MultiXactId.
The 026_overwrite_contrecord test might fail on extremely slow animals
Hello hackers, A couple of buildfarm failures produced by skink, which runs tests under Valgrind: [1], [2], with the following diagnostics: 31/324 postgresql:recovery / recovery/026_overwrite_contrecord ERROR 325.72s (exit status 255 or signal 127 SIGinvalid) # Initializing node "standby" from backup "backup" of node "primary" ### Enabling streaming replication for node "standby" ### Starting node "standby" # Running: pg_ctl -w -D /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/026_overwrite_contrecord/data/t_026_overwrite_contrecord_standby_data/pgdata -l /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/026_overwrite_contrecord/log/026_overwrite_contrecord_standby.log -o --cluster-name=standby start waiting for server to start stopped waiting pg_ctl: could not start server Examine the log output. # pg_ctl start failed; see logfile for details: /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/026_overwrite_contrecord/log/026_overwrite_contrecord_standby.log # No postmaster PID for node "standby" [14:44:39.051](7.936s) Bail out! pg_ctl start failed 026_overwrite_contrecord_standby.log: 2025-03-14 14:44:38.533 UTC [1558222][startup][:0] LOG: database system was interrupted; last known up at 2025-03-14 14:44:30 UTC 2025-03-14 14:44:38.806 UTC [1558222][startup][:0] LOG: invalid checkpoint record 2025-03-14 14:44:38.808 UTC [1558222][startup][:0] PANIC: could not locate a valid checkpoint record at 0/2094248 2025-03-14 14:44:38.937 UTC [1555866][postmaster][:0] LOG: startup process (PID 1558222) was terminated by signal 6: Aborted 2025-03-14 14:44:38.937 UTC [1555866][postmaster][:0] LOG: aborting startup due to startup process failure 026_overwrite_contrecord_primary.log: 2025-03-14 14:44:30.536 UTC [1553014][client backend][4/2:0] LOG: statement: SELECT pg_walfile_name(pg_current_wal_insert_lsn()) 2025-03-14 14:44:30.621 UTC [1519069][checkpointer][:0] LOG: checkpoint complete: wrote 109 buffers (85.2%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=12.404 s, sync=0.013 s, total=12.574 s; sync files=28, longest=0.004 s, average=0.001 s; distance=8299 kB, estimate=8299 kB; lsn=0/2094248, redo lsn=0/1FA5A48 2025-03-14 14:44:31.119 UTC [1519026][postmaster][:0] LOG: received immediate shutdown request indicates that this test might fail when a checkpoint performed during the test execution (that is, when the duration of the test up to $node->stop('immediate'); exceeds 300 secs). This failure can be easily reproduced with: --- a/src/test/recovery/t/026_overwrite_contrecord.pl +++ b/src/test/recovery/t/026_overwrite_contrecord.pl @@ -24,2 +24,3 @@ autovacuum = off wal_keep_size = 1GB +checkpoint_timeout=30s )); @@ -61,2 +62,3 @@ $node->safe_psql('postgres', #$node->safe_psql('postgres', qq{create table foo ()}); +sleep(40); my $endfile = $node->safe_psql('postgres', The last 5 runs of the test on skink on master show the following duration of the test: 29/331 postgresql:recovery / recovery/026_overwrite_contrecord OK 317.37s 3 subtests passed 29/331 postgresql:recovery / recovery/026_overwrite_contrecord OK 412.78s 3 subtests passed 29/331 postgresql:recovery / recovery/026_overwrite_contrecord OK 448.46s 3 subtests passed 29/331 postgresql:recovery / recovery/026_overwrite_contrecord OK 445.73s 3 subtests passed 29/331 postgresql:recovery / recovery/026_overwrite_contrecord OK 383.50s 3 subtests passed [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-09%2020%3A23%3A09 - REL_17_STABLE [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-14%2013%3A52%3A09 - master Best regards, Alexander Lakhin Neon (https://neon.tech)
Re: SQLFunctionCache and generic plans
Hi > We get substantial wins on all of fx, fx3, fx4. fx2 is the > case that gets inlined and never reaches functions.c, so the > lack of change there is expected. What I found odd is that > I saw a small speedup (~6%) on fx5 and fx6; those functions > are in plpgsql so they really shouldn't change either. > The only thing I can think of is that I made the hash key > computation a tiny bit faster by omitting unused argtypes[] > entries. That does avoid hashing several hundred bytes > typically, but it's hard to believe it'd amount to any > visible savings overall. > > Anyway, PFA v10. > > I can confirm so all tests passed without problems Regards Pavel > regards, tom lane > > [1] > https://www.postgresql.org/message-id/CAFj8pRDWDeF2cC%2BpCjLHJno7KnK5kdtjYN-f933RHS7UneArFw%40mail.gmail.com > >
Re: Typo in comment for pgstat_database_flush_cb()
On Sun Mar 30, 2025 at 4:39 AM PDT, Heikki Linnakangas wrote: > On 30/03/2025 14:32, Heikki Linnakangas wrote: >> On 30/03/2025 13:28, Etsuro Fujita wrote: >>> Another thing I noticed is $SUBJECT: I think “if lock could not >>> immediately acquired” should be “if lock could not be immediately >>> acquired”. Attached is a patch for that. >> >> Yep. And there are more instances of the same typo in other such >> flush_cb functions, if you search for "immediately acquired". +1 for chainging other occurrences of this in other pgstat_*.c files. > While we're at it, the comment feels a bit awkward to me anyway. Maybe > rephrase it to "If nowait is true and the lock could not be immediately > acquired, returns false without flushing the entry. Otherwise returns true." This rephrasing does sounds better. Best regards, Gurjeet http://Gurje.et
Re: Typo in comment for pgstat_database_flush_cb()
On 30/03/2025 13:28, Etsuro Fujita wrote: Another thing I noticed is $SUBJECT: I think “if lock could not immediately acquired” should be “if lock could not be immediately acquired”. Attached is a patch for that. Yep. And there are more instances of the same typo in other such flush_cb functions, if you search for "immediately acquired". -- Heikki Linnakangas Neon (https://neon.tech)
Re: Some read stream improvements
On Tue, Mar 18, 2025 at 5:56 AM Andres Freund wrote: > So one thing is that the pin count differs by 1 at the start of the scan. No > idea why. > > I still don't know what drives the difference between freebsd and the rest, > but IIUC the reason this fails is just that we are holding too many buffers > pinned, due to some buffers being pinned outside of read_stream.c. I couldn't reproduce this on my local FreeBSD box, but I think I see one part of the problem: the cursor a few lines up has a stream with a higher distance holding a couple of pins. Not sure how the local buffer pool got into a state that caused that if it isn't doing the same on other machines, but anyway, if I read the test right you intend to pin strictly one page per cursor, so I tried saying so explicitly: - query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i); + query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid between '(%s, 1)'::tid and '(%s, )'::tid $q$, cursorname, i, i); That passed on the FreeBSD CI task.
Re: AIO v2.5
On Tue, Mar 25, 2025 at 11:58 AM Andres Freund wrote: > > > Another thought on complete_shared running on other backends: I wonder if we > > should push an ErrorContextCallback that adds "CONTEXT: completing I/O of > > other process" or similar, so people wonder less about how "SELECT FROM a" > > led > > to a log message about IO on table "b". > > I've been wondering about that as well, and yes, we probably should. > > I'd add the pid of the backend that started the IO to the message - although > need to check whether we're trying to keep PIDs of other processes from > unprivileged users. > > I think we probably should add a similar, but not equivalent, context in io > workers. Maybe "I/O worker executing I/O on behalf of process %d". I think this has not yet been done. Attached patch is an attempt to add error context for IO completions by another backend when using io_uring and IO processing in general by an IO worker. It seems to work -- that is, running the test_aio tests, you can see the context in the logs. I'm not certain that I did this in the way you both were envisioning, though. - Melanie From 79d7e930de510cad6aef31532b06ba679a72d94a Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sun, 30 Mar 2025 14:03:55 -0400 Subject: [PATCH] Add errcontext for processing I/Os for another backend Push an ErrorContextCallback adding additional detail about the process performing the I/O and the owner of the I/O when those are not the same. For io_method worker, this adds context specifying which process owns the I/O that the I/O worker is processing. For io_method io_uring, this adds context only when a backend is *completing* I/O for another backend. It specifies the pid of the owning process. Discussion: https://postgr.es/m/20250325141120.8e.nmisch%40google.com --- src/backend/storage/aio/aio.c | 1 + src/backend/storage/aio/method_io_uring.c | 31 +++ src/backend/storage/aio/method_worker.c | 29 + 3 files changed, 61 insertions(+) diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c index 43f1e2a7785..3d5cf726e24 100644 --- a/src/backend/storage/aio/aio.c +++ b/src/backend/storage/aio/aio.c @@ -42,6 +42,7 @@ #include "port/atomics.h" #include "storage/aio_internal.h" #include "storage/aio_subsys.h" +#include "storage/proc.h" #include "storage/procnumber.h" #include "utils/guc.h" #include "utils/guc_hooks.h" diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c index 3b299dcf388..244918e1883 100644 --- a/src/backend/storage/aio/method_io_uring.c +++ b/src/backend/storage/aio/method_io_uring.c @@ -303,14 +303,41 @@ pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) return num_staged_ios; } +static void +pgaio_uring_completion_error_callback(void *arg) +{ + ProcNumber owner; + PGPROC *owner_proc; + int32 owner_pid; + PgAioHandle *ioh = arg; + + if (!ioh) + return; + + /* No need for context if a backend is completing the IO for itself */ + if (ioh->owner_procno == MyProcNumber) + return; + + owner = ioh->owner_procno; + owner_proc = GetPGProcByNumber(owner); + owner_pid = owner_proc->pid; + + errcontext("completing I/O on behalf of process %d", owner_pid); +} + static void pgaio_uring_drain_locked(PgAioUringContext *context) { int ready; int orig_ready; + ErrorContextCallback errcallback = {0}; Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE)); + errcallback.callback = pgaio_uring_completion_error_callback; + errcallback.previous = error_context_stack; + error_context_stack = &errcallback; + /* * Don't drain more events than available right now. Otherwise it's * plausible that one backend could get stuck, for a while, receiving CQEs @@ -338,9 +365,11 @@ pgaio_uring_drain_locked(PgAioUringContext *context) PgAioHandle *ioh; ioh = io_uring_cqe_get_data(cqe); + errcallback.arg = ioh; io_uring_cqe_seen(&context->io_uring_ring, cqe); pgaio_io_process_completion(ioh, cqe->res); + errcallback.arg = NULL; } END_CRIT_SECTION(); @@ -349,6 +378,8 @@ pgaio_uring_drain_locked(PgAioUringContext *context) "drained %d/%d, now expecting %d", ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring)); } + + error_context_stack = errcallback.previous; } static void diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c index 5ea00d8a89e..52f901ed4c2 100644 --- a/src/backend/storage/aio/method_worker.c +++ b/src/backend/storage/aio/method_worker.c @@ -361,11 +361,33 @@ pgaio_worker_register(void) on_shmem_exit(pgaio_worker_die, 0); } +static void +pgaio_worker_error_callback(void *arg) +{ + ProcNumber owner; + PGPROC *owner_proc; + int32 owner_pid; + PgAioHandle *ioh = arg; + + if (!ioh) + return; + + Assert(ioh->owner_procno != MyProcNumber); + Assert(MyBackendType == B_IO_WORKER); + + owner = ioh->owner_procno
Re: pg_stat_database.checksum_failures vs shared relations
On Sat, Mar 29, 2025 at 7:09 PM Andres Freund wrote: > Hi, > > On 2025-03-29 13:17:44 -0400, Andres Freund wrote: > > On 2025-03-28 13:47:16 -0400, Andres Freund wrote: > > > Attached is a fix for the issue. > > > > I plan to push this fix soon, unless somebody protests... > > And done. > Hi! Sorry to get into this thread a bit late. Just to let you know that now that I'm caught up, I do agree it looks right. And - thanks for handling this! //Magnus
Re: Non-text mode for pg_dumpall
On 2025-03-30 Su 12:50 PM, Andrew Dunstan wrote: On 2025-03-29 Sa 1:17 AM, Mahendra Singh Thalor wrote: On Sat, 29 Mar 2025 at 03:50, Andrew Dunstan wrote: On 2025-03-27 Th 5:15 PM, Andrew Dunstan wrote: On 2025-03-19 We 2:41 AM, Mahendra Singh Thalor wrote: On Wed, 12 Mar 2025 at 21:18, Andrew Dunstan wrote: On 2025-03-12 We 3:03 AM, jian he wrote: On Wed, Mar 12, 2025 at 1:06 AM Álvaro Herrera wrote: Hello, On 2025-Mar-11, Mahendra Singh Thalor wrote: In map.dat file, I tried to fix this issue by adding number of characters in dbname but as per code comments, as of now, we are not supporting \n\r in dbnames so i removed handling. I will do some more study to fix this issue. Yeah, I think this is saying that you should not consider the contents of map.dat as a shell string. After all, you're not going to _execute_ that file via the shell. Maybe for map.dat you need to escape such characters somehow, so that they don't appear as literal newlines/carriage returns. I am confused. currently pg_dumpall plain format will abort when encountering dbname containing newline. the left dumped plain file does not contain all the cluster databases data. if pg_dumpall non-text format aborts earlier, it's aligned with pg_dumpall plain format? it's also an improvement since aborts earlier, nothing will be dumped? am i missing something? I think we should fix that. But for the current proposal, Álvaro and I were talking this morning, and we thought the simplest thing here would be to have the one line format and escape NL/CRs in the database name. cheers Okay. As per discussions, we will keep one line entry for each database into map.file. Thanks all for feedback and review. Here, I am attaching updated patches for review and testing. These patches can be applied on commit a6524105d20b. I'm working through this patch set with a view to committing it. Attached is some cleanup which is where I got to today, although there is more to do. One thing I am wondering is why not put the SimpleDatabaseOidList stuff in fe_utils/simle_list.{c,h} ? That's where all the similar stuff belongs, and it feels strange to have this inline in pg_restore.c. (I also don't like the name much - SimpleOidStringList or maybe SimpleOidPlusStringList might be better). OK, I have done that, so here is the result. The first two are you original patches. patch 3 adds the new list type to fe-utils, and patch 4 contains my cleanups and use of the new list type. Apart from some relatively minor cleanup, the one thing I would like to change is how dumps are named. If we are producing tar or custom format dumps, I think the file names should reflect that (oid.dmp and oid.tar rather than a bare oid as the filename), and pg_restore should look for those. I'm going to work on that tomorrow - I don't think it will be terribly difficult. Thanks Andrew. Here, I am attaching a delta patch for oid.tar and oid.dmp format. OK, looks good, I have incorporated that. There are a couple of rough edges, though. First, I see this: andrew@ub22arm:inst $ bin/pg_restore -C -d postgres --exclude-database=regression_dummy_seclabel --exclude-database=regression_test_extensions --exclude-database=regression_test_pg_dump dest pg_restore: error: could not execute query: "ERROR: role "andrew" already exists " Command was: " -- -- Roles -- CREATE ROLE andrew;" pg_restore: warning: errors ignored on global.dat file restore: 1 pg_restore: error: could not execute query: ERROR: database "template1" already exists Command was: CREATE DATABASE template1 WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' LOCALE_PROVIDER = libc LOCALE = 'C'; pg_restore: warning: errors ignored on database "template1" restore: 1 pg_restore: error: could not execute query: ERROR: database "postgres" already exists Command was: CREATE DATABASE postgres WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' LOCALE_PROVIDER = libc LOCALE = 'C'; pg_restore: warning: errors ignored on database "postgres" restore: 1 pg_restore: warning: errors ignored on restore: 3 It seems pointless to be trying to create the rolw that we are connected as, and we also expect template1 and postgres to exist. In a similar vein, I don't see why we are setting the --create flag in pg_dumpall for those databases. I'm attaching a patch that is designed to stop that, but it doesn't solve the above issues. I also notice a bunch of these in globals.dat: -- -- Databases -- -- -- Database "template1" dump -- -- -- Database "andrew" dump -- -- -- Database "isolation_regression_brin" dump -- -- -- Database "isolation_regression_delay_execution" dump -- ... The patch also tries to fix this. Lastly, this badly needs some TAP tests written. I'm going to work on reviewing the documentation next. I have reworked the documentation some. See attached. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com diff --git a/doc/src/sgml
Re: Commit fest 2025-03
On Mon, 24 Mar 2025 at 10:07, vignesh C wrote: > > On Mon, 17 Mar 2025 at 09:26, vignesh C wrote: > Here's a quick commitfest status report as of today: status | start | 10th | 17th | 24th | 31st +-+---+---++- Needs review: | 198 | 182 | 134| 120 | 105 Waiting on Author: |37 | 35|69 |62 |59 Ready for Committer:|33 | 33|34 |34 |27 Committed:|52 | 67|75 |90 | 109 Moved to next CF: | 7 | 7|11 |14 |16 Withdrawn:|12 | 15|15 |18 |18 Rejected: | 1 | 1| 3| 3 | 5 RWF: | 2 | 2| 2| 2 | 4 Total: | 343 | 343 | 343 | 343 | 343 There are currently 27 patches in the "Ready for Committer" status at [1]. If any committers have bandwidth, please consider reviewing and advancing them. If you have submitted a patch that is in the "Waiting for Author" state, please update it to "Needs Review" as soon as possible. This is where reviewers are most likely to focus their efforts. Additionally, 12 Needs review patches require a rebase due to recent commits at [2]. Patch owners are requested to rebase and submit updated versions. 19 patches have been committed in the last week. [1] - https://commitfest.postgresql.org/52/?status=3 [2] - https://commitfest.postgresql.org/52/?status=1 Regards, Vignesh
Re: Memoize ANTI and SEMI JOIN inner
On 31.03.2025 06:33, David Rowley wrote: On Mon, 31 Mar 2025 at 16:21, Alena Rybakina wrote: However, is it necessary to check that extra->inner_unique must be false for SEMI/ANTI joins here, or am I missing something? It looks a little confusing at this point. If it is necessary, I don't see the reason for it. It was me that worked on unique joins and I see no reason why a SEMI or ANTI join couldn't be marked as unique. The reason they're not today is that the only point of the unique join optimisation is so that during execution, the join nodes could skip to the next outer tuple after matching the current outer to an inner. If the join is unique, then there are no more join partners to find for the current outer after matching it up once. With SEMI and ANTI joins, we skip to the next outer tuple after finding a match anyway, so there's no point in going to the trouble of setting the inner_unique flag. I can't say definitively that we won't find a reason in the future that we should set inner_unique for SEMI/ANTI joins, so I don't follow the need for the Assert. Maybe you're seeing something that I'm not. What do you think will break if both flags are true? Actually, I was mainly confused by the code itself - the check seemed to contradict the explanation. It looked like we were enforcing that inner_unique must be false for SEMI/ANTI joins, even though it's not actually important for those join types. That’s why I originally proposed either adding an Assert or removing this flag from check altogether, just to make it more explicit. So, I agree with your explanation — by the definition of SEMI and ANTI joins, there's no need to set inner_unique, and also no need to assert against it. These joins skip to the next outer tuple once they find a match (or fail to find one, in the case of ANTI). I updated the diff, where I left changes only in the code comment. -- Regards, Alena Rybakina Postgres Professional diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c index c75408f552b..252cb712943 100644 --- a/src/backend/optimizer/path/joinpath.c +++ b/src/backend/optimizer/path/joinpath.c @@ -728,14 +728,16 @@ get_memoize_path(PlannerInfo *root, RelOptInfo *innerrel, single_mode = true; /* -* Memoize normally marks cache entries as complete when it runs out of -* tuples to read from its subplan. However, with unique joins, Nested -* Loop will skip to the next outer tuple after finding the first matching -* inner tuple. Another case is a semi or anti join. If number of join -* clauses, pushed to the inner as parameterised filter no less than the -* number of join clauses, that means all the clauses have been pushed to -* the inner and any tuple coming from the inner side will be successfully -* used to build the join result. +* Normally, memoize marks cache entries as complete when it exhausts +* all tuples from its subplan. However, in unique joins, Nested Loop +* will skip to the next outer tuple after finding the first matching +* inner tuple. +* Another case is a SEMI or ANTI joins. If the number of join clauses, +* pushed to the inner as parameterised filter is equal to or greater +* than the total number of join clauses. This implies that all relevant +* join conditions have been applied on the inner side, so any returned +* inner tuple will be guaranteed to satisfy the join condition, making +* it safe to memoize. * This means that we may not read the inner side of the * join to completion which leaves no opportunity to mark the cache entry * as complete. To work around that, when the join is unique we
Re: general purpose array_sort
I spent some time looking at the v17 patchset. There were some pretty strange things in it --- why were some of the variants of array_sort() marked as volatile, for example? But the two things I'd like to suggest functionality-wise are: * The second argument of the variants with booleans should be defined as true=descending, not true=ascending. It seems a little odd to me for the default of a boolean option not to be "false". Also, then you don't need an inversion between the second and third arguments. I'm not dead set on this but it just seems a little cleaner. * I see that the code is set up to detect an unsortable input type before it takes the fast exit for "no sort required". I think this is poor engineering: we ought to make the fast path as fast as possible. The can't-sort case is so rare in real-world usage that I do not think it matters if the error isn't thrown by every possible call. Besides which, it is inconsistent anyway: consider SELECT array_sort(NULL::xid[]); which will not error because it will never reach the C code. Why's that okay but delivering an answer for "array_sort('{1}'::xid[])" is not? I think "throw error only if we must sort and cannot" is a perfectly fine definition. At the code level, I didn't like the way that the multiple entry points were set up. I think it's generally cleaner code to have a worker function with plain C call and return coding and make all the SQL-visible functions be wrappers around that. Also the caching mechanism was overcomplicated, in particular because we do not need a cache lookup to know which sort operators apply to arrays. So all that leads me to v18 attached. (I merged the two patches into one, didn't see much value in splitting them.) In v18, it's somewhat annoying that the typcache doesn't cache the typarray field; we would not need a separate get_array_type() lookup if it did. I doubt there is any real reason for that except that pg_type.typarray didn't exist when the typcache was invented. So I'm tempted to add it. But I looked at existing callers of get_array_type() and none of them are adjacent to typcache lookups, so only array_sort would be helped immediately. I left it alone for the moment; wonder if anyone else has an opinion? regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 5bf6656deca..2129d027398 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -20741,6 +20741,42 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sort + +array_sort ( + array anyarray + , descending boolean + , nulls_first boolean + ) +anyarray + + +Sorts the first dimension of the array. +The sort order is determined by the default sort ordering of the +array's element type; however, if the element type is collatable, +the collation to use can be forced by adding +a COLLATE clause to +the array argument. + + +If descending is true then sort in +descending order, otherwise ascending order. If omitted, the +default is ascending order. +If nulls_first is true then nulls appear +before non-null values, otherwise nulls appear after non-null +values. +If omitted, nulls_first is taken to have +the same value as descending. + + +array_sort(ARRAY[[2,4],[2,1],[6,5]]) +{{2,1},{2,4},{6,5}} + + + diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c index 2aae2f8ed93..2a8ea974029 100644 --- a/src/backend/utils/adt/array_userfuncs.c +++ b/src/backend/utils/adt/array_userfuncs.c @@ -12,16 +12,19 @@ */ #include "postgres.h" +#include "catalog/pg_operator_d.h" #include "catalog/pg_type.h" #include "common/int.h" #include "common/pg_prng.h" #include "libpq/pqformat.h" +#include "miscadmin.h" #include "nodes/supportnodes.h" #include "port/pg_bitutils.h" #include "utils/array.h" #include "utils/builtins.h" #include "utils/datum.h" #include "utils/lsyscache.h" +#include "utils/tuplesort.h" #include "utils/typcache.h" /* @@ -43,6 +46,18 @@ typedef struct DeserialIOData Oid typioparam; } DeserialIOData; +/* + * ArraySortCachedInfo + * Used for caching catalog data in array_sort + */ +typedef struct ArraySortCachedInfo +{ + ArrayMetaState array_meta; /* metadata for array_create_iterator */ + Oid elem_lt_opr; /* "<" operator for element type */ + Oid elem_gt_opr; /* ">" operator for element type */ + Oid array_type; /* pg_type OID of array type */ +} ArraySortCachedInfo; + static Datum array_position_common(FunctionCallInfo fcinfo); @@ -1858,3 +1873,171 @@ array_reverse(PG_FUNCTION_ARGS) PG_RETURN_ARRAYTYPE_P(result); } + +/* + * array_sort + * + * Sorts the first dimension of the a
Re: AIO v2.5
Hi, On 2025-03-27 10:52:10 +1300, Thomas Munro wrote: > On Thu, Mar 27, 2025 at 10:41 AM Andres Freund wrote: > > > > Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems > > > > > > Consider also updating this comment to stop focusing on prefetch; I think > > > changing that aligns with the patch's other changes: > > > > > > /* > > > * How many buffers PrefetchBuffer callers should try to stay ahead of > > > their > > > * ReadBuffer calls by. Zero means "never prefetch". This value is only > > > used > > > * for buffers not belonging to tablespaces that have their > > > * effective_io_concurrency parameter set. > > > */ > > > int effective_io_concurrency = > > > DEFAULT_EFFECTIVE_IO_CONCURRENCY; > > > > Good point. Although I suspect it might be worth adjusting this, and also > > the > > config.sgml bit about effective_io_concurrency separately. That seems like > > it > > might take an iteration or two. > > +1 for rewriting that separately from this work on the code (I can > have a crack at that if you want). You taking a crack at that would be appreciated! > For the comment, my suggestion would be something like: > > "Default limit on the level of concurrency that each I/O stream > (currently, ReadStream but in future other kinds of streams) can use. > Zero means that I/O is always performed synchronously, ie not > concurrently with query execution. This value can be overridden at the > tablespace level with the parameter of the same name. Note that > streams performing I/O not classified as single-session work respect > maintenance_io_concurrency instead." Generally sounds good. I do wonder if the last sentence could be made a bit simpler, it took me a few seconds to parse "not classified as single-session". Maybe "classified as performing work for multiple sessions respect maintenance_io_concurrency instead."? Greetings, Andres Freund
Re: AIO v2.5
Hi, On 2025-03-29 14:29:29 -0700, Noah Misch wrote: > > Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control > > ignore_checksum_failure > > Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers() > > Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO > > Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional > > batchmode > > support > > Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as > > wait time > > Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems > > Ready for commit I've pushed most of these after some very light further editing. Yay. Thanks a lot for all the reviews! So far the buildfarm hasn't been complaining, but it's early days. I didn't yet push > > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level > > design because I want to integrate some language that could be referenced by smgrstartreadv() (and more in the future), as we have been talking about. Tomorrow I'll work on sending out a new version with the remaining patches. I plan for that version to have: - pg_aios view with the security checks (already done, trivial) - a commit with updated language for smgrstartreadv(), probably referencing aio's README - a change to mdreadv() around zero_damaged_pages, as Noah and I have been discussing - updated tests, with the FIXMEs etc addressed - a reviewed version of the errcontext callback patch that Melanie sent earlier today Todo beyond that: - The comment and docs updates we've been discussing in https://postgr.es/m/5fc6r4smanncsmqw7ib6s3uj6eoiqoioxbd5ibmhtqimcggtoe%40fyrok2gozsoq - I think a good long search through the docs is in order, there probably are other things that should be updated, beyond concrete references to effective_io_concurrency etc. - Whether we should do something, and if so what, about BAS_BULKREAD for 18. Thomas may have some thoughts / code. - Whether there's anything committable around Jelte's RLIMIT_NOFILE changes. Greetings, Andres Freund
Improve coments on structures in trigger.c
Hi, I found that comments in trigger.c describing fields of three different structures in a comment block. I feel this is confusing because some structure names are not explicitly mentioned there even though common field names are used between structures. I've attached a patch to improve the comments by splitting it to three blocks. Regards, Yugo Nagata -- Yugo Nagata diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c index c9f61130c69..699454ff277 100644 --- a/src/backend/commands/trigger.c +++ b/src/backend/commands/trigger.c @@ -3771,55 +3771,15 @@ typedef struct AfterTriggerEventList * * query_stack[query_depth] is the per-query-level data, including these fields: * - * events is a list of AFTER trigger events queued by the current query. - * None of these are valid until the matching AfterTriggerEndQuery call - * occurs. At that point we fire immediate-mode triggers, and append any - * deferred events to the main events list. - * - * fdw_tuplestore is a tuplestore containing the foreign-table tuples - * needed by events queued by the current query. (Note: we use just one - * tuplestore even though more than one foreign table might be involved. - * This is okay because tuplestores don't really care what's in the tuples - * they store; but it's possible that someday it'd break.) - * - * tables is a List of AfterTriggersTableData structs for target tables - * of the current query (see below). - * * maxquerydepth is just the allocated length of query_stack. * * trans_stack holds per-subtransaction data, including these fields: * - * state is NULL or a pointer to a saved copy of the SET CONSTRAINTS - * state data. Each subtransaction level that modifies that state first - * saves a copy, which we use to restore the state if we abort. - * - * events is a copy of the events head/tail pointers, - * which we use to restore those values during subtransaction abort. - * - * query_depth is the subtransaction-start-time value of query_depth, - * which we similarly use to clean up at subtransaction abort. - * - * firing_counter is the subtransaction-start-time value of firing_counter. - * We use this to recognize which deferred triggers were fired (or marked - * for firing) within an aborted subtransaction. - * * We use GetCurrentTransactionNestLevel() to determine the correct array * index in trans_stack. maxtransdepth is the number of allocated entries in * trans_stack. (By not keeping our own stack pointer, we can avoid trouble * in cases where errors during subxact abort cause multiple invocations * of AfterTriggerEndSubXact() at the same nesting depth.) - * - * We create an AfterTriggersTableData struct for each target table of the - * current query, and each operation mode (INSERT/UPDATE/DELETE), that has - * either transition tables or statement-level triggers. This is used to - * hold the relevant transition tables, as well as info tracking whether - * we already queued the statement triggers. (We use that info to prevent - * firing the same statement triggers more than once per statement, or really - * once per transition table set.) These structs, along with the transition - * table tuplestores, live in the (sub)transaction's CurTransactionContext. - * That's sufficient lifespan because we don't allow transition tables to be - * used by deferrable triggers, so they only need to survive until - * AfterTriggerEndQuery. */ typedef struct AfterTriggersQueryData AfterTriggersQueryData; typedef struct AfterTriggersTransData AfterTriggersTransData; @@ -3842,6 +3802,23 @@ typedef struct AfterTriggersData int maxtransdepth; /* allocated len of above array */ } AfterTriggersData; +/* + * AfterTriggersQueryData has the following fields: + * + * events is a list of AFTER trigger events queued by the current query. + * None of these are valid until the matching AfterTriggerEndQuery call + * occurs. At that point we fire immediate-mode triggers, and append any + * deferred events to the main events list. + * + * fdw_tuplestore is a tuplestore containing the foreign-table tuples + * needed by events queued by the current query. (Note: we use just one + * tuplestore even though more than one foreign table might be involved. + * This is okay because tuplestores don't really care what's in the tuples + * they store; but it's possible that someday it'd break.) + * + * tables is a List of AfterTriggersTableData structs for target tables + * of the current query (see below). + */ struct AfterTriggersQueryData { AfterTriggerEventList events; /* events pending from this query */ @@ -3849,6 +3826,23 @@ struct AfterTriggersQueryData List *tables; /* list of AfterTriggersTableData, see below */ }; +/* + * AfterTriggersTransData has the following fields: + * + * state is NULL or a pointer to a saved copy of the SET CONSTRAINTS + * state data. Each subtransaction level that modifies that state first + * saves a copy, which we use to restore th
Re: Proposal: Progressive explain
On 3/31/25 02:23, torikoshia wrote: On Fri, Mar 7, 2025 at 6:43 AM Rafael Thofehrn Castro Implemented this version. New patch has the following characteristics: I haven't looked into the code yet, but when I ran below commands during make installcheck, there was an error and an assertion failure =# select * from pg_stat_progress_explain; =# \watch 0.1 Yeah, that's to be expected. I think many corner cases may be found: hash table in the middle of filling, opened file descriptors, an incorrect combination of variables, 'not yet visited' subtrees - who knows what else? So, last time, I just ended up with the idea that using the explain code is a bit dangerous - in the middle of execution, it is enough to expose only basic data - rows, numbers and timings. It seems safe to gather. -- regards, Andrei Lepikhov
Re: Replace IN VALUES with ANY in WHERE clauses during optimization
Hi, Alena! On Sat, Mar 29, 2025 at 9:03 PM Alena Rybakina wrote: > On 29.03.2025 14:03, Alexander Korotkov wrote: >> One thing I have to fix: we must do >> IncrementVarSublevelsUp() unconditionally for all expressions as Vars >> could be deeper inside. > > Yes, I'm looking at it too, I've just understood that it was needed for > subqueries - they can contain var elements which needs decrease the sublevel > parameter. > > for example for the query: > > EXPLAIN (COSTS OFF) > SELECT ten FROM onek t > WHERE unique1 IN (VALUES (0), ((2 IN (SELECT unique2 FROM onek c > WHERE c.unique2 = t.unique1))::integer)); > > We are interested in this element: ((2 IN (SELECT unique2 FROM onek c WHERE > c.unique2 = t.unique1)) > > It is funcexpr object with RabgeTblEntry variable. I highlighted > > WARNING: 1{FUNCEXPR :funcid 2558 :funcresulttype 23 :funcretset false > :funcvariadic false :funcformat 1 :funccollid 0 :inputcollid 0 :args > ({SUBLINK :subLinkType 2 :subLinkId 0 :testexpr {OPEXPR :opno 96 :opfuncid 65 > :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({CONST > :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true > :constisnull false :location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]} {PARAM > :paramkind 2 :paramid 1 :paramtype 23 :paramtypmod -1 :paramcollid 0 > :location -1}) :location -1} :operName ("=") :subselect {QUERY :commandType 1 > :querySource 0 :canSetTag true :utilityStmt <> :resultRelation 0 :hasAggs > false :hasWindowFuncs false :hasTargetSRFs false :hasSubLinks false > :hasDistinctOn false :hasRecursive false :hasModifyingCTE false :hasForUpdate > false :hasRowSecurity false :hasGroupRTE false :isReturn false :cteList <> > :rtable ({RANGETBLENTRY :alias {ALIAS :aliasname c :colnames <>} :eref {ALIAS > :aliasname c :colnames ("unique1" "unique2" "two" "four" "ten" "twenty" > "hundred" "thousand" "twothousand" "fivethous" "tenthous" "odd" "even" > "stringu1" "stringu2" "string4")} :rtekind 0 :relid 32795 :inh true :relkind > r :rellockmode 1 :perminfoindex 1 :tablesample <> :lateral false :inFromCl > true :securityQuals <>}) :rteperminfos ({RTEPERMISSIONINFO :relid 32795 :inh > true :requiredPerms 2 :checkAsUser 0 :selectedCols (b 9) :insertedCols (b) > :updatedCols (b)}) :jointree {FROMEXPR :fromlist ({RANGETBLREF :rtindex 1}) > :quals {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16 :opretset false > :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 2 :vartype 23 > :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 > :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} {VAR :varno 1 > :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) > :varlevelsup 2 :varreturningtype 0 :varnosyn 1 :varattnosyn 1 :location -1}) > :location -1}} :mergeActionList <> :mergeTargetRelation 0 :mergeJoinCondition > <> :targetList ({TARGETENTRY :expr {VAR :varno 1 :varattno 2 :vartype 23 > :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 > :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} :resno 1 > :resname unique2 :ressortgroupref 0 :resorigtbl 32795 :resorigcol 2 :resjunk > false}) :override 0 :onConflict <> :returningOldAlias <> :returningNewAlias > <> :returningList <> :groupClause <> :groupDistinct false :groupingSets <> > :havingQual <> :windowClause <> :distinctClause <> :sortClause <> > :limitOffset <> :limitCount <> :limitOption 0 :rowMarks <> :setOperations <> > :constraintDeps <> :withCheckOptions <> :stmt_location -1 :stmt_len -1} > :location -1}) :location -1} > > > I highlighted in bold the var we need - since it is in a subquery in the in > expression will be flattened, all elements contained in it should decrease > the level number by one, since they will belong to the subtree located above > it. Because of that condition, this did not happen. > > I generally agree with you that it is better to remove that condition. The > function IncrementVarSublevelsUp essentially goes through the structures > below and will decrease the level of only the vars for which this needs to be > done, and the condition with 1 will protect us from touching those vars that > should not. So the varlevelsup for this var should be 1. > > I am currently investigating whether this transformation will be fair for all > cases; I have not found any problems yet. Thank you for your feedback. I appreciate you're also looking for the potential problems. On thing to highlight: doing IncrementVarSublevelsUp() unconditionally is required not just for subqueries. Consider the following example. SELECT * FROM t WHERE val1 IN (VALUES (val2), (val2 +1)); The second value contain Var, which needs IncrementVarSublevelsUp(), but the top node is OpExpr. -- Regards, Alexander Korotkov Supabase
Typo in comment for pgstat_database_flush_cb()
Another thing I noticed is $SUBJECT: I think “if lock could not immediately acquired” should be “if lock could not be immediately acquired”. Attached is a patch for that. Best regards, Etsuro Fujita fix-typo-in-comment.patch Description: Binary data
Re: Memoize ANTI and SEMI JOIN inner
Hi! On 21.03.2025 18:56, Andrei Lepikhov wrote: On 20/3/2025 07:02, David Rowley wrote: On Thu, 20 Mar 2025 at 06:16, Andrei Lepikhov wrote: How can we be sure that semi or anti-join needs only one tuple? I think it would be enough to control the absence of join clauses and filters in the join. Unfortunately, we only have such a guarantee in the plan creation stage (maybe even setrefs.c). So, it seems we need to invent an approach like AlternativeSubplan. I suggest looking at what 9e215378d did. You might be able to also allow semi and anti-joins providing the cache keys cover the entire join condition. I think this might be safe as Nested Loop will only ask its inner subnode for the first match before skipping to the next outer row and with anti-join, there's no reason to look for additional rows after the first. Semi-join and unique joins do the same thing in nodeNestloop.c. To save doing additional checks at run-time, the code does: Thank you for the clue! I almost took the wrong direction. I have attached the new patch, which includes corrected comments for better clarification of the changes, as well as some additional tests. I now feel much more confident about this version since I have resolved that concern. I reviewed your patch and made a couple of suggestions. The first change is related to your comment (and the one before it). I fixed some grammar issues and simplified the wording to make it clearer and easier to understand. The second change involves adding an Assert when generating the Memoize path. Based on the existing comment and the surrounding logic (shown below), I believe it's worth asserting that both inner_unique and single_mode are not true at the same time — just as a safety check. /* * We may do this for SEMI or ANTI joins when they need only one tuple from * the inner side to produce the result. Following if condition checks that * rule. * * XXX Currently we don't attempt to mark SEMI/ANTI joins as inner_unique * = true. Should we? See add_paths_to_joinrel() */ if(!extra->inner_unique&& (jointype== JOIN_SEMI|| jointype== JOIN_ANTI)) single_mode= true; -- Regards, Alena Rybakina Postgres Professional diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c index c75408f552b..252cb712943 100644 --- a/src/backend/optimizer/path/joinpath.c +++ b/src/backend/optimizer/path/joinpath.c @@ -728,14 +728,16 @@ get_memoize_path(PlannerInfo *root, RelOptInfo *innerrel, single_mode = true; /* - * Memoize normally marks cache entries as complete when it runs out of - * tuples to read from its subplan. However, with unique joins, Nested - * Loop will skip to the next outer tuple after finding the first matching - * inner tuple. Another case is a semi or anti join. If number of join - * clauses, pushed to the inner as parameterised filter no less than the - * number of join clauses, that means all the clauses have been pushed to - * the inner and any tuple coming from the inner side will be successfully - * used to build the join result. + * Normally, memoize marks cache entries as complete when it exhausts + * all tuples from its subplan. However, in unique joins, Nested Loop + * will skip to the next outer tuple after finding the first matching + * inner tuple. + * Another case is a SEMI or ANTI joins. If the number of join clauses, + * pushed to the inner as parameterised filter is equal to or greater + * than the total number of join clauses. This implies that all relevant + * join conditions have been applied on the inner side, so any returned + * inner tuple will be guaranteed to satisfy the join condition, making + * it safe to memoize. * This means that we may not read the inner side of the * join to completion which leaves no opportunity to mark the cache entry * as complete. To work around that, when the join is unique we @@ -808,6 +810,8 @@ get_memoize_path(PlannerInfo *root, RelOptInfo *innerrel, &hash_operators, &binary_mode)) { + Assert(!(extra->inner_unique && single_mode)); + return (Path *) create_memoize_path(root, innerrel, inner_path,
Re: Memoize ANTI and SEMI JOIN inner
On Mon, 31 Mar 2025 at 16:21, Alena Rybakina wrote: > However, is it necessary to check that extra->inner_unique must be false for > SEMI/ANTI joins here, or am I missing something? It looks a little confusing > at this point. If it is necessary, I don't see the reason for it. It was me that worked on unique joins and I see no reason why a SEMI or ANTI join couldn't be marked as unique. The reason they're not today is that the only point of the unique join optimisation is so that during execution, the join nodes could skip to the next outer tuple after matching the current outer to an inner. If the join is unique, then there are no more join partners to find for the current outer after matching it up once. With SEMI and ANTI joins, we skip to the next outer tuple after finding a match anyway, so there's no point in going to the trouble of setting the inner_unique flag. I can't say definitively that we won't find a reason in the future that we should set inner_unique for SEMI/ANTI joins, so I don't follow the need for the Assert. Maybe you're seeing something that I'm not. What do you think will break if both flags are true? David
Re: Change log level for notifying hot standby is waiting non-overflowed snapshot
On 2025/03/28 0:13, torikoshia wrote: On 2025-03-27 11:06, torikoshia wrote: Hi, I had another off-list discussion with Fujii-san, and according to the following manual[1], it seems that a transaction with an overflowed subtransaction is already considered inconsistent: Reaching a consistent state can also be delayed in the presence of both of these conditions: - A write transaction has more than 64 subtransactions - Very long-lived write transactions IIUC, the manual suggests that both conditions must be met -- recovery reaching at least minRecoveryPoint and no overflowed subtransactions —- for the standby to be considered consistent. OTOH, the following log message is emitted even when subtransactions have overflowed, which appears to contradict the definition of consistency mentioned above: LOG: consistent recovery state reached This log message is triggered when recovery progresses beyond minRecoveryPoint(according to CheckRecoveryConsistency()). However, since this state does not satisfy 'consistency' defined in the manual, I think it would be more accurate to log that it has merely reached the "minimum recovery point". Furthermore, it may be better to emit the above log message only when recovery has progressed beyond minRecoveryPoint and there are no overflowed subtransactions. Attached patch does this. Additionally, renaming variables such as reachedConsistency in CheckRecoveryConsistency might also be appropriate. However, in the attached patch, I have left them unchanged for now. On 2025-03-25 00:55, Fujii Masao wrote: - case CAC_NOTCONSISTENT: + case CAC_NOTCONSISTENT_OR_OVERFLOWED: This new name seems a bit too long. I'm OK to leave the name as it is. Or, something like CAC_NOTHOTSTANDBY seems simpler and better to me. Beyond just the length issue, given the understanding outlined above, I now think CAC_NOTCONSISTENT does not actually need to be changed. In high-availability.sgml, the "Administrator's Overview" section already describes the conditions for accepting hot standby connections. This section should also be updated accordingly. Agreed. I have updated this section to mention that the resolution is to close the problematic transaction. OTOH the changes made in v2 patch seem unnecessary, since the concept of 'consistent' is already explained in the "Administrator's Overview." - errdetail("Consistent recovery state has not been yet reached."))); + errdetail("Consistent recovery state has not been yet reached, or snappshot is pending because subtransaction is overflowed."), Given the above understanding, "or" is not appropriate in this context, so I left this message unchanged. Instead, I have added an errhint. The phrasing in the hint message aligns with the manual, allowing users to search for this hint and find the newly added resolution instructions. On second thought, it may not be appropriate to show this output to clients attempting to connect. This message should be notified not to clients but to administrators. From this point of view, it'd be better to output a message indicating the status inside ProcArrayApplyRecoveryInfo(). However, a straightforward implementation would result in the same message being logged every time an XLOG_RUNNING_XACTS WAL is received, making it noisy. Instead of directly outputting a log indicating that a hot standby connection cannot be established due to subtransaction overflow, the attached patch updates the manual so that administrators can determine whether a subtransaction overflow has occurred based on the modified log output. What do you think? I had the same thought during our off-list discussion. However, after reviewing the recovery code - such as recoveryStopsBefore(), which checks whether a consistent state is reached - I now believe the manual’s definition of a consistent state may be incorrect. A consistent state should be defined as the point where recovery has reached minRecoveryPoint. If we were to change the definition to match the manual, we would also need to update various recovery checks, which wouldn't be a trivial task. Given that, I now think it's better to revive your v1 patch, which introduces a new postmaster signal and improves the error message when connections are not accepted during hot standby. I've attached a revised version of the patch based on your v1. Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION From a21c5fa943c2e372bf6b77fcbb32345751fb7a84 Mon Sep 17 00:00:00 2001 From: Fujii Masao Date: Fri, 28 Mar 2025 01:40:52 +0900 Subject: [PATCH v5] Improve error message when standby does accept connections. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Even after reaching the minimum recovery point, if there are long-lived write transactions with
Re: [PoC] Reducing planning time when tables have many partitions
Hello Ashutosh, Thank you for your detailed review, and apologies for my delayed response. On Thu, Mar 27, 2025 at 1:42 PM Ashutosh Bapat wrote: > > Here are memory consumption numbers using list_free() instead of pfree(), > using the same method as [1], using a binary without asserts and debug info. > PFA the patchset where all the patches are the same as v35 but with an extra > patch fixing memory leak. The memory leak is visible with a higher number of > joins. At a lower number of joins, I expect that the memory saved is less > than a KB or the leaked memory fits within 1 chunk of memory context and > hence not visible. Thank you for conducting your benchmarks. Your results clearly show increased memory consumption with my patches. As Tom also suggested, we may reduce memory usage by adopting a different design. I will reconsider alternative approaches and compare the memory usage to the current version. Thank you once again for conducting the benchmarks. -- Best regards, Yuya Watari
Re: Thinko in pgstat_build_snapshot()
On Sun, Mar 30, 2025 at 4:31 AM Heikki Linnakangas wrote: > > On 30/03/2025 13:23, Etsuro Fujita wrote: > > While working on something else I noticed $SUBJECT: we are allocating > > more memory than necessary and copying more data than necessary > > because we specify the wrong PgStat_KindInfo member as the size > > argument for MemoryContextAlloc and memcpy. This could become > > problematic if snapshotting a very large number of variables stats, so > > I fixed it. Attached is a patch for that. > > Good catch. Patch looks good to me at quick glance. +1 on both counts. Best regards, Gurjeet http://Gurje.et
Re: Memoize ANTI and SEMI JOIN inner
On 31.03.2025 06:04, David Rowley wrote: On Mon, 31 Mar 2025 at 15:33, Alena Rybakina wrote: I believe it's worth asserting that both inner_unique and single_mode are not true at the same time — just as a safety check. add_paths_to_joinrel() just chooses not to populate inner_unique for SEMI and ANTI joins because, as of today's master, it's pretty pointless to determine that because the executor will short-circuit and skip to the next outer tuple for those join types anyway. I don't follow why having both these flags set would cause trouble. It seems perfectly legitimate that add_paths_to_joinrel() could choose to set the inner_unique flag for these join types, and if it did, the Assert you're proposing would fail for no good reason. I tend to agree with you that someone might set this flag to true for these join types in the future. However, is it necessary to check that extra->inner_unique must be false for SEMI/ANTI joins here, or am I missing something? It looks a little confusing at this point. if(!extra->inner_unique&& (jointype== JOIN_SEMI|| jointype== JOIN_ANTI)) single_mode= true; -- Regards, Alena Rybakina Postgres Professional
Re: Improve monitoring of shared memory allocations
Hi Tomas, > > Right. I'm still not convinced if this makes any difference, or whether > this alignment was merely a consequence of using ShmemAlloc(). I don't > want to make this harder to understand unnecessarily. > Yeah, it makes sense. > Let's keep this simple - without additional alignment. I'll think about > it a bit more, and maybe add it before commit. > OK. > > > > I will improve the comment in the next version. > > > > OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not > really used except for this bit: > > +if (init_size > nelem_alloc) > +element_alloc = false; > > Can't we determine before calling the function, to make it a bit less > confusing? > Yes, we could determine whether the pre-allocated elements are zero before calling the function, I have fixed it accordingly in the attached 0001 patch. Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've passed this information as a boolean variable-initial_elems. If it is false, no elements are pre-allocated. Please find attached the v7-series, which incorporates your review patches and addresses a few remaining comments. Thank you, Rahila Syed v7-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch Description: Binary data v7-0003-Add-cacheline-padding-between-heavily-accessed-array.patch Description: Binary data v7-0001-Account-for-all-the-shared-memory-allocated-by-hash_.patch Description: Binary data
Re: Memoize ANTI and SEMI JOIN inner
On Mon, 31 Mar 2025 at 15:33, Alena Rybakina wrote: > I believe it's worth asserting that both inner_unique and single_mode are not > true at the same time — just as a safety check. add_paths_to_joinrel() just chooses not to populate inner_unique for SEMI and ANTI joins because, as of today's master, it's pretty pointless to determine that because the executor will short-circuit and skip to the next outer tuple for those join types anyway. I don't follow why having both these flags set would cause trouble. It seems perfectly legitimate that add_paths_to_joinrel() could choose to set the inner_unique flag for these join types, and if it did, the Assert you're proposing would fail for no good reason. David