date:20250113

Re: Vacuum statistics

2025-01-13 Thread Alena Rybakina


On 10.01.2025 17:51, Alena Rybakina wrote:


Hi! I thought about this problem again and I think I have a solution.

On 02.12.2024 23:12, Alena Rybakina wrote:

* I've read the previous discussion on how important to keep all these
fields regarding vacuum statistics including points by Andrei and Jim.
It still worrying me that statistics volume is going to burst in about
3 times, but I don't have a particular proposal on how to make more
granular approach.  I wonder if you could propose something.
We can collect statistics on databases at all times - there are less 
compared to vacuum statistics of relations, but they can give enough 
information that can hint that something is going wrong.
With the track_vacuum_statistics guc we can cover cases of collecting 
extended and complete information: when it is enabled, we will collect 
vacuum statistics on relations both: heaps and indexes.
This will not lead to a synchronicity between constant database 
statistics and temporary statistics of relations, since our vacuum 
statistics are cumulative and it is assumed that we will look at 
changes in statistics over a certain period.

What do you think?


I implemented this in my latest patch version [0].

[0] 
https://www.postgresql.org/message-id/1e81a0a1-a63b-48fb-905a-d6495f89ab73%40postgrespro.ru


--
Regards,
Alena Rybakina
Postgres Professional

Re: Reorder shutdown sequence, to flush pgstats later

2025-01-13 Thread Bertrand Drouvot

Hi,

On Fri, Jan 10, 2025 at 02:07:25PM -0500, Andres Freund wrote:
> However, I think 0007 needs a bit more work.
> 
> One thing that worries me a bit is that using SIGINT for triggering the
> shutdown checkpoint could somehow be unintentionally triggered? Previously
> checkpointer would effectively have ignored that,

Right.

> but now it'd bring the whole
> cluster down (because postmaster would see an unexpected ShutdownXLOG()). But
> I can't really see a legitimate reason for checkpointer to get spurious
> SIGINTs.

Yeah, appart from human error on the OS side, I don't see any reasons too.

> This made me think about what happens if a spurious SIGINT is sent - and in
> the last version the answer wasn't great, namely everything seemed to go on
> normal, except that the cluster couldn't be shut down normally.

Yeah, and I think that checkpoints would hang.

> There wasn't
> any code to treat receiving PMSIGNAL_XLOG_IS_SHUTDOWN in the wrong state as
> fatal.  I think this needs to be treated the same way we treat not being able
> to fork checkpointer during a shutdown.  Now receiving
> PMSIGNAL_XLOG_IS_SHUTDOWN while not in PM_WAIT_XLOG_SHUTDOWN triggers
> FatalError processing after logging "WAL was shut down unexpectedly".

I wonder if we could first ask the postmaster a confirmation that the SIGINT is
coming from it? (and if not, then simply ignore the SIGINT). I thought about a
shared memory flag but the postmaster has to be keept away from shared memory
operations, so that's not a valid idea. Another idea could be that the 
checkpointer
sends a new "confirm" request to the postmater first. But then I'm not sure how
the postmaster could reply back (given the signals that are already being used)
and that might overcomplicate things.

So yeah, might be better to be on the safe side of thing and "simply" enter 
Fatal.

> We have
> multiple copies of code to go to FatalError, that doesn't seem great.

+  * FIXME: This should probably not have a copy fo the code to
+  * transition into FatalError state.
+  */
+  ereport(LOG,
+  (errmsg("WAL was shut down unexpectedly")));

Indeed, might be worth extracting this into a helper function or such.

> Another thing I started worrying about is that the commit added
> PM_WAIT_CHECKPOINTER *after* PM_WAIT_DEAD_END. However, we have a few places
> where we directly jump to PM_WAIT_DEAD_END in fatal situations:
> 
> /*
>  * Stop any dead-end children and stop creating new ones.
>  */
> UpdatePMState(PM_WAIT_DEAD_END);
> ConfigurePostmasterWaitSet(false);
> SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
> and
> /*
>  * If we failed to fork a checkpointer, just shut down.
>  * Any required cleanup will happen at next restart. We
>  * set FatalError so that an "abnormal shutdown" message
>  * gets logged when we exit.
>  *
>  * We don't consult send_abort_for_crash here, as it's
>  * unlikely that dumping cores would illuminate the reason
>  * for checkpointer fork failure.
>  */
> FatalError = true;
> UpdatePMState(PM_WAIT_DEAD_END);
> ConfigurePostmasterWaitSet(false);
> 
> 
> I don't think this actively causes problems today,

For the first case, we'd ask the checkpointer to do work in the PM_WAIT_DEAD_END
case while already SIGQUIT'd (through SignalHandlerForCrashExit()). But that's
not an issue per say as we check for CheckpointerPMChild != NULL when
pmState == PM_WAIT_DEAD_END.

For the second case (where we failed to fork a checkpointer), we'd also
check for CheckpointerPMChild != NULL when pmState == PM_WAIT_DEAD_END.

So in both case we'd ask a non existing checkpointer to do work but we later
check that it exists, so I also don't think that it causes any problem.

> but it seems rather iffy.

Yeah, we intend to ask a non existing checkpointer process to do work...

> Maybe PM_WAIT_CHECKPOINTER should be before PM_WAIT_DEAD_END? I earlier
> thought it'd be better to shut checkpointer down after even dead-end children
> exited, in case we ever wanted to introduce stats for dead-end children - but
> that seems rather unlikely?

Yeah, given the above, it looks more clean to change that ordering.

Indeed that seems unlikely to introduce dead-end children stats and even if we 
do
, we could think back to put PM_WAIT_CHECKPOINTER after PM_WAIT_DEAD_END (or any
oher way to ensure the stats are flushed after they are shut down). So it's
probably better to focus on the current state and then put
PM_WAIT_CHECKPOINTER before PM_WAIT_DEAD_END.

Anyway, if we reach the 2 cases mentioned above we're not going to flush the
stats anyway (CheckpointerPMChild is NULL) so better to reorder then.

> Independent of t

Re: Conflict detection for update_deleted in logical replication

2025-01-13 Thread Nisha Moond

Here are the performance test results and analysis with the recent patches.

Test Setup:
- Created Pub and Sub nodes with logical replication and below configurations.
   autovacuum_naptime = '30s'
   shared_buffers = '30GB'
   max_wal_size = 20GB
   min_wal_size = 10GB
   track_commit_timestamp = on (only on Sub node).
- Pub and Sub had different pgbench tables with initial data of scale=100.

---
Case-0: Collected data on pgHead
---
 - Ran pgbench(read-write) on both the publisher and the subscriber
with 30 clients for a duration of 15 minutes, collecting data over 3
runs.

Results:
Run#pub_TPSsub_TPS
130551.6347129476.81709
230112.3120328933.75013
329599.4038328379.4977
Median  30112.31203  28933.75013

---
Case-1: Long run(15-minutes) tests when retain_conflitc_info=ON
---
 - Code: pgHead + v19 patches.
 - At Sub set autovacuum=false.
 - Ran pgbench(read-write) on both the publisher and the subscriber
with 30 clients for a duration of 15 minutes, collecting data over 3
runs.

Results:
Run#pub_TPSsub_TPS
   130326.576374890.410972
   230412.851154787.192754
   330860.138794864.549117
Median  30412.851154864.549117
regression  1% -83%

  - A 15-minute pgbench run test showed higher reduction in the sub's
TPS. As the test run time increased the TPS reduced further at the Sub
node.

---
Case-2 : Re-ran the case-1 with autovacuum enabled and running every 30 seconds.
---
 - Code: pgHead + v19 patches.
 - At Sub set autovacuum=true.
 - Also measured the frequency of slot.xmin and the worker's
oldest_nonremovable_xid updates.

Results:
Run#pub_TPSsub_TPS  #slot.xmin_updates
#worker's_oldest_nonremovable_xid_updates
   131080.309444573.547293 0   1
regression   3%-84%

 - Autovacuum did not help in improving the Sub's TPS.
 - The slot's xmin was not advanced.


Observations and RCA for TPS reduction in above tests:
 - The launcher was not able to advance slot.xmin during the 15-minute
pgbench run, leading to increased dead tuple accumulation on the
subscriber node.
 - The launcher failed to advance slot.xmin because the apply worker
could not set the oldest_nonremovable_xid early and frequently enough
due to following two reasons -
   1) For large pgbench tables (scale=100), the tablesync takes time
to complete, forcing the apply worker to wait before updating its
oldest_nonremovable_xid.
   2) With 30 clients generating operations at a pace that a single
apply worker cannot match, the worker fails to catch up with the
rapidly increasing remote_lsn, lagging behind the Publisher's LSN
throughout the 15-minute run.

Considering the above reasons, for better performance measurements,
collected data when table_sync is off, with a varying number of
clients on the publisher node. Below test used the v21 patch set,
which also includes improvement patches (006 and 007) for more
frequent slot.xmin updates.
---
Case-3: Create the subscription with option "copy_data=false", so, no
tablesync in the picture.
---
Test setup:
 - Code: pgHead + v21 patches.
 - Created Pub and Sub nodes with logical replication and below configurations.
   autovacuum_naptime = '30s'
   shared_buffers = '30GB'
   max_wal_size = 20GB
   min_wal_size = 10GB
   track_commit_timestamp = on (only on Sub node).

 - The Pub and Sub had different pgbench tables with initial data of scale=100.
 - Ran pgbench(read-write) on both the pub and the sub for a duration
of 15 minutes, using 30 clients on the Subscriber while varying the
number of clients on the Publisher.
 - In addition to TPS, the frequency of slot.xmin and the worker's
oldest_nonremovable_xid updates was also measured.

Observations:
 - As the number of clients on the publisher increased, the
publisher's TPS improved, but the subscriber's TPS dropped
significantly.
 - The frequency of slot.xmin updates also declined with more clients
on the publisher, indicating that the apply worker updated its
oldest_nonremovable_xid less frequently as the read-write operations
on the publisher increased.

Results:
#Pub-clients  pubTPS  pubTPS_increament  subTPS  pubTPS_reduction
#slot.xmin_updates  #worker's_oldest_nonremovable_xid_updates
  11364.487898   0   35000.06738   0  6976 6977
  22706.100445   98%  32297.81408   -8% 5838 5839
  45079.522778   272%8581.034791   -75%   268   269
  30  31308.18524   2195%  5324.328696   -85%   4 5

Note: In the above result table, the column -
 - "PubTPS_increment" represents the % improvement in the Pub's
TPS compared to its TPS in the initial run with #Pub-clients=1 and
 - "SubTPS_reduction" indicates the % decrease in the Sub's TPS
compared to its TPS in the initial

Re: proposal: plpgsql, new check for extra_errors - strict_expr_check

2025-01-13 Thread Pavel Stehule

Hi

fresh rebase

Regards

Pavel
From 4e582d825af9d3463be332a90b9e959980480293 Mon Sep 17 00:00:00 2001
From: "ok...@github.com" 
Date: Wed, 12 Jun 2024 21:34:05 +0200
Subject: [PATCH 1/3] use strict rules for parsing PL/pgSQL expressions

Originally the rule PLpgSQL_Expr allows almost all SQL clauses. It was designed
to allow old undocumented syntax

var := col FROM tab;

The reason for support of this "strange" syntax was technical. The PLpgSQL parser
cannot use SQL parser accurately (it was really primitive), and people found
this undocumented syntax. Lattery, when it was possible to do exact parsing, from
compatibility reasons, the parsing of PL/pgSQL expressions allows described syntax.

Unfortunately, with support almost all SQL clauses, the PLpgSQL can accept
really broken code like

DO $$
DECLARE
  l_cnt int;
BEGIN
  l_cnt := 1
  DELETE FROM foo3 WHERE id=1;
END; $$;

proposed patch introduce new extra error check strict_expr_check, that solve
this issue.
---
 doc/src/sgml/plpgsql.sgml|  18 
 src/backend/executor/spi.c   |   6 ++
 src/backend/parser/gram.y| 149 +++
 src/backend/parser/parser.c  |   6 ++
 src/include/parser/parser.h  |  22 
 src/interfaces/ecpg/preproc/parse.pl |  10 +-
 src/pl/plpgsql/src/pl_comp.c |  31 ++
 src/pl/plpgsql/src/pl_gram.y |  49 ++---
 src/pl/plpgsql/src/pl_handler.c  |   2 +
 src/pl/plpgsql/src/plpgsql.h |  19 
 10 files changed, 298 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index 78e4983139..a27700d6cb 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -5386,6 +5386,24 @@ a_output := a_output || $$ if v_$$ || referrer_keys.kind || $$ like '$$

   
  
+
+ 
+  strict_expr_check
+  
+   
+Enabling this check will cause PL/pgSQL to
+check if a PL/pgSQL expression is just an
+expression without any SQL clauses like FROM,
+ORDER BY. This undocumented form of expressions
+is allowed for compatibility reasons, but in some special cases
+it doesn't to allow to detect broken code.
+   
+
+   
+This check is allowed only plpgsql.extra_errors.
+   
+  
+ 
 
 
 The following example shows the effect of plpgsql.extra_warnings
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index ecb2e4ccaa..f9d2703162 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -2984,11 +2984,17 @@ _SPI_error_callback(void *arg)
 		switch (carg->mode)
 		{
 			case RAW_PARSE_PLPGSQL_EXPR:
+			case RAW_PARSE_PLPGSQL_STRICT_EXPR:
+			case RAW_PARSE_PLPGSQL_STRICT_EXPR_LIST:
+			case RAW_PARSE_PLPGSQL_STRICT_NAMED_EXPR_LIST:
 errcontext("PL/pgSQL expression \"%s\"", query);
 break;
 			case RAW_PARSE_PLPGSQL_ASSIGN1:
 			case RAW_PARSE_PLPGSQL_ASSIGN2:
 			case RAW_PARSE_PLPGSQL_ASSIGN3:
+			case RAW_PARSE_PLPGSQL_STRICT_EXPR_ASSIGN1:
+			case RAW_PARSE_PLPGSQL_STRICT_EXPR_ASSIGN2:
+			case RAW_PARSE_PLPGSQL_STRICT_EXPR_ASSIGN3:
 errcontext("PL/pgSQL assignment \"%s\"", query);
 break;
 			default:
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 6079de70e0..1b72db6952 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,6 +312,11 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	select_no_parens select_with_parens select_clause
 simple_select values_clause
 PLpgSQL_Expr PLAssignStmt
+PLpgSQLStrictExpr PLpgSQLStrictExprs PLpgSQLStrictNamedExprs
+PLAssignStmtStrictExpr
+
+%type 	plpgsql_strict_expr plpgsql_strict_named_expr
+%type 	plpgsql_strict_expr_list plpgsql_strict_named_expr_list
 
 %type 			opt_single_name
 %type 		opt_qualified_name
@@ -814,6 +819,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %token		MODE_PLPGSQL_ASSIGN1
 %token		MODE_PLPGSQL_ASSIGN2
 %token		MODE_PLPGSQL_ASSIGN3
+%token		MODE_PLPGSQL_STRICT_EXPR
+%token		MODE_PLPGSQL_STRICT_EXPR_LIST
+%token		MODE_PLPGSQL_STRICT_NAMED_EXPR_LIST
+%token		MODE_PLPGSQL_STRICT_EXPR_ASSIGN1
+%token		MODE_PLPGSQL_STRICT_EXPR_ASSIGN2
+%token		MODE_PLPGSQL_STRICT_EXPR_ASSIGN3
 
 
 /* Precedence: lowest to highest */
@@ -945,6 +956,46 @@ parse_toplevel:
 pg_yyget_extra(yyscanner)->parsetree =
 	list_make1(makeRawStmt((Node *) n, @2));
 			}
+			| MODE_PLPGSQL_STRICT_EXPR PLpgSQLStrictExpr
+			{
+pg_yyget_extra(yyscanner)->parsetree =
+	list_make1(makeRawStmt($2, 0));
+			}
+			| MODE_PLPGSQL_STRICT_EXPR_LIST PLpgSQLStrictExprs
+			{
+pg_yyget_extra(yyscanner)->parsetree =
+	list_make1(makeRawStmt($2, 0));
+			}
+			| MODE_PLPGSQL_STRICT_NAMED_EXPR_LIST PLpgSQLStrictNamedExprs
+			{
+	pg_yyget_extra(yyscanner)->parsetree =
+	list_make1(makeRawStmt($2, 0));
+			}
+			| MODE_PLPGSQ

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-01-13 Thread Jacob Champion

On Mon, Jan 13, 2025 at 3:21 PM Jacob Champion
 wrote:
> Next email will discuss the architectural bug that Kashif found.

Okay, here goes. A standard OAuth connection attempt looks like this
(oh, I hope Gmail doesn't mangle it):

IssuerUserlibpq   Backend
  |||
  |x -> x -> o  [1] Startup Packet
  ||||
  ||x <- x  [2] OAUTHBEARER Request
  ||||
  ||x -> x  [3] Parameter Discovery
  ||||
  ||x <- o  [4] Parameters Stored
  |||
  |||
  |||
  ||x -> o  [5] New Startup Packet
  ||||
  ||x <- x  [6] OAUTHBEARER Request
  ||||
  x <- x <> x|
  x <- x <> x|  [7] OAuth Handshake
  x <- x <> x|
  ||||
  o|x -> x  [8] Send Token
   |||
   | <- x <- x  [9] Connection Established
   |||
   x <> x <> x
   x <> x <> x  [10] Use the DB
   ...
   ...
   ...

When the server first asks for a token via OAUTHBEARER (step 2), the
client doesn't necessarily know what the server's requirements are for
a given user. It uses the rest of the doomed OAUTHBEARER exchange to
store the issuer and scope information in the PGconn (step 3-4), then
disconnects and sets need_new_connection in PQconnectPoll() so that a
second connection is immediately opened (step 5). When the OAUTHBEARER
mechanism takes control the second time, it has everything it needs to
conduct the login flow with the issuer (step 7). It then sends the
obtained token to establish a connection (steps 8 onward).

The problem is that step 7 is consuming the authentication_timeout for
the backend. I'm very good at completing these flows quickly, but if
you can't complete the browser prompts in time, you will simply not be
able to log into the server. Which is harsh to say the least. (Imagine
the pain if the standard psql password prompt timed out.) DBAs can get
around it by increasing the timeout, obviously, but that doesn't feel
very good as a solution.

Last week I looked into a fix where libpq would simply try again with
the stored token if the backend hangs up on it during the handshake,
but I think that will end up making the UX worse. The token validation
on the server side isn't going to be instantaneous, so if the client
is able to complete the token exchange in 59 seconds and send it to
the backend, there's an excellent chance that the connection is still
going to be torn down in a way that's indistinguishable from a crash.
We don't want the two sides to fight for time.

So I think what I'm going to need to do is modify v41-0003 to allow
the mechanism to politely hang up the connection while the flow is in
progress. This further decouples the lifetimes of the mechanism and
the async auth -- the async state now has to live outside of the SASL
exchange -- but I think it's probably more architecturally sound. Yell
at me if that sounds unmaintainable or if there's a more obvious fix
I'm missing.

Huge thanks to Kashif for pointing this out!

--Jacob

Re: Question about behavior of deletes with REPLICA IDENTITY NOTHING

2025-01-13 Thread Peter Smith

On Tue, Jan 14, 2025 at 8:22 AM Robert Treat  wrote:
>
> On Mon, Jan 13, 2025 at 3:55 AM Amit Kapila  wrote:
> > On Mon, Jan 13, 2025 at 10:22 AM Robert Treat  wrote:
> > > On Sun, Jan 12, 2025 at 11:00 PM Amit Kapila  
> > > wrote:
> > > > On Sat, Jan 11, 2025 at 7:28 PM Robert Treat  wrote:
> > > > +If a table with replica identity set to NOTHING
> > > > +(or set DEFAULT but with no primary key, or set
> > > > +USING INDEX but the index has been dropped) is
> > > > +added to a publication that replicates UPDATE
> > > > +or DELETE operations,
> > > > +subsequent UPDATE or DELETE
> > > > +operations will cause an error on the publisher.
> > > >
> > > > In the above change, we missed the existing "a table without a replica
> > > > identity" part. A slightly different way to write the above change
> > > > could be: "Tables lacking a replica identity or with an insufficiently
> > > > defined replica identity (e.g., set to NOTHING, set to DEFAULT but
> > > > with no primary key, or set USING INDEX but the index has been
> > > > dropped) cannot be updated or deleted when added to a publication that
> > > > replicates these operations. Attempting to do so will result in an
> > > > error on the publisher."
> > > >
> > >
> >
> 
> > > Aside from that, your above language is a little more compact with the
> > > trade-off of being less explicit in talking about publication
> > > properties; I didn't change that part because it didn't seem like an
> > > issue, but we could update that second part if you feel strongly about
> > > it. LMK.
> > >
> >
> > One of the reasons I tried to rephrase the sentence was it appears to
> > be long. I agree that the way you proposed is more explicit but the
> > way I phrased also conveys the information in a bit succinct form. I
> > think you can once propose with the wording on those lines then let us
> > what Peter or others think about it.
> >
>
> Splitting the difference would look like this?
>
> "Tables with replica identity set NOTHING,
> set DEFAULT but with no primary key, or set
> USING INDEX but the index has been
> dropped, cannot be updated or deleted when added to a publication that
> replicates these operations. Attempting to do so will result in an
> error on the publisher."
>
>

I thought Amit's proposed text was mostly OK;  it only needed the
"lacking a replica identity" part to be removed. (I've also changed
the e.g. to i.e.)

Like this:

"Tables with an insufficiently defined replica identity (i.e., set to
NOTHING, set to DEFAULT but with no primary key, or set USING INDEX
but the index has been dropped) cannot be updated or ...".

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Several buildfarm animals fail tests because of shared memory error

2025-01-13 Thread Tom Lane

Robins Tharakan  writes:
> Unrelated, parula has been failing the libperl test (only v15 and older),
> for the past
> 3 weeks - to clarify, this test started to fail (~18 days ago) before I
> fixed the
> 'RemoveIPC' configuration (~5 days ago), so this is unrelated to that
> change.

> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=parula&dt=2025-01-09%2003%3A13%3A18&stg=configure

> The first REL_15_STABLE test failure points to acd5c28db5 but I didn't see
> anything interesting there.

> The error seems to be around "annobin.so" and so it may be about how
> gcc is being compiled (not sure). While I figure out if GCC compilation
> needs work, I thought to bring it up here since v16+ seems to work fine on
> the same box and we may want to consider doing something similar for all
> older versions too?

In a failing build (v13) I see

checking for CFLAGS recommended by Perl... -D_REENTRANT -D_GNU_SOURCE -O2 
-ftree-vectorize -flto=auto -ffat-lto-objects -fexceptions -g 
-grecord-gcc-switches -pipe -Wall -Werror=format-security 
-Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -march=armv8.2-a+crypto 
-mtune=neoverse-n1 -mbranch-protection=standard -fasynchronous-unwind-tables 
-fstack-clash-protection -fwrapv -fno-strict-aliasing -I/usr/local/include 
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
checking for CFLAGS to compile embedded Perl... 
checking for flags to link embedded Perl...  -Wl,-z,relro -Wl,--as-needed 
-Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -Wl,--build-id=sha1  
-fstack-protector-strong -L/usr/local/lib  -L/usr/lib64/perl5/CORE -lperl 
-lpthread -lresolv -ldl -lm -lcrypt -lutil -lc

HEAD reports the same "CFLAGS recommended by Perl" but is much more
selective about what it actually adopts:

checking for CFLAGS recommended by Perl... -D_REENTRANT -D_GNU_SOURCE -O2 
-ftree-vectorize -flto=auto -ffat-lto-objects -fexceptions -g 
-grecord-gcc-switches -pipe -Wall -Werror=format-security 
-Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -march=armv8.2-a+crypto 
-mtune=neoverse-n1 -mbranch-protection=standard -fasynchronous-unwind-tables 
-fstack-clash-protection -fwrapv -fno-strict-aliasing -I/usr/local/include 
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
checking for CFLAGS to compile embedded Perl... 
checking for flags to link embedded Perl...   -L/usr/lib64/perl5/CORE -lperl 
-lpthread -lresolv -ldl -lm -lcrypt -lutil -lc

So it would appear that the link failure is down to those -specs
switches that we uncritically adopted.  The change in our behavior
was presumably at

commit b4e936859dc441102eb0b6fb7a104f3948c90490
Author: Peter Eisentraut 
Date:   Tue Aug 23 16:00:38 2022 +0200

Remove further unwanted linker flags from perl_embed_ldflags

Remove the contents of $Config{ldflags} from ExtUtils::Embed's ldopts,
like we already do with $Config{ccdlflags}.  Those flags are the
choices of those who built the Perl installation, which are not
necessarily appropriate for building PostgreSQL.  What we really want
from ldopts are the options identifying the location and name of the
libperl library, but unfortunately it doesn't appear possible to get
that separately from the other stuff.

The motivation for this was to strip -mmacosx-version-min options.  We
already did something similar for the -arch option.  Both of those are
now covered by this more general approach.

This went into v16 and was not back-patched, which is probably wise
because there was at least one followup fix (1c3aa5450) and we still
didn't entirely understand what was happening in the Cygwin build [1].
So I'm hesitant to consider back-patching it just because your
experimental gcc isn't working.  At the very least we ought to find
out why it worked up till three weeks ago.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/8c4fcb72-2574-ff7c-4c25-1f032d4a2a57%40enterprisedb.com

Re: Skip collecting decoded changes of already-aborted transactions

2025-01-13 Thread Peter Smith

Hi Sawada-San.

Some review comments for patch v13-0002.

==

I think the v12 ambiguity of RBTXN_PREPARE versus RBTXN_SENT_PREPARE
was mostly addressed already by the improved comments for the macros
in patch 0001.

Meanwhile, patch v13-0002 says it is renaming constants for better
consistency, but I don't think it went far enough.

For example, better name consistency would be achieved by changing
*all* of the constants related to prepared transactions:

#define RBTXN_IS_PREPARED  0x0040
#define RBTXN_IS_PREPARED_SKIPPED  0x0080
#define RBTXN_IS_PREPARED_SENT 0x0200

where:

RBTXN_IS_PREPARED. This means it's a prepared transaction. (but we
can't tell from this if it is skipped or sent).

RBTXN_IS_PREPARED_SKIPPED. This means it's a prepared transaction
(RBTXN_IS_PREPARED) and it's being skipped.

RBTXN_IS_PREPARED_SENT. This means it's a prepared transaction
(RBTXN_IS_PREPARED) and we've sent it.

~

A note about RBTXN_IS_PREPARED. Since all of these constants are
clearly about transactions (e.g. "TXN" in prefix "RBTXN_"), I felt
patch 0002 calling this RBTXN_IS_PREPARED_TXN just seemed like adding
a redundant _TXN. e.g. we don't say RBTXN_IS_COMMITTED_TXN etc.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Question about behavior of deletes with REPLICA IDENTITY NOTHING

2025-01-13 Thread Robert Treat

On Mon, Jan 13, 2025 at 3:55 AM Amit Kapila  wrote:
> On Mon, Jan 13, 2025 at 10:22 AM Robert Treat  wrote:
> > On Sun, Jan 12, 2025 at 11:00 PM Amit Kapila  
> > wrote:
> > > On Sat, Jan 11, 2025 at 7:28 PM Robert Treat  wrote:
> > > +If a table with replica identity set to NOTHING
> > > +(or set DEFAULT but with no primary key, or set
> > > +USING INDEX but the index has been dropped) is
> > > +added to a publication that replicates UPDATE
> > > +or DELETE operations,
> > > +subsequent UPDATE or DELETE
> > > +operations will cause an error on the publisher.
> > >
> > > In the above change, we missed the existing "a table without a replica
> > > identity" part. A slightly different way to write the above change
> > > could be: "Tables lacking a replica identity or with an insufficiently
> > > defined replica identity (e.g., set to NOTHING, set to DEFAULT but
> > > with no primary key, or set USING INDEX but the index has been
> > > dropped) cannot be updated or deleted when added to a publication that
> > > replicates these operations. Attempting to do so will result in an
> > > error on the publisher."
> > >
> >
>

> > Aside from that, your above language is a little more compact with the
> > trade-off of being less explicit in talking about publication
> > properties; I didn't change that part because it didn't seem like an
> > issue, but we could update that second part if you feel strongly about
> > it. LMK.
> >
>
> One of the reasons I tried to rephrase the sentence was it appears to
> be long. I agree that the way you proposed is more explicit but the
> way I phrased also conveys the information in a bit succinct form. I
> think you can once propose with the wording on those lines then let us
> what Peter or others think about it.
>

Splitting the difference would look like this?

"Tables with replica identity set NOTHING,
set DEFAULT but with no primary key, or set
USING INDEX but the index has been
dropped, cannot be updated or deleted when added to a publication that
replicates these operations. Attempting to do so will result in an
error on the publisher."


Robert Treat
https://xzilla.net

Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

2025-01-13 Thread Masahiko Sawada

On Mon, Jan 13, 2025 at 1:22 AM vignesh C  wrote:
>
> On Tue, 31 Dec 2024 at 10:15, Masahiko Sawada  wrote:
> >
> > Hi all,
> >
> > Logical decoding (and logical replication) are available only when
> > wal_level = logical. As the documentation says[1], Using the 'logical'
> > level increases the WAL volume which could negatively affect the
> > performance. For that reason, users might want to start with using
> > 'replica', but when they want to use logical decoding they need a
> > server restart to increase wal_level to 'logical'. My goal is to allow
> > users who are using 'replica' level to use logical decoding without a
> > server restart. There are other GUC parameters related to logical
> > decoding and logical replication such as max_wal_senders,
> > max_logical_replication_workers, and max_replication_slots, but even
> > if users set these parameters >0, there would not be a noticeable
> > performance impact. And their default values are already >0. So I'd
> > like to focus on making only the wal_level dynamic GUC parameter.
> > There are several earlier discussions[2][3] but no one has submitted
> > patches unless I'm missing something.
> >
> > The first idea I came up with is to make the wal_level a PGC_SIGHUP
> > parameter. However, it affects not only setting 'replica' to 'logical'
> > but also setting 'minimal' to 'replica' or higher. I'm not sure the
> > latter case is common and it might require a checkpoint. I don't want
> > to make the patch complex for uncommon cases.
> >
> > The second idea is to somehow allow both WAL-logging logical info and
> > logical decoding even when wal_level is 'replica'. I've attached a PoC
> > patch for that. The patch introduces new SQL functions such as
> > pg_activate_logical_decoding() and pg_deactivate_logical_decoding().
> > These functions are available only when wal_level is 'repilca'(or
> > higher). In pg_activate_logical_decoding(), we set the status of
> > logical decoding stored on the shared memory from 'disabled' to
> > 'xlog-logical-info', allowing all processes to write logical
> > information to WAL records for logical decoding. But the logical
> > decoding is still not allowed. Once we confirm all in-progress
> > transactions completed, we switch the status to
> > 'logical-decoding-ready', meaning that users can create logical
> > replication slots and use logical decoding.
>
> I felt that the only disadvantage with this approach is that we
> currently wait for all in-progress transactions to complete before
> enabling logical decoding. If a long-running transaction exists and
> the session enabling logical decoding fails—due to factors like
> statement_timeout, transaction_timeout,
> idle_in_transaction_session_timeout, idle_session_timeout, or any
> other failure. This would require restarting the wait. During this
> time, there's a risk that a new long-running transaction could start,
> further delaying the process. Probably this could be solved if the
> waiting is done from any of the background processes through
> PGC_SIGHUP.  While this type of failure is likely rare, I’m unsure
> whether we should consider this scenario.
>
> Thoughts?

Yeah, delegating the activation to the background process such as the
checkpointer would also be one solution. This would work with the
approach that we enable the logical decoding via
pg_activate_logical_decoding(). On the other hand, if we support
automatically enabling the logical decoding (and logical info logging)
when the first logical slot is created, we might want to have the
process who is creating the logical slot activate the logical decoding
too.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Add get_bytes() and set_bytes() functions

2025-01-13 Thread Dean Rasheed

On Mon, 13 Jan 2025 at 19:23, Alvaro Herrera  wrote:
>
> But these don't show the acceptable range. We have these that do:
>
> #: utils/adt/varbit.c:1824 utils/adt/varbit.c:1882
> #, c-format
> msgid "bit index %d out of valid range (0..%d)"
>
> #: utils/adt/varlena.c:3218 utils/adt/varlena.c:3285
> #, c-format
> msgid "index %d out of valid range, 0..%d"
>
> #: utils/adt/varlena.c:3249 utils/adt/varlena.c:3321
> #, c-format
> msgid "index %lld out of valid range, 0..%lld"
>
> #: utils/misc/guc.c:3130
> #, c-format
> msgid "%d%s%s is outside the valid range for parameter \"%s\" (%d .. %d)"
>

Those are all instances of a value that's outside a specific range
that you might not otherwise know, rather than being out of range of
the type itself. For that, we generally don't say what the range of
the type is. For example, we currently do:

select repeat('1', 50)::bit(50)::int;
ERROR:  integer out of range

Regards,
Dean

Re: pgsql: Consolidate docs for vacuum-related GUCs in new subsection

2025-01-13 Thread Alvaro Herrera

On 2025-Jan-13, Melanie Plageman wrote:

> Since I didn't hear back about this and I don't see an obvious
> alternative reorganization in guc_tables.c, I plan to just push the
> attached patch that updates only postgresql.conf.sample.

Apologies, I was very unclear -- I didn't want to talk about the
ordering of entries in the code, but the categorization.  See the
config_group_names list in guc_tables.c, which defines some groups.
Each setting belongs into a group, and those groups correspond to what
the sample config file lists as section/subsection titles and to the
grouping in the docs.  Also, this categorization affects how the entries
are listed in the pg_settings view and in "postgres --describe-config",
which feed from the same tables.

Perhaps with your changes (assuming I read your commit message right),
we need new groups:
VACUUMING
VACUUMING_FREEZING
VACUUMING_AUTOVACUUM

Thanks

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Doing what he did amounts to sticking his fingers under the hood of the
implementation; if he gets his fingers burnt, it's his problem."  (Tom Lane)

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

2025-01-13 Thread Melanie Plageman

On Thu, Jan 9, 2025 at 1:24 PM Andres Freund  wrote:
>
> On 2025-01-07 15:46:26 -0500, Melanie Plageman wrote:
> > For table storage options, those related to vacuum but not autovacuum
> > are in the main StdRdOptions struct. Of those, some are overridden by
> > VACUUM command parameters which are parsed out into the VacuumParams
> > struct. Though the members of VacuumParams are initialized in
> > ExecVacuum(), the storage parameter overrides are determined in
> > vacuum_rel() and the final value goes in the VacuumParams struct which
> > is passed all the way through to heap_vacuum_rel().
> >
> > Because VacuumParams is what ultimately gets passed down to the
> > table-AM specific vacuum implementation, autovacuum also initializes
> > its own instance of VacuumParams in the autovac_table struct in
> > table_recheck_autovac() (even though no VACUUM command parameters can
> > affect autovacuum). These are overridden in vacuum_rel() as well.
> >
> > Ultimately vacuum_eager_scan_max_fails is a bit different from the
> > existing members of VacuumParams and StdRdOptions. It is a GUC and a
> > table storage option but not a SQL command parameter -- and both the
> > GUC and the table storage parameter affect both vacuum and autovacuum.
> > And it doesn't need to be initialized in different ways for autovacuum
> > and vacuum. In the end, I decided to follow the existing conventions
> > as closely as I could.
>
> I think that's fine. The abstractions in this area aren't exactly perfect, and
> I don't think this makes it worse in any meaningful way. It's not really
> different from having other heap-specific params like freeze_min_age in
> VacuumParams.

Got it. I've left it as is, then.

Attached v6 is rebased over recent changes in the vacuum-related docs.
I've also updated the "Routine Vacuuming" section of the docs to
mention eager scanning.

I'm planning to commit 0001 (which updates the code comment at the top
of vacuumlazy.c to explain heap vacuuming) --barring any objections.

I've been running a few multi-day benchmarks to ensure that the patch
behaves the same in a "normal" timeframe as it did in a compressed
one.

So far, it looks good. For a multi-day transactional benchmark with a
gaussian data access pattern, it looks about the same as a shorter
version (that is, aggressive vacuums are much shorter and there is no
difference when compared to master WRT total WAL volume, TPS, etc).

The final long benchmarks I'm waiting on are a hot tail workload with
a job that deletes old data.

- Melanie
From a5080bb6c630af932451d56a0931c9bc96eb8417 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 7 Jan 2025 09:48:34 -0500
Subject: [PATCH v6 2/2] Eagerly scan all-visible pages to amortize aggressive
 vacuum

Introduce eager scanning normal vacuums, in which vacuum scans some of
the all-visible but not all-frozen pages in the relation to amortize the
cost of an aggressive vacuum.

Because the goal is to freeze these all-visible pages, all-visible pages
that are eagerly scanned and set all-frozen in the visibility map are
considered successful eager scans and those not frozen are considered
failed eager scans.

If too many eager scans fail in a row, eager scanning is temporarily
suspended until a later portion of the relation. The number of failures
tolerated is configurable globally and per table. To effectively
amortize aggressive vacuums, we cap the number of successes as well.
Once we reach the maximum number of blocks successfully eager scanned
and frozen, eager scanning is permanently disabled for the current
vacuum.

Original design idea from Robert Haas, with enhancements from
Andres Freund, Tomas Vondra, and me

Author: Melanie Plageman
Reviewed-by: Andres Freund, Robert Haas, Robert Treat, Bilal Yavuz
Discussion: https://postgr.es/m/flat/CAAKRu_ZF_KCzZuOrPrOqjGVe8iRVWEAJSpzMgRQs%3D5-v84cXUg%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  20 +
 doc/src/sgml/maintenance.sgml |  47 ++-
 doc/src/sgml/ref/create_table.sgml|  15 +
 src/backend/access/common/reloptions.c|  13 +-
 src/backend/access/heap/vacuumlazy.c  | 382 --
 src/backend/commands/vacuum.c |  13 +
 src/backend/postmaster/autovacuum.c   |   2 +
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/commands/vacuum.h |  23 ++
 src/include/utils/rel.h   |   7 +
 11 files changed, 488 insertions(+), 45 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f41a17b1fe..305f4065495 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9116,6 +9116,26 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;

+ 
+  vacuum_eager_scan_max_fails (integer)
+  
+   vacuum_eager_scan_max_fails configuration parameter
+  
+  
+  
+   
+

Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

2025-01-13 Thread Masahiko Sawada

On Mon, Jan 13, 2025 at 1:31 AM Ashutosh Bapat
 wrote:
>
> On Mon, Jan 13, 2025 at 2:52 PM vignesh C  wrote:
> >
> > I felt that the only disadvantage with this approach is that we
> > currently wait for all in-progress transactions to complete before
> > enabling logical decoding. If a long-running transaction exists and
> > the session enabling logical decoding fails—due to factors like
> > statement_timeout, transaction_timeout,
> > idle_in_transaction_session_timeout, idle_session_timeout, or any
> > other failure. This would require restarting the wait. During this
> > time, there's a risk that a new long-running transaction could start,
> > further delaying the process. Probably this could be solved if the
> > waiting is done from any of the background processes through
> > PGC_SIGHUP.  While this type of failure is likely rare, I’m unsure
> > whether we should consider this scenario.
>
> A related question: While the operation is waiting for already running
> transactions to end, the backends whose transactions have finished may
> start new transactions. When we switch WAL from 'replica' to
> 'logical', there may be some transactions running. Will this lead to
> WAL stream with mixes WAL records - some with logical information and
> some without?

Yes. There could be mixed WAL records until all running transactions complete.

> Do we need to adjust logical decoding to tackle this
> situation? Is there a chance that some WAL from same transaction have
> logical information and some do not?

In the current approach, we first enable logical info WAL-logging to
let newly started transactions do logical info WAL-logging, then allow
the logical decoding only after all running transactions complete.
Therefore, at the time when we allow the logical decoding, all WAL
records are written with logical information.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: AIO v2.2

2025-01-13 Thread Andres Freund

Hi,

On 2025-01-13 15:43:46 -0500, Robert Haas wrote:
> On Wed, Jan 8, 2025 at 7:26 PM Andres Freund  wrote:
> > 1) Shared memory representation of an IO, for the AIO subsystem internally
> >
> >Currently: PgAioHandle
> >
> > 2) A way for the issuer of an IO to reference 1), to attach information to 
> > the
> >IO
> >
> >Currently: PgAioHandle*
> >
> > 3) A way for any backend to wait for a specific IO to complete
> >
> >Currently: PgAioHandleRef
>
> With that additional information, I don't mind this naming too much,
> but I still think PgAioHandle -> PgAio and PgAioHandleRef ->
> PgAioHandle is worth considering. Compare BackgroundWorkerSlot and
> BackgroundWorkerHandle, which suggests PgAioHandle -> PgAioSlot and
> PgAioHandleRef -> PgAioHandle.

I don't love PgAioHandle -> PgAio as there are other things (e.g. per-backend
state) in the PgAio namespace...



> > > I do agree with Heikki that REAPED sounds later than COMPLETED, because 
> > > you
> > > reap zombie processes by collecting their exit status. Maybe you could 
> > > have
> > > AHS_COMPLETE or AHS_IO_COMPLETE for the state where the I/O is done but
> > > there's still completion-related work to be done, and then the other state
> > > could be AHS_DONE or AHS_FINISHED or AHS_FINAL or AHS_REAPED or something.
> >
> > How about
> >
> > AHS_COMPLETE_KERNEL or AHS_COMPLETE_RAW - raw syscall completed
> > AHS_COMPLETE_SHARED_CB - shared callback completed
> > AHS_COMPLETE_LOCAL_CB - local callback completed
> >
> > ?
>
> That's not bad. I like RAW better than KERNEL.

Cool.


> I was hoping to use different works like COMPLETE and DONE rather than, as
> you did it here, COMPLETE and COMPLETE, but it's probably fine.

Once the IO is really done, the handle is immediately recycled (and moved into
IDLE state, ready to be used again).

Greetings,

Andres Freund

Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

2025-01-13 Thread Michail Nikolaev

Hello, everyone!

I have noticed tests are still flapping a little bit on FreeBSD.
Now I have added some markers to isolation specs to avoid possible race
conditions.

Best regards,
Mikhail.

>


v6-0003-Modify-the-infer_arbiter_indexes-function-to-also.patch
Description: Binary data


v6-0001-Specs-top-reproduce-the-issues-with-CREATE-INDEX-.patch
Description: Binary data


v6-0004-Modify-the-ExecInitPartitionInfo-function-to-cons.patch
Description: Binary data


v6-0002-Modify-the-infer_arbiter_indexes-function-to-cons.patch
Description: Binary data

Re: Reduce TupleHashEntryData struct size by half

2025-01-13 Thread Jeff Davis

On Sun, 2025-01-12 at 14:54 +1300, David Rowley wrote:
> While I do understand the desire to reduce Hash Agg's memory usage,
> has this really been through enough performance testing to be getting
> committed?

Perhaps not. I'm going to revert it while we sort it out, and hopefully
we can find a solution because it's a substantial memory savings.

> I wonder if there's some other better way of doing this. Would it be
> worth having some function like ExecCopySlotMinimalTuple() that
> accepts an additional parameter so that the palloc allocates N more
> bytes at the end?  Maybe it's worth hunting around to see if there's
> any other executor nodes that could benefit from that infrastructure.

That would be convenient, but doesn't seem like a great separation of
responsibilities. Perhaps some API that separated the length
calculation, and accepted a caller-supplied buffer?

Regards,
Jeff Davis

Re: Conflict detection for update_deleted in logical replication

2025-01-13 Thread Masahiko Sawada

On Sun, Jan 12, 2025 at 10:36 PM Amit Kapila  wrote:
>
> On Fri, Jan 10, 2025 at 6:13 AM Masahiko Sawada  wrote:
> >
> > 3. If the apply worker cannot catch up, it could enter to a bad loop;
> > the publisher sends huge amount of data -> the apply worker cannot
> > catch up -> it needs to wait for a longer time to advance its
> > oldest_nonremovable_xid -> more garbage are accumulated and then have
> > the apply more slow -> (looping). I'm not sure how to deal with this
> > point TBH. We might be able to avoid entering this bad loop once we
> > resolve the other two points.
> >
>
> I don't think we can avoid accumulating garbage especially when the
> workload on the publisher is more. Consider the current case being
> discussed, on the publisher, we have 30 clients performing read-write
> operations and there is only one pair of reader (walsender) and writer
> (apply_worker) to perform all those write operations on the
> subscriber. It can never match the speed and the subscriber side is
> bound to have less performance (or accumulate more bloat) irrespective
> of its workload. If there is one client on the publisher performing
> operation, we won't see much degradation but as the number of clients
> increases, the performance degradation (and bloat) will keep on
> increasing.
>
> There are other scenarios that can lead to the same situation, such as
> a large table sync, the subscriber node being down for sometime, etc.
> Basically, any case where apply_side lags by a large amount from the
> remote node.
>
> One idea to prevent the performance degradation or bloat increase is
> to invalidate the slot, once we notice that subscriber lags (in terms
> of WAL apply) behind the publisher by a certain threshold. Say we have
> max_lag (or max_lag_behind_remote) (defined in terms of seconds)
> subscription option which allows us to stop calculating
> oldest_nonremovable_xid for that subscription. We can indicate that
> via some worker_level parameter. Once all the subscriptions on a node
> that has enabled retain_conflict_info have stopped calculating
> oldest_nonremovable_xid, we can invalidate the slot. Now, users can
> check this and need to disable/enable retain_conflict_info to again
> start retaining the required information. The other way could be that
> instead of invalidating the slot, we directly drop/re-create the slot
> or increase its xmin. If we choose to advance the slot automatically
> without user intervention, we need to let users know via LOG and or
> via information in the view.
>
> I think such a mechanism via the new option max_lag will address your
> concern: "It's reasonable behavior for this approach but it might not
> be a reasonable outcome for users if they could be affected by such a
> performance dip without no way to avoid it." as it will provide a way
> to avoid performance dip only when there is a possibility of such a
> dip.
>
> I mentioned max_lag as a subscription option instead of a GUC because
> it applies only to subscriptions that have enabled
> retain_conflict_info but we can consider it to be a GUC if you and
> others think so provided the above proposal sounds reasonable. Also,
> max_lag could be defined in terms of LSN as well but I think time
> would be easy to configure.
>
> Thoughts?

I agree that we cannot avoid accumulating dead tuples when the
workload on the publisher is more, and which affects the subscriber
performance. What we need to do is to update slot's xmin as quickly as
possible to minimize the dead tuple accumulation at least when the
subscriber is not much behind. If there is a tradeoff for doing so
(e.g., vs. the publisher performance), we need to provide a way for
users to balance it.  The max_lag idea sounds interesting for the case
where the subscriber is much behind. Probably we can visit this idea
as a new feature after completing this feature.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New GUC autovacuum_max_threshold ?

2025-01-13 Thread Nathan Bossart

On Mon, Jan 13, 2025 at 05:17:11PM -0600, Sami Imseih wrote:
> I propose renaming the GUC from "autovacuum_max_threshold" to
> "autovacuum_vacuum_max_threshold" to clarify that it applies only
> to the vacuum operation performed by autovacuum, not to the analyze operation.
> This will also align with naming for other related GUCs, i.e.,
> "autovacuum_analyze_threshold" and "autovacuum_vacuum_threshold."
> 
> The "vacuum threshold" calculation described in [1] will also need to be
> updated.

Good call.  Here is an updated patch.

-- 
nathan
>From 1a06dd9ccc2ee67a60278561f16c2c76cf3f8dc5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 13 Jan 2025 20:01:24 -0600
Subject: [PATCH v6 1/1] Introduce autovacuum_vacuum_max_threshold.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Author: Nathan Bossart, Frédéric Yhuel
Reviewed-by: Melanie Plageman, Robert Haas, Laurenz Albe, Michael Banck, Joe 
Conway, Sami Imseih, David Rowley, wenhui qiu
Discussion: 
https://postgr.es/m/956435f8-3b2f-47a6-8756-8c54ded61802%40dalibo.com
---
 doc/src/sgml/config.sgml  | 24 +++
 doc/src/sgml/maintenance.sgml |  6 +++--
 doc/src/sgml/ref/create_table.sgml| 15 
 src/backend/access/common/reloptions.c| 11 +
 src/backend/postmaster/autovacuum.c   | 12 ++
 src/backend/utils/misc/guc_tables.c   |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  3 +++
 src/bin/psql/tab-complete.in.c|  2 ++
 src/include/postmaster/autovacuum.h   |  1 +
 src/include/utils/rel.h   |  1 +
 10 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f41a17b1fe..54a1ec2084a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8579,6 +8579,30 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;

   
 
+  
+   autovacuum_vacuum_max_threshold 
(integer)
+   
+autovacuum_vacuum_max_threshold
+configuration parameter
+   
+   
+   
+
+ Specifies the maximum number of updated or deleted tuples needed to
+ trigger a VACUUM in any one table, i.e., a cap on
+ the value calculated with
+ autovacuum_vacuum_threshold and
+ autovacuum_vacuum_scale_factor.  The default is
+ 100,000,000 tuples.  If -1 is specified, autovacuum will not enforce a
+ maximum number of updated or deleted tuples that will trigger a
+ VACUUM operation.  This parameter can only be set
+ in the postgresql.conf file or on the server
+ command line; but the setting can be overridden for individual tables
+ by changing storage parameters.
+
+   
+  
+
   
autovacuum_vacuum_insert_threshold 
(integer)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 0be90bdc7ef..f84ad7557d9 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -895,9 +895,11 @@ HINT:  Execute a database-wide VACUUM in that database.
 VACUUM exceeds the vacuum threshold, the
 table is vacuumed.  The vacuum threshold is defined as:
 
-vacuum threshold = vacuum base threshold + vacuum scale factor * number of 
tuples
+vacuum threshold = Minimum(vacuum max threshold, vacuum base threshold + 
vacuum scale factor * number of tuples)
 
-where the vacuum base threshold is
+where the vacuum max threshold is
+,
+the vacuum base threshold is
 ,
 the vacuum scale factor is
 ,
diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index 2237321cb4f..417498f71db 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1712,6 +1712,21 @@ WITH ( MODULUS numeric_literal, REM
 

 
+   
+autovacuum_vacuum_max_threshold, 
toast.autovacuum_vacuum_max_threshold (integer)
+
+ autovacuum_vacuum_max_threshold
+ storage parameter
+
+
+   
+
+ Per-table value for 
+ parameter.
+
+   
+  
+

 autovacuum_vacuum_scale_factor, 
toast.autovacuum_vacuum_scale_factor (floating 
point)
 
diff --git a/src/backend/access/common/reloptions.c 
b/src/backend/access/common/reloptions.c
index e587abd9990..5731cf42f54 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -231,6 +231,15 @@ static relopt_int intRelOpts[] =
},
-1, 0, INT_MAX
},
+   {
+   {
+   "autovacuum_vacuum_max_threshold",
+   "Maximum number of tuple updates or deletes prior to 
vacuum",
+   RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+   ShareUpdateExclusiveLock
+   },
+   -2, -1, INT_MAX
+   }

Re: Some ExecSeqScan optimizations

2025-01-13 Thread Amit Langote

Hi Vladlen,

On Fri, Jan 10, 2025 at 11:49 PM Vladlen Popolitov
 wrote:
> Amit Langote писал(а) 2025-01-10 18:22:
> > On Fri, Jan 10, 2025 at 7:36 PM David Rowley 
> > wrote:
> >> On Fri, 10 Jan 2025 at 22:53, Vladlen Popolitov
> >>  wrote:
> >> >   In case of query
> >> > select count(*) from test_table where a_1 = 100;
> >> > I would expect increase of query time due to additional if...else . It
> >> > is not clear
> >> > what code was eliminated to decrease query time.
> >>
> >> Are you talking about the code added to ExecInitSeqScan() to determine
> >> which node function to call? If so, that's only called during executor
> >> startup.  The idea here is to reduce the branching during execution by
> >> calling one of those special functions which has a more specialised
> >> version of the ExecScan code for the particular purpose it's going to
> >> be used for.
> >
> > Looks like I hadn't mentioned this key aspect of the patch in the
> > commit message, so did that in the attached.
> >
> > Vladlen, does what David wrote and the new commit message answer your
> > question(s)?
>
> Hi Amit,
>
>   Yes, David clarified the idea, but it is still hard to believe in 5% of
> improvements.
> The query
> select count(*) from test_table where a_1 = 100;
> has both qual and projection, and ExecScanExtended() will be generated
> similar to ExecScan() (the same not NULL values to check in if()).

Yes, I've noticed that if the plan for the above query contains a
projection, like when it contains a Gather node, the inlined version
of ExecScanExtended() will look more or less the same as the full
ExecScan().  There won't be noticeable speedup with the patch in that
case.

However, I ran the benchmark tests with Gather disabled such that I
get a plan without projection, which uses an inlined version that
doesn't have branches related to projection.  I illustrate my example
below.

>   Do you have some scripts to reproduce your benchmark?

Use these steps.  Set max_parallel_workers_per_gather to 0,
shared_buffers to 512MB.  Compile the patch using --buildtype=release.

create table foo (a int, b int, c int, d int, e int);
insert into foo select i, i, i, i, i from generate_series(1, 1000) i;

-- pg_prewarm: to ensure that no buffers lead to I/O to reduce noise
select pg_size_pretty(pg_prewarm('foo'));

select count(*) from foo where a = 1000;

Times I get on v17, master, and with the patch for the above query are
as follows:

v17: 173, 173, 174 ms

master: 173, 175, 169 ms

Patched: 160, 161, 158 ms

Please let me know if you're still unable to reproduce such numbers
with the steps I described.

--
Thanks, Amit Langote

Re: Some ExecSeqScan optimizations

2025-01-13 Thread Amit Langote

Hi Junwang,

On Sat, Jan 11, 2025 at 7:39 PM Junwang Zhao  wrote:
> Hi Amit,
>
> On Fri, Jan 10, 2025 at 7:22 PM Amit Langote  wrote:
> >
> > On Fri, Jan 10, 2025 at 7:36 PM David Rowley  wrote:
> > > On Fri, 10 Jan 2025 at 22:53, Vladlen Popolitov
> > >  wrote:
> > > >   In case of query
> > > > select count(*) from test_table where a_1 = 100;
> > > > I would expect increase of query time due to additional if...else . It
> > > > is not clear
> > > > what code was eliminated to decrease query time.
> > >
> > > Are you talking about the code added to ExecInitSeqScan() to determine
> > > which node function to call? If so, that's only called during executor
> > > startup.  The idea here is to reduce the branching during execution by
> > > calling one of those special functions which has a more specialised
> > > version of the ExecScan code for the particular purpose it's going to
> > > be used for.
> >
> > Looks like I hadn't mentioned this key aspect of the patch in the
> > commit message, so did that in the attached.
>
> Thanks for updating the patch. While seeing the patch, the es_epq_active
> confused me a little bit mostly because its name, a field name ending with
> "active" typically suggests a boolean value, but here it is not, should we
> change it to sth like es_epqstate? However this is not related to this patch,
> I can start a new thread if you think this is worth a patch.

Yeah, the name has confused me as well from time to time.

Though it might be a good idea to dig the thread that led to the
introduction of this field to find out if the naming has some logic
we're missing.

You may start a new thread to get the attention of other folks who
might have some clue.

> There is one tiny indent issue(my IDE does this automatically), which I
> guess you will fix before committing.
>
> -   EPQState *epqstate;
> +   EPQState   *epqstate;

Thanks for the heads up.

-- 
Thanks, Amit Langote

Re: Skip collecting decoded changes of already-aborted transactions

2025-01-13 Thread Peter Smith

Hi Sawada-San. Here are some cosmetic review comments for the patch v13-0001.

==
Commit message

1.
This commit introduces an additional CLOG lookup to check the
transaction status, so the logical decoding skips further change also
when it doesn't touch system catalogs if the transaction is already
aborted. This optimization enhances logical decoding performance,
especially for large transactions that have already been rolled back,
as it avoids unnecessary disk or network I/O.

~

That first sentence seems confusing. How about:

This commit adds a CLOG lookup to check the transaction status,
allowing logical decoding to skip changes for non-system catalogs if
the transaction is already aborted.

On Tue, Jan 14, 2025 at 5:56 AM Masahiko Sawada  wrote:
>
> On Mon, Jan 6, 2025 at 5:52 PM Peter Smith  wrote:
> >
> > Hi Sawada-San.
> >
> > Here are some review comments for the patch v12-0001.
>
> Thank you for reviewing the patch!
>
> >
> > ==
> > .../replication/logical/reorderbuffer.c
> >
> > ReorderBufferCheckAndTruncateAbortedTXN:
> >
> > ~~~
> >
> > 3.
> > + if (TransactionIdDidCommit(txn->xid))
> > + {
> > + /*
> > + * Remember the transaction is committed so that we can skip CLOG
> > + * check next time, avoiding the pressure on CLOG lookup.
> > + */
> > + Assert(!rbtxn_is_aborted(txn));
> > + txn->txn_flags |= RBTXN_IS_COMMITTED;
> > + return false;
> > + }
> > +
> > + /*
> > + * The transaction aborted. We discard the changes we've collected so far
> > + * and toast reconstruction data. The full cleanup will happen as part of
> > + * decoding ABORT record of this transaction.
> > + */
> > + ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
> > + ReorderBufferToastReset(rb, txn);
> > +
> > + /* All changes should be discarded */
> > + Assert(txn->size == 0);
> > +
> > + /*
> > + * Mark the transaction as aborted so we can ignore future changes of this
> > + * transaction.
> > + */
> > + Assert(!rbtxn_is_committed(txn));
> > + txn->txn_flags |= RBTXN_IS_ABORTED;
> > +
> > + return true;
> > +}
> >
> > 3a.
> > That whole last part related to "The transaction aborted", might be
> > clearer if the whole chunk of code was in an 'else' block from the
> > previous "if (TransactionIdDidCommit(txn->xid))".
>
> I'm not sure it increases the readability. I think it pretty makes
> sense to me that we return false in the 'if
> (TransactionIdDidCommit(txn->xid))' block. If we add the 'else' block,
> the reader might be confused as we have the  'else' block in spite of
> having the return in the 'if' block. We can add a local variable for
> the result and return it at the end of the function but I'm not sure
> it's a good idea to increase the readability.
>

2.
I think adding a local variable is overkill but OTOH introducing
“else” clarifies that the following code can only be reached when the
transaction is aborted. E.g. You don’t even need to read the previous
code block and see the “return false” to know that. Anyway, it’s
probably just a personal preference.

> > 3c.
> > The "and toast reconstruction data" seems to be missing a word/s. (??)
> > - "... and also discard TOAST reconstruction data"
> > - "... and reset TOAST reconstruction data"
>
> I don't understand this comment. What words are you suggesting adding
> to these sentences?
>

3.
I meant something like:

BEFORE
We discard the changes we've collected so far and toast reconstruction data.

SUGGESTION
We discard both the changes collected so far and the TOAST reconstruction data.

==
src/include/replication/reorderbuffer.h

4.
-/* Has this transaction been prepared? */
+/*
+ * Is this transaction a prepared transaction?
+ *
+ * Being true means that this transaction should be prepared instead of
+ * committed. To check whether a prepare or a stream_prepare has already
+ * been sent for this transaction, we need to use rbtxn_sent_prepare().
+ */

/Is this transaction a prepared transaction?/Is this a prepared transaction?/

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Make pg_stat_io view count IOs as bytes instead of blocks

2025-01-13 Thread Michael Paquier

On Fri, Jan 10, 2025 at 08:23:46AM +, Bertrand Drouvot wrote:
> On Fri, Jan 10, 2025 at 10:15:52AM +0300, Nazir Bilal Yavuz wrote:
>> But I agree that having a macro has more benefits,
>> also there already is a check for the 'io_op < IOOP_NUM_TYPES' in the
>> pgstat_count_io_op() function.
> 
> Yeah, I think we can remove the "io_op < IOOP_NUM_TYPE" assertion in
> pgstat_count_io_op() (but keep this check as part of the macro).
>
> Appart from the above, LGTM.

Okay, so applied.

And I've somewhat managed to fat-finger the business with
pgstat_count_io_op() with an incorrect rebase.  Will remove in a
minute..
--
Michael


signature.asc
Description: PGP signature

Docs for pg_basebackup needs v17 note for incremental backup

2025-01-13 Thread David G. Johnston

Hackers,

Should the following paragraph in the docs be modified to point out minimum
server version of v17 for incremental backups?

pg_basebackup works with servers of the same or an older major version,
down to 9.1. However, WAL streaming mode (-X stream) only works with server
version 9.3 and later, and tar format (--format=tar) only works with server
version 9.5 and later.

https://www.postgresql.org/docs/current/app-pgbasebackup.html

David J.

Re: WAL-logging facility for pgstats kinds

2025-01-13 Thread Michael Paquier

On Fri, Jan 10, 2025 at 01:46:53PM +0900, Michael Paquier wrote:
> I'd rather use RecoveryInProgress() here even if XLogInsertAllowed()
> is a synonym of that, minus the update of LocalXLogInsertAllowed for
> the local process.

I've applied v2-0002 for the new header as it is useful on its own.
Rebased to avoid the wrath of the CF bot, as v3.
--
Michael
From 7d4aa6fc0e348065b46c31cbb21d7a4d2f21e5b2 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 10 Jan 2025 13:26:46 +0900
Subject: [PATCH v3 3/4] Add RMGR and WAL-logging API for pgstats

This commit adds a new facility in the backend that offers to pgstats
kinds the possibility to WAL-log data that need to be persisted across
restarts:
- A new callback called redo_cb can be defined by a pgstats_kind.
- pgstat_xlog.c includes one API able to register some payload data and
its size.  Stats kinds doing WAL-logging require a definition of
redo_cb.
- pg_waldump is able to show the data logged, as hexadedimal data.

This facility has applications for in-core and custom stats kinds, with
one primary case being the possibility to WAL-log some statistics
related to relations so as these are not lost post-crash.  This is left
as future work.
---
 src/include/access/rmgrlist.h|  1 +
 src/include/utils/pgstat_internal.h  |  8 +++
 src/include/utils/pgstat_xlog.h  | 41 ++
 src/backend/access/rmgrdesc/Makefile |  1 +
 src/backend/access/rmgrdesc/meson.build  |  1 +
 src/backend/access/rmgrdesc/pgstatdesc.c | 49 
 src/backend/access/transam/rmgr.c|  1 +
 src/backend/utils/activity/Makefile  |  1 +
 src/backend/utils/activity/meson.build   |  1 +
 src/backend/utils/activity/pgstat_xlog.c | 72 
 src/bin/pg_waldump/.gitignore|  1 +
 src/bin/pg_waldump/rmgrdesc.c|  1 +
 src/bin/pg_waldump/t/001_basic.pl|  3 +-
 src/tools/pgindent/typedefs.list |  1 +
 14 files changed, 181 insertions(+), 1 deletion(-)
 create mode 100644 src/include/utils/pgstat_xlog.h
 create mode 100644 src/backend/access/rmgrdesc/pgstatdesc.c
 create mode 100644 src/backend/utils/activity/pgstat_xlog.c

diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 8e7fc9db87..409397cb21 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
+PG_RMGR(RM_PGSTAT_ID, "PgStat", pgstat_redo, pgstat_desc, pgstat_identify, NULL, NULL, NULL, NULL)
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 4bb8e5c53a..7a87d4e2f6 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -258,6 +258,14 @@ typedef struct PgStat_KindInfo
 	 */
 	void		(*init_backend_cb) (void);
 
+	/*
+	 * Perform custom actions when replaying WAL related to a statistics kind.
+	 *
+	 * "data" is a pointer to the stats information that can be used by the
+	 * stats kind at redo.
+	 */
+	void		(*redo_cb) (void *data, size_t data_size);
+
 	/*
 	 * For variable-numbered stats: flush pending stats. Required if pending
 	 * data is used.  See flush_fixed_cb for fixed-numbered stats.
diff --git a/src/include/utils/pgstat_xlog.h b/src/include/utils/pgstat_xlog.h
new file mode 100644
index 00..8e229b4f12
--- /dev/null
+++ b/src/include/utils/pgstat_xlog.h
@@ -0,0 +1,41 @@
+/* --
+ * pgstat_xlog.h
+ *	  Exports from utils/activity/pgstat_xlog.c
+ *
+ * Copyright (c) 2001-2025, PostgreSQL Global Development Group
+ *
+ * src/include/utils/pgstat_xlog.h
+ * --
+ */
+#ifndef PGSTAT_XLOG_H
+#define PGSTAT_XLOG_H
+
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogreader.h"
+#include "utils/pgstat_kind.h"
+
+/*
+ * Generic WAL record for pgstat data.
+ */
+typedef struct xl_pgstat_data
+{
+	PgStat_Kind stats_kind;
+	size_t		data_size;		/* size of the data */
+	/* stats data, worth data_size */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} xl_pgstat_data;
+
+#define SizeOfPgStatData(offsetof(xl_pgstat_data, data))
+
+extern XLogRecPtr pgstat_xlog_data(PgStat_Kind stats_kind,
+   const void *data,
+   size_t data_size);
+
+/* RMGR API */
+#define XLOG_PGSTAT_DATA	0x00
+extern void pgstat_redo(XLogReaderState *record);
+extern void pgstat_desc(StringInfo buf, XLogReaderState *record);
+extern const char *pgstat_identify(uint8 info);
+
+#endif			/* PGSTAT_XLOG_H */
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index cd95eec

Re: Issue with markers in isolation tester? Or not?

2025-01-13 Thread Noah Misch

On Mon, Jan 13, 2025 at 02:02:00AM +0100, Michail Nikolaev wrote:
> While stabilizing tests for [0] I have noticed unclear (and confusing in my
> opinion) behavior of markers in the isolation tester.
> 
> I have attached a test with reproducer.
> 
> There are two permutations in the test:
> 
> permutation
>   after(before)
>   before
>   detach1
>   wakeup1
>   detach2
>   wakeup2
> 
> In that case, I see expected results:
>   step before: <... completed>
>   step after: <... completed>
> 
> But changing the order of steps slightly:
> 
> permutation
>after(before)
>wakeup1  <--- wakeup moved here
>before
>detach1
>detach2
>wakeup2
> 
> makes "after" to be completed before "before". Does that make sense?

Yes.  I don't see a good reason for isolationtester to disregard that
"(before)" marker, so I think you've found a bug in isolationtester.  How
might we fix it?

Re: Psql meta-command conninfo+

2025-01-13 Thread Alvaro Herrera

On 2025-Jan-10, Hunaid Sohail wrote:

> IMO, we should continue with v35 and add all parameters from
> PQparameterStatus,
> as Sami suggested. The names themselves are informative enough.

IMO we need more opinions.  Maybe enough other people are not as unhappy
about the three-table solution.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Changing shared_buffers without restart

2025-01-13 Thread Ashutosh Bapat

Hi Dmitry,

On Tue, Dec 17, 2024 at 7:40 PM Ashutosh Bapat
 wrote:
>
> I could verify the memory mappings, their sizes etc. by looking at
> /proc/PID/maps and /proc/PID/status but I did not find a way to verify
> the amount of memory actually allocated and verify that it's actually
> shrinking and expanding. Please let me know how to verify that.

As somewhere mentioned upthread, the mmap or mremap by themselves do
not allocate any memory. Writing to the mapped region causes memory to
be allocated and shows up in VmRSS and RssShmem. But it does get
resized if mremap() shrinks the mapped region.

Attached are patches rebased on top of commit
2a7b2d97171dd39dca7cefb91008a3c84ec003ba. I have also fixed
compilation errors. Otherwise I haven't changed anything in the
patches. The last patches adds some TODOs and questions, which I think
we need to address while completing this work, just add for as a
reminder later. The TODO in postgres.c is related to your observation

> Another rough edge is that a
> backend, executing pg_reload_conf interactively, will not resize
> mappings immediately, for some reason it will require another command.
I don't have a solution right now, but at least the comment documents
the reason and points to its origin.

I am next looking at the problem of synchronizing the change across
the backends.

-- 
Best Wishes,
Ashutosh Bapat
From b5d295fc0a58b47228add95b4b8ad00fc0a66c07 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Wed, 16 Oct 2024 20:21:33 +0200
Subject: [PATCH 2/7] Allow placing shared memory mapping with an offset

Currently the kernel is responsible to chose an address, where to place each
shared memory mapping, which is the lowest possible address that do not clash
with any other mappings. This is considered to be the most portable approach,
but one of the downsides is that there is no place to resize allocated mappings
anymore. Here is how it looks like for one mapping in /proc/$PID/maps,
/dev/zero represents the anonymous shared memory we talk about:

0040-0049 /path/bin/postgres
...
012d9000-0133e000 [heap]
7f443a80-7f470a80 /dev/zero (deleted)
7f470a80-7f471831d000 /usr/lib/locale/locale-archive
7f471840-7f4718401000 /usr/lib64/libicudata.so.74.2
...
7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)

By specifying the mapping address directly it's possible to place the
mapping in a way that leaves room for resizing. The idea is first to get
the address chosen by the kernel, then apply some offset derived from
the expected upper limit. Because we base the layout on the address
chosen by the kernel, things like address space randomization should not
be a problem, since the randomization is applied to the mmap base, which
is one per process. The result looks like this:

012d9000-0133e000 [heap]
7f443a80-7f444196c000 /dev/zero (deleted)
[...free space...]
7f470a80-7f471831d000 /usr/lib/locale/locale-archive
7f471840-7f4718401000 /usr/lib64/libicudata.so.74.2

This approach do not impact the actual memory usage as reported by the kernel.
Here is the output of /proc/$PID/status for the master version with
shared_buffers = 128 MB:

// Peak virtual memory size, which is described as total pages mapped in mm_struct
VmPeak:   422780 kB
// Size of memory portions. It contains RssAnon + RssFile + RssShmem
VmRSS: 21248 kB
// Size of resident anonymous memory
RssAnon: 640 kB
// Size of resident file mappings
RssFile:9728 kB
// Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
// shared anonymous mappings)
RssShmem:  10880 kB

Here is the same for the patch with the shared mapping placed at
an offset 10 GB:

VmPeak:  1102844 kB
VmRSS: 21376 kB
RssAnon: 640 kB
RssFile:9856 kB
RssShmem:  10880 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

$ cd /sys/fs/cgroup
$ mkdir postgres
$ cd postres
$ echo 268435456 > memory.max

$ echo $MASTER_PID_SHELL > cgroup.procs
# postgres from the master branch has being successfully launched
#  from that shell
$ cat memory.current
17465344 (~16 MB)
# stop postgres

$ echo $PATCH_PID_SHELL > cgroup.procs
# postgres from the patch has being successfully launched from that shell
$ cat memory.current
18219008 (~17 MB)

Note that currently the implementation makes assumptions about the upper limit.
Ideally it should be based on the maximum available memory.
---
 src/backend/port/sysv_shmem.c | 122 +-
 1 file changed, 121 insertions(+),

Re: Vacuum statistics

2025-01-13 Thread Andrei Zubkov

Hi Sami,

Thank you for your attention to our patch and for your own work.

On Sun, 2025-01-05 at 20:00 -0600, Sami Imseih wrote:
> 
> You make valid points. I now think because track_vacuum_statistics is
> optional, we should track total_time in 2 places. First place in the
> new
> view being proposed here and the second place is in
> pg_stat_all_tables
> as being proposed here [3]. This way if track_vacuum_statistics is
> off, the
> total_time of vacuum could still be tracked by pg_stat_all_tables.

I think that field total_time in pg_stat_all_tables is redundant at
least if it will be the only field we want to add there. Yes, we have
vacuum counts in pg_stat_all_tables, but those are not related to the
vacuum workload actually. When we think we see unusual numbers there,
we can answer the question "why" - we know the conditions causing
autovacuum to launch a vacuum on every particular table, we have tuple
statistics on this table, and we can detect anomalies here. For
example, when vacuum process should be launched 5 times, but was
launched only twice.

The total_time field is workload metric. Yes, we can calculate the
mean time of vacuum operation on every particular table but there is
nothing we can do with it. We don't know what this time should be for
this table now. We only can compare this metric to its values in the
past. But once we see this time raising we will immediately face the
question "why?". And we have nothing to say about it. Where the time
was spent: vacuuming heap, vacuuming indexes, sleeping in the delay
point or performing IO operations, is there actual workload performed
by vacuum increased with total_time, or now we are spending more time
for the same workload? I think if we are adding workload statistics to
the Cumulative Statistics System we should do it as complete as
possible.

-- 
Regards, Andrei Zubkov
Postgres Professional

Re: Question about behavior of deletes with REPLICA IDENTITY NOTHING

2025-01-13 Thread Amit Kapila

On Mon, Jan 13, 2025 at 10:22 AM Robert Treat  wrote:
>
> On Sun, Jan 12, 2025 at 11:00 PM Amit Kapila  wrote:
> >
> > On Sat, Jan 11, 2025 at 7:28 PM Robert Treat  wrote:
> > >
> > > Definitely couldn't hurt; Updated patch cleans that up a bit and
> > > tweaks the link to alter table replica status.
> > >
> >
> > IIUC, we have changed following to clarify the REPLICA IDENTITY usage:
> > If a table without a replica identity is
> > -   added to a publication that replicates UPDATE
> > -   or DELETE operations then
> > -   subsequent UPDATE or DELETE
> > -   operations will cause an error on the publisher.  
> > INSERT
> > -   operations can proceed regardless of any replica identity.
> >
> > +If a table with replica identity set to NOTHING
> > +(or set DEFAULT but with no primary key, or set
> > +USING INDEX but the index has been dropped) is
> > +added to a publication that replicates UPDATE
> > +or DELETE operations,
> > +subsequent UPDATE or DELETE
> > +operations will cause an error on the publisher.
> >
> > In the above change, we missed the existing "a table without a replica
> > identity" part. A slightly different way to write the above change
> > could be: "Tables lacking a replica identity or with an insufficiently
> > defined replica identity (e.g., set to NOTHING, set to DEFAULT but
> > with no primary key, or set USING INDEX but the index has been
> > dropped) cannot be updated or deleted when added to a publication that
> > replicates these operations. Attempting to do so will result in an
> > error on the publisher."
> >
>
> We didn't miss it, we removed it. It is a misnomer to say a table
> doesn't have a replica identity, because all tables do and always must
> have one, hence pg_class.relreplident is NOT NULL. In most cases it is
> set DEFAULT and people don't think about it, but it isn't due to a
> lack of or insufficient replica identity, and I think that language is
> part of what confuses people.
>

Okay, I got it.

> Aside from that, your above language is a little more compact with the
> trade-off of being less explicit in talking about publication
> properties; I didn't change that part because it didn't seem like an
> issue, but we could update that second part if you feel strongly about
> it. LMK.
>

One of the reasons I tried to rephrase the sentence was it appears to
be long. I agree that the way you proposed is more explicit but the
way I phrased also conveys the information in a bit succinct form. I
think you can once propose with the wording on those lines then let us
what Peter or others think about it.

-- 
With Regards,
Amit Kapila.

Re: Documentation update of wal_retrieve_retry_interval to mention table sync worker

2025-01-13 Thread Shlok Kyal

On Mon, 13 Jan 2025 at 12:33, vignesh C  wrote:
>
> On Mon, 6 Jan 2025 at 08:47, Peter Smith  wrote:
> >
> > Hi Vignesh,
> >
> > Some review comments for your v2 patch.
> >
> > ==
> > doc/src/sgml/logical-replication.sgml
> >
> > AFAICT the only difference you made is changing:
> > FROM "a special kind of apply process"
> > TO "a special kind of table synchronization worker process".
> >
> > There is only ONE kind of tablesync process, so I think saying "a
> > special kind of table synchronization worker process" seems
> > misleading. I also thought maybe it is better to mention that this is
> > PER table.
> >
> > SUGGESTION:
> > ... a special table synchronization worker process per table.
>
> Thanks, the updated v3 version patch has the changes for the same.
>
 Hi Vignesh,

I reviewed the v3 patch. And it looks good to me.

Thanks and Regards,
Shlok Kyal

Re: Adjusting hash join memory limit to handle batch explosion

2025-01-13 Thread Tomas Vondra

On 1/13/25 17:32, Melanie Plageman wrote:
> On Sat, Jan 11, 2025 at 7:42 PM Tomas Vondra  wrote:
>>
>> I had a quiet evening yesterday, so I decided to take a stab at this and
>> see how hard would it be, and how bad would the impact be. Attached is
>> an experimental patch, doing the *bare* minimum for a simple query:
>>
>> 1) It defines a limit of 128 batches (a bit low, but also 1MB). In
>> practice we'd use something like 256 - 1024, probably. Doesn't matter.
>>
>> 2) Ensures the initial pass over data in MultiExecPrivateHash does not
>> use more than 128 batches, switches to "tooManyBatches=true" if that
>> happens (and dumps the batch to file ExecHashDumpBatchToFile, even if
>> it's batchno=0). And later it calls ExecHashHandleTooManyBatches() to
>> increase the nbatch further.
>>
>> 3) Does something similar for the outer relation - if there are too many
>> batches, we do ExecHashJoinRepartitionBatches() which first partitions
>> into 128 batches. This only does a single pass in the WIP, though.
>> Should be recursive or something.
>>
>> 4) Extends the BufFile API with BufFileHasBuffer/BufFileFreeBuffer so
>> that the code can free the buffers. It also means the buffer needs to be
>> allocated separately, not embedded in BufFile struct. (I'm a bit
>> surprised it works without having to re-read the buffer after freeing
>> it, but that's probably thanks to how hashjoin uses the files).
> 
> I started looking at this. Even though you do explain what it does
> above, I still found it a bit hard to follow. Could you walk through
> an example (like the one you gave in SQL) but explaining what happens
> in the implementation? Basically what you have in 2 and 3 above but
> with a specific example.
> 

OK, I'll try ... see the end of this message.

> This is my understanding of what this does:
> if we are at the max number of batches when building the hashtable and
> we run out of space and need to double nbatches, we
> 1. dump the data from the current batch that is in the hashtable into a file
> 2. close and flush are the currently open buffiles, double the number
> of batches, and then only open files for the batches we need to store
> tuples from the batch we were trying to put in the hashtable when we
> hit the limit (now in a temp file)
> 

Roughly, but the second step needs to happen only after we finish the
first pass over the inner relation. I'll try to explain this as part of
the example.

> I also don't understand why ExecHashJoinRepartitionBatches() is needed
> -- I think it has something to do with needing a certain number of
> buffers open while processing batch 0, but what does this have to do
> with the outer side of the join?
> 

No, this is about building batches on the outer side. We've built the
hash table, and we may have ended with a very high nbatch. We can't
build all of them right away (would need too many buffiles), so we do
that in multiple phases, to not cross the limit.

> Another random question: why doesn't ExecHashHandleTooManyBatches()
> free the outer batch files?
> 

Because it was tailored for the example when all batch splits happen for
batch 0, before we even start processing the outer side. In practice it
probably should free the files.

Let's do the example - as I mentioned, I only tried doing this for the
case where all the batch increases happen for batch 0, before we start
building the outer batches. I'm 99% sure the patch will need to modify a
couple more places to handle batch increases in later stages.

Assume we don't want to use more than 128 batches, but that we're
running a query that needs 256 batches. The patch will do this:

1) ExecHashTableCreate will set nbatch_maximum=128 as the limit for the
current pass over inner relation, and it'll cap the other nbatch fields
accordingly. If we already know we'll need more batches, we set
tooManyBatches=true to remember this.

But let's we start with nbatch=64, nbatch_maximum=128 (and thus also
with tooManyBatches=false).

2) We start loading data into the hash table, until exceed the memory
limit for the first time. We double the number to 128, move some of the
data from the hash table to the new batch, and continue.

3) We hit the memory limit again, but this time we've hit

(nbatch == nbatch_maximum)

so we can't double the number of batches. But we also can't continue
adding data to the in-memory hash table, so we set tooManyBatches=true
and we start spilling even the current batch to a file.

4) We finish the first pass over the inner relation with

nbatch = 128
nbatch_maximum = 128
tooManyBatches = true

so we need to do something. We run ExecHashHandleTooManyBatches() starts
increasing the nbatches until the current batch fits into work_mem. We
have nbatch=128, and the query needs nbatch=256, so we only do one loop.

Note: Right now it simply doubles the number of batches in each loop.
But it could be faster and do up to 128 in one step.

128 -> 16k -> 1M

The later batches will already d

Re: Exists pull-up application with JoinExpr

2025-01-13 Thread Alena Rybakina


Hi! I have solved it.

On 30.12.2024 11:24, Alena Rybakina wrote:


Hi! Thank you for your interest to this subject!

On 27.12.2024 15:53, Ilia Evdokimov wrote:

Hi Alena,

Thank you for your work on subqueries with JOIN.

Have you considered the scenario where in subquery includes a qual 
like (tc.aid = 1)? When I tried executing those queries I receive 
different results. In my opinion, to prevent this, we should add 
filters for such quals within the loop 'foreach (lc, all_clauses)'


EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta
WHERE EXISTS (SELECT * FROM tb, tc WHERE ta.id = tb.id AND tc.aid = 1);
  QUERY PLAN
--
 Hash Join (actual rows=1 loops=1)
   Hash Cond: (ta.id = tb.id)
   Buffers: local hit=3
   ->  Seq Scan on ta (actual rows=3 loops=1)
 Buffers: local hit=1
   ->  Hash (actual rows=3 loops=1)
 Buckets: 4096  Batches: 1  Memory Usage: 33kB
 Buffers: local hit=2
 ->  HashAggregate (actual rows=3 loops=1)
   Group Key: tb.id
   Batches: 1  Memory Usage: 121kB
   Buffers: local hit=2
   ->  Nested Loop (actual rows=3 loops=1)
 Buffers: local hit=2
 ->  Seq Scan on tb (actual rows=3 loops=1)
   Buffers: local hit=1
 ->  Materialize (actual rows=1 loops=3)
   Storage: Memory  Maximum Storage: 17kB
   Buffers: local hit=1
   ->  Seq Scan on tc (actual rows=1 loops=1)
 Filter: (aid = 1)
 Rows Removed by Filter: 1
 Buffers: local hit=1
(23 rows)



EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb JOIN tc ON ta.id =
tb.id WHERE tc.aid = 1);
    QUERY PLAN
--- 


 Seq Scan on ta (actual rows=1 loops=1)
   Filter: EXISTS(SubPlan 1)
   Rows Removed by Filter: 2
   Buffers: local hit=6
   SubPlan 1
 ->  Nested Loop (actual rows=0 loops=3)
   Buffers: local hit=5
   ->  Index Only Scan using tb_pkey on tb (actual rows=0 
loops=3)

 Index Cond: (id = ta.id)
 Heap Fetches: 1
 Buffers: local hit=4
   ->  Seq Scan on tc (actual rows=1 loops=1)
 Filter: (aid = 1)
 Buffers: local hit=1
(14 rows)


You are right, at the moment the code is not processed if there is a 
constant qual in the subquery (like t1.x1=1 in the example below) and 
this problem is not only related to the current patch.


For example you can get such a query plan if you complete this request 
to the master:


create table t (xint);
create table t1 (x1int);
create table t2 (x2int);
  EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
  SELECT 1
FROM t
   WHERE EXISTS (SELECT 1
   FROM t1
   where t1.x1 = 1);
QUERY PLAN

  Result (actual rows=0 loops=1)
One-Time Filter: (InitPlan 1).col1
InitPlan 1
  ->  Seq Scan on t1 (actual rows=0 loops=1)
Filter: (x1 = 1)
->  Seq Scan on t (never executed)
(6 rows)

It's all because of the check in this function - this qual has 
levelsoup = 0, not 1 (see (!contain_vars_of_level(whereClause, 1)), 
but I already found out that by changing this, the logic of correction 
there is required a little more complicated. At the moment, I'm 
working to add this processing to the patch.


Thanks for the case!

The logic is the same, but extended to constants. I added a few more 
tests that not only cover this case, but also NOT EXISTS, which will be 
converted to ANTI JOIN.


--
Regards,
Alena Rybakina
Postgres Professional
From b911333078fad71d4509adab1b0473828409b000 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Tue, 12 Nov 2024 12:16:42 +0300
Subject: [PATCH] Teach the planner to convert EXISTS and NOT EXISTS subqueries
 into semi and anti joins, respectively, if subquery's join expressions are
 independent and vars have the level no more than the parent. In addtion the
 transformation will be alowed if the expressions are constant. To do this, we
 put all potential expressions from the qual list and join list into the
 common list and check each expression one by one to see if they are suitable
 for transformation. In particular, we need to increment the level of
 expresions's vars to the parent query level. We condider expressions only for
 INNER JOIN type of join in subquery, otherwice the transformation is not
 available.
 
 Authors: Alena Rybakina 
 Reviewed-by: Ranier Vilela , Ilia Evdokimov 

---
 src/backend/optimizer/plan/subselect.c  | 1

Re: [PATCH] Add get_bytes() and set_bytes() functions

2025-01-13 Thread Dean Rasheed

On Mon, 13 Jan 2025 at 17:36, Aleksander Alekseev
 wrote:
>
> Besides fixing opr_sanity test I corrected error messages:
>
> -ERROR:  bytea size 3 out of valid range, 0..2
> +ERROR:  bytea out of valid range, ''..'\x'
>

"smallint out of range", "integer out of range" and "bigint out of
range" would be more consistent with existing error messages.

Regards,
Dean

Re: Incorrect CHUNKHDRSZ in nodeAgg.c

2025-01-13 Thread Jeff Davis

On Fri, 2025-01-10 at 23:30 +1300, David Rowley wrote:
> Since bump.c does not add headers to the palloc'd chunks, I think the
> following code from hash_agg_entry_size() shouldn't be using
> CHUNKHDRSZ anymore.

Fixed.

I also tried to account for the power-of-two allocations for the
transition values. We don't do that in other places, but now that we
have the bump allocator which does not do that, it seems reasonable to
account for it here.

> I did some benchmarking using the attached script. There's a general
> speedup, but I saw some unexpected increase in the number of batches
> with the patched version on certain tests. See the attached results.
> For example, the work_mem = 8MB with 10 million rows shows "Batches:
> 129" on master but "Batches: 641" with the patched version. I didn't
> check why.

Somewhat counter-intuitively, HashAgg can use more batches when there
is more memory available. If already spilling, creating more small
batches is good, because it reduces the chances of recursing. The
limiting factor for creating a lot of tiny batches is that each
partition requires a logtape with its own write buffer, so if there's
more memory available, that allows creating more logtapes and a higher
partitioning fanout.

The runtimes I got running your tests are mixed. I'm still analyzing
whether test noise is a factor, or whether the increased number of
partitions is a factor. But any runtime regressions are minor in
comparison to the memory savings, so I think we are on the right track.

Attached v2.

Regards,
Jeff Davis

From 35e86405da5dd920450c3d939be0f4c16a7dda24 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Wed, 8 Jan 2025 11:46:35 -0800
Subject: [PATCH v2] HashAgg: use Bump allocator for hash TupleHashTable
 entries.

The entries aren't freed until the entire hash table is destroyed, so
use the Bump allocator to improve allocation speed, avoid wasting
space on the chunk header, and avoid wasting space due to the
power-of-two allocations.

Discussion: https://postgr.es/m/caaphdvqv1anb4cm36fzrwivxrevbo_lsg_eq3nqdxtjecaa...@mail.gmail.com
Reviewed-by: David Rowley
---
 src/backend/executor/execGrouping.c |  23 +++---
 src/backend/executor/nodeAgg.c  | 110 +++-
 src/include/nodes/execnodes.h   |   5 +-
 3 files changed, 109 insertions(+), 29 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 07750253963..9be4963d06f 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -512,27 +512,32 @@ LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
 		}
 		else
 		{
+			MinimalTuple mtup;
 			MinimalTuple firstTuple;
 			size_t totalsize; /* including alignment and additionalsize */
 
 			/* created new entry */
 			*isnew = true;
-			/* zero caller data */
-			MemoryContextSwitchTo(hashtable->tablecxt);
 
-			/* Copy the first tuple into the table context */
-			firstTuple = ExecCopySlotMinimalTuple(slot);
+			/*
+			 * Extract the minimal tuple into the temp context, then copy it
+			 * into the table context.
+			 */
+			MemoryContextSwitchTo(hashtable->tempcxt);
+			mtup = ExecCopySlotMinimalTuple(slot);
 
 			/*
-			 * Allocate additional space right after the MinimalTuple of size
-			 * additionalsize. The caller can get a pointer to this data with
-			 * TupleHashEntryGetAdditional(), and store arbitrary data there.
+			 * Allocate space for the MinimalTuple followed by empty space of
+			 * size additionalsize. The caller can get a maxaligned pointer to
+			 * this data with TupleHashEntryGetAdditional(), and store
+			 * arbitrary data there.
 			 *
 			 * This avoids the need to store an extra pointer or allocate an
 			 * additional chunk, which would waste memory.
 			 */
-			totalsize = MAXALIGN(firstTuple->t_len) + hashtable->additionalsize;
-			firstTuple = repalloc(firstTuple, totalsize);
+			totalsize = MAXALIGN(mtup->t_len) + hashtable->additionalsize;
+			firstTuple = MemoryContextAlloc(hashtable->tablecxt, totalsize);
+			memcpy(firstTuple, mtup, mtup->t_len);
 			memset((char *) firstTuple + firstTuple->t_len, 0,
    totalsize - firstTuple->t_len);
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 19640752967..43a90ed6f7a 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -405,6 +405,7 @@ static void build_hash_tables(AggState *aggstate);
 static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 		  bool nullcheck);
+static void hash_create_memory(AggState *aggstate);
 static long hash_choose_num_buckets(double hashentrysize,
 	long ngroups, Size memory);
 static int	hash_choose_num_partitions(double input_groups,
@@ -1503,7 +1504,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 {
 	AggStatePerHash perhash = &aggstate->per

Re: explain analyze rows=%.0f

2025-01-13 Thread Ilia Evdokimov



On 11.01.2025 14:10, Ilia Evdokimov wrote:



On 11.01.2025 12:15, Guillaume Lelarge wrote:



Thanks for your patch, this looks like a very interesting feature 
that I'd like to see in a future release.


It did a quick run: patch OK, make OK, make install OK, but make 
check fails quite a lot on partition_prune.sql.


I guess it would need some work on partition_prune.sql tests and 
perhaps also on the docs.


Thanks again.


--
Guillaume.



Yes, certainly. I have fixed partition_prune.sql. In the documentation 
example for EXPLAIN ANALYZE where loops is greater than one, I updated 
how 'rows' and 'loops' values are displayed so they appear as decimal 
fractions with two digits after the decimal point.


I attached fixed patch.

Any suggestions?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.



I guess, it's not ideal to modify the existing example in the 
documentation of the v5 patch because readers wouldn't immediately 
understand why decimal fractions appear there. Instead, I'll add a brief 
note in the documentation clarifying how rows and loops are displayed 
when the average row count is below one.


The changes the of documentation are attached v6 patch.

If you have any other suggestions or different opinions, I'd be happy to 
discuss them.


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
From 08f92c7e11829045014598e1dcc042f3e5a1e1a3 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov 
Date: Mon, 13 Jan 2025 23:01:44 +0300
Subject: [PATCH] Clarify display of rows and loops as decimal fractions

When the average number of rows is small compared to the number of loops,
both rows and loops are displayed as decimal fractions with two digits
after the decimal point in EXPLAIN ANALYZE.
---
 doc/src/sgml/perform.sgml |  6 ++-
 src/backend/commands/explain.c| 49 +--
 src/test/regress/expected/partition_prune.out | 10 ++--
 src/test/regress/sql/partition_prune.sql  |  2 +-
 4 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index a502a2aaba..3f13d17fe9 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -757,7 +757,11 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
 comparable with the way that the cost estimates are shown.  Multiply by
 the loops value to get the total time actually spent in
 the node.  In the above example, we spent a total of 0.030 milliseconds
-executing the index scans on tenk2.
+executing the index scans on tenk2.   If a subplan node
+is executed multiple times and the average number of rows is less than one,
+the rows and loops values are shown as a decimal fraction
+(with two digits after the decimal point) to indicate that some rows
+were actually processed rather than simply rounding down to zero.

 

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c24e66f82e..200294b756 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1981,14 +1981,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
+			appendStringInfo(es->str, " (actual");
 			if (es->timing)
-appendStringInfo(es->str,
- " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)",
- startup_ms, total_ms, rows, nloops);
+appendStringInfo(es->str, " time=%.3f..%.3f", startup_ms, total_ms);
+
+			if (nloops > 1 && planstate->instrument->ntuples < nloops)
+appendStringInfo(es->str," rows=%.2f loops=%.2f)", rows, nloops);
 			else
-appendStringInfo(es->str,
- " (actual rows=%.0f loops=%.0f)",
- rows, nloops);
+appendStringInfo(es->str," rows=%.0f loops=%.0f)", rows, nloops);
 		}
 		else
 		{
@@ -1999,8 +1999,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
 	 3, es);
 			}
-			ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
-			ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
+			if (nloops > 1 && planstate->instrument->ntuples < nloops)
+			{
+ExplainPropertyFloat("Actual Rows", NULL, rows, 2, es);
+ExplainPropertyFloat("Actual Loops", NULL, nloops, 2, es);
+			}
+			else
+			{
+ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
+ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
+			}
 		}
 	}
 	else if (es->analyze)
@@ -2052,14 +2060,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (es->format == EXPLAIN_FORMAT_TEXT)
 			{
 ExplainIndentText(es);
+appendStringInfo(es->str, "actual");
 if (es->timing)
-	appendStringInfo(es->str,
-	 "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
-	 startup_ms, total_ms, rows, nloops);
+	appendStringInfo(es->str, " time=%.3f..%.3f", startup_ms, total_ms);
+
+if (nloops > 1 && planstate->instrument->ntuples < nloops)
+	appendStringInfo(es->str," rows=%.2

Re: New GUC autovacuum_max_threshold ?

2025-01-13 Thread Nathan Bossart

Here is a rebased version of the patch (commit ca9c6a5 adjusted the
documentation for vacuum-related GUCs).

-- 
nathan
>From d78d621233734abc54d0273526983bf64b88e7ba Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 8 Jan 2025 13:11:52 -0600
Subject: [PATCH v5 1/1] Introduce autovacuum_max_threshold.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Author: Nathan Bossart, Frédéric Yhuel
Reviewed-by: Melanie Plageman, Robert Haas, Laurenz Albe, Michael Banck, Joe 
Conway, Sami Imseih, David Rowley, wenhui qiu
Discussion: 
https://postgr.es/m/956435f8-3b2f-47a6-8756-8c54ded61802%40dalibo.com
---
 doc/src/sgml/config.sgml  | 24 +++
 doc/src/sgml/ref/create_table.sgml| 15 
 src/backend/access/common/reloptions.c| 11 +
 src/backend/postmaster/autovacuum.c   | 12 ++
 src/backend/utils/misc/guc_tables.c   |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/bin/psql/tab-complete.in.c|  2 ++
 src/include/postmaster/autovacuum.h   |  1 +
 src/include/utils/rel.h   |  1 +
 9 files changed, 77 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f41a17b1fe..5a0ed9ac36b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8579,6 +8579,30 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;

   
 
+  
+   autovacuum_max_threshold (integer)
+   
+autovacuum_max_threshold
+configuration parameter
+   
+   
+   
+
+ Specifies the maximum number of updated or deleted tuples needed to
+ trigger a VACUUM in any one table, i.e., a cap on
+ the value calculated with
+ autovacuum_vacuum_threshold and
+ autovacuum_vacuum_scale_factor.  The default is
+ 100,000,000 tuples.  If -1 is specified, autovacuum will not enforce a
+ maximum number of updated or deleted tuples that will trigger a
+ VACUUM operation.  This parameter can only be set
+ in the postgresql.conf file or on the server
+ command line; but the setting can be overridden for individual tables
+ by changing storage parameters.
+
+   
+  
+
   
autovacuum_vacuum_insert_threshold 
(integer)

diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index 2237321cb4f..155c78ad6ba 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1712,6 +1712,21 @@ WITH ( MODULUS numeric_literal, REM
 

 
+   
+autovacuum_max_threshold, 
toast.autovacuum_max_threshold (integer)
+
+ autovacuum_max_threshold
+ storage parameter
+
+
+   
+
+ Per-table value for 
+ parameter.
+
+   
+  
+

 autovacuum_vacuum_scale_factor, 
toast.autovacuum_vacuum_scale_factor (floating 
point)
 
diff --git a/src/backend/access/common/reloptions.c 
b/src/backend/access/common/reloptions.c
index e587abd9990..fbae300a128 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -231,6 +231,15 @@ static relopt_int intRelOpts[] =
},
-1, 0, INT_MAX
},
+   {
+   {
+   "autovacuum_max_threshold",
+   "Maximum number of tuple updates or deletes prior to 
vacuum",
+   RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+   ShareUpdateExclusiveLock
+   },
+   -2, -1, INT_MAX
+   },
{
{
"autovacuum_vacuum_insert_threshold",
@@ -1843,6 +1852,8 @@ default_reloptions(Datum reloptions, bool validate, 
relopt_kind kind)
offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
enabled)},
{"autovacuum_vacuum_threshold", RELOPT_TYPE_INT,
offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
vacuum_threshold)},
+   {"autovacuum_max_threshold", RELOPT_TYPE_INT,
+   offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
vacuum_max_threshold)},
{"autovacuum_vacuum_insert_threshold", RELOPT_TYPE_INT,
offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
vacuum_ins_threshold)},
{"autovacuum_analyze_threshold", RELOPT_TYPE_INT,
diff --git a/src/backend/postmaster/autovacuum.c 
b/src/backend/postmaster/autovacuum.c
index 0ab921a169b..ea48fba73f8 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -120,6 +120,7 @@ int autovacuum_max_workers;
 intautovacuum_work_mem = -1;
 intautovacuum_naptime;
 intautovacuum_vac_thresh;
+intautovacuum_max_thres

Re: Reorder shutdown sequence, to flush pgstats later

2025-01-13 Thread Andres Freund

Hi,

On 2025-01-13 12:20:39 +, Bertrand Drouvot wrote:
> On Fri, Jan 10, 2025 at 02:07:25PM -0500, Andres Freund wrote:
> > There wasn't
> > any code to treat receiving PMSIGNAL_XLOG_IS_SHUTDOWN in the wrong state as
> > fatal.  I think this needs to be treated the same way we treat not being 
> > able
> > to fork checkpointer during a shutdown.  Now receiving
> > PMSIGNAL_XLOG_IS_SHUTDOWN while not in PM_WAIT_XLOG_SHUTDOWN triggers
> > FatalError processing after logging "WAL was shut down unexpectedly".
> 
> I wonder if we could first ask the postmaster a confirmation that the SIGINT 
> is
> coming from it? (and if not, then simply ignore the SIGINT). I thought about a
> shared memory flag but the postmaster has to be keept away from shared memory
> operations, so that's not a valid idea. Another idea could be that the 
> checkpointer
> sends a new "confirm" request to the postmater first. But then I'm not sure 
> how
> the postmaster could reply back (given the signals that are already being 
> used)
> and that might overcomplicate things.

That sounds more complicated than it's worth it for a hypothetical risk.



> > Maybe PM_WAIT_CHECKPOINTER should be before PM_WAIT_DEAD_END? I earlier
> > thought it'd be better to shut checkpointer down after even dead-end 
> > children
> > exited, in case we ever wanted to introduce stats for dead-end children - 
> > but
> > that seems rather unlikely?
> 
> Yeah, given the above, it looks more clean to change that ordering.

I'll post a version that does that in a bit.


> > Independent of this patch, we seem to deal with FatalError processing in a
> > somewhat inconsistently.  We have this comment (in master):
> > /*
> >  * PM_WAIT_DEAD_END state ends when all other children are gone 
> > except
> >  * for the logger.  During normal shutdown, all that remains are
> >  * dead-end backends, but in FatalError processing we jump 
> > straight
> >  * here with more processes remaining.  Note that they have 
> > already
> >  * been sent appropriate shutdown signals, either during a 
> > normal
> >  * state transition leading up to PM_WAIT_DEAD_END, or during
> >  * FatalError processing.
> > 
> > However that doesn't actually appear to be true in the most common way to
> > reach FatalError, via HandleChildCrash():
> > 
> > if (Shutdown != ImmediateShutdown)
> > FatalError = true;
> > 
> > /* We now transit into a state of waiting for children to die */
> > if (pmState == PM_RECOVERY ||
> > pmState == PM_HOT_STANDBY ||
> > pmState == PM_RUN ||
> > pmState == PM_STOP_BACKENDS ||
> > pmState == PM_WAIT_XLOG_SHUTDOWN)
> > UpdatePMState(PM_WAIT_BACKENDS);
> > 
> > Here we
> > a) don't directly go to PM_WAIT_DEAD_END, but PM_WAIT_BACKENDS
> > 
> >From PM_WAIT_BACKENDS we'll switch to PM_WAIT_DEAD_END if all processes
> >other than walsender, archiver and dead-end children exited.
> > 
> > b) in PM_WAIT_XLOG_SHUTDOWN (previously named PM_SHUTDOWN) we go *back* to
> >PM_WAIT_BACKENDS?
> > 
> > This comment seems to have been added in
> > 
> > commit a78af0427015449269fb7d9d8c6057cfcb740149
> > Author: Heikki Linnakangas 
> > Date:   2024-11-14 16:12:28 +0200
> > 
> > Assign a child slot to every postmaster child process
> > 
> > ISTM that section of the comment is largely wrong and has never really been
> > the case if my git sleuthing is correct?
> 
> I agree that does not look correct but I don't think it's coming from 
> a78af0427015449269fb7d9d8c6057cfcb740149.
> ISTM that a78af0427015449269fb7d9d8c6057cfcb740149 re-worded this already 
> wrong
> comment:
> 
> "
> /*
>  * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
>  * (ie, no dead_end children remain), and the archiver is gone too.
>  *
>  * The reason we wait for those two is to protect them against a new
>  * postmaster starting conflicting subprocesses; this isn't an
>  * ironclad protection, but it at least helps in the
>  * shutdown-and-immediately-restart scenario.  Note that they have
>  * already been sent appropriate shutdown signals, either during a
>  * normal state transition leading up to PM_WAIT_DEAD_END, or during
>  * FatalError processing.
>  */
> "

Hm. This version doesn't really seem wrong in the same way / to the same
degree.


> while we could also see:
> 
> "
> if (Shutdown != ImmediateShutdown)
> FatalError = true;
> 
> /* We now transit into a state of waiting for children to die */
> if (pmState == PM_RECOVERY ||
> pmState == PM_HOT_STANDBY ||
> pmState == PM_RUN ||
> pmState == PM_STOP_BACKENDS ||
> pmState == PM_SHUTDOWN)
> pmState = PM_WAIT_BACKENDS;
> "

I don't think this make the version of the comment you included ab

Re: pgsql: Consolidate docs for vacuum-related GUCs in new subsection

2025-01-13 Thread Melanie Plageman

On Mon, Jan 13, 2025 at 3:46 PM Alvaro Herrera  wrote:
>
> On 2025-Jan-13, Melanie Plageman wrote:
>
> > Since I didn't hear back about this and I don't see an obvious
> > alternative reorganization in guc_tables.c, I plan to just push the
> > attached patch that updates only postgresql.conf.sample.
>
> Apologies, I was very unclear -- I didn't want to talk about the
> ordering of entries in the code, but the categorization.  See the
> config_group_names list in guc_tables.c, which defines some groups.
> Each setting belongs into a group, and those groups correspond to what
> the sample config file lists as section/subsection titles and to the
> grouping in the docs.  Also, this categorization affects how the entries
> are listed in the pg_settings view and in "postgres --describe-config",
> which feed from the same tables.

Oh dear, I had no idea that these categories existed. I suppose I
never paid attention to the category column in pg_settings nor used
--describe-config. Attached is a patch to fix this. I checked both
pg_settings and --describe-config output, and it seems to work.

I'm quite sorry about the extra noise this is causing (especially for
people with patch sets requiring rebasing).

> Perhaps with your changes (assuming I read your commit message right),
> we need new groups:
> VACUUMING
> VACUUMING_FREEZING
> VACUUMING_AUTOVACUUM

I've gone with VACUUM_AUTOVACUUM, VACUUM_COST_DELAY, and
VACUUM_FREEZING, but I am open to feedback.

- Melanie


v1-0001-Synchronize-guc_tables.c-categories-with-vacuum-d.patch
Description: Binary data

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

2025-01-13 Thread Alena Rybakina


Hi!

On 14.01.2025 00:46, Melanie Plageman wrote:

On Thu, Jan 9, 2025 at 1:24 PM Andres Freund  wrote:

On 2025-01-07 15:46:26 -0500, Melanie Plageman wrote:

For table storage options, those related to vacuum but not autovacuum
are in the main StdRdOptions struct. Of those, some are overridden by
VACUUM command parameters which are parsed out into the VacuumParams
struct. Though the members of VacuumParams are initialized in
ExecVacuum(), the storage parameter overrides are determined in
vacuum_rel() and the final value goes in the VacuumParams struct which
is passed all the way through to heap_vacuum_rel().

Because VacuumParams is what ultimately gets passed down to the
table-AM specific vacuum implementation, autovacuum also initializes
its own instance of VacuumParams in the autovac_table struct in
table_recheck_autovac() (even though no VACUUM command parameters can
affect autovacuum). These are overridden in vacuum_rel() as well.

Ultimately vacuum_eager_scan_max_fails is a bit different from the
existing members of VacuumParams and StdRdOptions. It is a GUC and a
table storage option but not a SQL command parameter -- and both the
GUC and the table storage parameter affect both vacuum and autovacuum.
And it doesn't need to be initialized in different ways for autovacuum
and vacuum. In the end, I decided to follow the existing conventions
as closely as I could.

I think that's fine. The abstractions in this area aren't exactly perfect, and
I don't think this makes it worse in any meaningful way. It's not really
different from having other heap-specific params like freeze_min_age in
VacuumParams.

Got it. I've left it as is, then.

Attached v6 is rebased over recent changes in the vacuum-related docs.
I've also updated the "Routine Vacuuming" section of the docs to
mention eager scanning.

I'm planning to commit 0001 (which updates the code comment at the top
of vacuumlazy.c to explain heap vacuuming) --barring any objections.



Thank you for working on this patch, without this explanation it is 
difficult to understand what is happening, to put it mildly.


While working on the vacuum statistics, I'll add you a link to the 
thread just in case [0], I discovered some more important points that I 
think can be mentioned.


The first of them is related to the fact that vacuum will not clean 
tuples referenced in indexes, since it was previously unable to take a 
cleanup lock on the index. You can look at the increment of 
missed_dead_tuples and vacrel->missed_dead_pages in the 
lazy_scan_noprune function. That is, these are absolutely dead tuples 
for vacuum that it simply could not clean.


Secondly, I think it is worth mentioning the moment when vacuum urgently 
starts cleaning the heap relationship when there is a threat of a 
wraparound round. At this point, it skips the index processing phase and 
heap relationship truncation.


Thirdly, FreeSpaceMap is updated every time after the complete 
completion of index and table cleaning (after the lazy_vacuum function) 
and after table heap pruning stage (the lazy_scan_prune function). Maybe 
you should add it.


I think it is possible to add additional information about parallel 
vacuum - firstly, workers are generated for each index, which perform 
their cleaning. Some indexes are defined by vacuum as unsafe for 
processing by a parallel worker and can be processed only by a 
postmaster (or leader). These are indexes that do not support parallel 
bulk-deletion, parallel cleanup (see 
parallel_vacuum_index_is_parallel_safe function).


I noticed an interesting point, but I don’t know if it is necessary to 
write about it, but for me it was not obvious and informative that the 
buffer and wal statistics are thrown by the indexes that were processed 
by workers and are thrown separately in (pvs->buffer_usage, pvs->wal_usage).


[0] https://commitfest.postgresql.org/51/5012/

--
Regards,
Alena Rybakina
Postgres Professional

Re: pgbench error: (setshell) of script 0; execution of meta-command failed

2025-01-13 Thread Tom Lane

Nathan Bossart  writes:
> On Sat, Jan 11, 2025 at 02:04:13PM -0500, Tom Lane wrote:
>> What I propose doing in the released branches is what's shown in
>> the attached patch for v17: rename port/pqsignal.c's function to
>> pqsignal_fe in frontend, but leave it as pqsignal in the backend.
>> Leaving it as pqsignal in the backend preserves ABI for extension
>> modules, at the cost that it's not entirely clear which pqsignal
>> an extension that's also linked with libpq will get.  But that
>> hazard affects none of our code, and it existed already for any
>> third-party extensions that call pqsignal.  In principle using
>> "pqsignal_fe" in frontend creates an ABI hazard for outside users of
>> libpgport.a.  But I think it's not really a problem, because we don't
>> support that as a shared library.  As long as people build with
>> headers and .a files from the same minor release, they'll be OK.

> I don't have any concrete reasons to think you are wrong about this, but it
> does feel somewhat risky to me.  It might be worth testing it with a couple
> of third-party projects that use the frontend version of pqsignal().
> codesearch.debian.net shows a couple that may be doing so [0] [1] [2].

It's fair to worry about this, but I don't think my testing that would
prove a lot.  AFAICS, whether somebody runs into trouble would depend
on many factors like their specific build process and what versions of
which packages they have installed.

In any case, I think we have a very limited amount of wiggle room.
We definitely cannot change libpq's ABI without provoking howls of
anguish.

I have been wondering whether it would help to add something like
this at the end of port/pqsignal.c in the back branches:

#ifdef FRONTEND
#undef pqsignal

/* ABI-compatibility wrapper */
pqsigfunc
pqsignal(int signo, pqsigfunc func)
{
return pqsignal_fe(signo, func);
}
#endif

(plus or minus an extern or so, but you get the idea).  The point of
this is that compiling against old headers and then linking against
newer libpgport.a would still work.  It does nothing however for the
reverse scenario of compiling against new headers and then linking
against old libpgport.a.  So I haven't persuaded myself that it's
worth the trouble -- but I'm happy to include it if others think
it would help.

regards, tom lane

Re: New GUC autovacuum_max_threshold ?

2025-01-13 Thread Sami Imseih

> Here is a rebased version of the patch (commit ca9c6a5 adjusted the
> documentation for vacuum-related GUCs).

I looked at the patch and have a few comments.

I propose renaming the GUC from "autovacuum_max_threshold" to
"autovacuum_vacuum_max_threshold" to clarify that it applies only
to the vacuum operation performed by autovacuum, not to the analyze operation.
This will also align with naming for other related GUCs, i.e.,
"autovacuum_analyze_threshold" and "autovacuum_vacuum_threshold."

The "vacuum threshold" calculation described in [1] will also need to be
updated.

[1] https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM

Regards,

Sami

Re: Issue with markers in isolation tester? Or not?

2025-01-13 Thread Michail Nikolaev

Hello!

In case if someone will look for a workaround - it is possible to use
"notices N" mark.\

Instead of

step before {
SELECT injection_points_run('injection-points-wait-2');
}

use something like:

step before {
  DO $$
  BEGIN
  PERFORM injection_points_run('injection-points-wait-2');
  RAISE NOTICE 'before is complete';
  END $$
}

and then:

permutation
after(before notices 1)
before
detach1
wakeup1
detach2
wakeup2

permutation
after(before notices 1)
wakeup1
before
detach1
detach2
wakeup2

in such a case, both permutations report "before" to be completed before
"after", not after.

Best regards,
Michail.

>

Re: pgsql: Consolidate docs for vacuum-related GUCs in new subsection

2025-01-13 Thread Alena Rybakina


Hi!

On 14.01.2025 01:35, Melanie Plageman wrote:

On Mon, Jan 13, 2025 at 3:46 PM Alvaro Herrera  wrote:

On 2025-Jan-13, Melanie Plageman wrote:


Since I didn't hear back about this and I don't see an obvious
alternative reorganization in guc_tables.c, I plan to just push the
attached patch that updates only postgresql.conf.sample.

Apologies, I was very unclear -- I didn't want to talk about the
ordering of entries in the code, but the categorization.  See the
config_group_names list in guc_tables.c, which defines some groups.
Each setting belongs into a group, and those groups correspond to what
the sample config file lists as section/subsection titles and to the
grouping in the docs.  Also, this categorization affects how the entries
are listed in the pg_settings view and in "postgres --describe-config",
which feed from the same tables.

Oh dear, I had no idea that these categories existed. I suppose I
never paid attention to the category column in pg_settings nor used
--describe-config. Attached is a patch to fix this. I checked both
pg_settings and --describe-config output, and it seems to work.

I'm quite sorry about the extra noise this is causing (especially for
people with patch sets requiring rebasing).


Perhaps with your changes (assuming I read your commit message right),
we need new groups:
VACUUMING
VACUUMING_FREEZING
VACUUMING_AUTOVACUUM

I've gone with VACUUM_AUTOVACUUM, VACUUM_COST_DELAY, and
VACUUM_FREEZING, but I am open to feedback.


Looks good and convenient, thanks for the patch!

I noticed another guc autovacuum_work_mem, which belongs more to the 
autovacuum category in my opinion, although it belongs to RESOURCES_MEM, 
but in fact, only autovacuum uses it.


--
Regards,
Alena Rybakina
Postgres Professional

Re: Virtual generated columns

2025-01-13 Thread Dean Rasheed

On Wed, 8 Jan 2025 at 16:14, Peter Eisentraut  wrote:
>
> Here is a new patch version

In expand_generated_columns_in_expr():

+   RangeTblEntry *rte;
+
+   rte = makeNode(RangeTblEntry);
+   rte->relid = RelationGetRelid(rel);
+
+   node = expand_generated_columns_internal(node, rel, rt_index, rte);

This dummy RTE is a bit too minimal.

I think it should explicitly set rte->rtekind to RTE_RELATION, even
though that's technically not necessary since RTE_RELATION is zero.

In addition, it needs to set rte->eref, because expandRTE() (called
from ReplaceVarsFromTargetList()) needs that when expanding whole-row
variables. Here's a simple reproducer which crashes:

CREATE TABLE foo (a int, b int GENERATED ALWAYS AS (a*2) VIRTUAL);
ALTER TABLE foo ADD CONSTRAINT foo_check CHECK (foo IS NOT NULL);

Regards,
Dean

SQLJSON: errmsg(" .. should ...") -> must

2025-01-13 Thread Alvaro Herrera

Hi

There's a few recent SQL/JSON error messages in which we say something
"should" be something else.  We avoid this, so I think we shouldn't use
it here either.  (There's also wparser_def.c which uses "should" in that
way, and I think it's there just because it's a dark unvisited corner
that's been wrong for awhile.)

While at it, I wasn't sure about the following:

errmsg("JSON path expression in JSON_QUERY must return single item without 
wrapper")

It seems a strange phrasing: I read it as meaning that the code is
expecting that the path expression returns a single item and no wrapper
around it.  But what it really means is that if the user specifies that
no wrapping is required, then a single item must be returned (so the
"without wrapper" applies to the JSON_QUERY, not to the single item).
I find this confusing, so I'd change it like this:

errmsg("JSON path expression in JSON_QUERY must return single item when no 
wrapper is requested")

which I think is clearer.

Patch attached, for branch master.  This applies to 17 cleanly, so I
lean towards backpatching it all there.  (The wparser_def.c changes go
much longer back, but I see little point in backpatching those any
further.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Postgres is bloatware by design: it was built to house
 PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)
>From ae8f41ec3955055832e250bcd1419a2adff4e582 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= 
Date: Mon, 13 Jan 2025 19:10:35 +0100
Subject: [PATCH] error messages must use 'must' rather than 'should'

---
 src/backend/tsearch/wparser_def.c|  8 
 src/backend/utils/adt/jsonpath_exec.c| 12 ++--
 src/test/regress/expected/sqljson_jsontable.out  |  2 +-
 src/test/regress/expected/sqljson_queryfuncs.out |  8 
 4 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index f26923d044b..79bcd32a063 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -2671,19 +2671,19 @@ prsd_headline(PG_FUNCTION_ARGS)
 		if (min_words >= max_words)
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	 errmsg("MinWords should be less than MaxWords")));
+	 errmsg("%s must be less than %s", "MinWords", "MaxWords")));
 		if (min_words <= 0)
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	 errmsg("MinWords should be positive")));
+	 errmsg("%s must be positive", "MinWords")));
 		if (shortword < 0)
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	 errmsg("ShortWord should be >= 0")));
+	 errmsg("%s must be >= 0", "ShortWord")));
 		if (max_fragments < 0)
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	 errmsg("MaxFragments should be >= 0")));
+	 errmsg("%s must be >= 0", "MaxFragments")));
 	}
 
 	/* Locate words and phrases matching the query */
diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index f6dfcb11a62..960fdec3dba 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -3977,13 +3977,13 @@ JsonPathQuery(Datum jb, JsonPath *jp, JsonWrapper wrapper, bool *empty,
 		if (column_name)
 			ereport(ERROR,
 	(errcode(ERRCODE_MORE_THAN_ONE_SQL_JSON_ITEM),
-	 errmsg("JSON path expression for column \"%s\" should return single item without wrapper",
+	 errmsg("JSON path expression for column \"%s\" must return single item when no wrapper is requested",
 			column_name),
 	 errhint("Use the WITH WRAPPER clause to wrap SQL/JSON items into an array.")));
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_MORE_THAN_ONE_SQL_JSON_ITEM),
-	 errmsg("JSON path expression in JSON_QUERY should return single item without wrapper"),
+	 errmsg("JSON path expression in JSON_QUERY must return single item when no wrapper is requested"),
 	 errhint("Use the WITH WRAPPER clause to wrap SQL/JSON items into an array.")));
 	}
 
@@ -4041,12 +4041,12 @@ JsonPathValue(Datum jb, JsonPath *jp, bool *empty, bool *error, List *vars,
 		if (column_name)
 			ereport(ERROR,
 	(errcode(ERRCODE_MORE_THAN_ONE_SQL_JSON_ITEM),
-	 errmsg("JSON path expression for column \"%s\" should return single scalar item",
+	 errmsg("JSON path expression for column \"%s\" must return single scalar item",
 			column_name)));
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_MORE_THAN_ONE_SQL_JSON_ITEM),
-	 errmsg("JSON path expression in JSON_VALUE should return single scalar item")));
+	 errmsg("JSON path expression in JSON_VALUE must return single scalar item")));
 	}
 
 	res = JsonValueListHead(&found);
@@ -4065,12 +4065,12 @@ JsonPathValue(Datum jb, JsonPath *jp, bool *empty, bool *error, List *vars,
 		if (column_name)
 			ereport(ERROR,
 	(errcode(ERRCODE_SQL_JSO

Re: Pgoutput not capturing the generated columns

2025-01-13 Thread vignesh C

On Mon, 13 Jan 2025 at 14:57, Amit Kapila  wrote:
>
> On Mon, Jan 13, 2025 at 5:25 AM Peter Smith  wrote:
> >
> > Future -- there probably need to be further clarifications/emphasis to
> > describe how the generated column replication feature only works for
> > STORED generated columns (not VIRTUAL ones), but IMO it is better to
> > address that separately *after* dealing with these missing
> > documentation patches.
> >
>
> I thought it was better to deal with the ambiguity related to the
> 'virtual' part first. I have summarized the options we have regarding
> this in an email [1]. I prefer to extend the current option to take
> values as 's', and 'n'. This will keep the room open to extending it
> with a new value 'v'. The primary reason to go this way is to avoid
> adding new options in the future. It is better to keep the number of
> subscription options under control. Do you have any preference?

Yes, this seems a better approach. Here is the attached patch which
handles the same.

Regards,
Vignesh
From 853353bec204c349533791fd3e397e40f7dd70f0 Mon Sep 17 00:00:00 2001
From: Vignesh 
Date: Mon, 13 Jan 2025 23:37:07 +0530
Subject: [PATCH] Change publish_generated_columns option to use enum instead
 of boolean

The current boolean publish_generated_columns option only supports a binary
choice, which is insufficient for future enhancements where generated columns
can be of different types (e.g., stored and virtual). To better accommodate
future requirements, this commit changes the option to an enum, with initial
values 'none' and 'stored'.
---
 doc/src/sgml/ref/create_publication.sgml|  10 +-
 src/backend/catalog/pg_publication.c|  12 +-
 src/backend/commands/publicationcmds.c  |  58 ++--
 src/backend/replication/logical/proto.c |  43 +++---
 src/backend/replication/pgoutput/pgoutput.c |  30 ++--
 src/backend/utils/cache/relcache.c  |   2 +-
 src/bin/pg_dump/pg_dump.c   |  15 +-
 src/bin/pg_dump/pg_dump.h   |   2 +-
 src/bin/pg_dump/t/002_pg_dump.pl|   4 +-
 src/bin/psql/describe.c |  11 +-
 src/include/catalog/pg_publication.h|  18 ++-
 src/include/commands/publicationcmds.h  |   2 +-
 src/include/replication/logicalproto.h  |  10 +-
 src/test/regress/expected/publication.out   | 150 ++--
 src/test/regress/sql/publication.sql|  24 ++--
 src/test/subscription/t/011_generated.pl|  60 
 16 files changed, 252 insertions(+), 199 deletions(-)

diff --git a/doc/src/sgml/ref/create_publication.sgml b/doc/src/sgml/ref/create_publication.sgml
index 5e25536554..f036635e6e 100644
--- a/doc/src/sgml/ref/create_publication.sgml
+++ b/doc/src/sgml/ref/create_publication.sgml
@@ -190,12 +190,18 @@ CREATE PUBLICATION name

 

-publish_generated_columns (boolean)
+publish_generated_columns (enum)
 
  
   Specifies whether the generated columns present in the tables
   associated with the publication should be replicated.
-  The default is false.
+  The default is none meaning the generated columns
+  present in the tables associated with publication will not be replicated.
+ 
+
+ 
+  If set to stored, the generated columns present in
+  the tables associated with publication will be replicated.
  
 
  
diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index b89098f5e9..a89aedcc20 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -622,10 +622,10 @@ pub_collist_to_bitmapset(Bitmapset *columns, Datum pubcols, MemoryContext mcxt)
 /*
  * Returns a bitmap representing the columns of the specified table.
  *
- * Generated columns are included if include_gencols is true.
+ * Generated columns are included if gencols_type is 'stored'.
  */
 Bitmapset *
-pub_form_cols_map(Relation relation, bool include_gencols)
+pub_form_cols_map(Relation relation, char gencols_type)
 {
 	Bitmapset  *result = NULL;
 	TupleDesc	desc = RelationGetDescr(relation);
@@ -634,7 +634,8 @@ pub_form_cols_map(Relation relation, bool include_gencols)
 	{
 		Form_pg_attribute att = TupleDescAttr(desc, i);
 
-		if (att->attisdropped || (att->attgenerated && !include_gencols))
+		if (att->attisdropped ||
+			(att->attgenerated && (gencols_type == PUBLISH_GENCOL_NONE)))
 			continue;
 
 		result = bms_add_member(result, att->attnum);
@@ -1068,7 +1069,7 @@ GetPublication(Oid pubid)
 	pub->pubactions.pubdelete = pubform->pubdelete;
 	pub->pubactions.pubtruncate = pubform->pubtruncate;
 	pub->pubviaroot = pubform->pubviaroot;
-	pub->pubgencols = pubform->pubgencols;
+	pub->pubgencols_type = pubform->pubgencols_type;
 
 	ReleaseSysCache(tup);
 
@@ -1276,7 +1277,8 @@ pg_get_publication_tables(PG_FUNCTION_ARGS)
 			{
 Form_pg_attribute att = TupleDescAttr(desc, i);
 
-if (att->attisdropped || (at

Re: pgsql: Consolidate docs for vacuum-related GUCs in new subsection

2025-01-13 Thread Melanie Plageman

On Mon, Jan 13, 2025 at 10:22 AM Melanie Plageman
 wrote:
>
> Thanks to Álvaro for pointing this out. I didn't think of it.
>
> On Sun, Jan 12, 2025 at 2:21 PM Tom Lane  wrote:
> >
> > Daniel Gustafsson  writes:
> > > On 11 Jan 2025, at 10:02, Alvaro Herrera  wrote:
> > >> and the GUC grouping in guc_tables.c/h?
> >
> > > I don't know what our policy around this is, and maybe the backpatching 
> > > hazard
> > > isn't too bad here, but it doesn't entirely seem worth the churn.
> >
> > I think the entire point of that categorization is to line up with the
> > docs, so our policy should be to fix this.
>
> I wrote a patch to reorder postgresql.conf.sample. But when I started
> looking at guc_tables.c, it doesn't seem like those are grouped
> according to the current docs order. Part of this is because some of
> the GUCs have different data types. But this appears to be more than
> that. For example, in master guc_tables.c,
> autovacuum_vacuum_cost_limit and vacuum_cost_limit are together (in
> docs in master they were in different sub-sections). Is guc_tables.c
> meant to be consistent with the ordering in the docs?

Since I didn't hear back about this and I don't see an obvious
alternative reorganization in guc_tables.c, I plan to just push the
attached patch that updates only postgresql.conf.sample.

- Melanie


v1-0001-Reorder-vacuum-GUCs-in-postgresql.conf.sample-to-.patch
Description: Binary data

Re: connection establishment versus parallel workers

2025-01-13 Thread Nathan Bossart

On Thu, Dec 19, 2024 at 10:09:35AM -0600, Nathan Bossart wrote:
> On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:
>> 0001 patch is unchanged, 0002 patch sketches out a response to the
>> observation a couple of paragraphs above.
> 
> Both of these patches seem to improve matters quite a bit.  I haven't yet
> thought too deeply about it all, but upon a skim, your patches seem
> entirely reasonable to me.

I gave these a closer look, and I still feel that they are both
straightforward and reasonable.  IIUC the main open question is whether
this might cause problems for other PM signal kinds.  Like you, I don't see
anything immediately obvious there, but I'll admit I'm not terribly
familiar with the precise characteristics of postmaster signals.  In any
case, 0001 feels pretty safe to me.

> However, while this makes the test numbers for >= v16 look more like those
> for v15, we're also seeing a big jump from v13 to v14.  This bisects pretty
> cleanly to commit d872510.  I haven't figured out _why_ this commit is
> impacting this particular test, but I figured I'd at least update the
> thread with what we know so far.

I regrettably have no updates on this one, yet.

-- 
nathan

Re: InitControlFile misbehaving on graviton

2025-01-13 Thread Alexander Lakhin


Hello Christoph,

13.01.2025 21:04, Christoph Berg wrote:

Bernd and I have been chasing a bug that happens when all of the
following conditions are fulfilled:

...

It smells like graviton's arm9 isn't as backwards compatible to arm8
as it should be. (But then I don't understand why disabling openssl
fixes it.)


Very interesting! Maybe this can also explain weird leafhopper (which
uses graviton4, as Robins said) failures in the buildfarm. (I've
extracted diffs from the logs at [1] and brought Robins's attention to it
at [2].)

[1] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures
[2] 
https://www.postgresql.org/message-id/35d87371-f3ab-42c8-9aac-bb39ab5bd987%40gmail.com

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

2025-01-13 Thread vignesh C

On Tue, 31 Dec 2024 at 10:15, Masahiko Sawada  wrote:
>
> Hi all,
>
> Logical decoding (and logical replication) are available only when
> wal_level = logical. As the documentation says[1], Using the 'logical'
> level increases the WAL volume which could negatively affect the
> performance. For that reason, users might want to start with using
> 'replica', but when they want to use logical decoding they need a
> server restart to increase wal_level to 'logical'. My goal is to allow
> users who are using 'replica' level to use logical decoding without a
> server restart. There are other GUC parameters related to logical
> decoding and logical replication such as max_wal_senders,
> max_logical_replication_workers, and max_replication_slots, but even
> if users set these parameters >0, there would not be a noticeable
> performance impact. And their default values are already >0. So I'd
> like to focus on making only the wal_level dynamic GUC parameter.
> There are several earlier discussions[2][3] but no one has submitted
> patches unless I'm missing something.
>
> The first idea I came up with is to make the wal_level a PGC_SIGHUP
> parameter. However, it affects not only setting 'replica' to 'logical'
> but also setting 'minimal' to 'replica' or higher. I'm not sure the
> latter case is common and it might require a checkpoint. I don't want
> to make the patch complex for uncommon cases.
>
> The second idea is to somehow allow both WAL-logging logical info and
> logical decoding even when wal_level is 'replica'. I've attached a PoC
> patch for that. The patch introduces new SQL functions such as
> pg_activate_logical_decoding() and pg_deactivate_logical_decoding().
> These functions are available only when wal_level is 'repilca'(or
> higher). In pg_activate_logical_decoding(), we set the status of
> logical decoding stored on the shared memory from 'disabled' to
> 'xlog-logical-info', allowing all processes to write logical
> information to WAL records for logical decoding. But the logical
> decoding is still not allowed. Once we confirm all in-progress
> transactions completed, we switch the status to
> 'logical-decoding-ready', meaning that users can create logical
> replication slots and use logical decoding.

I felt that the only disadvantage with this approach is that we
currently wait for all in-progress transactions to complete before
enabling logical decoding. If a long-running transaction exists and
the session enabling logical decoding fails—due to factors like
statement_timeout, transaction_timeout,
idle_in_transaction_session_timeout, idle_session_timeout, or any
other failure. This would require restarting the wait. During this
time, there's a risk that a new long-running transaction could start,
further delaying the process. Probably this could be solved if the
waiting is done from any of the background processes through
PGC_SIGHUP.  While this type of failure is likely rare, I’m unsure
whether we should consider this scenario.

Thoughts?

Regards,
Vignesh

Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2025-01-13 Thread Daniel Gustafsson

Circling back to wrap up this thread I retested with, and without, OpenSSL FIPS
and tweaked the docs and code a little to ensure it detects the right function
to use.  The attached is what I propose we go ahead with.

--
Daniel Gustafsson



v7-0001-pgcrypto-Make-it-possible-to-disable-built-in-cry.patch
Description: Binary data

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread chiranmoy.bhattacha...@fujitsu.com

On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:
> Do you mean that the auto-vectorization worked and you observed no
> performance improvement, or the auto-vectorization had no effect on the
> code generated?

Auto-vectorization is working now with the following addition on Graviton 3 
(m7g.4xlarge) with GCC 11.4, and the results match yours. Previously, 
auto-vectorization had no effect because we missed the -march=native option.

  encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -march=native

There is a 30% improvement using auto-vectorization.

 buf   | default | auto_vec | SVE
+---++---
 16 | 16  |  12  |8
 64 | 58  |  40  |9
256 |223  | 152  |   18
   1024 |934  | 613  |   54
   4096 |   3533  |2430  |  202
  16384 |  14081  |9831  |  800
  65536 |  56374  |   38702  | 3202

Auto-vectorization had no effect on hex_decode due to the presence of control 
flow.

-
Here is a comment snippet from src/include/port/simd.h

"While Neon support is technically optional for aarch64, it appears that all 
available 64-bit hardware does have it."

Currently, it is assumed that all aarch64 machine support NEON, but for newer 
advanced SIMD like SVE (and AVX512 for x86) this assumption may not hold. We 
need a runtime check to be sure.. Using src/include/port/simd.h to abstract 
away these advanced SIMD implementations may be difficult.

We will update the thread once a solution is found.

-
Chiranmoy

Re: POC: track vacuum/analyze cumulative time per relation

2025-01-13 Thread Bertrand Drouvot

Hi,

On Fri, Jan 10, 2025 at 04:26:05PM -0600, Sami Imseih wrote:
> {
> /* time is already in msec, just convert to double for presentation */
> PG_RETURN_FLOAT8((double)
> pgstat_fetch_stat_checkpointer()->write_time);
> }
> 
> > +
> > +   PgStat_Counter total_vacuum_time;   /* user initiated vacuum */
> > +   PgStat_Counter total_autovacuum_time;   /* autovacuum initiated */
> > +   PgStat_Counter total_analyze_time;  /* user initiated vacuum */
> > +   PgStat_Counter total_autoanalyze_time;  /* autovacuum initiated */
> >
> > Those comments look weird to me.
> >
> 
> Ok, Will remove these.
> 
> I also updated the comments for the instrument code path to reflect the
> fact starttime is now set for all cases.Also, added documentation.
> 
> See the attached v2.

Thanks for the patch update!

A few random comments:

=== 1

+/* pg_stat_get_vacuum_time */
+PG_STAT_GET_RELENTRY_FLOAT8(total_vacuum_time)
+
+/* pg_stat_get_autovacuum_time */
+PG_STAT_GET_RELENTRY_FLOAT8(total_autovacuum_time)
+
+/* pg_stat_get_analyze_time */
+PG_STAT_GET_RELENTRY_FLOAT8(total_analyze_time)
+
+/* pg_stat_get_autoanalyze_time */
+PG_STAT_GET_RELENTRY_FLOAT8(total_autoanalyze_time)
+

The comments do not reflect the function names ("total" is missing to give
pg_stat_get_total_vacuum_time() and such).

=== 2

+#define PG_STAT_GET_RELENTRY_FLOAT8(stat)
+Datum
+CppConcat(pg_stat_get_,stat)(PG_FUNCTION_ARGS)
+{   
+   Oid relid = PG_GETARG_OID(0); 
+   int64   result;
+   PgStat_StatTabEntry *tabentry;
+   
+   if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+   result = 0;
+   else  
+   result = (float8) (tabentry->stat);
+  
+   PG_RETURN_FLOAT8(result);
+}
+
 /* pg_stat_get_analyze_count */
 PG_STAT_GET_RELENTRY_INT64(analyze_count

I think it's better to define the macro just before its first usage (meaning
just after pg_stat_get_vacuum_count()): that would be consistent with the places
the other macros are defined.

=== 3

+   int64   result; 

s/int64/double/?

=== 4

+   Total time this table has spent in manual vacuum
+  

Mention the unit?

=== 5

+   /*
+* When verbose or autovacuum logging is used, initialize a resource 
usage
+* snapshot and optionally track I/O timing.
+*/
if (instrument)
{

Out of curiosity, why this extra comment? To be somehow consistent with
do_analyze_rel()?

=== 6

@@ -343,6 +349,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
  
RelationGetRelid(rel));

+   starttime = GetCurrentTimestamp();

I wonder if it wouldn't make more sense to put the GetCurrentTimestamp() call
before the pgstat_progress_start_command() call. That would be aligned with the
"endtime" being after the pgstat_progress_end_command() and where it was before
the patch.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Consolidate docs for vacuum-related GUCs in new subsection

2025-01-13 Thread Melanie Plageman

Thanks to Álvaro for pointing this out. I didn't think of it.

On Sun, Jan 12, 2025 at 2:21 PM Tom Lane  wrote:
>
> Daniel Gustafsson  writes:
> > On 11 Jan 2025, at 10:02, Alvaro Herrera  wrote:
> >> and the GUC grouping in guc_tables.c/h?
>
> > I don't know what our policy around this is, and maybe the backpatching 
> > hazard
> > isn't too bad here, but it doesn't entirely seem worth the churn.
>
> I think the entire point of that categorization is to line up with the
> docs, so our policy should be to fix this.

I wrote a patch to reorder postgresql.conf.sample. But when I started
looking at guc_tables.c, it doesn't seem like those are grouped
according to the current docs order. Part of this is because some of
the GUCs have different data types. But this appears to be more than
that. For example, in master guc_tables.c,
autovacuum_vacuum_cost_limit and vacuum_cost_limit are together (in
docs in master they were in different sub-sections). Is guc_tables.c
meant to be consistent with the ordering in the docs?

- Melanie

Re: [PATCH] SVE popcount support

2025-01-13 Thread Malladi, Rama



Here is the updated patch using 
pg_attribute_target("arch=armv8-a+sve") to compile the arch-specific 
function instead of using compiler flags.


---
This looks good. Thanks Chiranmoy and team. Can you address any other 
feedback from Nathan or others here? Then we can pursue further reviews 
and merging of the patch.

Re: pgbench error: (setshell) of script 0; execution of meta-command failed

2025-01-13 Thread Nathan Bossart

On Sat, Jan 11, 2025 at 02:04:13PM -0500, Tom Lane wrote:
> After studying this more, I think what we should do in HEAD is
> even more aggressive: let's make the real name of port/pqsignal.c's
> function be either pqsignal_fe in frontend, or pqsignal_be in backend.
> This positively ensures that there's no collision with libpq's
> export, and it seems like a good idea for the additional reason that
> the frontend and backend versions are not really interchangeable.

That would be nice.

> However, we can't get away with that in released branches because
> it'd be an ABI break for any outside code that calls pqsignal.
> What I propose doing in the released branches is what's shown in
> the attached patch for v17: rename port/pqsignal.c's function to
> pqsignal_fe in frontend, but leave it as pqsignal in the backend.
> Leaving it as pqsignal in the backend preserves ABI for extension
> modules, at the cost that it's not entirely clear which pqsignal
> an extension that's also linked with libpq will get.  But that
> hazard affects none of our code, and it existed already for any
> third-party extensions that call pqsignal.  In principle using
> "pqsignal_fe" in frontend creates an ABI hazard for outside users of
> libpgport.a.  But I think it's not really a problem, because we don't
> support that as a shared library.  As long as people build with
> headers and .a files from the same minor release, they'll be OK.

I don't have any concrete reasons to think you are wrong about this, but it
does feel somewhat risky to me.  It might be worth testing it with a couple
of third-party projects that use the frontend version of pqsignal().
codesearch.debian.net shows a couple that may be doing so [0] [1] [2].

> BTW, this decouples legacy-pqsignal.c from pqsignal.c enough
> that we could now do what's contemplated in the comments from
> 3b00fdba9: simplify that version by making it return void,
> or perhaps better just a true/false success report.
> I've not touched that point here, though.

I think it should just return void since AFAICT nobody checks the return
value.  But it would probably be a good idea to add some sort of error
checking within the function.  I've attached a draft patch that adds some
new assertions.  I originally tried to use elog() where it was available,
but besides making the code even more unreadable, I think it's unnecessary
(since AFAICT any problems are likely coding errors).

[0] https://github.com/gleu/pgstats
[1] https://github.com/hapostgres/pg_auto_failover
[2] https://github.com/credativ/pg_checksums

-- 
nathan
>From 93ea04b912a90e6a166941c575b4da128e3b7533 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 13 Jan 2025 09:27:28 -0600
Subject: [PATCH v1 1/1] modify pqsignal

---
 src/include/port.h  |  2 +-
 src/port/pqsignal.c | 32 ++--
 2 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/src/include/port.h b/src/include/port.h
index f0e28ce5c53..4e9e5657872 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -520,7 +520,7 @@ extern int  pg_mkdir_p(char *path, int omode);
 #define pqsignal pqsignal_be
 #endif
 typedef void (*pqsigfunc) (SIGNAL_ARGS);
-extern pqsigfunc pqsignal(int signo, pqsigfunc func);
+extern void pqsignal(int signo, pqsigfunc func);
 
 /* port/quotes.c */
 extern char *escape_single_quotes_ascii(const char *src);
diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c
index f707b1c54e1..076e3dfe340 100644
--- a/src/port/pqsignal.c
+++ b/src/port/pqsignal.c
@@ -114,29 +114,15 @@ wrapper_handler(SIGNAL_ARGS)
  *
  * Returns the previous handler.
  *
- * NB: If called within a signal handler, race conditions may lead to bogus
- * return values.  You should either avoid calling this within signal handlers
- * or ignore the return value.
- *
- * XXX: Since no in-tree callers use the return value, and there is little
- * reason to do so, it would be nice if we could convert this to a void
- * function instead of providing potentially-bogus return values.
- * Unfortunately, that requires modifying the pqsignal() in legacy-pqsignal.c,
- * which in turn requires an SONAME bump, which is probably not worth it.
- *
  * Note: the actual name of this function is either pqsignal_fe when
  * compiled with -DFRONTEND, or pqsignal_be when compiled without.
  * This is to avoid a name collision with libpq's legacy-pqsignal.c.
  */
-pqsigfunc
+void
 pqsignal(int signo, pqsigfunc func)
 {
-   pqsigfunc   orig_func = pqsignal_handlers[signo];   /* assumed 
atomic */
 #if !(defined(WIN32) && defined(FRONTEND))
-   struct sigaction act,
-   oact;
-#else
-   pqsigfunc   ret;
+   struct sigaction act;
 #endif
 
Assert(signo < PG_NSIG);
@@ -155,17 +141,11 @@ pqsignal(int signo, pqsigfunc func)
if (signo == SIGCHLD)
act.sa_flags |= SA_NOCLDSTOP;
 #endif
-   if (sigaction(signo, &act, &oact) < 0)
-   return SIG_ERR;
-   else if (oact.

Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

2025-01-13 Thread Ashutosh Bapat

On Mon, Jan 13, 2025 at 2:52 PM vignesh C  wrote:
>
> I felt that the only disadvantage with this approach is that we
> currently wait for all in-progress transactions to complete before
> enabling logical decoding. If a long-running transaction exists and
> the session enabling logical decoding fails—due to factors like
> statement_timeout, transaction_timeout,
> idle_in_transaction_session_timeout, idle_session_timeout, or any
> other failure. This would require restarting the wait. During this
> time, there's a risk that a new long-running transaction could start,
> further delaying the process. Probably this could be solved if the
> waiting is done from any of the background processes through
> PGC_SIGHUP.  While this type of failure is likely rare, I’m unsure
> whether we should consider this scenario.

A related question: While the operation is waiting for already running
transactions to end, the backends whose transactions have finished may
start new transactions. When we switch WAL from 'replica' to
'logical', there may be some transactions running. Will this lead to
WAL stream with mixes WAL records - some with logical information and
some without? Do we need to adjust logical decoding to tackle this
situation? Is there a chance that some WAL from same transaction have
logical information and some do not?

-- 
Best Wishes,
Ashutosh Bapat

Re: Skip collecting decoded changes of already-aborted transactions

2025-01-13 Thread Amit Kapila

On Tue, Jan 7, 2025 at 7:22 AM Peter Smith  wrote:
>
> ==
> src/include/replication/reorderbuffer.h
>
> 6.
>  #define RBTXN_PREPARE  0x0040
>  #define RBTXN_SKIPPED_PREPARE   0x0080
>  #define RBTXN_HAS_STREAMABLE_CHANGE 0x0100
> +#define RBTXN_SENT_PREPARE 0x0200
> +#define RBTXN_IS_COMMITTED 0x0400
> +#define RBTXN_IS_ABORTED 0x0800
>
> Something about this new RBTXN_SENT_PREPARE name seems inconsistent to me.
>
> I feel there is now also some introduced ambiguity with these macros:
>
> /* Has this transaction been prepared? */
> #define rbtxn_prepared(txn) \
> ( \
> ((txn)->txn_flags & RBTXN_PREPARE) != 0 \
> )
>
> +/* Has a prepare or stream_prepare already been sent? */
> +#define rbtxn_sent_prepare(txn) \
> +( \
> + ((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
> +)
>
>
> e.g. It's also not clear from the comments what is the distinction
> between the existing macro comment "Has this transaction been
> prepared?" and the new macro comment "Has a prepare or stream_prepare
> already been sent?".
>
> Indeed, I was wondering if some of the places currently calling
> "rbtxn_prepared(txn)" should now strictly be calling
> "rbtxn_sent_prepared(txn)" macro instead?
>

Right, I think after this change, it appears we should try to rename
the existing constants. One place where we can consider to use new
macro is the current usage of rbtxn_prepared() in
SnapBuildDistributeNewCatalogSnapshot().

> IMO some minor renaming of the existing constants (and also their
> associated macros) might help to make all this more coherent. For
> example, perhaps like:
>
> #define RBTXN_IS_PREPARE_NEEDED0x0040
>

The other option could be RBTXN_IS_PREPARE_REQUESTED.

-- 
With Regards,
Amit Kapila.

Re: Psql meta-command conninfo+

2025-01-13 Thread Dean Rasheed

On Mon, 13 Jan 2025 at 08:44, Alvaro Herrera  wrote:
>
> > IMO, we should continue with v35 and add all parameters from
> > PQparameterStatus,
> > as Sami suggested. The names themselves are informative enough.
>
> IMO we need more opinions.  Maybe enough other people are not as unhappy
> about the three-table solution.
>

I don't like the 3-table format either. I think it should be a single table.

The trouble with v35 is that it produces 1 row with lots of columns,
which is pretty unreadable unless expanded mode is used. So I think we
should just do it that way round by default -- i.e., make it like
\dconfig and have 2 columns, "Parameter" and "Value", with lots of
rows.

Perhaps it could also have a "Description" column, which might help
with things like distinguishing between authenticated user and session
user.

Regards,
Dean

Accept recovery conflict interrupt on blocked writing

2025-01-13 Thread Anthonin Bonnefoy

Hi,

I have replicas that have regular transient bursts of replay lag (>5
minutes). The events have the following symptoms:
- Replicas are using physical replication slot and hot_standby_feedback
- The lag recovers by itself after at most 15 minutes
- During the same timeframe, there's a query stuck in ClientWrite on
the replica for ~15 minutes (despite the 30s statement_timeout). The
matching client was terminated at the beginning when the server
started sending results and when the server process died, the replay
lag started to recover.

The 15 minutes timeout matches the default linux TCP retransmission
timeout[1]. A similar situation can be triggered by using psql
pipelining patch[2].

# On the primary, generate constant updates:
echo "\set aid random(1, 10 * :scale)
UPDATE pgbench_accounts SET bid=bid+1 WHERE aid=:aid;" > update.sql
pgbench -f update.sql -T900

-- On a replica:
\startpipeline
-- At least 2 select are needed to completely saturate socket buffers
select * from pgbench_accounts \bind \g
select * from pgbench_accounts \bind \g
-- Flush the commands to the server
\flush

After that, the queries are sent to the server but the client doesn't
consume the results and the backend process should be stuck in a
ClientWrite state. Eventually, usually after ~10s, WAL replay becomes
completely blocked (sometimes I need to redo the pipeline query if
nothing happens). Looking at the backtrace, the recovery process is
blocked on ResolveRecoveryConflictWithBufferPin. The conflicting
buffer pin is held by the pipelined query, currently blocked trying to
write the result to the socket which is completely saturated. During
this time, all interrupts are ignored and WAL replay won't be able to
recover until the socket becomes either writable or ends with an
error.

The same situation happened on my instances where a backend process
was sending results to a client that died without sending a FIN or a
RST. The only way for this process to die is to reach the TCP
retransmission timeout after 15m. During this time, it can conflict
with the recovery as it holds a buffer pin, possibly blocking the
recovery for the whole duration.

To avoid blocking recovery for an extended period of time, this patch
changes client write interrupts by handling recovery conflict
interrupts instead of ignoring them. Since the interrupt happens while
we're likely to have partially written results on the socket, there's
no easy way to keep protocol sync so the session needs to be
terminated.

Setting tcp_user_timeout is also a possible mitigation for users, but
my assumption of how conflict recovery is done is that it is not
desirable to block recovery for an extended period of time and it is
fine to be aggressive when the standby delay is exceeded.

[1]: https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html
[2]: https://commitfest.postgresql.org/51/5407/

Regards,
Anthonin Bonnefoy


v01-0001-Accept-recovery-conflict-interrupt-on-blocked-wr.patch
Description: Binary data

Re: why there is not VACUUM FULL CONCURRENTLY?

2025-01-13 Thread Michael Banck

Hi,

On Sat, Jan 11, 2025 at 09:01:54AM -0500, Andrew Dunstan wrote:
> On 2025-01-09 Th 8:35 AM, Alvaro Herrera wrote:
> > Maybe we should have a new toplevel command.  Some ideas that have been
> > thrown around:
> > 
> > - RETABLE (it's like REINDEX, but for tables)
> > - ALTER TABLE  SQUEEZE
> > - SQUEEZE 
> > - VACUUM (SQUEEZE)
> > - VACUUM (COMPACT)
> > - MAINTAIN  COMPACT
> > - MAINTAIN  SQUEEZE

I don't like any of them a lot :-/

> COMPACT tablename ...

That sounds like it would compress content rather than just rewrite it
normally to get rid of bloat.

I think REORG (or REPACK, but that has not history elsewhere) would fit
best, we don't need to emulate the myriad of DB2 options...

Michael

Re: Pgoutput not capturing the generated columns

2025-01-13 Thread Amit Kapila

On Mon, Jan 13, 2025 at 5:25 AM Peter Smith  wrote:
>
> Future -- there probably need to be further clarifications/emphasis to
> describe how the generated column replication feature only works for
> STORED generated columns (not VIRTUAL ones), but IMO it is better to
> address that separately *after* dealing with these missing
> documentation patches.
>

I thought it was better to deal with the ambiguity related to the
'virtual' part first. I have summarized the options we have regarding
this in an email [1]. I prefer to extend the current option to take
values as 's', and 'n'. This will keep the room open to extending it
with a new value 'v'. The primary reason to go this way is to avoid
adding new options in the future. It is better to keep the number of
subscription options under control. Do you have any preference?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JfEZUdtC5896vwEZFXBZnQ4aTDDXQxv3NOaosYu973Pw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Modern SHA2- based password hashes for pgcrypto

2025-01-13 Thread Alvaro Herrera

Hello

I was passing by and I noticed that this needs badly pgindent, and some
comments are enumerations that would lose formatting once through it.
For example, this would happen which is not good:

/*
-* 1. Start digest A
-* 2. Add the password string to digest A
-* 3. Add the salt to digest A
+* 1. Start digest A 2. Add the password string to digest A 3. Add the
+* salt to digest A
 */

I suggest you can fix this by adding a "-" sign to the opening "/*" line
so that pgindent doesn't mangle it (so it becomes /*- ).  This appears
in several places.

It's been said in my presence that pgcrypto is obsolete and shouldn't be
used anymore.  I'm not sure I believe that, but even if that's true,
it's clear that there's plenty of people who has an interest on it, so I
don't see that as an objection to reject this work.  So let's move on.


Your test data (crypt-shacrypt.sql) looks a bit short.  I noticed that
Drepper's SHA-crypt.txt file has a bunch of test lines (starting with
the "Test vectors from FIPS 180-2: appendix B.1." comment line, as well
as "appendix C.1" at the bottom) which perhaps could be incorporated
into the .sql script, to ensure correctness (or at least,
bug-compatibility with the reference implementation).  I'd also add a
note that Drepper's implementation is public domain in crypt-sha.c's
license block.

I think the "ascii_dollar" variable is a bit useless.  Why not just use the
literal $ sign where needed (or a DOLLAR_CHR if we feel like avoiding a
magic value there)?  Also, I wonder if it would be better to use a
StringInfo instead of a fixed-size buffer, which would probably make
some string manipulations easier ... Or maybe not, but let's not mix
strlcat calls with strncat calls with no comments about why.

Some of your elog(ERROR)s should probably be ereport(), and I'm not sure
we want all the elog(DEBUG1)s.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread Nathan Bossart

On Mon, Jan 13, 2025 at 03:48:49PM +, chiranmoy.bhattacha...@fujitsu.com 
wrote:
> There is a 30% improvement using auto-vectorization.

It might be worth enabling auto-vectorization independently of any patches
that use intrinsics, then.

> Currently, it is assumed that all aarch64 machine support NEON, but for
> newer advanced SIMD like SVE (and AVX512 for x86) this assumption may not
> hold. We need a runtime check to be sure.. Using src/include/port/simd.h
> to abstract away these advanced SIMD implementations may be difficult.

Yeah, moving simd.h to anything beyond Neon/SSE2 might be tricky at the
moment.  Besides the need for additional runtime checks, using wider
registers can mean that you need more data before an optimization takes
effect, which is effectively a regression.  I ran into this when I tried to
add AVX2 support to simd.h [0].  My question about using simd.h was
ultimately about abstracting the relevant Neon/SSE2 instructions and using
those for hex_encode/decode().  If that's possible, I think it'd be
interesting to see how that compares to the SVE version.

[0] https://postgr.es/m/20231129171526.GA857928%40nathanxps13

-- 
nathan

Re: per backend I/O statistics

2025-01-13 Thread Michael Paquier

On Fri, Jan 10, 2025 at 01:56:01PM +, Bertrand Drouvot wrote:
> So now, I think that it is not a good idea for the per backend stats to 
> behave differently than the other stats kinds. I think we should get rid of
> the "snapshot" for all of them or for none of them. Reading the above 
> comments,
> it looks like it has already been discussed and that the outcome is to keep
> it for all of them.

Thanks a lot for diving into these details!  I did not recall that
there was a strict argument for the current behavior, even if I was
pretty sure that introducing an exception for the backend stats was
not quite right compared to the other stats kinds.  The consistency
line prevails here, then.
--
Michael


signature.asc
Description: PGP signature

Re: Conflict Detection and Resolution

2025-01-13 Thread Amit Kapila

On Mon, Jan 13, 2025 at 11:01 PM Michail Nikolaev
 wrote:
>
> Just curious - are any plans to continue to work on this project?
>

Yes, but our intention is to reach a consensus/conclusion about how to
deal with update_delete conflicts. The topic is discussed in the CF
entry [1].

If you want to help move forward with this work, it is better to help
with the review of work in the CF entry [1].

[1] - https://commitfest.postgresql.org/51/5378/

-- 
With Regards,
Amit Kapila.

Re: Reorder shutdown sequence, to flush pgstats later

2025-01-13 Thread Andres Freund

Hi,

On 2025-01-13 12:20:39 +, Bertrand Drouvot wrote:
> > We have
> > multiple copies of code to go to FatalError, that doesn't seem great.
> 
> +  * FIXME: This should probably not have a copy fo the code to
> +  * transition into FatalError state.
> +  */
> +  ereport(LOG,
> +  (errmsg("WAL was shut down unexpectedly")));
> 
> Indeed, might be worth extracting this into a helper function or such.

That is quite a rats nest of issues :(. There are quite a few oddities around
the related code.  The attached series contains a few commits trying to unify
the paths transitioning to FatalError = true into the new HandleFatalError().

The current code has the weird behaviour of going through PM_WAIT_BACKENDS. I
left it like that for now. In fact more paths do so now, because we did so for
PM_WAIT_XLOG_SHUTDOWN but *didn't* do so for PM_WAIT_XLOG_ARCHIVAL, which
seems rather weird.

I suspect it'd be better to switch directly to PM_DEAD_END and have
HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's left
for another time.

After the earlier commit that just moves the code I turned the existing if ()
into a switch so that whoever adds new states is told to look at that code by
the compiler.

Thoughts?


> > Another thing I started worrying about is that the commit added
> > PM_WAIT_CHECKPOINTER *after* PM_WAIT_DEAD_END. However, we have a few places
> > where we directly jump to PM_WAIT_DEAD_END in fatal situations:

Attached patch is implemented that way.

Greetings,

Andres Freund
>From 833ee00ea2a1341b2e20e88c96d8824c09189df2 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 10 Jan 2025 11:11:40 -0500
Subject: [PATCH v4 1/7] checkpointer: Request checkpoint via latch instead of
 signal

The main reason for this is that a future commit would like to use SIGINT for
another purpose. But it's also a tad nicer and tad more efficient to use
SetLatch(), as it avoids a signal when checkpointer already is busy.

Reviewed-by: Bertrand Drouvot 
Reviewed-by: Nazir Bilal Yavuz 
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/checkpointer.c | 60 +--
 1 file changed, 19 insertions(+), 41 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..dd2c8376c6e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -159,9 +159,6 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
-/* Signal handlers */
-static void ReqCheckpointHandler(SIGNAL_ARGS);
-
 
 /*
  * Main entry point for checkpointer process
@@ -191,7 +188,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
+	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -860,23 +857,6 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
-/* 
- *		signal handler routines
- * 
- */
-
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	/*
-	 * The signaling process should have set ckpt_flags nonzero, so all we
-	 * need do is ensure that our main loop gets kicked out of any wait.
-	 */
-	SetLatch(MyLatch);
-}
-
-
 /* 
  *		communication with backends
  * 
@@ -990,38 +970,36 @@ RequestCheckpoint(int flags)
 	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
 
 	/*
-	 * Send signal to request checkpoint.  It's possible that the checkpointer
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  (Actually, more than a few times, since on slow
-	 * or overloaded buildfarm machines, it's been observed that the
-	 * checkpointer can take several seconds to start.)  However, if not told
-	 * to wait for the checkpoint to occur, we consider failure to send the
-	 * signal to be nonfatal and merely LOG it.  The checkpointer should see
-	 * the request when it does start, with or without getting a signal.
+	 * Set checkpointer's latch to request checkpoint.  It's possible that the
+	 * checkpointer hasn't started yet, so we will retry a few times if
+	 * needed.  (Actually, more than a few times, since on slow or overloaded
+	 * buildfarm machines, it's been observed that the checkpointer can take
+	 * several seconds to start.)  However, if not told to wait for the
+	 * checkpoint to occur, we consider failure to set the latch to be
+	 * nonfatal and merely LOG it.  The checkpointer should see the request
+	 * w

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread Tom Lane

John Naylor  writes:
> We can do about as well simply by changing the nibble lookup to a byte
> lookup, which works on every compiler and architecture:

I didn't attempt to verify your patch, but I do prefer addressing
this issue in a machine-independent fashion.  I also like the brevity
of the patch (though it could do with some comments perhaps, not that
the existing code has any).

regards, tom lane

RE: COPY performance on Windows

2025-01-13 Thread Ryohei Takahashi (Fujitsu)

Hi,


I did more investigation on the COPY performance on Windows.

By using community distributed binaries, the COPY performance of PG16.6 and 
PG17.0 is worse than PG16.4.
However, by using the binaries build by myself, there are no difference.
So, it is not the problem about the source code after PG16.4.

On the other hand, I noticed that 82a4edabd2 is degrading COPY performance on 
Windows.

I will write the detail in the next e-mail.

Regards,
Ryohei Takahashi

Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size

2025-01-13 Thread Shubham Khanna

On Fri, Jan 10, 2025 at 6:56 AM Peter Smith  wrote:
>
> Hi Shubham,
>
> Some review comments for patch v8=0001.
>
> ==
> Commit message.
>
> 1.
> By ensuring 'max_slot_wal_keep_size' is -1, this patch prevents the potential
> deletion of required WAL files on the publisher that could disrupt logical
> replication. A test case has been added to validate correct warning reporting
> when 'max_slot_wal_keep_size' is misconfigured.
>
> ~
>
> AFAIK the patch only logs a warning. It is not *enforcing* anything at all.
>
> So, saying "ensuring" and "preventing" is misleading.
>
> ==
> src/bin/pg_basebackup/pg_createsubscriber.c
>
> check_publisher:
>
> 2.
> + max_slot_wal_keep_size = strtoi64(PQgetvalue(res, 0, 6), NULL, 0);
>
> Is passing base 0 here ok?
>
> ==
> src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
>
> 3.
> +# reload configuration
> +$node_p->append_conf('postgresql.conf', 'max_slot_wal_keep_size = 10MB');
> +$node_p->reload;
> +
> +# dry run mode on node S and verify the warning message for misconfigured
> +# 'max_slot_wal_keep_size'
> +command_checks_all(
>
> Because of the --verbose I expected to see the pg_log_debug message
> getting output showing the large (10MB) value for
> max_slot_wal_keep_size.
>
> But, I can only see a value of -1 in the log file
> 'tmp_check/log/regress_log_040_pg_createsubscriber', which may not
> even be from the same test. Am I looking in the wrong place (???) e.g.
> Where is the logging evidence of that publisher's bad GUC (10MB) value
> in the logs before the warning is emitted?
>

I have fixed the given comments using your suggestions. The attached
patch contains the suggested changes.

Thanks and Regards,
Shubham Khanna.


v9-0001-Validate-max_slot_wal_keep_size-in-pg_createsubsc.patch
Description: Binary data

RE: COPY performance on Windows

2025-01-13 Thread Ryohei Takahashi (Fujitsu)

Hi,


I compared the performance between 94f9c08 (previous commit of 82a4edabd2) and 
82a4edabd2 with nclient = 8.
The performance is degraded by 8%.


* 94f9c08 (previous commit of 82a4edabd2) 
75sec

* 82a4edabd2
81sec


I looked the pg_stat_io view after COPY completed.


* 94f9c08 (previous commit of 82a4edabd2) 
backend_type |object |  context  | reads | read_time | writes  
| write_time | writebacks | writeback_time | extends |extend_time | 
op_bytes |   hits   | evictions | reuses  | fsyncs | fsync_time | stats_reset
 client backend  | relation  | bulkwrite | 0 | 0 | 4408399 
|  57207.667 |  0 |  0 | 4424783 | 43897.6794 | 
8192 |  6559989 | 0 | 4408399 ||| 2025-01-14 
13:57:29.162001+09


* 82a4edabd2
backend_type |object |  context  | reads | read_time | writes  
| write_time | writebacks | writeback_time | extends | extend_time | op_bytes | 
 hits   | evictions | reuses  | fsyncs | fsync_time |  stats_reset
 client backend  | relation  | bulkwrite | 0 | 0 | 4408514 
| 112722.993 |  0 |  0 | 4424898 |   33416.076 | 8192 | 
5245654 | 0 | 4408514 ||| 2025-01-14 
13:54:40.643054+09



By applying 82a4edabd2, extend_time is reduced from 43897ms to 33416ms.
It seems it is the effect of 82a4edabd2.

On the other hand, write_time is increased from 57207ms to 112722ms.
I think write_time is just the time for pwrite(), so I cannot understand why 
82a4edabd2 affects write_time.


How do you think about this?


Regards,
Ryohei Takahashi

Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size

2025-01-13 Thread Shubham Khanna

On Fri, Jan 10, 2025 at 8:21 AM Peter Smith  wrote:
>
> On Mon, Jan 6, 2025 at 1:29 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Shubham,
> >
> > Thanks for creating a patch. I have one comment about it.
> >
> > check_publisher() assumed that the SQL function 
> > `pg_catalog.current_setting('max_slot_wal_keep_size')`
> > will return the numeric, but it just return the text representation. I.e., 
> > if the parameter is
> > set to 10MB, the function returns like below:
> >
> > ```
> > postgres=# SELECT * FROM 
> > pg_catalog.current_setting('max_slot_wal_keep_size');
> >  current_setting
> > -
> >  10MB
> > (1 row)
> > ```
> >
> > Your patch can work well because atoi() ignores the latter part of the 
> > string,
> > e.g., "10MB" is converted to "10", but this is not clean. I suggest either 
> > of
> > 1) accepting the value as the string, or 2) using an SQL function 
> > pg_size_bytes()
> > to get max_slot_wal_keep_size.
> >
>
> Hi Shubham.
>
> Kuroda-san gave a couple of ideas above and you chose the option 2.
>
> FWIW, recently I've been thinking that option 1 might have been a
> better choice, because:
> i)  Using strtoi64 for a GUC value seems to be very rare. I didn't
> find any other examples of what you've ended up doing in v8-0001.
> ii) When you display the value in the pg_log_debug it would be better
> to display the same human-readable format that the user configured
> instead of some possibly huge int64 value
> iii) AFAIK, we only need to check this against "-1". The actual value
> is of no real importance to the patch, so going to extra trouble to
> extract an int64 that we don't care about seems unnecessary
>

I have fixed the given comment and used the string to accept the
value. The v9 version Patch attached at [1] has the changes for the
same.
[1] - 
https://www.postgresql.org/message-id/CAHv8RjL4MFEDosz9%2BTFmufYCDoNCnyN6PFeSxG6MwsH1ufhZLA%40mail.gmail.com

Thanks and regards,
Shubham Khanna.

Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size

2025-01-13 Thread Shubham Khanna

On Tue, Jan 14, 2025 at 10:10 AM Peter Smith  wrote:
>
> Hi Shubham,
>
> In a previous review [1, comment #3] I was complaining about the lack
> of output emitted by the pg_log_debug() because I wanted to see some
> LOG evidence of the bad value of max_slot_wal_keep_size.
>
> FYI, I think that mystery has finally been solved.
>
> AFAIK 2 separate problems were contributing to the lack of debug output.
>
> 1. Unlike the 'command_ok' subroutine, the 'command_checks_all'
> subroutine apparently redirects the output before doing the requested
> pattern matching on it. So there is no output to be seen.
>
> 2. After that, I did not realise the effect of '-verbose' is
> cumulative on loglevel. Specifically, to see debug messages it is not
> enough to just say '--verbose'. In fact, you need to specify it 2
> times: '--verbose', '--verbose'
>
> ~
>
> After addressing these (e.g. by executing the same pg_createsubscriber
> using 'command_ok' and using a double '--verbose'), the expected LOGS
> can be seen.
>
> e.g. in the file tmp_check/log/regress_log_040_pg_createsubscriber
> --
> ...
> pg_createsubscriber: checking settings on publisher
> pg_createsubscriber: publisher: wal_level: logical
> pg_createsubscriber: publisher: max_replication_slots: 10
> pg_createsubscriber: publisher: current replication slots: 2
> pg_createsubscriber: publisher: max_wal_senders: 10
> pg_createsubscriber: publisher: current wal senders: 1
> pg_createsubscriber: publisher: max_prepared_transactions: 0
> pg_createsubscriber: publisher: max_slot_wal_keep_size: 10485760
> pg_createsubscriber: warning: publisher requires WAL size must not be 
> restricted
> pg_createsubscriber: hint: Set the configuration parameter
> "max_slot_wal_keep_size" to -1 to ensure that required WAL files are
> not prematurely removed.
> pg_createsubscriber: stopping the subscriber
> ...
> --
>
> See the attached diff for the TAP changes I used to expose this
> logging. Apply this atop v8.
>
> ==
> [1] 
> https://www.postgresql.org/message-id/CAHut%2BPszk61QM%2BcEvq_1-A2y%2BJrAD0USB%2BNvtcidajYOfHDkyw%40mail.gmail.com
>

I have used your diff to update the TAP tests accordingly. The v9
version Patch attached at [1] has the changes for the same.
[1] - 
https://www.postgresql.org/message-id/CAHv8RjL4MFEDosz9%2BTFmufYCDoNCnyN6PFeSxG6MwsH1ufhZLA%40mail.gmail.com

Thanks and regards,
Shubham Khanna.

Re: pg_createsubscriber TAP test wrapping makes command options hard to read.

2025-01-13 Thread Peter Smith

On Fri, Dec 13, 2024 at 12:17 AM Dagfinn Ilmari Mannsåker
 wrote:
>
> Dagfinn Ilmari Mannsåker  writes:
>
> > Peter Smith  writes:
> >
> >> On Thu, Dec 12, 2024 at 2:53 PM Michael Paquier  
> >> wrote:
> >> ...
> >>
> >>> > So, AFAICT I can workaround the perltidy wrapping just by putting all
> >>> > the noarg options at the bottom of the command, then all the
> >>> > option/optarg pairs (ie 2s) will stay together. I can post another
> >>> > patch to do it this way unless you think it is too hacky.
> >>>
> >>> This trick works for me if that makes the long list of option easier
> >>> to read.  With two elements of the array perl line, I would just put
> >>> some --dry-run or --verbose at the end of their respective arrays.
> >>> --
> >>> Michael
> >>
> >> Hi Michael.
> >>
> >> Yes, that is the workaround that I was proposing.
> >
> > A better option, IMO, is to use the fat comma (=>) between options and
> > their values.  This makes it clear both to humans and perltidy that they
> > belong together, and we can put all the valueless options first without
> > things being rewrapped.
>
> Here's a more thorough patch, that also applies the fat comma treatment
> to other pg_createsubscriber invocations in the same file that don't
> currently happen to be mangled by perltidy.  It also adds trailing
> commas to the last item in multi-line command arrays, which is common
> perl style.
>
> - ilmari
>

Hi,

In your v4 patch, there is a fragment (below) that replaces a  double
'--verbose' switch with just a single '--verbose'.

As I have only recently learned, the '--verbose'' switch has a
cumulative effect [1], so the original double '--verbose' was probably
deliberate so it should be kept that way.

~~

 # Run pg_createsubscriber on node S
 command_ok(
  [
- 'pg_createsubscriber', '--verbose',
- '--recovery-timeout', "$PostgreSQL::Test::Utils::timeout_default",
- '--verbose', '--pgdata',
- $node_s->data_dir, '--publisher-server',
- $node_p->connstr($db1), '--socketdir',
- $node_s->host, '--subscriber-port',
- $node_s->port, '--publication',
- 'pub1', '--publication',
- 'Pub2', '--replication-slot',
- 'replslot1', '--replication-slot',
- 'replslot2', '--database',
- $db1, '--database',
- $db2
+ 'pg_createsubscriber',
+ '--verbose',
+ '--recovery-timeout' => $PostgreSQL::Test::Utils::timeout_default,
+ '--pgdata' => $node_s->data_dir,
+ '--publisher-server' => $node_p->connstr($db1),
+ '--socketdir' => $node_s->host,
+ '--subscriber-port' => $node_s->port,
+ '--publication' => 'pub1',
+ '--publication' => 'pub2',
+ '--replication-slot' => 'replslot1',
+ '--replication-slot' => 'replslot2',
+ '--database' => $db1,
+ '--database' => $db2,
  ],
  'run pg_createsubscriber on node S');


==
[1] https://www.postgresql.org/docs/devel/app-pgcreatesubscriber.html

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

2025-01-13 Thread jian he

On Sat, Jan 11, 2025 at 5:54 PM Kirill Reshke  wrote:
>
> On Fri, 10 Jan 2025 at 11:38, jian he  wrote:
> > I think there are three remaining issues that may need more attention
> > 1.
> > Table 27.42. pg_stat_progress_copy View
> > (pg_stat_progress_copy)
> > column pg_stat_progress_copy.tuples_skipped now the description is
> > ""
> > When the ON_ERROR option is set to ignore, this value shows the number of 
> > tuples
> > skipped due to malformed data. When the ON_ERROR option is set to 
> > set_to_null,
> > this value shows the number of tuples where malformed data was converted to
> > NULL.
> > """
> > now the column name tuples_skipped would not be that suitable for
> > (on_error set_to_null).
> > since now it is not tuple skipped, it is in a tuple some value was set to 
> > null.
>
> Indeed this is something we need to fix.
>
> > Or
> > we can skip progress reports for (on_error set_to_null) case.
>
> Maybe we can add a `malformed_tuples` column to this view?
>
we can do this later.
so for on_error set_to_null, i've removed pgstat_progress_update_param
related code.

the attached patch also did some doc enhancement, error message enhancement.
From a95d42bf1e6044c6c9a2afbb15d168d6679eceab Mon Sep 17 00:00:00 2001
From: jian he 
Date: Tue, 14 Jan 2025 13:46:12 +0800
Subject: [PATCH v11 1/1] COPY (on_error set_to_null)

Extent "on_error action", introduce new option:  on_error set_to_null.
Current grammar makes us unable to use "on_error null", so we choose "on_error set_to_null".

Any data type conversion errors during the COPY FROM process will result in the
affected column being set to NULL. This only applicable when using the
non-binary format for COPY FROM. However, the not-null constraint will still be
enforced. If a conversion error leads to a NULL value in a column that has a
not-null constraint, a not-null constraint violation error will be triggered.

A regression test for a domain with a not-null constraint has been added.

Author: Jian He ,
Author: Kirill Reshke 

Reviewed-by:
Fujii Masao ,
Jim Jones ,
"David G. Johnston" ,
Yugo NAGATA ,
torikoshia 

discussion: https://postgr.es/m/CAKFQuwawy1e6YR4S=j+y7pXqg_Dw1WBVrgvf=bp3d1_asfe...@mail.gmail.com
---
 doc/src/sgml/ref/copy.sgml   | 34 +-
 src/backend/commands/copy.c  |  6 ++-
 src/backend/commands/copyfrom.c  | 29 
 src/backend/commands/copyfromparse.c | 46 +-
 src/bin/psql/tab-complete.in.c   |  2 +-
 src/include/commands/copy.h  |  1 +
 src/include/commands/copyfrom_internal.h |  4 +-
 src/test/regress/expected/copy2.out  | 60 
 src/test/regress/sql/copy2.sql   | 46 ++
 9 files changed, 201 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 8394402f09..5e1d08ab91 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -394,23 +394,34 @@ COPY { table_name [ ( error_action value of
-  stop means fail the command, while
-  ignore means discard the input row and continue with the next one.
+  stop means fail the command,
+  ignore means discard the input row and continue with the next one, and
+  set_to_null means replace columns containing erroneous input values with
+  NULL and move to the next field.
   The default is stop.
  
  
-  The ignore option is applicable only for COPY FROM
+  The ignore and set_to_null
+  options are applicable only for COPY FROM
   when the FORMAT is text or csv.
  
  
-  A NOTICE message containing the ignored row count is
-  emitted at the end of the COPY FROM if at least one
-  row was discarded. When LOG_VERBOSITY option is set to
-  verbose, a NOTICE message
+  For ignore option,
+  a NOTICE message containing the ignored row count is
+  emitted at the end of the COPY FROM if at least one row was discarded.
+  For set_to_null option,
+  a NOTICE message containing the row count that erroneous input values replaced by to null
+  happened is emitted at the end of the COPY FROM if at least one row was replaced.
+ 
+ 
+  When LOG_VERBOSITY option is set to verbose,
+  for ignore option, a NOTICE message
   containing the line of the input file and the column name whose input
-  conversion has failed is emitted for each discarded row.
-  When it is set to silent, no message is emitted
-  regarding ignored rows.
+  conversion has failed is emitted for each discarded row;
+  for set_to_null option,
+  a NOTICE message containing the line of the input file and the column name
+  where value was replaced with NULL for each input conversion failure.
+  When it is set to silent, no message is emitted regarding input conversion failed rows.
  
 

@@ -458,7 +469,8 @@ COPY { table_name [ ( 
  
   This is currently used in COPY FROM

Re: Possible integer overflow in bringetbitmap()

2025-01-13 Thread Michael Paquier

On Fri, Jan 10, 2025 at 11:22:37AM -0800, James Hunter wrote:
> Attached the proposed one-line fix.

Thanks, applied and backpatched down to v13.  While on it, I've
checked all the other amgetbitmap callbacks and they all rely on an
int64 for the result they return.  So these parts are OK.
--
Michael


signature.asc
Description: PGP signature

Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size

2025-01-13 Thread Peter Smith

Hi Shubham,

In a previous review [1, comment #3] I was complaining about the lack
of output emitted by the pg_log_debug() because I wanted to see some
LOG evidence of the bad value of max_slot_wal_keep_size.

FYI, I think that mystery has finally been solved.

AFAIK 2 separate problems were contributing to the lack of debug output.

1. Unlike the 'command_ok' subroutine, the 'command_checks_all'
subroutine apparently redirects the output before doing the requested
pattern matching on it. So there is no output to be seen.

2. After that, I did not realise the effect of '-verbose' is
cumulative on loglevel. Specifically, to see debug messages it is not
enough to just say '--verbose'. In fact, you need to specify it 2
times: '--verbose', '--verbose'

~

After addressing these (e.g. by executing the same pg_createsubscriber
using 'command_ok' and using a double '--verbose'), the expected LOGS
can be seen.

e.g. in the file tmp_check/log/regress_log_040_pg_createsubscriber
--
...
pg_createsubscriber: checking settings on publisher
pg_createsubscriber: publisher: wal_level: logical
pg_createsubscriber: publisher: max_replication_slots: 10
pg_createsubscriber: publisher: current replication slots: 2
pg_createsubscriber: publisher: max_wal_senders: 10
pg_createsubscriber: publisher: current wal senders: 1
pg_createsubscriber: publisher: max_prepared_transactions: 0
pg_createsubscriber: publisher: max_slot_wal_keep_size: 10485760
pg_createsubscriber: warning: publisher requires WAL size must not be restricted
pg_createsubscriber: hint: Set the configuration parameter
"max_slot_wal_keep_size" to -1 to ensure that required WAL files are
not prematurely removed.
pg_createsubscriber: stopping the subscriber
...
--

See the attached diff for the TAP changes I used to expose this
logging. Apply this atop v8.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPszk61QM%2BcEvq_1-A2y%2BJrAD0USB%2BNvtcidajYOfHDkyw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl 
b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
index c1fc0cf..773a27b 100644
--- a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
+++ b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
@@ -327,7 +327,7 @@ $node_p->reload;
 # 'max_slot_wal_keep_size'
 command_checks_all(
[
-   'pg_createsubscriber', '--verbose',
+   'pg_createsubscriber', '--verbose', '--verbose',
'--recovery-timeout', 
"$PostgreSQL::Test::Utils::timeout_default",
'--dry-run', '--pgdata',
$node_s->data_dir, '--publisher-server',
@@ -350,6 +350,26 @@ command_checks_all(
'Validate warning for misconfigured max_slot_wal_keep_size on the 
publisher'
 );
 
+# And, same again to see the output...
+command_ok(
+   [
+   'pg_createsubscriber', '--verbose', '--verbose',
+   '--recovery-timeout', 
"$PostgreSQL::Test::Utils::timeout_default",
+   '--dry-run', '--pgdata',
+   $node_s->data_dir, '--publisher-server',
+   $node_p->connstr($db1), '--socketdir',
+   $node_s->host, '--subscriber-port',
+   $node_s->port, '--publication',
+   'pub1', '--publication',
+   'pub2', '--subscription',
+   'sub1', '--subscription',
+   'sub2', '--database',
+   $db1, '--database',
+   $db2
+   ],
+   'Ditto above, but this time using command_ok to retain the output'
+);
+
 # Reset 'max_slot_wal_keep_size' to default after the test
 $node_p->append_conf('postgresql.conf', 'max_slot_wal_keep_size = -1');
 $node_p->reload;

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread Michael Paquier

On Tue, Jan 14, 2025 at 12:27:30PM +0700, John Naylor wrote:
> We can do about as well simply by changing the nibble lookup to a byte
> lookup, which works on every compiler and architecture:
> 
> select hex_encode_test(100, 1024);
> master:
> Time: 1158.700 ms
> v2:
> Time:  777.443 ms
> 
> If we need to do much better than this, it seems better to send the
> data to the client as binary, if possible.

That's pretty cool.  Complex to parse, still really cool.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread John Naylor

On Sat, Jan 11, 2025 at 3:46 AM Nathan Bossart  wrote:
>
> I was able to get auto-vectorization to take effect on Apple clang 16 with
> the following addition to src/backend/utils/adt/Makefile:
>
> encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -mllvm -force-vector-width=8
>
> This gave the following results with your hex_encode_test() function:
>
> buf  | HEAD  | patch | % diff
>   ---+---+---+
>   16 |21 |16 |   24
>   64 |54 |41 |   24
>  256 |   138 |   100 |   28
> 1024 |   441 |   300 |   32
> 4096 |  1671 |  1106 |   34
>16384 |  6890 |  4570 |   34
>65536 | 27393 | 18054 |   34

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

select hex_encode_test(100, 1024);
master:
Time: 1158.700 ms
v2:
Time:  777.443 ms

If we need to do much better than this, it seems better to send the
data to the client as binary, if possible.

-- 
John Naylor
Amazon Web Services
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4a6fcb56cd..8b059bc834 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -145,7 +145,7 @@ binary_decode(PG_FUNCTION_ARGS)
  * HEX
  */
 
-static const char hextbl[] = "0123456789abcdef";
+static const char hextbl[512] = "000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff";
 
 static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
@@ -165,9 +165,8 @@ hex_encode(const char *src, size_t len, char *dst)
 
 	while (src < end)
 	{
-		*dst++ = hextbl[(*src >> 4) & 0xF];
-		*dst++ = hextbl[*src & 0xF];
-		src++;
+		memcpy(dst, &hextbl[(* ((unsigned char *) src)) * 2], 2);
+		src++; dst+=2;
 	}
 	return (uint64) len * 2;
 }

Re: Skip collecting decoded changes of already-aborted transactions

2025-01-13 Thread Amit Kapila

On Tue, Jan 14, 2025 at 7:32 AM Peter Smith  wrote:
>
> Hi Sawada-San.
>
> Some review comments for patch v13-0002.
>
> ==
>
> I think the v12 ambiguity of RBTXN_PREPARE versus RBTXN_SENT_PREPARE
> was mostly addressed already by the improved comments for the macros
> in patch 0001.
>
> Meanwhile, patch v13-0002 says it is renaming constants for better
> consistency, but I don't think it went far enough.
>
> For example, better name consistency would be achieved by changing
> *all* of the constants related to prepared transactions:
>
> #define RBTXN_IS_PREPARED  0x0040
> #define RBTXN_IS_PREPARED_SKIPPED  0x0080
> #define RBTXN_IS_PREPARED_SENT 0x0200
>
> where:
>
> RBTXN_IS_PREPARED. This means it's a prepared transaction. (but we
> can't tell from this if it is skipped or sent).
>
> RBTXN_IS_PREPARED_SKIPPED. This means it's a prepared transaction
> (RBTXN_IS_PREPARED) and it's being skipped.
>
> RBTXN_IS_PREPARED_SENT. This means it's a prepared transaction
> (RBTXN_IS_PREPARED) and we've sent it.
>

The first one (RBTXN_IS_PREPARED) sounds like an improvement over what
we have now. I am not convinced about the other two.

> ~
>
> A note about RBTXN_IS_PREPARED. Since all of these constants are
> clearly about transactions (e.g. "TXN" in prefix "RBTXN_"), I felt
> patch 0002 calling this RBTXN_IS_PREPARED_TXN just seemed like adding
> a redundant _TXN. e.g. we don't say RBTXN_IS_COMMITTED_TXN etc.
>

+1. I felt the same.


-- 
With Regards,
Amit Kapila.

Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.

2025-01-13 Thread Rahila Syed

Hi,

If a DSM is created or attached from an interrupt handler while a
transaction is being
rolled back, it may result in the following error.
"ResourceOwnerEnlarge called after release started"
This was found during the testing of Enhancing Memory Context Reporting
feature
by Fujii Masao [1].

I propose a fix to fall back to creating a DSM or DSA with a NULL resource
owner if
CurrentResourceOwner is set to "releasing".
Since a DSM or DSA can already be created with a NULL resource owner—
meaning it persists for the lifetime of the backend or until explicitly
detached—
this approach should be acceptable.

Please find attached a patch which does that. Kindly let me know your views.


[1].
 
https://www.postgresql.org/message-id/a1a7e2b7-8f33-4313-baff-42e92ec14fd3%40oss.nttdata.com



0001-Prevent-the-error-on-creating-a-dsm-segment-from-an-.patch
Description: Binary data

Re: Psql meta-command conninfo+

2025-01-13 Thread Hunaid Sohail

Hi,

On Mon, Jan 13, 2025 at 4:12 PM Dean Rasheed 
wrote:

> I don't like the 3-table format either. I think it should be a single
> table.
>
> The trouble with v35 is that it produces 1 row with lots of columns,
> which is pretty unreadable unless expanded mode is used. So I think we
> should just do it that way round by default -- i.e., make it like
> \dconfig and have 2 columns, "Parameter" and "Value", with lots of
> rows.
>
> Perhaps it could also have a "Description" column, which might help
> with things like distinguishing between authenticated user and session
> user.
>

I've tried the approach and attached the output.
I'm not attaching the patch as it requires some formatting. Does this look
good?
I have added a one liner description that is consistent with libpq docs.

postgres=# \conninfo+
 Connection Information
  Parameter   | Value  |
   Description
--++
 Database | postgres   | Displays the database name
of the connection.
 Client User  | hunaid | Displays the name of the
user connected to the database as returned by PQuser.
 Host | localhost  | Displays the server host
name of the active connection.
 Host Address | 127.0.0.1  | Displays the server IP
address of the active connection.
 Port | 5430   | Displays the port number
used for the database connection.
 Options  || Displays any command-line
options passed in the connection request.
 Protocol Version | 3  | Displays the
frontend/backend protocol being used.
 Password Used| false  | Indicates whether a
password was used during authentication.
 GSSAPI Authenticated | false  | Indicates if GSSAPI
authentication was used for the connection.
 Backend PID  | 43465  | Displays the process ID of
the backend for the connection.
 SSL Connection   | true   | Indicates whether SSL
encryption is in use for the connection.
 SSL Protocol | TLSv1.3| Displays the SSL protocol
used for the connection.
 SSL Cipher   | TLS_AES_256_GCM_SHA384 | Displays the SSL cipher
used for the connection.
 SSL Compression  | off| Indicates whether SSL
compression is enabled.
 ALPN | postgresql | Displays the ALPN protocol
used in the SSL connection.
(15 rows)

I was wondering how we can show the descriptions of parameters
from PQparameterStatus. We can create an array of descriptions,
but if the order changes somehow, it can mess things up.

Looking forward to your opinions. Can we continue with this approach?

Regards,
Hunaid Sohail

Re: Conflict detection for update_deleted in logical replication

2025-01-13 Thread Amit Kapila

On Tue, Jan 14, 2025 at 7:14 AM Masahiko Sawada  wrote:
>
> On Sun, Jan 12, 2025 at 10:36 PM Amit Kapila  wrote:
> >
> > I don't think we can avoid accumulating garbage especially when the
> > workload on the publisher is more. Consider the current case being
> > discussed, on the publisher, we have 30 clients performing read-write
> > operations and there is only one pair of reader (walsender) and writer
> > (apply_worker) to perform all those write operations on the
> > subscriber. It can never match the speed and the subscriber side is
> > bound to have less performance (or accumulate more bloat) irrespective
> > of its workload. If there is one client on the publisher performing
> > operation, we won't see much degradation but as the number of clients
> > increases, the performance degradation (and bloat) will keep on
> > increasing.
> >
> > There are other scenarios that can lead to the same situation, such as
> > a large table sync, the subscriber node being down for sometime, etc.
> > Basically, any case where apply_side lags by a large amount from the
> > remote node.
> >
> > One idea to prevent the performance degradation or bloat increase is
> > to invalidate the slot, once we notice that subscriber lags (in terms
> > of WAL apply) behind the publisher by a certain threshold. Say we have
> > max_lag (or max_lag_behind_remote) (defined in terms of seconds)
> > subscription option which allows us to stop calculating
> > oldest_nonremovable_xid for that subscription. We can indicate that
> > via some worker_level parameter. Once all the subscriptions on a node
> > that has enabled retain_conflict_info have stopped calculating
> > oldest_nonremovable_xid, we can invalidate the slot. Now, users can
> > check this and need to disable/enable retain_conflict_info to again
> > start retaining the required information. The other way could be that
> > instead of invalidating the slot, we directly drop/re-create the slot
> > or increase its xmin. If we choose to advance the slot automatically
> > without user intervention, we need to let users know via LOG and or
> > via information in the view.
> >
> > I think such a mechanism via the new option max_lag will address your
> > concern: "It's reasonable behavior for this approach but it might not
> > be a reasonable outcome for users if they could be affected by such a
> > performance dip without no way to avoid it." as it will provide a way
> > to avoid performance dip only when there is a possibility of such a
> > dip.
> >
> > I mentioned max_lag as a subscription option instead of a GUC because
> > it applies only to subscriptions that have enabled
> > retain_conflict_info but we can consider it to be a GUC if you and
> > others think so provided the above proposal sounds reasonable. Also,
> > max_lag could be defined in terms of LSN as well but I think time
> > would be easy to configure.
> >
> > Thoughts?
>
> I agree that we cannot avoid accumulating dead tuples when the
> workload on the publisher is more, and which affects the subscriber
> performance. What we need to do is to update slot's xmin as quickly as
> possible to minimize the dead tuple accumulation at least when the
> subscriber is not much behind. If there is a tradeoff for doing so
> (e.g., vs. the publisher performance), we need to provide a way for
> users to balance it.
>

As of now, I can't think of a way to throttle the publisher when the
apply_worker lags. Basically, we need some way to throttle (reduce the
speed of backends) when the apply worker is lagging behind a threshold
margin. Can you think of some way? I thought if one notices frequent
invalidation of the launcher's slot due to max_lag, then they can
rebalance their workload on the publisher.

>
  The max_lag idea sounds interesting for the case
> where the subscriber is much behind. Probably we can visit this idea
> as a new feature after completing this feature.
>

Sure, but what will be our answer to users for cases where the
performance tanks due to bloat accumulation? The tests show that once
the apply_lag becomes large, it becomes almost impossible for the
apply worker to catch up (or take a very long time) and advance the
slot's xmin. The users can disable retain_conflict_info to bring back
the performance and get rid of bloat but I thought it would be easier
for users to do that if we have some knob where they don't need to
wait till actually the problem of bloat/performance dip happens.

-- 
With Regards,
Amit Kapila.

Re: Conflict Detection and Resolution

2025-01-13 Thread Michail Nikolaev

Hello, everyone!

Just curious - are any plans to continue to work on this project?

Best regards,
Mikhail.

Re: Skip collecting decoded changes of already-aborted transactions

2025-01-13 Thread Masahiko Sawada

On Mon, Jan 13, 2025 at 3:07 AM Amit Kapila  wrote:
>
> On Tue, Jan 7, 2025 at 7:22 AM Peter Smith  wrote:
> >
> > ==
> > src/include/replication/reorderbuffer.h
> >
> > 6.
> >  #define RBTXN_PREPARE  0x0040
> >  #define RBTXN_SKIPPED_PREPARE   0x0080
> >  #define RBTXN_HAS_STREAMABLE_CHANGE 0x0100
> > +#define RBTXN_SENT_PREPARE 0x0200
> > +#define RBTXN_IS_COMMITTED 0x0400
> > +#define RBTXN_IS_ABORTED 0x0800
> >
> > Something about this new RBTXN_SENT_PREPARE name seems inconsistent to me.
> >
> > I feel there is now also some introduced ambiguity with these macros:
> >
> > /* Has this transaction been prepared? */
> > #define rbtxn_prepared(txn) \
> > ( \
> > ((txn)->txn_flags & RBTXN_PREPARE) != 0 \
> > )
> >
> > +/* Has a prepare or stream_prepare already been sent? */
> > +#define rbtxn_sent_prepare(txn) \
> > +( \
> > + ((txn)->txn_flags & RBTXN_SENT_PREPARE) != 0 \
> > +)
> >
> >
> > e.g. It's also not clear from the comments what is the distinction
> > between the existing macro comment "Has this transaction been
> > prepared?" and the new macro comment "Has a prepare or stream_prepare
> > already been sent?".
> >
> > Indeed, I was wondering if some of the places currently calling
> > "rbtxn_prepared(txn)" should now strictly be calling
> > "rbtxn_sent_prepared(txn)" macro instead?
> >
>
> Right, I think after this change, it appears we should try to rename
> the existing constants. One place where we can consider to use new
> macro is the current usage of rbtxn_prepared() in
> SnapBuildDistributeNewCatalogSnapshot().

I think that RBTXN_PREPARE would mean that the transaction needs to be
prepared but it doesn't mean that a prepare or a stream_prepare has
already been sent. And RBTXN_SENT_PREPARE adds some internal details
about whether a prepare or a stream_prepare has actually been sent.
IIUC RBTXN_SENT_PREPARE is used only in a short term in
ReorderBufferPrepare(). So outside of reorderbuffer such as
snapbuild.c doesn't need to care about the RBTXN_SENT_PREPARE.

>
> > IMO some minor renaming of the existing constants (and also their
> > associated macros) might help to make all this more coherent. For
> > example, perhaps like:
> >
> > #define RBTXN_IS_PREPARE_NEEDED0x0040
> >
>
> The other option could be RBTXN_IS_PREPARE_REQUESTED.

I'm a bit concerned that these names sound like a state that the
transaction needs to be prepared but has not been done yet. But
rbtxn_prepared() is widely used to check if the transaction is a
prepared transaction regardless of a prepare or a stream_prepare
actually being sent. How about RBTXN_IS_PREPARED_TXN and
rbtxn_is_preapred_txn()? I think it would indicate well that the
transaction needs to be processed as a prepared transaction.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Skip collecting decoded changes of already-aborted transactions

2025-01-13 Thread Masahiko Sawada

On Mon, Jan 6, 2025 at 5:52 PM Peter Smith  wrote:
>
> Hi Sawada-San.
>
> Here are some review comments for the patch v12-0001.

Thank you for reviewing the patch!

>
> ==
> .../replication/logical/reorderbuffer.c
>
> ReorderBufferCheckAndTruncateAbortedTXN:
>
> 1.
> +/*
> + * Check the transaction status by looking CLOG and discard all changes if
> + * the transaction is aborted. The transaction status is cached in
> + * txn->txn_flags so we can skip future changes and avoid CLOG lookups on the
> + * next call.
> + *
> + * Return true if the transaction is aborted, otherwise return false.
> + *
> + * When the 'debug_logical_replication_streaming' is set to "immediate", we
> + * don't check the transaction status, meaning the caller will always process
> + * this transaction.
> + */
>
> Typo "by looking CLOG".
>
> It should be something like "by CLOG lookup".

Fixed.

>
> ~~~
>
> 2.
> + /* Quick return if the transaction status is already known */
> + if (rbtxn_is_committed(txn))
> + return false;
> + if (rbtxn_is_aborted(txn))
> + {
> + /* Already-aborted transactions should not have any changes */
> + Assert(txn->size == 0);
> +
> + return true;
> + }
> +
>
> Consider changing that 2nd 'if' to be 'else if', because then that
> will make it more obvious that the earlier single line comment "Quick
> return if...", in fact applies to both these conditions.
>
> Alternatively, make that a block comment and add some blank lines like:
>
> + /*
> +* Quick returns if the transaction status is already known.
> +*/
> +
> + if (rbtxn_is_committed(txn))
> + return false;
> +
> + if (rbtxn_is_aborted(txn))
> + {
> + /* Already-aborted transactions should not have any changes */
> + Assert(txn->size == 0);
> +
> + return true;
> + }

I used a block comment.

>
> ~~~
>
> 3.
> + if (TransactionIdDidCommit(txn->xid))
> + {
> + /*
> + * Remember the transaction is committed so that we can skip CLOG
> + * check next time, avoiding the pressure on CLOG lookup.
> + */
> + Assert(!rbtxn_is_aborted(txn));
> + txn->txn_flags |= RBTXN_IS_COMMITTED;
> + return false;
> + }
> +
> + /*
> + * The transaction aborted. We discard the changes we've collected so far
> + * and toast reconstruction data. The full cleanup will happen as part of
> + * decoding ABORT record of this transaction.
> + */
> + ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
> + ReorderBufferToastReset(rb, txn);
> +
> + /* All changes should be discarded */
> + Assert(txn->size == 0);
> +
> + /*
> + * Mark the transaction as aborted so we can ignore future changes of this
> + * transaction.
> + */
> + Assert(!rbtxn_is_committed(txn));
> + txn->txn_flags |= RBTXN_IS_ABORTED;
> +
> + return true;
> +}
>
> 3a.
> That whole last part related to "The transaction aborted", might be
> clearer if the whole chunk of code was in an 'else' block from the
> previous "if (TransactionIdDidCommit(txn->xid))".

I'm not sure it increases the readability. I think it pretty makes
sense to me that we return false in the 'if
(TransactionIdDidCommit(txn->xid))' block. If we add the 'else' block,
the reader might be confused as we have the  'else' block in spite of
having the return in the 'if' block. We can add a local variable for
the result and return it at the end of the function but I'm not sure
it's a good idea to increase the readability.

>
> ~
>
> 3b.
> "toast" is an acronym so it should be written in uppercase IMO.
>
> ~

Hmm, it seems we don't use TOAST at all at least in reorderbuffer.c. I
would prefer to make it consistent with others.

>
> 3c.
> The "and toast reconstruction data" seems to be missing a word/s. (??)
> - "... and also discard TOAST reconstruction data"
> - "... and reset TOAST reconstruction data"

I don't understand this comment. What words are you suggesting adding
to these sentences?

>
> ~~~
>
> ReorderBufferMaybeMarkTXNStreamed:
>
> 4.
> +static void
> +ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
> +{
> + /*
> + * The top-level transaction, is marked as streamed always, even if it
> + * does not contain any changes (that is, when all the changes are in
> + * subtransactions).
> + *
> + * For subtransactions, we only mark them as streamed when there are
> + * changes in them.
> + *
> + * We do it this way because of aborts - we don't want to send aborts for
> + * XIDs the downstream is not aware of. And of course, it always knows
> + * about the toplevel xact (we send the XID in all messages), but we never
> + * stream XIDs of empty subxacts.
> + */
> + if (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0))
> + txn->txn_flags |= RBTXN_IS_STREAMED;
> +}
>
> /the toplevel xact/the top-level xact/

Fixed.

>
> ~~~
>
> 5.
>   /*
> - * We send the prepare for the concurrently aborted xacts so that later
> - * when rollback prepared is decoded and sent, the downstream should be
> - * able to rollback such a xact. See comments atop DecodePrepare.
> - *
> - * Note, for the concurrent_abort + streaming case a stre

InitControlFile misbehaving on graviton

2025-01-13 Thread Christoph Berg

Bernd and I have been chasing a bug that happens when all of the
following conditions are fulfilled:

* PG 15..18 (older PGs are ok)
* gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok)
* arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 host)
* -O2 (ok with -O0)
* --with-openssl (ok without openssl)
* using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it)

The problem happens early during initdb:

$ ./configure --with-openssl --enable-debug
...
$ /usr/local/pgsql/bin/initdb -D broken --no-clean
...
running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL:  
control file contains invalid database cluster state
child process exited with exit code 1
initdb: data directory "broken" not removed at user's request

Looking at the control file, we can see that the cluster state is
"starting up":

$ /usr/local/pgsql/bin/pg_controldata broken/
pg_control version number:1700
Catalog version number:   202501101
Database system identifier:   7459462110308027428
Database cluster state:   starting up
pg_control last modified: Mon 13 Jan 2025 06:02:44 PM UTC
Latest checkpoint location:   0/128
Latest checkpoint's REDO location:0/128
Latest checkpoint's REDO WAL file:00010001
Latest checkpoint's TimeLineID:   1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:  0:3

The relevant code is in BootStrapXLOG():

/* Now create pg_control */
InitControlFile(sysidentifier, data_checksum_version);
ControlFile->time = checkPoint.time;
ControlFile->checkPoint = checkPoint.redo;
ControlFile->checkPointCopy = checkPoint;

/* some additional ControlFile fields are set in WriteControlFile() */
WriteControlFile();

and InitControlFile():

if (!pg_strong_random(mock_auth_nonce, MOCK_AUTH_NONCE_LEN))
ereport(PANIC,
(errcode(ERRCODE_INTERNAL_ERROR),
 errmsg("could not generate secret authorization token")));

memset(ControlFile, 0, sizeof(ControlFileData));
/* Initialize pg_control status fields */
ControlFile->system_identifier = sysidentifier;
memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, 
MOCK_AUTH_NONCE_LEN);
ControlFile->state = DB_SHUTDOWNED;

So the state should actually be DB_SHUTDOWNED (1), but on this system,
the value is DB_STARTUP (0).

Stepping through InitControlFile we can see that ControlFile->state is
never written to. (The trace jumps back to InitControlFile a lot
because it seems inlined into BootStrapXLOG):

$ cd broken
$ rm -f global/pg_control; PGDATA=$PWD gdb /usr/lib/postgresql/17/bin/postgres
Reading symbols from /usr/local/pgsql/bin/postgres...
(gdb) b InitControlFile
Breakpoint 1 at 0x1819bc: file xlog.c, line 4214.
(gdb) r --boot -F -c log_checkpoints=false -X 16777216 -k
Starting program: /usr/local/pgsql/bin/postgres --boot -F -c 
log_checkpoints=false -X 16777216 -k

Breakpoint 1, 0xaac219bc in InitControlFile (sysidentifier=, data_checksum_version=) at xlog.c:4214
4214if (!pg_strong_random(mock_auth_nonce, MOCK_AUTH_NONCE_LEN))
(gdb) s
5175InitControlFile(sysidentifier, data_checksum_version);
(gdb)
InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=1) at 
xlog.c:4214
4214if (!pg_strong_random(mock_auth_nonce, MOCK_AUTH_NONCE_LEN))
(gdb)
pg_strong_random (buf=buf@entry=0xf670, len=len@entry=32) at 
pg_strong_random.c:79
79  for (i = 0; i < NUM_RAND_POLL_RETRIES; i++)
(gdb)
81  if (RAND_status() == 1)
(gdb)
87  RAND_poll();
(gdb)
90  if (RAND_bytes(buf, len) == 1)
(gdb)
InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=1) at 
xlog.c:4219
4219memset(ControlFile, 0, sizeof(ControlFileData));
(gdb)
4221ControlFile->system_identifier = sysidentifier;
(gdb)
5175InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0xaac21a08 in InitControlFile (sysidentifier=7459466832287685723, 
data_checksum_version=1) at xlog.c:4221
4221ControlFile->system_identifier = sysidentifier;
(gdb)
4222memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, 
MOCK_AUTH_NONCE_LEN);
(gdb)
5175InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0xaac21a3c in InitControlFile (sysidentifier=7459466832287685723, 
data_checksum_version=1) at xlog.c:4234
4234ControlFile->track_commit_timestamp = track_commit_timestamp;
(gdb)
5175InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0xaac21a4c in InitControlFile (sysidentifier=7459466832287685723, 
data_checksum_version=1) at xlog.c:4232
4232ControlFile->wal_level = wal_level;
(gdb)
5175InitControlFile(sysidentifier, data_

Re: Adjusting hash join memory limit to handle batch explosion

2025-01-13 Thread Melanie Plageman

On Sat, Jan 11, 2025 at 7:42 PM Tomas Vondra  wrote:
>
> I had a quiet evening yesterday, so I decided to take a stab at this and
> see how hard would it be, and how bad would the impact be. Attached is
> an experimental patch, doing the *bare* minimum for a simple query:
>
> 1) It defines a limit of 128 batches (a bit low, but also 1MB). In
> practice we'd use something like 256 - 1024, probably. Doesn't matter.
>
> 2) Ensures the initial pass over data in MultiExecPrivateHash does not
> use more than 128 batches, switches to "tooManyBatches=true" if that
> happens (and dumps the batch to file ExecHashDumpBatchToFile, even if
> it's batchno=0). And later it calls ExecHashHandleTooManyBatches() to
> increase the nbatch further.
>
> 3) Does something similar for the outer relation - if there are too many
> batches, we do ExecHashJoinRepartitionBatches() which first partitions
> into 128 batches. This only does a single pass in the WIP, though.
> Should be recursive or something.
>
> 4) Extends the BufFile API with BufFileHasBuffer/BufFileFreeBuffer so
> that the code can free the buffers. It also means the buffer needs to be
> allocated separately, not embedded in BufFile struct. (I'm a bit
> surprised it works without having to re-read the buffer after freeing
> it, but that's probably thanks to how hashjoin uses the files).

I started looking at this. Even though you do explain what it does
above, I still found it a bit hard to follow. Could you walk through
an example (like the one you gave in SQL) but explaining what happens
in the implementation? Basically what you have in 2 and 3 above but
with a specific example.

This is my understanding of what this does:
if we are at the max number of batches when building the hashtable and
we run out of space and need to double nbatches, we
1. dump the data from the current batch that is in the hashtable into a file
2. close and flush are the currently open buffiles, double the number
of batches, and then only open files for the batches we need to store
tuples from the batch we were trying to put in the hashtable when we
hit the limit (now in a temp file)

I also don't understand why ExecHashJoinRepartitionBatches() is needed
-- I think it has something to do with needing a certain number of
buffers open while processing batch 0, but what does this have to do
with the outer side of the join?

Another random question: why doesn't ExecHashHandleTooManyBatches()
free the outer batch files?

- Melanie

Re: [PATCH] Add get_bytes() and set_bytes() functions

2025-01-13 Thread Aleksander Alekseev

Hi Michael,

> v5-0001 includes the following output:
>
> --- a/src/test/regress/expected/opr_sanity.out
> +++ b/src/test/regress/expected/opr_sanity.out
> @@ -126,9 +126,12 @@ WHERE p1.oid < p2.oid AND
>   p1.proretset != p2.proretset OR
>   p1.provolatile != p2.provolatile OR
>   p1.pronargs != p2.pronargs);
> - oid | proname | oid | proname
> --+-+-+-
> -(0 rows)
> + oid  | proname  | oid  | proname
> +--+--+--+-
> + 2405 | int2send | 8577 | bytea
> + 2407 | int4send | 8578 | bytea
> + 2409 | int8send | 8579 | bytea
> +(3 rows)
>
> This should not happen, as you are using multiple times the same
> function.

Thanks. Here is the corrected patch.

Besides fixing opr_sanity test I corrected error messages:

-ERROR:  bytea size 3 out of valid range, 0..2
+ERROR:  bytea out of valid range, ''..'\x'

... and also included tests for min/max integer values.

I discarded the 0002 patch that implemented get_bytes() / set_bytes().
This part doesn't seem to get much support, so let's focus on casting.

--
Best regards,
Aleksander Alekseev
From 4b25ab46f10c8d284b2b2f8f3a7998a7b00c12d0 Mon Sep 17 00:00:00 2001
From: Aleksander Alekseev 
Date: Mon, 26 Aug 2024 12:09:59 +0300
Subject: [PATCH v6] Allow casting between bytea and integer types.

For instance:

SELECT '\x12345678'::bytea::integer;
SELECT 0x12345678::bytea;

This works with int2's, int4's and int8's.

Aleksander Alekseev, reviewed by Peter Eisentraut, Michael Paquier

Discussion: https://postgr.es/m/CAJ7c6TPtOp6%2BkFX5QX3fH1SVr7v65uHr-7yEJ%3DGMGQi5uhGtcA%40mail.gmail.com

BUMP CATVERSION
---
 src/backend/utils/adt/int.c  |  68 +
 src/include/catalog/pg_cast.dat  |  14 +++
 src/include/catalog/pg_proc.dat  |  18 
 src/test/regress/expected/opr_sanity.out |   6 ++
 src/test/regress/expected/strings.out| 120 +++
 src/test/regress/sql/strings.sql |  34 +++
 6 files changed, 260 insertions(+)

diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c
index b5781989a64..a240f647a48 100644
--- a/src/backend/utils/adt/int.c
+++ b/src/backend/utils/adt/int.c
@@ -336,6 +336,74 @@ int4send(PG_FUNCTION_ARGS)
  *		===
  */
 
+/* Common code for bytea_int2, bytea_int4 and bytea_int8 */
+static int64
+bytea_integer(bytea* v, int max_size, const char* max_value)
+{
+	int 	len = VARSIZE_ANY_EXHDR(v);
+	int 	offset = 0;
+	int64 	result = 0;
+
+	if (len > max_size)
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg("bytea out of valid range, ''..'\\x%s'",
+	max_value)));
+	while (len--)
+	{
+		result = result << 8;
+		result |= ((unsigned char *) VARDATA_ANY(v))[offset];
+		offset++;
+	}
+
+	return result;
+}
+
+/* Cast bytea -> int2 */
+Datum
+bytea_int2(PG_FUNCTION_ARGS)
+{
+	bytea	*v = PG_GETARG_BYTEA_PP(0);
+	PG_RETURN_INT16((int16)bytea_integer(v, sizeof(int16), ""));
+}
+
+/* Cast bytea -> int4 */
+Datum
+bytea_int4(PG_FUNCTION_ARGS)
+{
+	bytea	*v = PG_GETARG_BYTEA_PP(0);
+	PG_RETURN_INT32((int32)bytea_integer(v, sizeof(int32), ""));
+}
+
+/* Cast bytea -> int8 */
+Datum
+bytea_int8(PG_FUNCTION_ARGS)
+{
+	bytea	*v = PG_GETARG_BYTEA_PP(0);
+	PG_RETURN_INT64(bytea_integer(v, sizeof(int64), ""));
+}
+
+/* Cast int2 -> bytea; currently just a wrapper for int2send() */
+Datum
+int2_bytea(PG_FUNCTION_ARGS)
+{
+	return int2send(fcinfo);
+}
+
+/* Cast int4 -> bytea; currently just a wrapper for int4send() */
+Datum
+int4_bytea(PG_FUNCTION_ARGS)
+{
+	return int4send(fcinfo);
+}
+
+/* Cast int8 -> bytea; currently just a wrapper for int8send() */
+Datum
+int8_bytea(PG_FUNCTION_ARGS)
+{
+	return int8send(fcinfo);
+}
+
 Datum
 i2toi4(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/catalog/pg_cast.dat b/src/include/catalog/pg_cast.dat
index a26ba34e869..ab46be606f0 100644
--- a/src/include/catalog/pg_cast.dat
+++ b/src/include/catalog/pg_cast.dat
@@ -320,6 +320,20 @@
 { castsource => 'varchar', casttarget => 'name', castfunc => 'name(varchar)',
   castcontext => 'i', castmethod => 'f' },
 
+# Allow explicit coercions between bytea and integer types
+{ castsource => 'int2', casttarget => 'bytea', castfunc => 'bytea(int2)',
+  castcontext => 'e', castmethod => 'f' },
+{ castsource => 'int4', casttarget => 'bytea', castfunc => 'bytea(int4)',
+  castcontext => 'e', castmethod => 'f' },
+{ castsource => 'int8', casttarget => 'bytea', castfunc => 'bytea(int8)',
+  castcontext => 'e', castmethod => 'f' },
+{ castsource => 'bytea', casttarget => 'int2', castfunc => 'int2(bytea)',
+  castcontext => 'e', castmethod => 'f' },
+{ castsource => 'bytea', casttarget => 'int4', castfunc => 'int4(bytea)',
+  castcontext => 'e', castmethod => 'f' },
+{ castsource => 'bytea', casttarget => 'int8', castfunc => 'int8(bytea)',
+  castcontext => 'e', castmethod => 'f' },
+
 # Allow explicit coercions between int4 and "char"
 { castsource => 'char', casttarget => 'int4', castfunc =>

Re: [PATCH] Add get_bytes() and set_bytes() functions

2025-01-13 Thread Alvaro Herrera

On 2025-Jan-13, Dean Rasheed wrote:

> On Mon, 13 Jan 2025 at 17:36, Aleksander Alekseev
>  wrote:
> >
> > Besides fixing opr_sanity test I corrected error messages:
> >
> > -ERROR:  bytea size 3 out of valid range, 0..2
> > +ERROR:  bytea out of valid range, ''..'\x'
> 
> "smallint out of range", "integer out of range" and "bigint out of
> range" would be more consistent with existing error messages.

But these don't show the acceptable range. We have these that do:

#: utils/adt/varbit.c:1824 utils/adt/varbit.c:1882
#, c-format
msgid "bit index %d out of valid range (0..%d)"

#: utils/adt/varlena.c:3218 utils/adt/varlena.c:3285
#, c-format
msgid "index %d out of valid range, 0..%d"

#: utils/adt/varlena.c:3249 utils/adt/varlena.c:3321
#, c-format
msgid "index %lld out of valid range, 0..%lld"

#: utils/misc/guc.c:3130
#, c-format
msgid "%d%s%s is outside the valid range for parameter \"%s\" (%d .. %d)"


The quoting in Aleksander's proposal is not great.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Update: super-fast reaction on the Postgres bugs mailing list. The report
was acknowledged [...], and a fix is under discussion.
The wonders of open-source !"
 https://twitter.com/gunnarmorling/status/1596080409259003906

Re: InitControlFile misbehaving on graviton

2025-01-13 Thread Matthias van de Meent

On Mon, 13 Jan 2025 at 20:04, Christoph Berg  wrote:
>
> Bernd and I have been chasing a bug that happens when all of the
> following conditions are fulfilled:
>
> * PG 15..18 (older PGs are ok)
> * gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok)
> * arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 host)
> * -O2 (ok with -O0)
> * --with-openssl (ok without openssl)
> * using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it)
>
> The problem happens early during initdb:
>
> $ ./configure --with-openssl --enable-debug
> ...
> $ /usr/local/pgsql/bin/initdb -D broken --no-clean
> ...
> running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL:  
> control file contains invalid database cluster state
> child process exited with exit code 1
> initdb: data directory "broken" not removed at user's request

Yes, weird.

> (gdb) disassemble
> Dump of assembler code for function BootStrapXLOG:
>0xaac21708 <+0>: stp x29, x30, [sp, #-272]!
>0xaac2170c <+4>: mov w1, #0x0// #0
>0xaac21710 <+8>: mov x29, sp
> ...
> => 0xaac219bc <+692>:   add x19, sp, #0x90
>0xaac219c0 <+696>:   mov x0, x19
>0xaac219c4 <+700>:   mov x1, #0x20   // #32
>0xaac219c8 <+704>:   str w2, [x21, #28]
>0xaac219cc <+708>:   bl  0xab0ac824 

pg_strong_random pulls random values from openssl's RAND_bytes
(defined in openssl/rand.h) when PostgreSQL is compiled with openSSL
support. If openSSL isn't enabled we instead use /dev/urandom (on
unix-y systems), which means different code will be generated for
pg_strong_random.

>0xaac219d0 <+712>:   tbz w0, #0, 0xaac21b28 
> 
>0xaac219d4 <+716>:   ldr x3, [x22, #32]
>0xaac219d8 <+720>:   mov x2, #0x128  // 
> #296
>0xaac219dc <+724>:   mov w1, #0x0// #0
>0xaac219e0 <+728>:   mov x0, x3
>0xaac219e4 <+732>:   bl  0xaab7f3b0 

Given this code, it looks like register x3 contains ControlFile - it's
being memset(..., 0, sizeof(ControlFileData));

>0xaac219e8 <+736>:   mov x3, x0
>0xaac219ec <+740>:   mov x1, #0x3e8  // 
> #1000
>0xaac219f0 <+744>:   ldr w9, [x21, #32]
>0xaac219f4 <+748>:   adrpx7, 0xab3ce000 
> 
>0xaac219f8 <+752>:   ldr x7, [x7, #3720]
>0xaac219fc <+756>:   str x1, [x3, #128]

... Which would make this the assignment to unloggedLSN (which matches
the FirstNormalUnloggedLSN=1000 stored just above)

>0xaac21a00 <+760>:   ldr w1, [sp, #120]
>0xaac21a04 <+764>:   add x0, x0, #0x28
>0xaac21a08 <+768>:   str x23, [x3]

And this would be the assignment of systemidentifier,

>0xaac21a0c <+772>:   str w1, [x3, #252]

... data_checksum_version,

>0xaac21a10 <+776>:   adrpx6, 0xab3cf000
>0xaac21a14 <+780>:   ldr x6, [x6, #2392]
>0xaac21a18 <+784>:   adrpx5, 0xab3cf000
>0xaac21a1c <+788>:   ldr x5, [x5, #2960]
>0xaac21a20 <+792>:   adrpx4, 0xab3cf000
>0xaac21a24 <+796>:   ldr x4, [x4, #3352]
>0xaac21a28 <+800>:   ldp q26, q25, [x19]
>0xaac21a2c <+804>:   str s15, [x3, #16]

... and finally ControlFile->state.

I don't see where s15 is initialized and/or written to first, but this
is the only reference in this section of ASM. As such, I think the
initialization (presumably, "mov s15, #1" or such) must have happened
before the call to pg_secure_rand/RAND_bytes.

Looking around on the internet, it seems that in the ARM Procedure
Call Standard register s15 does not need to be preserved, and thus
could be clobbered when we're going into pg_secure_rand and co. If the
register is was indeed clobbered by OpenSSL, that would be a good
explanation for these issues. Can you check this?

> The really weird thing is that the very same binaries work on a
> different host (arm64 VM provided by Huawei) - the
> postgresql_arm64.deb files compiled there and present on
> apt.postgresql.org are fine, but when installed on that graviton VM,
> they throw the above error.

If I were you, I'd start looking into the differences in behaviour of
OpenSSL between the two ARM-based systems you mention; particularly
with a focus on register contents. It looks like gdb's `i r ...`
command could help out with that - or so StackOverflow tells me.

Kind regards,

Matthias van de Meent

Re: connection establishment versus parallel workers

2025-01-13 Thread Thomas Munro

On Tue, Jan 14, 2025 at 8:50 AM Nathan Bossart  wrote:
> On Thu, Dec 19, 2024 at 10:09:35AM -0600, Nathan Bossart wrote:
> > On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:
> >> 0001 patch is unchanged, 0002 patch sketches out a response to the
> >> observation a couple of paragraphs above.
> >
> > Both of these patches seem to improve matters quite a bit.  I haven't yet
> > thought too deeply about it all, but upon a skim, your patches seem
> > entirely reasonable to me.
>
> I gave these a closer look, and I still feel that they are both
> straightforward and reasonable.  IIUC the main open question is whether
> this might cause problems for other PM signal kinds.  Like you, I don't see
> anything immediately obvious there, but I'll admit I'm not terribly
> familiar with the precise characteristics of postmaster signals.  In any
> case, 0001 feels pretty safe to me.

Cool.  Thanks.  I'll think about what else could be affected by that
change as you say, and if nothing jumps out I'll go ahead and commit
them, back to 16.

I have done a lot more study of this problem and was about to write in
with some more patches to propose for master only.  Basically that
"100" is destroying performance in this workload, which at least on my
machine hardly gets any parallelism at all, and only in sporadic
bursts.  You can argue that we aren't designed for high frequency
short-lived workers (we'll have to reuse workers in some way to be
good at that), but I don't think it has to fail as badly as it does
today.  It falls off a cliff instead of plateauing: we are so busy
forking that we don't get around to reaping children, so all our slots
are (artificially) used up most of the time, and the queries that do
manage to nab one then sit on their hands for a long time at query
end.  "1" gets much smoother results, but as prophesied in aa1351f1,
the complexity is terrible, possibly even O(n^3) in places depending
on how you count: there are many places that scan the whole worker
list, and one that even scans it again for each item, and that is for
each thing that starts.  IOW we have to fix the complexity
fundamentally.  I have a WIP patch that adds a couple of work queues,
so that the postmaster never has to consider anything more than the
head of a queue in various places.  More soon...

> > However, while this makes the test numbers for >= v16 look more like those
> > for v15, we're also seeing a big jump from v13 to v14.  This bisects pretty
> > cleanly to commit d872510.  I haven't figured out _why_ this commit is
> > impacting this particular test, but I figured I'd at least update the
> > thread with what we know so far.
>
> I regrettably have no updates on this one, yet.

My first thought was that the catalogues needed for connection might
be getting evicted, but the data size seems too small for that surely
and you'd probably have picked it up immediately from wait events.
Weird.

Re: AIO v2.2

2025-01-13 Thread Robert Haas

On Wed, Jan 8, 2025 at 7:26 PM Andres Freund  wrote:
> 1) Shared memory representation of an IO, for the AIO subsystem internally
>
>Currently: PgAioHandle
>
> 2) A way for the issuer of an IO to reference 1), to attach information to the
>IO
>
>Currently: PgAioHandle*
>
> 3) A way for any backend to wait for a specific IO to complete
>
>Currently: PgAioHandleRef

With that additional information, I don't mind this naming too much,
but I still think PgAioHandle -> PgAio and PgAioHandleRef ->
PgAioHandle is worth considering. Compare BackgroundWorkerSlot and
BackgroundWorkerHandle, which suggests PgAioHandle -> PgAioSlot and
PgAioHandleRef -> PgAioHandle.

> ZOMBIE feels even later than REAPED to me :)

Makes logical sense, because you would assume that you die first and
then later become an undead creature, but the UNIX precedent is that
dying turns you into a zombie and someone then has to reap the exit
status for you to be just plain dead. :-)

> > I do agree with Heikki that REAPED sounds later than COMPLETED, because you
> > reap zombie processes by collecting their exit status. Maybe you could have
> > AHS_COMPLETE or AHS_IO_COMPLETE for the state where the I/O is done but
> > there's still completion-related work to be done, and then the other state
> > could be AHS_DONE or AHS_FINISHED or AHS_FINAL or AHS_REAPED or something.
>
> How about
>
> AHS_COMPLETE_KERNEL or AHS_COMPLETE_RAW - raw syscall completed
> AHS_COMPLETE_SHARED_CB - shared callback completed
> AHS_COMPLETE_LOCAL_CB - local callback completed
>
> ?

That's not bad. I like RAW better than KERNEL. I was hoping to use
different works like COMPLETE and DONE rather than, as you did it
here, COMPLETE and COMPLETE, but it's probably fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

94 matches

Mail list logo