Re: Verify predefined LWLocks tranches have entries in wait_event_names.txt

2025-07-23 Thread Nathan Bossart
On Wed, Jul 23, 2025 at 06:22:16AM +, Bertrand Drouvot wrote:
> PFA v4 where the only change compared to v3 is that it reads lwlocklist.h once
> in generate-lwlocknames.pl.

Committed.

-- 
nathan




Re: Support getrandom() for pg_strong_random() source

2025-07-23 Thread Jacob Champion
On Tue, Jul 22, 2025 at 4:23 PM Masahiko Sawada  wrote:
> The trend of the results were similar:
>
> getrandom: 497.061 ms
> RAND_bytes: 1152.260 ms ms
> /dev/urandom: 1696.065 ms
>
> Please let me know if I'm missing configurations or settings to
> measure this workload properly.

I don't think you're missing anything, or else I'm missing something
too. If I modify pg_strong_random() to call getentropy() in addition
to the existing RAND_bytes() code, `perf` shows RAND_bytes() taking up
2.4x the samples that getentropy() does. That's very similar to your
results.

> On Tue, Jul 22, 2025 at 11:46 AM Jacob Champion
>  wrote:
> > That is _really_ surprising to me at first glance. I thought
> > RAND_bytes() was supposed to be a userspace PRNG, which I would
> > naively expect to take much less time than pulling data from Linux.

So my expectation was naive for sure. This has sent me down a bit of a
rabbit hole, starting with Adam Langley's BoringSSL post [1] which led
to a post/rant on urandom [2]. I don't think an API that advertises
"strong randomness" should ever prioritize performance over strength.
But maybe the pendulum has swung far enough that we can expect any
kernel supporting getentropy() to be able to do the job just as well
as OpenSSL does in userspace, except also faster? I think it might be
worth a discussion.

Thanks,
--Jacob

[1] https://www.imperialviolet.org/2015/10/17/boringssl.html
[2] https://sockpuppet.org/blog/2014/02/25/safely-generate-random-numbers/




Re: Extension security improvement: Add support for extensions with an owned schema

2025-07-23 Thread Artem Gavrilov
Hello Jelte,

I reviewed your patch. Overall it looks good, I didn't find any problems
with code. Documentation is in place and clear.

Initial Run
===
The patch applies cleanly to HEAD (196063d6761). All tests successfully
pass.

Comments
===
1) I noticed that pg_dump changes weren't covered with tests.

2) I assume these error messages may be confusing, especially first one:

> -- Fails for an already existing schema to be provided
> CREATE EXTENSION test_ext_owned_schema SCHEMA test_ext_owned_schema;
> ERROR:  schema "test_ext_owned_schema" already exists
> -- Fails because a different schema is set in control file
> CREATE EXTENSION test_ext_owned_schema SCHEMA test_schema;
> ERROR:  extension "test_ext_owned_schema" must be installed in schema
> "test_ext_owned_schema"


In both cases it's not clear that the extension requires schema ownership.
Can hint messages be added there?

-- 

Artem Gavrilov

Senior Software Engineer, Percona

artem.gavri...@percona.com
percona.com 


Re: index prefetching

2025-07-23 Thread Tomas Vondra
On 7/23/25 02:59, Andres Freund wrote:
> Hi,
> 
> On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
>> But I don't see why would this have any effect on the prefetch distance,
>> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
>> that. I'd have expected exactly the opposite behavior.
>>
>> Could be bug, of course. But it'd be helpful to see the dataset/query.
> 
> Pgbench scale 500, with the simpler query from my message.
> 

I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
you monitor the queue depth (presumably iostat?), but I added a basic
prefetch info to explain (see the attached WIP patch), reporting the
average prefetch distance, number of stalls (with distance=0) and stream
resets (after filling INDEX_SCAN_MAX_BATCHES).

And I see this (there's a complete explain output attached) for the two
queries from your message [1]. The

simple query:

SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid
LIMIT 1000);

complex query:

SELECT max(abalance), min(abalance), sum(abalance::numeric),
avg(abalance::numeric), avg(aid::numeric), avg(bid::numeric) FROM
(SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 1000);

The stats actually look *exactly* the same, which makes sense because
it's reading the same index.


   max_batches  distance  stalls  resets  stalls/reset
  
64   272   3   3 1
3259  122939 653   188
1636  108101 1190   90
 821   98775 2104   46
 411   95627 4556   20

I think this behavior mostly matches my expectations, although it's
interesting the stalls jump so much between 64 and 32 batches.

I did test both with buffered I/O (io_method=sync) and direct I/O
(io_method=worker), and the results are exactly the same for me. Not the
timings, of course, but the prefetch stats.

Of course, maybe there's something wrong in how the stats are collected.
I wonder if maybe we should update the distance in get_block() and not
in next_buffer().

Or maybe there's some interference from having to read the leaf pages
sooner. But I don't see why that would affect the queue depth, fewer
reset should keep the queues fuller I think.


I'll think about adding some sort of distance histogram to the stats.
Maybe something like tinyhist [2] would work here.



[1]
https://www.postgresql.org/message-id/h2n7d7zb2lbkdcemopvrgmteo35zzi5ljl2jmk32vz5f4pziql%407ppr6r6yfv4z

[2] https://github.com/tvondra/tinyhist


regards

-- 
Tomas Vondra
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4835c48b448..5e6c3208955 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -407,12 +407,6 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
-	/*
-	 * No batching by default, so set it to NULL. Will be initialized later if
-	 * batching is requested and AM supports it.
-	 */
-	scan->xs_batches = NULL;
-
 	return scan;
 }
 
@@ -463,6 +457,17 @@ index_rescan(IndexScanDesc scan,
 			orderbys, norderbys);
 }
 
+void
+index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets)
+{
+	/* ugly */
+	if (scan->xs_heapfetch->rs != NULL)
+	{
+		read_stream_prefetch_stats(scan->xs_heapfetch->rs,
+	   accum, count, stalls, resets);
+	}
+}
+
 /* 
  *		index_endscan - end a scan
  * 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e2792ead71..d92f68d0533 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -136,6 +136,7 @@ static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 			  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
 static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexprefetch_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1966,6 +1967,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 show_instrumentation_count("Rows Removed by Filter", 1,
 		   planstate, es);
 			show_indexsearches_info(planstate, es);
+			show_indexprefetch_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1983,6 +1985,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 ExplainPropertyFloat("Heap Fetches", NULL,
 	 planstate->instrument->ntuples2, 0, es);
 			show_indexsearches_info(planstate, es);
+			show_index

trivial grammar refactor

2025-07-23 Thread Álvaro Herrera
Hello

I noticed some duplicative coding while hacking on REPACK[1].  We can
save a few lines now with a trivial change to the rules for CHECKPOINT
and REINDEX, and allow to save a few extra lines in that patch.

Any objections to this?

[1] https://commitfest.postgresql.org/patch/5117/

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
>From 435fee7ff7400bf86f6ad0ea4df27b2d2b37c254 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= 
Date: Wed, 23 Jul 2025 15:48:18 +0200
Subject: [PATCH] Refactor grammar productions to create
 opt_utility_option_list

This can be used by various commands that accept the standard optional
parenthesized list of options.  We already have CHECKPOINT and REINDEX,
and REPACK will take advantage of it as well.
---
 src/backend/parser/gram.y | 27 +++
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 73345bb3c70..903f854799c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -318,6 +318,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 		opt_qualified_name
 %type 		opt_concurrently
 %type 	opt_drop_behavior
+%type 		opt_utility_option_list
 
 %type 	alter_column_default opclass_item opclass_drop alter_using
 %type 	add_drop opt_asc_desc opt_nulls_order
@@ -556,7 +557,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	generic_option_list alter_generic_option_list
 
 %type 	reindex_target_relation reindex_target_all
-%type 	opt_reindex_option_list
 
 %type 	copy_generic_opt_arg copy_generic_opt_arg_list_item
 %type 	copy_generic_opt_elem
@@ -1141,6 +1141,11 @@ opt_drop_behavior:
 			| /* EMPTY */	{ $$ = DROP_RESTRICT; /* default */ }
 		;
 
+opt_utility_option_list:
+			'(' utility_option_list ')'		{ $$ = $2; }
+			| /* EMPTY */	{ $$ = NULL; }
+		;
+
 /*
  *
  * CALL statement
@@ -2028,18 +2033,12 @@ constraints_set_mode:
  * Checkpoint statement
  */
 CheckPointStmt:
-			CHECKPOINT
+			CHECKPOINT opt_utility_option_list
 {
 	CheckPointStmt *n = makeNode(CheckPointStmt);
 
 	$$ = (Node *) n;
-}
-			| CHECKPOINT '(' utility_option_list ')'
-{
-	CheckPointStmt *n = makeNode(CheckPointStmt);
-
-	$$ = (Node *) n;
-	n->options = $3;
+	n->options = $2;
 }
 		;
 
@@ -9354,7 +9353,7 @@ DropTransformStmt: DROP TRANSFORM opt_if_exists FOR Typename LANGUAGE name opt_d
  */
 
 ReindexStmt:
-			REINDEX opt_reindex_option_list reindex_target_relation opt_concurrently qualified_name
+			REINDEX opt_utility_option_list reindex_target_relation opt_concurrently qualified_name
 {
 	ReindexStmt *n = makeNode(ReindexStmt);
 
@@ -9367,7 +9366,7 @@ ReindexStmt:
 			makeDefElem("concurrently", NULL, @4));
 	$$ = (Node *) n;
 }
-			| REINDEX opt_reindex_option_list SCHEMA opt_concurrently name
+			| REINDEX opt_utility_option_list SCHEMA opt_concurrently name
 {
 	ReindexStmt *n = makeNode(ReindexStmt);
 
@@ -9380,7 +9379,7 @@ ReindexStmt:
 			makeDefElem("concurrently", NULL, @4));
 	$$ = (Node *) n;
 }
-			| REINDEX opt_reindex_option_list reindex_target_all opt_concurrently opt_single_name
+			| REINDEX opt_utility_option_list reindex_target_all opt_concurrently opt_single_name
 {
 	ReindexStmt *n = makeNode(ReindexStmt);
 
@@ -9402,10 +9401,6 @@ reindex_target_all:
 			SYSTEM_P{ $$ = REINDEX_OBJECT_SYSTEM; }
 			| DATABASE{ $$ = REINDEX_OBJECT_DATABASE; }
 		;
-opt_reindex_option_list:
-			'(' utility_option_list ')'{ $$ = $2; }
-			| /* EMPTY */			{ $$ = NULL; }
-		;
 
 /*
  *
-- 
2.39.5



Re: Adding wait events statistics

2025-07-23 Thread Robert Haas
On Tue, Jul 22, 2025 at 8:24 AM Bertrand Drouvot
 wrote:
> So based on the cycles metric I think it looks pretty safe to implement for 
> the
> whole majority of classes.

I'm not convinced that this is either cheap enough to implement, and I
don't understand the value proposition, either. I see the first couple
of messages in the thread say that this is important for
troubleshooting problems in "our user base" (not sure about the
antecedent of "our") but the description of what those problems are
seems pretty vague to me. I've tried counting wait events of a
particular type and the resulting data was extremely misleading and
therefore useless.

I think it's probably a mistake to even be thinking about this in
terms of wait events. It seems very reasonable to want more data about
what PostgreSQL backends are doing, but I don't really see any reason
why that should happen to line up with whatever wait events do. For
instance, if you asked "what information is useful to gather about
heavyweight locks?" you might say "well, we'd like to know how many
times we tried to acquire one, and how many of those times we had to
wait, and how many of those times we waited for longer than
deadlock_timeout". And then you could design a facility to answer
those questions. Or you might say "we'd like a history of all the
different heavyweight locks that a certain backend has tried to
acquire," and then you could design a tracing facility to give you
that. Either of those hypothetical facilities involve providing more
information than you would get from just counting wait events, or
counting+timing wait events, or recording a complete history of all
wait events.

And, I would say that, for more or less that exact reason, there's a
real use case for either of them. Maybe either or both are valuable
enough to add and maybe they are not, and maybe the overhead is
acceptable or maybe it isn't, but I think the arguments are much
better than for a facility that just counts and times all the wait
events. For instance, the former facility would let you answer the
question "what percentage of lock acquisitions are contended?" whereas
a pure wait event count just lets you answer the question "how many
contended lock acquisitions were there?". The latter isn't useless,
but the former is a lot better, particularly since as I proposed it,
you can also judge how many of those contention events involved a
brief wait vs. a long one.

But if, on the other hand, you look at LWLocks, it's a totally
different situation, IMHO. I'm almost sure that measuring LWLock wait
times is going to be too expensive to be practical, and I don't really
see why you'd want to: the right approach is sampling, which is cheap
and in my experience highly effective. Measuring counts doesn't seem
very useful either: knowing the number of times that somebody tried to
acquire a relation lock or a tuple lock arguably tells you something
about your workload that you might want to know, whereas I would argue
that knowing the number of times that somebody tried to acquire a
buffer lock doesn't really tell you anything at all. What you might
want to know is how many buffers you accessed, which is why we've
already got a system for tracking that. That's not to say that there's
nothing at all that somebody could want to know about LWLocks that you
can't already find out today: for example, a system that identifies
which buffers are experiencing significant buffer lock contention, by
relation OID and block number, sounds quite handy. But just counting
wait events, or counting and timing them, will not give you that.
Knowing which SLRUs are being heavily used could also be useful, but I
think there's a good argument that having some instrumentation that
cuts across SLRU types and exposes a bunch of useful numbers for each
could be more useful than just hoping you can figure it out from
LWLock wait event counts.

In short, I feel like just counting, or counting+timing, all the wait
events is painting with too broad a brush. Wait events get inserted
for a specific purpose: so you can know why a certain backend is
off-CPU without having to use debugging tools. They serve that purpose
pretty well, but that doesn't mean that they serve other purposes
well, and I find it kind of hard to see the argument that just
sticking a bunch of counters or timers in the same places where we put
the wait event calls would be the right thing in general.
Introspection is an important topic and, IMHO, deserves much more
specific and nuanced thought about what we're trying to accomplish and
how we're going about it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: More protocol.h replacements this time into walsender.c

2025-07-23 Thread Nathan Bossart
Committed.  I noticed that there are several characters with no match in
protocol.h.  It might be worth adding those.

In walsender.c:

1537:   pq_sendbyte(ctx->out, 'w');
2353:   case 'r':
2357:   case 'h':
2361:   case 'p':
2755:   pq_sendbyte(&output_message, 's');
3367:   pq_sendbyte(&output_message, 'w');
4138:   pq_sendbyte(&output_message, 'k');

In walreceiver.c:

829:case 'w':   /* WAL records 
*/
853:case 'k':   /* Keepalive */
1133:   pq_sendbyte(&reply_message, 'r');
1237:   pq_sendbyte(&reply_message, 'h');

In logical/worker.c:

3854:   if (c == 'w')
3876:   else if (c == 'k')
3895:   else if (c == 's')  /* Primary 
status update */
4127:   pq_sendbyte(reply_message, 'r');
4298:   pq_sendbyte(request_message, 'p');

-- 
nathan




Re: trivial grammar refactor

2025-07-23 Thread Nathan Bossart
On Wed, Jul 23, 2025 at 05:38:34PM +0200, Álvaro Herrera wrote:
> I noticed some duplicative coding while hacking on REPACK[1].  We can
> save a few lines now with a trivial change to the rules for CHECKPOINT
> and REINDEX, and allow to save a few extra lines in that patch.
> 
> Any objections to this?

Seems reasonable to me.  Any chance we can use this for CLUSTER, VACUUM,
ANALYZE, and EXPLAIN, too?

-- 
nathan




Re: trivial grammar refactor

2025-07-23 Thread Álvaro Herrera
On 2025-Jul-23, Álvaro Herrera wrote:

> So we can still do this, and I still think it's a win, but unfortunately
> it won't help for the REPACK patch.

Ah no, I can still use it:

RepackStmt:
REPACK opt_utility_option_list qualified_name USING INDEX name
| REPACK opt_utility_option_list qualified_name USING INDEX
| REPACK opt_utility_option_list qualified_name
| REPACK USING INDEX
| CLUSTER '(' utility_option_list ')' qualified_name 
cluster_index_specification
| CLUSTER qualified_name cluster_index_specification
| CLUSTER opt_utility_option_list
| CLUSTER VERBOSE qualified_name cluster_index_specification
| CLUSTER VERBOSE
| CLUSTER VERBOSE name ON qualified_name
| CLUSTER name ON qualified_name
;


-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Essentially, you're proposing Kevlar shoes as a solution for the problem
that you want to walk around carrying a loaded gun aimed at your foot.
(Tom Lane)




Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

2025-07-23 Thread Andres Freund
Hi,

On 2025-03-08 11:47:25 -0500, Peter Geoghegan wrote:
> My current plan is to commit this on Tuesday or Wednesday, barring any
> objections.

A minor question about this patch: Was there a particular reason it added the
index specific instrumentation information inline in IndexScanState etc?  Of
course the amount of memory right now is rather trivial, so that is not an
issue memory usage wise. Is that the reason?

The background for my question is that I was looking at what it would take to
track the index and table buffer usage separately for
IndexScanState/IndexOnlyScanState and IndexScanInstrumentation seems to be
pre-destined for that information. But it seems a a bit too much memory to
just keep a BufferUsage around even when analyze isn't used.


Greetings,

Andres Freund


PS: Another thing that I think we ought to track is the number of fetches from
the table that missed, but that's not really related to my question here or
this thread...




Re: trivial grammar refactor

2025-07-23 Thread Álvaro Herrera

... so using the same set of productions, I can rewrite the current
CLUSTER rule in this way and it won't be a problem for the REPACK
changes.

Thanks for looking!

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
>From 91edb076325ca366762832fcf3a4eab7de21002d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= 
Date: Wed, 23 Jul 2025 18:13:49 +0200
Subject: [PATCH v2] Refactor grammar to create opt_utility_option_list

This changes the grammar for REINDEX, CHECKPOINT and CLUSTER; they still
accept the same options as before, but the grammar is written
differently for convenience of future development.
---
 src/backend/parser/gram.y | 65 ++-
 1 file changed, 36 insertions(+), 29 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 73345bb3c70..8a6236167db 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -318,6 +318,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 		opt_qualified_name
 %type 		opt_concurrently
 %type 	opt_drop_behavior
+%type 		opt_utility_option_list
 
 %type 	alter_column_default opclass_item opclass_drop alter_using
 %type 	add_drop opt_asc_desc opt_nulls_order
@@ -556,7 +557,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	generic_option_list alter_generic_option_list
 
 %type 	reindex_target_relation reindex_target_all
-%type 	opt_reindex_option_list
 
 %type 	copy_generic_opt_arg copy_generic_opt_arg_list_item
 %type 	copy_generic_opt_elem
@@ -1141,6 +1141,11 @@ opt_drop_behavior:
 			| /* EMPTY */	{ $$ = DROP_RESTRICT; /* default */ }
 		;
 
+opt_utility_option_list:
+			'(' utility_option_list ')'		{ $$ = $2; }
+			| /* EMPTY */	{ $$ = NULL; }
+		;
+
 /*
  *
  * CALL statement
@@ -2028,18 +2033,12 @@ constraints_set_mode:
  * Checkpoint statement
  */
 CheckPointStmt:
-			CHECKPOINT
+			CHECKPOINT opt_utility_option_list
 {
 	CheckPointStmt *n = makeNode(CheckPointStmt);
 
 	$$ = (Node *) n;
-}
-			| CHECKPOINT '(' utility_option_list ')'
-{
-	CheckPointStmt *n = makeNode(CheckPointStmt);
-
-	$$ = (Node *) n;
-	n->options = $3;
+	n->options = $2;
 }
 		;
 
@@ -9354,7 +9353,7 @@ DropTransformStmt: DROP TRANSFORM opt_if_exists FOR Typename LANGUAGE name opt_d
  */
 
 ReindexStmt:
-			REINDEX opt_reindex_option_list reindex_target_relation opt_concurrently qualified_name
+			REINDEX opt_utility_option_list reindex_target_relation opt_concurrently qualified_name
 {
 	ReindexStmt *n = makeNode(ReindexStmt);
 
@@ -9367,7 +9366,7 @@ ReindexStmt:
 			makeDefElem("concurrently", NULL, @4));
 	$$ = (Node *) n;
 }
-			| REINDEX opt_reindex_option_list SCHEMA opt_concurrently name
+			| REINDEX opt_utility_option_list SCHEMA opt_concurrently name
 {
 	ReindexStmt *n = makeNode(ReindexStmt);
 
@@ -9380,7 +9379,7 @@ ReindexStmt:
 			makeDefElem("concurrently", NULL, @4));
 	$$ = (Node *) n;
 }
-			| REINDEX opt_reindex_option_list reindex_target_all opt_concurrently opt_single_name
+			| REINDEX opt_utility_option_list reindex_target_all opt_concurrently opt_single_name
 {
 	ReindexStmt *n = makeNode(ReindexStmt);
 
@@ -9402,10 +9401,6 @@ reindex_target_all:
 			SYSTEM_P{ $$ = REINDEX_OBJECT_SYSTEM; }
 			| DATABASE{ $$ = REINDEX_OBJECT_DATABASE; }
 		;
-opt_reindex_option_list:
-			'(' utility_option_list ')'{ $$ = $2; }
-			| /* EMPTY */			{ $$ = NULL; }
-		;
 
 /*
  *
@@ -11903,49 +11898,61 @@ ClusterStmt:
 	n->params = $3;
 	$$ = (Node *) n;
 }
-			| CLUSTER '(' utility_option_list ')'
+			| CLUSTER qualified_name cluster_index_specification
+{
+	ClusterStmt *n = makeNode(ClusterStmt);
+
+	n->relation = $2;
+	n->indexname = $3;
+	n->params = NIL;
+	$$ = (Node *) n;
+}
+			| CLUSTER opt_utility_option_list
 {
 	ClusterStmt *n = makeNode(ClusterStmt);
 
 	n->relation = NULL;
 	n->indexname = NULL;
-	n->params = $3;
+	n->params = $2;
 	$$ = (Node *) n;
 }
 			/* unparenthesized VERBOSE kept for pre-14 compatibility */
-			| CLUSTER opt_verbose qualified_name cluster_index_specification
+			| CLUSTER VERBOSE qualified_name cluster_index_specification
 {
 	ClusterStmt *n = makeNode(ClusterStmt);
 
 	n->relation = $3;
 	n->indexname = $4;
-	n->params = NIL;
-	if ($2)
-		n->params = lappend(n->params, makeDefElem("verbose", NULL, @2));
+	n->params = list_make1(makeDefElem("verbose", NULL, @2));
 	$$ = (Node *) n;
 }
 			/* unparenthesized VERBOSE kept for pre-17 compatibility */
-			| CLUSTER opt_verbose
+			| CLUSTER VER

Re: Making type Datum be 8 bytes everywhere

2025-07-23 Thread Andres Freund
Hi,

On 2025-07-18 13:24:32 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2025-07-17 20:09:57 -0400, Tom Lane wrote:
> >> I made it just as a proof-of-concept that this can work.  It compiled
> >> cleanly and passed check-world for me on a 32-bit FreeBSD image.
>
> > Interestingly it generates a *lot* of warnings here when building for 32 bit
> > with gcc.
>
> Oh, that's annoying.  I tested it with
>
> $ cc --version
> FreeBSD clang version 13.0.0 (g...@github.com:llvm/llvm-project.git 
> llvmorg-13.0.0-0-gd7b669b3a303)
> Target: i386-unknown-freebsd13.1
> Thread model: posix
> InstalledDir: /usr/bin
>
> which is slightly back-rev but not that old.  Which gcc did you use?

That was gcc 14.


> > One class of complaints is about DatumGetPointer() and
> > PointerGetDatum() casting between different sizes:
>
> > ../../../../../home/andres/src/postgresql/src/include/postgres.h: In 
> > function 'DatumGetPointer':
> > ../../../../../home/andres/src/postgresql/src/include/postgres.h:320:16: 
> > warning: cast to pointer from integer of different size 
> > [-Wint-to-pointer-cast]
> >   320 | return (Pointer) X;
> >   |^
> > ../../../../../home/andres/src/postgresql/src/include/postgres.h: In 
> > function 'PointerGetDatum':
> > ../../../../../home/andres/src/postgresql/src/include/postgres.h:330:16: 
> > warning: cast from pointer to integer of different size 
> > [-Wpointer-to-int-cast]
> >   330 | return (Datum) X;
> >   |^
>
> We might be able to silence those with intermediate casts to uintptr_t,
> perhaps?

Yep, that does the trick.


> >> I've not looked into the performance consequences.  We probably
> >> should at least try to measure that, though I'm not sure what
> >> our threshold of pain would be for deciding not to do this.
>
> > From my POV the threshold would have to be rather high for backend code. 
> > Less
> > so in libpq, but that's not affected here.
>
> I don't know if it's "rather high" or not, but that seems like
> the gating factor that ought to be checked before putting in
> more work.

The hard bit would be to determine what workload to measure.  Something like
pgbench probably won't suffer meaningfully, there's just not enough passing of
values around.

For a bit I thought it'd need to be a workload that does a lot of int4 math or
such, but I doubt the overhead of it matters sufficiently there.

Then I realized that the biggest issue probably would be a query that does a
lot of tuple deforming of 4 byte values, while not actually accessing them?

Can you think of a worse workload than that?

Greetings,

Andres Freund




Re: Document transition table triggers are not allowed on views/foreign tables

2025-07-23 Thread Ashutosh Bapat
On Wed, Jul 23, 2025 at 4:16 PM Etsuro Fujita  wrote:
>
> On Tue, Jul 15, 2025 at 4:55 PM Etsuro Fujita  wrote:
> > Another thing I noticed about transition tables is that while we
> > prohibit transition tables on views/foreign tables, there is no
> > description about that in the user-facing documentation.  So I would
> > like to propose to do $SUBJECT in create_trigger.sgml.  Attached is a
> > patch for that.

I think the restriction should be specified in a manner similar to how
restriction on CONSTRAINT option for foreign tables is specified i.e.
in " This option is only allowed for an AFTER trigger that is not a
constraint trigger; also, if the trigger is an UPDATE trigger, it must
not specify a column_name list.". But that sentence is already a bit
complex because of ; also, ... part. How about splitting the sentence
into two and mentioning restriction like below?

"This option is only allowed for an AFTER trigger on tables other than
views or foreign tables. The trigger should not be a constraint
trigger. If the trigger is an UPDATE trigger, it must not specify a
column_name list when using this option."

-- 
Best Wishes,
Ashutosh Bapat




Re: index prefetching

2025-07-23 Thread Andres Freund
Hi,

On 2025-07-23 14:50:15 +0200, Tomas Vondra wrote:
> On 7/23/25 02:59, Andres Freund wrote:
> > Hi,
> > 
> > On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
> >> But I don't see why would this have any effect on the prefetch distance,
> >> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
> >> that. I'd have expected exactly the opposite behavior.
> >>
> >> Could be bug, of course. But it'd be helpful to see the dataset/query.
> > 
> > Pgbench scale 500, with the simpler query from my message.
> > 
> 
> I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
> you monitor the queue depth (presumably iostat?)

Yes, iostat, since I was looking at what the "actually required" lookahead
distance is.

Do you actually get the query to be entirely CPU bound? What amount of IO
waiting do you see EXPLAIN (ANALYZE, TIMING OFF) with track_io_timing=on
report?

Ah - I was using a very high effective_io_concurrency. With a high
effective_io_concurrency value I see a lot of stalls, even at
INDEX_SCAN_MAX_BATCHES = 64. And a lower prefetch distance, which seems
somewhat odd.


FWIW, in my tests I was just evicting lineitem from shared buffers, since I
wanted to test the heap prefetching, without stalls induced by blocking on
index reads. But what I described happens with either.

;SET effective_io_concurrency = 256;SELECT 
pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs off, 
timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY 
aid LIMIT 1000);
┌──┐
│QUERY PLAN 
   │
├──┤
│ Aggregate (actual rows=1.00 loops=1)  
   │
│   Buffers: shared hit=27369 read=164191   
   │
│   I/O Timings: shared read=358.795
   │
│   ->  Limit (actual rows=1000.00 loops=1) 
   │
│ Buffers: shared hit=27369 read=164191 
   │
│ I/O Timings: shared read=358.795  
   │
│ ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts 
(actual rows=1000.00 loops=1) │
│   Index Searches: 1   
   │
│   Prefetch Distance: 256.989  
   │
│   Prefetch Stalls: 3  
   │
│   Prefetch Resets: 3  
   │
│   Buffers: shared hit=27369 read=164191   
   │
│   I/O Timings: shared read=358.795
   │
│ Planning Time: 0.086 ms   
   │
│ Execution Time: 4194.845 ms   
   │
└──┘

;SET effective_io_concurrency = 512;SELECT 
pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs off, 
timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY 
aid LIMIT 1000);
┌──┐
│QUERY PLAN 
   │
├──┤
│ Aggregate (actual rows=1.00 loops=1)  
   │
│   Buffers: shared hit=27368 read=164190   
   │
│   I/O Timings: shared read=832.515
   │
│   ->  Limit (actual rows=1000.00 loops=1) 
   │
│ Buffers: shared hit=27368 read=164190 
   │
│ I/O Timings: shared read=832.515  
   │
│ ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts 
(actual rows=1000.00 loops=1) │
│   Index Searche

Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

2025-07-23 Thread Peter Geoghegan
On Wed, Jul 23, 2025 at 1:50 PM Andres Freund  wrote:
> A minor question about this patch: Was there a particular reason it added the
> index specific instrumentation information inline in IndexScanState etc?  Of
> course the amount of memory right now is rather trivial, so that is not an
> issue memory usage wise. Is that the reason?

There was no very good reason behind my choice to do things that way.
I wanted to minimize the amount of churn in files like
nodeIndexScan.c. It was almost an arbitrary choice.

> The background for my question is that I was looking at what it would take to
> track the index and table buffer usage separately for
> IndexScanState/IndexOnlyScanState and IndexScanInstrumentation seems to be
> pre-destined for that information. But it seems a a bit too much memory to
> just keep a BufferUsage around even when analyze isn't used.

Offhand, I'd say that it would almost certainly be okay to switch over
to using dynamic allocation for IndexScanInstrumentation, instead of
storing it inline in IndexScanState/IndexOnlyScanState. That way you
could add many more fields to IndexScanInstrumentation, without that
creating any memory bloat problems in the common case where the scan
isn't running in an EXPLAIN ANALYZE.

-- 
Peter Geoghegan




Re: Custom pgstat support performance regression for simple queries

2025-07-23 Thread Andres Freund
Hi,

On 2025-07-23 09:54:12 +0900, Michael Paquier wrote:
> On Tue, Jul 22, 2025 at 10:57:06AM -0400, Andres Freund wrote:
> > It seems rather unsurprising that that causes a slowdown.
> > 
> > The pre-check is there to:
> > /* Don't expend a clock check if nothing to do */
> > 
> > but you made it way more expensive than a clock check would have been (not
> > counting old vmware installs or such, where clock checks take ages).
> > 
> > If I change the loop count to only be the builtin stats kinds, the overhead
> > goes away almost completely.
> 
> Noted.  I was wondering originally if the threshold for the builtin
> and custom kinds should be lowered, as well.  Perhaps that's just too
> optimistic, so we can first lower things.  Say PGSTAT_KIND_CUSTOM_MIN
> to 32 and PGSTAT_KIND_MAX to 48 or so to begin with?  I don't see a
> strict policy here, but these would make the whole cheaper to begin
> with, with enough margin for any new stats kinds.

Yes, 256 seems pointlessly large.


> Then, about the loop used here, I'd be OK to keep the past shortcuts
> for the builtin fixed-sized stats kinds with the requirement that new
> builtin stats kinds need to hack into pgstat_report_stat() themselves
> on efficiency grounds, combined with a second loop for custom kinds,
> taken only if pgstat_register_kind() has been used by tracking if
> that's the case in a flag.  Do you have a specific preference here?

I think the downstream approach of having a flag that says if there are *any*
pending stats is better.

Greetings,

Andres Freund




Re: Parallel heap vacuum

2025-07-23 Thread Andres Freund
Hi,

On 2025-07-22 11:44:29 -0700, Masahiko Sawada wrote:
> Do you think it makes sense to implement the above idea that we launch
> parallel vacuum workers for heap through the same vacuumparallel.c
> codebase and maintain the consistent interface with parallel index
> vacuuming APIs?

Yes, that might make sense. But wiring it up via tableam doesn't make sense.

Greetings,

Andres Freund




Re: Error with DEFAULT VALUE in temp table

2025-07-23 Thread Tom Lane
=?UTF-8?B?0JDQvdGC0YPQsNC9INCS0LjQvtC70LjQvQ==?=  
writes:
> I made patch for this problem, I changed event_trigger, for
> definitions of temporality
> DEFAULT VALUE

I looked over this patch.  I understand what you want to fix,
and I agree with the general plan of pushing the namespace-lookup
code into a subroutine so we can more easily point it at a different
object for the attrdef case.  However, you've done no favors for
future readers of the code:

* "process_catalog_object" is about as generic and uninformative a
name as one could easily think of.  I'd suggest something more like
"identify_object_namespace" --- I'm not wedded to that exact choice,
but the name should indicate what the function is trying to do.

* The new function also lacks a header comment, which isn't OK
except maybe for extremely trivial functions with obvious APIs.
Here I think you need to explain what values it outputs, and
you certainly need to explain what the return value means.

* The comment block at lines 1297ff is now kind of dangling,
because you moved half of what it's talking about to somewhere else.
Perhaps some of that text belongs in the new function's header
comment.

* Zero comments in the new code block for "object->classId ==
AttrDefaultRelationId" are not OK either.  I'd expect to see
something like "We treat a column default as temp if its table
is temp".

* I wonder if the check for is_objectclass_supported shouldn't
move into the new function too.  It's not really a concern
of the outer function where the new function is getting its
information from.

* If I'm reading it correctly, the patch depends on the
assumption that attrdefs aren't supported by the
is_objectclass_supported() infrastructure.  I'm not sure
that's right even today, and it sure seems like something
that could get broken by well-intentioned future patches.


Something that isn't the fault of your patch, but could be
improved while we're here:

* It seems rather messy and poorly-thought-out that schemas
themselves are handled in two separate places in the function,
at lines 1283ff and 1367ff.  Seems like that could be unified
and also made to look more like the equivalent code for
objects-contained-in-schemas.


Taking my last three comments together, maybe what we want for
the overall structure in EventTriggerSQLDropAddObject is

if (object->classId == NamespaceRelationId)
{
code for the schema case;
}
else if (object->classId == AttrDefaultRelationId)
{
code for the attrdef case;
}
else
{
generic case;
}

where the second and third blocks use this new function.

regards, tom lane




Re: DOCS: What SGML markup to use for user objects like tables, columns, etc?

2025-07-23 Thread Bruce Momjian
On Mon, Jun 23, 2025 at 10:29:50AM +1000, Peter Smith wrote:
> Hi,
> 
> I am looking for some guidelines/recommended SGML tags to use when
> referring in the PG documentation to any user-defined
> schema/table/column names.
> 
> This is most commonly seen near a  SQL example.
> 
> Currently, it seems a bit inconsistent. The rendering also looks quite
> different for these different markups.
> 
> 
> EXAMPLES OF DIFFERENT USAGE
> ===
> 
> 1.  as seen in create_publication.sgml,
> alter_publication.sgml, ddl.sgml, etc.
> 
> e.g.
>   
>Add tables users,
>departments and schema
>production to the publication
>production_publication:
> 
> ALTER PUBLICATION production_publication ADD TABLE users, departments,
> TABLES IN SCHEMA production;
> 
>  
> 
> ===
> 
> 2. , as seen in logical-replication.sgml
> 
> e.g.
> 
> CREATE SUBSCRIPTION mysub CONNECTION 'dbname=foo host=bar
> user=repuser' PUBLICATION mypub;
> 
>   
> 
>   
>The above will start the replication process, which synchronizes the
>initial table contents of the tables users and
>departments and then starts replicating
>incremental changes to those tables.
>   
> 
> ===
> 
> 3. , as seen in advanced.sgml
> 
> e.g.
>
> Let's create two tables:  A table cities
> and a table capitals.  Naturally, capitals
> are also cities, so you want some way to show the capitals
> implicitly when you list all cities.  If you're really clever you
> might invent some scheme like this:
> 
> 
> CREATE TABLE capitals (
>   name   text,
>   population real,
>   elevation  int,-- (in ft)
>   state  char(2)
> );
> 
> ==
> 
> My AI tool says the following.
> 
> 
> PostgreSQL documentation typically uses:
>  for specific, concrete names
>  for generic placeholders
>  for system objects and data types
> 
> 
> TBH, this seemed like good advice to me... however there are quite a
> few examples not following that.
> 
> Thoughts?

You are right that we are inconsistent.  I think  is the
accepted way to reference table names, and  for column
names.  If you can create a patch to make them consistent, I can apply
it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.




Re: pgbench - adding pl/pgsql versions of tests

2025-07-23 Thread Hannu Krosing
I had left plain < and > in code sample in documentation page, fixed now




On Sun, Jul 6, 2025 at 11:52 PM Hannu Krosing  wrote:
>
> On Sat, Feb 3, 2024 at 8:54 AM Hannu Krosing  wrote:
> >
> > My justification for adding pl/pgsql tests as part of the immediately 
> > available tests
> > is that pl/pgsql itself is always enabled, so having a no-effort way to 
> > test its
> > performance benefits would be really helpful.
> > We also should have "tps-b-like as SQL function" to round up the "test 
> > what's available in server" set.
>
> Finally got around to adding the tests for all out-of-the box
> supported languages - pl/pgsql, and old and new SQL.
>
> Also added the documentation.
From 72ede8812f61ed4879f5c005a18131c32ba2768b Mon Sep 17 00:00:00 2001
From: Hannu Krosing 
Date: Sun, 6 Jul 2025 23:41:46 +0200
Subject: [PATCH v3 1/2] added new and old style SQL functions and
 documentation

---
 doc/src/sgml/ref/pgbench.sgml |  43 +++-
 src/bin/pgbench/pgbench.c | 187 +-
 2 files changed, 223 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..6d733cff1af 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -105,8 +105,9 @@ pgbench -i  other-options  pgbench -i creates four tables pgbench_accounts,
 pgbench_branches, pgbench_history, and
 pgbench_tellers,
-destroying any existing tables of these names.
-Be very careful to use another database if you have tables having these
+destroying any existing tables of these names and their dependencies.
+It also creates 6 functions starting with pgbench_.
+Be very careful to use another database if you have tables or functions having these
 names!

   
@@ -361,6 +362,15 @@ pgbench  options  d
   
  
 
+ 
+  --no-functions
+  
+   
+Do not create pl/pgsql and SQL functions for internal scripts.
+   
+  
+ 
+
  
   --partition-method=NAME
   
@@ -427,8 +437,35 @@ pgbench  options  d
 Available built-in scripts are: tpcb-like,
 simple-update and select-only.
 Unambiguous prefixes of built-in names are accepted.
+   
+   
+Unless disabled with --no-functions option at database 
+init the tpcb-like and simple-update
+scripts are also implemented as User-Defined functions in the database which 
+can be tested using scripts named plpgsql-tpcb-like, 
+sqlfunc-tpcb-like, oldsqlf-tpcb-like, 
+plpgsql-simple-update, sqlfunc-simple-update
+and oldsqlf-simple-update.
+The sqlfunc-* versions use the new SQL-standard SQL functions and 
+the oldsqlf-* use the SQL functions defined using LANGUAGE SQL.
+Use --show-script=scriptname to see what is actually run.
+   
+   
 With the special name list, show the list of built-in scripts
-and exit immediately.
+and exit immediately :
+
+$ pgbench -b list
+Available builtin scripts:
+  tpcb-like: 
+  plpgsql-tpcb-like: 
+  sqlfunc-tpcb-like: 
+  oldsqlf-tpcb-like: 
+  simple-update: 
+  plpgsql-simple-update: 
+  sqlfunc-simple-update: 
+  oldsqlf-simple-update: 
+select-only: 
+


 Optionally, write an integer weight after @ to
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..21d2fd64548 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -160,8 +160,8 @@ typedef struct socket_set
 /
  * some configurable parameters */
 
-#define DEFAULT_INIT_STEPS "dtgvp"	/* default -I setting */
-#define ALL_INIT_STEPS "dtgGvpf"	/* all possible steps */
+#define DEFAULT_INIT_STEPS "dYtgvpy"	/* default -I setting */
+#define ALL_INIT_STEPS "dYtgGvpfy"	/* all possible steps */
 
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
@@ -796,6 +796,33 @@ static const BuiltinScript builtin_script[] =
 		"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
 		"END;\n"
 	},
+	{
+		"plpgsql-tpcb-like",
+		"",
+		"\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n"
+		"\\set bid random(1, " CppAsString2(nbranches) " * :scale)\n"
+		"\\set tid random(1, " CppAsString2(ntellers) " * :scale)\n"
+		"\\set delta random(-5000, 5000)\n"
+		"SELECT 1 FROM pgbench_tpcb_like(:aid, :bid, :tid, :delta);\n"
+	},
+	{
+		"sqlfunc-tpcb-like",
+		"",
+		"\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n"
+		"\\set bid random(1, " CppAsString2(nbranches) " * :scale)\n"
+		"\\set tid random(1, " CppAsString2(ntellers) " * :scale)\n"
+		"\\set delta random(-5000, 5000)\n"
+		"SELECT 1 FROM pgbench_tpcb_like_sqlfunc(:aid, :bid, :tid, :delta);\n"
+	},
+	{
+		"oldsqlf-tpcb-like",
+		"

Re: vacuumdb changes for stats import/export

2025-07-23 Thread Frédéric Yhuel




On 3/18/25 22:37, Nathan Bossart wrote:

Committed with the following small changes:


Hi, I don't really understand this sentence in 
doc/src/sgml/ref/vacuumdb.sgml:


> This option prevents vacuumdb from deleting existing statistics so 
that the query optimizer's choices do not become transiently worse.


I thought that the point was to avoid unnecessary post-upgrade analyzes?




Re: Custom pgstat support performance regression for simple queries

2025-07-23 Thread Andres Freund
Hi,

On 2025-07-23 10:23:53 +, Bertrand Drouvot wrote:
> On Wed, Jul 23, 2025 at 05:09:54PM +0900, Michael Paquier wrote:
> > so we don’t really need to mix multiple numbers; we could just have a single
> > boolean flag that any fixed-sized stats kinds can set to let the reporting 
> > know
> > that some activity has happened.
> 
> That works to say "there are pending stats" but not well to say "there are no
> pending stats".
> 
> Indeed, with a single boolean flag, then how could a stat say that it has 
> nothing
> pending anymore (when flushing) without saying "all the stats have nothing
> pending" (while some may still have pending stats)?

I don't think that's a problem - reset that global flag after checking it at
the start of pgstat_report_stat() and set it to true if partial_flush is true
at the end of pgstat_report_stat().

Greetings,

Andres Freund




is there any specific use of setting transaction_read_only GUC externally?

2025-07-23 Thread Srinath Reddy Sadipiralla
Hi,
$SUBJECT,because anyway after the next transaction starts XactReadOnly is
going to change depending on either RecoveryInProgress()
or DefaultXactReadOnly GUC ,so if there's no specific use to
set transaction_read_only GUC can we change its GucContext from PGC_USERSET
to PGC_INTERNAL and remove check_transaction_read_only api which makes it
clear that we can't set it externally and only we can use SHOW on it,
thoughts?

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/


Re: Postgres keep alive configuration

2025-07-23 Thread Fabrice Chapuis
Ok Thanks Daniel

It show the right value with the connection over the network

Regards

Fabrice

postgres [2867156]=> show tcp_keepalives_idle;
+-+
| tcp_keepalives_idle |
+-+
| 3   |
+-+
(1 row)

On Wed, Jul 23, 2025 at 1:30 PM Daniel Westermann (DWE) <
daniel.westerm...@dbi-services.com> wrote:

> >From: Fabrice Chapuis 
> >Sent: Wednesday, July 23, 2025 13:20
> >To: PostgreSQL Hackers 
> >Subject: Postgres keep alive configuration
> >
> >Hi,
> >
> >I try to configure these 3 parameters to activate keep alive on a sever
> Postgres in version 14.
> >
> >tcp_keepalives_idle = 3
> >tcp_keepalives_interval = 2
> >tcp_keepalives_count = 30
>
> >After issuing alter system set ... command and a pg_reload_conf(); these
> parameters keep value 0.
>
> Did you create a connection over the network when checking the values or
> are you connected over a socket? For socket connections these values will
> always be 0.
>
> Regards
> Daniel


Re: index prefetching

2025-07-23 Thread Peter Geoghegan
On Tue, Jul 22, 2025 at 9:31 PM Peter Geoghegan  wrote:
> I'll get back to you on this soon. There are plenty of indexes that
> are not perfectly correlated (like pgbench_accounts_pkey is) that'll
> nevertheless benefit significantly from the approach taken by the
> complex patch.

I'll give you a few examples that look like this. I'm not necessarily
suggesting that you adopt these example indexes into your test suite
-- these are just to stimulate discussion.

* The TPC-C order line table primary key.

This is the single largest index used by TPC-C, by quite some margin.
This is the index that my Postgres 12 "split after new tuple"
optimization made about 40% smaller with retail inserts -- I've
already studied it in detail.

It's a composite index on 4 integer columns, where each leaf page
contains about 260 index tuples. Note that this is true regardless of
whether retail inserts or CREATE INDEX were used (thanks to the
"split after new tuple" thing). And yet I see that nhblks is even
*lower* than pgbench_accounts: the average per-leaf-page nhblks is about
4 or 5. While the odd leaf page has an nhblks of 7 or 8, some
individual leaf pages that are full, but have an nhblks of only 4.

I would expect such an index to benefit to the maximum possible extent
from the complex patch/eager leaf page reading. This is true in spite
of the fact that technically the overall correlation is weak.

Here's what a random leaf page looks like, in terms of TIDs:

┌┬───┐
│ itemoffset │   htid│
├┼───┤
│  1 │ ∅ │
│  2 │ (1510,55) │
│  3 │ (1510,56) │
│  4 │ (1510,57) │
│  5 │ (1510,58) │
│  6 │ (1510,59) │
│  7 │ (1510,60) │
│  8 │ (1510,61) │
│  9 │ (1510,62) │
│ 10 │ (1510,63) │
│ 11 │ (1510,64) │
│ 12 │ (1510,65) │
│ 13 │ (1510,66) │
│ 14 │ (1510,67) │
│ 15 │ (1510,68) │
│ 16 │ (1510,69) │
│ 17 │ (1510,70) │
│ 18 │ (1510,71) │
│ 19 │ (1510,72) │
│ 20 │ (1510,73) │
│ 21 │ (1510,74) │
│ 22 │ (1510,75) │
│ 23 │ (1510,76) │
│ 24 │ (1510,77) │
│ 25 │ (1510,78) │
│ 26 │ (1510,79) │
│ 27 │ (1510,80) │
│ 28 │ (1510,81) │
│ 29 │ (1517,1)  │
│ 30 │ (1517,2)  │
│ 31 │ (1517,3)  │
│ 32 │ (1517,4)  │
│ 33 │ (1517,5)  │
│ 34 │ (1517,6)  │
│ 35 │ (1517,7)  │
│ 36 │ (1517,8)  │
│ 37 │ (1517,9)  │
│ 38 │ (1517,10) │
│ 39 │ (1517,11) │
│ 40 │ (1517,12) │
│ 41 │ (1517,13) │
│ 42 │ (1517,14) │
│ 43 │ (1517,15) │
│ 44 │ (1517,16) │
│ 45 │ (1517,17) │
│ 46 │ (1517,18) │
│ 47 │ (1517,19) │
│ 48 │ (1517,20) │
│ 49 │ (1517,21) │
│ 50 │ (1517,22) │
│ 51 │ (1517,23) │
│ 52 │ (1517,24) │
│ 53 │ (1517,25) │
│ 54 │ (1517,26) │
│ 55 │ (1517,27) │
│ 56 │ (1517,28) │
│ 57 │ (1517,29) │
│ 58 │ (1517,30) │
│ 59 │ (1517,31) │
│ 60 │ (1517,32) │
│ 61 │ (1517,33) │
│ 62 │ (1517,34) │
│ 63 │ (1517,35) │
│ 64 │ (1517,36) │
│ 65 │ (1517,37) │
│ 66 │ (1517,38) │
│ 67 │ (1517,39) │
│ 68 │ (1517,40) │
│ 69 │ (1517,41) │
│ 70 │ (1517,42) │
│ 71 │ (1517,43) │
│ 72 │ (1517,44) │
│ 73 │ (1517,45) │
│ 74 │ (1517,46) │
│ 75 │ (1517,47) │
│ 76 │ (1517,48) │
│ 77 │ (1517,49) │
│ 78 │ (1517,50) │
│ 79 │ (1517,51) │
│ 80 │ (1517,52) │
│ 81 │ (1517,53) │
│ 82 │ (1517,54) │
│ 83 │ (1517,55) │
│ 84 │ (1517,56) │
│ 85 │ (1517,57) │
│ 86 │ (1517,58) │
│ 87 │ (1517,59) │
│ 88 │ (1517,60) │
│ 89 │ (1517,62) │
│ 90 │ (1523,1)  │
│ 91 │ (1523,2)  │
│ 92 │ (1523,3)  │
│ 93 │ (1523,4)  │
│ 94 │ (1523,5)  │
│ 95 │ (1523,6)  │
│ 96 │ (1523,7)  │
│ 97 │ (1523,8)  │
│ 98 │ (1523,9)  │
│ 99 │ (1523,10) │
│100 │ (1523,11) │
│101 │ (1523,12) │
│102 │ (1523,13) │
│103 │ (1523,14) │
│104 │ (1523,15) │
│105 │ (1523,16) │
│106 │ (1523,17) │
│107 │ (1523,18) │
│108 │ (1523,19) │
│109 │ (1523,20) │
│110 │ (1523,21) │
│111 │ (1523,22) │
│112 │ (1523,23) │
│113 │ (1523,24) │
│114 │ (1523,25) │
│115 │ (1523,26) │
│116 │ (1523,27) │
│117 │ (1523,28) │
│118 │ (1523,29) │
│119 │ (1523,30) │
│120 │ (1523,31) │
│121 │ (1523,32) │
│122 │ (1523,33) │
│123 │ (1523,34) │
│124 │ (1523,35) │
│125 │ (1523,36) │
│126 │ (1523,37) │
│127 │ (1523,38) │
│128 │ (15

Re: trivial grammar refactor

2025-07-23 Thread Álvaro Herrera
On 2025-Jul-23, Nathan Bossart wrote:

> On Wed, Jul 23, 2025 at 05:38:34PM +0200, Álvaro Herrera wrote:
> > I noticed some duplicative coding while hacking on REPACK[1].  We can
> > save a few lines now with a trivial change to the rules for CHECKPOINT
> > and REINDEX, and allow to save a few extra lines in that patch.
> > 
> > Any objections to this?
> 
> Seems reasonable to me.  Any chance we can use this for CLUSTER, VACUUM,
> ANALYZE, and EXPLAIN, too?

ANALYZE and VACUUM cannot be changed this way, because the productions
where options are optional are the legacy ones; so running
"VACUUM/ANALYZE table" uses that one, and we cannot change that:

VacuumStmt: VACUUM opt_full opt_freeze opt_verbose opt_analyze 
opt_vacuum_relation_list

EXPLAIN doesn't work because the '(' starts either a query (via
select_with_parens) or an option list, so you get a shift/reduce
conflict.

I hadn't looked at CLUSTER, because the REPACK patch is absorbing that
rule into a larger rule for both REPACK and CLUSTER.  But now that I
look again, I realize that apparently my REPACK branch is bogus, because
I forgot to add a production for "CLUSTER name ON qualified_name".  And
with that one put back, I cannot use opt_utility_option_list there
either, because the above conflicts with "CLUSTER
opt_utility_option_list qualified_name cluster_index_specification".

So we can still do this, and I still think it's a win, but unfortunately
it won't help for the REPACK patch.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)




Re: index prefetching

2025-07-23 Thread Tomas Vondra
On 7/23/25 17:09, Andres Freund wrote:
> Hi,
> 
> On 2025-07-23 14:50:15 +0200, Tomas Vondra wrote:
>> On 7/23/25 02:59, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
 But I don't see why would this have any effect on the prefetch distance,
 queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
 that. I'd have expected exactly the opposite behavior.

 Could be bug, of course. But it'd be helpful to see the dataset/query.
>>>
>>> Pgbench scale 500, with the simpler query from my message.
>>>
>>
>> I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
>> you monitor the queue depth (presumably iostat?)
> 
> Yes, iostat, since I was looking at what the "actually required" lookahead
> distance is.
> 
> Do you actually get the query to be entirely CPU bound? What amount of IO
> waiting do you see EXPLAIN (ANALYZE, TIMING OFF) with track_io_timing=on
> report?
> 

No, it definitely needs to wait for I/O (FWIW it's on the xeon, with a
single NVMe SSD).

> Ah - I was using a very high effective_io_concurrency. With a high
> effective_io_concurrency value I see a lot of stalls, even at
> INDEX_SCAN_MAX_BATCHES = 64. And a lower prefetch distance, which seems
> somewhat odd.
> 

I think that's a bug in the explain patch. The counters were updated at
the beginning of _next_buffer(), but that's wrong - a single call to
_next_buffer() can prefetch multiple blocks. This skewed the stats, as
the prefetches are not counted with "distance=0". With higher eic this
happens sooner, so the average distance seemed to decrease.

The attached patch does the updates in _get_block(), which I think is
better. And "stall" now means (distance == 1), which I think detects
requests without prefetching.

I also added a separate "Count" for the actual number of prefetched
blocks, and "Skipped" for duplicate blocks skipped (which the read
stream never even sees, because it's skipped in the callback).

> 
> FWIW, in my tests I was just evicting lineitem from shared buffers, since I
> wanted to test the heap prefetching, without stalls induced by blocking on
> index reads. But what I described happens with either.
> 
> ;SET effective_io_concurrency = 256;SELECT 
> pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs 
> off, timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts 
> ORDER BY aid LIMIT 1000);
> ┌──┐
> │QUERY PLAN   
>  │
> ├──┤
> │ Aggregate (actual rows=1.00 loops=1)
>  │
> │   Buffers: shared hit=27369 read=164191 
>  │
> │   I/O Timings: shared read=358.795  
>  │
> │   ->  Limit (actual rows=1000.00 loops=1)   
>  │
> │ Buffers: shared hit=27369 read=164191   
>  │
> │ I/O Timings: shared read=358.795
>  │
> │ ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts 
> (actual rows=1000.00 loops=1) │
> │   Index Searches: 1 
>  │
> │   Prefetch Distance: 256.989
>  │
> │   Prefetch Stalls: 3
>  │
> │   Prefetch Resets: 3
>  │
> │   Buffers: shared hit=27369 read=164191 
>  │
> │   I/O Timings: shared read=358.795  
>  │
> │ Planning Time: 0.086 ms 
>  │
> │ Execution Time: 4194.845 ms 
>  │
> └──┘
> 
> ;SET effective_io_concurrency = 512;SELECT 
> pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs 
> off, timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts 
> ORDER BY aid LIMIT 1000);
> ┌──┐
> │   

Re: trivial grammar refactor

2025-07-23 Thread Nathan Bossart
On Wed, Jul 23, 2025 at 06:50:59PM +0200, Álvaro Herrera wrote:
> So we can still do this, and I still think it's a win,

+1

> but unfortunately it won't help for the REPACK patch.

Darn.

-- 
nathan




Re: Making type Datum be 8 bytes everywhere

2025-07-23 Thread Tom Lane
Andres Freund  writes:
> On 2025-07-18 13:24:32 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> One class of complaints is about DatumGetPointer() and
>>> PointerGetDatum() casting between different sizes:

>> We might be able to silence those with intermediate casts to uintptr_t,
>> perhaps?

> Yep, that does the trick.

Cool, thanks for checking.

>>> I've not looked into the performance consequences.  We probably
>>> should at least try to measure that, though I'm not sure what
>>> our threshold of pain would be for deciding not to do this.

> The hard bit would be to determine what workload to measure.  Something like
> pgbench probably won't suffer meaningfully, there's just not enough passing of
> values around.

I'm disinclined to put in a huge amount of effort looking for the
worst case.  We established long ago that we weren't going to
optimize for 32-bit anymore.  So as long as this doesn't completely
tank performance on 32-bit, I'm satisfied.  I'd almost say that
if standard pgbench doesn't notice the change, that's good enough.

regards, tom lane




Re: RecoveryTargetAction is left out in xlog_interna.h

2025-07-23 Thread Álvaro Herrera
On 2024-Sep-04, Kyotaro Horiguchi wrote:

> Hello,
> 
> While reviewing a patch, I noticed that enum RecoveryTargetAction is
> still in xlog_internal.h, even though it seems like it should be in
> xlogrecovery.h. Commit 70e81861fa separated out xlogrecovery.c/h and
> moved several enums related to recovery targets to
> xlogrecovery.h. However, it appears that enum RecoveryTargetAction was
> inadvertently left behind in that commit.

True.  Pushed.  I felt backpatching was unnecessary, even though the
mistake is old.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




RE: 024_add_drop_pub.pl might fail due to deadlock

2025-07-23 Thread Hayato Kuroda (Fujitsu)
Dear Ajin,

Thanks for the patch. Firstly let me confirm my understanding. While altering 
the
subscription, locks are acquired with below ordering:

Order   target  level
1   pg_subscription row exclusive
2   pg_subscription, my tuple   access exclusive
3   pg_subscription_rel access exclusive
4   pg_replication_origin   excluive

In contrast, apply worker works like:

Order   target  level
1   pg_replication_origin   excluive
2   pg_subscription, my tuple   access share
3   pg_subscrition_rel  row exclusive

Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.

Below are my comments:

```
@@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 * is dropped. So passing missing_ok = false.
 */
ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, 
syncslotname, false);
-
```
This change is not needed.

```
+   if (!rel)
+   rel = 
table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
```

I feel it is too strong, isn't it enough to use row exclusive as initially used?

```
 UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
-  XLogRecPtr sublsn)
+  XLogRecPtr sublsn, bool 
lock_needed)
```

I feel the name `lock_needed` is bit misleading, because the function still 
opens
the pg_subscription_rel catalog with row exclusive. How about 
"lock_shared_object"?

```
@@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool 
isTopLevel)
 * Note that the state can't change because we have already stopped both
 * the apply and tablesync workers and they can't restart because of
 * exclusive lock on the subscription.
+*
+* Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+* deadlock with apply workers of other subscriptions trying
+* to drop tracking origin.
 */
+   sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
```

Hmm. Per my understanding, DropSubscription acquires below locks till it reaches
replorigin_drop_by_name().

Order   target  level
1   pg_subscription access exclusive
2   pg_subscription, my tuple   access exclusive
3   pg_replication_origin   excluive

IIUC we must preserve the ordering, not the target of locks.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Custom pgstat support performance regression for simple queries

2025-07-23 Thread Bertrand Drouvot
On Wed, Jul 23, 2025 at 05:09:54PM +0900, Michael Paquier wrote:
> On Jul 23, 2025, at 16:33, Bertrand Drouvot  
> wrote:
> > Maybe we could use a flag, say:
> > 
> > #define PGSTAT_PENDING_IO (1 << 0)
> > #define PGSTAT_PENDING_WAL(1 << 1)
> > #define PGSTAT_PENDING_SLRU   (1 << 2)
> > 
> > and check for a pgstat_pending_mask in pgstat_report_stat() instead?
> > 
> > They would need to set pgstat_pending_mask accordingly when they flush, have
> > pending stats though.
> 
> The point is just to say to the report path to move further if at least one of
> the fixed stats kinds has something to flush, then let the loop do its work
> across all the stats kinds if they have a flush callback,

Right.

> so we don’t really need to mix multiple numbers; we could just have a single
> boolean flag that any fixed-sized stats kinds can set to let the reporting 
> know
> that some activity has happened.

That works to say "there are pending stats" but not well to say "there are no
pending stats".

Indeed, with a single boolean flag, then how could a stat say that it has 
nothing
pending anymore (when flushing) without saying "all the stats have nothing
pending" (while some may still have pending stats)?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [(known) BUG] DELETE/UPDATE more than one row in partitioned foreign table

2025-07-23 Thread Etsuro Fujita
Hi,

On Sat, Jul 19, 2025 at 12:53 AM Jehan-Guillaume de Rorthais
 wrote:
> We have been bitten by this old bug recently:
>
> https://www.postgresql.org/message-id/flat/CAFjFpRfcgwsHRmpvoOK-GUQi-n8MgAS%2BOxcQo%3DaBDn1COywmcg%40mail.gmail.com
>
> I suspect this bug lost attention when the only fixed submitted to the
> commitfest has been flagged as "returned with feedback":
>
> https://commitfest.postgresql.org/patch/1819/

This is on my TODO list, but I didn't have time to work on it, unfortunately.

> Please, find in attachment the old patch forbidding more than one row to be
> deleted/updated from postgresExecForeignDelete and postgresExecForeignUpdate. 
> I
> just rebased them without modification. I suppose this is much safer than
> leaving the FDW destroy arbitrary rows on the remote side based on their sole
> ctid.

Thanks for rebasing the patch set, but I don't think the idea
implemented in the second patch is safe; even with the patch we have:

create table plt (a int, b int) partition by list (a);
create table plt_p1 partition of plt for values in (1);
create table plt_p2 partition of plt for values in (2);
create function trig_null() returns trigger language plpgsql as
  $$ begin return null; end; $$;
create trigger trig_null before update or delete on plt_p1
  for each row execute function trig_null();
create foreign table fplt (a int, b int)
  server loopback options (table_name 'plt');

insert into fplt values (1, 1), (2, 2);
INSERT 0 0
select tableoid::regclass, ctid, * from plt;
 tableoid | ctid  | a | b
--+---+---+---
 plt_p1   | (0,1) | 1 | 1
 plt_p2   | (0,1) | 2 | 2
(2 rows)

explain verbose update fplt set b = (case when random() <= 1 then 10
else 20 end) where a = 1;
   QUERY PLAN
-
 Update on public.fplt  (cost=100.00..128.02 rows=0 width=0)
   Remote SQL: UPDATE public.plt SET b = $2 WHERE ctid = $1
   ->  Foreign Scan on public.fplt  (cost=100.00..128.02 rows=7 width=42)
 Output: CASE WHEN (random() <= '1'::double precision) THEN 10
ELSE 20 END, ctid, fplt.*
 Remote SQL: SELECT a, b, ctid FROM public.plt WHERE ((a = 1))
FOR UPDATE
(5 rows)

update fplt set b = (case when random() <= 1 then 10 else 20 end) where a = 1;
UPDATE 1
select tableoid::regclass, ctid, * from plt;
 tableoid | ctid  | a | b
--+---+---+
 plt_p1   | (0,1) | 1 |  1
 plt_p2   | (0,2) | 2 | 10
(2 rows)

The row in the partition plt_p2 is updated, which is wrong as the row
doesn't satisfy the query's WHERE condition.

> Or maybe we should just not support foreign table to reference a
> remote partitioned table?

I don't think so because we can execute SELECT, INSERT, and direct
UPDATE/DELETE safely on such a foreign table.

I think a simple fix for this is to teach the system that the foreign
table is a partitioned table; in more detail, I would like to propose
to 1) add to postgres_fdw a table option, inherited, to indicate
whether the foreign table is a partitioned/inherited table or not, and
2) modify postgresPlanForeignModify() to throw an error if the given
operation is an update/delete on such a foreign table.  Attached is a
WIP patch for that.  I think it is the user's responsibility to set
the option properly, but we could modify postgresImportForeignSchema()
to support that.  Also, I think this would be back-patchable.

What do you think about that?

Best regards,
Etsuro Fujita


postgres_fdw-disallow-upddel-in-problematic-cases-wip.patch
Description: Binary data


Re: Postgres keep alive configuration

2025-07-23 Thread Daniel Westermann (DWE)
>From: Fabrice Chapuis 
>Sent: Wednesday, July 23, 2025 13:20
>To: PostgreSQL Hackers 
>Subject: Postgres keep alive configuration
> 
>Hi,
>
>I try to configure these 3 parameters to activate keep alive on a sever 
>Postgres in version 14.
>
>tcp_keepalives_idle = 3
>tcp_keepalives_interval = 2
>tcp_keepalives_count = 30

>After issuing alter system set ... command and a pg_reload_conf(); these 
>parameters keep value 0.

Did you create a connection over the network when checking the values or are 
you connected over a socket? For socket connections these values will always be 
0.

Regards
Daniel



Re: 024_add_drop_pub.pl might fail due to deadlock

2025-07-23 Thread Amit Kapila
On Wed, Jul 23, 2025 at 2:42 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Ajin,
>
> Thanks for the patch. Firstly let me confirm my understanding. While altering 
> the
> subscription, locks are acquired with below ordering:
>
> Order   target  level
> 1   pg_subscription row exclusive
> 2   pg_subscription, my tuple   access exclusive
> 3   pg_subscription_rel access exclusive
> 4   pg_replication_origin   excluive
>
> In contrast, apply worker works like:
>
> Order   target  level
> 1   pg_replication_origin   excluive
> 2   pg_subscription, my tuple   access share
> 3   pg_subscrition_rel  row exclusive
>
> Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.
>
> Below are my comments:
>
> ```
> @@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
>  * is dropped. So passing missing_ok = false.
>  */
> ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, 
> syncslotname, false);
> -
> ```
> This change is not needed.
>
> ```
> +   if (!rel)
> +   rel = 
> table_open(SubscriptionRelRelationId, AccessExclusiveLock);
> +
> ```
>
> I feel it is too strong, isn't it enough to use row exclusive as initially 
> used?
>
> ```
>  UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
> -  XLogRecPtr sublsn)
> +  XLogRecPtr sublsn, bool 
> lock_needed)
> ```
>
> I feel the name `lock_needed` is bit misleading, because the function still 
> opens
> the pg_subscription_rel catalog with row exclusive. How about 
> "lock_shared_object"?
>

I think if we lock in a caller, we don't need to use any lock during
table_open. We can use the parameter name as already_locked as we do
at some other places in the code.

-- 
With Regards,
Amit Kapila.




Re: Pathify RHS unique-ification for semijoin planning

2025-07-23 Thread Álvaro Herrera
Hello,

As a very trivial test on this patch, I ran the query in your opening
email, both with and without the patch, scaling up the size of the table
a little bit.  So I did this

drop table if exists t;
create table t(a int, b int);
insert into t select i % 10, i from generate_series(1,1e7) i;
create index on t(a);
vacuum analyze t;

set enable_hashagg to off;

explain (costs off, analyze, buffers)
select * from t t1 where t1.a in
  (select a from t t2 where a < 1)
order by t1.a;


This is the plan without the patch:

QUERY PLAN  
   
───
 Merge Join (actual time=289.262..700.761 rows=100.00 loops=1)
   Merge Cond: (t1.a = t2.a)
   Buffers: shared hit=1017728 read=3945 written=3361, temp read=1471 
written=1476
   ->  Index Scan using t_a_idx on t t1 (actual time=0.011..320.747 
rows=101.00 loops=1)
 Index Searches: 1
 Buffers: shared hit=997725 read=3112 written=2664
   ->  Sort (actual time=219.273..219.771 rows=1.00 loops=1)
 Sort Key: t2.a
 Sort Method: quicksort  Memory: 385kB
 Buffers: shared hit=20003 read=833 written=697, temp read=1471 
written=1476
 ->  Unique (actual time=128.173..218.708 rows=1.00 loops=1)
   Buffers: shared hit=20003 read=833 written=697, temp read=1471 
written=1476
   ->  Sort (actual time=128.170..185.461 rows=100.00 loops=1)
 Sort Key: t2.a
 Sort Method: external merge  Disk: 11768kB
 Buffers: shared hit=20003 read=833 written=697, temp 
read=1471 written=1476
 ->  Index Only Scan using t_a_idx on t t2 (actual 
time=0.024..74.171 rows=100.00 loops=1)
   Index Cond: (a < 1)
   Heap Fetches: 0
   Index Searches: 1
   Buffers: shared hit=20003 read=833 written=697
 Planning:
   Buffers: shared hit=28 read=7
 Planning Time: 0.212 ms
 Execution Time: 732.840 ms


and this is the plan with the patch:

  QUERY PLAN
   
───
 Merge Join (actual time=70.310..595.116 rows=100.00 loops=1)
   Merge Cond: (t1.a = t2.a)
   Buffers: shared hit=1017750 read=3923 written=3586
   ->  Index Scan using t_a_idx on t t1 (actual time=0.020..341.257 
rows=101.00 loops=1)
 Index Searches: 1
 Buffers: shared hit=996914 read=3923 written=3586
   ->  Unique (actual time=0.028..99.074 rows=1.00 loops=1)
 Buffers: shared hit=20836
 ->  Index Only Scan using t_a_idx on t t2 (actual time=0.026..66.219 
rows=100.00 loops=1)
   Index Cond: (a < 1)
   Heap Fetches: 0
   Index Searches: 1
   Buffers: shared hit=20836
 Planning:
   Buffers: shared hit=55 read=15 written=14
 Planning Time: 0.391 ms
 Execution Time: 621.377 ms


This is a really nice improvement.  I think we could find queries that
are arbitrarily faster, by feeding enough tuples to the unnecessary Sort
nodes.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No necesitamos banderas
 No reconocemos fronteras"  (Jorge González)




Re: Proposal: QUALIFY clause

2025-07-23 Thread Álvaro Herrera
On 2025-Jul-22, Andrew Dunstan wrote:

> If we were making up our own syntax this would be a sensible thing to
> debate. If we're talking about implementing something we expect to be in the
> standard, I think we will have to live with what the standards committee
> decides, regardless of our preference. We've almost certainly been preempted
> here by other RDBMSs using QUALIFY, heedless of English grammar.

The Romans, the Vikings, the Normans, all have influenced the English
language.  Why not SQL?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
“Cuando no hay humildad las personas se degradan” (A. Christie)




Re: Custom pgstat support performance regression for simple queries

2025-07-23 Thread Michael Paquier
On Jul 23, 2025, at 16:33, Bertrand Drouvot  
wrote:
> Maybe we could use a flag, say:
> 
> #define PGSTAT_PENDING_IO (1 << 0)
> #define PGSTAT_PENDING_WAL(1 << 1)
> #define PGSTAT_PENDING_SLRU   (1 << 2)
> 
> and check for a pgstat_pending_mask in pgstat_report_stat() instead?
> 
> They would need to set pgstat_pending_mask accordingly when they flush, have
> pending stats though.

The point is just to say to the report path to move further if at least one of 
the fixed stats kinds has something to flush, then let the loop do its work 
across all the stats kinds if they have a flush callback, so we don’t really 
need to mix multiple numbers; we could just have a single boolean flag that any 
fixed-sized stats kinds can set to let the reporting know that some activity 
has happened. This would mean that custom fixed-sized kinds would need to 
interact with this flag as well, but that makes the fast-exit path dirty cheap 
with or without custom stats kinds.
--
Michael






RE: Conflict detection for update_deleted in logical replication

2025-07-23 Thread Zhijie Hou (Fujitsu)
On Wednesday, July 23, 2025 12:08 PM Amit Kapila  
wrote:
> 
> On Wed, Jul 23, 2025 at 3:51 AM Masahiko Sawada
>  wrote:
> >
> > I've reviewed the 0001 patch and it looks good to me.
> >
> 
> Thanks, I have pushed the 0001 patch.

Thanks for pushing. I have rebased the remaining patches.

I have reordered the patches to prioritize the detection of update_deleted as
the initial patch. This can give us more time to consider the new GUC, since the
performance-related aspects have been documented.

One pervious patch used to prove the possibility of allowing changing the
retain_dead_tuples for enabled subscription, has not yet been rebased. I will
rebase that once all the main patches are stable.

Best Regards,
Hou zj



v52-0003-Re-create-the-replication-slot-if-the-conflict-r.patch
Description:  v52-0003-Re-create-the-replication-slot-if-the-conflict-r.patch


v52-0001-Support-the-conflict-detection-for-update_delete.patch
Description:  v52-0001-Support-the-conflict-detection-for-update_delete.patch


v52-0002-Introduce-a-new-GUC-max_conflict_retention_durat.patch
Description:  v52-0002-Introduce-a-new-GUC-max_conflict_retention_durat.patch


Re: Custom pgstat support performance regression for simple queries

2025-07-23 Thread Bertrand Drouvot
Hi,

On Wed, Jul 23, 2025 at 09:54:12AM +0900, Michael Paquier wrote:
> Then, about the loop used here, I'd be OK to keep the past shortcuts
> for the builtin fixed-sized stats kinds with the requirement that new
> builtin stats kinds need to hack into pgstat_report_stat() themselves
> on efficiency grounds

Maybe we could use a flag, say:

#define PGSTAT_PENDING_IO (1 << 0) 
#define PGSTAT_PENDING_WAL(1 << 1)
#define PGSTAT_PENDING_SLRU   (1 << 2)

and check for a pgstat_pending_mask in pgstat_report_stat() instead?

They would need to set pgstat_pending_mask accordingly when they flush, have
pending stats though.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: SQL Property Graph Queries (SQL/PGQ)

2025-07-23 Thread Amit Langote
Hi Ashutosh,

On Tue, Jul 22, 2025 at 8:40 PM Ashutosh Bapat
 wrote:
> Here's the patchset rebased on the latest master.

Thanks for the update. I was going through the patch series today and
had a few comments on the structure that might help with reviewing and
eventually committing, whoever ends up doing that. :)

I noticed that some functions, like graph_table_property_reference(),
are introduced in one patch and then renamed or removed in a later
one. It’s easier to review if such refactoring is avoided across
patches -- ideally, each patch should introduce code in its final
intended form. That also helps reduce the number of patches and makes
each one more self-contained.

One of the later patches (0004) collects several follow-up changes.
While it fixes a real bug -- a crash when GRAPH_TABLE is used inside
plpgsql due to a conflict with the columnref hook -- it also includes
incidental cleanups like switching to foreach_node(), updating
comments, and adjusting function signatures. Those would be better
folded into the patches that introduced the relevant code, rather than
grouped into a catch-all at the end. That keeps each patch focused and
easier to review -- and easier to merge when committing.

A general principle that might help: if you're refactoring existing
code, a standalone preliminary patch makes sense. But if you're
tweaking something just added in the same series, it’s usually better
to squash that into the original patch. The rename from
graph_table_property_reference() to transformGraphTablePropertyRef()
may be a fair exception since it reflects a shift prompted by the bug
fix -- but most other adjustments could be folded in without loss of
clarity.

I understand the intent to spell out each change, but if the details
are incidental to the overall design, they don’t necessarily need to
be split out. Explaining the reasoning in the thread is always
helpful, but consolidating the patches once the design has settled
makes things easier for both reviewers and committers.

Finally, attaching the patches directly, with versioned names like
v8-000x, instead of zipping them helps.  Many folks (myself included)
will casually skip a zip file because of the small hassle of unzipping
just to read a patch. I once postponed reading such a patch and didn’t
get back to it for quite a while. :)

-- 
Thanks, Amit Langote




Re: display hot standby state in psql prompt

2025-07-23 Thread Srinath Reddy Sadipiralla
On Wed, Jul 23, 2025 at 1:33 PM Jim Jones  wrote:

>
> I believe that 'read/write' is more idiomatic than 'read-write' in this
> context. 'Read-only' works as a hyphenated adjective, and 'read/write'
> is typically treated as a paired label that indicates two distinct
> capabilities --- read and write. What do you think?
>
> Makes sense.

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/


Re: Document transition table triggers are not allowed on views/foreign tables

2025-07-23 Thread Etsuro Fujita
On Tue, Jul 15, 2025 at 4:55 PM Etsuro Fujita  wrote:
> Another thing I noticed about transition tables is that while we
> prohibit transition tables on views/foreign tables, there is no
> description about that in the user-facing documentation.  So I would
> like to propose to do $SUBJECT in create_trigger.sgml.  Attached is a
> patch for that.

If there are no objections, I will push this and back-patch it to all
supported versions.

Best regards,
Etsuro Fujita




Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-07-23 Thread Ilia Evdokimov



On 23.07.2025 03:11, David Rowley wrote:

On Wed, 23 Jul 2025 at 02:35, Andrei Lepikhov  wrote:

The 'Buffers:' way looks more natural to me. I don't like duplicated
text in the explain format - it is already cluttered by multiple
unnecessary elements that distract attention from the problematic plan
elements, such as unplaggable costs output if we only need row
predictions, '.00' in estimations, etc.

Seems logical.



+1





Will you add the ExplainOpenGroup call to the final version of the patch?

I'm leaning towards not doing that as a reader might expect all the
"Estimates" to be within that group, but the estimated cost and row
counts won't be in there. Maybe it's possible to circumvent that
expectation by naming the group something like "MemoizeEstimates", but
IMO, that seems excessive for 4 properties.



I agree. I would consider adding a group if we displayed information in 
a looped format, like for Workers, or if we had some particularly useful 
data for parsers - for example, timings or memory usage. But for four 
estimates, grouping seems unnesseray.


Given that, patch v11 still looks like the most appropriate version to me.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.





Postgres keep alive configuration

2025-07-23 Thread Fabrice Chapuis
Hi,

I try to configure these 3 parameters to activate keep alive on a sever
Postgres in version 14.

tcp_keepalives_idle = 3
tcp_keepalives_interval = 2
tcp_keepalives_count = 30

After issuing alter system set ... command and a pg_reload_conf(); these
parameters keep value 0.

I also check these OS parameters
grep TCP_KEEPIDLE /usr/include/netinet/tcp.h
#define TCP_KEEPIDLE 4  /* Start keeplives after this period */

grep TCP_KEEPINTVL  /usr/include/netinet/tcp.h
#define TCP_KEEPINTVL5  /* Interval between keepalives */

grep TCP_KEEPCNT   /usr/include/netinet/tcp.h
#define TCP_KEEPCNT  6  /* Number of keepalives before death */

and

sysctl net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 7200

Thanks for helping on this point

Fabrice


Re: Interrupts vs signals

2025-07-23 Thread Joel Jacobson
On Thu, Mar 6, 2025, at 17:43, Heikki Linnakangas wrote:
> On 06/03/2025 02:47, Heikki Linnakangas wrote:
>> Here's a new patch set. It includes the previous work, and also goes the 
>> whole hog and replaces procsignals and many other signalling with 
>> interrupts. It's based on Thomas's v3-0002-Redesign-interrupts-remove- 
>> ProcSignals.patch much earlier in this thread.
>
> And here's yet another version. It's the same at high level, but with a 
> ton of little fixes.
>
> One notable change is that I merged storage/interrupt.[ch] with 
> postmaster/interrupt.[ch], so the awkwardness of having two files with 
> same name is gone.
>
> -- 
> Heikki Linnakangas
> Neon (https://neon.tech)
>
> Attachments:
> * v7-0001-Replace-Latches-with-Interrupts.patch
> * v7-0002-Fix-lost-wakeup-issue-in-logical-replication-laun.patch
> * v7-0003-Use-INTERRUPT_GENERAL-for-bgworker-state-change-n.patch
> * v7-0004-Replace-ProcSignals-and-other-ad-hoc-signals-with.patch

Great work in this thread. I'll try to help, definitively benchmarking,
but will also try to load the new design into my brain, to get the
correct mental model of it, so I can hopefully help with code review as
well.

I'm trying to figure out what all possible states a backend can be in.
With the new design, it basically seems like there are two different
types of bits? An Interrupt bit type, and a Sleeping bit. This gives
four possible states, if I'm not mistaken.

I say "bit type" since there are of course many different bits, for all
the existing interrupts, previously called "reasons" in the old design.
But all those type of interrupt bits, are just one type of bit. The
other type of bit seems to be the "sleeping bit".

I note how the InterruptType is overloaded to also be used for the
sleeping bit (SLEEPING_ON_INTERRUPTS).

Would it be correct to say that a backend can be in any of these four
states?

1. Interrupt clear, Sleeping clear: The backend is running its code and
is not waiting for anything.

2. Interrupt set, Sleeping clear: Interrupt pending. A sender has set
our interrupt bit, but we were still executing code. Crucially, the
sender saw the Sleeping bit was clear and thus did not send a physical
wakeup. The pending interrupt will be serviced at the next
CHECK_FOR_INTERRUPTS() call.

3. Interrupt clear, Sleeping set: Preparing to block. This is the
critical, short-lived state that solves the lost-wakeup problem. We have
just atomically set the Sleeping bit, advertising our intent to enter a
kernel wait. Any sender that sees this state knows it is now responsible
for sending a physical wakeup signal.

4. Interrupt set, Sleeping set: Wakeup in progress. This state confirms
the race was handled. A sender caught us in state 3, flipped our
Interrupt bit, saw our Sleeping bit was set, and sent the required
physical wakeup. We will enter the kernel wait but return immediately,
service the interrupt, and reset both bits.

If my mental model of your new design is correct, I was expecting to see similar
performance gains for LISTEN/NOTIFY, thanks to not unnecessarily interrupting
listening backends.

To test, I checked out an old git hash (d611f8b) from 2025-03-06, so I
could apply the v7 patch set.

I then ran one of the benchmark scripts that I've created for my work on
async.c.

It's a great win compared to master, but for some reason, your v7 patch
set is a bit slower on all benchmark configs I tested, except -j 1 -c 1,
where it's the same.

 Connections=Jobs | TPS (patch-joel) | TPS (patch-v7) | Relative Diff (%) | 
StdDev (master) | StdDev (patch)
--+--++---+-+
1 |   151510 | 151119 | -0.26%| 
923 |577
2 |   239051 | 227925 | -4.65%| 
   1596 |638
4 |   250910 | 235238 | -6.25%| 
   4891 |   5628
8 |   171944 | 167271 | -2.72%| 
   2752 |   5182
   16 |   165482 | 148305 | -10.38%   | 
   2825 |   6447
   32 |   145150 | 140216 | -3.40%| 
   1566 |   1867
   64 |   131836 | 122617 | -6.99%| 
573 |253
  128 |   121333 | 113668 | -6.32%| 
874 |981
(8 rows)

This might be due to other changes since Mars, since my patch is against
master.

When you've rebased, I can run the benchmarks again, and try to find the
culprit.

Btw, I really like the idea of a "stepping stone" approach, mentioned by
Thomas Munro in the other thread [1]. That is, looking at if we can find
ways to do smaller incremental changes, with a great ratio between
performance gains and additional complexity co

Re: Collation and primary keys

2025-07-23 Thread Daniel Verite
Jeff Davis wrote:

> * The libc C.UTF-8 locale was a reasonable default (though not a
> natural language collation). But now that we have C.UTF-8 available
> from the builtin provider, then we should encourage that instead of
> relying on the slower, platform-specific libc implementation.

Yes. In particular, we should encourage the ecosystem to support
the new collation features so that they're widely available to
end users.

Anecdotically I was looking this week at upgrading instances to
v17 with the builtin C.UTF-8 locale.

They happen to be hosted by RDS, and RDS appears not to offer
the builtin provider at instance creation (nor ICU, even though
initdb with ICU is possible since v15).
The templates being restrained to libc, the database creations
should then use template0, locale_provider=builtin,
builtin_locale=...
But in my case, database creations have to be done by TerraForm.
And as it turns out, TF's Postgres resource [1] also hasn't been
updated to allow for non-libc databases (only lc_collate and
lc_ctype can be set).

Conclusion for that upgrade: that will be v17 with libc collations :(

[1]
https://registry.terraform.io/providers/cyrilgdn/postgresql/latest/docs/resources/postgresql_database


Best regards,
-- 
Daniel Vérité 
https://postgresql.verite.pro/




Re: 024_add_drop_pub.pl might fail due to deadlock

2025-07-23 Thread Ajin Cherian
On Wed, Jul 23, 2025 at 4:32 PM Amit Kapila  wrote:
>
> On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian  wrote:
> >
> > Yes, this is correct. I have also verified this in my testing that
> > when there is a second subscription, a deadlock can still happen with
> > my previous patch. I have now modified the patch in tablesync worker
> > to take locks on both SubscriptionRelationId and
> > SubscriptionRelRelationId prior to taking lock on
> > ReplicationOriginRelationId. I have also found that there is a similar
> > problem in DropSubscription() where lock on SubscriptionRelRelationId
> > is not taken before dropping the tracking origin. I've also modified
> > the signature of UpdateSubscriptionRelState to take a bool
> > "lock_needed" which if false, the SubscriptionRelationId is not
> > locked. I've made a new version of the patch on PG_15.
> >
>
> Review comments:
> 
> 1.
> + if (!rel)
> + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
>
> Why did you acquire AccessExclusiveLock here when the current code has
> RowExclusiveLock? It should be RowExclusiveLock.
>

Yes, you are correct. I have replaced it with RowExclusiveLock.

> 2.
> + *
> + * Lock SubscriptionRelationId with AccessShareLock and
> + * take AccessExclusiveLock on SubscriptionRelRelationId to
> + * prevent any deadlocks with user concurrently performing
> + * refresh on the subscription.
>   */
>
> Try to tell in the comments that we want to keep the locking order
> same as DDL commands to prevent deadlocks.
>

Modified.

> 3.
> + * Also close any tables prior to the commit.
>   */
> + if (rel)
> + {
> + table_close(rel, AccessExclusiveLock);
> + rel = NULL;
>
> We don't need to explicitly release lock on table_close, it will be
> done at transaction end, so use NoLock here as we do in current HEAD
> code.
>

Done.

> 4.
> DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
>  {
> - Relation rel;
> + Relation rel, sub_rel;
>   ObjectAddress myself;
>   HeapTuple tup;
>   Oid subid;
> @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt,
> bool isTopLevel)
>   * Note that the state can't change because we have already stopped both
>   * the apply and tablesync workers and they can't restart because of
>   * exclusive lock on the subscription.
> + *
> + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
> + * deadlock with apply workers of other subscriptions trying
> + * to drop tracking origin.
>   */
> + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
>
> I don't think we need AccessExclusiveLock on SubscriptionRelRelationId
> in DropSubscription. Kindly test again after fixing the first comment
> above.
> --

Yes, it was failing because I was taking AccessExclusiveLock in the
apply worker, and that was causing the deadlock in my testing. I
tested this worked.



On Wed, Jul 23, 2025 at 7:12 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Ajin,
>
> Thanks for the patch. Firstly let me confirm my understanding. While altering 
> the
> subscription, locks are acquired with below ordering:
>
> Order   target  level
> 1   pg_subscription row exclusive
> 2   pg_subscription, my tuple   access exclusive
> 3   pg_subscription_rel access exclusive
> 4   pg_replication_origin   excluive
>
> In contrast, apply worker works like:
>
> Order   target  level
> 1   pg_replication_origin   excluive
> 2   pg_subscription, my tuple   access share
> 3   pg_subscrition_rel  row exclusive
>
> Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.
>

Yes, that is correct.


> Below are my comments:
>
> ```
> @@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
>  * is dropped. So passing missing_ok = false.
>  */
> ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, 
> syncslotname, false);
> -
> ```
> This change is not needed.

Removed.

>
> ```
> +   if (!rel)
> +   rel = 
> table_open(SubscriptionRelRelationId, AccessExclusiveLock);
> +
> ```
>
> I feel it is too strong, isn't it enough to use row exclusive as initially 
> used?
>

Yes, agree. Fixed.

> ```
>  UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
> -  XLogRecPtr sublsn)
> +  XLogRecPtr sublsn, bool 
> lock_needed)
> ```
>
> I feel the name `lock_needed` is bit misleading, because the function still 
> opens
> the pg_subscription_rel catalog with row exclusive. How about 
> "lock_shared_object"?
>

I have modified it to not take lock while table_open as well and
changed the name of the variable to al

Re: Suggestion to add --continue-client-on-abort option to pgbench

2025-07-23 Thread Yugo Nagata
On Tue, 22 Jul 2025 17:49:49 +0900
Yugo Nagata  wrote:

> On Fri, 18 Jul 2025 17:07:53 +0900
> Rintaro Ikeda  wrote:
> 
> > Hi,
> > 
> > On 2025/07/16 22:49, Yugo Nagata wrote:
> > >> I think we should also change the error message in pg_log_error. I 
> > >> modified the
> > >> patch v8-0003 as follows:
> > >> @@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, 
> > >> char
> > >> *varprefix)
> > >>
> > >> default:
> > >> /* anything else is unexpected */
> > >> -   pg_log_error("client %d script %d 
> > >> aborted in
> > >> command %d query %d: %s",
> > >> -st->id, 
> > >> st->use_file,
> > >> st->command, qrynum,
> > >> +   pg_log_error("client %d aborted in 
> > >> command %d
> > >> query %d of script %d: %s",
> > >> +st->id, 
> > >> st->command,
> > >> qrynum, st->use_file,
> > >>  
> > >> PQerrorMessage(st->con));
> > >> goto error;
> > >> }
> > >>
> > >> With this change, the output now is like this:
> > >>> pgbench: error: client 0 aborted in command 1 query 0 of script 0: 
> > >>> ERROR:
> > >> duplicate key value violates unique constraint "test_col2_key"
> > >>
> > >> I want hear your thoughts.
> > > 
> > > My idea is to modify this as follows;
> > > 
> > > default:
> > > /* anything else is unexpected */
> > > -   pg_log_error("client %d script %d aborted 
> > > in command %d query %d: %s",
> > > -st->id, 
> > > st->use_file, st->command, qrynum,
> > > -
> > > PQerrorMessage(st->con));
> > > +   commandFailed(st, "SQL", 
> > > PQerrorMessage(st->con));
> > > goto error;
> > > }
> > > 
> > > This fix is originally planned to be included in patch v8, but was missed.
> > > It is now included in the attached patch, v10.
> > > 
> > > With this change, the output becomes:
> > > 
> > >  pgbench: error: client 0 aborted in command 0 (SQL) of script 0;
> > >   ERROR:  duplicate key value violates unique constraint "t2_pkey"
> > > 
> > > Although there is a slight difference, the message is essentially the 
> > > same as
> > > your proposal. Also, I believe the use of commandFailed() makes the code 
> > > simpler
> > > and more consistent.
> > > 
> > > What do you think?
> > > 
> > 
> > Thank you for the new patch! I think Nagata-san's v10 patch is a clear
> > improvement over my v9 patch. I'm happy with the changes.
> 
> Thank you.
> 
> I believe the patches implement the expected behavior, include appropriste 
> doc and test
> modification, are in good shape overall, so if there are no objections,
> I'll mark this as Read-for-Committer.

I've updated the CF status to Ready for Committer.

Regards,
Yugo Nagata

-- 
Yugo Nagata 




Re: Fixing MSVC's inability to detect elog(ERROR) does not return

2025-07-23 Thread Tom Lane
David Rowley  writes:
> On Thu, 24 Jul 2025 at 12:27, Tom Lane  wrote:
>> Hmmm ... but you did check that in fact we can remove such known-dead
>> code and not get a warning now?

> Yes. The patch has a small temporary adjustment to
> BaseBackupGetTargetHandle() to comment out the return. It compiles for
> me using Visual Studio 2022 without any warnings. If I remove the
> macro change, I get:
> [598/2255] Compiling C object
> src/backend/postgres_lib.a.p/backup_basebackup_target.c.obj
> src\backend\backup\basebackup_target.c(150) : warning C4715:
> 'BaseBackupGetTargetHandle': not all control paths return a value

OK.  I'd vote for going ahead and seeing what the buildfarm says.

regards, tom lane




Re: simple patch for discussion

2025-07-23 Thread Greg Hennessy
On Tue, Jul 22, 2025 at 8:49 PM David Rowley  wrote:

> I don't think having a GUC which allows exactly two settings is
> anywhere near as flexible as you could make this.


I would phrase it as being a GUC with two settings currently,
I think it is easier to go from 2 possible algorithms to 3 than
it is to go from 1 to 2.


> You're complaining
> that the current calculation isn't ideal for you and are proposing a
> new one which seemingly is more to your liking, but who's to say that
> someone else would find either useful? With this method, if someone
> else comes along complaining, do we give them what they want by adding
> a new enum? That does not sound great :(
>

Adding a new enum to solve a perceived problem doesn't seem like a large
ask to me. If you think this situation isn't great, may I ask what you
would
consider great?

One thought I had was that you could adjust the "divide by 3" to be
> "divide by " and make that calculation floating point.
>

I would think that a new patch to change the hard coded log3 to a more
general
log(x) where X is controlled by a GUC would be better solved by a separate
patch, but I have no objection to submitting a new patch that includes that.


> On rethink, maybe you can use the log() function to get something
> close to what's being generated today.


I think the ceil function needs to be in there to exactly match the existing
behavior, but I haven't verified. I do agree that having the default
behavior not change unless set by an admin to be a good idea.

Can other people give advice on if adding a new algorithm to
calculate parallel worker number and change the scaling
from log3 of a ratio to log of a GUC is best taken care of
by one patch or two?

Its clear that if we make the logarithm base an adjustable
parameter we have to insure it is not 1.0 or less, but
how much larger than 1.0 can we allow? My memory says
that the smallest floating point number larger than unity is 1+2**(-23).
I guess we can make the minimum allowed number another GUC. :)


Fixing MSVC's inability to detect elog(ERROR) does not return

2025-07-23 Thread David Rowley
(Moving this discussion to hackers. Previously in [0].)

Background:

The ereport_domain() macro has a pg_unreachable() which is meant to be
hit when the elevel >= ERROR so that the compiler can determine that
elog/ereport ERROR never returns. This works fine in gcc and clang but
MSVC seems to be unable to figure this out.  This causes us to have to
maintain hundreds of lines like "return NULL; /* keep compiler quiet
*/" to avoid warnings.

What's causing this?:

You might imagine that this happens because pg_unreachable() isn't
working for MSVC, but you'd be wrong. It's because of the
double-evaluation mitigation of elevel in the ereport_domain() macro.
We have:

const int elevel_ = (elevel);

and because later we do:

if (elevel_ >= ERROR) \
pg_unreachable(); \

the MSVC compiler can't figure out that the pg_unreachable() is always
hit because it sees elevel_ as variable rather than constant, even
when the macro's elevel parameter is a compile-time constant.

Fixing it:

Ideally, MSVC would have something like __builtin_const_p(). Looking
around, I found [1], which hacks together a macro which does the same
job. The problem is, the macro uses _Generic(), which is C11. Tom
suggested we adjust the meson build scripts to make MSVC always build
with C11, which seems doable due to 8fd9bb1d9 making our minimum
Visual Studio version 2019, and going by [2], vs2019 supports C11.

So, I think that means we can adjust the meson build scripts to pass
/std:c11 when building in MSVC. The attached patch does this and
defines a pg_builtin_constant() macro and adjusts ereport_domain() to
use it.

One thing I've not done yet is adjust all other usages of
__builtin_const_p() to use pg_builtin_const().

I did a quick check of the postgres.exe size with and without this
change. It does save about 13KB, but that seems to be entirely from
the /std:c11 flag. None is due to the compiler emitting less code
after elog(ERROR) / ereport(ERROR) :-(

postgres.exe size:
master: 10_143_232 bytes
patched: 10_129_920 bytes

Does anyone have any objections or comments to any of the above?

David

[0] 
https://postgr.es/m/caaphdvrfdxjbrv6kcx_ghkysufubndyssjppcjqigourfje...@mail.gmail.com
[1] 
https://stackoverflow.com/questions/49480442/detecting-integer-constant-expressions-in-macros
[2] 
https://devblogs.microsoft.com/cppblog/c11-and-c17-standard-support-arriving-in-msvc/


v3-0001-Detect-elog-ERROR-can-t-return-in-MSVC-when-using.patch
Description: Binary data


Re: index prefetching

2025-07-23 Thread Peter Geoghegan
On Wed, Jul 23, 2025 at 9:59 PM Peter Geoghegan  wrote:
> Tomas' index-prefetch-simple-master branch:
> │ I/O Timings: shared read=1490.918
> │ Execution Time: 2015.731 ms

> Complex patch (same prewarming/eviction are omitted this time):
> │ I/O Timings: shared read=138.856
> │ Execution Time: 768.454 ms

> I'm using direct IO in both cases. This can easily be repeated, and is stable.

Forget to add context about the master branch: Master can do this in
2386.850 ms, with "I/O Timings: shared read=1825.161". That's with
buffered I/O (not direct I/O), and with the same pg_prewarm +
pg_buffercache_evict_relation function calls as before. I'm running
"echo 3 > /proc/sys/vm/drop_caches" to drop the filesystem cache here,
too (unlike when testing the patches, where my use of direct i/o makes
that step unnecessary).

In summary, the simple patch + direct I/O clearly beats the master
branch + buffered I/O here -- though not by much. While the complex
patch gets a far greater benefit.

-- 
Peter Geoghegan




Re: Regression with large XML data input

2025-07-23 Thread Tom Lane
Michael Paquier  writes:
> Are you planning to look at that for the next minor release?  It would
> take me a couple of hours to dig into all that, and I am sure that I
> am going to need your stamp or Erik's to avoid doing a stupid thing.

It was my commit, so my responsibility, so I'll deal with it.
Planning to wait for Erik's input though.

regards, tom lane




Re: IndexAmRoutine aminsertcleanup function can be NULL?

2025-07-23 Thread Michael Paquier
On Wed, Jul 23, 2025 at 12:07:56PM +0800, Japin Li wrote:
> PFA to assert all required IndexAM callbacks are present.

@@ -42,6 +42,19 @@ GetIndexAmRoutine(Oid amhandler)
 elog(ERROR, "index access method handler function %u did not return an 
IndexAmRoutine struct",
  amhandler);
 
+/* Assert that all required callbacks are present. */
+Assert(routine->ambuild != NULL);
+Assert(routine->ambuildempty != NULL);
+Assert(routine->aminsert != NULL);
+Assert(routine->ambulkdelete != NULL);
+Assert(routine->amvacuumcleanup != NULL);
+Assert(routine->amcostestimate != NULL);
+Assert(routine->amoptions != NULL);
+Assert(routine->amvalidate != NULL);
+Assert(routine->ambeginscan != NULL);
+Assert(routine->amrescan != NULL);
+Assert(routine->amendscan != NULL);

Oh.  I like that, and I would like to apply it on HEAD if there are no
objections. 
--
Michael


signature.asc
Description: PGP signature


Re: Report bytes and transactions actually sent downtream

2025-07-23 Thread Ashutosh Bapat
Hi Amit,

On Mon, Jul 14, 2025 at 3:31 PM Amit Kapila  wrote:
>
> On Mon, Jul 14, 2025 at 10:55 AM Ashutosh Bapat
>  wrote:
> >
> > On Sun, Jul 13, 2025 at 4:34 PM Amit Kapila  wrote:
> > >
> > >
> > > I think we don't want to make it mandatory for plugins to implement
> > > these stats, so instead of throwing ERROR, the view should show that
> > > the plugin doesn't provide stats. How about having OutputPluginStats
> > > similar to OutputPluginCallbacks and OutputPluginOptions members in
> > > LogicalDecodingContext? It will have members like stats_available,
> > > txns_sent or txns_skipped, txns_filtered, etc.
> >
> > Not making mandatory looks useful. I can try your suggestion. Rather
> > than having stats_available as a member of OutputPluginStats, it's
> > better to have a NULL value for the corresponding member in
> > LogicalDecodingContext. We don't want an output plugin to reset
> > stats_available once set. Will that work?
> >
>
> We can try that.
>
> > > I am thinking it will
> > > be better to provide this information in a separate view like
> > > pg_stat_plugin_stats or something like that, here we can report
> > > slot_name, plugin_name, then the other stats we want to implement part
> > > of OutputPluginStats.
> >
> > As you have previously pointed out, the view should make it explicit
> > that the new stats are maintained by the plugin and not core. I agree
> > with that intention. However, already have three views
> > pg_replication_slots (which has slot name and plugin name), then
> > pg_replication_stats which is about stats maintained by a WAL sender
> > or running replication and then pg_stat_replication_slots, which is
> > about accumulated statistics for a replication through a given
> > replication slot. It's already a bit hard to keep track of who's who
> > when debugging an issue. Adding one more view will add to confusion.
> >
> > Instead of adding a new view how about
> > a. name the columns as plugin_sent_txns, plugin_sent_bytes,
> > plugin_filtered_change_bytes to make it clear that these columns are
> > maintained by plugin
> > b. report these NULL if stats_available = false OR OutputPluginStats
> > is not set in LogicalDecodingContext
> > c. Document that NULL value for these columns indicates that the
> > plugin is not maintaining/reporting these stats
> > d. adding plugin name to pg_stat_replication_slots, that will make it
> > easy for users to know which plugin they should look at in case of
> > dubious or unavailable stats
> >
>
> Sounds reasonable.

Here's the next patch which considers all the discussion so far. It
adds four fields to pg_stat_replication_slots.
- plugin - name of the output plugin
- plugin_filtered_bytes - reports the amount of changes filtered
out by the output plugin
- plugin_sent_txns - the amount of transactions sent downstream by
the output plugin
- plugin_sent_bytes - the amount of data sent downstream by the
outputplugin.

There are some points up for a discussion:
1. pg_stat_reset_replication_slot() zeroes out the statistics entry by
calling pgstat_reset() or pgstat_reset_of_kind() which don't know
about the contents of the entry. So
PgStat_StatReplSlotEntry::plugin_has_stats is set to false and plugin
stats are reported as NULL, instead of zero, immediately after reset.
This is the same case when the stats is queried immediately after the
statistics is initialized and before any stats are reported. We could
instead make it report
zero, if we save the plugin_has_stats and restore it after reset. But
doing that in pgstat_reset_of_kind() seems like an extra overhead + we
will need to write a function to find all replication slot entries.
Resetting the stats would be a rare event in practice. Trying to
report 0 instead of NULL in that rare case doesn't seem to be worth
the efforts and code. Given that the core code doesn't know whether a
given plugin reports stats or not, I think this behaviour is
appropriate as long as we document it. Please let me know if the
documentation in the patch is clear enough.

2. There's also a bit of asymmetry in the way sent_bytes is handled.
The code which actually sends the logical changes to the downstream is
part of the core code
but the format of the change and hence the number of bytes sent is
decided by the plugin. It's a stat related to plugin but maintained by
the core code. The patch implements it as a plugin stat (so the
corresponding column has "plugin" prefix and is also reported as NULL
upon reset etc.), but we may want to reconsider how to report and
maintain it.

3. The names of new columns have the prefix "plugin_" but the internal
variables tracking those don't for the sake of brevity. If you prefer
to have the same prefix for the internal variables, I can change that.

I think I have covered all the cases where filteredbytes should be
incremented, but please let me know if I have missed any.

-- 
Best Wishes,
Ashutosh Bapat
From 0bdaca3d1bb95cb834d45137eddaaa58fe1646b9 Mon Sep 17 00

Re: Regression with large XML data input

2025-07-23 Thread Tom Lane
Michael Paquier  writes:
> A customer has reported a regression with the parsing of rather large
> XML data, introduced by the set of backpatches done with f68d6aabb7e2
> & friends.

Bleah.

> Switching back to the previous code, where we rely on
> xmlParseBalancedChunkMemory() fixes the issue.

Yeah, just reverting these commits might be an acceptable answer,
since the main point was to work around a bleeding-edge bug:

>> * Early 2.13.x releases of libxml2 contain a bug that causes
>> xmlParseBalancedChunkMemory to return the wrong status value in some
>> cases.  This breaks our regression tests.  While that bug is now fixed
>> upstream and will probably never be seen in any production-oriented
>> distro, it is currently a problem on some more-bleeding-edge-friendly
>> platforms.

Presumably that problem is now gone, a year later.  The other point
about

>> * xmlParseBalancedChunkMemory is considered to depend on libxml2's
>> semi-deprecated SAX1 APIs, and will go away when and if they do.

is still hypothetical I think.  But we might want to keep this bit:

>> While here, avoid allocating an xmlParserCtxt in DOCUMENT parse mode,
>> since that code path is not going to use it.

regards, tom lane




Re: Conflict detection for update_deleted in logical replication

2025-07-23 Thread shveta malik
On Wed, Jul 23, 2025 at 12:53 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, July 23, 2025 12:08 PM Amit Kapila  
> wrote:
> >
> > On Wed, Jul 23, 2025 at 3:51 AM Masahiko Sawada
> >  wrote:
> > >
> > > I've reviewed the 0001 patch and it looks good to me.
> > >
> >
> > Thanks, I have pushed the 0001 patch.
>
> Thanks for pushing. I have rebased the remaining patches.
>
> I have reordered the patches to prioritize the detection of update_deleted as
> the initial patch. This can give us more time to consider the new GUC, since 
> the
> performance-related aspects have been documented.
>

Thank You for patches.  Please find a few comments for 0001:

1)
+   if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf)
== HEAPTUPLE_RECENTLY_DEAD)
+   dead = true;

Shall we name the variable as 'recently_dead' as 'dead' can even mean
HEAPTUPLE_DEAD.

2)
+   if (MySubscription->retaindeadtuples &&
+   FindMostRecentlyDeletedTupleInfo(localrel, remoteslot,
+
  &conflicttuple.xmin,
+
  &conflicttuple.origin,
+
  &conflicttuple.ts) &&
+   conflicttuple.origin != replorigin_session_origin)
+   type = CT_UPDATE_DELETED;
+   else
+   type = CT_UPDATE_MISSING;

Shall the conflict be detected as update_deleted irrespective of origin?

3)
+   /*
+* We do not consider HEAPTUPLE_DEAD status because it indicates
+* either tuples whose inserting transaction was
aborted, meaning
+* there is no commit timestamp or origin, or tuples
deleted by a
+* transaction older than oldestXmin, making it safe
to ignore them
+* during conflict detection (See comments atop
+* maybe_advance_nonremovable_xid() for details).
+*/

a) We can use parentheses for below otherwise sentence becomes
confusing due to multiple 'or' with 'either'
 (meaning there is no commit timestamp or origin)

b) we can change last line to: See comments atop worker.c for details

4)
+  
+   The tuple to be updated was deleted by another origin. The update will
+   simply be skipped in this scenario.
+   Note that this conflict can only be detected when
+   
+   and retain_dead_tuples
+   are enabled. Note that if a tuple cannot be found due to the table being
+   truncated only a update_missing conflict will
+   arise
+  

4a)
Can we please make the link as:
track_commit_timestamp

This is because the current usage of 'xref linkend' gives it a
slightly bigger font in html file, while all other references of this
GUC is normal font.

4b) We need a comma after truncated:
Note that if a tuple cannot be found due to the table being truncated,
only a update_missing conflict will arise.

5)
monitoring.sgml:
+  
+   Number of times the tuple to be updated was deleted by another origin
+   during the application of changes. See 
+   for details about this conflict.
+  

Here we are using the term 'by another origin', while in the rest of
the doc (see confl_update_origin_differs, confl_delete_origin_differs)
we use the term 'by another source'. Shall we keep it the same?
OTOH, I think using 'origin' is better but the rest of the page  is
using source. So perhaps changing source to origin everywhere is
better. Thoughts?
This can be changed if needed once we decide on point 2 above.

thanks
Shveta




Re: Pathify RHS unique-ification for semijoin planning

2025-07-23 Thread Richard Guo
On Wed, Jul 23, 2025 at 5:11 PM Álvaro Herrera  wrote:
> As a very trivial test on this patch, I ran the query in your opening
> email, both with and without the patch, scaling up the size of the table
> a little bit.

> This is a really nice improvement.  I think we could find queries that
> are arbitrarily faster, by feeding enough tuples to the unnecessary Sort
> nodes.

Thank you for testing this patch!

In addition to eliminating unnecessary sort nodes, this patch also
allows us to exploit pre-sorted paths that aren't necessarily the
cheapest total during the unique-ification step.  It also allows the
use of parallel plans for that step on large tables.  I think we could
also find queries that become faster as a result of these improvements.

Thanks
Richard




Re: Regression with large XML data input

2025-07-23 Thread Michael Paquier
On Wed, Jul 23, 2025 at 11:28:38PM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>> Switching back to the previous code, where we rely on
>> xmlParseBalancedChunkMemory() fixes the issue.
> 
> Yeah, just reverting these commits might be an acceptable answer,
> since the main point was to work around a bleeding-edge bug:

Still it is not possible to do exactly that on all the branches
because of the business with XMLSERIALIZE that requires some options
for xmlParseInNodeContext(), is it?

>>> * Early 2.13.x releases of libxml2 contain a bug that causes
>>> xmlParseBalancedChunkMemory to return the wrong status value in some
>>> cases.  This breaks our regression tests.  While that bug is now fixed
>>> upstream and will probably never be seen in any production-oriented
>>> distro, it is currently a problem on some more-bleeding-edge-friendly
>>> platforms.
> 
> Presumably that problem is now gone, a year later.  The other point
> about

I would probably agree that it does not seem worth caring for this
range in the early 2.13 series.  I didn't mention it upthread but all
my tests were with debian GID's libxml2 which seems to be a 2.12.7
flavor with some 2.9.14 pieces, based on what apt is telling me.  I
did not test with a different version from upstream, but I'm pretty
sure that's the same story.

>>> * xmlParseBalancedChunkMemory is considered to depend on libxml2's
>>> semi-deprecated SAX1 APIs, and will go away when and if they do.
> 
> is still hypothetical I think.  But we might want to keep this bit:

Worth mentioning upstream 4f329dc52490, I guess, added to the 2.14
branch:
parser: Implement xmlCtxtParseContent

This implements xmlCtxtParseContent, a better alternative to
xmlParseInNodeContext or xmlParseBalancedChunkMemory. It accepts a
parser context and a parser input, making it a lot more versatile.

With all our stable branches, I am not sure if this should be
considered, but that seems worth keeping in mind.

>>> While here, avoid allocating an xmlParserCtxt in DOCUMENT parse mode,
>>> since that code path is not going to use it.

Are you planning to look at that for the next minor release?  It would
take me a couple of hours to dig into all that, and I am sure that I
am going to need your stamp or Erik's to avoid doing a stupid thing.
--
Michael


signature.asc
Description: PGP signature


Re: magical eref alias names

2025-07-23 Thread Tom Lane
[ returning to this thread now that v19 is open for business ]

Robert Haas  writes:
> I rediscovered, or re-encountered, this problem today, which motivated
> me to have a closer look at your (Tom's) patch. My feeling is that
> it's the right approach. I agree that we could try to keep the current
> generated names by extending addRangeTableEntryForSubquery, but I'm
> tentatively inclined to think that we shouldn't.

That's fine by me.  I'm personally content with all of the changes
shown in your patchset.

> However, I did come across one other mildly interesting case.
> expand_single_inheritance_child has this:
> ...
> What I find curious about this is that we're assigning the parent's
> eref to both the child's eref and the child's alias. Maybe there's
> something I don't understand here, or maybe it just doesn't matter,
> but why wouldn't we assign eref to eref and alias to alias? Or even
> alias to alias and generate a new eref?

The issue is explained in the previous comment block a few lines up:

 * Construct an alias clause for the child, which we can also use as eref.
 * This is important so that EXPLAIN will print the right column aliases
 * for child-table columns.  (Since ruleutils.c doesn't have any easy way
 * to reassociate parent and child columns, we must get the child column
 * aliases right to start with.  Note that setting childrte->alias forces
 * ruleutils.c to use these column names, which it otherwise would not.)

I think the case this is worried about is that if you have a parent
table t(a,b,c), and the query writes say "t as t1(x)", then t's column
"a" will be labeled "x" by EXPLAIN and we want the child's column "a"
to similarly be labeled "x".

So unless we want to start rejiggering ruleutils' already-overly-
complex behavior, we have to put something in childrte->alias, even
if the parent had no alias.  So that's a violation of the principle
you were hoping to establish, but as long as it only applies to
partitions and inheritance children, I'm not sure it's worth moving
heaven and earth to make it different.

We could certainly do something a little different than what the code
is doing, such as "if the parent does have a user-written alias, use
parentrte->alias->aliasname not parentrte->eref->aliasname for the
childrte->alias->aliasname".  I'm not sure it's worth bothering with
that, though.

I noticed that the patchset was failing in cfbot because we've since
grown another regression test case whose output is affected by 0002.
So here's a v3 that incorporates that additional change.  I did a
little bit of wordsmithing on the commit messages too, but the code
changes are identical to v2.

regards, tom lane

From ab46f16a15464e9ba3fae5008ef533e0a6a28303 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Wed, 23 Jul 2025 17:04:45 -0400
Subject: [PATCH v3 1/3] Don't generate fake "*SELECT*" or "*SELECT* %d"
 subquery aliases.

rte->alias should point only to a user-written alias, but in these
cases that principle was violated. Fixing this causes some regression
test output changes: wherever rte->alias previously had a value and
is now NULL, rte->eref is now set to a generated name rather than to
rte->alias; and the scheme used to generate eref names differs from
what we were doing for aliases.

The upshot is that instead of "*SELECT*" or "*SELECT* %d",
EXPLAIN will now emit "unnamed_subquery" or "unnamed_subquery_%d".
But that's a reasonable descriptor, and we were already producing
that in yet other cases, so this seems not too objectionable.

Author: Tom Lane 
Co-authored-by: Robert Haas 
Discussion: https://postgr.es/m/ca+tgmoysymda2gvanzpmci084n+mvucv0bj0hpbs6uhmmn6...@mail.gmail.com
---
 contrib/postgres_fdw/expected/postgres_fdw.out |  8 
 src/backend/executor/functions.c   |  2 +-
 src/backend/parser/analyze.c   |  7 ++-
 src/test/regress/expected/partition_prune.out  |  4 ++--
 src/test/regress/expected/rangefuncs.out   |  8 
 src/test/regress/expected/union.out| 14 +++---
 6 files changed, 20 insertions(+), 23 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4b6e49a5d95..c492ba38c58 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5098,13 +5098,13 @@ SELECT ft1.c1 FROM ft1 JOIN ft2 on ft1.c1 = ft2.c1 WHERE
 -- ===
 EXPLAIN (verbose, costs off)
 INSERT INTO ft2 (c1,c2,c3) SELECT c1+1000,c2+100, c3 || c3 FROM ft2 LIMIT 20;
-  QUERY PLAN  

Re: [WIP PATCH v2] Implement "pg_restore --data-only --clean" as a way to skip WAL

2025-07-23 Thread Dimitrios Apostolou

On Friday 2025-06-13 03:42, Greg Sabino Mullane wrote:



I think the overall idea is sound. But we need a better solution for the 
truncate fk failure. Can we introspect somehow and do a truncate or do a delete 
as necessary? I don't like the idea of simply ignoring the constraint, or of 
throwing an error.


Sorry I haven't answered, but I don't really know how to catch the error 
of TRUNCATE in the C code and do a DELETE instead. I was hoping someone 
would chime in and suggest an approach.


Furthermore, silently replacing the TRUNCATE with DELETE will negate all 
the performance benefits we are aiming for with this patch (avoiding WAL 
logging).


In pg_restore AFAICT the suggested way to handle similar FK errors is to 
add another option: --disable-triggers. This works for other operations 
besides TRUNCATE, and one can break referential integrity in order to 
speed up loading of data.


As such, I think the proper solution would be to change TRUNCATE 
implementation in the backend, in order to ignore FK constraints when 
the triggers are off. How correct do you think this is? And any ideas on 
how to approach the implementation?


Thanks,
Dimitris


Re: trivial grammar refactor

2025-07-23 Thread Andres Freund
Hi,

On 2025-07-23 19:59:52 +0200, Álvaro Herrera wrote:
> ... so using the same set of productions, I can rewrite the current
> CLUSTER rule in this way and it won't be a problem for the REPACK
> changes.

But it comes at the price of henceforth duplicating all ClusterStmt, once for
VERBOSE and once without. That's not exactly a free lunch...

Greetings,

Andres Freund




Re: simple patch for discussion

2025-07-23 Thread David Rowley
On Thu, 24 Jul 2025 at 08:52, Greg Hennessy  wrote:
> Adding a new enum to solve a perceived problem doesn't seem like a large
> ask to me.

Seems overly slow to me. As someone has to write the patch, a
committer has to commit said patch, the user must wait for the next
major version to be released before upgrading their database to that
version. That's 6 to 15 months at best.

> Can other people give advice on if adding a new algorithm to
> calculate parallel worker number and change the scaling
> from log3 of a ratio to log of a GUC is best taken care of
> by one patch or two?

Others might be able to chime in if you gave a summary of what those
patches were. I assume you want an enum GUC for the algorithm and some
scale GUC?

> Its clear that if we make the logarithm base an adjustable
> parameter we have to insure it is not 1.0 or less, but
> how much larger than 1.0 can we allow? My memory says
> that the smallest floating point number larger than unity is 1+2**(-23).
> I guess we can make the minimum allowed number another GUC. :)

The range could be 1.0 to maybe 100.0 or 1000.0. If the number is too
close to 1.0 then just have compute_parallel_worker() return
max_workers. You could document 1.0 to mean "limit to
max_parallel_maintenance_workers / max_parallel_workers_per_gather".
The function won't return a value higher than that anyway.

David




Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

2025-07-23 Thread Dimitrios Apostolou
Hi Nathan, I've noticed you've set yourself as a reviewer of this patch 
in the commitfest. I appreciate it, but you might want to combine it 
with another simple patch [1] that speeds up the same part of 
pg_restore: the initial full scan on TOC-less archives.


[1] https://commitfest.postgresql.org/patch/5817/


On Saturday 2025-06-14 00:04, Nathan Bossart wrote:


On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:

By the way, I might have set the threshold to 1MB in my program, but
lowering it won't show a difference in my test case, since the lseek()s I
was noticing before the patch were mostly 8-16KB forward. Not sure what is
the defining factor for that. Maybe the compression algorithm, or how wide
the table is?


I may have missed it, but could you share what the strace looks like with
the patch applied?


I hope you've seen my response here, with special focus on the small 
block size that both compressed and uncompressed custom format archives 
have.


I have been needing to pg_restore 10TB TOC-less dumps recently, and it's 
a pain to do the full scan, even with both of my patches applied.


Maybe the block size could be a command line option of pg_dump, so that 
one could set it to sizes like 100MB, which sounds like a normal block 
from the perspective of a 10TB gigantic dump.


Regards,
Dimitris





Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions

2025-07-23 Thread jian he
On Tue, Jul 22, 2025 at 8:26 PM Vik Fearing  wrote:
>
> The actual syntax is:
>
>
>  ::=
>  CAST 
>   AS 
>  [ FORMAT  ]
>  [  ON CONVERSION ERROR ]
>  
>
>
> "CONVERSION" is probably a noise word, but it is there because A) Oracle
> wanted it there, and B) it makes sense because if the  behavior> fails, that is still a failure of the entire CAST.
>
>
> The  is:
>
>  ::=
>  ERROR
>| NULL
>| DEFAULT 
>
>
> but I am planning on removing the NULL variant in favor of having the
>  be a .  So it
> would be either ERROR ON CONVERSION ERROR (postgres's current behavior),
> or DEFAULT NULL ON CONVERSION ERROR.
>
>
> An example of B) above would be: CAST('five' AS INTEGER DEFAULT 'six' ON
> CONVERSION ERROR).  'six' is no more an integer than 'five' is, so that
> would error out because the conversion error does not happen on the
> operand but on the default clause. CAST('five' AS INTEGER DEFAULT 6 ON
> CONVERSION ERROR) would work.
>

hi.

>  ::=
>  ERROR
>| NULL
>| DEFAULT 

for 
I disallow it from returning a set, or using aggregate or window functions.
For example, the following three cases will fail:

+SELECT CAST('a' as int DEFAULT sum(1) ON CONVERSION ERROR); --error
+SELECT CAST('a' as int DEFAULT sum(1) over() ON CONVERSION ERROR); --error
+SELECT CAST('a' as int DEFAULT ret_setint() ON CONVERSION ERROR) --error
(ret_setint function is warped as (select 1 union all select 2))

for array coerce, which you already mentioned, i think the following
is what we expected.
+SELECT CAST('{234,def,567}'::text[] AS integer[] DEFAULT '{-1011}' ON
CONVERSION ERROR);
+  int4
+-
+ {-1011}
+(1 row)

I didn't implement the [ FORMAT  ] part for now.
please check the attached regress test and tests expected result.
From 47c181eee593468c3d7b7cb57aec3a1ea8cb3c1d Mon Sep 17 00:00:00 2001
From: jian he 
Date: Fri, 18 Jul 2025 13:00:19 +0800
Subject: [PATCH v2 1/2] make ArrayCoerceExpr error safe

similar to https://git.postgresql.org/cgit/postgresql.git/commit/?id=aaaf9449ec6be62cb0d30ed3588dc384f56274bf
---
 src/backend/executor/execExpr.c   | 3 +++
 src/backend/executor/execExprInterp.c | 4 
 src/backend/utils/adt/arrayfuncs.c| 7 +++
 3 files changed, 14 insertions(+)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index f1569879b52..1f3f899874f 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -1702,6 +1702,9 @@ ExecInitExprRec(Expr *node, ExprState *state,
 elemstate->innermost_caseval = (Datum *) palloc(sizeof(Datum));
 elemstate->innermost_casenull = (bool *) palloc(sizeof(bool));
 
+if (state->escontext != NULL)
+	elemstate->escontext = state->escontext;
+
 ExecInitExprRec(acoerce->elemexpr, elemstate,
 &elemstate->resvalue, &elemstate->resnull);
 
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 1a37737d4a2..81e46cff725 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -3644,6 +3644,10 @@ ExecEvalArrayCoerce(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
 			  econtext,
 			  op->d.arraycoerce.resultelemtype,
 			  op->d.arraycoerce.amstate);
+
+	if (SOFT_ERROR_OCCURRED(op->d.arraycoerce.elemexprstate->escontext)
+		&& (*op->resvalue = (Datum) 0))
+		*op->resnull = true;
 }
 
 /*
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index c8f53c6fbe7..b5f98bf22f9 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -3288,6 +3288,13 @@ array_map(Datum arrayd,
 		/* Apply the given expression to source element */
 		values[i] = ExecEvalExpr(exprstate, econtext, &nulls[i]);
 
+		if (SOFT_ERROR_OCCURRED(exprstate->escontext))
+		{
+			pfree(values);
+			pfree(nulls);
+			return (Datum) 0;
+		}
+
 		if (nulls[i])
 			hasnulls = true;
 		else
-- 
2.34.1

From 7ec9361ba5d6225d6adbda28ae218d82b4a74b7a Mon Sep 17 00:00:00 2001
From: jian he 
Date: Thu, 24 Jul 2025 09:10:35 +0800
Subject: [PATCH v2 2/2] CAST(expr AS newtype DEFAULT ON ERROR)

Type cast nodes are generally one of: FuncExpr, CoerceViaIO, or ArrayCoerceExpr;
see build_coercion_expression.  We've already made CoerceViaIO error safe[0]; we
just need to teach it to evaluate default expression after coercion failures.
ArrayCoerceExpr is also handling errors safely.
However, FuncExpr can't be made error-safe directly (see example like int84), so
instead, we construct a CoerceViaIO node using the source expression as its
argument.

[0]: https://git.postgresql.org/cgit/postgresql.git/commit/?id=aaaf9449ec6be62cb0d30ed3588dc384f56274bf
discussion: https://postgr.es/m/CADkLM=fv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ@mail.gmail.com

demo:
SELECT CAST('1' AS date  DEFAULT '2011-01-01' ON ERROR),
   CAST('{234,def,567}'::text[] AS integer[] DEFAULT '{-1011}' ON ERROR);

date|  int4
+--

Re: [PATCH] Proposal to Enable/Disable Index using ALTER INDEX

2025-07-23 Thread Robert Treat
On Tue, Jul 22, 2025 at 6:50 PM David Rowley  wrote:
> On Tue, 22 Jul 2025 at 05:16, Sami Imseih  wrote:
> > Also, I'd like to ask. what would be the argument against offering both 
> > options,
> > ALTER and a GUC to override the catalog, as currently proposed in the patch?
>
> For me, the reason I don't like ALTER TABLE + the use_invisible_index
> / force_invisible_index (the v18 patch seems to be confused about the
> name of that GUC) is because it puts into question what "invisible"
> means. It's going to be a pretty useless feature for use cases where a
> DBA wants to ensure a certain index is *never* used, but does not want
> to drop it. A DBA might want to disable a certain index to investigate
> certain forms of index corruption and it might not be good if people
> can just overwrite that to bypass the DBA's choice.
>

Thanks for elaborating on this, you said it better than me.

So I'll note that in my proposal, the hypothetical catalog update
("alter index set guc" or whatever) is a one way door; if the dba (or
whomever) sets that, then the index is ignored everywhere, since the
session level guc can only also suggest the index be ignored from
planning. That is enough to allow people to both slow roll out OR slow
roll in new indexes, as needed, which I think covers enough ground
without the complexity going too far (which your below example clearly
shows is possible).

> It might be a slightly more flexible feature if there were 3 possible
> states and one of those states could be clearly defined to mean that
> users can overwrite the disabledness of all indexes by setting a GUC.
> I'm still struggling to like that, however.
>
> Now wondering if it would be better to spend the effort looking at
> pg_hint_plan and seeing how hard it would be to get global hints added
> which are applied to all queries, and then add a way to disable use of
> a named index. (I don't have any experience with that extension other
> than looking at the documentation)
>

I'd be interested in your evaluation of this, but the GUC I've
outlined would accomplish most of the use cases here automagically.


Robert Treat
https://xzilla.net




Re: index prefetching

2025-07-23 Thread Peter Geoghegan
On Wed, Jul 23, 2025 at 12:36 PM Peter Geoghegan  wrote:
> * The TPC-C order line table primary key.

I tested this for myself.

Tomas' index-prefetch-simple-master branch:

set max_parallel_workers_per_gather =0;
SELECT pg_buffercache_evict_relation('order_line');
select pg_prewarm('order_line_pkey');

:ea select sum(ol_amount) from order_line where ol_w_id < 10;
┌┐
│
 QUERY PLAN
   │
├┤
│ Aggregate  (cost=264259.55..264259.56 rows=1 width=32) (actual
time=2015.711..2015.712 rows=1.00 loops=1)
 │
│   Output: sum(ol_amount)

   │
│   Buffers: shared hit=17815 read=33855

   │
│   I/O Timings: shared read=1490.918

   │
│   ->  Index Scan using order_line_pkey on public.order_line
(cost=0.56..257361.93 rows=2759049 width=4) (actual
time=7.936..1768.236 rows=2700116.00 loops=1) │
│ Output: ol_w_id, ol_d_id, ol_o_id, ol_number, ol_i_id,
ol_delivery_d, ol_amount, ol_supply_w_id, ol_quantity, ol_dist_info
 │
│ Index Cond: (order_line.ol_w_id < 10)

   │
│ Index Searches: 1

   │
│ Index Prefetch: true

   │
│ Index Distance: 110.7

   │
│ Buffers: shared hit=17815 read=33855

   │
│ I/O Timings: shared read=1490.918

   │
│ Planning Time: 0.049 ms

   │
│ Serialization: time=0.003 ms  output=1kB  format=text

   │
│ Execution Time: 2015.731 ms

   │
└┘
(15 rows)

Complex patch (same prewarming/eviction are omitted this time):

:ea select sum(ol_amount) from order_line where ol_w_id < 10;
┌───┐
│
QUERY PLAN
  │
├───┤
│ Aggregate  (cost=264259.55..264259.56 rows=1 width=32) (actual
time=768.387..768.388 rows=1.00 loops=1)
│
│   Output: sum(ol_amount)

  │
│   Buffers: shared hit=17815 read=33855

  │
│   I/O Timings: shared read=138.856

  │
│   ->  Index Scan using order_line_pkey on public.order_line
(cost=0.56..257361.93 rows=2759049 width=4) (actual
time=7.956..493.694 rows=2700116.00 loops=1) │
│ Output: ol_w_id, ol_d_id, ol_o_id, ol_number, ol_i_id,
ol_delivery_d, ol_amount, ol_supply_w_id, ol_quantity, ol_dist_info
│
│ Index Cond: (order_line.ol_w_id < 10)

  │
│ Index Searches: 1

  │
│ Buffers: shared hit=17815 read=33855

  │
│ I/O Timings: shared read=138.856

  │
│ Planning Time: 0.043 ms

  │
│ Serialization: time=0.003 ms  output=1kB  format=text

  │
│ Execution Time: 768.454 ms

  │
└───┘
(13 rows)

I'm using direct IO in both cases. This can easily be repeated, and is stable.

To be fair, the planner wants to use a parallel index scan for this.
If I allow the scan to be parallel, 5 parallel workers are used. The
simple patch now takes 295.722 ms, while the complex patch takes
301.875 ms. I imagine that that's because the use of parallelism
eliminates the natural advantage that the complex has with this
workload/index -- the scan as a whole is presumably no longer
bottlenecked on physical index characteristics. The parallel workers
can almost behave as 5 independent scans, all kept sufficiently busy,
even without our having to read ahead to later leaf pages.

It's possible that something weird is going on with the prefetch
distance, in the context of parallel scans specifically -- it's not
like we've really tested parallel scans just yet (with either patch).
Even if there is an addressable problem in either patch here, I'd be
surprised if it was the main factor behind the simple patch doing
relatively well when scanning in parallel like this.

-- 
Peter Geoghegan


Re: Remaining dependency on setlocale()

2025-07-23 Thread Jeff Davis
On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
> On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis  wrote:
> > I don't have a great windows development environment, and it
> > appears CI
> > and the buildfarm don't offer great coverage either. Can I ask for
> > a
> > volunteer to do the windows side of this work?
> 
> Me neither but I'm willing to help with that, and have done lots of
> closely related things through trial-by-CI...

Attached a patch to separate the message translation (both gettext and
strerror translations) from setlocale(). That's a step towards thread
safety, and also a step toward setting LC_CTYPE=C permanently (more
work still required there).

The patch feels a bit over-engineered, but I'd like to know what you
think. It would be great if you could test/debug the windows NLS-
enabled paths.

I'm also not sure what to do about the NetBSD path. NetBSD has no
uselocale(), so I have to fall bad to temporary setlocale(), which is
not thread safe. And I'm getting a mysterious error in test_aio for
NetBSD, which I haven't investigated yet.

Regards,
Jeff Davis

From 8bd59e7d52351fadeb4fe26023aa0ff57735e03f Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 18 Jul 2025 14:06:45 -0700
Subject: [PATCH v5] Create wrapper for managing NLS locale.

Message translation depends on LC_CTYPE and LC_MESSAGES. Use wrapper
functions to control those settings rather than relying on the
permanent setlocale() settings.

Improves thread safety by using "_l()" variants of functions or
uselocale() where available. On windows, setlocale() can be made
thread-safe. There is still at least one platform (NetBSD) where none
of those options are available, in which case it still depends on
thread-unsafe setlocale().

Also separates message translation behavior from other, unrelated
behaviors like tolower().

Discussion: https://postgr.es/m/f040113cf384ada69558ec004a04a3ddb3e40a26.ca...@j-davis.com
---
 configure.ac  |   2 +
 meson.build   |   2 +
 src/backend/main/main.c   |  13 +-
 src/backend/utils/adt/Makefile|   1 +
 src/backend/utils/adt/meson.build |   1 +
 src/backend/utils/adt/pg_locale.c |  39 +--
 src/backend/utils/adt/pg_nls.c| 417 ++
 src/backend/utils/init/postinit.c |   4 +
 src/include/c.h   |  19 +-
 src/include/pg_config.h.in|   8 +
 src/include/port.h|  10 +
 src/include/utils/pg_nls.h|  29 +++
 src/tools/pg_bsd_indent/err.c |   2 +
 13 files changed, 524 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/utils/adt/pg_nls.c
 create mode 100644 src/include/utils/pg_nls.h

diff --git a/configure.ac b/configure.ac
index c2877e36935..493014302cd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1791,6 +1791,8 @@ AC_CHECK_FUNCS(m4_normalize([
 	backtrace_symbols
 	copyfile
 	copy_file_range
+	dgettext_l
+	dngettext_l
 	elf_aux_info
 	getauxval
 	getifaddrs
diff --git a/meson.build b/meson.build
index 5365aaf95e6..d3c285b2d54 100644
--- a/meson.build
+++ b/meson.build
@@ -2882,6 +2882,8 @@ func_checks = [
   # when enabling asan the dlopen check doesn't notice that -ldl is actually
   # required. Just checking for dlsym() ought to suffice.
   ['dlsym', {'dependencies': [dl_dep], 'define': false}],
+  ['dgettext_l'],
+  ['dngettext_l'],
   ['elf_aux_info'],
   ['explicit_bzero'],
   ['getauxval'],
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index bdcb5e4f261..fbef0245b28 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -38,6 +38,7 @@
 #include "utils/help_config.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_nls.h"
 #include "utils/ps_status.h"
 
 
@@ -139,14 +140,16 @@ main(int argc, char *argv[])
 	init_locale("LC_CTYPE", LC_CTYPE, "");
 
 	/*
-	 * LC_MESSAGES will get set later during GUC option processing, but we set
-	 * it here to allow startup error messages to be localized.
+	 * Initialize NLS locale's LC_CTYPE and LC_MESSAGES from the environment.
+	 * It will be updated later during GUC option processing, but we set it
+	 * here to allow startup error messages to be localized.
 	 */
-#ifdef LC_MESSAGES
-	init_locale("LC_MESSAGES", LC_MESSAGES, "");
-#endif
+	pg_nls_set_locale("", "");
 
 	/* We keep these set to "C" always.  See pg_locale.c for explanation. */
+#ifdef LC_MESSAGES
+	init_locale("LC_MESSAGES", LC_MESSAGES, "C");
+#endif
 	init_locale("LC_MONETARY", LC_MONETARY, "C");
 	init_locale("LC_NUMERIC", LC_NUMERIC, "C");
 	init_locale("LC_TIME", LC_TIME, "C");
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index ffeacf2b819..38e395b7de9 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -84,6 +84,7 @@ OBJS = \
 	pg_locale_icu.o \
 	pg_locale_libc.o \
 	pg_lsn.o \
+	pg_nls.o \
 	pg_upgrade_support.o \
 	pgstatfuncs.o \
 	pseudorandomfuncs.o \
diff --git a/src/backend/utils/adt/meson.build b/src/back

Re: [PATCH] Proposal to Enable/Disable Index using ALTER INDEX

2025-07-23 Thread Michael Paquier
On Tue, Jul 22, 2025 at 06:04:50PM -0500, Sami Imseih wrote:
> Regardless of what comes out of this thread, pg_hint_plan supporting
> a NoIndexScan hint that takes in an index name, and not only a relname
> is needed. I plan on taking that up there.

Patches are welcome upstream.
--
Michael


signature.asc
Description: PGP signature


Re: Fixing MSVC's inability to detect elog(ERROR) does not return

2025-07-23 Thread David Rowley
On Thu, 24 Jul 2025 at 12:27, Tom Lane  wrote:
> Hmmm ... but you did check that in fact we can remove such known-dead
> code and not get a warning now?

Yes. The patch has a small temporary adjustment to
BaseBackupGetTargetHandle() to comment out the return. It compiles for
me using Visual Studio 2022 without any warnings. If I remove the
macro change, I get:

[598/2255] Compiling C object
src/backend/postgres_lib.a.p/backup_basebackup_target.c.obj
src\backend\backup\basebackup_target.c(150) : warning C4715:
'BaseBackupGetTargetHandle': not all control paths return a value

David




Re: [PATCH] Proposal to Enable/Disable Index using ALTER INDEX

2025-07-23 Thread Michael Paquier
On Tue, Jul 22, 2025 at 01:15:16PM -0500, Sami Imseih wrote:
> The GUC serves multiple purposes. For example,I can create an index as 
> invisible
> and use it in a controlled way, which is helpful for experimenting
> with a new index.

An in-core GUC to control the list of indexes that should be allowed
or disallowed is I think asking for trouble, adding schema-related
knowledge directly into the GUC machinery.  This does not scale well,
even if you force all the entries to be specified down to the database
and the schema.  And it makes harder to control what a "good" behavior
should be at query-level.

My 2c.
--
Michael


signature.asc
Description: PGP signature


Re: [Buildfarm:84] Re: stats.sql might fail due to shared buffers also used by parallel tests

2025-07-23 Thread Alexander Lakhin

Hello Takatsuka-san,

23.07.2025 08:48, TAKATSUKA Haruka wrote:

Hello Alexander,

On Wed, 23 Jul 2025 00:55:37 +
Yugo Nagata - Buildfarm wrote:


Nagata-san, could you please share the configuration of hamerkop? If it's
running inside VM, what virtualization software is used?

It's vmware ESXi 7.0.3 (21930508).
This Windows enable auto-synchronize with 'time.windows.com'.
VMware Tools 12.1.5 build 20735119 is installed in the VM.


Thank you for your prompt response!

I used VirtualBox, not VMWare, when trying to reproduce the issue, maybe
that's the reason why I failed. It looks like VMWare has it's own
specifics related to timekeeping: [1].

Maybe you could try tools.syncTime = "0" by any chance?

There is also an interesting note in VMware Tools 11 docs [2]:
If the clock on the guest operating system is ahead of the clock on the
host, VMware Tools causes the clock on the guest to run more slowly until
the clocks are synchronized.

But still it's hard to say without experimentation whether this can cause
the observed effect.

[1] https://knowledge.broadcom.com/external/article?legacyId=1318
[2] 
https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/tools/11-1-0/vmware-tools-administration-11-1-0/configuring-vmware-tools-components/using-vmware-tools-configuration-utility/enable-periodic-time-synchronization.html


Best regards,
Alexander

Re: Collation and primary keys

2025-07-23 Thread Jeff Davis
On Wed, 2025-07-23 at 13:53 +0200, Daniel Verite wrote:
> > * The libc C.UTF-8 locale was a reasonable default (though not a
> > natural language collation). But now that we have C.UTF-8 available
> > from the builtin provider, then we should encourage that instead of
> > relying on the slower, platform-specific libc implementation.
> 
> Yes. In particular, we should encourage the ecosystem to support
> the new collation features so that they're widely available to
> end users.

Then I propose that we change the initdb default to builtin C.UTF-8.
Patch attached.

To get the old initdb behavior use --locale-provider=libc, and all the
other defaults will work as before.

The change would not disrupt upgrades (see commit 9637badd9f).

One annoyance: if your environment has an LC_CTYPE with a non-UTF-8
locale, then initdb forces LC_CTYPE=C and emits a warning.

I had previously tried, and failed, to change the default to ICU for
v16, so it's worth mentioning why I don't believe this proposal will
run into the same problems:

* ICU, while better than libc, didn't completely solve any of the
problems. This proposal completely solves the inconsistent primary key
problem, and is much faster than libc or ICU.

* In the version 16 change, we were still attempting to map environment
variables to ICU locales, which was never going to work very well. In
particular, as you pointed out, ICU has nothing to approximate the
C.UTF-8 locale. The current proposal doesn't attempt that kind of
cleverness.

Comments?

Regards,
Jeff Davis

From 8ba8f74d28a64bfb006a76fbec64638f55f3660c Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 17 Jul 2025 13:07:50 -0700
Subject: [PATCH] initdb: default to builtin C.UTF-8

Discussion: https://postgr.es/m/918773d0-886b-4ce0-8b74-ae23496ad...@manitou-mail.org
---
 src/backend/commands/dbcommands.c |  2 +-
 src/bin/initdb/initdb.c   | 50 ++-
 src/bin/initdb/t/001_initdb.pl|  6 ++-
 src/bin/scripts/t/020_createdb.pl | 23 +
 .../modules/test_escape/t/001_test_escape.pl  |  2 +-
 5 files changed, 56 insertions(+), 27 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 502a45163c8..92a396b8406 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1052,7 +1052,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 		dbctype = src_ctype;
 	if (dblocprovider == '\0')
 		dblocprovider = src_locprovider;
-	if (dblocale == NULL)
+	if (dblocale == NULL && dblocprovider == src_locprovider)
 		dblocale = src_locale;
 	if (dbicurules == NULL)
 		dbicurules = src_icurules;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 62bbd08d9f6..ea2a3299737 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -82,6 +82,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 
+#define DEFAULT_BUILTIN_LOCALE		"C.UTF-8"
 
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
@@ -144,7 +145,7 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
-static char locale_provider = COLLPROVIDER_LIBC;
+static char locale_provider = COLLPROVIDER_BUILTIN;
 static bool builtin_locale_specified = false;
 static char *datlocale = NULL;
 static bool icu_locale_specified = false;
@@ -2468,12 +2469,11 @@ setlocales(void)
 	lc_messages = canonname;
 #endif
 
-	if (locale_provider != COLLPROVIDER_LIBC && datlocale == NULL)
-		pg_fatal("locale must be specified if provider is %s",
- collprovider_name(locale_provider));
-
 	if (locale_provider == COLLPROVIDER_BUILTIN)
 	{
+		if (!datlocale)
+			datlocale = DEFAULT_BUILTIN_LOCALE;
+
 		if (strcmp(datlocale, "C") == 0)
 			canonname = "C";
 		else if (strcmp(datlocale, "C.UTF-8") == 0 ||
@@ -2491,6 +2491,9 @@ setlocales(void)
 	{
 		char	   *langtag;
 
+		if (!datlocale)
+			pg_fatal("locale must be specified if provider is icu");
+
 		/* canonicalize to a language tag */
 		langtag = icu_language_tag(datlocale);
 		printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
@@ -2686,6 +2689,29 @@ setup_locale_encoding(void)
 {
 	setlocales();
 
+	/*
+	 * For the builtin provider (other than the "C" locale), default encoding
+	 * to UTF-8. If lc_ctype is not compatible with UTF-8, also force lc_ctype
+	 * to "C". On windows, all locales are compatible with UTF-8.
+	 */
+	if (!encoding && locale_provider == COLLPROVIDER_BUILTIN &&
+		strcmp(datlocale, "C") != 0)
+	{
+#ifndef WIN32
+		int ctype_enc = pg_get_encoding_from_locale(lc_ctype, false);
+		if (!(ctype_enc == PG_UTF8 ||
+			  ctype_enc == PG_SQL_ASCII))
+		{
+			pg_log_warning("setting LC_CTYPE to \"C\"");
+			pg_log_warning_detail("Encoding of LC_CTYPE locale \"%s\" does not match encoding required by builtin locale \"%s\".",
+ 

Re: Custom pgstat support performance regression for simple queries

2025-07-23 Thread Michael Paquier
On Wed, Jul 23, 2025 at 11:59:11AM -0400, Andres Freund wrote:
> On 2025-07-23 10:23:53 +, Bertrand Drouvot wrote:
>> Indeed, with a single boolean flag, then how could a stat say that it has 
>> nothing
>> pending anymore (when flushing) without saying "all the stats have nothing
>> pending" (while some may still have pending stats)?
> 
> I don't think that's a problem - reset that global flag after checking it at
> the start of pgstat_report_stat() and set it to true if partial_flush is true
> at the end of pgstat_report_stat().

That's one way of doing it.  While looking at the code, I tend to
think that it is actually simpler to reset it once at the bottom of
pgstat_report_fixed, not touching it at all if we are in a
partial_flush state because we know that we will need to do one report
later on anyway, be it just for the stats in the dshash.  If we do
that, we can also skip entirely the flush_static_cb calls if
pgstat_report_fixed is not set while doing the reports.
--
Michael


signature.asc
Description: PGP signature


Re: Fixing MSVC's inability to detect elog(ERROR) does not return

2025-07-23 Thread Tom Lane
David Rowley  writes:
> I did a quick check of the postgres.exe size with and without this
> change. It does save about 13KB, but that seems to be entirely from
> the /std:c11 flag. None is due to the compiler emitting less code
> after elog(ERROR) / ereport(ERROR) :-(

Hmmm ... but you did check that in fact we can remove such known-dead
code and not get a warning now?

regards, tom lane




Re: Custom pgstat support performance regression for simple queries

2025-07-23 Thread Michael Paquier
On Wed, Jul 23, 2025 at 12:00:55PM -0400, Andres Freund wrote:
> On 2025-07-23 09:54:12 +0900, Michael Paquier wrote:
>> Noted.  I was wondering originally if the threshold for the builtin
>> and custom kinds should be lowered, as well.  Perhaps that's just too
>> optimistic, so we can first lower things.  Say PGSTAT_KIND_CUSTOM_MIN
>> to 32 and PGSTAT_KIND_MAX to 48 or so to begin with?  I don't see a
>> strict policy here, but these would make the whole cheaper to begin
>> with, with enough margin for any new stats kinds.
> 
> Yes, 256 seems pointlessly large.

I still cannot put my finger on what a good number should be, but I've
settled to something that I think should be good enough for an initial
release:
PGSTAT_KIND_CUSTOM_MIN 128 -> 24
PGSTAT_KIND_MAX 256 -> 32

These numbers mean that we have enough room for 7 more builtins kinds,
same for custom kinds.  Even if this is an issue, this could always be
moved up at some point even across major releases, even if that would
mean for extension developers to switch to a different ID.

>> Then, about the loop used here, I'd be OK to keep the past shortcuts
>> for the builtin fixed-sized stats kinds with the requirement that new
>> builtin stats kinds need to hack into pgstat_report_stat() themselves
>> on efficiency grounds, combined with a second loop for custom kinds,
>> taken only if pgstat_register_kind() has been used by tracking if
>> that's the case in a flag.  Do you have a specific preference here?
> 
> I think the downstream approach of having a flag that says if there are *any*
> pending stats is better.

Works for me.  I am finishing with the attached, meaning that
workloads that trigger no stats can take the fast-exit path with a
single boolean check.  What do you think?
--
Michael
From 05ded16fe6ffd87835ff29b0136e46bb76a81687 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 24 Jul 2025 08:18:06 +0900
Subject: [PATCH 1/2] Lower bounds of pgstats kinds

This changes stats kinds to have the following bounds, making their
handling in core cheaper by default:
- PGSTAT_KIND_CUSTOM_MIN 128 -> 24
- PGSTAT_KIND_MAX 256 -> 32

This should be enough to leave some room for the following years for
built-in and custom stats kinds.  At least that should be enough to
start with this facility for extension developers.
---
 src/include/utils/pgstat_kind.h   | 6 +++---
 src/test/modules/injection_points/injection_stats.c   | 2 +-
 src/test/modules/injection_points/injection_stats_fixed.c | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/include/utils/pgstat_kind.h b/src/include/utils/pgstat_kind.h
index f44169fd5a3c..eb5f0b3ae6db 100644
--- a/src/include/utils/pgstat_kind.h
+++ b/src/include/utils/pgstat_kind.h
@@ -18,7 +18,7 @@
 
 /* Range of IDs allowed, for built-in and custom kinds */
 #define PGSTAT_KIND_MIN	1		/* Minimum ID allowed */
-#define PGSTAT_KIND_MAX	256		/* Maximum ID allowed */
+#define PGSTAT_KIND_MAX	32		/* Maximum ID allowed */
 
 /* use 0 for INVALID, to catch zero-initialized data */
 #define PGSTAT_KIND_INVALID 0
@@ -46,7 +46,7 @@
 /* Custom stats kinds */
 
 /* Range of IDs allowed for custom stats kinds */
-#define PGSTAT_KIND_CUSTOM_MIN	128
+#define PGSTAT_KIND_CUSTOM_MIN	24
 #define PGSTAT_KIND_CUSTOM_MAX	PGSTAT_KIND_MAX
 #define PGSTAT_KIND_CUSTOM_SIZE	(PGSTAT_KIND_CUSTOM_MAX - PGSTAT_KIND_CUSTOM_MIN + 1)
 
@@ -55,7 +55,7 @@
  * development and have not reserved their own unique kind ID yet. See:
  * https://wiki.postgresql.org/wiki/CustomCumulativeStats
  */
-#define PGSTAT_KIND_EXPERIMENTAL	128
+#define PGSTAT_KIND_EXPERIMENTAL	24
 
 static inline bool
 pgstat_is_kind_builtin(PgStat_Kind kind)
diff --git a/src/test/modules/injection_points/injection_stats.c b/src/test/modules/injection_points/injection_stats.c
index 14903c629e0d..e3947b23ba57 100644
--- a/src/test/modules/injection_points/injection_stats.c
+++ b/src/test/modules/injection_points/injection_stats.c
@@ -59,7 +59,7 @@ static const PgStat_KindInfo injection_stats = {
 /*
  * Kind ID reserved for statistics of injection points.
  */
-#define PGSTAT_KIND_INJECTION	129
+#define PGSTAT_KIND_INJECTION	25
 
 /* Track if stats are loaded */
 static bool inj_stats_loaded = false;
diff --git a/src/test/modules/injection_points/injection_stats_fixed.c b/src/test/modules/injection_points/injection_stats_fixed.c
index 3d0c01bdd05a..bc54c79d190b 100644
--- a/src/test/modules/injection_points/injection_stats_fixed.c
+++ b/src/test/modules/injection_points/injection_stats_fixed.c
@@ -64,7 +64,7 @@ static const PgStat_KindInfo injection_stats_fixed = {
 /*
  * Kind ID reserved for statistics of injection points.
  */
-#define PGSTAT_KIND_INJECTION_FIXED	130
+#define PGSTAT_KIND_INJECTION_FIXED	26
 
 /* Track if fixed-numbered stats are loaded */
 static bool inj_fixed_loaded = false;
-- 
2.50.0

From 6814ddec82e40a14d61ba0a5c93974c76640ab52 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 24 Jul 2025 09:03:53 +09

Re: [Buildfarm:84] Re: stats.sql might fail due to shared buffers also used by parallel tests

2025-07-23 Thread TAKATSUKA Haruka


On Wed, 23 Jul 2025 23:00:00 +0300
Alexander Lakhin  wrote:

{snip}
> >>> Nagata-san, could you please share the configuration of hamerkop? If it's
> >>> running inside VM, what virtualization software is used?
> > It's vmware ESXi 7.0.3 (21930508).
> > This Windows enable auto-synchronize with 'time.windows.com'.
> > VMware Tools 12.1.5 build 20735119 is installed in the VM.
> 
> Thank you for your prompt response!
> 
> I used VirtualBox, not VMWare, when trying to reproduce the issue, maybe
> that's the reason why I failed. It looks like VMWare has it's own
> specifics related to timekeeping: [1].
> 
> Maybe you could try tools.syncTime = "0" by any chance?

It has been already tools.syncTime = "0" so far.
I confirmed the following GUI setting.

VM Options
  VMware Tools
Time "Synchronize guest time with host":  unchecked

Thanks,
Haruka Takatsuka 




Regression with large XML data input

2025-07-23 Thread Michael Paquier
Hi all,
(Adding in CC Tom and Eric, as committer and  author.)

A customer has reported a regression with the parsing of rather large
XML data, introduced by the set of backpatches done with f68d6aabb7e2
& friends.

The problem is introduced by the change from
xmlParseBalancedChunkMemory() to xmlNewNode() +
xmlParseInNodeContext() in xml_parse(), to avoid an issue in
xmlParseBalancedChunkMemory() in the range of libxml2 2.13.0-2.13.2
for a bug that has already been fixed upstream, where we use a
temporary root node for the case where parse_as_document is false.

If the input XML data is large enough, one gets a failure at the top
of the latest branches, and it worked properly before.  Here is a
short test case (courtesy of a colleague, case that I've modified
slightly):
CREATE TABLE xmldata (id BIGINT PRIMARY KEY, message XML );
DO $$ DECLARE size_40mb TEXT := repeat('X', 4000);
BEGIN
 BEGIN
   INSERT INTO xmldata (id, message) VALUES
 ( 1, (('Test40MB' || size_40mb || 
'')::xml) );
   RAISE NOTICE 'insert 40MB successful';
   EXCEPTION WHEN OTHERS THEN RAISE NOTICE 'Error insert 40MB: %', SQLERRM;
 END;
END $$;

Switching back to the previous code, where we rely on
xmlParseBalancedChunkMemory() fixes the issue.  A quick POC is
attached.  It fails one case in check-world with SERIALIZE because I
am not sure it is possible to pass down some options through
xmlParseBalancedChunkMemory(), still the regression is gone, and I am
wondering if there is not a better solution to be able to dodge the
original problem and still accept this case.  One good thing is that
xmlParseBalancedChunkMemory() is able to return a list of nodes, that
we need for this code path of xml_parse().  So perhaps one solution
would be the addition of a code path with
xmlParseBalancedChunkMemory() depending on the options we want to
process, keeping the code path with the fake "content-root" for the
XML SERIALIZE case.

The patch in question has been applied first to 6082b3d5d3d1 on HEAD
impacting v18~, then it has been backpatched down to all stable
branches, like f68d6aabb7e2, introducing the regression in all the
stable branches since the minor releases done in August 2024, as of:
12.20, 13.16, 14.13, 15.8, 16.4.

Thoughts or comments?
--
Michael
From 1834a83ec4e013fcc762e07372ca3fa365e2541e Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 24 Jul 2025 12:01:52 +0900
Subject: [PATCH] Fix xml2 regression

---
 src/backend/utils/adt/xml.c | 27 ---
 1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index f7b731825fca..3a44794a2393 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -1900,9 +1900,6 @@ xml_parse(text *data, XmlOptionType xmloption_arg,
 		}
 		else
 		{
-			xmlNodePtr	root;
-			xmlNodePtr	oldroot PG_USED_FOR_ASSERTS_ONLY;
-
 			/* set up document with empty root node to be the context node */
 			doc = xmlNewDoc(version);
 			if (doc == NULL || xmlerrcxt->err_occurred)
@@ -1916,31 +1913,15 @@ xml_parse(text *data, XmlOptionType xmloption_arg,
 			"could not allocate XML document");
 			doc->standalone = standalone;
 
-			root = xmlNewNode(NULL, (const xmlChar *) "content-root");
-			if (root == NULL || xmlerrcxt->err_occurred)
-xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
-			"could not allocate xml node");
-
-			/*
-			 * This attaches root to doc, so we need not free it separately;
-			 * and there can't yet be any old root to free.
-			 */
-			oldroot = xmlDocSetRootElement(doc, root);
-			Assert(oldroot == NULL);
-
 			/* allow empty content */
 			if (*(utf8string + count))
 			{
 xmlNodePtr	node_list = NULL;
-xmlParserErrors res;
 
-res = xmlParseInNodeContext(root,
-			(char *) utf8string + count,
-			strlen((char *) utf8string + count),
-			options,
-			&node_list);
-
-if (res != XML_ERR_OK || xmlerrcxt->err_occurred)
+res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0,
+	   utf8string + count,
+	   &node_list);
+if (res_code != 0 || xmlerrcxt->err_occurred)
 {
 	xmlFreeNodeList(node_list);
 	xml_errsave(escontext, xmlerrcxt,
-- 
2.50.0



signature.asc
Description: PGP signature


Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-07-23 Thread Andrei Lepikhov
On 23/7/2025 02:11, David Rowley wrote:
> On Wed, 23 Jul 2025 at 02:35, Andrei Lepikhov  wrote:
>> However, at first, I'd consider how it could be added to the
>> IncrementalSort and HashJoin. The number of estimated groups/buckets may
>> also provide some insights into the planner's decision.
>
> Sounds like another patch for another thread.
Maybe, it is up to you

> Do you have an argument in mind to support adding the group?
I have not. I just thought that grouping would be helpful for an explain
parser.


--
regards, Andrei Lepikhov




Re: display hot standby state in psql prompt

2025-07-23 Thread Srinath Reddy Sadipiralla
Hi Jim,

On Tue, Jul 22, 2025 at 4:40 PM Jim Jones  wrote:

> Hi
>
> On 21.07.25 21:12, Greg Sabino Mullane wrote:
> > Seems good enough for me. I think as long as we document it well, it's
> > only going to be a net positive, even with some edge cases.
>
> I just moved the patch from PG19-Drafts to PG19-2 commitfest.[1]
>
> Thanks a lot for the feedback!
>
> Best regards, Jim
>
> 1 - https://commitfest.postgresql.org/patch/5872/


+1 for the patch,i have reviewed and tested this patch, except these below
cosmetic changes it LGTM.

cosmetic changes:
1) add comment about %i in get_prompt api.
2) maybe we can use read-write instead of read/write to be consistent with
the
naming such as options of target_session_attrs uses read-write.

testing:

=> in primary node:

psql (19devel)
Type "help" for help.

postgres=# \set PROMPT1 '[%i] # '
[read/write] # set default_transaction_read_only=on;
SET
[read-only] # set default_transaction_read_only=off;
SET
[read/write] # show in_hot_standby ;
 in_hot_standby

 off
(1 row)

[read/write] # set default_transaction_read_only=on;
SET
[read-only] # show in_hot_standby ;
 in_hot_standby

 off
(1 row)

[read-only] # \q

=> in replica node

psql (19devel)
Type "help" for help.

postgres=# \set PROMPT1 '[%i] # '
[read-only] # show in_hot_standby ;
 in_hot_standby

 on
(1 row)

[read-only] # show default_transaction_read_only;
 default_transaction_read_only
---
 off
(1 row)

[read-only] # set default_transaction_read_only=on;
SET
[read-only] # set transaction_read_only=on;
SET
[read-only] # set transaction_read_only=off;
ERROR:  cannot set transaction read-write mode during recovery
[read-only] # select pg_promote();
 pg_promote

 t
(1 row)

[read-only] # show in_hot_standby ;
 in_hot_standby

 off
(1 row)

[read-only] # show default_transaction_read_only;
 default_transaction_read_only
---
 on
(1 row)

[read-only] # set default_transaction_read_only=off;
SET
[read/write] #


-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/


Re: recoveryStopsAfter is not usable when recovery_target_inclusive is false

2025-07-23 Thread Michael Paquier
On Wed, Jul 23, 2025 at 05:04:37AM +, Hayato Kuroda (Fujitsu) wrote:
> While working on [1] I found the point. When recovery_target_lsn is specified 
> and
> recovery_target_inclusive is false, recoveryStopsAfter() cannot return true.
> 
>   /* Check if the target LSN has been reached */
>   if (recoveryTarget == RECOVERY_TARGET_LSN &&
>   recoveryTargetInclusive &&
>   record->ReadRecPtr >= recoveryTargetLSN)
> 
> In this case the recovery can stop when next record is read. This normally 
> works
> well but if the next record has not been generated yet, startup process will 
> wait
> till new one will come then exit from the apply loop.

+static inline bool
+TargetLSNIsReached(XLogReaderState *record, bool is_before)
+{
+XLogRecPtrtargetLsn;
+
+Assert(recoveryTarget == RECOVERY_TARGET_LSN);
+
+if (is_before)
+{
+if (recoveryTargetInclusive)
+return false;
+
+targetLsn = record->ReadRecPtr;
+}
+else
+{
+targetLsn = recoveryTargetInclusive ?
+record->ReadRecPtr : record->EndRecPtr;
+}
+
+return targetLsn >= recoveryTargetLSN;
+}

So, what you are doing here is changing the behavior of the check in
recoveryStopsAfter(), with the addition of one exit path based on
EndRecPtr if recovery_target_inclusive is false, to leave a bit
earlier in case if we don't have a follow-up record.  Did you notice a
speedup once you did that in your logirep workflow?  I am afraid that
this change would be a new mode, skipping a couple of code paths if
one decides to trigger the exit.  And I highly suspect that you could
break some cases that worked until now here, skipping one
recoveryPausesHere() and one ProcessStartupProcInterrupts() to say the
least.

> I feel the process can exit bit earliyer, by comparing with the end point of 
> the
> applied record and recovery_target_lsn.
> Attached patch roughly implemented the idea.

Could you prove your point with a test or something that justifies a
new definition?

> I read the old discussions, but I cannot find the reason of current style.

35250b6ad7a8 was quite some time ago, as of this thread, with my
fingerprints all over it:
https://www.postgresql.org/message-id/flat/CAB7nPqRKKyZQhcB39mty9C_gfB0ODRUQZEWB4GPP8ARpdAB%3DZA%40mail.gmail.com

If my memory serves me right, this comes down to the consistency with
how stop XID targets are handled before and after records are read,
except that this one applies to ReadRecPtr.
--
Michael


signature.asc
Description: PGP signature


Re: display hot standby state in psql prompt

2025-07-23 Thread Jim Jones
Hi Srinath

On 23.07.25 09:03, Srinath Reddy Sadipiralla wrote:
> +1 for the patch,i have reviewed and tested this patch, except these
> below cosmetic changes it LGTM.
>
> cosmetic changes:
> 1) add comment about %i in get_prompt api.

Done.

> 2) maybe we can use read-write instead of read/write to be consistent
> with the
>     naming such as options of target_session_attrs uses read-write.

I believe that 'read/write' is more idiomatic than 'read-write' in this
context. 'Read-only' works as a hyphenated adjective, and 'read/write'
is typically treated as a paired label that indicates two distinct
capabilities --- read and write. What do you think?

v3 attached.

Thanks for the thorough testing and review!

Best, JimFrom 260d23ca92d5c31dc717cca4fd05b760e0069142 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Wed, 23 Jul 2025 09:35:15 +0200
Subject: [PATCH v3] Add %i prompt escape to indicate server read-only status

This patch introduces a new prompt escape `%i` for psql, which shows
whether the connected server is operating in read-only mode. It
expands to `read-only` if either the server is in hot standby mode
(`in_hot_standby = on`) or the session's default transaction mode is
read-only (`default_transaction_read_only = on`). Otherwise, it
displays `read/write`.

This is useful for distinguishing read-only sessions (e.g. connected
to a standby, or using a default read-only transaction mode) from
read/write ones at a glance, especially when working with multiple
connections in replicated or restricted environments.
---
 doc/src/sgml/ref/psql-ref.sgml | 14 ++
 src/bin/psql/prompt.c  | 16 
 2 files changed, 30 insertions(+)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 4f7b11175c..c8e1449766 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -5008,6 +5008,20 @@ testdb=> INSERT INTO my_table VALUES (:'content');
 
   
 
+  
+%i
+
+  
+Displays the session's read-only status as read-only
+if the server is in hot standby (in_hot_standby is
+on) or the default transaction mode is read-only
+(default_transaction_read_only is on),
+or read-write otherwise. Useful for identifying
+sessions that cannot perform writes, such as in replication setups.
+  
+
+  
+
   
 %x
 
diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c
index b08d7328fb..1d0eca5cfc 100644
--- a/src/bin/psql/prompt.c
+++ b/src/bin/psql/prompt.c
@@ -43,6 +43,8 @@
  *			or a ! if session is not connected to a database;
  *		in prompt2 -, *, ', or ";
  *		in prompt3 nothing
+ * %i - displays "read-only" if in hot standby or default_transaction_read_only
+ *		is on, "read/write" otherwise.
  * %x - transaction status: empty, *, !, ? (unknown or no connection)
  * %l - The line number inside the current statement, starting from 1.
  * %? - the error code of the last query (not yet implemented)
@@ -247,7 +249,21 @@ get_prompt(promptStatus_t status, ConditionalStack cstack)
 			break;
 	}
 	break;
+case 'i':
+	if (pset.db)
+	{
+		const char *hs = PQparameterStatus(pset.db, "in_hot_standby");
+		const char *ro = PQparameterStatus(pset.db, "default_transaction_read_only");
 
+		if ((hs && strcmp(hs, "on") == 0) ||
+			(ro && strcmp(ro, "on") == 0))
+			strlcpy(buf, "read-only", sizeof(buf));
+		else
+			strlcpy(buf, "read/write", sizeof(buf));
+	}
+	else
+		buf[0] = '\0';
+	break;
 case 'x':
 	if (!pset.db)
 		buf[0] = '?';
-- 
2.43.0



RE: 024_add_drop_pub.pl might fail due to deadlock

2025-07-23 Thread Hayato Kuroda (Fujitsu)
> Dear Ajin,
> 
> Thanks for the patch. Firstly let me confirm my understanding. While altering 
> the
> subscription, locks are acquired with below ordering:
>

I forgot to confirm one point. For which branch should be backpatch? Initially
it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on 
PG14.
Regarding the PG13, it may not be affected because the replication origin seemed
not to be used for the table sync. It was introduced for ce0fdbfe97.

[1]: 
https://www.postgresql.org/message-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297%40gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: index prefetching

2025-07-23 Thread Tomas Vondra
On 7/23/25 03:31, Peter Geoghegan wrote:
> On Tue, Jul 22, 2025 at 8:37 PM Tomas Vondra  wrote:
>>> I happen to think that that's a very unrealistic assumption. Most
>>> standard benchmarks have indexes that almost all look fairly similar
>>> to pgbench_accounts_pkey, from the point of view of "heap page blocks
>>> per leaf page". There are exceptions, of course (e.g., the TPC-C order
>>> table's primary key suffers from fragmentation).
>>>
>>
>> I agree with all of this.
> 
> Cool.
> 
>> I assume you mean results for the "linear" data set, because for every
>> other data set the patches perform almost exactly the same (when
>> restoring the distance after stream reset):
>>
>> https://github.com/tvondra/indexscan-prefetch-tests/blob/with-distance-restore-after-reset/d16-rows-cold-32GB-16-unscaled.pdf
> 
> Right.
> 
>> And it's a very good point. I was puzzled by this too for a while, and
>> it took me a while to understand how/why this happens. It pretty much
>> boils down to the "duplicate block" detection and how it interacts with
>> the stream resets (again!).
> 
> I think that you slightly misunderstand where I'm coming from here: it
> *doesn't* puzzle me. What puzzled me was that it puzzled you.
> 
> Andres' test query is very simple, and not entirely sympathetic
> towards the complex patch (by design). And yet it *also* gets quite a
> decent improvement from the complex patch. It doesn't speed things up
> by another order of magnitude or anything, but it's a very decent
> improvement -- one well worth having.
> 
> I'm also unsurprised at the fact that all the other tests that you ran
> were more or less a draw between simple and complex. At least not now
> that I've drilled down and understood what the indexes from those
> other test cases actually look like, in practice.
> 
>> So you're right the complex patch prefetches far ahead. I thought the
>> distance will quickly decrease because of the duplicate blocks, but I
>> missed the fact the read stream will not seem them at all.
> 
> FWIW I wasn't thinking about it at anything like that level of
> sophistication. Everything I've said about it was based on intuitions
> about how the prefetching was bound to work, for each different kind
> of index. I just looked at individual leaf pages (or small groups of
> them) from each index/test, and considered their TIDs, and imagined
> how that was likely to affect the scan.
> 
> It just seems obvious to me that all the tests (except for "linear")
> couldn't possibly be helped by eagerly reading multiple leaf pages. It
> seemed equally obvious that it's quite possible to come up with a
> suite of tests that have several tests that could benefit in the same
> way (not just 1). Although your "linear_1"/"linear_N" tests aren't
> actually like that, many cases will be -- and not just those that are
> perfectly correlated ala pgbench.
> 
>> I'm not sure it's desirable to "hide" blocks from the read stream like
>> this - it'll never see the misses. How could it make good decisions,
>> when we skew the data used by the heuristics like this?
> 
> I don't think that I fully understand what's desirable here myself.
> 
>>> Doing *no* prefetching will usually be the right thing to do. Does
>>> that make index prefetching pointless in general?
>>>
>>
>> I don't think so. Why would it? There's plenty of queries that can
>> benefit from it a lot, and as long as it doesn't cause harm to other
>> queries it's a win.
> 
> I was being sarcastic. That wasn't a useful thing for me to do. Apologies.
> 
>> This is not resetting the stream, though. This is resetting the position
>> tracking how far the stream got.
> 
> My main point is that there's stuff going on here that nobody quite
> understands just yet. And so it probably makes sense to defensively
> assume that the prefetch distance resetting stuff might matter with
> either the complex or simple patch.
> 
>> Sorry, got distracted and forgot to complete the sentence. I think I
>> wanted to write "mostly from not resetting the distance to 1". Which is
>> true, but the earlier "linear" example also shows there are cases where
>> the page boundaries are significant.
> 
> Of course that's true. But that was just a temporary defect of the
> "simple" patch (and perhaps even for the "complex" patch, albeit to a
> much lesser degree). It isn't really relevant to the important
> question of whether the simple or complex design should be pursued --
> we know that now.
> 
> As I said, I don't think that the test suite is particularly well
> suited to evaluating simple vs complex. Because there's only one test
> ("linear") that has any hope of being better with the complex patch.
> And because having only 1 such test isn't representative.
> 
>> That's actually intentional. I wanted to model tables with wider tuples,
>> without having to generate all the data etc. Maybe 25% is too much, and
>> real table have more than 20 tuples. It's true 400B is fairly large.
> 
> My point about fill fac