Re: Commit fest 2025-03

2025-03-09 Thread vignesh C
On Wed, 5 Mar 2025 at 16:50, Jim Jones  wrote:
>
> Hi Vignesh
>
> On 05.03.25 10:22, vignesh C wrote:
> > The following "Ready for committer" patches needs rebase
> > ---
> > Truncate logs by max_log_size - Kirill Gavrilov
> >
> > Patch owners, please provide a rebased version to prepare it for
> > reviewers and committers.
>
> Is there something wrong with the commitfest app? This patch applies
> cleanly and passes all tests

This issue has been addressed now, here are the updated patches that
needs to be rebased:
The walsender does not update its IO statistics until it exits -
Bertrand Drouvot
noreturn attribute for MSVC, C11 - Peter Eisentraut
Add pg_stat_session - Sergey Dudoladov
Logging plan of the currently running query - Atsushi Torikoshi
Index Prefetching -  Tomas Vondra
Allow partition-wise join when whole row var is needed - Alexander Pyhalov
Asynchronous MergeAppend Execution - Alexander Pyhalov
AIO - Andres Freund
Limiting overshoot in nbtree SAOP parallel index scans - Matthias van de Meent
Use read_stream in index vacuum -  Andrey Borodin
Adding compression of temporary files - Filip Janus
Allow to use an index for ILIKE in more cases - Yugo NAGATA
Use Bump allocator for HashAgg - Jeff Davis
Read stream scalability improvements and AIO-compatibility - Thomas Munro
Compress big WAL records  - Andrey Borodin
Reduce timing overhead of EXPLAIN ANALYZE using rdtsc  - Lukas Fittl
Use XLOG_CONTROL_FILE macro everywhere - Anton A. Melnikov
Don't dirty pages while they are getting flushed out - Andres Freund
Enable logical decoding when wal_level = 'replica' without a server
restart - Masahiko Sawada
VACUUM FULL / CLUSTER CONCURRENTLY - Antonin Houska
NOT ENFORCED constraint feature - Amul Sul
Changing shared_buffers without restart - Ashutosh Bapat
Extended Statistics set/restore/clear functions - Corey Huinker
general purpose array_sort - jian he
explain plans for foreign servers - Dinesh Salve
Enable fine-grained control over what gets logged on connection
attempt (reduces log size) - Sergey
Allow CI to only run the compiler warnings task - Bertrand Drouvot

Patch owners, please provide a rebased version to prepare it for
reviewers and committers.

Regards,
Vignesh




Re: MergeJoin beats HashJoin in the case of multiple hash clauses

2025-03-09 Thread Alexander Korotkov
On Wed, Mar 5, 2025 at 4:43 AM Alexander Korotkov  wrote:
>
> On Mon, Mar 3, 2025 at 10:24 AM Andrei Lepikhov  wrote:
> > On 17/2/2025 01:34, Alexander Korotkov wrote:
> > > Hi, Andrei!
> > >
> > > On Tue, Oct 8, 2024 at 8:00 AM Andrei Lepikhov  wrote:
> > > Thank you for your work on this subject.  I agree with the general
> > > direction.  While everyone has used conservative estimates for a long
> > > time, it's better to change them only when we're sure about it.
> > > However, I'm still not sure I get the conservatism.
> > >
> > > if (innerbucketsize > thisbucketsize)
> > >  innerbucketsize = thisbucketsize;
> > > if (innermcvfreq > thismcvfreq)
> > > innermcvfreq = thismcvfreq;
> > >
> > > IFAICS, even in the worst case (all columns are totally correlated),
> > > the overall bucket size should be the smallest bucket size among
> > > clauses (not the largest).  And the same is true of MCV.  As a mental
> > > experiment, we can add a new clause to hash join, which is always true
> > > because columns on both sides have the same value.  In fact, it would
> > > have almost no influence except for the cost of extracting additional
> > > columns and the cost of executing additional operators.  But in the
> > > current model, this additional clause would completely ruin
> > > thisbucketsize and thismcvfreq, making hash join extremely
> > > unappealing.  Should we still revise this to calculate minimum instead
> > > of maximum?
> > I agree with your point. But I think the code works precisely the way
> > you have described.
>
> You're right.  I just messed up with the sides of comparison operator.

I've revised commit message, comments, formatting etc.
I'm going to push this if no objections.

--
Regards,
Alexander Korotkov
Supabase


v3-0001-Use-extended-stats-for-precise-estimation-of-buck.patch
Description: Binary data


Re: Considering fractional paths in Append node

2025-03-09 Thread Nikita Malakhov
Hi!

No objections. Alexander, thank you!

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Incorrect assert in libpqwalreceiver

2025-03-09 Thread Heikki Linnakangas

On 09/03/2025 10:09, Jacob Brazeal wrote:
The libpqrcv_connect function asserts 'Assert(i < sizeof(keys))', where 
keys is declared as const char *keys[6];.


However, sizeof(keys) is not the correct way to check the array length 
(on my system, for example, it's 48 = 6 * 8 at this callsite, not 6.)


I attached a patch to fix the assert, but I suppose we could also just 
remove the assert altogether, since it hasn't been doing anything for at 
least 8 years.


Committed, thanks!

I think it's still valuable; it's an easy mistake to make, to add a 
parameter and forget to increase the array size.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Printing window function OVER clauses in EXPLAIN

2025-03-09 Thread Tom Lane
=?utf-8?Q?=C3=81lvaro?= Herrera  writes:
> Would it be possible and make sense to use notation of explicit WINDOW
> clauses, for cases where multiple window functions invoke identical
> window definitions?

There's something to be said for that.  We would have to assign
made-up names to windows that didn't have one.  But then the
output might look like

  WindowAgg  (...)
Output: empno, depname, row_number() OVER (window1), rank() OVER (window1), 
count(*) OVER (window1), enroll_date
Window: window1 = PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

which is surely a lot nicer than 3x repetitions of the window spec.

After reading David's mail I'd been thinking of something like

  WindowAgg  (...)
Output: empno, depname, row_number() OVER (...), rank() OVER (...), 
count(*) OVER (...), enroll_date
Window: PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING

which is shorter but vaguer.  In particular, if you have more than one
WindowAgg, then with explicit names we'd have something like

  WindowAgg  (...)
Output: empno, depname, row_number() OVER (window1), rank() OVER (window1), 
(count(*) OVER (window2)), enroll_date
Window: window1 = PARTITION BY depname ORDER BY enroll_date ROWS UNBOUNDED 
PRECEDING
WindowAgg  (...)
  Output: empno, depname, count(*) OVER (window2), enroll_date
Window: window2 = PARTITION BY depname ORDER BY enroll_date ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

With "..." that would be confusing as heck to someone who didn't
understand the nuances of the extra parentheses.

> (Hmm, not sure if the Window clauses would be top-level or attached to
> each WindowAgg in its own level.)

IMO the obvious thing is to attach each WindowClause to the WindowAgg
node that implements it.

I'll go try to code this up.

regards, tom lane




Re: Parallel CREATE INDEX for GIN indexes

2025-03-09 Thread Tom Lane
Tomas Vondra  writes:
> I pushed the two smaller parts today.

Coverity is a little unhappy about this business in
_gin_begin_parallel:

boolleaderparticipates = true;
...
#ifdef DISABLE_LEADER_PARTICIPATION
leaderparticipates = false;
#endif
...
scantuplesortstates = leaderparticipates ? request + 1 : request;

It says

>>> CID 1644203:  Possible Control flow issues  (DEADCODE)
>>> Execution cannot reach the expression "request" inside this statement: 
>>> "scantuplesortstates = (lead...".
924 scantuplesortstates = leaderparticipates ? request + 1 : 
request;

If this were just temporary code I'd let it pass, but I see nothing
replacing this logic in the follow-up patches, so I think we ought
to do something to shut it up.

It's not complaining about the later bits like

if (leaderparticipates)
ginleader->nparticipanttuplesorts++;

(perhaps because there's no dead code there?)  So one idea is

scantuplesortstates = request;
if (leaderparticipates)
scantuplesortstates++;

which would look more like the other code anyway.

regards, tom lane




Re: Statistics Import and Export

2025-03-09 Thread Jeff Davis
On Sat, 2025-03-08 at 10:56 -0500, Robert Treat wrote:
> In the UX world, the general pattern is people start to get
> overwhelmed once you get over a 1/2 dozen options (I think that's
> based on Miller's law, but might be mis-remembering); we are already
> at 9 for this use case. So really it is quite the opposite, we'd be
> reducing the burden on customers by simplifying the interface rather
> than just throwing out every possible combination and saying "you
> figure it out". 

To be clear about your proposal:

* --include conflicts with --schema-only and --data-only
* --include overrides any default

is that right?

Thoughts on how we should document when/how to use --section vs --
include? Granted, that might be a point of confusion regardless of the
options we offer.

Regards,
Jeff Davis





Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

2025-03-09 Thread vignesh C
On Wed, 5 Mar 2025 at 21:43, Fujii Masao  wrote:
>
>
>
> On 2024/11/30 15:23, Kirill Reshke wrote:
> > On Fri, 11 Oct 2024 at 06:53, Fujii Masao  
> > wrote:
> >> However, this issue already exists without the proposed patch.
> >> Since file_fdw already reports progress partially, querying multiple
> >> file_fdw tables can lead to inaccurate or confusing progress reports.
> >> You can even observe this when analyzing a file_fdw table and also
> >> when copying to the table with a trigger that executes progress-reporting
> >> commands.
> >>
> >> So, I don’t think this issue should block the proposed patch.
> >> In fact, progress reporting is already flawed in these scenarios,
> >> regardless of whether the patch is applied.
>
> On second thought, supporting progress tracking for COPY used by file_fdw
> could increase the chances of multiple commands being tracked simultaneously
> by a single backend. This means the command progress view might show
> incorrect results more often.
>
> As I mentioned before, this issue already exists, but it currently
> only happens in rare cases. I don’t think the fact that the issue
> already exists is a good reason to introduce more, and likely more common,
> scenarios where it could occur.
>
> With that in mind, I'm thinking of withdrawing this patch for now.

I've updated the status to "withdrawn." Feel free to add it again
anytime if you change your mind.

Regards,
Vignesh




Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

2025-03-09 Thread Daniil Davydov
Hi,
A few days ago I came up with an idea to implement multi insert
optimization wherever possible. I prepared a raw patch
and it showed a great performance gain (up to 4 times during INSERT
... INTO ... in the best case).
Then I was very happy to find this thread. You did a great job and I
want to help you to bring the matter to an end.

On Thu, Oct 31, 2024 at 11:17 AM Jingtang Zhang  wrote:
> I did some performance test these days, and I have some findings.
> HEAD:
>   12.29%  postgres[.] pg_checksum_block
>6.33%  postgres[.] GetPrivateRefCountEntry
>5.40%  postgres[.] pg_comp_crc32c_sse42
>4.54%  [kernel][k] copy_user_enhanced_fast_string
>2.69%  postgres[.] BufferIsValid
>1.52%  postgres[.] XLogRecordAssemble
>
> Patched:
>   11.75%  postgres[.] tts_virtual_materialize
>8.87%  postgres[.] pg_checksum_block
>8.17%  postgres[.] slot_deform_heap_tuple
>8.09%  postgres[.] heap_compute_data_size
>6.17%  postgres[.] fill_val
>3.81%  postgres[.] heap_fill_tuple
>3.37%  postgres[.] tts_virtual_copyslot
>2.62%  [kernel][k] copy_user_enhanced_fast_string

I applied v25 patches on the master branch and made some measurements
to find out what is the bottleneck in this case. The 'time' utility
showed that without a patch, this query will run 1.5 times slower. I
also made a few flamegraphs for this test. Most of the time is spent
calling
these two functions : tts_virtual_copyslot and heap_form_tuple.
All tests were run in virtual machine with these CPU characteristics:
Architecture: x86_64
CPU(s):   2
  On-line CPU(s) list:0,1
Virtualization features:
  Virtualization: AMD-V
  Hypervisor vendor:  KVM
  Virtualization type:full
Caches (sum of all):
  L1d:128 KiB (2 instances)
  L1i:128 KiB (2 instances)
  L2: 1 MiB (2 instances)
  L3: 32 MiB (2 instances)
NUMA:
  NUMA node(s):   1
  NUMA node0 CPU(s):  0,1

In my implementation, I used Tuplestore functionality to store tuples.
In order to get rid of getting stuck in the above mentioned functions,
I crossed it with the current implementation (v25 patches) and got a
10% increase in performance (for the test above). I also set up v22
patches to
compare performance (with/without tuplestore) for INSERT ... INTO ...
queries (with -j 4 -c 10 parameters for pgbech), and there also was an
increase in TPS (about 3-4%).

I attach a patch that adds Tuplestore to v25. What do you think about this idea?

--
Best regards,
Daniil Davydov
From a59cfcbb05bb07c94a4c0ad6531baa5e531629ae Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Sun, 9 Mar 2025 16:37:44 +0700
Subject: [PATCH] Replace holding tuples in virtual slots with tuplestorage

During performance testing, it was found out that in the current
implementation a lot of the program's time is spent calling two functions :
tts_virtual_copyslot and heap_fill_tuple. Calls to these functions are related
to the fact that tuples are stored in virtual_tts, so I propose to replace this
logic with Tuplestore functionality.

Discussion: https://www.postgresql.org/message-id/9F9326B4-8AD9-4858-B1C1-559FC64E6E93%40gmail.com
---
 src/backend/access/heap/heapam.c | 67 +++-
 src/include/access/heapam.h  |  9 -
 2 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index acdce1a4b4..276480213a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2665,7 +2665,6 @@ void
 heap_modify_buffer_insert(TableModifyState *state,
 		  TupleTableSlot *slot)
 {
-	TupleTableSlot *dstslot;
 	HeapInsertState *istate;
 	HeapMultiInsertState *mistate;
 	MemoryContext oldcontext;
@@ -2682,8 +2681,10 @@ heap_modify_buffer_insert(TableModifyState *state,
 		mistate =
 			(HeapMultiInsertState *) palloc(sizeof(HeapMultiInsertState));
 		mistate->slots =
-			(TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS);
-		mistate->cur_slots = 0;
+			(TupleTableSlot **) palloc0(sizeof(void *) * HEAP_MAX_BUFFERED_SLOTS);
+		mistate->tstore = tuplestore_begin_heap(false, false, work_mem);
+		mistate->nused = 0;
+
 		istate->mistate = mistate;
 
 		/*
@@ -2702,36 +2703,11 @@ heap_modify_buffer_insert(TableModifyState *state,
 	istate = (HeapInsertState *) state->data;
 	Assert(istate->mistate != NULL);
 	mistate = istate->mistate;
-	dstslot = mistate->slots[mistate->cur_slots];
-
-	if (dstslot == NULL)
-	{
-		/*
-		 * We use virtual tuple slots buffered slots for leveraging the
-		 * optimization it provides to minimize physical data copying. The
-		 * virtual slot gets materialized when we copy (via below
-		 * ExecCopySlot) the tuples from the so

Re: Considering fractional paths in Append node

2025-03-09 Thread Alexander Korotkov
On Wed, Mar 5, 2025 at 1:20 PM Alexander Korotkov  wrote:
> On Wed, Mar 5, 2025 at 8:32 AM Andrei Lepikhov  wrote:
> > On 5/3/2025 03:27, Alexander Korotkov wrote:
> > > On Mon, Mar 3, 2025 at 1:04 PM Andrei Lepikhov  wrote:
> > >>> 2. As usage of root->tuple_fraction RelOptInfo it has been criticized,
> > >>> do you think we could limit this to some simple cases?  For instance,
> > >>> check that RelOptInfo is the final result relation for given root.
> > >> I believe that using tuple_fraction is not an issue. Instead, it serves
> > >> as a flag that allows the upper-level optimisation to consider
> > >> additional options. The upper-level optimiser has more variants to
> > >> examine and will select the optimal path based on the knowledge
> > >> available at that level. Therefore, we're not introducing a mistake
> > >> here; we're simply adding extra work in the narrow case. However, having
> > >> only the bottom-up planning process, I don't see how we could avoid this
> > >> additional workload.
> > >
> > > Yes, but if we can assume root->tuple_fraction applies to result of
> > > Append, it's strange we apply the same tuple fraction to all the child
> > > rels.  Latter rels should less likely be used at all and perhaps
> > > should have less tuple_fraction.
> > Of course, it may happen. But I'm not sure it is a common rule.
> > Using LIMIT, we usually select data according to specific clauses.
> > Imagine, we need TOP-100 ranked goods. Appending partitions of goods, we
> > will use the index on the 'rating' column. But who knows how top-rated
> > goods are spread across partitions? Maybe a single partition contains
> > all of them? So, we need to select 100 top-rated goods from each partition.
> > Hence, applying the same limit to each partition seems reasonable, right?
>
> Ok, I didn't notice add_paths_to_append_rel() is used for MergeAppend
> as well.  I thought again about regular Append.  If can have required
> number of rows from the first few children relations, the error of
> tuple fraction shouldn't influence plans much, and other children
> relations wouldn't be used at all.  But if we don't, we unlikely get
> prediction of selectivity accurate enough to predict which exact
> children relations are going to be used.  So, usage root tuple
> fraction for every child relation would be safe.  So, this approach
> makes sense to me.

I've slightly revised the commit message and comments.  I'm going to
push this if no objections.

--
Regards,
Alexander Korotkov
Supabase
From c9c0efa98b74d76912845339cb5b70dba3c8cb17 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Thu, 5 Dec 2024 15:15:49 +0700
Subject: [PATCH v3] Teach Append to consider tuple_fraction when accumulating
 subpaths.

This change is dedicated to more active usage of IndexScan and parameterized
NestLoop paths in partitioned cases under an Append node, as it already works
with plain tables.  As newly added regression tests demonstrate, it should
provide more smartness to the partitionwise technique.

With an indication of how many tuples are needed, it may be more meaningful
to use the 'fractional branch' subpaths of the Append path list, which are
more optimal for this specific number of tuples.  Planning on a higher level,
if the optimizer needs all the tuples, it will choose non-fractional paths.
In the case when, during execution, Append needs to return fewer tuples than
declared by tuple_fraction, it would not be harmful to use the 'intermediate'
variant of paths.  However, it will earn a considerable profit if a sensible
set of tuples is selected.

The change of the existing regression test demonstrates the positive outcome
of this feature: instead of scanning the whole table, the optimizer prefers
to use a parameterized scan, being aware of the only single tuple the join
has to produce to perform the query.

Discussion: https://www.postgresql.org/message-id/flat/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA%40mail.gmail.com
Author: Nikita Malakhov 
Author: Andrei Lepikhov 
Reviewed-by: Andy Fan 
Reviewed-by: Alexander Korotkov 
---
 src/backend/optimizer/path/allpaths.c|  18 ++-
 src/backend/optimizer/plan/planner.c |   8 ++
 src/test/regress/expected/partition_join.out | 116 +++
 src/test/regress/expected/union.out  |  15 ++-
 src/test/regress/sql/partition_join.sql  |  21 
 5 files changed, 168 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b5bc9b602e2..9e7f01a9684 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1371,9 +1371,23 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		 */
 		if (rel->consider_startup && childrel->cheapest_startup_path != NULL)
 		{
+			Path	   *cheapest_path;
+
+			/*
+			 * With an indication of how many tuples the query should provide,
+			 * the optimizer tries to choose the path optimal f

Re: [Doc] Improve hostssl related descriptions and option presentation

2025-03-09 Thread vignesh C
On Fri, 28 Feb 2025 at 00:08, David G. Johnston
 wrote:
>
> On Mon, Apr 22, 2024 at 2:20 PM David G. Johnston 
>  wrote:
>>
>> Thoughts anyone?
>>
>> On Thu, Feb 1, 2024 at 3:47 PM David G. Johnston 
>>  wrote:
>>>
>>> Motivated by a recent complaint [1] I found the hostssl related material in 
>>> our docs quite verbose and even repetitive.  Some of that is normal since 
>>> we have both an overview/walk-through section as well as a reference 
>>> section.  But the overview in particular was self-repetitive.  Here is a 
>>> first pass at some improvements.  The commit message contains both the 
>>> intent of the patch and some thoughts/questions still being considered.
>>>
>>> David J.
>>>
>>> [1] 
>>> https://www.postgresql.org/message-id/170672433664.663.11895343120533141715%40wrigleys.postgresql.org
>
>
> Withdrawn as this now has conflicts from recent changes and no one seems 
> interested in voicing an opinion on this for or against.

I've updated the commitfest entry status to "withdrawn." Feel free to
add it again anytime if you change your mind.

Regards,
Vignesh




Re: general purpose array_sort

2025-03-09 Thread jian he
hi.

patch rebased, also did some minor comments tweak.
From c9398dfe889f23dce147db1719aa9fe4dfaa3adc Mon Sep 17 00:00:00 2001
From: jian he 
Date: Sun, 9 Mar 2025 20:45:20 +0800
Subject: [PATCH v16 1/2] general purpose array_sort

Introduce the SQL-callable function array_sort(anyarray). The parameter passed
to this function cannot truly be a polymorphic data type. Instead, it accepts
any array type that supports the "less than" (`<`) operator.

If the input parameter is a multidimensional array, array_sort will sort based
on the first dimension. By default, sorting is performed based on
the argument's collation. However, you can also specify a collation clause if
needed, for special value NULL: nulls will appear after non-null values.

for example:
SELECT array_sort('{foo,bar,CCC,Abc,bbc}'::text[] COLLATE "C");
will sort based on "C" collation.

Author: Junwang Zhao 
Co-authored-by: Jian He 

Reviewed-by:
Michael Paquier 
Aleksander Alekseev ,
Tom Lane ,
David G. Johnston ,
Amit Langote ,
andr...@proxel.se ,
Robert Haas ,
Dean Rasheed 

discussion: https://postgr.es/m/CAEG8a3J41a4dpw_-F94fF-JPRXYxw-GfsgoGotKcjs9LVfEEvw%40mail.gmail.com
---
 doc/src/sgml/func.sgml|  20 +++
 src/backend/utils/adt/array_userfuncs.c   | 143 ++
 src/backend/utils/adt/arrayfuncs.c|   3 +-
 src/include/catalog/pg_proc.dat   |   3 +
 src/include/utils/array.h |   1 +
 src/test/regress/expected/arrays.out  |  90 +++
 .../regress/expected/collate.icu.utf8.out |  13 ++
 src/test/regress/sql/arrays.sql   |  25 +++
 src/test/regress/sql/collate.icu.utf8.sql |   4 +
 src/tools/pgindent/typedefs.list  |   1 +
 10 files changed, 302 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4d6061a8458..e24ef42ad98 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20669,6 +20669,26 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sort
+
+array_sort ( anyarray )
+anyarray
+   
+   
+Sorts the first dimension of the array.
+The sort order is determined by the < operator of the element type,
+nulls will appear after non-null values.
+The collation to use can be forced by adding a COLLATE clause to any of the arguments.
+   
+   
+array_sort(ARRAY[[2,4],[2,1],[6,5]])
+{{2,1},{2,4},{6,5}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index 2aae2f8ed93..583e56fc805 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -15,6 +15,7 @@
 #include "catalog/pg_type.h"
 #include "common/int.h"
 #include "common/pg_prng.h"
+#include "miscadmin.h"
 #include "libpq/pqformat.h"
 #include "nodes/supportnodes.h"
 #include "port/pg_bitutils.h"
@@ -22,6 +23,7 @@
 #include "utils/builtins.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
 #include "utils/typcache.h"
 
 /*
@@ -43,6 +45,18 @@ typedef struct DeserialIOData
 	Oid			typioparam;
 } DeserialIOData;
 
+/*
+ * ArraySortCachedInfo
+ *		Used for caching data in array_sort
+ */
+typedef struct ArraySortCachedInfo
+{
+	TypeCacheEntry *typentry;		/* type cache entry for element type */
+	TypeCacheEntry *array_typentry; /* type cache entry for array type */
+	ArrayMetaState array_meta;		/* array metadata for better
+	 * array_create_iterator performance */
+} ArraySortCachedInfo;
+
 static Datum array_position_common(FunctionCallInfo fcinfo);
 
 
@@ -1858,3 +1872,132 @@ array_reverse(PG_FUNCTION_ARGS)
 
 	PG_RETURN_ARRAYTYPE_P(result);
 }
+
+/*
+ * array_sort
+ *
+ * Sorts the first dimension of the array.
+ * The sort order is determined by the "<" operator of the element type.
+ */
+Datum
+array_sort(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array = PG_GETARG_ARRAYTYPE_P(0);
+	Oid			elmtyp;
+	Oid			array_type = InvalidOid;
+	Oid			collation = PG_GET_COLLATION();
+	ArraySortCachedInfo *cache_info;
+	TypeCacheEntry *typentry;
+	Tuplesortstate *tuplesortstate;
+	ArrayIterator array_iterator;
+	Datum		value;
+	bool		isnull;
+	ArrayBuildStateAny *astate = NULL;
+	int			ndim,
+			   *dims,
+			   *lbs;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	cache_info = (ArraySortCachedInfo *) fcinfo->flinfo->fn_extra;
+	if (cache_info == NULL)
+	{
+		cache_info = (ArraySortCachedInfo *) MemoryContextAlloc(fcinfo->flinfo->fn_mcxt,
+sizeof(ArraySortCachedInfo));
+		cache_info->typentry = NULL;
+		cache_info->array_typentry = NULL;
+		fcinfo->flinfo->fn_extra = (void *) cache_info;
+	}
+
+	if (ndim == 1)
+	{
+		/* Finds the ordering operator for the type for 1-D arrays */
+		typentry = cache_info->typentry;
+		if (typentry == NULL 

Re: Printing window function OVER clauses in EXPLAIN

2025-03-09 Thread Álvaro Herrera
Hello

Would it be possible and make sense to use notation of explicit WINDOW
clauses, for cases where multiple window functions invoke identical
window definitions?  I'm thinking of something like

explain verbose SELECT
empno,
depname,
row_number() OVER testwin rn,
rank() OVER testwin rnk,
count(*) OVER testwin cnt
FROM empsalary
window testwin as
  (PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN
   UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

for which, with the patch, we'd get this

   QUERY 
PLAN

  
─
 WindowAgg  (cost=74.64..101.29 rows=1070 width=68)
   Output: empno, depname, row_number() OVER (PARTITION BY depname ORDER BY 
enroll_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), rank() 
OVER (PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING), count(*) OVER (PARTITION BY depname ORDER 
BY enroll_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 
enroll_date
   ->  Sort  (cost=74.54..77.21 rows=1070 width=44)
 Output: depname, enroll_date, empno
 Sort Key: empsalary.depname, empsalary.enroll_date
 ->  Seq Scan on pg_temp.empsalary  (cost=0.00..20.70 rows=1070 
width=44)
   Output: depname, enroll_date, empno
(7 filas)

which is pretty ugly to read and requires careful tracking to verify
that they're all defined on the same window.  Previously, we just get

QUERY PLAN  
  
──
 WindowAgg  (cost=74.64..101.29 rows=1070 width=68)
   Output: empno, depname, row_number() OVER (?), rank() OVER (?), count(*) 
OVER (?), enroll_date
   ->  Sort  (cost=74.54..77.21 rows=1070 width=44)
 Output: depname, enroll_date, empno
 Sort Key: empsalary.depname, empsalary.enroll_date
 ->  Seq Scan on pg_temp.empsalary  (cost=0.00..20.70 rows=1070 
width=44)
   Output: depname, enroll_date, empno
(7 filas)

so it didn't matter.

I'd imagine something like

   QUERY 
PLAN

  
─
 Window testwin AS (PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
 WindowAgg  (cost=74.64..101.29 rows=1070 width=68)
   Output: empno, depname, row_number() OVER testwin, rank() OVER testwin, 
count(*) OVER testwin, enroll_date
   ->  Sort  (cost=74.54..77.21 rows=1070 width=44)
 Output: depname, enroll_date, empno
 Sort Key: empsalary.depname, empsalary.enroll_date
 ->  Seq Scan on pg_temp.empsalary  (cost=0.00..20.70 rows=1070 
width=44)
   Output: depname, enroll_date, empno
(7 filas)


I imagine this working even if the user doesn't explicitly use a WINDOW
clause, if only because it makes the explain easier to read, and it's
much clearer if the user specifies two different window definitions.
So with David Johnston's example, something like

 Window window1 AS (PARTITION BY depname ORDER BY enroll_date ROWS UNBOUNDED 
PRECEDING)
 Window window2 AS (PARTITION BY depname ORDER BY enroll_date RANGE BETWEEN 
CURRENT ROW AND CURRENT ROW)
 WindowAgg
   Output: empno, depname, (row_number() OVER window1), rank() OVER window1, 
count(*) OVER window2, enroll_date
   ->  WindowAgg
 Output: depname, enroll_date, empno, row_number() OVER window1, rank() 
OVER window1
 ->  Sort
   Output: depname, enroll_date, empno
   Sort Key: empsalary.depname, empsalary.enroll_date
   ->  Seq Scan on pg_temp.empsalary
 Output: depname, enroll_date, empno

(Hmm, not sure if the Window clauses 

Re: strange valgrind reports about wrapper_handler on 64-bit arm

2025-03-09 Thread Tomas Vondra
On 3/9/25 03:16, Nathan Bossart wrote:
> On Sat, Mar 08, 2025 at 11:48:22PM +0100, Tomas Vondra wrote:
>> Shortly after restarting this I got three more reports - all of them are
>> related to strcoll_l. This is on c472a18296e4, i.e. with the asserts
>> added in this thread etc. But none of those seem to fail.
> 
>> ==189168==at 0xA683CC: wrapper_handler (pqsignal.c:90)
>> ==189168==at 0xA683F0: wrapper_handler (pqsignal.c:91)
>> ==189168==at 0xA684D4: wrapper_handler (pqsignal.c:110)
> 
> This appears to refer to the following lines:
> 
>   Assert(postgres_signal_arg > 0);
>   Assert(postgres_signal_arg < PG_NSIG);
>   (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);
> 
> The common ingredient seems to be postgres_signal_arg.  I haven't found
> anything else that seems helpful.
> 

Yeah, it's a bit weird. I got another report on the 64-bit rpi5 machine
(except for the command it's exactly the same), and then also this
report on the 32-bit machine:

---
==3482== Use of uninitialised value of size 4
==3482==at 0x991498: wrapper_handler (pqsignal.c:113)
==3482==by 0x: ???
==3482==  Uninitialised value was created by a heap allocation
==3482==at 0x940EB8: palloc (mcxt.c:1341)
==3482==by 0x980037: initStringInfoInternal (stringinfo.c:45)
==3482==by 0x980103: initStringInfo (stringinfo.c:99)
==3482==by 0x7194B7: ReadArrayStr (arrayfuncs.c:613)
==3482==by 0x718803: array_in (arrayfuncs.c:289)
==3482==by 0x90024F: InputFunctionCallSafe (fmgr.c:1607)
==3482==by 0x2E346B: CopyFromTextLikeOneRow (copyfromparse.c:1029)
==3482==by 0x2E304B: CopyFromTextOneRow (copyfromparse.c:919)
==3482==by 0x2E2EEF: NextCopyFrom (copyfromparse.c:890)
==3482==by 0x2DF5C7: CopyFrom (copyfrom.c:1149)
==3482==by 0x2DAE33: DoCopy (copy.c:306)
==3482==by 0x6CC38F: standard_ProcessUtility (utility.c:738)
==3482==by 0x4B654C3: pgss_ProcessUtility (pg_stat_statements.c:1179)
==3482==by 0x6CBBF3: ProcessUtility (utility.c:519)
==3482==by 0x6CA24B: PortalRunUtility (pquery.c:1184)
==3482==by 0x6CA537: PortalRunMulti (pquery.c:1348)
==3482==by 0x6C9837: PortalRun (pquery.c:819)
==3482==by 0x6C1BEB: exec_simple_query (postgres.c:1272)
==3482==by 0x6C74FF: PostgresMain (postgres.c:4693)
==3482==by 0x6BD297: BackendMain (backend_startup.c:107)
==3482==
{
   
   Memcheck:Value4
   fun:wrapper_handler
   obj:*
}
**3482** Valgrind detected 1 error(s) during execution of "COPY
array_op_test FROM
'/home/debian/postgres/src/test/regress/data/array.data';"
---

This all seems very strange ... I'm starting to wonder if maybe this is
a valgrind issue. Both machines have valgrind 3.19, I'll try with a
custom build of 3.24 (the latest release).


regards

-- 
Tomas Vondra





Re: BackgroundPsql swallowing errors on windows

2025-03-09 Thread Noah Misch
On Sun, Mar 09, 2025 at 12:47:34PM -0400, Andres Freund wrote:
> On 2025-02-16 17:52:36 -0800, Noah Misch wrote:
> > On Sun, Feb 16, 2025 at 08:42:50PM -0500, Andres Freund wrote:
> > > On February 16, 2025 7:50:18 PM EST, Tom Lane  wrote:
> > > >Noah Misch  writes:
> > > >> On Sun, Feb 16, 2025 at 06:18:44PM -0500, Tom Lane wrote:
> > > >>> I think that
> > > >>> IPC::Run may be screwing up here, because I have seen non-Windows
> > > >>> CI failures that look like it didn't read all the stderr output.
> > > >>> For example, this pgbench test failure on macOS from [1]:
> > > >
> > > >> https://github.com/cpan-authors/IPC-Run/commit/2128df3bbcac7e733ac46302c4b1371ffb88fe14
> > > >> fixed that one.
> > > >
> > > >Ah.  Do we know whether that fix has made it into our CI images?
> > > >(Or anywhere else, for that matter?)
> > > 
> > > The CI images are regenerated three times a week, but for most OSs, they 
> > > will only install perl modules via the applicable packaging method, so 
> > > it'll depend on when they pick up that version.
> > > 
> > > On Windows cpan is used, so it should pick that new version fairly 
> > > quickly if a release has been made.
> > > 
> > > On macos we can't currently use images, so we just cache all the installed
> > > macports packages. The cache is keyed by OS version and list of packages
> > > to be installed, with no other forced invalidation right now. So it's hard
> > > to predict when a new version of a package will be picked up and it will
> > > differ between git repositories.  I've been wondering whether the cached
> > > macports install should just be regularly generated instead, along the
> > > other ci images.
> > 
> > The change is not in a release yet.  We could have macos install IPC::Run 
> > from
> > github, or I could get a release cut so it can make its way to macports.
> 
> It'd be great if we could get a release.

Yep.  I put the tree in the necessary state, and I contacted the person on
2025-02-17 and again on 2025-03-04.  I'm scheduled to follow up again on
2025-03-11.

> I guess I can figure out the magic
> incantations to install it from git for CI

There's no need to build anything, so it suffices to do:

git clone https://github.com/cpan-authors/IPC-Run.git
export PERL5LIB=$PWD/IPC-Run/lib
# (or append, if you already have non-empty PERL5LIB)




Re: what's going on with lapwing?

2025-03-09 Thread Robert Haas
On Thu, Mar 6, 2025 at 12:22 PM Tom Lane  wrote:
> I don't think that's the way to think about old buildfarm members.
> Sure, nobody is very likely to be putting PG 18 on a Debian 7 box,
> but the odds are much higher that they might have PG 13 on it and
> wish to update to 13.latest.  So what you need to compare OS EOL
> dates to is not current development but our oldest supported branch.

But the work it's creating is mostly because it's still testing
master. If it were only testing a gradually-decreasing set of older
branches, that wouldn't seem weird to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Incorrect assert in libpqwalreceiver

2025-03-09 Thread Jacob Brazeal
Hello hackers,

The libpqrcv_connect function asserts 'Assert(i < sizeof(keys))', where
keys is declared as const char *keys[6];.

However, sizeof(keys) is not the correct way to check the array length (on
my system, for example, it's 48 = 6 * 8 at this callsite, not 6.)

I attached a patch to fix the assert, but I suppose we could also just
remove the assert altogether, since it hasn't been doing anything for at
least 8 years.

Regards,
Jacob


v1-length-assert.patch
Description: Binary data


Re: Commit fest 2025-03

2025-03-09 Thread vignesh C
On Thu, 6 Mar 2025 at 21:24, Tom Lane  wrote:
>
> vignesh C  writes:
> > On Wed, 5 Mar 2025 at 16:50, Jim Jones  wrote:
> >> Is there something wrong with the commitfest app? This patch applies
> >> cleanly and passes all tests
>
> > I verified that it applies neatly and passes the tests for me too, I
> > have reported this issue at [1]. I don't know the reason for this.
>
> What did you apply with?  "git am" is known to be far pickier than
> "patch".  (I thought the cfbot was using "patch", but I might be
> wrong.  There are many versions of "patch", too.)

In my local environment I used "git am" and it worked fine. cfbot has
used "git apply --3way" as per [1], they are planning to change it to
"git am" to fix this issue.
[1] - 
https://www.postgresql.org/message-id/CAGECzQSE7igwsTUk17i%2BRNEUL1zR-Zkp-1Zs7KkhmBxAu_Fksw%40mail.gmail.com

Regards,
Vignesh




Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

2025-03-09 Thread Michail Nikolaev
Hello, Mathias!

>  though I suspect the SP-GIST tests to have
> bugs, as an intermediate version of my 0003 patch didn't trigger the
> tests to fail

It all fails on master - could you please detail what is "intermediate" in
that case? Also, I think it is a good idea to add the same type of test to
btree.

>  * XXX: In the future we should probably reorder these operations so
>  * we can apply the checks in block order, rather than index order.

I think it is already done in your patch, no?

Should we when use that mechanics for btree as well? It seems to be
straight forward and non-invasive. In such case, "Unchecked" goes away, and
it is each AM responsibility to call the check while holding the pin.

Best regards,
Mikhail.


Re: Commitfest app release on Feb 17 with many improvements

2025-03-09 Thread Jelte Fennema-Nio
On Sun, 9 Mar 2025 at 03:21, vignesh C  wrote:
> Couple of suggestions: a) No need to show CI status as "Needs rebase,"
> "Not processed," etc., for committed patches.

Do you mean specifically for committed ones? Or just for any patch
with a "closed" status.

> b) Can we add a filter
> for "Needs rebase!"? This would help the CommitFest manager easily
> list patches that need updating.

That should be pretty easy to implement. But is that really what we
want? In the next release, sorting by "failing since" is implemented.
It sounds like that could be enough instead. i.e. do we really only
want to call out patches that need a rebase? Or also ones that have
been failing in CI for a long time?

I'm even wondering if this whole flow still makes sense. Do we really
want to send an email to  the mailing list about this? And if so, why
is someone doing that manually? If people subscribe to updates for
patches that they authored, then they get these "needs rebase"
automatically. Should we maybe simply default that option to true? And
for instance send a notification automatically to all people with a
"needs rebase" CI status whenever we start a new commitfest.




Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2025-03-09 Thread Oliver Ford
On Sun, Mar 9, 2025 at 6:40 AM Tatsuo Ishii  wrote:

> > Attached version removes the non-nulls array. That seems to speed
> > everything up.  Running the above query with 1 million rows averages
> 450ms,
> > similar when using lead/lag.
>
> Great. However, CFbot complains about the patch:
>
> https://cirrus-ci.com/task/6364194477441024
>
>
Attached fixes the headerscheck locally.


0010-ignore-nulls.patch
Description: Binary data


Re: Log connection establishment timings

2025-03-09 Thread Jacob Champion
On Thu, Mar 6, 2025 at 3:16 PM Melanie Plageman
 wrote:
>
> Jacob at some point had asked about this, and I didn't
> have a satisfactory answer. I'm not really sure what would be more
> useful to end users.

For the record, I'm not really sure either. I don't have a strong
opinion either way.

--Jacob




Re: maintenance_work_mem = 64kB doesn't work for vacuum

2025-03-09 Thread David Rowley
On Mon, 10 Mar 2025 at 07:46, Masahiko Sawada  wrote:
> A simple fix is to bump the minimum maintenance_work_mem to 256kB. We
> would break the compatibility for backbranch (i.e. v17) but I guess
> it's unlikely that existing v17 users are using less than 1MB
> maintenance_work_mem (the release note doesn't mention the fact that
> we lowered the minimum value).

Could you do something similar to what's in hash_agg_check_limits()
where we check we've got at least 1 item before bailing before we've
used up the all the prescribed memory?  That seems like a safer coding
practise as if in the future the minimum usage for a DSM segment goes
above 256KB, the bug comes back again.

David




Re: Printing window function OVER clauses in EXPLAIN

2025-03-09 Thread Tom Lane
I wrote:
> I'll go try to code this up.

OK, here's v2 done like that.  I do like this output better.
I backed off the idea of putting the WindowClause as such
into the plan, partly because I didn't feel like debugging
the setrefs.c problem that David discovered upthread.
This way does require a bit more code, but I think it's less
likely to have bugs.

A couple of notes:

* I made the Window: plan annotation come out unconditionally.
We could alternatively print it only in VERBOSE mode, which would
greatly reduce the number of regression test diffs.  However, it seems
fairly comparable to the sort keys of a Sort node or the group keys of
a Group node, which we print unconditionally.  Also, there are cases
where a higher-level node unconditionally prints a reference to a
window function output, which would mean that that output's reference
to "windowN" would have no referent in the displayed data.

* In passing, I editorialized on the order in which the Run Condition
annotation comes out:

 case T_WindowAgg:
+show_window_def(castNode(WindowAggState, planstate), ancestors, 
es);
 show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+show_upper_qual(((WindowAgg *) plan)->runConditionOrig,
+"Run Condition", planstate, ancestors, es);
 if (plan->qual)
 show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
-show_upper_qual(((WindowAgg *) plan)->runConditionOrig,
-"Run Condition", planstate, ancestors, es);
 show_windowagg_info(castNode(WindowAggState, planstate), es);
 break;

It seemed quite weird to me to have the Run Condition plan property
come out in the middle of properties that only appear in EXPLAIN
ANALYZE mode.  Maybe there's a reason for this other than "people
added new properties at the end", but I don't see it.

regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c68119030ab..29e8d7b806f 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3968,10 +3968,11 @@ select c2, sum(c2), count(c2) over (partition by c2%2) from ft2 where c2 < 10 gr
  QUERY PLAN 
 
  Sort
-   Output: c2, (sum(c2)), (count(c2) OVER (?)), ((c2 % 2))
+   Output: c2, (sum(c2)), (count(c2) OVER window1), ((c2 % 2))
Sort Key: ft2.c2
->  WindowAgg
- Output: c2, (sum(c2)), count(c2) OVER (?), ((c2 % 2))
+ Output: c2, (sum(c2)), count(c2) OVER window1, ((c2 % 2))
+ Window: window1 AS (PARTITION BY ((ft2.c2 % 2)))
  ->  Sort
Output: c2, ((c2 % 2)), (sum(c2))
Sort Key: ((ft2.c2 % 2))
@@ -3979,7 +3980,7 @@ select c2, sum(c2), count(c2) over (partition by c2%2) from ft2 where c2 < 10 gr
  Output: c2, ((c2 % 2)), (sum(c2))
  Relations: Aggregate on (public.ft2)
  Remote SQL: SELECT c2, (c2 % 2), sum(c2) FROM "S 1"."T 1" WHERE ((c2 < 10)) GROUP BY 1
-(12 rows)
+(13 rows)
 
 select c2, sum(c2), count(c2) over (partition by c2%2) from ft2 where c2 < 10 group by c2 order by 1;
  c2 | sum | count 
@@ -4001,10 +4002,11 @@ select c2, array_agg(c2) over (partition by c2%2 order by c2 desc) from ft1 wher
 QUERY PLAN 
 ---
  Sort
-   Output: c2, (array_agg(c2) OVER (?)), ((c2 % 2))
+   Output: c2, (array_agg(c2) OVER window1), ((c2 % 2))
Sort Key: ft1.c2
->  WindowAgg
- Output: c2, array_agg(c2) OVER (?), ((c2 % 2))
+ Output: c2, array_agg(c2) OVER window1, ((c2 % 2))
+ Window: window1 AS (PARTITION BY ((ft1.c2 % 2)) ORDER BY ft1.c2)
  ->  Sort
Output: c2, ((c2 % 2))
Sort Key: ((ft1.c2 % 2)), ft1.c2 DESC
@@ -4012,7 +4014,7 @@ select c2, array_agg(c2) over (partition by c2%2 order by c2 desc) from ft1 wher
  Output: c2, ((c2 % 2))
  Relations: Aggregate on (public.ft1)
  Remote SQL: SELECT c2, (c2 % 2) FROM "S 1"."T 1" WHERE ((c2 < 10)) GROUP BY 1
-(12 rows)
+(13 rows)
 
 select c2, array_agg(c2) over (partition by c2%2 order by c2 desc) from ft1 where c2 < 10 group by c2 order by 1;
  c2 |  array_agg  
@@ -4031,13 +4033,14 @@ select c2, array_agg(c2) over (partition by c2%2 order by c2 desc) from ft1 wher
 
 explain (verbose, costs off)
 select c2, array_agg(c2) over (partition by c2%2 order by c2 range betwe

Re: Parallel heap vacuum

2025-03-09 Thread Peter Smith
Hi Sawada-San.

Here are some review comments for patch v11-0002

==
Commit message.

1.
Heap table AM disables the parallel heap vacuuming for now, but an
upcoming patch uses it.

This function implementation was moved into patch 0001, so probably
this part of the commit message comment also belongs now in patch
0001.

==
src/backend/commands/vacuumparallel.c

2.
+ * For processing indexes in parallel, individual indexes are processed by one
+ * vacuum process. We launch parallel worker processes at the start
of parallel index
+ * bulk-deletion and index cleanup and once all indexes are
processed, the parallel
+ * worker processes exit.
+ *

"are processed by one vacuum process." -- Did you mean "are processed
by separate vacuum processes." ?

~~~

3.
+ *
+ * Each time we process table or indexes in parallel, the parallel context is
+ * re-initialized so that the same DSM can be used for multiple
passes of table vacuum
+ * or index bulk-deletion and index cleanup.

Maybe I am mistaken, but it seems like the logic is almost always
re-initializing this. I wonder if it might be simpler to just remove
the 'need_reinitialize_dsm' field and initialize unconditionally.

~~~

4.
+ * nrequested_workers is >= 0 number and the requested parallel degree. 0
+ * means that the parallel degrees for table and indexes vacuum are decided
+ * differently. See the comments of parallel_vacuum_compute_workers() for
+ * details.
+ *
  * On success, return parallel vacuum state.  Otherwise return NULL.
  */

SUGGESTION
nrequested_workers is the requested parallel degree (>=0). 0 means that...

~~~

5.
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int
nrequested,
- bool *will_parallel_vacuum)
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+ int nrequested, int *nworkers_table_p,
+ bool *idx_will_parallel_vacuum)

IIUC the returns for this function seem inconsistent. AFAIK, it
previously would return the number of workers for parallel index
vacuuming. But now (after this patch) the return value is returned
Max(nworkers_table, nworkers_index). Meanwhile, the number of workers
for parallel table vacuuming is returned as a by-reference parameter
'nworkers_table_p'. In other words, it is returning the number of
workers in 2 different ways.

Why not make this a void function, but introduce another parameter
'nworkers_index_p', similar to 'nworkers_table_p'?

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Parallel heap vacuum

2025-03-09 Thread Peter Smith
Hi Sawada-San,

Here are some review comments for patch v11-0001.

==
src/backend/access/heap/vacuumlazy.c

1.
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Return 0 to disable parallel vacuum so far.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested)

You don't need to say "so far".

==
src/backend/access/table/tableamapi.c

2.
  Assert(routine->relation_vacuum != NULL);
+ Assert(routine->parallel_vacuum_compute_workers != NULL);
  Assert(routine->scan_analyze_next_block != NULL);

Is it better to keep these Asserts in the same order that the
TableAmRoutine fields are assigned (in heapam_handler.c)?

~~~

3.
+ /*
+ * Callbacks for parallel vacuum are also optional (except for
+ * parallel_vacuum_compute_workers). But one callback implies presence of
+ * the others.
+ */
+ Assert(routine->parallel_vacuum_estimate == NULL) ==
+   (routine->parallel_vacuum_initialize == NULL)) ==
+ (routine->parallel_vacuum_initialize_worker == NULL)) ==
+ (routine->parallel_vacuum_collect_dead_items == NULL)));

/also optional/optional/

==
src/include/access/heapam.h

+extern int heap_parallel_vacuum_compute_workers(Relation rel, int
nworkers_requested);

4.
wrong tab/space after 'int'.

==
src/include/access/tableam.h

5.
+ /*
+ * Compute the number of parallel workers for parallel table vacuum. The
+ * parallel degree for parallel vacuum is further limited by
+ * max_parallel_maintenance_workers. The function must return 0 to disable
+ * parallel table vacuum.
+ *
+ * 'nworkers_requested' is a >=0 number and the requested number of
+ * workers. This comes from the PARALLEL option. 0 means to choose the
+ * parallel degree based on the table AM specific factors such as table
+ * size.
+ */
+ int (*parallel_vacuum_compute_workers) (Relation rel,
+ int nworkers_requested);

The comment here says "This comes from the PARALLEL option." and "0
means to choose the parallel degree...". But, the PG docs [1] says "To
disable this feature, one can use PARALLEL option and specify parallel
workers as zero.".

These two descriptions "disable this feature" (PG docs) and letting
the system "choose the parallel degree" (code comment) don't sound the
same. Should this 0001 patch update the PG documentation for the
effect of setting PARALLEL value zero?

~~~

6.
+ /*
+ * Initialize DSM space for parallel table vacuum.
+ *
+ * Not called if parallel table vacuum is disabled.
+ *
+ * Optional callback, but either all other parallel vacuum callbacks need
+ * to exist, or neither.
+ */

"or neither"?

Also, saying "all other" seems incorrect because
parallel_vacuum_compute_workers callback must exist event if
parallel_vacuum_initialize does not exist.

IMO you meant to say "all optional", and "or none".

SUGGESTION:
Optional callback. Either all optional parallel vacuum callbacks need
to exist, or none.

(this same issue is repeated in multiple places).

==
[1] https://www.postgresql.org/docs/devel/sql-vacuum.html

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Parallel CREATE INDEX for GIN indexes

2025-03-09 Thread Peter Geoghegan
On Sun, Mar 9, 2025 at 6:23 PM Tom Lane  wrote:
> Ah.  Most likely somebody dismissed it years ago.  Given that
> precedent, I'm content to dismiss this one too.

It is dead code, unless somebody decides to #define
DISABLE_LEADER_PARTICIPATION to debug a problem.

-- 
Peter Geoghegan




Re: Parallel CREATE INDEX for GIN indexes

2025-03-09 Thread Tom Lane
Tomas Vondra  writes:
> On 3/9/25 17:38, Tom Lane wrote:
>> Coverity is a little unhappy about this business in
>> _gin_begin_parallel:

> I don't mind doing it differently, but this code is just a copy from
> _bt_begin_parallel. So how come coverity does not complain about that?
> Or is that whitelisted?

Ah.  Most likely somebody dismissed it years ago.  Given that
precedent, I'm content to dismiss this one too.

regards, tom lane




Re: what's going on with lapwing?

2025-03-09 Thread Robert Haas
On Thu, Mar 6, 2025 at 4:27 PM Tom Lane  wrote:
> I think you misunderstood my drift.  I'm okay with setting a project
> policy that we won't support OSes that are more than N years EOL,
> as long as it's phrased to account for older PG branches properly.

Yep, I misunderstood. That sounds awesome. Let's see if others agree.

> My point was that we can implement such a policy in a laissez-faire
> way: if an older BF animal isn't causing us trouble then why mess
> with it?  Once we *do* recognize that it's causing us trouble,
> we can apply the still-hypothetical policy and ask the owner to
> turn it off for branches where it's out of support.

Fair enough. This does have the disadvantage that people will still
commit things that turn the buildfarm red and have to go into panic
mode -- perhaps not even realizing which machines are running an older
OS -- and then reactively do this. However, it still sounds like
progress over where we are today, because it would (hopefully) remove
the need for an argument about each individual case.

Honestly, I'm kind of bummed by how fast operating systems and OS
versions die these days. It seems crazy to me that I probably couldn't
reasonably build a PG from today on the laptop I started at EDB with,
or a PG from then on my current laptop. But it doesn't seem like
there's much point in fighting the tide. We shouldn't be the only
people trying to keep stuff working long after the sell-by date.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Parallel CREATE INDEX for GIN indexes

2025-03-09 Thread Tomas Vondra



On 3/9/25 17:38, Tom Lane wrote:
> Tomas Vondra  writes:
>> I pushed the two smaller parts today.
> 
> Coverity is a little unhappy about this business in
> _gin_begin_parallel:
> 
>   boolleaderparticipates = true;
>   ...
> #ifdef DISABLE_LEADER_PARTICIPATION
>   leaderparticipates = false;
> #endif
>   ...
>   scantuplesortstates = leaderparticipates ? request + 1 : request;
> 
> It says
> 
 CID 1644203:  Possible Control flow issues  (DEADCODE)
 Execution cannot reach the expression "request" inside this statement: 
 "scantuplesortstates = (lead...".
> 924   scantuplesortstates = leaderparticipates ? request + 1 : 
> request;
> 
> If this were just temporary code I'd let it pass, but I see nothing
> replacing this logic in the follow-up patches, so I think we ought
> to do something to shut it up.
> 
> It's not complaining about the later bits like
> 
>   if (leaderparticipates)
>   ginleader->nparticipanttuplesorts++;
> 
> (perhaps because there's no dead code there?)  So one idea is
> 
>   scantuplesortstates = request;
>   if (leaderparticipates)
>   scantuplesortstates++;
> 
> which would look more like the other code anyway.
> 

I don't mind doing it differently, but this code is just a copy from
_bt_begin_parallel. So how come coverity does not complain about that?
Or is that whitelisted?


thanks

-- 
Tomas Vondra





Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

2025-03-09 Thread Alexander Korotkov
On Sun, Mar 9, 2025 at 4:53 AM Alexander Korotkov  wrote:
> On Sat, Mar 8, 2025 at 12:49 PM Alexander Korotkov  
> wrote:
> > On Fri, Mar 7, 2025 at 8:20 PM Álvaro Herrera  
> > wrote:
> > >
> > > On 2024-Mar-25, Alexander Korotkov wrote:
> > >
> > > > reindexdb: Add the index-level REINDEX with multiple jobs
> > > >
> > > > Straight-forward index-level REINDEX is not supported with multiple 
> > > > jobs as
> > > > we cannot control the concurrent processing of multiple indexes 
> > > > depending on
> > > > the same relation.  Instead, we dedicate the whole table to certain 
> > > > reindex
> > > > job.  Thus, if indexes in the lists belong to different tables, that 
> > > > gives us
> > > > a fair level of parallelism.
> > >
> > > I tested this, because of a refactoring suggestion [1] and I find that
> > > it's rather completely broken.
> >
> > The code was written with assumption that running
> > run_reindex_command() with async == true can schedule a number of
> > queries for a connection.  But actually that's not true and everything
> > is broken.
>
> The draft patch for revert is attached.  Could you, please, check.

After second thought it's not so hard to fix.  The attached patch does
it by putting REINDEX commands related to the same table into a single
SQL statement.  Possibly, that could be better than revert given we
need to handle compatibility.  What do you think?

--
Regards,
Alexander Korotkov
Supabase


v1-0001-reindexdb-Fix-the-index-level-REINDEX-with-multip.patch
Description: Binary data


Re: maintenance_work_mem = 64kB doesn't work for vacuum

2025-03-09 Thread Melanie Plageman
On Sun, Mar 9, 2025 at 9:24 PM John Naylor  wrote:
>
> On Mon, Mar 10, 2025 at 1:46 AM Masahiko Sawada  wrote:
> >
> > Commit bbf668d66fbf6 (back-patched to v17) lowered the minimum
> > maintenance_work_mem to 64kB, but it doesn't work for parallel vacuum
>
> That was done in the first place to make a regression test for a bug
> fix easier, but that test never got committed. In any case I found it
> worked back in July:

Yes, I would like to keep the lower minimum. I really do have every
intention of committing that test. Apologies for taking so long.
Raising the limit to 256 kB might make the test take too long. And I
think it's nice to have that coverage (not just of the vacuum bug but
of the multi-index vacuum pass vacuum in a natural setting [as opposed
to the tidstore test module]). I don't recall if we have that
elsewhere.

- Melanie




maintenance_work_mem = 64kB doesn't work for vacuum

2025-03-09 Thread Masahiko Sawada
Hi,

Commit bbf668d66fbf6 (back-patched to v17) lowered the minimum
maintenance_work_mem to 64kB, but it doesn't work for parallel vacuum
cases since the minimum dsa segment size (DSA_MIN_SEGMENT_SIZE) is
256kB. As soon as the radix tree allocates its control object and the
root node, the memory usage exceeds the maintenance_work_mem limit and
vacuum ends up with the following assertion:

TRAP: failed Assert("vacrel->lpdead_item_pages > 0"), File:
"vacuumlazy.c", Line: 2447, PID: 3208575

On build without --enable-cassert, vacuum never ends.

I've tried to lower DSA_MIN_SEGMENT_SIZE to 64kB but no luck.
Investigating it further, dsa creates a 64kB superblock for each size
class and when creating a new shared radix tree we need to create two
superblocks: one for the radix tree control data (64 bytes) and
another one for the root node (40 bytes). Also, each superblock
requires a span data, which uses 1 page (4096kB). Therefore, we need
at least 136kB for a shared radix tree even when it's empty.

A simple fix is to bump the minimum maintenance_work_mem to 256kB. We
would break the compatibility for backbranch (i.e. v17) but I guess
it's unlikely that existing v17 users are using less than 1MB
maintenance_work_mem (the release note doesn't mention the fact that
we lowered the minimum value).

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Add regression test checking combinations of (object,backend_type,context) in pg_stat_io

2025-03-09 Thread Michael Paquier
On Wed, Mar 05, 2025 at 09:19:16PM +0900, Michael Paquier wrote:
> On Wed, Mar 05, 2025 at 07:34:16AM +, Bertrand Drouvot wrote:
>> What about adding some extra paranoia like?
>> 
>> SELECT backend_type, object, context FROM pg_stat_io ORDER BY
>> backend_type, object, context COLLATE "C";
> 
> Why not, to force the ordering.

For the sake of the archives.  As 8b532771a099 has proved, this has
required two more COLLATE clauses in the query to force a stable
output, but we are in the clear now.
--
Michael


signature.asc
Description: PGP signature


RE: Selectively invalidate caches in pgoutput module

2025-03-09 Thread Hayato Kuroda (Fujitsu)
I found cfbot got angry due to a variable-shadowing. PSA fixed version.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v7-0001-Introduce-a-new-invalidation-message-to-invalidat.patch
Description:  v7-0001-Introduce-a-new-invalidation-message-to-invalidat.patch


v7-0002-Invalidate-Relcaches-while-ALTER-PUBLICATION-RENA.patch
Description:  v7-0002-Invalidate-Relcaches-while-ALTER-PUBLICATION-RENA.patch


Re: Add an option to skip loading missing publication to avoid logical replication failure

2025-03-09 Thread Dilip Kumar
On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila  wrote:

> On Tue, Mar 4, 2025 at 6:54 PM vignesh C  wrote:
> >
> > On further thinking, I felt the use of publications_updated variable
> > is not required we can use publications_valid itself which will be set
> > if the publication system table is invalidated. Here is a patch for
> > the same.
> >
>
> The patch relies on the fact that whenever a publication's data is
> invalidated, it will also invalidate all the RelSyncEntires as per
> publication_invalidation_cb. But note that we are discussing removing
> that inefficiency in the thread  [1]. So, we should try to rebuild the
> entry when we have skipped the required publication previously.
>
> Apart from this, please consider updating the docs, as mentioned in my
> response to Sawada-San's email.
>

I'm not sure I fully understand it, but based on your previous email and
the initial email from Vignesh, if IIUC, the issue occurs when a
publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET
PUBLICATION is executed, the subscriber workers restart and request the
changes based on restart_lsn, which is at an earlier LSN in the WAL than
the LSN at which the publication was created. This leads to an error, and
we are addressing this behavior as part of the fix by skipping the changes
which are between the restart_lsn of subscriber and the lsn at which
publication is created and this behavior looks fine.

BTW, I am planning to commit this only on HEAD as this is a behavior
> change. Please let me know if you guys think otherwise.
>

Somehow this looks like a bug fix which should be backported no?  Am I
missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-09 Thread Ashutosh Bapat
On Thu, Mar 6, 2025 at 9:18 AM Ajin Cherian  wrote:

> +   Subscription names, publication names, and replication slot names are
> +   automatically generated. Cannot be used together with
> +   --database, --publication,
> +   --replication-slot or 
> --subscription.
>
> Don't start the sentence with "Cannot". Change the sentence to "This
> option cannot be used together with ..."
> similar sentences used in 3 other places below this as well. Change all of 
> them.

I didn't find any instance of "Cannot be" in the *.sgml files, but I
find some instances of "Can be". So it seems we allow such constructs
in the documentation. Any reasons for suggesting this change?

-- 
Best Wishes,
Ashutosh Bapat




Re: RFC: Logging plan of the running query

2025-03-09 Thread torikoshia

On 2025-03-09 00:42, Akshat Jaimini wrote:

Hi,
I think there is still some problem with the patch. I am not able to
apply it to the master branch.

Can you please take another look at it?


Thanks for pointing it out!
Modified it.

BTW the patch adds about 400 lines to explain.c and it may be better to 
split the file as well as 9173e8b6046, but I leave it as it is for now.


--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.From 699264fbc1f4e99114966eaeba55a0ec5184e1c2 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Mon, 10 Mar 2025 14:01:54 +0900
Subject: [PATCH v41]  Add function to log the plan of the currently running
 query

Currently, we have to wait for the query execution to finish
to know its plan either using EXPLAIN ANALYZE or auto_explain.
This is not so convenient, for example when investigating
long-running queries on production environments.
To improve this situation, this patch adds pg_log_query_plan()
function that requests to log the plan of the currently
executing query.

By default, only superusers are allowed to request to log the
plans because allowing any users to issue this request at an
unbounded rate would cause lots of log messages and which can
lead to denial of service.

On receipt of the request, codes for logging plan is wrapped
to every ExecProcNode and when the executor runs one of
ExecProcNode, the plan is actually logged. These wrappers are
unwrapped when once the plan is logged.
In this way, we can avoid adding the overhead which we'll face
when adding CHECK_FOR_INTERRUPTS() like mechanisms in somewhere
in executor codes safely.
---
 contrib/auto_explain/auto_explain.c  |  23 +-
 doc/src/sgml/func.sgml   |  50 +++
 src/backend/access/transam/xact.c|   7 +
 src/backend/catalog/system_functions.sql |   2 +
 src/backend/commands/explain.c   | 372 ++-
 src/backend/executor/execMain.c  |  12 +
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/tcop/postgres.c  |   4 +
 src/backend/utils/init/globals.c |   2 +
 src/include/catalog/pg_proc.dat  |   6 +
 src/include/commands/explain.h   |  12 +
 src/include/miscadmin.h  |   1 +
 src/include/nodes/execnodes.h|   3 +
 src/include/storage/procsignal.h |   2 +
 src/include/tcop/pquery.h|   2 +-
 src/test/regress/expected/misc_functions.out |  54 ++-
 src/test/regress/sql/misc_functions.sql  |  41 +-
 17 files changed, 555 insertions(+), 42 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 7007a226c0..85cfe1b4f4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -408,26 +408,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			es->format = auto_explain_log_format;
 			es->settings = auto_explain_log_settings;
 
-			ExplainBeginOutput(es);
-			ExplainQueryText(es, queryDesc);
-			ExplainQueryParameters(es, queryDesc->params, auto_explain_log_parameter_max_length);
-			ExplainPrintPlan(es, queryDesc);
-			if (es->analyze && auto_explain_log_triggers)
-ExplainPrintTriggers(es, queryDesc);
-			if (es->costs)
-ExplainPrintJITSummary(es, queryDesc);
-			ExplainEndOutput(es);
-
-			/* Remove last line break */
-			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
-es->str->data[--es->str->len] = '\0';
-
-			/* Fix JSON to output an object */
-			if (auto_explain_log_format == EXPLAIN_FORMAT_JSON)
-			{
-es->str->data[0] = '{';
-es->str->data[es->str->len - 1] = '}';
-			}
+			ExplainAssembleLogOutput(es, queryDesc, auto_explain_log_format,
+	 auto_explain_log_triggers,
+	 auto_explain_log_parameter_max_length);
 
 			/*
 			 * Note: we rely on the existing logging of context or
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 51dd8ad657..d1a16a89fb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28553,6 +28553,25 @@ acl  | {postgres=arwdDxtm/postgres,foo=r/postgres}

   
 
+  
+   
+
+ pg_log_query_plan
+
+pg_log_query_plan ( pid integer )
+boolean
+   
+   
+Requests to log the plan of the query currently running on the
+backend with specified process ID.
+It will be logged at LOG message level and
+will appear in the server log based on the log
+configuration set (See 
+for more information), but will not be sent to the client
+regardless of .
+
+  
+
   

 
@@ -28671,6 +28690,37 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
 because it may generate a large number of log messages.

 
+   
+pg_log_query_plan can be used
+to log the plan of a backend process. For example:
+
+p

Re: Add an option to skip loading missing publication to avoid logical replication failure

2025-03-09 Thread Amit Kapila
On Tue, Mar 4, 2025 at 6:54 PM vignesh C  wrote:
>
> On further thinking, I felt the use of publications_updated variable
> is not required we can use publications_valid itself which will be set
> if the publication system table is invalidated. Here is a patch for
> the same.
>

The patch relies on the fact that whenever a publication's data is
invalidated, it will also invalidate all the RelSyncEntires as per
publication_invalidation_cb. But note that we are discussing removing
that inefficiency in the thread  [1]. So, we should try to rebuild the
entry when we have skipped the required publication previously.

Apart from this, please consider updating the docs, as mentioned in my
response to Sawada-San's email.

BTW, I am planning to commit this only on HEAD as this is a behavior
change. Please let me know if you guys think otherwise.

[1] - 
https://www.postgresql.org/message-id/OSCPR01MB14966C09AA201EFFA706576A7F5C92%40OSCPR01MB14966.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.




Re: Parallel heap vacuum

2025-03-09 Thread Amit Kapila
On Wed, Mar 5, 2025 at 6:25 AM Masahiko Sawada  wrote:
>
> On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada  wrote:
> >
> >
> > Another performance regression I can see in the results is that heap
> > vacuum phase (phase III) got slower with the patch. It's weired to me
> > since I don't touch the code of heap vacuum phase. I'm still
> > investigating the cause.
>
> I have investigated this regression. I've confirmed that In both
> scenarios (patched and unpatched), the entire table and its associated
> indexes were loaded into the shared buffer before the vacuum. Then,
> the 'perf record' analysis, focused specifically on the heap vacuum
> phase of the patched code, revealed numerous soft page faults
> occurring:
>
> 62.37%13.90%  postgres  postgres[.] lazy_vacuum_heap_rel
> |
> |--52.44%--lazy_vacuum_heap_rel
> |  |
> |  |--46.33%--lazy_vacuum_heap_page (inlined)
> |  |  |
> |  |  |--32.42%--heap_page_is_all_visible 
> (inlined)
> |  |  |  |
> |  |  |  
> |--26.46%--HeapTupleSatisfiesVacuum
> |  |  |  |
> HeapTupleSatisfiesVacuumHorizon
> |  |  |  |
> HeapTupleHeaderXminCommitted (inlined)
> |  |  |  |  |
> |  |  |  |   --18.52%--page_fault
> |  |  |  | 
> do_page_fault
> |  |  |  |
> __do_page_fault
> |  |  |  |
> handle_mm_fault
> |  |  |  |
> __handle_mm_fault
> |  |  |  |
> handle_pte_fault
> |  |  |  | |
> |  |  |  |
> |--16.53%--filemap_map_pages
> |  |  |  | |  
> |
> |  |  |  | |
> --2.63%--alloc_set_pte
> |  |  |  | |
>   pfn_pte
> |  |  |  | |
> |  |  |  |
> --1.99%--pmd_page_vaddr
> |  |  |  |
> |  |  |   --1.99%--TransactionIdPrecedes
>
> I did not observe these page faults in the 'perf record' results for
> the HEAD version. Furthermore, when I disabled parallel heap vacuum
> while keeping parallel index vacuuming enabled, the regression
> disappeared. Based on these findings, the likely cause of the
> regression appears to be that during parallel heap vacuum operations,
> table blocks were loaded into the shared buffer by parallel vacuum
> workers.
>

In the previous paragraph, you mentioned that the entire table and its
associated indexes were loaded into the shared buffer before the
vacuum. If that is true, then why does the parallel vacuum need to
reload the table blocks into shared buffers?

> However, in the heap vacuum phase, the leader process needed
> to process all blocks, resulting in soft page faults while creating
> Page Table Entries (PTEs). Without the patch, the backend process had
> already created PTEs during the heap scan, thus preventing these
> faults from occurring during the heap vacuum phase.
>

This part is again not clear to me because I am assuming all the data
exists in shared buffers before the vacuum, so why the page faults
will occur in the first place.

-- 
With Regards,
Amit Kapila.




Re: Parallel heap vacuum

2025-03-09 Thread Amit Kapila
On Fri, Mar 7, 2025 at 11:06 PM Masahiko Sawada  wrote:
>
> Discussing with Amit offlist, I've run another benchmark test where no
> data is loaded on the shared buffer. In the previous test, I loaded
> all table blocks before running vacuum, so it was the best case. The
> attached test results showed the worst case.
>
> Overall, while the numbers seem not stable, the phase I got sped up a
> bit, but not as scalable as expected, which is not surprising.
>

Sorry, but it is difficult for me to understand this data because it
doesn't contain the schema or details like what exactly is a fraction.
It is also not clear how the workers are divided among heap and
indexes, like do we use parallelism for both phases of heap or only
first phase and do we reuse those workers for index vacuuming. These
tests were probably discussed earlier, but it would be better to
either add a summary of the required information to understand the
results or at least a link to a previous email that has such details.

 Please
> note that the test results shows that the phase III also got sped up
> but this is because in parallel vacuum we use more ring buffers than
> the single process vacuum. So we need to compare the only phase I time
> in terms of the benefit of the parallelism.
>

Does phase 3 also use parallelism? If so, can we try to divide the
ring buffers among workers or at least try vacuum with an increased
number of ring buffers. This would be good to do for both the phases,
if they both use parallelism.

-- 
With Regards,
Amit Kapila.




Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-03-09 Thread Shubham Khanna
On Wed, Mar 5, 2025 at 3:55 PM Nisha Moond  wrote:
>
> On Tue, Mar 4, 2025 at 8:05 PM Shubham Khanna
>  wrote:
> >
> > The attached Patch contains the suggested changes.
> >
>
> Hi Shubham,
>
> Here are few comments for 040_pg_createsubscriber.pl
>
> 1)
> +# Run pg_createsubscriber on node S using '--cleanup-existing-publications'.
> +# --verbose is used twice to show more information.
>
>   1a) /node S/ node S1
>
>   1b) Also, can we keep the comment in the same pattern as it was earlier -
> # Run pg_createsubscriber on node S1. --verbose is used twice
> # to show more information.
> # In passing, also test the --enable-two-phase and
> --cleanup-existing-publications
> # options.
>

Fixed.

> 2)
> +# Reuse P as primary
> +# Set up node S2 as standby linking to node P
> +$node_p->backup('backup_3');
>
> /Reuse P as/ Reuse node P as/
>

Fixed.

> 3)
> +$node_s2->append_conf(
> + 'postgresql.conf', qq[
> +   primary_conninfo = '$pconnstr'
> +   hot_standby_feedback = on
> +   max_logical_replication_workers = 5
> +   ]);
>
> Do we need "hot_standby_feedback = on" on node_s2? I think we can remove it.
>

Removed and updated the configurations.

> 4)
> +# Create user-defined publications
> +$node_p->safe_psql($db3, "CREATE PUBLICATION test_pub3 FOR ALL TABLES;");
> +$node_p->safe_psql($db3, "CREATE PUBLICATION test_pub4 FOR ALL TABLES;");
>
> Can create both publications under one safe_psql as -
>
> $node_p->safe_psql($db3, qq[CREATE PUBLICATION test_pub3 FOR ALL TABLES;
> CREATE PUBLICATION test_pub4 FOR ALL TABLES;
> ]);
>

Fixed.

> 5)
> +# Run pg_createsubscriber on node A without using
> +# '--cleanup-existing-publications'.
> +# --verbose is used twice to show more information.
>
>   5a)  /node A/node S2/
>   5b)  /without using '--cleanup-existing-publications' / without
> '--cleanup-existing-publications' option
>

Fixed.

> 6)
> + ],
> + 'run pg_createsubscriber without --cleanup-existing-publications on node A'
> +);
>
> /node A/node S2/
>

Fixed.

> 7)
> +# Drop the newly created publications
> +$node_p->safe_psql($db3, "DROP PUBLICATION IF EXISTS test_pub3;");
> +$node_p->safe_psql($db3, "DROP PUBLICATION IF EXISTS test_pub4;");
>
> Similar to #4, use single safe_psql to drop both the publications.
> OTOH, do we really need to drop the publications here? I think we can
> remove this part since we're performing cleanup right after.
> 
>
> --

Fixed.

The attached patch contains the suggested changes.

Thanks and regards,
Shubham Khanna.


v14-0001-Support-for-dropping-all-publications-in-pg_crea.patch
Description: Binary data


Re: Make tuple deformation faster

2025-03-09 Thread David Rowley
On Thu, 6 Mar 2025 at 10:17, Andres Freund  wrote:
> FWIW, I am fairly certain that I looked at this at an earlier state of the
> patch, and at least for me the issue wasn't that it was inherently slower to
> use the bitmask, but that it was hard to convince the compiler not generate
> worse code.
>
> IIRC the compiler generated more complicated address gathering instructions
> which are slower on some older microarchitectures, but this is a vague memory.

I've been reading GCC's assembly output with -fverbose-asm. I find it
quite hard to follow as the changes between the 16-byte and 8-byte
CompactAttribute versions are vast (see attached).

A few interesting things jump out. e.g, in master:

# execTuples.c:1080: thisatt->attcacheoff = *offp;
.loc 1 1080 26 is_stmt 0 view .LVU1468
movl %ebp, (%rax) # off, MEM[(int *)_22]

whereas with the 8-byte version, I see:

# execTuples.c:1080: thisatt->attcacheoff = *offp;
.loc 1 1080 26 is_stmt 0 view .LVU1484
movl %ebp, 24(%rax) # off, MEM[(int *)_358 + 24B]

You can see the MOVL in the 8-byte version should amount to an
additional micro op to add 24 to RAX before the dereference.

One interesting thing to note about having CompactAttribute in its
8-byte form is that the compiler is tempted into sharing a register
with the tts_values array before Datum is also 8-bytes. Note the
difference in [1] between the two left compiler outputs and the
right-hand one. You can see RCX is dedicated for addressing
CompactAttribute in the right window, but RAX is used for both arrays
in the left two.  I don't 100% know for sure that's the reason for the
slowness with the full version but it does seem from the fragment I
posted just above that RAX does need 24 bytes added in the 8 bytes
version but not in the 16 byte version, so RAX is certainly not
dedicated and ready pointing to attcacheoff at that point.

Jeff, I'm not sure if I understand this well enough to write a
meaningful comment to explain why we don't use bitflags. With my
current knowledge level on this, it's a bit hand-wavy at best. Are you
content with this, or do you want to see something written into the
header comment for CompactAttribute in the code?

David

[1] https://godbolt.org/z/7hWvqdW6E
<>


Re: Restrict copying of invalidated replication slots

2025-03-09 Thread Shlok Kyal
On Fri, 28 Feb 2025 at 08:56, Amit Kapila  wrote:
>
> On Fri, Feb 28, 2025 at 5:10 AM Masahiko Sawada  wrote:
> >
> > On Thu, Feb 27, 2025 at 12:52 AM Amit Kapila  
> > wrote:
> > >
> > > On Thu, Feb 27, 2025 at 10:47 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Tue, Feb 25, 2025 at 7:33 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > AFAICU, InvalidateObsoleteReplicationSlots() is not serialized with
> > > > > this operation. So, isn't it possible that the source slot exists at
> > > > > the later position in ReplicationSlotCtl->replication_slots and the
> > > > > loop traversing slots is already ahead from the point where the newly
> > > > > copied slot is created?
> > > >
> > > > Good point. I think it's possible.
> > > >
> > > > > If so, the newly created slot won't be marked
> > > > > as invalid whereas the source slot will be marked as invalid. I agree
> > > > > that even in such a case, at a later point, the newly created slot
> > > > > will also be marked as invalid.
> > > >
> > > > The wal_status of the newly created slot would immediately become
> > > > 'lost' and the next checkpoint will invalidate it. Do we need to do
> > > > something to deal with this case?
> > > >
> > >
> > > + /* Check if source slot became invalidated during the copy operation */
> > > + if (second_slot_contents.data.invalidated != RS_INVAL_NONE)
> > > + ereport(ERROR,
> > > + errmsg("cannot copy replication slot \"%s\"",
> > > +NameStr(*src_name)),
> > > + errdetail("The source replication slot was invalidated during the
> > > copy operation."));
> > >
> > > How about adding a similar test as above for MyReplicationSlot? That
> > > should suffice the need because we have already acquired the new slot
> > > by this time and invalidation should signal this process before
> > > marking the new slot as invalid.
> >
> > IIUC in the scenario you mentioned, the loop traversing slots already
> > passed the position of newly created slot in
> > ReplicationSlotCtl->replication_slots array, so
> > MyReplicationSlot->data.invalidated is still RS_INVAL_NONE, no?
> >
>
> Right, I don't have a simpler solution for this apart from making it
> somehow serialize this operation with
> InvalidateObsoleteReplicationSlots(). I don't think we need to go that
> far to handle the scenario being discussed. It is better to add a
> comment for this race and why it won't harm.

I tried to reproduce the scenario described using the following steps:

Here are the steps:

1. set max_replication_slots = 2

2. create two logical replication slot, lets say slot1 and slot2. drop
the first slot (slot1).

3. Now run pg_copy_logical_replication slot function on slot2. Lets
say copied slot is 'slot_cp'.
In function copy_replication_slot add breakpoint just after function
'create_logical_replication_slot'.
slot_cp will be created. In array
ReplicationSlotCtl->replication_slots, slot_cp will come before slot2.

4. Now invalidate the 'slot2'.
Function 'InvalidateObsoleteReplicationSlots' will be called. Now it
will loop through ReplicationSlotCtl->replication_slots and will get
'slot_cp' first. Since the slot is acquired by copy_replication_slot,
It will wait for the slot to be released. Once slot is released,
'slot_cp' and 'slot2' will be invalidated.

I have tried it with 'wal_removal' invalidation. So it is issue in
PG_13 to HEAD.
I also see a kill signal sent to function 'copy_replication_slot'.

Terminal output:
postgres=# select pg_copy_logical_replication_slot('slot2', 'slot_cp');
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

5. Execute 'SELECT * from pg_replication_slots'. slot_cp is present in
the list with wal_status as lost.

I have added the comment on existing patches for REL_16 to master.
Created a new patch to add only comments in REL13-REL15.

Thanks and Regards,
Shlok Kyal


REL_17_v7-0001-Restrict-copying-of-invalidated-replication-slots.patch
Description: Binary data


REL_16_v7-0001-Restrict-copying-of-invalidated-replication-slots.patch
Description: Binary data


REL_15-REL_13_v7-0001-Comment-race-condition-in-copy_replication_slot.patch
Description: Binary data


master_v7-0001-Restrict-copying-of-invalidated-replication-slots.patch
Description: Binary data


Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-03-09 Thread Shubham Khanna
On Thu, Mar 6, 2025 at 9:27 AM Peter Smith  wrote:
>
> Hi Shubham.
>
> Some review comments for patch v13-0001.
>
> ==
> GENERAL
>
> 1.
> --cleanup-existing-publications
>
> I've never liked this proposed switch name much.
>
> e.g. why say "cleanup" instead of "drop"? What is the difference?
> Saying drop is very explicit about what the switch will do.
> e.g. why say "existing"? It seems redundant; you can't cleanup/drop
> something that does not exist.
>
> My preference is one of:
> --drop-publications
> --drop-all-publications
>
> either of which seem nicely aligned with the descriptions in the usage and 
> docs.
>
> Yeah, I know I have raised this same point before, but AFAICT the
> reply was like "will revise it in the next patch version", but that
> was many versions ago. I think it is important to settle the switch
> name earlier than later because there will be many tentacles into the
> code (vars, params, fields, comments) and docs if anything changes --
> so it is not a decision you want to leave till the end because it
> could destabilise everything at the last minute.
>

I have updated the option name to '--drop-all-publications'.

> ==
> Commit message
>
> 2.
> By default, publications are preserved to avoid unintended data loss.
>
> ~
>
> Was there supposed to be a blank line before this text, or should this
> text be wrapped into the preceding paragraph?
>

Fixed.

> ==
> src/bin/pg_basebackup/pg_createsubscriber.c
>
> setup_subscriber:
>
> 3.
>  /*
>   * Create the subscriptions, adjust the initial location for logical
>   * replication and enable the subscriptions. That's the last step for logical
> - * replication setup.
> + * replication setup. If 'drop_publications' options is true, existing
> + * publications on the subscriber will be dropped before creating new
> + * subscriptions.
>   */
>
> There are multiple things amiss with this comment.
> - 'drop_publications' is not the parameter name
> - 'drop_publications' options [sic plural??]. It is not an option
> here; it is a parameter
>

Fixed.

> ~~~
>
> check_and_drop_existing_publications:
>
> 4.
>  /*
> - * Remove publication if it couldn't finish all steps.
> + * Check and drop existing publications on the subscriber if requested.
>   */
>
> There is no need to say "if requested.". It is akin to saying this
> function does XXX if this function is called.
>

Fixed.

> ~~~
>
> drop_publication_by_name:
>
> 5.
> +/* Drop the specified publication of the given database. */
> +static void
> +drop_publication_by_name(PGconn *conn, const char *pubname, const char 
> *dbname)
>
> 5a.
> I think it is better to define this function before
> check_and_drop_existing_publications. YMMV.
>
> ~
>
> 5b.
> IMO the parameters should be reordered (PGconn *conn, const char
> *dbname, const char *pubname). It seems more natural and would be
> consistent with check_and_drop_existing_publications. YMMV.
>
> ~~~

Fixed.

>
> 6.
> - dbinfo->pubname, dbinfo->dbname, PQresultErrorMessage(res));
> - dbinfo->made_publication = false; /* don't try again. */
> + pubname, dbname, PQresultErrorMessage(res));
> + dbinfos.dbinfo->made_publication = false; /* don't try again. */
>
> Something about this flag assignment seems odd to me. IIUC
> 'made_publications' is for removing the cleanup_objects_atexit to be
> able to remove the special publication implicitly made by the
> pg_createsubscriber. But, AFAIK we can also get to this code via the
> --cleanup-existing-publication switch, so actually it might be one of
> the "existing" publication DROPS that has failed. If so, then the
> "don't try again comment" is misleading because it may not be that
> same publication "again" at all.
>

I agree with your point. The current comment is misleading, as the
failure could be related to an existing publication drop via
--cleanup-existing-publications now --drop-all-publications, not just
the publication created by pg_createsubscriber.
This issue is still open, and I will address it in the next version of
the patch.

> ==
> .../t/040_pg_createsubscriber.pl
>
> GENERAL.
>
> 7.
> Most of the changes to this test code are just changing node name S ->
> S1. It's not much to do with the patch other than tidying it in
> preparation for a new node S2.  OTOH it makes the review far harder
> because there are so many changes. IMO all this S->S1 stuff should be
> done as a separate pre-requisite patch and then it will be far easier
> to see what changes are added for the --clean-existing-publications
> testing.
>
> ~~~
>

I understand your concern. Since I am using two nodes (node_s1 and
node_s2), I will work on creating a separate prerequisite patch for
renaming S -> S1. This should make it easier to review the actual
changes related to --cleanup-existing-publications now
--drop-all-publications testing.

> 8.
>  # Set up node S as standby linking to node P
>  $node_p->backup('backup_1');
> -my $node_s = PostgreSQL::Test::Cluster->new('node_s');
> -$node_s->init_from_ba

Re: SQLFunctionCache and generic plans

2025-03-09 Thread Pavel Stehule
Hi

čt 6. 3. 2025 v 9:57 odesílatel Alexander Pyhalov 
napsal:

> Hi.
>
> Tom Lane писал(а) 2025-02-27 23:40:
> > Alexander Pyhalov  writes:
> >> Now sql functions plans are actually saved. The most of it is a
> >> simplified version of plpgsql plan cache. Perhaps, I've missed
> >> something.
> >
> > A couple of thoughts about v6:
> >
> > * I don't think it's okay to just summarily do this:
> >
> >   /* It's stale; unlink and delete */
> >   fcinfo->flinfo->fn_extra = NULL;
> >   MemoryContextDelete(fcache->fcontext);
> >   fcache = NULL;
> >
> > when fmgr_sql sees that the cache is stale.  If we're
> > doing a nested call of a recursive SQL function, this'd be
> > cutting the legs out from under the outer recursion level.
> > plpgsql goes to some lengths to do reference-counting of
> > function cache entries, and I think you need the same here.
> >
>
> I've looked for original bug report 7881 (
>
> https://www.postgresql.org/message-id/E1U5ytP-0006E3-KB%40wrigleys.postgresql.org
> ).
> It's interesting, but it seems that plan cache is not affected by it as
> when fcinfo xid mismatches we destroy fcinfo, not plan itself (cached
> plan survives and still can be used).
>
> I thought about another case - when recursive function is invalidated
> during its execution. But I haven't found such case. If function is
> modified during function execution in another backend, the original
> backend uses old snapshot during function execution and still sees the
> old function definition. Looked at the following case -
>
> create or replace function f (int) returns setof int language sql as $$
> select i from t where t.j = $1
> union all
> select f(i) from t where t.j = $1
> $$;
>
> and changed function definition to
>
> create or replace function f (int) returns setof int language sql as $$
> select i from t where t.j = $1  and i > 0
> union all
> select f(i) from t where t.j = $1
> $$;
>
> during execution of the function. As expected, backend still sees the
> old definition and uses cached plan.
>
> > * I don't like much of anything about 0004.  It's messy and it
> > gives up all the benefit of plan caching in some pretty-common
> > cases (anywhere where the user was sloppy about what data type
> > is being returned).  I wonder if we couldn't solve that with
> > more finesse by applying check_sql_fn_retval() to the query tree
> > after parse analysis, but before we hand it to plancache.c?
> > This is different from what happens now because we'd be applying
> > it before not after query rewrite, but I don't think that
> > query rewrite ever changes the targetlist results.  Another
> > point is that the resultTargetList output might change subtly,
> > but I don't think we care there either: I believe we only examine
> > that output for its result data types and resjunk markers.
> > (This is nonobvious and inadequately documented, but for sure we
> > couldn't try to execute that tlist --- it's never passed through
> > the planner.)
>
> I'm also not fond of the last patch. So tried to fix it in a way you've
> suggested - we call check_sql_fn_retval() before creating cached plans.
> It fixes issue with revalidation happening after modifying query
> results.
>
> One test now changes behavior. What's happening is that after moving
> extension to another schema, cached plan is invalidated. But
> revalidation
> happens and rebuilds the plan. As we've saved analyzed parse tree, not
> the raw one, we refer to public.dep_req2() not by name, but by oid. Oid
> didn't change, so cached plan is rebuilt and used. Don't know, should we
> fix it and if should, how.
> >
> > * One diff that caught my eye was
> >
> > - if (!ActiveSnapshotSet() &&
> > - plansource->raw_parse_tree &&
> > - analyze_requires_snapshot(plansource->raw_parse_tree))
> > + if (!ActiveSnapshotSet() &&
> StmtPlanRequiresRevalidation(plansource))
> >
> > Because StmtPlanRequiresRevalidation uses
> > stmt_requires_parse_analysis, this is basically throwing away the
> > distinction between stmt_requires_parse_analysis and
> > analyze_requires_snapshot.  I do not think that's okay, for the
> > reasons explained in analyze_requires_snapshot.  We should expend the
> > additional notational overhead to keep those things separate.
> >
> > * I'm also not thrilled by teaching StmtPlanRequiresRevalidation
> > how to do something equivalent to stmt_requires_parse_analysis for
> > Query trees.  The entire point of the current division of labor is
> > for it *not* to know that, but keep the knowledge of the properties
> > of different statement types in parser/analyze.c.  So it looks to me
> > like we need to add a function to parser/analyze.c that does this.
> > Not quite sure what to call it though.
> > querytree_requires_parse_analysis() would be a weird name, because
> > if it's a Query tree then it's already been through parse analysis.
> > Maybe querytree_requires

Re: Enhance file_fdw to report processed and skipped tuples in COPY progress

2025-03-09 Thread Fujii Masao




On 2025/03/09 20:38, vignesh C wrote:

I've updated the status to "withdrawn." Feel free to add it again
anytime if you change your mind.


Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION





Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

2025-03-09 Thread Michael Paquier
On Thu, Mar 06, 2025 at 06:44:27PM -0600, Sami Imseih wrote:
> Regarding the issue itself, query jumbling behavior is often subjective,
> making it difficult to classify as a bug. I'm not entirely sure this
> qualifies as a bug either, but I do believe it should be addressed.

I would call that a bug and something that we should fix, but not
something that we can really backpatch as this has unfortunately an
impact on monitoring tools.  Stability takes priority in this area in
already released branches. 

> By rearranging them as done in variant A, the position where the expression
> is appended in the jumble changes, leading to a different hash. I just
> don't like
> this solution because it requires one to carefully construct a struct,
> but it maybe the best out of the other options.

I prefer something like variant A.  It would not be the first time we
are manipulating the structure of the parse nodes for the sake of
making the query jumbling more transparent.  An advantage of depending
on the structure definition is that we can document the expectation in
the header, and not hide it in the middle of the jumble code.

> Variant B is not acceptable IMO as it adds a whole bunch of null-terminators
> unnecessarily. For example, in a simple "select 1", (expr == NULL) is
> true 19 times,
> so that is an extra 19 bytes.

Variant B is not acceptable here.

> I think a third option is to add a new pg_node_attr called "query_jumble_null"
> and be very selective on which fields should append a null-terminator to the
> jumble when the expression is null
> 
> The queryjumblefuncs.c could look like the below. JUMBLE_NULL will
> be responsible for appending the null terminator.

Not much a fan of that.  For the sake of this thread, I'd still go
with the simplicity of A.  And please, let's add a couple of queries
in pgss to track that these lead to two different entries.

Another option that I was thinking about and could be slightly cleaner
is the addition of an extra field in Query that is set when we go
through a offset_clause in the parser.  It could just be a boolean, 
false by default.  We have been using this practice in in
DeallocateStmt, for example.  And there are only a few places where 
limitOffset is set.  However, I'd rather discard this idea as
transformSelectStmt() has also the idea to transform a LIMIT clause
into an OFFSET clause, changing a Query representation.  And note that
we calculate the query jumbling after applying the query
transformation.  For these reasons, variant A where we put the
LimitOption between the two int8 expression nodes feels like the
"okay" approach here.  But we must document this expectation in the
structure, and check for more grammar variants of LIMIT and OFFSET
clauses in pgss.

Another concept would be to add into the jumble the offset of a Node
in the upper structure we are jumbling, but this requires keeping a
track of all the nested things, which would make the query jumbling
code much more complicated and more expensive, for sure.

I'd also recommend to be careful and double-check all these
transformation assumptions for LIMIT and OFFSET.  We should shape
things so as an OFFSET never maps to a LIMIT variant when we
differentiate all these flavors at query string level.
--
Michael


signature.asc
Description: PGP signature


Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

2025-03-09 Thread David Rowley
On Mon, 10 Mar 2025 at 12:39, Michael Paquier  wrote:
>
> On Thu, Mar 06, 2025 at 06:44:27PM -0600, Sami Imseih wrote:
> > Regarding the issue itself, query jumbling behavior is often subjective,
> > making it difficult to classify as a bug. I'm not entirely sure this
> > qualifies as a bug either, but I do believe it should be addressed.
>
> I would call that a bug and something that we should fix, but not
> something that we can really backpatch as this has unfortunately an
> impact on monitoring tools.  Stability takes priority in this area in
> already released branches.

I know you reviewed this, Michael, so you're aware; 2d3389c28 did
recently document that we'd only break this in minor versions as a
last resort. So, I agree that it sounds like a master-only fix is in
order.

For the record, the docs [1] read:

"Generally, it can be assumed that queryid values are stable between
minor version releases of PostgreSQL, providing that instances are
running on the same machine architecture and the catalog metadata
details match. Compatibility will only be broken between minor
versions as a last resort."

David

[1] https://www.postgresql.org/docs/current/pgstatstatements.html




Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-09 Thread Ajin Cherian
On Mon, Mar 10, 2025 at 3:58 PM Ashutosh Bapat
 wrote:
>
> On Thu, Mar 6, 2025 at 9:18 AM Ajin Cherian  wrote:
>
> > +   Subscription names, publication names, and replication slot names 
> > are
> > +   automatically generated. Cannot be used together with
> > +   --database, --publication,
> > +   --replication-slot or 
> > --subscription.
> >
> > Don't start the sentence with "Cannot". Change the sentence to "This
> > option cannot be used together with ..."
> > similar sentences used in 3 other places below this as well. Change all of 
> > them.
>
> I didn't find any instance of "Cannot be" in the *.sgml files, but I
> find some instances of "Can be". So it seems we allow such constructs
> in the documentation. Any reasons for suggesting this change?
>

I just don't think it is correct English to start a sentence with
"Cannot be". I checked with grammarly as well.

regards,
Ajin Cherian
Fujitsu Australia




Re: maintenance_work_mem = 64kB doesn't work for vacuum

2025-03-09 Thread Masahiko Sawada
On Sun, Mar 9, 2025 at 7:03 PM David Rowley  wrote:
>
> On Mon, 10 Mar 2025 at 10:30, David Rowley  wrote:
> > Could you do something similar to what's in hash_agg_check_limits()
> > where we check we've got at least 1 item before bailing before we've
> > used up the all the prescribed memory?  That seems like a safer coding
> > practise as if in the future the minimum usage for a DSM segment goes
> > above 256KB, the bug comes back again.
>
> FWIW, I had something like the attached in mind.
>

Thank you for the patch! I like your idea. This means that even if we
set maintenance_work_mem to 64kB the memory usage would not actually
be limited to 64kB but probably we're fine as it's primarily testing
purpose.

Regarding that patch, we need to note that the lpdead_items is a
counter that is not reset in the entire vacuum. Therefore, with
maintenance_work_mem = 64kB, once we collect at least one lpdead item,
we perform a cycle of index vacuuming and heap vacuuming for every
subsequent block even if they don't have a lpdead item. I think we
should use vacrel->dead_items_info->num_items instead.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Orphaned users in PG16 and above can only be managed by Superusers

2025-03-09 Thread Ashutosh Sharma
Hi,

On Fri, Mar 7, 2025 at 10:55 PM Nathan Bossart  wrote:
>
> I noticed that much of this code is lifted from DropRole(), and the new
> check_drop_role_dependency() function is only used by DropRole() right
> before it does the exact same scans.  Couldn't we put the new dependency
> detection in those existing scans in DropRole()?
>

It can be done, but mixing the code that checks for the drop role
dependency with the code that removes entries for the role being
dropped from pg_auth_members could reduce clarity and precision. This
is more of a sanity check which I felt was necessary before we proceed
with actually dropping the role, starting with the deletion of drop
role entries from the system catalogs. I’m aware there’s some code
duplication, but I think it should be fine.

--
With Regards,
Ashutosh Sharma.




Re: Printing window function OVER clauses in EXPLAIN

2025-03-09 Thread Álvaro Herrera
On 2025-Mar-09, Tom Lane wrote:

> David Rowley  writes:
> > What are your thoughts on being a bit more brief with the naming and
> > just prefix with "w" instead of "window"? Looking at window.out, I see
> > that the EXPLAIN output does become quite a bit wider than before. I
> > favour the idea of saving a bit of space.  There is an example in
> > src/sgml/advanced.sgml that has "OVER w", so it does not seem overly
> > strange to me to name them "w1", "w2", etc.
> 
> OK by me, any objections elsewhere?

WFM.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: [BUG]: the walsender does not update its IO statistics until it exits

2025-03-09 Thread Michael Paquier
On Mon, Mar 03, 2025 at 11:54:39AM +, Bertrand Drouvot wrote:
> So it does not look like what we're adding here can be seen as a primary 
> bottleneck
> but that is probably worth implementing the "have_iostats" optimization 
> attached.
> 
> Also, while I did not measure any noticeable extra lag, given the fact that 
> pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 
> 2.5%,
> that could still make sense to reduce the frequency of the flush calls, 
> thoughts?

Okay.  The number of reports really worry me anyway as proposed in the
patch.  It's not really complicated to see dozens of pgstats calls in
a single millisecond in the WAL sender, even with some of the
regression tests.  That's more present in the logirep scenarios, it
seems..

For example, here are some numbers based on some elog() calls planted
in walreceiver.c, looking at the logs of the TAP tests.  Through a
single check-world, I am able to gather the following information:
- The location proposed by the patch leads to 680k stats reports,
in WalSndLoop() because the "Block if we have unsent data".  Some
tests show quite some contention.
- A different location, in WalSndLoop() just before WalSndWait() leads
to 59k calls.  And there's some contention in the logirep tests..
- I've spotted a third candidate which looks pretty solid, actually:
WalSndWaitForWal() before WalSndWait().  This leads to 2.9k reports in
the whole test suite, with much less contention in the reports.  These
can still be rather frequent, up to ~50 calls per seconds, but that's
really less than the two others.

Stats data is useful as long as it is possible to get an idea of how
the system behaves, particularly with a steady workload.  More
frequent reports are useful for spikey data detection, showing more
noise.  Still, too many reports may cause the part gathering the
reports to become a bottleneck, while we want it to offer hints about
bottlenecks.  So I would argue in favor of a more conservative choice
in the back branches than what the patch is proposing.  Choice 3 i'm
quoting above is tempting by design: not too much, still frequent
enough to offer enough relevant information in the stats.
--
Michael


signature.asc
Description: PGP signature


Re: Printing window function OVER clauses in EXPLAIN

2025-03-09 Thread David Rowley
On Mon, 10 Mar 2025 at 14:13, Tom Lane  wrote:
> Hmm, OK.  Do you think it could be sensible to put Run Condition
> before Filter, then?  On the same grounds of "keeping related
> things together", it could be argued that Run Condition is
> related to the Window property.  Also, the Run Condition acts
> before the Filter does, if I've got my head screwed on straight.

Yes, directly after the "Window" property makes sense for the reason
you stated. Thanks for thinking of that.

David




Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

2025-03-09 Thread Michael Paquier
On Mon, Mar 10, 2025 at 02:14:01PM +1300, David Rowley wrote:
> I know you reviewed this, Michael, so you're aware; 2d3389c28 did
> recently document that we'd only break this in minor versions as a
> last resort. So, I agree that it sounds like a master-only fix is in
> order.

Yes, this thread's problem does not pass the compatibility bar.
Thanks for the reminder about the docs.
--
Michael


signature.asc
Description: PGP signature


Re: Add an option to skip loading missing publication to avoid logical replication failure

2025-03-09 Thread Amit Kapila
On Sun, Mar 9, 2025 at 9:00 AM Masahiko Sawada  wrote:
>
> On Tue, Mar 4, 2025 at 9:04 PM Amit Kapila  wrote:
>
> >
> > I see the point of adding such an option to avoid breaking the current
> > applications (if there are any) that are relying on current behaviour.
> > But OTOH, I am not sure if users expect us to fail explicitly in such
> > scenarios.
>
> On the side note, with the patch since we ignore missing publications
> we will be able to create a subscription with whatever publications
> regardless their existence like:
>
> CREATE SUBSCRIPTION ... PUBLICATION pub1, pub2, pub3, pub4, pub5, ..., 
> pub1000;
>
> The walsender corresponding to such subscriber can stream changes as
> soon as the listed publications are created on the publisher and
> REFRESH PUBLICATION is executed.
>

Right, but OTOH, one can expect that the data should start replicating
as soon as one creates a publication on the publisher. However, the
data for tables that are part of the publication will start
replicating from the point when the decoding process will process the
WAL corresponding to Create Publication. I suggest to add something
for this in docs unless it is already explained.

> >
> > This is a long-standing behaviour for which we get reports from time
> > to time, and once analyzing a failure, Tom also looked at it and
> > agreed that we don't have much choice to avoid skipping non-existent
> > publications [5]. But we never concluded as to whether skipping should
> > be a default behavior or an optional one. So, we need more opinions on
> > it.
>
> I'm leaning toward making the skipping behavior a default as I could
> not find a good benefit for the current behavior (i.e., stopping
> logical replication due to missing publications).
>

Sounds reasonable. We can always add the option at a later point if
required. Thanks for your input. We can continue reviewing and
committing the current patch.

-- 
With Regards,
Amit Kapila.




Re: Printing window function OVER clauses in EXPLAIN

2025-03-09 Thread David Rowley
On Mon, 10 Mar 2025 at 11:19, Tom Lane  wrote:
> OK, here's v2 done like that.  I do like this output better.
> I backed off the idea of putting the WindowClause as such
> into the plan, partly because I didn't feel like debugging
> the setrefs.c problem that David discovered upthread.
> This way does require a bit more code, but I think it's less
> likely to have bugs.

This output is much nicer. The patch looks good to me.

What are your thoughts on being a bit more brief with the naming and
just prefix with "w" instead of "window"? Looking at window.out, I see
that the EXPLAIN output does become quite a bit wider than before. I
favour the idea of saving a bit of space.  There is an example in
src/sgml/advanced.sgml that has "OVER w", so it does not seem overly
strange to me to name them "w1", "w2", etc.

> * In passing, I editorialized on the order in which the Run Condition
> annotation comes out:
>
>  case T_WindowAgg:
> +show_window_def(castNode(WindowAggState, planstate), ancestors, 
> es);
>  show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
> +show_upper_qual(((WindowAgg *) plan)->runConditionOrig,
> +"Run Condition", planstate, ancestors, es);
>  if (plan->qual)
>  show_instrumentation_count("Rows Removed by Filter", 1,
> planstate, es);
> -show_upper_qual(((WindowAgg *) plan)->runConditionOrig,
> -"Run Condition", planstate, ancestors, es);
>  show_windowagg_info(castNode(WindowAggState, planstate), es);
>  break;
>
> It seemed quite weird to me to have the Run Condition plan property
> come out in the middle of properties that only appear in EXPLAIN
> ANALYZE mode.  Maybe there's a reason for this other than "people
> added new properties at the end", but I don't see it.

I did it that way because "Rows Removed by Filter" is a property of
"Filter", so it makes sense to me for those to be together. It doesn't
make sense to me to put something unrelated between them.

If you look at BitmapHeapScan output, this keeps the related outputs
together, i.e:

   ->  Parallel Bitmap Heap Scan on ab  (cost=111.20..82787.64 rows=1
width=8) (actual time=172.498..172.499 rows=0.00 loops=3)
 Recheck Cond: (a = 1)
 Rows Removed by Index Recheck: 705225
 Filter: (b = 3)
 Rows Removed by Filter: 

What you're proposing seems equivalent to if we did it like:

   ->  Parallel Bitmap Heap Scan on ab  (cost=111.20..82787.64 rows=1
width=8) (actual time=172.498..172.499 rows=0.00 loops=3)
 Recheck Cond: (a = 1)
 Filter: (b = 3)
 Rows Removed by Index Recheck: 705225
 Rows Removed by Filter: 

David




Re: CREATE OR REPLACE MATERIALIZED VIEW

2025-03-09 Thread Erik Wienhold
The attached v6 fixes the build.  Somehow I missed testing with
--with-cassert the whole time and it turned that out I forgot to pass
queryString to ExecRefreshMatView.

-- 
Erik Wienhold
>From 6ec126d8da5ca80f93ea8a58e07d654f5e21ef6d Mon Sep 17 00:00:00 2001
From: Erik Wienhold 
Date: Tue, 21 May 2024 18:35:47 +0200
Subject: [PATCH v6 1/3] Add CREATE OR REPLACE MATERIALIZED VIEW

---
 .../sgml/ref/create_materialized_view.sgml|  15 +-
 src/backend/commands/createas.c   | 207 ++
 src/backend/commands/tablecmds.c  |   8 +-
 src/backend/commands/view.c   | 106 ++---
 src/backend/parser/gram.y |  15 ++
 src/bin/psql/tab-complete.in.c|  26 ++-
 src/include/commands/view.h   |   3 +
 src/include/nodes/parsenodes.h|   2 +-
 src/include/nodes/primnodes.h |   1 +
 src/test/regress/expected/matview.out | 191 
 src/test/regress/sql/matview.sql  | 108 +
 11 files changed, 589 insertions(+), 93 deletions(-)

diff --git a/doc/src/sgml/ref/create_materialized_view.sgml 
b/doc/src/sgml/ref/create_materialized_view.sgml
index 62d897931c3..5e03320eb73 100644
--- a/doc/src/sgml/ref/create_materialized_view.sgml
+++ b/doc/src/sgml/ref/create_materialized_view.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] 
table_name
+CREATE [ OR REPLACE ] MATERIALIZED VIEW [ IF NOT EXISTS ] 
table_name
 [ (column_name [, ...] ) ]
 [ USING method ]
 [ WITH ( storage_parameter [= 
value] [, ... ] ) ]
@@ -60,6 +60,17 @@ CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] 
table_name
   Parameters
 
   
+   
+OR REPLACE
+
+ 
+  Replaces a materialized view if it already exists.
+  Specifying OR REPLACE together with
+  IF NOT EXISTS is an error.
+ 
+
+   
+

 IF NOT EXISTS
 
@@ -67,7 +78,7 @@ CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] 
table_name
   Do not throw an error if a materialized view with the same name already
   exists. A notice is issued in this case.  Note that there is no guarantee
   that the existing materialized view is anything like the one that would
-  have been created.
+  have been created, unless you use OR REPLACE instead.
  
 

diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 44b4665ccd3..9f37ed7f177 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -79,55 +79,151 @@ static void intorel_destroy(DestReceiver *self);
 static ObjectAddress
 create_ctas_internal(List *attrList, IntoClause *into)
 {
-   CreateStmt *create = makeNode(CreateStmt);
-   boolis_matview;
+   boolis_matview,
+   replace = false;
charrelkind;
-   Datum   toast_options;
-   const char *const validnsps[] = HEAP_RELOPT_NAMESPACES;
+   Oid matviewOid = InvalidOid;
ObjectAddress intoRelationAddr;
 
/* This code supports both CREATE TABLE AS and CREATE MATERIALIZED VIEW 
*/
is_matview = (into->viewQuery != NULL);
relkind = is_matview ? RELKIND_MATVIEW : RELKIND_RELATION;
 
-   /*
-* Create the target relation by faking up a CREATE TABLE parsetree and
-* passing it to DefineRelation.
-*/
-   create->relation = into->rel;
-   create->tableElts = attrList;
-   create->inhRelations = NIL;
-   create->ofTypename = NULL;
-   create->constraints = NIL;
-   create->options = into->options;
-   create->oncommit = into->onCommit;
-   create->tablespacename = into->tableSpaceName;
-   create->if_not_exists = false;
-   create->accessMethod = into->accessMethod;
+   /* Check if an existing materialized view needs to be replaced. */
+   if (is_matview)
+   {
+   LOCKMODElockmode;
 
-   /*
-* Create the relation.  (This will error out if there's an existing 
view,
-* so we don't need more code to complain if "replace" is false.)
-*/
-   intoRelationAddr = DefineRelation(create, relkind, InvalidOid, NULL, 
NULL);
+   lockmode = into->replace ? AccessExclusiveLock : NoLock;
+   (void) RangeVarGetAndCheckCreationNamespace(into->rel, lockmode,
+   
&matviewOid);
+   replace = OidIsValid(matviewOid) && into->replace;
+   }
 
-   /*
-* If necessary, create a TOAST table for the target table.  Note that
-* NewRelationCreateToastTable ends with CommandCounterIncrement(), so
-* that the TOAST table will be visible for insertion.
-*/
-   CommandCounterIncrement();
+   if (is_matview && replace)
+   {
+   Relation