date:20250430

Re: alphabetize long options in pg_dump[all] docs

2025-04-30 Thread Álvaro Herrera

On 2025-Apr-30, Peter Eisentraut wrote:

> On 29.04.25 23:54, Nathan Bossart wrote:

> > Fair point.  We seem to be pivoting towards long options, anyway.  If
> > there's support for this, I could go through all the client and server
> > application docs to ensure they match this style.
>
> However, I think this would require coordination across all --help output
> and man pages (76 objects), so for the short term, let's just move recently
> added options to the right place under the current theory/theories, and
> leave a larger reshuffling for later.

+1 WFM

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Pero la cosa no es muy grave ..." (le petit Nicolas -- René Goscinny)

Re: [RFC] Lock-free XLog Reservation from WAL

2025-04-30 Thread Yura Sokolov

Just rebase

-- 
regards
Yura Sokolov aka funny-falconFrom 4ea25d6feb655a072d1e9f40a547dc6aeab762ac Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Sun, 19 Jan 2025 17:40:28 +0300
Subject: [PATCH v6] Lock-free XLog Reservation using lock-free hash-table

Removed PrevBytePos to eliminate lock contention, allowing atomic updates
to CurrBytePos. Use lock-free hash-table based on 4-way Cuckoo Hashing
to store link to PrevBytePos.
---
 src/backend/access/transam/xlog.c | 585 +++---
 src/tools/pgindent/typedefs.list  |   2 +
 2 files changed, 532 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2d4c346473b..0dff9addfe1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -68,6 +68,8 @@
 #include "catalog/pg_database.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
+#include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -379,6 +381,94 @@ typedef union WALInsertLockPadded
 	char		pad[PG_CACHE_LINE_SIZE];
 } WALInsertLockPadded;
 
+/* #define WAL_LINK_64 0 */
+#ifndef WAL_LINK_64
+#ifdef PG_HAVE_ATOMIC_U64_SIMULATION
+#define WAL_LINK_64 0
+#else
+#define WAL_LINK_64 1
+#endif
+#endif
+
+/*
+ * It links current position with previous one.
+ * - CurrPosId is (CurrBytePos ^ (CurrBytePos>>32))
+ *   Since CurrBytePos grows monotonically and it is aligned to MAXALIGN,
+ *   CurrPosId correctly identifies CurrBytePos for at least 4*2^32 = 32GB of
+ *   WAL logs.
+ * - CurrPosHigh is (CurrBytePos>>32), it is stored for strong uniqueness check.
+ * - PrevSize is difference between CurrBytePos and PrevBytePos
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	uint64		CurrPos;
+	uint64		PrevPos;
+#define WAL_PREV_EMPTY (~((uint64)0))
+#define WALLinkEmpty(l) ((l).PrevPos == WAL_PREV_EMPTY)
+#define WALLinkSamePos(a, b) ((a).CurrPos == (b).CurrPos)
+#define WALLinkCopyPrev(a, b) do {(a).PrevPos = (b).PrevPos;} while(0)
+#else
+	uint32		CurrPosId;
+	uint32		CurrPosHigh;
+	uint32		PrevSize;
+#define WALLinkEmpty(l) ((l).PrevSize == 0)
+#define WALLinkSamePos(a, b) ((a).CurrPosId == (b).CurrPosId && (a).CurrPosHigh == (b).CurrPosHigh)
+#define WALLinkCopyPrev(a, b) do {(a).PrevSize = (b).PrevSize;} while(0)
+#endif
+} WALPrevPosLinkVal;
+
+/*
+ * This is an element of lock-free hash-table.
+ * In 32 bit mode PrevSize's lowest bit is used as a lock, relying on fact it is MAXALIGN-ed.
+ * In 64 bit mode lock protocol is more complex.
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	pg_atomic_uint64 CurrPos;
+	pg_atomic_uint64 PrevPos;
+#else
+	pg_atomic_uint32 CurrPosId;
+	uint32		CurrPosHigh;
+	pg_atomic_uint32 PrevSize;
+	uint32		pad;			/* to align to 16 bytes */
+#endif
+} WALPrevPosLink;
+
+StaticAssertDecl(sizeof(WALPrevPosLink) == 16, "WALPrevPosLink should be 16 bytes");
+
+#define PREV_LINKS_HASH_CAPA (NUM_XLOGINSERT_LOCKS * 2)
+StaticAssertDecl(!(PREV_LINKS_HASH_CAPA & (PREV_LINKS_HASH_CAPA - 1)),
+ "PREV_LINKS_HASH_CAPA should be power of two");
+StaticAssertDecl(PREV_LINKS_HASH_CAPA < UINT16_MAX,
+ "PREV_LINKS_HASH_CAPA is too large");
+
+/*---
+ * PREV_LINKS_HASH_STRATEGY - the way slots are chosen in hash table
+ *   1 - 4 positions h1,h1+1,h2,h2+2 - it guarantees at least 3 distinct points,
+ * but may spread at 4 cache lines.
+ *   2 - 4 positions h,h^1,h^2,h^3 - 4 points in single cache line.
+ *   3 - 8 positions h1,h1^1,h1^2,h1^4,h2,h2^1,h2^2,h2^3 - 8 distinct points in
+ * in two cache lines.
+ */
+#ifndef PREV_LINKS_HASH_STRATEGY
+#define PREV_LINKS_HASH_STRATEGY 3
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY <= 2
+#define PREV_LINKS_LOOKUPS 4
+#else
+#define PREV_LINKS_LOOKUPS 8
+#endif
+
+struct WALPrevLinksLookups
+{
+	uint16		pos[PREV_LINKS_LOOKUPS];
+};
+
+#define SWAP_ONCE_IN 128
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -390,17 +480,18 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
  */
 typedef struct XLogCtlInsert
 {
-	slock_t		insertpos_lck;	/* protects CurrBytePos and PrevBytePos */
-
 	/*
 	 * CurrBytePos is the end of reserved WAL. The next record will be
-	 * inserted at that position. PrevBytePos is the start position of the
-	 * previously inserted (or rather, reserved) record - it is copied to the
-	 * prev-link of the next record. These are stored as "usable byte
-	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
+	 * inserted at that position.
+	 *
+	 * The start position of the previously inserted (or rather, reserved)
+	 * record (it is copied to the prev-link of the next record) will be
+	 * stored in PrevLinksHash.
+	 *
+	 * These are stored as "usable byte positions" rather than XLogRecPtrs
+	 * (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
+	pg_atomic_uint64 CurrBytePos;
 
 	/*

Re: Doc: fix the rewrite condition when executing ALTER TABLE ADD COLUMN

2025-04-30 Thread Robert Treat

On Wed, Apr 30, 2025 at 5:15 AM Peter Eisentraut  wrote:
>
> On 28.04.25 18:56, Álvaro Herrera wrote:
> > On 2025-Apr-23, Nathan Bossart wrote:
> >
> >> On Mon, Mar 24, 2025 at 11:37:20AM +0100, Álvaro Herrera wrote:
> >
> >>> I'd add a note about these two things to the open items page, and wait
> >>> to see if we get some of these limitations fixed, so that if we don't,
> >>> we remember to note this limitation in the documentation.
> >>
> >> Are we still waiting on something for this, or should we proceed with the
> >> documentation changes?  It doesn't seem tremendously urgent, but I noticed
> >> it's been about a month since the last message on this thread.
> >
> > I've edited the Open Items page to disclaim my responsibility from this
> > item, since this comes from virtual generated columns which is not my
> > turf.  I think we should just document the current state of affairs; we
> > can come back with further code improvements during the next cycle.
>
> Here is a proposed patch that includes some text about virtual generated
> columns and also fixes up a small mistake in the previous patch
> (confused identity and generated columns) and improves the wording and
> formatting a bit more.

If I were going to quibble, I'd probably rewrite the second paragraph as

+Changing the type of an existing column will normally cause the
entire table
+and its indexes to be rewritten.
+As an exception, when changing the type of an existing column,
 if the USING clause does not change the column
 contents and the old type is either binary coercible to the new type
 or an unconstrained domain over the new type, a table rewrite is not
-needed.  However, indexes must always be rebuilt unless the system
+needed.  However, indexes will still need to be rebuilt unless the system
 can verify that the new index would be logically equivalent to the
 existing one.  For example, if the collation for a column has been
 changed, an index rebuild is required because the new sort
 order might be different.  However, in the absence of a collation
 change, a column can be changed from text to
 varchar (or vice versa) without rebuilding the indexes
-because these data types sort identically.  Table and/or index
+because these data types sort identically.

But otherwise this LGTM.


Robert Treat
https://xzilla.net

RE: Fix slot synchronization with two_phase decoding enabled

2025-04-30 Thread Zhijie Hou (Fujitsu)

On Tue, Apr 29, 2025 at 6:57 PM Amit Kapila wrote:
> 
> On Tue, Apr 29, 2025 at 3:01 AM Masahiko Sawada
>  wrote:
> >
> > Your fix looks good to me. Is it worth considering putting an
> > assertion to verify if new two_phase_at is equal to or greater than
> > confirmed_lsn (or at least it doesn't go backward), when syncing
> > two_phase_at?
> >
> 
> Yeah, it makes sense. But the condition should be reverse (two_phase_at
> should be less than or equal to confirmed_flush). I have done that, changed a
> few comments, and committed the patch.

I noticed a BF failure[1] related to the fix, where the recently added assertion
Assert(slot->data.two_phase_at <= slot->data.confirmed_flush) was broken. After
examining the logs and code, I couldn't identify any functionality issues. Here 
is
my analysis:

AFAICS, the last slotsync cycle should have retrieved the latest 
confirmed_flush and
two_phase_at from the source slot, both expected to be 0/6005150. Here are
some details:

The standby's log[1] shows the source slot's two_phase_at as 0/6005150.
Meanwhile, the publisher's log reveals that the source slot's confirmed_flush
was already 0/6005150 before the last slotsync. Therefore, the last slotsync
cycle should have fetched confirmed_flush: 0/6005150, two_phase_at: 0/6005150.

If we assume remote_slot->confirmed_lsn during the last sync cycle is
0/6005150, then either the local slot's confirmed_lsn had already been updated
to this value in a prior sync, leading it to skip the update in the last cycle
(in which case, the assertion shouldn't be broken), or it should enter the slot
update branch to set the synced slot to 0/6005150 (see
update_local_synced_slot() for details). Thus, theoretically, the assertion
wouldn't fail.

Beyond functionality problems, another possibility might be the CPU reordered
memory access, causing the assertion to execute before updating
confirmed_flush. However, we lack the information needed to verify this, so one
idea is to add some DEBUG message in update_local_synced_slot() to facilitate
tracking if the issue recurs.

I'll continue to investigate, but any suggestions or insights would be greatly
appreciated.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=calliphoridae&dt=2025-04-29%2009%3A11%3A31&stg=recovery-check

[2]
2025-04-29 09:18:05.641 UTC [1023004][client backend][0/2:0] LOG:  statement: 
SELECT two_phase AND '0/6005150' = two_phase_at from pg_replication_slots WHERE 
slot_name = 'lsub1_slot' AND synced AND NOT temporary;

[3]
2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG:  0/6004FC8 has 
been already streamed, forwarding to 0/6005150
2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT:  
START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', 
streaming 'parallel', two_phase 'on', origin 'any', publication_names 
'"regress_mypub"')
2025-04-29 09:18:05.489 UTC [1022911][client backend][:0] LOG:  disconnection: 
session time: 0:00:00.005 user=bf database=postgres host=[local]
2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] LOG:  starting logical 
decoding for slot "lsub1_slot"
2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] DETAIL:  Streaming 
transactions committing after 0/6005150, reading WAL from 0/6004A00.
2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] STATEMENT:  
START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', 
streaming 'parallel', two_phase 'on', origin 'any', publication_names 
'"regress_mypub"')
2025-04-29 09:18:05.518 UTC [1022687][client backend][9/19:0] LOG:  statement: 
SELECT slot_name, plugin, confirmed_flush_lsn, restart_lsn, catalog_xmin, 
two_phase, two_phase_at, failover, database, invalidation_reason FROM 
pg_catalog.pg_replication_slots WHERE failover and NOT temporary
2025-04-29 09:18:05.524 UTC [1022928][not initialized][:0] LOG:  connection 
received: host=[local]

Best Regards,
Hou zj

Re: RFC: Logging plan of the running query

2025-04-30 Thread torikoshia


Hi,

On 2025-04-28 08:55, Hannu Krosing wrote:

Have you also checked out
https://github.com/postgrespro/pg_query_state which logs running query
plan AND collected counts and timings as a response to a signal?


Yes. For example, see the discussion:
https://www.postgresql.org/message-id/d68c3ae31672664876b22d2dcbb526d2%40postgrespro.ru

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a750dc8..e1b0be5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3492,6 +3492,8 @@ ProcessInterrupts(void)
if (ParallelMessagePending)
HandleParallelMessages();

+   CheckAndHandleCustomSignals();


Has this ever been discussed for inclusion in core ?


As far as I understand from reading a bit of pg_query_state, it 
registers custom interrupts to obtain the query state, including the 
output of EXPLAIN:


  -- pg_query_state/patches/custom_signals_17.0.patch
  diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
  index a750dc8..e1b0be5 100644
  --- a/src/backend/tcop/postgres.c
  +++ b/src/backend/tcop/postgres.c
  @@ -3492,6 +3492,8 @@ ProcessInterrupts(void)
  if (ParallelMessagePending)
  HandleParallelMessages();

  +   CheckAndHandleCustomSignals();

However, we believe it is not safe to perform something as complex as 
EXPLAIN during an interrupt.

For more details, please refer to the discussion below:
https://www.postgresql.org/message-id/CA%2BTgmobH%2BUto9MCD%2BvWc71bVbOnd7d8zeYjRT8nXaeLe5hsNJQ%40mail.gmail.com

Previous patches in this thread also attempted a similar approach, but 
due to the safety concerns mentioned above, we decided to explore 
alternative solutions.
As a result, we are currently proposing an approach based on wrapping 
plan nodes instead.



--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Re: alphabetize long options in pg_dump[all] docs

2025-04-30 Thread Peter Eisentraut


On 29.04.25 23:54, Nathan Bossart wrote:

On Tue, Apr 29, 2025 at 11:45:11PM +0200, Álvaro Herrera wrote:

I think the concept here is that all short options go first in
alphabetical order, then the long options in their own alphabetical
order, and if one option has both, then the short option takes
precedence.


That's what it looks like to me, too.


If that's the idea, then --filter in pg_dumpall is in the
wrong place, and other than that it looks good.


I missed that one, thanks.


I think that's what gives the shorter patch.  But where would you look
for, say, --large-objects?  I mean, how do you know that its short
version is -b?  Maybe it would make more sense to sort on long options
first and put short options as the second-priority item for each option.


Fair point.  We seem to be pivoting towards long options, anyway.  If
there's support for this, I could go through all the client and server
application docs to ensure they match this style.


There are two styles currently in use:  First, as described above, list 
all short options first, then all long-only options.  The second style 
is that long-only options are listed alphabetically between short 
options.  I think both of these styles are used in --help output and man 
pages, and I've long had a desire to unify under one style.  Which would 
also be helpful to offer guidance when new options are added.


However, I think this would require coordination across all --help 
output and man pages (76 objects), so for the short term, let's just 
move recently added options to the right place under the current 
theory/theories, and leave a larger reshuffling for later.

Re: Introduce some randomness to autovacuum

2025-04-30 Thread Junwang Zhao

Hi Nikita, wenhui,

On Fri, Apr 25, 2025 at 11:16 PM Nikita Malakhov  wrote:
>
> Hi!
>
> I agree it is a good idea to shift the table list. Although vacuuming larger 
> tables first
> is a questionable approach because smaller ones could wait a long time to be 
> vacuumed.
> It looks like the most obvious and simple way is that the first table to be 
> vacuumed
> should not be the first one from the previous iteration.
>
> On Fri, Apr 25, 2025 at 6:04 PM wenhui qiu  wrote:
>>
>> Hi，I like your idea，It would be even better if the weights could be taken 
>> according to the larger tables
>>
>
> --
> Regards,
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/

Thanks for your feedback.

I ended up with adding a guc configuration that may support different vacuum
strategies. I name it as `autovacuum_vacuum_strategy` but you might have
a better one. For now it support only two strategies:

1. Sequential: Tables are vacuumed in the order they are collected.
2. Random: The list of tables is rotated around a randomly chosen
   pivot before vacuuming to avoid always starting with the same
   table, which prevents vacuuming starvation for some tables.

We can extend this strategy like prioritization and whatever algorithms
in the future.

-- 
Regards
Junwang Zhao

v1-0001-Introduce-autovacuum-vacuum-strategy.patch
Description: Binary data

Re: Some problems regarding the self-join elimination code

2025-04-30 Thread Andrei Lepikhov


On 4/30/25 13:22, Alexander Korotkov wrote:

Thank you, Andrei.  I've put it all together.
0001 Fixes material bugs in ChangeVarNodes_walker() including regression test
0002 Puts back comments which got accidentally removed
0003 Refactors ChangeVarNodesExtended() with custom user-defined callback

I've revised the remaining refactoring patch.  I've made a callback an
additional callback, but not the replacement to the walker.  It looks
better for me now.  Also, I've written some comments and the commit
message.  Any thoughts?

It seems quite elegant to me.
I have not precisely examined the part with the RestrictInfo replacement 
logic - I guess you just copied it - I reviewed the rest of the patch.


Generally, it looks good, but let me be a little picky.
1. Looking into the callback-related code, I suggest changing the name 
of ChangeVarNodes_callback to something less general and more specific - 
like replace_relid_callback. My variant doesn't seem the best, but 
general-purposed name seems worse.
2. This callback doesn't modify query replacement logic's behaviour- it 
only affects expressions. It makes sense to write about that in the 
description of the ChangeVarNodesExtended.
3. Should the ChangeVarNodesWalkExpression function return the walker's 
returning value?


--
regards, Andrei Lepikhov

Re: Accounting for metapages in genericcostestimate()

2025-04-30 Thread Tom Lane

=?utf-8?Q?=C3=81lvaro?= Herrera  writes:
> On 2025-Apr-28, Tom Lane wrote:
>> +BlockNumber numNonLeafPages;/* # of index pages that are not leafs 
>> */

> I find the use of "leafs" as plural for "leaf" a bit strange ...
> We already have uses of that word, but I wonder if they don't mostly
> or exclusively come from non-native English speakers.

Yeah, "leaves" would be correct, but I wondered whether that'd confuse
non-native speakers more.  Happy to change it though.

regards, tom lane

Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

2025-04-30 Thread Salvatore Dipietro

Hi,
we would like to propose the removal of the Instruction
Synchronization Barrier (isb) for aarch64 architectures. Based on our
testing on Graviton instances (m7g.16xlarge), we can see on average
over multiple iterations up to 12% better performance using PGBench
select-only and up to 9% with Sysbench oltp_read_only workloads. On
Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
select-only and up to 6% with Sysbench oltp_read_only workloads.
We have also tested it putting more pressure on the spin_delay
function, enabling pg_stat_statements.track_planning with PGBench
read-only [0] and, on average, the patch shows up to 27% better
performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

Testing environment:
- PostgreSQL version: 17.2
- Operating System: Ubuntu 22.04
- Test Platform: AWS Graviton instances (m6g.16xlarge, m7g.16xlarge
and m8g.24xlarge)

Our benchmark results on PGBench select-only without
pg_stat_statements.track_planning:
```
# Load DB on m7g.16xlarge
$ pgbench -i --fillfactor=90 --scale=5644 --host=172.31.32.85
--username=postgres pgtest

# Without patch
$ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared
-P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest
...
"transaction type: ",
"scaling factor: 5644",
"query mode: prepared",
"number of clients: 256",
"number of threads: 96",
"duration: 600 s",
"number of transactions actually processed: 359864937",
"latency average = 0.420 ms",
"latency stddev = 1.755 ms",
"tps = 599770.727316 (including connections establishing)",
"tps = 599826.788919 (excluding connections establishing)"


# With patch
$ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared
-P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest
...
"transaction type: ",
"scaling factor: 5644",
"query mode: prepared",
"number of clients: 256",
"number of threads: 96",
"duration: 600 s",
"number of transactions actually processed: 405891881",
"latency average = 0.371 ms",
"latency stddev = 0.569 ms",
"tps = 676480.900049 (including connections establishing)",
"tps = 676523.557293 (excluding connections establishing)"
```

[0] https://www.postgresql.org/message-id/ZxgDEb_VpWyNZKB_%40nathan


0001-Remove-Instruction-Synchronization-Barrier-in-spin_d.patch
Description: Binary data

Re: Accounting for metapages in genericcostestimate()

2025-04-30 Thread Álvaro Herrera

On 2025-Apr-28, Tom Lane wrote:

> @@ -135,6 +141,7 @@ typedef struct
>   double  numIndexTuples; /* number of leaf tuples visited */
>   double  spc_random_page_cost;   /* relevant random_page_cost 
> value */
>   double  num_sa_scans;   /* # indexscans from ScalarArrayOpExprs 
> */
> + BlockNumber numNonLeafPages;/* # of index pages that are not leafs 
> */
>  } GenericCosts;

The idea you described seems quite reasonable, though I didn't review
the patch in detail.

I find the use of "leafs" as plural for "leaf" a bit strange ...
We already have uses of that word, but I wonder if they don't mostly
or exclusively come from non-native English speakers.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Enhancing Memory Context Statistics Reporting

2025-04-30 Thread Daniel Gustafsson

> On 30 Apr 2025, at 12:14, Peter Eisentraut  wrote:
> 
> On 29.04.25 15:13, Rahila Syed wrote:
>> Please find attached a patch with some comments and documentation changes.
>> Additionaly, added a missing '\0' termination to "Remaining Totals" string.
>> I think this became necessary after we replaced dsa_allocate0()
>> with dsa_allocate() is the latest version.
> 
> >   strncpy(nameptr, "Remaining Totals", namelen);
> > + nameptr[namelen] = '\0';
> 
> Looks like a case for strlcpy()?

True.  I did go ahead with the strncpy and nul terminator assignment, mostly
out of muscle memory, but I agree that this would be a good place for a
strlcpy() instead.

--
Daniel Gustafsson

Re: Introduce some randomness to autovacuum

2025-04-30 Thread Nathan Bossart

On Fri, Apr 25, 2025 at 10:02:49PM +0800, Junwang Zhao wrote:
> After watching Robert's talk[1] on autovacuum and participating in the related
> workshop yesterday, it appears that people are inclined to use prioritization
> to address the issues highlighted in Robert's presentation. Here I list two
> of the failure modes that were discussed.
> 
> - Spinning. Running repeatedly on the same table but not accomplishing
> anything useful.
> - Starvation. autovacuum can't vacuum everything that needs vacuuming.
> - ...
> 
> The prioritization way needs some basic stuff that postgres doesn't have now.
> 
> I had a random thought that introducing some randomness might help
> mitigate some of the issues mentioned above. Before performing vacuum
> on the collected tables, we could rotate the table_oids list by a random
> number within the range [0, list_length(table_oids)]. This way, every table
> would have an equal chance of being vacuumed first, thus no spinning and
> starvation.
> 
> Even if there is a broken table that repeatedly gets stuck, this random
> approach would still provide opportunities for other tables to be vacuumed.
> Eventually, the system would converge.

First off, thank you for thinking about this problem and for sharing your
thoughts.  Adding randomness to solve this is a creative idea.

That being said, I am -1 for this proposal.  Autovacuum parameters and
scheduling are already quite complicated, and making it nondeterministic
would add an additional layer of complexity (and may introduce its own
problems).  But more importantly, IMHO it masks the problems instead of
solving them more directly, and it could mask future problems, too.  It'd
probably behoove us to think about the known problems more deeply and to
craft more targeted solutions.

-- 
nathan

Fix outdated comments for IndexInfo

2025-04-30 Thread Japin Li


Hi, all

While working on [1], I found outdated comments in IndexInfo.
The attached patch corrects them.

[1] 
https://www.postgresql.org/message-id/2A40921D-83AB-411E-ADA6-7E509A46F1E4%40logansw.com

-- 
Regrads,
Japin Li

>From 7c01644860a32ca9ab93367c2f8e34047c9d703f Mon Sep 17 00:00:00 2001
From: Li Jianping 
Date: Wed, 30 Apr 2025 23:33:04 +0800
Subject: [PATCH] Fix outdated comments for IndexInfo

* Commit 78416235713 removed the ii_OpclassOptions field.

* Commit 94aa7cc5f70 added the ii_NullsNotDistinct field.

* Commit fc0438b4e80 added the ii_WithoutOverlaps field.

All previous comments were not updated accordingly.
---
 src/include/nodes/execnodes.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5b6cadb5a6c..076ffa45d60 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -174,13 +174,14 @@ typedef struct ExprState
  *		UniqueProcs
  *		UniqueStrats
  *		Uniqueis it a unique index?
- *		OpclassOptions		opclass-specific options, or NULL if none
+ *		NullsNotDistinct	is unique nulls distinct?
  *		ReadyForInserts		is it valid for inserts?
  *		CheckedUnchanged	IndexUnchanged status determined yet?
  *		IndexUnchanged		aminsert hint, cached for retail inserts
  *		Concurrent			are we doing a concurrent index build?
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
+ *		WithoutOverlaps		is it a without overlaps index?
  *		ParallelWorkers		# of workers requested (excludes leader)
  *		Am	Oid of index AM
  *		AmCacheprivate cache area for index AM
-- 
2.43.0

Re: Typo in multixact.c and jsonfuncs.c documentation

2025-04-30 Thread David Rowley

On Wed, 30 Apr 2025 at 19:13, Fujii Masao  wrote:
> This commit seems to have caused an indent-check failure on the buildfarm 
> member koel.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=koel&dt=2025-04-30%2002%3A19%3A00

Thanks. Fixed.

David

Re: Persist injection points across server restarts

2025-04-30 Thread Michael Paquier

On Wed, Apr 30, 2025 at 10:52:51AM +0500, Andrey Borodin wrote:
> Did you consider custom resource manager to WAL-log injection
> points? This way we do not need to flush them explicitly, they
> always will be persistent.

WAL-logging would not work in the case I've faced, because we need a
point much earlier than the startup process beginning any redo
activity, so I don't think that this would be useful.
--
Michael


signature.asc
Description: PGP signature

Re: alphabetize long options in pg_dump[all] docs

2025-04-30 Thread Nathan Bossart

On Wed, Apr 30, 2025 at 04:46:37PM +0200, Álvaro Herrera wrote:
> On 2025-Apr-30, Peter Eisentraut wrote:
>> On 29.04.25 23:54, Nathan Bossart wrote:
>> > Fair point.  We seem to be pivoting towards long options, anyway.  If
>> > there's support for this, I could go through all the client and server
>> > application docs to ensure they match this style.
>>
>> However, I think this would require coordination across all --help output
>> and man pages (76 objects), so for the short term, let's just move recently
>> added options to the right place under the current theory/theories, and
>> leave a larger reshuffling for later.
> 
> +1 WFM

Committed.

-- 
nathan

Re: vacuumdb --missing-stats-only and pg_upgrade from PG13

2025-04-30 Thread Nathan Bossart

On Thu, Apr 24, 2025 at 03:25:55PM +0200, Christoph Berg wrote:
> Re: Nathan Bossart
>> Here is what I have staged for commit.  I'll aim to commit these patches
>> sometime next week to give time for additional feedback.
> 
> I confirm my PG13 test table gets analyzed now with the patch.

Committed.

-- 
nathan

Re: Performance issues with v18 SQL-language-function changes

2025-04-30 Thread Alexander Lakhin


Hello Tom,

Sorry if I've chosen the wrong thread, but starting from 0313c5dc6,
the following script:
CREATE TYPE aggtype AS (a int, b text);
CREATE FUNCTION aggfns_trans(aggtype[], integer, text) RETURNS aggtype[] LANGUAGE sql AS 'SELECT 
array_append($1,ROW($2,$3)::aggtype)';

CREATE AGGREGATE aggfns(integer, text) (SFUNC = public.aggfns_trans, STYPE = 
public.aggtype[], INITCOND = '{}');

CREATE TABLE t(i int,  k int);
INSERT INTO t SELECT 1, 2 FROM generate_series(1, 4000);

SET statement_timeout = '10s';
SELECT aggfns(i, repeat('x', 8192)) OVER (PARTITION BY i) FROM t;

crashes the server for me like this:
corrupted size vs. prev_size while consolidating
2025-04-30 19:40:04.209 UTC [286426] LOG:  client backend (PID 286441) was 
terminated by signal 6: Aborted
2025-04-30 19:40:04.209 UTC [286426] DETAIL:  Failed process was running: SELECT aggfns(i, repeat('x', 8192)) OVER 
(PARTITION BY i) FROM t;


(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=) 
at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=) at 
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=, signo=signo@entry=6) at 
./nptl/pthread_kill.c:89
#3  0x73cc15c4526e in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#4  0x73cc15c288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x73cc15c297b6 in __libc_message_impl (fmt=fmt@entry=0x73cc15dce8d7 
"%s\n") at ../sysdeps/posix/libc_fatal.c:132
#6  0x73cc15ca8fe5 in malloc_printerr (str=str@entry=0x73cc15dd1b30 "corrupted size vs. prev_size while 
consolidating") at ./malloc/malloc.c:5772
#7  0x73cc15cab144 in _int_free_merge_chunk (av=0x73cc15e03ac0 , p=0x5cb3ac57b2c0, size=12541904) at 
./malloc/malloc.c:4695

#8  0x73cc15cadd9e in __GI___libc_free (mem=mem@entry=0x5cb3acd73290) at 
./malloc/malloc.c:3398
#9  0x5cb3a0c2db4c in AllocSetFree (pointer=0x5cb3acd732c8) at aset.c:1107
#10 0x5cb3a0c381f8 in pfree (pointer=) at mcxt.c:241
#11 0x5cb3a067de98 in heap_free_minimal_tuple (mtup=) at 
heaptuple.c:1532
#12 0x5cb3a08b86a1 in tts_minimal_clear (slot=0x5cb3ac576fb0) at 
execTuples.c:532
#13 0x5cb3a08ab16e in ExecClearTuple (slot=0x5cb3ac576fb0) at 
../../../src/include/executor/tuptable.h:460
#14 ExecFilterJunk (junkfilter=, slot=) at 
execJunk.c:277
#15 0x5cb3a08bdb03 in sqlfunction_receive (slot=, 
self=0x5cb3ac525ce0) at functions.c:2597
#16 0x5cb3a08ab4e7 in ExecutePlan (dest=0x5cb3ac525ce0, direction=, numberTuples=1, sendTuples=true, 
operation=CMD_SELECT,

    queryDesc=0x5cb3ac525d30) at execMain.c:1814
...

With some script modifications, I observed also other memory-context-
related crashes.

(Probably this effect is achieved just because of the performance
optimization -- I haven't look deeper yet.)

This issue is discovered with SQLsmith.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: AIO v2.5

2025-04-30 Thread Andres Freund

Hi,

After a bit more private back and forth with Alexander I have found the issue
- and it's pretty stupid:

pgaio_io_wait_for_free() does what it says on the tin. For that, after a bunch
of other things, finds the oldest in-flight IO and waits for it.

PgAioHandle *ioh = dclist_head_element(PgAioHandle, node,

   &pgaio_my_backend->in_flight_ios);

switch (ioh->state)
{
...
case PGAIO_HS_COMPLETED_IO:
case PGAIO_HS_SUBMITTED:
pgaio_debug_io(DEBUG2, ioh,
   "waiting for free io 
with %d in flight",
   
dclist_count(&pgaio_my_backend->in_flight_ios));
...
pgaio_io_wait(ioh, ioh->generation);
break;


The problem is that, if the log level is low enough, ereport() (which is
called by pgaio_debug_io()), processes interrupts.  The interrupt processing
may end up execute ProcessBarrierSmgrRelease(), which in turn needs to wait
for all in-flight IOs before the IOs are closed.

Which then leads to the
elog(PANIC, "waiting for own IO in wrong state: %d",
 state);

error.

The waiting for in-flight IOs before closing FDs only happens with io-uring,
hence this only triggering with io-uring.


A similar set of steps can lead to the "no free IOs despite no in-flight IOs"
ERROR that Alexander also observed - if pgaio_submit_staged() triggers a debug
ereport that executes ProcessBarrierSmgrRelease() in an interrupt, we might
wait for all in-flight IOs during IO submission, triggering the error.


I'm somewhat amazed that Alexander could somewhat reliably reproduce this - I
haven't been able to do so once using Alexander's recipe.  I did find a
reproducer though:

c=32;pgbench -n -c$c -j$c -P1 -T1000 -f <(echo 'SELECT sum(abalance) FROM 
pgbench_accounts;')
c=1;pgbench -n -c$c -j$c -P1 -T1000 -f <(echo 'DROP DATABASE IF EXISTS 
foo;CREATE DATABASE foo;')

trigger both issues quickly if run with
  log_min_messages=debug2
  io_method=io_uring
and not when a non-debug log level is used.


I'm not yet sure how to best fix it - locally I have done so by pgaio_debug()
do a HOLD_INTERRUPTS()/RESUME_INTERRUPTS() around the call to ereport.  But
that doesn't really seem great - otoh requiring various pieces of code to know
that anything emitting debug messages needs to hold interrupts etc makes for
rare and hard to understand bugs.

We could just make the relevant functions hold interrupts, and that might be
the best path forward, but we don't really need to hold all interrupts
(e.g. termination would be fine), so it's a bit coarse grained.  It would need
to happen in a few places, which isn't great either.

Other suggestions?


Thanks again for finding and reporting this Alexander!

Greetings,

Andres Freund

Re: pgsql: Add function to log the memory contexts of specified backend pro

2025-04-30 Thread Robert Haas

On Tue, Apr 6, 2021 at 12:45 AM Fujii Masao  wrote:
> Add function to log the memory contexts of specified backend process.

Hi,

I think this might need a recursion guard. I tried this:

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index dc4c600922d..b219a934034 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3532,7 +3532,7 @@ ProcessInterrupts(void)
 if (ParallelMessagePending)
 ProcessParallelMessages();

-if (LogMemoryContextPending)
+if (true)
 ProcessLogMemoryContextInterrupt();

 if (PublishMemoryContextPending)
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72f5655fb34..867fd7b0ad5 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -112,7 +112,7 @@ extern void ProcessInterrupts(void);
 /* Test whether an interrupt is pending */
 #ifndef WIN32
 #define INTERRUPTS_PENDING_CONDITION() \
-(unlikely(InterruptPending))
+(unlikely(InterruptPending) || true)
 #else
 #define INTERRUPTS_PENDING_CONDITION() \
 (unlikely(UNBLOCKED_SIGNAL_QUEUE()) ?
pgwin32_dispatch_queued_signals() : 0, \

That immediately caused infinite recursion, ending in a core dump:

frame #13: 0x000104607b00
postgres`errfinish(filename=, lineno=,
funcname=) at elog.c:543:2 [opt]
frame #14: 0x000104637078
postgres`ProcessLogMemoryContextInterrupt at mcxt.c:1392:2 [opt]
frame #15: 0x0001044a901c postgres`ProcessInterrupts at
postgres.c:3536:3 [opt]
frame #16: 0x000104607b54
postgres`errfinish(filename=, lineno=,
funcname=) at elog.c:608:2 [opt] [artificial]
frame #17: 0x000104637078
postgres`ProcessLogMemoryContextInterrupt at mcxt.c:1392:2 [opt]
frame #18: 0x0001044a901c postgres`ProcessInterrupts at
postgres.c:3536:3 [opt]


It might be unlikely that a process can be signalled fast enough to
actually fail in this way, but I'm not sure it's impossible, and I
think we should be defending against it. The most trivial recursion
guard would be HOLD_INTERRUPTS()/RESUME_INTERRUPTS() around
ProcessLogMemoryContextInterrupt(), but I think that's probably not
quite good enough because it would make the backend impervious to
pg_terminate_backend() while it's dumping memory contexts, and that
could be a long time if the write blocks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Introduce some randomness to autovacuum

2025-04-30 Thread Sami Imseih

> Yes, it is masking the problem, but maybe a better way to think about it is 
> that it is delaying the
> performance impact, allowing more time for a manual intervention of the 
> problematic table(s).

I question how the user will gauge the success of setting the strategy
to "random"? They may make
it random by default, but fall into the same issues and revert it back
to the default strategy.

But also, the key as you mention is "manual intervention" which
requires proper monitoring. I will
argue that for the two cases that this proposal is seeking to correct,
we already have good
solutions that could be implemented by a user.

Let's take the "spinning" case again. If a table has some sort of
problem causing
vacuum to error out, one can just disable autovacuum on a per-table
level and correct
the issue. Also, the xmin horizon being held back ( which is claimed
to be the most common cause,
and I agree with that ), well that one is just going to cause all your
autovacuums to become
useless.

Also, I do think the starvation problem has a good answer now that
autovacuum_max_workers
can be modified online. Maybe something can be done for autovacuum to
auto-tune this
setting to give more workers at times when it's needed. Not sure what
that looks like,
but it is more possible now that this setting does not require a restart.

--
Sami Imseih
Amazon Web Services (AWS)

Re: Improve explicit cursor handling in pg_stat_statements

2025-04-30 Thread Sami Imseih

I forgot to add the proper tests to the Normalize cursor utility statements.
Reattaching v2.

I also want to add that the decision to not normalize the cursor name in
the FETCH command is because it would not make sense to combine
FETCH commands for various cursors into the same entry.

Regards,

--
Sami Imseih
Amazon Web Services (AWS)


v2-0002-Improve-cursor-handling-in-pg_stat_statements.patch
Description: Binary data


v2-0001-Normalize-cursor-utility-statements.patch
Description: Binary data

Re: Introduce some randomness to autovacuum

2025-04-30 Thread Greg Sabino Mullane

On Wed, Apr 30, 2025 at 10:07 AM Junwang Zhao  wrote:

> I ended up with adding a guc configuration that may support different
> vacuum
> strategies.

+1 to this: it's a good solution to a tricky problem. I would be a -1 if
this were not a GUC.

Yes, it is masking the problem, but maybe a better way to think about it is
that it is delaying the performance impact, allowing more time for a manual
intervention of the problematic table(s).

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-30 Thread Jacob Champion

On Wed, Apr 30, 2025 at 5:55 AM Daniel Gustafsson  wrote:
> > To keep things moving: I assume this is unacceptable. So v10 redirects
> > every access to a PGconn struct member through a shim, similarly to
> > how conn->errorMessage was translated in v9. This adds plenty of new
> > boilerplate, but not a whole lot of complexity. To try to keep us
> > honest, libpq-int.h has been removed from the libpq-oauth includes.
>
> That admittedly seems like a win regardless.

Yeah, it moves us much closer to the long-term goal.

> We should either clarify that it was never shipped as part of libpq core, or
> remove this altogether.

Done in v11, with your suggested wording.

> I think this explanatory paragraph should come before the function prototype.

Done.

> Nitpick, but it won't be .so everywhere.  Would this be clearar if spelled out
> with something like "do not rely on libpq-int.h when building libpq-oauth as
> dynamic shared lib"?

I went with "do not rely on libpq-int.h in dynamic builds of
libpq-oauth", since devs are hopefully going to be the only people who
see it. I've also fixed up an errant #endif label right above it.

I'd ideally like to get a working split in for beta. Barring
objections, I plan to get this pushed tomorrow so that the buildfarm
has time to highlight any corner cases well before the Saturday
freeze. I still see the choice of naming (with its forced-ABI break
every major version) as needing more scrutiny, and probably worth a
Revisit entry.

The CI still looks happy, and I will spend today with VMs and more
testing on the Autoconf side. I'll try to peer at Alpine and musl
libc, too; dogfish and basilisk are the Curl-enabled animals that
caught my attention most.

Thanks!
--Jacob
1:  e86e93f7ac8 ! 1:  5a1d1345919 oauth: Move the builtin flow into a separate 
module
@@ Commit message
 
 Per request from Tom Lane and Bruce Momjian. Based on an initial patch
 by Daniel Gustafsson, who also contributed docs changes. The "bare"
-dlopen() concept came from Thomas Munro. Many many people reviewed the
-design and implementation; thank you!
+dlopen() concept came from Thomas Munro. Many people reviewed the 
design
+and implementation; thank you!
 
 Co-authored-by: Daniel Gustafsson 
 Reviewed-by: Andres Freund 
 Reviewed-by: Christoph Berg 
+Reviewed-by: Daniel Gustafsson 
 Reviewed-by: Jelte Fennema-Nio 
 Reviewed-by: Peter Eisentraut 
 Reviewed-by: Wolfgang Walther 
@@ src/interfaces/libpq-oauth/Makefile (new)
  ## src/interfaces/libpq-oauth/README (new) ##
 @@
 +libpq-oauth is an optional module implementing the Device Authorization 
flow for
-+OAuth clients (RFC 8628). It was originally developed as part of libpq 
core and
-+later split out as its own shared library in order to isolate its 
dependency on
-+libcurl. (End users who don't want the Curl dependency can simply choose 
not to
-+install this module.)
++OAuth clients (RFC 8628). It is maintained as its own shared library in 
order to
++isolate its dependency on libcurl. (End users who don't want the Curl 
dependency
++can simply choose not to install this module.)
 +
 +If a connection string allows the use of OAuth, and the server asks for 
it, and
 +a libpq client has not installed its own custom OAuth flow, libpq will 
attempt
@@ src/interfaces/libpq-oauth/README (new)
 +pg_fe_run_oauth_flow and pg_fe_cleanup_oauth_flow are implementations of
 +conn->async_auth and conn->cleanup_async_auth, respectively.
 +
++At the moment, pg_fe_run_oauth_flow() relies on libpq's pg_g_threadlock 
and
++libpq_gettext(), which must be injected by libpq using this initialization
++function before the flow is run:
++
 +- void libpq_oauth_init(pgthreadlock_t threadlock,
 +  libpq_gettext_func gettext_impl,
 +  conn_errorMessage_func 
errmsg_impl,
@@ src/interfaces/libpq-oauth/README (new)
 +  set_conn_altsock_func 
setaltsock_impl,
 +  set_conn_oauth_token_func 
settoken_impl);
 +
-+At the moment, pg_fe_run_oauth_flow() relies on libpq's pg_g_threadlock 
and
-+libpq_gettext(), which must be injected by libpq using this initialization
-+function before the flow is run.
-+
 +It also relies on access to several members of the PGconn struct. Not 
only can
 +these change positions across minor versions, but the offsets aren't 
necessarily
 +stable within a single minor release (conn->errorMessage, for instance, 
can
@@ src/interfaces/libpq/fe-auth-oauth-curl.c => 
src/interfaces/libpq-oauth/oauth-cu
 +#define set_conn_altsock(CONN, VAL) do { CONN->altsock = VAL; } while (0)
 +#define set_conn_oauth_token(CONN, VAL) do { CONN->oaut

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-30 Thread Daniel Gustafsson

> On 30 Apr 2025, at 19:59, Jacob Champion  
> wrote:
> On Wed, Apr 30, 2025 at 5:55 AM Daniel Gustafsson  wrote:

>> Nitpick, but it won't be .so everywhere.  Would this be clearar if spelled 
>> out
>> with something like "do not rely on libpq-int.h when building libpq-oauth as
>> dynamic shared lib"?
> 
> I went with "do not rely on libpq-int.h in dynamic builds of
> libpq-oauth", since devs are hopefully going to be the only people who
> see it. I've also fixed up an errant #endif label right above it.

That's indeed better than my suggestion.

> I'd ideally like to get a working split in for beta.

+Many

> Barring
> objections, I plan to get this pushed tomorrow so that the buildfarm
> has time to highlight any corner cases well before the Saturday
> freeze.

I'll try to kick the tyres a bit more as well.

--
Daniel Gustafsson

Re: Introduce some randomness to autovacuum

2025-04-30 Thread Robert Treat

On Wed, Apr 30, 2025 at 1:56 PM Sami Imseih  wrote:
>
> > Yes, it is masking the problem, but maybe a better way to think about it is 
> > that it is delaying the
> > performance impact, allowing more time for a manual intervention of the 
> > problematic table(s).
>
> I question how the user will gauge the success of setting the strategy
> to "random"? They may make
> it random by default, but fall into the same issues and revert it back
> to the default strategy.
>
> But also, the key as you mention is "manual intervention" which
> requires proper monitoring. I will
> argue that for the two cases that this proposal is seeking to correct,
> we already have good
> solutions that could be implemented by a user.
>

I would have a lot more faith in this discussion if there was any kind
of external solution that had gained popularity as a general solution,
but this doesn't seem to be the case (and trying to wedge something in
the server will likely hurt that kind of research.

As an example, the first fallacy of autovacuum management is the idea
that a single strategy will always work. Having implemented a number
of crude vacuum management systems in user space already, I know that
I have run into multiple cases where I had to essentially build two
different "queues" of vacuums (one for xids, one for bloat) to be fed
into Postgres so as not to be blocked (in either direction) by
conflicting priorities no matter how wonky things got. I can imagine a
set of gucs that we could put in to try to mimic such types of
behavior, but I imagine it would take quite a few rounds before we got
the behavior correct.

Robert Treat
https://xzilla.net

Re: Add an option to skip loading missing publication to avoid logical replication failure

2025-04-30 Thread vignesh C

On Wed, 30 Apr 2025 at 17:41, Amit Kapila  wrote:
>
> On Wed, Apr 30, 2025 at 11:22 AM Tom Lane  wrote:
> >
> > Xuneng Zhou pointed out on Discord that the test case added by
> > 7c99dc587 has caused repeated failures in CI --- though oddly,
> > it's not failed in the buildfarm so far as I can find.  The
> > failures look like
> >
> > timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading 
> > publication: tap_pub_3) at 
> > /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.pl line 103.
> >
>
> I analyzed the relevant publisher-side CI Logs [1]:
> ...
> 2025-04-19 08:24:14.096 UTC [21961][client backend]
> [024_add_drop_pub.pl][7/4:0] LOG:  statement: INSERT INTO tab_3
> values(1)
> 2025-04-19 08:24:14.098 UTC [21961][client backend]
> [024_add_drop_pub.pl][:0] LOG:  disconnection: session time:
> 0:00:00.003 user=postgres database=postgres host=[local]
> 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG:
> released logical replication slot "tap_sub"
> 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG:
> disconnection: session time: 0:00:00.329 user=postgres
> database=postgres host=[local]
> 2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0]
> LOG:  connection received: host=[local]
> 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> LOG:  connection authenticated: user="postgres" method=trust
> (/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117)
> 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> LOG:  replication connection authorized: user=postgres
> application_name=tap_sub
> 2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG:
> statement: SELECT pg_catalog.set_config('search_path', '', false);
> 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> received replication command: IDENTIFY_SYSTEM
> 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0]
> STATEMENT:  IDENTIFY_SYSTEM
> 2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> 0/0 (proto_version '4', streaming 'parallel', origin 'any',
> publication_names '"tap_pub_
> ...
>
> This shows that walsender restarts after the "INSERT INTO tab_3
> values(1)" is processed by the previous walsender ("released logical
> replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)").
> So, it is possible that the old apply worker has sent the confirmation
> of WAL received location after the Insert (due to keep_alive message
> handling). So, after the restart, the new walsender will start
> processing WAL after the INSERT and wait for the skipped message LOG
> timed out.
>
> Considering the above theory is correct, after "ALTER SUBSCRIPTION
> tap_sub SET PUBLICATION tap_pub_3", we should wait for the new
> walsender to restart. We are already doing the same for a similar case
> in 001_rep_changes.pl (See "# check that change of connection string
> and/or publication list causes restart of subscription workers. We
> check the state along with application_name to ensure that the
> walsender is (re)started.).
>
> Unfortunately, I will be away for the rest of the week. In the
> meantime, if you or someone else is able to reproduce and fix it, then
> good; otherwise, I'll take care of it after I return.

I agree with your analysis. I was able to reproduce the issue by
delaying the invalidation of the subscription until the walsender
finished decoding the INSERT operation following the ALTER
SUBSCRIPTION through a debugger and using the lsn from the pg_waldump
of the INSERT after the ALTER SUBSCRIPTION.  In this scenario, the
confirmed_flush_lsn ends up pointing to a location after the INSERT.
When the invalidation is eventually received and the apply
worker/walsender is restarted, the restarted walsender begins decoding
from that LSN—after the INSERT—which means the "skipped loading
publication" warning is never triggered, causing the test to fail.

Attached is a patch that ensures the walsender process is properly
restarted after ALTER SUBSCRIPTION, preventing this race condition.

Regards,
Vignesh
From 26c8efbf59a456f4ad8a87b504180449efe1cd69 Mon Sep 17 00:00:00 2001
From: Vignesh 
Date: Wed, 30 Apr 2025 21:38:33 +0530
Subject: [PATCH] Fix race condition after ALTER SUBSCRIPTION SET PUBLICATION

Previously, after executing ALTER SUBSCRIPTION tap_sub SET PUBLICATION, we
did not wait for the new walsender process to restart. As a result, an INSERT
executed immediately after the ALTER could be decoded and the confirmed flush
lsn is advanced. This could cause replication to resume from a point after the
INSERT. In such cases, we miss the expected warning about the missing
publication.

To fix this, we now ensure that the walsender has restarted before continuing
after ALTER SUBSCRIPTION.
---
 src/test/subscription/t/024_add_drop

Re: Introduce some randomness to autovacuum

2025-04-30 Thread Sami Imseih

> - Spinning. Running repeatedly on the same table but not accomplishing
> anything useful.

> But more importantly, IMHO it masks the problems instead of
> solving them more directly, and it could mask future problems, too

To add more to Nathan's comment about masking future problems,
this will not solve the "spinning" problem because if the most common
reason for this is a long-running transaction, etc., all your tables will
eventually end up with wasted vacuum cycles because the xmin
horizon is not advancing.

--
Sami Imseih

Re: add --sequence-data to pg_dumpall

2025-04-30 Thread Nathan Bossart

On Wed, Apr 30, 2025 at 09:29:59AM +0900, Michael Paquier wrote:
> On Tue, Apr 29, 2025 at 03:55:08PM -0500, Nathan Bossart wrote:
>> Assuming we want this patch, should we apply it to v18?  It's arguably an
>> oversight in the pg_dump --sequence-data commit, and pg_dumpall will just
>> pass the option through to pg_dump, but otherwise there's not a really
>> strong reason it can't wait.
> 
> This reminds me of, that fixed a similar defect in pg_dumpall
> following the addition of an option in pg_dump where the former was
> forgotten:
> https://www.postgresql.org/message-id/YKHC%2BqCJvzCRVCpY%40paquier.xyz
> 
> I agree with applying that to v18 now and treat it as a defect rather
> than wait for v19 and treat this patch as a new feature.  Bonus points
> for the patch being straight-forward.

Since there's precedent, I'll plan on committing this in the next few days
unless someone objects.  I've added the rest of the RMT to this thread,
too, just in case.

-- 
nathan

Re: pgsql: Add function to get memory context stats for processes

2025-04-30 Thread Robert Haas

On Sat, Apr 26, 2025 at 5:07 PM Tomas Vondra  wrote:
> > Just for the record, it sounds quite unsafe to me too.  I could
> > credit it being all right to examine the process' MemoryContext data
> > structures, but calling dsa_create() from CFI seems really insane.
> > Way too many moving parts there.
>
> Maybe I'm oblivious to some very obvious issues, but why is this so
> different from everything else that is already called from
> ProcessInterrupts()? Perhaps dsa_create() is more complex compared to
> the other stuff (haven't checked), but why would it be unsafe?
>
> The one risk I can think of is a risk of recursion - if any of the
> functions called from ProcessGetMemoryContextInterrupt() does CFI, could
> that be a problem if there's another pending signal. I see some other
> handlers (e.g. ProcessParallelApplyMessages) handle this by explicitly
> holding interrupts. Should ProcessGetMemoryContextInterrupt() do the
> same thing?
>
> In any case, if DSA happens to not be the right way to transfer this,
> what should we use instead? The only thing I can think of is some sort
> of pre-allocated chunk of shared memory.

The big disadvantage of a pre-allocated chunk of shared memory is that
it would necessarily be fixed size, and a memory context dump can be
really big. Another alternative would be a temporary file. But none of
that settles the question of safety.

I think part of the reason why it's possible for us to have
disagreements about whether this is safe is that we don't have any
clear documentation of what you can assume to be true at a
CHECK_FOR_INTERRUPTS(). It's not possible for there to be an LWLock
held at that point, because we hold off interrupts when you acquire an
LWLock and don't re-enable them until all LWLocks have been released.
We can't be holding a spinlock, either, because we only insert
CHECK_FOR_INTERRUPTS() at the top of loops and code that holds a
spinlock is supposed to be straight-line code that never loops. But
what else can you assume? Can you assume, for example, that there's a
transaction? I doubt it. Can you assume that the transaction is
non-aborted? I doubt that even more. There's no obvious-to-me reason
why those things should be true.

And in fact if you try this on a backend waiting in an aborted
transaction, it breaks:

robert.haas=# select pg_backend_pid();
 pg_backend_pid

  19321
(1 row)

robert.haas=# begin;
BEGIN
robert.haas=*# select 1/0;
ERROR:  division by zero

And then from another session, run this command, using the PID from above:

select * from pg_get_process_memory_contexts(19321, false, 1);

Then you get:

2025-04-30 15:14:33.878 EDT [19321] ERROR:  ResourceOwnerEnlarge
called after release started
2025-04-30 15:14:33.878 EDT [19321] FATAL:  terminating connection
because protocol synchronization was lost

I kind of doubt whether that's the only problem here, but it's *a* problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Add function to get memory context stats for processes

2025-04-30 Thread Daniel Gustafsson

> On 30 Apr 2025, at 22:17, Robert Haas  wrote:

> I kind of doubt whether that's the only problem here, but it's *a* problem.

This is indeed exposing a pre-existing issue, which was reported in [0] and a
patch fixing it has been submitted.  I have been testing and reworking the
patch slightly but due to $life and $work events have yet to have time to push
it.

--
Daniel Gustafsson

[0] 3eb40b3e-45c7-426a-b7f8-81f7d05a9...@oss.nttdata.com

Re: fixing CREATEROLE

2025-04-30 Thread Tom Lane

necro-ing an old thread ...

Robert Haas  writes:
> [ v4-0002-Add-new-GUC-createrole_self_grant.patch ]

I confess to not having paid close enough attention when
these patches went in, or I would have complained about
createrole_self_grant.  It changes the user-visible behavior
of SQL commands, specifically CREATE ROLE.  We have learned
over and over again that GUCs that do that are generally
a bad idea.

Two years later, it's perhaps too late to take it out again.
However, I'd at least like to complain about the fact that
it breaks pg_dumpall, which is surely not expecting anything
but the default behavior.  If for any reason the restore is
run under a non-default setting of createrole_self_grant,
there's a potential of creating role grants that were not
there in the source database.  Admittedly the damage is
probably limited by the fact that it only applies if the
restoring user has CREATEROLE but not SUPERUSER, which
I imagine is a rare case.  But don't we need to add
createrole_self_grant to the set of GUCs that pg_dump[all]
resets in the emitted SQL?

regards, tom lane

Improve explicit cursor handling in pg_stat_statements

2025-04-30 Thread Sami Imseih

Hi hackers,

I recently looked into a workload that makes heavy use of explicit cursors,
and I found that pg_stat_statements can become a bottleneck. The
application in question declares hundreds of cursors, and for each one,
performs many FETCH and MOVE operations with varying fetch sizes.

As a result, pg_stat_statements ends up overwhelmed by the deallocation
(and garbage collection) of DECLARE CURSOR, FETCH, and MOVE entries.
Each of these is associated with a unique queryId, which leads to bloated
entries with limited diagnostic value.

Other issues:

1. FETCH/MOVE statements don't really help much in laying blame on a specific
query. the DECLARE CURSOR statement could have been evicted in
pg_stat_statements
by that point or a similar cursor name is pointing to a different query.

Also, FETCH doesn't aggregate for the same cursor — e.g., FETCH 10 c1 and
FETCH 20 c1 show up as separate entries.

2. DECLARE CURSOR doesn't provide execution stats for the underlying SQL.
Enabling pg_stat_statements.track = 'all' can expose the underlying SQL,
but adds overhead.There’s also a bug: the toplevel column for the underlying
query is still marked as "t", even though you must set track "all" to see it.

Based on this, I propose the following improvements:

1. Better normalization of cursor utility commands:

2. Normalize the cursor name in CLOSE.

3. Normalize fetch/move sizes in FETCH and MOVE. Users can use the rows and
calls columns to derive average fetch size. Ideally I would want to
normalize the
cursor name and generate the queryId f the FETCH statement based on the
underlying query, but that is not possible to do that post parsing.

(The above normalizations of these utility statements will reduce the bloat.)

4. Track the underlying query of a cursor by default, even when
pg_stat_statements.track_utility = off.

I’ve attached two patches that implement this. Here's a quick example:

```
begin;
declare cs1 cursor for select from pg_class;
declare cs2 cursor for select from pg_class;
fetch 10 cs1;
fetch 20 cs1;
fetch 10 cs1;
fetch 10 cs2;
close cs1;
close cs2;
declare cs1 cursor for select from pg_attribute;
SELECT calls, rows, query, toplevel FROM pg_stat_statements ORDER BY
query COLLATE "C";
commit;
```

current state:
```
postgres=*# SELECT calls, rows, query, toplevel FROM
pg_stat_statements ORDER BY query COLLATE "C";
 calls | rows |
query   | toplevel
---+--+---+--
 1 |1 | SELECT nameFROM pg_catalog.pg_available_extensions
  WHERE name LIKE $1 AND installed_version IS NULL+| t
   |  | LIMIT $2
   |
 1 |0 | begin
   | t
 1 |0 | close cs1
   | t
 1 |0 | close cs2
   | t
 1 |0 | create extension pg_stat_statements
   | t
 1 |0 | declare cs1 cursor for select from pg_attribute
   | t
 1 |0 | declare cs1 cursor for select from pg_class
   | t
 1 |0 | declare cs2 cursor for select from pg_class
   | t
 2 |   20 | fetch 10 cs1
   | t
 1 |   10 | fetch 10 cs2
   | t
 1 |   20 | fetch 20 cs1
   | t
(11 rows)
```

with both patches applied:
```
postgres=*# SELECT calls, rows, query, toplevel FROM
pg_stat_statements ORDER BY query COLLATE "C";
 calls | rows | query  | toplevel
---+--++--
 1 |0 | begin  | t
 2 |0 | close $1   | t
 1 |0 | declare $1 cursor for select from pg_attribute | t
 2 |0 | declare $1 cursor for select from pg_class | t
 3 |   40 | fetch $1 cs1   | t
 1 |   10 | fetch $1 cs2   | t
 2 |   50 | select from pg_class   | t
(7 rows)

postgres=*# commit;
COMMIT
```

FWIW, I raised this ~3 years ago [0], but there was not much interest. I
have seen this being a problem a few times since then that I think something
should be done about. I also was not happy with the approach I took in [0].

Looking forward to feedback!

Regards,

--
Sami Imseih
Amazon Web Services (AWS)

[0] 
https://www.postgresql.org/message-id/flat/203CFCF7-176E-4AFC-A48E-B2CECFECD6AA%40amazon.com


v1-0001-Improve-cursor-handling-in-pg_stat_stateme

Re: Typo in multixact.c and jsonfuncs.c documentation

2025-04-30 Thread Fujii Masao





On 2025/04/30 10:41, David Rowley wrote:

On Wed, 30 Apr 2025 at 13:25, Junwang Zhao  wrote:

Found a trivial typo in multixact.c, after searching the code base
there is another such typo in jsonfuncs.c.


Thanks. Pushed.


This commit seems to have caused an indent-check failure on the buildfarm 
member koel.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=koel&dt=2025-04-30%2002%3A19%3A00

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Doc: fix the rewrite condition when executing ALTER TABLE ADD COLUMN

2025-04-30 Thread Peter Eisentraut


On 28.04.25 18:56, Álvaro Herrera wrote:

On 2025-Apr-23, Nathan Bossart wrote:


On Mon, Mar 24, 2025 at 11:37:20AM +0100, Álvaro Herrera wrote:



I'd add a note about these two things to the open items page, and wait
to see if we get some of these limitations fixed, so that if we don't,
we remember to note this limitation in the documentation.


Are we still waiting on something for this, or should we proceed with the
documentation changes?  It doesn't seem tremendously urgent, but I noticed
it's been about a month since the last message on this thread.


I've edited the Open Items page to disclaim my responsibility from this
item, since this comes from virtual generated columns which is not my
turf.  I think we should just document the current state of affairs; we
can come back with further code improvements during the next cycle.


Here is a proposed patch that includes some text about virtual generated 
columns and also fixes up a small mistake in the previous patch 
(confused identity and generated columns) and improves the wording and 
formatting a bit more.
From 33fb59c94ae3dbf6367e36c79f71dc9e291423d8 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 30 Apr 2025 11:11:15 +0200
Subject: [PATCH] doc: Improve explanations when a table rewrite is needed

Further improvement for commit 11bd8318602.  That commit confused
identity and generated columns; fix that.  Also, virtual generated
columns have since been added; add more details about that.  Also some
small rewordings and reformattings to further improve clarity.

Discussion: https://postgr.es/m/00e6eb5f5c793b8ef722252c7a519...@oss.nttdata.com
---
 doc/src/sgml/ref/alter_table.sgml | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index a75e75d800d..9bf7ca1462e 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -1436,22 +1436,31 @@ Notes
 

 Adding a column with a volatile DEFAULT
-(e.g., clock_timestamp()), a generated column
-(e.g., GENERATED BY DEFAULT AS IDENTITY), a domain
-data type with constraints will require the entire table and its
-indexes to be rewritten, as will changing the type of an existing
-column.  As an exception, when changing the type of an existing column,
+(e.g., clock_timestamp()), a stored generated column,
+an identity column, or a column with a domain data type that has
+constraints will cause the entire table and its indexes to be rewritten.
+Adding a virtual generated column never requires a rewrite.
+   
+
+   
+Changing the type of an existing column will also cause the entire table
+and its indexes to be rewritten.
+As an exception, when changing the type of an existing column,
 if the USING clause does not change the column
 contents and the old type is either binary coercible to the new type
 or an unconstrained domain over the new type, a table rewrite is not
-needed.  However, indexes must always be rebuilt unless the system
+needed.  However, indexes are always rebuilt unless the system
 can verify that the new index would be logically equivalent to the
 existing one.  For example, if the collation for a column has been
 changed, an index rebuild is required because the new sort
 order might be different.  However, in the absence of a collation
 change, a column can be changed from text to
 varchar (or vice versa) without rebuilding the indexes
-because these data types sort identically.  Table and/or index
+because these data types sort identically.
+   
+
+   
+Table and/or index
 rebuilds may take a significant amount of time for a large table,
 and will temporarily require as much as double the disk space.

-- 
2.49.0

Re: Some problems regarding the self-join elimination code

2025-04-30 Thread Alexander Korotkov

On Sun, Apr 27, 2025 at 2:02 PM Alexander Korotkov  wrote:
>
> On Fri, Apr 11, 2025 at 5:46 PM Andrei Lepikhov  wrote:
> > On 4/10/25 14:39, Andrei Lepikhov wrote:
> > > On 4/10/25 13:36, Alexander Korotkov wrote:
> > >> On Wed, Apr 9, 2025 at 10:39 AM Andrei Lepikhov 
> > >> wrote:
> > >>> It seems we are coming to the conclusion that join removal optimisation
> > >>> may do something out of ChangeVarNodes resposibility. Before further
> > >>> complicating of this function code I would like to know opinion of Tom,
> > >>> who initially proposed [1] to use this routine. May be better a) return
> > >>> to more specialised change_relid / sje_walker machinery or b) move
> > >>> ChangeVarNodes out of rewriteManip and make it multi-purpose routine,
> > >>> allowing to transform expression that may happen after a Var node
> > >>> change?
> > >>
> > >> What about adding a callback to ChangeVarNodes_context that would
> > >> called for each RestrictInfo after changing varnodes itself?  SJE
> > >> could use a callback that replaces OpExpr with NullTest when needed.
> > > I think it is doable, of course. Just looking forward a little, it may
> > > need more complication in the future (SJE definitely should be widened
> > > to partitioned tables) and it may be simpler to have two different
> > > routines for two different stages of planning.
> > To provide some food for thought, here is a draft in attachment which
> > addresses both issues: RestrictInfo relid replacement and move
> > SJE-specific code out of the ChangeVarNodes routine (callback approach).
>
> Thank you, Andrei.  I've put it all together.
> 0001 Fixes material bugs in ChangeVarNodes_walker() including regression test
> 0002 Puts back comments which got accidentally removed
> 0003 Refactors ChangeVarNodesExtended() with custom user-defined callback
>
> I'm going to further work on improvement of these patches.

Sorry, I accidentally sent some messages off-list.  So, now 0001 and
0002 are pushed.

I've revised the remaining refactoring patch.  I've made a callback an
additional callback, but not the replacement to the walker.  It looks
better for me now.  Also, I've written some comments and the commit
message.  Any thoughts?

--
Regards,
Alexander Korotkov
Supabase


v4-0001-Refactor-ChangeVarNodesExtended-using-the-custom-.patch
Description: Binary data

Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2025-04-30 Thread Ajin Cherian

On Tue, Apr 22, 2025 at 5:00 PM Peter Smith  wrote:
>
> Hi Ajin,
>
> Some review comments for patch v17-0001
>
> ==
> Commit message
>
> 1.
> Additionally, transactions containing WAL records (INTERNAL_SNAPSHOT,
> COMMAND_ID, or INVALIDATION) that modify the historical snapshot
> constructed during logical decoding are deemed unfilterable. This
> is necessary because constructing a correct historical snapshot
> for searching publication information requires processing these WAL record.
>
> ~
>
> /these WAL record./these WAL records./
>

Fixed.

> ==
> src/backend/replication/logical/reorderbuffer.c
>
> ReorderBufferFilterByRelFileLocator:
>
> 2.
> + if (found)
> + {
> + rb->try_to_filter_change = entry->filterable;
> + return entry->filterable;
> + }
> +
>
> Bad indentation.
>

Fixed.

> ==
> src/include/replication/reorderbuffer.h
>
> 3.
> +extern bool ReorderBufferCanFilterChanges(ReorderBuffer *rb);
> +
>
> Why is this extern here? This function is not implemented in patch 0001.
>
Fixed.

On Wed, Apr 23, 2025 at 1:11 PM Peter Smith  wrote:
>
> Hi Ajin,
>
> Some review comments for patch v17-0002.
>
> ==
> src/backend/replication/logical/decode.c
>
> 1.
>  /*
> + * Check if filtering changes before decoding is supported and we're
> not suppressing filter
> + * changes currently.
> + */
> +static inline bool
> +FilterChangeIsEnabled(LogicalDecodingContext *ctx)
> +{
> + return (ctx->callbacks.filter_change_cb != NULL &&
> + ctx->reorder->try_to_filter_change);
> +}
> +
>
> I still have doubts about the need for/benefits of this to be a
> separate function.
>
> It is only called from *one* place within the other static function,
> FilterChange.
>
> Just putting that code inline seems just as readable as maintaining
> the separate function for it.
>
> SUGGESTION:
> static inline bool
> FilterChange(LogicalDecodingContext *ctx, XLogRecPtr origptr, TransactionId 
> xid,
>  RelFileLocator *target_locator, ReorderBufferChangeType
> change_type)
> {
>   return
> ctx->callbacks.filter_change_cb &&
> ctx->reorder->try_to_filter_change &&
> ReorderBufferFilterByRelFileLocator(ctx->reorder, xid, origptr,
> target_locator,
> change_type));
> }
>
>

Fixed.

> ==
> src/backend/replication/logical/reorderbuffer.c
>
> RelFileLocatorCacheInvalidateCallback:
>
> 2.
>   if (hash_search(RelFileLocatorFilterCache,
> - &entry->key,
> - HASH_REMOVE,
> - NULL) == NULL)
> + &entry->key,
> + HASH_REMOVE,
> + NULL) == NULL)
>   elog(ERROR, "hash table corrupted");
>
> I think this whitespace change belongs back in patch 0001 where this
> function was introduced, not here in patch 0002.
>

Fixed.

> ~~~
>
> ReorderBufferFilterByRelFileLocator:
>
> 3.
> + /*
> + * Quick return if we already know that the relation is not to be
> + * decoded. These are for special relations that are unlogged and for
> + * sequences and catalogs.
> + */
> + if (entry->filterable)
> + return true;
>
> /These are for/This is for/

Fixed.

>
> ~~~
>
> 4.
>   if (RelationIsValid(relation))
>   {
> - entry->relid = RelationGetRelid(relation);
> + if (IsToastRelation(relation))
> + {
> + char   *toast_name = RelationGetRelationName(relation);
> + int n PG_USED_FOR_ASSERTS_ONLY;
> +
> + n = sscanf(toast_name, "pg_toast_%u", &entry->relid);
> +
> + Assert(n == 1);
> + }
> + else
> + entry->relid = RelationGetRelid(relation);
> +
>   entry->filterable = false;
> + rb->try_to_filter_change = rb->filter_change(rb, entry->relid, change_type,
> +   true, &cache_valid);
>   RelationClose(relation);
>   }
>   else
>   {
>   entry->relid = InvalidOid;
> - entry->filterable = true;
> + rb->try_to_filter_change = entry->filterable = true;
>   }
>
>   rb->try_to_filter_change = entry->filterable;
> ~
>
> Something seems fishy here. AFAICT, the rb->try_to_filter_change will
> already be assigned in both the *if* and the *else* blocks. So, why is
> it being overwritten again outside that if/else?
>

Fixed.



On Wed, Apr 23, 2025 at 1:49 PM Peter Smith  wrote:
>
> Hi Ajin.
>
> Here are some v17-0003 review comments (aka some v16-0003 comments
> that were not yet addressed or rejected)
>
> On Fri, Apr 11, 2025 at 4:05 PM Peter Smith  wrote:
> ...
> > ==
> > Commit message
> >
> > 2. missing commit message
>
> Not yet addressed.

Fixed.

>
> > ~~~
> >
> > 8.
> > # Insert, delete, and update tests for restricted publication tables
> > $log_location = -s $node_publisher->logfile;
> > $node_publisher->safe_psql('postgres', "INSERT INTO insert_only_table
> > VALUES (1, 'to be inserted')");
> > $node_publisher->safe_psql('postgres', "UPDATE insert_only_table SET
> > data = 'updated' WHERE id = 1");
> > $logfile = slurp_file($node_publisher->logfile, $log_location);
> > ok($logfile =~ qr/Filtering UPDATE/,
> > 'unpublished UPDATE is filtered');
> >
> > $log_location = -s $node_publisher->logfile;
> > $node_publisher->safe_psql('postgres', "DELETE FROM insert_only_table
> > WHERE id = 1");
>

Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-04-30 Thread Daniel Gustafsson

> On 29 Apr 2025, at 02:10, Jacob Champion  
> wrote:
> 
> On Wed, Apr 23, 2025 at 10:46 AM Jacob Champion
>  wrote:
>> Are there any readers who feel like an internal ABI version for
>> `struct pg_conn`, bumped during breaking backports, would be
>> acceptable? (More definitively: are there any readers who would veto
>> that?)
> 
> To keep things moving: I assume this is unacceptable. So v10 redirects
> every access to a PGconn struct member through a shim, similarly to
> how conn->errorMessage was translated in v9. This adds plenty of new
> boilerplate, but not a whole lot of complexity. To try to keep us
> honest, libpq-int.h has been removed from the libpq-oauth includes.

That admittedly seems like a win regardless.

> This will now handle in-place minor version upgrades that swap pg_conn
> internals around, so I've gone back to -MAJOR versioning alone.
> fe_oauth_state is still exported; it now has an ABI warning above it.
> (I figure that's easier to draw a line around during backports,
> compared to everything in PGconn. We can still break things there
> during major version upgrades.)

While I'm far from the expert on this subject (luckily there are such in this
thread), I am unable to see any sharp edges from reading and testing this
version of the patch. A few small comments:


+libpq-oauth is an optional module implementing the Device Authorization flow 
for
+OAuth clients (RFC 8628). It was originally developed as part of libpq core and
+later split out as its own shared library in order to isolate its dependency on
+libcurl. (End users who don't want the Curl dependency can simply choose not to
+install this module.)

We should either clarify that it was never shipped as part of libpq core, or
remove this altogether.  I would vote for the latter since we typically don't
document changes that happen during the devcycle.  How about something like:

+libpq-oauth is an optional module implementing the Device Authorization flow 
for
+OAuth clients (RFC 8628). It is maintained as its own shared library in order 
to
+isolate its dependency on libcurl. (End users who don't want the Curl 
dependency
+can simply choose not to install this module.)


+- void libpq_oauth_init(pgthreadlock_t threadlock,
+ 
+At the moment, pg_fe_run_oauth_flow() relies on libpq's pg_g_threadlock and
+libpq_gettext(), which must be injected by libpq using this initialization
+function before the flow is run.

I think this explanatory paragraph should come before the function prototype.
The following paragraph on the setters/getters make sense where it is though.


+#if defined(USE_DYNAMIC_OAUTH) && defined(LIBPQ_INT_H)
+#error do not rely on libpq-int.h in libpq-oauth.so
+#endif

Nitpick, but it won't be .so everywhere.  Would this be clearar if spelled out
with something like "do not rely on libpq-int.h when building libpq-oauth as
dynamic shared lib"?

--
Daniel Gustafsson

GIN tries to form a tuple with a partial compressedList during insertion

2025-04-30 Thread Arseniy Mukhin

Hi,

In the functions addItemPointersToLeafTuple and buildFreshLeafTuple
(in gininsert.c), the result of ginCompressPostingList()
is passed to GinFormTuple without checking whether
ginCompressPostingList() successfully packed all items.
These GinFormTuple calls consistently fail because the resulting
tuples always exceed the maximum size.
While this doesn’t result in data corruption, it might still be worth
addressing.
Additionally, the condition if (compressedList) is unnecessary, since
ginCompressPostingList() never returns NULL.

Please find the attached patch fixing it.

Best regards,
Arseniy Mukhin
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a7b7b5996e3..5643c423627 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -218,7 +218,8 @@ addItemPointersToLeafTuple(GinState *ginstate,
 	ItemPointerData *newItems,
 			   *oldItems;
 	int			oldNPosting,
-newNPosting;
+newNPosting,
+nwritten;
 	GinPostingList *compressedList;
 
 	Assert(!GinIsPostingTree(old));
@@ -236,17 +237,18 @@ addItemPointersToLeafTuple(GinState *ginstate,
 	/* Compress the posting list, and try to a build tuple with room for it */
 	res = NULL;
 	compressedList = ginCompressPostingList(newItems, newNPosting, GinMaxItemSize,
-			NULL);
+&nwritten);
 	pfree(newItems);
-	if (compressedList)
+	if (nwritten == newNPosting)
 	{
 		res = GinFormTuple(ginstate, attnum, key, category,
 		   (char *) compressedList,
 		   SizeOfGinPostingList(compressedList),
 		   newNPosting,
 		   false);
-		pfree(compressedList);
 	}
+pfree(compressedList);
+
 	if (!res)
 	{
 		/* posting list would be too big, convert to posting tree */
@@ -293,17 +295,19 @@ buildFreshLeafTuple(GinState *ginstate,
 {
 	IndexTuple	res = NULL;
 	GinPostingList *compressedList;
+int nwritten;
 
 	/* try to build a posting list tuple with all the items */
-	compressedList = ginCompressPostingList(items, nitem, GinMaxItemSize, NULL);
-	if (compressedList)
+	compressedList = ginCompressPostingList(items, nitem, GinMaxItemSize, &nwritten);
+	if (nwritten == nitem)
 	{
 		res = GinFormTuple(ginstate, attnum, key, category,
 		   (char *) compressedList,
 		   SizeOfGinPostingList(compressedList),
 		   nitem, false);
-		pfree(compressedList);
 	}
+pfree(compressedList);
+
 	if (!res)
 	{
 		/* posting list would be too big, build posting tree */

Re: Enhancing Memory Context Statistics Reporting

2025-04-30 Thread Peter Eisentraut

On 29.04.25 15:13, Rahila Syed wrote:

Please find attached a patch with some comments and documentation changes.
Additionaly, added a missing '\0' termination to "Remaining Totals" string.
I think this became necessary after we replaced dsa_allocate0()
with dsa_allocate() is the latest version.

>strncpy(nameptr, "Remaining Totals", namelen);
> +  nameptr[namelen] = '\0';

Looks like a case for strlcpy()?

Re: Add an option to skip loading missing publication to avoid logical replication failure

2025-04-30 Thread Amit Kapila

On Wed, Apr 30, 2025 at 11:22 AM Tom Lane  wrote:
>
> Xuneng Zhou pointed out on Discord that the test case added by
> 7c99dc587 has caused repeated failures in CI --- though oddly,
> it's not failed in the buildfarm so far as I can find.  The
> failures look like
>
> timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading 
> publication: tap_pub_3) at 
> /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.pl line 103.
>

I analyzed the relevant publisher-side CI Logs [1]:
...
2025-04-19 08:24:14.096 UTC [21961][client backend]
[024_add_drop_pub.pl][7/4:0] LOG:  statement: INSERT INTO tab_3
values(1)
2025-04-19 08:24:14.098 UTC [21961][client backend]
[024_add_drop_pub.pl][:0] LOG:  disconnection: session time:
0:00:00.003 user=postgres database=postgres host=[local]
2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG:
released logical replication slot "tap_sub"
2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG:
disconnection: session time: 0:00:00.329 user=postgres
database=postgres host=[local]
2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0]
LOG:  connection received: host=[local]
2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
LOG:  connection authenticated: user="postgres" method=trust
(/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117)
2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
LOG:  replication connection authorized: user=postgres
application_name=tap_sub
2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG:
statement: SELECT pg_catalog.set_config('search_path', '', false);
2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
received replication command: IDENTIFY_SYSTEM
2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0]
STATEMENT:  IDENTIFY_SYSTEM
2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/0 (proto_version '4', streaming 'parallel', origin 'any',
publication_names '"tap_pub_
...

This shows that walsender restarts after the "INSERT INTO tab_3
values(1)" is processed by the previous walsender ("released logical
replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)").
So, it is possible that the old apply worker has sent the confirmation
of WAL received location after the Insert (due to keep_alive message
handling). So, after the restart, the new walsender will start
processing WAL after the INSERT and wait for the skipped message LOG
timed out.

Considering the above theory is correct, after "ALTER SUBSCRIPTION
tap_sub SET PUBLICATION tap_pub_3", we should wait for the new
walsender to restart. We are already doing the same for a similar case
in 001_rep_changes.pl (See "# check that change of connection string
and/or publication list causes restart of subscription workers. We
check the state along with application_name to ensure that the
walsender is (re)started.).

Unfortunately, I will be away for the rest of the week. In the
meantime, if you or someone else is able to reproduce and fix it, then
good; otherwise, I'll take care of it after I return.

[1] - 
https://api.cirrus-ci.com/v1/artifact/task/6561639182368768/testrun/build/testrun/subscription/024_add_drop_pub/log/024_add_drop_pub_publisher.log

-- 
With Regards,
Amit Kapila.

Re: Parallel CREATE INDEX for GIN indexes

2025-04-30 Thread Tomas Vondra

On 4/18/25 03:03, Vinod Sridharan wrote:
> Hello,
> As part of testing this change I believe I found a scenario where the
> parallel build seems to trigger OOMs for larger indexes. Specifically,
> the calls for ginEntryInsert seem to leak memory into
> TopTransactionContext and OOM/crash the outer process.
> For serial build, the calls for ginEntryInsert tend to happen in a
> temporary memory context that gets reset at the end of the
> ginBuildCallback.
> For inserts, the call has a custom memory context and gets reset at
> the end of the insert.
> For parallel build, during the merge phase, the MemoryContext isn't
> swapped - and so this happens on the TopTransactionContext, and ends
> up growing (especially for larger indexes).
> 

Yes, that's true. The ginBuildCallbackParallel() already releases memory
after flushing the in-memory state, but I missed _gin_parallel_merge()
needs to be careful about memory usage too.

I haven't been able to trigger OOM (or even particularly bad) memory
usage, but I suppose it might be an issue with custom GIN opclasses with
much wider keys.

> I believe at the very least these should happen inside the tmpCtx
> found in the GinBuildState and reset periodically.
> 
> In the attached patch, I've tried to do this, and I'm able to build
> the index without OOMing, and only consuming maintenance_work_mem
> through the merge process.
> 
> Would appreciate your thoughts on this (and whether there's other approaches 
> to
> resolve this too).
> 

The patch seems fine to me - I repeated the tests with mailing list
archives, with MemoryContextStats() in _gin_parallel_merge, and it
reliably minimizes the memory usage. So that's fine.

I was also worried if this might have performance impact, but it
actually seems to make it a little bit faster.

I'll get this pushed.

thanks

-- 
Tomas Vondra

Re: Questions about logicalrep_worker_launch()

2025-04-30 Thread Fujii Masao




On 2025/04/29 21:21, Amit Kapila wrote:

On Mon, Apr 28, 2025 at 2:37 PM Fujii Masao  wrote:


On 2025/04/26 3:03, Masahiko Sawada wrote:

I agree with these changes.

I think that while the changes for (2) should be for v19, the changes
for (1) might be treated as a bug fix?


Agreed. I've split the patch into two parts:

0001 is for (1) and is a bug fix that should be back-patched to v16,
where parallel apply workers were introduced. Since it didn't apply
cleanly to v16, I also created a separate patch specifically for v16.



The situation for parallel_apply workers is not as bad as for
tablesync workers because even if the worker for parallel apply is not
available, the apply worker will apply the changes by itself. OTOH, if
the tablesync worker is not available, the tablesync will be pending
till the time a worker for the same is not available. So, I am not
sure if this is a clear cut bug which requires a fix in backbranches.


I'm fine with treating this as an improvement rather than a bug fix.
In any case, I've registered the patches for the next CommitFest.

The attached patches are the same as before, just rebased for the master branch.



Additionally, shall we try to reproduce this case for parallel apply workers?


I noticed this issue while reading the code, so I haven't actually reproduced 
it.
Are you saying it's not possible to reproduce this in practice?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From 134331b89c0c413b20866e25c332b23c253bfa54 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Mon, 28 Apr 2025 14:29:26 +0900
Subject: [PATCH v3 1/2] Fix bug that could block the startup of parallel apply
 workers.

If a logical replication worker fails to start and its parent crashes
while waiting, its worker slot can remain marked as "in use".
This can prevent new workers from starting, as the launcher may not
find a free slot or may incorrectly think the sync or parallel apply
worker limits have been reached.

To handle this, the launcher already performs garbage collection
when no free slot is found or when the sync worker limit is hit,
and then retries launching workers. However, it previously did not
trigger garbage collection when the parallel apply worker limit
was reached. As a result, stale slots could block new parallel apply
workers from starting, even though they could have been launched
after cleanup.

This commit fixes the issue by triggering garbage collection
when the parallel apply worker limit is reached as well. If stale slots
are cleared and the number of parallel apply workers drops below
the limit, new parallel apply worker can then be started successfully.
---
 src/backend/replication/logical/launcher.c | 65 +++---
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c 
b/src/backend/replication/logical/launcher.c
index 10677da56b2..ac95afe4bae 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -96,7 +96,7 @@ static void logicalrep_launcher_onexit(int code, Datum arg);
 static void logicalrep_worker_onexit(int code, Datum arg);
 static void logicalrep_worker_detach(void);
 static void logicalrep_worker_cleanup(LogicalRepWorker *worker);
-static int logicalrep_pa_worker_count(Oid subid);
+static void logicalrep_worker_count(Oid subid, int *nsync, int 
*nparallelapply);
 static void logicalrep_launcher_attach_dshmem(void);
 static void ApplyLauncherSetWorkerStartTime(Oid subid, TimestampTz start_time);
 static TimestampTz ApplyLauncherGetWorkerStartTime(Oid subid);
@@ -350,16 +350,21 @@ retry:
}
}
 
-   nsyncworkers = logicalrep_sync_worker_count(subid);
+   logicalrep_worker_count(subid, &nsyncworkers, &nparallelapplyworkers);
 
now = GetCurrentTimestamp();
 
/*
-* If we didn't find a free slot, try to do garbage collection.  The
-* reason we do this is because if some worker failed to start up and 
its
-* parent has crashed while waiting, the in_use state was never cleared.
+* If we can't start a new logical replication background worker because
+* no free slot is available, or because the number of sync workers or
+* parallel apply workers has reached the limit per subscriptoin, try
+* running garbage collection. The reason we do this is because if some
+* workers failed to start up and their parent has crashed while 
waiting,
+* the in_use state was never cleared. By freeing up these stale worker
+* slots, we may be able to start a new worker.
 */
-   if (worker == NULL || nsyncworkers >= max_sync_workers_per_subscription)
+   if (worker == NULL || nsyncworkers >= max_sync_workers_per_subscription 
||
+   nparallelapplyworkers >= 
max_parallel_apply_workers_per_subscription)
{
bool

Re: Introduce some randomness to autovacuum

2025-04-30 Thread Junwang Zhao

Hi Sami,

On Thu, May 1, 2025 at 1:56 AM Sami Imseih  wrote:
>
> > Yes, it is masking the problem, but maybe a better way to think about it is 
> > that it is delaying the
> > performance impact, allowing more time for a manual intervention of the 
> > problematic table(s).
>
> I question how the user will gauge the success of setting the strategy
> to "random"? They may make
> it random by default, but fall into the same issues and revert it back
> to the default strategy.
>
> But also, the key as you mention is "manual intervention" which
> requires proper monitoring. I will
> argue that for the two cases that this proposal is seeking to correct,
> we already have good
> solutions that could be implemented by a user.
>
> Let's take the "spinning" case again. If a table has some sort of
> problem causing
> vacuum to error out, one can just disable autovacuum on a per-table
> level and correct
> the issue. Also, the xmin horizon being held back ( which is claimed
> to be the most common cause,
> and I agree with that ), well that one is just going to cause all your
> autovacuums to become
> useless.

Yeah, I tend to agree with you that the xmin horizon hold back will
make autovacuums to become useless for all tables.

But I have a question, let me quote Andres' comment on slack first:

```quote begin
It seems a bit silly to not just do some basic prioritization instead,
but perhaps we just need to reach for some basic stuff, given that
we seem unable to progress on prioritization.
```quote end

If randomness is not working, ISTM that the prioritization will not benefit
the "spinning" case too, am I right?

>
> Also, I do think the starvation problem has a good answer now that
> autovacuum_max_workers
> can be modified online. Maybe something can be done for autovacuum to
> auto-tune this
> setting to give more workers at times when it's needed. Not sure what
> that looks like,
> but it is more possible now that this setting does not require a restart.

Good to know, thanks.

One case I didn't mention is that some corruption due to vacuuming the
same table might starve other tables two, randomness gives other tables
some chances to be vacuumed. I do admit that multi vacuum workers
can eliminate this issue a little bit if the corrupted table's vacuum progress
lasts for some time, but I think randomness is much better.

>
> --
> Sami Imseih
> Amazon Web Services (AWS)

-- 
Regards
Junwang Zhao

Re: Improve explicit cursor handling in pg_stat_statements

2025-04-30 Thread Michael Paquier

On Wed, Apr 30, 2025 at 02:43:41PM -0500, Sami Imseih wrote:
> I also want to add that the decision to not normalize the cursor name in
> the FETCH command is because it would not make sense to combine
> FETCH commands for various cursors into the same entry.

- calls | rows | query 
+--+---
- 2 |0 | CLOSE cursor_stats_1
- 2 |0 | DECLARE cursor_stats_1 CURSOR WITH HOLD FOR SELECT $1
+ calls | rows |   query
+---+--+
+ 2 |0 | CLOSE $1
+ 2 |0 | DECLARE $1 CURSOR WITH HOLD FOR SELECT $2

Hmm.  What are the workloads that you have seen as problematic?  Do
these involve cursor names generated randomly, where most of them are
similar with a random factor for the name?  Too much normalization
here would limit the amount of verbosity that we have for this area,
especially if we are dealing with query patterns that rely on few
CLOSE naming patterns spread across a couple of key queries, because
we would now know anymore about their distribution.

- 1 |5 | FETCH FORWARD 5 pgss_cursor
- 1 |7 | FETCH FORWARD ALL pgss_cursor
- 1 |1 | FETCH NEXT pgss_cursor
+ 1 |0 | DECLARE $1 CURSOR FOR SELECT * FROM pgss_matv

Saying that, applying normalization for the number of FETCH looks like
a natural move.  It seems to me that we should still make a difference
with the ALL case, though?

 typedef struct ClosePortalStmt
 {
NodeTag type;
-   char   *portalname; /* name of the portal (cursor) */
+   /* name of the portal (cursor) */
+   char   *portalname pg_node_attr(query_jumble_ignore);
+   ParseLoclocation pg_node_attr(query_jumble_location);
/* NULL means CLOSE ALL */

Could it matter to make a distinction with CLOSE ALL, compared to the
case where the CLOSE statements are named?  It would be possible to
make a difference compared to the named case with an extra boolean
field, for example.  I would suggest to add some tests for CLOSE ALL
anyway; we don't have any currently.
--
Michael

signature.asc
Description: PGP signature

Re: Introduce some randomness to autovacuum

2025-04-30 Thread Junwang Zhao

On Thu, May 1, 2025 at 8:12 AM David Rowley  wrote:
>
> On Thu, 1 May 2025 at 03:29, Nathan Bossart  wrote:
> > That being said, I am -1 for this proposal.  Autovacuum parameters and
> > scheduling are already quite complicated, and making it nondeterministic
> > would add an additional layer of complexity (and may introduce its own
> > problems).  But more importantly, IMHO it masks the problems instead of
> > solving them more directly, and it could mask future problems, too.  It'd
> > probably behoove us to think about the known problems more deeply and to
> > craft more targeted solutions.
>
> -1 from me too.
>
> It sounds like the aim is to fix the problem with autovacuum vacuuming
> the same table over and over and being unable to remove enough dead
> tuples due to something holding back the oldest xmin horizon.  Why
> can't we just fix that by remembering the value that
> VacuumCutoffs.OldestXmin and only coming back to that table once
> that's moved forward some amount?

Users expect the tables to be auto vacuumed when:
*dead_tuples > vac_base_thresh + vac_scale_factor * reltuples*
If we depend on xid moving forward to do autovacuum, I think
there are chances some bloated tables won't be vacuumed?


>
> David



-- 
Regards
Junwang Zhao

Should shared_preload_libraries be loaded during binary upgrade?

2025-04-30 Thread Dilip Kumar

Does it make sense to load "shared_preload_libraries" during binary
upgrade mode?

An extension might unintentionally interfere with pg_upgrade, for
example, by connecting to the 'postgres' database, which can cause the
upgrade to fail as the restore needs to drop that database. While it's
true that extensions should ideally handle this themselves, wouldn't
it be safer if we could avoid loading them at all during the binary
upgrade mode?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Introduce some randomness to autovacuum

2025-04-30 Thread David Rowley

On Thu, 1 May 2025 at 17:35, Junwang Zhao  wrote:
>
> On Thu, May 1, 2025 at 8:12 AM David Rowley  wrote:
> > It sounds like the aim is to fix the problem with autovacuum vacuuming
> > the same table over and over and being unable to remove enough dead
> > tuples due to something holding back the oldest xmin horizon.  Why
> > can't we just fix that by remembering the value that
> > VacuumCutoffs.OldestXmin and only coming back to that table once
> > that's moved forward some amount?
>
> Users expect the tables to be auto vacuumed when:
> *dead_tuples > vac_base_thresh + vac_scale_factor * reltuples*
> If we depend on xid moving forward to do autovacuum, I think
> there are chances some bloated tables won't be vacuumed?

Can you explain why you think that?  The idea is to start vacuum other
tables that perhaps can have dead tuples removed instead of repeating
vacuums on the same table over and over without any chance of being
able to remove any more dead tuples than we could during the last
vacuum.

David

not not - pgupgrade.sgml

2025-04-30 Thread Erik Rijkers


Hi,

  It seems to me that, in doc/src/sgml/ref/pgupgrade.sgml, this phrase:

"Because not all statistics are not transferred"

  should be:

"Because not all statistics are transferred"


Thanks, 

Erik

Re: Fix outdated comments for IndexInfo

2025-04-30 Thread Richard Guo

On Thu, May 1, 2025 at 12:49 AM Japin Li  wrote:
> While working on [1], I found outdated comments in IndexInfo.
> The attached patch corrects them.

Nice catch.  LGTM.

Thanks
Richard

queryId constant squashing does not support prepared statements

2025-04-30 Thread Sami Imseih

62d712ec added the ability to squash constants from an IN LIST
for queryId computation purposes. This means that a similar
queryId will be generated for the same queries that only
different on the number of values in the IN-LIST.

The patch lacks the ability to apply this optimization to values
passed in as parameters ( i.e. parameter kind = PARAM_EXTERN )
which will be the case for SQL prepared statements and protocol level
prepared statements, i.e.

```
select from t where id in (1, 2, 3) \bind
```
or
```
prepare prp(int, int, int) as select from t where id in ($1, $2, $3);
```

Here is the current state,

```
postgres=# create table t (id int);
CREATE TABLE
postgres=# prepare prp(int, int, int) as select from t where id in ($1, $2, $3);
PREPARE
postgres=# execute prp(1, 2, 3);
postgres=# select from t where id in (1, 2, 3);
--
(0 rows)
postgres=# SELECT query, calls FROM pg_stat_statements ORDER BY query
COLLATE "C";
   query
| calls
---+---
...

 select from t where id in ($1 /*, ... */)
| 1
 select from t where id in ($1, $2, $3)
 | 1 <<- prepared statement
(6 rows)
```

but with the attached patch, the optimization applies.

```
 create table t (id int)
| 1
 select from t where id in ($1 /*, ... */)
 | 2
(3 rows)
```

I think this is a pretty big gap as many of the common drivers such as JDBC,
which use extended query protocol, will not be able to take advantage of
the optimization in 18, which will be very disappointing.

Thoughts?

Sami Imseih
Amazon Web Services (AWS)


v1-0001-Allow-query-jumble-to-squash-a-list-external-para.patch
Description: Binary data

Re: pgsql: Add function to get memory context stats for processes

2025-04-30 Thread Robert Haas

On Wed, Apr 30, 2025 at 4:29 PM Daniel Gustafsson
 wrote:
> This is indeed exposing a pre-existing issue, which was reported in [0] and a
> patch fixing it has been submitted.  I have been testing and reworking the
> patch slightly but due to $life and $work events have yet to have time to push
> it.

Thanks for the pointer. I will reply to that thread.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.

2025-04-30 Thread Robert Haas

On Wed, Apr 30, 2025 at 5:24 PM Daniel Gustafsson  wrote:
> Attached is a current v4 with a few small tweaks.

Sorry to turn up late here, but I strongly disagree with the notion
that this is a bug in the DSM or DSA code. It seems to me that it is
the caller's responsibility to provide a valid resource owner, not the
job of the called code to ignore the resource owner when it's
unusable. I suspect that there are many other parts of the system that
rely on the ResourceOwner machinery which likewise assume that the
ResourceOwner that they are passed is valid.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Fix slot synchronization with two_phase decoding enabled

2025-04-30 Thread Masahiko Sawada

On Wed, Apr 30, 2025 at 2:38 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tue, Apr 29, 2025 at 6:57 PM Amit Kapila wrote:
> >
> > On Tue, Apr 29, 2025 at 3:01 AM Masahiko Sawada
> >  wrote:
> > >
> > > Your fix looks good to me. Is it worth considering putting an
> > > assertion to verify if new two_phase_at is equal to or greater than
> > > confirmed_lsn (or at least it doesn't go backward), when syncing
> > > two_phase_at?
> > >
> >
> > Yeah, it makes sense. But the condition should be reverse (two_phase_at
> > should be less than or equal to confirmed_flush). I have done that, changed 
> > a
> > few comments, and committed the patch.
>
> I noticed a BF failure[1] related to the fix, where the recently added 
> assertion
> Assert(slot->data.two_phase_at <= slot->data.confirmed_flush) was broken. 
> After
> examining the logs and code, I couldn't identify any functionality issues.

Yeah, that's weird.

> AFAICS, the last slotsync cycle should have retrieved the latest 
> confirmed_flush and
> two_phase_at from the source slot, both expected to be 0/6005150. Here are
> some details:
>
> The standby's log[1] shows the source slot's two_phase_at as 0/6005150.
> Meanwhile, the publisher's log reveals that the source slot's confirmed_flush
> was already 0/6005150 before the last slotsync. Therefore, the last slotsync
> cycle should have fetched confirmed_flush: 0/6005150, two_phase_at: 0/6005150.
>
> If we assume remote_slot->confirmed_lsn during the last sync cycle is
> 0/6005150, then either the local slot's confirmed_lsn had already been updated
> to this value in a prior sync, leading it to skip the update in the last cycle
> (in which case, the assertion shouldn't be broken), or it should enter the 
> slot
> update branch to set the synced slot to 0/6005150 (see
> update_local_synced_slot() for details). Thus, theoretically, the assertion
> wouldn't fail.

Agreed with your analysis. After enabling the subscription with
two_phase=true, the primary server has the following logs that
indicate that logical decoding started from 0/6005150, not 0/6004FC8.
Given that the slot's two_phase was enabled at this time, the slot's
confirmed_flush and two_phase_at were 0/6005150.

2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG:
received replication command: START_REPLICATION SLOT "lsub1_slot"
LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase
'on', origin 'any', publication_names '"regress_mypub"')
2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT:
START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version
'4', streaming 'parallel', two_phase 'on', origin 'any',
publication_names '"regress_mypub"')
2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG:
acquired logical replication slot "lsub1_slot"
2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT:
START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version
'4', streaming 'parallel', two_phase 'on', origin 'any',
publication_names '"regress_mypub"')
2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG:
0/6004FC8 has been already streamed, forwarding to 0/6005150
2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT:
START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version
'4', streaming 'parallel', two_phase 'on', origin 'any',
publication_names '"regress_mypub"')
2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] LOG:
starting logical decoding for slot "lsub1_slot"
2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] DETAIL:
Streaming transactions committing after 0/6005150, reading WAL from
0/6004A00.
2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] STATEMENT:
START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version
'4', streaming 'parallel', two_phase 'on', origin 'any',
publication_names '"regress_mypub"')

>
> Beyond functionality problems, another possibility might be the CPU reordered
> memory access, causing the assertion to execute before updating
> confirmed_flush.

Not sure CPU reordered memory access could matter in this case but I
don't have other ideas of possible causes.

> However, we lack the information needed to verify this, so one
> idea is to add some DEBUG message in update_local_synced_slot() to facilitate
> tracking if the issue recurs.

That would be a good idea.

Also, it's unrelated to this issue but probably we are better off
doing this assertion check only when enabling two_phase. The assertion
is currently checked even when disabling the two_phase, which seems
not to make sense, and we don't clear two_phase_at value when
disabling two_phase by ReplicationSlotAlter().

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Introduce some randomness to autovacuum

2025-04-30 Thread David Rowley

On Thu, 1 May 2025 at 03:29, Nathan Bossart  wrote:
> That being said, I am -1 for this proposal.  Autovacuum parameters and
> scheduling are already quite complicated, and making it nondeterministic
> would add an additional layer of complexity (and may introduce its own
> problems).  But more importantly, IMHO it masks the problems instead of
> solving them more directly, and it could mask future problems, too.  It'd
> probably behoove us to think about the known problems more deeply and to
> craft more targeted solutions.

-1 from me too.

It sounds like the aim is to fix the problem with autovacuum vacuuming
the same table over and over and being unable to remove enough dead
tuples due to something holding back the oldest xmin horizon.  Why
can't we just fix that by remembering the value that
VacuumCutoffs.OldestXmin and only coming back to that table once
that's moved forward some amount?

David

Re: Performance issues with v18 SQL-language-function changes

2025-04-30 Thread Tom Lane

Alexander Lakhin  writes:
> Sorry if I've chosen the wrong thread, but starting from 0313c5dc6,
> the following script:
> CREATE TYPE aggtype AS (a int, b text);
> CREATE FUNCTION aggfns_trans(aggtype[], integer, text) RETURNS aggtype[] 
> LANGUAGE sql AS 'SELECT 
> array_append($1,ROW($2,$3)::aggtype)';
> CREATE AGGREGATE aggfns(integer, text) (SFUNC = public.aggfns_trans, STYPE = 
> public.aggtype[], INITCOND = '{}');

> CREATE TABLE t(i int,  k int);
> INSERT INTO t SELECT 1, 2 FROM generate_series(1, 4000);

> SET statement_timeout = '10s';
> SELECT aggfns(i, repeat('x', 8192)) OVER (PARTITION BY i) FROM t;

> crashes the server for me like this:
> corrupted size vs. prev_size while consolidating

Hmm.  What seems to be going on here is that once the aggfns_trans()
result gets large enough that the SQL-function-result tuplestore
decides to spill to disk, when we pull the result tuple back out
of the tuplestore with tuplestore_gettupleslot we end up with the
jf_resultSlot holding a should-free tuple pointer that points into
the tuplestore's storage.  After tuplestore_clear that is a dangling
pointer, and the next use of the jf_resultSlot fails while trying to
free the tuple.

So the attached fixes it for me, but I'm still mighty confused
because I don't understand why it didn't fail in the old code.
This logic doesn't seem noticeably different from before, and
there's even a very old comment (in the SRF path) alleging that

/* NB: this might delete the slot's content, but we don't care */

Now we *do* care, but what changed?

(As an aside, seems like tuplestore is leaving money on the table,
because there's hardly any point in spilling to disk when it never
holds more than one tuple.  But that's not something to change now.)

regards, tom lane

diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index e0bca7cb81c..37f616280c6 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -1728,8 +1728,9 @@ fmgr_sql(PG_FUNCTION_ARGS)
 elog(ERROR, "failed to fetch lazy-eval tuple");
 			/* Extract the result as a datum, and copy out from the slot */
 			result = postquel_get_single_result(slot, fcinfo, fcache);
+			/* Clear the slot, in case it points into the tuplestore */
+			ExecClearTuple(slot);
 			/* Clear the tuplestore, but keep it for next time */
-			/* NB: this might delete the slot's content, but we don't care */
 			tuplestore_clear(fcache->tstore);

 			/*
@@ -1810,7 +1811,11 @@ fmgr_sql(PG_FUNCTION_ARGS)
 			/* Re-use the junkfilter's output slot to fetch back the tuple */
 			slot = fcache->junkFilter->jf_resultSlot;
 			if (tuplestore_gettupleslot(fcache->tstore, true, false, slot))
+			{
 result = postquel_get_single_result(slot, fcinfo, fcache);
+/* Clear the slot, in case it points into the tuplestore */
+ExecClearTuple(slot);
+			}
 			else
 			{
 fcinfo->isnull = true;

Re: queryId constant squashing does not support prepared statements

2025-04-30 Thread Michael Paquier

On Wed, Apr 30, 2025 at 04:52:06PM -0500, Sami Imseih wrote:
> 62d712ec added the ability to squash constants from an IN LIST
> for queryId computation purposes. This means that a similar
> queryId will be generated for the same queries that only
> different on the number of values in the IN-LIST.
> 
> The patch lacks the ability to apply this optimization to values
> passed in as parameters ( i.e. parameter kind = PARAM_EXTERN )
> which will be the case for SQL prepared statements and protocol level
> prepared statements, i.e.
> 
> I think this is a pretty big gap as many of the common drivers such as JDBC,
> which use extended query protocol, will not be able to take advantage of
> the optimization in 18, which will be very disappointing.
> 
> Thoughts?

Yes.  Long IN/ANY clauses are as far as a more common pattern caused
by ORMs, and I'd like to think that application developers would not
hardcode such clauses in their right minds (well, okay, I'm likely
wrong about this assumption, feel free to counter-argue).  These also 
like relying on the extended query protocol.  Not taking into account
JDBC in that is a bummer, and it is very popular.

I agree that the current solution we have in the tree feels incomplete
because we are not taking into account the most common cases that
users would care about.  Now, allowing PARAM_EXTERN means that we
allow any expression to be detected as a squashable thing, and this
kinds of breaks the assumption of IsSquashableConst() where we want
only constants to be allowed because EXECUTE parameters can be any
kind of Expr nodes.  At least that's the intention of the code on
HEAD.

Now, I am not actually objecting about PARAM_EXTERN included or not if
there's a consensus behind it and my arguments are considered as not
relevant.  The patch is written so as it claims that a PARAM_EXTERN
implies the expression to be a Const, but it may not be so depending
on what the execution path is given for the parameter.  Or at least
the patch could be clearer and rename the parts about the "Const"
squashable APIs around queryjumblefuncs.c.

To be honest, the situation of HEAD makes me question whether we are
using the right approach for this facility.  I did mention a couple of
months ago about an alternative, but it comes down to accept that any
expressions would be normalized, unfortunately I never got down to
study it in details as this touches around expr_list in the parser: we
could detect in the parser the start and end locations of an
expression list in a query string, then group all of them together
based on their location in the string.  This would be also cheaper
than going through all the elements in the list, tweaking things when
dealing with a subquery..

The PARAM_EXTERN part has been mentioned a couple of weeks ago here,
btw:
https://www.postgresql.org/message-id/CAA5RZ0tu6_KRiYJCFptf4_--wjFSu9cZMj1XNmOCqTNxu=v...@mail.gmail.com
--
Michael

signature.asc
Description: PGP signature

Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2025-04-30 Thread Mark Dilger

Peter,

moving the conversation here from "pgsql: Improve nbtree skip scan
primitive scan scheduling" thread on -committers.  Attached is another
regression test for your consideration, which gives rise to the following
assertion:

TRAP: failed Assert("numSkipArrayKeys == 0"), File: "nbtpreprocesskeys.c",
Line: 1859, PID: 7210
0   postgres0x0001032f30e0
ExceptionalCondition + 108
1   postgres0x000102e83100
_bt_preprocess_keys + 6036
2   postgres0x000102e87338 _bt_first + 168
3   postgres0x000102e84680 btgettuple + 196
4   postgres0x000102e75cdc
index_getnext_tid + 68
5   postgres0x0001030017a0 IndexOnlyNext +
228
6   postgres0x000102fe5b2c ExecScan + 228
7   postgres0x000103011088 ExecSort + 536
8   postgres0x000102fdbc68
standard_ExecutorRun + 312
9   postgres0x0001031bdfb8 PortalRunSelect
+ 236
10  postgres0x0001031bdbd4 PortalRun + 492
11  postgres0x0001031bcb18
exec_simple_query + 1292
12  postgres0x0001031b9d1c PostgresMain +
1388
13  postgres0x0001031b59a8
BackendInitialize + 0
14  postgres0x00010310edd8
postmaster_child_launch + 372
15  postgres0x00010311303c ServerLoop + 4948
16  postgres0x000103111360
InitProcessGlobals + 0
17  postgres0x000103030c00 help + 0
18  dyld0x00018eb17154 start + 2476

This looks sufficiently different from the assertion mentioned on the other
thread to be worth posting here.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


_skipscan2.sql
Description: application/sql

Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2025-04-30 Thread Peter Geoghegan

On Wed, Apr 30, 2025 at 9:12 PM Mark Dilger
 wrote:
> TRAP: failed Assert("numSkipArrayKeys == 0"), File: "nbtpreprocesskeys.c", 
> Line: 1859, PID: 7210
> 0   postgres0x0001032f30e0 
> ExceptionalCondition + 108
> 1   postgres0x000102e83100 
> _bt_preprocess_keys + 6036

> This looks sufficiently different from the assertion mentioned on the other 
> thread to be worth posting here.

It is a completely unrelated issue. Fortunately, this one is harmless.
The assertion merely failed to account for the case where we
completely end the scan during preprocessing due to it having an
unsatisfiable array qual.

Pushed a fix for this just now.

Thanks

--
Peter Geoghegan

Re: Performance issues with v18 SQL-language-function changes

2025-04-30 Thread Tom Lane

Alexander Lakhin  writes:
> Sorry if I've chosen the wrong thread, but starting from 0313c5dc6,
> the following script:
> ...
> crashes the server for me like this:
> corrupted size vs. prev_size while consolidating

Yup, duplicated here.  Thanks for the report!

regards, tom lane

Re: Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.

2025-04-30 Thread Daniel Gustafsson

> On 11 Apr 2025, at 21:08, Rahila Syed  wrote:
> 
> Hi Daniel,
> 
> Thank you for your review and code improvements. 
> 
> Please find below some observations.
> 
> 1. dsm_unpin_mapping(dsm_segment *seg)
> +   if (CurrentResourceOwner && 
> IsResourceOwnerReleasing(CurrentResourceOwner))
> +   return;
> 
> Given that the function can return without setting resowner to a 
> CurrentResourceOwner which is not NULL, shall we change the function 
> signature to return true when "unpin" is successful and false when not?

Maybe, but at this point in the cycle we cannot change the prototype like that.
Food for thought for v19 though.

> 2.  If value of IsResourceOwnerReleasing changes between dsm_create_descriptor
> and attach_internal, the dsm segment and dsa area will end up with different 
> resource owners. 
> Earlier the chances of CurrentResourceOwner changing between the two 
> functions were zero.
> May be something can be done to keep resowner assignments under both these 
> functions
> in sync.

Hmm, that's a good question, not sure. Do you have a repro for this handy?

Attached is a current v4 with a few small tweaks.

--
Daniel Gustafsson



v4-0001-Don-t-try-to-enlarge-resourceowner-when-releasing.patch
Description: Binary data

Batch TIDs lookup in ambulkdelete

2025-04-30 Thread Masahiko Sawada

Hi all,

During index bulk-deletion in lazy vacuum, we currently check the
deletability of each index tuple individually using the
vac_tid_reaped() function. The attached proof-of-concept patches
propose batching multiple TID lookups for deletability checks to
reduce overhead. This optimization aims to minimize redundant function
calls and repeated TidStore entry retrievals for TIDs on the same
page. I have conducted benchmarks across several scenarios to evaluate
the performance impact.

# Case-1 (btree tuples are regular tuples and dead TIDs are concentrated):

create unlogged table test (c int) with (autovacuum_enabled = off);
insert into test select generate_series(1, ${NROWS});
create index on test (c);
delete from test where c < ${NROWS} * 0.3;

# Case-2 (btree tuples are regular tuples and dead TIDs are sparsed):

create unlogged table test (c int) with (autovacuum_enabled = off);
insert into test select generate_series(1, ${NROWS});
create index on test (c);
delete from test where random() < 0.3;

# Case-3 (btree tuples are deduplicated tuples):

create unlogged table test (c int) with (autovacuum_enabled = off);
insert into test select c % 1000 from generate_series(1, ${NROWS}) c;
create index on test (c);
select pg_relation_size('test') / 8192 as relpages \gset
delete from test where (ctid::text::point)[0] < ((:'relpages')::int * 0.3);

# Case-4 (btree tuples are deduplicated tuples and table is clustered):

create unlogged table test (c int) with (autovacuum_enabled = off);
insert into test select c % 1000 from generate_series(1, ${NROWS}) c;
create index on test (c);
cluster test using test_c_idx;
select pg_relation_size('test') / 8192 as relpages \gset
delete from test where (ctid::text::point)[0] < ((:'relpages')::int * 0.3);

# Case-5 (creating btree index on UUID column)

create unlogged table test (c uuid) with (autovacuum_enabled = off);
insert into test select uuidv4() from generate_series(1, ${NROWS}) c;
create index on test (c);
select pg_relation_size('test') / 8192 as relpages \gset
delete from test where (ctid::text::point)[0] < ((:'relpages')::int * 0.3);

Here are the results (NROWS = 5000):

  HEAD PATCHEDDIFF
case-1: 3,021 ms  2.818 ms93.29%
case-2: 5, 697 ms 5.545 ms97.34%
case-3: 2,833 ms  2.790 ms98.48%
case-4: 2,564 ms  2.279 ms88.86%
case-5: 4,657 ms  4.706 ms   101.04%

I've measured 6 ~ 11% improvement in btree bulk-deletion.

Here are the summary of each attached patch:

0001: Introduce TIdStoreIsMemberMulti() where we can do IsMember check
for multiple TIDs in one function call. If the given TIDs are sorted
(at least in block number), we can save radix tree lookup for the same
page entry.

0002: Convert IndexBUlkDeleteCallback() to batched operation.

0003: Use batch TIDs lookup in btree index bulk-deletion.

In patch 0003, we implement batch TID lookups for both each
deduplicated index tuple and remaining all regular index tuples, which
seems to be the most straightforward approach. While further
optimizations are possible, such as performing batch TID lookups for
all index tuples on a single page, these could introduce additional
overhead from sorting and re-sorting TIDs. Moreover, considering that
radix tree lookups are relatively inexpensive, the benefits of sorting
TIDs and using TidStoreIsMemberMulti() might be minimal. Nevertheless,
these potential optimizations warrant further evaluation to determine
their actual impact on performance.

Also, the patch includes the batch TIDs lookup support only for btree
indexes but we potentially can improve other index AMs too.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From ba4e7dc1f06f3dceab896b6ee4bf68ff23db1c09 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Tue, 22 Apr 2025 12:25:11 -0700
Subject: [PATCH v1 3/3] Use batch TIDs lookup in btree index bulk-deletion.

TIDs in the postlist are sorted. But TIDs of the gathered regular
index tuples are not sorted.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/nbtree/nbtree.c | 107 +++--
 1 file changed, 71 insertions(+), 36 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 821e16d5691..c92efecc076 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
 #include "access/stratnum.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
+#include "common/int.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
 #include "storage/bulk_write.h"
@@ -1309,6 +1310,13 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		IndexFreeSpaceMapVacuum(rel);
 }
 
+/* qsort comparator for sorting OffsetNumbers */
+static int
+cmpOffsetNumbers(const void *a, const void *b)
+{
+	return pg_cmp_u16(*(const OffsetNumber *) a, *(const OffsetNumber *) b);
+}
+
 /*
  * btvac

Re: Vacuum timing in pg_stat_all_tables

2025-04-30 Thread Michael Paquier

On Wed, Mar 12, 2025 at 10:52:57AM +, Bertrand Drouvot wrote:
> Note that it does not add extra explanation to "cost-based delay". If we feel 
> the
> need we could add a link to " linkend="runtime-config-resource-vacuum-cost"/>"
> like it has been done for delay_time in bb8dff9995f.

Sorry for the delay, this has been sitting in my inbox for some time,
and I am catching with some of my backlog.

   
-   Total time this table has been manually vacuumed, in milliseconds
+   Total time this table has been manually vacuumed, in milliseconds. (This
+   includes the time spent sleeping due to cost-based delay.)
   

Hmm, okay, adding this information to these four new fields is fine
here, so I'll apply that on HEAD shortly.  I can see that this matches
with the style used for some of the other fields, like n_tup_upd for the
details regarding HOT.
--
Michael


signature.asc
Description: PGP signature

Re: fixing CREATEROLE

2025-04-30 Thread David G. Johnston

On Wed, Apr 30, 2025 at 1:29 PM Tom Lane  wrote:

> But don't we need to add

createrole_self_grant to the set of GUCs that pg_dump[all]
> resets in the emitted SQL?
>
>
The other approach would be to do what we do for the role options and just
specify everything explicitly in the dump.  The GUC is only a default
specifier so let's not leave room for defaults in the dump file.

David J.

Re: Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.

2025-04-30 Thread Michael Paquier

On Wed, Apr 30, 2025 at 06:03:49PM -0400, Robert Haas wrote:
> Sorry to turn up late here, but I strongly disagree with the notion
> that this is a bug in the DSM or DSA code. It seems to me that it is
> the caller's responsibility to provide a valid resource owner, not the
> job of the called code to ignore the resource owner when it's
> unusable. I suspect that there are many other parts of the system that
> rely on the ResourceOwner machinery which likewise assume that the
> ResourceOwner that they are passed is valid.

Yeah, dshash would be one, I think.  It feels to me that if you want
to enforce this kind of policy to be checked, this is something that
should be done in the shape of one or more assertion based the state
of the resource owner expected in these low-level paths rather than
tweaking the DSA and DSM code to do what you are expecting here, and
only enforce such new policies on HEAD to avoid disruption with
existing systems.

I'm actually rather scared of the patch, isn't there a risk of
breaking existing patterns that worked out of the box by forcing the
resowner to not be set?  My spidey sense tingles when I see such
patterns, because this is enforcing assumptions directly hidden to the
callers.
--
Michael

signature.asc
Description: PGP signature

Re: POC: Parallel processing of indexes in autovacuum

2025-04-30 Thread Masahiko Sawada

Hi,

On Wed, Apr 16, 2025 at 4:05 AM Maxim Orlov  wrote:
>
> Hi!
>
> The VACUUM command can be executed with the parallel option. As documentation 
> states, it will perform index vacuum and index cleanup phases of VACUUM in 
> parallel using integer background workers. But such an interesting feature is 
> not used for an autovacuum. After a quick look at the source codes, it became 
> clear to me that when the parallel option was added, the corresponding option 
> for autovacuum wasn't implemented, although there are no clear obstacles to 
> this.
>
> Actually, one of our customers step into a problem with autovacuum on a table 
> with many indexes and relatively long transactions. Of course, long 
> transactions are an ultimate evil and the problem can be solved by calling 
> running vacuum and a cron task, but, I think, we can do better.
>
> Anyhow, what about adding parallel option for an autovacuum? Here is a POC 
> patch for proposed functionality. For the sake of simplicity's, several GUC's 
> have been added. It would be good to think through the parallel launch 
> condition without them.
>
> As always, any thoughts and opinions are very welcome!

As I understand it, we initially disabled parallel vacuum for
autovacuum because their objectives are somewhat contradictory.
Parallel vacuum aims to accelerate the process by utilizing additional
resources, while autovacuum is designed to perform cleaning operations
with minimal impact on foreground transaction processing (e.g.,
through vacuum delay).

Nevertheless, I see your point about the potential benefits of using
parallel vacuum within autovacuum in specific scenarios. The crucial
consideration is determining appropriate criteria for triggering
parallel vacuum in autovacuum. Given that we currently support only
parallel index processing, suitable candidates might be autovacuum
operations on large tables that have a substantial number of
sufficiently large indexes and a high volume of garbage tuples.

Once we have parallel heap vacuum, as discussed in thread[1], it would
also likely be beneficial to incorporate it into autovacuum during
aggressive vacuum or failsafe mode.

Although the actual number of parallel workers ultimately depends on
the number of eligible indexes, it might be beneficial to introduce a
storage parameter, say parallel_vacuum_workers, that allows control
over the number of parallel vacuum workers on a per-table basis.

Regarding implementation: I notice the WIP patch implements its own
parallel vacuum mechanism for autovacuum. Have you considered simply
setting at_params.nworkers to a value greater than zero?

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoAEfCNv-GgaDheDJ%2Bs-p_Lv1H24AiJeNoPGCmZNSwL1YA%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

67 matches

Mail list logo