Re: alphabetize long options in pg_dump[all] docs
On 2025-Apr-30, Peter Eisentraut wrote: > On 29.04.25 23:54, Nathan Bossart wrote: > > Fair point. We seem to be pivoting towards long options, anyway. If > > there's support for this, I could go through all the client and server > > application docs to ensure they match this style. > > However, I think this would require coordination across all --help output > and man pages (76 objects), so for the short term, let's just move recently > added options to the right place under the current theory/theories, and > leave a larger reshuffling for later. +1 WFM -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Pero la cosa no es muy grave ..." (le petit Nicolas -- René Goscinny)
Re: [RFC] Lock-free XLog Reservation from WAL
Just rebase -- regards Yura Sokolov aka funny-falconFrom 4ea25d6feb655a072d1e9f40a547dc6aeab762ac Mon Sep 17 00:00:00 2001 From: Yura Sokolov Date: Sun, 19 Jan 2025 17:40:28 +0300 Subject: [PATCH v6] Lock-free XLog Reservation using lock-free hash-table Removed PrevBytePos to eliminate lock contention, allowing atomic updates to CurrBytePos. Use lock-free hash-table based on 4-way Cuckoo Hashing to store link to PrevBytePos. --- src/backend/access/transam/xlog.c | 585 +++--- src/tools/pgindent/typedefs.list | 2 + 2 files changed, 532 insertions(+), 55 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 2d4c346473b..0dff9addfe1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -68,6 +68,8 @@ #include "catalog/pg_database.h" #include "common/controldata_utils.h" #include "common/file_utils.h" +#include "common/hashfn.h" +#include "common/pg_prng.h" #include "executor/instrument.h" #include "miscadmin.h" #include "pg_trace.h" @@ -379,6 +381,94 @@ typedef union WALInsertLockPadded char pad[PG_CACHE_LINE_SIZE]; } WALInsertLockPadded; +/* #define WAL_LINK_64 0 */ +#ifndef WAL_LINK_64 +#ifdef PG_HAVE_ATOMIC_U64_SIMULATION +#define WAL_LINK_64 0 +#else +#define WAL_LINK_64 1 +#endif +#endif + +/* + * It links current position with previous one. + * - CurrPosId is (CurrBytePos ^ (CurrBytePos>>32)) + * Since CurrBytePos grows monotonically and it is aligned to MAXALIGN, + * CurrPosId correctly identifies CurrBytePos for at least 4*2^32 = 32GB of + * WAL logs. + * - CurrPosHigh is (CurrBytePos>>32), it is stored for strong uniqueness check. + * - PrevSize is difference between CurrBytePos and PrevBytePos + */ +typedef struct +{ +#if WAL_LINK_64 + uint64 CurrPos; + uint64 PrevPos; +#define WAL_PREV_EMPTY (~((uint64)0)) +#define WALLinkEmpty(l) ((l).PrevPos == WAL_PREV_EMPTY) +#define WALLinkSamePos(a, b) ((a).CurrPos == (b).CurrPos) +#define WALLinkCopyPrev(a, b) do {(a).PrevPos = (b).PrevPos;} while(0) +#else + uint32 CurrPosId; + uint32 CurrPosHigh; + uint32 PrevSize; +#define WALLinkEmpty(l) ((l).PrevSize == 0) +#define WALLinkSamePos(a, b) ((a).CurrPosId == (b).CurrPosId && (a).CurrPosHigh == (b).CurrPosHigh) +#define WALLinkCopyPrev(a, b) do {(a).PrevSize = (b).PrevSize;} while(0) +#endif +} WALPrevPosLinkVal; + +/* + * This is an element of lock-free hash-table. + * In 32 bit mode PrevSize's lowest bit is used as a lock, relying on fact it is MAXALIGN-ed. + * In 64 bit mode lock protocol is more complex. + */ +typedef struct +{ +#if WAL_LINK_64 + pg_atomic_uint64 CurrPos; + pg_atomic_uint64 PrevPos; +#else + pg_atomic_uint32 CurrPosId; + uint32 CurrPosHigh; + pg_atomic_uint32 PrevSize; + uint32 pad; /* to align to 16 bytes */ +#endif +} WALPrevPosLink; + +StaticAssertDecl(sizeof(WALPrevPosLink) == 16, "WALPrevPosLink should be 16 bytes"); + +#define PREV_LINKS_HASH_CAPA (NUM_XLOGINSERT_LOCKS * 2) +StaticAssertDecl(!(PREV_LINKS_HASH_CAPA & (PREV_LINKS_HASH_CAPA - 1)), + "PREV_LINKS_HASH_CAPA should be power of two"); +StaticAssertDecl(PREV_LINKS_HASH_CAPA < UINT16_MAX, + "PREV_LINKS_HASH_CAPA is too large"); + +/*--- + * PREV_LINKS_HASH_STRATEGY - the way slots are chosen in hash table + * 1 - 4 positions h1,h1+1,h2,h2+2 - it guarantees at least 3 distinct points, + * but may spread at 4 cache lines. + * 2 - 4 positions h,h^1,h^2,h^3 - 4 points in single cache line. + * 3 - 8 positions h1,h1^1,h1^2,h1^4,h2,h2^1,h2^2,h2^3 - 8 distinct points in + * in two cache lines. + */ +#ifndef PREV_LINKS_HASH_STRATEGY +#define PREV_LINKS_HASH_STRATEGY 3 +#endif + +#if PREV_LINKS_HASH_STRATEGY <= 2 +#define PREV_LINKS_LOOKUPS 4 +#else +#define PREV_LINKS_LOOKUPS 8 +#endif + +struct WALPrevLinksLookups +{ + uint16 pos[PREV_LINKS_LOOKUPS]; +}; + +#define SWAP_ONCE_IN 128 + /* * Session status of running backup, used for sanity checks in SQL-callable * functions to start and stop backups. @@ -390,17 +480,18 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE; */ typedef struct XLogCtlInsert { - slock_t insertpos_lck; /* protects CurrBytePos and PrevBytePos */ - /* * CurrBytePos is the end of reserved WAL. The next record will be - * inserted at that position. PrevBytePos is the start position of the - * previously inserted (or rather, reserved) record - it is copied to the - * prev-link of the next record. These are stored as "usable byte - * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()). + * inserted at that position. + * + * The start position of the previously inserted (or rather, reserved) + * record (it is copied to the prev-link of the next record) will be + * stored in PrevLinksHash. + * + * These are stored as "usable byte positions" rather than XLogRecPtrs + * (see XLogBytePosToRecPtr()). */ - uint64 CurrBytePos; - uint64 PrevBytePos; + pg_atomic_uint64 CurrBytePos; /*
Re: Doc: fix the rewrite condition when executing ALTER TABLE ADD COLUMN
On Wed, Apr 30, 2025 at 5:15 AM Peter Eisentraut wrote: > > On 28.04.25 18:56, Álvaro Herrera wrote: > > On 2025-Apr-23, Nathan Bossart wrote: > > > >> On Mon, Mar 24, 2025 at 11:37:20AM +0100, Álvaro Herrera wrote: > > > >>> I'd add a note about these two things to the open items page, and wait > >>> to see if we get some of these limitations fixed, so that if we don't, > >>> we remember to note this limitation in the documentation. > >> > >> Are we still waiting on something for this, or should we proceed with the > >> documentation changes? It doesn't seem tremendously urgent, but I noticed > >> it's been about a month since the last message on this thread. > > > > I've edited the Open Items page to disclaim my responsibility from this > > item, since this comes from virtual generated columns which is not my > > turf. I think we should just document the current state of affairs; we > > can come back with further code improvements during the next cycle. > > Here is a proposed patch that includes some text about virtual generated > columns and also fixes up a small mistake in the previous patch > (confused identity and generated columns) and improves the wording and > formatting a bit more. If I were going to quibble, I'd probably rewrite the second paragraph as +Changing the type of an existing column will normally cause the entire table +and its indexes to be rewritten. +As an exception, when changing the type of an existing column, if the USING clause does not change the column contents and the old type is either binary coercible to the new type or an unconstrained domain over the new type, a table rewrite is not -needed. However, indexes must always be rebuilt unless the system +needed. However, indexes will still need to be rebuilt unless the system can verify that the new index would be logically equivalent to the existing one. For example, if the collation for a column has been changed, an index rebuild is required because the new sort order might be different. However, in the absence of a collation change, a column can be changed from text to varchar (or vice versa) without rebuilding the indexes -because these data types sort identically. Table and/or index +because these data types sort identically. But otherwise this LGTM. Robert Treat https://xzilla.net
RE: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 29, 2025 at 6:57 PM Amit Kapila wrote: > > On Tue, Apr 29, 2025 at 3:01 AM Masahiko Sawada > wrote: > > > > Your fix looks good to me. Is it worth considering putting an > > assertion to verify if new two_phase_at is equal to or greater than > > confirmed_lsn (or at least it doesn't go backward), when syncing > > two_phase_at? > > > > Yeah, it makes sense. But the condition should be reverse (two_phase_at > should be less than or equal to confirmed_flush). I have done that, changed a > few comments, and committed the patch. I noticed a BF failure[1] related to the fix, where the recently added assertion Assert(slot->data.two_phase_at <= slot->data.confirmed_flush) was broken. After examining the logs and code, I couldn't identify any functionality issues. Here is my analysis: AFAICS, the last slotsync cycle should have retrieved the latest confirmed_flush and two_phase_at from the source slot, both expected to be 0/6005150. Here are some details: The standby's log[1] shows the source slot's two_phase_at as 0/6005150. Meanwhile, the publisher's log reveals that the source slot's confirmed_flush was already 0/6005150 before the last slotsync. Therefore, the last slotsync cycle should have fetched confirmed_flush: 0/6005150, two_phase_at: 0/6005150. If we assume remote_slot->confirmed_lsn during the last sync cycle is 0/6005150, then either the local slot's confirmed_lsn had already been updated to this value in a prior sync, leading it to skip the update in the last cycle (in which case, the assertion shouldn't be broken), or it should enter the slot update branch to set the synced slot to 0/6005150 (see update_local_synced_slot() for details). Thus, theoretically, the assertion wouldn't fail. Beyond functionality problems, another possibility might be the CPU reordered memory access, causing the assertion to execute before updating confirmed_flush. However, we lack the information needed to verify this, so one idea is to add some DEBUG message in update_local_synced_slot() to facilitate tracking if the issue recurs. I'll continue to investigate, but any suggestions or insights would be greatly appreciated. [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=calliphoridae&dt=2025-04-29%2009%3A11%3A31&stg=recovery-check [2] 2025-04-29 09:18:05.641 UTC [1023004][client backend][0/2:0] LOG: statement: SELECT two_phase AND '0/6005150' = two_phase_at from pg_replication_slots WHERE slot_name = 'lsub1_slot' AND synced AND NOT temporary; [3] 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG: 0/6004FC8 has been already streamed, forwarding to 0/6005150 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.489 UTC [1022911][client backend][:0] LOG: disconnection: session time: 0:00:00.005 user=bf database=postgres host=[local] 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] LOG: starting logical decoding for slot "lsub1_slot" 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] DETAIL: Streaming transactions committing after 0/6005150, reading WAL from 0/6004A00. 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.518 UTC [1022687][client backend][9/19:0] LOG: statement: SELECT slot_name, plugin, confirmed_flush_lsn, restart_lsn, catalog_xmin, two_phase, two_phase_at, failover, database, invalidation_reason FROM pg_catalog.pg_replication_slots WHERE failover and NOT temporary 2025-04-29 09:18:05.524 UTC [1022928][not initialized][:0] LOG: connection received: host=[local] Best Regards, Hou zj
Re: RFC: Logging plan of the running query
Hi, On 2025-04-28 08:55, Hannu Krosing wrote: Have you also checked out https://github.com/postgrespro/pg_query_state which logs running query plan AND collected counts and timings as a response to a signal? Yes. For example, see the discussion: https://www.postgresql.org/message-id/d68c3ae31672664876b22d2dcbb526d2%40postgrespro.ru diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index a750dc8..e1b0be5 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3492,6 +3492,8 @@ ProcessInterrupts(void) if (ParallelMessagePending) HandleParallelMessages(); + CheckAndHandleCustomSignals(); Has this ever been discussed for inclusion in core ? As far as I understand from reading a bit of pg_query_state, it registers custom interrupts to obtain the query state, including the output of EXPLAIN: -- pg_query_state/patches/custom_signals_17.0.patch diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index a750dc8..e1b0be5 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3492,6 +3492,8 @@ ProcessInterrupts(void) if (ParallelMessagePending) HandleParallelMessages(); + CheckAndHandleCustomSignals(); However, we believe it is not safe to perform something as complex as EXPLAIN during an interrupt. For more details, please refer to the discussion below: https://www.postgresql.org/message-id/CA%2BTgmobH%2BUto9MCD%2BvWc71bVbOnd7d8zeYjRT8nXaeLe5hsNJQ%40mail.gmail.com Previous patches in this thread also attempted a similar approach, but due to the safety concerns mentioned above, we decided to explore alternative solutions. As a result, we are currently proposing an approach based on wrapping plan nodes instead. -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Re: alphabetize long options in pg_dump[all] docs
On 29.04.25 23:54, Nathan Bossart wrote: On Tue, Apr 29, 2025 at 11:45:11PM +0200, Álvaro Herrera wrote: I think the concept here is that all short options go first in alphabetical order, then the long options in their own alphabetical order, and if one option has both, then the short option takes precedence. That's what it looks like to me, too. If that's the idea, then --filter in pg_dumpall is in the wrong place, and other than that it looks good. I missed that one, thanks. I think that's what gives the shorter patch. But where would you look for, say, --large-objects? I mean, how do you know that its short version is -b? Maybe it would make more sense to sort on long options first and put short options as the second-priority item for each option. Fair point. We seem to be pivoting towards long options, anyway. If there's support for this, I could go through all the client and server application docs to ensure they match this style. There are two styles currently in use: First, as described above, list all short options first, then all long-only options. The second style is that long-only options are listed alphabetically between short options. I think both of these styles are used in --help output and man pages, and I've long had a desire to unify under one style. Which would also be helpful to offer guidance when new options are added. However, I think this would require coordination across all --help output and man pages (76 objects), so for the short term, let's just move recently added options to the right place under the current theory/theories, and leave a larger reshuffling for later.
Re: Introduce some randomness to autovacuum
Hi Nikita, wenhui, On Fri, Apr 25, 2025 at 11:16 PM Nikita Malakhov wrote: > > Hi! > > I agree it is a good idea to shift the table list. Although vacuuming larger > tables first > is a questionable approach because smaller ones could wait a long time to be > vacuumed. > It looks like the most obvious and simple way is that the first table to be > vacuumed > should not be the first one from the previous iteration. > > On Fri, Apr 25, 2025 at 6:04 PM wenhui qiu wrote: >> >> Hi,I like your idea,It would be even better if the weights could be taken >> according to the larger tables >> > > -- > Regards, > Nikita Malakhov > Postgres Professional > The Russian Postgres Company > https://postgrespro.ru/ Thanks for your feedback. I ended up with adding a guc configuration that may support different vacuum strategies. I name it as `autovacuum_vacuum_strategy` but you might have a better one. For now it support only two strategies: 1. Sequential: Tables are vacuumed in the order they are collected. 2. Random: The list of tables is rotated around a randomly chosen pivot before vacuuming to avoid always starting with the same table, which prevents vacuuming starvation for some tables. We can extend this strategy like prioritization and whatever algorithms in the future. -- Regards Junwang Zhao v1-0001-Introduce-autovacuum-vacuum-strategy.patch Description: Binary data
Re: Some problems regarding the self-join elimination code
On 4/30/25 13:22, Alexander Korotkov wrote: Thank you, Andrei. I've put it all together. 0001 Fixes material bugs in ChangeVarNodes_walker() including regression test 0002 Puts back comments which got accidentally removed 0003 Refactors ChangeVarNodesExtended() with custom user-defined callback I've revised the remaining refactoring patch. I've made a callback an additional callback, but not the replacement to the walker. It looks better for me now. Also, I've written some comments and the commit message. Any thoughts? It seems quite elegant to me. I have not precisely examined the part with the RestrictInfo replacement logic - I guess you just copied it - I reviewed the rest of the patch. Generally, it looks good, but let me be a little picky. 1. Looking into the callback-related code, I suggest changing the name of ChangeVarNodes_callback to something less general and more specific - like replace_relid_callback. My variant doesn't seem the best, but general-purposed name seems worse. 2. This callback doesn't modify query replacement logic's behaviour- it only affects expressions. It makes sense to write about that in the description of the ChangeVarNodesExtended. 3. Should the ChangeVarNodesWalkExpression function return the walker's returning value? -- regards, Andrei Lepikhov
Re: Accounting for metapages in genericcostestimate()
=?utf-8?Q?=C3=81lvaro?= Herrera writes: > On 2025-Apr-28, Tom Lane wrote: >> +BlockNumber numNonLeafPages;/* # of index pages that are not leafs >> */ > I find the use of "leafs" as plural for "leaf" a bit strange ... > We already have uses of that word, but I wonder if they don't mostly > or exclusively come from non-native English speakers. Yeah, "leaves" would be correct, but I wondered whether that'd confuse non-native speakers more. Happy to change it though. regards, tom lane
Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
Hi, we would like to propose the removal of the Instruction Synchronization Barrier (isb) for aarch64 architectures. Based on our testing on Graviton instances (m7g.16xlarge), we can see on average over multiple iterations up to 12% better performance using PGBench select-only and up to 9% with Sysbench oltp_read_only workloads. On Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench select-only and up to 6% with Sysbench oltp_read_only workloads. We have also tested it putting more pressure on the spin_delay function, enabling pg_stat_statements.track_planning with PGBench read-only [0] and, on average, the patch shows up to 27% better performance on m6g.16xlarge and up to 37% on m7g.16xlarge. Testing environment: - PostgreSQL version: 17.2 - Operating System: Ubuntu 22.04 - Test Platform: AWS Graviton instances (m6g.16xlarge, m7g.16xlarge and m8g.24xlarge) Our benchmark results on PGBench select-only without pg_stat_statements.track_planning: ``` # Load DB on m7g.16xlarge $ pgbench -i --fillfactor=90 --scale=5644 --host=172.31.32.85 --username=postgres pgtest # Without patch $ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared -P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest ... "transaction type: ", "scaling factor: 5644", "query mode: prepared", "number of clients: 256", "number of threads: 96", "duration: 600 s", "number of transactions actually processed: 359864937", "latency average = 0.420 ms", "latency stddev = 1.755 ms", "tps = 599770.727316 (including connections establishing)", "tps = 599826.788919 (excluding connections establishing)" # With patch $ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared -P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest ... "transaction type: ", "scaling factor: 5644", "query mode: prepared", "number of clients: 256", "number of threads: 96", "duration: 600 s", "number of transactions actually processed: 405891881", "latency average = 0.371 ms", "latency stddev = 0.569 ms", "tps = 676480.900049 (including connections establishing)", "tps = 676523.557293 (excluding connections establishing)" ``` [0] https://www.postgresql.org/message-id/ZxgDEb_VpWyNZKB_%40nathan 0001-Remove-Instruction-Synchronization-Barrier-in-spin_d.patch Description: Binary data
Re: Accounting for metapages in genericcostestimate()
On 2025-Apr-28, Tom Lane wrote: > @@ -135,6 +141,7 @@ typedef struct > double numIndexTuples; /* number of leaf tuples visited */ > double spc_random_page_cost; /* relevant random_page_cost > value */ > double num_sa_scans; /* # indexscans from ScalarArrayOpExprs > */ > + BlockNumber numNonLeafPages;/* # of index pages that are not leafs > */ > } GenericCosts; The idea you described seems quite reasonable, though I didn't review the patch in detail. I find the use of "leafs" as plural for "leaf" a bit strange ... We already have uses of that word, but I wonder if they don't mostly or exclusively come from non-native English speakers. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Enhancing Memory Context Statistics Reporting
> On 30 Apr 2025, at 12:14, Peter Eisentraut wrote: > > On 29.04.25 15:13, Rahila Syed wrote: >> Please find attached a patch with some comments and documentation changes. >> Additionaly, added a missing '\0' termination to "Remaining Totals" string. >> I think this became necessary after we replaced dsa_allocate0() >> with dsa_allocate() is the latest version. > > > strncpy(nameptr, "Remaining Totals", namelen); > > + nameptr[namelen] = '\0'; > > Looks like a case for strlcpy()? True. I did go ahead with the strncpy and nul terminator assignment, mostly out of muscle memory, but I agree that this would be a good place for a strlcpy() instead. -- Daniel Gustafsson
Re: Introduce some randomness to autovacuum
On Fri, Apr 25, 2025 at 10:02:49PM +0800, Junwang Zhao wrote: > After watching Robert's talk[1] on autovacuum and participating in the related > workshop yesterday, it appears that people are inclined to use prioritization > to address the issues highlighted in Robert's presentation. Here I list two > of the failure modes that were discussed. > > - Spinning. Running repeatedly on the same table but not accomplishing > anything useful. > - Starvation. autovacuum can't vacuum everything that needs vacuuming. > - ... > > The prioritization way needs some basic stuff that postgres doesn't have now. > > I had a random thought that introducing some randomness might help > mitigate some of the issues mentioned above. Before performing vacuum > on the collected tables, we could rotate the table_oids list by a random > number within the range [0, list_length(table_oids)]. This way, every table > would have an equal chance of being vacuumed first, thus no spinning and > starvation. > > Even if there is a broken table that repeatedly gets stuck, this random > approach would still provide opportunities for other tables to be vacuumed. > Eventually, the system would converge. First off, thank you for thinking about this problem and for sharing your thoughts. Adding randomness to solve this is a creative idea. That being said, I am -1 for this proposal. Autovacuum parameters and scheduling are already quite complicated, and making it nondeterministic would add an additional layer of complexity (and may introduce its own problems). But more importantly, IMHO it masks the problems instead of solving them more directly, and it could mask future problems, too. It'd probably behoove us to think about the known problems more deeply and to craft more targeted solutions. -- nathan
Fix outdated comments for IndexInfo
Hi, all While working on [1], I found outdated comments in IndexInfo. The attached patch corrects them. [1] https://www.postgresql.org/message-id/2A40921D-83AB-411E-ADA6-7E509A46F1E4%40logansw.com -- Regrads, Japin Li >From 7c01644860a32ca9ab93367c2f8e34047c9d703f Mon Sep 17 00:00:00 2001 From: Li Jianping Date: Wed, 30 Apr 2025 23:33:04 +0800 Subject: [PATCH] Fix outdated comments for IndexInfo * Commit 78416235713 removed the ii_OpclassOptions field. * Commit 94aa7cc5f70 added the ii_NullsNotDistinct field. * Commit fc0438b4e80 added the ii_WithoutOverlaps field. All previous comments were not updated accordingly. --- src/include/nodes/execnodes.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 5b6cadb5a6c..076ffa45d60 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -174,13 +174,14 @@ typedef struct ExprState * UniqueProcs * UniqueStrats * Uniqueis it a unique index? - * OpclassOptions opclass-specific options, or NULL if none + * NullsNotDistinct is unique nulls distinct? * ReadyForInserts is it valid for inserts? * CheckedUnchanged IndexUnchanged status determined yet? * IndexUnchanged aminsert hint, cached for retail inserts * Concurrent are we doing a concurrent index build? * BrokenHotChain did we detect any broken HOT chains? * Summarizing is it a summarizing index? + * WithoutOverlaps is it a without overlaps index? * ParallelWorkers # of workers requested (excludes leader) * Am Oid of index AM * AmCacheprivate cache area for index AM -- 2.43.0
Re: Typo in multixact.c and jsonfuncs.c documentation
On Wed, 30 Apr 2025 at 19:13, Fujii Masao wrote: > This commit seems to have caused an indent-check failure on the buildfarm > member koel. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=koel&dt=2025-04-30%2002%3A19%3A00 Thanks. Fixed. David
Re: Persist injection points across server restarts
On Wed, Apr 30, 2025 at 10:52:51AM +0500, Andrey Borodin wrote: > Did you consider custom resource manager to WAL-log injection > points? This way we do not need to flush them explicitly, they > always will be persistent. WAL-logging would not work in the case I've faced, because we need a point much earlier than the startup process beginning any redo activity, so I don't think that this would be useful. -- Michael signature.asc Description: PGP signature
Re: alphabetize long options in pg_dump[all] docs
On Wed, Apr 30, 2025 at 04:46:37PM +0200, Álvaro Herrera wrote: > On 2025-Apr-30, Peter Eisentraut wrote: >> On 29.04.25 23:54, Nathan Bossart wrote: >> > Fair point. We seem to be pivoting towards long options, anyway. If >> > there's support for this, I could go through all the client and server >> > application docs to ensure they match this style. >> >> However, I think this would require coordination across all --help output >> and man pages (76 objects), so for the short term, let's just move recently >> added options to the right place under the current theory/theories, and >> leave a larger reshuffling for later. > > +1 WFM Committed. -- nathan
Re: vacuumdb --missing-stats-only and pg_upgrade from PG13
On Thu, Apr 24, 2025 at 03:25:55PM +0200, Christoph Berg wrote: > Re: Nathan Bossart >> Here is what I have staged for commit. I'll aim to commit these patches >> sometime next week to give time for additional feedback. > > I confirm my PG13 test table gets analyzed now with the patch. Committed. -- nathan
Re: Performance issues with v18 SQL-language-function changes
Hello Tom, Sorry if I've chosen the wrong thread, but starting from 0313c5dc6, the following script: CREATE TYPE aggtype AS (a int, b text); CREATE FUNCTION aggfns_trans(aggtype[], integer, text) RETURNS aggtype[] LANGUAGE sql AS 'SELECT array_append($1,ROW($2,$3)::aggtype)'; CREATE AGGREGATE aggfns(integer, text) (SFUNC = public.aggfns_trans, STYPE = public.aggtype[], INITCOND = '{}'); CREATE TABLE t(i int, k int); INSERT INTO t SELECT 1, 2 FROM generate_series(1, 4000); SET statement_timeout = '10s'; SELECT aggfns(i, repeat('x', 8192)) OVER (PARTITION BY i) FROM t; crashes the server for me like this: corrupted size vs. prev_size while consolidating 2025-04-30 19:40:04.209 UTC [286426] LOG: client backend (PID 286441) was terminated by signal 6: Aborted 2025-04-30 19:40:04.209 UTC [286426] DETAIL: Failed process was running: SELECT aggfns(i, repeat('x', 8192)) OVER (PARTITION BY i) FROM t; (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x73cc15c4526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x73cc15c288ff in __GI_abort () at ./stdlib/abort.c:79 #5 0x73cc15c297b6 in __libc_message_impl (fmt=fmt@entry=0x73cc15dce8d7 "%s\n") at ../sysdeps/posix/libc_fatal.c:132 #6 0x73cc15ca8fe5 in malloc_printerr (str=str@entry=0x73cc15dd1b30 "corrupted size vs. prev_size while consolidating") at ./malloc/malloc.c:5772 #7 0x73cc15cab144 in _int_free_merge_chunk (av=0x73cc15e03ac0 , p=0x5cb3ac57b2c0, size=12541904) at ./malloc/malloc.c:4695 #8 0x73cc15cadd9e in __GI___libc_free (mem=mem@entry=0x5cb3acd73290) at ./malloc/malloc.c:3398 #9 0x5cb3a0c2db4c in AllocSetFree (pointer=0x5cb3acd732c8) at aset.c:1107 #10 0x5cb3a0c381f8 in pfree (pointer=) at mcxt.c:241 #11 0x5cb3a067de98 in heap_free_minimal_tuple (mtup=) at heaptuple.c:1532 #12 0x5cb3a08b86a1 in tts_minimal_clear (slot=0x5cb3ac576fb0) at execTuples.c:532 #13 0x5cb3a08ab16e in ExecClearTuple (slot=0x5cb3ac576fb0) at ../../../src/include/executor/tuptable.h:460 #14 ExecFilterJunk (junkfilter=, slot=) at execJunk.c:277 #15 0x5cb3a08bdb03 in sqlfunction_receive (slot=, self=0x5cb3ac525ce0) at functions.c:2597 #16 0x5cb3a08ab4e7 in ExecutePlan (dest=0x5cb3ac525ce0, direction=, numberTuples=1, sendTuples=true, operation=CMD_SELECT, queryDesc=0x5cb3ac525d30) at execMain.c:1814 ... With some script modifications, I observed also other memory-context- related crashes. (Probably this effect is achieved just because of the performance optimization -- I haven't look deeper yet.) This issue is discovered with SQLsmith. Best regards, Alexander Lakhin Neon (https://neon.tech)
Re: AIO v2.5
Hi, After a bit more private back and forth with Alexander I have found the issue - and it's pretty stupid: pgaio_io_wait_for_free() does what it says on the tin. For that, after a bunch of other things, finds the oldest in-flight IO and waits for it. PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios); switch (ioh->state) { ... case PGAIO_HS_COMPLETED_IO: case PGAIO_HS_SUBMITTED: pgaio_debug_io(DEBUG2, ioh, "waiting for free io with %d in flight", dclist_count(&pgaio_my_backend->in_flight_ios)); ... pgaio_io_wait(ioh, ioh->generation); break; The problem is that, if the log level is low enough, ereport() (which is called by pgaio_debug_io()), processes interrupts. The interrupt processing may end up execute ProcessBarrierSmgrRelease(), which in turn needs to wait for all in-flight IOs before the IOs are closed. Which then leads to the elog(PANIC, "waiting for own IO in wrong state: %d", state); error. The waiting for in-flight IOs before closing FDs only happens with io-uring, hence this only triggering with io-uring. A similar set of steps can lead to the "no free IOs despite no in-flight IOs" ERROR that Alexander also observed - if pgaio_submit_staged() triggers a debug ereport that executes ProcessBarrierSmgrRelease() in an interrupt, we might wait for all in-flight IOs during IO submission, triggering the error. I'm somewhat amazed that Alexander could somewhat reliably reproduce this - I haven't been able to do so once using Alexander's recipe. I did find a reproducer though: c=32;pgbench -n -c$c -j$c -P1 -T1000 -f <(echo 'SELECT sum(abalance) FROM pgbench_accounts;') c=1;pgbench -n -c$c -j$c -P1 -T1000 -f <(echo 'DROP DATABASE IF EXISTS foo;CREATE DATABASE foo;') trigger both issues quickly if run with log_min_messages=debug2 io_method=io_uring and not when a non-debug log level is used. I'm not yet sure how to best fix it - locally I have done so by pgaio_debug() do a HOLD_INTERRUPTS()/RESUME_INTERRUPTS() around the call to ereport. But that doesn't really seem great - otoh requiring various pieces of code to know that anything emitting debug messages needs to hold interrupts etc makes for rare and hard to understand bugs. We could just make the relevant functions hold interrupts, and that might be the best path forward, but we don't really need to hold all interrupts (e.g. termination would be fine), so it's a bit coarse grained. It would need to happen in a few places, which isn't great either. Other suggestions? Thanks again for finding and reporting this Alexander! Greetings, Andres Freund
Re: pgsql: Add function to log the memory contexts of specified backend pro
On Tue, Apr 6, 2021 at 12:45 AM Fujii Masao wrote: > Add function to log the memory contexts of specified backend process. Hi, I think this might need a recursion guard. I tried this: diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index dc4c600922d..b219a934034 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3532,7 +3532,7 @@ ProcessInterrupts(void) if (ParallelMessagePending) ProcessParallelMessages(); -if (LogMemoryContextPending) +if (true) ProcessLogMemoryContextInterrupt(); if (PublishMemoryContextPending) diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 72f5655fb34..867fd7b0ad5 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -112,7 +112,7 @@ extern void ProcessInterrupts(void); /* Test whether an interrupt is pending */ #ifndef WIN32 #define INTERRUPTS_PENDING_CONDITION() \ -(unlikely(InterruptPending)) +(unlikely(InterruptPending) || true) #else #define INTERRUPTS_PENDING_CONDITION() \ (unlikely(UNBLOCKED_SIGNAL_QUEUE()) ? pgwin32_dispatch_queued_signals() : 0, \ That immediately caused infinite recursion, ending in a core dump: frame #13: 0x000104607b00 postgres`errfinish(filename=, lineno=, funcname=) at elog.c:543:2 [opt] frame #14: 0x000104637078 postgres`ProcessLogMemoryContextInterrupt at mcxt.c:1392:2 [opt] frame #15: 0x0001044a901c postgres`ProcessInterrupts at postgres.c:3536:3 [opt] frame #16: 0x000104607b54 postgres`errfinish(filename=, lineno=, funcname=) at elog.c:608:2 [opt] [artificial] frame #17: 0x000104637078 postgres`ProcessLogMemoryContextInterrupt at mcxt.c:1392:2 [opt] frame #18: 0x0001044a901c postgres`ProcessInterrupts at postgres.c:3536:3 [opt] It might be unlikely that a process can be signalled fast enough to actually fail in this way, but I'm not sure it's impossible, and I think we should be defending against it. The most trivial recursion guard would be HOLD_INTERRUPTS()/RESUME_INTERRUPTS() around ProcessLogMemoryContextInterrupt(), but I think that's probably not quite good enough because it would make the backend impervious to pg_terminate_backend() while it's dumping memory contexts, and that could be a long time if the write blocks. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Introduce some randomness to autovacuum
> Yes, it is masking the problem, but maybe a better way to think about it is > that it is delaying the > performance impact, allowing more time for a manual intervention of the > problematic table(s). I question how the user will gauge the success of setting the strategy to "random"? They may make it random by default, but fall into the same issues and revert it back to the default strategy. But also, the key as you mention is "manual intervention" which requires proper monitoring. I will argue that for the two cases that this proposal is seeking to correct, we already have good solutions that could be implemented by a user. Let's take the "spinning" case again. If a table has some sort of problem causing vacuum to error out, one can just disable autovacuum on a per-table level and correct the issue. Also, the xmin horizon being held back ( which is claimed to be the most common cause, and I agree with that ), well that one is just going to cause all your autovacuums to become useless. Also, I do think the starvation problem has a good answer now that autovacuum_max_workers can be modified online. Maybe something can be done for autovacuum to auto-tune this setting to give more workers at times when it's needed. Not sure what that looks like, but it is more possible now that this setting does not require a restart. -- Sami Imseih Amazon Web Services (AWS)
Re: Improve explicit cursor handling in pg_stat_statements
I forgot to add the proper tests to the Normalize cursor utility statements. Reattaching v2. I also want to add that the decision to not normalize the cursor name in the FETCH command is because it would not make sense to combine FETCH commands for various cursors into the same entry. Regards, -- Sami Imseih Amazon Web Services (AWS) v2-0002-Improve-cursor-handling-in-pg_stat_statements.patch Description: Binary data v2-0001-Normalize-cursor-utility-statements.patch Description: Binary data
Re: Introduce some randomness to autovacuum
On Wed, Apr 30, 2025 at 10:07 AM Junwang Zhao wrote: > I ended up with adding a guc configuration that may support different > vacuum > strategies. +1 to this: it's a good solution to a tricky problem. I would be a -1 if this were not a GUC. Yes, it is masking the problem, but maybe a better way to think about it is that it is delaying the performance impact, allowing more time for a manual intervention of the problematic table(s). Cheers, Greg -- Crunchy Data - https://www.crunchydata.com Enterprise Postgres Software Products & Tech Support
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Wed, Apr 30, 2025 at 5:55 AM Daniel Gustafsson wrote: > > To keep things moving: I assume this is unacceptable. So v10 redirects > > every access to a PGconn struct member through a shim, similarly to > > how conn->errorMessage was translated in v9. This adds plenty of new > > boilerplate, but not a whole lot of complexity. To try to keep us > > honest, libpq-int.h has been removed from the libpq-oauth includes. > > That admittedly seems like a win regardless. Yeah, it moves us much closer to the long-term goal. > We should either clarify that it was never shipped as part of libpq core, or > remove this altogether. Done in v11, with your suggested wording. > I think this explanatory paragraph should come before the function prototype. Done. > Nitpick, but it won't be .so everywhere. Would this be clearar if spelled out > with something like "do not rely on libpq-int.h when building libpq-oauth as > dynamic shared lib"? I went with "do not rely on libpq-int.h in dynamic builds of libpq-oauth", since devs are hopefully going to be the only people who see it. I've also fixed up an errant #endif label right above it. I'd ideally like to get a working split in for beta. Barring objections, I plan to get this pushed tomorrow so that the buildfarm has time to highlight any corner cases well before the Saturday freeze. I still see the choice of naming (with its forced-ABI break every major version) as needing more scrutiny, and probably worth a Revisit entry. The CI still looks happy, and I will spend today with VMs and more testing on the Autoconf side. I'll try to peer at Alpine and musl libc, too; dogfish and basilisk are the Curl-enabled animals that caught my attention most. Thanks! --Jacob 1: e86e93f7ac8 ! 1: 5a1d1345919 oauth: Move the builtin flow into a separate module @@ Commit message Per request from Tom Lane and Bruce Momjian. Based on an initial patch by Daniel Gustafsson, who also contributed docs changes. The "bare" -dlopen() concept came from Thomas Munro. Many many people reviewed the -design and implementation; thank you! +dlopen() concept came from Thomas Munro. Many people reviewed the design +and implementation; thank you! Co-authored-by: Daniel Gustafsson Reviewed-by: Andres Freund Reviewed-by: Christoph Berg +Reviewed-by: Daniel Gustafsson Reviewed-by: Jelte Fennema-Nio Reviewed-by: Peter Eisentraut Reviewed-by: Wolfgang Walther @@ src/interfaces/libpq-oauth/Makefile (new) ## src/interfaces/libpq-oauth/README (new) ## @@ +libpq-oauth is an optional module implementing the Device Authorization flow for -+OAuth clients (RFC 8628). It was originally developed as part of libpq core and -+later split out as its own shared library in order to isolate its dependency on -+libcurl. (End users who don't want the Curl dependency can simply choose not to -+install this module.) ++OAuth clients (RFC 8628). It is maintained as its own shared library in order to ++isolate its dependency on libcurl. (End users who don't want the Curl dependency ++can simply choose not to install this module.) + +If a connection string allows the use of OAuth, and the server asks for it, and +a libpq client has not installed its own custom OAuth flow, libpq will attempt @@ src/interfaces/libpq-oauth/README (new) +pg_fe_run_oauth_flow and pg_fe_cleanup_oauth_flow are implementations of +conn->async_auth and conn->cleanup_async_auth, respectively. + ++At the moment, pg_fe_run_oauth_flow() relies on libpq's pg_g_threadlock and ++libpq_gettext(), which must be injected by libpq using this initialization ++function before the flow is run: ++ +- void libpq_oauth_init(pgthreadlock_t threadlock, + libpq_gettext_func gettext_impl, + conn_errorMessage_func errmsg_impl, @@ src/interfaces/libpq-oauth/README (new) + set_conn_altsock_func setaltsock_impl, + set_conn_oauth_token_func settoken_impl); + -+At the moment, pg_fe_run_oauth_flow() relies on libpq's pg_g_threadlock and -+libpq_gettext(), which must be injected by libpq using this initialization -+function before the flow is run. -+ +It also relies on access to several members of the PGconn struct. Not only can +these change positions across minor versions, but the offsets aren't necessarily +stable within a single minor release (conn->errorMessage, for instance, can @@ src/interfaces/libpq/fe-auth-oauth-curl.c => src/interfaces/libpq-oauth/oauth-cu +#define set_conn_altsock(CONN, VAL) do { CONN->altsock = VAL; } while (0) +#define set_conn_oauth_token(CONN, VAL) do { CONN->oaut
Re: [PoC] Federated Authn/z with OAUTHBEARER
> On 30 Apr 2025, at 19:59, Jacob Champion > wrote: > On Wed, Apr 30, 2025 at 5:55 AM Daniel Gustafsson wrote: >> Nitpick, but it won't be .so everywhere. Would this be clearar if spelled >> out >> with something like "do not rely on libpq-int.h when building libpq-oauth as >> dynamic shared lib"? > > I went with "do not rely on libpq-int.h in dynamic builds of > libpq-oauth", since devs are hopefully going to be the only people who > see it. I've also fixed up an errant #endif label right above it. That's indeed better than my suggestion. > I'd ideally like to get a working split in for beta. +Many > Barring > objections, I plan to get this pushed tomorrow so that the buildfarm > has time to highlight any corner cases well before the Saturday > freeze. I'll try to kick the tyres a bit more as well. -- Daniel Gustafsson
Re: Introduce some randomness to autovacuum
On Wed, Apr 30, 2025 at 1:56 PM Sami Imseih wrote: > > > Yes, it is masking the problem, but maybe a better way to think about it is > > that it is delaying the > > performance impact, allowing more time for a manual intervention of the > > problematic table(s). > > I question how the user will gauge the success of setting the strategy > to "random"? They may make > it random by default, but fall into the same issues and revert it back > to the default strategy. > > But also, the key as you mention is "manual intervention" which > requires proper monitoring. I will > argue that for the two cases that this proposal is seeking to correct, > we already have good > solutions that could be implemented by a user. > I would have a lot more faith in this discussion if there was any kind of external solution that had gained popularity as a general solution, but this doesn't seem to be the case (and trying to wedge something in the server will likely hurt that kind of research. As an example, the first fallacy of autovacuum management is the idea that a single strategy will always work. Having implemented a number of crude vacuum management systems in user space already, I know that I have run into multiple cases where I had to essentially build two different "queues" of vacuums (one for xids, one for bloat) to be fed into Postgres so as not to be blocked (in either direction) by conflicting priorities no matter how wonky things got. I can imagine a set of gucs that we could put in to try to mimic such types of behavior, but I imagine it would take quite a few rounds before we got the behavior correct. Robert Treat https://xzilla.net
Re: Add an option to skip loading missing publication to avoid logical replication failure
On Wed, 30 Apr 2025 at 17:41, Amit Kapila wrote: > > On Wed, Apr 30, 2025 at 11:22 AM Tom Lane wrote: > > > > Xuneng Zhou pointed out on Discord that the test case added by > > 7c99dc587 has caused repeated failures in CI --- though oddly, > > it's not failed in the buildfarm so far as I can find. The > > failures look like > > > > timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading > > publication: tap_pub_3) at > > /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.pl line 103. > > > > I analyzed the relevant publisher-side CI Logs [1]: > ... > 2025-04-19 08:24:14.096 UTC [21961][client backend] > [024_add_drop_pub.pl][7/4:0] LOG: statement: INSERT INTO tab_3 > values(1) > 2025-04-19 08:24:14.098 UTC [21961][client backend] > [024_add_drop_pub.pl][:0] LOG: disconnection: session time: > 0:00:00.003 user=postgres database=postgres host=[local] > 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG: > released logical replication slot "tap_sub" > 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG: > disconnection: session time: 0:00:00.329 user=postgres > database=postgres host=[local] > 2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0] > LOG: connection received: host=[local] > 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0] > LOG: connection authenticated: user="postgres" method=trust > (/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117) > 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0] > LOG: replication connection authorized: user=postgres > application_name=tap_sub > 2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG: > statement: SELECT pg_catalog.set_config('search_path', '', false); > 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG: > received replication command: IDENTIFY_SYSTEM > 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] > STATEMENT: IDENTIFY_SYSTEM > 2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG: > received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL > 0/0 (proto_version '4', streaming 'parallel', origin 'any', > publication_names '"tap_pub_ > ... > > This shows that walsender restarts after the "INSERT INTO tab_3 > values(1)" is processed by the previous walsender ("released logical > replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)"). > So, it is possible that the old apply worker has sent the confirmation > of WAL received location after the Insert (due to keep_alive message > handling). So, after the restart, the new walsender will start > processing WAL after the INSERT and wait for the skipped message LOG > timed out. > > Considering the above theory is correct, after "ALTER SUBSCRIPTION > tap_sub SET PUBLICATION tap_pub_3", we should wait for the new > walsender to restart. We are already doing the same for a similar case > in 001_rep_changes.pl (See "# check that change of connection string > and/or publication list causes restart of subscription workers. We > check the state along with application_name to ensure that the > walsender is (re)started.). > > Unfortunately, I will be away for the rest of the week. In the > meantime, if you or someone else is able to reproduce and fix it, then > good; otherwise, I'll take care of it after I return. I agree with your analysis. I was able to reproduce the issue by delaying the invalidation of the subscription until the walsender finished decoding the INSERT operation following the ALTER SUBSCRIPTION through a debugger and using the lsn from the pg_waldump of the INSERT after the ALTER SUBSCRIPTION. In this scenario, the confirmed_flush_lsn ends up pointing to a location after the INSERT. When the invalidation is eventually received and the apply worker/walsender is restarted, the restarted walsender begins decoding from that LSN—after the INSERT—which means the "skipped loading publication" warning is never triggered, causing the test to fail. Attached is a patch that ensures the walsender process is properly restarted after ALTER SUBSCRIPTION, preventing this race condition. Regards, Vignesh From 26c8efbf59a456f4ad8a87b504180449efe1cd69 Mon Sep 17 00:00:00 2001 From: Vignesh Date: Wed, 30 Apr 2025 21:38:33 +0530 Subject: [PATCH] Fix race condition after ALTER SUBSCRIPTION SET PUBLICATION Previously, after executing ALTER SUBSCRIPTION tap_sub SET PUBLICATION, we did not wait for the new walsender process to restart. As a result, an INSERT executed immediately after the ALTER could be decoded and the confirmed flush lsn is advanced. This could cause replication to resume from a point after the INSERT. In such cases, we miss the expected warning about the missing publication. To fix this, we now ensure that the walsender has restarted before continuing after ALTER SUBSCRIPTION. --- src/test/subscription/t/024_add_drop
Re: Introduce some randomness to autovacuum
> - Spinning. Running repeatedly on the same table but not accomplishing > anything useful. > But more importantly, IMHO it masks the problems instead of > solving them more directly, and it could mask future problems, too To add more to Nathan's comment about masking future problems, this will not solve the "spinning" problem because if the most common reason for this is a long-running transaction, etc., all your tables will eventually end up with wasted vacuum cycles because the xmin horizon is not advancing. -- Sami Imseih
Re: add --sequence-data to pg_dumpall
On Wed, Apr 30, 2025 at 09:29:59AM +0900, Michael Paquier wrote: > On Tue, Apr 29, 2025 at 03:55:08PM -0500, Nathan Bossart wrote: >> Assuming we want this patch, should we apply it to v18? It's arguably an >> oversight in the pg_dump --sequence-data commit, and pg_dumpall will just >> pass the option through to pg_dump, but otherwise there's not a really >> strong reason it can't wait. > > This reminds me of, that fixed a similar defect in pg_dumpall > following the addition of an option in pg_dump where the former was > forgotten: > https://www.postgresql.org/message-id/YKHC%2BqCJvzCRVCpY%40paquier.xyz > > I agree with applying that to v18 now and treat it as a defect rather > than wait for v19 and treat this patch as a new feature. Bonus points > for the patch being straight-forward. Since there's precedent, I'll plan on committing this in the next few days unless someone objects. I've added the rest of the RMT to this thread, too, just in case. -- nathan
Re: pgsql: Add function to get memory context stats for processes
On Sat, Apr 26, 2025 at 5:07 PM Tomas Vondra wrote: > > Just for the record, it sounds quite unsafe to me too. I could > > credit it being all right to examine the process' MemoryContext data > > structures, but calling dsa_create() from CFI seems really insane. > > Way too many moving parts there. > > Maybe I'm oblivious to some very obvious issues, but why is this so > different from everything else that is already called from > ProcessInterrupts()? Perhaps dsa_create() is more complex compared to > the other stuff (haven't checked), but why would it be unsafe? > > The one risk I can think of is a risk of recursion - if any of the > functions called from ProcessGetMemoryContextInterrupt() does CFI, could > that be a problem if there's another pending signal. I see some other > handlers (e.g. ProcessParallelApplyMessages) handle this by explicitly > holding interrupts. Should ProcessGetMemoryContextInterrupt() do the > same thing? > > In any case, if DSA happens to not be the right way to transfer this, > what should we use instead? The only thing I can think of is some sort > of pre-allocated chunk of shared memory. The big disadvantage of a pre-allocated chunk of shared memory is that it would necessarily be fixed size, and a memory context dump can be really big. Another alternative would be a temporary file. But none of that settles the question of safety. I think part of the reason why it's possible for us to have disagreements about whether this is safe is that we don't have any clear documentation of what you can assume to be true at a CHECK_FOR_INTERRUPTS(). It's not possible for there to be an LWLock held at that point, because we hold off interrupts when you acquire an LWLock and don't re-enable them until all LWLocks have been released. We can't be holding a spinlock, either, because we only insert CHECK_FOR_INTERRUPTS() at the top of loops and code that holds a spinlock is supposed to be straight-line code that never loops. But what else can you assume? Can you assume, for example, that there's a transaction? I doubt it. Can you assume that the transaction is non-aborted? I doubt that even more. There's no obvious-to-me reason why those things should be true. And in fact if you try this on a backend waiting in an aborted transaction, it breaks: robert.haas=# select pg_backend_pid(); pg_backend_pid 19321 (1 row) robert.haas=# begin; BEGIN robert.haas=*# select 1/0; ERROR: division by zero And then from another session, run this command, using the PID from above: select * from pg_get_process_memory_contexts(19321, false, 1); Then you get: 2025-04-30 15:14:33.878 EDT [19321] ERROR: ResourceOwnerEnlarge called after release started 2025-04-30 15:14:33.878 EDT [19321] FATAL: terminating connection because protocol synchronization was lost I kind of doubt whether that's the only problem here, but it's *a* problem. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pgsql: Add function to get memory context stats for processes
> On 30 Apr 2025, at 22:17, Robert Haas wrote: > I kind of doubt whether that's the only problem here, but it's *a* problem. This is indeed exposing a pre-existing issue, which was reported in [0] and a patch fixing it has been submitted. I have been testing and reworking the patch slightly but due to $life and $work events have yet to have time to push it. -- Daniel Gustafsson [0] 3eb40b3e-45c7-426a-b7f8-81f7d05a9...@oss.nttdata.com
Re: fixing CREATEROLE
necro-ing an old thread ... Robert Haas writes: > [ v4-0002-Add-new-GUC-createrole_self_grant.patch ] I confess to not having paid close enough attention when these patches went in, or I would have complained about createrole_self_grant. It changes the user-visible behavior of SQL commands, specifically CREATE ROLE. We have learned over and over again that GUCs that do that are generally a bad idea. Two years later, it's perhaps too late to take it out again. However, I'd at least like to complain about the fact that it breaks pg_dumpall, which is surely not expecting anything but the default behavior. If for any reason the restore is run under a non-default setting of createrole_self_grant, there's a potential of creating role grants that were not there in the source database. Admittedly the damage is probably limited by the fact that it only applies if the restoring user has CREATEROLE but not SUPERUSER, which I imagine is a rare case. But don't we need to add createrole_self_grant to the set of GUCs that pg_dump[all] resets in the emitted SQL? regards, tom lane
Improve explicit cursor handling in pg_stat_statements
Hi hackers, I recently looked into a workload that makes heavy use of explicit cursors, and I found that pg_stat_statements can become a bottleneck. The application in question declares hundreds of cursors, and for each one, performs many FETCH and MOVE operations with varying fetch sizes. As a result, pg_stat_statements ends up overwhelmed by the deallocation (and garbage collection) of DECLARE CURSOR, FETCH, and MOVE entries. Each of these is associated with a unique queryId, which leads to bloated entries with limited diagnostic value. Other issues: 1. FETCH/MOVE statements don't really help much in laying blame on a specific query. the DECLARE CURSOR statement could have been evicted in pg_stat_statements by that point or a similar cursor name is pointing to a different query. Also, FETCH doesn't aggregate for the same cursor — e.g., FETCH 10 c1 and FETCH 20 c1 show up as separate entries. 2. DECLARE CURSOR doesn't provide execution stats for the underlying SQL. Enabling pg_stat_statements.track = 'all' can expose the underlying SQL, but adds overhead.There’s also a bug: the toplevel column for the underlying query is still marked as "t", even though you must set track "all" to see it. Based on this, I propose the following improvements: 1. Better normalization of cursor utility commands: 2. Normalize the cursor name in CLOSE. 3. Normalize fetch/move sizes in FETCH and MOVE. Users can use the rows and calls columns to derive average fetch size. Ideally I would want to normalize the cursor name and generate the queryId f the FETCH statement based on the underlying query, but that is not possible to do that post parsing. (The above normalizations of these utility statements will reduce the bloat.) 4. Track the underlying query of a cursor by default, even when pg_stat_statements.track_utility = off. I’ve attached two patches that implement this. Here's a quick example: ``` begin; declare cs1 cursor for select from pg_class; declare cs2 cursor for select from pg_class; fetch 10 cs1; fetch 20 cs1; fetch 10 cs1; fetch 10 cs2; close cs1; close cs2; declare cs1 cursor for select from pg_attribute; SELECT calls, rows, query, toplevel FROM pg_stat_statements ORDER BY query COLLATE "C"; commit; ``` current state: ``` postgres=*# SELECT calls, rows, query, toplevel FROM pg_stat_statements ORDER BY query COLLATE "C"; calls | rows | query | toplevel ---+--+---+-- 1 |1 | SELECT nameFROM pg_catalog.pg_available_extensions WHERE name LIKE $1 AND installed_version IS NULL+| t | | LIMIT $2 | 1 |0 | begin | t 1 |0 | close cs1 | t 1 |0 | close cs2 | t 1 |0 | create extension pg_stat_statements | t 1 |0 | declare cs1 cursor for select from pg_attribute | t 1 |0 | declare cs1 cursor for select from pg_class | t 1 |0 | declare cs2 cursor for select from pg_class | t 2 | 20 | fetch 10 cs1 | t 1 | 10 | fetch 10 cs2 | t 1 | 20 | fetch 20 cs1 | t (11 rows) ``` with both patches applied: ``` postgres=*# SELECT calls, rows, query, toplevel FROM pg_stat_statements ORDER BY query COLLATE "C"; calls | rows | query | toplevel ---+--++-- 1 |0 | begin | t 2 |0 | close $1 | t 1 |0 | declare $1 cursor for select from pg_attribute | t 2 |0 | declare $1 cursor for select from pg_class | t 3 | 40 | fetch $1 cs1 | t 1 | 10 | fetch $1 cs2 | t 2 | 50 | select from pg_class | t (7 rows) postgres=*# commit; COMMIT ``` FWIW, I raised this ~3 years ago [0], but there was not much interest. I have seen this being a problem a few times since then that I think something should be done about. I also was not happy with the approach I took in [0]. Looking forward to feedback! Regards, -- Sami Imseih Amazon Web Services (AWS) [0] https://www.postgresql.org/message-id/flat/203CFCF7-176E-4AFC-A48E-B2CECFECD6AA%40amazon.com v1-0001-Improve-cursor-handling-in-pg_stat_stateme
Re: Typo in multixact.c and jsonfuncs.c documentation
On 2025/04/30 10:41, David Rowley wrote: On Wed, 30 Apr 2025 at 13:25, Junwang Zhao wrote: Found a trivial typo in multixact.c, after searching the code base there is another such typo in jsonfuncs.c. Thanks. Pushed. This commit seems to have caused an indent-check failure on the buildfarm member koel. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=koel&dt=2025-04-30%2002%3A19%3A00 Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Doc: fix the rewrite condition when executing ALTER TABLE ADD COLUMN
On 28.04.25 18:56, Álvaro Herrera wrote: On 2025-Apr-23, Nathan Bossart wrote: On Mon, Mar 24, 2025 at 11:37:20AM +0100, Álvaro Herrera wrote: I'd add a note about these two things to the open items page, and wait to see if we get some of these limitations fixed, so that if we don't, we remember to note this limitation in the documentation. Are we still waiting on something for this, or should we proceed with the documentation changes? It doesn't seem tremendously urgent, but I noticed it's been about a month since the last message on this thread. I've edited the Open Items page to disclaim my responsibility from this item, since this comes from virtual generated columns which is not my turf. I think we should just document the current state of affairs; we can come back with further code improvements during the next cycle. Here is a proposed patch that includes some text about virtual generated columns and also fixes up a small mistake in the previous patch (confused identity and generated columns) and improves the wording and formatting a bit more. From 33fb59c94ae3dbf6367e36c79f71dc9e291423d8 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 30 Apr 2025 11:11:15 +0200 Subject: [PATCH] doc: Improve explanations when a table rewrite is needed Further improvement for commit 11bd8318602. That commit confused identity and generated columns; fix that. Also, virtual generated columns have since been added; add more details about that. Also some small rewordings and reformattings to further improve clarity. Discussion: https://postgr.es/m/00e6eb5f5c793b8ef722252c7a519...@oss.nttdata.com --- doc/src/sgml/ref/alter_table.sgml | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index a75e75d800d..9bf7ca1462e 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -1436,22 +1436,31 @@ Notes Adding a column with a volatile DEFAULT -(e.g., clock_timestamp()), a generated column -(e.g., GENERATED BY DEFAULT AS IDENTITY), a domain -data type with constraints will require the entire table and its -indexes to be rewritten, as will changing the type of an existing -column. As an exception, when changing the type of an existing column, +(e.g., clock_timestamp()), a stored generated column, +an identity column, or a column with a domain data type that has +constraints will cause the entire table and its indexes to be rewritten. +Adding a virtual generated column never requires a rewrite. + + + +Changing the type of an existing column will also cause the entire table +and its indexes to be rewritten. +As an exception, when changing the type of an existing column, if the USING clause does not change the column contents and the old type is either binary coercible to the new type or an unconstrained domain over the new type, a table rewrite is not -needed. However, indexes must always be rebuilt unless the system +needed. However, indexes are always rebuilt unless the system can verify that the new index would be logically equivalent to the existing one. For example, if the collation for a column has been changed, an index rebuild is required because the new sort order might be different. However, in the absence of a collation change, a column can be changed from text to varchar (or vice versa) without rebuilding the indexes -because these data types sort identically. Table and/or index +because these data types sort identically. + + + +Table and/or index rebuilds may take a significant amount of time for a large table, and will temporarily require as much as double the disk space. -- 2.49.0
Re: Some problems regarding the self-join elimination code
On Sun, Apr 27, 2025 at 2:02 PM Alexander Korotkov wrote: > > On Fri, Apr 11, 2025 at 5:46 PM Andrei Lepikhov wrote: > > On 4/10/25 14:39, Andrei Lepikhov wrote: > > > On 4/10/25 13:36, Alexander Korotkov wrote: > > >> On Wed, Apr 9, 2025 at 10:39 AM Andrei Lepikhov > > >> wrote: > > >>> It seems we are coming to the conclusion that join removal optimisation > > >>> may do something out of ChangeVarNodes resposibility. Before further > > >>> complicating of this function code I would like to know opinion of Tom, > > >>> who initially proposed [1] to use this routine. May be better a) return > > >>> to more specialised change_relid / sje_walker machinery or b) move > > >>> ChangeVarNodes out of rewriteManip and make it multi-purpose routine, > > >>> allowing to transform expression that may happen after a Var node > > >>> change? > > >> > > >> What about adding a callback to ChangeVarNodes_context that would > > >> called for each RestrictInfo after changing varnodes itself? SJE > > >> could use a callback that replaces OpExpr with NullTest when needed. > > > I think it is doable, of course. Just looking forward a little, it may > > > need more complication in the future (SJE definitely should be widened > > > to partitioned tables) and it may be simpler to have two different > > > routines for two different stages of planning. > > To provide some food for thought, here is a draft in attachment which > > addresses both issues: RestrictInfo relid replacement and move > > SJE-specific code out of the ChangeVarNodes routine (callback approach). > > Thank you, Andrei. I've put it all together. > 0001 Fixes material bugs in ChangeVarNodes_walker() including regression test > 0002 Puts back comments which got accidentally removed > 0003 Refactors ChangeVarNodesExtended() with custom user-defined callback > > I'm going to further work on improvement of these patches. Sorry, I accidentally sent some messages off-list. So, now 0001 and 0002 are pushed. I've revised the remaining refactoring patch. I've made a callback an additional callback, but not the replacement to the walker. It looks better for me now. Also, I've written some comments and the commit message. Any thoughts? -- Regards, Alexander Korotkov Supabase v4-0001-Refactor-ChangeVarNodesExtended-using-the-custom-.patch Description: Binary data
Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding
On Tue, Apr 22, 2025 at 5:00 PM Peter Smith wrote: > > Hi Ajin, > > Some review comments for patch v17-0001 > > == > Commit message > > 1. > Additionally, transactions containing WAL records (INTERNAL_SNAPSHOT, > COMMAND_ID, or INVALIDATION) that modify the historical snapshot > constructed during logical decoding are deemed unfilterable. This > is necessary because constructing a correct historical snapshot > for searching publication information requires processing these WAL record. > > ~ > > /these WAL record./these WAL records./ > Fixed. > == > src/backend/replication/logical/reorderbuffer.c > > ReorderBufferFilterByRelFileLocator: > > 2. > + if (found) > + { > + rb->try_to_filter_change = entry->filterable; > + return entry->filterable; > + } > + > > Bad indentation. > Fixed. > == > src/include/replication/reorderbuffer.h > > 3. > +extern bool ReorderBufferCanFilterChanges(ReorderBuffer *rb); > + > > Why is this extern here? This function is not implemented in patch 0001. > Fixed. On Wed, Apr 23, 2025 at 1:11 PM Peter Smith wrote: > > Hi Ajin, > > Some review comments for patch v17-0002. > > == > src/backend/replication/logical/decode.c > > 1. > /* > + * Check if filtering changes before decoding is supported and we're > not suppressing filter > + * changes currently. > + */ > +static inline bool > +FilterChangeIsEnabled(LogicalDecodingContext *ctx) > +{ > + return (ctx->callbacks.filter_change_cb != NULL && > + ctx->reorder->try_to_filter_change); > +} > + > > I still have doubts about the need for/benefits of this to be a > separate function. > > It is only called from *one* place within the other static function, > FilterChange. > > Just putting that code inline seems just as readable as maintaining > the separate function for it. > > SUGGESTION: > static inline bool > FilterChange(LogicalDecodingContext *ctx, XLogRecPtr origptr, TransactionId > xid, > RelFileLocator *target_locator, ReorderBufferChangeType > change_type) > { > return > ctx->callbacks.filter_change_cb && > ctx->reorder->try_to_filter_change && > ReorderBufferFilterByRelFileLocator(ctx->reorder, xid, origptr, > target_locator, > change_type)); > } > > Fixed. > == > src/backend/replication/logical/reorderbuffer.c > > RelFileLocatorCacheInvalidateCallback: > > 2. > if (hash_search(RelFileLocatorFilterCache, > - &entry->key, > - HASH_REMOVE, > - NULL) == NULL) > + &entry->key, > + HASH_REMOVE, > + NULL) == NULL) > elog(ERROR, "hash table corrupted"); > > I think this whitespace change belongs back in patch 0001 where this > function was introduced, not here in patch 0002. > Fixed. > ~~~ > > ReorderBufferFilterByRelFileLocator: > > 3. > + /* > + * Quick return if we already know that the relation is not to be > + * decoded. These are for special relations that are unlogged and for > + * sequences and catalogs. > + */ > + if (entry->filterable) > + return true; > > /These are for/This is for/ Fixed. > > ~~~ > > 4. > if (RelationIsValid(relation)) > { > - entry->relid = RelationGetRelid(relation); > + if (IsToastRelation(relation)) > + { > + char *toast_name = RelationGetRelationName(relation); > + int n PG_USED_FOR_ASSERTS_ONLY; > + > + n = sscanf(toast_name, "pg_toast_%u", &entry->relid); > + > + Assert(n == 1); > + } > + else > + entry->relid = RelationGetRelid(relation); > + > entry->filterable = false; > + rb->try_to_filter_change = rb->filter_change(rb, entry->relid, change_type, > + true, &cache_valid); > RelationClose(relation); > } > else > { > entry->relid = InvalidOid; > - entry->filterable = true; > + rb->try_to_filter_change = entry->filterable = true; > } > > rb->try_to_filter_change = entry->filterable; > ~ > > Something seems fishy here. AFAICT, the rb->try_to_filter_change will > already be assigned in both the *if* and the *else* blocks. So, why is > it being overwritten again outside that if/else? > Fixed. On Wed, Apr 23, 2025 at 1:49 PM Peter Smith wrote: > > Hi Ajin. > > Here are some v17-0003 review comments (aka some v16-0003 comments > that were not yet addressed or rejected) > > On Fri, Apr 11, 2025 at 4:05 PM Peter Smith wrote: > ... > > == > > Commit message > > > > 2. missing commit message > > Not yet addressed. Fixed. > > > ~~~ > > > > 8. > > # Insert, delete, and update tests for restricted publication tables > > $log_location = -s $node_publisher->logfile; > > $node_publisher->safe_psql('postgres', "INSERT INTO insert_only_table > > VALUES (1, 'to be inserted')"); > > $node_publisher->safe_psql('postgres', "UPDATE insert_only_table SET > > data = 'updated' WHERE id = 1"); > > $logfile = slurp_file($node_publisher->logfile, $log_location); > > ok($logfile =~ qr/Filtering UPDATE/, > > 'unpublished UPDATE is filtered'); > > > > $log_location = -s $node_publisher->logfile; > > $node_publisher->safe_psql('postgres', "DELETE FROM insert_only_table > > WHERE id = 1"); >
Re: [PoC] Federated Authn/z with OAUTHBEARER
> On 29 Apr 2025, at 02:10, Jacob Champion > wrote: > > On Wed, Apr 23, 2025 at 10:46 AM Jacob Champion > wrote: >> Are there any readers who feel like an internal ABI version for >> `struct pg_conn`, bumped during breaking backports, would be >> acceptable? (More definitively: are there any readers who would veto >> that?) > > To keep things moving: I assume this is unacceptable. So v10 redirects > every access to a PGconn struct member through a shim, similarly to > how conn->errorMessage was translated in v9. This adds plenty of new > boilerplate, but not a whole lot of complexity. To try to keep us > honest, libpq-int.h has been removed from the libpq-oauth includes. That admittedly seems like a win regardless. > This will now handle in-place minor version upgrades that swap pg_conn > internals around, so I've gone back to -MAJOR versioning alone. > fe_oauth_state is still exported; it now has an ABI warning above it. > (I figure that's easier to draw a line around during backports, > compared to everything in PGconn. We can still break things there > during major version upgrades.) While I'm far from the expert on this subject (luckily there are such in this thread), I am unable to see any sharp edges from reading and testing this version of the patch. A few small comments: +libpq-oauth is an optional module implementing the Device Authorization flow for +OAuth clients (RFC 8628). It was originally developed as part of libpq core and +later split out as its own shared library in order to isolate its dependency on +libcurl. (End users who don't want the Curl dependency can simply choose not to +install this module.) We should either clarify that it was never shipped as part of libpq core, or remove this altogether. I would vote for the latter since we typically don't document changes that happen during the devcycle. How about something like: +libpq-oauth is an optional module implementing the Device Authorization flow for +OAuth clients (RFC 8628). It is maintained as its own shared library in order to +isolate its dependency on libcurl. (End users who don't want the Curl dependency +can simply choose not to install this module.) +- void libpq_oauth_init(pgthreadlock_t threadlock, + +At the moment, pg_fe_run_oauth_flow() relies on libpq's pg_g_threadlock and +libpq_gettext(), which must be injected by libpq using this initialization +function before the flow is run. I think this explanatory paragraph should come before the function prototype. The following paragraph on the setters/getters make sense where it is though. +#if defined(USE_DYNAMIC_OAUTH) && defined(LIBPQ_INT_H) +#error do not rely on libpq-int.h in libpq-oauth.so +#endif Nitpick, but it won't be .so everywhere. Would this be clearar if spelled out with something like "do not rely on libpq-int.h when building libpq-oauth as dynamic shared lib"? -- Daniel Gustafsson
GIN tries to form a tuple with a partial compressedList during insertion
Hi, In the functions addItemPointersToLeafTuple and buildFreshLeafTuple (in gininsert.c), the result of ginCompressPostingList() is passed to GinFormTuple without checking whether ginCompressPostingList() successfully packed all items. These GinFormTuple calls consistently fail because the resulting tuples always exceed the maximum size. While this doesn’t result in data corruption, it might still be worth addressing. Additionally, the condition if (compressedList) is unnecessary, since ginCompressPostingList() never returns NULL. Please find the attached patch fixing it. Best regards, Arseniy Mukhin diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index a7b7b5996e3..5643c423627 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -218,7 +218,8 @@ addItemPointersToLeafTuple(GinState *ginstate, ItemPointerData *newItems, *oldItems; int oldNPosting, -newNPosting; +newNPosting, +nwritten; GinPostingList *compressedList; Assert(!GinIsPostingTree(old)); @@ -236,17 +237,18 @@ addItemPointersToLeafTuple(GinState *ginstate, /* Compress the posting list, and try to a build tuple with room for it */ res = NULL; compressedList = ginCompressPostingList(newItems, newNPosting, GinMaxItemSize, - NULL); +&nwritten); pfree(newItems); - if (compressedList) + if (nwritten == newNPosting) { res = GinFormTuple(ginstate, attnum, key, category, (char *) compressedList, SizeOfGinPostingList(compressedList), newNPosting, false); - pfree(compressedList); } +pfree(compressedList); + if (!res) { /* posting list would be too big, convert to posting tree */ @@ -293,17 +295,19 @@ buildFreshLeafTuple(GinState *ginstate, { IndexTuple res = NULL; GinPostingList *compressedList; +int nwritten; /* try to build a posting list tuple with all the items */ - compressedList = ginCompressPostingList(items, nitem, GinMaxItemSize, NULL); - if (compressedList) + compressedList = ginCompressPostingList(items, nitem, GinMaxItemSize, &nwritten); + if (nwritten == nitem) { res = GinFormTuple(ginstate, attnum, key, category, (char *) compressedList, SizeOfGinPostingList(compressedList), nitem, false); - pfree(compressedList); } +pfree(compressedList); + if (!res) { /* posting list would be too big, build posting tree */
Re: Enhancing Memory Context Statistics Reporting
On 29.04.25 15:13, Rahila Syed wrote: Please find attached a patch with some comments and documentation changes. Additionaly, added a missing '\0' termination to "Remaining Totals" string. I think this became necessary after we replaced dsa_allocate0() with dsa_allocate() is the latest version. >strncpy(nameptr, "Remaining Totals", namelen); > + nameptr[namelen] = '\0'; Looks like a case for strlcpy()?
Re: Add an option to skip loading missing publication to avoid logical replication failure
On Wed, Apr 30, 2025 at 11:22 AM Tom Lane wrote: > > Xuneng Zhou pointed out on Discord that the test case added by > 7c99dc587 has caused repeated failures in CI --- though oddly, > it's not failed in the buildfarm so far as I can find. The > failures look like > > timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading > publication: tap_pub_3) at > /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.pl line 103. > I analyzed the relevant publisher-side CI Logs [1]: ... 2025-04-19 08:24:14.096 UTC [21961][client backend] [024_add_drop_pub.pl][7/4:0] LOG: statement: INSERT INTO tab_3 values(1) 2025-04-19 08:24:14.098 UTC [21961][client backend] [024_add_drop_pub.pl][:0] LOG: disconnection: session time: 0:00:00.003 user=postgres database=postgres host=[local] 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG: released logical replication slot "tap_sub" 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG: disconnection: session time: 0:00:00.329 user=postgres database=postgres host=[local] 2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0] LOG: connection received: host=[local] 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0] LOG: connection authenticated: user="postgres" method=trust (/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117) 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0] LOG: replication connection authorized: user=postgres application_name=tap_sub 2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG: received replication command: IDENTIFY_SYSTEM 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] STATEMENT: IDENTIFY_SYSTEM 2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4', streaming 'parallel', origin 'any', publication_names '"tap_pub_ ... This shows that walsender restarts after the "INSERT INTO tab_3 values(1)" is processed by the previous walsender ("released logical replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)"). So, it is possible that the old apply worker has sent the confirmation of WAL received location after the Insert (due to keep_alive message handling). So, after the restart, the new walsender will start processing WAL after the INSERT and wait for the skipped message LOG timed out. Considering the above theory is correct, after "ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_3", we should wait for the new walsender to restart. We are already doing the same for a similar case in 001_rep_changes.pl (See "# check that change of connection string and/or publication list causes restart of subscription workers. We check the state along with application_name to ensure that the walsender is (re)started.). Unfortunately, I will be away for the rest of the week. In the meantime, if you or someone else is able to reproduce and fix it, then good; otherwise, I'll take care of it after I return. [1] - https://api.cirrus-ci.com/v1/artifact/task/6561639182368768/testrun/build/testrun/subscription/024_add_drop_pub/log/024_add_drop_pub_publisher.log -- With Regards, Amit Kapila.
Re: Parallel CREATE INDEX for GIN indexes
On 4/18/25 03:03, Vinod Sridharan wrote: > Hello, > As part of testing this change I believe I found a scenario where the > parallel build seems to trigger OOMs for larger indexes. Specifically, > the calls for ginEntryInsert seem to leak memory into > TopTransactionContext and OOM/crash the outer process. > For serial build, the calls for ginEntryInsert tend to happen in a > temporary memory context that gets reset at the end of the > ginBuildCallback. > For inserts, the call has a custom memory context and gets reset at > the end of the insert. > For parallel build, during the merge phase, the MemoryContext isn't > swapped - and so this happens on the TopTransactionContext, and ends > up growing (especially for larger indexes). > Yes, that's true. The ginBuildCallbackParallel() already releases memory after flushing the in-memory state, but I missed _gin_parallel_merge() needs to be careful about memory usage too. I haven't been able to trigger OOM (or even particularly bad) memory usage, but I suppose it might be an issue with custom GIN opclasses with much wider keys. > I believe at the very least these should happen inside the tmpCtx > found in the GinBuildState and reset periodically. > > In the attached patch, I've tried to do this, and I'm able to build > the index without OOMing, and only consuming maintenance_work_mem > through the merge process. > > Would appreciate your thoughts on this (and whether there's other approaches > to > resolve this too). > The patch seems fine to me - I repeated the tests with mailing list archives, with MemoryContextStats() in _gin_parallel_merge, and it reliably minimizes the memory usage. So that's fine. I was also worried if this might have performance impact, but it actually seems to make it a little bit faster. I'll get this pushed. thanks -- Tomas Vondra
Re: Questions about logicalrep_worker_launch()
On 2025/04/29 21:21, Amit Kapila wrote: On Mon, Apr 28, 2025 at 2:37 PM Fujii Masao wrote: On 2025/04/26 3:03, Masahiko Sawada wrote: I agree with these changes. I think that while the changes for (2) should be for v19, the changes for (1) might be treated as a bug fix? Agreed. I've split the patch into two parts: 0001 is for (1) and is a bug fix that should be back-patched to v16, where parallel apply workers were introduced. Since it didn't apply cleanly to v16, I also created a separate patch specifically for v16. The situation for parallel_apply workers is not as bad as for tablesync workers because even if the worker for parallel apply is not available, the apply worker will apply the changes by itself. OTOH, if the tablesync worker is not available, the tablesync will be pending till the time a worker for the same is not available. So, I am not sure if this is a clear cut bug which requires a fix in backbranches. I'm fine with treating this as an improvement rather than a bug fix. In any case, I've registered the patches for the next CommitFest. The attached patches are the same as before, just rebased for the master branch. Additionally, shall we try to reproduce this case for parallel apply workers? I noticed this issue while reading the code, so I haven't actually reproduced it. Are you saying it's not possible to reproduce this in practice? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION From 134331b89c0c413b20866e25c332b23c253bfa54 Mon Sep 17 00:00:00 2001 From: Fujii Masao Date: Mon, 28 Apr 2025 14:29:26 +0900 Subject: [PATCH v3 1/2] Fix bug that could block the startup of parallel apply workers. If a logical replication worker fails to start and its parent crashes while waiting, its worker slot can remain marked as "in use". This can prevent new workers from starting, as the launcher may not find a free slot or may incorrectly think the sync or parallel apply worker limits have been reached. To handle this, the launcher already performs garbage collection when no free slot is found or when the sync worker limit is hit, and then retries launching workers. However, it previously did not trigger garbage collection when the parallel apply worker limit was reached. As a result, stale slots could block new parallel apply workers from starting, even though they could have been launched after cleanup. This commit fixes the issue by triggering garbage collection when the parallel apply worker limit is reached as well. If stale slots are cleared and the number of parallel apply workers drops below the limit, new parallel apply worker can then be started successfully. --- src/backend/replication/logical/launcher.c | 65 +++--- 1 file changed, 31 insertions(+), 34 deletions(-) diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 10677da56b2..ac95afe4bae 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -96,7 +96,7 @@ static void logicalrep_launcher_onexit(int code, Datum arg); static void logicalrep_worker_onexit(int code, Datum arg); static void logicalrep_worker_detach(void); static void logicalrep_worker_cleanup(LogicalRepWorker *worker); -static int logicalrep_pa_worker_count(Oid subid); +static void logicalrep_worker_count(Oid subid, int *nsync, int *nparallelapply); static void logicalrep_launcher_attach_dshmem(void); static void ApplyLauncherSetWorkerStartTime(Oid subid, TimestampTz start_time); static TimestampTz ApplyLauncherGetWorkerStartTime(Oid subid); @@ -350,16 +350,21 @@ retry: } } - nsyncworkers = logicalrep_sync_worker_count(subid); + logicalrep_worker_count(subid, &nsyncworkers, &nparallelapplyworkers); now = GetCurrentTimestamp(); /* -* If we didn't find a free slot, try to do garbage collection. The -* reason we do this is because if some worker failed to start up and its -* parent has crashed while waiting, the in_use state was never cleared. +* If we can't start a new logical replication background worker because +* no free slot is available, or because the number of sync workers or +* parallel apply workers has reached the limit per subscriptoin, try +* running garbage collection. The reason we do this is because if some +* workers failed to start up and their parent has crashed while waiting, +* the in_use state was never cleared. By freeing up these stale worker +* slots, we may be able to start a new worker. */ - if (worker == NULL || nsyncworkers >= max_sync_workers_per_subscription) + if (worker == NULL || nsyncworkers >= max_sync_workers_per_subscription || + nparallelapplyworkers >= max_parallel_apply_workers_per_subscription) { bool
Re: Introduce some randomness to autovacuum
Hi Sami, On Thu, May 1, 2025 at 1:56 AM Sami Imseih wrote: > > > Yes, it is masking the problem, but maybe a better way to think about it is > > that it is delaying the > > performance impact, allowing more time for a manual intervention of the > > problematic table(s). > > I question how the user will gauge the success of setting the strategy > to "random"? They may make > it random by default, but fall into the same issues and revert it back > to the default strategy. > > But also, the key as you mention is "manual intervention" which > requires proper monitoring. I will > argue that for the two cases that this proposal is seeking to correct, > we already have good > solutions that could be implemented by a user. > > Let's take the "spinning" case again. If a table has some sort of > problem causing > vacuum to error out, one can just disable autovacuum on a per-table > level and correct > the issue. Also, the xmin horizon being held back ( which is claimed > to be the most common cause, > and I agree with that ), well that one is just going to cause all your > autovacuums to become > useless. Yeah, I tend to agree with you that the xmin horizon hold back will make autovacuums to become useless for all tables. But I have a question, let me quote Andres' comment on slack first: ```quote begin It seems a bit silly to not just do some basic prioritization instead, but perhaps we just need to reach for some basic stuff, given that we seem unable to progress on prioritization. ```quote end If randomness is not working, ISTM that the prioritization will not benefit the "spinning" case too, am I right? > > Also, I do think the starvation problem has a good answer now that > autovacuum_max_workers > can be modified online. Maybe something can be done for autovacuum to > auto-tune this > setting to give more workers at times when it's needed. Not sure what > that looks like, > but it is more possible now that this setting does not require a restart. Good to know, thanks. One case I didn't mention is that some corruption due to vacuuming the same table might starve other tables two, randomness gives other tables some chances to be vacuumed. I do admit that multi vacuum workers can eliminate this issue a little bit if the corrupted table's vacuum progress lasts for some time, but I think randomness is much better. > > -- > Sami Imseih > Amazon Web Services (AWS) -- Regards Junwang Zhao
Re: Improve explicit cursor handling in pg_stat_statements
On Wed, Apr 30, 2025 at 02:43:41PM -0500, Sami Imseih wrote: > I also want to add that the decision to not normalize the cursor name in > the FETCH command is because it would not make sense to combine > FETCH commands for various cursors into the same entry. - calls | rows | query +--+--- - 2 |0 | CLOSE cursor_stats_1 - 2 |0 | DECLARE cursor_stats_1 CURSOR WITH HOLD FOR SELECT $1 + calls | rows | query +---+--+ + 2 |0 | CLOSE $1 + 2 |0 | DECLARE $1 CURSOR WITH HOLD FOR SELECT $2 Hmm. What are the workloads that you have seen as problematic? Do these involve cursor names generated randomly, where most of them are similar with a random factor for the name? Too much normalization here would limit the amount of verbosity that we have for this area, especially if we are dealing with query patterns that rely on few CLOSE naming patterns spread across a couple of key queries, because we would now know anymore about their distribution. - 1 |5 | FETCH FORWARD 5 pgss_cursor - 1 |7 | FETCH FORWARD ALL pgss_cursor - 1 |1 | FETCH NEXT pgss_cursor + 1 |0 | DECLARE $1 CURSOR FOR SELECT * FROM pgss_matv Saying that, applying normalization for the number of FETCH looks like a natural move. It seems to me that we should still make a difference with the ALL case, though? typedef struct ClosePortalStmt { NodeTag type; - char *portalname; /* name of the portal (cursor) */ + /* name of the portal (cursor) */ + char *portalname pg_node_attr(query_jumble_ignore); + ParseLoclocation pg_node_attr(query_jumble_location); /* NULL means CLOSE ALL */ Could it matter to make a distinction with CLOSE ALL, compared to the case where the CLOSE statements are named? It would be possible to make a difference compared to the named case with an extra boolean field, for example. I would suggest to add some tests for CLOSE ALL anyway; we don't have any currently. -- Michael signature.asc Description: PGP signature
Re: Introduce some randomness to autovacuum
On Thu, May 1, 2025 at 8:12 AM David Rowley wrote: > > On Thu, 1 May 2025 at 03:29, Nathan Bossart wrote: > > That being said, I am -1 for this proposal. Autovacuum parameters and > > scheduling are already quite complicated, and making it nondeterministic > > would add an additional layer of complexity (and may introduce its own > > problems). But more importantly, IMHO it masks the problems instead of > > solving them more directly, and it could mask future problems, too. It'd > > probably behoove us to think about the known problems more deeply and to > > craft more targeted solutions. > > -1 from me too. > > It sounds like the aim is to fix the problem with autovacuum vacuuming > the same table over and over and being unable to remove enough dead > tuples due to something holding back the oldest xmin horizon. Why > can't we just fix that by remembering the value that > VacuumCutoffs.OldestXmin and only coming back to that table once > that's moved forward some amount? Users expect the tables to be auto vacuumed when: *dead_tuples > vac_base_thresh + vac_scale_factor * reltuples* If we depend on xid moving forward to do autovacuum, I think there are chances some bloated tables won't be vacuumed? > > David -- Regards Junwang Zhao
Should shared_preload_libraries be loaded during binary upgrade?
Does it make sense to load "shared_preload_libraries" during binary upgrade mode? An extension might unintentionally interfere with pg_upgrade, for example, by connecting to the 'postgres' database, which can cause the upgrade to fail as the restore needs to drop that database. While it's true that extensions should ideally handle this themselves, wouldn't it be safer if we could avoid loading them at all during the binary upgrade mode? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Introduce some randomness to autovacuum
On Thu, 1 May 2025 at 17:35, Junwang Zhao wrote: > > On Thu, May 1, 2025 at 8:12 AM David Rowley wrote: > > It sounds like the aim is to fix the problem with autovacuum vacuuming > > the same table over and over and being unable to remove enough dead > > tuples due to something holding back the oldest xmin horizon. Why > > can't we just fix that by remembering the value that > > VacuumCutoffs.OldestXmin and only coming back to that table once > > that's moved forward some amount? > > Users expect the tables to be auto vacuumed when: > *dead_tuples > vac_base_thresh + vac_scale_factor * reltuples* > If we depend on xid moving forward to do autovacuum, I think > there are chances some bloated tables won't be vacuumed? Can you explain why you think that? The idea is to start vacuum other tables that perhaps can have dead tuples removed instead of repeating vacuums on the same table over and over without any chance of being able to remove any more dead tuples than we could during the last vacuum. David
not not - pgupgrade.sgml
Hi, It seems to me that, in doc/src/sgml/ref/pgupgrade.sgml, this phrase: "Because not all statistics are not transferred" should be: "Because not all statistics are transferred" Thanks, Erik
Re: Fix outdated comments for IndexInfo
On Thu, May 1, 2025 at 12:49 AM Japin Li wrote: > While working on [1], I found outdated comments in IndexInfo. > The attached patch corrects them. Nice catch. LGTM. Thanks Richard
queryId constant squashing does not support prepared statements
62d712ec added the ability to squash constants from an IN LIST for queryId computation purposes. This means that a similar queryId will be generated for the same queries that only different on the number of values in the IN-LIST. The patch lacks the ability to apply this optimization to values passed in as parameters ( i.e. parameter kind = PARAM_EXTERN ) which will be the case for SQL prepared statements and protocol level prepared statements, i.e. ``` select from t where id in (1, 2, 3) \bind ``` or ``` prepare prp(int, int, int) as select from t where id in ($1, $2, $3); ``` Here is the current state, ``` postgres=# create table t (id int); CREATE TABLE postgres=# prepare prp(int, int, int) as select from t where id in ($1, $2, $3); PREPARE postgres=# execute prp(1, 2, 3); postgres=# select from t where id in (1, 2, 3); -- (0 rows) postgres=# SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; query | calls ---+--- ... select from t where id in ($1 /*, ... */) | 1 select from t where id in ($1, $2, $3) | 1 <<- prepared statement (6 rows) ``` but with the attached patch, the optimization applies. ``` create table t (id int) | 1 select from t where id in ($1 /*, ... */) | 2 (3 rows) ``` I think this is a pretty big gap as many of the common drivers such as JDBC, which use extended query protocol, will not be able to take advantage of the optimization in 18, which will be very disappointing. Thoughts? Sami Imseih Amazon Web Services (AWS) v1-0001-Allow-query-jumble-to-squash-a-list-external-para.patch Description: Binary data
Re: pgsql: Add function to get memory context stats for processes
On Wed, Apr 30, 2025 at 4:29 PM Daniel Gustafsson wrote: > This is indeed exposing a pre-existing issue, which was reported in [0] and a > patch fixing it has been submitted. I have been testing and reworking the > patch slightly but due to $life and $work events have yet to have time to push > it. Thanks for the pointer. I will reply to that thread. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.
On Wed, Apr 30, 2025 at 5:24 PM Daniel Gustafsson wrote: > Attached is a current v4 with a few small tweaks. Sorry to turn up late here, but I strongly disagree with the notion that this is a bug in the DSM or DSA code. It seems to me that it is the caller's responsibility to provide a valid resource owner, not the job of the called code to ignore the resource owner when it's unusable. I suspect that there are many other parts of the system that rely on the ResourceOwner machinery which likewise assume that the ResourceOwner that they are passed is valid. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Apr 30, 2025 at 2:38 AM Zhijie Hou (Fujitsu) wrote: > > On Tue, Apr 29, 2025 at 6:57 PM Amit Kapila wrote: > > > > On Tue, Apr 29, 2025 at 3:01 AM Masahiko Sawada > > wrote: > > > > > > Your fix looks good to me. Is it worth considering putting an > > > assertion to verify if new two_phase_at is equal to or greater than > > > confirmed_lsn (or at least it doesn't go backward), when syncing > > > two_phase_at? > > > > > > > Yeah, it makes sense. But the condition should be reverse (two_phase_at > > should be less than or equal to confirmed_flush). I have done that, changed > > a > > few comments, and committed the patch. > > I noticed a BF failure[1] related to the fix, where the recently added > assertion > Assert(slot->data.two_phase_at <= slot->data.confirmed_flush) was broken. > After > examining the logs and code, I couldn't identify any functionality issues. Yeah, that's weird. > AFAICS, the last slotsync cycle should have retrieved the latest > confirmed_flush and > two_phase_at from the source slot, both expected to be 0/6005150. Here are > some details: > > The standby's log[1] shows the source slot's two_phase_at as 0/6005150. > Meanwhile, the publisher's log reveals that the source slot's confirmed_flush > was already 0/6005150 before the last slotsync. Therefore, the last slotsync > cycle should have fetched confirmed_flush: 0/6005150, two_phase_at: 0/6005150. > > If we assume remote_slot->confirmed_lsn during the last sync cycle is > 0/6005150, then either the local slot's confirmed_lsn had already been updated > to this value in a prior sync, leading it to skip the update in the last cycle > (in which case, the assertion shouldn't be broken), or it should enter the > slot > update branch to set the synced slot to 0/6005150 (see > update_local_synced_slot() for details). Thus, theoretically, the assertion > wouldn't fail. Agreed with your analysis. After enabling the subscription with two_phase=true, the primary server has the following logs that indicate that logical decoding started from 0/6005150, not 0/6004FC8. Given that the slot's two_phase was enabled at this time, the slot's confirmed_flush and two_phase_at were 0/6005150. 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG: received replication command: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG: acquired logical replication slot "lsub1_slot" 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG: 0/6004FC8 has been already streamed, forwarding to 0/6005150 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] LOG: starting logical decoding for slot "lsub1_slot" 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] DETAIL: Streaming transactions committing after 0/6005150, reading WAL from 0/6004A00. 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') > > Beyond functionality problems, another possibility might be the CPU reordered > memory access, causing the assertion to execute before updating > confirmed_flush. Not sure CPU reordered memory access could matter in this case but I don't have other ideas of possible causes. > However, we lack the information needed to verify this, so one > idea is to add some DEBUG message in update_local_synced_slot() to facilitate > tracking if the issue recurs. That would be a good idea. Also, it's unrelated to this issue but probably we are better off doing this assertion check only when enabling two_phase. The assertion is currently checked even when disabling the two_phase, which seems not to make sense, and we don't clear two_phase_at value when disabling two_phase by ReplicationSlotAlter(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Introduce some randomness to autovacuum
On Thu, 1 May 2025 at 03:29, Nathan Bossart wrote: > That being said, I am -1 for this proposal. Autovacuum parameters and > scheduling are already quite complicated, and making it nondeterministic > would add an additional layer of complexity (and may introduce its own > problems). But more importantly, IMHO it masks the problems instead of > solving them more directly, and it could mask future problems, too. It'd > probably behoove us to think about the known problems more deeply and to > craft more targeted solutions. -1 from me too. It sounds like the aim is to fix the problem with autovacuum vacuuming the same table over and over and being unable to remove enough dead tuples due to something holding back the oldest xmin horizon. Why can't we just fix that by remembering the value that VacuumCutoffs.OldestXmin and only coming back to that table once that's moved forward some amount? David
Re: Performance issues with v18 SQL-language-function changes
Alexander Lakhin writes: > Sorry if I've chosen the wrong thread, but starting from 0313c5dc6, > the following script: > CREATE TYPE aggtype AS (a int, b text); > CREATE FUNCTION aggfns_trans(aggtype[], integer, text) RETURNS aggtype[] > LANGUAGE sql AS 'SELECT > array_append($1,ROW($2,$3)::aggtype)'; > CREATE AGGREGATE aggfns(integer, text) (SFUNC = public.aggfns_trans, STYPE = > public.aggtype[], INITCOND = '{}'); > CREATE TABLE t(i int, k int); > INSERT INTO t SELECT 1, 2 FROM generate_series(1, 4000); > SET statement_timeout = '10s'; > SELECT aggfns(i, repeat('x', 8192)) OVER (PARTITION BY i) FROM t; > crashes the server for me like this: > corrupted size vs. prev_size while consolidating Hmm. What seems to be going on here is that once the aggfns_trans() result gets large enough that the SQL-function-result tuplestore decides to spill to disk, when we pull the result tuple back out of the tuplestore with tuplestore_gettupleslot we end up with the jf_resultSlot holding a should-free tuple pointer that points into the tuplestore's storage. After tuplestore_clear that is a dangling pointer, and the next use of the jf_resultSlot fails while trying to free the tuple. So the attached fixes it for me, but I'm still mighty confused because I don't understand why it didn't fail in the old code. This logic doesn't seem noticeably different from before, and there's even a very old comment (in the SRF path) alleging that /* NB: this might delete the slot's content, but we don't care */ Now we *do* care, but what changed? (As an aside, seems like tuplestore is leaving money on the table, because there's hardly any point in spilling to disk when it never holds more than one tuple. But that's not something to change now.) regards, tom lane diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c index e0bca7cb81c..37f616280c6 100644 --- a/src/backend/executor/functions.c +++ b/src/backend/executor/functions.c @@ -1728,8 +1728,9 @@ fmgr_sql(PG_FUNCTION_ARGS) elog(ERROR, "failed to fetch lazy-eval tuple"); /* Extract the result as a datum, and copy out from the slot */ result = postquel_get_single_result(slot, fcinfo, fcache); + /* Clear the slot, in case it points into the tuplestore */ + ExecClearTuple(slot); /* Clear the tuplestore, but keep it for next time */ - /* NB: this might delete the slot's content, but we don't care */ tuplestore_clear(fcache->tstore); /* @@ -1810,7 +1811,11 @@ fmgr_sql(PG_FUNCTION_ARGS) /* Re-use the junkfilter's output slot to fetch back the tuple */ slot = fcache->junkFilter->jf_resultSlot; if (tuplestore_gettupleslot(fcache->tstore, true, false, slot)) + { result = postquel_get_single_result(slot, fcinfo, fcache); +/* Clear the slot, in case it points into the tuplestore */ +ExecClearTuple(slot); + } else { fcinfo->isnull = true;
Re: queryId constant squashing does not support prepared statements
On Wed, Apr 30, 2025 at 04:52:06PM -0500, Sami Imseih wrote: > 62d712ec added the ability to squash constants from an IN LIST > for queryId computation purposes. This means that a similar > queryId will be generated for the same queries that only > different on the number of values in the IN-LIST. > > The patch lacks the ability to apply this optimization to values > passed in as parameters ( i.e. parameter kind = PARAM_EXTERN ) > which will be the case for SQL prepared statements and protocol level > prepared statements, i.e. > > I think this is a pretty big gap as many of the common drivers such as JDBC, > which use extended query protocol, will not be able to take advantage of > the optimization in 18, which will be very disappointing. > > Thoughts? Yes. Long IN/ANY clauses are as far as a more common pattern caused by ORMs, and I'd like to think that application developers would not hardcode such clauses in their right minds (well, okay, I'm likely wrong about this assumption, feel free to counter-argue). These also like relying on the extended query protocol. Not taking into account JDBC in that is a bummer, and it is very popular. I agree that the current solution we have in the tree feels incomplete because we are not taking into account the most common cases that users would care about. Now, allowing PARAM_EXTERN means that we allow any expression to be detected as a squashable thing, and this kinds of breaks the assumption of IsSquashableConst() where we want only constants to be allowed because EXECUTE parameters can be any kind of Expr nodes. At least that's the intention of the code on HEAD. Now, I am not actually objecting about PARAM_EXTERN included or not if there's a consensus behind it and my arguments are considered as not relevant. The patch is written so as it claims that a PARAM_EXTERN implies the expression to be a Const, but it may not be so depending on what the execution path is given for the parameter. Or at least the patch could be clearer and rename the parts about the "Const" squashable APIs around queryjumblefuncs.c. To be honest, the situation of HEAD makes me question whether we are using the right approach for this facility. I did mention a couple of months ago about an alternative, but it comes down to accept that any expressions would be normalized, unfortunately I never got down to study it in details as this touches around expr_list in the parser: we could detect in the parser the start and end locations of an expression list in a query string, then group all of them together based on their location in the string. This would be also cheaper than going through all the elements in the list, tweaking things when dealing with a subquery.. The PARAM_EXTERN part has been mentioned a couple of weeks ago here, btw: https://www.postgresql.org/message-id/CAA5RZ0tu6_KRiYJCFptf4_--wjFSu9cZMj1XNmOCqTNxu=v...@mail.gmail.com -- Michael signature.asc Description: PGP signature
Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Peter, moving the conversation here from "pgsql: Improve nbtree skip scan primitive scan scheduling" thread on -committers. Attached is another regression test for your consideration, which gives rise to the following assertion: TRAP: failed Assert("numSkipArrayKeys == 0"), File: "nbtpreprocesskeys.c", Line: 1859, PID: 7210 0 postgres0x0001032f30e0 ExceptionalCondition + 108 1 postgres0x000102e83100 _bt_preprocess_keys + 6036 2 postgres0x000102e87338 _bt_first + 168 3 postgres0x000102e84680 btgettuple + 196 4 postgres0x000102e75cdc index_getnext_tid + 68 5 postgres0x0001030017a0 IndexOnlyNext + 228 6 postgres0x000102fe5b2c ExecScan + 228 7 postgres0x000103011088 ExecSort + 536 8 postgres0x000102fdbc68 standard_ExecutorRun + 312 9 postgres0x0001031bdfb8 PortalRunSelect + 236 10 postgres0x0001031bdbd4 PortalRun + 492 11 postgres0x0001031bcb18 exec_simple_query + 1292 12 postgres0x0001031b9d1c PostgresMain + 1388 13 postgres0x0001031b59a8 BackendInitialize + 0 14 postgres0x00010310edd8 postmaster_child_launch + 372 15 postgres0x00010311303c ServerLoop + 4948 16 postgres0x000103111360 InitProcessGlobals + 0 17 postgres0x000103030c00 help + 0 18 dyld0x00018eb17154 start + 2476 This looks sufficiently different from the assertion mentioned on the other thread to be worth posting here. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company _skipscan2.sql Description: application/sql
Re: Adding skip scan (including MDAM style range skip scan) to nbtree
On Wed, Apr 30, 2025 at 9:12 PM Mark Dilger wrote: > TRAP: failed Assert("numSkipArrayKeys == 0"), File: "nbtpreprocesskeys.c", > Line: 1859, PID: 7210 > 0 postgres0x0001032f30e0 > ExceptionalCondition + 108 > 1 postgres0x000102e83100 > _bt_preprocess_keys + 6036 > This looks sufficiently different from the assertion mentioned on the other > thread to be worth posting here. It is a completely unrelated issue. Fortunately, this one is harmless. The assertion merely failed to account for the case where we completely end the scan during preprocessing due to it having an unsatisfiable array qual. Pushed a fix for this just now. Thanks -- Peter Geoghegan
Re: Performance issues with v18 SQL-language-function changes
Alexander Lakhin writes: > Sorry if I've chosen the wrong thread, but starting from 0313c5dc6, > the following script: > ... > crashes the server for me like this: > corrupted size vs. prev_size while consolidating Yup, duplicated here. Thanks for the report! regards, tom lane
Re: Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.
> On 11 Apr 2025, at 21:08, Rahila Syed wrote: > > Hi Daniel, > > Thank you for your review and code improvements. > > Please find below some observations. > > 1. dsm_unpin_mapping(dsm_segment *seg) > + if (CurrentResourceOwner && > IsResourceOwnerReleasing(CurrentResourceOwner)) > + return; > > Given that the function can return without setting resowner to a > CurrentResourceOwner which is not NULL, shall we change the function > signature to return true when "unpin" is successful and false when not? Maybe, but at this point in the cycle we cannot change the prototype like that. Food for thought for v19 though. > 2. If value of IsResourceOwnerReleasing changes between dsm_create_descriptor > and attach_internal, the dsm segment and dsa area will end up with different > resource owners. > Earlier the chances of CurrentResourceOwner changing between the two > functions were zero. > May be something can be done to keep resowner assignments under both these > functions > in sync. Hmm, that's a good question, not sure. Do you have a repro for this handy? Attached is a current v4 with a few small tweaks. -- Daniel Gustafsson v4-0001-Don-t-try-to-enlarge-resourceowner-when-releasing.patch Description: Binary data
Batch TIDs lookup in ambulkdelete
Hi all, During index bulk-deletion in lazy vacuum, we currently check the deletability of each index tuple individually using the vac_tid_reaped() function. The attached proof-of-concept patches propose batching multiple TID lookups for deletability checks to reduce overhead. This optimization aims to minimize redundant function calls and repeated TidStore entry retrievals for TIDs on the same page. I have conducted benchmarks across several scenarios to evaluate the performance impact. # Case-1 (btree tuples are regular tuples and dead TIDs are concentrated): create unlogged table test (c int) with (autovacuum_enabled = off); insert into test select generate_series(1, ${NROWS}); create index on test (c); delete from test where c < ${NROWS} * 0.3; # Case-2 (btree tuples are regular tuples and dead TIDs are sparsed): create unlogged table test (c int) with (autovacuum_enabled = off); insert into test select generate_series(1, ${NROWS}); create index on test (c); delete from test where random() < 0.3; # Case-3 (btree tuples are deduplicated tuples): create unlogged table test (c int) with (autovacuum_enabled = off); insert into test select c % 1000 from generate_series(1, ${NROWS}) c; create index on test (c); select pg_relation_size('test') / 8192 as relpages \gset delete from test where (ctid::text::point)[0] < ((:'relpages')::int * 0.3); # Case-4 (btree tuples are deduplicated tuples and table is clustered): create unlogged table test (c int) with (autovacuum_enabled = off); insert into test select c % 1000 from generate_series(1, ${NROWS}) c; create index on test (c); cluster test using test_c_idx; select pg_relation_size('test') / 8192 as relpages \gset delete from test where (ctid::text::point)[0] < ((:'relpages')::int * 0.3); # Case-5 (creating btree index on UUID column) create unlogged table test (c uuid) with (autovacuum_enabled = off); insert into test select uuidv4() from generate_series(1, ${NROWS}) c; create index on test (c); select pg_relation_size('test') / 8192 as relpages \gset delete from test where (ctid::text::point)[0] < ((:'relpages')::int * 0.3); Here are the results (NROWS = 5000): HEAD PATCHEDDIFF case-1: 3,021 ms 2.818 ms93.29% case-2: 5, 697 ms 5.545 ms97.34% case-3: 2,833 ms 2.790 ms98.48% case-4: 2,564 ms 2.279 ms88.86% case-5: 4,657 ms 4.706 ms 101.04% I've measured 6 ~ 11% improvement in btree bulk-deletion. Here are the summary of each attached patch: 0001: Introduce TIdStoreIsMemberMulti() where we can do IsMember check for multiple TIDs in one function call. If the given TIDs are sorted (at least in block number), we can save radix tree lookup for the same page entry. 0002: Convert IndexBUlkDeleteCallback() to batched operation. 0003: Use batch TIDs lookup in btree index bulk-deletion. In patch 0003, we implement batch TID lookups for both each deduplicated index tuple and remaining all regular index tuples, which seems to be the most straightforward approach. While further optimizations are possible, such as performing batch TID lookups for all index tuples on a single page, these could introduce additional overhead from sorting and re-sorting TIDs. Moreover, considering that radix tree lookups are relatively inexpensive, the benefits of sorting TIDs and using TidStoreIsMemberMulti() might be minimal. Nevertheless, these potential optimizations warrant further evaluation to determine their actual impact on performance. Also, the patch includes the batch TIDs lookup support only for btree indexes but we potentially can improve other index AMs too. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com From ba4e7dc1f06f3dceab896b6ee4bf68ff23db1c09 Mon Sep 17 00:00:00 2001 From: Masahiko Sawada Date: Tue, 22 Apr 2025 12:25:11 -0700 Subject: [PATCH v1 3/3] Use batch TIDs lookup in btree index bulk-deletion. TIDs in the postlist are sorted. But TIDs of the gathered regular index tuples are not sorted. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch-through: --- src/backend/access/nbtree/nbtree.c | 107 +++-- 1 file changed, 71 insertions(+), 36 deletions(-) diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 821e16d5691..c92efecc076 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -23,6 +23,7 @@ #include "access/stratnum.h" #include "commands/progress.h" #include "commands/vacuum.h" +#include "common/int.h" #include "nodes/execnodes.h" #include "pgstat.h" #include "storage/bulk_write.h" @@ -1309,6 +1310,13 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, IndexFreeSpaceMapVacuum(rel); } +/* qsort comparator for sorting OffsetNumbers */ +static int +cmpOffsetNumbers(const void *a, const void *b) +{ + return pg_cmp_u16(*(const OffsetNumber *) a, *(const OffsetNumber *) b); +} + /* * btvac
Re: Vacuum timing in pg_stat_all_tables
On Wed, Mar 12, 2025 at 10:52:57AM +, Bertrand Drouvot wrote: > Note that it does not add extra explanation to "cost-based delay". If we feel > the > need we could add a link to " linkend="runtime-config-resource-vacuum-cost"/>" > like it has been done for delay_time in bb8dff9995f. Sorry for the delay, this has been sitting in my inbox for some time, and I am catching with some of my backlog. - Total time this table has been manually vacuumed, in milliseconds + Total time this table has been manually vacuumed, in milliseconds. (This + includes the time spent sleeping due to cost-based delay.) Hmm, okay, adding this information to these four new fields is fine here, so I'll apply that on HEAD shortly. I can see that this matches with the style used for some of the other fields, like n_tup_upd for the details regarding HOT. -- Michael signature.asc Description: PGP signature
Re: fixing CREATEROLE
On Wed, Apr 30, 2025 at 1:29 PM Tom Lane wrote: > But don't we need to add createrole_self_grant to the set of GUCs that pg_dump[all] > resets in the emitted SQL? > > The other approach would be to do what we do for the role options and just specify everything explicitly in the dump. The GUC is only a default specifier so let's not leave room for defaults in the dump file. David J.
Re: Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.
On Wed, Apr 30, 2025 at 06:03:49PM -0400, Robert Haas wrote: > Sorry to turn up late here, but I strongly disagree with the notion > that this is a bug in the DSM or DSA code. It seems to me that it is > the caller's responsibility to provide a valid resource owner, not the > job of the called code to ignore the resource owner when it's > unusable. I suspect that there are many other parts of the system that > rely on the ResourceOwner machinery which likewise assume that the > ResourceOwner that they are passed is valid. Yeah, dshash would be one, I think. It feels to me that if you want to enforce this kind of policy to be checked, this is something that should be done in the shape of one or more assertion based the state of the resource owner expected in these low-level paths rather than tweaking the DSA and DSM code to do what you are expecting here, and only enforce such new policies on HEAD to avoid disruption with existing systems. I'm actually rather scared of the patch, isn't there a risk of breaking existing patterns that worked out of the box by forcing the resowner to not be set? My spidey sense tingles when I see such patterns, because this is enforcing assumptions directly hidden to the callers. -- Michael signature.asc Description: PGP signature
Re: POC: Parallel processing of indexes in autovacuum
Hi, On Wed, Apr 16, 2025 at 4:05 AM Maxim Orlov wrote: > > Hi! > > The VACUUM command can be executed with the parallel option. As documentation > states, it will perform index vacuum and index cleanup phases of VACUUM in > parallel using integer background workers. But such an interesting feature is > not used for an autovacuum. After a quick look at the source codes, it became > clear to me that when the parallel option was added, the corresponding option > for autovacuum wasn't implemented, although there are no clear obstacles to > this. > > Actually, one of our customers step into a problem with autovacuum on a table > with many indexes and relatively long transactions. Of course, long > transactions are an ultimate evil and the problem can be solved by calling > running vacuum and a cron task, but, I think, we can do better. > > Anyhow, what about adding parallel option for an autovacuum? Here is a POC > patch for proposed functionality. For the sake of simplicity's, several GUC's > have been added. It would be good to think through the parallel launch > condition without them. > > As always, any thoughts and opinions are very welcome! As I understand it, we initially disabled parallel vacuum for autovacuum because their objectives are somewhat contradictory. Parallel vacuum aims to accelerate the process by utilizing additional resources, while autovacuum is designed to perform cleaning operations with minimal impact on foreground transaction processing (e.g., through vacuum delay). Nevertheless, I see your point about the potential benefits of using parallel vacuum within autovacuum in specific scenarios. The crucial consideration is determining appropriate criteria for triggering parallel vacuum in autovacuum. Given that we currently support only parallel index processing, suitable candidates might be autovacuum operations on large tables that have a substantial number of sufficiently large indexes and a high volume of garbage tuples. Once we have parallel heap vacuum, as discussed in thread[1], it would also likely be beneficial to incorporate it into autovacuum during aggressive vacuum or failsafe mode. Although the actual number of parallel workers ultimately depends on the number of eligible indexes, it might be beneficial to introduce a storage parameter, say parallel_vacuum_workers, that allows control over the number of parallel vacuum workers on a per-table basis. Regarding implementation: I notice the WIP patch implements its own parallel vacuum mechanism for autovacuum. Have you considered simply setting at_params.nworkers to a value greater than zero? Regards, [1] https://www.postgresql.org/message-id/CAD21AoAEfCNv-GgaDheDJ%2Bs-p_Lv1H24AiJeNoPGCmZNSwL1YA%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com