Re: Commit fest 2025-03
On Wed, 5 Mar 2025 at 16:50, Jim Jones wrote: > > Hi Vignesh > > On 05.03.25 10:22, vignesh C wrote: > > The following "Ready for committer" patches needs rebase > > --- > > Truncate logs by max_log_size - Kirill Gavrilov > > > > Patch owners, please provide a rebased version to prepare it for > > reviewers and committers. > > Is there something wrong with the commitfest app? This patch applies > cleanly and passes all tests This issue has been addressed now, here are the updated patches that needs to be rebased: The walsender does not update its IO statistics until it exits - Bertrand Drouvot noreturn attribute for MSVC, C11 - Peter Eisentraut Add pg_stat_session - Sergey Dudoladov Logging plan of the currently running query - Atsushi Torikoshi Index Prefetching - Tomas Vondra Allow partition-wise join when whole row var is needed - Alexander Pyhalov Asynchronous MergeAppend Execution - Alexander Pyhalov AIO - Andres Freund Limiting overshoot in nbtree SAOP parallel index scans - Matthias van de Meent Use read_stream in index vacuum - Andrey Borodin Adding compression of temporary files - Filip Janus Allow to use an index for ILIKE in more cases - Yugo NAGATA Use Bump allocator for HashAgg - Jeff Davis Read stream scalability improvements and AIO-compatibility - Thomas Munro Compress big WAL records - Andrey Borodin Reduce timing overhead of EXPLAIN ANALYZE using rdtsc - Lukas Fittl Use XLOG_CONTROL_FILE macro everywhere - Anton A. Melnikov Don't dirty pages while they are getting flushed out - Andres Freund Enable logical decoding when wal_level = 'replica' without a server restart - Masahiko Sawada VACUUM FULL / CLUSTER CONCURRENTLY - Antonin Houska NOT ENFORCED constraint feature - Amul Sul Changing shared_buffers without restart - Ashutosh Bapat Extended Statistics set/restore/clear functions - Corey Huinker general purpose array_sort - jian he explain plans for foreign servers - Dinesh Salve Enable fine-grained control over what gets logged on connection attempt (reduces log size) - Sergey Allow CI to only run the compiler warnings task - Bertrand Drouvot Patch owners, please provide a rebased version to prepare it for reviewers and committers. Regards, Vignesh
Re: MergeJoin beats HashJoin in the case of multiple hash clauses
On Wed, Mar 5, 2025 at 4:43 AM Alexander Korotkov wrote: > > On Mon, Mar 3, 2025 at 10:24 AM Andrei Lepikhov wrote: > > On 17/2/2025 01:34, Alexander Korotkov wrote: > > > Hi, Andrei! > > > > > > On Tue, Oct 8, 2024 at 8:00 AM Andrei Lepikhov wrote: > > > Thank you for your work on this subject. I agree with the general > > > direction. While everyone has used conservative estimates for a long > > > time, it's better to change them only when we're sure about it. > > > However, I'm still not sure I get the conservatism. > > > > > > if (innerbucketsize > thisbucketsize) > > > innerbucketsize = thisbucketsize; > > > if (innermcvfreq > thismcvfreq) > > > innermcvfreq = thismcvfreq; > > > > > > IFAICS, even in the worst case (all columns are totally correlated), > > > the overall bucket size should be the smallest bucket size among > > > clauses (not the largest). And the same is true of MCV. As a mental > > > experiment, we can add a new clause to hash join, which is always true > > > because columns on both sides have the same value. In fact, it would > > > have almost no influence except for the cost of extracting additional > > > columns and the cost of executing additional operators. But in the > > > current model, this additional clause would completely ruin > > > thisbucketsize and thismcvfreq, making hash join extremely > > > unappealing. Should we still revise this to calculate minimum instead > > > of maximum? > > I agree with your point. But I think the code works precisely the way > > you have described. > > You're right. I just messed up with the sides of comparison operator. I've revised commit message, comments, formatting etc. I'm going to push this if no objections. -- Regards, Alexander Korotkov Supabase v3-0001-Use-extended-stats-for-precise-estimation-of-buck.patch Description: Binary data
Re: Considering fractional paths in Append node
Hi! No objections. Alexander, thank you! -- Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: Incorrect assert in libpqwalreceiver
On 09/03/2025 10:09, Jacob Brazeal wrote: The libpqrcv_connect function asserts 'Assert(i < sizeof(keys))', where keys is declared as const char *keys[6];. However, sizeof(keys) is not the correct way to check the array length (on my system, for example, it's 48 = 6 * 8 at this callsite, not 6.) I attached a patch to fix the assert, but I suppose we could also just remove the assert altogether, since it hasn't been doing anything for at least 8 years. Committed, thanks! I think it's still valuable; it's an easy mistake to make, to add a parameter and forget to increase the array size. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Printing window function OVER clauses in EXPLAIN
=?utf-8?Q?=C3=81lvaro?= Herrera writes: > Would it be possible and make sense to use notation of explicit WINDOW > clauses, for cases where multiple window functions invoke identical > window definitions? There's something to be said for that. We would have to assign made-up names to windows that didn't have one. But then the output might look like WindowAgg (...) Output: empno, depname, row_number() OVER (window1), rank() OVER (window1), count(*) OVER (window1), enroll_date Window: window1 = PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING which is surely a lot nicer than 3x repetitions of the window spec. After reading David's mail I'd been thinking of something like WindowAgg (...) Output: empno, depname, row_number() OVER (...), rank() OVER (...), count(*) OVER (...), enroll_date Window: PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING which is shorter but vaguer. In particular, if you have more than one WindowAgg, then with explicit names we'd have something like WindowAgg (...) Output: empno, depname, row_number() OVER (window1), rank() OVER (window1), (count(*) OVER (window2)), enroll_date Window: window1 = PARTITION BY depname ORDER BY enroll_date ROWS UNBOUNDED PRECEDING WindowAgg (...) Output: empno, depname, count(*) OVER (window2), enroll_date Window: window2 = PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING With "..." that would be confusing as heck to someone who didn't understand the nuances of the extra parentheses. > (Hmm, not sure if the Window clauses would be top-level or attached to > each WindowAgg in its own level.) IMO the obvious thing is to attach each WindowClause to the WindowAgg node that implements it. I'll go try to code this up. regards, tom lane
Re: Parallel CREATE INDEX for GIN indexes
Tomas Vondra writes: > I pushed the two smaller parts today. Coverity is a little unhappy about this business in _gin_begin_parallel: boolleaderparticipates = true; ... #ifdef DISABLE_LEADER_PARTICIPATION leaderparticipates = false; #endif ... scantuplesortstates = leaderparticipates ? request + 1 : request; It says >>> CID 1644203: Possible Control flow issues (DEADCODE) >>> Execution cannot reach the expression "request" inside this statement: >>> "scantuplesortstates = (lead...". 924 scantuplesortstates = leaderparticipates ? request + 1 : request; If this were just temporary code I'd let it pass, but I see nothing replacing this logic in the follow-up patches, so I think we ought to do something to shut it up. It's not complaining about the later bits like if (leaderparticipates) ginleader->nparticipanttuplesorts++; (perhaps because there's no dead code there?) So one idea is scantuplesortstates = request; if (leaderparticipates) scantuplesortstates++; which would look more like the other code anyway. regards, tom lane
Re: Statistics Import and Export
On Sat, 2025-03-08 at 10:56 -0500, Robert Treat wrote: > In the UX world, the general pattern is people start to get > overwhelmed once you get over a 1/2 dozen options (I think that's > based on Miller's law, but might be mis-remembering); we are already > at 9 for this use case. So really it is quite the opposite, we'd be > reducing the burden on customers by simplifying the interface rather > than just throwing out every possible combination and saying "you > figure it out". To be clear about your proposal: * --include conflicts with --schema-only and --data-only * --include overrides any default is that right? Thoughts on how we should document when/how to use --section vs -- include? Granted, that might be a point of confusion regardless of the options we offer. Regards, Jeff Davis
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress
On Wed, 5 Mar 2025 at 21:43, Fujii Masao wrote: > > > > On 2024/11/30 15:23, Kirill Reshke wrote: > > On Fri, 11 Oct 2024 at 06:53, Fujii Masao > > wrote: > >> However, this issue already exists without the proposed patch. > >> Since file_fdw already reports progress partially, querying multiple > >> file_fdw tables can lead to inaccurate or confusing progress reports. > >> You can even observe this when analyzing a file_fdw table and also > >> when copying to the table with a trigger that executes progress-reporting > >> commands. > >> > >> So, I don’t think this issue should block the proposed patch. > >> In fact, progress reporting is already flawed in these scenarios, > >> regardless of whether the patch is applied. > > On second thought, supporting progress tracking for COPY used by file_fdw > could increase the chances of multiple commands being tracked simultaneously > by a single backend. This means the command progress view might show > incorrect results more often. > > As I mentioned before, this issue already exists, but it currently > only happens in rare cases. I don’t think the fact that the issue > already exists is a good reason to introduce more, and likely more common, > scenarios where it could occur. > > With that in mind, I'm thinking of withdrawing this patch for now. I've updated the status to "withdrawn." Feel free to add it again anytime if you change your mind. Regards, Vignesh
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
Hi, A few days ago I came up with an idea to implement multi insert optimization wherever possible. I prepared a raw patch and it showed a great performance gain (up to 4 times during INSERT ... INTO ... in the best case). Then I was very happy to find this thread. You did a great job and I want to help you to bring the matter to an end. On Thu, Oct 31, 2024 at 11:17 AM Jingtang Zhang wrote: > I did some performance test these days, and I have some findings. > HEAD: > 12.29% postgres[.] pg_checksum_block >6.33% postgres[.] GetPrivateRefCountEntry >5.40% postgres[.] pg_comp_crc32c_sse42 >4.54% [kernel][k] copy_user_enhanced_fast_string >2.69% postgres[.] BufferIsValid >1.52% postgres[.] XLogRecordAssemble > > Patched: > 11.75% postgres[.] tts_virtual_materialize >8.87% postgres[.] pg_checksum_block >8.17% postgres[.] slot_deform_heap_tuple >8.09% postgres[.] heap_compute_data_size >6.17% postgres[.] fill_val >3.81% postgres[.] heap_fill_tuple >3.37% postgres[.] tts_virtual_copyslot >2.62% [kernel][k] copy_user_enhanced_fast_string I applied v25 patches on the master branch and made some measurements to find out what is the bottleneck in this case. The 'time' utility showed that without a patch, this query will run 1.5 times slower. I also made a few flamegraphs for this test. Most of the time is spent calling these two functions : tts_virtual_copyslot and heap_form_tuple. All tests were run in virtual machine with these CPU characteristics: Architecture: x86_64 CPU(s): 2 On-line CPU(s) list:0,1 Virtualization features: Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type:full Caches (sum of all): L1d:128 KiB (2 instances) L1i:128 KiB (2 instances) L2: 1 MiB (2 instances) L3: 32 MiB (2 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0,1 In my implementation, I used Tuplestore functionality to store tuples. In order to get rid of getting stuck in the above mentioned functions, I crossed it with the current implementation (v25 patches) and got a 10% increase in performance (for the test above). I also set up v22 patches to compare performance (with/without tuplestore) for INSERT ... INTO ... queries (with -j 4 -c 10 parameters for pgbech), and there also was an increase in TPS (about 3-4%). I attach a patch that adds Tuplestore to v25. What do you think about this idea? -- Best regards, Daniil Davydov From a59cfcbb05bb07c94a4c0ad6531baa5e531629ae Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Sun, 9 Mar 2025 16:37:44 +0700 Subject: [PATCH] Replace holding tuples in virtual slots with tuplestorage During performance testing, it was found out that in the current implementation a lot of the program's time is spent calling two functions : tts_virtual_copyslot and heap_fill_tuple. Calls to these functions are related to the fact that tuples are stored in virtual_tts, so I propose to replace this logic with Tuplestore functionality. Discussion: https://www.postgresql.org/message-id/9F9326B4-8AD9-4858-B1C1-559FC64E6E93%40gmail.com --- src/backend/access/heap/heapam.c | 67 +++- src/include/access/heapam.h | 9 - 2 files changed, 38 insertions(+), 38 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index acdce1a4b4..276480213a 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2665,7 +2665,6 @@ void heap_modify_buffer_insert(TableModifyState *state, TupleTableSlot *slot) { - TupleTableSlot *dstslot; HeapInsertState *istate; HeapMultiInsertState *mistate; MemoryContext oldcontext; @@ -2682,8 +2681,10 @@ heap_modify_buffer_insert(TableModifyState *state, mistate = (HeapMultiInsertState *) palloc(sizeof(HeapMultiInsertState)); mistate->slots = - (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS); - mistate->cur_slots = 0; + (TupleTableSlot **) palloc0(sizeof(void *) * HEAP_MAX_BUFFERED_SLOTS); + mistate->tstore = tuplestore_begin_heap(false, false, work_mem); + mistate->nused = 0; + istate->mistate = mistate; /* @@ -2702,36 +2703,11 @@ heap_modify_buffer_insert(TableModifyState *state, istate = (HeapInsertState *) state->data; Assert(istate->mistate != NULL); mistate = istate->mistate; - dstslot = mistate->slots[mistate->cur_slots]; - - if (dstslot == NULL) - { - /* - * We use virtual tuple slots buffered slots for leveraging the - * optimization it provides to minimize physical data copying. The - * virtual slot gets materialized when we copy (via below - * ExecCopySlot) the tuples from the so
Re: Considering fractional paths in Append node
On Wed, Mar 5, 2025 at 1:20 PM Alexander Korotkov wrote: > On Wed, Mar 5, 2025 at 8:32 AM Andrei Lepikhov wrote: > > On 5/3/2025 03:27, Alexander Korotkov wrote: > > > On Mon, Mar 3, 2025 at 1:04 PM Andrei Lepikhov wrote: > > >>> 2. As usage of root->tuple_fraction RelOptInfo it has been criticized, > > >>> do you think we could limit this to some simple cases? For instance, > > >>> check that RelOptInfo is the final result relation for given root. > > >> I believe that using tuple_fraction is not an issue. Instead, it serves > > >> as a flag that allows the upper-level optimisation to consider > > >> additional options. The upper-level optimiser has more variants to > > >> examine and will select the optimal path based on the knowledge > > >> available at that level. Therefore, we're not introducing a mistake > > >> here; we're simply adding extra work in the narrow case. However, having > > >> only the bottom-up planning process, I don't see how we could avoid this > > >> additional workload. > > > > > > Yes, but if we can assume root->tuple_fraction applies to result of > > > Append, it's strange we apply the same tuple fraction to all the child > > > rels. Latter rels should less likely be used at all and perhaps > > > should have less tuple_fraction. > > Of course, it may happen. But I'm not sure it is a common rule. > > Using LIMIT, we usually select data according to specific clauses. > > Imagine, we need TOP-100 ranked goods. Appending partitions of goods, we > > will use the index on the 'rating' column. But who knows how top-rated > > goods are spread across partitions? Maybe a single partition contains > > all of them? So, we need to select 100 top-rated goods from each partition. > > Hence, applying the same limit to each partition seems reasonable, right? > > Ok, I didn't notice add_paths_to_append_rel() is used for MergeAppend > as well. I thought again about regular Append. If can have required > number of rows from the first few children relations, the error of > tuple fraction shouldn't influence plans much, and other children > relations wouldn't be used at all. But if we don't, we unlikely get > prediction of selectivity accurate enough to predict which exact > children relations are going to be used. So, usage root tuple > fraction for every child relation would be safe. So, this approach > makes sense to me. I've slightly revised the commit message and comments. I'm going to push this if no objections. -- Regards, Alexander Korotkov Supabase From c9c0efa98b74d76912845339cb5b70dba3c8cb17 Mon Sep 17 00:00:00 2001 From: "Andrei V. Lepikhov" Date: Thu, 5 Dec 2024 15:15:49 +0700 Subject: [PATCH v3] Teach Append to consider tuple_fraction when accumulating subpaths. This change is dedicated to more active usage of IndexScan and parameterized NestLoop paths in partitioned cases under an Append node, as it already works with plain tables. As newly added regression tests demonstrate, it should provide more smartness to the partitionwise technique. With an indication of how many tuples are needed, it may be more meaningful to use the 'fractional branch' subpaths of the Append path list, which are more optimal for this specific number of tuples. Planning on a higher level, if the optimizer needs all the tuples, it will choose non-fractional paths. In the case when, during execution, Append needs to return fewer tuples than declared by tuple_fraction, it would not be harmful to use the 'intermediate' variant of paths. However, it will earn a considerable profit if a sensible set of tuples is selected. The change of the existing regression test demonstrates the positive outcome of this feature: instead of scanning the whole table, the optimizer prefers to use a parameterized scan, being aware of the only single tuple the join has to produce to perform the query. Discussion: https://www.postgresql.org/message-id/flat/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA%40mail.gmail.com Author: Nikita Malakhov Author: Andrei Lepikhov Reviewed-by: Andy Fan Reviewed-by: Alexander Korotkov --- src/backend/optimizer/path/allpaths.c| 18 ++- src/backend/optimizer/plan/planner.c | 8 ++ src/test/regress/expected/partition_join.out | 116 +++ src/test/regress/expected/union.out | 15 ++- src/test/regress/sql/partition_join.sql | 21 5 files changed, 168 insertions(+), 10 deletions(-) diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index b5bc9b602e2..9e7f01a9684 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -1371,9 +1371,23 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, */ if (rel->consider_startup && childrel->cheapest_startup_path != NULL) { + Path *cheapest_path; + + /* + * With an indication of how many tuples the query should provide, + * the optimizer tries to choose the path optimal f
Re: [Doc] Improve hostssl related descriptions and option presentation
On Fri, 28 Feb 2025 at 00:08, David G. Johnston wrote: > > On Mon, Apr 22, 2024 at 2:20 PM David G. Johnston > wrote: >> >> Thoughts anyone? >> >> On Thu, Feb 1, 2024 at 3:47 PM David G. Johnston >> wrote: >>> >>> Motivated by a recent complaint [1] I found the hostssl related material in >>> our docs quite verbose and even repetitive. Some of that is normal since >>> we have both an overview/walk-through section as well as a reference >>> section. But the overview in particular was self-repetitive. Here is a >>> first pass at some improvements. The commit message contains both the >>> intent of the patch and some thoughts/questions still being considered. >>> >>> David J. >>> >>> [1] >>> https://www.postgresql.org/message-id/170672433664.663.11895343120533141715%40wrigleys.postgresql.org > > > Withdrawn as this now has conflicts from recent changes and no one seems > interested in voicing an opinion on this for or against. I've updated the commitfest entry status to "withdrawn." Feel free to add it again anytime if you change your mind. Regards, Vignesh
Re: general purpose array_sort
hi. patch rebased, also did some minor comments tweak. From c9398dfe889f23dce147db1719aa9fe4dfaa3adc Mon Sep 17 00:00:00 2001 From: jian he Date: Sun, 9 Mar 2025 20:45:20 +0800 Subject: [PATCH v16 1/2] general purpose array_sort Introduce the SQL-callable function array_sort(anyarray). The parameter passed to this function cannot truly be a polymorphic data type. Instead, it accepts any array type that supports the "less than" (`<`) operator. If the input parameter is a multidimensional array, array_sort will sort based on the first dimension. By default, sorting is performed based on the argument's collation. However, you can also specify a collation clause if needed, for special value NULL: nulls will appear after non-null values. for example: SELECT array_sort('{foo,bar,CCC,Abc,bbc}'::text[] COLLATE "C"); will sort based on "C" collation. Author: Junwang Zhao Co-authored-by: Jian He Reviewed-by: Michael Paquier Aleksander Alekseev , Tom Lane , David G. Johnston , Amit Langote , andr...@proxel.se , Robert Haas , Dean Rasheed discussion: https://postgr.es/m/CAEG8a3J41a4dpw_-F94fF-JPRXYxw-GfsgoGotKcjs9LVfEEvw%40mail.gmail.com --- doc/src/sgml/func.sgml| 20 +++ src/backend/utils/adt/array_userfuncs.c | 143 ++ src/backend/utils/adt/arrayfuncs.c| 3 +- src/include/catalog/pg_proc.dat | 3 + src/include/utils/array.h | 1 + src/test/regress/expected/arrays.out | 90 +++ .../regress/expected/collate.icu.utf8.out | 13 ++ src/test/regress/sql/arrays.sql | 25 +++ src/test/regress/sql/collate.icu.utf8.sql | 4 + src/tools/pgindent/typedefs.list | 1 + 10 files changed, 302 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 4d6061a8458..e24ef42ad98 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -20669,6 +20669,26 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sort + +array_sort ( anyarray ) +anyarray + + +Sorts the first dimension of the array. +The sort order is determined by the < operator of the element type, +nulls will appear after non-null values. +The collation to use can be forced by adding a COLLATE clause to any of the arguments. + + +array_sort(ARRAY[[2,4],[2,1],[6,5]]) +{{2,1},{2,4},{6,5}} + + + diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c index 2aae2f8ed93..583e56fc805 100644 --- a/src/backend/utils/adt/array_userfuncs.c +++ b/src/backend/utils/adt/array_userfuncs.c @@ -15,6 +15,7 @@ #include "catalog/pg_type.h" #include "common/int.h" #include "common/pg_prng.h" +#include "miscadmin.h" #include "libpq/pqformat.h" #include "nodes/supportnodes.h" #include "port/pg_bitutils.h" @@ -22,6 +23,7 @@ #include "utils/builtins.h" #include "utils/datum.h" #include "utils/lsyscache.h" +#include "utils/tuplesort.h" #include "utils/typcache.h" /* @@ -43,6 +45,18 @@ typedef struct DeserialIOData Oid typioparam; } DeserialIOData; +/* + * ArraySortCachedInfo + * Used for caching data in array_sort + */ +typedef struct ArraySortCachedInfo +{ + TypeCacheEntry *typentry; /* type cache entry for element type */ + TypeCacheEntry *array_typentry; /* type cache entry for array type */ + ArrayMetaState array_meta; /* array metadata for better + * array_create_iterator performance */ +} ArraySortCachedInfo; + static Datum array_position_common(FunctionCallInfo fcinfo); @@ -1858,3 +1872,132 @@ array_reverse(PG_FUNCTION_ARGS) PG_RETURN_ARRAYTYPE_P(result); } + +/* + * array_sort + * + * Sorts the first dimension of the array. + * The sort order is determined by the "<" operator of the element type. + */ +Datum +array_sort(PG_FUNCTION_ARGS) +{ + ArrayType *array = PG_GETARG_ARRAYTYPE_P(0); + Oid elmtyp; + Oid array_type = InvalidOid; + Oid collation = PG_GET_COLLATION(); + ArraySortCachedInfo *cache_info; + TypeCacheEntry *typentry; + Tuplesortstate *tuplesortstate; + ArrayIterator array_iterator; + Datum value; + bool isnull; + ArrayBuildStateAny *astate = NULL; + int ndim, + *dims, + *lbs; + + ndim = ARR_NDIM(array); + dims = ARR_DIMS(array); + lbs = ARR_LBOUND(array); + + elmtyp = ARR_ELEMTYPE(array); + cache_info = (ArraySortCachedInfo *) fcinfo->flinfo->fn_extra; + if (cache_info == NULL) + { + cache_info = (ArraySortCachedInfo *) MemoryContextAlloc(fcinfo->flinfo->fn_mcxt, +sizeof(ArraySortCachedInfo)); + cache_info->typentry = NULL; + cache_info->array_typentry = NULL; + fcinfo->flinfo->fn_extra = (void *) cache_info; + } + + if (ndim == 1) + { + /* Finds the ordering operator for the type for 1-D arrays */ + typentry = cache_info->typentry; + if (typentry == NULL
Re: Printing window function OVER clauses in EXPLAIN
Hello Would it be possible and make sense to use notation of explicit WINDOW clauses, for cases where multiple window functions invoke identical window definitions? I'm thinking of something like explain verbose SELECT empno, depname, row_number() OVER testwin rn, rank() OVER testwin rnk, count(*) OVER testwin cnt FROM empsalary window testwin as (PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING); for which, with the patch, we'd get this QUERY PLAN ─ WindowAgg (cost=74.64..101.29 rows=1070 width=68) Output: empno, depname, row_number() OVER (PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), rank() OVER (PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), count(*) OVER (PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), enroll_date -> Sort (cost=74.54..77.21 rows=1070 width=44) Output: depname, enroll_date, empno Sort Key: empsalary.depname, empsalary.enroll_date -> Seq Scan on pg_temp.empsalary (cost=0.00..20.70 rows=1070 width=44) Output: depname, enroll_date, empno (7 filas) which is pretty ugly to read and requires careful tracking to verify that they're all defined on the same window. Previously, we just get QUERY PLAN ── WindowAgg (cost=74.64..101.29 rows=1070 width=68) Output: empno, depname, row_number() OVER (?), rank() OVER (?), count(*) OVER (?), enroll_date -> Sort (cost=74.54..77.21 rows=1070 width=44) Output: depname, enroll_date, empno Sort Key: empsalary.depname, empsalary.enroll_date -> Seq Scan on pg_temp.empsalary (cost=0.00..20.70 rows=1070 width=44) Output: depname, enroll_date, empno (7 filas) so it didn't matter. I'd imagine something like QUERY PLAN ─ Window testwin AS (PARTITION BY depname ORDER BY enroll_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) WindowAgg (cost=74.64..101.29 rows=1070 width=68) Output: empno, depname, row_number() OVER testwin, rank() OVER testwin, count(*) OVER testwin, enroll_date -> Sort (cost=74.54..77.21 rows=1070 width=44) Output: depname, enroll_date, empno Sort Key: empsalary.depname, empsalary.enroll_date -> Seq Scan on pg_temp.empsalary (cost=0.00..20.70 rows=1070 width=44) Output: depname, enroll_date, empno (7 filas) I imagine this working even if the user doesn't explicitly use a WINDOW clause, if only because it makes the explain easier to read, and it's much clearer if the user specifies two different window definitions. So with David Johnston's example, something like Window window1 AS (PARTITION BY depname ORDER BY enroll_date ROWS UNBOUNDED PRECEDING) Window window2 AS (PARTITION BY depname ORDER BY enroll_date RANGE BETWEEN CURRENT ROW AND CURRENT ROW) WindowAgg Output: empno, depname, (row_number() OVER window1), rank() OVER window1, count(*) OVER window2, enroll_date -> WindowAgg Output: depname, enroll_date, empno, row_number() OVER window1, rank() OVER window1 -> Sort Output: depname, enroll_date, empno Sort Key: empsalary.depname, empsalary.enroll_date -> Seq Scan on pg_temp.empsalary Output: depname, enroll_date, empno (Hmm, not sure if the Window clauses
Re: strange valgrind reports about wrapper_handler on 64-bit arm
On 3/9/25 03:16, Nathan Bossart wrote: > On Sat, Mar 08, 2025 at 11:48:22PM +0100, Tomas Vondra wrote: >> Shortly after restarting this I got three more reports - all of them are >> related to strcoll_l. This is on c472a18296e4, i.e. with the asserts >> added in this thread etc. But none of those seem to fail. > >> ==189168==at 0xA683CC: wrapper_handler (pqsignal.c:90) >> ==189168==at 0xA683F0: wrapper_handler (pqsignal.c:91) >> ==189168==at 0xA684D4: wrapper_handler (pqsignal.c:110) > > This appears to refer to the following lines: > > Assert(postgres_signal_arg > 0); > Assert(postgres_signal_arg < PG_NSIG); > (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg); > > The common ingredient seems to be postgres_signal_arg. I haven't found > anything else that seems helpful. > Yeah, it's a bit weird. I got another report on the 64-bit rpi5 machine (except for the command it's exactly the same), and then also this report on the 32-bit machine: --- ==3482== Use of uninitialised value of size 4 ==3482==at 0x991498: wrapper_handler (pqsignal.c:113) ==3482==by 0x: ??? ==3482== Uninitialised value was created by a heap allocation ==3482==at 0x940EB8: palloc (mcxt.c:1341) ==3482==by 0x980037: initStringInfoInternal (stringinfo.c:45) ==3482==by 0x980103: initStringInfo (stringinfo.c:99) ==3482==by 0x7194B7: ReadArrayStr (arrayfuncs.c:613) ==3482==by 0x718803: array_in (arrayfuncs.c:289) ==3482==by 0x90024F: InputFunctionCallSafe (fmgr.c:1607) ==3482==by 0x2E346B: CopyFromTextLikeOneRow (copyfromparse.c:1029) ==3482==by 0x2E304B: CopyFromTextOneRow (copyfromparse.c:919) ==3482==by 0x2E2EEF: NextCopyFrom (copyfromparse.c:890) ==3482==by 0x2DF5C7: CopyFrom (copyfrom.c:1149) ==3482==by 0x2DAE33: DoCopy (copy.c:306) ==3482==by 0x6CC38F: standard_ProcessUtility (utility.c:738) ==3482==by 0x4B654C3: pgss_ProcessUtility (pg_stat_statements.c:1179) ==3482==by 0x6CBBF3: ProcessUtility (utility.c:519) ==3482==by 0x6CA24B: PortalRunUtility (pquery.c:1184) ==3482==by 0x6CA537: PortalRunMulti (pquery.c:1348) ==3482==by 0x6C9837: PortalRun (pquery.c:819) ==3482==by 0x6C1BEB: exec_simple_query (postgres.c:1272) ==3482==by 0x6C74FF: PostgresMain (postgres.c:4693) ==3482==by 0x6BD297: BackendMain (backend_startup.c:107) ==3482== { Memcheck:Value4 fun:wrapper_handler obj:* } **3482** Valgrind detected 1 error(s) during execution of "COPY array_op_test FROM '/home/debian/postgres/src/test/regress/data/array.data';" --- This all seems very strange ... I'm starting to wonder if maybe this is a valgrind issue. Both machines have valgrind 3.19, I'll try with a custom build of 3.24 (the latest release). regards -- Tomas Vondra
Re: BackgroundPsql swallowing errors on windows
On Sun, Mar 09, 2025 at 12:47:34PM -0400, Andres Freund wrote: > On 2025-02-16 17:52:36 -0800, Noah Misch wrote: > > On Sun, Feb 16, 2025 at 08:42:50PM -0500, Andres Freund wrote: > > > On February 16, 2025 7:50:18 PM EST, Tom Lane wrote: > > > >Noah Misch writes: > > > >> On Sun, Feb 16, 2025 at 06:18:44PM -0500, Tom Lane wrote: > > > >>> I think that > > > >>> IPC::Run may be screwing up here, because I have seen non-Windows > > > >>> CI failures that look like it didn't read all the stderr output. > > > >>> For example, this pgbench test failure on macOS from [1]: > > > > > > > >> https://github.com/cpan-authors/IPC-Run/commit/2128df3bbcac7e733ac46302c4b1371ffb88fe14 > > > >> fixed that one. > > > > > > > >Ah. Do we know whether that fix has made it into our CI images? > > > >(Or anywhere else, for that matter?) > > > > > > The CI images are regenerated three times a week, but for most OSs, they > > > will only install perl modules via the applicable packaging method, so > > > it'll depend on when they pick up that version. > > > > > > On Windows cpan is used, so it should pick that new version fairly > > > quickly if a release has been made. > > > > > > On macos we can't currently use images, so we just cache all the installed > > > macports packages. The cache is keyed by OS version and list of packages > > > to be installed, with no other forced invalidation right now. So it's hard > > > to predict when a new version of a package will be picked up and it will > > > differ between git repositories. I've been wondering whether the cached > > > macports install should just be regularly generated instead, along the > > > other ci images. > > > > The change is not in a release yet. We could have macos install IPC::Run > > from > > github, or I could get a release cut so it can make its way to macports. > > It'd be great if we could get a release. Yep. I put the tree in the necessary state, and I contacted the person on 2025-02-17 and again on 2025-03-04. I'm scheduled to follow up again on 2025-03-11. > I guess I can figure out the magic > incantations to install it from git for CI There's no need to build anything, so it suffices to do: git clone https://github.com/cpan-authors/IPC-Run.git export PERL5LIB=$PWD/IPC-Run/lib # (or append, if you already have non-empty PERL5LIB)
Re: what's going on with lapwing?
On Thu, Mar 6, 2025 at 12:22 PM Tom Lane wrote: > I don't think that's the way to think about old buildfarm members. > Sure, nobody is very likely to be putting PG 18 on a Debian 7 box, > but the odds are much higher that they might have PG 13 on it and > wish to update to 13.latest. So what you need to compare OS EOL > dates to is not current development but our oldest supported branch. But the work it's creating is mostly because it's still testing master. If it were only testing a gradually-decreasing set of older branches, that wouldn't seem weird to me. -- Robert Haas EDB: http://www.enterprisedb.com
Incorrect assert in libpqwalreceiver
Hello hackers, The libpqrcv_connect function asserts 'Assert(i < sizeof(keys))', where keys is declared as const char *keys[6];. However, sizeof(keys) is not the correct way to check the array length (on my system, for example, it's 48 = 6 * 8 at this callsite, not 6.) I attached a patch to fix the assert, but I suppose we could also just remove the assert altogether, since it hasn't been doing anything for at least 8 years. Regards, Jacob v1-length-assert.patch Description: Binary data
Re: Commit fest 2025-03
On Thu, 6 Mar 2025 at 21:24, Tom Lane wrote: > > vignesh C writes: > > On Wed, 5 Mar 2025 at 16:50, Jim Jones wrote: > >> Is there something wrong with the commitfest app? This patch applies > >> cleanly and passes all tests > > > I verified that it applies neatly and passes the tests for me too, I > > have reported this issue at [1]. I don't know the reason for this. > > What did you apply with? "git am" is known to be far pickier than > "patch". (I thought the cfbot was using "patch", but I might be > wrong. There are many versions of "patch", too.) In my local environment I used "git am" and it worked fine. cfbot has used "git apply --3way" as per [1], they are planning to change it to "git am" to fix this issue. [1] - https://www.postgresql.org/message-id/CAGECzQSE7igwsTUk17i%2BRNEUL1zR-Zkp-1Zs7KkhmBxAu_Fksw%40mail.gmail.com Regards, Vignesh
Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?
Hello, Mathias! > though I suspect the SP-GIST tests to have > bugs, as an intermediate version of my 0003 patch didn't trigger the > tests to fail It all fails on master - could you please detail what is "intermediate" in that case? Also, I think it is a good idea to add the same type of test to btree. > * XXX: In the future we should probably reorder these operations so > * we can apply the checks in block order, rather than index order. I think it is already done in your patch, no? Should we when use that mechanics for btree as well? It seems to be straight forward and non-invasive. In such case, "Unchecked" goes away, and it is each AM responsibility to call the check while holding the pin. Best regards, Mikhail.
Re: Commitfest app release on Feb 17 with many improvements
On Sun, 9 Mar 2025 at 03:21, vignesh C wrote: > Couple of suggestions: a) No need to show CI status as "Needs rebase," > "Not processed," etc., for committed patches. Do you mean specifically for committed ones? Or just for any patch with a "closed" status. > b) Can we add a filter > for "Needs rebase!"? This would help the CommitFest manager easily > list patches that need updating. That should be pretty easy to implement. But is that really what we want? In the next release, sorting by "failing since" is implemented. It sounds like that could be enough instead. i.e. do we really only want to call out patches that need a rebase? Or also ones that have been failing in CI for a long time? I'm even wondering if this whole flow still makes sense. Do we really want to send an email to the mailing list about this? And if so, why is someone doing that manually? If people subscribe to updates for patches that they authored, then they get these "needs rebase" automatically. Should we maybe simply default that option to true? And for instance send a notification automatically to all people with a "needs rebase" CI status whenever we start a new commitfest.
Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options
On Sun, Mar 9, 2025 at 6:40 AM Tatsuo Ishii wrote: > > Attached version removes the non-nulls array. That seems to speed > > everything up. Running the above query with 1 million rows averages > 450ms, > > similar when using lead/lag. > > Great. However, CFbot complains about the patch: > > https://cirrus-ci.com/task/6364194477441024 > > Attached fixes the headerscheck locally. 0010-ignore-nulls.patch Description: Binary data
Re: Log connection establishment timings
On Thu, Mar 6, 2025 at 3:16 PM Melanie Plageman wrote: > > Jacob at some point had asked about this, and I didn't > have a satisfactory answer. I'm not really sure what would be more > useful to end users. For the record, I'm not really sure either. I don't have a strong opinion either way. --Jacob
Re: maintenance_work_mem = 64kB doesn't work for vacuum
On Mon, 10 Mar 2025 at 07:46, Masahiko Sawada wrote: > A simple fix is to bump the minimum maintenance_work_mem to 256kB. We > would break the compatibility for backbranch (i.e. v17) but I guess > it's unlikely that existing v17 users are using less than 1MB > maintenance_work_mem (the release note doesn't mention the fact that > we lowered the minimum value). Could you do something similar to what's in hash_agg_check_limits() where we check we've got at least 1 item before bailing before we've used up the all the prescribed memory? That seems like a safer coding practise as if in the future the minimum usage for a DSM segment goes above 256KB, the bug comes back again. David
Re: Printing window function OVER clauses in EXPLAIN
I wrote: > I'll go try to code this up. OK, here's v2 done like that. I do like this output better. I backed off the idea of putting the WindowClause as such into the plan, partly because I didn't feel like debugging the setrefs.c problem that David discovered upthread. This way does require a bit more code, but I think it's less likely to have bugs. A couple of notes: * I made the Window: plan annotation come out unconditionally. We could alternatively print it only in VERBOSE mode, which would greatly reduce the number of regression test diffs. However, it seems fairly comparable to the sort keys of a Sort node or the group keys of a Group node, which we print unconditionally. Also, there are cases where a higher-level node unconditionally prints a reference to a window function output, which would mean that that output's reference to "windowN" would have no referent in the displayed data. * In passing, I editorialized on the order in which the Run Condition annotation comes out: case T_WindowAgg: +show_window_def(castNode(WindowAggState, planstate), ancestors, es); show_upper_qual(plan->qual, "Filter", planstate, ancestors, es); +show_upper_qual(((WindowAgg *) plan)->runConditionOrig, +"Run Condition", planstate, ancestors, es); if (plan->qual) show_instrumentation_count("Rows Removed by Filter", 1, planstate, es); -show_upper_qual(((WindowAgg *) plan)->runConditionOrig, -"Run Condition", planstate, ancestors, es); show_windowagg_info(castNode(WindowAggState, planstate), es); break; It seemed quite weird to me to have the Run Condition plan property come out in the middle of properties that only appear in EXPLAIN ANALYZE mode. Maybe there's a reason for this other than "people added new properties at the end", but I don't see it. regards, tom lane diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index c68119030ab..29e8d7b806f 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -3968,10 +3968,11 @@ select c2, sum(c2), count(c2) over (partition by c2%2) from ft2 where c2 < 10 gr QUERY PLAN Sort - Output: c2, (sum(c2)), (count(c2) OVER (?)), ((c2 % 2)) + Output: c2, (sum(c2)), (count(c2) OVER window1), ((c2 % 2)) Sort Key: ft2.c2 -> WindowAgg - Output: c2, (sum(c2)), count(c2) OVER (?), ((c2 % 2)) + Output: c2, (sum(c2)), count(c2) OVER window1, ((c2 % 2)) + Window: window1 AS (PARTITION BY ((ft2.c2 % 2))) -> Sort Output: c2, ((c2 % 2)), (sum(c2)) Sort Key: ((ft2.c2 % 2)) @@ -3979,7 +3980,7 @@ select c2, sum(c2), count(c2) over (partition by c2%2) from ft2 where c2 < 10 gr Output: c2, ((c2 % 2)), (sum(c2)) Relations: Aggregate on (public.ft2) Remote SQL: SELECT c2, (c2 % 2), sum(c2) FROM "S 1"."T 1" WHERE ((c2 < 10)) GROUP BY 1 -(12 rows) +(13 rows) select c2, sum(c2), count(c2) over (partition by c2%2) from ft2 where c2 < 10 group by c2 order by 1; c2 | sum | count @@ -4001,10 +4002,11 @@ select c2, array_agg(c2) over (partition by c2%2 order by c2 desc) from ft1 wher QUERY PLAN --- Sort - Output: c2, (array_agg(c2) OVER (?)), ((c2 % 2)) + Output: c2, (array_agg(c2) OVER window1), ((c2 % 2)) Sort Key: ft1.c2 -> WindowAgg - Output: c2, array_agg(c2) OVER (?), ((c2 % 2)) + Output: c2, array_agg(c2) OVER window1, ((c2 % 2)) + Window: window1 AS (PARTITION BY ((ft1.c2 % 2)) ORDER BY ft1.c2) -> Sort Output: c2, ((c2 % 2)) Sort Key: ((ft1.c2 % 2)), ft1.c2 DESC @@ -4012,7 +4014,7 @@ select c2, array_agg(c2) over (partition by c2%2 order by c2 desc) from ft1 wher Output: c2, ((c2 % 2)) Relations: Aggregate on (public.ft1) Remote SQL: SELECT c2, (c2 % 2) FROM "S 1"."T 1" WHERE ((c2 < 10)) GROUP BY 1 -(12 rows) +(13 rows) select c2, array_agg(c2) over (partition by c2%2 order by c2 desc) from ft1 where c2 < 10 group by c2 order by 1; c2 | array_agg @@ -4031,13 +4033,14 @@ select c2, array_agg(c2) over (partition by c2%2 order by c2 desc) from ft1 wher explain (verbose, costs off) select c2, array_agg(c2) over (partition by c2%2 order by c2 range betwe
Re: Parallel heap vacuum
Hi Sawada-San. Here are some review comments for patch v11-0002 == Commit message. 1. Heap table AM disables the parallel heap vacuuming for now, but an upcoming patch uses it. This function implementation was moved into patch 0001, so probably this part of the commit message comment also belongs now in patch 0001. == src/backend/commands/vacuumparallel.c 2. + * For processing indexes in parallel, individual indexes are processed by one + * vacuum process. We launch parallel worker processes at the start of parallel index + * bulk-deletion and index cleanup and once all indexes are processed, the parallel + * worker processes exit. + * "are processed by one vacuum process." -- Did you mean "are processed by separate vacuum processes." ? ~~~ 3. + * + * Each time we process table or indexes in parallel, the parallel context is + * re-initialized so that the same DSM can be used for multiple passes of table vacuum + * or index bulk-deletion and index cleanup. Maybe I am mistaken, but it seems like the logic is almost always re-initializing this. I wonder if it might be simpler to just remove the 'need_reinitialize_dsm' field and initialize unconditionally. ~~~ 4. + * nrequested_workers is >= 0 number and the requested parallel degree. 0 + * means that the parallel degrees for table and indexes vacuum are decided + * differently. See the comments of parallel_vacuum_compute_workers() for + * details. + * * On success, return parallel vacuum state. Otherwise return NULL. */ SUGGESTION nrequested_workers is the requested parallel degree (>=0). 0 means that... ~~~ 5. static int -parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested, - bool *will_parallel_vacuum) +parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes, + int nrequested, int *nworkers_table_p, + bool *idx_will_parallel_vacuum) IIUC the returns for this function seem inconsistent. AFAIK, it previously would return the number of workers for parallel index vacuuming. But now (after this patch) the return value is returned Max(nworkers_table, nworkers_index). Meanwhile, the number of workers for parallel table vacuuming is returned as a by-reference parameter 'nworkers_table_p'. In other words, it is returning the number of workers in 2 different ways. Why not make this a void function, but introduce another parameter 'nworkers_index_p', similar to 'nworkers_table_p'? == Kind Regards, Peter Smith. Fujitsu Australia
Re: Parallel heap vacuum
Hi Sawada-San, Here are some review comments for patch v11-0001. == src/backend/access/heap/vacuumlazy.c 1. +/* + * Compute the number of workers for parallel heap vacuum. + * + * Return 0 to disable parallel vacuum so far. + */ +int +heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested) You don't need to say "so far". == src/backend/access/table/tableamapi.c 2. Assert(routine->relation_vacuum != NULL); + Assert(routine->parallel_vacuum_compute_workers != NULL); Assert(routine->scan_analyze_next_block != NULL); Is it better to keep these Asserts in the same order that the TableAmRoutine fields are assigned (in heapam_handler.c)? ~~~ 3. + /* + * Callbacks for parallel vacuum are also optional (except for + * parallel_vacuum_compute_workers). But one callback implies presence of + * the others. + */ + Assert(routine->parallel_vacuum_estimate == NULL) == + (routine->parallel_vacuum_initialize == NULL)) == + (routine->parallel_vacuum_initialize_worker == NULL)) == + (routine->parallel_vacuum_collect_dead_items == NULL))); /also optional/optional/ == src/include/access/heapam.h +extern int heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested); 4. wrong tab/space after 'int'. == src/include/access/tableam.h 5. + /* + * Compute the number of parallel workers for parallel table vacuum. The + * parallel degree for parallel vacuum is further limited by + * max_parallel_maintenance_workers. The function must return 0 to disable + * parallel table vacuum. + * + * 'nworkers_requested' is a >=0 number and the requested number of + * workers. This comes from the PARALLEL option. 0 means to choose the + * parallel degree based on the table AM specific factors such as table + * size. + */ + int (*parallel_vacuum_compute_workers) (Relation rel, + int nworkers_requested); The comment here says "This comes from the PARALLEL option." and "0 means to choose the parallel degree...". But, the PG docs [1] says "To disable this feature, one can use PARALLEL option and specify parallel workers as zero.". These two descriptions "disable this feature" (PG docs) and letting the system "choose the parallel degree" (code comment) don't sound the same. Should this 0001 patch update the PG documentation for the effect of setting PARALLEL value zero? ~~~ 6. + /* + * Initialize DSM space for parallel table vacuum. + * + * Not called if parallel table vacuum is disabled. + * + * Optional callback, but either all other parallel vacuum callbacks need + * to exist, or neither. + */ "or neither"? Also, saying "all other" seems incorrect because parallel_vacuum_compute_workers callback must exist event if parallel_vacuum_initialize does not exist. IMO you meant to say "all optional", and "or none". SUGGESTION: Optional callback. Either all optional parallel vacuum callbacks need to exist, or none. (this same issue is repeated in multiple places). == [1] https://www.postgresql.org/docs/devel/sql-vacuum.html Kind Regards, Peter Smith. Fujitsu Australia
Re: Parallel CREATE INDEX for GIN indexes
On Sun, Mar 9, 2025 at 6:23 PM Tom Lane wrote: > Ah. Most likely somebody dismissed it years ago. Given that > precedent, I'm content to dismiss this one too. It is dead code, unless somebody decides to #define DISABLE_LEADER_PARTICIPATION to debug a problem. -- Peter Geoghegan
Re: Parallel CREATE INDEX for GIN indexes
Tomas Vondra writes: > On 3/9/25 17:38, Tom Lane wrote: >> Coverity is a little unhappy about this business in >> _gin_begin_parallel: > I don't mind doing it differently, but this code is just a copy from > _bt_begin_parallel. So how come coverity does not complain about that? > Or is that whitelisted? Ah. Most likely somebody dismissed it years ago. Given that precedent, I'm content to dismiss this one too. regards, tom lane
Re: what's going on with lapwing?
On Thu, Mar 6, 2025 at 4:27 PM Tom Lane wrote: > I think you misunderstood my drift. I'm okay with setting a project > policy that we won't support OSes that are more than N years EOL, > as long as it's phrased to account for older PG branches properly. Yep, I misunderstood. That sounds awesome. Let's see if others agree. > My point was that we can implement such a policy in a laissez-faire > way: if an older BF animal isn't causing us trouble then why mess > with it? Once we *do* recognize that it's causing us trouble, > we can apply the still-hypothetical policy and ask the owner to > turn it off for branches where it's out of support. Fair enough. This does have the disadvantage that people will still commit things that turn the buildfarm red and have to go into panic mode -- perhaps not even realizing which machines are running an older OS -- and then reactively do this. However, it still sounds like progress over where we are today, because it would (hopefully) remove the need for an argument about each individual case. Honestly, I'm kind of bummed by how fast operating systems and OS versions die these days. It seems crazy to me that I probably couldn't reasonably build a PG from today on the laptop I started at EDB with, or a PG from then on my current laptop. But it doesn't seem like there's much point in fighting the tide. We shouldn't be the only people trying to keep stuff working long after the sell-by date. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Parallel CREATE INDEX for GIN indexes
On 3/9/25 17:38, Tom Lane wrote: > Tomas Vondra writes: >> I pushed the two smaller parts today. > > Coverity is a little unhappy about this business in > _gin_begin_parallel: > > boolleaderparticipates = true; > ... > #ifdef DISABLE_LEADER_PARTICIPATION > leaderparticipates = false; > #endif > ... > scantuplesortstates = leaderparticipates ? request + 1 : request; > > It says > CID 1644203: Possible Control flow issues (DEADCODE) Execution cannot reach the expression "request" inside this statement: "scantuplesortstates = (lead...". > 924 scantuplesortstates = leaderparticipates ? request + 1 : > request; > > If this were just temporary code I'd let it pass, but I see nothing > replacing this logic in the follow-up patches, so I think we ought > to do something to shut it up. > > It's not complaining about the later bits like > > if (leaderparticipates) > ginleader->nparticipanttuplesorts++; > > (perhaps because there's no dead code there?) So one idea is > > scantuplesortstates = request; > if (leaderparticipates) > scantuplesortstates++; > > which would look more like the other code anyway. > I don't mind doing it differently, but this code is just a copy from _bt_begin_parallel. So how come coverity does not complain about that? Or is that whitelisted? thanks -- Tomas Vondra
Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs
On Sun, Mar 9, 2025 at 4:53 AM Alexander Korotkov wrote: > On Sat, Mar 8, 2025 at 12:49 PM Alexander Korotkov > wrote: > > On Fri, Mar 7, 2025 at 8:20 PM Álvaro Herrera > > wrote: > > > > > > On 2024-Mar-25, Alexander Korotkov wrote: > > > > > > > reindexdb: Add the index-level REINDEX with multiple jobs > > > > > > > > Straight-forward index-level REINDEX is not supported with multiple > > > > jobs as > > > > we cannot control the concurrent processing of multiple indexes > > > > depending on > > > > the same relation. Instead, we dedicate the whole table to certain > > > > reindex > > > > job. Thus, if indexes in the lists belong to different tables, that > > > > gives us > > > > a fair level of parallelism. > > > > > > I tested this, because of a refactoring suggestion [1] and I find that > > > it's rather completely broken. > > > > The code was written with assumption that running > > run_reindex_command() with async == true can schedule a number of > > queries for a connection. But actually that's not true and everything > > is broken. > > The draft patch for revert is attached. Could you, please, check. After second thought it's not so hard to fix. The attached patch does it by putting REINDEX commands related to the same table into a single SQL statement. Possibly, that could be better than revert given we need to handle compatibility. What do you think? -- Regards, Alexander Korotkov Supabase v1-0001-reindexdb-Fix-the-index-level-REINDEX-with-multip.patch Description: Binary data
Re: maintenance_work_mem = 64kB doesn't work for vacuum
On Sun, Mar 9, 2025 at 9:24 PM John Naylor wrote: > > On Mon, Mar 10, 2025 at 1:46 AM Masahiko Sawada wrote: > > > > Commit bbf668d66fbf6 (back-patched to v17) lowered the minimum > > maintenance_work_mem to 64kB, but it doesn't work for parallel vacuum > > That was done in the first place to make a regression test for a bug > fix easier, but that test never got committed. In any case I found it > worked back in July: Yes, I would like to keep the lower minimum. I really do have every intention of committing that test. Apologies for taking so long. Raising the limit to 256 kB might make the test take too long. And I think it's nice to have that coverage (not just of the vacuum bug but of the multi-index vacuum pass vacuum in a natural setting [as opposed to the tidstore test module]). I don't recall if we have that elsewhere. - Melanie
maintenance_work_mem = 64kB doesn't work for vacuum
Hi, Commit bbf668d66fbf6 (back-patched to v17) lowered the minimum maintenance_work_mem to 64kB, but it doesn't work for parallel vacuum cases since the minimum dsa segment size (DSA_MIN_SEGMENT_SIZE) is 256kB. As soon as the radix tree allocates its control object and the root node, the memory usage exceeds the maintenance_work_mem limit and vacuum ends up with the following assertion: TRAP: failed Assert("vacrel->lpdead_item_pages > 0"), File: "vacuumlazy.c", Line: 2447, PID: 3208575 On build without --enable-cassert, vacuum never ends. I've tried to lower DSA_MIN_SEGMENT_SIZE to 64kB but no luck. Investigating it further, dsa creates a 64kB superblock for each size class and when creating a new shared radix tree we need to create two superblocks: one for the radix tree control data (64 bytes) and another one for the root node (40 bytes). Also, each superblock requires a span data, which uses 1 page (4096kB). Therefore, we need at least 136kB for a shared radix tree even when it's empty. A simple fix is to bump the minimum maintenance_work_mem to 256kB. We would break the compatibility for backbranch (i.e. v17) but I guess it's unlikely that existing v17 users are using less than 1MB maintenance_work_mem (the release note doesn't mention the fact that we lowered the minimum value). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Add regression test checking combinations of (object,backend_type,context) in pg_stat_io
On Wed, Mar 05, 2025 at 09:19:16PM +0900, Michael Paquier wrote: > On Wed, Mar 05, 2025 at 07:34:16AM +, Bertrand Drouvot wrote: >> What about adding some extra paranoia like? >> >> SELECT backend_type, object, context FROM pg_stat_io ORDER BY >> backend_type, object, context COLLATE "C"; > > Why not, to force the ordering. For the sake of the archives. As 8b532771a099 has proved, this has required two more COLLATE clauses in the query to force a stable output, but we are in the clear now. -- Michael signature.asc Description: PGP signature
RE: Selectively invalidate caches in pgoutput module
I found cfbot got angry due to a variable-shadowing. PSA fixed version. Best regards, Hayato Kuroda FUJITSU LIMITED v7-0001-Introduce-a-new-invalidation-message-to-invalidat.patch Description: v7-0001-Introduce-a-new-invalidation-message-to-invalidat.patch v7-0002-Invalidate-Relcaches-while-ALTER-PUBLICATION-RENA.patch Description: v7-0002-Invalidate-Relcaches-while-ALTER-PUBLICATION-RENA.patch
Re: Add an option to skip loading missing publication to avoid logical replication failure
On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila wrote: > On Tue, Mar 4, 2025 at 6:54 PM vignesh C wrote: > > > > On further thinking, I felt the use of publications_updated variable > > is not required we can use publications_valid itself which will be set > > if the publication system table is invalidated. Here is a patch for > > the same. > > > > The patch relies on the fact that whenever a publication's data is > invalidated, it will also invalidate all the RelSyncEntires as per > publication_invalidation_cb. But note that we are discussing removing > that inefficiency in the thread [1]. So, we should try to rebuild the > entry when we have skipped the required publication previously. > > Apart from this, please consider updating the docs, as mentioned in my > response to Sawada-San's email. > I'm not sure I fully understand it, but based on your previous email and the initial email from Vignesh, if IIUC, the issue occurs when a publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET PUBLICATION is executed, the subscriber workers restart and request the changes based on restart_lsn, which is at an earlier LSN in the WAL than the LSN at which the publication was created. This leads to an error, and we are addressing this behavior as part of the fix by skipping the changes which are between the restart_lsn of subscriber and the lsn at which publication is created and this behavior looks fine. BTW, I am planning to commit this only on HEAD as this is a behavior > change. Please let me know if you guys think otherwise. > Somehow this looks like a bug fix which should be backported no? Am I missing something? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.
On Thu, Mar 6, 2025 at 9:18 AM Ajin Cherian wrote: > + Subscription names, publication names, and replication slot names are > + automatically generated. Cannot be used together with > + --database, --publication, > + --replication-slot or > --subscription. > > Don't start the sentence with "Cannot". Change the sentence to "This > option cannot be used together with ..." > similar sentences used in 3 other places below this as well. Change all of > them. I didn't find any instance of "Cannot be" in the *.sgml files, but I find some instances of "Can be". So it seems we allow such constructs in the documentation. Any reasons for suggesting this change? -- Best Wishes, Ashutosh Bapat
Re: RFC: Logging plan of the running query
On 2025-03-09 00:42, Akshat Jaimini wrote: Hi, I think there is still some problem with the patch. I am not able to apply it to the master branch. Can you please take another look at it? Thanks for pointing it out! Modified it. BTW the patch adds about 400 lines to explain.c and it may be better to split the file as well as 9173e8b6046, but I leave it as it is for now. -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.From 699264fbc1f4e99114966eaeba55a0ec5184e1c2 Mon Sep 17 00:00:00 2001 From: Atsushi Torikoshi Date: Mon, 10 Mar 2025 14:01:54 +0900 Subject: [PATCH v41] Add function to log the plan of the currently running query Currently, we have to wait for the query execution to finish to know its plan either using EXPLAIN ANALYZE or auto_explain. This is not so convenient, for example when investigating long-running queries on production environments. To improve this situation, this patch adds pg_log_query_plan() function that requests to log the plan of the currently executing query. By default, only superusers are allowed to request to log the plans because allowing any users to issue this request at an unbounded rate would cause lots of log messages and which can lead to denial of service. On receipt of the request, codes for logging plan is wrapped to every ExecProcNode and when the executor runs one of ExecProcNode, the plan is actually logged. These wrappers are unwrapped when once the plan is logged. In this way, we can avoid adding the overhead which we'll face when adding CHECK_FOR_INTERRUPTS() like mechanisms in somewhere in executor codes safely. --- contrib/auto_explain/auto_explain.c | 23 +- doc/src/sgml/func.sgml | 50 +++ src/backend/access/transam/xact.c| 7 + src/backend/catalog/system_functions.sql | 2 + src/backend/commands/explain.c | 372 ++- src/backend/executor/execMain.c | 12 + src/backend/storage/ipc/procsignal.c | 4 + src/backend/tcop/postgres.c | 4 + src/backend/utils/init/globals.c | 2 + src/include/catalog/pg_proc.dat | 6 + src/include/commands/explain.h | 12 + src/include/miscadmin.h | 1 + src/include/nodes/execnodes.h| 3 + src/include/storage/procsignal.h | 2 + src/include/tcop/pquery.h| 2 +- src/test/regress/expected/misc_functions.out | 54 ++- src/test/regress/sql/misc_functions.sql | 41 +- 17 files changed, 555 insertions(+), 42 deletions(-) diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c index 7007a226c0..85cfe1b4f4 100644 --- a/contrib/auto_explain/auto_explain.c +++ b/contrib/auto_explain/auto_explain.c @@ -408,26 +408,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc) es->format = auto_explain_log_format; es->settings = auto_explain_log_settings; - ExplainBeginOutput(es); - ExplainQueryText(es, queryDesc); - ExplainQueryParameters(es, queryDesc->params, auto_explain_log_parameter_max_length); - ExplainPrintPlan(es, queryDesc); - if (es->analyze && auto_explain_log_triggers) -ExplainPrintTriggers(es, queryDesc); - if (es->costs) -ExplainPrintJITSummary(es, queryDesc); - ExplainEndOutput(es); - - /* Remove last line break */ - if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n') -es->str->data[--es->str->len] = '\0'; - - /* Fix JSON to output an object */ - if (auto_explain_log_format == EXPLAIN_FORMAT_JSON) - { -es->str->data[0] = '{'; -es->str->data[es->str->len - 1] = '}'; - } + ExplainAssembleLogOutput(es, queryDesc, auto_explain_log_format, + auto_explain_log_triggers, + auto_explain_log_parameter_max_length); /* * Note: we rely on the existing logging of context or diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 51dd8ad657..d1a16a89fb 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -28553,6 +28553,25 @@ acl | {postgres=arwdDxtm/postgres,foo=r/postgres} + + + + pg_log_query_plan + +pg_log_query_plan ( pid integer ) +boolean + + +Requests to log the plan of the query currently running on the +backend with specified process ID. +It will be logged at LOG message level and +will appear in the server log based on the log +configuration set (See +for more information), but will not be sent to the client +regardless of . + + + @@ -28671,6 +28690,37 @@ LOG: Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560 because it may generate a large number of log messages. + +pg_log_query_plan can be used +to log the plan of a backend process. For example: + +p
Re: Add an option to skip loading missing publication to avoid logical replication failure
On Tue, Mar 4, 2025 at 6:54 PM vignesh C wrote: > > On further thinking, I felt the use of publications_updated variable > is not required we can use publications_valid itself which will be set > if the publication system table is invalidated. Here is a patch for > the same. > The patch relies on the fact that whenever a publication's data is invalidated, it will also invalidate all the RelSyncEntires as per publication_invalidation_cb. But note that we are discussing removing that inefficiency in the thread [1]. So, we should try to rebuild the entry when we have skipped the required publication previously. Apart from this, please consider updating the docs, as mentioned in my response to Sawada-San's email. BTW, I am planning to commit this only on HEAD as this is a behavior change. Please let me know if you guys think otherwise. [1] - https://www.postgresql.org/message-id/OSCPR01MB14966C09AA201EFFA706576A7F5C92%40OSCPR01MB14966.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Re: Parallel heap vacuum
On Wed, Mar 5, 2025 at 6:25 AM Masahiko Sawada wrote: > > On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada wrote: > > > > > > Another performance regression I can see in the results is that heap > > vacuum phase (phase III) got slower with the patch. It's weired to me > > since I don't touch the code of heap vacuum phase. I'm still > > investigating the cause. > > I have investigated this regression. I've confirmed that In both > scenarios (patched and unpatched), the entire table and its associated > indexes were loaded into the shared buffer before the vacuum. Then, > the 'perf record' analysis, focused specifically on the heap vacuum > phase of the patched code, revealed numerous soft page faults > occurring: > > 62.37%13.90% postgres postgres[.] lazy_vacuum_heap_rel > | > |--52.44%--lazy_vacuum_heap_rel > | | > | |--46.33%--lazy_vacuum_heap_page (inlined) > | | | > | | |--32.42%--heap_page_is_all_visible > (inlined) > | | | | > | | | > |--26.46%--HeapTupleSatisfiesVacuum > | | | | > HeapTupleSatisfiesVacuumHorizon > | | | | > HeapTupleHeaderXminCommitted (inlined) > | | | | | > | | | | --18.52%--page_fault > | | | | > do_page_fault > | | | | > __do_page_fault > | | | | > handle_mm_fault > | | | | > __handle_mm_fault > | | | | > handle_pte_fault > | | | | | > | | | | > |--16.53%--filemap_map_pages > | | | | | > | > | | | | | > --2.63%--alloc_set_pte > | | | | | > pfn_pte > | | | | | > | | | | > --1.99%--pmd_page_vaddr > | | | | > | | | --1.99%--TransactionIdPrecedes > > I did not observe these page faults in the 'perf record' results for > the HEAD version. Furthermore, when I disabled parallel heap vacuum > while keeping parallel index vacuuming enabled, the regression > disappeared. Based on these findings, the likely cause of the > regression appears to be that during parallel heap vacuum operations, > table blocks were loaded into the shared buffer by parallel vacuum > workers. > In the previous paragraph, you mentioned that the entire table and its associated indexes were loaded into the shared buffer before the vacuum. If that is true, then why does the parallel vacuum need to reload the table blocks into shared buffers? > However, in the heap vacuum phase, the leader process needed > to process all blocks, resulting in soft page faults while creating > Page Table Entries (PTEs). Without the patch, the backend process had > already created PTEs during the heap scan, thus preventing these > faults from occurring during the heap vacuum phase. > This part is again not clear to me because I am assuming all the data exists in shared buffers before the vacuum, so why the page faults will occur in the first place. -- With Regards, Amit Kapila.
Re: Parallel heap vacuum
On Fri, Mar 7, 2025 at 11:06 PM Masahiko Sawada wrote: > > Discussing with Amit offlist, I've run another benchmark test where no > data is loaded on the shared buffer. In the previous test, I loaded > all table blocks before running vacuum, so it was the best case. The > attached test results showed the worst case. > > Overall, while the numbers seem not stable, the phase I got sped up a > bit, but not as scalable as expected, which is not surprising. > Sorry, but it is difficult for me to understand this data because it doesn't contain the schema or details like what exactly is a fraction. It is also not clear how the workers are divided among heap and indexes, like do we use parallelism for both phases of heap or only first phase and do we reuse those workers for index vacuuming. These tests were probably discussed earlier, but it would be better to either add a summary of the required information to understand the results or at least a link to a previous email that has such details. Please > note that the test results shows that the phase III also got sped up > but this is because in parallel vacuum we use more ring buffers than > the single process vacuum. So we need to compare the only phase I time > in terms of the benefit of the parallelism. > Does phase 3 also use parallelism? If so, can we try to divide the ring buffers among workers or at least try vacuum with an increased number of ring buffers. This would be good to do for both the phases, if they both use parallelism. -- With Regards, Amit Kapila.
Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.
On Wed, Mar 5, 2025 at 3:55 PM Nisha Moond wrote: > > On Tue, Mar 4, 2025 at 8:05 PM Shubham Khanna > wrote: > > > > The attached Patch contains the suggested changes. > > > > Hi Shubham, > > Here are few comments for 040_pg_createsubscriber.pl > > 1) > +# Run pg_createsubscriber on node S using '--cleanup-existing-publications'. > +# --verbose is used twice to show more information. > > 1a) /node S/ node S1 > > 1b) Also, can we keep the comment in the same pattern as it was earlier - > # Run pg_createsubscriber on node S1. --verbose is used twice > # to show more information. > # In passing, also test the --enable-two-phase and > --cleanup-existing-publications > # options. > Fixed. > 2) > +# Reuse P as primary > +# Set up node S2 as standby linking to node P > +$node_p->backup('backup_3'); > > /Reuse P as/ Reuse node P as/ > Fixed. > 3) > +$node_s2->append_conf( > + 'postgresql.conf', qq[ > + primary_conninfo = '$pconnstr' > + hot_standby_feedback = on > + max_logical_replication_workers = 5 > + ]); > > Do we need "hot_standby_feedback = on" on node_s2? I think we can remove it. > Removed and updated the configurations. > 4) > +# Create user-defined publications > +$node_p->safe_psql($db3, "CREATE PUBLICATION test_pub3 FOR ALL TABLES;"); > +$node_p->safe_psql($db3, "CREATE PUBLICATION test_pub4 FOR ALL TABLES;"); > > Can create both publications under one safe_psql as - > > $node_p->safe_psql($db3, qq[CREATE PUBLICATION test_pub3 FOR ALL TABLES; > CREATE PUBLICATION test_pub4 FOR ALL TABLES; > ]); > Fixed. > 5) > +# Run pg_createsubscriber on node A without using > +# '--cleanup-existing-publications'. > +# --verbose is used twice to show more information. > > 5a) /node A/node S2/ > 5b) /without using '--cleanup-existing-publications' / without > '--cleanup-existing-publications' option > Fixed. > 6) > + ], > + 'run pg_createsubscriber without --cleanup-existing-publications on node A' > +); > > /node A/node S2/ > Fixed. > 7) > +# Drop the newly created publications > +$node_p->safe_psql($db3, "DROP PUBLICATION IF EXISTS test_pub3;"); > +$node_p->safe_psql($db3, "DROP PUBLICATION IF EXISTS test_pub4;"); > > Similar to #4, use single safe_psql to drop both the publications. > OTOH, do we really need to drop the publications here? I think we can > remove this part since we're performing cleanup right after. > > > -- Fixed. The attached patch contains the suggested changes. Thanks and regards, Shubham Khanna. v14-0001-Support-for-dropping-all-publications-in-pg_crea.patch Description: Binary data
Re: Make tuple deformation faster
On Thu, 6 Mar 2025 at 10:17, Andres Freund wrote: > FWIW, I am fairly certain that I looked at this at an earlier state of the > patch, and at least for me the issue wasn't that it was inherently slower to > use the bitmask, but that it was hard to convince the compiler not generate > worse code. > > IIRC the compiler generated more complicated address gathering instructions > which are slower on some older microarchitectures, but this is a vague memory. I've been reading GCC's assembly output with -fverbose-asm. I find it quite hard to follow as the changes between the 16-byte and 8-byte CompactAttribute versions are vast (see attached). A few interesting things jump out. e.g, in master: # execTuples.c:1080: thisatt->attcacheoff = *offp; .loc 1 1080 26 is_stmt 0 view .LVU1468 movl %ebp, (%rax) # off, MEM[(int *)_22] whereas with the 8-byte version, I see: # execTuples.c:1080: thisatt->attcacheoff = *offp; .loc 1 1080 26 is_stmt 0 view .LVU1484 movl %ebp, 24(%rax) # off, MEM[(int *)_358 + 24B] You can see the MOVL in the 8-byte version should amount to an additional micro op to add 24 to RAX before the dereference. One interesting thing to note about having CompactAttribute in its 8-byte form is that the compiler is tempted into sharing a register with the tts_values array before Datum is also 8-bytes. Note the difference in [1] between the two left compiler outputs and the right-hand one. You can see RCX is dedicated for addressing CompactAttribute in the right window, but RAX is used for both arrays in the left two. I don't 100% know for sure that's the reason for the slowness with the full version but it does seem from the fragment I posted just above that RAX does need 24 bytes added in the 8 bytes version but not in the 16 byte version, so RAX is certainly not dedicated and ready pointing to attcacheoff at that point. Jeff, I'm not sure if I understand this well enough to write a meaningful comment to explain why we don't use bitflags. With my current knowledge level on this, it's a bit hand-wavy at best. Are you content with this, or do you want to see something written into the header comment for CompactAttribute in the code? David [1] https://godbolt.org/z/7hWvqdW6E <>
Re: Restrict copying of invalidated replication slots
On Fri, 28 Feb 2025 at 08:56, Amit Kapila wrote: > > On Fri, Feb 28, 2025 at 5:10 AM Masahiko Sawada wrote: > > > > On Thu, Feb 27, 2025 at 12:52 AM Amit Kapila > > wrote: > > > > > > On Thu, Feb 27, 2025 at 10:47 AM Masahiko Sawada > > > wrote: > > > > > > > > On Tue, Feb 25, 2025 at 7:33 PM Amit Kapila > > > > wrote: > > > > > > > > > > AFAICU, InvalidateObsoleteReplicationSlots() is not serialized with > > > > > this operation. So, isn't it possible that the source slot exists at > > > > > the later position in ReplicationSlotCtl->replication_slots and the > > > > > loop traversing slots is already ahead from the point where the newly > > > > > copied slot is created? > > > > > > > > Good point. I think it's possible. > > > > > > > > > If so, the newly created slot won't be marked > > > > > as invalid whereas the source slot will be marked as invalid. I agree > > > > > that even in such a case, at a later point, the newly created slot > > > > > will also be marked as invalid. > > > > > > > > The wal_status of the newly created slot would immediately become > > > > 'lost' and the next checkpoint will invalidate it. Do we need to do > > > > something to deal with this case? > > > > > > > > > > + /* Check if source slot became invalidated during the copy operation */ > > > + if (second_slot_contents.data.invalidated != RS_INVAL_NONE) > > > + ereport(ERROR, > > > + errmsg("cannot copy replication slot \"%s\"", > > > +NameStr(*src_name)), > > > + errdetail("The source replication slot was invalidated during the > > > copy operation.")); > > > > > > How about adding a similar test as above for MyReplicationSlot? That > > > should suffice the need because we have already acquired the new slot > > > by this time and invalidation should signal this process before > > > marking the new slot as invalid. > > > > IIUC in the scenario you mentioned, the loop traversing slots already > > passed the position of newly created slot in > > ReplicationSlotCtl->replication_slots array, so > > MyReplicationSlot->data.invalidated is still RS_INVAL_NONE, no? > > > > Right, I don't have a simpler solution for this apart from making it > somehow serialize this operation with > InvalidateObsoleteReplicationSlots(). I don't think we need to go that > far to handle the scenario being discussed. It is better to add a > comment for this race and why it won't harm. I tried to reproduce the scenario described using the following steps: Here are the steps: 1. set max_replication_slots = 2 2. create two logical replication slot, lets say slot1 and slot2. drop the first slot (slot1). 3. Now run pg_copy_logical_replication slot function on slot2. Lets say copied slot is 'slot_cp'. In function copy_replication_slot add breakpoint just after function 'create_logical_replication_slot'. slot_cp will be created. In array ReplicationSlotCtl->replication_slots, slot_cp will come before slot2. 4. Now invalidate the 'slot2'. Function 'InvalidateObsoleteReplicationSlots' will be called. Now it will loop through ReplicationSlotCtl->replication_slots and will get 'slot_cp' first. Since the slot is acquired by copy_replication_slot, It will wait for the slot to be released. Once slot is released, 'slot_cp' and 'slot2' will be invalidated. I have tried it with 'wal_removal' invalidation. So it is issue in PG_13 to HEAD. I also see a kill signal sent to function 'copy_replication_slot'. Terminal output: postgres=# select pg_copy_logical_replication_slot('slot2', 'slot_cp'); FATAL: terminating connection due to administrator command server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. 5. Execute 'SELECT * from pg_replication_slots'. slot_cp is present in the list with wal_status as lost. I have added the comment on existing patches for REL_16 to master. Created a new patch to add only comments in REL13-REL15. Thanks and Regards, Shlok Kyal REL_17_v7-0001-Restrict-copying-of-invalidated-replication-slots.patch Description: Binary data REL_16_v7-0001-Restrict-copying-of-invalidated-replication-slots.patch Description: Binary data REL_15-REL_13_v7-0001-Comment-race-condition-in-copy_replication_slot.patch Description: Binary data master_v7-0001-Restrict-copying-of-invalidated-replication-slots.patch Description: Binary data
Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.
On Thu, Mar 6, 2025 at 9:27 AM Peter Smith wrote: > > Hi Shubham. > > Some review comments for patch v13-0001. > > == > GENERAL > > 1. > --cleanup-existing-publications > > I've never liked this proposed switch name much. > > e.g. why say "cleanup" instead of "drop"? What is the difference? > Saying drop is very explicit about what the switch will do. > e.g. why say "existing"? It seems redundant; you can't cleanup/drop > something that does not exist. > > My preference is one of: > --drop-publications > --drop-all-publications > > either of which seem nicely aligned with the descriptions in the usage and > docs. > > Yeah, I know I have raised this same point before, but AFAICT the > reply was like "will revise it in the next patch version", but that > was many versions ago. I think it is important to settle the switch > name earlier than later because there will be many tentacles into the > code (vars, params, fields, comments) and docs if anything changes -- > so it is not a decision you want to leave till the end because it > could destabilise everything at the last minute. > I have updated the option name to '--drop-all-publications'. > == > Commit message > > 2. > By default, publications are preserved to avoid unintended data loss. > > ~ > > Was there supposed to be a blank line before this text, or should this > text be wrapped into the preceding paragraph? > Fixed. > == > src/bin/pg_basebackup/pg_createsubscriber.c > > setup_subscriber: > > 3. > /* > * Create the subscriptions, adjust the initial location for logical > * replication and enable the subscriptions. That's the last step for logical > - * replication setup. > + * replication setup. If 'drop_publications' options is true, existing > + * publications on the subscriber will be dropped before creating new > + * subscriptions. > */ > > There are multiple things amiss with this comment. > - 'drop_publications' is not the parameter name > - 'drop_publications' options [sic plural??]. It is not an option > here; it is a parameter > Fixed. > ~~~ > > check_and_drop_existing_publications: > > 4. > /* > - * Remove publication if it couldn't finish all steps. > + * Check and drop existing publications on the subscriber if requested. > */ > > There is no need to say "if requested.". It is akin to saying this > function does XXX if this function is called. > Fixed. > ~~~ > > drop_publication_by_name: > > 5. > +/* Drop the specified publication of the given database. */ > +static void > +drop_publication_by_name(PGconn *conn, const char *pubname, const char > *dbname) > > 5a. > I think it is better to define this function before > check_and_drop_existing_publications. YMMV. > > ~ > > 5b. > IMO the parameters should be reordered (PGconn *conn, const char > *dbname, const char *pubname). It seems more natural and would be > consistent with check_and_drop_existing_publications. YMMV. > > ~~~ Fixed. > > 6. > - dbinfo->pubname, dbinfo->dbname, PQresultErrorMessage(res)); > - dbinfo->made_publication = false; /* don't try again. */ > + pubname, dbname, PQresultErrorMessage(res)); > + dbinfos.dbinfo->made_publication = false; /* don't try again. */ > > Something about this flag assignment seems odd to me. IIUC > 'made_publications' is for removing the cleanup_objects_atexit to be > able to remove the special publication implicitly made by the > pg_createsubscriber. But, AFAIK we can also get to this code via the > --cleanup-existing-publication switch, so actually it might be one of > the "existing" publication DROPS that has failed. If so, then the > "don't try again comment" is misleading because it may not be that > same publication "again" at all. > I agree with your point. The current comment is misleading, as the failure could be related to an existing publication drop via --cleanup-existing-publications now --drop-all-publications, not just the publication created by pg_createsubscriber. This issue is still open, and I will address it in the next version of the patch. > == > .../t/040_pg_createsubscriber.pl > > GENERAL. > > 7. > Most of the changes to this test code are just changing node name S -> > S1. It's not much to do with the patch other than tidying it in > preparation for a new node S2. OTOH it makes the review far harder > because there are so many changes. IMO all this S->S1 stuff should be > done as a separate pre-requisite patch and then it will be far easier > to see what changes are added for the --clean-existing-publications > testing. > > ~~~ > I understand your concern. Since I am using two nodes (node_s1 and node_s2), I will work on creating a separate prerequisite patch for renaming S -> S1. This should make it easier to review the actual changes related to --cleanup-existing-publications now --drop-all-publications testing. > 8. > # Set up node S as standby linking to node P > $node_p->backup('backup_1'); > -my $node_s = PostgreSQL::Test::Cluster->new('node_s'); > -$node_s->init_from_ba
Re: SQLFunctionCache and generic plans
Hi čt 6. 3. 2025 v 9:57 odesílatel Alexander Pyhalov napsal: > Hi. > > Tom Lane писал(а) 2025-02-27 23:40: > > Alexander Pyhalov writes: > >> Now sql functions plans are actually saved. The most of it is a > >> simplified version of plpgsql plan cache. Perhaps, I've missed > >> something. > > > > A couple of thoughts about v6: > > > > * I don't think it's okay to just summarily do this: > > > > /* It's stale; unlink and delete */ > > fcinfo->flinfo->fn_extra = NULL; > > MemoryContextDelete(fcache->fcontext); > > fcache = NULL; > > > > when fmgr_sql sees that the cache is stale. If we're > > doing a nested call of a recursive SQL function, this'd be > > cutting the legs out from under the outer recursion level. > > plpgsql goes to some lengths to do reference-counting of > > function cache entries, and I think you need the same here. > > > > I've looked for original bug report 7881 ( > > https://www.postgresql.org/message-id/E1U5ytP-0006E3-KB%40wrigleys.postgresql.org > ). > It's interesting, but it seems that plan cache is not affected by it as > when fcinfo xid mismatches we destroy fcinfo, not plan itself (cached > plan survives and still can be used). > > I thought about another case - when recursive function is invalidated > during its execution. But I haven't found such case. If function is > modified during function execution in another backend, the original > backend uses old snapshot during function execution and still sees the > old function definition. Looked at the following case - > > create or replace function f (int) returns setof int language sql as $$ > select i from t where t.j = $1 > union all > select f(i) from t where t.j = $1 > $$; > > and changed function definition to > > create or replace function f (int) returns setof int language sql as $$ > select i from t where t.j = $1 and i > 0 > union all > select f(i) from t where t.j = $1 > $$; > > during execution of the function. As expected, backend still sees the > old definition and uses cached plan. > > > * I don't like much of anything about 0004. It's messy and it > > gives up all the benefit of plan caching in some pretty-common > > cases (anywhere where the user was sloppy about what data type > > is being returned). I wonder if we couldn't solve that with > > more finesse by applying check_sql_fn_retval() to the query tree > > after parse analysis, but before we hand it to plancache.c? > > This is different from what happens now because we'd be applying > > it before not after query rewrite, but I don't think that > > query rewrite ever changes the targetlist results. Another > > point is that the resultTargetList output might change subtly, > > but I don't think we care there either: I believe we only examine > > that output for its result data types and resjunk markers. > > (This is nonobvious and inadequately documented, but for sure we > > couldn't try to execute that tlist --- it's never passed through > > the planner.) > > I'm also not fond of the last patch. So tried to fix it in a way you've > suggested - we call check_sql_fn_retval() before creating cached plans. > It fixes issue with revalidation happening after modifying query > results. > > One test now changes behavior. What's happening is that after moving > extension to another schema, cached plan is invalidated. But > revalidation > happens and rebuilds the plan. As we've saved analyzed parse tree, not > the raw one, we refer to public.dep_req2() not by name, but by oid. Oid > didn't change, so cached plan is rebuilt and used. Don't know, should we > fix it and if should, how. > > > > * One diff that caught my eye was > > > > - if (!ActiveSnapshotSet() && > > - plansource->raw_parse_tree && > > - analyze_requires_snapshot(plansource->raw_parse_tree)) > > + if (!ActiveSnapshotSet() && > StmtPlanRequiresRevalidation(plansource)) > > > > Because StmtPlanRequiresRevalidation uses > > stmt_requires_parse_analysis, this is basically throwing away the > > distinction between stmt_requires_parse_analysis and > > analyze_requires_snapshot. I do not think that's okay, for the > > reasons explained in analyze_requires_snapshot. We should expend the > > additional notational overhead to keep those things separate. > > > > * I'm also not thrilled by teaching StmtPlanRequiresRevalidation > > how to do something equivalent to stmt_requires_parse_analysis for > > Query trees. The entire point of the current division of labor is > > for it *not* to know that, but keep the knowledge of the properties > > of different statement types in parser/analyze.c. So it looks to me > > like we need to add a function to parser/analyze.c that does this. > > Not quite sure what to call it though. > > querytree_requires_parse_analysis() would be a weird name, because > > if it's a Query tree then it's already been through parse analysis. > > Maybe querytree_requires
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress
On 2025/03/09 20:38, vignesh C wrote: I've updated the status to "withdrawn." Feel free to add it again anytime if you change your mind. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
On Thu, Mar 06, 2025 at 06:44:27PM -0600, Sami Imseih wrote: > Regarding the issue itself, query jumbling behavior is often subjective, > making it difficult to classify as a bug. I'm not entirely sure this > qualifies as a bug either, but I do believe it should be addressed. I would call that a bug and something that we should fix, but not something that we can really backpatch as this has unfortunately an impact on monitoring tools. Stability takes priority in this area in already released branches. > By rearranging them as done in variant A, the position where the expression > is appended in the jumble changes, leading to a different hash. I just > don't like > this solution because it requires one to carefully construct a struct, > but it maybe the best out of the other options. I prefer something like variant A. It would not be the first time we are manipulating the structure of the parse nodes for the sake of making the query jumbling more transparent. An advantage of depending on the structure definition is that we can document the expectation in the header, and not hide it in the middle of the jumble code. > Variant B is not acceptable IMO as it adds a whole bunch of null-terminators > unnecessarily. For example, in a simple "select 1", (expr == NULL) is > true 19 times, > so that is an extra 19 bytes. Variant B is not acceptable here. > I think a third option is to add a new pg_node_attr called "query_jumble_null" > and be very selective on which fields should append a null-terminator to the > jumble when the expression is null > > The queryjumblefuncs.c could look like the below. JUMBLE_NULL will > be responsible for appending the null terminator. Not much a fan of that. For the sake of this thread, I'd still go with the simplicity of A. And please, let's add a couple of queries in pgss to track that these lead to two different entries. Another option that I was thinking about and could be slightly cleaner is the addition of an extra field in Query that is set when we go through a offset_clause in the parser. It could just be a boolean, false by default. We have been using this practice in in DeallocateStmt, for example. And there are only a few places where limitOffset is set. However, I'd rather discard this idea as transformSelectStmt() has also the idea to transform a LIMIT clause into an OFFSET clause, changing a Query representation. And note that we calculate the query jumbling after applying the query transformation. For these reasons, variant A where we put the LimitOption between the two int8 expression nodes feels like the "okay" approach here. But we must document this expectation in the structure, and check for more grammar variants of LIMIT and OFFSET clauses in pgss. Another concept would be to add into the jumble the offset of a Node in the upper structure we are jumbling, but this requires keeping a track of all the nested things, which would make the query jumbling code much more complicated and more expensive, for sure. I'd also recommend to be careful and double-check all these transformation assumptions for LIMIT and OFFSET. We should shape things so as an OFFSET never maps to a LIMIT variant when we differentiate all these flavors at query string level. -- Michael signature.asc Description: PGP signature
Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
On Mon, 10 Mar 2025 at 12:39, Michael Paquier wrote: > > On Thu, Mar 06, 2025 at 06:44:27PM -0600, Sami Imseih wrote: > > Regarding the issue itself, query jumbling behavior is often subjective, > > making it difficult to classify as a bug. I'm not entirely sure this > > qualifies as a bug either, but I do believe it should be addressed. > > I would call that a bug and something that we should fix, but not > something that we can really backpatch as this has unfortunately an > impact on monitoring tools. Stability takes priority in this area in > already released branches. I know you reviewed this, Michael, so you're aware; 2d3389c28 did recently document that we'd only break this in minor versions as a last resort. So, I agree that it sounds like a master-only fix is in order. For the record, the docs [1] read: "Generally, it can be assumed that queryid values are stable between minor version releases of PostgreSQL, providing that instances are running on the same machine architecture and the catalog metadata details match. Compatibility will only be broken between minor versions as a last resort." David [1] https://www.postgresql.org/docs/current/pgstatstatements.html
Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.
On Mon, Mar 10, 2025 at 3:58 PM Ashutosh Bapat wrote: > > On Thu, Mar 6, 2025 at 9:18 AM Ajin Cherian wrote: > > > + Subscription names, publication names, and replication slot names > > are > > + automatically generated. Cannot be used together with > > + --database, --publication, > > + --replication-slot or > > --subscription. > > > > Don't start the sentence with "Cannot". Change the sentence to "This > > option cannot be used together with ..." > > similar sentences used in 3 other places below this as well. Change all of > > them. > > I didn't find any instance of "Cannot be" in the *.sgml files, but I > find some instances of "Can be". So it seems we allow such constructs > in the documentation. Any reasons for suggesting this change? > I just don't think it is correct English to start a sentence with "Cannot be". I checked with grammarly as well. regards, Ajin Cherian Fujitsu Australia
Re: maintenance_work_mem = 64kB doesn't work for vacuum
On Sun, Mar 9, 2025 at 7:03 PM David Rowley wrote: > > On Mon, 10 Mar 2025 at 10:30, David Rowley wrote: > > Could you do something similar to what's in hash_agg_check_limits() > > where we check we've got at least 1 item before bailing before we've > > used up the all the prescribed memory? That seems like a safer coding > > practise as if in the future the minimum usage for a DSM segment goes > > above 256KB, the bug comes back again. > > FWIW, I had something like the attached in mind. > Thank you for the patch! I like your idea. This means that even if we set maintenance_work_mem to 64kB the memory usage would not actually be limited to 64kB but probably we're fine as it's primarily testing purpose. Regarding that patch, we need to note that the lpdead_items is a counter that is not reset in the entire vacuum. Therefore, with maintenance_work_mem = 64kB, once we collect at least one lpdead item, we perform a cycle of index vacuuming and heap vacuuming for every subsequent block even if they don't have a lpdead item. I think we should use vacrel->dead_items_info->num_items instead. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Orphaned users in PG16 and above can only be managed by Superusers
Hi, On Fri, Mar 7, 2025 at 10:55 PM Nathan Bossart wrote: > > I noticed that much of this code is lifted from DropRole(), and the new > check_drop_role_dependency() function is only used by DropRole() right > before it does the exact same scans. Couldn't we put the new dependency > detection in those existing scans in DropRole()? > It can be done, but mixing the code that checks for the drop role dependency with the code that removes entries for the role being dropped from pg_auth_members could reduce clarity and precision. This is more of a sanity check which I felt was necessary before we proceed with actually dropping the role, starting with the deletion of drop role entries from the system catalogs. I’m aware there’s some code duplication, but I think it should be fine. -- With Regards, Ashutosh Sharma.
Re: Printing window function OVER clauses in EXPLAIN
On 2025-Mar-09, Tom Lane wrote: > David Rowley writes: > > What are your thoughts on being a bit more brief with the naming and > > just prefix with "w" instead of "window"? Looking at window.out, I see > > that the EXPLAIN output does become quite a bit wider than before. I > > favour the idea of saving a bit of space. There is an example in > > src/sgml/advanced.sgml that has "OVER w", so it does not seem overly > > strange to me to name them "w1", "w2", etc. > > OK by me, any objections elsewhere? WFM. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: [BUG]: the walsender does not update its IO statistics until it exits
On Mon, Mar 03, 2025 at 11:54:39AM +, Bertrand Drouvot wrote: > So it does not look like what we're adding here can be seen as a primary > bottleneck > but that is probably worth implementing the "have_iostats" optimization > attached. > > Also, while I did not measure any noticeable extra lag, given the fact that > pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about > 2.5%, > that could still make sense to reduce the frequency of the flush calls, > thoughts? Okay. The number of reports really worry me anyway as proposed in the patch. It's not really complicated to see dozens of pgstats calls in a single millisecond in the WAL sender, even with some of the regression tests. That's more present in the logirep scenarios, it seems.. For example, here are some numbers based on some elog() calls planted in walreceiver.c, looking at the logs of the TAP tests. Through a single check-world, I am able to gather the following information: - The location proposed by the patch leads to 680k stats reports, in WalSndLoop() because the "Block if we have unsent data". Some tests show quite some contention. - A different location, in WalSndLoop() just before WalSndWait() leads to 59k calls. And there's some contention in the logirep tests.. - I've spotted a third candidate which looks pretty solid, actually: WalSndWaitForWal() before WalSndWait(). This leads to 2.9k reports in the whole test suite, with much less contention in the reports. These can still be rather frequent, up to ~50 calls per seconds, but that's really less than the two others. Stats data is useful as long as it is possible to get an idea of how the system behaves, particularly with a steady workload. More frequent reports are useful for spikey data detection, showing more noise. Still, too many reports may cause the part gathering the reports to become a bottleneck, while we want it to offer hints about bottlenecks. So I would argue in favor of a more conservative choice in the back branches than what the patch is proposing. Choice 3 i'm quoting above is tempting by design: not too much, still frequent enough to offer enough relevant information in the stats. -- Michael signature.asc Description: PGP signature
Re: Printing window function OVER clauses in EXPLAIN
On Mon, 10 Mar 2025 at 14:13, Tom Lane wrote: > Hmm, OK. Do you think it could be sensible to put Run Condition > before Filter, then? On the same grounds of "keeping related > things together", it could be argued that Run Condition is > related to the Window property. Also, the Run Condition acts > before the Filter does, if I've got my head screwed on straight. Yes, directly after the "Window" property makes sense for the reason you stated. Thanks for thinking of that. David
Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
On Mon, Mar 10, 2025 at 02:14:01PM +1300, David Rowley wrote: > I know you reviewed this, Michael, so you're aware; 2d3389c28 did > recently document that we'd only break this in minor versions as a > last resort. So, I agree that it sounds like a master-only fix is in > order. Yes, this thread's problem does not pass the compatibility bar. Thanks for the reminder about the docs. -- Michael signature.asc Description: PGP signature
Re: Add an option to skip loading missing publication to avoid logical replication failure
On Sun, Mar 9, 2025 at 9:00 AM Masahiko Sawada wrote: > > On Tue, Mar 4, 2025 at 9:04 PM Amit Kapila wrote: > > > > > I see the point of adding such an option to avoid breaking the current > > applications (if there are any) that are relying on current behaviour. > > But OTOH, I am not sure if users expect us to fail explicitly in such > > scenarios. > > On the side note, with the patch since we ignore missing publications > we will be able to create a subscription with whatever publications > regardless their existence like: > > CREATE SUBSCRIPTION ... PUBLICATION pub1, pub2, pub3, pub4, pub5, ..., > pub1000; > > The walsender corresponding to such subscriber can stream changes as > soon as the listed publications are created on the publisher and > REFRESH PUBLICATION is executed. > Right, but OTOH, one can expect that the data should start replicating as soon as one creates a publication on the publisher. However, the data for tables that are part of the publication will start replicating from the point when the decoding process will process the WAL corresponding to Create Publication. I suggest to add something for this in docs unless it is already explained. > > > > This is a long-standing behaviour for which we get reports from time > > to time, and once analyzing a failure, Tom also looked at it and > > agreed that we don't have much choice to avoid skipping non-existent > > publications [5]. But we never concluded as to whether skipping should > > be a default behavior or an optional one. So, we need more opinions on > > it. > > I'm leaning toward making the skipping behavior a default as I could > not find a good benefit for the current behavior (i.e., stopping > logical replication due to missing publications). > Sounds reasonable. We can always add the option at a later point if required. Thanks for your input. We can continue reviewing and committing the current patch. -- With Regards, Amit Kapila.
Re: Printing window function OVER clauses in EXPLAIN
On Mon, 10 Mar 2025 at 11:19, Tom Lane wrote: > OK, here's v2 done like that. I do like this output better. > I backed off the idea of putting the WindowClause as such > into the plan, partly because I didn't feel like debugging > the setrefs.c problem that David discovered upthread. > This way does require a bit more code, but I think it's less > likely to have bugs. This output is much nicer. The patch looks good to me. What are your thoughts on being a bit more brief with the naming and just prefix with "w" instead of "window"? Looking at window.out, I see that the EXPLAIN output does become quite a bit wider than before. I favour the idea of saving a bit of space. There is an example in src/sgml/advanced.sgml that has "OVER w", so it does not seem overly strange to me to name them "w1", "w2", etc. > * In passing, I editorialized on the order in which the Run Condition > annotation comes out: > > case T_WindowAgg: > +show_window_def(castNode(WindowAggState, planstate), ancestors, > es); > show_upper_qual(plan->qual, "Filter", planstate, ancestors, es); > +show_upper_qual(((WindowAgg *) plan)->runConditionOrig, > +"Run Condition", planstate, ancestors, es); > if (plan->qual) > show_instrumentation_count("Rows Removed by Filter", 1, > planstate, es); > -show_upper_qual(((WindowAgg *) plan)->runConditionOrig, > -"Run Condition", planstate, ancestors, es); > show_windowagg_info(castNode(WindowAggState, planstate), es); > break; > > It seemed quite weird to me to have the Run Condition plan property > come out in the middle of properties that only appear in EXPLAIN > ANALYZE mode. Maybe there's a reason for this other than "people > added new properties at the end", but I don't see it. I did it that way because "Rows Removed by Filter" is a property of "Filter", so it makes sense to me for those to be together. It doesn't make sense to me to put something unrelated between them. If you look at BitmapHeapScan output, this keeps the related outputs together, i.e: -> Parallel Bitmap Heap Scan on ab (cost=111.20..82787.64 rows=1 width=8) (actual time=172.498..172.499 rows=0.00 loops=3) Recheck Cond: (a = 1) Rows Removed by Index Recheck: 705225 Filter: (b = 3) Rows Removed by Filter: What you're proposing seems equivalent to if we did it like: -> Parallel Bitmap Heap Scan on ab (cost=111.20..82787.64 rows=1 width=8) (actual time=172.498..172.499 rows=0.00 loops=3) Recheck Cond: (a = 1) Filter: (b = 3) Rows Removed by Index Recheck: 705225 Rows Removed by Filter: David
Re: CREATE OR REPLACE MATERIALIZED VIEW
The attached v6 fixes the build. Somehow I missed testing with --with-cassert the whole time and it turned that out I forgot to pass queryString to ExecRefreshMatView. -- Erik Wienhold >From 6ec126d8da5ca80f93ea8a58e07d654f5e21ef6d Mon Sep 17 00:00:00 2001 From: Erik Wienhold Date: Tue, 21 May 2024 18:35:47 +0200 Subject: [PATCH v6 1/3] Add CREATE OR REPLACE MATERIALIZED VIEW --- .../sgml/ref/create_materialized_view.sgml| 15 +- src/backend/commands/createas.c | 207 ++ src/backend/commands/tablecmds.c | 8 +- src/backend/commands/view.c | 106 ++--- src/backend/parser/gram.y | 15 ++ src/bin/psql/tab-complete.in.c| 26 ++- src/include/commands/view.h | 3 + src/include/nodes/parsenodes.h| 2 +- src/include/nodes/primnodes.h | 1 + src/test/regress/expected/matview.out | 191 src/test/regress/sql/matview.sql | 108 + 11 files changed, 589 insertions(+), 93 deletions(-) diff --git a/doc/src/sgml/ref/create_materialized_view.sgml b/doc/src/sgml/ref/create_materialized_view.sgml index 62d897931c3..5e03320eb73 100644 --- a/doc/src/sgml/ref/create_materialized_view.sgml +++ b/doc/src/sgml/ref/create_materialized_view.sgml @@ -21,7 +21,7 @@ PostgreSQL documentation -CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] table_name +CREATE [ OR REPLACE ] MATERIALIZED VIEW [ IF NOT EXISTS ] table_name [ (column_name [, ...] ) ] [ USING method ] [ WITH ( storage_parameter [= value] [, ... ] ) ] @@ -60,6 +60,17 @@ CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] table_name Parameters + +OR REPLACE + + + Replaces a materialized view if it already exists. + Specifying OR REPLACE together with + IF NOT EXISTS is an error. + + + + IF NOT EXISTS @@ -67,7 +78,7 @@ CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] table_name Do not throw an error if a materialized view with the same name already exists. A notice is issued in this case. Note that there is no guarantee that the existing materialized view is anything like the one that would - have been created. + have been created, unless you use OR REPLACE instead. diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c index 44b4665ccd3..9f37ed7f177 100644 --- a/src/backend/commands/createas.c +++ b/src/backend/commands/createas.c @@ -79,55 +79,151 @@ static void intorel_destroy(DestReceiver *self); static ObjectAddress create_ctas_internal(List *attrList, IntoClause *into) { - CreateStmt *create = makeNode(CreateStmt); - boolis_matview; + boolis_matview, + replace = false; charrelkind; - Datum toast_options; - const char *const validnsps[] = HEAP_RELOPT_NAMESPACES; + Oid matviewOid = InvalidOid; ObjectAddress intoRelationAddr; /* This code supports both CREATE TABLE AS and CREATE MATERIALIZED VIEW */ is_matview = (into->viewQuery != NULL); relkind = is_matview ? RELKIND_MATVIEW : RELKIND_RELATION; - /* -* Create the target relation by faking up a CREATE TABLE parsetree and -* passing it to DefineRelation. -*/ - create->relation = into->rel; - create->tableElts = attrList; - create->inhRelations = NIL; - create->ofTypename = NULL; - create->constraints = NIL; - create->options = into->options; - create->oncommit = into->onCommit; - create->tablespacename = into->tableSpaceName; - create->if_not_exists = false; - create->accessMethod = into->accessMethod; + /* Check if an existing materialized view needs to be replaced. */ + if (is_matview) + { + LOCKMODElockmode; - /* -* Create the relation. (This will error out if there's an existing view, -* so we don't need more code to complain if "replace" is false.) -*/ - intoRelationAddr = DefineRelation(create, relkind, InvalidOid, NULL, NULL); + lockmode = into->replace ? AccessExclusiveLock : NoLock; + (void) RangeVarGetAndCheckCreationNamespace(into->rel, lockmode, + &matviewOid); + replace = OidIsValid(matviewOid) && into->replace; + } - /* -* If necessary, create a TOAST table for the target table. Note that -* NewRelationCreateToastTable ends with CommandCounterIncrement(), so -* that the TOAST table will be visible for insertion. -*/ - CommandCounterIncrement(); + if (is_matview && replace) + { + Relation