Getting rid of OverrideSearhPath in namespace.c
Hello hackers, As a follow-up for the CVE-2023-2454 fix, I think that it makes sense to completely remove unsafe functions PushOverrideSearchPath()/PopOverrideSearchPath(), which are not used in the core now. Please look at the patch attached. Beside that, maybe it's worth to rename three functions in "Override" in their names: GetOverrideSearchPath(), CopyOverrideSearchPath(), OverrideSearchPathMatchesCurrent(), and then maybe struct OverrideSearchPath. Noah Misch proposed name GetSearchPathMatcher() for the former. What do you think? Best regards, Alexanderdiff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index 14e57adee2..93610f948e 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -67,9 +67,7 @@ * may be included: * * 1. If a TEMP table namespace has been initialized in this session, it - * is implicitly searched first. (The only time this doesn't happen is - * when we are obeying an override search path spec that says not to use the - * temp namespace, or the temp namespace is included in the explicit list.) + * is implicitly searched first. * * 2. The system catalog namespace is always searched. If the system * namespace is present in the explicit path then it will be searched in @@ -108,19 +106,14 @@ * namespace (if it exists), preceded by the user's personal namespace * (if one exists). * - * We support a stack of "override" search path settings for use within - * specific sections of backend code. namespace_search_path is ignored - * whenever the override stack is nonempty. activeSearchPath is always - * the actually active path; it points either to the search list of the - * topmost stack entry, or to baseSearchPath which is the list derived - * from namespace_search_path. + * activeSearchPath is always the actually active path; it points to + * to baseSearchPath which is the list derived from namespace_search_path. * * If baseSearchPathValid is false, then baseSearchPath (and other * derived variables) need to be recomputed from namespace_search_path. * We mark it invalid upon an assignment to namespace_search_path or receipt * of a syscache invalidation event for pg_namespace. The recomputation - * is done during the next non-overridden lookup attempt. Note that an - * override spec is never subject to recomputation. + * is done during the next lookup attempt. * * Any namespaces mentioned in namespace_search_path that are not readable * by the current user ID are simply left out of baseSearchPath; so @@ -161,17 +154,6 @@ static Oid namespaceUser = InvalidOid; /* The above four values are valid only if baseSearchPathValid */ static bool baseSearchPathValid = true; -/* Override requests are remembered in a stack of OverrideStackEntry structs */ - -typedef struct -{ - List *searchPath; /* the desired search path */ - Oid creationNamespace; /* the desired creation namespace */ - int nestLevel; /* subtransaction nesting level */ -} OverrideStackEntry; - -static List *overrideStack = NIL; - /* * myTempNamespace is InvalidOid until and unless a TEMP namespace is set up * in a particular backend session (this happens when a CREATE TEMP TABLE @@ -3392,8 +3374,7 @@ SetTempNamespaceState(Oid tempNamespaceId, Oid tempToastNamespaceId) /* - * GetOverrideSearchPath - fetch current search path definition in form - * used by PushOverrideSearchPath. + * GetOverrideSearchPath - fetch current search path definition. * * The result structure is allocated in the specified memory context * (which might or might not be equal to CurrentMemoryContext); but any @@ -3512,189 +3512,6 @@ OverrideSearchPathMatchesCurrent(OverrideSearchPath *path) return true; } -/* - * PushOverrideSearchPath - temporarily override the search path - * - * Do not use this function; almost any usage introduces a security - * vulnerability. It exists for the benefit of legacy code running in - * non-security-sensitive environments. - * - * We allow nested overrides, hence the push/pop terminology. The GUC - * search_path variable is ignored while an override is active. - * - * It's possible that newpath->useTemp is set but there is no longer any - * active temp namespace, if the path was saved during a transaction that - * created a temp namespace and was later rolled back. In that case we just - * ignore useTemp. A plausible alternative would be to create a new temp - * namespace, but for existing callers that's not necessary because an empty - * temp namespace wouldn't affect their results anyway. - * - * It's also worth noting that other schemas listed in newpath might not - * exist anymore either. We don't worry about this because OIDs that match - * no existing namespace will simply not produce any hits during searches. - */ -void -PushOverrideSearchPath(OverrideSearchPath *newpath) -{ - OverrideStackEntry *entry; - List *oidlist; - Oid firstNS; - MemoryContext oldcxt; - - /* - * Copy th
Re: Should we remove db_user_namespace?
On Sat, Jul 15, 2023 at 1:34 AM Nathan Bossart wrote: > > On Mon, Jul 10, 2023 at 03:43:07PM +0900, Michael Paquier wrote: > > On Wed, Jul 05, 2023 at 08:49:26PM -0700, Nathan Bossart wrote: > >> On Thu, Jul 06, 2023 at 08:21:18AM +0900, Michael Paquier wrote: > >>> Removing the GUC from this table is kind of annoying. Cannot this be > >>> handled like default_with_oids or ssl_renegotiation_limit to avoid any > >>> kind of issues with the reload of dump files and the kind? > >> > >> Ah, good catch. > > > > Thanks. Reading through the patch, this version should be able to > > handle the dump reloads. > > Hm. Do we actually need to worry about this? It's a PGC_SIGHUP GUC, so it > can only be set at postmaster start or via a configuration file. Any dump > files that are trying to set it or clients that are trying to add it to > startup packets are already broken. I guess keeping the GUC around would > avoid breaking any configuration files or startup scripts that happen to be > setting it to false, but I don't know if that's really worth worrying > about. I'd lean towards "no". A hard break, when it's a major release, is better than a "it stopped having effect but didn't tell you anything" break. Especially when it comes to things like startup scripts etc. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Inefficiency in parallel pg_restore with many tables
On 2023-07-15 Sa 13:47, Tom Lane wrote: I looked into the performance gripe at [1] about pg_restore not making effective use of parallel workers when there are a lot of tables. I was able to reproduce that by dumping and parallel restoring 100K tables made according to this script: do $$ begin for i in 1..10 loop execute format('create table t%s (f1 int unique, f2 int unique);', i); execute format('insert into t%s select x, x from generate_series(1,1000) x', i); if i % 100 = 0 then commit; end if; end loop; end $$; Once pg_restore reaches the parallelizable part of the restore, what I see is that the parent pg_restore process goes to 100% CPU while its children (and the server) mostly sit idle; that is, the task dispatch logic in pg_backup_archiver.c is unable to dispatch tasks fast enough to keep the children busy. A quick perf check showed most of the time being eaten by pg_qsort and TocEntrySizeCompare. What I believe is happening is that we start the parallel restore phase with 100K TableData items that are ready to go (they are in the ready_list) and 200K AddConstraint items that are pending, because we make those have dependencies on the corresponding TableData so we don't build an index until after its table is populated. Each time one of the TableData items is completed by some worker, the two AddConstraint items for its table are moved from the pending_list to the ready_list --- and that means ready_list_insert marks the ready_list as no longer sorted. When we go to pop the next task from the ready_list, we re-sort that entire list first. So we spend something like O(N^2 * log(N)) time just sorting, if there are N tables. Clearly, this code is much less bright than it thinks it is (and that's all my fault, if memory serves). I'm not sure how big a deal this is in practice: in most situations the individual jobs are larger than they are in this toy example, plus the initial non-parallelizable part of the restore is a bigger bottleneck anyway with this many tables. Still, we do have one real-world complaint, so maybe we should look into improving it. I wonder if we could replace the sorted ready-list with a priority heap, although that might be complicated by the fact that pop_next_work_item has to be capable of popping something that's not necessarily the largest remaining job. Another idea could be to be a little less eager to sort the list every time; I think in practice scheduling wouldn't get much worse if we only re-sorted every so often. Yeah, I think that last idea is reasonable. Something like if the number added since the last sort is more than min(50, list_length/4) then sort. That shouldn't be too invasive. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: Inefficiency in parallel pg_restore with many tables
Andrew Dunstan writes: > On 2023-07-15 Sa 13:47, Tom Lane wrote: >> I wonder if we could replace the sorted ready-list with a priority heap, >> although that might be complicated by the fact that pop_next_work_item >> has to be capable of popping something that's not necessarily the >> largest remaining job. Another idea could be to be a little less eager >> to sort the list every time; I think in practice scheduling wouldn't >> get much worse if we only re-sorted every so often. > Yeah, I think that last idea is reasonable. Something like if the number > added since the last sort is more than min(50, list_length/4) then sort. > That shouldn't be too invasive. Actually, as long as we're talking about approximately-correct behavior: let's make the ready_list be a priority heap, and then just make pop_next_work_item scan forward from the array start until it finds an item that's runnable per the lock heuristic. If the heap root is blocked, the next things we'll examine will be its two children. We might pick the lower-priority of those two, but it's still known to be higher priority than at least 50% of the remaining heap entries, so it shouldn't be too awful as a choice. The argument gets weaker the further you go into the heap, but we're not expecting that having most of the top entries blocked will be a common case. (Besides which, the priorities are pretty crude to begin with.) Once selected, pulling out an entry that is not the heap root is no problem: you just start the sift-down process from there. The main advantage of this over the only-sort-sometimes idea is that we can guarantee that the largest ready item will always be dispatched as soon as it can be (because it will be the heap root). So cases involving one big table (with big indexes) and a lot of little ones should get scheduled sanely, which is the main thing we want this algorithm to ensure. With the other approach we can't really promise much at all. regards, tom lane
Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
Hi, I'm still working on it, but, unfortunately, I didn't have much time to work with it well enough that there would be something that could be shown. Now I am trying to sort out the problems that I drew attention to in the previous letter. -- Regards, Alena Rybakina Postgres Professional
Re: Autogenerate some wait events code and documentation
Hi, On 2023-07-14 13:49:22 +0900, Michael Paquier wrote: > I have looked again at 0001 and 0002 and applied them to get them out > of the way. 0003 and 0004 are rebased and attached. I'll add them to > the CF for later consideration. More opinions are welcome. > From b6390183bdcc054df82279bb0b2991730f85a0a3 Mon Sep 17 00:00:00 2001 > From: Michael Paquier > Date: Thu, 13 Jul 2023 10:14:47 +0900 > Subject: [PATCH v3 4/4] Remove column for enum elements in > wait_event_names.txt > > This file is now made of two columns, removing the column listing the > enum elements for each wait event class: > - Camelcase event name used in pg_stat_activity. There are now > unquoted. > - Description of the documentation. > The enum elements are generated from what is now the first column. I think the search issue is valid, so I do think going the other way is preferrable. I.e. use just the enum value in the .txt and generate the camel case name from that. That allows you to search the define used in code and find a hit in the file. I personally would still leave off the WAIT_EVENT prefix in the .txt, I think most of us can remember to chop that off. I don't think we need to be particularly consistent with wait events across major versions. They're necessarily tied to how the code works, and we've yanked that around plenty. Greetings, Andres Freund
Re: Use of additional index columns in rows filtering
On 7/15/23 16:20, Tomas Vondra wrote: > > ... > > 4) problems with opcintype != opckeytype (name_ops) > > While running the tests, I ran into an issue with name_ops, causing > failures for \dT and other catalog queries. The root cause is that > name_ops has opcintype = name, but opckeytype = cstring. The index-only > clauses are copied from the table, with Vars mutated to reference the > INDEX_VAR. But the type is not, so when we get to evaluating the > expressions, CheckVarSlotCompatibility() fails because the Var has name, > but the iss_IndexSlot (created with index tuple descriptor) has cstring. > > The rebased patch fixes this by explicitly adjusting types of the > descriptor in ExecInitIndexScan(). > > However, maybe this indicates the very idea of evaluating expressions > using slot with index tuple descriptor is misguided. This made me look > at regular index-only scan (nodeIndexonlyscan.c), and that uses a slot > with the "table" structure, and instead of evaluating the expression on > the index index tuple it expands the index tuple into the table slot. > Which is what StoreIndexTuple() does. > > So maybe this should do what IOS does - expand the index tuple into > "table slot" and evaluate the expression on that. That'd also make the > INDEX_VAR tweak in createplan.c unnecessary - in fact, that seemed a bit > strange anyway, so ditching fix_indexfilter_mutator would be good. > This kept bothering me, so I looked at it today, and reworked it to use the IOS approach. It's a bit more complicated because for IOS both slots have the same overall structure, except for the data types. But for regular index scans that's not the case - the code has to "expand" the index tuple into the larger "table slot". This works, and in general I think the result is much cleaner - in particular, it means we don't need to switch the Var nodes to reference the INDEX_VAR. While experimenting with this I realized again that we're not matching expressions to IOS. So if you have an expression index on (a+b), that can't be used even if the query only uses this particular expression. The same limitation applies to index-only filters, of course. It's not the fault of this patch, but perhaps it'd be an interesting improvement. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 685e4014533d9d5f29564a3a21b972abba70ee9f Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Sun, 9 Apr 2023 02:08:45 +0200 Subject: [PATCH] evaluate filters on the index tuple (when possible) Discussion: https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU%3D%40yamlcoder.me --- src/backend/commands/explain.c| 2 + src/backend/executor/nodeIndexscan.c | 258 +- src/backend/optimizer/path/costsize.c | 23 ++ src/backend/optimizer/path/indxpath.c | 190 +++-- src/backend/optimizer/plan/createplan.c | 80 ++ src/backend/optimizer/plan/planner.c | 2 +- src/backend/optimizer/plan/setrefs.c | 6 + src/backend/optimizer/util/pathnode.c | 2 + src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/execnodes.h | 7 + src/include/nodes/pathnodes.h | 1 + src/include/nodes/plannodes.h | 2 + src/include/optimizer/cost.h | 1 + src/include/optimizer/pathnode.h | 1 + src/test/regress/expected/create_index.out| 19 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/updatable_views.out | 12 +- 18 files changed, 580 insertions(+), 40 deletions(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 8570b14f621..556388bc499 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1778,6 +1778,8 @@ ExplainNode(PlanState *planstate, List *ancestors, case T_IndexScan: show_scan_qual(((IndexScan *) plan)->indexqualorig, "Index Cond", planstate, ancestors, es); + show_scan_qual(((IndexScan *) plan)->indexfiltersorig, + "Index Filter", planstate, ancestors, es); if (((IndexScan *) plan)->indexqualorig) show_instrumentation_count("Rows Removed by Index Recheck", 2, planstate, es); diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c index 4540c7781d2..5ee1fd86d7f 100644 --- a/src/backend/executor/nodeIndexscan.c +++ b/src/backend/executor/nodeIndexscan.c @@ -32,6 +32,8 @@ #include "access/nbtree.h" #include "access/relscan.h" #include "access/tableam.h" +#include "access/visibilitymap.h" +#include "catalog/index.h" #include "catalog/pg_am.h" #include "executor/execdebug.h" #include "executor/nodeIndexscan.h" @@ -69,6 +71,10 @@ static void reorderqueue_push(Inde
Re: Permute underscore separated components of columns before fuzzy matching
Hello Mikhail, I'm sorry. Please try attached patch instead. Thank you for having a look! Regards Arne From: Mikhail Gribkov Sent: Thursday, July 6, 2023 13:31 To: Arne Roland Cc: Pg Hackers Subject: Re: Permute underscore separated components of columns before fuzzy matching Hello Arne, The goal of supporting words-switching hints sounds interesting and I've tried to apply your patch. The patch was applied smoothly to the latest master and check-world reported no problems. Although I had problems after trying to test the new functionality. I tried to simply mix words in pg_stat_activity.wait_event_type: postgres=# select wait_type_event from pg_stat_activity ; 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 2023-07-06 14:12:35.968 MSK [1480] ERROR: column "wait_type_event" does not exist at character 8 2023-07-06 14:12:35.968 MSK [1480] HINT: Perhaps you meant to reference the column "pg_stat_activity.wait_event_type". 2023-07-06 14:12:35.968 MSK [1480] STATEMENT: select wait_type_event from pg_stat_activity ; WARNING: detected write past chunk end in MessageContext 0x559d668aaf30 WARNING: detected write past chunk
Re: Changing types of block and chunk sizes in memory contexts
On Fri, 14 Jul 2023 at 18:53, Melih Mutlu wrote: > David Rowley , 13 Tem 2023 Per, 08:04 tarihinde şunu > yazdı: >> >> I looked at your v2 patch. The only thing that really looked wrong >> were the (Size) casts in the context creation functions. These should >> have been casts to uint32 rather than Size. Basically, all the casts >> do is say to the compiler "Yes, I know this could cause truncation due >> to assigning to a size smaller than the source type's size". Some >> compilers will likely warn without that and the cast will stop them. >> We know there can't be any truncation due to the Asserts. There's also >> the fundamental limitation that MemoryChunk can't store block offsets >> larger than 1GBs anyway, so things will go bad if we tried to have >> blocks bigger than 1GB. > > > Right! I don't know why I cast them to Size. Thanks for the fix. Pushed. David
Re: Autogenerate some wait events code and documentation
On Sun, Jul 16, 2023 at 12:21:20PM -0700, Andres Freund wrote: > I think the search issue is valid, so I do think going the other way is > preferrable. I.e. use just the enum value in the .txt and generate the camel > case name from that. That allows you to search the define used in code and > find a hit in the file. > > I personally would still leave off the WAIT_EVENT prefix in the .txt, I think > most of us can remember to chop that off. So you mean to switch a line that now looks like that: WAIT_EVENT_FOO_BAR FooBar"Waiting on Foo Bar." To that: FOO_BAR "Waiting on Foo Bar." Or even that: WAIT_EVENT_FOO_BAR "Waiting on Foo Bar." Sure, it is an improvement for any wait events that use WAIT_EVENT_ when searching them, but this adds more magic into the LWLock and Lock areas if the same conversion is applied there. Or am I right to assume that you'd mean to *not* do any of that for these two classes? These can be treated as exceptions in the script when generating the wait event names from the enum elements, of course. > I don't think we need to be particularly consistent with wait events across > major versions. They're necessarily tied to how the code works, and we've > yanked that around plenty. IMO, it depends on the code path involved. For example, I know of some code that relies on SyncRep to track backends waiting on a sync reply, and that's one sensible to keep compatible. I'd be sad if something like that breaks suddenly after a major release. -- Michael signature.asc Description: PGP signature
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
On Fri, Jul 14, 2023 at 01:32:49PM -0700, David Zhang wrote: > I believe before users can make a backup using pg_basebackup and then start > the backup as an independent Primary server for whatever reasons. Now, if > this is still allowed, then users need to be aware that the backup_label > must be manually deleted, otherwise, the backup won't be able to start as a > Primary. Delete a backup_label from a fresh base backup can easily lead to data corruption, as the startup process would pick up as LSN to start recovery from the control file rather than the backup_label file. This would happen if a checkpoint updates the redo LSN in the control file while a backup happens and the control file is copied after the checkpoint, for instance. If one wishes to deploy a new primary from a base backup, recovery.signal is the way to go, making sure that the new primary is bumped into a new timeline once recovery finishes, on top of making sure that the startup process starts recovery from a position where the cluster would be able to achieve a consistent state. > The current message below doesn't provide such a hint. > > + if (!ArchiveRecoveryRequested) > + ereport(FATAL, > + (errmsg("could not find > recovery.signal or standby.signal when recovering with > backup_label"), > + errhint("If you are restoring > from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" > and add required recovery options.", > + DataDir, > DataDir))); How would you rewrite that? I am not sure how many details we want to put here in terms of differences between recovery.signal and standby.signal, still we surely should mention these are the two possible choices. -- Michael signature.asc Description: PGP signature
Re: Improve heapgetpage() performance, overhead from serializable
Hi, Regards, Zhang Mingli On Jul 16, 2023 at 09:57 +0800, Andres Freund , wrote: > Hi, > > Several loops which are important for query performance, like heapgetpage()'s > loop over all tuples, have to call functions like > HeapCheckForSerializableConflictOut() and PredicateLockTID() in every > iteration. > > When serializable is not in use, all those functions do is to to return. But > being situated in a different translation unit, the compiler can't inline > (without LTO at least) the check whether serializability is needed. It's not > just the function call overhead that's noticable, it's also that registers > have to be spilled to the stack / reloaded from memory etc. > > On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres > pinned to one core. Parallel workers disabled to reduce noise. All times are > the average of 15 executions with pgbench, in a newly started, but prewarmed > postgres. > > SELECT * FROM pgbench_accounts OFFSET 1000; > HEAD: > 397.977 > > removing the HeapCheckForSerializableConflictOut() from heapgetpage() > (incorrect!), to establish the baseline of what serializable costs: > 336.695 > > pulling out CheckForSerializableConflictOutNeeded() from > HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling > HeapCheckForSerializableConflictOut() in the loop: > 339.742 > > moving the loop into a static inline function, marked as pg_always_inline, > called with static arguments for always_visible, check_serializable: > 326.546 > > marking the always_visible, !check_serializable case likely(): > 322.249 > > removing TestForOldSnapshot() calls, which we pretty much already decided on: > 312.987 > > > FWIW, there's more we can do, with some hacky changes I got the time down to > 273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms > for something as core as this, is imo worth considering on its own. > > > > > Now, this just affects the sequential scan case. heap_hot_search_buffer() > shares many of the same pathologies. I find it a bit harder to improve, > because the compiler's code generation seems to switch between good / bad with > changes that seems unrelated... > > > I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so > far? > > > Greetings, > > Andres Freund LGTM and I have a fool question: if (likely(all_visible)) { if (likely(!check_serializable)) scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, block, lines, 1, 0); else scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, block, lines, 1, 1); } else { if (likely(!check_serializable)) scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, block, lines, 0, 0); else scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, block, lines, 0, 1); Does it make sense to combine if else condition and put it to the incline function’s param? Like: scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, block, lines, all_visible, check_serializable);
Re: logicalrep_message_type throws an error
On Sat, Jul 15, 2023 at 7:16 PM Euler Taveira wrote: > > On Sat, Jul 15, 2023, at 4:27 AM, Amit Kapila wrote: > > Do you have something like attached in mind? > > > WFM. I would change the comment that says > > This function is called to provide context in the error ... > > to > > This message provides context in the error ... > > because this comment is not at the beginning of the function but *before* the > message. > Sounds reasonable to me. I'll make this modification before pushing unless there are more comments/suggestions. -- With Regards, Amit Kapila.
Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
On Thu, Jun 01, 2023 at 08:50:50AM -0400, Michael Paquier wrote: > >> Speaking of which, ATExecSetAccessMethodNoStorage() does a catalog > >> update even if ALTER TABLE is defined to use the same table AM as what > >> is currently set. There is no need to update the relation's pg_class > >> entry in this case. Be careful that InvokeObjectPostAlterHook() still > >> needs to be checked in this case. Perhaps some tests should be added > >> in test_oat_hooks.sql? It would be tempted to add a new SQL file for > >> that. > > > > Are you suggesting to put this in a conditional: if > > oldrelam!=newAccessMethod ? > > Yes, that's what I would add with a few lines close to the beginning > of ATExecSetTableSpaceNoStorage() to save from catalog updates if > these are not needed. I understand that it's possible for it to be conditional, but I don't undertand why it's desirable/important ? It still seems preferable to be unconditional. On Wed, May 31, 2023 at 06:35:34PM -0500, Justin Pryzby wrote: >> Why is that desirable ? I'd prefer to see it written with fewer >> conditionals, not more; then, it hits the same path every time. It's not conditional in the tablespace code that this follows, nor others like ATExecSetStatistics(), ATExecForceNoForceRowSecurity(), ATExecSetCompression(), etc, some of which are recently-added. If it's important to do here, isn't it also important to do everywhere else ? -- Justin
Re: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL
On Sat, Jul 15, 2023 at 2:10 PM Amit Kapila wrote: > > On Fri, Jul 14, 2023 at 2:15 PM Hayato Kuroda (Fujitsu) > wrote: > > > > > > I think it's appropriate to add on the restrictions page. (But > > > > mentioning that this > > > restriction is only for subscriber) > > > > > > > > If the list were larger, then the restrictions page could be divided > > > > into publisher > > > and subscriber restrictions. But not for one very specific restriction. > > > > > > > > > > Okay, how about something like: "The UPDATE and DELETE operations > > > cannot be applied on the subscriber for the published tables that > > > specify REPLICA IDENTITY FULL when the table has attributes with > > > datatypes (e.g point or box) that don't have a default operator class > > > for Btree or Hash. This won't be a problem if the table has a primary > > > key or replica identity defined for it."? > > > > Thanks for discussing and giving suggestions. But it seems that the first > > sentence is difficult to read for me. How about attached? > > > > The last line seems repetitive to me. So, I have removed it. Apart > from that patch looks good to me. Sergie, Peter, and others, any > thoughts? The v5 patch LGTM. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: unrecognized node type while displaying a Path due to dangling pointer
On Wed, 12 Jul 2023 at 14:50, David Rowley wrote: > > On Wed, 12 Jul 2023 at 14:23, Tom Lane wrote: > > I did think about that, but "shallow copy a Path" seems nontrivial > > because the Path structs are all different sizes. Maybe it is worth > > building some infrastructure to support that? > > It seems a reasonable thing to have to do. It seems the minimum thing > we could do to ensure each Path is only mentioned in at most 1 > RelOptInfo. I've attached a draft patch which adds copyObjectFlat() and supports all Node types asides from the ones mentioned in @extra_tags in gen_node_support.pl. This did require including all the node header files in copyfuncs.c, which that file seems to have avoided until now. I also didn't do anything about ExtensibleNode types. I assume just copying the ExtensibleNode isn't good enough. To flat copy the actual node I think would require adding a new function to ExtensibleNodeMethods. I was also unsure what we should do when shallow copying a List. The problem there is if we just do a shallow copy, a repalloc() on the elements array would end up pfreeing memory that might be used by a shallow copied clone. Perhaps List is not unique in that regard? Maybe the solution there is to add a special case and list_copy() Lists like what is done in copyObjectImpl(). I'm hoping the attached patch will at least assist in moving the discussion along. David diff --git a/src/backend/nodes/.gitignore b/src/backend/nodes/.gitignore index 0c14b5697b..91cbd2cf24 100644 --- a/src/backend/nodes/.gitignore +++ b/src/backend/nodes/.gitignore @@ -1,4 +1,5 @@ /node-support-stamp +/nodesizes.h /nodetags.h /*funcs.funcs.c /*funcs.switch.c diff --git a/src/backend/nodes/Makefile b/src/backend/nodes/Makefile index 0a95e683d0..46ce62f828 100644 --- a/src/backend/nodes/Makefile +++ b/src/backend/nodes/Makefile @@ -99,4 +99,4 @@ queryjumblefuncs.o: queryjumblefuncs.c queryjumblefuncs.funcs.c queryjumblefuncs readfuncs.o: readfuncs.c readfuncs.funcs.c readfuncs.switch.c | node-support-stamp maintainer-clean: clean - rm -f node-support-stamp $(addsuffix funcs.funcs.c,copy equal out queryjumble read) $(addsuffix funcs.switch.c,copy equal out queryjumble read) nodetags.h + rm -f node-support-stamp $(addsuffix funcs.funcs.c,copy equal out queryjumble read) $(addsuffix funcs.switch.c,copy equal out queryjumble read) nodesizes.h nodetags.h diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index f2568ff5e6..ccd4cbfa01 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -15,9 +15,54 @@ #include "postgres.h" +#include "access/amapi.h" +#include "access/tableam.h" +#include "access/tsmapi.h" +#include "commands/event_trigger.h" +#include "commands/trigger.h" +#include "foreign/fdwapi.h" #include "miscadmin.h" +#include "nodes/execnodes.h" +#include "nodes/extensible.h" +#include "nodes/miscnodes.h" +#include "nodes/parsenodes.h" +#include "nodes/pathnodes.h" +#include "nodes/plannodes.h" +#include "nodes/replnodes.h" +#include "nodes/supportnodes.h" +#include "nodes/tidbitmap.h" #include "utils/datum.h" +static const Size flat_node_sizes[] = { + 0, /* T_Invalid */ +#include "nodes/nodesizes.h" +}; + +/* + * copyObjectFlat + * Allocate a new copy of the Node type denoted by 'from' and flat copy the + * contents of it into the newly allocated node and return it. + */ +void * +copyObjectFlat(const void *from) +{ + Sizesize; + void *retval; + NodeTag tag = nodeTag(from); + + if ((unsigned int) tag >= lengthof(flat_node_sizes)) + { + elog(ERROR, "unrecognized node type: %d", (int) tag); + return NULL; + } + + /* XXX how to handle ExtensibleNodes? Can we just deep copy? */ + size = flat_node_sizes[tag]; + retval = palloc(size); + memcpy(retval, from, size); + + return retval; +} /* * Macros to simplify copying of different kinds of fields. Use these diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl index 72c7963578..e9a759c5a0 100644 --- a/src/backend/nodes/gen_node_support.pl +++ b/src/backend/nodes/gen_node_support.pl @@ -2,6 +2,7 @@ #-- # # Generate node support files: +# - nodesizes.h # - nodetags.h # - copyfuncs # - equalfuncs @@ -599,6 +600,28 @@ my $header_comment = */ '; +# nodesizes.h + +push @output_files, 'nodesizes.h'; +open my $ns, '>', "$output_path/nodesizes.h$tmpext" + or die "$output_path/nodesizes.h$tmpext: $!"; + +printf $ns $header_comment, 'nodesizes.h'; + +foreach my $n (@node_types) +{ + next if elem $n, @abstract_types; + if (defined $manual_nodetag_number{$n}) + { + print $ns "\tsizeof(T_${n}),\n"; + } + else + { + print $ns "\tsizeof(${n}),\n"; + } +} +
Re: unrecognized node type while displaying a Path due to dangling pointer
David Rowley writes: > On Wed, 12 Jul 2023 at 14:50, David Rowley wrote: >> On Wed, 12 Jul 2023 at 14:23, Tom Lane wrote: >>> I did think about that, but "shallow copy a Path" seems nontrivial >>> because the Path structs are all different sizes. Maybe it is worth >>> building some infrastructure to support that? >> It seems a reasonable thing to have to do. It seems the minimum thing >> we could do to ensure each Path is only mentioned in at most 1 >> RelOptInfo. > ... > I also didn't do anything about ExtensibleNode types. I assume just > copying the ExtensibleNode isn't good enough. To flat copy the actual > node I think would require adding a new function to > ExtensibleNodeMethods. Yeah, the problem I've got with this approach is that flat-copying FDW and Custom paths would require extending the respective APIs. While that's a perfectly reasonable ask if we only need to do this in HEAD, it would be a nonstarter for released branches. Is it okay to only fix this issue in HEAD? > I was also unsure what we should do when shallow copying a List. The proposal is to shallow-copy a Path node. List is not a kind of Path, so how does List get into it? (Lists below Paths would not get copied, by definition.) regards, tom lane
Re: Inefficiency in parallel pg_restore with many tables
On Sun, Jul 16, 2023 at 09:45:54AM -0400, Tom Lane wrote: > Actually, as long as we're talking about approximately-correct behavior: > let's make the ready_list be a priority heap, and then just make > pop_next_work_item scan forward from the array start until it finds an > item that's runnable per the lock heuristic. If the heap root is > blocked, the next things we'll examine will be its two children. > We might pick the lower-priority of those two, but it's still known to > be higher priority than at least 50% of the remaining heap entries, so > it shouldn't be too awful as a choice. The argument gets weaker the > further you go into the heap, but we're not expecting that having most > of the top entries blocked will be a common case. (Besides which, the > priorities are pretty crude to begin with.) Once selected, pulling out > an entry that is not the heap root is no problem: you just start the > sift-down process from there. > > The main advantage of this over the only-sort-sometimes idea is that > we can guarantee that the largest ready item will always be dispatched > as soon as it can be (because it will be the heap root). So cases > involving one big table (with big indexes) and a lot of little ones > should get scheduled sanely, which is the main thing we want this > algorithm to ensure. With the other approach we can't really promise > much at all. This seems worth a try. IIUC you are suggesting making binaryheap.c frontend-friendly and expanding its API a bit. If no one has volunteered, I could probably hack something together. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: remaining sql/json patches
hi. seems there is no explanation about, json_api_common_syntax in functions-json.html I can get json_query full synopsis from functions-json.html as follows: json_query ( context_item, path_expression [ PASSING { value AS varname } [, ...]] [ RETURNING data_type [ FORMAT JSON [ ENCODING UTF8 ] ] ] [ { WITHOUT | WITH { CONDITIONAL | [UNCONDITIONAL] } } [ ARRAY ] WRAPPER ] [ { KEEP | OMIT } QUOTES [ ON SCALAR STRING ] ] [ { ERROR | NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression } ON EMPTY ] [ { ERROR | NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression } ON ERROR ]) seems doesn't have a full synopsis for json_table? only partial one by one explanation.
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Thu, Jun 8, 2023 at 9:24 AM Hayato Kuroda (Fujitsu) wrote: > Few comments/questions 1. +check_for_parameter_settings(ClusterInfo *new_cluster) { ... + + res = executeQueryOrDie(conn, "SHOW max_replication_slots;"); + max_replication_slots = atoi(PQgetvalue(res, 0, 0)); + + if (max_replication_slots == 0) + pg_fatal("max_replication_slots must be greater than 0"); ... } Won't it be better to verify that the value of "max_replication_slots" is greater than the number of logical slots we are planning to copy from old on the new cluster? Similar to this, I thought whether we need to check the value of max_wal_senders? But, I guess one can simply decode from slots by using APIs, so not sure about that. What do you think? 2. + /* + * Dump logical replication slots if needed. + * + * XXX We cannot dump replication slots at the same time as the schema + * dump because we need to separate the timing of restoring + * replication slots and other objects. Replication slots, in + * particular, should not be restored before executing the pg_resetwal + * command because it will remove WALs that are required by the slots. + */ + if (user_opts.include_logical_slots) Can you explain this point a bit more with some example scenarios? Basically, if we had sent all the WAL before the upgrade then why do we need to worry about the timing of pg_resetwal? 3. I see that you are trying to ensure that all the WAL has been consumed for a slot except for shutdown_checkpoint in patch 0003 but do we need to think of any interaction with restart_lsn (MyReplicationSlot->data.restart_lsn) which is the start point to read WAL for decoding by walsender? -- With Regards, Amit Kapila.
Re: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL
Hi, > > > The last line seems repetitive to me. So, I have removed it. Apart > > from that patch looks good to me. Sergie, Peter, and others, any > > thoughts? > > The v5 patch LGTM. > > Overall looks good to me as well. Please consider the following as an optional improvement. My only minor concern here is the use of the term "default operator class". It is accurate to use it. However, as far as I know, not many users can follow that easily. I think the "pkey/repl full" suggestion gives some tip, but I wonder if we add something like the following to the text such that users can understand more: do not have a default operator class for B-tree or Hash. + If there is no default operator class, usually the type does not have an > equality operator. However, this limitation .. Thanks, Onder