Getting rid of OverrideSearhPath in namespace.c

2023-07-16 Thread Alexander Lakhin

Hello hackers,

As a follow-up for the CVE-2023-2454 fix, I think that it makes sense to
completely remove unsafe functions
PushOverrideSearchPath()/PopOverrideSearchPath(), which are not used in the
core now.
Please look at the patch attached.

Beside that, maybe it's worth to rename three functions in "Override" in
their names: GetOverrideSearchPath(), CopyOverrideSearchPath(),
OverrideSearchPathMatchesCurrent(), and then maybe struct OverrideSearchPath.
Noah Misch proposed name GetSearchPathMatcher() for the former.

What do you think?

Best regards,
Alexanderdiff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 14e57adee2..93610f948e 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -67,9 +67,7 @@
  * may be included:
  *
  * 1. If a TEMP table namespace has been initialized in this session, it
- * is implicitly searched first.  (The only time this doesn't happen is
- * when we are obeying an override search path spec that says not to use the
- * temp namespace, or the temp namespace is included in the explicit list.)
+ * is implicitly searched first.
  *
  * 2. The system catalog namespace is always searched.  If the system
  * namespace is present in the explicit path then it will be searched in
@@ -108,19 +106,14 @@
  * namespace (if it exists), preceded by the user's personal namespace
  * (if one exists).
  *
- * We support a stack of "override" search path settings for use within
- * specific sections of backend code.  namespace_search_path is ignored
- * whenever the override stack is nonempty.  activeSearchPath is always
- * the actually active path; it points either to the search list of the
- * topmost stack entry, or to baseSearchPath which is the list derived
- * from namespace_search_path.
+ * activeSearchPath is always the actually active path; it points to
+ * to baseSearchPath which is the list derived from namespace_search_path.
  *
  * If baseSearchPathValid is false, then baseSearchPath (and other
  * derived variables) need to be recomputed from namespace_search_path.
  * We mark it invalid upon an assignment to namespace_search_path or receipt
  * of a syscache invalidation event for pg_namespace.  The recomputation
- * is done during the next non-overridden lookup attempt.  Note that an
- * override spec is never subject to recomputation.
+ * is done during the next lookup attempt.
  *
  * Any namespaces mentioned in namespace_search_path that are not readable
  * by the current user ID are simply left out of baseSearchPath; so
@@ -161,17 +154,6 @@ static Oid	namespaceUser = InvalidOid;
 /* The above four values are valid only if baseSearchPathValid */
 static bool baseSearchPathValid = true;
 
-/* Override requests are remembered in a stack of OverrideStackEntry structs */
-
-typedef struct
-{
-	List	   *searchPath;		/* the desired search path */
-	Oid			creationNamespace;	/* the desired creation namespace */
-	int			nestLevel;		/* subtransaction nesting level */
-} OverrideStackEntry;
-
-static List *overrideStack = NIL;
-
 /*
  * myTempNamespace is InvalidOid until and unless a TEMP namespace is set up
  * in a particular backend session (this happens when a CREATE TEMP TABLE
@@ -3392,8 +3374,7 @@ SetTempNamespaceState(Oid tempNamespaceId, Oid tempToastNamespaceId)
 
 
 /*
- * GetOverrideSearchPath - fetch current search path definition in form
- * used by PushOverrideSearchPath.
+ * GetOverrideSearchPath - fetch current search path definition.
  *
  * The result structure is allocated in the specified memory context
  * (which might or might not be equal to CurrentMemoryContext); but any
@@ -3512,189 +3512,6 @@ OverrideSearchPathMatchesCurrent(OverrideSearchPath *path)
 	return true;
 }
 
-/*
- * PushOverrideSearchPath - temporarily override the search path
- *
- * Do not use this function; almost any usage introduces a security
- * vulnerability.  It exists for the benefit of legacy code running in
- * non-security-sensitive environments.
- *
- * We allow nested overrides, hence the push/pop terminology.  The GUC
- * search_path variable is ignored while an override is active.
- *
- * It's possible that newpath->useTemp is set but there is no longer any
- * active temp namespace, if the path was saved during a transaction that
- * created a temp namespace and was later rolled back.  In that case we just
- * ignore useTemp.  A plausible alternative would be to create a new temp
- * namespace, but for existing callers that's not necessary because an empty
- * temp namespace wouldn't affect their results anyway.
- *
- * It's also worth noting that other schemas listed in newpath might not
- * exist anymore either.  We don't worry about this because OIDs that match
- * no existing namespace will simply not produce any hits during searches.
- */
-void
-PushOverrideSearchPath(OverrideSearchPath *newpath)
-{
-	OverrideStackEntry *entry;
-	List	   *oidlist;
-	Oid			firstNS;
-	MemoryContext oldcxt;
-
-	/*
-	 * Copy th

Re: Should we remove db_user_namespace?

2023-07-16 Thread Magnus Hagander
On Sat, Jul 15, 2023 at 1:34 AM Nathan Bossart  wrote:
>
> On Mon, Jul 10, 2023 at 03:43:07PM +0900, Michael Paquier wrote:
> > On Wed, Jul 05, 2023 at 08:49:26PM -0700, Nathan Bossart wrote:
> >> On Thu, Jul 06, 2023 at 08:21:18AM +0900, Michael Paquier wrote:
> >>> Removing the GUC from this table is kind of annoying.  Cannot this be
> >>> handled like default_with_oids or ssl_renegotiation_limit to avoid any
> >>> kind of issues with the reload of dump files and the kind?
> >>
> >> Ah, good catch.
> >
> > Thanks.  Reading through the patch, this version should be able to
> > handle the dump reloads.
>
> Hm.  Do we actually need to worry about this?  It's a PGC_SIGHUP GUC, so it
> can only be set at postmaster start or via a configuration file.  Any dump
> files that are trying to set it or clients that are trying to add it to
> startup packets are already broken.  I guess keeping the GUC around would
> avoid breaking any configuration files or startup scripts that happen to be
> setting it to false, but I don't know if that's really worth worrying
> about.

I'd lean towards "no". A hard break, when it's a major release, is
better than a "it stopped having effect but didn't tell you anything"
break. Especially when it comes to things like startup scripts etc.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Inefficiency in parallel pg_restore with many tables

2023-07-16 Thread Andrew Dunstan


On 2023-07-15 Sa 13:47, Tom Lane wrote:

I looked into the performance gripe at [1] about pg_restore not making
effective use of parallel workers when there are a lot of tables.
I was able to reproduce that by dumping and parallel restoring 100K
tables made according to this script:

do $$
begin
for i in 1..10 loop
   execute format('create table t%s (f1 int unique, f2 int unique);', i);
   execute format('insert into t%s select x, x from generate_series(1,1000) x',
  i);
   if i % 100 = 0 then commit; end if;
end loop;
end
$$;

Once pg_restore reaches the parallelizable part of the restore, what
I see is that the parent pg_restore process goes to 100% CPU while its
children (and the server) mostly sit idle; that is, the task dispatch
logic in pg_backup_archiver.c is unable to dispatch tasks fast enough
to keep the children busy.  A quick perf check showed most of the time
being eaten by pg_qsort and TocEntrySizeCompare.

What I believe is happening is that we start the parallel restore phase
with 100K TableData items that are ready to go (they are in the
ready_list) and 200K AddConstraint items that are pending, because
we make those have dependencies on the corresponding TableData so we
don't build an index until after its table is populated.  Each time
one of the TableData items is completed by some worker, the two
AddConstraint items for its table are moved from the pending_list
to the ready_list --- and that means ready_list_insert marks the
ready_list as no longer sorted.  When we go to pop the next task
from the ready_list, we re-sort that entire list first.  So
we spend something like O(N^2 * log(N)) time just sorting, if
there are N tables.  Clearly, this code is much less bright
than it thinks it is (and that's all my fault, if memory serves).

I'm not sure how big a deal this is in practice: in most situations
the individual jobs are larger than they are in this toy example,
plus the initial non-parallelizable part of the restore is a bigger
bottleneck anyway with this many tables.  Still, we do have one
real-world complaint, so maybe we should look into improving it.

I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job.  Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.



Yeah, I think that last idea is reasonable. Something like if the number 
added since the last sort is more than min(50, list_length/4) then sort. 
That shouldn't be too invasive.



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Inefficiency in parallel pg_restore with many tables

2023-07-16 Thread Tom Lane
Andrew Dunstan  writes:
> On 2023-07-15 Sa 13:47, Tom Lane wrote:
>> I wonder if we could replace the sorted ready-list with a priority heap,
>> although that might be complicated by the fact that pop_next_work_item
>> has to be capable of popping something that's not necessarily the
>> largest remaining job.  Another idea could be to be a little less eager
>> to sort the list every time; I think in practice scheduling wouldn't
>> get much worse if we only re-sorted every so often.

> Yeah, I think that last idea is reasonable. Something like if the number 
> added since the last sort is more than min(50, list_length/4) then sort. 
> That shouldn't be too invasive.

Actually, as long as we're talking about approximately-correct behavior:
let's make the ready_list be a priority heap, and then just make
pop_next_work_item scan forward from the array start until it finds an
item that's runnable per the lock heuristic.  If the heap root is
blocked, the next things we'll examine will be its two children.
We might pick the lower-priority of those two, but it's still known to
be higher priority than at least 50% of the remaining heap entries, so
it shouldn't be too awful as a choice.  The argument gets weaker the
further you go into the heap, but we're not expecting that having most
of the top entries blocked will be a common case.  (Besides which, the
priorities are pretty crude to begin with.)  Once selected, pulling out
an entry that is not the heap root is no problem: you just start the
sift-down process from there.

The main advantage of this over the only-sort-sometimes idea is that
we can guarantee that the largest ready item will always be dispatched
as soon as it can be (because it will be the heap root).  So cases
involving one big table (with big indexes) and a lot of little ones
should get scheduled sanely, which is the main thing we want this
algorithm to ensure.  With the other approach we can't really promise
much at all.

regards, tom lane




Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

2023-07-16 Thread Alena Rybakina

Hi,

I'm still working on it, but, unfortunately, I didn't have much time to 
work with it well enough that there would be something that could be shown.
Now I am trying to sort out the problems that I drew attention to in the 
previous letter.


--
Regards,
Alena Rybakina
Postgres Professional





Re: Autogenerate some wait events code and documentation

2023-07-16 Thread Andres Freund
Hi,

On 2023-07-14 13:49:22 +0900, Michael Paquier wrote:
> I have looked again at 0001 and 0002 and applied them to get them out
> of the way.  0003 and 0004 are rebased and attached.  I'll add them to
> the CF for later consideration.  More opinions are welcome.

> From b6390183bdcc054df82279bb0b2991730f85a0a3 Mon Sep 17 00:00:00 2001
> From: Michael Paquier 
> Date: Thu, 13 Jul 2023 10:14:47 +0900
> Subject: [PATCH v3 4/4] Remove column for enum elements in
>  wait_event_names.txt
> 
> This file is now made of two columns, removing the column listing the
> enum elements for each wait event class:
> - Camelcase event name used in pg_stat_activity.  There are now
> unquoted.
> - Description of the documentation.

> The enum elements are generated from what is now the first column.

I think the search issue is valid, so I do think going the other way is
preferrable. I.e. use just the enum value in the .txt and generate the camel
case name from that. That allows you to search the define used in code and
find a hit in the file.

I personally would still leave off the WAIT_EVENT prefix in the .txt, I think
most of us can remember to chop that off.

I don't think we need to be particularly consistent with wait events across
major versions. They're necessarily tied to how the code works, and we've
yanked that around plenty.

Greetings,

Andres Freund




Re: Use of additional index columns in rows filtering

2023-07-16 Thread Tomas Vondra
On 7/15/23 16:20, Tomas Vondra wrote:
>
> ...
> 
> 4) problems with opcintype != opckeytype (name_ops)
> 
> While running the tests, I ran into an issue with name_ops, causing
> failures for \dT and other catalog queries. The root cause is that
> name_ops has opcintype = name, but opckeytype = cstring. The index-only
> clauses are copied from the table, with Vars mutated to reference the
> INDEX_VAR. But the type is not, so when we get to evaluating the
> expressions, CheckVarSlotCompatibility() fails because the Var has name,
> but the iss_IndexSlot (created with index tuple descriptor) has cstring.
> 
> The rebased patch fixes this by explicitly adjusting types of the
> descriptor in ExecInitIndexScan().
> 
> However, maybe this indicates the very idea of evaluating expressions
> using slot with index tuple descriptor is misguided. This made me look
> at regular index-only scan (nodeIndexonlyscan.c), and that uses a slot
> with the "table" structure, and instead of evaluating the expression on
> the index index tuple it expands the index tuple into the table slot.
> Which is what StoreIndexTuple() does.
> 
> So maybe this should do what IOS does - expand the index tuple into
> "table slot" and evaluate the expression on that. That'd also make the
> INDEX_VAR tweak in createplan.c unnecessary - in fact, that seemed a bit
> strange anyway, so ditching fix_indexfilter_mutator would be good.
> 

This kept bothering me, so I looked at it today, and reworked it to use
the IOS approach. It's a bit more complicated because for IOS both slots
have the same overall structure, except for the data types. But for
regular index scans that's not the case - the code has to "expand" the
index tuple into the larger "table slot". This works, and in general I
think the result is much cleaner - in particular, it means we don't need
to switch the Var nodes to reference the INDEX_VAR.

While experimenting with this I realized again that we're not matching
expressions to IOS. So if you have an expression index on (a+b), that
can't be used even if the query only uses this particular expression.
The same limitation applies to index-only filters, of course. It's not
the fault of this patch, but perhaps it'd be an interesting improvement.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 685e4014533d9d5f29564a3a21b972abba70ee9f Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 9 Apr 2023 02:08:45 +0200
Subject: [PATCH] evaluate filters on the index tuple (when possible)

Discussion: https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU%3D%40yamlcoder.me
---
 src/backend/commands/explain.c|   2 +
 src/backend/executor/nodeIndexscan.c  | 258 +-
 src/backend/optimizer/path/costsize.c |  23 ++
 src/backend/optimizer/path/indxpath.c | 190 +++--
 src/backend/optimizer/plan/createplan.c   |  80 ++
 src/backend/optimizer/plan/planner.c  |   2 +-
 src/backend/optimizer/plan/setrefs.c  |   6 +
 src/backend/optimizer/util/pathnode.c |   2 +
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/execnodes.h |   7 +
 src/include/nodes/pathnodes.h |   1 +
 src/include/nodes/plannodes.h |   2 +
 src/include/optimizer/cost.h  |   1 +
 src/include/optimizer/pathnode.h  |   1 +
 src/test/regress/expected/create_index.out|  19 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/updatable_views.out |  12 +-
 18 files changed, 580 insertions(+), 40 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8570b14f621..556388bc499 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1778,6 +1778,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_IndexScan:
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 		   "Index Cond", planstate, ancestors, es);
+			show_scan_qual(((IndexScan *) plan)->indexfiltersorig,
+		   "Index Filter", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
 show_instrumentation_count("Rows Removed by Index Recheck", 2,
 		   planstate, es);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4540c7781d2..5ee1fd86d7f 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -32,6 +32,8 @@
 #include "access/nbtree.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "access/visibilitymap.h"
+#include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "executor/execdebug.h"
 #include "executor/nodeIndexscan.h"
@@ -69,6 +71,10 @@ static void reorderqueue_push(Inde

Re: Permute underscore separated components of columns before fuzzy matching

2023-07-16 Thread Arne Roland
Hello Mikhail,

I'm sorry. Please try attached patch instead.

Thank you for having a look!

Regards
Arne


From: Mikhail Gribkov 
Sent: Thursday, July 6, 2023 13:31
To: Arne Roland 
Cc: Pg Hackers 
Subject: Re: Permute underscore separated components of columns before fuzzy 
matching

Hello Arne,

The goal of supporting words-switching hints sounds interesting and I've tried 
to apply your patch.
The patch was applied smoothly to the latest master and check-world reported no 
problems. Although I had problems after trying to test the new functionality.

I tried to simply mix words in pg_stat_activity.wait_event_type:

postgres=# select wait_type_event from pg_stat_activity ;
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] WARNING:  detected write past chunk end in 
MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
2023-07-06 14:12:35.968 MSK [1480] ERROR:  column "wait_type_event" does not 
exist at character 8
2023-07-06 14:12:35.968 MSK [1480] HINT:  Perhaps you meant to reference the 
column "pg_stat_activity.wait_event_type".
2023-07-06 14:12:35.968 MSK [1480] STATEMENT:  select wait_type_event from 
pg_stat_activity ;
WARNING:  detected write past chunk end in MessageContext 0x559d668aaf30
WARNING:  detected write past chunk 

Re: Changing types of block and chunk sizes in memory contexts

2023-07-16 Thread David Rowley
On Fri, 14 Jul 2023 at 18:53, Melih Mutlu  wrote:
> David Rowley , 13 Tem 2023 Per, 08:04 tarihinde şunu 
> yazdı:
>>
>> I looked at your v2 patch. The only thing that really looked wrong
>> were the (Size) casts in the context creation functions.  These should
>> have been casts to uint32 rather than Size. Basically, all the casts
>> do is say to the compiler "Yes, I know this could cause truncation due
>> to assigning to a size smaller than the source type's size". Some
>> compilers will likely warn without that and the cast will stop them.
>> We know there can't be any truncation due to the Asserts. There's also
>> the fundamental limitation that MemoryChunk can't store block offsets
>> larger than 1GBs anyway, so things will go bad if we tried to have
>> blocks bigger than 1GB.
>
>
> Right! I don't know why I cast them to Size. Thanks for the fix.

Pushed.

David




Re: Autogenerate some wait events code and documentation

2023-07-16 Thread Michael Paquier
On Sun, Jul 16, 2023 at 12:21:20PM -0700, Andres Freund wrote:
> I think the search issue is valid, so I do think going the other way is
> preferrable. I.e. use just the enum value in the .txt and generate the camel
> case name from that. That allows you to search the define used in code and
> find a hit in the file.
> 
> I personally would still leave off the WAIT_EVENT prefix in the .txt, I think
> most of us can remember to chop that off.

So you mean to switch a line that now looks like that:
WAIT_EVENT_FOO_BAR   FooBar"Waiting on Foo Bar."
To that:
FOO_BAR   "Waiting on Foo Bar."
Or even that:
WAIT_EVENT_FOO_BAR   "Waiting on Foo Bar."

Sure, it is an improvement for any wait events that use WAIT_EVENT_
when searching them, but this adds more magic into the LWLock and Lock
areas if the same conversion is applied there.  Or am I right to
assume that you'd mean to *not* do any of that for these two classes?
These can be treated as exceptions in the script when generating the
wait event names from the enum elements, of course.

> I don't think we need to be particularly consistent with wait events across
> major versions. They're necessarily tied to how the code works, and we've
> yanked that around plenty.

IMO, it depends on the code path involved.  For example, I know of
some code that relies on SyncRep to track backends waiting on a sync
reply, and that's one sensible to keep compatible.  I'd be sad if
something like that breaks suddenly after a major release.
--
Michael


signature.asc
Description: PGP signature


Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-07-16 Thread Michael Paquier
On Fri, Jul 14, 2023 at 01:32:49PM -0700, David Zhang wrote:
> I believe before users can make a backup using pg_basebackup and then start
> the backup as an independent Primary server for whatever reasons. Now, if
> this is still allowed, then users need to be aware that the backup_label
> must be manually deleted, otherwise, the backup won't be able to start as a
> Primary.

Delete a backup_label from a fresh base backup can easily lead to data
corruption, as the startup process would pick up as LSN to start
recovery from the control file rather than the backup_label file.
This would happen if a checkpoint updates the redo LSN in the control
file while a backup happens and the control file is copied after the
checkpoint, for instance.  If one wishes to deploy a new primary from
a base backup, recovery.signal is the way to go, making sure that the
new primary is bumped into a new timeline once recovery finishes, on
top of making sure that the startup process starts recovery from a
position where the cluster would be able to achieve a consistent
state.

> The current message below doesn't provide such a hint.
> 
> + if (!ArchiveRecoveryRequested)
> + ereport(FATAL,
> + (errmsg("could not find
> recovery.signal or standby.signal when recovering with
> backup_label"), 
> +  errhint("If you are restoring
> from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\"
> and add required recovery options.",
> +  DataDir,
> DataDir)));

How would you rewrite that?  I am not sure how many details we want to
put here in terms of differences between recovery.signal and
standby.signal, still we surely should mention these are the two
possible choices.
--
Michael


signature.asc
Description: PGP signature


Re: Improve heapgetpage() performance, overhead from serializable

2023-07-16 Thread Zhang Mingli
Hi,

Regards,
Zhang Mingli
On Jul 16, 2023 at 09:57 +0800, Andres Freund , wrote:
> Hi,
>
> Several loops which are important for query performance, like heapgetpage()'s
> loop over all tuples, have to call functions like
> HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
> iteration.
>
> When serializable is not in use, all those functions do is to to return. But
> being situated in a different translation unit, the compiler can't inline
> (without LTO at least) the check whether serializability is needed. It's not
> just the function call overhead that's noticable, it's also that registers
> have to be spilled to the stack / reloaded from memory etc.
>
> On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
> pinned to one core. Parallel workers disabled to reduce noise. All times are
> the average of 15 executions with pgbench, in a newly started, but prewarmed
> postgres.
>
> SELECT * FROM pgbench_accounts OFFSET 1000;
> HEAD:
> 397.977
>
> removing the HeapCheckForSerializableConflictOut() from heapgetpage()
> (incorrect!), to establish the baseline of what serializable costs:
> 336.695
>
> pulling out CheckForSerializableConflictOutNeeded() from
> HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
> HeapCheckForSerializableConflictOut() in the loop:
> 339.742
>
> moving the loop into a static inline function, marked as pg_always_inline,
> called with static arguments for always_visible, check_serializable:
> 326.546
>
> marking the always_visible, !check_serializable case likely():
> 322.249
>
> removing TestForOldSnapshot() calls, which we pretty much already decided on:
> 312.987
>
>
> FWIW, there's more we can do, with some hacky changes I got the time down to
> 273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
> for something as core as this, is imo worth considering on its own.
>
>
>
>
> Now, this just affects the sequential scan case. heap_hot_search_buffer()
> shares many of the same pathologies. I find it a bit harder to improve,
> because the compiler's code generation seems to switch between good / bad with
> changes that seems unrelated...
>
>
> I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
> far?
>
>
> Greetings,
>
> Andres Freund

LGTM and I have a fool question:

if (likely(all_visible))
{
if (likely(!check_serializable))
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
page, buffer,

   block, lines, 1, 0);
else
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
page, buffer,

   block, lines, 1, 1);
}
else
{
if (likely(!check_serializable))
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
page, buffer,

   block, lines, 0, 0);
else
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
page, buffer,

   block, lines, 0, 1);


Does it make sense to combine if else condition and put it to the incline 
function’s param?

Like:
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,

   block, lines, all_visible, check_serializable);













Re: logicalrep_message_type throws an error

2023-07-16 Thread Amit Kapila
On Sat, Jul 15, 2023 at 7:16 PM Euler Taveira  wrote:
>
> On Sat, Jul 15, 2023, at 4:27 AM, Amit Kapila wrote:
>
> Do you have something like attached in mind?
>
>
> WFM. I would change the comment that says
>
> This function is called to provide context in the error ...
>
> to
>
> This message provides context in the error ...
>
> because this comment is not at the beginning of the function but *before* the
> message.
>

Sounds reasonable to me. I'll make this modification before pushing
unless there are more comments/suggestions.

-- 
With Regards,
Amit Kapila.




Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2023-07-16 Thread Justin Pryzby
On Thu, Jun 01, 2023 at 08:50:50AM -0400, Michael Paquier wrote:
> >> Speaking of which, ATExecSetAccessMethodNoStorage() does a catalog
> >> update even if ALTER TABLE is defined to use the same table AM as what
> >> is currently set.  There is no need to update the relation's pg_class
> >> entry in this case.  Be careful that InvokeObjectPostAlterHook() still
> >> needs to be checked in this case.  Perhaps some tests should be added
> >> in test_oat_hooks.sql?  It would be tempted to add a new SQL file for
> >> that.
> >
> > Are you suggesting to put this in a conditional: if 
> > oldrelam!=newAccessMethod ?
> 
> Yes, that's what I would add with a few lines close to the beginning
> of ATExecSetTableSpaceNoStorage() to save from catalog updates if
> these are not needed.

I understand that it's possible for it to be conditional, but I don't
undertand why it's desirable/important ?

It still seems preferable to be unconditional.

On Wed, May 31, 2023 at 06:35:34PM -0500, Justin Pryzby wrote:
>> Why is that desirable ?  I'd prefer to see it written with fewer
>> conditionals, not more; then, it hits the same path every time.

It's not conditional in the tablespace code that this follows, nor
others like ATExecSetStatistics(), ATExecForceNoForceRowSecurity(),
ATExecSetCompression(), etc, some of which are recently-added.  If it's
important to do here, isn't it also important to do everywhere else ?

-- 
Justin




Re: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-16 Thread Peter Smith
On Sat, Jul 15, 2023 at 2:10 PM Amit Kapila  wrote:
>
> On Fri, Jul 14, 2023 at 2:15 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > > > I think it's appropriate to add on the restrictions page. (But 
> > > > mentioning that this
> > > restriction is only for subscriber)
> > > >
> > > > If the list were larger, then the restrictions page could be divided 
> > > > into publisher
> > > and subscriber restrictions. But not for one very specific restriction.
> > > >
> > >
> > > Okay, how about something like: "The UPDATE and DELETE operations
> > > cannot be applied on the subscriber for the published tables that
> > > specify REPLICA IDENTITY FULL when the table has attributes with
> > > datatypes (e.g point or box) that don't have a default operator class
> > > for Btree or Hash. This won't be a problem if the table has a primary
> > > key or replica identity defined for it."?
> >
> > Thanks for discussing and giving suggestions. But it seems that the first
> > sentence is difficult to read for me. How about attached?
> >
>
> The last line seems repetitive to me. So, I have removed it. Apart
> from that patch looks good to me. Sergie, Peter, and others, any
> thoughts?

The v5 patch LGTM.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-16 Thread David Rowley
On Wed, 12 Jul 2023 at 14:50, David Rowley  wrote:
>
> On Wed, 12 Jul 2023 at 14:23, Tom Lane  wrote:
> > I did think about that, but "shallow copy a Path" seems nontrivial
> > because the Path structs are all different sizes.  Maybe it is worth
> > building some infrastructure to support that?
>
> It seems a reasonable thing to have to do.  It seems the minimum thing
> we could do to ensure each Path is only mentioned in at most 1
> RelOptInfo.

I've attached a draft patch which adds copyObjectFlat() and supports
all Node types asides from the ones mentioned in @extra_tags in
gen_node_support.pl.  This did require including all the node header
files in copyfuncs.c, which that file seems to have avoided until now.

I also didn't do anything about ExtensibleNode types. I assume just
copying the ExtensibleNode isn't good enough. To flat copy the actual
node I think would require adding a new function to
ExtensibleNodeMethods.

I was also unsure what we should do when shallow copying a List. The
problem there is if we just do a shallow copy, a repalloc() on the
elements array would end up pfreeing memory that might be used by a
shallow copied clone.  Perhaps List is not unique in that regard?
Maybe the solution there is to add a special case and list_copy()
Lists like what is done in copyObjectImpl().

I'm hoping the attached patch will at least assist in moving the
discussion along.

David
diff --git a/src/backend/nodes/.gitignore b/src/backend/nodes/.gitignore
index 0c14b5697b..91cbd2cf24 100644
--- a/src/backend/nodes/.gitignore
+++ b/src/backend/nodes/.gitignore
@@ -1,4 +1,5 @@
 /node-support-stamp
+/nodesizes.h
 /nodetags.h
 /*funcs.funcs.c
 /*funcs.switch.c
diff --git a/src/backend/nodes/Makefile b/src/backend/nodes/Makefile
index 0a95e683d0..46ce62f828 100644
--- a/src/backend/nodes/Makefile
+++ b/src/backend/nodes/Makefile
@@ -99,4 +99,4 @@ queryjumblefuncs.o: queryjumblefuncs.c 
queryjumblefuncs.funcs.c queryjumblefuncs
 readfuncs.o:  readfuncs.c readfuncs.funcs.c readfuncs.switch.c | 
node-support-stamp
 
 maintainer-clean: clean
-   rm -f node-support-stamp $(addsuffix funcs.funcs.c,copy equal out 
queryjumble read) $(addsuffix funcs.switch.c,copy equal out queryjumble read) 
nodetags.h
+   rm -f node-support-stamp $(addsuffix funcs.funcs.c,copy equal out 
queryjumble read) $(addsuffix funcs.switch.c,copy equal out queryjumble read) 
nodesizes.h nodetags.h
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index f2568ff5e6..ccd4cbfa01 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -15,9 +15,54 @@
 
 #include "postgres.h"
 
+#include "access/amapi.h"
+#include "access/tableam.h"
+#include "access/tsmapi.h"
+#include "commands/event_trigger.h"
+#include "commands/trigger.h"
+#include "foreign/fdwapi.h"
 #include "miscadmin.h"
+#include "nodes/execnodes.h"
+#include "nodes/extensible.h"
+#include "nodes/miscnodes.h"
+#include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
+#include "nodes/plannodes.h"
+#include "nodes/replnodes.h"
+#include "nodes/supportnodes.h"
+#include "nodes/tidbitmap.h"
 #include "utils/datum.h"
 
+static const Size flat_node_sizes[] = {
+   0, /* T_Invalid */
+#include "nodes/nodesizes.h"
+};
+
+/*
+ * copyObjectFlat
+ * Allocate a new copy of the Node type denoted by 'from' and flat 
copy the
+ * contents of it into the newly allocated node and return it.
+ */
+void *
+copyObjectFlat(const void *from)
+{
+   Sizesize;
+   void   *retval;
+   NodeTag tag = nodeTag(from);
+
+   if ((unsigned int) tag >= lengthof(flat_node_sizes))
+   {
+   elog(ERROR, "unrecognized node type: %d", (int) tag);
+   return NULL;
+   }
+
+   /* XXX how to handle ExtensibleNodes? Can we just deep copy? */
+   size = flat_node_sizes[tag];
+   retval = palloc(size);
+   memcpy(retval, from, size);
+
+   return retval;
+}
 
 /*
  * Macros to simplify copying of different kinds of fields.  Use these
diff --git a/src/backend/nodes/gen_node_support.pl 
b/src/backend/nodes/gen_node_support.pl
index 72c7963578..e9a759c5a0 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -2,6 +2,7 @@
 #--
 #
 # Generate node support files:
+# - nodesizes.h
 # - nodetags.h
 # - copyfuncs
 # - equalfuncs
@@ -599,6 +600,28 @@ my $header_comment =
  */
 ';
 
+# nodesizes.h
+
+push @output_files, 'nodesizes.h';
+open my $ns, '>', "$output_path/nodesizes.h$tmpext"
+  or die "$output_path/nodesizes.h$tmpext: $!";
+
+printf $ns $header_comment, 'nodesizes.h';
+
+foreach my $n (@node_types)
+{
+   next if elem $n, @abstract_types;
+   if (defined $manual_nodetag_number{$n})
+   {
+   print $ns "\tsizeof(T_${n}),\n";
+   }
+   else
+   {
+   print $ns "\tsizeof(${n}),\n";
+   }
+}
+

Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-16 Thread Tom Lane
David Rowley  writes:
> On Wed, 12 Jul 2023 at 14:50, David Rowley  wrote:
>> On Wed, 12 Jul 2023 at 14:23, Tom Lane  wrote:
>>> I did think about that, but "shallow copy a Path" seems nontrivial
>>> because the Path structs are all different sizes.  Maybe it is worth
>>> building some infrastructure to support that?

>> It seems a reasonable thing to have to do.  It seems the minimum thing
>> we could do to ensure each Path is only mentioned in at most 1
>> RelOptInfo.

> ...
> I also didn't do anything about ExtensibleNode types. I assume just
> copying the ExtensibleNode isn't good enough. To flat copy the actual
> node I think would require adding a new function to
> ExtensibleNodeMethods.

Yeah, the problem I've got with this approach is that flat-copying
FDW and Custom paths would require extending the respective APIs.
While that's a perfectly reasonable ask if we only need to do this
in HEAD, it would be a nonstarter for released branches.  Is it
okay to only fix this issue in HEAD?

> I was also unsure what we should do when shallow copying a List.

The proposal is to shallow-copy a Path node.  List is not a kind
of Path, so how does List get into it?  (Lists below Paths would
not get copied, by definition.)

regards, tom lane




Re: Inefficiency in parallel pg_restore with many tables

2023-07-16 Thread Nathan Bossart
On Sun, Jul 16, 2023 at 09:45:54AM -0400, Tom Lane wrote:
> Actually, as long as we're talking about approximately-correct behavior:
> let's make the ready_list be a priority heap, and then just make
> pop_next_work_item scan forward from the array start until it finds an
> item that's runnable per the lock heuristic.  If the heap root is
> blocked, the next things we'll examine will be its two children.
> We might pick the lower-priority of those two, but it's still known to
> be higher priority than at least 50% of the remaining heap entries, so
> it shouldn't be too awful as a choice.  The argument gets weaker the
> further you go into the heap, but we're not expecting that having most
> of the top entries blocked will be a common case.  (Besides which, the
> priorities are pretty crude to begin with.)  Once selected, pulling out
> an entry that is not the heap root is no problem: you just start the
> sift-down process from there.
> 
> The main advantage of this over the only-sort-sometimes idea is that
> we can guarantee that the largest ready item will always be dispatched
> as soon as it can be (because it will be the heap root).  So cases
> involving one big table (with big indexes) and a lot of little ones
> should get scheduled sanely, which is the main thing we want this
> algorithm to ensure.  With the other approach we can't really promise
> much at all.

This seems worth a try.  IIUC you are suggesting making binaryheap.c
frontend-friendly and expanding its API a bit.  If no one has volunteered,
I could probably hack something together.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: remaining sql/json patches

2023-07-16 Thread jian he
hi.
seems there is no explanation about, json_api_common_syntax in
functions-json.html

I can get json_query full synopsis from functions-json.html as follows:
json_query ( context_item, path_expression [ PASSING { value AS
varname } [, ...]] [ RETURNING data_type [ FORMAT JSON [ ENCODING UTF8
] ] ] [ { WITHOUT | WITH { CONDITIONAL | [UNCONDITIONAL] } } [ ARRAY ]
WRAPPER ] [ { KEEP | OMIT } QUOTES [ ON SCALAR STRING ] ] [ { ERROR |
NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression } ON EMPTY ]
[ { ERROR | NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression }
ON ERROR ])

seems doesn't  have a full synopsis for json_table? only partial one
by  one explanation.




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-07-16 Thread Amit Kapila
On Thu, Jun 8, 2023 at 9:24 AM Hayato Kuroda (Fujitsu)
 wrote:
>

Few comments/questions

1.
+check_for_parameter_settings(ClusterInfo *new_cluster)
{
...
+
+ res = executeQueryOrDie(conn, "SHOW max_replication_slots;");
+ max_replication_slots = atoi(PQgetvalue(res, 0, 0));
+
+ if (max_replication_slots == 0)
+ pg_fatal("max_replication_slots must be greater than 0");
...
}

Won't it be better to verify that the value of "max_replication_slots"
is greater than the number of logical slots we are planning to copy
from old on the new cluster? Similar to this, I thought whether we
need to check the value of max_wal_senders? But, I guess one can
simply decode from slots by using APIs, so not sure about that. What
do you think?

2.
+ /*
+ * Dump logical replication slots if needed.
+ *
+ * XXX We cannot dump replication slots at the same time as the schema
+ * dump because we need to separate the timing of restoring
+ * replication slots and other objects. Replication slots, in
+ * particular, should not be restored before executing the pg_resetwal
+ * command because it will remove WALs that are required by the slots.
+ */
+ if (user_opts.include_logical_slots)

Can you explain this point a bit more with some example scenarios?
Basically, if we had sent all the WAL before the upgrade then why do
we need to worry about the timing of pg_resetwal?

3. I see that you are trying to ensure that all the WAL has been
consumed for a slot except for shutdown_checkpoint in patch 0003 but
do we need to think of any interaction with restart_lsn
(MyReplicationSlot->data.restart_lsn) which is the start point to read
WAL for decoding by walsender?

-- 
With Regards,
Amit Kapila.




Re: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-16 Thread Önder Kalacı
Hi,

>
> > The last line seems repetitive to me. So, I have removed it. Apart
> > from that patch looks good to me. Sergie, Peter, and others, any
> > thoughts?
>
> The v5 patch LGTM.
>
>
Overall looks good to me as well. Please consider the following as an
optional improvement.

My only minor concern here is the use of the term "default operator class".
It is accurate to use it. However, as far as I know, not many users can
follow that easily. I think the "pkey/repl full" suggestion gives some tip,
but I wonder if we add something like the following to the text such that
users can understand more:

 do not have a default operator class for B-tree or Hash.

+ If  there is no default operator class, usually the type does not have an
> equality operator.

However,  this limitation ..


Thanks,
Onder