date:20230118

RE: Modify the document of Logical Replication configuration settings

2023-01-18 Thread Takamichi Osumi (Fujitsu)

On Wednesday, January 18, 2023 4:11 PM Michael Paquier  
wrote:
> On Wed, Jan 18, 2023 at 07:00:43AM +, Takamichi Osumi (Fujitsu) wrote:
> > The attached patch includes below two changes for the description of
> > Logical Replication "Configuration Settings".
> >
> > 1. Add one brief description about wal_sender_timeout.
> >I made it similar to one other sentence for subscriber.
> > 2. Fix a wrong GUC name "wal_receiver_retry_interval".
> >I think this doesn't seem to exist and would mean
> "wal_retrieve_retry_interval".
> >
> > Kindly have a look at it.
> 
> Looks right to me, thanks!
Thank you for checking, too !


Best Regards,
Takamichi Osumi

Re: Improve GetConfigOptionValues function

2023-01-18 Thread Bharath Rupireddy

On Wed, Jan 18, 2023 at 1:24 PM Nitin Jadhav
 wrote:
>
> Attaching the patch.
>
> On Wed, Jan 18, 2023 at 1:21 PM Nitin Jadhav
>  wrote:
> >
> > Hi,
> >
> > GetConfigOptionValues function extracts the config parameters for the
> > given variable irrespective of whether it results in noshow or not.
> > But the parent function show_all_settings ignores the values parameter
> > if it results in noshow. It's unnecessary to fetch all the values
> > during noshow. So a return statement in GetConfigOptionValues() when
> > noshow is set to true is needed. Attached the patch for the same.
> > Please share your thoughts.

Yes, the existing caller isn't using the fetched values when noshow is
set to true. However, I think returning from GetConfigOptionValues()
when noshow is set to true without fetching values limits the use of
the function. What if someother caller wants to use the function to
get the values with noshow passed in and use the values when noshow is
set to true?

Also, do we gain anything with the patch? I mean, can
show_all_settings()/pg_settings/pg_show_all_settings() get faster, say
with a non-superuser without pg_read_all_settings predefined role or
with a superuser? I see there're about 6 GUC_NO_SHOW_ALL GUCs and 20
GUC_SUPERUSER_ONLY GUCs, I'm not sure if it leads to some benefit with
the patch.

Having said above, I don't mind keeping the things the way they're right now.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-18 Thread Dilip Kumar

On Tue, Jan 17, 2023 at 10:05 AM Peter Geoghegan  wrote:
>
> On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar  wrote:
> > I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
> > strictly for freezing.  But the point is that the eager scanning
> > strategy is driven by table freezing needs of the table (tableagefrac)
> > that make sense, but if we have selected the eager freezing based on
> > the table age and its freezing need then why don't we force the eager
> > freezing as well if we have selected eager scanning, after all the
> > eager scanning is selected for satisfying the freezing need.
>
> Don't think of eager scanning as the new name for aggressive mode --
> it's a fairly different concept, because we care about costs now.
> Eager scanning can be chosen just because it's very cheap relative to
> the alternative of lazy scanning, even when relfrozenxid is still very
> recent. (This kind of behavior isn't really new [1], but the exact
> implementation from the patch is new.)
>
> Tables such as pgbench_branches and pgbench_tellers will reliably use
> eager scanning strategy, no matter how any GUC has been set -- just
> because the added cost is always zero (relative to lazy scanning). It
> really doesn't matter how far along tableagefrac here, ever. These
> same tables will never use eager freezing strategy, unless the
> vacuum_freeze_strategy_threshold GUC is misconfigured. (This is
> another example of how scanning strategy and freezing strategy may
> differ for the same table.)

Yes, I agree with that.  Thanks for explaining in detail.

> You do have a good point, though. I think that I know what you mean.
> Note that antiwraparound autovacuums (or VACUUMs of tables very near
> to that point) *will* always use both the eager freezing strategy and
> the eager scanning strategy -- which is probably close to what you
> meant.

Right

> The important point is that there can be more than one reason to
> prefer one strategy to another -- and the reasons can be rather
> different. Occasionally it'll be a combination of two factors together
> that push things in favor of one strategy over the other -- even
> though either factor on its own would not have resulted in the same
> choice.

Yes, that makes sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Logical replication timeout problem

2023-01-18 Thread wangw.f...@fujitsu.com

On Wed, Jan 18, 2023 at 13:29 PM Amit Kapila  wrote:
> On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat
>  wrote:
> >
> > On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila  wrote:
> >
> > > >
> > > > I am a bit worried about the indirections that the wrappers and hooks
> > > > create. Output plugins call OutputPluginUpdateProgress() in callbacks
> > > > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > > > call OutputPluginUpdateProgress.
> > > >
> > >
> > > Yeah, I think we can do it as we are doing the previous approach but
> > > we need an additional wrapper (update_progress_cb_wrapper()) as the
> > > current patch has so that we can add error context information. This
> > > is similar to why we have a wrapper for all other callbacks like
> > > change_cb_wrapper.
> > >
> >
> > Ultimately OutputPluginUpdateProgress() will be called - which in turn
> > will call ctx->update_progress.
> >
> 
> No, update_progress_cb_wrapper() should directly call
> ctx->update_progress(). The key reason to have a
> update_progress_cb_wrapper() is that it allows us to add error context
> information (see the usage of output_plugin_error_callback).

I think it makes sense. This also avoids the need for every output plugin to
implement the callback. So I tried to improve the patch based on this approach.

And I tried to add some comments for this new callback to distinguish it from
ctx->update_progress.

Attach the new patch.

Regards,
Wang Wei


v3-0001-Fix-the-logical-replication-timeout-during-proces.patch
Description:  v3-0001-Fix-the-logical-replication-timeout-during-proces.patch

Re: minor bug

2023-01-18 Thread Laurenz Albe

On Tue, 2023-01-17 at 10:32 -0500, Tom Lane wrote:
> Laurenz Albe  writes:
> > On Mon, 2023-01-16 at 19:59 +0100, Torsten Förtsch wrote:
> > > So, the timestamp displayed in the log message is certainly wrong.
> 
> > If recovery stops at a WAL record that has no timestamp, you get this
> > bogus recovery stop time.  I think we should show the recovery stop time
> > only if time was the target, as in the attached patch.
> 
> I don't think that is a tremendously useful definition: the user
> already knows what recoveryStopTime is, or can find it out from
> their settings easily enough.
> 
> I seem to recall that the original idea was to report the timestamp
> of the commit/abort record we are stopping at.  Maybe my memory is
> faulty, but I think that'd be significantly more useful than the
> current behavior, *especially* when the replay stopping point is
> defined by something other than time.
> 
> (Also, the wording of the log message suggests that that's what
> the reported time is supposed to be.  I wonder if somebody messed
> this up somewhere along the way.)

recoveryStopTime is set to recordXtime (the time of the xlog record)
a few lines above that patch, so this is useful information if it is
present.

I realized that my original patch might be a problem for translation;
here is an updated version that does not take any shortcuts.

Yours,
Laurenz Albe
From b01428486f7795f757edd14f66362ee575d2f168 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Wed, 18 Jan 2023 09:23:50 +0100
Subject: [PATCH] Don't show bogus recovery stop time
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If target for archive recovery was different from time, the
WAL record that ended recovery might not have a timestamp.
In that case, the log message at the end of recovery would
show a recovery time of midnight on 2000-01-01.

Reported-by: Torsten Förtsch
Author: Laurenz Albe
Discussion: https://postgr.es/m/cakkg4_kuevpqbmyoflajx7opaqk6cvwkvx0hrcfjspfrptx...@mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c | 35 +--
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index bc3c3eb3e7..73929a535e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2570,19 +2570,38 @@ recoveryStopsBefore(XLogReaderState *record)
 		recoveryStopLSN = InvalidXLogRecPtr;
 		recoveryStopName[0] = '\0';
 
+		/* this could be simplified, but that might break translatability */
 		if (isCommit)
 		{
-			ereport(LOG,
-	(errmsg("recovery stopping before commit of transaction %u, time %s",
-			recoveryStopXid,
-			timestamptz_to_str(recoveryStopTime;
+			if (recoveryStopTime != 0)
+			{
+ereport(LOG,
+		(errmsg("recovery stopping before commit of transaction %u, time %s",
+recoveryStopXid,
+timestamptz_to_str(recoveryStopTime;
+			}
+			else
+			{
+ereport(LOG,
+		(errmsg("recovery stopping before commit of transaction %u",
+recoveryStopXid)));
+			}
 		}
 		else
 		{
-			ereport(LOG,
-	(errmsg("recovery stopping before abort of transaction %u, time %s",
-			recoveryStopXid,
-			timestamptz_to_str(recoveryStopTime;
+			if (recoveryStopTime != 0)
+			{
+ereport(LOG,
+		(errmsg("recovery stopping before abort of transaction %u, time %s",
+recoveryStopXid,
+timestamptz_to_str(recoveryStopTime;
+			}
+			else
+			{
+ereport(LOG,
+		(errmsg("recovery stopping before abort of transaction %u",
+recoveryStopXid)));
+			}
 		}
 	}
 
-- 
2.39.0

Re: Inconsistency in vacuum behavior

2023-01-18 Thread Nikita Malakhov

Hi hackers!

Alexander found a very good issue.
Please check the solution above. Any objections? It's a production case,
please review,
any thoughts and objections are welcome.

On Mon, Jan 16, 2023 at 8:15 PM Alexander Pyhalov 
wrote:

> Nikita Malakhov писал 2023-01-16 20:12:
> > Hi,
> >
> > Currently there is no error in this case, so additional thrown error
> > would require a new test.
> > Besides, throwing an error here does not make sense - it is just a
> > check for a vacuum
> > permission, I think the right way is to just skip a relation that is
> > not suitable for vacuum.
> > Any thoughts or objections?
> >
>
> No objections for not throwing an error.
>
> --
> Best regards,
> Alexander Pyhalov,
> Postgres Professional
>


-- 
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/

Re: Modify the document of Logical Replication configuration settings

2023-01-18 Thread Bharath Rupireddy

On Wed, Jan 18, 2023 at 12:31 PM Takamichi Osumi (Fujitsu)
 wrote:
>
> Hi, hackers
>
> The attached patch includes below two changes for the description of
> Logical Replication "Configuration Settings".
>
> 1. Add one brief description about wal_sender_timeout.
>I made it similar to one other sentence for subscriber.

+   
+Logical replication walsender is also affected by
+wal_sender_timeout.
+   

Looks fine. Adding something like [1] in wal_sender_timeout GUC's
description might be a good idea just to give specific information
that the logical replication subscribers too get affected. Perhaps,
it's not required since the postgres glossary wraps logical
replication subscriber under standby anyway -
https://www.postgresql.org/docs/devel/glossary.html. To me personally,
the typical notion of standby is the one connected to primary via
streaming replication.

> 2. Fix a wrong GUC name "wal_receiver_retry_interval".
>I think this doesn't seem to exist and would mean 
> "wal_retrieve_retry_interval".

Good catch. +1.

[1]
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89d53f2a64..6f9509267c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4326,7 +4326,8 @@ restore_command = 'copy
"C:\\server\\archivedir\\%f" "%p"'  # Windows

 Terminate replication connections that are inactive for longer
 than this amount of time. This is useful for
-the sending server to detect a standby crash or network outage.
+the sending server to detect a standby crash or logical replication
+subscriber crash or network outage.
 If this value is specified without units, it is taken as milliseconds.
 The default value is 60 seconds.
 A value of zero disables the timeout mechanism.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

SHARED locks barging behaviour

2023-01-18 Thread Arul Ajmani

I'm trying to better understand the following barging behaviour with SHARED
locks.

*Setup:*

postgres=# create table t(a INT);
CREATE TABLE
postgres=# INSERT INTO t VALUES(1);
INSERT 0 1

Then, performing the following operations in 3 different sessions, in
order, we observe:

Session 1 Session 2 Session 3
BEGIN;
BEGIN
postgres=*# SELECT * FROM t WHERE a = 1 FOR SHARE;
 a
---
 1
(1 row)
postgres=# BEGIN;
BEGIN
postgres=*# SELECT * FROM t WHERE a = 1 FOR UPDATE;

* --- waits
BEGIN;
BEGIN
postgres=*# SELECT * FROM t WHERE a = 1 FOR SHARE;
 a
---
 1
(1 row)

* -- returns immediately

Given there is a transaction waiting to acquire a FOR UPDATE lock, I was
surprised to see the second FOR SHARE transaction return immediately
instead of waiting. I have two questions:

1) Could this barging behaviour potentially starve out the transaction
waiting to acquire the FOR UPDATE lock, if there is a continuous queue of
transactions that acquire a FOR SHARE lock briefly?
2) Assuming this is by design, I couldn't find (in code) where this
explicit policy choice is made. I was looking around LockAcquireExtended, but
it seems like the decision is made above this layer. Could someone more
familiar with this code point me at the right place?

Thanks

Issue with psql's create_help.pl under perlcritic

2023-01-18 Thread Michael Paquier

Hi all,

A recent system update of a Debian SID host has begun to show me this
issue:
./src/bin/psql/create_help.pl: Bareword dir handle opened at line 47,
column 1.  See pages 202,204 of PBP.
([InputOutput::ProhibitBarewordDirHandles] Severity: 5)

This issue gets fixed here as of the attached.
Comments?
--
Michael
diff --git a/src/bin/psql/create_help.pl b/src/bin/psql/create_help.pl
index 2c5fed203e..1d5366db16 100644
--- a/src/bin/psql/create_help.pl
+++ b/src/bin/psql/create_help.pl
@@ -44,7 +44,7 @@ my $define = $hfilebasename;
 $define =~ tr/a-z/A-Z/;
 $define =~ s/\W/_/g;
 
-opendir(DIR, $docdir)
+opendir(my $dh, $docdir)
   or die "$0: could not open documentation source dir '$docdir': $!\n";
 open(my $hfile_handle, '>', "$outdir/$hfile")
   or die "$0: could not open output file '$hfile': $!\n";
@@ -103,7 +103,7 @@ my $maxlen = 0;
 
 my %entries;
 
-foreach my $file (sort readdir DIR)
+foreach my $file (sort readdir $dh)
 {
 	my ($cmdid, @cmdnames, $cmddesc, $cmdsynopsis);
 	$file =~ /\.sgml$/ or next;
@@ -230,4 +230,4 @@ print $hfile_handle "
 close $cfile_handle;
 close $hfile_handle;
 close $depfile_handle if ($depfile);
-closedir DIR;
+closedir $dh;


signature.asc
Description: PGP signature

Re: document the need to analyze partitioned tables

2023-01-18 Thread Laurenz Albe

On Tue, 2023-01-17 at 16:16 -0500, Bruce Momjian wrote:
> On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote:
> > Maybe (all?) the clarification the docs need is to say:
> > "Partitioned tables are not *themselves* processed by autovacuum."
> 
> Yes, I think the lack of autovacuum needs to be specifically mentioned
> since most people assume autovacuum handles _all_ statistics updating.
> 
> Can someone summarize how bad it is we have no statistics on partitioned
> tables?  It sounds bad to me.

Andrey Lepikhov had an example earlier in this thread[1].  It doesn't take
an exotic query. 

Attached is a new version of my patch that tries to improve the wording.

Yours,
Laurenz Albe

 [1]: https://postgr.es/m/3df5c68b-13aa-53d0-c0ec-ed98e6972e2e%40postgrespro.ru
From 53da8083556364490d42077492e608152f9ae02e Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Wed, 18 Jan 2023 10:09:21 +0100
Subject: [PATCH] Improve autovacuum doc on partitioned tables

The documentation mentioned that autovacuum doesn't process
partitioned tables, but it was unclear about the impact.
The old wording could be interpreted to mean that there are
problems with dead tuple cleanup on partitioned tables.
Clarify that the only potential problem is autoanalyze, and
that statistics for the partitions will be gathered.

Author: Laurenz Albe
Reviewed-by: Nathan Bossart, Justin Pryzby
Discussion: https://postgr.es/m/1fd81ddc7710a154834030133c6fea41e55c8efb.camel%40cybertec.at
---
 doc/src/sgml/maintenance.sgml | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9c..3954e797a4 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -860,8 +860,15 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu

 

-Partitioned tables are not processed by autovacuum.  Statistics
-should be collected by running a manual ANALYZE when it is
+The partitions of a partitioned table are normal tables and get processed
+by autovacuum, but autovacuum doesn't process the partitioned table itself.
+This is no problem as far as VACUUM is concerned, since
+processing the partitions is sufficient.  But, as mentioned in
+, it also means that autovacuum won't
+run ANALYZE on the partitioned table itself.
+While statistics are gathered on the partitions, some queries may rely on
+the statistics for the partitioned table.  You should collect statistics by
+running a manual ANALYZE when the partitioned table is
 first populated, and again whenever the distribution of data in its
 partitions changes significantly.

-- 
2.39.0

Re: [EXTERNAL] Re: [PATCH] Support using "all" for the db user in pg_ident.conf

2023-01-18 Thread Jelte Fennema

> 0004 looks fine as well, be it for the tests (I am hesitating to tweak
> things a bit here actually for the role names), the code or the docs,

Anything I can do to help with this? Or will you do that yourself?

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2023-01-18 Thread Richard Guo

On Tue, Jan 17, 2023 at 3:05 PM Tom Lane  wrote:

> Richard Guo  writes:
> > BTW, I wonder if we should have checked CoercionForm before
> > fix_upper_expr_mutator steps into CoerceViaIO->arg to adjust the expr
> > there.
>
> I will just quote what it says in primnodes.h:
>
>  * NB: equal() ignores CoercionForm fields, therefore this *must* not carry
>  * any semantically significant information.
>
> If you think the planner should act differently for different values of
> CoercionForm, you are mistaken.  Maybe this is evidence of some
> previous bit of brain-fade, but if so we need to fix that.

According to this comment in primnodes.h, the planner is not supposed to
treat implicit and explicit casts differently.  In this case
set_plan_references is doing its job correctly, to adjust both random()
FuncExprs in targetlist to refer to subplan's output for query 2.  As a
consequence we would get a sorted output.

I'm still confused that when the same scenario happens with ORDER BY in
an aggregate function, like in query 1, the result is different in that
we would get an unsorted output.

I wonder if we should avoid this inconsistent behavior.

Thanks
Richard

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-18 Thread David Rowley

On Tue, 10 Jan 2023 at 06:15, Ankit Kumar Pandey  wrote:
>
>
> > On 09/01/23 17:53, David Rowley wrote:
> > I gave some thought to whether doing foreach_delete_current() is safe
> > within a foreach_reverse() loop. I didn't try it, but I couldn't see
> > any reason why not.  It is pretty late here and I'd need to test that
> > to be certain. If it turns out not to be safe then we need to document
> > that fact in the comments of the foreach_reverse() macro and the
> > foreach_delete_current() macro.
>
> I tested foreach_delete_current inside foreach_reverse loop.
>
> It worked fine.

I also thought I'd better test that foreach_delete_current() works
with foreach_reverse(). I can confirm that it *does not* work
correctly.  I guess maybe you only tested the fact that it deleted the
current item and not that the subsequent loop correctly went to the
item directly before the deleted item. There's a problem with that. We
skip an item.

Instead of fixing that, I think it's likely better just to loop
backwards manually with a for() loop, so I've adjusted the patch to
work that way.  It's quite likely that the additional code in
foreach() and what was in foreach_reverse() is slower than looping
manually due to the additional work those macros do to set the cell to
NULL when we run out of cells to loop over.

> I have attached patch with one extra test case (as mentioned above).
> Rest of the changes are looking fine.

I made another pass over the v7 patch and fixed a bug that was
disabling the optimization when the deepest WindowAgg had a
runCondition.  This should have been using llast_node instead of
linitial_node.  The top-level WindowAgg is the last in the list.

I also made a pass over the tests and added a test case for the
runCondition check to make sure we disable the optimization when the
top-level WindowAgg has one of those.  I wasn't sure what your test
comments case numbers were meant to represent. They were not aligned
with the code comments that document when the optimisation is
disabled, they started out aligned, but seemed to go off the rails at
#3. I've now made them follow the comments in create_one_window_path()
and made it more clear what we expect the test outcome to be in each
case.

I've attached the v9 patch. I feel like this patch is quite
self-contained and I'm quite happy with it now.  If there are no
objections soon, I'm planning on pushing it.

David
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index 093ace0864..c110696403 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4464,8 +4464,11 @@ create_one_window_path(PlannerInfo *root,
   List *activeWindows)
 {
PathTarget *window_target;
+   Query  *parse = root->parse;
ListCell   *l;
List   *topqual = NIL;
+   List   *orderby_pathkeys = NIL;
+   int sort_pushdown_idx = -1;

/*
 * Since each window clause could require a different sort order, we 
stack
@@ -4484,6 +4487,75 @@ create_one_window_path(PlannerInfo *root,
 */
window_target = input_target;

+   /*
+* Here we attempt to minimize the number of sorts which must be 
performed
+* for queries with an ORDER BY clause.
+*
+* It's possible due to select_active_windows() putting the 
WindowClauses
+* with the lowest tleSortGroupRef last in the activeWindows list that 
the
+* final WindowClause has a subset of the sort requirements that the
+* query's ORDER BY clause has.  Below we try to detect this case and if
+* we find this is true, we consider adjusting the sort that's done for
+* WindowAggs and include the additional clauses so that no additional
+* sorting is required for the query's ORDER BY clause.
+*
+* We don't try this optimization for the following cases:
+*
+* 1. If the query has a DISTINCT clause.  We needn't waste any 
additional
+* effort for the more strict sort order as if DISTINCT is done via Hash
+* Aggregate then that's going to undo this work.
+*
+* 2. If the query has a LIMIT clause.  The top-level sort will be able 
to
+* use a top-n sort which might be more efficient than sorting  by the
+* additional columns.  If the LIMIT does not reduce thenumber 
of rows
+* significantly then this might not be true, but we don't try to 
consider
+* that here.
+*
+* 3. If the top-level WindowClause has a runCondition then this can
+* filter out tuples and make the final sort cheaper.  If we pushed the
+* sort down below the WindowAgg then we'd need to sort all rows 
including
+* ones that the runCondition might filter out.  This may waste effort 
so
+* we just don't try to push down the sort for this case.
+*

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2023-01-18 Thread David Rowley

On Wed, 18 Jan 2023 at 22:37, Richard Guo  wrote:
> I'm still confused that when the same scenario happens with ORDER BY in
> an aggregate function, like in query 1, the result is different in that
> we would get an unsorted output.
>
> I wonder if we should avoid this inconsistent behavior.

It certainly seems pretty strange that aggregates with an ORDER BY
behave differently from the query's ORDER BY. I'd have expected that
to be the same.  I've not looked to see why there's a difference, but
suspect that we thought about how we want it to work for the query's
ORDER BY and when ORDER BY aggregates were added, that behaviour was
not considered.

Likely finding the code or location where that code should be would
help us understand if something was just forgotten in the aggregate's
case.

It's probably another question as to if we should be adjusting this
behaviour now.

David

RE: Update comments in multixact.c

2023-01-18 Thread shiy.f...@fujitsu.com

On Wed, Jan 18, 2023 6:04 AM Peter Geoghegan  wrote:
> 
> On Tue, Jan 17, 2023 at 1:33 AM shiy.f...@fujitsu.com
>  wrote:
> > I noticed that commit 5212d447fa updated some comments in multixact.c
> because
> > SLRU truncation for multixacts is performed during VACUUM, instead of
> > checkpoint. Should the following comments which mentioned checkpointer be
> > changed, too?
> 
> Yes, I think so.

Thanks for your reply.

Attach a patch which fixed them.

Regards,
Shi yu


v1-0001-Update-comments-in-multixact.c.patch
Description: v1-0001-Update-comments-in-multixact.c.patch

Re: Helper functions for wait_for_catchup() in Cluster.pm

2023-01-18 Thread Alvaro Herrera

On 2023-Jan-18, Drouvot, Bertrand wrote:

> The current calls are done that way:
> 
> wait_for_replay_catchup called:
> - 8 times with write LSN as an argument
> - 1 time with insert LSN as an argument
> - 16 times with flush LSN as an argument

> So it looks like that providing a node as a second argument
> would not help for the wait_for_replay_catchup() case.

... unless we changed the calls that wait for reply that use write or
insert so that they use flush instead.  Surely everything should still
work, right?  Flushing would still occur, either right after the write
(as the transaction commits) or ~200ms afterwards when WAL writer
catches up to that point.

I suppose this may fail to be true if there is some test that is
specifically testing whether writing WAL without flushing works, which
should rare enough, but if it does exist, in that one place we can use
the underlying wait_for_catchup().

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: [EXTERNAL] Re: Support load balancing in libpq

2023-01-18 Thread Jelte Fennema

As far as I can tell this is ready for committer feedback now btw. I'd
really like to get this into PG16.

> It hadn't been my intention to block the patch on it, sorry. Just
> registering a preference.

No problem. I hadn't looked into the shared PRNG solution closely
enough to determine if I thought it was better or not. Now that I
implemented an initial version I personally don't think it brings
enough advantages to warrant the added complexity. I definitely
don't think it's required for this patch, if needed this change can
always be done later without negative user impact afaict. And the
connection local PRNG works well enough to bring advantages.

> my
> assumption was that you could use the existing pglock_thread() to
> handle it, since it didn't seem like the additional contention would
> hurt too much.

That definitely would have been the easier approach and I considered
it. But the purpose of pglock_thread seemed so different from this lock
that it felt weird to combine the two. Another reason I refactored the lock
code is that it would be probably be necessary for a future round-robin
load balancing, which would require sharing state between different
connections.

> > And my thought was that the one-time
> > initialization could be moved to a place that doesn't need to know the
> > connection options at all, to make it easier to reason about the
> > architecture. Say, next to the WSAStartup machinery.

That's an interesting thought, but I don't think it would really simplify
the initialization code. Mostly it would change its location.

> (And after marinating on this over the weekend, it occurred to me that
> keeping the per-connection option while making the PRNG global
> introduces an additional hazard, because two concurrent connections
> can now fight over the seed value.)

I think since setting the initial seed value is really only meant for testing
it's not a big deal if it doesn't work with concurrent connections.

Re: Minimal logical decoding on standbys

2023-01-18 Thread Drouvot, Bertrand


Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:

Hi,
0004:


@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {

  =pod

+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{


Any reason this has to be standby specific?



Due to the extra work to be done for this case (aka wait for restart_lsn
and trigger a checkpoint on the primary).




+   # Now arrange for the xl_running_xacts record for which pg_recvlogical
+   # is waiting.
+   $master->safe_psql('postgres', 'CHECKPOINT');
+


Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
for us instead? This will likely also be needed in real applications after all.



Not sure I got it. What the C helper would be supposed to do?




+   print "starting pg_recvlogical\n";


I don't think tests should just print somewhere. Either diag() or note()
should be used.




Will be done.


+   if ($wait)
+   # make sure activeslot is in use
+   {
+   $node_standby->poll_query_until('testdb',
+   "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE 
slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+   ) or die "slot never became active";
+   }


That comment placement imo is quite odd.




Agree, will be done.


+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+   14, 'Decoding produced 14 rows');


Maybe mention that it's 2 transactions + 10 rows?




Agree, will be done.


+$node_primary->wait_for_catchup($node_standby, 'replay', 
$node_primary->lsn('flush'));


There's enough copies of this that I wonder if we shouldn't introduce a
Cluster.pm level helper for this.



Done in [1].




+print "waiting to replay $endpos\n";


See above.




Will be done.


+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+'testdb', 'activeslot', $endpos, 180,
+'include-xids' => '0',
+'skip-empty-xacts' => '1');


I don't think this should use a hardcoded 180 but
$PostgreSQL::Test::Utils::timeout_default.




Agree, will be done.


+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;


IIRC a reload should suffice.




Right.

With a reload in place in my testing, now I notice that the catalog_xmin
is updated on the primary physical slot after logical slots invalidation
when reloading hot_standby_feedback from "off" to "on".

This is not the case after a re-start (aka catalog_xmin is NULL).

I think a re-start and reload should produce identical behavior on
the primary physical slot. If so, I'm tempted to think that the catalog_xmin
should be updated in case of a re-start too (even if all the logical slots are 
invalidated)
because the slots are not dropped yet. What do you think?


+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');


Can we do something cheaper than rewriting the entire database? Seems
rewriting a single table ought to be sufficient?



While implementing the test at the table level I discovered that It looks like there is 
no guarantee that say a "vacuum full pg_class;" would
produce a conflict.

Indeed, from what I can see in my testing it could generate a XLOG_HEAP2_PRUNE 
with snapshotConflictHorizon to 0:

"rmgr: Heap2   len (rec/tot):107/   107, tx:848, lsn: 0/03B98B30, 
prev 0/03B98AF0, desc: PRUNE snapshotConflictHorizon 0"


Having a snapshotConflictHorizon to zero leads to 
ResolveRecoveryConflictWithSnapshot() simply returning
without any conflict handling.

It does look like that in the standby decoding case that's not the right 
behavior (and that the xid that generated the PRUNING should be used instead)
, what do you think?


I think it'd also be good to test that rewriting a non-catalog table doesn't
trigger an issue.




Good point, but need to understand the above first.



+##
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT 
pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT 
pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch 
a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting

Re: [DOCS] Stats views and functions not in order?

2023-01-18 Thread Peter Eisentraut


On 11.01.23 07:11, Peter Smith wrote:

v9-0003 --> v10-0001


I'm not sure if anything is pending for v9-0003, if there is something
pending, please post an updated patch for the same.


Thanks for the reminder. PSA v10.


So this patch changes some sections describing system views to 
refentry's.  What is the reason for that?  refentry's are basically man 
pages; do we want man pages for each system view?


Maybe (*), but then we should also do the same to all the other system 
views, all the system catalogs, everything else.  Just changing a few in 
a single place seems odd.


(*) -- but also maybe not?

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2023-01-18 Thread Dean Rasheed

On Wed, 18 Jan 2023 at 09:49, David Rowley  wrote:
>
> On Wed, 18 Jan 2023 at 22:37, Richard Guo  wrote:
> > I'm still confused that when the same scenario happens with ORDER BY in
> > an aggregate function, like in query 1, the result is different in that
> > we would get an unsorted output.
> >
> > I wonder if we should avoid this inconsistent behavior.
>
> It certainly seems pretty strange that aggregates with an ORDER BY
> behave differently from the query's ORDER BY. I'd have expected that
> to be the same.  I've not looked to see why there's a difference, but
> suspect that we thought about how we want it to work for the query's
> ORDER BY and when ORDER BY aggregates were added, that behaviour was
> not considered.
>

I think the behaviour of an ORDER BY in the query can also be pretty
surprising. For example, consider:

SELECT ARRAY[random(), random(), random()]
FROM generate_series(1, 3);

array
-
 {0.2335800863701647,0.14688842754711273,0.2975659224823368}
 {0.10616525384762876,0.8371175798972244,0.2936178886154661}
 {0.21679841321788262,0.5254761982948826,0.7789412240118161}
(3 rows)

which produces 9 different random values, as expected, and compare that to:

SELECT ARRAY[random(), random(), random()]
FROM generate_series(1, 3)
ORDER BY random();

 array
---
 {0.01952216253949679,0.01952216253949679,0.01952216253949679}
 {0.6735145595500629,0.6735145595500629,0.6735145595500629}
 {0.9406665780147616,0.9406665780147616,0.9406665780147616}
(3 rows)

which now only has 3 distinct random values. It's pretty
counterintuitive that adding an ORDER BY clause changes the contents
of the rows returned, not just their order.

The trouble is, if we tried to fix that, we'd risk changing some other
behaviour that users may have come to rely on.

Regards,
Dean

Re: heapgettup refactoring

2023-01-18 Thread Peter Eisentraut


On 10.01.23 21:31, Melanie Plageman wrote:

On Thu, Jan 5, 2023 at 8:52 AM Peter Eisentraut
 wrote:


Ok, let's look through these patches starting from the top then.

v4-0001-Add-no-movement-scan-helper.patch

This makes sense overall; there is clearly some duplicate code that can
be unified.

It appears that during your rebasing you have effectively reverted your
earlier changes that have been committed as
8e1db29cdbbd218ab6ba53eea56624553c3bef8c.  You should undo that.


Thanks. I think I have addressed this.
I've attached a rebased v5.


In your v2 patch, you remove these assertions:

-   /* check that rs_cindex is in sync */
-   Assert(scan->rs_cindex < scan->rs_ntuples);
-   Assert(lineoff == scan->rs_vistuples[scan->rs_cindex]);

Is that intentional?

I don't see any explanation, or some other equivalent code appearing 
elsewhere to replace this.

Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit

2023-01-18 Thread Etsuro Fujita

Hi Vignesh,

On Wed, Jan 4, 2023 at 9:19 PM vignesh C  wrote:
> On Tue, 1 Nov 2022 at 15:54, Etsuro Fujita  wrote:
> > Attached is a rebased version of the patch.
>
> The patch does not apply on top of HEAD as in [1], please post a rebased 
> patch:

I rebased the patch.  Attached is an updated patch.

Thanks!

Best regards,
Etsuro Fujita


v11-postgres-fdw-Add-support-for-parallel-abort.patch
Description: Binary data

Re: Rethinking the implementation of ts_headline()

2023-01-18 Thread Alvaro Herrera

I tried this other test, based on looking at the new regression tests
you added,

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, 
MaxWords=9, MinWords=4');
   ts_headline   
─
 motion,↵
 As idle as a painted Ship↵
   Upon
(1 fila)

and was surprised that the match for the 'day & drink' arm of the OR
disappears from the reported headline.

This is what 15 reports for the same query:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, 
MaxWords=9, MinWords=4');
ts_headline
───
 Day after day, day after day,↵
   We stuck ... motion,   ↵
 As idle as a painted Ship  ↵
   Upon
(1 fila)

I think this was better.

15 seems to fail in other ways; for instance, 'drink' is not highlighted in the
headline when the OR matches, but if the other arm of the OR doesn't match, it
is; for example both 15 and master return the same for this one:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (mountain & backpack)'), 
'MaxFragments=5, MaxWords=9, MinWords=4');
ts_headline
───
 Day after day, day after day,↵
   We stuck ... drop to drink. ↵
 S. T. Coleridge
(1 fila)



Another thing I think might be a regression is the way fragments are
selected.  Consider what happens if I change the "idle & painted" in the
earlier query to "idle <-> painted", and MaxWords is kept low:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 
'MaxFragments=5, MaxWords=9, MinWords=4');
  ts_headline  
───
 day,  ↵
   We stuck, nor breath nor motion,   ↵
 As idle ... painted Ship   ↵
   Upon a painted Ocean.   ↵
 Water, water, every ... drop to drink.↵
 S. T. Coleridge
(1 fila)

Note that it chose to put a fragment delimiter exactly in the middle of the
phrase match, where the stop words are.  If I raise MaxWords, it is of course
much better, I suppose because the word limit doesn't force a new fragment,

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 
'MaxFragments=5, MaxWords=25, MinWords=4');
   ts_headline
──
 after day, day after day,  ↵
   We stuck, nor breath nor motion,  ↵
 As idle as a painted Ship ↵
   Upon a painted Ocean.  ↵
 Water, water, every where ... boards did shrink;↵
 Water, water, every where,  ↵
   Nor any drop to drink. ↵
 S. T. Coleridge
(1 fila)

But in 15, the query with low MaxWords does this instead, where the
fragment delimiter occurs just *before* the phrasal match.

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 
'MaxFragments=5, MaxWords=9, MinWords=4');
ts_headline
───
 Day after day,

Re: Question about initial logical decoding snapshot

2023-01-18 Thread shveta malik

Hello,

I was curious as to why we need 3rd running_xact and wanted to learn
more about it, so I have made a few changes to come up with a patch
which builds the snapshot in 2 running_xacts. The motive is to run the
tests to see the failures/issues with this approach to understand the
need of reading 3rd running_xact to build a consistent snapshot. On
this patch, I have got one test-failure which is
test_decoding/twophase_snapshot.

Approach:
When we start building a snapshot, on the occurrence of first
running_xact, move the state from START to BUILDING and wait for all
in-progress transactions to finish. On the second running_xact where
we find oldestRunningXid >= 1st xl_running_xact's nextXid, move to
CONSISTENT state. So, it means all the transactions started before
BUILDING state are now finished and all the new transactions that are
currently in progress are the ones that are started after BUILDING
state and thus have enough info to be decoded.

Failure analysis for twophase_snapshot test:
After the patch application, test-case fails because slot is created
sooner and 'PREPARE TRANSACTION test1' is available as result of first
'pg_logical_slot_get_changes'  itself.  Intent of this testcase is to
see how two-phase txn is handled when snapshot-build completes in 3
stages (BUILDING-->FULL-->CONSISTENT). Originally, the PREPARED txn is
started between FULL and CONSISTENT stage and thus as per the current
code logic, 'DecodePrepare' will skip it. Please see code in
DecodePrepare:

 /* We can't start streaming unless a consistent state is reached. */
if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
{
ReorderBufferSkipPrepare(ctx->reorder, xid);
return;
}

So first 'pg_logical_slot_get_changes' will not show these changes.
Once we do 'commit prepared' after CONSISTENT state is reached, it
will be available for next 'pg_logical_slot_get_changes' to consume.

On the other hand, after the current patch, since we reach consistent
state sooner, so with the same test-case, PREPARED transaction now
ends up starting after CONSISTENT state and thus will be available to
be consumed by first 'pg_logical_slot_get_changes' itself. This makes
the testcase to fail.

Please note that in the patch, I have maintained 'WAIT for all running
transactions to end' even after reaching CONSISTENT state. I have
tried running tests even after removing that WAIT after CONSISTENT,
with that, we get one more test failure which is
test_decoding/ondisk_startup. The reason for failure here is the same
as previous case i.e., since we reach CONSISTENT state earlier,
slot-creation finishes faster and thus we see slight change in result
for this test.  ('step s1init completed' seen earlier in log file).

Both the failing tests here are written in such a way that they align
with the 3-phase snapshot build process. Otherwise, I do not see any
logical issues yet with this approach based on the test-cases
available so far.

So, I still have not gotten clarity on why we need 3rd running_xact
here. In code, I see a comment in SnapBuildFindSnapshot() which says
"c) ...But for older running transactions no viable snapshot exists
yet, so CONSISTENT will only be reached once all of those have
finished."  This comment refers to txns started between BUILDING and
FULL state. I do not understand it fully. I am not sure what tests I
need to run on the patch to reproduce this issue where we do not have
a viable snapshot when we go by two running_xacts only.

Any thoughts/comments are most welcome. Attached the patch for review.

Thanks
Shveta

On Fri, Dec 30, 2022 at 11:57 PM Chong Wang  wrote:
>
> Hi hackers.
>
> I'm studying the source code about creation of initial logical decoding 
> snapshot. What confused me is that why must we process 3 xl_running_xacts 
> before we get to the consistent state. I think we only need 2 
> xl_running_xacts.
>
> I think we can get to consistent state when we meet the 2nd xl_running_xact 
> with its oldestRunningXid > 1st xl_running_xact's nextXid, this means the 
> active transactions in 1st xl_running_xact all had commited, and we have all 
> the logs of transactions who will commit afterwards, so there is consistent 
> state in this time point and we can export a snapshot.
>
> I had read the discussion in [0] and the comment of commit '955a684', but I 
> haven't got a detailed explanation about why we need 4 stages during creation 
> of initial logical decoding snapshot but not 3 stages.
>
> My rencent job is relevant to logical decoding so I want to figure this 
> problem out, I'm very grateful if you can answer me, thanks.
>
>
>
> [0] 
> https://www.postgresql.org/message-id/flat/f37e975c-908f-858e-707f-058d3b1eb214%402ndquadrant.com
>
>
>
> --
>
> Best regards
>
> Chong Wang
>
> Greenplum DataFlow team

v1-0001-SnapBuild-in-2-stages.patch
Description: Binary data

Re: Support logical replication of DDLs

2023-01-18 Thread Amit Kapila

On Sat, Jan 7, 2023 at 8:58 PM Zheng Li  wrote:
>
> I added documentation and changed user interface design in the
> attached v60 patch set.
> The patch set addressed comments from Peter in [1].
>
> The motivation for the user interface change is that we want to manage
> DDL replication feature in stages with fine grained replication
> levels.
> For example, we can focus on reviewing and testing table commands
> first, then other commands. It also make sense to introduce different
> DDL replication levels
> from the user perspective as pointed out in [1]. We can add more
> replication levels along the way.
>
> In this patch DDL replication is disabled by default and it can be
> enabled at different levels
> using the new PUBLICATION option 'ddl'. This option currently has two
> levels and are
> only allowed to be set if the PUBLICATION is FOR ALL TABLES.
>
>   all: this option enables replication of all supported DDL commands.
>   table: this option enables replication of Table DDL commands which include:
>   -CREATE/ALTER/DROP TABLE
>   -CREATE TABLE AS
>

I think this point needs some thought. When you say 'all', how do you
think it will help to support DDL replication for foreign tables,
materialized views, views, etc where changes to such relations are
currently not supported by logical replication? We should also think
about initial sync for all those objects as well.

-- 
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

2023-01-18 Thread Ashutosh Bapat

On Wed, Jan 18, 2023 at 1:49 PM wangw.f...@fujitsu.com
 wrote:
>
> On Wed, Jan 18, 2023 at 13:29 PM Amit Kapila  wrote:
> > On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat
> >  wrote:
> > >
> > > On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila  
> > > wrote:
> > >
> > > > >
> > > > > I am a bit worried about the indirections that the wrappers and hooks
> > > > > create. Output plugins call OutputPluginUpdateProgress() in callbacks
> > > > > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > > > > call OutputPluginUpdateProgress.
> > > > >
> > > >
> > > > Yeah, I think we can do it as we are doing the previous approach but
> > > > we need an additional wrapper (update_progress_cb_wrapper()) as the
> > > > current patch has so that we can add error context information. This
> > > > is similar to why we have a wrapper for all other callbacks like
> > > > change_cb_wrapper.
> > > >
> > >
> > > Ultimately OutputPluginUpdateProgress() will be called - which in turn
> > > will call ctx->update_progress.
> > >
> >
> > No, update_progress_cb_wrapper() should directly call
> > ctx->update_progress(). The key reason to have a
> > update_progress_cb_wrapper() is that it allows us to add error context
> > information (see the usage of output_plugin_error_callback).
>
> I think it makes sense. This also avoids the need for every output plugin to
> implement the callback. So I tried to improve the patch based on this 
> approach.
>
> And I tried to add some comments for this new callback to distinguish it from
> ctx->update_progress.

Comments don't help when using cscope or some such code browsing tool.
Better to use a different variable name.

-- 
Best Wishes,
Ashutosh Bapat

Re: Logical replication timeout problem

2023-01-18 Thread Amit Kapila

On Wed, Jan 18, 2023 at 5:37 PM Ashutosh Bapat
 wrote:
>
> On Wed, Jan 18, 2023 at 1:49 PM wangw.f...@fujitsu.com
>  wrote:
> >
> > On Wed, Jan 18, 2023 at 13:29 PM Amit Kapila  
> > wrote:
> > > On Tue, Jan 17, 2023 at 6:41 PM Ashutosh Bapat
> > >  wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila  
> > > > wrote:
> > > >
> > > > > >
> > > > > > I am a bit worried about the indirections that the wrappers and 
> > > > > > hooks
> > > > > > create. Output plugins call OutputPluginUpdateProgress() in 
> > > > > > callbacks
> > > > > > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > > > > > call OutputPluginUpdateProgress.
> > > > > >
> > > > >
> > > > > Yeah, I think we can do it as we are doing the previous approach but
> > > > > we need an additional wrapper (update_progress_cb_wrapper()) as the
> > > > > current patch has so that we can add error context information. This
> > > > > is similar to why we have a wrapper for all other callbacks like
> > > > > change_cb_wrapper.
> > > > >
> > > >
> > > > Ultimately OutputPluginUpdateProgress() will be called - which in turn
> > > > will call ctx->update_progress.
> > > >
> > >
> > > No, update_progress_cb_wrapper() should directly call
> > > ctx->update_progress(). The key reason to have a
> > > update_progress_cb_wrapper() is that it allows us to add error context
> > > information (see the usage of output_plugin_error_callback).
> >
> > I think it makes sense. This also avoids the need for every output plugin to
> > implement the callback. So I tried to improve the patch based on this 
> > approach.
> >
> > And I tried to add some comments for this new callback to distinguish it 
> > from
> > ctx->update_progress.
>
> Comments don't help when using cscope or some such code browsing tool.
> Better to use a different variable name.
>

+ /*
+ * Callback to be called when updating progress during sending data of a
+ * transaction (and its subtransactions) to the output plugin.
+ */
+ ReorderBufferUpdateProgressCB update_progress;

Are you suggesting changing the name of the above variable? If so, how
about apply_progress, progress, or updateprogress? If you don't like
any of these then feel free to suggest something else. If we change
the variable name then accordingly, we need to update
ReorderBufferUpdateProgressCB as well.

-- 
With Regards,
Amit Kapila.

Re: PL/Python: Fix return in the middle of PG_TRY() block.

2023-01-18 Thread Xing Guo

On Mon, Jan 16, 2023 at 11:29 PM Tom Lane  wrote:

> Xing Guo  writes:
> > Are there any unsafe codes in pltcl.c? The return statement is in the
> > PG_CATCH() block, I think the exception stack has been recovered in
> > PG_CATCH block so the return statement in PG_CATCH block should be ok?
>
> Yes, the stack has already been unwound at the start of a PG_CATCH
> (or PG_FINALLY) block, so there is no reason to avoid returning
> out of those.
>
> In principle you could also mess things up with a "continue", "break",
> or "goto" leading out of PG_TRY.  That's probably far less likely
> than "return", but I wonder whether Andres' compiler hack will
> catch that.
>
> regards, tom lane
>

Thank you Tom,

Based on your comments, I've refactored my clang checker[1], now it can
warn about the following patterns:
1. return statement in PG_TRY(). We've catched all of them in this thread.
2. continue statement in PG_TRY() *unless* it's in for/while/do-while
statements.
3. break statement in PG_TRY() *unless* it's in for/while/do-while/switch
statements.
4. goto statement in PG_TRY() *unless* the label it points to is in the
same PG_TRY block.

Good news is that, there's no patterns (2, 3, 4) in Postgres source tree
and we've catched all of the return statements in the PG_TRY block in this
thread.

[1]
https://github.com/higuoxing/clang-plugins/blob/main/lib/ReturnInPgTryBlockChecker.cpp

Best Regards,
Xing

Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-01-18 Thread Alvaro Herrera

On 2023-Jan-02, Karl O. Pinc wrote:

> Hi,
> 
> Attached is a patch: contrib_v1.patch
> 
> It modifies Appendix F, the contrib directory.
> 
> It adds brief text into the titles shown in the
> table of contents so it's easier to tell what
> each module does.  It also suffixes [trusted] or [obsolete]
> on the relevant titles.

This looks a good idea to me.  I'm not 100% sold on having the "trusted"
or "obsolete" marker on the titles themselves, though.  Not sure what
alternative do we have, though, other than leave them out completely.

There's a typo "equalivent" in two places.

In passwordcheck, I would say just "check for weak passwords" or maybe
"verify password strength".

pg_buffercache is missing.  Maybe "-- inspect state of the Postgres
buffer cache".

For pg_stat_statements I suggest "track statistics of planning and
execution of SQL queries"

For sepgsql, as I understand it is strictly SELinux based, not just
"-like".  So this needs rewording: "label-based, SELinux-like, mandatory
access control".  Maybe "SELinux-based implementation of mandatory
access control for row-level security".

xml -- typo "qeurying"

> The sentences describing what the modules are and how
> to build them have been reworked.  Some split in 2,
> some words removed or replaced, etc.
> 
> I introduced the word "component" because the appendix
> has build instructions for command line programs as well
> as extensions and libraries loaded with shared_preload_libraries().
> This involved removing most occurrences of the word
> "module", although it is left in the section title.

I haven't read this part yet.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"But static content is just dynamic content that isn't moving!"
http://smylers.hates-software.com/2007/08/15/fe244d0c.html

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Andrew Dunstan

On 2023-01-17 Tu 11:30, Tom Lane wrote:
> Andrew Dunstan  writes:
>> FYI crake has just passed the test with flying colours.
> Cool.  I await the Windows machines' results with interest.

fairwren and drongo are clean except for fairywren upgrading 9.6 to 11.
This appears to be a longstanding issue that the fuzz processing was
causing us to ignore. See for example

It's somewhat interesting that this doesn't appear to be an issue with
the MSVC builds on drongo. And it disappears when upgrading to release
12 or later where we use the extra-float-digits=0 hack.

I propose to add this to just the release 11 AdjustUpgrade.pm:

    # float4 values in this table on Msys can have precision differences
    # in representation between old and new versions
    if ($old_version < 10 && $dbnames{contrib_regression_btree_gist} &&
    $^O eq 'msys')
    {
    _add_st($result, 'contrib_regression_btree_gist',
            'drop table if exists float4tmp');
    }

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-01-18 Thread Alvaro Herrera

Not related to this patch: it's very annoying that in the PDF output,
each section in the appendix doesn't start on a blank page -- which
means that the doc page for many modules starts in the middle of a page
were the previous one ends.  This is very ugly.  And then you get to
dblink, which contains a bunch of reference pages for the functions it
provides, and those *do* start a new page each.  So it looks quite
inconsistent.

I wonder if we can tweak something in the stylesheet to include a page
break.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The Postgresql hackers have what I call a "NASA space shot" mentality.
 Quite refreshing in a world of "weekend drag racer" developers."
(Scott Marlowe)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-18 Thread David Geier


On 1/16/23 21:39, Pavel Stehule wrote:


po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra 
 napsal:


Hi,

there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.

As for the patch, I don't have much comments. I'm wondering if it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:

 Planning time: 0.197 ms
 Execution time: 0.225 ms
 Timing source: clock_gettime (or tsc)

There has been a proposal to expose this as a GUC (or perhaps as
explain
option), to allow users to pick what timing source to use. I
wouldn't go
that far - AFAICS is this is meant to be universally better when
available. But knowing which source was used seems useful.


+1


Thanks for looking at the patch.

I'll fix the merge conflict.

I like the idea of exposing the timing source in the EXPLAIN ANALYZE output.
It's a good tradeoff between inspectability and effort, given that RDTSC 
should always be better to use.

If there are no objections I go this way.

--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-18 Thread David Geier


On 1/16/23 18:37, Andres Freund wrote:

Hi,

On 2023-01-02 14:28:20 +0100, David Geier wrote:

I'm doubtful this is worth the complexity it incurs. By the time we convert
out of the instr_time format, the times shouldn't be small enough that the
accuracy is affected much.
I don't feel strong about it and you have a point that we most likely 
only convert ones we've accumulated a fair amount of cycles.

Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
actually accumulate themselves, and should instead keep things in the
instr_time format and convert later. We'd win more accuracy / speed that way.

I don't think the introduction of pg_time_usec_t was a great idea, but oh
well.
Fully agreed. Why not replacing pg_time_usec_t with instr_time in a 
separate patch? I haven't looked carefully enough if all occurrences 
could sanely replaced but at least the ones that accumulate time seem 
good starting points.

Additionally, I initialized a few variables of type instr_time which
otherwise resulted in warnings due to use of potentially uninitialized
variables.

Unless we decide, as I suggested downthread, that we deprecate
INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar
patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls.
I don't feel strong about it, but like Tom tend towards keeping the 
initialization macro.
Thanks that you have improved on the first patch and fixed these issues 
in a better way.

What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that
it's consistent with the _MILLISEC() and _MICROSEC() variants?
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function or
also have the function return a double and cast where necessary at the call
site?

I think those should be a separate discussion / patch.


OK. I'll propose follow-on patches ones we're done with the ones at hand.

I'll then rebase the RDTSC patches on your patch set.

--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-18 Thread David Geier


Hi,

@Andres: will you take care of these changes and provide me with an 
updated patch set so I can rebase the RDTSC changes?
Otherwise, I can also apply Tom suggestions to your patch set and send 
out the complete patch set.


--
David Geier
(ServiceNow)

Re: Issue with psql's create_help.pl under perlcritic

2023-01-18 Thread Andrew Dunstan



On 2023-01-18 We 03:50, Michael Paquier wrote:
> Hi all,
>
> A recent system update of a Debian SID host has begun to show me this
> issue:
> ./src/bin/psql/create_help.pl: Bareword dir handle opened at line 47,
> column 1.  See pages 202,204 of PBP.
> ([InputOutput::ProhibitBarewordDirHandles] Severity: 5)
>
> This issue gets fixed here as of the attached.
> Comments?

Looks fine. Interesting it's not caught by perlcritic on my Fedora 35
instance, nor my Ubuntu 22.04 instance.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Add LZ4 compression in pg_dump

2023-01-18 Thread Tomas Vondra

Hi,

On 1/16/23 16:14, gkokola...@pm.me wrote:
> Hi,
> 
> I admit I am completely at lost as to what is expected from me anymore.
> 

:-(

I understand it's frustrating not to know why a patch is not moving
forward. Particularly when is seems fairly straightforward ...

Let me briefly explain my personal (and admittedly very subjective) view
on picking what patches to review/commit. I'm sure other committers have
other criteria, but maybe this will help.

There are always more patches than I can review/commit, so I have to
prioritize, and pick which patches to look at. For me, it's mostly about
cost/benefit of the patch. The cost is e.g. the amount of time I need to
spend to review/commit the stuff, maybe read the thread, etc. Benefits
is mainly the new features/improvements.

It's oversimplified, we could talk about various bits that contribute to
the costs and benefits, but this is what it boils down.

There's always the aspect of time - patches A and B have roughly the
same benefits, but with A we get it "immediately" while B requires
additional parts that we don't have ready yet (and if they don't make it
we get no benefit), I'll probably pick A.

Unfortunately, this plays against this patch - I'm certainly in favor of
adding lz4 (and other compression algos) into pg_dump, but if I commit
0001 we get little benefit, and the other parts actually adding lz4/zstd
are treated as "WIP / for completeness" so it's unclear when we'd get to
commit them.

So if I could recommend one thing, it'd be to get at least one of those
WIP patches into a shape that's likely committable right after 0001.

> I had posted v19-0001 for a committer's consideration and v19-000{2,3} for 
> completeness.
> Please find a rebased v20 attached.
> 

I took a quick look at 0001, so a couple comments (sorry if some of this
was already discussed in the thread):

1) I don't think a "refactoring" patch should reference particular
compression algorithms (lz4/zstd), and in particular I don't think we
should have "not yet implemented" messages. We only have a couple other
places doing that, when we didn't have a better choice. But here we can
simply reject the algorithm when parsing the options, we don't need to
do that in a dozen other places.

2) I wouldn't reorder the cases in WriteDataToArchive, i.e. I'd keep
"none" at the end. It might make backpatches harder.

3) While building, I get bunch of warnings about missing cfdopen()
prototype and pg_backup_archiver.c not knowing about cfdopen() and
adding an implicit prototype (so I doubt it actually works).

4) "cfp" struct no longer wraps gzFile, but the comment was not updated.
FWIW I'm not sure switching to "void *" is an improvement, maybe it'd be
better to have a "union" of correct types?

5) cfopen/cfdopen are missing comments. cfopen_internal has an updated
comment, but that's a static function while cfopen/cfdopen are the
actual API.

> Also please let me know if I should silently step away from it and let other 
> people lead
> it. I would be glad to comply either way.
> 

Please don't. I promise to take a look at this patch again.

Thanks for doing all the work.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyReadDataFromArchiveZlib

compress_io.c:605:1: warning: no previous prototype for ‘cfdopen’ [-Wmissing-prototypes]
  605 | cfdopen(int fd, const char *mode,
  | ^~~
pg_backup_archiver.c: In function ‘SetOutput’:
pg_backup_archiver.c:1528:26: warning: implicit declaration of function ‘cfdopen’; did you mean ‘cfopen’? [-Wimplicit-function-declaration]
 1528 | AH->OF = cfdopen(dup(fn), mode, compression_spec);
  |  ^~~
  |  cfopen
pg_backup_archiver.c:1528:24: warning: assignment to ‘void *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
 1528 | AH->OF = cfdopen(dup(fn), mode, compression_spec);
  |^
pg_backup_archiver.c: In function ‘_allocAH’:
pg_backup_archiver.c:2236:16: warning: assignment to ‘void *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
 2236 | AH->OF = cfdopen(dup(fileno(stdout)), PG_BINARY_A, out_compress_spec);
  |^

Re: ANY_VALUE aggregate

2023-01-18 Thread Peter Eisentraut


On 05.12.22 21:18, Vik Fearing wrote:

On 12/5/22 15:57, Vik Fearing wrote:
The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It 
returns an implementation-dependent (i.e. non-deterministic) value 
from the rows in its group.


PFA an implementation of this aggregate.


Here is v2 of this patch.  I had forgotten to update sql_features.txt.


In your patch, the documentation says the definition is any_value("any") 
but the catalog definitions are any_value(anyelement).  Please sort that 
out.


Since the transition function is declared strict, null values don't need 
to be checked.  I think the whole function could be reduced to


Datum
any_value_trans(PG_FUNCTION_ARGS)
{
PG_RETURN_DATUM(PG_GETARG_DATUM(0));
}

Re: [DOCS] Stats views and functions not in order?

2023-01-18 Thread David G. Johnston

On Wed, Jan 18, 2023 at 3:36 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 11.01.23 07:11, Peter Smith wrote:
> > v9-0003 --> v10-0001
> >
> >> I'm not sure if anything is pending for v9-0003, if there is something
> >> pending, please post an updated patch for the same.
> >
> > Thanks for the reminder. PSA v10.
>
> So this patch changes some sections describing system views to
> refentry's.  What is the reason for that?  refentry's are basically man
> pages; do we want man pages for each system view?
>

I didn't really consider the effect this might have on man pages.  I knew
it would produce the desired effect in the HTML and assumed it would
produce an acceptable effect in the PDF.  I was going for the html effect
of having these views chunked into their own pages, any other changes being
non-detrimental.  And inspecting the DocBook configurations learned that
sect1 and refentry had this effect.  Using sect1 is not possible in this
part of the documentation.

>
> Maybe (*), but then we should also do the same to all the other system
> views, all the system catalogs, everything else.  Just changing a few in
> a single place seems odd.
>
> (*) -- but also maybe not?
>
>
I could see those who use man pages being pleased with having access to
these core building blocks of the system at ready access.  I am not one of
those people, using the website exclusively.  If there is a champion of man
pages here that wants to ensure that changes in this area work well there
this patch would be better for it.

I really want a one-page-per-view output on the website in this section.
This is the only way I could see getting to that point (as noted upthread,
system catalogs don't have this problem because they are able to be
marked up as sect1) .  The existing side-effect is, for me, an acceptable
trade-off situation.  If you want to provide a statement for why these are
special, it's because they are in the System Monitoring chapter instead of
System Internals and the man pages don't cover system internals...

I'm not opposed to alternative markup that gets the pagination job done,
though it likely involves tool-chain configuration/modifications. There is
a nearby thread where this is being done presently so maybe if refentry is
a commit-blocker there is still hope, but it is presently outside my
capability.  I'm after the pagination and have no current preference as to
how it is technically accomplished.

David J.

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-18 Thread Justin Pryzby

TBH, I think the best approach is what I did in:
0001-report-top-parent-progress-for-CREATE-INDEX.txt

That's a minimal patch, ideal for backpatching.

..which defines/clarifies that the progress reporting is only for
*direct* children.  That avoids the need to change any data structures,
and it's what was probably intended by the original patch, which doesn't
seem to have considered intermediate partitioned tables.

I think it'd be fine to re-define that in some future release, to allow
showing indirect children (probably only "leaves", and not intermediate
partitioned tables).  Or "total_bytes" or other global progress.

-- 
Justin

Re: Remove source code display from \df+?

2023-01-18 Thread Isaac Morland

On Wed, 18 Jan 2023 at 00:00, Pavel Stehule  wrote:

>
> út 17. 1. 2023 v 20:29 odesílatel Isaac Morland 
> napsal:
>
>>
>> I welcome comments and feedback. Now to try to find something manageable
>> to review.
>>
>
> looks well
>
> you miss update psql documentation
>
> https://www.postgresql.org/docs/current/app-psql.html
>
> If the form \df+ is used, additional information about each function is
> shown, including volatility, parallel safety, owner, security
> classification, access privileges, language, source code and description.
>

Thanks, and sorry about that, it just completely slipped my mind. I’ve
attached a revised patch with a slight update of the psql documentation.

you should to assign your patch to commitfest app
>
> https://commitfest.postgresql.org/
>

I thought I had: https://commitfest.postgresql.org/42/4133/

Did I miss something?


0001-Remove-source-code-display-from-df-v2.patch
Description: Binary data

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Tom Lane

Andrew Dunstan  writes:
> fairwren and drongo are clean except for fairywren upgrading 9.6 to 11.
> This appears to be a longstanding issue that the fuzz processing was
> causing us to ignore. See for example
> 

Interesting.  I suspected that removing the fuzz allowance would teach
us some things we hadn't known about.

> I propose to add this to just the release 11 AdjustUpgrade.pm:
>     # float4 values in this table on Msys can have precision differences
>     # in representation between old and new versions
>     if ($old_version < 10 && $dbnames{contrib_regression_btree_gist} &&
>     $^O eq 'msys')
>     {
>     _add_st($result, 'contrib_regression_btree_gist',
>             'drop table if exists float4tmp');
>     }

Seems reasonable (but I wonder if you don't need "$old_version < 11").
A nicer answer would be to apply --extra-float-digits=0 across the
board, but pre-v12 pg_dump lacks that switch.

regards, tom lane

Re: [DOCS] Stats views and functions not in order?

2023-01-18 Thread Tom Lane

"David G. Johnston"  writes:
> ...  I was going for the html effect
> of having these views chunked into their own pages, any other changes being
> non-detrimental.

But is that a result we want?  It will for example break any bookmarks
that people might have for these documentation entries.  It will also
pretty thoroughly break the cross-version navigation links in this
part of the docs.

Maybe the benefit is worth those costs, but I'm entirely not convinced
of that.  I think we need to tread pretty lightly when rearranging
longstanding documentation-layout decisions.

regards, tom lane

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2023-01-18 Thread Tom Lane

Dean Rasheed  writes:
> I think the behaviour of an ORDER BY in the query can also be pretty
> surprising.

Indeed.  The fundamental question is this: in

> SELECT ARRAY[random(), random(), random()]
> FROM generate_series(1, 3)
> ORDER BY random();

are those four occurrences of random() supposed to refer to the
same value, or not?  This only matters for volatile functions
of course; with stable or immutable functions, textually-equal
subexpressions should have the same value in any given row.

It is very clear what we are supposed to do for

SELECT random() FROM ... ORDER BY 1;

which sadly isn't legal SQL anymore.  It gets fuzzy as soon
as we have

SELECT random() FROM ... ORDER BY random();

You could make an argument either way for those being the
same value or not, but historically we've concluded that
it's more useful to deem them the same value.  Then the
behavior you show is not such a surprising extension,
although it could be argued that such matches should only
extend to identical top-level targetlist entries.

> The trouble is, if we tried to fix that, we'd risk changing some other
> behaviour that users may have come to rely on.

Yeah.  I'm hesitant to try to adjust semantics here;
we're much more likely to get complaints than kudos.

regards, tom lane

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-18 Thread Masahiko Sawada

On Tue, Jan 17, 2023 at 8:06 PM John Naylor
 wrote:
>
> On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada  wrote:
> >
> > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> >  wrote:
>
> Attached is an update that mostly has the modest goal of getting CI green 
> again. v19-0003 has squashed the entire radix tree template from previously. 
> I've kept out the perf test module for now -- still needs updating.
>
> > > [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
> > > external symbol pg_popcount64
> > > [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
> > > fatal error LNK1120: 1 unresolved externals
> >
> > Yeah, I'm not sure what's causing that. Since that comes from a debugging 
> > function, we could work around it, but it would be nice to understand why, 
> > so I'll probably have to experiment on my CI repo.
>
> I'm still confused by this error, because it only occurs in the test module. 
> I successfully built with just 0002 in CI so elsewhere where bmw_* symbols 
> resolve just fine on all platforms. I've worked around the error in v19-0004 
> by using the general-purpose pg_popcount() function. We only need to count 
> bits in assert builds, so it doesn't matter a whole lot.

I spent today investigating this issue, I found out that on Windows,
libpgport_src.a is not linked when building codes outside of
src/backend unless explicitly linking it. It's not a problem on Linux
etc. but the linker raises a fatal error on Windows. I'm not sure the
right way to fix it but the attached patch resolved the issue on
cfbot. Since it seems not to be related to 0002 patch but maybe the
designed behavior or a problem in meson. We can discuss it on a
separate thread.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


link_pgport_src.patch
Description: Binary data

Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-01-18 Thread Karl O. Pinc

On Wed, 18 Jan 2023 13:30:45 +0100
Alvaro Herrera  wrote:

> Not related to this patch: it's very annoying that in the PDF output,
> each section in the appendix doesn't start on a blank page -- which
> means that the doc page for many modules starts in the middle of a
> page were the previous one ends.

> I wonder if we can tweak something in the stylesheet to include a page
> break.

Would this be something to be included in this patch?
(If I can figure it out.)

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Robert Haas

On Tue, Jan 17, 2023 at 5:56 PM Peter Geoghegan  wrote:
> > Why do you think that?
>
> For the reasons I gave about statistics, random sampling, the central
> limit theorem. All that stuff. This matches the experience of Andres.
> And is obviously the only explanation behind the reliance on
> antiwraparound autovacuums for cleaning up bloat in larger OLTP
> databases. It just fits: the dead tuples approach can sometimes be so
> completely wrong that even an alternative triggering condition based
> on something that is virtually unrelated to the thing we actually care
> about can do much better in practice. Consistently, reliably, for a
> given table/workload.

Hmm, I don't know. I have no intuition one way or the other for
whether we're undercounting dead tuples, and I don't understand what
would cause us to do that. I thought that we tracked that accurately,
as part of the statistics system, not by sampling
(pg_stat_all_tables.n_dead_tup).

But, I think there are a number of other explanations for why we tend
to rely on antiwraparound vacuums more than we should.
Auto-cancellation. Skipping tables that are locked, or pages that are
pinned. A cost limit that is too low relative to the size of the
database, so that eventually all tables are in wraparound danger all
the time. The fact that we can vacuum tables uselessly, without
accomplishing anything, because the XID horizon is too new, but we
don't know that so we just try to vacuum anyway. And then we repeat
that useless work in an infinite loop. The fact that the system's idea
of when a vacuum needs to happen grows with
autovacuum_vacuum_scale_factor, but that actually gets too big too
fast, so that eventually it never triggers vacuuming at all, or at
least not before XID age does.

I think we ought to fire autovacuum_vacuum_scale_factor out of an
airlock. It's not the right model, and I think many people have been
aware that it's not the right model for a decade, and we haven't been
creative enough to come up with anything better. We *know* that you
have to lower this value for large tables or they just don't get
vacuumed often enough. That means we have some idea how often they
ought to be vacuumed. I'm sure I'm not the person who has the best
intuition on that point, but I bet people who have been responsible
for large production systems have some decent ideas in that area. We
should find out what that intuition is and come up with a new formula
that matches the intuition of people with experience doing this sort
of thing.

e.g.

1. When computing autovacuum_vacuum_threshold + table_size *
autovacuum_vacuum_scale_factor, if the result exceeds the value of a
new parameter autovacuum_vacuum_maximum_threshold, then clamp the
result to that value.

2. When computing autovacuum_vacuum_threshold + table_size *
autovacuum_vacuum_scale_factor, if the result exceeds 80% of the
number of dead TIDs we could store, clamp it to that number.

3. Change the formula altogether to use a square root or a cube root
or a logarithm somewhere.

I think we also ought to invent some sort of better cost limit system
that doesn't shoot you in the foot automatically as the database
grows. Nobody actually wants to limit the rate at which the database
vacuums stuff to a constant. What they really want to do is limit it
to a rate that is somewhat faster than the minimum rate needed to
avoid disaster. We should try to develop metrics for whether vacuum is
keeping up. I think one good one would be duty cycle -- if we have N
vacuum workers, then over a period of K seconds we could have done as
much as N*K process-seconds of vacuum work, and as little as 0. So
figure out how many seconds of vacuum work we actually did, and divide
that by N*K to get a percentage. If it's over, say, 90%, then we are
not keeping up. We should dynamically raise the cost limit until we
do. And drop it back down later when things look better.

I don't actually see any reason why dead tuples, even counted in a
relatively stupid way, isn't fundamentally good enough to get all
tables vacuumed before we hit the XID age cutoff. It doesn't actually
do that right now, but I feel like that must be because we're doing
other stupid things, not because there's anything that terrible about
the metric as such. Maybe that's wrong, but I find it hard to imagine.
If I imagine a world where vacuum always gets started when the number
of dead tuples hits some reasonable bound (rather than the
unreasonable bound that the scale factor stuff computes) and it always
cleans up those dead tuples (instead of doing a lot of work to clean
up nothing at all, or doing a lot of work to clean up only a small
fraction of those dead tuples, or cancelling itself, or skipping the
table that has the problem because it's locked, or running with an
unreasonably low cost limit, or otherwise being unable to GET THE JOB
DONE) then how do we ever reach autovacuum_freeze_max_age? I think it
would still be possible, but only if the XID consumption rat

Re: [DOCS] Stats views and functions not in order?

2023-01-18 Thread David G. Johnston

On Wed, Jan 18, 2023 at 8:38 AM Tom Lane  wrote:

> "David G. Johnston"  writes:
> > ...  I was going for the html effect
> > of having these views chunked into their own pages, any other changes
> being
> > non-detrimental.
>
> But is that a result we want?  It will for example break any bookmarks
> that people might have for these documentation entries.  It will also
> pretty thoroughly break the cross-version navigation links in this
> part of the docs.

> Maybe the benefit is worth those costs, but I'm entirely not convinced
> of that.  I think we need to tread pretty lightly when rearranging
> longstanding documentation-layout decisions.
>
>
Fair points.

The external linking can be solved with redirect rules, as I believe we've
done before, and fairly recently.  Even if not I think when they see why
the break happened they will be happy for the improved user experience.

I do think it is important enough a change to warrant breaking the
cross-version navigation links.  I can imagine a linking scheme that would
still work but I'm doubtful that this is important enough to expend the
development effort.

David J.

Re: ANY_VALUE aggregate

2023-01-18 Thread Tom Lane

Peter Eisentraut  writes:
> On 05.12.22 21:18, Vik Fearing wrote:
>> On 12/5/22 15:57, Vik Fearing wrote:
>>> The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It 
>>> returns an implementation-dependent (i.e. non-deterministic) value 
>>> from the rows in its group.

> Since the transition function is declared strict, null values don't need 
> to be checked.

Hmm, but should it be strict?  That means that what it's returning
is *not* "any value" but "any non-null value".  What does the draft
spec have to say about that?

regards, tom lane

Re: ANY_VALUE aggregate

2023-01-18 Thread Vik Fearing


On 1/18/23 16:55, Tom Lane wrote:

Peter Eisentraut  writes:

On 05.12.22 21:18, Vik Fearing wrote:

On 12/5/22 15:57, Vik Fearing wrote:

The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It
returns an implementation-dependent (i.e. non-deterministic) value
from the rows in its group.



Since the transition function is declared strict, null values don't need
to be checked.


Hmm, but should it be strict?  That means that what it's returning
is *not* "any value" but "any non-null value".  What does the draft
spec have to say about that?


It falls into the same category as AVG() etc.  That is, nulls are 
removed before calculation.

--
Vik Fearing

Re: Improve GetConfigOptionValues function

2023-01-18 Thread Tom Lane

Nitin Jadhav  writes:
> GetConfigOptionValues function extracts the config parameters for the
> given variable irrespective of whether it results in noshow or not.
> But the parent function show_all_settings ignores the values parameter
> if it results in noshow. It's unnecessary to fetch all the values
> during noshow. So a return statement in GetConfigOptionValues() when
> noshow is set to true is needed. Attached the patch for the same.
> Please share your thoughts.

I do not think this is an improvement: it causes GetConfigOptionValues
to be making assumptions about how its results will be used.  If
show_all_settings() were a big performance bottleneck, and there were
a lot of no-show values that we could optimize, then maybe the extra
coupling would be worthwhile.  But I don't believe either of those
things.

Possibly a better answer is to refactor into separate functions,
along the lines of

static bool
ConfigOptionIsShowable(struct config_generic *conf)

static void
GetConfigOptionValues(struct config_generic *conf, const char **values)

regards, tom lane

Re: almost-super-user problems that we haven't fixed yet

2023-01-18 Thread Robert Haas

On Tue, Jan 17, 2023 at 7:15 PM Nathan Bossart  wrote:
> Great.  Here is a first attempt at the patch.

In general, looks good. I think this will often call HaveNFreeProcs
twice, though, and that would be better to avoid, e.g.

if (!am_superuser && !am_walsender && (SuperuserReservedBackends +
ReservedBackends) > 0)
&& !HaveNFreeProcs(SuperuserReservedBackends + ReservedBackends))
{
if (!HaveNFreeProcs(SuperuserReservedBackends))
remaining connection slots are reserved for non-replication
superuser connections;
if (!has_privs_of_role(GetUserId(), ROLE_PG_USE_RESERVED_CONNECTIONS))
   remaining connection slots are reserved for roles with
privileges of pg_use_reserved_backends;
}

In the common case where we hit neither limit, this only counts free
connection slots once. We could do even better by making
HaveNFreeProcs have an out parameter for the number of free procs
actually found when it returns false, but that's probably not
important.

I don't think that we should default both the existing GUC and the new
one to 3, because that raises the default limit in the case where the
new feature is not used from 3 to 6. I think we should default one of
them to 0 and the other one to 3. Not sure which one should get which
value.

> > I think the documentation will need some careful wordsmithing,
> > including adjustments to superuser_reserved_connections. We want to
> > recast superuser_reserved_connections as a final reserve to be touched
> > after even reserved_connections has been exhausted.
>
> I tried to do this, but there is probably still room for improvement,
> especially for the parts that discuss the relationship between
> max_connections, superuser_reserved_connections, and reserved_connections.

I think it's pretty good the way you have it. I agree that there might
be a way to make it even better, but I don't think I know what it is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-01-18 Thread Tomas Vondra

Hi,

I took a quick look at this patch, to see if there's something we
want/can get into v16. The last version was submitted about 9 months
ago, and it doesn't apply cleanly anymore, but the bitrot is fairly
minor. Not sure there's still interest, though.

As for the patch, I wonder if it's unnecessarily complex. It adds *two*
timestamps for each pg_stat_statements entry - one for reset of the
whole entry, one for reset of "min/max" times only.

I can see why the first timestamp (essentially tracking creating of the
entry) is useful. I'd probably call it "created_at" or something like
that, but that's a minor detail. Or maybe stats_reset, which is what we
use in pgstat?

But is the second timestamp for the min/max fields really useful? AFAIK
to perform analysis, people take regular pg_stat_statements snapshots,
which works fine for counters (calculating deltas) but not for gauges
(which need a reset, to track fresh values). But people analyzing this
are already resetting the whole entry, and so the snapshots already are
tracking deltas.

So I'm not convinced actually need the second timestamp.

A couple more comments:

1) I'm not sure why the patch is adding tests of permissions on the
pg_stat_statements_reset function?

2) If we want the second timestamp, shouldn't it also cover resets of
the mean values, not just min/max?

3) I don't understand why the patch is adding "IS NOT NULL AS t" to
various places in the regression tests.

4) I rather dislike the "minmax" naming, because that's often used in
other contexts (for BRIN indexes), and as I mentioned maybe it should
also cover the "mean" fields.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Improve GetConfigOptionValues function

2023-01-18 Thread Bharath Rupireddy

On Wed, Jan 18, 2023 at 9:44 PM Tom Lane  wrote:
>
> Possibly a better answer is to refactor into separate functions,
> along the lines of
>
> static bool
> ConfigOptionIsShowable(struct config_generic *conf)
>
> static void
> GetConfigOptionValues(struct config_generic *conf, const char **values)

+1 and ConfigOptionIsShowable() function can replace explicit showable
checks in two other places too.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: ANY_VALUE aggregate

2023-01-18 Thread Vik Fearing


On 1/18/23 16:06, Peter Eisentraut wrote:

On 05.12.22 21:18, Vik Fearing wrote:

On 12/5/22 15:57, Vik Fearing wrote:
The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It 
returns an implementation-dependent (i.e. non-deterministic) value 
from the rows in its group.


PFA an implementation of this aggregate.


Here is v2 of this patch.  I had forgotten to update sql_features.txt.


In your patch, the documentation says the definition is any_value("any") 
but the catalog definitions are any_value(anyelement).  Please sort that 
out.


Since the transition function is declared strict, null values don't need 
to be checked.


Thank you for the review.  Attached is a new version rebased to d540a02a72.
--
Vik Fearing
From 9cf2c5b56ea38d3080c0cb9f8ef9e6229d8696b4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 9 Apr 2022 00:07:38 +0200
Subject: [PATCH] Implement ANY_VALUE aggregate

SQL:2023 defines an ANY_VALUE aggregate whose purpose is to emit an
implementation-dependent (i.e. non-deterministic) value from the
aggregated rows.
---
 doc/src/sgml/func.sgml   | 14 ++
 src/backend/catalog/sql_features.txt |  1 +
 src/backend/utils/adt/misc.c | 13 +
 src/include/catalog/pg_aggregate.dat |  4 
 src/include/catalog/pg_proc.dat  |  8 
 src/test/regress/expected/aggregates.out | 24 
 src/test/regress/sql/aggregates.sql  |  6 ++
 7 files changed, 70 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b8dac9ef46..8ff9decfec 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19735,6 +19735,20 @@ SELECT NULLIF(value, '(none)') ...
  
 
  
+  
+   
+
+ any_value
+
+any_value ( anyelement )
+same as input type
+   
+   
+Chooses a non-deterministic value from the non-null input values.
+   
+   Yes
+  
+
   

 
diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index abad216b7e..dfd3882801 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -520,6 +520,7 @@ T622	Trigonometric functions			YES
 T623	General logarithm functions			YES	
 T624	Common logarithm functions			YES	
 T625	LISTAGG			NO	
+T626	ANY_VALUE			YES	
 T631	IN predicate with one list element			YES	
 T641	Multiple column assignment			NO	only some syntax variants supported
 T651	SQL-schema statements in SQL routines			YES	
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 220ddb8c01..a9251f977e 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -1041,3 +1041,16 @@ pg_get_replica_identity_index(PG_FUNCTION_ARGS)
 	else
 		PG_RETURN_NULL();
 }
+
+/*
+ * Transition function for the ANY_VALUE aggregate
+ *
+ * Currently this just returns the first value, but in the future it might be
+ * able to signal to the aggregate that it does not need to be called anymore.
+ */
+Datum
+any_value_trans(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+}
+
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 8c957437ea..aac60dee58 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -625,4 +625,8 @@
   aggfinalfn => 'dense_rank_final', aggfinalextra => 't', aggfinalmodify => 'w',
   aggmfinalmodify => 'w', aggtranstype => 'internal' },
 
+# any_value
+{ aggfnoid => 'any_value(anyelement)', aggtransfn => 'any_value_trans',
+  aggcombinefn => 'any_value_trans', aggtranstype => 'anyelement' },
+
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 86eb8e8c58..95e760440e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11891,4 +11891,12 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '8981', descr => 'arbitrary value from among input values',
+  proname => 'any_value', prokind => 'a', proisstrict => 'f',
+  prorettype => 'anyelement', proargtypes => 'anyelement',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8982', descr => 'any_value transition function',
+  proname => 'any_value_trans', prorettype => 'anyelement', proargtypes => 'anyelement anyelement',
+  prosrc => 'any_value_trans' },
+
 ]
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index ae3b905331..b240ef522b 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -25,6 +25,24 @@ SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
  32.6667
 (1 row)
 
+SELECT any_value(v) FROM (VALUES (1)) AS v (v);
+ any_value 
+---
+ 1
+(1 row)
+
+SELECT any_value(v) FROM (VALUES (NULL)) AS v (v);
+ any_value 
+---
+ 
+(1 row)
+
+SE

CREATEROLE users vs. role properties

2023-01-18 Thread Robert Haas

On Mon, Jan 16, 2023 at 2:29 PM Robert Haas  wrote:
> 1. It's still possible for a CREATEROLE user to hand out role
> attributes that they don't possess. The new prohibitions in
> cf5eb37c5ee0cc54c80d95c1695d7fca1f7c68cb prevent a CREATEROLE user
> from handing out membership in a role on which they lack sufficient
> permissions, but they don't prevent a CREATEROLE user who lacks
> CREATEDB from creating a new user who does have CREATEDB. I think we
> should subject the CREATEDB, REPLICATION, and BYPASSRLS attributes to
> the same rule that we now use for role memberships: you've got to have
> the property in order to give it to someone else. In the case of
> CREATEDB, this would tighten the current rules, which allow you to
> give out CREATEDB without having it. In the case of REPLICATION and
> BYPASSRLS, this would liberalize the current rules: right now, a
> CREATEROLE user cannot give REPLICATION or BYPASSRLS to another user
> even if they possess those attributes.
>
> This proposal doesn't address the CREATEROLE or CONNECTION LIMIT
> properties. It seems possible to me that someone might want to set up
> a CREATEROLE user who can't make more such users, and this proposal
> doesn't manufacture any way of doing that. It also doesn't let you
> constraint the ability of a CREATEROLE user to set a CONNECTION LIMIT
> for some other user. I think that's OK. It might be nice to have ways
> of imposing such restrictions at some point in the future, but it is
> not very obvious what to do about such cases and, importantly, I don't
> think there's any security impact from failing to address those cases.
> If a CREATEROLE user without CREATEDB can create a new role that does
> have CREATEDB, that's a privilege escalation. If they can hand out
> CREATEROLE, that isn't: they already have it.

Here is a patch implementing the above proposal. Since this is fairly
closely related to already-committed work, I would like to get this
into v16. That way, all the changes to how CREATEROLE works will go
into a single release, which seems less confusing for users. It is
also fairly clear to me that this is an improvement over the status
quo. Sometimes things that seem clear to me turn out to be false, so
if this change seems like a problem to you, please let me know.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Adjust-interaction-of-CREATEROLE-with-role-proper.patch
Description: Binary data

Re: Implement missing join selectivity estimation for range types

2023-01-18 Thread Tomas Vondra

Hello Mahmoud,

Thanks for the patch and sorry for not taking a look earlier.

On 6/30/22 16:31, Mahmoud Sakr wrote:
> Hi,
> Given a query:
> SELECT * FROM t1, t2 WHERE t1.r << t2.r
> where t1.r, t2.r are of range type,
> currently PostgreSQL will estimate a constant selectivity for the << 
> predicate,
> which is equal to 0.005, not utilizing the statistics that the optimizer
> collects for range attributes.
> 
> We have worked out a theory for inequality join selectivity estimation
> (http://arxiv.org/abs/2206.07396), and implemented it for range
> types it in this patch.
> 

Interesting. Are there any particular differences compared to how we
estimate for example range clauses on regular columns?

> The algorithm in this patch re-uses the currently collected statistics for
> range types, which is the bounds histogram. It works fairly accurate for the
> operations <<, >>, &&, &<, &>, <=, >= with estimation error of about 0.5%.

Right. I think 0.5% is roughly expected for the default statistics
target, which creates 100 histogram bins, each representing ~1% of the
values. Which on average means ~0.5% error.

> The patch also implements selectivity estimation for the
> operations @>, <@ (contains and is contained in), but their accuracy is not
> stable, since the bounds histograms assume independence between the range
> bounds. A point to discuss is whether or not to keep these last two 
> operations.

That's a good question. I think the independence assumption is rather
foolish in this case, so I wonder if we could "stabilize" this by making
some different - less optimistic - assumption. Essentially, we have an
estimates for lower/upper boundaries:

  P1 = P(lower(var1) <= lower(var2))
  P2 = P(upper(var2) <= upper(var1))

and independence means we take (P1*P2). But maybe we should be very
pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?

Another option is to use the length histogram, right? I mean, we know
what the average length is, and it should be possible to use that to
calculate how "far" ranges in a histogram can overlap.

> The patch also includes the selectivity estimation for multirange types,
> treating a multirange as a single range which is its bounding box.
> 

OK. But ideally we'd cross-check elements of the two multiranges, no?

> The same algorithm in this patch is applicable to inequality joins of scalar
> types. We, however, don't implement it for scalars, since more work is needed
> to make use of the other statistics available for scalars, such as the MCV.
> This is left as a future work.
> 

So if the column(s) contain a couple very common (multi)ranges that make
it into an MCV, we'll ignore those? That's a bit unfortunate, because
those MCV elements are potentially the main contributors to selectivity.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-18 Thread Ankit Kumar Pandey





On 18/01/23 15:12, David Rowley wrote:



I also thought I'd better test that foreach_delete_current() works
with foreach_reverse(). I can confirm that it *does not* work
correctly.  I guess maybe you only tested the fact that it deleted the
current item and not that the subsequent loop correctly went to the
item directly before the deleted item. There's a problem with that. We
skip an item.


Hmm, not really sure why did I miss that. I tried this again (added 
following in postgres.c above


PortalStart)

List* l = NIL;
l = lappend(l, 1);
l = lappend(l, 2);
l = lappend(l, 3);
l = lappend(l, 4);

ListCell *lc;
foreach_reverse(lc, l)
{
if (foreach_current_index(lc) == 2) // delete 3
{
foreach_delete_current(l, lc);
}
}

foreach(lc, l)
{
int i = (int) lfirst(lc);
ereport(LOG,(errmsg("%d", i)));
}

Got result:
2023-01-18 20:23:28.115 IST [51007] LOG:  1
2023-01-18 20:23:28.115 IST [51007] STATEMENT:  select pg_backend_pid();
2023-01-18 20:23:28.115 IST [51007] LOG:  2
2023-01-18 20:23:28.115 IST [51007] STATEMENT:  select pg_backend_pid();
2023-01-18 20:23:28.115 IST [51007] LOG:  4
2023-01-18 20:23:28.115 IST [51007] STATEMENT:  select pg_backend_pid();

I had expected list_delete_cell to take care of rest.


Instead of fixing that, I think it's likely better just to loop
backwards manually with a for() loop, so I've adjusted the patch to
work that way.  It's quite likely that the additional code in
foreach() and what was in foreach_reverse() is slower than looping
manually due to the additional work those macros do to set the cell to
NULL when we run out of cells to loop over.


Okay, current version looks fine as well.


I made another pass over the v7 patch and fixed a bug that was
disabling the optimization when the deepest WindowAgg had a
runCondition.  This should have been using llast_node instead of
linitial_node.  The top-level WindowAgg is the last in the list.



I also made a pass over the tests and added a test case for the
runCondition check to make sure we disable the optimization when the
top-level WindowAgg has one of those.  


I wasn't sure what your test comments case numbers were meant to represent. 
They were not aligned with the code comments that document when the optimisation is

disabled, they started out aligned, but seemed to go off the rails at
#3. I've now made them follow the comments in create_one_window_path()
and made it more clear what we expect the test outcome to be in each
case.


Those were just numbering for exceptional cases, making them in sync 
with comments


wasn't really on my mind, but now they looks better.


I've attached the v9 patch. I feel like this patch is quite
self-contained and I'm quite happy with it now.  If there are no
objections soon, I'm planning on pushing it.


Patch is already rebased with latest master, tests are all green.

Tried some basic profiling and it looked good.


I also tried a bit unrealistic case.

create table abcdefgh(a int, b int, c int, d int, e int, f int, g int, h int);

insert into abcdefgh select a,b,c,d,e,f,g,h from
generate_series(1,7) a,
generate_series(1,7) b,
generate_series(1,7) c,
generate_series(1,7) d,
generate_series(1,7) e,
generate_series(1,7) f,
generate_series(1,7) g,
generate_series(1,7) h;

explain analyze select count(*) over (order by a),
row_number() over (partition by a order by b) from abcdefgh order by 
a,b,c,d,e,f,g,h;

In patch version

QUERY PLAN

--
---
 WindowAgg  (cost=1023241.14..1225007.67 rows=5764758 width=48) (actual 
time=64957.894..81950.352 rows=5764801 loops=1)
   ->  WindowAgg  (cost=1023241.14..1138536.30 rows=5764758 width=40) 
(actual time=37959.055..60391.799 rows=5764801 loops

=1)
 ->  Sort  (cost=1023241.14..1037653.03 rows=5764758 width=32) 
(actual time=37959.045..52968.791 rows=5764801 loop

s=1)
   Sort Key: a, b, c, d, e, f, g, h
   Sort Method: external merge  Disk: 237016kB
   ->  Seq Scan on abcdefgh  (cost=0.00..100036.58 
rows=5764758 width=32) (actual time=0.857..1341.107 rows=57

64801 loops=1)
 Planning Time: 0.168 ms
 Execution Time: 82748.789 ms
(8 rows)

In Master

QUERY PLAN

--
-
 Incremental Sort  (cost=1040889.72..1960081.97 rows=5764758 width=48) 
(actual time=23461.815..69654.700 rows=5764801 loop

s=1)
   Sort Key: a, b, c, d, e, f, g, h
   Presorted Key: a, b
   Full-sort Groups: 49  Sort Method: quicksort  Average Memory: 30kB  
Peak Memory: 30kB
   Pre-sorted Groups: 49  Sort Method: external merge  Average Disk: 
6688kB  Peak Disk: 6688kB
   ->  WindowAgg  (cost=1023241.14..1225007.67 rows=5764758 width=48) 
(actual time=22729.171..40189.407 rows=5764801 loops

=1)
 ->  Windo

Re: Removing redundant grouping columns

2023-01-18 Thread Tom Lane

David Rowley  writes:
> No objections from me.

Pushed, thanks for looking at it.

regards, tom lane

Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-01-18 Thread Alvaro Herrera

On 2023-Jan-18, Karl O. Pinc wrote:

> On Wed, 18 Jan 2023 13:30:45 +0100
> Alvaro Herrera  wrote:
> 
> > Not related to this patch: it's very annoying that in the PDF output,
> > each section in the appendix doesn't start on a blank page -- which
> > means that the doc page for many modules starts in the middle of a
> > page were the previous one ends.
> 
> > I wonder if we can tweak something in the stylesheet to include a page
> > break.
> 
> Would this be something to be included in this patch?
> (If I can figure it out.)

No, I think we should do that change separately.  I just didn't think a
parenthical complain was worth a separate thread for it; but if you do
create a patch, please do create a new thread (unless the current patch
in this one is committed already.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Ninguna manada de bestias tiene una voz tan horrible como la humana" (Orual)

Re: document the need to analyze partitioned tables

2023-01-18 Thread Justin Pryzby

On Wed, Jan 18, 2023 at 10:15:18AM +0100, Laurenz Albe wrote:
> On Tue, 2023-01-17 at 16:16 -0500, Bruce Momjian wrote:
> > On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote:
> > > Maybe (all?) the clarification the docs need is to say:
> > > "Partitioned tables are not *themselves* processed by autovacuum."
> > 
> > Yes, I think the lack of autovacuum needs to be specifically mentioned
> > since most people assume autovacuum handles _all_ statistics updating.

That's what 61fa6ca79 aimed to do.  Laurenz is suggesting further
clarification.

> > Can someone summarize how bad it is we have no statistics on partitioned
> > tables?  It sounds bad to me.
> 
> Andrey Lepikhov had an example earlier in this thread[1].  It doesn't take
> an exotic query. 
> 
> Attached is a new version of my patch that tries to improve the wording.

I tweaked this a bit to end up with:

> -Partitioned tables are not processed by autovacuum.  Statistics
> -should be collected by running a manual ANALYZE when 
> it is
> +The leaf partitions of a partitioned table are normal tables and are 
> processed
> +by autovacuum; however, autovacuum does not process the partitioned 
> table itself.
> +This is no problem as far as VACUUM is concerned, 
> since
> +there's no need to vacuum the empty, partitioned table.  But, as 
> mentioned in
> +, it also means that autovacuum 
> won't
> +run ANALYZE on the partitioned table.
> +Although statistics are automatically gathered on its leaf partitions, 
> some queries also need
> +statistics on the partitioned table to run optimally.  You should 
> collect statistics by
> +running a manual ANALYZE when the partitioned table is
>  first populated, and again whenever the distribution of data in its
>  partitions changes significantly.
> 

"partitions are normal tables" was techically wrong, as partitions can
also be partitioned.

-- 
Justin

Re: Implement missing join selectivity estimation for range types

2023-01-18 Thread Tomas Vondra

Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
a proper comment, not just "this is a copy from rangetypes".

However, it seems the two functions are exactly the same. Would the
functions diverge in the future? If not, maybe there should be just a
single shared function?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: document the need to analyze partitioned tables

2023-01-18 Thread Bruce Momjian

On Wed, Jan 18, 2023 at 11:49:19AM -0600, Justin Pryzby wrote:
> On Wed, Jan 18, 2023 at 10:15:18AM +0100, Laurenz Albe wrote:
> > On Tue, 2023-01-17 at 16:16 -0500, Bruce Momjian wrote:
> > > On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote:
> > > > Maybe (all?) the clarification the docs need is to say:
> > > > "Partitioned tables are not *themselves* processed by autovacuum."
> > > 
> > > Yes, I think the lack of autovacuum needs to be specifically mentioned
> > > since most people assume autovacuum handles _all_ statistics updating.
> 
> That's what 61fa6ca79 aimed to do.  Laurenz is suggesting further
> clarification.

Ah, makes sense, thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: [PATCH] Const'ify the arguments of ilist.c/ilist.h functions

2023-01-18 Thread Andres Freund

Hi,

On 2023-01-12 08:34:25 +0100, Peter Eisentraut wrote:
> On 07.01.23 08:21, Peter Eisentraut wrote:
> > On 23.11.22 14:57, Aleksander Alekseev wrote:
> > > Hi Andres,
> > > 
> > > Thanks for the review!
> > > 
> > > > I don't think it is correct for any of these to add const. The
> > > > only reason it
> > > > works is because of casting etc.
> > > 
> > > Fair enough. PFA the corrected patch v2.
> > 
> > This patch version looks correct to me.  It is almost the same as the
> > one that Andres had posted in his thread, except that yours also
> > modifies slist_delete() and dlist_member_check().  Both of these changes
> > also look correct to me.
> 
> committed

Unfortunately this causes a build failure with ILIST_DEBUG
enabled. dlist_member_check() uses dlist_foreach(), which isn't set up to work
with const :(.  I'll push a quick workaround.

Greetings,

Andres Freund

Re: [PATCH] Add <> support to sepgsql_restorecon

2023-01-18 Thread Joe Conway

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

This needs regression test support for the feature and some minimal 
documentation that shows how to make use of it.

The new status of this patch is: Waiting on Author

Re: [PATCH] Const'ify the arguments of ilist.c/ilist.h functions

2023-01-18 Thread Andres Freund

Hi,

On 2023-01-18 10:22:14 -0800, Andres Freund wrote:
> On 2023-01-12 08:34:25 +0100, Peter Eisentraut wrote:
> > On 07.01.23 08:21, Peter Eisentraut wrote:
> > > This patch version looks correct to me.  It is almost the same as the
> > > one that Andres had posted in his thread, except that yours also
> > > modifies slist_delete() and dlist_member_check().  Both of these changes
> > > also look correct to me.
> > 
> > committed
> 
> Unfortunately this causes a build failure with ILIST_DEBUG
> enabled. dlist_member_check() uses dlist_foreach(), which isn't set up to work
> with const :(.  I'll push a quick workaround.

Pushed.

Greetings,

Andres Freund

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

.

On Wed, Jan 18, 2023 at 7:54 AM Robert Haas  wrote:
> > It just fits: the dead tuples approach can sometimes be so
> > completely wrong that even an alternative triggering condition based
> > on something that is virtually unrelated to the thing we actually care
> > about can do much better in practice. Consistently, reliably, for a
> > given table/workload.
>
> Hmm, I don't know. I have no intuition one way or the other for
> whether we're undercounting dead tuples, and I don't understand what
> would cause us to do that. I thought that we tracked that accurately,
> as part of the statistics system, not by sampling
> (pg_stat_all_tables.n_dead_tup).

It's both, kind of.

pgstat_report_analyze() will totally override the
tabentry->dead_tuples information that drives autovacuum.c, based on
an estimate derived from a random sample -- which seems to me to be an
approach that just doesn't have any sound theoretical basis. So while
there is a sense in which we track dead tuples incrementally and
accurately using the statistics system, we occasionally call
pgstat_report_analyze (and pgstat_report_vacuum) like this, so AFAICT
we might as well not even bother tracking things reliably the rest of
the time.

Random sampling works because the things that you don't sample are
very well represented by the things that you do sample. That's why
even very stale optimizer statistics can work quite well (and why the
EAV anti-pattern makes query optimization impossible) -- the
distribution is often fixed, more or less. The statistics generalize
very well because the data meets certain underlying assumptions that
all data stored in a relational database is theoretically supposed to
meet. Whereas with dead tuples, the whole point is to observe and
count dead tuples so that autovacuum can then go remove the dead
tuples -- which then utterly changes the situation! That's a huge
difference.

ISTM that you need a *totally* different approach for something that's
fundamentally dynamic, which is what this really is. Think about how
the random sampling will work in a very large table with concentrated
updates. The modified pages need to outweigh the large majority of
pages in the table that can be skipped by VACUUM anyway.

I wonder how workable it would be to just teach pgstat_report_analyze
and pgstat_report_vacuum to keep out of this, or to not update the
stats unless it's to increase the number of dead_tuples...

> I think we ought to fire autovacuum_vacuum_scale_factor out of an
> airlock.

Couldn't agree more. I think that this and the underlying statistics
are the really big problem as far as under-vacuuming is concerned.

> I think we also ought to invent some sort of better cost limit system
> that doesn't shoot you in the foot automatically as the database
> grows. Nobody actually wants to limit the rate at which the database
> vacuums stuff to a constant. What they really want to do is limit it
> to a rate that is somewhat faster than the minimum rate needed to
> avoid disaster. We should try to develop metrics for whether vacuum is
> keeping up.

Definitely agree that doing some kind of dynamic updating is
promising. What we thought at the start versus what actually happened.
Something cyclic, just like autovacuum itself.

> I don't actually see any reason why dead tuples, even counted in a
> relatively stupid way, isn't fundamentally good enough to get all
> tables vacuumed before we hit the XID age cutoff. It doesn't actually
> do that right now, but I feel like that must be because we're doing
> other stupid things, not because there's anything that terrible about
> the metric as such. Maybe that's wrong, but I find it hard to imagine.

On reflection, maybe you're right here. Maybe it's true that the
bigger problem is just that the implementation is bad, even on its own
terms -- since it's pretty bad! Hard to say at this point.

Depends on how you define it, too. Statistically sampling is just not
fit for purpose here. But is that a problem with
autovacuum_vacuum_scale_factor? I may have said words that could
reasonably be interpreted that way, but I'm not prepared to blame it
on the underlying autovacuum_vacuum_scale_factor model now. It's
fuzzy.

> > We're
> > still subject to the laws of physics. VACUUM would still be something
> > that more or less works at the level of the whole table, or not at
> > all. So being omniscient seems kinda overrated to me. Adding more
> > information does not in general lead to better outcomes.
>
> Yeah, I think that's true. In particular, it's not much use being
> omniscient but stupid. It would be better to have limited information
> and be smart about what you did with it.

I would put it like this: autovacuum shouldn't ever be a sucker. It
should pay attention to disconfirmatory signals. The information that
drives its decision making process should be treated as provisional.

Even if the information was correct at one point, the contents of the
table are constantly changing in

Re: almost-super-user problems that we haven't fixed yet

2023-01-18 Thread Nathan Bossart

On Wed, Jan 18, 2023 at 11:28:57AM -0500, Robert Haas wrote:
> In general, looks good. I think this will often call HaveNFreeProcs
> twice, though, and that would be better to avoid, e.g.

I should have thought of this.  This is fixed in v2.

> In the common case where we hit neither limit, this only counts free
> connection slots once. We could do even better by making
> HaveNFreeProcs have an out parameter for the number of free procs
> actually found when it returns false, but that's probably not
> important.

Actually, I think it might be important.  IIUC the separate calls to
HaveNFreeProcs might return different values for the same input, which
could result in incorrect error messages (e.g., you might get the
reserved_connections message despite setting reserved_connections to 0).
So, I made this change in v2, too.

> I don't think that we should default both the existing GUC and the new
> one to 3, because that raises the default limit in the case where the
> new feature is not used from 3 to 6. I think we should default one of
> them to 0 and the other one to 3. Not sure which one should get which
> value.

I chose to set reserved_connections to 0 since it is new and doesn't have a
pre-existing default value.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 94ee89548fd1080447b784993a1418480f407b49 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 17 Jan 2023 13:58:56 -0800
Subject: [PATCH v2 1/2] Rename ReservedBackends to SuperuserReservedBackends.

This is in preparation for adding a new reserved_connections GUC
that will use the ReservedBackends variable name.
---
 src/backend/postmaster/postmaster.c | 18 +-
 src/backend/utils/init/postinit.c   |  4 ++--
 src/backend/utils/misc/guc_tables.c |  2 +-
 src/include/postmaster/postmaster.h |  2 +-
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9cedc1b9f0..470704f364 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -204,15 +204,15 @@ char	   *Unix_socket_directories;
 char	   *ListenAddresses;
 
 /*
- * ReservedBackends is the number of backends reserved for superuser use.
- * This number is taken out of the pool size given by MaxConnections so
+ * SuperuserReservedBackends is the number of backends reserved for superuser
+ * use.  This number is taken out of the pool size given by MaxConnections so
  * number of backend slots available to non-superusers is
- * (MaxConnections - ReservedBackends).  Note what this really means is
- * "if there are <= ReservedBackends connections available, only superusers
- * can make new connections" --- pre-existing superuser connections don't
- * count against the limit.
+ * (MaxConnections - SuperuserReservedBackends).  Note what this really means
+ * is "if there are <= SuperuserReservedBackends connections available, only
+ * superusers can make new connections" --- pre-existing superuser connections
+ * don't count against the limit.
  */
-int			ReservedBackends;
+int			SuperuserReservedBackends;
 
 /* The socket(s) we're listening to. */
 #define MAXLISTEN	64
@@ -908,11 +908,11 @@ PostmasterMain(int argc, char *argv[])
 	/*
 	 * Check for invalid combinations of GUC settings.
 	 */
-	if (ReservedBackends >= MaxConnections)
+	if (SuperuserReservedBackends >= MaxConnections)
 	{
 		write_stderr("%s: superuser_reserved_connections (%d) must be less than max_connections (%d)\n",
 	 progname,
-	 ReservedBackends, MaxConnections);
+	 SuperuserReservedBackends, MaxConnections);
 		ExitPostmaster(1);
 	}
 	if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL)
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..6fa696fe8d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -927,8 +927,8 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	 * limited by max_connections or superuser_reserved_connections.
 	 */
 	if (!am_superuser && !am_walsender &&
-		ReservedBackends > 0 &&
-		!HaveNFreeProcs(ReservedBackends))
+		SuperuserReservedBackends > 0 &&
+		!HaveNFreeProcs(SuperuserReservedBackends))
 		ereport(FATAL,
 (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
  errmsg("remaining connection slots are reserved for non-replication superuser connections")));
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..5aa2cda8f9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2163,7 +2163,7 @@ struct config_int ConfigureNamesInt[] =
 			gettext_noop("Sets the number of connection slots reserved for superusers."),
 			NULL
 		},
-		&ReservedBackends,
+		&SuperuserReservedBackends,
 		3, 0, MAX_BACKENDS,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 203177e1ff..168d

Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-01-18 Thread Karl O. Pinc

On Wed, 18 Jan 2023 13:25:57 +0100
Alvaro Herrera  wrote:

> On 2023-Jan-02, Karl O. Pinc wrote:

> > Attached is a patch: contrib_v1.patch
> > 
> > It modifies Appendix F, the contrib directory.
> > 
> > It adds brief text into the titles shown in the
> > table of contents so it's easier to tell what
> > each module does.  It also suffixes [trusted] or [obsolete]
> > on the relevant titles.  

> 
> I'm not 100% sold on having the
> "trusted" or "obsolete" marker on the titles themselves, though.  Not
> sure what alternative do we have, though, other than leave them out
> completely.

The alternative would be to have a separate table with modules
for rows and "trusted" and "obsolete" columns.  It seems like
more of a maintenance hassle than having the markers in the titles.

Let me know if you want a table.  I do like having a place
to look to over all the modules to see what is "trusted" or "obsolete".

I suppose there could just be a table, with module names, descriptions,
and trusted and obsolete flags.  Instead of a table of contents
for the modules the module names in the table could be links.  But
that'd involve suppressing the table of contents showing all the
module names.  And has the problem of possible mis-match between
the modules listed in the table and the modules that exist.

> There's a typo "equalivent" in two places.

Fixed.

> In passwordcheck, I would say just "check for weak passwords" or maybe
> "verify password strength".

I used "verify password strength".

> 
> pg_buffercache is missing.  Maybe "-- inspect state of the Postgres
> buffer cache".

I used "inspect Postgres buffer cache state"

> For pg_stat_statements I suggest "track statistics of planning and
> execution of SQL queries"

I had written "track SQL query planning and execution statistics".
Changed to: "track statistics of SQL planning and execution"

I don't really care.  If you want your version I'll submit another
patch.

> For sepgsql, as I understand it is strictly SELinux based, not just
> "-like".  So this needs rewording: "label-based, SELinux-like,
> mandatory access control".  Maybe "SELinux-based implementation of
> mandatory access control for row-level security".

Changed to: "SELinux-based row-level security mandatory access control"

> xml -- typo "qeurying"

Fixed.

I have also made the patch put each module on a separate
page when producing PDF documents.  This did produce one warning,
which seems unrelated to me.  The pdf seems right. I also tried 
just "make", to be sure I didn't break anything unrelated.  Seemed 
to work.  So..., works for me.

New patch attached: contrib_v2.patch

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/adminpack.sgml b/doc/src/sgml/adminpack.sgml
index 184e96d7a0..04f3b52379 100644
--- a/doc/src/sgml/adminpack.sgml
+++ b/doc/src/sgml/adminpack.sgml
@@ -1,7 +1,7 @@
 
 
 
- adminpack
+ adminpack — pgAdmin support toolpack
 
  
   adminpack
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 923cbde9dd..4006c75cdf 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -1,7 +1,7 @@
 
 
 
- amcheck
+ amcheck — tools to verify index consistency
 
  
   amcheck
diff --git a/doc/src/sgml/auth-delay.sgml b/doc/src/sgml/auth-delay.sgml
index 40629311b1..0571f2a99d 100644
--- a/doc/src/sgml/auth-delay.sgml
+++ b/doc/src/sgml/auth-delay.sgml
@@ -1,7 +1,7 @@
 
 
 
- auth_delay
+ auth_delay — pause on authentication failure
 
  
   auth_delay
diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml
index bb7342b120..0c4656ee30 100644
--- a/doc/src/sgml/auto-explain.sgml
+++ b/doc/src/sgml/auto-explain.sgml
@@ -1,7 +1,7 @@
 
 
 
- auto_explain
+ auto_explain — log execution plans of slow queries
 
  
   auto_explain
diff --git a/doc/src/sgml/basebackup-to-shell.sgml b/doc/src/sgml/basebackup-to-shell.sgml
index 491368eb8f..b6a3b39541 100644
--- a/doc/src/sgml/basebackup-to-shell.sgml
+++ b/doc/src/sgml/basebackup-to-shell.sgml
@@ -1,7 +1,7 @@
 
 
 
- basebackup_to_shell
+ basebackup_to_shell — example "shell" pg_basebackup module
 
  
   basebackup_to_shell
diff --git a/doc/src/sgml/basic-archive.sgml b/doc/src/sgml/basic-archive.sgml
index 60f23d2855..b4d43ced20 100644
--- a/doc/src/sgml/basic-archive.sgml
+++ b/doc/src/sgml/basic-archive.sgml
@@ -1,7 +1,7 @@
 
 
 
- basic_archive
+ basic_archive — an example WAL archive module
 
  
   basic_archive
diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml
index 98d0316175..672ac2ed19 100644
--- a/doc/src/sgml/bloom.sgml
+++ b/doc/src/sgml/bloom.sgml
@@ -1,7 +1,7 @@
 
 
 
- bloom
+ bloom — bloom filter index access
 
  
   bloom
diff --git a/doc/src/sgml/btree-gin.sgml b/doc/src/sgml/btree-gin.sgml
index 870c25559e..0eaea0dbcd 100644
--- a/doc/src/sgml/btree-gin.sgml
+++ b/doc/src/sgml/btree-gin.sgml
@@ -1,7 +1,8 @@
 
 
 
- btree_gin
+ btree_gin —
+   sample GIN B-tree equivalent operato

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Robert Haas

On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan  wrote:
> pgstat_report_analyze() will totally override the
> tabentry->dead_tuples information that drives autovacuum.c, based on
> an estimate derived from a random sample -- which seems to me to be an
> approach that just doesn't have any sound theoretical basis.

Yikes. I think we don't have a choice but to have a method to correct
the information somehow, because AFAIK the statistics system is not
crash-safe. But that approach does seem to carry significant risk of
overwriting correct information with wrong information.

> On reflection, maybe you're right here. Maybe it's true that the
> bigger problem is just that the implementation is bad, even on its own
> terms -- since it's pretty bad! Hard to say at this point.
>
> Depends on how you define it, too. Statistically sampling is just not
> fit for purpose here. But is that a problem with
> autovacuum_vacuum_scale_factor? I may have said words that could
> reasonably be interpreted that way, but I'm not prepared to blame it
> on the underlying autovacuum_vacuum_scale_factor model now. It's
> fuzzy.

Yep. I think what we should try to evaluate is which number is
furthest from the truth. My guess is that the threshold is so high
relative to what a reasonable value would be that you can't get any
benefit out of making the dead tuple count more accurate. Like, if the
threshold is 100x too high, or something, then who cares how accurate
the dead tuples number is? It's going to be insufficient to trigger
vacuuming whether it's right or wrong. We should try substituting a
less-bogus threshold calculation and see what happens then. An
alternative theory is that the threshold is fine and we're only
failing to reach it because the dead tuple calculation is so
inaccurate. Maybe that's even true in some scenarios, but I bet that
it's never the issue when people have really big tables. The fact that
I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with
1TB of bloat in my 10TB table. Among other problems, I can't even
vacuum away that much bloat in one index pass, because autovacuum
can't use enough work memory for that. Also, the absolute space
wastage matters.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-01-18 Thread Andrei Zubkov

Hi Tomas,

On Wed, 2023-01-18 at 17:29 +0100, Tomas Vondra wrote:
> I took a quick look at this patch, to see if there's something we
> want/can get into v16. The last version was submitted about 9 months
> ago, and it doesn't apply cleanly anymore, but the bitrot is fairly
> minor. Not sure there's still interest, though.

Thank you for your attention to this patch!

I'm still very interest in this patch. And I think I'll try to rebase
this patch during a week or two if it seems possible to get it in 16..
> 
> I'd probably call it "created_at" or something like
> that, but that's a minor detail. Or maybe stats_reset, which is what
> we
> use in pgstat?

Yes there is some naming issue. My thought was the following:
 - "stats_reset" is not quite correct here, because the statement entry
moment if definitely not a reset. The field named just as it means -
this is time of the moment from which statistics is collected for this
particular entry.
 - "created_at" perfectly matches the purpose of the field, but seems
not such self-explaining to me.

> 
> But is the second timestamp for the min/max fields really useful?
> AFAIK
> to perform analysis, people take regular pg_stat_statements
> snapshots,
> which works fine for counters (calculating deltas) but not for gauges
> (which need a reset, to track fresh values). But people analyzing
> this
> are already resetting the whole entry, and so the snapshots already
> are
> tracking deltas.
>
> So I'm not convinced actually need the second timestamp.

The main purpose of the patch is to provide means to collecting
solutions to avoid the reset of pgss at all. Just like it happens for
the pg_stat_ views. The only really need of reset is that we can't be
sure that observing statement was not evicted and come back since last
sample. Right now we only can do a whole reset on each sample and see
how many entries will be in pgss hashtable on the next sample - how
close this value to the max. If there is a plenty space in hashtable we
can hope that there was not evictions since last sample. However there
could be reset performed by someone else and we are know nothing about
this.
Having a timestamp in stats_since field we are sure about how long this
statement statistics is tracked. That said sampling solution can
totally avoid pgss resets. Avoiding such resets means avoiding
interference between monitoring solutions.
But if no more resets is done we can't track min/max values, because
they still needs a reset and we can do nothing with such resets - they
are necessary. However I still want to know when min/max reset was
performed. This will help to detect possible interference on such
resets.
> 
> 
> A couple more comments:
> 
> 1) I'm not sure why the patch is adding tests of permissions on the
> pg_stat_statements_reset function?
> 
> 2) If we want the second timestamp, shouldn't it also cover resets of
> the mean values, not just min/max?

I think that mean values shouldn't be target for a partial reset
because the value for mean values can be easily reconstructed by the
sampling solution without a reset.

> 
> 3) I don't understand why the patch is adding "IS NOT NULL AS t" to
> various places in the regression tests.

The most of tests was copied from the previous version. I'll recheck
them.

> 
> 4) I rather dislike the "minmax" naming, because that's often used in
> other contexts (for BRIN indexes), and as I mentioned maybe it should
> also cover the "mean" fields.

Agreed, but I couldn't make it better. Other versions seemed worse to
me...
> 
> 
Regards, Andrei Zubkov

Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-01-18 Thread Karl O. Pinc

On Wed, 18 Jan 2023 18:34:47 +0100
Alvaro Herrera  wrote:

> On 2023-Jan-18, Karl O. Pinc wrote:
> 
> > On Wed, 18 Jan 2023 13:30:45 +0100
> > Alvaro Herrera  wrote:
> >   
> > > Not related to this patch: it's very annoying that in the PDF
> > > output, each section in the appendix doesn't start on a blank
> > > page -- which means that the doc page for many modules starts in
> > > the middle of a page were the previous one ends.  
> >   
> > > I wonder if we can tweak something in the stylesheet to include a
> > > page break.  
> > 
> > Would this be something to be included in this patch?
> > (If I can figure it out.)  
> 
> No, I think we should do that change separately.  I just didn't think
> a parenthical complain was worth a separate thread for it; but if you
> do create a patch, please do create a new thread (unless the current
> patch in this one is committed already.)

Oops.  Already sent a revised patch that includes starting each
module on a new page, for PDF output.  I'll wait to rip that
out after review and start a new thread if necessary.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

Re: Parallelize correlated subqueries that execute within each worker

2023-01-18 Thread Tomas Vondra

Hi,

This patch hasn't been updated since September, and it got broken by
4a29eabd1d91c5484426bc5836e0a7143b064f5a which the incremental sort
stuff a little bit. But the breakage was rather limited, so I took a
stab at fixing it - attached is the result, hopefully correct.

I also added a couple minor comments about stuff I noticed while
rebasing and skimming the patch, I kept those in separate commits.
There's also a couple pre-existing TODOs.

James, what's your plan with this patch. Do you intend to work on it for
PG16, or are there some issues I missed in the thread?


One of the queries in in incremental_sort changed plans a little bit:

explain (costs off) select distinct
  unique1,
  (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1)
from tenk1 t, generate_series(1, 1000);

switched from

 Unique  (cost=18582710.41..18747375.21 rows=1 width=8)
   ->  Gather Merge  (cost=18582710.41..18697375.21 rows=1000 ...)
 Workers Planned: 2
 ->  Sort  (cost=18582710.39..18593127.06 rows=417 ...)
   Sort Key: t.unique1, ((SubPlan 1))
 ...

to

 Unique  (cost=18582710.41..18614268.91 rows=1 ...)
   ->  Gather Merge  (cost=18582710.41..18614168.91 rows=2 ...)
 Workers Planned: 2
 ->  Unique  (cost=18582710.39..18613960.39 rows=1 ...)
   ->  Sort  (cost=18582710.39..18593127.06 ...)
 Sort Key: t.unique1, ((SubPlan 1))
   ...

which probably makes sense, as the cost estimate decreases a bit.

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 95db15fe16303ed3f4fdea52af3b8d6d05a8d7a6 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Mon, 26 Sep 2022 20:30:23 -0400
Subject: [PATCH 1/5] Add tests before change

---
 src/test/regress/expected/select_parallel.out | 108 ++
 src/test/regress/sql/select_parallel.sql  |  25 
 2 files changed, 133 insertions(+)

diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 91f74fe47a3..9b4d7dd44a4 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -311,6 +311,114 @@ select count(*) from tenk1 where (two, four) not in
  1
 (1 row)
 
+-- test parallel plans for queries containing correlated subplans
+-- where the subplan only needs params available from the current
+-- worker's scan.
+explain (costs off, verbose) select
+  (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1)
+  from tenk1 t, generate_series(1, 10);
+ QUERY PLAN 
+
+ Gather
+   Output: (SubPlan 1)
+   Workers Planned: 4
+   ->  Nested Loop
+ Output: t.unique1
+ ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
+   Output: t.unique1
+ ->  Function Scan on pg_catalog.generate_series
+   Output: generate_series.generate_series
+   Function Call: generate_series(1, 10)
+   SubPlan 1
+ ->  Index Only Scan using tenk1_unique1 on public.tenk1
+   Output: t.unique1
+   Index Cond: (tenk1.unique1 = t.unique1)
+(14 rows)
+
+explain (costs off, verbose) select
+  (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1)
+  from tenk1 t;
+  QUERY PLAN  
+--
+ Gather
+   Output: (SubPlan 1)
+   Workers Planned: 4
+   ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
+ Output: t.unique1
+   SubPlan 1
+ ->  Index Only Scan using tenk1_unique1 on public.tenk1
+   Output: t.unique1
+   Index Cond: (tenk1.unique1 = t.unique1)
+(9 rows)
+
+explain (costs off, verbose) select
+  (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1)
+  from tenk1 t
+  limit 1;
+QUERY PLAN 
+---
+ Limit
+   Output: ((SubPlan 1))
+   ->  Seq Scan on public.tenk1 t
+ Output: (SubPlan 1)
+ SubPlan 1
+   ->  Index Only Scan using tenk1_unique1 on public.tenk1
+ Output: t.unique1
+ Index Cond: (tenk1.unique1 = t.unique1)
+(8 rows)
+
+explain (costs off, verbose) select t.unique1
+  from tenk1 t
+  where t.unique1 = (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1);
+ QUERY PLAN  
+-
+ Seq Scan on public.tenk1 t
+   Output: t.unique1
+   Filter: (t.unique1 = (SubPlan 1))
+   SubPlan 1
+ ->  Index Only Scan using tenk1_unique1 on public.tenk1
+   Output: t.unique1
+   Index Cond: (tenk1.unique1 = t.unique1)
+(7

Re: document the need to analyze partitioned tables

2023-01-18 Thread Laurenz Albe

On Wed, 2023-01-18 at 11:49 -0600, Justin Pryzby wrote:
> I tweaked this a bit to end up with:
> 
> > -    Partitioned tables are not processed by autovacuum.  Statistics
> > -    should be collected by running a manual ANALYZE 
> > when it is
> > +    The leaf partitions of a partitioned table are normal tables and are 
> > processed
> > +    by autovacuum; however, autovacuum does not process the partitioned 
> > table itself.
> > +    This is no problem as far as VACUUM is concerned, 
> > since
> > +    there's no need to vacuum the empty, partitioned table.  But, as 
> > mentioned in
> > +    , it also means that autovacuum 
> > won't
> > +    run ANALYZE on the partitioned table.
> > +    Although statistics are automatically gathered on its leaf partitions, 
> > some queries also need
> > +    statistics on the partitioned table to run optimally.  You should 
> > collect statistics by
> > +    running a manual ANALYZE when the partitioned table 
> > is
> >  first populated, and again whenever the distribution of data in its
> >  partitions changes significantly.
> >     
> 
> "partitions are normal tables" was techically wrong, as partitions can
> also be partitioned.

I am fine with your tweaks.  I think this is good to go.

Yours,
Laurenz Albe

Re: Implement missing join selectivity estimation for range types

2023-01-18 Thread Mahmoud Sakr

Hi Tomas,
Thanks for picking up the patch and for the interesting discussions that
you bring !

> Interesting. Are there any particular differences compared to how we
> estimate for example range clauses on regular columns?
The theory is the same for scalar types. Yet, the statistics that are currently
collected for scalar types include other synopsis than the histogram, such
as MCV, which should be incorporated in the estimation. The theory for using
the additional statistics is ready in the paper, but we didn't yet implement it.
We thought of sharing the ready part, till the time allows us to implement the
rest, or other developers continue it.

> Right. I think 0.5% is roughly expected for the default statistics
> target, which creates 100 histogram bins, each representing ~1% of the
> values. Which on average means ~0.5% error.
Since this patch deals with join selectivity, we are then crossing 100*100 bins.
The ~0.5% error estimation comes from our experiments, rather than a
mathematical analysis.

> independence means we take (P1*P2). But maybe we should be very
> pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?
>
> Another option is to use the length histogram, right? I mean, we know
> what the average length is, and it should be possible to use that to
> calculate how "far" ranges in a histogram can overlap.
The independence assumption exists if we use the lower and upper
histograms. It equally exists if we use the lower and length histograms.
In both cases, the link between the two histograms is lost during their
construction.
You discussion brings an interesting trade-off of optimistic v.s. pessimistic
estimations. A typical way to deal with such a trade-off is to average the
two, for example is model validation in machine learning, Do you think we
should implement something like
average( (P1*P2), Min(P1,P2) )?

> OK. But ideally we'd cross-check elements of the two multiranges, no?

> So if the column(s) contain a couple very common (multi)ranges that make
> it into an MCV, we'll ignore those? That's a bit unfortunate, because
> those MCV elements are potentially the main contributors to selectivity.
Both ideas would require collecting more detailed statistics, for
example similar
to arrays. In this patch, we restricted ourselves to the existing statistics.


> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
> a proper comment, not just "this is a copy from rangetypes".
Right, the comment should elaborate more that the collected statistics are
currently that same as rangetypes but may potentially deviate.

> However, it seems the two functions are exactly the same. Would the
> functions diverge in the future? If not, maybe there should be just a
> single shared function?
Indeed, it is possible that the two functions will deviate if that statistics
of multirange types will be refined.

--
Best regards
Mahmoud SAKR

On Wed, Jan 18, 2023 at 7:07 PM Tomas Vondra
 wrote:
>
> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
> a proper comment, not just "this is a copy from rangetypes".
>
> However, it seems the two functions are exactly the same. Would the
> functions diverge in the future? If not, maybe there should be just a
> single shared function?
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Tom Lane

One more thing before we move on from this topic.  I'd been testing
modified versions of the AdjustUpgrade.pm logic by building from a
--from-source source tree, which seemed way easier than dealing
with a private git repo.  As it stands, TestUpgradeXversion.pm
refuses to run under $from_source, but I just diked that check out
and it seemed to work fine for my purposes.  Now, that's going to be
a regular need going forward, so I'd like to not need a hacked version
of the BF client code to do it.

Also, your committed version of TestUpgradeXversion.pm breaks that
use-case because you did

-   unshift(@INC, "$self->{pgsql}/src/test/perl");
+   unshift(@INC, "$self->{buildroot}/$this_branch/pgsql/src/test/perl");

which AFAICS is an empty directory in a $from_source run.

I suppose that the reason for not running under $from_source is to
avoid corrupting the saved installations with unofficial versions.
However, couldn't we skip the "save" step and still run the upgrade
tests against whatever we have saved?  (Maybe skip the same-version
test, as it's not quite reflecting any real case then.)

Here's a quick draft patch showing what I have in mind.  There may
well be a better way to deal with the wheres-the-source issue than
what is in hunk 2.  Also, I didn't reindent the unchanged code in
sub installcheck, and I didn't add anything about skipping
same-version tests.

regards, tom lane

diff --git a/PGBuild/Modules/TestUpgradeXversion.pm b/PGBuild/Modules/TestUpgradeXversion.pm
index a784686..408432d 100644
--- a/PGBuild/Modules/TestUpgradeXversion.pm
+++ b/PGBuild/Modules/TestUpgradeXversion.pm
@@ -51,8 +51,6 @@ sub setup
 	my $conf  = shift;# ref to the whole config object
 	my $pgsql = shift;# postgres build dir
 
-	return if $from_source;
-
 	return if $branch !~ /^(?:HEAD|REL_?\d+(?:_\d+)?_STABLE)$/;
 
 	my $animal = $conf->{animal};
@@ -351,7 +349,16 @@ sub test_upgrade## no critic (Subroutines::ProhibitManyArgs)
 	  if $verbose;
 
 	# load helper module from source tree
-	unshift(@INC, "$self->{buildroot}/$this_branch/pgsql/src/test/perl");
+	my $source_tree;
+	if ($from_source)
+	{
+		$source_tree = $self->{pgsql};
+	}
+	else
+	{
+		$source_tree = "$self->{buildroot}/$this_branch/pgsql";
+	}
+	unshift(@INC, "$source_tree/src/test/perl");
 	require PostgreSQL::Test::AdjustUpgrade;
 	PostgreSQL::Test::AdjustUpgrade->import;
 	shift(@INC);
@@ -694,6 +701,11 @@ sub installcheck
 	my $upgrade_loc  = "$upgrade_install_root/$this_branch";
 	my $installdir   = "$upgrade_loc/inst";
 
+	# Don't save in a $from_source build: we want the saved trees always
+	# to correspond to branch tips of the animal's standard repo.  We can
+	# perform upgrade tests against previously-saved trees, though.
+	if (!$from_source)
+	{
 	# for saving we need an exclusive lock.
 	get_lock($self, $this_branch, 1);
 
@@ -716,6 +728,7 @@ sub installcheck
 	  if ($verbose > 1);
 	send_result('XversionUpgradeSave', $status, \@saveout) if $status;
 	$steps_completed .= " XVersionUpgradeSave";
+	}
 
 	# in saveonly mode our work is done
 	return if $ENV{PG_UPGRADE_SAVE_ONLY};
@@ -744,7 +757,7 @@ sub installcheck
 		# other branch from being removed or changed under us.
 		get_lock($self, $oversion, 0);
 
-		$status =
+		my $status =
 		  test_upgrade($self, $save_env, $this_branch, $upgrade_install_root,
 			$dport, $install_loc, $other_branch) ? 0 : 1;

Re: Non-superuser subscription owners

2023-01-18 Thread Robert Haas

On Sat, Jan 8, 2022 at 2:38 AM Jeff Davis  wrote:
> Committed.

I was just noticing that what was committed here didn't actually fix
the problem implied by the subject line. That is, non-superuser still
can't own subscriptions. To put that another way, there's no way for
the superuser to delegate the setup and administration of logical
replication to a non-superuser. That's a bummer.

Reading the thread, I'm not quite sure why we seemingly did all the
preparatory work and then didn't actually fix the problem. It was
previously proposed that we introduce a new predefined role
pg_create_subscriptions and allow users who have the privileges of
that predefined role to create and alter subscriptions. There are a
few issues with that which, however, seem fairly solvable to me:

1. Jeff pointed out that if you supply a connection string that is
going to try to access local files, you'd better have
pg_read_server_files, or else we should not let you use that
connection string. I guess that's mostly a function of which
parameters you specify, e.g. passfile, sslcert, sslkey, though maybe
for host it depends on whether the value starts with a slash. We might
need to think a bit here to make sure we get the rules right but it
seems like a pretty solvable problem.

2. There was also quite a bit of discussion of what to do if a user
who was previously eligible to own a subscription ceases to be
eligible, in particular around a superuser who is made into a
non-superuser, but the same problem would apply if you had
pg_create_subscriptions or pg_read_server_files and then lost it. My
vote is to not worry about it too much. Specifically, I think we
should certainly check whether the user has permission to create a
subscription before letting them do so, but we could handle the case
where the user already owns a subscription and tries to modify it by
either allowing or denying the operation and I think either of those
would be fine. I even think we could do one of those in some cases and
the other in other cases and as long as there is some principle to the
thing, it's fine. I argue that it's not a normal configuration and
therefore it doesn't have to work in a particularly useful way. It
shouldn't break the world in some horrendous way, but that's about as
good as it needs to be. I'd argue for example that DROP SUBSCRIPTION
could just check whether you own the object, and that ALTER
SUBSCRIPTION could check whether you own the object and, if you're
changing the connection string, also whether you would have privileges
to set that new connection string on a new subscription.

3. There was a bit of discussion of maybe wanting to allow users to
create subscriptions with some connection strings but not others,
perhaps by having some kind of intermediate object that owns the
connection string and is owned by a superuser or someone with lots of
privileges, and then letting a less-privileged user point a
subscription at that object. I agree that might be useful to somebody,
but I don't see why it's a hard requirement to get anything at all
done here. Right now, a subscription contains a connection string
directly. If in the future someone wants to introduce a CREATE
REPLICATION DESTINATION command (or whatever) and have a way to point
a subscription at a replication destination rather than a connection
string directly, cool. Or if someone wants to wire this into CREATE
SERVER somehow, also cool. But if you don't care about restricting
which IPs somebody can try to access by providing a connection string
of their choice, then you would be happy if we just did something
simple here and left this problem for another day.

I am very curious to know (a) why work on this was abandoned (perhaps
the answer is just lack of round tuits, in which case there is no more
to be said), and (b) what people think of (1)-(3) above, and (c)
whether anyone knows of further problems that need to be considered
here.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Small omission in type_sanity.sql

2023-01-18 Thread Melanie Plageman

Hi,

I was playing around with splitting up the tablespace test in regress so
that I could use the tablespaces it creates in another test and happened
to notice that the pg_class validity checks in type_sanity.sql are
incomplete.

It seems that 8b08f7d4820fd did not update the pg_class tests in
type_sanity to include partitioned indexes and tables.

patch attached.
I only changed these few lines in type_sanity to be more correct; I
didn't change anything else in regress to actually exercise them (e.g.
ensuring a partitioned table is around when running type_sanity). It
might be worth moving type_sanity down in the parallel schedule?

It does seem a bit hard to remember to update these tests in
type_sanity.sql when adding some new value for a pg_class field. I
wonder if there is a better way of testing this.

- Melanie
From 1531cdf257af92d31836e50e1a0b865131f1a018 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 18 Jan 2023 14:26:36 -0500
Subject: [PATCH v1] Update pg_class validity test

Partitioned tables and indexes were not considered in the pg_class
validity checks in type_sanity.sql. This is not currently exercised
because all partitioned tables and indexes have been dropped by the time
type_sanity is run.
---
 src/test/regress/expected/type_sanity.out | 44 ---
 src/test/regress/sql/type_sanity.sql  | 36 ++-
 2 files changed, 42 insertions(+), 38 deletions(-)

diff --git a/src/test/regress/expected/type_sanity.out b/src/test/regress/expected/type_sanity.out
index a640cfc476..95435e3de6 100644
--- a/src/test/regress/expected/type_sanity.out
+++ b/src/test/regress/expected/type_sanity.out
@@ -498,34 +498,36 @@ ORDER BY 1;
 
 --  pg_class 
 -- Look for illegal values in pg_class fields
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p') OR
+SELECT oid, relname, relkind, relpersistence, relreplident
+FROM pg_class
+WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p', 'I') OR
 relpersistence NOT IN ('p', 'u', 't') OR
 relreplident NOT IN ('d', 'n', 'f', 'i');
- oid | relname 
--+-
+ oid | relname | relkind | relpersistence | relreplident 
+-+-+-++--
 (0 rows)
 
--- All tables and indexes should have an access method.
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE c1.relkind NOT IN ('S', 'v', 'f', 'c') and
-c1.relam = 0;
- oid | relname 
--+-
+-- All tables and indexes except partitioned tables should have an access
+-- method.
+SELECT oid, relname, relkind, relam
+FROM pg_class
+WHERE relkind NOT IN ('S', 'v', 'f', 'c', 'p') and
+relam = 0;
+ oid | relname | relkind | relam 
+-+-+-+---
 (0 rows)
 
--- Conversely, sequences, views, types shouldn't have them
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE c1.relkind IN ('S', 'v', 'f', 'c') and
-c1.relam != 0;
- oid | relname 
--+-
+-- Conversely, sequences, views, types, and partitioned tables shouldn't have
+-- them
+SELECT oid, relname, relkind, relam
+FROM pg_class
+WHERE relkind IN ('S', 'v', 'f', 'c', 'p') and
+relam != 0;
+ oid | relname | relkind | relam 
+-+-+-+---
 (0 rows)
 
--- Indexes should have AMs of type 'i'
+-- Indexes and partitioned indexes should have AMs of type 'i'
 SELECT pc.oid, pc.relname, pa.amname, pa.amtype
 FROM pg_class as pc JOIN pg_am AS pa ON (pc.relam = pa.oid)
 WHERE pc.relkind IN ('i') and
@@ -534,7 +536,7 @@ WHERE pc.relkind IN ('i') and
 -+-++
 (0 rows)
 
--- Tables, matviews etc should have AMs of type 't'
+-- Tables, matviews etc should have AMs of type 't' (except partitioned tables)
 SELECT pc.oid, pc.relname, pa.amname, pa.amtype
 FROM pg_class as pc JOIN pg_am AS pa ON (pc.relam = pa.oid)
 WHERE pc.relkind IN ('r', 't', 'm') and
diff --git a/src/test/regress/sql/type_sanity.sql b/src/test/regress/sql/type_sanity.sql
index 79ec410a6c..22a2dba94d 100644
--- a/src/test/regress/sql/type_sanity.sql
+++ b/src/test/regress/sql/type_sanity.sql
@@ -358,31 +358,33 @@ ORDER BY 1;
 
 -- Look for illegal values in pg_class fields
 
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p') OR
+SELECT oid, relname, relkind, relpersistence, relreplident
+FROM pg_class
+WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p', 'I') OR
 relpersistence NOT IN ('p', 'u', 't') OR
 relreplident NOT IN ('d', 'n', 'f', 'i');
 
--- All tables and indexes should have an access method.
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE c1.relkind NOT IN ('S', 'v', 'f', 'c') and
-c1.relam = 0;
-
--- Conversely, sequences, views, types shouldn't have them
-SELECT c1.oid, c1.relname
-FROM pg_class as c1
-WHERE c1.relkind IN ('S', 'v', 'f', 'c') and
-c1.relam != 0;
-
--- Indexes should have AMs of type 'i'

Re: almost-super-user problems that we haven't fixed yet

2023-01-18 Thread Robert Haas

On Wed, Jan 18, 2023 at 2:00 PM Nathan Bossart  wrote:
> On Wed, Jan 18, 2023 at 11:28:57AM -0500, Robert Haas wrote:
> > In general, looks good. I think this will often call HaveNFreeProcs
> > twice, though, and that would be better to avoid, e.g.
>
> I should have thought of this.  This is fixed in v2.

Should (nfree < SuperuserReservedBackends) be using <=, or am I confused?

> > In the common case where we hit neither limit, this only counts free
> > connection slots once. We could do even better by making
> > HaveNFreeProcs have an out parameter for the number of free procs
> > actually found when it returns false, but that's probably not
> > important.
>
> Actually, I think it might be important.  IIUC the separate calls to
> HaveNFreeProcs might return different values for the same input, which
> could result in incorrect error messages (e.g., you might get the
> reserved_connections message despite setting reserved_connections to 0).
> So, I made this change in v2, too.

I thought of that briefly and it didn't seem that important, but the
way you did it seems fine, so let's go with that.

What's the deal with removing "and no new replication connections will
be accepted" from the documentation? Is the existing documentation
just wrong? If so, should we fix that first? And maybe delete
"non-replication" from the error message that says "remaining
connection slots are reserved for non-replication superuser
connections"? It seems like right now the comments say that
replication connections are a completely separate pool of connections,
but the documentation and the error message make it sound otherwise.
If that's true, then one of them is wrong, and I think it's the
docs/error message. Or am I just misreading it?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: minor bug

2023-01-18 Thread Tom Lane

Laurenz Albe  writes:
> On Tue, 2023-01-17 at 10:32 -0500, Tom Lane wrote:
>> I seem to recall that the original idea was to report the timestamp
>> of the commit/abort record we are stopping at.  Maybe my memory is
>> faulty, but I think that'd be significantly more useful than the
>> current behavior, *especially* when the replay stopping point is
>> defined by something other than time.
>> (Also, the wording of the log message suggests that that's what
>> the reported time is supposed to be.  I wonder if somebody messed
>> this up somewhere along the way.)

> recoveryStopTime is set to recordXtime (the time of the xlog record)
> a few lines above that patch, so this is useful information if it is
> present.

Ah, but that only happens if recoveryTarget == RECOVERY_TARGET_TIME.
Digging in the git history, I see that this did use to work as
I remember: we always extracted the record time before printing it.
That was accidentally broken in refactoring in c945af80c.  I think
the correct fix is more like the attached.

regards, tom lane

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5e65785306..c14d1f3ef6 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2548,8 +2548,13 @@ recoveryStopsBefore(XLogReaderState *record)
 		stopsHere = (recordXid == recoveryTargetXid);
 	}
 
-	if (recoveryTarget == RECOVERY_TARGET_TIME &&
-		getRecordTimestamp(record, &recordXtime))
+	/*
+	 * Note: we must fetch recordXtime regardless of recoveryTarget setting.
+	 * We don't expect getRecordTimestamp ever to fail, since we already know
+	 * this is a commit or abort record; but test its result anyway.
+	 */
+	if (getRecordTimestamp(record, &recordXtime) &&
+		recoveryTarget == RECOVERY_TARGET_TIME)
 	{
 		/*
 		 * There can be many transactions that share the same commit time, so

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 11:02 AM Robert Haas  wrote:
> On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan  wrote:
> > pgstat_report_analyze() will totally override the
> > tabentry->dead_tuples information that drives autovacuum.c, based on
> > an estimate derived from a random sample -- which seems to me to be an
> > approach that just doesn't have any sound theoretical basis.
>
> Yikes. I think we don't have a choice but to have a method to correct
> the information somehow, because AFAIK the statistics system is not
> crash-safe. But that approach does seem to carry significant risk of
> overwriting correct information with wrong information.

This situation is really quite awful, so maybe we should do something
about it soon, in the scope of the Postgres 16 work on autovacuum that
is already underway. In fact I think that the problem here is so bad
that even something slightly less naive would be far more effective.

You're right to point out that pgstat_report_analyze needs to update
the stats in case there is a hard crash, of course. But there is
plenty of context with which to make better decisions close at hand.
For example, I bet that pgstat_report_analyze already does a pretty
good job of estimating live_tuples -- my spiel about statistics mostly
doesn't apply to live_tuples. Suppose that we notice that its new
estimate for live_tuples approximately matches what the stats
subsystem already thought about live_tuples, while dead_tuples is far
far lower. We shouldn't be so credulous as to believe the new
dead_tuples estimate at that point.

Perhaps we can change nothing about dead_tuples at all when this
happens. Or perhaps we can set dead_tuples to a value that is scaled
from the old estimate. The new dead_tuples value could be derived by
taking the ratio of the old live_tuples to the old dead_tuples, and
then using that to scale from the new live_tuples. This is just a
first pass, to see what you and others think. Even very simple
heuristics seem like they could make things much better.

Another angle of attack is the PD_ALL_VISIBLE page-level bit, which
acquire_sample_rows() could pay attention to -- especially in larger
tables, where the difference between all pages and just the
all-visible subset of pages is most likely to matter. The more sampled
pages that had PD_ALL_VISIBLE set, the less credible the new
dead_tuples estimate will be (relative to existing information), and
so pgstat_report_analyze() should prefer the new estimate over the old
one in proportion to that.

We probably shouldn't change anything about pgstat_report_vacuum as
part of this effort to make pgstat_report_analyze less terrible in the
near term. It certainly has its problems (what is true for pages that
VACUUM scanned at the end of VACUUM is far from representative for new
pages!), it's probably much less of a contributor to issues like those
that Andres reports seeing.

BTW, one of the nice things about the insert-driven autovacuum stats
is that pgstat_report_analyze doesn't have an opinion about how many
tuples were inserted since the last VACUUM ran. It does have other
problems, but they seem less serious to me.

> Yep. I think what we should try to evaluate is which number is
> furthest from the truth. My guess is that the threshold is so high
> relative to what a reasonable value would be that you can't get any
> benefit out of making the dead tuple count more accurate. Like, if the
> threshold is 100x too high, or something, then who cares how accurate
> the dead tuples number is?

Right. Or if we don't make any reasonable distinction between LP_DEAD
items and dead heap-only tuples, then the total number of both things
together may matter very little. Better to be approximately correct
than exactly wrong. Deliberately introducing a bias to lower the
variance is a perfectly valid approach.

> Maybe that's even true in some scenarios, but I bet that
> it's never the issue when people have really big tables. The fact that
> I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with
> 1TB of bloat in my 10TB table. Among other problems, I can't even
> vacuum away that much bloat in one index pass, because autovacuum
> can't use enough work memory for that. Also, the absolute space
> wastage matters.

I certainly agree with all that.

FWIW, part of my mental model with VACUUM is that the rules kind of
change in the case of a big table. We're far more vulnerable to issues
such as (say) waiting for cleanup locks because the overall cadence
used by autovacuum is so infrequently relative to everything else.
There are more opportunities for things to go wrong, worse
consequences when they do go wrong, and greater potential for the
problems to compound.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund

Hi,

On 2023-01-17 12:08:01 -0800, Peter Geoghegan wrote:
> > I think that's not the fault of relfrozenxid as a trigger, but that we 
> > simply
> > don't keep enough other stats. We should imo at least keep track of:
> 
> If you assume that there is chronic undercounting of dead tuples
> (which I think is very common), then of course anything that triggers
> vacuuming is going to help with that problem -- it might be totally
> inadequate, but still make the critical difference by not allowing the
> system to become completely destabilized. I absolutely accept that
> users that are relying on that exist, and that those users ought to
> not have things get even worse -- I'm pragmatic. But overall, what we
> should be doing is fixing the real problem, which is that the dead
> tuples accounting is deeply flawed. Actually, it's not just that the
> statistics are flat out wrong; the whole model is flat-out wrong.

I think that depends on what "whole model" encompasses...

> The assumptions that work well for optimizer statistics quite simply
> do not apply here. Random sampling for this is just wrong, because
> we're not dealing with something that follows a distribution that can
> be characterized with a sufficiently large sample. With optimizer
> statistics, the entire contents of the table is itself a sample taken
> from the wider world -- so even very stale statistics can work quite
> well (assuming that the schema is well normalized). Whereas the
> autovacuum dead tuples stuff is characterized by constant change. I
> mean of course it is -- that's the whole point! The central limit
> theorem obviously just doesn't work for something like this  -- we
> cannot generalize from a sample, at all.

If we were to stop dropping stats after crashes, I think we likely could
afford to stop messing with dead tuple stats during analyze. Right now it's
valuable to some degree because it's a way to reaosonably quickly recover from
lost stats.

The main way to collect inserted / dead tuple info for autovacuum's benefit is
via the stats collected when making changes.

We probably ought to simply not update dead tuples after analyze if the stats
entry has information about a prior [auto]vacuum. Or at least split the
fields.

> How many dead heap-only tuples are equivalent to one LP_DEAD item?
> What about page-level concentrations, and the implication for
> line-pointer bloat? I don't have a good answer to any of these
> questions myself. And I have my doubts that there are *any* good
> answers.

Hence my suggestion to track several of these via page level stats. In the big
picture it doesn't really matter that much whether there's 10 or 100 (dead
tuples|items) on a page that needs to be removed. It matters that the page
needs to be processed.

> Even these questions are the wrong questions (they're just less wrong).

I don't agree. Nothing is going to be perfect, but you're not going to be able
to do sensible vacuum scheduling without some stats, and it's fine if those
are an approximation, as long as the approximation makes some sense.

> I'd like to use the visibility map more for stuff here, too. It is
> totally accurate about all-visible/all-frozen pages, so many of my
> complaints about statistics don't really apply. Or need not apply, at
> least. If 95% of a table's pages are all-frozen in the VM, then of
> course it's pretty unlikely to be the right time to VACUUM the table
> if it's to clean up bloat -- this is just about the most reliable
> information we have access to.

I think querying that from stats is too expensive for most things. I suggested
tracking all-frozen in pg_class. Perhaps we should also track when pages are
*removed* from the VM in pgstats, I don't think we do today. That should give
a decent picture?

> > > This sounds like a great argument in favor of suspend-and-resume as a
> > > way of handling autocancellation -- no useful work needs to be thrown
> > > away for AV to yield for a minute or two.
> 
> > Hm, that seems a lot of work. Without having held a lock you don't even know
> > whether your old dead items still apply. Of course it'd improve the 
> > situation
> > substantially, if we could get it.
> 
> I don't think it's all that much work, once the visibility map
> snapshot infrastructure is there.
> 
> Why wouldn't your old dead items still apply?

Well, for one the table could have been rewritten. Of course we can add the
code to deal with that, but it is definitely something to be aware of. There
might also be some oddities around indexes getting added / removed.

> > > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be
> > > special cases? Maybe they should always be able to auto cancel an
> > > autovacuum?
> >
> > Yea, I think so. It's not obvious how to best pass down that knowledge into
> > ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best
> > way would be to change LockAcquireExtended() to get a flags argument instead
> > of reportMe

Re: Non-superuser subscription owners

2023-01-18 Thread Mark Dilger




> On Jan 18, 2023, at 11:38 AM, Robert Haas  wrote:
> 
> I was just noticing that what was committed here didn't actually fix
> the problem implied by the subject line. That is, non-superuser still
> can't own subscriptions.

Not so.  They can.  See src/test/subscription/027_nosuperuser.pl

> To put that another way, there's no way for
> the superuser to delegate the setup and administration of logical
> replication to a non-superuser.

True.

> That's a bummer.

Also true.

> Reading the thread, I'm not quite sure why we seemingly did all the
> preparatory work and then didn't actually fix the problem.

Prior to the patch, if a superuser created a subscription, then later was 
demoted to non-superuser, the subscription apply workers still applied the 
changes with superuser force.  So creating a superuser Alice, letting Alice 
create a subscription, then revoking superuser from Alice didn't accomplish 
anything interesting.  But after the patch, it does.  The superuser can now 
create non-superuser subscriptions.  (I'm not sure this ability is well 
advertised.)  But the problem of non-superuser roles creating non-superuser 
subscriptions is not solved.

From a security perspective, the bit that was solved may be the more important 
part; from a usability perspective, perhaps not.

> It was
> previously proposed that we introduce a new predefined role
> pg_create_subscriptions and allow users who have the privileges of
> that predefined role to create and alter subscriptions. There are a
> few issues with that which, however, seem fairly solvable to me:
> 
> 1. Jeff pointed out that if you supply a connection string that is
> going to try to access local files, you'd better have
> pg_read_server_files, or else we should not let you use that
> connection string. I guess that's mostly a function of which
> parameters you specify, e.g. passfile, sslcert, sslkey, though maybe
> for host it depends on whether the value starts with a slash. We might
> need to think a bit here to make sure we get the rules right but it
> seems like a pretty solvable problem.
> 
> 2. There was also quite a bit of discussion of what to do if a user
> who was previously eligible to own a subscription ceases to be
> eligible, in particular around a superuser who is made into a
> non-superuser, but the same problem would apply if you had
> pg_create_subscriptions or pg_read_server_files and then lost it. My
> vote is to not worry about it too much. Specifically, I think we
> should certainly check whether the user has permission to create a
> subscription before letting them do so, but we could handle the case
> where the user already owns a subscription and tries to modify it by
> either allowing or denying the operation and I think either of those
> would be fine. I even think we could do one of those in some cases and
> the other in other cases and as long as there is some principle to the
> thing, it's fine. I argue that it's not a normal configuration and
> therefore it doesn't have to work in a particularly useful way. It
> shouldn't break the world in some horrendous way, but that's about as
> good as it needs to be. I'd argue for example that DROP SUBSCRIPTION
> could just check whether you own the object, and that ALTER
> SUBSCRIPTION could check whether you own the object and, if you're
> changing the connection string, also whether you would have privileges
> to set that new connection string on a new subscription.
> 
> 3. There was a bit of discussion of maybe wanting to allow users to
> create subscriptions with some connection strings but not others,
> perhaps by having some kind of intermediate object that owns the
> connection string and is owned by a superuser or someone with lots of
> privileges, and then letting a less-privileged user point a
> subscription at that object. I agree that might be useful to somebody,
> but I don't see why it's a hard requirement to get anything at all
> done here. Right now, a subscription contains a connection string
> directly. If in the future someone wants to introduce a CREATE
> REPLICATION DESTINATION command (or whatever) and have a way to point
> a subscription at a replication destination rather than a connection
> string directly, cool. Or if someone wants to wire this into CREATE
> SERVER somehow, also cool. But if you don't care about restricting
> which IPs somebody can try to access by providing a connection string
> of their choice, then you would be happy if we just did something
> simple here and left this problem for another day.
> 
> I am very curious to know (a) why work on this was abandoned (perhaps
> the answer is just lack of round tuits, in which case there is no more
> to be said)

Mostly, it was a lack of round-tuits.  After the patch was committed, I quickly 
switched my focus elsewhere.

> , and (b) what people think of (1)-(3) above

There are different ways of solving (1), and Jeff and I discussed them in Dec 
2021.  My recollection was that idea (3)

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Robert Haas

On Wed, Jan 18, 2023 at 3:15 PM Peter Geoghegan  wrote:
> Suppose that we notice that its new
> estimate for live_tuples approximately matches what the stats
> subsystem already thought about live_tuples, while dead_tuples is far
> far lower. We shouldn't be so credulous as to believe the new
> dead_tuples estimate at that point.
>
> Another angle of attack is the PD_ALL_VISIBLE page-level bit, which
> acquire_sample_rows() could pay attention to -- especially in larger
> tables, where the difference between all pages and just the
> all-visible subset of pages is most likely to matter. The more sampled
> pages that had PD_ALL_VISIBLE set, the less credible the new
> dead_tuples estimate will be (relative to existing information), and
> so pgstat_report_analyze() should prefer the new estimate over the old
> one in proportion to that.

I don't know enough about the specifics of how this works to have an
intelligent opinion about how likely these particular ideas are to
work out. However, I think it's risky to look at estimates and try to
infer whether they are reliable. It's too easy to be wrong. What we
really want to do is anchor our estimates to some data source that we
know we can trust absolutely. If you trust possibly-bad data less, it
screws up your estimates more slowly, but it still screws them up.

If Andres is correct that what really matter is the number of pages
we're going to have to dirty, we could abandon counting dead tuples
altogether and just count not-all-visible pages in the VM map. That
would be cheap enough to recompute periodically. However, it would
also be a big behavior change from the way we do things now, so I'm
not sure it's a good idea. Still, a quantity that we can be certain
we're measuring accurately is better than one we can't measure
accurately even if it's a somewhat worse proxy for the thing we really
care about. There's a ton of value in not being completely and totally
wrong.

> FWIW, part of my mental model with VACUUM is that the rules kind of
> change in the case of a big table. We're far more vulnerable to issues
> such as (say) waiting for cleanup locks because the overall cadence
> used by autovacuum is so infrequently relative to everything else.
> There are more opportunities for things to go wrong, worse
> consequences when they do go wrong, and greater potential for the
> problems to compound.

Yes. A lot of parts of PostgreSQL, including this one, were developed
a long time ago when PostgreSQL databases were a lot smaller than they
often are today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Non-superuser subscription owners

2023-01-18 Thread Robert Haas

On Wed, Jan 18, 2023 at 3:26 PM Mark Dilger
 wrote:
> Prior to the patch, if a superuser created a subscription, then later was 
> demoted to non-superuser, the subscription apply workers still applied the 
> changes with superuser force.  So creating a superuser Alice, letting Alice 
> create a subscription, then revoking superuser from Alice didn't accomplish 
> anything interesting.  But after the patch, it does.  The superuser can now 
> create non-superuser subscriptions.  (I'm not sure this ability is well 
> advertised.)  But the problem of non-superuser roles creating non-superuser 
> subscriptions is not solved.

Ah, OK, thanks for the clarification!

> There are different ways of solving (1), and Jeff and I discussed them in Dec 
> 2021.  My recollection was that idea (3) was the cleanest.  Other ideas might 
> be simpler than (3), or they may just appear simpler but in truth turn into a 
> can of worms.  I don't know, since I never went as far as trying to implement 
> either approach.
>
> Idea (2) seems to contemplate non-superuser subscription owners as a 
> theoretical thing, but it's quite real already.  Again, see 
> 027_nosuperuser.pl.

I think the solution to the problem of a connection string trying to
access local files is to just look at the connection string, decide
whether it does that, and if yes, require the owner to have
pg_read_server_files as well as pg_create_subscription. (3) is about
creating some more sophisticated and powerful solution to that
problem, but that seems like a nice-to-have, not something essential,
and a lot more complicated to implement.

I guess what I listed as (2) is not relevant since I didn't understand
correctly what the current state of things is.

Unless I'm missing something, it seems like this could be a quite small patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Non-superuser subscription owners

2023-01-18 Thread Mark Dilger

> On Jan 18, 2023, at 12:51 PM, Robert Haas  wrote:
> 
> Unless I'm missing something, it seems like this could be a quite small patch.

I didn't like the idea of the create/alter subscription commands needing to 
parse the connection string and think about what it might do, because at some 
point in the future we might extend what things are allowed in that string, and 
we have to keep everything that contemplates that string in sync.  I may have 
been overly hesitant to tackle that problem.  Or maybe I just ran short of 
round tuits.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 12:44 PM Robert Haas  wrote:
> I don't know enough about the specifics of how this works to have an
> intelligent opinion about how likely these particular ideas are to
> work out. However, I think it's risky to look at estimates and try to
> infer whether they are reliable. It's too easy to be wrong. What we
> really want to do is anchor our estimates to some data source that we
> know we can trust absolutely. If you trust possibly-bad data less, it
> screws up your estimates more slowly, but it still screws them up.

Some of what I'm proposing arguably amounts to deliberately adding a
bias. But that's not an unreasonable thing in itself. I think of it as
related to the bias-variance tradeoff, which is a concept that comes
up a lot in machine learning and statistical inference.

We can afford to be quite imprecise at times, especially if we choose
a bias that we know has much less potential to do us harm -- some
mistakes hurt much more than others. We cannot afford to ever be
dramatically wrong, though -- especially in the direction of vacuuming
less often.

Besides, there is something that we *can* place a relatively high
degree of trust in that will still be in the loop here: VACUUM itself.
If VACUUM runs then it'll call pgstat_report_vacuum(), which will set
the record straight in the event of over estimating dead tuples. To
some degree the problem of over estimating dead tuples is
self-limiting.

> If Andres is correct that what really matter is the number of pages
> we're going to have to dirty, we could abandon counting dead tuples
> altogether and just count not-all-visible pages in the VM map.

That's what matters most from a cost point of view IMV. So it's a big
part of the overall picture, but not everything. It tells us
relatively little about the benefits, except perhaps when most pages
are all-visible.

-- 
Peter Geoghegan

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Andrew Dunstan



On 2023-01-18 We 14:32, Tom Lane wrote:
> One more thing before we move on from this topic.  I'd been testing
> modified versions of the AdjustUpgrade.pm logic by building from a
> --from-source source tree, which seemed way easier than dealing
> with a private git repo.  As it stands, TestUpgradeXversion.pm
> refuses to run under $from_source, but I just diked that check out
> and it seemed to work fine for my purposes.  Now, that's going to be
> a regular need going forward, so I'd like to not need a hacked version
> of the BF client code to do it.
>
> Also, your committed version of TestUpgradeXversion.pm breaks that
> use-case because you did
>
> -   unshift(@INC, "$self->{pgsql}/src/test/perl");
> +   unshift(@INC, "$self->{buildroot}/$this_branch/pgsql/src/test/perl");
>
> which AFAICS is an empty directory in a $from_source run.
>
> I suppose that the reason for not running under $from_source is to
> avoid corrupting the saved installations with unofficial versions.
> However, couldn't we skip the "save" step and still run the upgrade
> tests against whatever we have saved?  (Maybe skip the same-version
> test, as it's not quite reflecting any real case then.)
>
> Here's a quick draft patch showing what I have in mind.  There may
> well be a better way to deal with the wheres-the-source issue than
> what is in hunk 2.  Also, I didn't reindent the unchanged code in
> sub installcheck, and I didn't add anything about skipping
> same-version tests.


No that won't work if we're using vpath builds (which was why I changed
it from what you had). $self->{pgsql} is always the build directory.

Something like this should do it:


my $source_tree = $from_source || "$self->{buildroot}/$this_branch/pgsql";


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund

Hi,

On 2023-01-18 12:15:17 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 11:02 AM Robert Haas  wrote:
> > On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan  wrote:
> > > pgstat_report_analyze() will totally override the
> > > tabentry->dead_tuples information that drives autovacuum.c, based on
> > > an estimate derived from a random sample -- which seems to me to be an
> > > approach that just doesn't have any sound theoretical basis.
> >
> > Yikes. I think we don't have a choice but to have a method to correct
> > the information somehow, because AFAIK the statistics system is not
> > crash-safe. But that approach does seem to carry significant risk of
> > overwriting correct information with wrong information.

I suggested nearby to only have ANALYZE dead_tuples it if there's been no
[auto]vacuum since the stats entry was created. That allows recovering from
stats resets, be it via crashes or explicitly. What do you think?

To add insult to injury, we overwrite accurate information gathered by VACUUM
with bad information gathered by ANALYZE if you do VACUUM ANALYZE.

One complicating factor is that VACUUM sometimes computes an incrementally
more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE
computes something sane. I unintentionally encountered one when I was trying
something while writing this email, reproducer attached.

VACUUM (DISABLE_PAGE_SKIPPING) foo;
SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relid = 
'foo'::regclass;
┌┬┐
│ n_live_tup │ n_dead_tup │
├┼┤
│901 │ 50 │
└┴┘

after one VACUUM:
┌┬┐
│ n_live_tup │ n_dead_tup │
├┼┤
│8549905 │ 50 │
└┴┘

after 9 more VACUUMs:
┌┬┐
│ n_live_tup │ n_dead_tup │
├┼┤
│5388421 │ 50 │
└┴┘
(1 row)

I briefly tried it out, and it does *not* reproduce in 11, but does in
12. Haven't dug into what the cause is, but we probably use the wrong
denominator somewhere...

Greetings,

Andres Freund

vactest.sql
Description: application/sql

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 1:02 PM Peter Geoghegan  wrote:
> Some of what I'm proposing arguably amounts to deliberately adding a
> bias. But that's not an unreasonable thing in itself. I think of it as
> related to the bias-variance tradeoff, which is a concept that comes
> up a lot in machine learning and statistical inference.

To be clear, I was thinking of unreservedly trusting what
pgstat_report_analyze() says about dead_tuples in the event of its
estimate increasing our dead_tuples estimate, while being skeptical
(to a varying degree) when it's the other way around.

But now I need to go think about what Andres just brought up...

-- 
Peter Geoghegan

Re: almost-super-user problems that we haven't fixed yet

2023-01-18 Thread Nathan Bossart

On Wed, Jan 18, 2023 at 02:51:38PM -0500, Robert Haas wrote:
> Should (nfree < SuperuserReservedBackends) be using <=, or am I confused?

I believe < is correct.  At this point, the new backend will have already
claimed a proc struct, so if the number of remaining free slots equals the
number of reserved slots, it is okay.

> What's the deal with removing "and no new replication connections will
> be accepted" from the documentation? Is the existing documentation
> just wrong? If so, should we fix that first? And maybe delete
> "non-replication" from the error message that says "remaining
> connection slots are reserved for non-replication superuser
> connections"? It seems like right now the comments say that
> replication connections are a completely separate pool of connections,
> but the documentation and the error message make it sound otherwise.
> If that's true, then one of them is wrong, and I think it's the
> docs/error message. Or am I just misreading it?

I think you are right.  This seems to have been missed in ea92368.  I moved
this part to a new patch that should probably be back-patched to v12.

On that note, I wonder if it's worth changing the "sorry, too many clients
already" message to make it clear that max_connections has been reached.
IME some users are confused by this error, and I think it would be less
confusing if it pointed to the parameter that governs the number of
connection slots.  I'll create a new thread for this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 8f0aa2fa54ae01149cffe9a69265f98e76a08a23 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 18 Jan 2023 12:43:41 -0800
Subject: [PATCH v3 1/3] Code review for ea92368.

This commit missed an error message and a line in the docs.

Back-patch to v12.
---
 doc/src/sgml/config.sgml  | 3 +--
 src/backend/utils/init/postinit.c | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89d53f2a64..e019a1aac9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -725,8 +725,7 @@ include_dir 'conf.d'
 number of active concurrent connections is at least
 max_connections minus
 superuser_reserved_connections, new
-connections will be accepted only for superusers, and no
-new replication connections will be accepted.
+connections will be accepted only for superusers.

 

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..9145d96b38 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -931,7 +931,7 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		!HaveNFreeProcs(ReservedBackends))
 		ereport(FATAL,
 (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
- errmsg("remaining connection slots are reserved for non-replication superuser connections")));
+ errmsg("remaining connection slots are reserved for superusers")));
 
 	/* Check replication permissions needed for walsender processes. */
 	if (am_walsender)
-- 
2.25.1

>From bc110461e4f0a73c2b76ba0a6e821349c2cbe3df Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 17 Jan 2023 13:58:56 -0800
Subject: [PATCH v3 2/3] Rename ReservedBackends to SuperuserReservedBackends.

This is in preparation for adding a new reserved_connections GUC
that will use the ReservedBackends variable name.
---
 src/backend/postmaster/postmaster.c | 18 +-
 src/backend/utils/init/postinit.c   |  4 ++--
 src/backend/utils/misc/guc_tables.c |  2 +-
 src/include/postmaster/postmaster.h |  2 +-
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9cedc1b9f0..470704f364 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -204,15 +204,15 @@ char	   *Unix_socket_directories;
 char	   *ListenAddresses;
 
 /*
- * ReservedBackends is the number of backends reserved for superuser use.
- * This number is taken out of the pool size given by MaxConnections so
+ * SuperuserReservedBackends is the number of backends reserved for superuser
+ * use.  This number is taken out of the pool size given by MaxConnections so
  * number of backend slots available to non-superusers is
- * (MaxConnections - ReservedBackends).  Note what this really means is
- * "if there are <= ReservedBackends connections available, only superusers
- * can make new connections" --- pre-existing superuser connections don't
- * count against the limit.
+ * (MaxConnections - SuperuserReservedBackends).  Note what this really means
+ * is "if there are <= SuperuserReservedBackends connections available, only
+ * superusers can make new connections" --- pre-existing superuser connections
+ * don't count against the limit.
  */
-int			ReservedBackends;
+int			SuperuserReservedBackends;
 
 /* The socket(s) we're listening to. */

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Tom Lane

Andrew Dunstan  writes:
> On 2023-01-18 We 14:32, Tom Lane wrote:
>> I suppose that the reason for not running under $from_source is to
>> avoid corrupting the saved installations with unofficial versions.
>> However, couldn't we skip the "save" step and still run the upgrade
>> tests against whatever we have saved?  (Maybe skip the same-version
>> test, as it's not quite reflecting any real case then.)

> Something like this should do it:
> my $source_tree = $from_source || "$self->{buildroot}/$this_branch/pgsql";

Ah, I didn't understand that $from_source is a path not just a bool.

What do you think about the above questions?  Is this $from_source
exclusion for the reason I guessed, or some other one?

regards, tom lane

Re: document the need to analyze partitioned tables

2023-01-18 Thread Bruce Momjian

On Wed, Jan 18, 2023 at 10:15:18AM +0100, Laurenz Albe wrote:
> On Tue, 2023-01-17 at 16:16 -0500, Bruce Momjian wrote:
> > On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote:
> > > Maybe (all?) the clarification the docs need is to say:
> > > "Partitioned tables are not *themselves* processed by autovacuum."
> > 
> > Yes, I think the lack of autovacuum needs to be specifically mentioned
> > since most people assume autovacuum handles _all_ statistics updating.
> > 
> > Can someone summarize how bad it is we have no statistics on partitioned
> > tables?  It sounds bad to me.
> 
> Andrey Lepikhov had an example earlier in this thread[1].  It doesn't take
> an exotic query. 
> 
> Attached is a new version of my patch that tries to improve the wording.

Ah, yes, that is the example I saw but could not re-find.  Here is the
output:

CREATE TABLE test (id integer, val integer) PARTITION BY hash (id);

CREATE TABLE test_0 PARTITION OF test
   FOR VALUES WITH (modulus 2, remainder 0);
CREATE TABLE test_1 PARTITION OF test
   FOR VALUES WITH (modulus 2, remainder 1);

INSERT INTO test (SELECT q, q FROM generate_series(1,10) AS q);

VACUUM ANALYZE test;

INSERT INTO test (SELECT q, q%2 FROM generate_series(11,200) AS q);

VACUUM ANALYZE test_0,test_1;

EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF)
SELECT * FROM test t1, test t2 WHERE t1.id = t2.val;
   QUERY PLAN   
 

-
 Hash Join  (cost=7.50..15.25 rows=200 width=16) (actual rows=105 
loops=1)
   Hash Cond: (t1.id = t2.val)
   ->  Append  (cost=0.00..5.00 rows=200 width=8) (actual rows=200 
loops=1)
 ->  Seq Scan on test_0 t1_1  (cost=0.00..2.13 rows=113 
width=8) (actual rows=113 loops=1)
 ->  Seq Scan on test_1 t1_2  (cost=0.00..1.87 rows=87 width=8) 
(actual rows=87 loops=1)
   ->  Hash  (cost=5.00..5.00 rows=200 width=8) (actual rows=200 
loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 16kB
 ->  Append  (cost=0.00..5.00 rows=200 width=8) (actual 
rows=200 loops=1)
   ->  Seq Scan on test_0 t2_1  (cost=0.00..2.13 rows=113 
width=8) (actual rows=113 loops=1)
   ->  Seq Scan on test_1 t2_2  (cost=0.00..1.87 rows=87 
width=8) (actual rows=87 loops=1)

VACUUM ANALYZE test;

EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF)
SELECT * FROM test t1, test t2 WHERE t1.id = t2.val;
   QUERY PLAN   
 

-
 Hash Join  (cost=7.50..15.25 rows=200 width=16) (actual rows=105 
loops=1)
   Hash Cond: (t2.val = t1.id)
   ->  Append  (cost=0.00..5.00 rows=200 width=8) (actual rows=200 
loops=1)
 ->  Seq Scan on test_0 t2_1  (cost=0.00..2.13 rows=113 
width=8) (actual rows=113 loops=1)
 ->  Seq Scan on test_1 t2_2  (cost=0.00..1.87 rows=87 width=8) 
(actual rows=87 loops=1)
   ->  Hash  (cost=5.00..5.00 rows=200 width=8) (actual rows=200 
loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 16kB
 ->  Append  (cost=0.00..5.00 rows=200 width=8) (actual 
rows=200 loops=1)
   ->  Seq Scan on test_0 t1_1  (cost=0.00..2.13 rows=113 
width=8) (actual rows=113 loops=1)
   ->  Seq Scan on test_1 t1_2  (cost=0.00..1.87 rows=87 
width=8) (actual rows=87 loops=1)

I see the inner side uses 'val' in the first EXPLAIN and 'id' in the
second, and you are right that 'val' has mostly 0/1.

Is it possible to document when partition table statistics helps?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Ability to reference other extensions by schema in extension scripts

2023-01-18 Thread Sandro Santilli

On Mon, Jan 16, 2023 at 11:57:30PM -0500, Regina Obe wrote:

> What would be more bullet-proof is having an extra column in pg_extension or
> adding an extra array element to pg_extension.extcondition[] that allows you
> to say "Hey, don't allow this to be relocatable cause other extensions
> depend on it that have explicitly referenced the schema."

I've given this some more thoughts and I think a good 
compromise could be to add the safety net in ALTER EXTESION SET SCHEMA
so that it does not only check "extrelocatable" but also the presence
of any extension effectively depending on it, in which case the
operation could be prevented with a more useful message than
"extension does not support SET SCHEMA" (what is currently output).

Example query to determine those cases:

  SELECT e.extname, array_agg(v.name)
 FROM pg_extension e, pg_available_extension_versions v
 WHERE e.extname = ANY( v.requires )
 AND e.extrelocatable
 AND v.installed group by e.extname;

  extname|array_agg
  ---+--
   fuzzystrmatch | {postgis_tiger_geocoder}

--strk;

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund

Hi,

On 2023-01-18 13:08:44 -0800, Andres Freund wrote:
> One complicating factor is that VACUUM sometimes computes an incrementally
> more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE
> computes something sane. I unintentionally encountered one when I was trying
> something while writing this email, reproducer attached.
> 
> 
> VACUUM (DISABLE_PAGE_SKIPPING) foo;
> SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relid = 
> 'foo'::regclass;
> ┌┬┐
> │ n_live_tup │ n_dead_tup │
> ├┼┤
> │901 │ 50 │
> └┴┘
> 
> after one VACUUM:
> ┌┬┐
> │ n_live_tup │ n_dead_tup │
> ├┼┤
> │8549905 │ 50 │
> └┴┘
> 
> after 9 more VACUUMs:
> ┌┬┐
> │ n_live_tup │ n_dead_tup │
> ├┼┤
> │5388421 │ 50 │
> └┴┘
> (1 row)
> 
> 
> I briefly tried it out, and it does *not* reproduce in 11, but does in
> 12. Haven't dug into what the cause is, but we probably use the wrong
> denominator somewhere...

Oh, it does actually reproduce in 11 too - my script just didn't see it
because it was "too fast". For some reason < 12 it takes longer for the new
pgstat snapshot to be available. If I add a few sleeps, it shows in 11.

The real point of change appears to be 10->11.

There's a relevant looking difference in the vac_estimate_reltuples call:
10:
/* now we can compute the new value for pg_class.reltuples */
vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false,

 nblocks,

 vacrelstats->tupcount_pages,

 num_tuples);

11:
/* now we can compute the new value for pg_class.reltuples */
vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,

  nblocks,

  vacrelstats->tupcount_pages,

  live_tuples);
which points to:

commit 7c91a0364fcf5d739a09cc87e7adb1d4a33ed112
Author: Tom Lane 
Date:   2018-03-22 15:47:29 -0400

Sync up our various ways of estimating pg_class.reltuples.

Greetings,

Andres Freund

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 1:08 PM Andres Freund  wrote:
> I suggested nearby to only have ANALYZE dead_tuples it if there's been no
> [auto]vacuum since the stats entry was created. That allows recovering from
> stats resets, be it via crashes or explicitly. What do you think?

I like that idea. It's far simpler than the kind of stuff I was
thinking about, and probably just as effective. Even if it introduces
some unforeseen problem (which seems unlikely), we can still rely on
pgstat_report_vacuum() to set things straight before too long.

Are you planning on writing a patch for this? I'd be very interested
in seeing this through. Could definitely review it.

-- 
Peter Geoghegan

RE: Ability to reference other extensions by schema in extension scripts

2023-01-18 Thread Regina Obe

> On Mon, Jan 16, 2023 at 11:57:30PM -0500, Regina Obe wrote:
> 
> > What would be more bullet-proof is having an extra column in
> > pg_extension or adding an extra array element to
> > pg_extension.extcondition[] that allows you to say "Hey, don't allow
> > this to be relocatable cause other extensions depend on it that have
explicitly
> referenced the schema."
> 
> I've given this some more thoughts and I think a good compromise could be
to
> add the safety net in ALTER EXTESION SET SCHEMA so that it does not only
> check "extrelocatable" but also the presence of any extension effectively
> depending on it, in which case the operation could be prevented with a
more
> useful message than "extension does not support SET SCHEMA" (what is
> currently output).
> 
> Example query to determine those cases:
> 
>   SELECT e.extname, array_agg(v.name)
>  FROM pg_extension e, pg_available_extension_versions v
>  WHERE e.extname = ANY( v.requires )
>  AND e.extrelocatable
>  AND v.installed group by e.extname;
> 
>   extname|array_agg
>   ---+--
>fuzzystrmatch | {postgis_tiger_geocoder}
> 
> --strk;

The only problem with the above is then it bars an extension from being
relocated even if no extensions reference their schema.  Note you wouldn't
be able to tell if an extension references a schema without analyzing the
install script.  Which is why I was thinking another property would be
better, cause that could be checked during the install/upgrade of the
dependent extensions.

I personally would be okay with this and is easier to code I think and
doesn't require structural changes, but not sure others would be as it's
taking away permissions they had before when it wasn't necessary to do so.

Thanks,
Regina

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Andrew Dunstan

On 2023-01-18 We 16:14, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 2023-01-18 We 14:32, Tom Lane wrote:
>>> I suppose that the reason for not running under $from_source is to
>>> avoid corrupting the saved installations with unofficial versions.
>>> However, couldn't we skip the "save" step and still run the upgrade
>>> tests against whatever we have saved?  (Maybe skip the same-version
>>> test, as it's not quite reflecting any real case then.)
>> Something like this should do it:
>> my $source_tree = $from_source || "$self->{buildroot}/$this_branch/pgsql";
> Ah, I didn't understand that $from_source is a path not just a bool.
>
> What do you think about the above questions?  Is this $from_source
> exclusion for the reason I guessed, or some other one?
>
>   

Yes, the reason is that, unlike almost everything else in the buildfarm,
cross version upgrade testing requires saved state (binaries and data
directory), and we don't want from-source builds corrupting that state.

I think we can do what you want but it's a bit harder than what you've
done. If we're not going to save the current run's product then we need
to run the upgrade test from a different directory (probably directly in
"$buildroot/$this_branch/inst"). Otherwise we'll be testing upgrade to
the saved product of a previous run of this branch. I'll take a stab at
it tomorrow if you like.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-18 Thread Tom Lane

Andrew Dunstan  writes:
> I think we can do what you want but it's a bit harder than what you've
> done. If we're not going to save the current run's product then we need
> to run the upgrade test from a different directory (probably directly in
> "$buildroot/$this_branch/inst"). Otherwise we'll be testing upgrade to
> the saved product of a previous run of this branch.

Hmm, maybe that explains some inconsistent results I remember getting.

> I'll take a stab at it tomorrow if you like.

Please do.

regards, tom lane

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Andres Freund

Hi,

On 2023-01-18 13:42:40 -0800, Andres Freund wrote:
> The real point of change appears to be 10->11.
>
> There's a relevant looking difference in the vac_estimate_reltuples call:
> 10:
>   /* now we can compute the new value for pg_class.reltuples */
>   vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false,
>   
>  nblocks,
>   
>  vacrelstats->tupcount_pages,
>   
>  num_tuples);
>
> 11:
>   /* now we can compute the new value for pg_class.reltuples */
>   vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
>   
>   nblocks,
>   
>   vacrelstats->tupcount_pages,
>   
>   live_tuples);
> which points to:
>
> commit 7c91a0364fcf5d739a09cc87e7adb1d4a33ed112
> Author: Tom Lane 
> Date:   2018-03-22 15:47:29 -0400
>
> Sync up our various ways of estimating pg_class.reltuples.

The problem with the change is here:

/*
 * Okay, we've covered the corner cases.  The normal calculation is to
 * convert the old measurement to a density (tuples per page), then
 * estimate the number of tuples in the unscanned pages using that 
figure,
 * and finally add on the number of tuples in the scanned pages.
 */
old_density = old_rel_tuples / old_rel_pages;
unscanned_pages = (double) total_pages - (double) scanned_pages;
total_tuples = old_density * unscanned_pages + scanned_tuples;
return floor(total_tuples + 0.5);

Because we'll re-scan the pages for not-yet-removable rows in subsequent
vacuums, the next vacuum will process the same pages again. By using
scanned_tuples = live_tuples, we basically remove not-yet-removable tuples
from reltuples, each time.

The commit *did* try to account for that to some degree:

+/* also compute total number of surviving heap entries */
+vacrelstats->new_rel_tuples =
+vacrelstats->new_live_tuples + vacrelstats->new_dead_tuples;

but new_rel_tuples isn't used for pg_class.reltuples or pgstat.

This is pretty nasty. We use reltuples for a lot of things. And while analyze
might fix it sometimes, that won't reliably be the case, particularly when
there are repeated autovacuums due to a longrunning transaction - there's no
cause for auto-analyze to trigger again soon, while autovacuum will go at it
again and again.

Greetings,

Andres Freund

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 2:22 PM Andres Freund  wrote:
> The problem with the change is here:
>
> /*
>  * Okay, we've covered the corner cases.  The normal calculation is to
>  * convert the old measurement to a density (tuples per page), then
>  * estimate the number of tuples in the unscanned pages using that 
> figure,
>  * and finally add on the number of tuples in the scanned pages.
>  */
> old_density = old_rel_tuples / old_rel_pages;
> unscanned_pages = (double) total_pages - (double) scanned_pages;
> total_tuples = old_density * unscanned_pages + scanned_tuples;
> return floor(total_tuples + 0.5);

My assumption has always been that vac_estimate_reltuples() is prone
to issues like this because it just doesn't have access to very much
information each time it runs. It can only see the delta between what
VACUUM just saw, and what the last VACUUM (or possibly the last
ANALYZE) saw according to pg_class. You're always going to find
weaknesses in such a model if you go looking for them. You're always
going to find a way to salami slice your way from good information to
total nonsense, if you pick the right/wrong test case, which runs
VACUUM in a way that allows whatever bias there may be to accumulate.
It's sort of like the way floating point values can become very
inaccurate through a process that allows many small inaccuracies to
accumulate over time.

Maybe you're right to be concerned to the degree that you're concerned
-- I'm not sure. I'm just adding what I see as important context.

-- 
Peter Geoghegan

1 2 >

1 - 100 of 146 matches

Mail list logo