date:20240304

Re: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread Alvaro Herrera

On 2024-Mar-04, Hayato Kuroda (Fujitsu) wrote:

> Dear hackers,
> 
> While reading codes, I found that ApplyLauncherShmemInit() and
> AutoVacuumShmemInit() are always called even if they would not be
> launched.

Note that there are situations where the autovacuum launcher is started
even though autovacuum is nominally turned off, and I suspect your
proposal would break that.  IIRC this occurs when the Xid or multixact
counters cross the max_freeze_age threshold.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Porque Kim no hacía nada, pero, eso sí,
con extraordinario éxito" ("Kim", Kipling)

Re: Avoid stack frame setup in performance critical routines using tail calls

2024-03-04 Thread Andres Freund

Hi,

On 2024-03-04 17:43:50 +1300, David Rowley wrote:
> On Thu, 29 Feb 2024 at 00:29, David Rowley  wrote:
> > I switched over to working on doing what you did in 0002 for
> > generation.c and slab.c.
> >
> > See the attached patch which runs the same test as in [1] (aset.c is
> > just there for comparisons between slab and generation)
> >
> > The attached includes some additional tuning to generation.c:
> 
> I've now pushed this.

Thanks for working on all these, much appreciated!

Greetings,

Andres Freund

Re: Comments on Custom RMGRs

2024-03-04 Thread Andrey M. Borodin

> On 29 Feb 2024, at 19:47, Danil Anisimow  wrote:
> 
> Answering your questions might take some time as I want to write a sample 
> patch for pg_stat_statements and make some tests.
> What do you think about putting the patch to commitfest as it closing in a 
> few hours?

I’ve switched the patch to “Waiting on Author” to indicate that currently patch 
is not available yet. Please, flip it back when it’s available for review.

Thanks!

Best regards, Andrey Borodin.

Re: initdb's -c option behaves wrong way?

2024-03-04 Thread Daniel Gustafsson




> On 4 Mar 2024, at 02:01, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> I took the liberty to add this to the upcoming CF to make sure we don't lose
>> track of it.
> 
> Thanks for doing that, because the cfbot pointed out a problem:
> I should have written pg_strncasecmp not strncasecmp.  If this
> version tests cleanly, I'll push it.

+1, LGTM.

--
Daniel Gustafsson

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-04 Thread Bertrand Drouvot

Hi,

On Sun, Mar 03, 2024 at 03:44:34PM -0600, Nathan Bossart wrote:
> On Sun, Mar 03, 2024 at 11:40:00PM +0530, Bharath Rupireddy wrote:
> > On Sat, Mar 2, 2024 at 3:41 AM Nathan Bossart  
> > wrote:
> >> Would you ever see "conflict" as false and "invalidation_reason" as
> >> non-null for a logical slot?
> > 
> > No. Because both conflict and invalidation_reason are decided based on
> > the invalidation reason i.e. value of slot_contents.data.invalidated.
> > IOW, a logical slot that reports conflict as true must have been
> > invalidated.
> > 
> > Do you have any thoughts on reverting 007693f and introducing
> > invalidation_reason?
> 
> Unless I am misinterpreting some details, ISTM we could rename this column
> to invalidation_reason and use it for both logical and physical slots.  I'm
> not seeing a strong need for another column.

Yeah having two columns was more for convenience purpose. Without the "conflict"
one, a slot conflicting with recovery would be "a logical slot having a non NULL
invalidation_reason".

I'm also fine with one column if most of you prefer that way.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: CF entries for 17 to be reviewed

2024-03-04 Thread Andrey M. Borodin




> On 3 Mar 2024, at 01:19, Melanie Plageman  wrote:
> 
>  I'm not
> sure if the ones that do have a target version = 17 are actually all
> the patches targeting 17.

Yes, for me it’s only a hint where to bump things up. I will extend scope on 
other versions when I fill finish a pass though entries with version 17.


Here’s my take on “Miscellaneous” section.


* Permute underscore separated components of columns before fuzzy matching
The patch received some review, but not for latest version. I pinged 
the reviewer for an update.
* Add pg_wait_for_lockers() function
Citus-related patch, but may be of use to other distributed systems. 
This patch worth attention at least because author replied to himself 10+ 
times. Interesting addition, some feedback from Andres and Laurenz.
* Improve the log message output of basic_archive when 
basic_archive.archive_directory parameter is not set
Relatively simple change to improve user-friendliness. Daniel 
Gustafsson expressed interest recently.
* Fix log_line_prefix to display the transaction id (%x) for statements not in 
a transaction block
Reasonable log improvement, code is simple but in a tricky place. There 
was some feedback, I've asked if respondent can be a reviewer.
* Add Index-level REINDEX with multiple jobs
An addition to DBA toolset. Some unaddressed feedback, pinged authors.
* Add LSN <-> time conversion facility
There's ongoing discussion between Melanie and Tomas. Relatively 
heavyweight patchset, but given current solid technical level of the discussion 
this might land into 17. Take your chance to chime-in with review! :)
* date_trunc function in interval version
Some time tricks. There are review notes by Tomas. I pinged authors.
* Adding comments to help understand psql hidden queries
A couple of patches for psql --echo-hidden. Seems useful and simple. No 
reviews at all though. I moved the patch to "Clients" to reflect actual patch 
purpose and lighten generic “Miscellaneous".
* Improving EXPLAIN's display of SubPlan nodes
Some EXPLAIN changes, Alexander Alekseev was looking into this. I've 
asked him if he would be the reviewer.
* Should we remove -Wdeclaration-after-statement?
Not really a patch, kind of an opinion poll. The result is now kind of 
-BigNumber, I see no chances for this to get into 17, but maybe in future.
* Add to_regtypemod() SQL function
Cool nice function, some reviewers approved the patch. I've took a 
glance on the patch, seems nice, switched to "Ready for Committer". Some 
unrelated changes to genbki.pl, but according to thread it was needed for 
something.
* un-revert MAINTAIN privilege and pg_maintain predefined role
The work seems to be going on.
* Checkpoint extension hook
The patch is not provided yet. I've pinged the thread.

Stay tuned, I hope everyone interested in reviewing will find themself a cool 
interesting patch or two.


Best regards, Andrey Borodin.

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-03-04 Thread Michael Paquier

On Fri, Mar 01, 2024 at 03:03:14PM -0600, Justin Pryzby wrote:
> I think if the user sets something "explicitly", the catalog should
> reflect what they set.  Tablespaces have dattablespace, but AMs don't --
> it's a simpler case.

Okay.

> For 001: we don't *need* to support "ALTER SET AM default" for leaf
> tables.  It doesn't do anything that's not already possible.  But, if
> AMs for partitioned tables are optional rather than required, then seems
> to be needed to allow (re)settinng relam=0.

Indeed, for non-partitioned tables DEFAULT is a sugar flavor.  Not
mandatory, still it's nice to have to not have to type an AM.

> But for partitioned tables, I think it should set relam=0 directly.
> Currently it 1) falls through to default_table_am; and 2) detects that
> it's the default, so then sets relam to 0.
>
> Since InvalidOid is already taken, I guess you might need to introduce a
> boolean flag, like set_relam, indicating that the statement has an
> ACCESS METHOD clause.

Yes, I don't see an alternative.  The default needs a different field
to be tracked down to the execution.

>> + * method defined so as their children can inherit it; however, this is 
>> handled
> 
> so that
> 
>> + * Do nothing: access methods is a setting that partitions can
> 
> method (singular), or s/is/are/

Indeed.  Fixed both.

> In any case, it'd be a bit confusing for the error message to still say:
> 
> postgres=# CREATE TABLE a(i int) PARTITION BY RANGE(a) USING heap2;
> ERROR:  specifying a table access method is not supported on a partitioned 
> table

I was looking at this one as well and I don't see why we could not
remove it, so you are right (missed the tablespace part last week).  A
partitioned table created as a partition of a partitioned table would
inherit the relam of its parent (0 if default is set, or non-0 is
something is set).  I have added some regression tests for that.

And I'm finishing with the attached.  To summarize SET ACCESS METHOD
on a partitioned table, the semantics are:
- DEFAULT sets the relam to 0, any partitions with storage would use
the GUC at creation time.  Partitioned tables use a relam of 0.
- If a value is set for the am, relam becomes non-0.  Any partitions
created on it inherit it (partitioned as well as non-partitioned
tables).
- No USING clause means to set its relam to 0.

0001 seems OK here, 0002 needs more eyes.  The bulk of the changes is
in the regression tests to cover all the cases I could think of.
--
Michael
From 4d8a5de79b829e4f1be875c668f85bbfa6d3f37a Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 1 Mar 2024 10:22:22 +0900
Subject: [PATCH v3 1/2] Add DEFAULT option to ALTER TABLE SET ACCESS METHOD

---
 src/backend/commands/tablecmds.c|  3 ++-
 src/backend/parser/gram.y   | 10 --
 src/test/regress/expected/create_am.out | 21 +
 src/test/regress/sql/create_am.sql  | 11 +++
 doc/src/sgml/ref/alter_table.sgml   |  6 --
 5 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f798794556..3b1c2590fd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15212,7 +15212,8 @@ ATPrepSetAccessMethod(AlteredTableInfo *tab, Relation rel, const char *amname)
 	Oid			amoid;
 
 	/* Check that the table access method exists */
-	amoid = get_table_am_oid(amname, false);
+	amoid = get_table_am_oid(amname ? amname : default_table_access_method,
+			 false);
 
 	if (rel->rd_rel->relam == amoid)
 		return;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 130f7fc7c3..c6e2f679fd 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -338,6 +338,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type alter_identity_column_option_list
 %type   alter_identity_column_option
 %type 	set_statistics_value
+%type 		set_access_method_name
 
 %type 	createdb_opt_list createdb_opt_items copy_opt_list
 transaction_mode_list
@@ -2859,8 +2860,8 @@ alter_table_cmd:
 	n->newowner = $3;
 	$$ = (Node *) n;
 }
-			/* ALTER TABLE  SET ACCESS METHOD  */
-			| SET ACCESS METHOD name
+			/* ALTER TABLE  SET ACCESS METHOD {  | DEFAULT } */
+			| SET ACCESS METHOD set_access_method_name
 {
 	AlterTableCmd *n = makeNode(AlterTableCmd);
 
@@ -3076,6 +3077,11 @@ set_statistics_value:
 			| DEFAULT		{ $$ = NULL; }
 		;
 
+set_access_method_name:
+			ColId			{ $$ = $1; }
+			| DEFAULT		{ $$ = NULL; }
+		;
+
 PartitionBoundSpec:
 			/* a HASH partition */
 			FOR VALUES WITH '(' hash_partbound ')'
diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out
index b50293d514..e843d39ee7 100644
--- a/src/test/regress/expected/create_am.out
+++ b/src/test/regress/expected/create_am.out
@@ -283,6 +283,27 @@ SELECT COUNT(a), COUNT(1) FILTER(WHERE a=1) FROM heaptable;

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Bertrand Drouvot

Hi,

On Thu, Feb 29, 2024 at 03:38:59PM +0530, Amit Kapila wrote:
> On Thu, Feb 29, 2024 at 9:13 AM Peter Smith  wrote:
> >
> > On Tue, Feb 27, 2024 at 11:35 PM Amit Kapila  
> > wrote:
> > >
> >
> > > Also, adding wait sounds
> > > more like a boolean. So, I don't see the proposed names any better
> > > than the current one.
> > >
> >
> > Anyway, the point is that the current GUC name 'standby_slot_names' is
> > not ideal IMO because it doesn't have enough meaning by itself -- e.g.
> > you have to read the accompanying comment or documentation to have any
> > idea of its purpose.
> >
> 
> Yeah, one has to read the description but that is true for other
> parameters like "temp_tablespaces". I don't have any better ideas but
> open to suggestions.

What about "non_lagging_standby_slots"?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread Hayato Kuroda (Fujitsu)

Dear Alvaro,

Thanks for giving comments!

> > While reading codes, I found that ApplyLauncherShmemInit() and
> > AutoVacuumShmemInit() are always called even if they would not be
> > launched.
> 
> Note that there are situations where the autovacuum launcher is started
> even though autovacuum is nominally turned off, and I suspect your
> proposal would break that.  IIRC this occurs when the Xid or multixact
> counters cross the max_freeze_age threshold.

Right. In GetNewTransactionId(), SetTransactionIdLimit() and some other places,
PMSIGNAL_START_AUTOVAC_LAUNCHER is sent to postmaster when the xid exceeds
autovacuum_freeze_max_age. This work has already been written in the doc [1]:

```
To ensure that this does not happen, autovacuum is invoked on any table that
might contain unfrozen rows with XIDs older than the age specified by the
configuration parameter autovacuum_freeze_max_age. (This will happen even
if autovacuum is disabled.)
```

This means that my first idea won't work well. Even if the postmaster does not
initially allocate shared memory, backends may request to start auto vacuum and
use the region. However, the second idea is still valid, which allows the 
allocation
of shared memory dynamically. This is a bit efficient for the system which 
tuples
won't be frozen. Thought?

[1]: 
https://www.postgresql.org/docs/devel/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Injection points: some tools to wait and wake

2024-03-04 Thread Jelte Fennema-Nio

On Mon, 4 Mar 2024 at 06:27, Michael Paquier  wrote:
> I have mentioned that on a separate thread,

Yeah, I didn't read all emails related to this feature

> Perhaps we could consider that as an exception in "contrib", or have a
> separate path for test modules we're OK to install (the calls had
> better be superuser-only if we do that).

Yeah, it makes sense that you'd want to backport fixes/changes to
this. As long as you put a disclaimer in the docs that you can do that
for this module, I think it would be fine. Our tests fairly regularly
break anyway when changing minor versions of postgres in our CI, e.g.
due to improvements in the output of isolationtester. So if changes to
this module require some changes that's fine by me. Seems much nicer
than having to copy-paste the code.

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Amit Kapila

On Mon, Mar 4, 2024 at 9:35 AM Peter Smith  wrote:
>
> OK, if the code will remain as-is wouldn't it be better to anyway
> change the function name to indicate what it really does?
>
> e.g.  NeedToWaitForWal --> NeedToWaitForWalFlushOrStandbys
>

This seems too long. I would prefer the current name NeedToWaitForWal
as waiting for WAL means waiting to flush the WAL and waiting to
replicate it to standby. On similar lines, the variable name
standby_slot_oldest_flush_lsn looks too long. How about
ss_oldest_flush_lsn (where ss indicates standy_slots)?

Apart from this, I have made minor modifications in the attached.

-- 
With Regards,
Amit Kapila.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index aa477015bd..1530e7720c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -2390,8 +2390,8 @@ validate_standby_slots(char **newval)
{
/*
 * We cannot validate the replication slot if the replication 
slots'
-* data has not been initialized. This is ok as we will 
validate the
-* specified slot when waiting for them to catch up. See
+* data has not been initialized. This is ok as we will anyway 
validate
+* the specified slot when waiting for them to catch up. See
 * StandbySlotsHaveCaughtup for details.
 */
}
@@ -2473,8 +2473,7 @@ assign_standby_slot_names(const char *newval, void *extra)
char   *standby_slot_names_cpy = extra;
 
/*
-* The standby slots may have changed, so we need to recompute the 
oldest
-* LSN.
+* The standby slots may have changed, so we must recompute the oldest 
LSN.
 */
standby_slot_oldest_flush_lsn = InvalidXLogRecPtr;
 
@@ -2664,6 +2663,7 @@ StandbySlotsHaveCaughtup(XLogRecPtr wait_for_lsn, int 
elevel)
if (caught_up_slot_num != list_length(standby_slot_names_list))
return false;
 
+   /* The standby_slot_oldest_flush_lsn must not retreat. */
Assert(XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) ||
   min_restart_lsn >= standby_slot_oldest_flush_lsn);
 
diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
index b71408d533..580f9dabd3 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1805,10 +1805,10 @@ NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr 
flushed_lsn,
 
/*
 * Check if the standby slots have caught up to the flushed position. It
-* is good to wait up to flushed position and then let it send the 
changes
-* to logical subscribers one by one which are already covered in 
flushed
-* position without needing to wait on every change for standby
-* confirmation.
+* is good to wait up to the flushed position and then let the WalSender
+* send the changes to logical subscribers one by one which are already
+* covered by the flushed position without needing to wait on every 
change
+* for standby confirmation.
 */
if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event))
return true;

Re: date_trunc function in interval version

2024-03-04 Thread Przemysław Sztoch


Tomas Vondra wrote on 18.02.2024 01:29:

Hi,

Please don't too-post on this list. The custom is to bottom-post or
reply inline, and it's much easier to follow such replies.

On 12/23/23 23:45, Przemysław Sztoch wrote:

date_bin has big problem with DST.
In example, if you put origin in winter zone, then generated bin will be
incorrect for summer input date.

date_trunc is resistant for this problem.
My version of date_trunc is additionally more flexible, you can select
more granular interval, 12h, 8h, 6h, 15min, 10 min etc...


I'm not very familiar with date_bin(), but is this issue inherent or
could we maybe fix date_bin() to handle DST better?

Apparently the functionality is identical to date_bin.
When I saw date_bin in the documentation, I thought it solved all my 
problems.

Unfortunately, DST problems have many corner cases.
I tried to change date_bin several times, but unfortunately in some 
cases it would start working differently than before.



In any case, the patch needs to add the new stuff to the SGML docs (to
doc/src/sgml/func.sgml), which now documents the date_trunc(text,...)
variant only.

Updated.

--
Przemysław Sztoch | Mobile +48 509 99 00 66
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e5fa82c161..95cdfab2d0 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9472,6 +9472,23 @@ SELECT regexp_match('abc01234xyz', 
'(?:(.*?)(\d+)(.*)){1,1}');
 

 
+   
+
+ 
+  date_trunc
+ 
+ date_trunc ( interval, 
timestamp with time zone )
+ timestamp with time zone
+
+
+ Truncate to specified precision in the specified time zone. Interval 
has to be a divisor of a day, week or century.
+
+
+ date_trunc('30 minutes'::interval, timestamp '2001-02-16 
20:38:40+00')
+ 2001-02-16 20:30:00+00
+
+   
+

 
  date_trunc ( text, timestamp 
with time zone, text )
@@ -9487,6 +9504,24 @@ SELECT regexp_match('abc01234xyz', 
'(?:(.*?)(\d+)(.*)){1,1}');
 

 
+   
+
+ date_trunc ( interval, 
timestamp with time zone, text )
+ timestamp with time zone
+
+
+ Truncate to specified precision in the specified time zone. Interval 
has to be a divisor of a day, week or century.
+
+
+ date_trunc('3 hour'::interval, timestamptz '2001-02-16 
21:38:40+00', 'Europe/Warsaw')
+ 2001-02-16 20:00:00+00
+
+
+ date_trunc('15 minutes'::interval, timestamptz '2001-02-16 
21:38:40+00', 'Europe/Warsaw')
+ 2001-02-16 21:30:00+00
+
+   
+

 
  date_trunc ( text, 
interval )
diff --git a/src/backend/utils/adt/timestamp.c 
b/src/backend/utils/adt/timestamp.c
index 7a016a6923..e376968c49 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -4999,6 +4999,177 @@ timestamptz_trunc_zone(PG_FUNCTION_ARGS)
PG_RETURN_TIMESTAMPTZ(result);
 }
 
+/*
+ * Common code for timestamptz_trunc_int() and timestamptz_trunc_int_zone().
+ *
+ * tzp identifies the zone to truncate with respect to.  We assume
+ * infinite timestamps have already been rejected.
+ */
+static TimestampTz
+timestamptz_trunc_int_internal(Interval *interval, TimestampTz timestamp, 
pg_tz *tzp)
+{
+   TimestampTz result;
+   int tz;
+   int interval_parts = 0;
+   boolbad_interval = false;
+   boolredotz = false;
+   fsec_t  fsec;
+   struct pg_tm tt,
+  *tm = &tt;
+
+   if (interval->month != 0)
+   {
+   interval_parts++;
+   /* 1200 = hundred years */
+   if ((1200/interval->month) * interval->month != 1200)
+   bad_interval = true;
+   }
+   if (interval->day != 0)
+   {
+   interval_parts++;
+   if (interval->day != 1 && interval->day != 7)
+   bad_interval = true;
+   }
+   if (interval->time != 0)
+   {
+   interval_parts++;
+   if (interval->time > USECS_PER_SEC)
+   {
+   if ((interval->time % USECS_PER_SEC) != 0)
+   bad_interval = true;
+   if ((USECS_PER_DAY/interval->time) * interval->time != 
USECS_PER_DAY)
+   bad_interval = true;
+   }
+   else if (interval->time < USECS_PER_SEC && 
(USECS_PER_SEC/interval->time) * interval->time != USECS_PER_SEC)
+   bad_interval = true;
+   }
+   if (interval_parts != 1 || bad_interval)
+   {
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("interval has to be a divisor of a day, 
week or century.")))

Re: PostgreSQL Contributors Updates

2024-03-04 Thread Ashutosh Bapat

On Sun, Mar 3, 2024 at 9:28 PM Joe Conway  wrote:
>
> All,
>
> The PostgreSQL Contributor Page
> (https://www.postgresql.org/community/contributors/) includes people who
> have made substantial, long-term contributions of time and effort to the
> PostgreSQL project. The PostgreSQL Contributors Team recognizes the
> following people for their contributions.
>
> New PostgreSQL Contributors:
>
> * Bertrand Drouvot
> * Gabriele Bartolini
> * Richard Guo
>
> New PostgreSQL Major Contributors:
>
> * Alexander Lakhin
> * Daniel Gustafsson
> * Dean Rasheed
> * John Naylor
> * Melanie Plageman
> * Nathan Bossart
>

Hearty congratulations. Well deserved.


-- 
Best Wishes,
Ashutosh Bapat

Re: PostgreSQL Contributors Updates

2024-03-04 Thread Dilip Kumar

On Sun, Mar 3, 2024 at 9:28 PM Joe Conway  wrote:
>
> All,
>
> The PostgreSQL Contributor Page
> (https://www.postgresql.org/community/contributors/) includes people who
> have made substantial, long-term contributions of time and effort to the
> PostgreSQL project. The PostgreSQL Contributors Team recognizes the
> following people for their contributions.
>
> New PostgreSQL Contributors:
>
> * Bertrand Drouvot
> * Gabriele Bartolini
> * Richard Guo
>
> New PostgreSQL Major Contributors:
>
> * Alexander Lakhin
> * Daniel Gustafsson
> * Dean Rasheed
> * John Naylor
> * Melanie Plageman
> * Nathan Bossart
>
> Thank you and congratulations to all!
>

 Congratulations to all!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Bertrand Drouvot

Hi,

On Sun, Mar 03, 2024 at 07:56:32AM +, Zhijie Hou (Fujitsu) wrote:
> Here is the V104 patch which addressed above and Peter's comments.

Thanks!

A few more random comments:

1 ===

+The function may be blocked if the specified slot is a failover enabled

s/blocked/waiting/ ?

2 ===

+* specified slot when waiting for them to catch up. See
+* StandbySlotsHaveCaughtup for details.

s/StandbySlotsHaveCaughtup/StandbySlotsHaveCaughtup()/ ?

3 ===

+   /* Now verify if the specified slots really exist and have correct type 
*/

remove "really"?

4 ===

+   /*
+* Don't need to wait for the standbys to catch up if there is no value 
in
+* standby_slot_names.
+*/
+   if (standby_slot_names_list == NIL)
+   return true;
+
+   /*
+* Don't need to wait for the standbys to catch up if we are on a 
standby
+* server, since we do not support syncing slots to cascading standbys.
+*/
+   if (RecoveryInProgress())
+   return true;
+
+   /*
+* Don't need to wait for the standbys to catch up if they are already
+* beyond the specified WAL location.
+*/
+   if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) &&
+   standby_slot_oldest_flush_lsn >= wait_for_lsn)
+   return true;

What about using OR conditions instead?

5 ===

+static bool
+NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn,
+uint32 *wait_event)

Not a big deal but does it need to return a bool? (I mean it all depends of
the *wait_event value). Is it for better code readability in the caller?

6 ===

+static bool
+NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn,
+uint32 *wait_event)

Same questions as for NeedToWaitForStandby().

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PostgreSQL Contributors Updates

2024-03-04 Thread Bharath Rupireddy

On Sun, Mar 3, 2024 at 9:28 PM Joe Conway  wrote:
>
> All,
>
> The PostgreSQL Contributor Page
> (https://www.postgresql.org/community/contributors/) includes people who
> have made substantial, long-term contributions of time and effort to the
> PostgreSQL project. The PostgreSQL Contributors Team recognizes the
> following people for their contributions.
>
> New PostgreSQL Contributors:
>
> * Bertrand Drouvot
> * Gabriele Bartolini
> * Richard Guo
>
> New PostgreSQL Major Contributors:
>
> * Alexander Lakhin
> * Daniel Gustafsson
> * Dean Rasheed
> * John Naylor
> * Melanie Plageman
> * Nathan Bossart
>
> Thank you and congratulations to all!

+1. Congratulations to all!

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Amit Kapila

On Mon, Mar 4, 2024 at 4:52 PM Bertrand Drouvot
 wrote:
>
> On Sun, Mar 03, 2024 at 07:56:32AM +, Zhijie Hou (Fujitsu) wrote:
> > Here is the V104 patch which addressed above and Peter's comments.
>
> Thanks!
>
>
> 4 ===
>
> +   /*
> +* Don't need to wait for the standbys to catch up if there is no 
> value in
> +* standby_slot_names.
> +*/
> +   if (standby_slot_names_list == NIL)
> +   return true;
> +
> +   /*
> +* Don't need to wait for the standbys to catch up if we are on a 
> standby
> +* server, since we do not support syncing slots to cascading 
> standbys.
> +*/
> +   if (RecoveryInProgress())
> +   return true;
> +
> +   /*
> +* Don't need to wait for the standbys to catch up if they are already
> +* beyond the specified WAL location.
> +*/
> +   if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) &&
> +   standby_slot_oldest_flush_lsn >= wait_for_lsn)
> +   return true;
>
> What about using OR conditions instead?
>

I think we can use but it seems code is easier to follow this way but
this is just a matter of personal choice.

> 5 ===
>
> +static bool
> +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn,
> +uint32 *wait_event)
>
> Not a big deal but does it need to return a bool? (I mean it all depends of
> the *wait_event value). Is it for better code readability in the caller?
>

Yes, I think so.  Adding checks based on wait_events sounds a bit awkward.

-- 
With Regards,
Amit Kapila.

Re: a wrong index choose when statistics is out of date

2024-03-04 Thread Andy Fan

David Rowley  writes:

> On Sun, 3 Mar 2024 at 20:08, Andy Fan  wrote:
>> The issue can be reproduced with the following steps:
>>
>> create table x_events (.., created_at timestamp, a int, b int);
>>
>> create index idx_1 on t(created_at, a);
>> create index idx_2 on t(created_at, b);
>>
>> query:
>> select * from t where create_at = current_timestamp and b = 1;
>>
>> index (created_at, a) rather than (created_at, b) may be chosen for the
>> above query if the statistics think "create_at = current_timestamp" has
>> no rows, then both index are OK, actually it is true just because
>> statistics is out of date.
>
> I don't think there's really anything too special about the fact that
> the created_at column is always increasing. We commonly get 1-row
> estimates after multiplying the selectivities from individual stats.
> Your example just seems like yet another reason that this could
> happen.

You are right about there are more cases which lead this happen. However
this is the only case where the created_at = $1 trick can works, which
was the problem I wanted to resove when I was writing. 

> I've been periodically talking about introducing "risk" as a factor
> that the planner should consider.  I did provide some detail in [1]
> about the design that was in my head at that time.  I'd not previously
> thought that it could also solve this problem, but after reading your
> email, I think it can.

Haha, I remeber you were against "risk factor" before at [1], and at
that time we are talking about the exact same topic as here, and I
proposaled another risk factor. Without an agreement, I did it in my
own internal version and get hurted then, something like I didn't pay
enough attention to Bitmap Index Scan and Index scan. Then I forget the
"risk factor".

>
> I don't think it would be right to fudge the costs in any way, but I
> think the risk factor for IndexPaths could take into account the
> number of unmatched index clauses and increment the risk factor, or
> "certainty_factor" as it is currently in my brain-based design. That
> way add_path() would be more likely to prefer the index that matches
> the most conditions.

This is somehow similar with my proposal at [1]?  What do you think
about the treat 'col op const' as 'col op $1' for the marked column?
This could just resolve a subset of questions in your mind, but the
method looks have a solid reason.

Currently I treat the risk factor as what you did before, but this maybe
another time for me to switch my mind again.

[1] 
https://www.postgresql.org/message-id/CAApHDvovVWCbeR4v%2BA4Dkwb%3DYS_GuJG9OyCm8jZu%2B%2BcP2xsY_A%40mail.gmail.com
-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

2024-03-04 Thread Andy Fan

Andrei Lepikhov  writes:

> On 3/3/2024 14:01, Andy Fan wrote:
>> 1. We can let the user define the column as the value is increased day by
>> day. the syntax may be:
>> ALTER TABLE x_events ALTER COLUMN created_at ALWAYS_INCREASED.
>> then when a query like 'create_at op const', the statistics module
>> can
>> treat it as 'created_at = $1'. so the missing statistics doesn't make
>> difference. Then I think the above issue can be avoided.
> Let me write some words to support your efforts in that way.
> I also have some user cases where they periodically insert data in large
> chunks. These chunks contain 'always increased' values, and it causes
> trouble each time they start an analytic query over this new data before
> the analyze command.
> I have thought about that issue before but invented nothing special
> except a more aggressive analysis of such tables.

I have to say we run into a exactly same sistuation and use the same
trick to solve the problem, and we know no matter how aggressive it is,
the problem may still happen.

> Your trick can work, but it needs a new parameter in pg_type and a lot
> of additional code for such a rare case.
> I'm looking forward to the demo patch.

Maybe my word "auto_increased" is too like a type, but actually what I
want to is adding a new attribute for pg_attribute which ties with one
column in one relation.  When we figure out a selective on this
*column*, we do such trick. This probably doesn't need much code.

-- 
Best Regards
Andy Fan

Re: Tidy fill hstv array (src/backend/access/heap/pruneheap.c)

2024-03-04 Thread Ranier Vilela

Em seg., 4 de mar. de 2024 às 03:18, Andrey M. Borodin 
escreveu:

>
>
> > On 14 Jan 2024, at 18:55, John Naylor  wrote:
> >
> > On Sat, Jan 13, 2024 at 9:36 PM Ranier Vilela 
> wrote:
> >>
> >> Em ter., 9 de jan. de 2024 às 06:31, John Naylor <
> johncnaylo...@gmail.com> escreveu:
> >
> >>> This just moves an operation from one place to the other, while
> >>> obliterating the explanatory comment, so I don't see an advantage.
> >>
> >> Well, I think that is precisely the case for using memset.
> >> The way initialization is currently done is much slower and harmful to
> the branch.
> >> Of course, the gain should be small, but it is fully justified for
> switching to memset.
> >
> > We haven't seen any evidence or reasoning for that. Simple
> > rules-of-thumb are not enough.
> >
>
> Hi Ranier,
>
> I’ll mark CF entry [0] as “Returned with Feedback”. Feel free to reopen
> item in this CF or submit to the next, if you want to continue working on
> this.
>
> I took a glance into the patch, and I would agree that setting field
> nonzero values with memset() is somewhat unusual. Please provide stronger
> evidence to do so.
>
 I counted the calls with non-zero memset in the entire postgres code and
they are about 183 calls.

I counted the calls with non-zero memset in the entire postgres code and
they are about 183 calls.

At least 4 calls with -1

File contrib\pg_trgm\trgm_op.c:
memset(lastpos, -1, sizeof(int) * len);

File src\backend\executor\execPartition.c:
memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);

File src\backend\partitioning\partprune.c:
memset(subplan_map, -1, nparts * sizeof(int));
memset(subpart_map, -1, nparts * sizeof(int));

Does filling a memory area, one by one, with branches, need strong evidence
to prove that it is better than filling a memory area, all at once, without
branches?

best regards,
Ranier Vilela

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-04 Thread John Naylor

On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada  wrote:
>
> On Sun, Mar 3, 2024 at 2:43 PM John Naylor  wrote:

> > Right, I should have said "reset". Resetting a context will delete
> > it's children as well, and seems like it should work to reset the tree
> > context, and we don't have to know whether that context actually
> > contains leaves at all. That should allow copying "tree context" to
> > "leaf context" in the case where we have no special context for
> > leaves.
>
> Resetting the tree->context seems to work. But I think we should note
> for callers that the dsa_area passed to RT_CREATE should be created in
> a different context than the context passed to RT_CREATE because
> otherwise RT_FREE() will also free the dsa_area. For example, the
> following code in test_radixtree.c will no longer work:
>
> dsa = dsa_create(tranche_id);
> radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
> :
> rt_free(radixtree);
> dsa_detach(dsa); // dsa is already freed.
>
> So I think that a practical usage of the radix tree will be that the
> caller  creates a memory context for a radix tree and passes it to
> RT_CREATE().

That sounds workable to me.

> I've attached an update patch set:
>
> - 0008 updates RT_FREE_RECURSE().

Thanks!

> - 0009 patch is an updated version of cleanup radix tree memory handling.

Looks pretty good, as does the rest. I'm going through again,
squashing and making tiny adjustments to the template. The only thing
not done is changing the test with many values to resemble the perf
test more.

I wrote:
> > Secondly, I thought about my recent work to skip checking if we first
> > need to create a root node, and that has a harmless (for vacuum at
> > least) but slightly untidy behavior: When RT_SET is first called, and
> > the key is bigger than 255, new nodes will go on top of the root node.
> > These have chunk '0'. If all subsequent keys are big enough, the
> > orginal root node will stay empty. If all keys are deleted, there will
> > be a chain of empty nodes remaining. Again, I believe this is
> > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > on this, but likely not today.
>
> This turns out to be a lot trickier than it looked, so it seems best
> to allow a trivial amount of waste, as long as it's documented
> somewhere. It also wouldn't be terrible to re-add those branches,
> since they're highly predictable.

I put a little more work into this, and got it working, just needs a
small amount of finicky coding. I'll share tomorrow.

I have a question about RT_FREE_RECURSE:

+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();

I'm not sure why these are here: The first seems overly paranoid,
although harmless, but the second is probably a bad idea. Why should
the user be able to to interrupt the freeing of memory?

Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
tree, leading to coding like "iter->tree->ctl->root". I *think* it
would be easier to read if the tree was a parameter to these iteration
functions. That would require an API change, so the tests/tidstore
would have some churn. I can do that, but before trying I wanted to
see what you think -- is there some reason to keep the current way?

Re: Eager aggregation, take 3

2024-03-04 Thread Andy Fan

Richard Guo  writes:

> Hi All,
>
> Eager aggregation is a query optimization technique that partially
> pushes a group-by past a join, and finalizes it once all the relations
> are joined.  Eager aggregation reduces the number of input rows to the
> join and thus may result in a better overall plan.  This technique is
> thoroughly described in the 'Eager Aggregation and Lazy Aggregation'
> paper [1].

This is a really helpful and not easy task, even I am not sure when I
can spend time to study this, I want to say "Thanks for working on
this!" first and hope we can really progress on this topic. Good luck! 

-- 
Best Regards
Andy Fan

Re: CF entries for 17 to be reviewed

2024-03-04 Thread Andrey M. Borodin

> On 4 Mar 2024, at 13:42, Andrey M. Borodin  wrote:
> 
> Here’s my take on “Miscellaneous” section.

I’ve read other small sections.

Monitoring & Control
* Logging parallel worker draught
The patchset on improving loggin os resource starvation when parallel 
workers are spawned. Some major refactoring proposed. I've switched to WoA.
* System username in pg_stat_activity
There's active discussion on extending or not pg_stat_activity.

Testing
* Add basic tests for the low-level backup method
Michael Paquier provided feedback, so I switched to WoA.

System Administration
* recovery modules
There was a very active discussion, but after April 2023 it stalled, 
and only some rebases are there. Maybe a fresh look could revive the thread.
* Possibility to disable `ALTER SYSTEM`
The discussion seems active, but inconclusive.
* Add checkpoint/redo LSNs to recovery errors.
Michael Paquier provided feedback, so I switched to WoA.

Security
* add not_before and not_after timestamps to sslinfo extension and pg_stat_ssl
Most recent version provided by Daniel Gustafsson, but the thread 
stalled in September 2023. Maybe Jacob or some other reviewer could refresh it, 
IMO this might land in 17.

Thanks!

Best regards, Andrey Borodin.

Re: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread 'Alvaro Herrera'

On 2024-Mar-04, Hayato Kuroda (Fujitsu) wrote:

> However, the second idea is still valid, which allows the allocation
> of shared memory dynamically. This is a bit efficient for the system
> which tuples won't be frozen. Thought?

I think it would be worth allocating AutoVacuumShmem->av_workItems using
dynamic shmem allocation, particularly to prevent workitems from being
discarded just because the array is full¹; but other than that, the
struct is just 64 bytes long so I doubt it's useful to allocate it
dynamically.

¹ I mean, if the array is full, just allocate another array, point to it
from the original one, and keep going.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
  -- Paul Graham, http://www.paulgraham.com/opensource.html

Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-04 Thread Mats Kindahl

Hi hackers,

I wanted to hook into the EXPLAIN output for queries and add some extra
information, but since there is no standard_ExplainOneQuery() I had to copy
the code and create my own version.

Since the pattern with other hooks for a function WhateverFunction() seems
to be that there is a standard_WhateverFunction() for each
WhateverFunction_hook, I created a patch to follow this pattern for your
consideration.

I was also considering adding a callback so that you can annotate any node
with explanatory information that is not a custom scan node. This could be
used to propagate and summarize information from custom scan nodes, but I
had no immediate use for that so did not add it here. I would still be
interested in hearing if you think this is something that would be useful
to the community.

Best wishes,
Mats Kindahl, Timescale
From eaab4d7c088ff3ee9b0e6ec3de96677bafd184c0 Mon Sep 17 00:00:00 2001
From: Mats Kindahl 
Date: Mon, 4 Mar 2024 12:38:05 +0100
Subject: Improve support for ExplainOneQuery hook

There is a hook ExplainOneQuery_hook, but there is no corresponding
standard_ExplainOneQuery and the code that belongs there is written
in-line in ExplainOneQuery(). As a result, anybody adding a hook for
ExplainOneQuery_hook has to copy the code from ExplainOneQuery() to run
the standard ExplainOneQuery before adding their own messages.

This commit fixes this by refactoring ExplainOneQuery() and factor out
the standard code into standard_ExplainOneQuery() and call it from
ExplainOneQuery() in a similar manner to how it is done for other
hooks.
---
 src/backend/commands/explain.c | 106 ++---
 src/include/commands/explain.h |   4 ++
 2 files changed, 61 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 78754bc6ba..3b7bed3ca2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -391,63 +391,71 @@ ExplainOneQuery(Query *query, int cursorOptions,
 
 	/* if an advisor plugin is present, let it manage things */
 	if (ExplainOneQuery_hook)
-		(*ExplainOneQuery_hook) (query, cursorOptions, into, es,
- queryString, params, queryEnv);
+		(*ExplainOneQuery_hook)(query, cursorOptions, into, es,
+			 queryString, params, queryEnv);
 	else
-	{
-		PlannedStmt *plan;
-		instr_time	planstart,
-	planduration;
-		BufferUsage bufusage_start,
-	bufusage;
-		MemoryContextCounters mem_counters;
-		MemoryContext planner_ctx = NULL;
-		MemoryContext saved_ctx = NULL;
-
-		if (es->memory)
-		{
-			/*
-			 * Create a new memory context to measure planner's memory
-			 * consumption accurately.  Note that if the planner were to be
-			 * modified to use a different memory context type, here we would
-			 * be changing that to AllocSet, which might be undesirable.
-			 * However, we don't have a way to create a context of the same
-			 * type as another, so we pray and hope that this is OK.
-			 */
-			planner_ctx = AllocSetContextCreate(CurrentMemoryContext,
-"explain analyze planner context",
-ALLOCSET_DEFAULT_SIZES);
-			saved_ctx = MemoryContextSwitchTo(planner_ctx);
-		}
+		standard_ExplainOneQuery(query, cursorOptions, into, es,
+ queryString, params, queryEnv);
+}
 
-		if (es->buffers)
-			bufusage_start = pgBufferUsage;
-		INSTR_TIME_SET_CURRENT(planstart);
+void
+standard_ExplainOneQuery(Query *query, int cursorOptions,
+		 IntoClause *into, ExplainState *es,
+		 const char *queryString, ParamListInfo params,
+		 QueryEnvironment *queryEnv)
+{
+	PlannedStmt *plan;
+	instr_time	planstart,
+planduration;
+	BufferUsage bufusage_start,
+bufusage;
+	MemoryContextCounters mem_counters;
+	MemoryContext planner_ctx = NULL;
+	MemoryContext saved_ctx = NULL;
+
+	if (es->memory)
+	{
+		/*
+		 * Create a new memory context to measure planner's memory consumption
+		 * accurately.  Note that if the planner were to be modified to use a
+		 * different memory context type, here we would be changing that to
+		 * AllocSet, which might be undesirable.  However, we don't have a way
+		 * to create a context of the same type as another, so we pray and
+		 * hope that this is OK.
+		 */
+		planner_ctx = AllocSetContextCreate(CurrentMemoryContext,
+			"explain analyze planner context",
+			ALLOCSET_DEFAULT_SIZES);
+		saved_ctx = MemoryContextSwitchTo(planner_ctx);
+	}
 
-		/* plan the query */
-		plan = pg_plan_query(query, queryString, cursorOptions, params);
+	if (es->buffers)
+		bufusage_start = pgBufferUsage;
+	INSTR_TIME_SET_CURRENT(planstart);
 
-		INSTR_TIME_SET_CURRENT(planduration);
-		INSTR_TIME_SUBTRACT(planduration, planstart);
+	/* plan the query */
+	plan = pg_plan_query(query, queryString, cursorOptions, params);
 
-		if (es->memory)
-		{
-			MemoryContextSwitchTo(saved_ctx);
-			MemoryContextMemConsumed(planner_ctx, &mem_counters);
-		}
+	INSTR_TIME_SET_CURRENT(planduration);
+	INSTR_TIME_SUBTRACT(planduration, plans

Re: Shared detoast Datum proposal

2024-03-04 Thread Tomas Vondra




On 3/4/24 02:29, Andy Fan wrote:
> 
> Tomas Vondra  writes:
> 
>> On 3/3/24 07:10, Andy Fan wrote:
>>>
>>> Hi,
>>>
>>> Here is a updated version, the main changes are:
>>>
>>> 1. an shared_detoast_datum.org file which shows the latest desgin and
>>> pending items during discussion. 
>>> 2. I removed the slot->pre_detoast_attrs totally.
>>> 3. handle some pg_detoast_datum_slice use case.
>>> 4. Some implementation improvement.
>>>
>>
>> I only very briefly skimmed the patch, and I guess most of my earlier
>> comments still apply.
> 
> Yes, the overall design is not changed.
> 
>> But I'm a bit surprised the patch needs to pass a
>> MemoryContext to so many places as a function argument - that seems to
>> go against how we work with memory contexts. Doubly so because it seems
>> to only ever pass CurrentMemoryContext anyway. So what's that about?
> 
> I think you are talking about the argument like this:
>  
>  /* --
> - * detoast_attr -
> + * detoast_attr_ext -
>   *
>   *   Public entry point to get back a toasted value from compression
>   *   or external storage.  The result is always non-extended varlena form.
>   *
> + * ctx: The memory context which the final value belongs to.
> + *
>   * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
>   * datum, the result will be a pfree'able chunk.
>   * --
>   */
> 
> +extern struct varlena *
> +detoast_attr_ext(struct varlena *attr, MemoryContext ctx)
> 
> This is mainly because 'detoast_attr' will apply more memory than the
> "final detoast datum" , for example the code to scan toast relation
> needs some memory as well, and what I want is just keeping the memory
> for the final detoast datum so that other memory can be released sooner,
> so I added the function argument for that. 
> 

What exactly does detoast_attr allocate in order to scan toast relation?
Does that happen in master, or just with the patch? If with master, I
suggest to ignore that / treat that as a separate issue and leave it for
a different patch.

In any case, the custom is to allocate results in the context that is
set in CurrentMemoryContext at the moment of the call, and if there's
substantial amount of allocations that we want to free soon, we either
do that by explicit pfree() calls, or create a small temporary context
in the function (but that's more expensive).

I don't think we should invent a new convention where the context is
passed as an argument, unless absolutely necessary.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Commitfest Manager for March

2024-03-04 Thread Aleksander Alekseev

Hi,

> >> The call for a CFM volunteer is still open.
> >
> > I always wanted to try. And most of the stuff I'm interested in is already 
> > committed.
> >
> > But given importance of last commitfest before feature freeze, we might be 
> > interested in more experienced CFM.
> > If I can do something useful - I'm up for it.
>
> I'm absolutely convinced you have more than enough experience with postgres
> hacking to do an excellent job.  I'm happy to give a hand as well.
>
> Thanks for volunteering!
>
> (someone from pginfra will give you the required admin permissions on the CF
> app)

Thanks for volunteering, Andrey!

If you need any help please let me know.

-- 
Best regards,
Aleksander Alekseev

Re: PostgreSQL Contributors Updates

2024-03-04 Thread Aleksander Alekseev

> > All,
> >
> > The PostgreSQL Contributor Page
> > (https://www.postgresql.org/community/contributors/) includes people who
> > have made substantial, long-term contributions of time and effort to the
> > PostgreSQL project. The PostgreSQL Contributors Team recognizes the
> > following people for their contributions.
> >
> > New PostgreSQL Contributors:
> >
> > * Bertrand Drouvot
> > * Gabriele Bartolini
> > * Richard Guo
> >
> > New PostgreSQL Major Contributors:
> >
> > * Alexander Lakhin
> > * Daniel Gustafsson
> > * Dean Rasheed
> > * John Naylor
> > * Melanie Plageman
> > * Nathan Bossart
> >
> > Thank you and congratulations to all!
>
> +1. Congratulations to all!

Congratulations to all!

-- 
Best regards,
Aleksander Alekseev

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan



Tomas Vondra  writes:

>>> But I'm a bit surprised the patch needs to pass a
>>> MemoryContext to so many places as a function argument - that seems to
>>> go against how we work with memory contexts. Doubly so because it seems
>>> to only ever pass CurrentMemoryContext anyway. So what's that about?
>> 
>> I think you are talking about the argument like this:
>>  
>>  /* --
>> - * detoast_attr -
>> + * detoast_attr_ext -
>>   *
>>   *  Public entry point to get back a toasted value from compression
>>   *  or external storage.  The result is always non-extended varlena form.
>>   *
>> + * ctx: The memory context which the final value belongs to.
>> + *
>>   * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
>>   * datum, the result will be a pfree'able chunk.
>>   * --
>>   */
>> 
>> +extern struct varlena *
>> +detoast_attr_ext(struct varlena *attr, MemoryContext ctx)
>> 
>> This is mainly because 'detoast_attr' will apply more memory than the
>> "final detoast datum" , for example the code to scan toast relation
>> needs some memory as well, and what I want is just keeping the memory
>> for the final detoast datum so that other memory can be released sooner,
>> so I added the function argument for that. 
>> 
>
> What exactly does detoast_attr allocate in order to scan toast relation?
> Does that happen in master, or just with the patch?

It is in the current master, for example the TupleTableSlot creation
needed by scanning the toast relation needs a memory allocating. 

> If with master, I
> suggest to ignore that / treat that as a separate issue and leave it for
> a different patch.

OK, I can make it as a seperate commit in the next version. 

> In any case, the custom is to allocate results in the context that is
> set in CurrentMemoryContext at the moment of the call, and if there's
> substantial amount of allocations that we want to free soon, we either
> do that by explicit pfree() calls, or create a small temporary context
> in the function (but that's more expensive).
>
> I don't think we should invent a new convention where the context is
> passed as an argument, unless absolutely necessary.

Hmm, in this case, if we don't add the new argument, we have to get the
detoast datum in Context-1 and copy it to the desired memory context,
which is the thing I want to avoid.  I think we have a same decision to
make on the TOAST cache method as well.

-- 
Best Regards
Andy Fan

Re: Extract numeric filed in JSONB more effectively

2024-03-04 Thread Peter Eisentraut


On 09.02.24 10:05, Andy Fan wrote:

2. Where is the current feature blocked for the past few months?

It's error message compatible issue! Continue with above setup:

master:

select * from tb where (a->'b')::numeric > 3::numeric;
ERROR:  cannot cast jsonb string to type numeric

select * from tb where (a->'b')::int4 > 3::numeric;
ERROR:  cannot cast jsonb string to type integer

You can see the error message is different (numeric vs integer).


Patched:

We still can get the same error message as master BUT the code
looks odd.

select * from tb where (a->'b')::int4 > 3;
 QUERY PLAN
---
  Seq Scan on public.tb
Output: a
Filter: ((jsonb_finish_numeric(jsonb_object_field_start((tb.a)::internal, 
'b'::text), '23'::oid))::integer > 3)
(3 rows)

You can see "jsonb_finish_numeric(..,  '23::oid)" the '23::oid' is just
for the *"integer"* output in error message:

"cannot cast jsonb string to type*integer*"

Now the sistuation is either we use the odd argument (23::oid) in
jsonb_finish_numeric, or we use a incompatible error message with the
previous version. I'm not sure which way is better, but this is the
place the current feature is blocked.


I'm not bothered by that.  It also happens on occasion in the backend C 
code that we pass around extra information to be able to construct 
better error messages.  The functions here are not backend C code, but 
they are internal functions, so similar considerations can apply.



But I have a different question about this patch set.  This has some 
overlap with the JSON_VALUE function that is being discussed at [0][1]. 
For example, if I apply the patch 
v39-0001-Add-SQL-JSON-query-functions.patch from that thread, I can run


select count(*) from tb where json_value(a, '$.a' returning numeric) = 2;

and I get a noticeable performance boost over

select count(*) from tb where cast (a->'a' as numeric) = 2;

So some questions to think about:

1. Compare performance of base case vs. this patch vs. json_value.

2. Can json_value be optimized further?

3. Is this patch still needed?

3a. If yes, should the internal rewriting make use of json_value or 
share code with it?



[0]: 
https://www.postgresql.org/message-id/flat/CA+HiwqE4XTdfb1nW=ojoy_tqsrhyt-q_kb6i5d4xckyrlc1...@mail.gmail.com

[1]: https://commitfest.postgresql.org/47/4377/

Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2024-03-04 Thread Aleksander Alekseev

Hi hackers,

>> I did not see any test addition for this, can we add it?
>
>
> Ok, added it in the attached version.
>
> This was an experimental patch, I was looking for the comment on the proposed
> approach -- whether we could simply skip the throwaway NOT NULL constraint for
> deferred PK constraint.  Moreover,  skip that for any PK constraint.

I confirm that the patch fixes the bug. All the tests pass. Looks like
RfC to me.

-- 
Best regards,
Aleksander Alekseev

Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-04 Thread Aleksander Alekseev

Hi,

> I wanted to hook into the EXPLAIN output for queries and add some extra 
> information, but since there is no standard_ExplainOneQuery() I had to copy 
> the code and create my own version.
>
> Since the pattern with other hooks for a function WhateverFunction() seems to 
> be that there is a standard_WhateverFunction() for each 
> WhateverFunction_hook, I created a patch to follow this pattern for your 
> consideration.
>
> I was also considering adding a callback so that you can annotate any node 
> with explanatory information that is not a custom scan node. This could be 
> used to propagate and summarize information from custom scan nodes, but I had 
> no immediate use for that so did not add it here. I would still be interested 
> in hearing if you think this is something that would be useful to the 
> community.

Thanks for the patch. LGTM.

I registered the patch on the nearest open CF [1] and marked it as
RfC. It is a pretty straightforward refactoring.

[1]: https://commitfest.postgresql.org/48/4879/

-- 
Best regards,
Aleksander Alekseev

Re: abi-compliance-checker

2024-03-04 Thread Peter Eisentraut


On 27.02.24 14:25, Alvaro Herrera wrote:

I like this idea and I think we should integrate it with the objective
of it becoming the workhorse of ABI-stability testing.  However, I do
not think that the subsequent patches should be part of the tree at all;
certainly not the produced .xml files in your 0004, as that would be far
too unstable and would cause a lot of pointless churn.


Looking at this again, if we don't want the .xml files in the tree, then 
we don't really need this patch series.  Most of the delicate work in 
the 0001 patch was to find the right abidw options combinations to 
reduce the variability in the .xml output files (--no-show-locs is an 
obvious example).  If we don't want to record the .xml files in the 
tree, then we don't need all these complications.


For example, if we want to check the postgres backend ABI across minor 
versions, we could just compile it multiple times and compare the 
binaries directly:


git checkout REL_16_0
meson setup build-0
meson compile -C build-0

git checkout REL_16_STABLE
meson setup build-1
meson compile -C build-1

abidiff --no-added-syms build-0/src/backend/postgres 
build-1/src/backend/postgres




The way I see this working, is that we set up a buildfarm animal [per
architecture] that runs the new rules produced by the abidw option and
stores the result locally, so that for stable branches it can turn red
when an ABI-breaking change with the previous minor release of the same
branch is introduced.  There's no point on it ever turning red in the
master branch, since we're obviously not concerned with ABI changes there.


Maybe the way forward here is to write a buildfarm module for this, and 
then see from there what further needs there are.

Re: type cache cleanup improvements

2024-03-04 Thread Aleksander Alekseev

Hi Teodor,

> I'd like to suggest two independent patches to improve performance of type 
> cache
> cleanup. I found a case where type cache cleanup was a reason for low
> performance. In short, customer makes 100 thousand temporary tables in one
> transaction.
>
> 1 mapRelType.patch
>It just adds a local map between relation and its type as it was suggested 
> in
> comment above TypeCacheRelCallback(). Unfortunately, using syscache here was
> impossible because this call back could be called outside transaction and it
> makes impossible catalog lookups.
>
> 2 hash_seq_init_with_hash_value.patch
>   TypeCacheTypCallback() loop over type hash  to find entry with given hash
> value. Here there are two problems: 1) there isn't  interface to dynahash to
> search entry with given hash value and 2) hash value calculation algorithm is
> differ from system cache. But coming hashvalue is came from system cache. 
> Patch
> is addressed to both issues. It suggests hash_seq_init_with_hash_value() call
> which inits hash sequential scan over the single bucket which could contain
> entry with given hash value, and hash_seq_search() will iterate only over such
> entries. Anf patch changes hash algorithm to match syscache. Actually, patch
> makes small refactoring of dynahash, it makes common function hash_do_lookup()
> which does initial lookup in hash.
>
> Some artificial performance test is in attachment, command to test is 'time 
> psql
> < custom_types_and_array.sql', here I show only last rollback time and total
> execution time:
> 1) master 92d2ab7554f92b841ea71bcc72eaa8ab11aae662
> Time: 33353,288 ms (00:33,353)
> psql < custom_types_and_array.sql  0,82s user 0,71s system 1% cpu 1:28,36 
> total
>
> 2) mapRelType.patch
> Time: 7455,581 ms (00:07,456)
> psql < custom_types_and_array.sql  1,39s user 1,19s system 6% cpu 41,220 total
>
> 3) hash_seq_init_with_hash_value.patch
> Time: 24975,886 ms (00:24,976)
> psql < custom_types_and_array.sql  1,33s user 1,25s system 3% cpu 1:19,77 
> total
>
> 4) both
> Time: 89,446 ms
> psql < custom_types_and_array.sql  0,72s user 0,52s system 10% cpu 12,137 
> total

These changes look very promising. Unfortunately the proposed patches
conflict with each other regardless the order of applying:

```
error: patch failed: src/backend/utils/cache/typcache.c:356
error: src/backend/utils/cache/typcache.c: patch does not apply
```

So it's difficult to confirm case 4, not to mention the fact that we
are unable to test the patches on cfbot.

Could you please rebase the patches against the recent master branch
(in any order) and submit the result of `git format-patch` ?

-- 
Best regards,
Aleksander Alekseev

RE: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread Hayato Kuroda (Fujitsu)

Dear Alvaro,

Thanks for discussing!

> 
> I think it would be worth allocating AutoVacuumShmem->av_workItems using
> dynamic shmem allocation, particularly to prevent workitems from being
> discarded just because the array is full¹; but other than that, the
> struct is just 64 bytes long so I doubt it's useful to allocate it
> dynamically.
> 
> ¹ I mean, if the array is full, just allocate another array, point to it
> from the original one, and keep going.

OK, I understood that my initial proposal is not so valuable, so I can withdraw 
it.

About the suggetion, you imagined AutoVacuumRequestWork() and brininsert(),
right? I agreed it sounds good, but I don't think it can be implemented by 
current
interface. An interface for dynamically allocating memory is 
GetNamedDSMSegment(),
and it returns the same shared memory region if input names are the same.
Therefore, there is no way to re-alloc the shared memory.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Replication conflicts not processed in ClientWrite

2024-03-04 Thread Magnus Hagander

When a backend is blocked on writing data (such as with a network
error or a very slow client), indicated with wait event ClientWrite,
it  appears to not properly notice that it's overrunning
max_standby_streaming_delay, and therefore does not cancel the
transaction on the backend.

I've reproduced this repeatedly on Ubuntu 20.04 with PostgreSQL 15 out
of the debian packages. Curiously enough, if I install the debug
symbols and restart, in order to get a backtrace, it starts processing
the cancellation again and can no longer reproduce. So it sounds like
some timing issue around it.

My simple test was, with session 1 on the standby and session 2 on the primary:
Session 1: begin transaction isolation level repeatable read;
Session 1: select count(*) from testtable;
Session 2: alter table testtable rename to testtable2;
Session 1: select * from testtable t1 cross join testtable t2;
kill -STOP 

At this point, replication lag sartgs growing on the standby and it
never terminates the session.

If I then SIGCONT it, it will get terminated by replication conflict.

If I kill the session hard, the replication lag recovers immediately.

AFAICT if the confliact happens at ClientRead, for example, it's
picked up immediately, but there's something in ClientWrite that
prevents it.

My first thought would be OpenSSL, but this is reproducible both on
tls-over-tcp and on unix sockets.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-04 Thread Ronan Dunklau

Le samedi 2 mars 2024, 23:29:52 CET Tomas Vondra a écrit :
> These are "my" animals (running at a local university). There's a couple
> interesting details:

Hi Tomas,
do you still have the failing cluster data ? 

Noah pointed me to this thread, and it looks a bit similar to the FSM 
corruption issue I'm facing: https://www.postgresql.org/message-id/
1925490.taCxCBeP46%40aivenlaptop

So if you still have the data, it would be nice to see if you indeed have a 
corrupted FSM, and if you have indications when it happened.

Best regards,

--
Ronan Dunklau

RE: Synchronizing slots from primary to standby

2024-03-04 Thread Zhijie Hou (Fujitsu)

On Monday, March 4, 2024 7:22 PM Bertrand Drouvot 
 wrote:
> 
> Hi,
> 
> On Sun, Mar 03, 2024 at 07:56:32AM +, Zhijie Hou (Fujitsu) wrote:
> > Here is the V104 patch which addressed above and Peter's comments.
> 
> Thanks!
> 
> A few more random comments:

Thanks for the comments!

> 
> 1 ===
> 
> +The function may be blocked if the specified slot is a failover
> + enabled
> 
> s/blocked/waiting/ ?

Changed.

> 
> 2 ===
> 
> +* specified slot when waiting for them to catch up. See
> +* StandbySlotsHaveCaughtup for details.
> 
> s/StandbySlotsHaveCaughtup/StandbySlotsHaveCaughtup()/ ?

Changed.

> 
> 3 ===
> 
> +   /* Now verify if the specified slots really exist and have
> + correct type */
> 
> remove "really"?

Changed.

> 
> 4 ===
> 
> +   /*
> +* Don't need to wait for the standbys to catch up if there is no 
> value in
> +* standby_slot_names.
> +*/
> +   if (standby_slot_names_list == NIL)
> +   return true;
> +
> +   /*
> +* Don't need to wait for the standbys to catch up if we are on a
> standby
> +* server, since we do not support syncing slots to cascading 
> standbys.
> +*/
> +   if (RecoveryInProgress())
> +   return true;
> +
> +   /*
> +* Don't need to wait for the standbys to catch up if they are already
> +* beyond the specified WAL location.
> +*/
> +   if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) &&
> +   standby_slot_oldest_flush_lsn >= wait_for_lsn)
> +   return true;
> 
> What about using OR conditions instead?
> 
> 5 ===
> 
> +static bool
> +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn,
> +uint32 *wait_event)
> 
> Not a big deal but does it need to return a bool? (I mean it all depends of 
> the
> *wait_event value). Is it for better code readability in the caller?
> 
> 6 ===
> 
> +static bool
> +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn,
> +uint32 *wait_event)
> 
> Same questions as for NeedToWaitForStandby().

I also feel the current style looks a bit cleaner, so didn’t change these.

Best Regards,
Hou zj

RE: Synchronizing slots from primary to standby

2024-03-04 Thread Zhijie Hou (Fujitsu)

On Monday, March 4, 2024 9:55 AM Peter Smith  wrote:
> 
> On Sun, Mar 3, 2024 at 6:51 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Sunday, March 3, 2024 7:47 AM Peter Smith 
> wrote:
> > >
> 
> > > 3.
> > > +   
> > > +
> > > + Value * is not accepted as it is 
> > > inappropriate to
> > > + block logical replication for physical slots that either lack
> > > + associated standbys or have standbys associated that are
> > > + not
> > > enabled
> > > + for replication slot synchronization. (see
> > > +  > > linkend="logicaldecoding-replication-slots-synchronization"/>).
> > > +
> > > +   
> > >
> > > Why does the document need to provide an excuse/reason for the rule?
> > > You could just say something like:
> > >
> > > SUGGESTION
> > > The slots must be named explicitly. For example, specifying wildcard
> > > values like * is not permitted.
> >
> > As suggested by Amit, I moved this to code comments.
> 
> Was the total removal of this note deliberate? I only suggested removing the
> *reason* for the rule, not the entire rule. Otherwise, the user won't know to
> avoid doing this until they try it and get an error.

OK, Added.

> 
> 
> > >
> > > 9. NeedToWaitForWal
> > >
> > > + /*
> > > + * Check if the standby slots have caught up to the flushed position.
> > > + It
> > > + * is good to wait up to flushed position and then let it send the
> > > + changes
> > > + * to logical subscribers one by one which are already covered in
> > > + flushed
> > > + * position without needing to wait on every change for standby
> > > + * confirmation. Note that after receiving the shutdown signal, an
> > > + ERROR
> > > + * is reported if any slots are dropped, invalidated, or inactive.
> > > + This
> > > + * measure is taken to prevent the walsender from waiting indefinitely.
> > > + */
> > > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event))
> > > + return true;
> > >
> > > I felt it was confusing things for this function to also call to the
> > > other one -- it seems an overlapping/muddling of the purpose of these.
> > > I think it will be easier to understand if the calling code just
> > > calls to one or both of these functions as required.
> >
> > Same as Amit, I didn't change this.
> 
> AFAICT my previous review comment was misinterpreted. Please see [1] for
> more details.
> 
> 
> 
> Here are some more review comments for v104-1

Thanks!

> 
> ==
> Commit message
> 
> 1.
> Additionally, The SQL functions pg_logical_slot_get_changes,
> pg_logical_slot_peek_changes and pg_replication_slot_advance are modified
> to wait for the replication slots specified in 'standby_slot_names' to catch 
> up
> before returning.
> 
> ~
> 
> Maybe that should be expressed using similar wording as the docs...
> 
> SUGGESTION
> Additionally, The SQL functions ... are modified. Now, when used with
> failover-enabled logical slots, these functions will block until all physical 
> slots
> specified in 'standby_slot_names' have confirmed WAL receipt.

Changed.

> 
> ==
> doc/src/sgml/config.sgml
> 
> 2.
> +pg_logical_slot_peek_changes,
> +when used with failover enabled logical slots, will block until all
> +physical slots specified in
> standby_slot_names have
> +confirmed WAL receipt.
> 
> /failover enabled logical slots/failover-enabled logical slots/

Changed. Note that for this comment and remaining comments, 
I used the later version we agreed(logical failover slot).

> 
> ==
> doc/src/sgml/func.sgml
> 
> 3.
> +The function may be blocked if the specified slot is a failover 
> enabled
> +slot and  linkend="guc-standby-slot-names">standby_slot_names me>
> +is configured.
> 
> 
> /a failover enabled slot//a failover-enabled slot/

Changed.

> 
> ~~~
> 
> 4.
> +slot may return to an earlier position. The function may be blocked 
> if
> +the specified slot is a failover enabled slot and
> + linkend="guc-standby-slot-names">standby_slot_names me>
> +is configured.
> 
> /a failover enabled slot//a failover-enabled slot/

Changed.

> 
> ==
> src/backend/replication/slot.c
> 
> 5.
> +/*
> + * Wait for physical standbys to confirm receiving the given lsn.
> + *
> + * Used by logical decoding SQL functions that acquired failover enabled 
> slot.
> + * It waits for physical standbys corresponding to the physical slots
> +specified
> + * in the standby_slot_names GUC.
> + */
> +void
> +WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn)
> 
> /failover enabled slot/failover-enabled slot/

Changed.

> 
> ~~~
> 
> 6.
> + /*
> + * Don't need to wait for the standby to catch up if the current
> + acquired
> + * slot is not a failover enabled slot, or there is no value in
> + * standby_slot_names.
> + */
> 
> /failover enabled slot/failover-enabled slot/

Changed.

> 
> ==
> src/backend/replication/slotfuncs.c
> 
> 7.
> +
> + /*
> + * Wake up logical wals

RE: Synchronizing slots from primary to standby

2024-03-04 Thread Zhijie Hou (Fujitsu)

On Monday, March 4, 2024 5:52 PM Amit Kapila  wrote:
> 
> On Mon, Mar 4, 2024 at 9:35 AM Peter Smith 
> wrote:
> >
> > OK, if the code will remain as-is wouldn't it be better to anyway
> > change the function name to indicate what it really does?
> >
> > e.g.  NeedToWaitForWal --> NeedToWaitForWalFlushOrStandbys
> >
> 
> This seems too long. I would prefer the current name NeedToWaitForWal as
> waiting for WAL means waiting to flush the WAL and waiting to replicate it to
> standby. On similar lines, the variable name standby_slot_oldest_flush_lsn 
> looks
> too long. How about ss_oldest_flush_lsn (where ss indicates standy_slots)?
> 
> Apart from this, I have made minor modifications in the attached.

Thanks, I have merged it.

Attach the V105 patch set which addressed Peter, Amit and Bertrand's comments.

This version also includes the following changes:
* We found a string matching issue for query_until() and fixed it.
* Removed one un-used parameter from NeedToWaitForStandbys.
* Disable the sub before testing the pg_logical_slot_get_changes in 040.pl, 
this is to prevent
This test from catching the warning from another walsender.
* Ran pgindent.

Best Regards,
Hou zj

v105-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
Description:  v105-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch

v105-0002-Document-the-steps-to-check-if-the-standby-is-r.patch
Description:  v105-0002-Document-the-steps-to-check-if-the-standby-is-r.patch

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-04 Thread Tomas Vondra




On 3/4/24 14:16, Ronan Dunklau wrote:
> Le samedi 2 mars 2024, 23:29:52 CET Tomas Vondra a écrit :
>> These are "my" animals (running at a local university). There's a couple
>> interesting details:
> 
> Hi Tomas,
> do you still have the failing cluster data ? 
> 
> Noah pointed me to this thread, and it looks a bit similar to the FSM 
> corruption issue I'm facing: https://www.postgresql.org/message-id/
> 1925490.taCxCBeP46%40aivenlaptop
> 
> So if you still have the data, it would be nice to see if you indeed have a 
> corrupted FSM, and if you have indications when it happened.
> 

Sorry, I nuked the buildroot so I don't have the data anymore. Let's see
if it fails again.

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2024-03-04 Thread Dean Rasheed

On Mon, 4 Mar 2024 at 12:34, Aleksander Alekseev
 wrote:
>
> > This was an experimental patch, I was looking for the comment on the 
> > proposed
> > approach -- whether we could simply skip the throwaway NOT NULL constraint 
> > for
> > deferred PK constraint.  Moreover,  skip that for any PK constraint.
>
> I confirm that the patch fixes the bug. All the tests pass. Looks like
> RfC to me.
>

I don't think that this is the right fix. ISTM that the real issue is
that dropping a NOT NULL constraint should not mark the column as
nullable if it is part of a PK, whether or not that PK is deferrable
-- a deferrable PK still marks a  column as not nullable.

The reason pg_dump creates these throwaway NOT NULL constraints is to
avoid a table scan to check for NULLs when the PK is later created.
That rationale still applies to deferrable PKs, so we still want the
throwaway NOT NULL constraints in that case, otherwise we'd be hurting
performance of restore.

Regards,
Dean

Re: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread 'Alvaro Herrera'

Hello Hayato,

On 2024-Mar-04, Hayato Kuroda (Fujitsu) wrote:

> OK, I understood that my initial proposal is not so valuable, so I can
> withdraw it.

Yeah, that's what it seems to me.

> About the suggetion, you imagined AutoVacuumRequestWork() and
> brininsert(), right?

Correct.

> I agreed it sounds good, but I don't think it can be implemented by
> current interface. An interface for dynamically allocating memory is
> GetNamedDSMSegment(), and it returns the same shared memory region if
> input names are the same.  Therefore, there is no way to re-alloc the
> shared memory.

Yeah, I was imagining something like this: the workitem-array becomes a
struct, which has a name and a "next" pointer and a variable number of
workitem slots; the AutoVacuumShmem struct has a pointer to the first
workitem-struct and the last one; when a workitem is requested by
brininsert, we initially allocate via GetNamedDSMSegment("workitem-0") a
workitem-struct with a smallish number of elements; if we request
another workitem and the array is full, we allocate another array via
GetNamedDSMSegment("workitem-1") and store a pointer to it in workitem-0
(so that the list can be followed by an autovacuum worker that's
processing the database), and it's also set as the tail of the list in
AutoVacuumShmem (so that we know where to store further work items).
When all items in a workitem-struct are processed, we can free it
(I guess via dsm_unpin_segment), and make AutoVacuumShmem->av_workitems
point to the next one in the list.

This way, the "array" can grow arbitrarily.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Provide a pg_truncate_freespacemap function

2024-03-04 Thread Ronan Dunklau

Hello,

As we are currently experiencing a FSM corruption issue [1], we need to 
rebuild FSM when we detect it. 

I noticed we have something to truncate a visibility map, but nothing for the 
freespace map, so I propose the attached (liberally copied from the VM 
counterpart) to allow to truncate a FSM without incurring downtime, as 
currently our only options are to either VACUUM FULL the table or stop the 
cluster and remove the FSM manually.

Does that seem correct ?


[1]  https://www.postgresql.org/message-id/flat/
1925490.taCxCBeP46%40aivenlaptop#7ace95c95cab17b6d92607e5362984ac

--
Ronan Dunklau



>From b3e28e4542311094ee7177b39cc9cdf3672d1b55 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 1 Mar 2024 08:57:49 +0100
Subject: [PATCH] Provide a pg_truncate_freespacemap function.

Similar to the pg_truncate_visibilitymap function, this one allows to
avoid downtime to fix an FSM in the case a corruption event happens
---
 contrib/pg_freespacemap/Makefile  |  5 +-
 .../pg_freespacemap--1.2--1.3.sql | 13 +
 contrib/pg_freespacemap/pg_freespacemap.c | 58 +++
 .../pg_freespacemap/pg_freespacemap.control   |  2 +-
 4 files changed, 75 insertions(+), 3 deletions(-)
 create mode 100644 contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql

diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile
index b48e4b255bc..0f4b52f5812 100644
--- a/contrib/pg_freespacemap/Makefile
+++ b/contrib/pg_freespacemap/Makefile
@@ -6,8 +6,9 @@ OBJS = \
 	pg_freespacemap.o
 
 EXTENSION = pg_freespacemap
-DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.1--1.2.sql \
-	pg_freespacemap--1.0--1.1.sql
+DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.2--1.3.sql \
+			pg_freespacemap--1.1--1.2.sql \
+			pg_freespacemap--1.0--1.1.sql
 PGFILEDESC = "pg_freespacemap - monitoring of free space map"
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_freespacemap/pg_freespacemap.conf
diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql
new file mode 100644
index 000..1f25877a175
--- /dev/null
+++ b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql
@@ -0,0 +1,13 @@
+/* contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_freespacemap UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION pg_truncate_freespace_map(regclass)
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_truncate_freespace_map'
+LANGUAGE C STRICT
+PARALLEL UNSAFE;
+
+
+
diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c
index b82cab2d97e..17f6a631ad8 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.c
+++ b/contrib/pg_freespacemap/pg_freespacemap.c
@@ -9,8 +9,12 @@
 #include "postgres.h"
 
 #include "access/relation.h"
+#include "access/xloginsert.h"
+#include "catalog/storage_xlog.h"
 #include "funcapi.h"
 #include "storage/freespace.h"
+#include "storage/smgr.h"
+#include "utils/rel.h"
 
 PG_MODULE_MAGIC;
 
@@ -20,6 +24,9 @@ PG_MODULE_MAGIC;
  */
 PG_FUNCTION_INFO_V1(pg_freespace);
 
+/* Truncate the free space map, in case of corruption */
+PG_FUNCTION_INFO_V1(pg_truncate_freespace_map);
+
 Datum
 pg_freespace(PG_FUNCTION_ARGS)
 {
@@ -40,3 +47,54 @@ pg_freespace(PG_FUNCTION_ARGS)
 	relation_close(rel, AccessShareLock);
 	PG_RETURN_INT16(freespace);
 }
+
+
+Datum
+pg_truncate_freespace_map(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	ForkNumber	fork;
+	BlockNumber block;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* Only some relkinds have a freespacemap */
+	if (!RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind))
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("relation \"%s\" is of wrong relation kind",
+		RelationGetRelationName(rel)),
+ errdetail_relkind_not_supported(rel->rd_rel->relkind)));
+
+
+	/* Forcibly reset cached file size */
+	RelationGetSmgr(rel)->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
+
+	/* Just pretend we're going to wipeout the whole rel */
+	block = FreeSpaceMapPrepareTruncateRel(rel, 0);
+
+	if (BlockNumberIsValid(block))
+	{
+		fork = FSM_FORKNUM;
+		smgrtruncate(RelationGetSmgr(rel), &fork, 1, &block);
+	}
+
+	if (RelationNeedsWAL(rel))
+	{
+		xl_smgr_truncate xlrec;
+
+		xlrec.blkno = 0;
+		xlrec.rlocator = rel->rd_locator;
+		xlrec.flags = SMGR_TRUNCATE_FSM;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+		XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/pg_freespacemap/pg_freespacemap.control b/contrib/pg_freespacemap/pg_freespacemap.control
index ac8fc5050a9..1992320691b 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.control
+++ b/contrib/pg_freespacemap/pg_freespacemap.control
@@ -

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-03-04 Thread Daniel Gustafsson

> On 6 Jan 2024, at 00:03, Nathan Bossart  wrote:

> I gave it a try.

Looking at this again I think this is about ready to go in.  My only comment is
that doc/src/sgml/archive-modules.sgml probably should be updated to refer to
setting the errdetail, especially since we document the errormessage there.

--
Daniel Gustafsson

Re: Statistics Import and Export

2024-03-04 Thread Bertrand Drouvot

Hi,

On Tue, Feb 20, 2024 at 02:24:52AM -0500, Corey Huinker wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker 
> wrote:
> 
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.
> >
> >
> >
> v6 posted below.

Thanks!

I had in mind to look at it but it looks like a rebase is needed.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: remaining sql/json patches

2024-03-04 Thread Erik Rijkers


Op 3/4/24 om 10:40 schreef Amit Langote:

Hi Jian,

Thanks for the reviews and sorry for the late reply. Replying to all
emails in one.


> [v40-0001-Add-SQL-JSON-query-functions.patch]
> [v40-0002-Show-function-name-in-TableFuncScan.patch]
> [v40-0003-JSON_TABLE.patch]

In my hands (applying with patch), the patches, esp. 0001, do not apply. 
 But I see the cfbot builds without problem so maybe just ignore these 
FAILED lines.  Better get them merged - so I can test there...


Erik


checking file doc/src/sgml/func.sgml
checking file src/backend/catalog/sql_features.txt
checking file src/backend/executor/execExpr.c
Hunk #1 succeeded at 48 with fuzz 2 (offset -1 lines).
Hunk #2 succeeded at 88 (offset -1 lines).
Hunk #3 succeeded at 2419 (offset -1 lines).
Hunk #4 succeeded at 4195 (offset -1 lines).
checking file src/backend/executor/execExprInterp.c
Hunk #1 succeeded at 72 (offset -1 lines).
Hunk #2 succeeded at 180 (offset -1 lines).
Hunk #3 succeeded at 485 (offset -1 lines).
Hunk #4 succeeded at 1560 (offset -1 lines).
Hunk #5 succeeded at 4242 (offset -1 lines).
checking file src/backend/jit/llvm/llvmjit_expr.c
checking file src/backend/jit/llvm/llvmjit_types.c
checking file src/backend/nodes/makefuncs.c
Hunk #1 succeeded at 856 (offset -1 lines).
checking file src/backend/nodes/nodeFuncs.c
Hunk #1 succeeded at 233 (offset -1 lines).
Hunk #2 succeeded at 517 (offset -1 lines).
Hunk #3 succeeded at 1019 (offset -1 lines).
Hunk #4 succeeded at 1276 (offset -1 lines).
Hunk #5 succeeded at 1617 (offset -1 lines).
Hunk #6 succeeded at 2381 (offset -1 lines).
Hunk #7 succeeded at 3429 (offset -1 lines).
Hunk #8 succeeded at 4164 (offset -1 lines).
checking file src/backend/optimizer/path/costsize.c
Hunk #1 succeeded at 4878 (offset -1 lines).
checking file src/backend/optimizer/util/clauses.c
Hunk #1 succeeded at 50 (offset -3 lines).
Hunk #2 succeeded at 415 (offset -3 lines).
checking file src/backend/parser/gram.y
checking file src/backend/parser/parse_expr.c
checking file src/backend/parser/parse_target.c
Hunk #1 succeeded at 1988 (offset -1 lines).
checking file src/backend/utils/adt/formatting.c
Hunk #1 succeeded at 4465 (offset -1 lines).
checking file src/backend/utils/adt/jsonb.c
Hunk #1 succeeded at 2159 (offset -4 lines).
checking file src/backend/utils/adt/jsonfuncs.c
checking file src/backend/utils/adt/jsonpath.c
Hunk #1 FAILED at 68.
Hunk #2 succeeded at 1239 (offset -1 lines).
1 out of 2 hunks FAILED
checking file src/backend/utils/adt/jsonpath_exec.c
Hunk #1 succeeded at 229 (offset -5 lines).
Hunk #2 succeeded at 2866 (offset -5 lines).
Hunk #3 succeeded at 3751 (offset -5 lines).
checking file src/backend/utils/adt/ruleutils.c
Hunk #1 succeeded at 474 (offset -1 lines).
Hunk #2 succeeded at 518 (offset -1 lines).
Hunk #3 succeeded at 8303 (offset -1 lines).
Hunk #4 succeeded at 8475 (offset -1 lines).
Hunk #5 succeeded at 8591 (offset -1 lines).
Hunk #6 succeeded at 9808 (offset -1 lines).
Hunk #7 succeeded at 9858 (offset -1 lines).
Hunk #8 succeeded at 10039 (offset -1 lines).
Hunk #9 succeeded at 10909 (offset -1 lines).
checking file src/include/executor/execExpr.h
checking file src/include/nodes/execnodes.h
checking file src/include/nodes/makefuncs.h
checking file src/include/nodes/parsenodes.h
checking file src/include/nodes/primnodes.h
checking file src/include/parser/kwlist.h
checking file src/include/utils/formatting.h
checking file src/include/utils/jsonb.h
checking file src/include/utils/jsonfuncs.h
checking file src/include/utils/jsonpath.h
checking file src/interfaces/ecpg/preproc/ecpg.trailer
checking file src/test/regress/expected/sqljson_queryfuncs.out
checking file src/test/regress/parallel_schedule
checking file src/test/regress/sql/sqljson_queryfuncs.sql
checking file src/tools/pgindent/typedefs.list

Re: remaining sql/json patches

2024-03-04 Thread Alvaro Herrera

On 2024-Mar-04, Erik Rijkers wrote:

> In my hands (applying with patch), the patches, esp. 0001, do not apply.
> But I see the cfbot builds without problem so maybe just ignore these FAILED
> lines.  Better get them merged - so I can test there...

It's because of dbbca2cf299b.  It should apply cleanly if you do "git
checkout dbbca2cf299b^" first ...  That commit is so recent that
evidently the cfbot hasn't had a chance to try this patch again since it
went in, which is why it's still green.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it."   (New York Times, about Microsoft PowerPoint)

Re: pgsql: Improve performance of subsystems on top of SLRU

2024-03-04 Thread Alvaro Herrera

On 2024-Mar-03, Tom Lane wrote:

> This is certainly simpler, but I notice that it holds the current
> LWLock across the line
> 
>   ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
> 
> where the old code did not.  Could the palloc take long enough that
> holding the lock is bad?

Hmm, I guess most of the time it shouldn't be much of a problem (if the
length is small so we can palloc without malloc'ing); but it could be in
the worst case.  But the fix is simple: just release the lock before.
There's no correctness argument for holding it all the way down.  I was
just confused about how the original code worked.

> Also, with this coding the "lock = NULL;" assignment just before
> "goto retry" is a dead store.  Not sure if Coverity or other static
> analyzers would whine about that.

Oh, right.  I removed that.

On 2024-Mar-04, Dilip Kumar wrote:

> I am not sure about the other changes, I mean that makes the code much
> simpler but now we are not releasing the 'MultiXactOffsetCtl' related
> bank lock, and later in the following loop, we are comparing that lock
> against 'MultiXactMemberCtl' related bank lock. This makes code
> simpler because now in the loop we are sure that we are always holding
> the lock but I do not like comparing the bank locks for 2 different
> SLRUs, although there is no problem as there would not be a common
> lock address,

True.  This can be addressed in the same way Tom's first comment is:
just release the lock before entering the second loop, and setting lock
to NULL.  This brings the code to a similar state than before, except
that the additional LWLock * variables are in a tighter scope.  That's
in 0001.


Now, I had a look at the other users of slru.c and noticed in subtrans.c
that StartupSUBTRANS we have some duplicate code that I think could be
rewritten by making the "while" block test the condition at the end
instead of at the start; changed that in 0002.  I'll leave this one for
later, because I want to add some test code for it -- right now it's
pretty much test-uncovered code.


I also looked at slru.c for uses of shared->bank_locks and noticed a
couple that could be made simpler.  That's 0003 here.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"If it is not right, do not do it.
If it is not true, do not say it." (Marcus Aurelius, Meditations)
>From a6be7cac5de83a36e056147f3e81bea2eb1096bd Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Sun, 3 Mar 2024 15:20:36 +0100
Subject: [PATCH v2 1/3] rework locking code in GetMultiXactIdMembers

Per Coverity
---
 src/backend/access/transam/multixact.c | 53 +++---
 1 file changed, 22 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index cd476b94fa..83b578dced 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1247,14 +1247,12 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset offset;
 	int			length;
 	int			truelength;
-	int			i;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	LWLock	   *prevlock = NULL;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1361,18 +1359,9 @@ retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
-	/*
-	 * If this page falls under a different bank, release the old bank's lock
-	 * and acquire the lock of the new bank.
-	 */
+	/* Acquire the bank lock for the page we need. */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-	if (lock != prevlock)
-	{
-		if (prevlock != NULL)
-			LWLockRelease(prevlock);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
-		prevlock = lock;
-	}
+	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
@@ -1407,17 +1396,19 @@ retry:
 
 		if (pageno != prev_pageno)
 		{
+			LWLock	   *newlock;
+
 			/*
 			 * Since we're going to access a different SLRU page, if this page
 			 * falls under a different bank, release the old bank's lock and
 			 * acquire the lock of the new bank.
 			 */
-			lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-			if (prevlock != lock)
+			newlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+			if (newlock != lock)
 			{
-LWLockRelease(prevlock);
-LWLockAcquire(lock, LW_EXCLUSIVE);
-prevlock = lock;
+LWLockRelease(lock);
+LWLockAcquire(newlock, LW_EXCLUSIVE);
+lock = newlock;
 			}
 			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
 		}
@@ -1429,8 +1420,7 @@ retry:
 		if (nextMXOffset == 0)
 		{
 			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(prevlock);
-			prevlock = NULL;
+			LWLockRelease(lock);
 			CHECK_FO

Re: Experiments with Postgres and SSL

2024-03-04 Thread Heikki Linnakangas


On 01/03/2024 23:49, Jacob Champion wrote:

On Wed, Feb 28, 2024 at 4:10 AM Heikki Linnakangas  wrote:

I think we'd want to *avoid* changing the major protocol version in a
way that would introduce a new roundtrip, though.


I'm starting to get up to speed with this patchset. So far I'm mostly
testing how it works; I have yet to take an in-depth look at the
implementation.


Thank you!


I'll squint more closely at the MITM-protection changes in 0008 later.
First impressions, though: it looks like that code has gotten much
less straightforward, which I think is dangerous given the attack it's
preventing. (Off-topic: I'm skeptical of future 0-RTT support. Our
protocol doesn't seem particularly replay-safe to me.)


Let's drop that patch. AFAICS it's not needed by the rest of the patches.


If we're interested in ALPN negotiation in the future, we may also
want to look at GREASE [1] to keep those options open in the presence
of third-party implementations. Unfortunately OpenSSL doesn't do this
automatically yet.


Can you elaborate? Do we need to do something extra in the server to be 
compatible with GREASE?



If we don't have a reason not to, it'd be good to follow the strictest
recommendations from [2] to avoid cross-protocol attacks. (For anyone
currently running web servers and Postgres on the same host, they
really don't want browsers "talking" to their Postgres servers.) That
would mean checking the negotiated ALPN on both the server and client
side, and failing if it's not what we expect.


Hmm, I thought that's what the patches does. But looking closer, libpq 
is not checking that ALPN was used. We should add that. Am I right?



I'm not excited about the proliferation of connection options. I don't
have a lot of ideas on how to fix it, though, other than to note that
the current sslnegotiation option names are very unintuitive to me:
- "postgres": only legacy handshakes
- "direct": might be direct... or maybe legacy
- "requiredirect": only direct handshakes... unless other options are
enabled and then we fall back again to legacy? How many people willing
to break TLS compatibility with old servers via "requiredirect" are
going to be okay with lazy fallback to GSS or otherwise?


Yeah, this is my biggest complaint about all this. Not so much the names 
of the options, but the number of combinations of different options, and 
how we're going to test them all. I don't have any great solutions, 
except adding a lot of tests to cover them, like Matthias did.



Heikki mentioned possibly hard-coding a TLS alert if direct SSL is
attempted without server TLS support. I think that's a cool idea, but
without an official "TLS not supported" alert code (which, honestly,
would be strange to standardize) I'm kinda -0.5 on it. If the client
tells me about a handshake_failure or similar, I'm going to start
investigating protocol versions and ciphersuites; I'm not going to
think to myself that maybe the server lacks TLS support altogether.


Agreed.


(Plus, we need to have a good error message when connecting to older
servers anyway.I think we should be able to key off of the EOF coming
back from OpenSSL; it'd be a good excuse to give that part of the code
some love.)


Hmm, if OpenSSL sends ClientHello and the server responds with a 
Postgres error packet, OpenSSL will presumably consume the error packet 
or at least part of it. But with our custom BIO, we can peek at the 
server response before handing it to OpenSSL.


If it helps, we could backport a nicer error message to old server 
versions, similar to what we did with SCRAM in commit 96d0f988b1.



For the record, I'm adding some one-off tests for this feature to a
local copy of my OAuth pytest suite, which is designed to do the kinds
of testing you're running into trouble with. It's not in any way
viable for a PG17 commit, but if you're interested I can make the
patches available.


Yes please, it would be nice to see what tests you've performed, and 
have it archived.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Bertrand Drouvot

Hi,

On Mon, Mar 04, 2024 at 01:28:04PM +, Zhijie Hou (Fujitsu) wrote:
> Attach the V105 patch set

Thanks!

Sorry I missed those during the previous review:

1 ===

Commit message: "these functions will block until"

s/block/wait/ ?

2 ===

+when used with logical failover slots, will block until all

s/block/wait/ ?

It seems those are the 2 remaining "block" that could deserve the proposed
above change.

3 ===

+   invalidated = slot->data.invalidated != RS_INVAL_NONE;
+   inactive = slot->active_pid == 0;

invalidated = (slot->data.invalidated != RS_INVAL_NONE);
inactive = (slot->active_pid == 0);

instead?

I think it's easier to read and it looks like this is the way it's written in
other places (at least the few I checked).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: LogwrtResult contended spinlock

2024-03-04 Thread Bharath Rupireddy

Thanks for looking into this.

On Thu, Feb 22, 2024 at 1:54 AM Jeff Davis  wrote:
>
> > 3.
> > @@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI)
> >  If at all, a read
> > barrier is warranted here, we can use atomic read with full barrier
>
> I don't think we need a full barrier but I'm fine with using
> pg_atomic_read_membarrier_u64() if it's better for whatever reason.

For the sake of clarity and correctness, I've used
pg_atomic_read_membarrier_u64 everywhere for reading
XLogCtl->LogwrtResult.Write and XLogCtl->LogwrtResult.Flush.

> > 5. I guess we'd better use pg_atomic_read_u64_impl and
> > pg_atomic_compare_exchange_u64_impl in
> > pg_atomic_monotonic_advance_u64
> > to reduce one level of function indirections.
>
> Aren't they inlined?

Yes, all of them are inlined. But, it seems like XXX_impl functions
are being used in implementing exposed functions as a convention.
Therefore, having pg_atomic_read_u64_impl and
pg_atomic_compare_exchange_u64_impl doesn't sound bad IMV.

> > 6.
> > + * Full barrier semantics (even when value is unchanged).
> >
> > +currval = pg_atomic_read_u64(ptr);
> > +if (currval >= target_)
> > +{
> > +pg_memory_barrier();
>
> I don't think they are exactly equivalent: in the current patch, the
> first pg_atomic_read_u64() could be reordered with earlier reads;
> whereas that wouldn't work if using pg_atomic_read_membarrier_u64() it
> could not be. I'm not sure whether that could create a performance
> problem or not.

I left it as-is.

> > 9.
> > +copyptr = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Copy);
> > +if (startptr + count > copyptr)
> > +ereport(WARNING,
> > +(errmsg("request to read past end of generated WAL;
> > request %X/%X, current position %X/%X",
> > +LSN_FORMAT_ARGS(startptr + count),
> > LSN_FORMAT_ARGS(copyptr;
> >
> > Any specific reason for this to be a WARNING rather than an ERROR?
>
> Good question. WaitXLogInsertionsToFinish() uses a LOG level message
> for the same situation. They should probably be the same log level, and
> I would think it would be either PANIC or WARNING. I have no idea why
> LOG was chosen.

WaitXLogInsertionsToFinish adjusts upto after LOG so that the wait is
never past the current insert position even if a caller asks for
reading WAL that doesn't yet exist. And the comment there says "Here
we just assume that to mean that all WAL that has been reserved needs
to be finished."

In contrast, WALReadFromBuffers kind of enforces callers to do
WaitXLogInsertionsToFinish (IOW asks callers to send in the WAL that
exists in the server). Therefore, an ERROR seems a reasonable choice
to me, if PANIC sounds rather strong affecting all the postgres
processes.

FWIW, a PANIC when requested to flush past the end of WAL in
WaitXLogInsertionsToFinish instead of LOG seems to be good. CF bot
animals don't complain -
https://github.com/BRupireddy2/postgres/tree/be_harsh_when_request_to_flush_past_end_of_WAL_WIP.

> 0001:
>
> * The comments on the two versions of the functions are redundant, and
> the style in that header seems to be to omit the comment from the u64
> version.

Removed comments atop 64-bit version.

> * I'm not sure the AssertPointerAlignment is needed in the u32 version?

Borrowed them from pg_atomic_read_u32 and
pg_atomic_compare_exchange_u32, just like how they assert before
calling XXX_impl versions. I don't see any problem with them.

> 0002:
>
> * All updates to the non-shared LogwrtResult should update both values.
> It's confusing to update those local values independently, because it
> violates the invariant that LogwrtResult.Flush <= LogwrtResult.Write.
>
> > 2. I guess we need to update both the Write and Flush local copies in
> > AdvanceXLInsertBuffer.
>
> I agree. Whenever we update the non-shared LogwrtResult, let's update
> the whole thing. Otherwise it's confusing.

Yes, it's done that way now with a macro XLogUpdateLocalLogWrtResult
using pg_atomic_read_membarrier_u64 to read both Write and Flush ptrs.

> * pg_memory_barrier() is not needed right before a spinlock

Got rid of it as we read both Flush and Write local copies with
pg_atomic_read_membarrier_u64.

> * As mentioned above, I think GetFlushRecPtr() needs two read barriers.
> Also, I think the same for GetXLogWriteRecPtr().

Yes, it's done that way now with a macro XLogUpdateLocalLogWrtResult
using pg_atomic_read_membarrier_u64 to read both Write and Flush ptrs.

> * In general, for any place using both Write and Flush, I think Flush
> should be loaded first, followed by a read barrier, followed by a load
> of the Write pointer.

Why read Flush first rather than Write? I think it's enough to do
{read Write,  read barrier, read Flush}. This works because Write is
monotonically advanced first before Flush using full barriers and we
don't get reordering issues between the readers and writers no? Am I
missing anything here?

> And I think in most of those places there shou

Re: Shared detoast Datum proposal

2024-03-04 Thread Tomas Vondra

On 3/4/24 02:23, Andy Fan wrote:
> 
> Tomas Vondra  writes:
> 
>> On 2/26/24 16:29, Andy Fan wrote:
>>>
>>> ...>
>>> I can understand the benefits of the TOAST cache, but the following
>>> issues looks not good to me IIUC: 
>>>
>>> 1). we can't put the result to tts_values[*] since without the planner
>>> decision, we don't know if this will break small_tlist logic. But if we
>>> put it into the cache only, which means a cache-lookup as a overhead is
>>> unavoidable.
>>
>> True - if you're comparing having the detoasted value in tts_values[*]
>> directly with having to do a lookup in a cache, then yes, there's a bit
>> of an overhead.
>>
>> But I think from the discussion it's clear having to detoast the value
>> into tts_values[*] has some weaknesses too, in particular:
>>
>> - It requires decisions which attributes to detoast eagerly, which is
>> quite invasive (having to walk the plan, ...).
>>
>> - I'm sure there will be cases where we choose to not detoast, but it
>> would be beneficial to detoast.
>>
>> - Detoasting just the initial slices does not seem compatible with this.
>>
>> IMHO the overhead of the cache lookup would be negligible compared to
>> the repeated detoasting of the value (which is the current baseline). I
>> somewhat doubt the difference compared to tts_values[*] will be even
>> measurable.
>>
>>>
>>> 2). It is hard to decide which entry should be evicted without attaching
>>> it to the TupleTableSlot's life-cycle. then we can't grantee the entry
>>> we keep is the entry we will reuse soon?
>>>
>>
>> True. But is that really a problem? I imagined we'd set some sort of
>> memory limit on the cache (work_mem?), and evict oldest entries. So the
>> entries would eventually get evicted, and the memory limit would ensure
>> we don't consume arbitrary amounts of memory.
>>
> 
> Here is a copy from the shared_detoast_datum.org in the patch. I am
> feeling about when / which entry to free is a key problem and run-time
> (detoast_attr) overhead vs createplan.c overhead is a small difference
> as well. the overhead I paid for createplan.c/setref.c looks not huge as
> well. 

I decided to whip up a PoC patch implementing this approach, to get a
better idea of how it compares to the original proposal, both in terms
of performance and code complexity.

Attached are two patches that both add a simple cache in detoast.c, but
differ in when exactly the caching happens.

toast-cache-v1 - caching happens in toast_fetch_datum, which means this
happens before decompression

toast-cache-v2 - caching happens in detoast_attr, after decompression

I started with v1, but that did not help with the example workloads
(from the original post) at all. Only after I changed this to cache
decompressed data (in v2) it became competitive.

There's GUC to enable the cache (enable_toast_cache, off by default), to
make experimentation easier.

The cache is invalidated at the end of a transaction - I think this
should be OK, because the rows may be deleted but can't be cleaned up
(or replaced with a new TOAST value). This means the cache could cover
multiple queries - not sure if that's a good thing or bad thing.

I haven't implemented any sort of cache eviction. I agree that's a hard
problem in general, but I doubt we can do better than LRU. I also didn't
implement any memory limit.

FWIW I think it's a fairly serious issue Andy's approach does not have
any way to limit the memory. It will detoasts the values eagerly, but
what if the rows have multiple large values? What if we end up keeping
multiple such rows (from different parts of the plan) at once?

I also haven't implemented caching for sliced data. I think modifying
the code to support this should not be difficult - it'd require tracking
how much data we have in the cache, and considering that during lookup
and when adding entries to cache.

I've implemented the cache as "global". Maybe it should be tied to query
or executor state, but then it'd be harder to access it from detoast.c
(we'd have to pass the cache pointer in some way, and it wouldn't work
for use cases that don't go through executor).

I think implementation-wise this is pretty non-invasive.

Performance-wise, these patches are slower than with Andy's patch. For
example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
v2 at ~150ms. I'm sure there's a number of optimization opportunities
and v2 could get v2 closer to the 100ms.

v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
as master, because the data set is small enough to be fully cached, so
there's no I/O and it's the decompression is the actual bottleneck.

That being said, I'm not convinced v1 would be a bad choice for some
cases. If there's memory pressure enough to evict TOAST, it's quite
possible the I/O would become the bottleneck. At which point it might be
a good trade off to cache compressed data, because then we'd cache more
detoasted values.

OTOH it's likely the access to TOAST values is localiz

Re: Extract numeric filed in JSONB more effectively

2024-03-04 Thread Andy Fan



Peter Eisentraut  writes:

> On 09.02.24 10:05, Andy Fan wrote:
>> 2. Where is the current feature blocked for the past few months?
>> It's error message compatible issue! Continue with above setup:
>> master:
>> select * from tb where (a->'b')::numeric > 3::numeric;
>> ERROR:  cannot cast jsonb string to type numeric
>> select * from tb where (a->'b')::int4 > 3::numeric;
>> ERROR:  cannot cast jsonb string to type integer
>> You can see the error message is different (numeric vs integer).
>> Patched:
>> We still can get the same error message as master BUT the code
>> looks odd.
>> select * from tb where (a->'b')::int4 > 3;
>>  QUERY PLAN
>> ---
>>   Seq Scan on public.tb
>> Output: a
>> Filter: 
>> ((jsonb_finish_numeric(jsonb_object_field_start((tb.a)::internal, 
>> 'b'::text), '23'::oid))::integer > 3)
>> (3 rows)
>> You can see "jsonb_finish_numeric(..,  '23::oid)" the '23::oid' is
>> just
>> for the *"integer"* output in error message:
>> "cannot cast jsonb string to type*integer*"
>> Now the sistuation is either we use the odd argument (23::oid) in
>> jsonb_finish_numeric, or we use a incompatible error message with the
>> previous version. I'm not sure which way is better, but this is the
>> place the current feature is blocked.
>
> I'm not bothered by that.  It also happens on occasion in the backend C
> code that we pass around extra information to be able to construct
> better error messages.  The functions here are not backend C code, but
> they are internal functions, so similar considerations can apply.

Thanks for speaking on this!

>
> But I have a different question about this patch set.  This has some
> overlap with the JSON_VALUE function that is being discussed at
> [0][1]. For example, if I apply the patch
> v39-0001-Add-SQL-JSON-query-functions.patch from that thread, I can run
>
> select count(*) from tb where json_value(a, '$.a' returning numeric) = 2;
>
> and I get a noticeable performance boost over
>
> select count(*) from tb where cast (a->'a' as numeric) = 2;

Here is my test and profile about the above 2 queries.

create table tb(a jsonb);
insert into tb
select jsonb_build_object('a', i) from generate_series(1, 1)i;

cat a.sql
select count(*) from tb
where json_value(a, '$.a' returning numeric) = 2;

cat b.sql
select count(*) from tb where cast (a->'a' as numeric) = 2;

pgbench -n -f xxx.sql postgres -T100 | grep lat

Then here is the result:

| query | master  | patched (patch here and jsonb_value) |
|---+-+-|
| a.sql | /   | 2.59 (ms)   |
| b.sql | 3.34 ms | 1.75 (ms)   |

As we can see the patch here has the best performance (this result looks
be different from yours?).

After I check the code, I am sure both patches *don't* have the problem
in master where it get a jsonbvalue first and convert it to jsonb and
then cast to numeric.

Then I perf the result, and find the below stuff:

JSOB_VALUE

-   32.02% 4.30%  postgres  postgres   [.] JsonPathValue
   - 27.72% JsonPathValue
  - 22.63% executeJsonPath (inlined)
 - 19.97% executeItem (inlined)
- executeItemOptUnwrapTarget
   - 17.79% executeNextItem
  - 15.49% executeItem (inlined)
 - executeItemOptUnwrapTarget
+ 8.50% getKeyJsonValueFromContainer (note here..)
+ 2.30% executeNextItem
  0.79% findJsonbValueFromContainer
+ 0.68% check_stack_depth
  + 1.51% jspGetNext
   + 0.73% check_stack_depth
   1.27% jspInitByBuffer
   0.85% JsonbExtractScalar
  + 4.91% DatumGetJsonbP (inlined)

Patch here for b.sql:
-

-   19.98% 2.10%  postgres  postgres   [.] jsonb_object_field_start
   - 17.88% jsonb_object_field_start
  - 17.70% jsonb_object_field_internal (inlined)
 + 11.03% getKeyJsonValueFromContainer
 - 6.26% DatumGetJsonbP (inlined)
+ 5.78% detoast_attr
   + 1.21% _start
   + 0.54% 0x55ddb44552a

JSONB_VALUE has a much longer way to get getKeyJsonValueFromContainer,
then I think JSON_VALUE probably is designed for some more complex path 
which need to pay extra effort which bring the above performance
difference. 

I added Amit and Alvaro to this thread in case they can have more
insight on this.

> So some questions to think about:
>
> 1. Compare performance of base case vs. this patch vs. json_value.

Done, as above. 
>
> 2. Can json_value be optimized further?

hmm, I have some troubles to understand A's performance boost over B,
then who is better. But in my test above, looks the patch here is better
on the given case and the differece may comes from JS

Re: a wrong index choose when statistics is out of date

2024-03-04 Thread Tomas Vondra

On 3/4/24 06:33, David Rowley wrote:
> On Sun, 3 Mar 2024 at 20:08, Andy Fan  wrote:
>> The issue can be reproduced with the following steps:
>>
>> create table x_events (.., created_at timestamp, a int, b int);
>>
>> create index idx_1 on t(created_at, a);
>> create index idx_2 on t(created_at, b);
>>
>> query:
>> select * from t where create_at = current_timestamp and b = 1;
>>
>> index (created_at, a) rather than (created_at, b) may be chosen for the
>> above query if the statistics think "create_at = current_timestamp" has
>> no rows, then both index are OK, actually it is true just because
>> statistics is out of date.
> 
> I don't think there's really anything too special about the fact that
> the created_at column is always increasing. We commonly get 1-row
> estimates after multiplying the selectivities from individual stats.
> Your example just seems like yet another reason that this could
> happen.
> 
> I've been periodically talking about introducing "risk" as a factor
> that the planner should consider.  I did provide some detail in [1]
> about the design that was in my head at that time.  I'd not previously
> thought that it could also solve this problem, but after reading your
> email, I think it can.
> 
> I don't think it would be right to fudge the costs in any way, but I
> think the risk factor for IndexPaths could take into account the
> number of unmatched index clauses and increment the risk factor, or
> "certainty_factor" as it is currently in my brain-based design. That
> way add_path() would be more likely to prefer the index that matches
> the most conditions.
> 
> The exact maths to calculate the certainty_factor for this case I
> don't quite have worked out yet. I plan to work on documenting the
> design of this and try and get a prototype patch out sometime during
> this coming southern hemisphere winter so that there's at least a full
> cycle of feedback opportunity before the PG18 freeze.
> 

I've been thinking about this stuff too, so I'm curious to hear what
kind of plan you come up with. Every time I tried to formalize a more
concrete plan, I ended up with a very complex (and possible yet more
fragile) approach.

I think we'd need to consider a couple things:

1) reliability of cardinality estimates

I think this is pretty much the same concept as confidence intervals,
i.e. knowing not just the regular estimate, but also a range where the
actual value lies with high confidence (e.g. 90%).

For a single clauses this might not be terribly difficult, but for more
complex cases (multiple conditions, ...) it seems far more complex :-(
For example, let's say we know confidence intervals for two conditions.
What's the confidence interval when combined using AND or OR?

2) robustness of the paths

Knowing just the confidence intervals does not seem sufficient, though.
The other thing that matters is how this affects the paths, how robust
the paths are. I mean, if we have alternative paths with costs that flip
somewhere in the confidence interval - which one to pick? Surely one
thing to consider is how fast the costs change for each path.

3) complexity of the model

I suppose we'd prefer a systematic approach (and not some ad hoc
solution for one particular path/plan type). So this would be somehow
integrated into the cost model, making it yet more complex. I'm quite
worried about this (not necessarily because of performance reasons).

I wonder if trying to improve the robustness solely by changes in the
planning phase is a very promising approach. I mean - sure, we should
improve that, but by definition it relies on a priori information. And
not only the stats may be stale - it's a very lossy approximation of the
actual data. Even if the stats are perfectly up to date / very detailed,
there's still going to be a lot of uncertainty.

I wonder if we'd be better off if we experimented with more robust
plans, like SmoothScan [1] or g-join [2].

regards

[1]
https://stratos.seas.harvard.edu/sites/scholar.harvard.edu/files/stratos/files/smooth_vldbj.pdf

[2]
http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/users/haerder/2011/JoinAndGrouping.pdf

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Tidy fill hstv array (src/backend/access/heap/pruneheap.c)

2024-03-04 Thread Andrey M. Borodin

> On 4 Mar 2024, at 16:47, Ranier Vilela  wrote:
> 
> Does filling a memory area, one by one, with branches, need strong evidence 
> to prove that it is better than filling a memory area, all at once, without 
> branches? 

I apologise for being too quick to decide to mark the patch RwF. Wold you mind 
if restore patch as "Needs review" so we can have more feedback?

Best regards, Andrey Borodin.

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-03-04 Thread Nathan Bossart

On Mon, Mar 04, 2024 at 03:21:59PM +0100, Daniel Gustafsson wrote:
> Looking at this again I think this is about ready to go in.  My only comment 
> is
> that doc/src/sgml/archive-modules.sgml probably should be updated to refer to
> setting the errdetail, especially since we document the errormessage there.

Thanks for reviewing.  How does this look?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 437e4fa9ec27ecf870530fc579cd7673dfcf96af Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 4 Mar 2024 11:15:37 -0600
Subject: [PATCH v5 1/1] Add macro for customizing an archive module WARNING
 message.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Presently, if an archive module's check_configured_cb callback
returns false, a generic WARNING about the archive module
misconfiguration is emitted, which unfortunately provides no
actionable details about the source of the miconfiguration.  This
commit introduces a macro that archive module authors can use to
add a DETAIL line to this WARNING.

Co-authored-by: Tung Nguyen
Reviewed-by: Daniel Gustafsson, Álvaro Herrera
Discussion: https://postgr.es/m/4109578306242a7cd5661171647e11b2%40oss.nttdata.com
---
 contrib/basic_archive/basic_archive.c |  7 ++-
 doc/src/sgml/archive-modules.sgml | 12 
 src/backend/archive/shell_archive.c   |  7 ++-
 src/backend/postmaster/pgarch.c   |  8 +++-
 src/include/archive/archive_module.h  |  8 
 5 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 804567e919..6b102e9072 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -161,7 +161,12 @@ check_archive_directory(char **newval, void **extra, GucSource source)
 static bool
 basic_archive_configured(ArchiveModuleState *state)
 {
-	return archive_directory != NULL && archive_directory[0] != '\0';
+	if (archive_directory != NULL && archive_directory[0] != '\0')
+		return true;
+
+	arch_module_check_errdetail("%s is not set.",
+"basic_archive.archive_directory");
+	return false;
 }
 
 /*
diff --git a/doc/src/sgml/archive-modules.sgml b/doc/src/sgml/archive-modules.sgml
index 7064307d9e..cf7438a759 100644
--- a/doc/src/sgml/archive-modules.sgml
+++ b/doc/src/sgml/archive-modules.sgml
@@ -114,6 +114,18 @@ WARNING:  archive_mode enabled, yet archiving is not configured
 In the latter case, the server will periodically call this function, and
 archiving will proceed only when it returns true.

+
+   
+
+ When returning false, it may be useful to append some
+ additional information to the generic warning message.  To do that, provide
+ a message to the arch_module_check_errdetail macro
+ before returning false.  Like
+ errdetail(), this macro accepts a format string
+ followed by an optional list of arguments.  The resulting string will be
+ emitted as the DETAIL line of the warning message.
+
+   
   
 
   
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index c95b732495..bff0ab800d 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -45,7 +45,12 @@ shell_archive_init(void)
 static bool
 shell_archive_configured(ArchiveModuleState *state)
 {
-	return XLogArchiveCommand[0] != '\0';
+	if (XLogArchiveCommand[0] != '\0')
+		return true;
+
+	arch_module_check_errdetail("%s is not set.",
+"archive_command");
+	return false;
 }
 
 static bool
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index bb0eb13a89..f97035ca03 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -88,6 +88,7 @@ typedef struct PgArchData
 } PgArchData;
 
 char	   *XLogArchiveLibrary = "";
+char	   *arch_module_check_errdetail_string;
 
 
 /* --
@@ -401,12 +402,17 @@ pgarch_ArchiverCopyLoop(void)
 			 */
 			HandlePgArchInterrupts();
 
+			/* Reset variables that might be set by the callback */
+			arch_module_check_errdetail_string = NULL;
+
 			/* can't do anything if not configured ... */
 			if (ArchiveCallbacks->check_configured_cb != NULL &&
 !ArchiveCallbacks->check_configured_cb(archive_module_state))
 			{
 ereport(WARNING,
-		(errmsg("archive_mode enabled, yet archiving is not configured")));
+		(errmsg("archive_mode enabled, yet archiving is not configured"),
+		 arch_module_check_errdetail_string ?
+		 errdetail_internal("%s", arch_module_check_errdetail_string) : 0));
 return;
 			}
 
diff --git a/src/include/archive/archive_module.h b/src/include/archive/archive_module.h
index fd59b9faf4..763af76e54 100644
--- a/src/include/archive/archive_module.h
+++ b/src/include/archive/archive_module.h
@@ -56,4 +56,12 @@ typedef const ArchiveModuleCallbacks *(*ArchiveModuleInit) (void);
 
 extern PGDLLEXPORT const ArchiveModuleCallback

Re: pgsql: Improve performance of subsystems on top of SLRU

2024-03-04 Thread Alvaro Herrera

FWIW there's a stupid bug in 0002, which is fixed here.  I'm writing a
simple test for it.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Now I have my system running, not a byte was off the shelf;
It rarely breaks and when it does I fix the code myself.
It's stable, clean and elegant, and lightning fast as well,
And it doesn't cost a nickel, so Bill Gates can go to hell."
>From ecf613092a3cbc4c5112d30af7eea55b668decec Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 4 Mar 2024 11:49:01 +0100
Subject: [PATCH v3] Rework redundant loop in subtrans.c

---
 src/backend/access/transam/subtrans.c | 29 +++
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index dc9566fb51..50bb1d8cfc 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -311,7 +311,7 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
 	FullTransactionId nextXid;
 	int64		startPage;
 	int64		endPage;
-	LWLock	   *prevlock;
+	LWLock	   *prevlock = NULL;
 	LWLock	   *lock;
 
 	/*
@@ -324,42 +324,27 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
 	nextXid = TransamVariables->nextXid;
 	endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
 
-	prevlock = SimpleLruGetBankLock(SubTransCtl, startPage);
-	LWLockAcquire(prevlock, LW_EXCLUSIVE);
-	while (startPage != endPage)
+	for (;;)
 	{
 		lock = SimpleLruGetBankLock(SubTransCtl, startPage);
-
-		/*
-		 * Check if we need to acquire the lock on the new bank then release
-		 * the lock on the old bank and acquire on the new bank.
-		 */
 		if (prevlock != lock)
 		{
-			LWLockRelease(prevlock);
+			if (prevlock)
+LWLockRelease(prevlock);
 			LWLockAcquire(lock, LW_EXCLUSIVE);
 			prevlock = lock;
 		}
 
 		(void) ZeroSUBTRANSPage(startPage);
+		if (startPage == endPage)
+			break;
+
 		startPage++;
 		/* must account for wraparound */
 		if (startPage > TransactionIdToPage(MaxTransactionId))
 			startPage = 0;
 	}
 
-	lock = SimpleLruGetBankLock(SubTransCtl, startPage);
-
-	/*
-	 * Check if we need to acquire the lock on the new bank then release the
-	 * lock on the old bank and acquire on the new bank.
-	 */
-	if (prevlock != lock)
-	{
-		LWLockRelease(prevlock);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
-	}
-	(void) ZeroSUBTRANSPage(startPage);
 	LWLockRelease(lock);
 }
 
-- 
2.39.2

Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Ranier Vilela

Hi,

The function var_strcmp is a critical function.
Inside the function, there is a shortcut condition,
which allows for a quick exit.

Unfortunately, the current code calls a very expensive function beforehand,
which if the test was true, all the call time is wasted.
So, IMO, it's better to postpone the function call until when it is
actually necessary.

best regards,
Ranier Vilela


avoid-call-expensive-function-varlena.patch
Description: Binary data

Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Matthias van de Meent

On Mon, 4 Mar 2024 at 18:39, Ranier Vilela  wrote:
>
> Hi,
>
> The function var_strcmp is a critical function.
> Inside the function, there is a shortcut condition,
> which allows for a quick exit.
>
> Unfortunately, the current code calls a very expensive function beforehand, 
> which if the test was true, all the call time is wasted.
> So, IMO, it's better to postpone the function call until when it is actually 
> necessary.

Thank you for your contribution.

I agree it would be better, but your current patch is incorrect,
because we need to check if the user has access to the locale (and
throw an error if not) before we return that the two strings are
equal.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan



>
> I decided to whip up a PoC patch implementing this approach, to get a
> better idea of how it compares to the original proposal, both in terms
> of performance and code complexity.
>
> Attached are two patches that both add a simple cache in detoast.c, but
> differ in when exactly the caching happens.
>
> toast-cache-v1 - caching happens in toast_fetch_datum, which means this
> happens before decompression
>
> toast-cache-v2 - caching happens in detoast_attr, after decompression
...
>
> I think implementation-wise this is pretty non-invasive.
..
>
> Performance-wise, these patches are slower than with Andy's patch. For
> example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
> v2 at ~150ms. I'm sure there's a number of optimization opportunities
> and v2 could get v2 closer to the 100ms.
>
> v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
> as master, because the data set is small enough to be fully cached, so
> there's no I/O and it's the decompression is the actual bottleneck.
>
> That being said, I'm not convinced v1 would be a bad choice for some
> cases. If there's memory pressure enough to evict TOAST, it's quite
> possible the I/O would become the bottleneck. At which point it might be
> a good trade off to cache compressed data, because then we'd cache more
> detoasted values.
>
> OTOH it's likely the access to TOAST values is localized (in temporal
> sense), i.e. we read it from disk, detoast it a couple times in a short
> time interval, and then move to the next row. That'd mean v2 is probably
> the correct trade off.

I don't have different thought about the above statement and Thanks for
the PoC code which would make later testing much easier. 

>> 
>> """
>> A alternative design: toast cache
>> -
>> 
>> This method is provided by Tomas during the review process. IIUC, this
>> method would maintain a local HTAB which map a toast datum to a detoast
>> datum and the entry is maintained / used in detoast_attr
>> function. Within this method, the overall design is pretty clear and the
>> code modification can be controlled in toasting system only.
>> 
>
> Right.
>
>> I assumed that releasing all of the memory at the end of executor once
>> is not an option since it may consumed too many memory. Then, when and
>> which entry to release becomes a trouble for me. For example:
>> 
>>   QUERY PLAN
>> --
>>  Nested Loop
>>Join Filter: (t1.a = t2.a)
>>->  Seq Scan on t1
>>->  Seq Scan on t2
>> (4 rows)
>> 
>> In this case t1.a needs a longer lifespan than t2.a since it is
>> in outer relation. Without the help from slot's life-cycle system, I
>> can't think out a answer for the above question.
>> 
>
> This is true, but how likely such plans are? I mean, surely no one would
> do nested loop with sequential scans on reasonably large tables, so how
> representative this example is?

Acutally this is a simplest Join case, we still have same problem like
Nested Loop + Index Scan which will be pretty common. 

> Also, this leads me to the need of having some sort of memory limit. If
> we may be keeping entries for extended periods of time, and we don't
> have any way to limit the amount of memory, that does not seem great.
>
> AFAIK if we detoast everything into tts_values[] there's no way to
> implement and enforce such limit. What happens if there's a row with
> multiple large-ish TOAST values? What happens if those rows are in
> different (and distant) parts of the plan?

I think this can be done by tracking the memory usage on EState level
or global variable level and disable it if it exceeds the limits and
resume it when we free the detoast datum when we don't need it. I think
no other changes need to be done.

> It seems far easier to limit the memory with the toast cache.

I think the memory limit and entry eviction is the key issue now. IMO,
there are still some difference when both methods can support memory
limit. The reason is my patch can grantee the cached memory will be
reused, so if we set the limit to 10MB, we know all the 10MB is
useful, but the TOAST cache method, probably can't grantee that, then
when we want to make it effecitvely, we have to set a higher limit for
this.


>> Another difference between the 2 methods is my method have many
>> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
>> it can save some run-time effort like hash_search find / enter run-time
>> in method 2 since I put them directly into tts_values[*].
>> 
>> I'm not sure the factor 2 makes some real measurable difference in real
>> case, so my current concern mainly comes from factor 1.
>> """
>> 
>
> This seems a bit dismissive of the (possible) issue.

Hmm, sorry about this, that is not my intention:(

> It'd be good to do
> some measurements, especially on simple queries that can't benefit from
> the detoasting (e.g. because there's no varlena attribute).

This testing for

Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Ranier Vilela

Em seg., 4 de mar. de 2024 às 14:54, Matthias van de Meent <
boekewurm+postg...@gmail.com> escreveu:

> On Mon, 4 Mar 2024 at 18:39, Ranier Vilela  wrote:
> >
> > Hi,
> >
> > The function var_strcmp is a critical function.
> > Inside the function, there is a shortcut condition,
> > which allows for a quick exit.
> >
> > Unfortunately, the current code calls a very expensive function
> beforehand, which if the test was true, all the call time is wasted.
> > So, IMO, it's better to postpone the function call until when it is
> actually necessary.
>
> Thank you for your contribution.
>
> I agree it would be better, but your current patch is incorrect,
> because we need to check if the user has access to the locale (and
> throw an error if not) before we return that the two strings are
> equal.
>
I can't see any user validation at the function pg_newlocale_from_collation.

meson test pass all checks.

best regards,
Ranier Vilela

Re: RFC: Logging plan of the running query

2024-03-04 Thread Robert Haas

On Sat, Mar 2, 2024 at 10:46 AM James Coleman  wrote:
> If I can rephrase this idea: it's basically "delay this interrupt
> until inline to the next ExecProcNode execution".

Yes, but it's not just that. It also means that the code which would
handle the interrupt doesn't need to be called at every ExecProcNode.
Only when the interrupt actually arrives do we enable the code that
handles it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Tidy fill hstv array (src/backend/access/heap/pruneheap.c)

2024-03-04 Thread Andres Freund

Hi,

On 2024-03-04 08:47:11 -0300, Ranier Vilela wrote:
> Does filling a memory area, one by one, with branches, need strong evidence
> to prove that it is better than filling a memory area, all at once, without
> branches?

That's a bogus comparison:

a) the memset version modifies the whole array, rather than just elements that
   correspond to an item - often there will be far fewer items than the
   maximally possible number

b) the memset version initializes array elements that will be set to an actual
   value

c) switching to memset does not elide any branches, as the branch is still
   needed

And even without those, it'd still not be obviously better to use an
ahead-of-time memset(), as storing lots of values at once is more likely to be
bound by memory bandwidth than interleaving different work.

Andres

Re: incremental backup mishandles XLOG_DBASE_CREATE_FILE_COPY

2024-03-04 Thread Robert Haas

On Sat, Feb 24, 2024 at 12:10 PM Noah Misch  wrote:
> Agreed, those don't touch relation data files.  I think you've got all
> relation data file mutations.  XLOG_DBASE_CREATE_FILE_COPY and XLOG_DBASE_DROP
> are the only record types that touch a relation data file without mentioning
> it in XLogRecordBlockHeader, XACT_XINFO_HAS_RELFILELOCATORS, or an RM_SMGR_ID
> rlocator field.

Thanks for the review. I have committed this.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Exponential backoff for auth_delay

2024-03-04 Thread Michael Banck

Hi,

On Wed, Feb 21, 2024 at 10:26:11PM +0100, Tomas Vondra wrote:
> Thanks for the patch. I took a closer look at v3, so let me share some
> review comments. Please push back if you happen to disagree with some of
> it, some of this is going to be more a matter of personal preference.

Thanks! As my patch was based on a previous patch, some of decisions
were carry-overs I am not overly attached to.

> 1) I think it's a bit weird to have two options specifying amount of
> time, but one is in seconds and one in milliseconds. Won't that be
> unnecessarily confusing? Could we do both in milliseconds?

Alright, I changed that.

See below for a discussion about the GUCs in general.

> 2) The C code defines the GUC as auth_delay.exponential_backoff, but the
> SGML docs say auth_delay.exp_backoff.

Right, an oversight from the last version where the GUC name got changed
but I forgot to change the documentation, fixed.

> 3) Do we actually need the exponential_backoff GUC? Wouldn't it be
> sufficient to have the maximum value, and if it's -1 treat that as no
> backoff?

That is a good question, I guess that makes sense.

The next question then is: what should the default for (now)
auth_delay.max_milliseconds be in this case, -1? Or do we say that as
the default for auth_delay.milliseconds is 0 anyway, why would somebody
not want exponential backoff when they switch it on and keep it at the
current 10s/1ms)?

I have not changed that for now, pending further input.

> 4) I think the SGML docs should more clearly explain that the delay is
> initialized to auth_delay.milliseconds, and reset to this value after
> successful authentication. The wording kinda implies that, but it's not
> quite clear I think.

Ok, I added some text to that end. I also added a not that
auth_delay.max_milliseconds will mean that the delay doubling never
stops.

> 4) I've been really confused by the change from
> 
>  if (status != STATUS_OK)
>to
>  if (status == STATUS_ERROR)
> 
> in auth_delay_checks, until I realized those two codes are not the only
> ones, and we intentionally ignore STATUS_EOF. I think it'd be good to
> mention that in a comment, it's not quite obvious (I only realized it
> because the e-mail mentioned it).

Yeah I agree, I tried to explain that now.

> 5) I kinda like the custom that functions in a contrib module start with
> a clear common prefix, say auth_delay_ in this case. Matter of personal
> preference, ofc.

Ok, I changed the functions to have an auth_delay_ prefix throughout..

> 6) I'm a bit skeptical about some acr_array details. Firstly, why would
> 50 entries be enough? Seems like a pretty low and arbitrary number ...
> Also, what happens if someone attempts to authenticate, fails to do so,
> and then disconnects and never tries again? Or just changes IP? Seems
> like the entry will remain in the array forever, no?

Yeah, that is how v3 of this patch worked. I have changed that now, see
below.

> Seems like a way to cause a "limited" DoS - do auth failure from 50
> different hosts, to fill the table, and now everyone has to wait the
> maximum amount of time (if they fail to authenticate).

Right, though the problem would only exist on authentication failures,
so it is really rather limited.

> I think it'd be good to consider:
> 
> - Making the acr_array a hash table, and larger than 50 entries (I
> wonder if that should be dynamic / customizable by GUC?).

I would say a GUC should be overkill for this as this would mostly be an
implementation detail.

More generally, I think now that entries are expired (see below), this
should be less of a concern, so I have not changed this to a hash table
for now but doubled MAX_CONN_RECORDS to 100 entries.

> - Make sure the entries are eventually expired, based on time (for
> example after 2*max_delay?).

I went with 5*max_milliseconds - the AuthConnRecord struct now has a
last_failed_auth timestamp member; if we increase the delay for a host,
we check if any other host expired in the meantime and remove it if so.

> - It would be a good idea to log something when we get into the "full
> table" and start delaying everyone by max_delay_seconds. (How would
> anyone know that's what's happening, seems rather confusing.)

Right, I added a log line for that. However, I made it LOG instead of
WARNING as I don't think the client would ever see it, would he?

Attached is v4 with the above changes.

Cheers,

Michael
>From 6520fb1e768bb8bc26cafad014d08d7e9ac7f12a Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Wed, 27 Dec 2023 15:55:39 +0100
Subject: [PATCH v4] Add optional exponential backoff to auth_delay contrib
 module.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This adds two new GUCs for auth_delay, exponential_backoff and
max_milliseconds. The former controls whether exponential backoff should be
used or not, the latter sets an maximum delay (default is 10s) in case
exponential backoff is act

Re: Add system identifier to backup manifest

2024-03-04 Thread Robert Haas

On Mon, Mar 4, 2024 at 12:35 AM Amul Sul  wrote:
> Yes, you are correct. Both the current caller of get_controlfile() has
> access to the root directory.
>
> I have dropped the 0002 patch -- I don't have a very strong opinion to 
> refactor
> get_controlfile() apart from saying that it might be good to have both 
> versions :) .

I don't have an enormously strong opinion on what the right thing to
do is here either, but I am not convinced that the change proposed by
Michael is an improvement. After all, that leaves us with the
situation where we know the path to the control file in three
different places. First, verify_backup_file() does a strcmp() against
the string "global/pg_control" to decide whether to call
verify_backup_file(). Then, verify_system_identifier() uses that
string to construct a pathname to the file that it will be read. Then,
get_controlfile() reconstructs the same pathname using it's own logic.
That's all pretty disagreeable. Hard-coded constants are hard to avoid
completely, but here it looks an awful lot like we're trying to
hardcode the same constant into as many different places as we can.
The now-dropped patch seems like an effort to avoid this, and while
it's possible that it wasn't the best way to avoid this, I still think
avoiding it somehow is probably the right idea.

I get a compiler warning with 0002, too:

../pgsql/src/backend/backup/basebackup_incremental.c:960:22: warning:
call to undeclared function 'GetSystemIdentifier'; ISO C99 and later
do not support implicit function declarations
[-Wimplicit-function-declaration]
system_identifier = GetSystemIdentifier();
^
1 warning generated.

But I've committed 0001.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: postgres_fdw fails to see that array type belongs to extension

2024-03-04 Thread Tom Lane

I wrote:
> Now that the multirange issue is fixed (3e8235ba4), here's a
> new version that includes the needed recursion in ALTER EXTENSION.
> I spent some more effort on a proper test case, too.

I looked this over again and pushed it.

regards, tom lane

Re: Fix a typo in pg_rotate_logfile

2024-03-04 Thread Daniel Gustafsson

> On 14 Feb 2024, at 21:48, Daniel Gustafsson  wrote:
>> On 14 Feb 2024, at 19:51, Nathan Bossart  wrote:
>> On Wed, Feb 14, 2024 at 10:04:49AM -0500, Tom Lane wrote:
>>> Daniel Gustafsson  writes:

 Attached is a diff to show what it would look like to remove adminpack 
 (catalog
 version bump omitted on purpose to avoid conflicts until commit).
>>> 
>>> I don't see any references you missed, so +1.
>> 
>> Seems reasonable to me, too.
> 
> Thanks!  I'll put this in the next CF to keep it open for comments a bit
> longer, but will close it early in the CF.

This has now been pushed, adminpack has left the building.

--
Daniel Gustafsson

Re: Shared detoast Datum proposal

2024-03-04 Thread Tomas Vondra

On 3/4/24 18:08, Andy Fan wrote:
> ...
>>
>>> I assumed that releasing all of the memory at the end of executor once
>>> is not an option since it may consumed too many memory. Then, when and
>>> which entry to release becomes a trouble for me. For example:
>>>
>>>   QUERY PLAN
>>> --
>>>  Nested Loop
>>>Join Filter: (t1.a = t2.a)
>>>->  Seq Scan on t1
>>>->  Seq Scan on t2
>>> (4 rows)
>>>
>>> In this case t1.a needs a longer lifespan than t2.a since it is
>>> in outer relation. Without the help from slot's life-cycle system, I
>>> can't think out a answer for the above question.
>>>
>>
>> This is true, but how likely such plans are? I mean, surely no one would
>> do nested loop with sequential scans on reasonably large tables, so how
>> representative this example is?
> 
> Acutally this is a simplest Join case, we still have same problem like
> Nested Loop + Index Scan which will be pretty common. 
> 

Yes, I understand there are cases where LRU eviction may not be the best
choice - like here, where the "t1" should stay in the case. But there
are also cases where this is the wrong choice, and LRU would be better.

For example a couple paragraphs down you suggest to enforce the memory
limit by disabling detoasting if the memory limit is reached. That means
the detoasting can get disabled because there's a single access to the
attribute somewhere "up the plan tree". But what if the other attributes
(which now won't be detoasted) are accessed many times until then?

I think LRU is a pretty good "default" algorithm if we don't have a very
good idea of the exact life span of the values, etc. Which is why other
nodes (e.g. Memoize) use LRU too.

But I wonder if there's a way to count how many times an attribute is
accessed (or is likely to be). That might be used to inform a better
eviction strategy.

Also, we don't need to evict the whole entry - we could evict just the
data part (guaranteed to be fairly large), but keep the header, and keep
the counts, expected number of hits, and other info. And use this to
e.g. release entries that reached the expected number of hits. But I'd
probably start with LRU and only do this as an improvement later.

>> Also, this leads me to the need of having some sort of memory limit. If
>> we may be keeping entries for extended periods of time, and we don't
>> have any way to limit the amount of memory, that does not seem great.
>>
>> AFAIK if we detoast everything into tts_values[] there's no way to
>> implement and enforce such limit. What happens if there's a row with
>> multiple large-ish TOAST values? What happens if those rows are in
>> different (and distant) parts of the plan?
> 
> I think this can be done by tracking the memory usage on EState level
> or global variable level and disable it if it exceeds the limits and
> resume it when we free the detoast datum when we don't need it. I think
> no other changes need to be done.
> 

That seems like a fair amount of additional complexity. And what if the
toasted values are accessed in context without EState (I haven't checked
how common / important that is)?

And I'm not sure just disabling the detoast as a way to enforce a memory
limit, as explained earlier.

>> It seems far easier to limit the memory with the toast cache.
> 
> I think the memory limit and entry eviction is the key issue now. IMO,
> there are still some difference when both methods can support memory
> limit. The reason is my patch can grantee the cached memory will be
> reused, so if we set the limit to 10MB, we know all the 10MB is
> useful, but the TOAST cache method, probably can't grantee that, then
> when we want to make it effecitvely, we have to set a higher limit for
> this.
> 

Can it actually guarantee that? It can guarantee the slot may be used,
but I don't see how could it guarantee the detoasted value will be used.
We may be keeping the slot for other reasons. And even if it could
guarantee the detoasted value will be used, does that actually prove
it's better to keep that value? What if it's only used once, but it's
blocking detoasting of values that will be used 10x that?

If we detoast a 10MB value in the outer side of the Nest Loop, what if
the inner path has multiple accesses to another 10MB value that now
can't be detoasted (as a shared value)?

> 
>>> Another difference between the 2 methods is my method have many
>>> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
>>> it can save some run-time effort like hash_search find / enter run-time
>>> in method 2 since I put them directly into tts_values[*].
>>>
>>> I'm not sure the factor 2 makes some real measurable difference in real
>>> case, so my current concern mainly comes from factor 1.
>>> """
>>>
>>
>> This seems a bit dismissive of the (possible) issue.
> 
> Hmm, sorry about this, that is not my intention:(
> 

No need to apologize.

>> It'd be good to do
>> some measurements, especially on simple queries that

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-03-04 Thread Daniel Gustafsson

> On 4 Mar 2024, at 18:22, Nathan Bossart  wrote:
> 
> On Mon, Mar 04, 2024 at 03:21:59PM +0100, Daniel Gustafsson wrote:
>> Looking at this again I think this is about ready to go in.  My only comment 
>> is
>> that doc/src/sgml/archive-modules.sgml probably should be updated to refer to
>> setting the errdetail, especially since we document the errormessage there.
> 
> Thanks for reviewing.  How does this look?

Looks good from a read-through, I like it.  A few comments on the commit
message only:

actionable details about the source of the miconfiguration.  This
s/miconfiguration/misconfiguration/

Reviewed-by: Daniel Gustafsson, ¡lvaro Herrera
Alvaro's name seems wrong.

--
Daniel Gustafsson

Re: [PATCH] Exponential backoff for auth_delay

2024-03-04 Thread Robert Haas

On Mon, Mar 4, 2024 at 2:43 PM Michael Banck  wrote:
> > 3) Do we actually need the exponential_backoff GUC? Wouldn't it be
> > sufficient to have the maximum value, and if it's -1 treat that as no
> > backoff?
>
> That is a good question, I guess that makes sense.
>
> The next question then is: what should the default for (now)
> auth_delay.max_milliseconds be in this case, -1? Or do we say that as
> the default for auth_delay.milliseconds is 0 anyway, why would somebody
> not want exponential backoff when they switch it on and keep it at the
> current 10s/1ms)?
>
> I have not changed that for now, pending further input.

I agree that two GUCs here seems to be one more than necessary, but I
wonder whether we couldn't just say 0 means no exponential backoff and
any other value is the maximum time. The idea that 0 means unlimited
doesn't seem useful in practice. There's always going to be some
limit, at least by the number of bits we have in the data type that
we're using to do the calculation. But that limit is basically never
the right answer. I mean, I think 2^31 milliseconds is roughly 25
days, but it seems unlikely to me that delays measured in days
helpfully more secure than delays measured in minutes, and they don't
seem very convenient for users either, and do you really want a failed
connection to linger for days before failing? That seems like you're
DOSing yourself. If somebody wants to configure a very large value
explicitly, cool, they can do as they like, but we don't need to
complicate the interface to make it easier for them to do so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-04 Thread Matthias van de Meent

On Sat, 2 Mar 2024 at 02:30, Peter Geoghegan  wrote:
>
> On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan  wrote:
> > Attached is v11, which now says something like that in the commit
> > message.
>
> Attached is v12.

Some initial comments on the documentation:

> +that searches for multiple values together.  Queries that use certain
> +SQL constructs to search for rows matching any value
> +out of a list (or an array) of multiple scalar values might perform
> +multiple primitive index scans (up to one primitive scan
> +per scalar value) at runtime.  See  linkend="functions-comparisons"/>
> +for details.

I don't think the "see  for details" is
correctly worded: The surrounding text implies that it would contain
details about in which precise situations multiple primitive index
scans would be consumed, while it only contains documentation about
IN/NOT IN/ANY/ALL/SOME.

Something like the following would fit better IMO:

+that searches for multiple values together.  Queries that use certain
+SQL constructs to search for rows matching any value
+out of a list or array of multiple scalar values (such as those
described in
+ might perform multiple primitive
+index scans (up to one primitive scan per scalar value) at runtime.

Then there is a second issue in the paragraph: Inverted indexes such
as GIN might well decide to start counting more than one "primitive
scan" per scalar value, because they may need to go through their
internal structure more than once to produce results for a single
scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a
trigram index would likely use at least 4 descents here: one for each
of "som", "ome", "wor", "ord".

> > All that really remains now is to research how we might integrate this
> > work with the recently added continuescanPrechecked/haveFirstMatch
> > stuff from Alexander Korotkov, if at all.
>
> The main change in v12 is that I've integrated both the
> continuescanPrechecked and the haveFirstMatch optimizations. Both of
> these fields are now page-level state, shared between the _bt_readpage
> caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
> appear next to the new home for _bt_checkkeys' continuescan field, in
> the new page state struct).

Cool. I'm planning to review the rest of this patch this
week/tomorrow, could you take some time to review some of my btree
patches, too?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Adding deprecation notices to pgcrypto documentation

2024-03-04 Thread Daniel Gustafsson

> On 16 Nov 2023, at 21:49, Daniel Gustafsson  wrote:
> 
> In the "Allow tests to pass in OpenSSL FIPS mode" thread [0] it was discovered
> that 3DES is joining the ranks of NIST disallowed algorithms.  The attached
> patch adds a small note to the pgcrypto documentation about deprecated uses of
> algorithms.  I've kept it to "official" notices such as RFC's and NIST SP's.
> There might be more that deserve a notice, but this seemed like a good start.
> 
> Any thoughts on whether this would be helpful?

Cleaning out my inbox I came across this which I still think is worth doing,
any objections to going ahead with it?

--
Daniel Gustafsson

RE: Popcount optimization using AVX512

2024-03-04 Thread Amonson, Paul D

Hi,

First, apologies on the patch. Find re-attached updated version.

Now I have some questions
#1

> -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
> +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31)
> +<< 31))
>
> IME this means that the autoconf you are using has been patched.  A quick 
> search on the mailing lists seems to indicate that it might be specific to 
> Debian [1].

I am not sure what the ask is here?  I made changes to the configure.ac and ran 
autoconf2.69 to get builds to succeed. Do you have a separate feedback here? 

#2 
As for the refactoring, this was done to satisfy previous review feedback about 
applying the AVX512 CFLAGS to the entire pg_bitutils.c file. Mainly to avoid 
segfault due to the AVX512 flags. If its ok, I would prefer to make a single 
commit as the change is pretty small and straight forward.

#3
I am not sure I understand the comment about the SIZE_VOID_P checks. Aren't 
they necessary to choose which functions to call based on 32 or 64 bit 
architectures?

#4
Would this change qualify for Workflow A as described in [0] and can be picked 
up by a committer, given it has been reviewed by multiple committers so far? 
The scope of the change is pretty contained as well. 

[0] https://wiki.postgresql.org/wiki/Submitting_a_Patch

Thanks,
Paul

-Original Message-
From: Nathan Bossart  
Sent: Friday, March 1, 2024 1:45 PM
To: Amonson, Paul D 
Cc: Andres Freund ; Alvaro Herrera 
; Shankaran, Akash ; Noah 
Misch ; Tom Lane ; Matthias van de Meent 
; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512

Thanks for the new version of the patch.  I didn't see a commitfest entry for 
this one, and unfortunately I think it's too late to add it for the March 
commitfest.  I would encourage you to add it to July's commitfest [0] so that 
we can get some routine cfbot coverage.

On Tue, Feb 27, 2024 at 08:46:06PM +, Amonson, Paul D wrote:
> After consulting some Intel internal experts on MSVC the linking issue 
> as it stood was not resolved. Instead, I created a MSVC ONLY work-around.
> This adds one extra functional call on the Windows builds (The linker 
> resolves a real function just fine but not a function pointer of the 
> same name). This extra latency does not exist on any of the other 
> platforms. I also believe I addressed all issues raised in the 
> previous reviews. The new pg_popcnt_x86_64_accel.c file is now the 
> ONLY file compiled with the
> AVX512 compiler flags. I added support for the MSVC compiler flag as 
> well. Both meson and autoconf are updated with the new refactor.
> 
> I am attaching the new patch.

I think this patch might be missing the new files.

-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) 
+<< 31))

IME this means that the autoconf you are using has been patched.  A quick 
search on the mailing lists seems to indicate that it might be specific to 
Debian [1].

-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
+intpg_popcount32_slow(uint32 word);
+intpg_popcount64_slow(uint64 word);
+uint64 pg_popcount_slow(const char *buf, int bytes);

This patch appears to do a lot of refactoring.  Would it be possible to break 
out the refactoring parts into a prerequisite patch that could be reviewed and 
committed independently from the AVX512 stuff?

-#if SIZEOF_VOID_P >= 8
+#if SIZEOF_VOID_P == 8
/* Process in 64-bit chunks if the buffer is aligned. */
-   if (buf == (const char *) TYPEALIGN(8, buf))
+   if (buf == (const char *)TYPEALIGN(8, buf))
{
-   const uint64 *words = (const uint64 *) buf;
+   const uint64 *words = (const uint64 *)buf;

while (bytes >= 8)
{
@@ -309,9 +213,9 @@ pg_popcount(const char *buf, int bytes)
bytes -= 8;
}

-   buf = (const char *) words;
+   buf = (const char *)words;
}
-#else
+#elif SIZEOF_VOID_P == 4
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
{

Most, if not all, of these changes seem extraneous.  Do we actually need to 
more strictly check SIZEOF_VOID_P?

[0] https://commitfest.postgresql.org/48/
[1] https://postgr.es/m/20230211020042.uthdgj72kp3xlqam%40awork3.anarazel.de

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

v5-0001-Add-support-for-AVX512-implemented-POPCNT.patch
Description: v5-0001-Add-support-for-AVX512-implemented-POPCNT.patch

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-03-04 Thread Nathan Bossart

On Mon, Mar 04, 2024 at 09:27:23PM +0100, Daniel Gustafsson wrote:
> Looks good from a read-through, I like it.  A few comments on the commit
> message only:
> 
> actionable details about the source of the miconfiguration.  This
> s/miconfiguration/misconfiguration/

I reworded the commit message a bit to avoid the word "misconfiguration,"
as it felt a bit misleading to me.  In any case, this was fixed, albeit
indirectly.

> Reviewed-by: Daniel Gustafsson, ¡lvaro Herrera
> Alvaro's name seems wrong.

Hm.  It looks alright to me.  I copied the name from his e-mail signature,
which has an accent over the first 'A'.  I assume that's why it's not
showing up correctly in some places.

Anyway, I've committed this now.  Thanks for taking a look!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Exponential backoff for auth_delay

2024-03-04 Thread Michael Banck

Hi,

On Mon, Mar 04, 2024 at 03:50:07PM -0500, Robert Haas wrote:
> I agree that two GUCs here seems to be one more than necessary, but I
> wonder whether we couldn't just say 0 means no exponential backoff and
> any other value is the maximum time. 

Alright, I have changed it so that auth_delay.milliseconds and
auth_delay.max_milliseconds are the only GUCs, their default being 0. If
the latter is 0, the former's value is always taken. If the latter is
non-zero and larger than the former, exponential backoff is applied with
the latter's value as maximum delay.

If the latter is smaller than the former then auth_delay just sets the
delay to the latter, I don't think this is problem or confusing, or
should this be considered a misconfiguration?

> The idea that 0 means unlimited doesn't seem useful in practice. 

Yeah, that was more how it was coded than a real policy decision, so
let's do away with it.

V5 attached.


Michael
>From 3563d77061480b7e022255b968a39086b0cc8814 Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Wed, 27 Dec 2023 15:55:39 +0100
Subject: [PATCH v5] Add optional exponential backoff to auth_delay contrib
 module.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This adds the new auth_delay.max_milliseconds GUC. If set (its default is 0),
auth_delay adds exponential backoff with this GUC's value as maximum delay.

The exponential backoff is tracked per remote host and doubled for every failed
login attempt (i.e., wrong password, not just missing pg_hba line or database)
and reset to auth_delay.milliseconds after a successful authentication or when
no authentication attempts have been made for 5*max_milliseconds from that
host.

Authors: Michael Banck, based on an earlier patch by 成之焕
Reviewed-by: Abhijit Menon-Sen, Tomas Vondra
Discussion: https://postgr.es/m/ahwaxacqiwivoehs5yejpqog.1.1668569845751.hmail.zhch...@ceresdata.com
---
 contrib/auth_delay/auth_delay.c  | 216 ++-
 doc/src/sgml/auth-delay.sgml |  31 -
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 244 insertions(+), 4 deletions(-)

diff --git a/contrib/auth_delay/auth_delay.c b/contrib/auth_delay/auth_delay.c
index ff0e1fd461..5fb123d133 100644
--- a/contrib/auth_delay/auth_delay.c
+++ b/contrib/auth_delay/auth_delay.c
@@ -14,24 +14,50 @@
 #include 
 
 #include "libpq/auth.h"
+#include "miscadmin.h"
 #include "port.h"
+#include "storage/dsm_registry.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
 
+#define MAX_CONN_RECORDS 100
+
 /* GUC Variables */
 static int	auth_delay_milliseconds = 0;
+static int	auth_delay_max_milliseconds = 0;
 
 /* Original Hook */
 static ClientAuthentication_hook_type original_client_auth_hook = NULL;
 
+typedef struct AuthConnRecord
+{
+	char		remote_host[NI_MAXHOST];
+	double		sleep_time;		/* in milliseconds */
+	TimestampTz last_failed_auth;
+} AuthConnRecord;
+
+static shmem_startup_hook_type shmem_startup_next = NULL;
+static AuthConnRecord *acr_array = NULL;
+
+static AuthConnRecord *auth_delay_find_acr_for_host(char *remote_host);
+static AuthConnRecord *auth_delay_find_free_acr(void);
+static double auth_delay_increase_delay_after_failed_conn_auth(Port *port);
+static void auth_delay_cleanup_conn_record(Port *port);
+static void auth_delay_expire_conn_records(Port *port);
+
 /*
  * Check authentication
  */
 static void
 auth_delay_checks(Port *port, int status)
 {
+	double		delay = auth_delay_milliseconds;
+
 	/*
 	 * Any other plugins which use ClientAuthentication_hook.
 	 */
@@ -39,20 +65,190 @@ auth_delay_checks(Port *port, int status)
 		original_client_auth_hook(port, status);
 
 	/*
-	 * Inject a short delay if authentication failed.
+	 * We handle both STATUS_ERROR and STATUS_OK - the third option
+	 * (STATUS_EOF) is disregarded.
+	 *
+	 * In case of STATUS_ERROR we inject a short delay, optionally with
+	 * exponential backoff.
+	 */
+	if (status == STATUS_ERROR)
+	{
+		if (auth_delay_max_milliseconds > 0)
+		{
+			/*
+			 * Delay by 2^n seconds after each authentication failure from a
+			 * particular host, where n is the number of consecutive
+			 * authentication failures.
+			 */
+			delay = auth_delay_increase_delay_after_failed_conn_auth(port);
+
+			/*
+			 * Clamp delay to a maximum of auth_delay_max_milliseconds.
+			 */
+			delay = Min(delay, auth_delay_max_milliseconds);
+		}
+
+		if (delay > 0)
+		{
+			elog(DEBUG1, "Authentication delayed for %g seconds due to auth_delay", delay / 1000.0);
+			pg_usleep(1000L * (long) delay);
+		}
+
+		/*
+		 * Expire delays from other hosts after auth_delay_max_milliseconds *
+		 * 5.
+		 */
+		auth_delay_expire_conn_records(port);
+	}
+
+	/*
+	 * Remove host-specific delay if authentication succeeded.
+	 */
+	if (status == STATUS_OK)
+		auth_delay_cleanup_conn_record(port);
+}
+
+static double
+auth_delay_increase_delay_after_fai

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-03-04 Thread Tom Lane

Nathan Bossart  writes:
> On Mon, Mar 04, 2024 at 09:27:23PM +0100, Daniel Gustafsson wrote:
>> Reviewed-by: Daniel Gustafsson, ¡lvaro Herrera
>> Alvaro's name seems wrong.

> Hm.  It looks alright to me.  I copied the name from his e-mail signature,
> which has an accent over the first 'A'.  I assume that's why it's not
> showing up correctly in some places.

I think that git has an expectation of commit log entries being in
UTF8.  The committed message looks okay from my end, but maybe some
encoding mangling happened to the version Daniel was looking at?

regards, tom lane

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-03-04 Thread Tom Lane

Alvaro Herrera  writes:
> Pushed that way, but we can discuss further wording improvements/changes
> if someone wants to propose any.

I just noticed that drongo is complaining about two lines added
by commit 53c2a97a9:

 drongo| 2024-03-04 14:34:52 | 
../pgsql/src/backend/access/transam/slru.c(436): warning C4047: '!=': 
'SlruPageStatus *' differs in levels of indirection from 'int'
 drongo| 2024-03-04 14:34:52 | 
../pgsql/src/backend/access/transam/slru.c(717): warning C4047: '!=': 
'SlruPageStatus *' differs in levels of indirection from 'int'

These lines are

Assert(&shared->page_status[slotno] != SLRU_PAGE_EMPTY);

Assert(&ctl->shared->page_status[slotno] != SLRU_PAGE_EMPTY);

These are comparing the address of something with an enum value,
which surely cannot be sane.  Is the "&" operator incorrect?

It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately)
the numeric value of zero, so I guess the majority of our BF
animals are understanding this as "address != NULL".  But that
doesn't look like a useful test to be making.

regards, tom lane

Re: Popcount optimization using AVX512

2024-03-04 Thread Nathan Bossart

(Please don't top-post on the Postgres lists.)

On Mon, Mar 04, 2024 at 09:39:36PM +, Amonson, Paul D wrote:
> First, apologies on the patch. Find re-attached updated version.

Thanks for the new version of the patch.

>> -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
>> +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31)
>> +<< 31))
>>
>> IME this means that the autoconf you are using has been patched.  A
>> quick search on the mailing lists seems to indicate that it might be
>> specific to Debian [1].
>  
> I am not sure what the ask is here?  I made changes to the configure.ac
> and ran autoconf2.69 to get builds to succeed. Do you have a separate
> feedback here?

These LARGE_OFF_T changes are unrelated to the patch at hand and should be
removed.  This likely means that you are using a patched autoconf that is
making these extra changes.

> As for the refactoring, this was done to satisfy previous review feedback
> about applying the AVX512 CFLAGS to the entire pg_bitutils.c file. Mainly
> to avoid segfault due to the AVX512 flags. If its ok, I would prefer to
> make a single commit as the change is pretty small and straight forward.

Okay.  The only reason I suggest this is to ease review.  For example, if
there is some required refactoring that doesn't involve any functionality
changes, it can be advantageous to get that part reviewed and committed
first so that reviewers can better focus on the code for the new feature.
But, of course, that isn't necessary and/or isn't possible in all cases.

> I am not sure I understand the comment about the SIZE_VOID_P checks.
> Aren't they necessary to choose which functions to call based on 32 or 64
> bit architectures?

Yes.  My comment was that the patch appeared to make unnecessary changes to
this code.  Perhaps I am misunderstanding something here.

> Would this change qualify for Workflow A as described in [0] and can be
> picked up by a committer, given it has been reviewed by multiple
> committers so far? The scope of the change is pretty contained as well.

I think so.  I would still encourage you to create an entry for this so
that it is automatically tested via cfbot [0].

[0] http://commitfest.cputube.org/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Injection points: some tools to wait and wake

2024-03-04 Thread Michael Paquier

On Mon, Mar 04, 2024 at 10:51:41AM +0100, Jelte Fennema-Nio wrote:
> Yeah, it makes sense that you'd want to backport fixes/changes to
> this. As long as you put a disclaimer in the docs that you can do that
> for this module, I think it would be fine. Our tests fairly regularly
> break anyway when changing minor versions of postgres in our CI, e.g.
> due to improvements in the output of isolationtester. So if changes to
> this module require some changes that's fine by me. Seems much nicer
> than having to copy-paste the code.

In my experience, anybody who does serious testing with their product
integrated with Postgres have two or three types of builds with their
own scripts: one with assertions, -DG and other developer-oriented
options enabled, and one for production deployments with more
optimized options like -O2.  Once there are custom scripts to build
and package Postgres, do we really need to move that to contrib/ at
all?  make install would work for a test module as long as the command
is run locally in its directory.
--
Michael


signature.asc
Description: PGP signature

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-03-04 Thread Tom Lane

I wrote:
> It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately)
> the numeric value of zero, so I guess the majority of our BF
> animals are understanding this as "address != NULL".  But that
> doesn't look like a useful test to be making.

In hopes of noticing whether there are other similar thinkos,
I permuted the order of the SlruPageStatus enum values, and
now I get the expected warnings from gcc:

In file included from ../../../../src/include/postgres.h:45,
 from slru.c:59:
slru.c: In function ‘SimpleLruWaitIO’:
slru.c:436:38: warning: comparison between pointer and integer
  Assert(&shared->page_status[slotno] != SLRU_PAGE_EMPTY);
  ^~
../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
 ^
slru.c: In function ‘SimpleLruWritePage’:
slru.c:717:43: warning: comparison between pointer and integer
  Assert(&ctl->shared->page_status[slotno] != SLRU_PAGE_EMPTY);
   ^~
../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
 ^

So it looks like it's just these two places.

regards, tom lane

Re: Adding deprecation notices to pgcrypto documentation

2024-03-04 Thread Nathan Bossart

On Mon, Mar 04, 2024 at 10:03:13PM +0100, Daniel Gustafsson wrote:
> Cleaning out my inbox I came across this which I still think is worth doing,
> any objections to going ahead with it?

I think the general idea is reasonable, but I have two small comments:

* Should this be a "Warning" and/or moved to the top of the page?  This
  seems like a relatively important notice that folks should see when
  beginning to use pgcrypto.

* Should we actually document the exact list of algorithms along with
  detailed reasons?  This list seems prone to becoming outdated.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Michael Paquier

On Mon, Mar 04, 2024 at 03:08:03PM -0300, Ranier Vilela wrote:
> I can't see any user validation at the function pg_newlocale_from_collation.

Matthias is right, look closer.  I can see more than one check,
especially note the one related to the collation version mismatch that
should not be silently ignored.

> meson test pass all checks.

Collations are harder to test because they depend on the environment
where the test happens, especially with ICU.
--
Michael

signature.asc
Description: PGP signature

Re: a wrong index choose when statistics is out of date

2024-03-04 Thread David Rowley

On Tue, 5 Mar 2024 at 00:37, Andy Fan  wrote:
>
> David Rowley  writes:
> > I don't think it would be right to fudge the costs in any way, but I
> > think the risk factor for IndexPaths could take into account the
> > number of unmatched index clauses and increment the risk factor, or
> > "certainty_factor" as it is currently in my brain-based design. That
> > way add_path() would be more likely to prefer the index that matches
> > the most conditions.
>
> This is somehow similar with my proposal at [1]?  What do you think
> about the treat 'col op const' as 'col op $1' for the marked column?
> This could just resolve a subset of questions in your mind, but the
> method looks have a solid reason.

Do you mean this?

> + /*
> + * To make the planner more robust to handle some inaccurate statistics
> + * issue, we will add a extra cost to qpquals so that the less qpquals
> + * the lower cost it has.
> + */
> + cpu_run_cost += 0.01 * list_length(qpquals);

I don't think it's a good idea to add cost penalties like you proposed
there. This is what I meant by "I don't think it would be right to
fudge the costs in any way".

If you modify the costs to add some small penalty so that the planner
is more likely to favour some other plan, what happens if we then
decide the other plan has some issue and we want to penalise that for
some other reason? Adding the 2nd penalty might result in the original
plan choice again. Which one should be penalised more? I think the
uncertainty needs to be tracked separately.

Fudging the costs like this is also unlikely to play nicely with
add_path's use of STD_FUZZ_FACTOR.  There'd be an incentive to do
things like total_cost *= STD_FUZZ_FACTOR; to ensure we get a large
enough penalty.

David

> [1] 
> https://www.postgresql.org/message-id/CAApHDvovVWCbeR4v%2BA4Dkwb%3DYS_GuJG9OyCm8jZu%2B%2BcP2xsY_A%40mail.gmail.com

Re: Add new error_action COPY ON_ERROR "log"

2024-03-04 Thread Michael Paquier

On Mon, Mar 04, 2024 at 05:00:00AM +0530, Bharath Rupireddy wrote:
> How about an extra option to error_action ignore-with-verbose which is
> similar to ignore but when specified emits one NOTICE per malformed
> row? With this, one can say COPY x FROM stdin (ON_ERROR
> ignore-with-verbose);.
> 
> Alternatively, we can think of adding a new option verbose altogether
> which can be used for not only this but for reporting some other COPY
> related info/errors etc. With this, one can say COPY x FROM stdin
> (VERBOSE on, ON_ERROR ignore);.

I would suggest a completely separate option, because that offers more
flexibility as each option has a separate meaning.  My main concern in
using one option to control them all is that one may want in the 
future to be able to specify more combinations of actions at query
level, especially if more modes are added to the ON_ERROR mode.  One
option would prevent that.

Perhaps ERROR_VERBOSE or ERROR_VERBOSITY would be better names, but
I'm never wedded to my naming suggestions.  Bad history with the
matter.

> There's also another way of having a separate GUC, but -100 from me
> for it. Because, it not only increases the total number of GUCs by 1,
> but also might set a wrong precedent to have a new GUC for controlling
> command level outputs.

What does this have to do with GUCs?  The ON_ERROR option does not
have one.
--
Michael

signature.asc
Description: PGP signature

Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Mar 04, 2024 at 03:08:03PM -0300, Ranier Vilela wrote:
>> I can't see any user validation at the function pg_newlocale_from_collation.

> Matthias is right, look closer.  I can see more than one check,
> especially note the one related to the collation version mismatch that
> should not be silently ignored.

The fast path through that code doesn't include any checks, true,
but the point is that finding the entry proves that we made those
checks previously.  I can't agree with making those semantics
squishy in order to save a few cycles in the exact-equality case.

regards, tom lane

Re: [PATCH] updates to docs about HOT updates for BRIN

2024-03-04 Thread Jeff Davis

On Tue, 2024-02-27 at 09:48 -0500, Stephen Frost wrote:
> Attached is an updated patch which drops the 'such as' and adds a
> sentence mentioning that BRIN is the only in-core summarizing index.

The original patch reads more clearly to me. In v4, summarizing (the
exception) feels like it's dominating the description.

Also, is it standard practice to backport this kind of doc update? I
ordinarily wouldn't be inclined to do so, but v4 seems targeted at 16
as well.

Attached my own suggested wording that hopefully addresses Stephen and
Alvaro's concerns. I agree that it's tricky to write so I took a more
minimalist approach:

 * I got rid of the "In summary" sentence because (a) it's confusing
now that we're talking about summarizing indexes; and (b) it's not
summarizing anything, it's just redundant.

 * I removed the mention partial or expression indexes. It's a bit
redundant and doesn't seem especially helpful in this context.

If this is agreeable I can commit it.

Regards,
Jeff Davis

From d17ecaf1af52ba5b055c19b465d77cc53f825747 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 26 Feb 2024 17:17:54 -0500
Subject: [PATCH v5] docs: Update HOT update docs for 19d8e2308b

Commit 19d8e2308b changed when the HOT update optimization is possible
but neglected to update the Heap-Only Tuples (HOT) documentation.  This
patch updates that documentation accordingly.

Author: Elizabeth Christensen
Reviewed-By: Stephen Frost, Alvaro Herrera
Discussion: https://postgr.es/m/CABoUFXRjisr58Ct_3VsFEdQx+fJeQTWTdJnM7XAp=8mubto...@mail.gmail.com
---
 doc/src/sgml/storage.sgml | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 652946db7d..2e1914b95b 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -1097,8 +1097,10 @@ data. Empty in ordinary tables.
   

 
- The update does not modify any columns referenced by the table's
- indexes, including expression and partial indexes.
+ The update does not modify any columns referenced by the table's indexes,
+ not including summarizing indexes.  The only summarizing index method in
+ the core PostgreSQL distribution is BRIN.
  


@@ -1114,7 +1116,8 @@ data. Empty in ordinary tables.
   

 
- New index entries are not needed to represent updated rows.
+ New index entries are not needed to represent updated rows, however,
+ summary indexes may still need to be updated.
 


@@ -1130,14 +1133,12 @@ data. Empty in ordinary tables.
  
 
  
-  In summary, heap-only tuple updates can only be created
-  if columns used by indexes are not updated.  You can
-  increase the likelihood of sufficient page space for
+  You can increase the likelihood of sufficient page space for
   HOT updates by decreasing a table's fillfactor.
-  If you don't, HOT updates will still happen because
-  new rows will naturally migrate to new pages and existing pages with
-  sufficient free space for new row versions.  The system view fillfactor.  If you
+  don't, HOT updates will still happen because new rows
+  will naturally migrate to new pages and existing pages with sufficient free
+  space for new row versions.  The system view pg_stat_all_tables
   allows monitoring of the occurrence of HOT and non-HOT updates.
  
-- 
2.34.1

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Peter Smith

Here are some review comments for v105-0001

==
doc/src/sgml/config.sgml

1.
+   
+The standbys corresponding to the physical replication slots in
+standby_slot_names must configure
+sync_replication_slots = true so they can receive
+logical failover slots changes from the primary.
+   

/slots changes/slot changes/

==
doc/src/sgml/func.sgml

2.
+The function may be waiting if the specified slot is a logical failover
+slot and standby_slot_names
+is configured.

I know this has been through multiple versions already, but this
latest wording "may be waiting..." doesn't seem very good to me.

How about one of these?

* The function may not be able to return immediately if the specified
slot is a logical failover slot and standby_slot_names is configured.

* The function return might be blocked if the specified slot is a
logical failover slot and standby_slot_names is configured.

* If the specified slot is a logical failover slot then the function
will block until all physical slots specified in standby_slot_names
have confirmed WAL receipt.

* If the specified slot is a logical failover slot then the function
will not return until all physical slots specified in
standby_slot_names have confirmed WAL receipt.

~~~

3.
+slot may return to an earlier position. The function may be waiting if
+the specified slot is a logical failover slot and
+standby_slot_names


Same as previous review comment #2

==
src/backend/replication/slot.c

4. WaitForStandbyConfirmation

+ * Used by logical decoding SQL functions that acquired logical failover slot.

IIUC it doesn't work like that. pg_logical_slot_get_changes_guts()
calls here unconditionally (i.e. the SQL functions don't even check if
they are failover slots before calling this) so the comment seems
misleading/redundant.

==
src/backend/replication/walsender.c

5. NeedToWaitForWal

+ /*
+ * Check if the standby slots have caught up to the flushed position. It
+ * is good to wait up to the flushed position and then let the WalSender
+ * send the changes to logical subscribers one by one which are already
+ * covered by the flushed position without needing to wait on every change
+ * for standby confirmation.
+ */
+ if (NeedToWaitForStandbys(flushed_lsn, wait_event))
+ return true;
+
+ *wait_event = 0;
+ return false;
+}
+

5a.
The comment (or part of it?) seems misplaced because it is talking
WalSender sending changes, but that is not happening in this function.

Also, isn't what this is saying already described by the other comment
in the caller? e.g.:

+ /*
+ * Within the loop, we wait for the necessary WALs to be flushed to
+ * disk first, followed by waiting for standbys to catch up if there
+ * are enough WALs or upon receiving the shutdown signal. To avoid the
+ * scenario where standbys need to catch up to a newer WAL location in
+ * each iteration, we update our idea of the currently flushed
+ * position only if we are not waiting for standbys to catch up.
+ */

~

5b.
Most of the code is unnecessary. AFAICT all this is exactly same as just 1 line:

return NeedToWaitForStandbys(flushed_lsn, wait_event);

~~~

6. WalSndWaitForWal

+ /*
+ * Within the loop, we wait for the necessary WALs to be flushed to
+ * disk first, followed by waiting for standbys to catch up if there
+ * are enough WALs or upon receiving the shutdown signal. To avoid the
+ * scenario where standbys need to catch up to a newer WAL location in
+ * each iteration, we update our idea of the currently flushed
+ * position only if we are not waiting for standbys to catch up.
+ */

Regarding that 1st sentence: maybe this logic used to be done
explicitly "within the loop" but IIUC this logic is now hidden inside
NeedToWaitForWal() so the comment should mention that.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

2024-03-04 Thread Michael Paquier

On Mon, Feb 26, 2024 at 02:01:45PM +, Bertrand Drouvot wrote:
> Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch 
> now
> (see the attached).

I have looked at what you have here.

First, in a build where 818fefd8fd is included, this makes the test
script a lot slower.  Most of the logic is quick, but we're spending
10s or so checking that catalog_xmin has advanced.  Could it be
possible to make that faster?

A second issue is the failure mode when 818fefd8fd is reverted.  The
test is getting stuck when we are waiting on the standby to catch up,
until a timeout decides to kick in to fail the test, and all the
previous tests pass.  Could it be possible to make that more
responsive?  I assume that in the failure mode we would get an
incorrect conflict_reason for injection_inactiveslot, succeeding in
checking the failure.

+my $terminated = 0;
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+if ($node_standby->log_contains(
+'terminating process .* to release replication slot 
\"injection_activeslot\"', $logstart))
+{
+$terminated = 1;
+last;
+}
+usleep(100_000);
+}
+ok($terminated, 'terminating process holding the active slot is logged 
with injection point');

The LOG exists when we are sure that the startup process is waiting
in the injection point, so this loop could be replaced with something
like:
+   $node_standby->wait_for_event('startup', 'TerminateProcessHoldingSlot');
+   ok( $node_standby->log_contains('terminating process .* .. ', 'termin .. ';)

Nit: the name of the injection point should be
terminate-process-holding-slot rather than
TerminateProcessHoldingSlot, to be consistent with the other ones. 
--
Michael

signature.asc
Description: PGP signature

Re: Preserve subscription OIDs during pg_upgrade

2024-03-04 Thread Bruce Momjian

On Mon, Feb 26, 2024 at 09:51:40AM +0530, Robert Haas wrote:
> > I am not sure that it is a good idea to relax that for PG17 at this
> > stage of the development cycle, though, as we have already done a lot
> > in this area for pg_upgrade and it may require more tweaks during the
> > beta period depending on the feedback received, so I would suggest to
> > do more improvements for the 18 cycle instead once we have a cleaner
> > picture of the whole.
> 
> That's fair.
> 
> I want to say that, unlike Tom, I'm basically in favor of preserving
> OIDs in more places across updates. It seems to have little downside
> and improve the understandability of the outcome. But that's separate
> from whether it is a good idea to build on that infrastructure in any
> particular way in the time we have left for this release.

Yes, the _minimal_ approach has changed in the past few years to make
pg_upgrade debugging easier.  The original design was ultra-conservative
where it could be, considering how radical the core functionality was.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-04 Thread Masahiko Sawada

On Mon, Mar 4, 2024 at 8:48 PM John Naylor  wrote:
>
> On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada  wrote:
> >
> > On Sun, Mar 3, 2024 at 2:43 PM John Naylor  wrote:
>
> > > Right, I should have said "reset". Resetting a context will delete
> > > it's children as well, and seems like it should work to reset the tree
> > > context, and we don't have to know whether that context actually
> > > contains leaves at all. That should allow copying "tree context" to
> > > "leaf context" in the case where we have no special context for
> > > leaves.
> >
> > Resetting the tree->context seems to work. But I think we should note
> > for callers that the dsa_area passed to RT_CREATE should be created in
> > a different context than the context passed to RT_CREATE because
> > otherwise RT_FREE() will also free the dsa_area. For example, the
> > following code in test_radixtree.c will no longer work:
> >
> > dsa = dsa_create(tranche_id);
> > radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
> > :
> > rt_free(radixtree);
> > dsa_detach(dsa); // dsa is already freed.
> >
> > So I think that a practical usage of the radix tree will be that the
> > caller  creates a memory context for a radix tree and passes it to
> > RT_CREATE().
>
> That sounds workable to me.
>
> > I've attached an update patch set:
> >
> > - 0008 updates RT_FREE_RECURSE().
>
> Thanks!
>
> > - 0009 patch is an updated version of cleanup radix tree memory handling.
>
> Looks pretty good, as does the rest. I'm going through again,
> squashing and making tiny adjustments to the template. The only thing
> not done is changing the test with many values to resemble the perf
> test more.
>
> I wrote:
> > > Secondly, I thought about my recent work to skip checking if we first
> > > need to create a root node, and that has a harmless (for vacuum at
> > > least) but slightly untidy behavior: When RT_SET is first called, and
> > > the key is bigger than 255, new nodes will go on top of the root node.
> > > These have chunk '0'. If all subsequent keys are big enough, the
> > > orginal root node will stay empty. If all keys are deleted, there will
> > > be a chain of empty nodes remaining. Again, I believe this is
> > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > > on this, but likely not today.
> >
> > This turns out to be a lot trickier than it looked, so it seems best
> > to allow a trivial amount of waste, as long as it's documented
> > somewhere. It also wouldn't be terrible to re-add those branches,
> > since they're highly predictable.
>
> I put a little more work into this, and got it working, just needs a
> small amount of finicky coding. I'll share tomorrow.
>
> I have a question about RT_FREE_RECURSE:
>
> + check_stack_depth();
> + CHECK_FOR_INTERRUPTS();
>
> I'm not sure why these are here: The first seems overly paranoid,
> although harmless, but the second is probably a bad idea. Why should
> the user be able to to interrupt the freeing of memory?

Good catch. We should not check the interruption there.

> Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
> tree, leading to coding like "iter->tree->ctl->root". I *think* it
> would be easier to read if the tree was a parameter to these iteration
> functions. That would require an API change, so the tests/tidstore
> would have some churn. I can do that, but before trying I wanted to
> see what you think -- is there some reason to keep the current way?

I considered both usages, there are two reasons for the current style.
I'm concerned that if we pass both the tree and RT_ITER to iteration
functions, the caller could mistakenly pass a different tree than the
one that was specified to create the RT_ITER. And the second reason is
just to make it consistent with other data structures such as
dynahash.c and dshash.c, but I now realized that in simplehash.h we
pass both the hash table and the iterator.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread Hayato Kuroda (Fujitsu)

Dear Alvaro,

Thanks for giving comments!

> > I agreed it sounds good, but I don't think it can be implemented by
> > current interface. An interface for dynamically allocating memory is
> > GetNamedDSMSegment(), and it returns the same shared memory region if
> > input names are the same.  Therefore, there is no way to re-alloc the
> > shared memory.
> 
> Yeah, I was imagining something like this: the workitem-array becomes a
> struct, which has a name and a "next" pointer and a variable number of
> workitem slots; the AutoVacuumShmem struct has a pointer to the first
> workitem-struct and the last one; when a workitem is requested by
> brininsert, we initially allocate via GetNamedDSMSegment("workitem-0") a
> workitem-struct with a smallish number of elements; if we request
> another workitem and the array is full, we allocate another array via
> GetNamedDSMSegment("workitem-1") and store a pointer to it in workitem-0
> (so that the list can be followed by an autovacuum worker that's
> processing the database), and it's also set as the tail of the list in
> AutoVacuumShmem (so that we know where to store further work items).
> When all items in a workitem-struct are processed, we can free it
> (I guess via dsm_unpin_segment), and make AutoVacuumShmem->av_workitems
> point to the next one in the list.
> 
> This way, the "array" can grow arbitrarily.
>

Basically sounds good. My concerns are:

* GetNamedDSMSegment() does not returns a raw pointer to dsm_segment. This means
  that it may be difficult to do dsm_unpin_segment on the caller side.
* dynamic shared memory is recorded in dhash (dsm_registry_table) and the entry
  won't be deleted. The reference for the chunk might be remained.

Overall, it may be needed that dsm_registry may be also extended. I do not start
working yet, so will share results after trying them.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Switching XLog source from archive to streaming when primary available

2024-03-04 Thread Nathan Bossart

cfbot claims that this one needs another rebase.

I've spent some time thinking about this one.  I'll admit I'm a bit worried
about adding more complexity to this state machine, but I also haven't
thought of any other viable approaches, and this still seems like a useful
feature.  So, for now, I think we should continue with the current
approach.

+fails to switch to stream mode, it falls back to archive mode. If this
+parameter value is specified without units, it is taken as
+milliseconds. Default is 5min. With a lower value

Does this really need to be milliseconds?  I would think that any
reasonable setting would at least on the order of seconds.

+attempts. To avoid this, it is recommended to set a reasonable value.

I think we might want to suggest what a "reasonable value" is.

+   static bool canSwitchSource = false;
+   boolswitchSource = false;

IIUC "canSwitchSource" indicates that we are trying to force a switch to
streaming, but we are currently exhausting anything that's present in the
pg_wal directory, while "switchSource" indicates that we should force a
switch to streaming right now.  Furthermore, "canSwitchSource" is static
while "switchSource" is not.  Is there any way to simplify this?  For
example, would it be possible to make an enum that tracks the
streaming_replication_retry_interval state?

/*
 * Don't allow any retry loops to occur during 
nonblocking
-* readahead.  Let the caller process everything that 
has been
-* decoded already first.
+* readahead if we failed to read from the current 
source. Let the
+* caller process everything that has been decoded 
already first.
 */
-   if (nonblocking)
+   if (nonblocking && lastSourceFailed)
return XLREAD_WOULDBLOCK;

Why do we skip this when "switchSource" is set?

+   /* Reset the WAL source switch state */
+   if (switchSource)
+   {
+   Assert(canSwitchSource);
+   Assert(currentSource == XLOG_FROM_STREAM);
+   Assert(oldSource == XLOG_FROM_ARCHIVE);
+   switchSource = false;
+   canSwitchSource = false;
+   }

How do we know that oldSource is guaranteed to be XLOG_FROM_ARCHIVE?  Is
there no way it could be XLOG_FROM_PG_WAL?

+#streaming_replication_retry_interval = 5min   # time after which standby
+   # attempts to switch WAL source from 
archive to
+   # streaming replication
+   # in milliseconds; 0 disables

I think we might want to turn this feature off by default, at least for the
first release.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases

2024-03-04 Thread Nathan Bossart

On Mon, Oct 23, 2023 at 03:25:42PM -0500, Nathan Bossart wrote:
> rebased

I saw that this thread was referenced elsewhere [0], so I figured I'd take
a fresh look.  From a quick glance, I'd say 0001 and 0002 are pretty
reasonable and could probably be committed for v17.  0003 probably requires
some more attention.  If there is indeed interest in these changes, I'll
try to spend some more time on it.

[0] https://postgr.es/m/E0D2F0CE-D27C-49B1-902B-AD8D2427F07E%40yandex-team.ru

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-04 Thread Peter Smith

Hi, Here are some review comments for v7-0001

1.
/*
 * binaryheap_free
 *
 * Releases memory used by the given binaryheap.
 */
void
binaryheap_free(binaryheap *heap)
{
pfree(heap);
}


Shouldn't the above function (not modified by the patch) also firstly
free the memory allocated for the heap->bh_nodes?

~~~

2.
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ heap->bh_space *= 2;
+ heap->bh_nodes = repalloc(heap->bh_nodes,
+   sizeof(bh_node_type) * heap->bh_space);
+}

Strictly speaking, this function doesn't really "Make sure" of
anything because the caller does the check whether we need more space.
All that happens here is allocating more space. Maybe this function
comment should say something like "Double the space allocated for
nodes."

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Support "Right Semi Join" plan shapes

2024-03-04 Thread Richard Guo

On Mon, Mar 4, 2024 at 10:33 AM wenhui qiu  wrote:

> HI Richard
>  Now it is starting the last commitfest for v17, can you respond to
> Alena Rybakina points?
>

Thanks for reminding.  Will do that soon.

Thanks
Richard

Re: Support "Right Semi Join" plan shapes

2024-03-04 Thread Richard Guo

On Tue, Jan 30, 2024 at 2:51 PM Alena Rybakina 
wrote:

> I have reviewed your patch and I think it is better to add an Assert for
> JOIN_RIGHT_SEMI to the ExecMergeJoin and ExecNestLoop functions to
> prevent the use of RIGHT_SEMI for these types of connections (NestedLoop
> and MergeJoin).

Hmm, I don't see why this is necessary.  The planner should already
guarantee that we won't have nestloops/mergejoins with right-semi joins.

Thanks
Richard

Re: PostgreSQL Contributors Updates

2024-03-04 Thread vignesh C

On Mon, 4 Mar 2024 at 17:43, Aleksander Alekseev
 wrote:
>
> > > All,
> > >
> > > The PostgreSQL Contributor Page
> > > (https://www.postgresql.org/community/contributors/) includes people who
> > > have made substantial, long-term contributions of time and effort to the
> > > PostgreSQL project. The PostgreSQL Contributors Team recognizes the
> > > following people for their contributions.
> > >
> > > New PostgreSQL Contributors:
> > >
> > > * Bertrand Drouvot
> > > * Gabriele Bartolini
> > > * Richard Guo
> > >
> > > New PostgreSQL Major Contributors:
> > >
> > > * Alexander Lakhin
> > > * Daniel Gustafsson
> > > * Dean Rasheed
> > > * John Naylor
> > > * Melanie Plageman
> > > * Nathan Bossart
> > >
> > > Thank you and congratulations to all!
> >
> > +1. Congratulations to all!
>
> Congratulations to all!

Congratulations to all!

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan

>
>> 2. more likely to use up all the memory which is allowed. for example:
>> if we set the limit to 1MB, then we kept more data which will be not
>> used and then consuming all of the 1MB. 
>> 
>> My method is resolving this with some helps from other modules (kind of
>> invasive) but can control the eviction well and use the memory as less
>> as possible.
>> 
>
> Is the memory usage really an issue? Sure, it'd be nice to evict entries
> as soon as we can prove they are not needed anymore, but if the cache
> limit is set to 1MB it's not really a problem to use 1MB.

This might be a key point which leads us to some different directions, so
I want to explain more about this, to see if we can get some agreement
here.

It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB,
100MB. In my current case it is 40MB at least. Less memory limit 
causes cache ineffecitve, high memory limit cause potential memory
use-up issue in the TOAST cache design. But in my method, even we set a
higher value, it just limits the user case it really (nearly) needed,
and it would not cache more values util the limit is hit. This would
make a noticable difference when we want to set a high limit and we have
some high active sessions, like 100 * 40MB = 4GB. 

> On 3/4/24 18:08, Andy Fan wrote:
>> ...
>>>
 I assumed that releasing all of the memory at the end of executor once
 is not an option since it may consumed too many memory. Then, when and
 which entry to release becomes a trouble for me. For example:

   QUERY PLAN
 --
  Nested Loop
Join Filter: (t1.a = t2.a)
->  Seq Scan on t1
->  Seq Scan on t2
 (4 rows)

 In this case t1.a needs a longer lifespan than t2.a since it is
 in outer relation. Without the help from slot's life-cycle system, I
 can't think out a answer for the above question.

>>>
>>> This is true, but how likely such plans are? I mean, surely no one would
>>> do nested loop with sequential scans on reasonably large tables, so how
>>> representative this example is?
>> 
>> Acutally this is a simplest Join case, we still have same problem like
>> Nested Loop + Index Scan which will be pretty common. 
>> 
>
> Yes, I understand there are cases where LRU eviction may not be the best
> choice - like here, where the "t1" should stay in the case. But there
> are also cases where this is the wrong choice, and LRU would be better.
>
> For example a couple paragraphs down you suggest to enforce the memory
> limit by disabling detoasting if the memory limit is reached. That means
> the detoasting can get disabled because there's a single access to the
> attribute somewhere "up the plan tree". But what if the other attributes
> (which now won't be detoasted) are accessed many times until then?

I am not sure I can't follow up here, but I want to explain more about
the disable-detoast-sharing logic when the memory limit is hit. When
this happen, the detoast sharing is disabled, but since the detoast
datum will be released every soon when the slot->tts_values[*] is
discard, then the 'disable' turn to 'enable' quickly. So It is not 
true that once it is get disabled, it can't be enabled any more for the
given query.

> I think LRU is a pretty good "default" algorithm if we don't have a very
> good idea of the exact life span of the values, etc. Which is why other
> nodes (e.g. Memoize) use LRU too.

> But I wonder if there's a way to count how many times an attribute is
> accessed (or is likely to be). That might be used to inform a better
> eviction strategy.

Yes, but in current issue we can get a better esitimation with the help
of plan shape and Memoize depends on some planner information as well.
If we bypass the planner information and try to resolve it at the 
cache level, the code may become to complex as well and all the cost is
run time overhead while the other way is a planning timie overhead.

> Also, we don't need to evict the whole entry - we could evict just the
> data part (guaranteed to be fairly large), but keep the header, and keep
> the counts, expected number of hits, and other info. And use this to
> e.g. release entries that reached the expected number of hits. But I'd
> probably start with LRU and only do this as an improvement later.

A great lession learnt here, thanks for sharing this!

As for the current user case what I want to highlight is in the current
user case, we are "caching" "user data" "locally".

USER DATA indicates it might be very huge, it is not common to have a
1M tables, but it is much common we have 1M Tuples in one scan, so
keeping the header might extra memory usage as well, like 10M * 24 bytes
= 240MB. LOCALLY means it is not friend to multi active sessions. CACHE
indicates it is hard to evict correctly. My method also have the USER
DATA, LOCALLY attributes, but it would be better at eviction. eviction
then have lead to memory usage issue which is discr

1 2 >

1 - 100 of 126 matches

Mail list logo