date:20231003

On Mon, Oct 02, 2023 at 10:22:06AM +0200, Jakub Wartak wrote:
> I've attached v4 that covers your suggestions.

Hmm.  I was looking at all that and pondered quite a bit about the
addition of the restriction when starting up the server, particularly
why there would be any need to include it in the same commit as the
one fixing the arguments given to AllocHuge().  At the end, as the
restriction goes a bit against 8d0ddccec636, I have removed it and 
went to the simplest solution of adding the casts to Size, which is
the same thing as we do in all the other callers that deal with signed
variables (like mbutils.c).

Regarding any restrictions, perhaps we should improve the docs, at
least.  It would be better to discuss that separately.
--
Michael


signature.asc
Description: PGP signature

Re: pg16: XX000: could not find pathkey item to sort

On Tue, 3 Oct 2023 at 09:11, David Rowley  wrote:
> I'm concerned that this patch will be too much overhead when creating
> paths when a PathKey's EquivalenceClass has a large number of members
> from partitioned tables.

I just tried out the patch to see how much it affects the performance
of the planner.  I think we need to find a better way to strip off the
pathkeys for the columns that have been aggregated.

Setup:
create table lp (a int, b int) partition by list (a);
select 'create table lp'||x||' partition of lp for values in('||x||')'
from generate_series(0,999)x;
\gexec

\pset pager off
set enable_partitionwise_aggregate=1;

Benchmark query:
explain (summary on) select a,count(*) from lp group by a;

Master:
Planning Time: 23.945 ms
Planning Time: 23.887 ms
Planning Time: 23.927 ms

perf top:
   7.39%  libc.so.6 [.] __memmove_avx_unaligned_erms
   6.98%  [kernel]  [k] clear_page_rep
   5.69%  postgres  [.] bms_is_subset
   5.07%  postgres  [.] fetch_upper_rel
   4.41%  postgres  [.] bms_equal

Patched:
Planning Time: 41.410 ms
Planning Time: 41.474 ms
Planning Time: 41.488 ms

perf top:
 19.02%  postgres  [.] bms_is_subset
   6.91%  postgres  [.] find_ec_member_matching_expr
   5.93%  libc.so.6 [.] __memmove_avx_unaligned_erms
   5.55%  [kernel]  [k] clear_page_rep
   4.07%  postgres  [.] fetch_upper_rel
   3.46%  postgres  [.] bms_equal

> I wondered if we should instead just check
> if the subpath's pathkeys match root->group_pathkeys and if they do
> set the AggPath's pathkeys to list_copy_head(subpath->pathkeys,
> root->num_groupby_pathkeys),  that'll be much cheaper, but it just
> feels a bit too much like a special case.

I tried this approach (patch attached) and it does perform better than
the other patch:

create_agg_path_fix2.patch:
Planning Time: 24.357 ms
Planning Time: 24.293 ms
Planning Time: 24.259 ms

   7.45%  libc.so.6 [.] __memmove_avx_unaligned_erms
   6.90%  [kernel]  [k] clear_page_rep
   5.56%  postgres  [.] bms_is_subset
   5.38%  postgres  [.] bms_equal

I wonder if the attached patch is too much of a special case fix.  I
guess from the lack of complaints previously that there are no other
cases where we could possibly have pathkeys that belong to columns
that are aggregated.  I've not gone to much effort to see if I can
craft a case that hits this without the ORDER BY/DISTINCT aggregate
optimisation, however.

David


create_agg_path_fix2.patch
Description: Binary data

Re: Incorrect handling of OOM in WAL replay leading to data loss

On Tue, Sep 26, 2023 at 03:48:07PM +0900, Michael Paquier wrote:
> By the way, anything that I am proposing here cannot be backpatched
> because of the infrastructure changes required in walreader.c, so I am
> going to create a second thread with something that could be
> backpatched (yeah, likely FATALs on OOM to stop recovery from doing
> something bad)..

Patch set is rebased as an effect of 6b18b3fe2c2f, that switched the
OOMs to fail harder now in xlogreader.c.  The patch set has nothing
new, except that 0001 is now a revert of 6b18b3fe2c2f to switch back
xlogreader.c to use soft errors on OOMs.

If there's no interest in this patch set after the next CF, I'm OK to
drop it.  The state of HEAD is at least correct in the OOM cases now.
--
Michael
From aa5377d221371f6be8729a27f1df18aa9c4a48e2 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 3 Oct 2023 16:12:14 +0900
Subject: [PATCH v5 1/4] Revert "Fail hard on out-of-memory failures in
 xlogreader.c"

This reverts commit 6b18b3fe2c2, putting back the code of xlogreader.c
to handle OOMs as soft failures.
---
 src/backend/access/transam/xlogreader.c | 47 -
 1 file changed, 39 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a1363e3b8f..a17263df20 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -43,7 +43,7 @@
 
 static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
-static void allocate_recordbuf(XLogReaderState *state, uint32 reclength);
+static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 			 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
@@ -155,7 +155,14 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	 * Allocate an initial readRecordBuf of minimal size, which can later be
 	 * enlarged if necessary.
 	 */
-	allocate_recordbuf(state, 0);
+	if (!allocate_recordbuf(state, 0))
+	{
+		pfree(state->errormsg_buf);
+		pfree(state->readBuf);
+		pfree(state);
+		return NULL;
+	}
+
 	return state;
 }
 
@@ -177,6 +184,7 @@ XLogReaderFree(XLogReaderState *state)
 
 /*
  * Allocate readRecordBuf to fit a record of at least the given length.
+ * Returns true if successful, false if out of memory.
  *
  * readRecordBufSize is set to the new buffer size.
  *
@@ -188,7 +196,7 @@ XLogReaderFree(XLogReaderState *state)
  * Note: This routine should *never* be called for xl_tot_len until the header
  * of the record has been fully validated.
  */
-static void
+static bool
 allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 {
 	uint32		newSize = reclength;
@@ -198,8 +206,15 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
-	state->readRecordBuf = (char *) palloc(newSize);
+	state->readRecordBuf =
+		(char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM);
+	if (state->readRecordBuf == NULL)
+	{
+		state->readRecordBufSize = 0;
+		return false;
+	}
 	state->readRecordBufSize = newSize;
+	return true;
 }
 
 /*
@@ -490,7 +505,9 @@ XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversi
 	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
 	if (allow_oversized)
 	{
-		decoded = palloc(required_space);
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
 		decoded->oversized = true;
 		return decoded;
 	}
@@ -798,7 +815,13 @@ restart:
 Assert(gotlen <= lengthof(save_copy));
 Assert(gotlen <= state->readRecordBufSize);
 memcpy(save_copy, state->readRecordBuf, gotlen);
-allocate_recordbuf(state, total_len);
+if (!allocate_recordbuf(state, total_len))
+{
+	/* We treat this as a "bogus data" condition */
+	report_invalid_record(state, "record length %u at %X/%X too long",
+		  total_len, LSN_FORMAT_ARGS(RecPtr));
+	goto err;
+}
 memcpy(state->readRecordBuf, save_copy, gotlen);
 buffer = state->readRecordBuf + gotlen;
 			}
@@ -854,8 +877,16 @@ restart:
 		decoded = XLogReadRecordAlloc(state,
 	  total_len,
 	  true /* allow_oversized */ );
-		/* allocation should always happen under allow_oversized */
-		Assert(decoded != NULL);
+		if (decoded == NULL)
+		{
+			/*
+			 * We failed to allocate memory for an oversized record.  As
+			 * above, we currently treat this as a "bogus data" condition.
+			 */
+			report_invalid_record(state,
+  "out of memory while trying to decode a record of length %u", total_len);
+			goto err;
+		}
 	}
 
 	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
-- 
2.42.0

From 970f1374272c058be51eedf77585d98b925b25c0 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 26 Sep 2023 15:40:05 +0900
Subject: [PAT

Re: Trigger violates foreign key constraint

2023-10-03 Thread Laurenz Albe

On Mon, 2023-10-02 at 09:49 -0400, Tom Lane wrote:
> This is by design: triggers operate at a lower level than
> foreign keys, so an ill-conceived trigger can break an FK constraint.
> That's documented somewhere, though maybe not visibly enough.

Not having found any documentation, I propose the attached caution.

Yours,
Laurenz Albe
From b6abd7dfdf1e25515ead092489efde0d239f7053 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Tue, 3 Oct 2023 09:20:54 +0200
Subject: [PATCH] Document foreign key internals

Warn the user that foreign keys are implemented as triggers, and
that user-defined triggers can interact with them and break
referential integrity.

Discussion: https://postgr.es/m/b81fe38fcc25a81be6e2e5b3fc1ff624130762fa.camel%40cybertec.at
---
 doc/src/sgml/ddl.sgml | 12 
 1 file changed, 12 insertions(+)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 075ff32991..8016b9fd81 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1224,6 +1224,18 @@ CREATE TABLE posts (
 syntax in the reference documentation for
 .

+
+   
+
+ Foreign key constraints are implemented as system triggers in PostgreSQL.
+ As such, they are subject to the trigger firing rules described in
+ .  In particular, other triggers
+ defined on the referencing table can cancel or modify the effects of
+ cascading delete or update, thereby breaking referential integrity.
+ This is not considered a bug, and it is the responsibility of the user
+ to write triggers so that such problems are avoided.
+
+   
   
 
   
-- 
2.41.0

Re: [PGDOCS] Inconsistent linkends to "monitoring" views.

On Tue, Oct 03, 2023 at 01:11:15PM +1100, Peter Smith wrote:
> I noticed one or two "monitoring" links and linkends that are slightly
> inconsistent from all the others.

-   
+   

Is that really worth bothering for the internal link references?  This
can create extra backpatching conflicts.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Fix memory leak in memoize for numeric key

On Tue, 3 Oct 2023 at 19:38, Orlov Aleksej  wrote:
> I found a query which consumes a lot of memory and triggers OOM killer.
> Memory leak occurs in memoize node for numeric key.

Thanks for the analysis and the patch.

> I've attached memoize_memory_leak_numeric_key.patch to address this.

Yeah, this is a bug for sure.

Looking at ExecHashGetHashValue() for example purposes, I see it's
quite careful to call ResetExprContext(econtext) at the top of the
function to reset the tuple context.

I think the patch might need to go a bit further and also adjust
MemoizeHash_equal().  In non-binary mode, we just call
ExecQualAndReset() which evaluates the join condition and resets the
context.  The binary mode code does not do this, so I think we should
expand on what you've done and adjust that code too.

I've done that in the attached patch.

David

fix_memoize_memory_leak_v2.patch
Description: Binary data

Re: CHECK Constraint Deferrable

2023-10-03 Thread Himanshu Upadhyaya

On Tue, Sep 19, 2023 at 4:14 PM Dean Rasheed 
wrote:

> > I think we should be able to defer one constraint like in the case of
> > foreign key constraint
> >
>
> Agreed. It should be possible to have a mix of deferred and immediate
> constraint checks. In the example, the tbl_chk_1 is set deferred, but
> it fails immediately, which is clearly not right.
>
> Fixed in V3 patch.

> I would say that it's reasonable to limit the scope of this patch to
> table constraints only, and leave domain constraints to a possible
> follow-up patch.
>
> Sure, Agree.

> A few other review comments:
>
> 1. The following produces a WARNING (possibly the same issue already
> reported):
>
> CREATE TABLE foo (a int, b int);
> ALTER TABLE foo ADD CONSTRAINT a_check CHECK (a > 0);
> ALTER TABLE foo ADD CONSTRAINT b_check CHECK (b > 0) DEFERRABLE;
>
> WARNING:  unexpected pg_constraint record found for relation "foo"
>
> fixed in V3 patch.

> 2. I think that equalTupleDescs() should compare the new fields, when
> comparing the 2 sets of check constraints.
>
> Fixed in V3 patch.

> 3. The constraint exclusion code in the planner should ignore
> deferrable check constraints (see get_relation_constraints() in
> src/backend/optimizer/util/plancat.c), otherwise it might incorrectly
> exclude a relation on the basis of a constraint that is temporarily
> violated, and return incorrect query results. For example:
>
> CREATE TABLE foo (a int);
> CREATE TABLE foo_c1 () INHERITS (foo);
> CREATE TABLE foo_c2 () INHERITS (foo);
> ALTER TABLE foo_c2 ADD CONSTRAINT cc CHECK (a != 5) INITIALLY DEFERRED;
>
> BEGIN;
> INSERT INTO foo_c2 VALUES (5);
> SET LOCAL constraint_exclusion TO off;
> SELECT * FROM foo WHERE a = 5;
> SET LOCAL constraint_exclusion TO on;
> SELECT * FROM foo WHERE a = 5;
> ROLLBACK;
>
> Fixed in V3 patch.

> 4. The code in MergeWithExistingConstraint() should prevent inherited
> constraints being merged if their deferrable properties don't match
> (as MergeConstraintsIntoExisting() does, since
> constraints_equivalent() tests the deferrable fields). I.e., the
> following should fail to merge the constraints, since they don't
> match:
>
> DROP TABLE IF EXISTS p,c;
>
> CREATE TABLE p (a int, b int);
> ALTER TABLE p ADD CONSTRAINT c1 CHECK (a > 0) DEFERRABLE;
> ALTER TABLE p ADD CONSTRAINT c2 CHECK (b > 0);
>
> CREATE TABLE c () INHERITS (p);
> ALTER TABLE c ADD CONSTRAINT c1 CHECK (a > 0);
> ALTER TABLE c ADD CONSTRAINT c2 CHECK (b > 0) DEFERRABLE;
>
> I.e., that should produce an error, as happens if c is made to inherit
> p *after* the constraints have been added.
>
> Fixed in V3 patch.

> 5. Instead of just adding the new fields to the end of the ConstrCheck
> struct, and to the end of lists of function parameters like
> StoreRelCheck(), and other related code, it would be more logical to
> put them immediately before the valid/invalid entries, to match the
> order of constraint properties in pg_constraint, and functions like
> CreateConstraintEntry().
>
> Fixed in V3 patch.

-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: pg.dll and .pdb files in psqlODBC have no version numbers

Mark Hill  writes:
> I posted this question to 
> pgsql-gene...@lists.postgrsql.org 
> last week but on one has responded so posting here now.

> A colleague noticed that the following files in the psqlODBC MSI for Windows 
> have no version numbers:
> ...
> Does anyone know if that is be design or some other reason?   Should they 
> have version numbers?

No idea, but actually the pgsql-odbc list would be the most authoritative
place for answers, if you find none here.

regards, tom lane

Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag

2023-10-03 Thread Bharath Rupireddy

On Mon, Oct 2, 2023 at 4:58 PM Drouvot, Bertrand
 wrote:
>
> On 9/29/23 8:19 AM, Michael Paquier wrote:
> > On Thu, Sep 28, 2023 at 02:37:02PM +0200, Drouvot, Bertrand wrote:
> >> This patch allows the role provided in 
> >> BackgroundWorkerInitializeConnection()
> >> and BackgroundWorkerInitializeConnectionByOid() to lack login 
> >> authorization.
> >
> > Interesting.  Yes, there would be use cases for that, I suppose.

Correct. It allows the roles that don't have LOGIN capabilities to
start and use bg workers.

> > This may be more adapted with a bits32 for the flags.
>
> Done that way in v2 attached.

While I like the idea of the flag to skip login checks for bg workers,
I don't quite like the APIs being changes InitializeSessionUserId and
InitPostgres (adding a new input parameter),
BackgroundWorkerInitializeConnection and
BackgroundWorkerInitializeConnectionByOid (changing of input parameter
type) given that all of these functions are available for external
modules and will break things for sure.

What if BGWORKER_BYPASS_ROLELOGINCHECK be part of bgw_flags? With
this, none of the API needs to be changed, so no compatibility
problems as such for external modules and the InitializeSessionUserId
can just do something like [1]. We might be tempted to add
BGWORKER_BYPASS_ALLOWCONN also to bgw_flags, but I'd prefer not to do
it for the same compatibility reasons.

Thoughts?

[1]
diff --git a/src/backend/utils/init/miscinit.c
b/src/backend/utils/init/miscinit.c
index 1e671c560c..27dcf052ab 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -786,10 +786,17 @@ InitializeSessionUserId(const char *rolename, Oid roleid)
 */
if (IsUnderPostmaster)
{
+   boolskip_check = false;
+
+   /* If asked, skip the role login check for background
workers. */
+   if (IsBackgroundWorker &&
+   (MyBgworkerEntry->bgw_flags &
BGWORKER_BYPASS_ROLELOGINCHECK) != 0)
+   skip_check = true;
+
/*
 * Is role allowed to login at all?
 */
-   if (!rform->rolcanlogin)
+   if (!skip_check && !rform->rolcanlogin)
ereport(FATAL,

(errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
 errmsg("role \"%s\" is not
permitted to log in",

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Doc: Minor update for enable_partitionwise_aggregate

2023-10-03 Thread Ashutosh Bapat

On Sun, Oct 1, 2023 at 7:38 AM Andy Atkinson  wrote:
>
> Hello. While reading the docs for the enable_partitionwise_aggregate 
> parameter on the Query Planning page, I thought the description had a small 
> mistake that could be improved.
>
> The current wording is: "which allows grouping or aggregation on a 
> partitioned tables performed separately "
>
> Page: https://www.postgresql.org/docs/current/runtime-config-query.html
>
> I think possible better alternatives could be:
>
> (Option 1) a "partitioned table's partitions" (the possessive form of 
> "it's"). The "enable_partition_pruning" parameter uses "the partitioned 
> table's partitions" in this form. I think this option is good, but I had a 
> slight preference for option 2.
> (Option 2) Or to just cut out the first part and say "to be performed 
> separately for each partition", which seemed simpler. So the sentence reads: 
> "which allows grouping or aggregation to be performed separately for each 
> partition"

I would leave "on a partitioned table". Notice that I have removed "s"
from tables.

> (Option 3) dropping the "a" so it says "which allows grouping or aggregation 
> on partitioned tables performed separately". I don't think this is as good 
> though because the aggregation happens on the partitions, so it feels 
> slightly off to me to say the "partitioned tables" instead of the partitions.

It's technically incorrect as well. Aggregation is performed on a
single relation always - a join or subquery or simple relation. A join
may have multiple tables in it but the aggregation is performed on its
result and not individual tables and hence not on partitions of
individual tables.

-- 
Best Wishes,
Ashutosh Bapat

Re: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}

2023-10-03 Thread Nazir Bilal Yavuz

Hi,

On Fri, 15 Sept 2023 at 16:30, Melanie Plageman
 wrote:
>
> Yes, good catch. This is a bug. I will note that at least in 15 and
> likely before, pgBufferUsage.local_blks_written is incremented for
> local buffers but pgBufferUsage.blk_write_time is only added to for
> shared buffers (in FlushBuffer()). I think it makes sense to propose a
> bug fix to stable branches counting blk_write_time for local buffers
> as well.

I attached the PG16+ (after pg_stat_io) and PG15- (before pg_stat_io)
versions of the same patch.

Regards,
Nazir Bilal Yavuz
Microsoft
From b4ea504497708b1017e8cfda5f1ac366a9e34386 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 15 Sep 2023 11:55:43 +0300
Subject: [PATCH v2 2/2] [PG16+] Add pgBufferUsage.blk_write_time to for IOOp
 IOOP_EXTENDs

---
 src/backend/utils/activity/pgstat_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index c8058b57962..56051fc6072 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -119,7 +119,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE)
+		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
 		{
 			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION
-- 
2.42.0

From 46aaa1f76f02f7697f520e6509a3aeb970a862ae Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 15 Sep 2023 12:35:01 +0300
Subject: [PATCH v2 1/2] [PG16+] Increase pgBufferUsage.blk_{read|write}_time
 when IOObject is IOOBJECT_TEMP_RELATION

---
 src/backend/utils/activity/pgstat_io.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb7d35d4225..c8058b57962 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -122,13 +122,15 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		if (io_op == IOOP_WRITE)
 		{
 			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-			if (io_object == IOOBJECT_RELATION)
+			if (io_object == IOOBJECT_RELATION
+			|| io_object == IOOBJECT_TEMP_RELATION)
 INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
 		}
 		else if (io_op == IOOP_READ)
 		{
 			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-			if (io_object == IOOBJECT_RELATION)
+			if (io_object == IOOBJECT_RELATION
+			|| io_object == IOOBJECT_TEMP_RELATION)
 INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
 		}
 
-- 
2.42.0

From 1e3af6e9466924713488b374b5d7bfb2c3bd5983 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 2 Oct 2023 16:31:46 +0300
Subject: [PATCH v2 2/2] [PG15-] Add pgBufferUsage.blk_write_time to for
 extends

---
 src/backend/storage/buffer/bufmgr.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9fcb3d6e194..a4e9fd3317b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -825,6 +825,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	bool		found;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
+	instr_time	io_start,
+io_time;
 
 	*hit = false;
 
@@ -992,9 +994,21 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	{
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		}
+
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
 		 * although we're essentially performing a write. At least on linux
@@ -1012,9 +1026,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
-			instr_time	io_start,
-		io_time;
-
 			if (track_io_timing)
 INSTR_TIME_SET_CURRENT(io_start);
 
-- 
2.42.0

From c7fdf1fa2a1f878a96a71b04e9fc65b1f2a0a402 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 2 Oct 2023 16:29:16 +0300
Subject: [PATCH v2 1/2] [PG15-] Add pgBufferUsage.blk_write_time to for local
 buffers

---
 src/backend/storage/buffer/localbuf.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index e71f95ac1f

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-03 Thread Hayato Kuroda (Fujitsu)

Dear Bharath,

> I'm attaching 0002 patch (on top of v45) which implements the new
> decodable callback approach that I have in mind. IMO, this new
> approach is extensible, better than the current approach (hard-coding
> of certain WAL records that may be generated during pg_upgrade) taken
> by the patch, and helps deal with the issue that custom WAL resource
> managers can have with the current approach taken by the patch.

Thanks for sharing your PoC! I tested yours and worked well. I have also made
the decoding approach locally, but your approach is conceptually faster. I think
it still checks the type one by one so not sure the acceptable, but at least
checkings are centerized. We must hear opinions from others. How do other think?
 
Comments for your patch. I attached the txt file, please include if it is OK.

1.
According to your post, we must have comments to notify developers that
is_decodable API must be implemented. Please share it too if you have idea.

 
2.
The existence of is_decodable should be checked in RegisterCustomRmgr().

3.
Anther rmgr API (rm_identify) requries uint8 without doing a bit operation:
they do "info & ~XLR_INFO_MASK" in the callbacks. Should we follow that?

4.
It is helpful for developers to add a function to test_custom_rmgrs module.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

diff --git a/src/backend/access/transam/rmgr.c 
b/src/backend/access/transam/rmgr.c
index 001bdf3535..850ba7829a 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -117,6 +117,11 @@ RegisterCustomRmgr(RmgrId rmid, const RmgrData *rmgr)
 errdetail("Custom resource manager \"%s\" 
already registered with the same ID.",
   RmgrTable[rmid].rm_name)));
 
+   if (rmgr->rm_decode && rmgr->rm_is_record_decodable == NULL)
+   ereport(ERROR,
+   (errmsg("failed to register custom resource 
manager \"%s\" with ID %d", rmgr->rm_name, rmid),
+errdetail("Custom resource manager which has a 
decode function must have is_reacode_decodable function too.")));
+
/* check for existing rmgr with the same name */
for (int existing_rmid = 0; existing_rmid <= RM_MAX_ID; existing_rmid++)
{
diff --git a/src/backend/replication/logical/decode.c 
b/src/backend/replication/logical/decode.c
index 60d26ae015..2e97962e60 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -214,7 +214,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer 
*buf)
 bool
 xlog_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & ~XLR_INFO_MASK)
{
case XLOG_CHECKPOINT_SHUTDOWN:
case XLOG_END_OF_RECOVERY:
@@ -401,7 +401,7 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer 
*buf)
 bool
 xact_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & ~XLR_INFO_MASK)
{
case XLOG_XACT_COMMIT:
case XLOG_XACT_COMMIT_PREPARED:
@@ -471,7 +471,7 @@ standby_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)
 bool
 standy_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & ~XLR_INFO_MASK)
{
case XLOG_RUNNING_XACTS:
return true;
@@ -550,7 +550,7 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer 
*buf)
 bool
 heap2_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & ~XLR_INFO_MASK)
{
case XLOG_HEAP2_MULTI_INSERT:
case XLOG_HEAP2_NEW_CID:
@@ -661,7 +661,7 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer 
*buf)
 bool
 heap_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & ~XLR_INFO_MASK)
{
case XLOG_HEAP_INSERT:
case XLOG_HEAP_HOT_UPDATE:
@@ -782,7 +782,7 @@ logicalmsg_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)
 bool
 logicalmsg_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & ~XLR_INFO_MASK)
{
case XLOG_LOGICAL_MESSAGE:
return true;
diff --git a/src/backend/utils/adt/pg_upgrade_support.c 
b/src/backend/utils/adt/pg_upgrade_support.c
index cfd3e448b1..f19cb68d92 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -320,15 +320,13 @@ binary_upgrade_validate_wal_logical_end(PG_FUNCTION_ARGS)
while (is_valid && ReadNextXLogRecord(xlogreader))
{
RmgrDatarmgr;
-   RmgrIds rmid;
-   uint8   info;
-
-   /* Check the type of WAL */
-   rmid = XLogRecGetRmid(xlogreader);
-   info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
 
if (initial_reco

Re: [PGDOCS] Inconsistent linkends to "monitoring" views.

2023-10-03 Thread Peter Smith

On Tue, Oct 3, 2023 at 6:30 PM Michael Paquier  wrote:
>
> On Tue, Oct 03, 2023 at 01:11:15PM +1100, Peter Smith wrote:
> > I noticed one or two "monitoring" links and linkends that are slightly
> > inconsistent from all the others.
>
> -   
> +   
>
> Is that really worth bothering for the internal link references?

I preferred 100% consistency instead of 95% consistency. YMMV.

> This can create extra backpatching conflicts.

Couldn't the same be said for every patch that fixes a comment typo?
This is like a link typo, so what's the difference?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Improvements in pg_dump/pg_restore toc format and performances

2023-10-03 Thread vignesh C

On Tue, 19 Sept 2023 at 15:46, Pierre Ducroquet  wrote:
>
> On Monday, September 18, 2023 11:54:42 PM CEST Nathan Bossart wrote:
> > On Mon, Sep 18, 2023 at 02:52:47PM -0700, Nathan Bossart wrote:
> > > On Thu, Jul 27, 2023 at 10:51:11AM +0200, Pierre Ducroquet wrote:
> > >> I ended up writing several patches that shaved some time for pg_restore
> > >> -l,
> > >> and reduced the toc.dat size.
> > >
> > > I've only just started taking a look at these patches, and I intend to do
> > > a
> > > more thorough review in the hopefully-not-too-distant future.
> >
> > Since cfbot is failing on some pg_upgrade and pg_dump tests, I've set this
> > to waiting-on-author.
>
> Attached updated patches fix this regression, I'm sorry I missed that.

Few comments:
1) These printf statements are not required:
+   /* write the list of am */
+   printf("%d tableams to save\n", tableam_count);
+   WriteInt(AH, tableam_count);
+   for (i = 0 ; i < tableam_count ; i++)
+   {
+   printf("%d is %s\n", i, tableams[i]);
+   WriteStr(AH, tableams[i]);
+   }
+
+   /* write the list of namespaces */
+   printf("%d namespaces to save\n", namespace_count);
+   WriteInt(AH, namespace_count);
+   for (i = 0 ; i < namespace_count ; i++)
+   {
+   printf("%d is %s\n", i, namespaces[i]);
+   WriteStr(AH, namespaces[i]);
+   }
+
+   /* write the list of owners */
+   printf("%d owners to save\n", owner_count);

2) We generally use pg_malloc in client tools, we should change palloc
to pg_malloc:
+   /* prepare dynamic arrays */
+   tableams = palloc(sizeof(char*) * 1);
+   tableams[0] = NULL;
+   tableam_count = 0;
+   namespaces = palloc(sizeof(char*) * 1);
+   namespaces[0] = NULL;
+   namespace_count = 0;
+   owners = palloc(sizeof(char*) * 1);
+   owners[0] = NULL;
+   owner_count = 0;
+   tablespaces = palloc(sizeof(char*) * 1);

3) This similar code is repeated few times, will it be possible to do
it in a common function:
+   if (namespace_count < 128)
+   {
+   te->namespace_idx = AH->ReadBytePtr(AH);
+   invalid_entry = 255;
+   }
+   else
+   {
+   te->namespace_idx = ReadInt(AH);
+   invalid_entry = -1;
+   }
+   if (te->namespace_idx == invalid_entry)
+   te->namespace = "";
+   else
+   te->namespace = namespaces[te->namespace_idx];

4) Can the initialization of tableam_count, namespace_count,
owner_count and tablespace_count be done at declaration and these
initialization code can be removed:
+   /* prepare dynamic arrays */
+   tableams = palloc(sizeof(char*) * 1);
+   tableams[0] = NULL;
+   tableam_count = 0;
+   namespaces = palloc(sizeof(char*) * 1);
+   namespaces[0] = NULL;
+   namespace_count = 0;
+   owners = palloc(sizeof(char*) * 1);
+   owners[0] = NULL;
+   owner_count = 0;
+   tablespaces = palloc(sizeof(char*) * 1);
+   tablespaces[0] = NULL;
+   tablespace_count = 0;

4) There are some whitespace issues in the patch:
Applying: move static strings to arrays at beginning
.git/rebase-apply/patch:24: trailing whitespace.
.git/rebase-apply/patch:128: trailing whitespace.
.git/rebase-apply/patch:226: trailing whitespace.
.git/rebase-apply/patch:232: trailing whitespace.
warning: 4 lines add whitespace errors.

Regards,
Vignesh

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-03 Thread Amit Kapila

On Tue, Oct 3, 2023 at 9:58 AM Bharath Rupireddy
 wrote:
>
> On Fri, Sep 29, 2023 at 5:27 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Yeah, the approach enforces developers to check the decodability.
> > But the benefit seems smaller than required efforts for it because the 
> > function
> > would be used only by pg_upgrade. Could you tell me if you have another use 
> > case
> > in mind? We may able to adopt if we have...
>
> I'm attaching 0002 patch (on top of v45) which implements the new
> decodable callback approach that I have in mind. IMO, this new
> approach is extensible, better than the current approach (hard-coding
> of certain WAL records that may be generated during pg_upgrade) taken
> by the patch, and helps deal with the issue that custom WAL resource
> managers can have with the current approach taken by the patch.
>

+xlog_is_record_decodable(uint8 info)
+{
+ switch (info)
+ {
+ case XLOG_CHECKPOINT_SHUTDOWN:
+ case XLOG_END_OF_RECOVERY:
+ return true;
+ case XLOG_CHECKPOINT_ONLINE:
+ case XLOG_PARAMETER_CHANGE:
...
+ return false;
}

I think this won't behave correctly. Without your patch, we consider
both XLOG_CHECKPOINT_SHUTDOWN and XLOG_CHECKPOINT_ONLINE as valid
records but after patch only one of these will be considered valid
which won't lead to desired behavior.

BTW, the API proposed in your patch returns the WAL record type as
valid if there is something we do for it during decoding but the check
in upgrade function expects the reverse value. For example, for WAL
record type XLOG_HEAP_INSERT, the API returns true and that is
indication to the caller that this is an expected record after
confirmed_flush LSN location which doesn't seem correct. Am I missing
something?

-- 
With Regards,
Amit Kapila.

Modernize const handling with readline

2023-10-03 Thread Peter Eisentraut


The comment

/* On some platforms, readline is declared as readline(char *) */

is obsolete.  The casting away of const can be removed.

The const in the readline() prototype was added in GNU readline 4.2, 
released in 2001.  BSD libedit has also had const in the prototype since 
at least 2001.


(The commit that introduced this comment (187e865174) talked about 
FreeBSD 4.8, which didn't have readline compatibility in libedit yet, so 
it must have been talking about GNU readline in the base system.  This 
checks out, but already FreeBSD 5 had an updated GNU readline with const.)
From 4db9b54b40833486e366c2d117291147ce47b195 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 3 Oct 2023 12:08:25 +0200
Subject: [PATCH] Modernize const handling with readline

The comment

/* On some platforms, readline is declared as readline(char *) */

is obsolete.  The casting away of const can be removed.

The const in the readline() prototype was added in GNU readline 4.2,
released in 2001.  BSD libedit has also had const in the prototype
since at least 2001.

(The commit that introduced this comment (187e865174) talked about
FreeBSD 4.8, which didn't have readline compatibility in libedit yet,
so it must have been talking about GNU readline in the base system.
This checks out, but already FreeBSD 5 had an updated GNU readline
with const.)
---
 src/bin/psql/input.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/bin/psql/input.c b/src/bin/psql/input.c
index 873d85079c..7c77fd8e89 100644
--- a/src/bin/psql/input.c
+++ b/src/bin/psql/input.c
@@ -88,8 +88,7 @@ gets_interactive(const char *prompt, PQExpBuffer query_buf)
/* Enable SIGINT to longjmp to sigint_interrupt_jmp */
sigint_interrupt_enabled = true;
 
-   /* On some platforms, readline is declared as readline(char *) 
*/
-   result = readline((char *) prompt);
+   result = readline(prompt);
 
/* Disable SIGINT again */
sigint_interrupt_enabled = false;
-- 
2.42.0

Re: Skip collecting decoded changes of already-aborted transactions

2023-10-03 Thread vignesh C

On Mon, 3 Jul 2023 at 07:16, Masahiko Sawada  wrote:
>
> On Fri, Jun 23, 2023 at 12:39 PM Dilip Kumar  wrote:
> >
> > On Fri, Jun 9, 2023 at 10:47 AM Masahiko Sawada  
> > wrote:
> > >
> > > Hi,
> > >
> > > In logical decoding, we don't need to collect decoded changes of
> > > aborted transactions. While streaming changes, we can detect
> > > concurrent abort of the (sub)transaction but there is no mechanism to
> > > skip decoding changes of transactions that are known to already be
> > > aborted. With the attached WIP patch, we check CLOG when decoding the
> > > transaction for the first time. If it's already known to be aborted,
> > > we skip collecting decoded changes of such transactions. That way,
> > > when the logical replication is behind or restarts, we don't need to
> > > decode large transactions that already aborted, which helps improve
> > > the decoding performance.
> > >
> > +1 for the idea of checking the transaction status only when we need
> > to flush it to the disk or send it downstream (if streaming in
> > progress is enabled).   Although this check is costly since we are
> > planning only for large transactions then it is worth it if we can
> > occasionally avoid disk or network I/O for the aborted transactions.
> >
>
> Thanks.
>
> I've attached the updated patch. With this patch, we check the
> transaction status for only large-transactions when eviction. For
> regression test purposes, I disable this transaction status check when
> logical_replication_mode is set to 'immediate'.

May be there is some changes that are missing in the patch, which is
giving the following errors:
reorderbuffer.c: In function ‘ReorderBufferCheckTXNAbort’:
reorderbuffer.c:3584:22: error: ‘logical_replication_mode’ undeclared
(first use in this function)
 3584 | if (unlikely(logical_replication_mode ==
LOGICAL_REP_MODE_IMMEDIATE))
  |  ^~~~

Regards,
Vignesh

Re: Synchronizing slots from primary to standby

2023-10-03 Thread Amit Kapila

On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand
 wrote:
>
> On 9/29/23 1:33 PM, Amit Kapila wrote:
> > On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand
> >  wrote:
> >>
> >
> >> - probably open corner cases like: what if a standby is down? would that 
> >> mean
> >> that synchronize_slot_names not being send to the primary would allow the 
> >> decoding
> >> on the primary to go ahead?
> >>
> >
> > Good question. BTW, irrespective of whether we have
> > 'standby_slot_names' parameters or not, how should we behave if
> > standby is down? Say, if 'synchronize_slot_names' is only specified on
> > standby then in such a situation primary won't be even aware that some
> > of the logical walsenders need to wait.
>
> Exactly, that's why I was thinking keeping standby_slot_names to address
> this scenario. In such a case one could simply decide to keep or remove
> the associated physical replication slot from standby_slot_names. Keep would
> mean "wait" and removing would mean allow to decode on the primary.
>
> > OTOH, one can say that users
> > should configure 'synchronize_slot_names' on both primary and standby
> > but note that this value could be different for different standby's,
> > so we can't configure it on primary.
> >
>
> Yeah, I think that's a good use case for standby_slot_names, what do you 
> think?
>

But, even if we keep 'standby_slot_names' for this purpose, the
primary doesn't know the value of 'synchronize_slot_names' once the
standby is down and or the primary is restarted. So, how will we know
which logical WAL senders needs to wait for 'standby_slot_names'?


--
With Regards,
Amit Kapila.

Re: Modernize const handling with readline

2023-10-03 Thread Aleksander Alekseev

Hi,

> The comment
>
>  /* On some platforms, readline is declared as readline(char *) */
>
> is obsolete.  The casting away of const can be removed.
>
> The const in the readline() prototype was added in GNU readline 4.2,
> released in 2001.  BSD libedit has also had const in the prototype since
> at least 2001.
>
> (The commit that introduced this comment (187e865174) talked about
> FreeBSD 4.8, which didn't have readline compatibility in libedit yet, so
> it must have been talking about GNU readline in the base system.  This
> checks out, but already FreeBSD 5 had an updated GNU readline with const.)

LGTM.

While examining the code for similar places I noticed that the
following functions can also be const'ified:

- crc32_sz
- pg_checksum_page (? temporary modifies the page but then restores it)
- XLogRegisterData (?)

The callers of cstring_to_text[_with_len] often cast the argument to
(char *) while in fact it's (const char *). This can be refactored
too.

Additionally there is a slight difference between XLogRegisterBlock()
declaration in xloginsert.h:

```
extern void XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator,
  ForkNumber forknum, BlockNumber blknum,
char *page,
  uint8 flags);
```

... and xloginsert.c:

```
void
XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum,
  BlockNumber blknum, Page page, uint8 flags)
```

Will there be a value in addressing anything of this?

-- 
Best regards,
Aleksander Alekseev

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-03 Thread Andrei Lepikhov


On 29/9/2023 09:52, Andrei Lepikhov wrote:

On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote:

Attach patches updated to master.
Pulled from patch 2 back to patch 1 a change that was also pertinent 
to patch 1.

+1 to the idea, have doubts on the implementation.

I have a question. I see the feature triggers ERROR on the exceeding of 
the memory limit. The superior PG_CATCH() section will handle the error. 
As I see, many such sections use memory allocations. What if some 
routine, like the CopyErrorData(), exceeds the limit, too? In this case, 
we could repeat the error until the top PG_CATCH(). Is this correct 
behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for 
recursion and allow error handlers to slightly exceed this hard limit?
By the patch in attachment I try to show which sort of problems I'm 
worrying about. In some PП_CATCH() sections we do CopyErrorData 
(allocate some memory) before aborting the transaction. So, the 
allocation error can move us out of this section before aborting. We 
await for soft ERROR message but will face more hard consequences.


--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 33975687b3..3f992b8d92 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -291,10 +291,7 @@ _SPI_commit(bool chain)
{
ErrorData  *edata;
 
-   /* Save error info in caller's context */
MemoryContextSwitchTo(oldcontext);
-   edata = CopyErrorData();
-   FlushErrorState();
 
/*
 * Abort the failed transaction.  If this fails too, we'll just
@@ -302,6 +299,10 @@ _SPI_commit(bool chain)
 */
AbortCurrentTransaction();
 
+   /* Save error info in caller's context */
+   edata = CopyErrorData();
+   FlushErrorState();
+
/* ... and start a new one */
StartTransactionCommand();
if (chain)
@@ -383,10 +384,7 @@ _SPI_rollback(bool chain)
{
ErrorData  *edata;
 
-   /* Save error info in caller's context */
MemoryContextSwitchTo(oldcontext);
-   edata = CopyErrorData();
-   FlushErrorState();
 
/*
 * Try again to abort the failed transaction.  If this fails 
too,
@@ -395,6 +393,10 @@ _SPI_rollback(bool chain)
 */
AbortCurrentTransaction();
 
+   /* Save error info in caller's context */
+   edata = CopyErrorData();
+   FlushErrorState();
+
/* ... and start a new one */
StartTransactionCommand();
if (chain)
diff --git a/src/backend/replication/logical/reorderbuffer.c 
b/src/backend/replication/logical/reorderbuffer.c
index 12edc5772a..f9cf599026 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2565,7 +2565,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, 
ReorderBufferTXN *txn,
PG_CATCH();
{
MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
-   ErrorData  *errdata = CopyErrorData();
+   ErrorData  *errdata;
 
/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH 
blocks */
if (iterstate)
@@ -2579,6 +2579,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, 
ReorderBufferTXN *txn,
 */
AbortCurrentTransaction();
 
+   errdata = CopyErrorData();
+
/* make sure there's no cache pollution */
ReorderBufferExecuteInvalidations(txn->ninvalidations,

  txn->invalidations);

Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag


Hi,

On 10/3/23 11:21 AM, Bharath Rupireddy wrote:

On Mon, Oct 2, 2023 at 4:58 PM Drouvot, Bertrand
 wrote:


On 9/29/23 8:19 AM, Michael Paquier wrote:

On Thu, Sep 28, 2023 at 02:37:02PM +0200, Drouvot, Bertrand wrote:

This patch allows the role provided in BackgroundWorkerInitializeConnection()
and BackgroundWorkerInitializeConnectionByOid() to lack login authorization.


Interesting.  Yes, there would be use cases for that, I suppose.


Correct. It allows the roles that don't have LOGIN capabilities to
start and use bg workers.


This may be more adapted with a bits32 for the flags.


Done that way in v2 attached.


While I like the idea of the flag to skip login checks for bg workers,
I don't quite like the APIs being changes InitializeSessionUserId and
InitPostgres (adding a new input parameter),
BackgroundWorkerInitializeConnection and
BackgroundWorkerInitializeConnectionByOid (changing of input parameter
type) given that all of these functions are available for external
modules and will break things for sure.

What if BGWORKER_BYPASS_ROLELOGINCHECK be part of bgw_flags? With
this, none of the API needs to be changed, so no compatibility
problems as such for external modules and the InitializeSessionUserId
can just do something like [1]. We might be tempted to add
BGWORKER_BYPASS_ALLOWCONN also to bgw_flags, but I'd prefer not to do
it for the same compatibility reasons.

Thoughts?



Thanks for looking at it!

I did some research and BGWORKER_BYPASS_ALLOWCONN has been added in eed1ce72e1 
and
at that time the bgw_flags did already exist.

In this the related thread [1], Tom mentioned:

"
We change exported APIs in new major versions all the time.  As
long as it's just a question of an added parameter, people can deal
with it.
"

And I agree with that.

Now, I understand your point but it looks to me that bgw_flags is more
about the capabilities (Access to shared memory with BGWORKER_SHMEM_ACCESS
or ability to establish database connection with 
BGWORKER_BACKEND_DATABASE_CONNECTION),

While with BGWORKER_BYPASS_ROLELOGINCHECK (and BGWORKER_BYPASS_ALLOWCONN) it's 
more related to
the BGW behavior once the capability is in place.

So, I think I'm fine with the current proposal and don't see the need to move
BGWORKER_BYPASS_ROLELOGINCHECK in bgw_flags.

[1]: https://www.postgresql.org/message-id/22769.1519323861%40sss.pgh.pa.us

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: New WAL record to detect the checkpoint redo location

On Wed, Sep 20, 2023 at 4:20 PM Robert Haas  wrote:
> Here are some patches.

Here are some updated patches. Following some off-list conversation
with Andres, I restructured 0003 to put the common case first and use
likely(), and I fixed the brown-paper-bag noted by Amit. I then turned
my attention to performance testing. I was happy to find out when I
did a bunch of testing on Friday that my branch with these patches
applied outperformed master. I was then less happy to find that when I
repeated the same tests today, master outperformed the branch. So now
I don't know what is going on, but it doesn't seem like my test
results are stable enough to draw meaningful conclusions.

I was trying to think of a test case where XLogInsertRecord would be
exercised as heavily as possible, so I really wanted to generate a lot
of WAL while doing as little real work as possible. The best idea that
I had was to run pg_create_restore_point() in a loop. Initially,
performance was dominated by the log messages which that function
emits, so I set log_min_messages='FATAL' to suppress those. To try to
further reduce other bottlenecks, I also set max_wal_size='50GB',
fsync='off', synchronous_commit='off', and wal_buffers='256MB'. Then I
ran this query:

select count(*) from (SELECT pg_create_restore_point('banana') from
generate_series(1,1) g) x;

I can't help laughing at the comedy of creating 100 million
banana-named restore points with no fsyncs or logging, but here we
are. All of my test runs with master, and with the patches, and with
just the first patch run in between 34 and 39 seconds. As I say, I
can't really separate out which versions are faster and slower with
any confidence. Before I fixed the brown-paper bag that Amit pointed
out, it was using WALInsertLockAcquireExclusive() instead of
WALInsertLockAcquire() for *all* WAL records, and that created an
extremely large and obvious increase in the runtime of the tests. So
I'm relatively confident that this test case is sensitive to changes
in execution time of XLogInsertRecord(), but apparently the changes
caused by rearranging the branches are a bit too marginal for them to
show up here.

One possible conclusion is that the differences here aren't actually
big enough to get stressed about, but I don't want to jump to that
conclusion without investigating the competing hypothesis that this
isn't the right way to test this, and that some better test would show
clearer results. Suggestions?

--
Robert Haas
EDB: http://www.enterprisedb.com


v7-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patch
Description: Binary data


v7-0003-WIP-Insert-XLOG_CHECKPOINT_REDO-at-the-redo-point.patch
Description: Binary data


v7-0002-Minimally-add-XLOG_CHECKPOINT_REDO.patch
Description: Binary data

Re: bgwriter doesn't flush WAL stats

2023-10-03 Thread Nazir Bilal Yavuz

Hi,

On Mon, 2 Oct 2023 at 13:08, Heikki Linnakangas  wrote:
>
> The first patch, to flush the bgwriter's WAL stats to the stats
> collector, seems like a straightforward bug fix, so committed and
> backpatched that. Thank you!
>
> I didn't look at the second patch.

Thanks for the push!

Actual commitfest entry for the second patch is:
https://commitfest.postgresql.org/45/4416/. I sent a second patch to
this thread just to show how I found this bug. There is no need to
review it, this commitfest entry could be closed as committed.

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2023-10-03 Thread Ashutosh Bapat

On Fri, Sep 29, 2023 at 8:36 AM Amit Langote  wrote:
> IOW, something
> like the following would have sufficed:
>
> @@ -1735,6 +1735,10 @@ build_child_join_sjinfo(PlannerInfo *root,
> SpecialJoinInfo *parent_sjinfo,
>  /*
>   * free_child_sjinfo_members
>   * Free memory consumed by members of a child SpecialJoinInfo.
> + *
> + * Only members that are translated copies of their counterpart in the parent
> + * SpecialJoinInfo are freed here.  However, members that could be referenced
> + * elsewhere are not freed.
>   */
>  static void
>  free_child_sjinfo_members(SpecialJoinInfo *child_sjinfo)
> @@ -1755,10 +1759,7 @@ free_child_sjinfo_members(SpecialJoinInfo 
> *child_sjinfo)
> bms_free(child_sjinfo->syn_lefthand);
> bms_free(child_sjinfo->syn_righthand);
>
> -   /*
> -* But the list of operator OIDs and the list of expressions may be
> -* referenced somewhere else. Do not free those.
> -*/
> +   /* semi_rhs_exprs may be referenced, so don't free. */
>  }

Works for me. PFA patchset with these changes. I have still left the
changes addressing your comments as a separate patch for easier
review.

-- 
Best Wishes,
Ashutosh Bapat
From 6f55c13473b04eb5634ed0f34d4d5bce8bfdf3e8 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 12 Jul 2023 14:34:14 +0530
Subject: [PATCH 1/3] Report memory used for planning a query in EXPLAIN
 ANALYZE

The memory used in the CurrentMemoryContext and its children is sampled
before and after calling pg_plan_query() from ExplainOneQuery(). The
difference in the two samples is reported as the memory consumed while
planning the query. This may not account for the memory allocated in
memory contexts which are not children of CurrentMemoryContext. These
contexts are usually other long lived contexts, e.g.
CacheMemoryContext, which are shared by all the queries run in a
session. The consumption in those can not be attributed only to a given
query and hence should not be reported any way.

The memory consumption is reported as "Planning Memory" property in
EXPLAIN ANALYZE output.

Ashutosh Bapat
---
 src/backend/commands/explain.c| 12 ++--
 src/backend/commands/prepare.c|  7 ++-
 src/backend/utils/mmgr/mcxt.c | 19 +++
 src/include/commands/explain.h|  3 ++-
 src/include/utils/memutils.h  |  1 +
 src/test/regress/expected/explain.out | 15 +++
 6 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 13217807ee..e3482cabd0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -397,16 +397,20 @@ ExplainOneQuery(Query *query, int cursorOptions,
 	planduration;
 		BufferUsage bufusage_start,
 	bufusage;
+		Size		mem_consumed;
 
 		if (es->buffers)
 			bufusage_start = pgBufferUsage;
 		INSTR_TIME_SET_CURRENT(planstart);
+		mem_consumed = MemoryContextMemUsed(CurrentMemoryContext);
 
 		/* plan the query */
 		plan = pg_plan_query(query, queryString, cursorOptions, params);
 
 		INSTR_TIME_SET_CURRENT(planduration);
 		INSTR_TIME_SUBTRACT(planduration, planstart);
+		mem_consumed = MemoryContextMemUsed(CurrentMemoryContext)
+			- mem_consumed;
 
 		/* calc differences of buffer counters. */
 		if (es->buffers)
@@ -417,7 +421,7 @@ ExplainOneQuery(Query *query, int cursorOptions,
 
 		/* run it (if needed) and produce output */
 		ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-	   &planduration, (es->buffers ? &bufusage : NULL));
+	   &planduration, (es->buffers ? &bufusage : NULL), &mem_consumed);
 	}
 }
 
@@ -527,7 +531,7 @@ void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 			   const char *queryString, ParamListInfo params,
 			   QueryEnvironment *queryEnv, const instr_time *planduration,
-			   const BufferUsage *bufusage)
+			   const BufferUsage *bufusage, const Size *mem_consumed)
 {
 	DestReceiver *dest;
 	QueryDesc  *queryDesc;
@@ -628,6 +632,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		double		plantime = INSTR_TIME_GET_DOUBLE(*planduration);
 
 		ExplainPropertyFloat("Planning Time", "ms", 1000.0 * plantime, 3, es);
+
+		if (mem_consumed)
+			ExplainPropertyUInteger("Planning Memory", "bytes",
+	(uint64) *mem_consumed, es);
 	}
 
 	/* Print info about runtime of triggers */
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 18f70319fc..02b48f845f 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -583,10 +583,12 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	instr_time	planduration;
 	BufferUsage bufusage_start,
 bufusage;
+	Size		mem_consumed;
 
 	if (es->buffers)
 		bufusage_start = pgBufferUsage;
 	INSTR_TIME_SET_CURRENT(planstart);
+	mem_consumed = MemoryContextMemUsed(CurrentMemoryContext);
 
 	/* Look it up in the hash table */
 	entry = FetchPrep

Re: remaining sql/json patches

2023-10-03 Thread Amit Langote

On Mon, Oct 2, 2023 at 2:26 PM Amit Langote  wrote:
> On Mon, Oct 2, 2023 at 1:24 PM Amit Langote  wrote:
> > Pushed this 30 min ago (no email on -committers yet!) and am looking
> > at the following llvm crash reported by buildfarm animal pogona [1]:
> >
> > #4  0x7f5bceb673d5 in __assert_fail_base (fmt=0x7f5bcecdbdc8
> > "%s%s%s:%u: %s%sAssertion `%s' failed.\\n%n",
> > assertion=assertion@entry=0x7f5bc1336419 "(i >= FTy->getNumParams() ||
> > FTy->getParamType(i) == Args[i]->getType()) && \\"Calling a function
> > with a bad signature!\\"", file=file@entry=0x7f5bc1336051
> > "/home/bf/src/llvm-project-5/llvm/lib/IR/Instructions.cpp",
> > line=line@entry=299, function=function@entry=0x7f5bc13362af "void
> > llvm::CallInst::init(llvm::FunctionType *, llvm::Value *,
> > ArrayRef, ArrayRef, const
> > llvm::Twine &)") at ./assert/assert.c:92
> > #5  0x7f5bceb763a2 in __assert_fail (assertion=0x7f5bc1336419 "(i
> > >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType())
> > && \\"Calling a function with a bad signature!\\"",
> > file=0x7f5bc1336051
> > "/home/bf/src/llvm-project-5/llvm/lib/IR/Instructions.cpp", line=299,
> > function=0x7f5bc13362af "void llvm::CallInst::init(llvm::FunctionType
> > *, llvm::Value *, ArrayRef,
> > ArrayRef, const llvm::Twine &)") at
> > ./assert/assert.c:101
> > #6  0x7f5bc110f138 in llvm::CallInst::init (this=0x557a91f3e508,
> > FTy=0x557a91ed9ae0, Func=0x557a91f8be88, Args=..., Bundles=...,
> > NameStr=...) at
> > /home/bf/src/llvm-project-5/llvm/lib/IR/Instructions.cpp:297
> > #7  0x7f5bc0fa579d in llvm::CallInst::CallInst
> > (this=0x557a91f3e508, Ty=0x557a91ed9ae0, Func=0x557a91f8be88,
> > Args=..., Bundles=..., NameStr=..., InsertBefore=0x0) at
> > /home/bf/src/llvm-project-5/llvm/include/llvm/IR/Instructions.h:1934
> > #8  0x7f5bc0fa538c in llvm::CallInst::Create (Ty=0x557a91ed9ae0,
> > Func=0x557a91f8be88, Args=..., Bundles=..., NameStr=...,
> > InsertBefore=0x0) at
> > /home/bf/src/llvm-project-5/llvm/include/llvm/IR/Instructions.h:1444
> > #9  0x7f5bc0fa51f9 in llvm::IRBuilder > llvm::IRBuilderDefaultInserter>::CreateCall (this=0x557a91f9c6a0,
> > FTy=0x557a91ed9ae0, Callee=0x557a91f8be88, Args=..., Name=...,
> > FPMathTag=0x0) at
> > /home/bf/src/llvm-project-5/llvm/include/llvm/IR/IRBuilder.h:1669
> > #10 0x7f5bc100edda in llvm::IRBuilder > llvm::IRBuilderDefaultInserter>::CreateCall (this=0x557a91f9c6a0,
> > Callee=0x557a91f8be88, Args=..., Name=..., FPMathTag=0x0) at
> > /home/bf/src/llvm-project-5/llvm/include/llvm/IR/IRBuilder.h:1663
> > #11 0x7f5bc100714e in LLVMBuildCall (B=0x557a91f9c6a0,
> > Fn=0x557a91f8be88, Args=0x7ffde6fa0b50, NumArgs=6, Name=0x7f5bc30b648c
> > "funccall_iocoerce_in_safe") at
> > /home/bf/src/llvm-project-5/llvm/lib/IR/Core.cpp:2964
> > #12 0x7f5bc30af861 in llvm_compile_expr (state=0x557a91fbeac0) at
> > /home/bf/bf-build/pogona/HEAD/pgsql.build/../pgsql/src/backend/jit/llvm/llvmjit_expr.c:1373
> >
> > This seems to me to be complaining about the following addition:
> >
> > +   {
> > +   Oid ioparam = op->d.iocoerce.typioparam;
> > +   LLVMValueRef v_params[6];
> > +   LLVMValueRef v_success;
> > +
> > +   v_params[0] = l_ptr_const(op->d.iocoerce.finfo_in,
> > + l_ptr(StructFmgrInfo));
> > +   v_params[1] = v_output;
> > +   v_params[2] = l_oid_const(lc, ioparam);
> > +   v_params[3] = l_int32_const(lc, -1);
> > +   v_params[4] = l_ptr_const(op->d.iocoerce.escontext,
> > +
> > l_ptr(StructErrorSaveContext));
> >
> > -   LLVMBuildStore(b, v_retval, v_resvaluep);
> > +   /*
> > +* InputFunctionCallSafe() will write directly into
> > +* *op->resvalue.
> > +*/
> > +   v_params[5] = v_resvaluep;
> > +
> > +   v_success = LLVMBuildCall(b, llvm_pg_func(mod,
> > "InputFunctionCallSafe"),
> > + v_params, 
> > lengthof(v_params),
> > + 
> > "funccall_iocoerce_in_safe");
> > +
> > +   /*
> > +* Return null if InputFunctionCallSafe() 
> > encountered
> > +* an error.
> > +*/
> > +   v_resnullp = LLVMBuildICmp(b, LLVMIntEQ, v_success,
> > +  l_sbool_const(0), "");
> > +   }
>
> Although most animals except pogona looked fine, I've decided to revert the 
> patch for now.
>
> IIUC, LLVM is complaining that the code in the above block is not passing the 
> arguments of InputFunctionCallSafe() using the correct types.  I'm not 
> exactly sure which particular argum

Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag

2023-10-03 Thread Bharath Rupireddy

On Tue, Oct 3, 2023 at 5:45 PM Drouvot, Bertrand
 wrote:
>
> > While I like the idea of the flag to skip login checks for bg workers,
> > I don't quite like the APIs being changes InitializeSessionUserId and
> > InitPostgres (adding a new input parameter),
> > BackgroundWorkerInitializeConnection and
> > BackgroundWorkerInitializeConnectionByOid (changing of input parameter
> > type) given that all of these functions are available for external
> > modules and will break things for sure.
> >
> > What if BGWORKER_BYPASS_ROLELOGINCHECK be part of bgw_flags? With
> > this, none of the API needs to be changed, so no compatibility
> > problems as such for external modules and the InitializeSessionUserId
> > can just do something like [1]. We might be tempted to add
> > BGWORKER_BYPASS_ALLOWCONN also to bgw_flags, but I'd prefer not to do
> > it for the same compatibility reasons.
> >
> > Thoughts?
> >
>
> Thanks for looking at it!
>
> I did some research and BGWORKER_BYPASS_ALLOWCONN has been added in 
> eed1ce72e1 and
> at that time the bgw_flags did already exist.
>
> In this the related thread [1], Tom mentioned:
>
> "
> We change exported APIs in new major versions all the time.  As
> long as it's just a question of an added parameter, people can deal
> with it.
> "

It doesn't have to be always/all the time. If the case here is okay to
change the bgw and other core functions API, I honestly feel that we
must move BGWORKER_BYPASS_ALLOWCONN to bgw_flags.

> Now, I understand your point but it looks to me that bgw_flags is more
> about the capabilities (Access to shared memory with BGWORKER_SHMEM_ACCESS
> or ability to establish database connection with 
> BGWORKER_BACKEND_DATABASE_CONNECTION),
>
> While with BGWORKER_BYPASS_ROLELOGINCHECK (and BGWORKER_BYPASS_ALLOWCONN) 
> it's more related to
> the BGW behavior once the capability is in place.

I look at the new flag as a capability of the bgw to connect with a
role without login access. IMV, all are the same.

> So, I think I'm fine with the current proposal and don't see the need to move
> BGWORKER_BYPASS_ROLELOGINCHECK in bgw_flags.
>
> [1]: https://www.postgresql.org/message-id/22769.1519323861%40sss.pgh.pa.us

I prefer to have it as bgw_flag, however, let's hear from others.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: On login trigger: take three

2023-10-03 Thread Daniel Gustafsson

> On 2 Oct 2023, at 20:10, Robert Haas  wrote:
> 
> Sorry to have gone dark on this for a long time after having been
> asked for my input back in March. I'm not having a great time trying
> to keep up with email, and the threads getting split up makes it a lot
> worse for me.

Not a problem, thanks for chiming in.

> On Fri, Sep 29, 2023 at 6:15 AM Daniel Gustafsson  wrote:
>> Running the same pgbench command on my laptop looking at the average 
>> connection
>> times, and the averaging that over five runs (low/avg/high) I see ~5% 
>> increase
>> over master with the patched version (compiled without assertions and debug):
>> 
>> Patched event_triggers on:  6.858 ms/7.038 ms/7.434 ms
>> Patched event_triggers off: 6.601 ms/6.958 ms/7.539 ms
>> Master: 6.676 ms/6.697 ms/6.760 ms
> 
> This seems kind of crazy to me. Why does it happen? It sounds to me
> like we must be doing a lot of extra catalog access to find out
> whether there are any on-login event triggers. Like maybe a sequential
> scan of pg_event_trigger.

That's exactly what happens, the patch is using BuildEventTriggerCache() to
build the hash for EVT which is then checked for login triggers.  This is
clearly the bottleneck and there needs to be a fast-path.  There used to be a
cache flag in an earlier version of the patch but it was a but klugy, a version
of that needs to be reimplemented for this patch to fly.

> I think a lot of users would say that logins on PostgreSQL are too slow 
> already.

Agreed.

--
Daniel Gustafsson

Re: CHECK Constraint Deferrable

On Monday, October 2, 2023, Andreas Joseph Krogh  wrote:

> På fredag 07. juli 2023 kl. 13:50:44, skrev Dilip Kumar <
> dilipbal...@gmail.com>:
>
> On Wed, Jul 5, 2023 at 3:08 PM Himanshu Upadhyaya
>  wrote:
> >
> > Hi,
> >
> > Currently, there is no support for CHECK constraint DEFERRABLE in a
> create table statement.
> > SQL standard specifies that CHECK constraint can be defined as
> DEFERRABLE.
>
> I think this is a valid argument that this is part of SQL standard so
> it would be good addition to PostgreSQL.  So +1 for the feature.
>
> But I am wondering whether there are some real-world use cases for
> deferred CHECK/NOT NULL constraints?  I mean like for foreign key
> constraints if there is a cyclic dependency between two tables then
> deferring the constraint is the simplest way to insert without error.
>
>
>
> The real-world use case, at least for me, is when using an ORM. For large
> object-graphs ORMs have a tendency to INSERT first with NULLs then UPDATE
> the “NOT NULLs” later.
>
> “Rewrite the ORM” is not an option for most of us…
>
Between this and Vik comment it sounds like we should probably require a
patch in this area to solve both the not null and check constraint deferral
omissions then, not just one of them (alternatively, let’s solve the not
null one first).

David J.

Re: Modernize const handling with readline

Peter Eisentraut  writes:
> The comment
>  /* On some platforms, readline is declared as readline(char *) */
> is obsolete.  The casting away of const can be removed.

+1, that's surely not of interest on anything we still support.

regards, tom lane

Re: On login trigger: take three

On Tue, Oct 3, 2023 at 9:43 AM Daniel Gustafsson  wrote:
> That's exactly what happens, the patch is using BuildEventTriggerCache() to
> build the hash for EVT which is then checked for login triggers.  This is
> clearly the bottleneck and there needs to be a fast-path.  There used to be a
> cache flag in an earlier version of the patch but it was a but klugy, a 
> version
> of that needs to be reimplemented for this patch to fly.

So I haven't looked at this patch, but we basically saying that only
the superuser can create login triggers, and if they do, those
triggers apply to every single user on the system? That would seem to
be the logical extension of the existing event trigger mechanism, but
it isn't obviously as good of a fit for this case as it is for other
cases where event triggers are a thing.

Changing the catalog representation could be a way around this. What
if you only allowed one login trigger per database, and instead of
being stored in pg_event_trigger, the OID of the function gets
recorded in the pg_database row? Then this would be a lot cheaper
since we have to fetch the pg_database row anyway. Or change the SQL
syntax to something entirely new so you can have different login
triggers for different users -- and maybe users are allowed to create
their own -- but the relevant ones can be found by an index scan
instead of a sequential scan.

I'm just spitballing here. If you think the present design is good and
just want to try to speed it up, I'm not deeply opposed to that. But
it's also not obvious to me how to stick a cache in front of something
that's basically a full-table scan.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby


Hi,

On 10/3/23 12:54 PM, Amit Kapila wrote:

On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand
 wrote:


On 9/29/23 1:33 PM, Amit Kapila wrote:

On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand
 wrote:





- probably open corner cases like: what if a standby is down? would that mean
that synchronize_slot_names not being send to the primary would allow the 
decoding
on the primary to go ahead?



Good question. BTW, irrespective of whether we have
'standby_slot_names' parameters or not, how should we behave if
standby is down? Say, if 'synchronize_slot_names' is only specified on
standby then in such a situation primary won't be even aware that some
of the logical walsenders need to wait.


Exactly, that's why I was thinking keeping standby_slot_names to address
this scenario. In such a case one could simply decide to keep or remove
the associated physical replication slot from standby_slot_names. Keep would
mean "wait" and removing would mean allow to decode on the primary.


OTOH, one can say that users
should configure 'synchronize_slot_names' on both primary and standby
but note that this value could be different for different standby's,
so we can't configure it on primary.



Yeah, I think that's a good use case for standby_slot_names, what do you think?



But, even if we keep 'standby_slot_names' for this purpose, the
primary doesn't know the value of 'synchronize_slot_names' once the
standby is down and or the primary is restarted. So, how will we know
which logical WAL senders needs to wait for 'standby_slot_names'?



Yeah right, I also think we'd need:

- synchronize_slot_names on both primary and standby

But now we would need to take care of different standby having different values 
(
as you said up-thread)

Thinking out loud: What about a single GUC on the primary (not 
standby_slot_names nor
synchronize_slot_names) but say logical_slots_wait_for_standby that could be a 
list of say
"logical_slot_name:physical_slot".

I think this GUC would help us define each walsender behavior (should the 
standby(s)
be up or down):

- don't wait if its associated logical_slot is not listed in this GUC
- or wait based on its associated "list" of mapped physical slots (would 
probably
have to deal with the min restart_lsn for all the corresponding mapped ones).

I don't think we can avoid having to define at least one GUC on the primary (at 
least to
handle the case of standby(s) being down).

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: trying again to get incremental backup

On Fri, Sep 1, 2023 at 10:30 AM Robert Haas  wrote:
> > No objections to 0001/0002.
>
> Cool.

Nobody else objected either, so I went ahead and committed those. I'll
rebase the rest of the patches on top of the latest master and repost,
hopefully after addressing some of the other review comments from
Dilip and Jakub.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Allow deleting enumerated values from an existing enumerated data type

Vik Fearing  writes:
> On 10/2/23 20:07, Dagfinn Ilmari Mannsåker wrote:
>> FWIW I'm +1 on this patch,

> Thanks.

>> and with Tom on dropping the "yet".  To me it
>> makes it sound like we intend to implement it soon (fsvo).

> I am not fundamentally opposed to it, nor to any other wordsmithing the 
> committer (probably Tom) wants to do.  The main point of the patch is to 
> list at least some of the problems that need to be solved in a correct 
> implementation.

Pushed with a bit more work on the text.

I left out the regression test, as it seems like it'd add test cycles
to little purpose.  It won't do anything to improve the odds that
someone finds this text.

regards, tom lane

Re: Allow deleting enumerated values from an existing enumerated data type

2023-10-03 Thread Vik Fearing


On 10/3/23 17:44, Tom Lane wrote:

Vik Fearing  writes:

On 10/2/23 20:07, Dagfinn Ilmari Mannsåker wrote:

FWIW I'm +1 on this patch,



Thanks.



and with Tom on dropping the "yet".  To me it
makes it sound like we intend to implement it soon (fsvo).



I am not fundamentally opposed to it, nor to any other wordsmithing the
committer (probably Tom) wants to do.  The main point of the patch is to
list at least some of the problems that need to be solved in a correct
implementation.


Pushed with a bit more work on the text.

I left out the regression test, as it seems like it'd add test cycles
to little purpose.  It won't do anything to improve the odds that
someone finds this text.


Thanks!
--
Vik Fearing

Re: Remove IndexInfo.ii_OpclassOptions field

2023-10-03 Thread Peter Eisentraut


On 30.08.23 02:51, Michael Paquier wrote:

On Tue, Aug 29, 2023 at 10:51:10AM +0200, Peter Eisentraut wrote:

At a glance, however, I think my patch is (a) not related, and (b) if it
were, it would probably *help*, because the change is to not allocate any
long-lived structures that no one needs and that might get out of date.


Hmm, yeah, perhaps you're right about (b) here.  I have a few other
high-priority items for stable branches on my board before being able
to look at all this in more details, unfortunately, so feel free to
ignore me if you think that this is an improvement anyway even
regarding the other issue discussed.


I have committed this.

Re: Synchronizing slots from primary to standby

On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 10/3/23 12:54 PM, Amit Kapila wrote:
> > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand
> >  wrote:
> >>
> >> On 9/29/23 1:33 PM, Amit Kapila wrote:
> >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand
> >>>  wrote:
> 
> >>>
>  - probably open corner cases like: what if a standby is down? would that 
>  mean
>  that synchronize_slot_names not being send to the primary would allow 
>  the decoding
>  on the primary to go ahead?
> 
> >>>
> >>> Good question. BTW, irrespective of whether we have
> >>> 'standby_slot_names' parameters or not, how should we behave if
> >>> standby is down? Say, if 'synchronize_slot_names' is only specified on
> >>> standby then in such a situation primary won't be even aware that some
> >>> of the logical walsenders need to wait.
> >>
> >> Exactly, that's why I was thinking keeping standby_slot_names to address
> >> this scenario. In such a case one could simply decide to keep or remove
> >> the associated physical replication slot from standby_slot_names. Keep 
> >> would
> >> mean "wait" and removing would mean allow to decode on the primary.
> >>
> >>> OTOH, one can say that users
> >>> should configure 'synchronize_slot_names' on both primary and standby
> >>> but note that this value could be different for different standby's,
> >>> so we can't configure it on primary.
> >>>
> >>
> >> Yeah, I think that's a good use case for standby_slot_names, what do you 
> >> think?
> >>
> >
> > But, even if we keep 'standby_slot_names' for this purpose, the
> > primary doesn't know the value of 'synchronize_slot_names' once the
> > standby is down and or the primary is restarted. So, how will we know
> > which logical WAL senders needs to wait for 'standby_slot_names'?
> >
>
> Yeah right, I also think we'd need:
>
> - synchronize_slot_names on both primary and standby
>
> But now we would need to take care of different standby having different 
> values (
> as you said up-thread)
>
> Thinking out loud: What about a single GUC on the primary (not 
> standby_slot_names nor
> synchronize_slot_names) but say logical_slots_wait_for_standby that could be 
> a list of say
> "logical_slot_name:physical_slot".
>
> I think this GUC would help us define each walsender behavior (should the 
> standby(s)
> be up or down):
>

It may help in defining the walsender's behaviour better for sure. But
the problem I see once we start defining sync-slot-names on primary
(in any form whether as independent GUC or as above mapping GUC) is
that it needs to be then in sync with standbys, as each standby for
sure needs to maintain its own sync-slot-names GUC to make it aware of
what all it needs to sync. This brings us to the original question of
how do we actually keep these configurations in sync between primary
and standby if we plan to maintain it on both?


> - don't wait if its associated logical_slot is not listed in this GUC
> - or wait based on its associated "list" of mapped physical slots (would 
> probably
> have to deal with the min restart_lsn for all the corresponding mapped ones).
>
> I don't think we can avoid having to define at least one GUC on the primary 
> (at least to
> handle the case of standby(s) being down).
>
> Thoughts?
>
> Regards,
>
> --
> Bertrand Drouvot
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com

Re: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}

On Fri, Sep 15, 2023 at 12:34 PM Melanie Plageman
 wrote:
> On Fri, Sep 15, 2023 at 9:24 AM Nazir Bilal Yavuz  wrote:
> > I found that pgBufferUsage.blk_{read|write}_time are zero although there 
> > are pgBufferUsage.local_blks_{read|written}
>
> Yes, good catch. This is a bug. I will note that at least in 15 and
> likely before, pgBufferUsage.local_blks_written is incremented for
> local buffers but pgBufferUsage.blk_write_time is only added to for
> shared buffers (in FlushBuffer()). I think it makes sense to propose a
> bug fix to stable branches counting blk_write_time for local buffers
> as well.

My first thought was to wonder whether this was even a bug. I
remembered that EXPLAIN treats shared, local, and temp buffers as
three separate categories of things. But it seems that someone decided
to conflate two of them for I/O timing purposes:

if (has_timing)
{
appendStringInfoString(es->str, "
shared/local");

 Notice this bit in particular.

if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
appendStringInfo(es->str, " read=%0.3f",

  INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
appendStringInfo(es->str, "
write=%0.3f",

  INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
if (has_temp_timing)
appendStringInfoChar(es->str, ',');
}
if (has_temp_timing)
{
appendStringInfoString(es->str, " temp");
if
(!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time))
appendStringInfo(es->str, " read=%0.3f",

  INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
if
(!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time))
appendStringInfo(es->str, "
write=%0.3f",

  INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
}

Given that, I'm inclined to agree that this is a bug. But we might
need to go through and make sure all of the code that deals with these
counters is on the same page about what the values represent. Maybe
there is code lurking somewhere that thinks these counters are only
supposed to include "shared" rather than, as the fragment above
suggests, "shared/local".

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Annoying build warnings from latest Apple toolchain

I wrote:
> Andres Freund  writes:
>> On 2023-09-29 12:14:40 -0400, Tom Lane wrote:
>>> Therefore, I think the prudent thing to do in the back branches is use the
>>> patch I posted before, to suppress the duplicate -l switches only on macOS.
>>> In HEAD, I propose we simplify life by doing it everywhere, as attached.

>> Makes sense.

> Done that way.

So, in the no-good-deed-goes-unpunished department, I see that Noah's
AIX animals are now complaining about a few duplicate symbols, eg
while building initdb:

ld: 0711-224 WARNING: Duplicate symbol: .pg_encoding_to_char
ld: 0711-224 WARNING: Duplicate symbol: .pg_valid_server_encoding
ld: 0711-224 WARNING: Duplicate symbol: .pg_valid_server_encoding_id
ld: 0711-224 WARNING: Duplicate symbol: .pg_char_to_encoding

It's far from clear to me why we see this warning now when we didn't
before, because there are strictly fewer sources of these symbols in
the link than before.  They are available from libpgcommon and are
also intentionally exported from libpq.  Presumably, initdb is now
linking to these symbols from libpq where before it got them from the
first mention of libpgcommon, but why is that any more worthy of a
warning?

Anyway, I don't have a huge problem with just ignoring these warnings
as such, since AFAICT they're only showing up on AIX.

However, thinking about this made me realize that there's a related
problem.  At one time we had intentionally made initdb use its own
copy of these routines rather than letting it get them from libpq.
The reason is explained in commit 8468146b0:

Fix the inadvertent libpq ABI breakage discovered by Martin Pitt: the
renumbering of encoding IDs done between 8.2 and 8.3 turns out to break 8.2
initdb and psql if they are run with an 8.3beta1 libpq.so.

This policy is still memorialized in a comment in initdb/Makefile:

# Note: it's important that we link to encnames.o from libpgcommon, not
# from libpq, else we have risks of version skew if we run with a libpq
# shared library from a different PG version.  The libpq_pgport macro
# should ensure that that happens.

and pg_wchar.h has this related comment:

 * XXX  We must avoid renumbering any backend encoding until libpq's major
 * version number is increased beyond 5; it turns out that the backend
 * encoding IDs are effectively part of libpq's ABI as far as 8.2 initdb and
 * psql are concerned.

Now it's not happening that way.  How big a problem is that?

In the case of psql, I think it's actually fixing a latent bug.
8468146b0's changes in psql clearly intend that psql will be
linking to libpq's copies of pg_char_to_encoding and
pg_valid_server_encoding_id, which is appropriate because it's
dealing with libpq's encoding IDs.  We broke that when we moved
encnames.c into libpgcommon.  We've not made any encoding ID
redefinitions since then, and we'd be unlikely to renumber PG_UTF8
in any case, but clearly linking to libpq's copies is safer.

(Which means that the makefiles are now OK, but the meson
build is not: we need libpq to be linked before libpgcommon.)

However, in the case of initdb, we had better be using the same
encoding IDs as the backend code we are setting up the database for.
If we ever add/renumber any backend-safe encodings again, we'd be
exposed to the same problem that 8.3 had.

Assuming that this problem is restricted to initdb, which I think
is true, probably the best fix is to cause the initdb link *only*
to link libpgcommon before libpq.  Every other non-backend program
is interested in libpq's encoding IDs if it cares at all.

Thoughts?

regards, tom lane

Re: On login trigger: take three

2023-10-03 Thread Alexander Korotkov

Hi, Robert!

On Tue, Oct 3, 2023 at 5:21 PM Robert Haas  wrote:
> On Tue, Oct 3, 2023 at 9:43 AM Daniel Gustafsson  wrote:
> > That's exactly what happens, the patch is using BuildEventTriggerCache() to
> > build the hash for EVT which is then checked for login triggers.  This is
> > clearly the bottleneck and there needs to be a fast-path.  There used to be 
> > a
> > cache flag in an earlier version of the patch but it was a but klugy, a 
> > version
> > of that needs to be reimplemented for this patch to fly.
>
> So I haven't looked at this patch, but we basically saying that only
> the superuser can create login triggers, and if they do, those
> triggers apply to every single user on the system? That would seem to
> be the logical extension of the existing event trigger mechanism, but
> it isn't obviously as good of a fit for this case as it is for other
> cases where event triggers are a thing.
>
> Changing the catalog representation could be a way around this. What
> if you only allowed one login trigger per database, and instead of
> being stored in pg_event_trigger, the OID of the function gets
> recorded in the pg_database row? Then this would be a lot cheaper
> since we have to fetch the pg_database row anyway. Or change the SQL
> syntax to something entirely new so you can have different login
> triggers for different users -- and maybe users are allowed to create
> their own -- but the relevant ones can be found by an index scan
> instead of a sequential scan.
>
> I'm just spitballing here. If you think the present design is good and
> just want to try to speed it up, I'm not deeply opposed to that. But
> it's also not obvious to me how to stick a cache in front of something
> that's basically a full-table scan.

Thank you for the interesting ideas. I'd like to try to revive the
version with the flag in pg_database.  Will use other ideas as backup
if no success.

--
Regards,
Alexander Korotkov

Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector

2023-10-03 Thread Karl O. Pinc

On Mon, 2 Oct 2023 15:18:32 -0500
"Karl O. Pinc"  wrote:

Version 7

Added:
v7-0016-Predicate-locks-are-held-per-cluster-not-per-data.patch

> > > On Mon, 25 Sep 2023 23:37:44 -0500
> > > "Karl O. Pinc"  wrote:
> > > 
> > > > > On Mon, 25 Sep 2023 17:55:59 -0500
> > > > > "Karl O. Pinc"  wrote:
> > > > > 
> > > > > > On Mon, 25 Sep 2023 14:14:37 +0200
> > > > > > Daniel Gustafsson  wrote:  
> > > > > 
> > > > > > > Once done you can do "git format-patch origin/master -v 1"
> > > > > > > which will generate a set of n patches named v1-0001
> > > > > > > through v1-000n.  
> > > 
> > > > > > I am not particularly confident in the top-line commit
> > > > > > descriptions.  
> > > > > 
> > > > > > The bulk of the commit descriptions are very wordy  
> > > > > 
> > > > > > Listing all the attachments here for future discussion:

v7-0001-Change-section-heading-to-better-reflect-saving-a.patch
v7-0002-Change-section-heading-to-better-describe-referen.patch
v7-0003-Better-section-heading-for-plpgsql-exception-trap.patch
v7-0004-Describe-how-to-raise-an-exception-in-the-excepti.patch
v7-0005-Improve-sentences-in-overview-of-system-configura.patch
v7-0006-Provide-examples-of-listing-all-settings.patch
v7-0007-Cleanup-summary-of-role-powers.patch
v7-0008-Explain-the-difference-between-role-attributes-an.patch
v7-0009-Document-the-oidvector-type.patch
v7-0010-Improve-sentences-about-the-significance-of-the-s.patch
v7-0011-Add-a-sub-section-to-describe-schema-resolution.patch
v7-0012-Explain-role-management.patch
v7-0013-Hyperlink-from-CREATE-FUNCTION-reference-page-to-.patch
v7-0014-Add-index-entries-for-parallel-safety.patch
v7-0015-Trigger-authors-need-not-worry-about-parallelism.patch
v7-0016-Predicate-locks-are-held-per-cluster-not-per-data.patch

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
>From 122665c4155698abe88e2bd17639a991791b94e3 Mon Sep 17 00:00:00 2001
From: "Karl O. Pinc" 
Date: Sun, 24 Sep 2023 15:49:30 -0500
Subject: [PATCH v7 01/16] Change section heading to better reflect saving a
 result in variable(s)

The current section title of "Executing a Command with a Single-Row
Result" does not reflect what the section is really about.  Other
sections make clear how to _execute_ commands, single-row result or not.
What this section is about is how to _save_ a single row of results into
variable(s).

It would be nice to talk about saving results into variables in the
section heading but I couldn't come up with anything pithy.  "Saving a
Single-Row of a Command's Result" seems good enough, especially since
there's few other places to save results other than in variables.
---
 doc/src/sgml/plpgsql.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index f55e901c7e..8747e84245 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -1126,7 +1126,7 @@ PERFORM create_mv('cs_session_page_requests_mv', my_query);

 

-Executing a Command with a Single-Row Result
+Saving a Single-Row of a Command's Result
 
 
  SELECT INTO
-- 
2.30.2

>From 4de4a31d41124dfa793cc5cce0516673811ea414 Mon Sep 17 00:00:00 2001
From: "Karl O. Pinc" 
Date: Sun, 24 Sep 2023 15:52:21 -0500
Subject: [PATCH v7 02/16] Change section heading to better describe reference
 of existing types

The section heading of "Copying Types" does not reflect what the
section is about.  It is not about making copies of data types but
about using the data type of existing columns (or rows) in new type
declarations without having to know what the existing type is.

"Re-Using the Type of Columns and Variables" seems adequate.  Getting
something in there about declartions seems too wordy.  I thought
perhaps "Referencing" instead of "Re-Using", but "referencing" isn't
perfect and "re-using" is generic enough, shorter, and simpler to read.
---
 doc/src/sgml/plpgsql.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index 8747e84245..874578265e 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -672,7 +672,7 @@ DECLARE

 
   
-   Copying Types
+   Re-Using the Type of Columns and Variables
 
 
 variable%TYPE
-- 
2.30.2

>From 80c2b8ef7ad6e610f5c7bdc61b827983a87110e2 Mon Sep 17 00:00:00 2001
From: "Karl O. Pinc" 
Date: Sun, 24 Sep 2023 16:03:29 -0500
Subject: [PATCH v7 03/16] Better section heading for plpgsql exception
 trapping

The docs seem to use "error" and "exception" interchangeably, perhaps
50% each.  But they never say that the are the same thing, and in the
larger world they are not.  Errors tend to be something that drop on
the floor and usually halt execution whereas exceptions can be trapped
and give the programmer more control over the flow of the program.
(Although, to be fair, exceptions are a subset of

Re: RFC: Logging plan of the running query

2023-10-03 Thread James Coleman

On Thu, Sep 7, 2023 at 2:09 AM torikoshia  wrote:
>
> On 2023-09-06 11:17, James Coleman wrote:
>
> >> > I've never been able to reproduce it (haven't tested the new version,
> >> > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian
> >> > (first buster and now bullseye).
> >> >
> >> > I'm attaching several stacktraces in the hope that they provide some
> >> > clues. These all match the ps output I sent earlier, though note in
> >> > that output there is both the regress instance and my test instance
> >> > (pid 3213249) running (different ports, of course, and they are from
> >> > the exact same compilation run). I've attached ps output for the
> >> > postgres processes under the make check process to simplify cross
> >> > referencing.
> >> >
> >> > A few interesting things:
> >> > - There's definitely a lock on a relation that seems to be what's
> >> > blocking the processes.
> >> > - When I try to connect with psql the process forks but then hangs
> >> > (see the ps output with task names stuck in "authentication"). I've
> >> > also included a trace from one of these.
> >>
> >> Thanks for sharing them!
> >>
> >> Many processes are waiting to acquire the LW lock, including the
> >> process
> >> trying to output the plan(select1.trace).
> >>
> >> I suspect that this is due to a lock that was acquired prior to being
> >> interrupted by ProcessLogQueryPlanInterrupt(), but have not been able
> >> to
> >> reproduce the same situation..
> >>
> >
> > I don't have time immediately to dig into this, but perhaps loading up
> > the core dumps would allow us to see what query is running in each
> > backend process (if it hasn't already been discarded by that point)
> > and thereby determine what point in each test process led to the error
> > condition?
>
> Thanks for the suggestion.
> I am concerned that core dumps may not be readable on different
> operating systems or other environments. (Unfortunately, I do not have
> Debian on hand)
>
> It seems that we can know what queries were running from the stack
> traces you shared.
> As described above, I suspect a lock which was acquired prior to
> ProcessLogQueryPlanInterrupt() caused the issue.
> I will try a little more to see if I can devise a way to create the same
> situation.
>
> > Alternatively we might be able to apply the same trick to the test
> > client instead...
> >
> > BTW, for my own easy reference in this thread: relid 1259 is pg_class
> > if I'm not mistaken.
>
> Yeah, and I think it's strange that the lock to 1259 appears twice and
> must be avoided.
>
>#10 0x559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at
> lmgr.c:117
>..
>#49 0x559d61b4989d in ProcessLogQueryPlanInterrupt () at
> explain.c:5158
>..
>#53 0x559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at
> lmgr.c:117

I chatted with Andres and David about this at PGConf.NYC, and I think
what we need to do is explicitly disallow running this code any time
we are inside of lock acquisition code.

Regards,
James Coleman

Re: Pre-proposal: unicode normalized text

On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:
> I think you misunderstand Unicode normalization and equivalence. 
> There
> is no standard Unicode `normalize()` that would cause the above
> equality
> predicate to be true.  If you normalize to NFD (normal form
> decomposed)
> then a _prefix_ of those two strings will be equal, but that's
> clearly
> not what you're looking for.

>From [1]:

"Unicode Normalization Forms are formally defined normalizations of
Unicode strings which make it possible to determine whether any two
Unicode strings are equivalent to each other. Depending on the
particular Unicode Normalization Form, that equivalence can either be a
canonical equivalence or a compatibility equivalence... A binary
comparison of the transformed strings will then determine equivalence."

NFC and NFD are based on Canonical Equivalence.

"Canonical equivalence is a fundamental equivalency between characters
or sequences of characters which represent the same abstract character,
and which when correctly displayed should always have the same visual
appearance and behavior."

Can you explain why NFC (the default form of normalization used by the
postgres normalize() function), followed by memcmp(), is not the right
thing to use to determine Canonical Equivalence?

Or are you saying that Canonical Equivalence is not a useful thing to
test?

What do you mean about the "prefix"?

In Postgres today:

  SELECT normalize(U&'\0061\0301', nfc)::bytea; -- \xc3a1
  SELECT normalize(U&'\00E1', nfc)::bytea; -- \xc3a1

  SELECT normalize(U&'\0061\0301', nfd)::bytea; -- \x61cc81
  SELECT normalize(U&'\00E1', nfd)::bytea; -- \x61cc81

which looks useful to me, but I assume you are saying that it doesn't
generalize well to other cases?

[1] https://unicode.org/reports/tr15/

> There are two ways to write 'á' in Unicode: one is pre-composed (one
> codepoint) and the other is decomposed (two codepoints in this
> specific
> case), and it would be nice to be able to preserve input form when
> storing strings but then still be able to index and match them
> form-insensitively (in the case of 'á' both equivalent
> representations
> should be considered equal, and for UNIQUE indexes they should be
> considered the same).

Sometimes preserving input differences is a good thing, other times
it's not, depending on the context. Almost any data type has some
aspects of the input that might not be preserved -- leading zeros in a
number, or whitespace in jsonb, etc.

If text is stored as normalized with NFC, it could be frustrating if
the retrieved string has a different binary representation than the
source data. But it could also be frustrating to look at two strings
made up of ordinary characters that look identical and for the database
to consider them unequal.

Regards,
Jeff Davis

Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-10-03 Thread James Coleman

On Mon, Oct 2, 2023 at 2:55 PM Robert Haas  wrote:
>
> On Sat, Sep 30, 2023 at 1:05 AM Peter Geoghegan  wrote:
> > > This is why I discovered it: it says "indexes do not reference their
> > > page item identifiers", which is manifestly not true when talking
> > > about the root item, and in fact would defeat the whole purpose of HOT
> > > (at least in a old-to-new chain like Postgres uses).
> >
> > Yeah, but...that's not what was intended. Obviously, the index hasn't
> > changed, and we expect index scans to continue to give correct
> > answers. So it is pretty strongly implied that it continues to point
> > to something valid.
>
> I took a look at this. I agree with James that the current wording is
> just plain wrong.
>
>  periodic vacuum operations.  (This is possible because indexes
>  do not reference their page
>  item identifiers.)
>
> Here, the antecedent of "their" is "old versions of updated rows". It
> is slightly unclear whether we should interpret this to mean (a) the
> tuples together with the line pointers which point to them or (b) only
> the tuple data to which the line pointers point. If (a), then it's
> wrong because we can't actually get rid of the root line pointer;
> rather, we have to change it to a redirect. If (b), then it's wrong
> because heap_page_prune() removes dead tuples in this sense whether
> HOT is involved or not. I can see no interpretation under which this
> statement is true as written.
>
> I reviewed James's proposed alternative:
>
> + periodic vacuum operations. However because indexes reference the old
> + version's page item 
> identifiers
> + the line pointer must remain in place. Such a line pointer has its
> + LP_REDIRECT bit set and its offset updated to the
> + page item identifiers of
> + the updated row.
>
> I don't think that's really right either. That's true for the root
> line pointer, but the phrasing seems to be referring to old versions
> generally, which would seem to include not only the root, for which
> this is correct, and also all subsequent now-dead row versions, for
> which it is wrong.
>
> Here is my attempt:
>
> When a row is updated multiple times, row versions other than the
> oldest and the newest can be completely removed during normal
> operation, including SELECTs, instead of requiring
> periodic vacuum operations. (Indexes always refer to the  linkend="storage-page-layout">page item identifiers of the
> original row version. The tuple data associated with that row version
> is removed, and its item identifier is converted to a redirect that
> points to the oldest version that may still be visible to some
> concurrent transaction. Intermediate row versions that are no longer
> visible to anyone are completely removed, and the associated page item
> identifiers are made available for reuse.)

Hi Robert,

Thanks for reviewing!

I like your changes. Reading through this several times, and noting
Peter's comments about pruning being more than just HOT, I'm thinking
that rather than a simple fixup for this one paragraph what we
actually want is to split out the concept of page pruning into its own
section of the storage docs. Attached is a patch that does that,
incorporating much of your language about LP_REDIRECT, along with
LP_DEAD so that readers know this affects more than just heap-only
tuple workloads.

Regards,
James


v2-0001-Correct-HOT-docs-to-account-for-LP_REDIRECT.patch
Description: Binary data

Re: Pre-proposal: unicode normalized text

On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.

Yeah, because we lose forward-compatibility of some useful operations.

> Here, Jeff mentions normalization, but I think it's a major issue
> with
> collation support. If new code points are added, users can put them
> into the database before they are known to the collation library, and
> then when they become known to the collation library the sort order
> changes and indexes break.

The collation version number may reflect the change in understanding
about assigned code points that may affect collation -- though I'd like
to understand whether this is guaranteed or not.

Regardless, given that (a) we don't have a good story for migrating to
new collation versions; and (b) it would be painful to rebuild indexes
even if we did; then you are right that it's a problem.

>  Would we endorse a proposal to make
> pg_catalog.text with encoding UTF-8 reject code points that aren't
> yet
> known to the collation library? To do so would be tighten things up
> considerably from where they stand today, and the way things stand
> today is already rigid enough to cause problems for some users.

What problems exist today due to the rigidity of text?

I assume you mean because we reject invalid byte sequences? Yeah, I'm
sure that causes a problem for some (especially migrations), but it's
difficult for me to imagine a database working well with no rules at
all for the the basic data types.

> Now, there is still the question of whether such a data type would
> properly belong in core or even contrib rather than being an
> out-of-core project. It's not obvious to me that such a data type
> would get enough traction that we'd want it to be part of PostgreSQL
> itself.

At minimum I think we need to have some internal functions to check for
unassigned code points. That belongs in core, because we generate the
unicode tables from a specific version.

I also think we should expose some SQL functions to check for
unassigned code points. That sounds useful, especially since we already
expose normalization functions.

One could easily imagine a domain with CHECK(NOT
contains_unassigned(a)). Or an extension with a data type that uses the
internal functions.

Whether we ever get to a core data type -- and more importantly,
whether anyone uses it -- I'm not sure.

>  But at the same time I can certainly understand why Jeff finds
> the status quo problematic.

Yeah, I am looking for a better compromise between:

  * everything is memcmp() and 'á' sometimes doesn't equal 'á'
(depending on code point sequence)
  * everything is constantly changing, indexes break, and text
comparisons are slow

A stable idea of unicode normalization based on using only assigned
code points is very tempting.

Regards,
Jeff Davis

Re: Annoying build warnings from latest Apple toolchain

I wrote:
> Assuming that this problem is restricted to initdb, which I think
> is true, probably the best fix is to cause the initdb link *only*
> to link libpgcommon before libpq.  Every other non-backend program
> is interested in libpq's encoding IDs if it cares at all.

The more I thought about that the less I liked it.  We're trying to
get away from link order dependencies, not add more.  And the fact
that we've had a latent bug for awhile from random-ish changes in
link order should reinforce our desire to get out of that business.

So I experimented with fixing things so that the versions of these
functions exported by libpq have physically different names from those
that you'd get from linking to libpgcommon.a or libpgcommon_srv.a.
Then, there's certainty about which one a given usage will link to,
based on what the #define environment is when the call is compiled.

This leads to a pleasingly small patch, at least in the Makefile
universe (I've not attempted to sync the meson or MSVC infrastructure
with this yet).  As a bonus, it should silence those new warnings
on AIX.  A disadvantage is that this causes an ABI break for
backend extensions, so we couldn't consider back-patching it.
But I think that's fine given that the problem is only latent
in released branches.

Thoughts?

regards, tom lane


diff --git a/src/bin/initdb/Makefile b/src/bin/initdb/Makefile
index d69bd89572..e80e57e457 100644
--- a/src/bin/initdb/Makefile
+++ b/src/bin/initdb/Makefile
@@ -16,13 +16,12 @@ subdir = src/bin/initdb
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(ICU_CFLAGS) $(CPPFLAGS)
-
 # Note: it's important that we link to encnames.o from libpgcommon, not
 # from libpq, else we have risks of version skew if we run with a libpq
-# shared library from a different PG version.  The libpq_pgport macro
-# should ensure that that happens.
-#
+# shared library from a different PG version.  Define
+# USE_PRIVATE_ENCODING_FUNCS to ensure that that happens.
+override CPPFLAGS := -DUSE_PRIVATE_ENCODING_FUNCS -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(ICU_CFLAGS) $(CPPFLAGS)
+
 # We need libpq only because fe_utils does.
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport) $(ICU_LIBS)
 
diff --git a/src/common/Makefile b/src/common/Makefile
index cc5c54dcee..70884be00c 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -140,6 +140,13 @@ libpgcommon.a: $(OBJS_FRONTEND)
 	rm -f $@
 	$(AR) $(AROPT) $@ $^
 
+#
+# Files in libpgcommon.a should use/export the "xxx_private" versions
+# of pg_char_to_encoding() and friends.
+#
+$(OBJS_FRONTEND): CPPFLAGS += -DUSE_PRIVATE_ENCODING_FUNCS
+
+
 #
 # Shared library versions of object files
 #
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 25276b199f..7d2fad91e6 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -13,6 +13,8 @@
  *		included by libpq client programs.  In particular, a libpq client
  *		should not assume that the encoding IDs used by the version of libpq
  *		it's linked to match up with the IDs declared here.
+ *		To help prevent mistakes, relevant functions that are exported by
+ *		libpq have a physically different name when being referenced directly.
  *
  *-
  */
@@ -562,6 +564,23 @@ surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
 }
 
 
+/*
+ * The functions in this list are exported by libpq, and we need to be sure
+ * that we know which calls are satisfied by libpq and which are satisfied
+ * by static linkage to libpgcommon.  (This is because we might be using a
+ * libpq.so that's of a different major version and has different encoding
+ * IDs from what libpgcommon knows.)  The official function names are what
+ * is actually used in and exported by libpq, while the names exported by
+ * libpgcommon.a and libpgcommon_srv.a end in "_private".
+ */
+#if defined(USE_PRIVATE_ENCODING_FUNCS) || !defined(FRONTEND)
+#define pg_char_to_encoding			pg_char_to_encoding_private
+#define pg_encoding_to_char			pg_encoding_to_char_private
+#define pg_valid_server_encoding	pg_valid_server_encoding_private
+#define pg_valid_server_encoding_id	pg_valid_server_encoding_id_private
+#define pg_utf_mblenpg_utf_mblen_private
+#endif
+
 /*
  * These functions are considered part of libpq's exported API and
  * are also declared in libpq-fe.h.

Re: Pre-proposal: unicode normalized text

2023-10-03 Thread Nico Williams

On Tue, Oct 03, 2023 at 12:15:10PM -0700, Jeff Davis wrote:
> On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:
> > I think you misunderstand Unicode normalization and equivalence. 
> > There is no standard Unicode `normalize()` that would cause the
> > above equality predicate to be true.  If you normalize to NFD
> > (normal form decomposed) then a _prefix_ of those two strings will
> > be equal, but that's clearly not what you're looking for.

Ugh, My client is not displying 'a' correctly, thus I misunderstood your
post.

> From [1]:

Here's what you wrote in your post:

| [...] But it's really the same
| character with just a different representation, and if you normalize
| them they are equal:
|
|  SELECT normalize('á') = normalize('á'); -- true

but my client is not displying 'a' correctly!  (It displays like 'a' but
it should display like 'á'.)

Bah.  So I'd (mis)interpreted you as saying that normalize('a') should
equal normalize('á').  Please disregard that part of my reply.

> > There are two ways to write 'á' in Unicode: one is pre-composed (one
> > codepoint) and the other is decomposed (two codepoints in this
> > specific case), and it would be nice to be able to preserve input
> > form when storing strings but then still be able to index and match
> > them form-insensitively (in the case of 'á' both equivalent
> > representations should be considered equal, and for UNIQUE indexes
> > they should be considered the same).
> 
> Sometimes preserving input differences is a good thing, other times
> it's not, depending on the context. Almost any data type has some
> aspects of the input that might not be preserved -- leading zeros in a
> number, or whitespace in jsonb, etc.

Almost every Latin input mode out there produces precomposed characters
and so they effectively produce NFC.  I'm not sure if the same is true
for, e.g., Hangul (Korean) and various other scripts.

But there are things out there that produce NFD.  Famously Apple's HFS+
uses NFD (or something very close to NFD).  So if you cut-n-paste things
that got normalized to NFD and paste them into contexts where
normalization isn't done, then you might start wanting to alter those
contexts to either normalize or be form-preserving/form-insensitive.
Sometimes you don't get to normalize, so you have to pick form-
preserving/form-insensitive behavior.

> If text is stored as normalized with NFC, it could be frustrating if
> the retrieved string has a different binary representation than the
> source data. But it could also be frustrating to look at two strings
> made up of ordinary characters that look identical and for the database
> to consider them unequal.

Exactly.  If you have such a case you might like the option to make your
database form-preserving and form-insensitive.  That means that indices
need to normalize strings, but tables need to store unnormalized
strings.

ZFS (filesystems are a bit like databases) does just that!

Nico
--

Re: Allow deleting enumerated values from an existing enumerated data type

Andrew Dunstan  writes:
> On 2023-09-28 Th 14:46, Tom Lane wrote:
>> We went through all these points years ago when the enum feature
>> was first developed, as I recall.  Nobody thought that the ability
>> to remove an enum value was worth the amount of complexity it'd
>> entail.

> That's quite true, and I accept my part in this history. But I'm not 
> sure we were correct back then.

I think it was the right decision at the time, given that the
alternative was to not add the enum feature at all.  The question
is whether we're now prepared to do additional work to support DROP
VALUE.  But the tradeoff still looks pretty grim, because the
problems haven't gotten any easier.

I've been trying to convince myself that there'd be some value in
your idea about a DISABLE flag, but I feel like there's something
missing there.  The easiest implementation would be to have
enum_in() reject disabled values, while still allowing enum_out()
to print them.  But that doesn't seem to lead to nice results:

* You couldn't do, say,
SELECT * FROM my_table WHERE enum_col = 'disabled_value'
to look for rows that you need to clean up.  I guess this'd work:
SELECT * FROM my_table WHERE enum_col::text = 'disabled_value'
but it's un-obvious and could not use an index on enum_col.

* If any of the disabled values remain, dump/restore would fail.
Maybe you'd want that to be sure you got rid of them, but it sounds
like a foot-gun.  ("What do you mean, our only backup doesn't
restore?")  Probably people would wish for two different behaviors:
either don't list disabled values at all in the dumped CREATE TYPE,
or do list them but disable them only after loading data.  The latter
approach will still have problems in data-only restores, but there are
comparable hazards with things like foreign keys.  (pg_upgrade would
need still a third behavior, perhaps.)

On the whole this is still a long way from a clean easy-to-use DROP
facility, and it adds a lot of complexity of its own for pg_dump.
So I'm not sure we want to build it.

regards, tom lane

Re: Allow deleting enumerated values from an existing enumerated data type

2023-10-03 Thread Matthias van de Meent

On Tue, 3 Oct 2023 at 22:49, Tom Lane  wrote:
>
> Andrew Dunstan  writes:
> > On 2023-09-28 Th 14:46, Tom Lane wrote:
> >> We went through all these points years ago when the enum feature
> >> was first developed, as I recall.  Nobody thought that the ability
> >> to remove an enum value was worth the amount of complexity it'd
> >> entail.
>
> > That's quite true, and I accept my part in this history. But I'm not
> > sure we were correct back then.
>
> I think it was the right decision at the time, given that the
> alternative was to not add the enum feature at all.  The question
> is whether we're now prepared to do additional work to support DROP
> VALUE.  But the tradeoff still looks pretty grim, because the
> problems haven't gotten any easier.
>
> I've been trying to convince myself that there'd be some value in
> your idea about a DISABLE flag, but I feel like there's something
> missing there.  The easiest implementation would be to have
> enum_in() reject disabled values, while still allowing enum_out()
> to print them.  But that doesn't seem to lead to nice results:
>
> [...]
>
> On the whole this is still a long way from a clean easy-to-use DROP
> facility, and it adds a lot of complexity of its own for pg_dump.
> So I'm not sure we want to build it.

I don't quite get what the hard problem is that we haven't already
solved for other systems:
We already can add additional constraints to domains (e.g. VALUE::int
<> 4), which (according to docs) scan existing data columns for
violations. We already drop columns without rewriting the table to
remove the column's data, and reject new data insertions for those
still-in-the-catalogs-but-inaccessible columns.

So, if a user wants to drop an enum value, why couldn't we "just" use
the DOMAIN facilities and 1.) add a constraint WHERE value NOT IN
(deleted_values), and after validation of that constraint 2.) mark the
enum value as deleted like we do with table column's pg_attribute
entries?

The only real issue that I can think of is making sure that concurrent
backends don't modify this data, but that shouldn't be very different
from the other locks we already have to take in e.g. ALTER TYPE ...
DROP ATTRIBUTE.

Kind regards,

Matthias van de Meent

Re: Allow deleting enumerated values from an existing enumerated data type

Matthias van de Meent  writes:
> I don't quite get what the hard problem is that we haven't already
> solved for other systems:
> We already can add additional constraints to domains (e.g. VALUE::int
> <> 4), which (according to docs) scan existing data columns for
> violations.

That's "solved" only for rather small values of "solved".  While the
code does try to look for violations of the new constraint, there is
no interlock against race conditions (ie, concurrent insertions of
a conflicting value).  It doesn't check temp tables belonging to
other backends, because it can't.  And IIRC it doesn't look for
instances in metadata, such as stored views.

The reason we've considered that Good Enough(TM) for domain
constraints is that if something does sneak through those loopholes,
nothing terribly surprising happens.  You've got a value there that
shouldn't be there, but that's mostly your own fault, and the
system continues to behave sanely.  Also you don't have any problem
introspecting what you've got, because such values will still print
normally.  Plus, we can't really guarantee that users won't get
into such a state in other ways, for example if their constraint
isn't really immutable.

The problem with dropping an enum value is that surprising things
might very well happen, because removal of the pg_enum row risks
comparisons failing if they involve the dropped value.  Thus for
example you might find yourself with a broken index that fails
all insertion and search attempts, even if you'd carefully removed
every user-visible instance of the doomed value: an instance of
it high up in the index tree will break most index accesses.
Even for values in user-visible places like views, the fact that
enum_out will fail doesn't make it any easier to figure out what
is wrong.

We might be able to get to a place where the surprise factor is
low enough to tolerate, but the domain-constraint precedent isn't
good enough for that IMO.

Andrew's idea of DISABLE rather than full DROP is one way of
ameliorating these problems: comparisons would still work, and
we can still print a value that perhaps shouldn't have been there.
But it's not without other problems.

> We already drop columns without rewriting the table to
> remove the column's data, and reject new data insertions for those
> still-in-the-catalogs-but-inaccessible columns.

Those cases don't seem to have a lot of connection to the enum problem.

> The only real issue that I can think of is making sure that concurrent
> backends don't modify this data, but that shouldn't be very different
> from the other locks we already have to take in e.g. ALTER TYPE ...
> DROP ATTRIBUTE.

I'd bet a good deal of money that those cases aren't too bulletproof.
We do not lock types simply because somebody has a value of the type
in flight somewhere in a query, and the cost of doing so would be
quite discouraging I fear.  On the whole, I'd rather accept the
idea that the DROP might not be completely watertight; but then
we have to work out the details of coping with orphaned values
in an acceptable way.

regards, tom lane

Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector

On Tue, Oct 3, 2023 at 10:56 AM Karl O. Pinc  wrote:

> On Mon, 2 Oct 2023 15:18:32 -0500
> "Karl O. Pinc"  wrote:
>
> Version 7
>
>
0001 - I would just call the section:
Capturing Command Results into Variables
I would add commentary in there that it is only possible for variables to
take on single value at any given time and so in order to handle multiple
row results you need to instantiate a loop as per 43.6.6

0002 - {Inferred | Indirect} Types ?
We are already in the Declarations section so the fact we are declaring new
variables is already covered.
"Instead of literally writing a type name you can write variable%TYPE and
the system will indirectly apply the then-current type of the named
variable to the newly declared variable." (using "copy the then-current"
reads pretty well and makes the existing title usable...)

0003 - The noun "Exception" here means "deviating from the normal flow of
the code", not, "A subclass of error".  I don't see this title change being
particularly beneficial.

0004 - Agreed, but "You can raise an error explicitly as described in
"Errors and Messages".  I would not use the phrase "raise an exception", it
doesn't fit.

0005 - This tone of voice is used throughout the introductory documentation
sections, this single change doesn't seem warranted.

0006 - Useful.  The view itself provides all configuration parameters known
to the system, the "or selected" isn't true of the view itself, and it
seems to go without saying that the user can add a where clause to any
query they write using that view.  At most I'd make one of the examples
include a where clause.

0007 - I'm leaning toward this area being due for some improvement,
especially given the v16 changes bring new attention to it, but this patch
doesn't scream "improvement" to me.

0008 - Same as 0007.  INHERIT is no longer an attribute though, it is a
property of membership.  This seems more acceptable on its own but I'd need
more time to take in the big picture myself.

That's it for now.  I'll look over the other 8 when I can.

David J.

Re: Index range search optimization

2023-10-03 Thread Pavel Borisov

Hi!

On Fri, 29 Sept 2023 at 10:35, Alexander Korotkov  wrote:
>
> Hi, Peter.
>
> On Fri, Sep 29, 2023 at 4:57 AM Peter Geoghegan  wrote:
> > On Fri, Sep 22, 2023 at 7:24 AM Alexander Korotkov  
> > wrote:
> > > The thing is that NULLs could appear in the middle of matching values.
> > >
> > > # WITH t (a, b) AS (VALUES ('a', 'b'), ('a', NULL), ('b', 'a'))
> > > SELECT a, b, (a, b) > ('a', 'a') FROM t ORDER BY (a, b);
> > >  a |  b   | ?column?
> > > ---+--+--
> > >  a | b| t
> > >  a | NULL | NULL
> > >  b | a| t
> > > (3 rows)
> > >
> > > So we can't just skip the row comparison operator, because we can meet
> > > NULL at any place.
> >
> > But why would SK_ROW_HEADER be any different? Is it related to this
> > existing case inside _bt_check_rowcompare()?:
> >
> > if (subkey->sk_flags & SK_ISNULL)
> > {
> > /*
> >  * Unlike the simple-scankey case, this isn't a disallowed case.
> >  * But it can never match.  If all the earlier row comparison
> >  * columns are required for the scan direction, we can stop the
> >  * scan, because there can't be another tuple that will succeed.
> >  */
> > if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
> > subkey--;
> > if ((subkey->sk_flags & SK_BT_REQFWD) &&
> > ScanDirectionIsForward(dir))
> > *continuescan = false;
> > else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
> >  ScanDirectionIsBackward(dir))
> > *continuescan = false;
> > return false;
> > }
>
> Yes, exactly. Our row comparison operators don't match if there is any
> null inside the row.  But you can find these rows within the matching
> range.
>
> > I noticed that you're not initializing so->firstPage correctly for the
> > _bt_endpoint() path, which is used when the initial position of the
> > scan is either the leftmost or rightmost page. That is, it's possible
> > to reach _bt_readpage() without having reached the point in
> > _bt_first() where you initialize so->firstPage to "true".
>
> Good catch, thank you!
>
> > It would probably make sense if the flag was initialized to "false" in
> > the same way as most other scan state is already, somewhere in
> > nbtree.c. Probably in btrescan().
>
> Makes sense, initialisation is added.
I've looked through the patch v8. I think it's good enough to be
pushed if Peter has no objections.

Regards,
Pavel.

Re: Pre-proposal: unicode normalized text

On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote:
> Ugh, My client is not displying 'a' correctly

Ugh. Is that an argument in favor of normalization or against?

I've also noticed that some fonts render the same character a bit
differently depending on the constituent code points. For instance, if
the accent is its own code point, it seems to be more prominent than if
a single code point represents both the base character and the accent.
That seems to be a violation, but I can understand why that might be
useful.

> 
> Almost every Latin input mode out there produces precomposed
> characters
> and so they effectively produce NFC.

The problem is not the normal case, the problem will be things like
obscure input methods, some kind of software that's being too clever,
or some kind of malicious user trying to confuse the database.

> 
> That means that indices
> need to normalize strings, but tables need to store unnormalized
> strings.

That's an interesting idea. Would the equality operator normalize
first, or are you saying that the index would need to recheck the
results?

Regards,
Jeff Davis

Re: Pre-proposal: unicode normalized text

On Mon, 2023-10-02 at 10:47 +0200, Peter Eisentraut wrote:
> I think a better direction here would be to work toward making 
> nondeterministic collations usable on the global/database level and
> then 
> encouraging users to use those.
> 
> It's also not clear which way the performance tradeoffs would fall.
> 
> Nondeterministic collations are obviously going to be slower, but by
> how 
> much?  People have accepted moving from C locale to "real" locales 
> because they needed those semantics.  Would it be any worse moving
> from 
> real locales to "even realer" locales?

If you normalize first, then you can get some semantic improvements
without giving up on the stability and performance of memcmp(). That
seems like a win with zero costs in terms of stability or performance
(except perhaps some extra text->utext casts).

Going to a "real" locale gives more semantic benefits but at a very
high cost: depending on a collation provider library, dealing with
collation changes, and performance costs. While supporting the use of
nondeterministic collations at the database level may be a good idea,
it's not helping to reach the compromise that I'm trying to reach in
this thread.

Regards,
Jeff Davis

Re: Pre-proposal: unicode normalized text

2023-10-03 Thread Nico Williams

On Tue, Oct 03, 2023 at 03:34:44PM -0700, Jeff Davis wrote:
> On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote:
> > Ugh, My client is not displying 'a' correctly
> 
> Ugh. Is that an argument in favor of normalization or against?

Heheh, well, it's an argument in favor of more software getting this
right (darn it).

It's also an argument for building a time machine so HFS+ can just
always have used NFC.  But the existence of UTF-16 is proof that time
machines don't exist (or that only bad actors have them).

> I've also noticed that some fonts render the same character a bit
> differently depending on the constituent code points. For instance, if
> the accent is its own code point, it seems to be more prominent than if
> a single code point represents both the base character and the accent.
> That seems to be a violation, but I can understand why that might be
> useful.

Yes, that happens.  Did you know that the ASCII character set was
designed with overstrike in mind for typing of accented Latin
characters?  Unicode combining sequences are kinda like that, but more
complex.

Yes, the idea really was that you could write a' (or 'a) to get á.
That's how people did it with typewriters anyways.

> > Almost every Latin input mode out there produces precomposed
> > characters and so they effectively produce NFC.
> 
> The problem is not the normal case, the problem will be things like
> obscure input methods, some kind of software that's being too clever,
> or some kind of malicious user trying to confuse the database.

_HFS+ enters the chat_

> > That means that indices
> > need to normalize strings, but tables need to store unnormalized
> > strings.
> 
> That's an interesting idea. Would the equality operator normalize
> first, or are you saying that the index would need to recheck the
> results?

You can optimize this to avoid having to normalize first.  Most strings
are not equal, and they tend to differ early.  And most strings will
likely be ASCII-mostly or in the same form anyways.  So you can just
walk a cursor down each string looking at two bytes, and if they are
both ASCII then you move each cursor forward by one byte, and if then
are not both ASCII then you take a slow path where you normalize one
grapheme cluster at each cursor (if necessary) and compare that.  (ZFS
does this.)

You can also assume ASCII-mostly, load as many bits of each string
(padding as needed) as will fit in SIMD registers, compare and check
that they're all ASCII, and if not then jump to the slow path.

You can also normalize one grapheme cluster at a time when hashing
(e.g., for hash indices), thus avoiding a large allocation if the string
is large.

Nico
--

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:
> One benefit would be that it'd make it more realistic to use direct
> IO for WAL
> - for which I have seen significant performance benefits. But when we
> afterwards have to re-read it from disk to replicate, it's less
> clearly a win.

Does this patch still look like a good fit for your (or someone else's)
plans for direct IO here? If so, would committing this soon make it
easier to make progress on that, or should we wait until it's actually
needed?

If I recall, this patch does not provide a perforance benefit as-is
(correct me if things have changed) and I don't know if a reduction in
syscalls alone is enough to justify it. But if it paves the way for
direct IO for WAL, that does seem worth it.

Regards,
Jeff Davis

Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector

2023-10-03 Thread Karl O. Pinc

On Tue, 3 Oct 2023 14:51:31 -0700
"David G. Johnston"  wrote:

> 0001 - I would just call the section:
> Capturing Command Results into Variables

I like your wording a lot.  Let's use it!

> I would add commentary in there that it is only possible for
> variables to take on single value at any given time and so in order
> to handle multiple row results you need to instantiate a loop as per
> 43.6.6

That sounds reasonable.  Let me see what I can come up with.
I'll do it in a separate commit.

> 0002 - {Inferred | Indirect} Types ?
> We are already in the Declarations section so the fact we are
> declaring new variables is already covered.

I agree.  But the problem is that the section title needs
to stand on it's own when the table of contents is scanned.
By that I don't mean that getting context from a "Declaration"
higher level section is somehow out-of-bounds, but that
neither "inferred" nor "indirect" has a recognizable meaning
unless the section body itself is read.

I thought about: Referencing Existing Types
But the problem with that is that the reader does not know
that this has to do with the types of existing objects
rather than working with user-defined types (or something else).

Also, I kind of like "re-using".  Shorter, simpler, word.

There is: Re-Using the Type of Objects

"Objects" is not very clear.  Variables are very different from
columns.  It seemed best to just write it out.

> "Instead of literally writing a type name you can write variable%TYPE
> and the system will indirectly apply the then-current type of the
> named variable to the newly declared variable." 

I've no objection to the section leading with a summary sentence like
this.  The trouble is coming up with something good.  I'd go with
"You can reference the data type of an existing column or variable
with the %TYPE notation.  The syntax is:"  Shorter and simpler.

Along with this change, it might be nice to move the "By using %TYPE
..." paragraph to after the first example (before the first paragraph).

But this is really feature creep for this commit.  :)

> (using "copy the
> then-current" reads pretty well and makes the existing title
> usable...)

I disagree.  The title needs to be understandable without reading
the section body.

> 
> 0003 - The noun "Exception" here means "deviating from the normal
> flow of the code", not, "A subclass of error".  I don't see this
> title change being particularly beneficial.

Isn't the entire section about "deviating from the normal flow of the
code"?  That's what makes me want "Exception" in the section title.

> 0004 - Agreed, but "You can raise an error explicitly as described in
> "Errors and Messages".  I would not use the phrase "raise an
> exception", it doesn't fit.

I like the brevity of your sentence.  And you're right that
the sentence does not flow as written.  How about:

You can raise an exception to throw an error as described in
"Errors and Messages".

?  I remain (overly?) focused on the word "exception", since that's
whats in the brain of the user that's writing RAISE EXCEPTION.
It matters if exceptions and errors are different.  If they're
not, then it also matters since it's exceptions that the user's
code raises.

> 0005 - This tone of voice is used throughout the introductory
> documentation sections, this single change doesn't seem warranted.

Ok.  I'll drop it unless somebody else chimes in.

> 0006 - Useful.  The view itself provides all configuration parameters
> known to the system, the "or selected" isn't true of the view itself,
> and it seems to go without saying that the user can add a where
> clause to any query they write using that view.  At most I'd make one
> of the examples include a where clause.

Good points.

I'll get rid of the ", or selected,".  May as well leave the
examples as short as possible.  Less is more.  :)

> 0007 - I'm leaning toward this area being due for some improvement,
> especially given the v16 changes bring new attention to it, but this
> patch doesn't scream "improvement" to me.

Let's see how it looks with 0012, which uses the vocabulary.

I do like the "Roles therefore control who has what access to which
objects." sentence.  I was shooting for shorter sentences, but then
when I broke the existing sentences into pieces I couldn't resist
adding text.

> 0008 - Same as 0007.  INHERIT is no longer an attribute though, it is
> a property of membership.

(!) I'm going to have to pay more attention.

>  This seems more acceptable on its own but
> I'd need more time to take in the big picture myself.

It's the big picture that I'm trying to draw. 0007, 0008, and 0012
all kind of go together.  It may be worth forking the email thread,
or something.

Have you any thoughts on the "permissions", "privleges" and 
"attributes" vocabulary/concepts used in this area?

Thanks very much for the review.  I'm going to wait a bit
before incorporating your suggestions and sending in another
patch set in case Daniel chimes in.  (I'm slig

Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag

On Tue, Oct 03, 2023 at 07:02:11PM +0530, Bharath Rupireddy wrote:
> On Tue, Oct 3, 2023 at 5:45 PM Drouvot, Bertrand
>  wrote:
>> I did some research and BGWORKER_BYPASS_ALLOWCONN has been added in 
>> eed1ce72e1 and
>> at that time the bgw_flags did already exist.
>>
>> In this the related thread [1], Tom mentioned:
>>
>> "
>> We change exported APIs in new major versions all the time.  As
>> long as it's just a question of an added parameter, people can deal
>> with it.
>> "
> 
> It doesn't have to be always/all the time. If the case here is okay to
> change the bgw and other core functions API, I honestly feel that we
> must move BGWORKER_BYPASS_ALLOWCONN to bgw_flags.

I don't agree with this point.  BackgroundWorkerInitializeConnection()
and its other friend are designed to be called at the beginning of the
main routine of a background worker, where bgw_flags is not accessible.
There is much more happening before a bgworker initializes its
connection, like signal handling and definitions of other states that
depend on the GUCs loaded for the bgworker.

>> Now, I understand your point but it looks to me that bgw_flags is more
>> about the capabilities (Access to shared memory with BGWORKER_SHMEM_ACCESS
>> or ability to establish database connection with 
>> BGWORKER_BACKEND_DATABASE_CONNECTION),
>>
>> While with BGWORKER_BYPASS_ROLELOGINCHECK (and BGWORKER_BYPASS_ALLOWCONN) 
>> it's more related to
>> the BGW behavior once the capability is in place.
> 
> I look at the new flag as a capability of the bgw to connect with a
> role without login access. IMV, all are the same.

Bertrand is arguing that the current code with its current split is
OK, because both are different concepts:
- bgw_flags is used by the postmaster to control how to launch the
bgworkers.
- The BGWORKER_* flags are used by the bgworkers themselves, once
things are set up by the postmaster based on bgw_flags.
--
Michael


signature.asc
Description: PGP signature

Re: Add support for AT LOCAL

On Tue, Oct 03, 2023 at 02:10:48AM +0200, Vik Fearing wrote:
> On 9/29/23 09:27, Michael Paquier wrote:
>> As the deparsing code introduced by this patch is showing, this leads
>> to a lot of extra complexity.  And, actually, this can be quite
>> expensive as well with these two layers of functions.  Note also that
>> in comparison to SQLValueFunctions, COERCE_SQL_SYNTAX does less
>> inlining.  So here comes my question: why doesn't this stuff just use
>> one underlying function to do this job?
> 
> I guess I don't understand the question.  Why do you think a single function
> that repeats what these functions do would be preferable?  I am not sure how
> doing it a different way would be better.

Leaving aside the ruleutils.c changes introduced by the patch that are
quite confusing, having one function in the executor stack is going to
be more efficient than two (aka less ExecInitFunc), and this syntax
could be used in SQL queries where the operations is repeated a lot.
This patch introduces two COERCE_SQL_SYNTAX, meaning that we would do
twice the ACL check, twice the function hook, etc, so this could lead
to significant differences.  I think that we should be careful with
the approach taken, and do benchmarks to choose an efficient approach
from the start.  See for example:
https://www.postgresql.org/message-id/zghbge45jkw8f...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature

Re: Index range search optimization

2023-10-03 Thread Alexander Korotkov

On Wed, Oct 4, 2023 at 12:59 AM Pavel Borisov  wrote:
> I've looked through the patch v8. I think it's good enough to be
> pushed if Peter has no objections.

Thank you, Pavel.
I'll push this if there are no objections.

--
Regards,
Alexander Korotkov

Re: pgstatindex vs. !indisready

On Sun, Oct 01, 2023 at 06:31:26PM -0700, Noah Misch wrote:
> The !indisvalid index may be missing tuples, yes.  In what way does that make
> one of those four operations incorrect?

Reminding myself of what these four do, it looks that you're right and
that the state of indisvalid is not going to change what they report.

Still, I'd like to agree with Tom's point to be more conservative and
check also for indisvalid which is what the planner does.  These
functions will be used in SELECTs, and one thing that worries me is
that checks based on indisready may get copy-pasted somewhere else,
leading to incorrect results where they get copied.  (indisready &&
!indisvalid) is a "short"-term combination in a concurrent build
process, as well (depends on how long one waits for the old snapshots
before switching indisvalid, but that's way shorter than the period of
time where the built indexes remain valid).
--
Michael


signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby

2023-10-03 Thread Amit Kapila

On Tue, Oct 3, 2023 at 9:27 PM shveta malik  wrote:
>
> On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand
>  wrote:
> >
> > Hi,
> >
> > On 10/3/23 12:54 PM, Amit Kapila wrote:
> > > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand
> > >  wrote:
> > >>
> > >> On 9/29/23 1:33 PM, Amit Kapila wrote:
> > >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand
> > >>>  wrote:
> > 
> > >>>
> >  - probably open corner cases like: what if a standby is down? would 
> >  that mean
> >  that synchronize_slot_names not being send to the primary would allow 
> >  the decoding
> >  on the primary to go ahead?
> > 
> > >>>
> > >>> Good question. BTW, irrespective of whether we have
> > >>> 'standby_slot_names' parameters or not, how should we behave if
> > >>> standby is down? Say, if 'synchronize_slot_names' is only specified on
> > >>> standby then in such a situation primary won't be even aware that some
> > >>> of the logical walsenders need to wait.
> > >>
> > >> Exactly, that's why I was thinking keeping standby_slot_names to address
> > >> this scenario. In such a case one could simply decide to keep or remove
> > >> the associated physical replication slot from standby_slot_names. Keep 
> > >> would
> > >> mean "wait" and removing would mean allow to decode on the primary.
> > >>
> > >>> OTOH, one can say that users
> > >>> should configure 'synchronize_slot_names' on both primary and standby
> > >>> but note that this value could be different for different standby's,
> > >>> so we can't configure it on primary.
> > >>>
> > >>
> > >> Yeah, I think that's a good use case for standby_slot_names, what do you 
> > >> think?
> > >>
> > >
> > > But, even if we keep 'standby_slot_names' for this purpose, the
> > > primary doesn't know the value of 'synchronize_slot_names' once the
> > > standby is down and or the primary is restarted. So, how will we know
> > > which logical WAL senders needs to wait for 'standby_slot_names'?
> > >
> >
> > Yeah right, I also think we'd need:
> >
> > - synchronize_slot_names on both primary and standby
> >
> > But now we would need to take care of different standby having different 
> > values (
> > as you said up-thread)
> >
> > Thinking out loud: What about a single GUC on the primary (not 
> > standby_slot_names nor
> > synchronize_slot_names) but say logical_slots_wait_for_standby that could 
> > be a list of say
> > "logical_slot_name:physical_slot".
> >
> > I think this GUC would help us define each walsender behavior (should the 
> > standby(s)
> > be up or down):
> >
>
> It may help in defining the walsender's behaviour better for sure. But
> the problem I see once we start defining sync-slot-names on primary
> (in any form whether as independent GUC or as above mapping GUC) is
> that it needs to be then in sync with standbys, as each standby for
> sure needs to maintain its own sync-slot-names GUC to make it aware of
> what all it needs to sync.

Yes, I also think so. Also, defining such a GUC where user wants to
sync all the slots which would normally be the case would be a night
mare for the users.

>
> This brings us to the original question of
> how do we actually keep these configurations in sync between primary
> and standby if we plan to maintain it on both?
>
>
> > - don't wait if its associated logical_slot is not listed in this GUC
> > - or wait based on its associated "list" of mapped physical slots (would 
> > probably
> > have to deal with the min restart_lsn for all the corresponding mapped 
> > ones).
> >
> > I don't think we can avoid having to define at least one GUC on the primary 
> > (at least to
> > handle the case of standby(s) being down).
> >

How about an alternate scheme where we define sync_slot_names on
standby but then store the physical_slot_name in the corresponding
logical slot (ReplicationSlotPersistentData) to be synced? So, the
standby will send the list of 'sync_slot_names' and the primary will
add the physical standby's slot_name in each of the corresponding
sync_slot. Now, if we do this then even after restart, we should be
able to know for which physical slot each logical slot needs to wait.
We can even provide an SQL API to reset the value of
standby_slot_names in logical slots as a way to unblock decoding in
case of emergency (for example, corresponding when physical standby
never comes up).

-- 
With Regards,
Amit Kapila.

Re: False "pg_serial": apparent wraparound” in logs

On Sun, Oct 01, 2023 at 09:43:21PM +0300, Heikki Linnakangas wrote:
> I think the smallest fix here would be to change CheckPointPredicate() so
> that if tailPage > headPage, pass headPage to SimpleLruTruncate() instead of
> tailPage. Or perhaps it should go into the "The SLRU is no longer needed"
> codepath in that case. If tailPage > headPage, the SLRU isn't needed at the
> moment.

Good idea.  Indeed that should be good and simple enough for the
back-branches, at quick glance.
--
Michael


signature.asc
Description: PGP signature

Re: Removing unneeded self joins

2023-10-03 Thread Alexander Korotkov

Hi!

I think this is a neat optimization.

On Tue, Sep 12, 2023 at 4:58 PM Andrey Lepikhov
 wrote:
> On 5/7/2023 21:28, Andrey Lepikhov wrote:
> > During the significant code revision in v.41 I lost some replacement
> > operations. Here is the fix and extra tests to check this in the future.
> > Also, Tom added the JoinDomain structure five months ago, and I added
> > code to replace relids for that place too.
> > One more thing, I found out that we didn't replace SJs, defined by
> > baserestrictinfos if no one self-join clause have existed for the join.
> > Now, it is fixed, and the test has been added.
> > To understand changes readily, see the delta file in the attachment.
> Here is new patch in attachment. Rebased on current master and some
> minor gaffes are fixed.

I went through the thread and I think the patch gets better shape.  A
couple of notes from my side.
1) Why replace_relid() makes a copy of lids only on insert/replace of
a member, but performs deletion in-place?
2) It would be nice to skip the insertion of IS NOT NULL checks when
they are not necessary.  [1] points that infrastructure from [2] might
be useful.  The patchset from [2] seems committed mow.  However, I
can't see it is directly helpful in this matter.  Could we just skip
adding IS NOT NULL clause for the columns, that have
pg_attribute.attnotnull set?

Links
1. https://www.postgresql.org/message-id/2375492.jE0xQCEvom%40aivenronan
2.  https://www.postgresql.org/message-id/flat/830269.1656693747%40sss.pgh.pa.us

--
Regards,
Alexander Korotkov

Re: make add_paths_to_append_rel aware of startup cost

On Sun, 1 Oct 2023 at 21:26, Andy Fan  wrote:
>> But overall, I'm more inclined to just go with the more simple "add a
>> cheap unordered startup append path if considering cheap startup
>> plans" version. I see your latest patch does both. So, I'd suggest two
>> patches as I do see the merit in keeping this simple and cheap.  If we
>> can get the first part in and you still find cases where you're not
>> getting the most appropriate startup plan based on the tuple fraction,
>> then we can reconsider what extra complexity we should endure in the
>> code based on the example query where we've demonstrated the planner
>> is not choosing the best startup path appropriate to the given tuple
>> fraction.
>
> I think this is a fair point,  I agree that your first part is good enough to 
> be
> committed first.   Actually I tried a lot to make a test case which can  prove
> the value of cheapest fractional cost but no gain so far:(

I've attached a patch with the same code as the previous patch but
this time including a regression test.

I see no reason to not commit this so if anyone feels differently
please let me know.

David


consider_cheapest_startup_appendpath_v2.patch
Description: Binary data

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-03 Thread Hayato Kuroda (Fujitsu)

Dear Bharath,

While checking more, I found some problems your PoC.

1. rm_is_record_decodable() returns true when WAL records are decodable.
   Based on that, should is_valid be false when the function is true?
   E.g., XLOG_HEAP_INSERT is accepted in the PoC.
2. XLOG_CHECKPOINT_SHUTDOWN and XLOG_RUNNING_XACTS should return false because
   these records may be generated during the upgrade but they are acceptable.
3. A bit operations are done for extracting a WAL type, but the mask is
   different based on the rmgr. E.g., XLOG uses XLR_INFO_MASK, but XACT uses
   XLOG_XACT_OPMASK.
4. There is a possibility that "XLOG_HEAP_INSERT | XLOG_HEAP_INIT_PAGE" is 
inserted,
   but it is not handled.

Regarding the 2., maybe we should say "if the reorderbuffer is modified while 
decoding,
rm_is_record_decodable must return false" or something. If so, the return value
of XLOG_END_OF_RECOVERY and XLOG_HEAP2_NEW_CID should be also changed.

I attached the fix patch for above. How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

diff --git a/src/backend/access/transam/rmgr.c 
b/src/backend/access/transam/rmgr.c
index 001bdf3535..850ba7829a 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -117,6 +117,11 @@ RegisterCustomRmgr(RmgrId rmid, const RmgrData *rmgr)
 errdetail("Custom resource manager \"%s\" 
already registered with the same ID.",
   RmgrTable[rmid].rm_name)));
 
+   if (rmgr->rm_decode && rmgr->rm_is_record_decodable == NULL)
+   ereport(ERROR,
+   (errmsg("failed to register custom resource 
manager \"%s\" with ID %d", rmgr->rm_name, rmid),
+errdetail("Custom resource manager which has a 
decode function must have is_reacode_decodable function too.")));
+
/* check for existing rmgr with the same name */
for (int existing_rmid = 0; existing_rmid <= RM_MAX_ID; existing_rmid++)
{
diff --git a/src/backend/replication/logical/decode.c 
b/src/backend/replication/logical/decode.c
index 60d26ae015..72a542a06b 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -214,11 +214,10 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer 
*buf)
 bool
 xlog_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & ~XLR_INFO_MASK)
{
case XLOG_CHECKPOINT_SHUTDOWN:
case XLOG_END_OF_RECOVERY:
-   return true;
case XLOG_CHECKPOINT_ONLINE:
case XLOG_PARAMETER_CHANGE:
case XLOG_NOOP:
@@ -401,7 +400,7 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer 
*buf)
 bool
 xact_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & XLOG_XACT_OPMASK)
{
case XLOG_XACT_COMMIT:
case XLOG_XACT_COMMIT_PREPARED:
@@ -471,10 +470,9 @@ standby_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)
 bool
 standy_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & ~XLR_INFO_MASK)
{
case XLOG_RUNNING_XACTS:
-   return true;
case XLOG_STANDBY_LOCK:
case XLOG_INVALIDATIONS:
return false;
@@ -550,11 +548,11 @@ heap2_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)
 bool
 heap2_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & XLOG_HEAP_OPMASK)
{
case XLOG_HEAP2_MULTI_INSERT:
-   case XLOG_HEAP2_NEW_CID:
return true;
+   case XLOG_HEAP2_NEW_CID:
case XLOG_HEAP2_REWRITE:
case XLOG_HEAP2_FREEZE_PAGE:
case XLOG_HEAP2_PRUNE:
@@ -661,9 +659,10 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer 
*buf)
 bool
 heap_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & XLOG_HEAP_OPMASK)
{
case XLOG_HEAP_INSERT:
+   case XLOG_HEAP_INSERT | XLOG_HEAP_INIT_PAGE:
case XLOG_HEAP_HOT_UPDATE:
case XLOG_HEAP_UPDATE:
case XLOG_HEAP_DELETE:
@@ -782,7 +781,7 @@ logicalmsg_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)
 bool
 logicalmsg_is_record_decodable(uint8 info)
 {
-   switch (info)
+   switch (info & ~XLR_INFO_MASK)
{
case XLOG_LOGICAL_MESSAGE:
return true;
diff --git a/src/backend/utils/adt/pg_upgrade_support.c 
b/src/backend/utils/adt/pg_upgrade_support.c
index cfd3e448b1..52084dc644 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -320,19 +320,18 @@ binary_upgrade_validate_wal_logical_end(PG_FUNCTION_ARGS)
while (is_valid && ReadNextXLogRecord

Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector

On Tue, Oct 3, 2023 at 4:15 PM Karl O. Pinc  wrote:

> On Tue, 3 Oct 2023 14:51:31 -0700
> "David G. Johnston"  wrote:
>
> Isn't the entire section about "deviating from the normal flow of the
> code"?  That's what makes me want "Exception" in the section title.
>

It is about how error handling in a procedure diverts the flow from the
normal code path to some other code path - what that path is labelled is
less important than the thing that causes the diversion - an error.

> ?  I remain (overly?) focused on the word "exception", since that's
> whats in the brain of the user that's writing RAISE EXCEPTION.
> It matters if exceptions and errors are different.  If they're
> not, then it also matters since it's exceptions that the user's
> code raises.
>
>
It's unfortunate the keyword to raise the message level "ERROR" is
"EXCEPTION" in that command but I'd rather simply handle that one anomaly
that make the rest of the system use the word exception, especially seem to
be fairly consistent in our usage of ERROR already.  I'm sympathetic that
other systems out there also encourage the usage of exception in this
context instead of error - but not to the point of opening up this
long-standing decision for rework.

> Have you any thoughts on the "permissions", "privleges" and
> "attributes" vocabulary/concepts used in this area?
>

I think we benefit from being able to equate permissions and privileges and
trying to separate them is going to be more harmful than helpful.  The
limited things that role attributes permit, and how they fall outside the
privilege/permission concept as we use it, isn't something that I've
noticed is a problem that needs addressing.

 (I'm slightly
> nervous about the renumbering making the thread hard to follow.)
>
>
0009 - Something just seems off with this one.  Unless there are more
places with this type in use I would just move the relevant notes (i.e.,
the one in proallargtypes) to that column and be done with it.  If there
are multiple places then moving the notes to the main docs and
cross-referencing to them seems warranted.  I also wouldn't call it legacy.

0010 -

When creating new objects, if a schema qualification is not given with the
name the first extant entry in the search_path is chosen; then an error
will be raised should the supplied name already exist in that schema.
In contexts where the object must already exist, but its name is not schema
qualified, the extant search_path schemas will be consulted serially until
one of them contains an appropriate object, returning it, or all schemas
are consulted, resulting in an object not found error.

I'm not seeing much value in presenting the additional user/public details
here.  Especially as it would then seem appropriate to include pg_temp.
And now we have to deal with the fact that by default the public schema
isn't so public anymore.

David J.

回复: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-10-03 Thread Thomas wen

> > This patch makes the filenames always 12 characters wide (for SLRUs that
> > opt-in to the new naming). That's actually not enough for the full range
> > that a 64 bit page number can represent. Should we make it 16 characters
> > now, to avoid repeating the same mistake we made earlier? Or make it
> > more configurable, on a per-SLRU basis. One reason I don't want to just
> > make it 16 characters is that WAL segment filenames are also 16 hex
> > characters, which could cause confusion.
>
> Good catch. I propose the following solution:
>
> ```
>  SlruFileName(SlruCtl ctl, char *path, int64 segno)
>  {
>  if (ctl->long_segment_names)
> -return snprintf(path, MAXPGPATH, "%s/%012llX", ctl->Dir,
> +/*
> + * We could use 16 characters here but the disadvantage would be that
> + * the SLRU segments will be hard to distinguish from WAL segments.
> + *
> + * For this reason we use 15 characters. It is enough but also means
> + * that in the future we can't decrease SLRU_PAGES_PER_SEGMENT 
> easily.
> + */
> +return snprintf(path, MAXPGPATH, "%s/%015llX", ctl->Dir,
>  (long long) segno);
>  else
>  return snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir,
> ```
>
> PFE the corrected patchset v58.
Good idea


发件人: Aleksander Alekseev 
发送时间: 2023年9月4日 22:41
收件人: Postgres hackers 
抄送: Heikki Linnakangas ; Maxim Orlov ; 
Jacob Champion ; Japin Li ; 
Andres Freund ; Michael Paquier ; 
Pavel Borisov ; Peter Eisentraut 
; Alexander Korotkov 
主题: Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into 
PostgreSQL 15)

Hi,

> > Reviewing this now. I think it's almost ready to be committed.
> >
> > There's another big effort going on to move SLRUs to the regular buffer
> > cache (https://commitfest.postgresql.org/43/3514/). I wonder how moving
> > to 64 bit page numbers will affect that. BlockNumber is still 32 bits,
> > after all.
>
> Somehow I didn't pay too much attention to this effort, thanks. I will
> familiarize myself with the patch. Intuitively I don't think that the
> patchse should block each other.
>
> > This patch makes the filenames always 12 characters wide (for SLRUs that
> > opt-in to the new naming). That's actually not enough for the full range
> > that a 64 bit page number can represent. Should we make it 16 characters
> > now, to avoid repeating the same mistake we made earlier? Or make it
> > more configurable, on a per-SLRU basis. One reason I don't want to just
> > make it 16 characters is that WAL segment filenames are also 16 hex
> > characters, which could cause confusion.
>
> Good catch. I propose the following solution:
>
> ```
>  SlruFileName(SlruCtl ctl, char *path, int64 segno)
>  {
>  if (ctl->long_segment_names)
> -return snprintf(path, MAXPGPATH, "%s/%012llX", ctl->Dir,
> +/*
> + * We could use 16 characters here but the disadvantage would be that
> + * the SLRU segments will be hard to distinguish from WAL segments.
> + *
> + * For this reason we use 15 characters. It is enough but also means
> + * that in the future we can't decrease SLRU_PAGES_PER_SEGMENT 
> easily.
> + */
> +return snprintf(path, MAXPGPATH, "%s/%015llX", ctl->Dir,
>  (long long) segno);
>  else
>  return snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir,
> ```
>
> PFE the corrected patchset v58.

While triaging the patches for the September CF [1] a consensus was
reached that the patchset needs another round of review. Also I
removed myself from the list of reviewers in order to make it clear
that a review from somebody else would be appreciated.

[1]: https://postgr.es/m/0737f444-59bb-ac1d-2753-873c40da0840%40eisentraut.org
--
Best regards,
Aleksander Alekseev

Re: Synchronizing slots from primary to standby

2023-10-03 Thread Peter Smith

Here are some review comments for v20-0002.

==
1. GENERAL - errmsg/elog messages

There are a a lot of minor problems and/or quirks across all the
message texts. Here is a summary of some I found:

ERROR
errmsg("could not receive list of slots from the primary server: %s",
errmsg("invalid response from primary server"),
errmsg("invalid connection string syntax: %s",
errmsg("replication slot-sync worker slot %d is empty, cannot attach",
errmsg("replication slot-sync worker slot %d is already used by
another worker, cannot attach",
errmsg("replication slot-sync worker slot %d is already used by
another worker, cannot attach",
errmsg("could not connect to the primary server: %s",

errmsg("operation not permitted on replication slots on standby which
are synchronized from primary")));
/primary/the primary/

errmsg("could not fetch invalidation cuase for slot \"%s\" from primary: %s",
/cuase/cause/
/primary/the primary/

errmsg("slot \"%s\" disapeared from the primary",
/disapeared/disappeared/

errmsg("could not fetch slot info from the primary: %s",
errmsg("could not connect to the primary server: %s", err)));
errmsg("could not map dynamic shared memory segment for slot-sync worker")));

errmsg("physical replication slot %s found in synchronize_slot_names",
slot name not quoted?
---

WARNING
errmsg("out of background worker slots"),

errmsg("Replication slot-sync worker failed to attach to worker-pool slot %d",
case?

errmsg("Removed database %d from replication slot-sync worker %d;
dbcount now: %d",
case?

errmsg("Skipping slots synchronization as primary_slot_name is not set."));
case?

errmsg("Skipping slots synchronization as hot_standby_feedback is off."));
case?

errmsg("Skipping slots synchronization as dbname is not specified in
primary_conninfo."));
case?

errmsg("slot-sync wait for slot %s interrupted by promotion, slot
creation aborted",

errmsg("could not fetch slot info for slot \"%s\" from primary: %s",
/primary/the primary/

errmsg("slot \"%s\" disappeared from the primary, aborting slot creation",
errmsg("slot \"%s\" invalidated on primary, aborting slot creation",

errmsg("slot-sync for slot %s interrupted by promotion, sync not possible",
slot name not quoted?

errmsg("skipping sync of slot \"%s\" as the received slot-sync lsn
%X/%X is ahead of the standby position %X/%X",

errmsg("not synchronizing slot %s; synchronization would move it backward",
slot name not quoted?
/backward/backwards/

---

LOG
errmsg("Added database %d to replication slot-sync worker %d; dbcount now: %d",
errmsg("Added database %d to replication slot-sync worker %d; dbcount now: %d",
errmsg("Stopping replication slot-sync worker %d",
errmsg("waiting for remote slot \"%s\" LSN (%u/%X) and catalog xmin
(%u) to pass local slot LSN (%u/%X) and and catalog xmin (%u)",

errmsg("wait over for remote slot \"%s\" as its LSN (%X/%X)and catalog
xmin (%u) has now passed local slot LSN (%X/%X) and catalog xmin
(%u)",
missing spaces?

elog(LOG, "Dropped replication slot \"%s\" ",
extra space?
why this one is elog but others are not?

elog(LOG, "Replication slot-sync worker %d is shutting down on
receiving SIGINT", MySlotSyncWorker->slot);
case?
why this one is elog but others are not?

elog(LOG, "Replication slot-sync worker %d started", worker_slot);
case?
why this one is elog but others are not?


DEBUG1
errmsg("allocated dsa for slot-sync worker for dbcount: %d"
worker number not given?
should be elog?

errmsg_internal("logical replication launcher started")
should be elog?



DEBUG2
elog(DEBUG2, "slot-sync worker%d's query:%s \n",
missing space after 'worker'
extra space before \n

==
.../libpqwalreceiver/libpqwalreceiver.c

2. libpqrcv_get_dbname_from_conninfo

+/*
+ * Get database name from primary conninfo.
+ *
+ * If dbanme is not found in connInfo, return NULL value.
+ * The caller should take care of handling NULL value.
+ */
+static char *
+libpqrcv_get_dbname_from_conninfo(const char *connInfo)

2a.
/dbanme/dbname/

~

2b.
"The caller should take care of handling NULL value."

IMO this is not very useful; it's like saying "caller must handle
function return values".

~~~

3.
+ for (opt = opts; opt->keyword != NULL; ++opt)
+ {
+ /* Ignore connection options that are not present. */
+ if (opt->val == NULL)
+ continue;
+
+ if (strcmp(opt->keyword, "dbname") == 0 && opt->val[0] != '\0')
+ {
+ dbname = pstrdup(opt->val);
+ }
+ }

3a.
If there are multiple "dbname" in the conninfo then it will be the
LAST one that is returned.

Judging by my quick syntax experiment (below) this seemed like the
correct thing to do, but I think there should be some comment to
explain about it.

test_sub=# create subscription sub1 connection 'dbname=foo dbname=bar
dbname=test_pub' publication pub1;
2023-09-28 19:15:15.012 AEST [23997] WARNING:  subscriptions created
by regression test cases should have names starting with "regress_"
WARNING:  subscriptions created by regression test cases should have
names starting with "regress

Re: Making aggregate deserialization (and WAL receive) functions slightly faster

On Tue, Oct 03, 2023 at 06:02:10PM +1300, David Rowley wrote:
> I know I said I'd drop this, but I was reminded of it again today.  I
> ended up adjusting the patch so that it no longer adds a helper
> function to stringinfo.c and instead just manually assigns the
> StringInfo.data field to point to the bytea's buffer.  This follows
> what's done in some existing places such as
> LogicalParallelApplyLoop(), ReadArrayBinary() and record_recv() to
> name a few.
> 
> I ran a fresh set of benchmarks on today's master with and without the
> patch applied. I used the same benchmark as I did in [1].  The average
> performance increase from between 0 and 12 workers is about 6.6%.
> 
> This seems worthwhile to me.  Any objections?

Interesting.

+   buf.len = VARSIZE_ANY_EXHDR(sstate);
+   buf.maxlen = 0;
+   buf.cursor = 0;

Perhaps it would be worth hiding that in a macro defined in
stringinfo.h?
--
Michael


signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby

On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila  wrote:
>
> On Tue, Oct 3, 2023 at 9:27 PM shveta malik  wrote:
> >
> > On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand
> >  wrote:
> > >
> > > Hi,
> > >
> > > On 10/3/23 12:54 PM, Amit Kapila wrote:
> > > > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand
> > > >  wrote:
> > > >>
> > > >> On 9/29/23 1:33 PM, Amit Kapila wrote:
> > > >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand
> > > >>>  wrote:
> > > 
> > > >>>
> > >  - probably open corner cases like: what if a standby is down? would 
> > >  that mean
> > >  that synchronize_slot_names not being send to the primary would 
> > >  allow the decoding
> > >  on the primary to go ahead?
> > > 
> > > >>>
> > > >>> Good question. BTW, irrespective of whether we have
> > > >>> 'standby_slot_names' parameters or not, how should we behave if
> > > >>> standby is down? Say, if 'synchronize_slot_names' is only specified on
> > > >>> standby then in such a situation primary won't be even aware that some
> > > >>> of the logical walsenders need to wait.
> > > >>
> > > >> Exactly, that's why I was thinking keeping standby_slot_names to 
> > > >> address
> > > >> this scenario. In such a case one could simply decide to keep or remove
> > > >> the associated physical replication slot from standby_slot_names. Keep 
> > > >> would
> > > >> mean "wait" and removing would mean allow to decode on the primary.
> > > >>
> > > >>> OTOH, one can say that users
> > > >>> should configure 'synchronize_slot_names' on both primary and standby
> > > >>> but note that this value could be different for different standby's,
> > > >>> so we can't configure it on primary.
> > > >>>
> > > >>
> > > >> Yeah, I think that's a good use case for standby_slot_names, what do 
> > > >> you think?
> > > >>
> > > >
> > > > But, even if we keep 'standby_slot_names' for this purpose, the
> > > > primary doesn't know the value of 'synchronize_slot_names' once the
> > > > standby is down and or the primary is restarted. So, how will we know
> > > > which logical WAL senders needs to wait for 'standby_slot_names'?
> > > >
> > >
> > > Yeah right, I also think we'd need:
> > >
> > > - synchronize_slot_names on both primary and standby
> > >
> > > But now we would need to take care of different standby having different 
> > > values (
> > > as you said up-thread)
> > >
> > > Thinking out loud: What about a single GUC on the primary (not 
> > > standby_slot_names nor
> > > synchronize_slot_names) but say logical_slots_wait_for_standby that could 
> > > be a list of say
> > > "logical_slot_name:physical_slot".
> > >
> > > I think this GUC would help us define each walsender behavior (should the 
> > > standby(s)
> > > be up or down):
> > >
> >
> > It may help in defining the walsender's behaviour better for sure. But
> > the problem I see once we start defining sync-slot-names on primary
> > (in any form whether as independent GUC or as above mapping GUC) is
> > that it needs to be then in sync with standbys, as each standby for
> > sure needs to maintain its own sync-slot-names GUC to make it aware of
> > what all it needs to sync.
>
> Yes, I also think so. Also, defining such a GUC where user wants to
> sync all the slots which would normally be the case would be a night
> mare for the users.
>
> >
> > This brings us to the original question of
> > how do we actually keep these configurations in sync between primary
> > and standby if we plan to maintain it on both?
> >
> >
> > > - don't wait if its associated logical_slot is not listed in this GUC
> > > - or wait based on its associated "list" of mapped physical slots (would 
> > > probably
> > > have to deal with the min restart_lsn for all the corresponding mapped 
> > > ones).
> > >
> > > I don't think we can avoid having to define at least one GUC on the 
> > > primary (at least to
> > > handle the case of standby(s) being down).
> > >
>
> How about an alternate scheme where we define sync_slot_names on
> standby but then store the physical_slot_name in the corresponding
> logical slot (ReplicationSlotPersistentData) to be synced? So, the
> standby will send the list of 'sync_slot_names' and the primary will
> add the physical standby's slot_name in each of the corresponding
> sync_slot. Now, if we do this then even after restart, we should be
> able to know for which physical slot each logical slot needs to wait.
> We can even provide an SQL API to reset the value of
> standby_slot_names in logical slots as a way to unblock decoding in
> case of emergency (for example, corresponding when physical standby
> never comes up).
>


Looks like a better approach to me. It solves most of the pain points like:
1) Avoids the need of multiple GUCs
2) Primary and standby need not to worry to be in sync if we maintain
sync-slot-names GUC on both
3) User still gets the flexibility to remove a standby from wait-lost
of primary's logical-walsenders' using reset SQL API.

Now some ini

Re: Synchronizing slots from primary to standby

On Wed, Oct 4, 2023 at 9:56 AM shveta malik  wrote:
>
> On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila  wrote:
> >
> > On Tue, Oct 3, 2023 at 9:27 PM shveta malik  wrote:
> > >
> > > On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 10/3/23 12:54 PM, Amit Kapila wrote:
> > > > > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand
> > > > >  wrote:
> > > > >>
> > > > >> On 9/29/23 1:33 PM, Amit Kapila wrote:
> > > > >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand
> > > > >>>  wrote:
> > > > 
> > > > >>>
> > > >  - probably open corner cases like: what if a standby is down? 
> > > >  would that mean
> > > >  that synchronize_slot_names not being send to the primary would 
> > > >  allow the decoding
> > > >  on the primary to go ahead?
> > > > 
> > > > >>>
> > > > >>> Good question. BTW, irrespective of whether we have
> > > > >>> 'standby_slot_names' parameters or not, how should we behave if
> > > > >>> standby is down? Say, if 'synchronize_slot_names' is only specified 
> > > > >>> on
> > > > >>> standby then in such a situation primary won't be even aware that 
> > > > >>> some
> > > > >>> of the logical walsenders need to wait.
> > > > >>
> > > > >> Exactly, that's why I was thinking keeping standby_slot_names to 
> > > > >> address
> > > > >> this scenario. In such a case one could simply decide to keep or 
> > > > >> remove
> > > > >> the associated physical replication slot from standby_slot_names. 
> > > > >> Keep would
> > > > >> mean "wait" and removing would mean allow to decode on the primary.
> > > > >>
> > > > >>> OTOH, one can say that users
> > > > >>> should configure 'synchronize_slot_names' on both primary and 
> > > > >>> standby
> > > > >>> but note that this value could be different for different standby's,
> > > > >>> so we can't configure it on primary.
> > > > >>>
> > > > >>
> > > > >> Yeah, I think that's a good use case for standby_slot_names, what do 
> > > > >> you think?
> > > > >>
> > > > >
> > > > > But, even if we keep 'standby_slot_names' for this purpose, the
> > > > > primary doesn't know the value of 'synchronize_slot_names' once the
> > > > > standby is down and or the primary is restarted. So, how will we know
> > > > > which logical WAL senders needs to wait for 'standby_slot_names'?
> > > > >
> > > >
> > > > Yeah right, I also think we'd need:
> > > >
> > > > - synchronize_slot_names on both primary and standby
> > > >
> > > > But now we would need to take care of different standby having 
> > > > different values (
> > > > as you said up-thread)
> > > >
> > > > Thinking out loud: What about a single GUC on the primary (not 
> > > > standby_slot_names nor
> > > > synchronize_slot_names) but say logical_slots_wait_for_standby that 
> > > > could be a list of say
> > > > "logical_slot_name:physical_slot".
> > > >
> > > > I think this GUC would help us define each walsender behavior (should 
> > > > the standby(s)
> > > > be up or down):
> > > >
> > >
> > > It may help in defining the walsender's behaviour better for sure. But
> > > the problem I see once we start defining sync-slot-names on primary
> > > (in any form whether as independent GUC or as above mapping GUC) is
> > > that it needs to be then in sync with standbys, as each standby for
> > > sure needs to maintain its own sync-slot-names GUC to make it aware of
> > > what all it needs to sync.
> >
> > Yes, I also think so. Also, defining such a GUC where user wants to
> > sync all the slots which would normally be the case would be a night
> > mare for the users.
> >
> > >
> > > This brings us to the original question of
> > > how do we actually keep these configurations in sync between primary
> > > and standby if we plan to maintain it on both?
> > >
> > >
> > > > - don't wait if its associated logical_slot is not listed in this GUC
> > > > - or wait based on its associated "list" of mapped physical slots 
> > > > (would probably
> > > > have to deal with the min restart_lsn for all the corresponding mapped 
> > > > ones).
> > > >
> > > > I don't think we can avoid having to define at least one GUC on the 
> > > > primary (at least to
> > > > handle the case of standby(s) being down).
> > > >
> >
> > How about an alternate scheme where we define sync_slot_names on
> > standby but then store the physical_slot_name in the corresponding
> > logical slot (ReplicationSlotPersistentData) to be synced? So, the
> > standby will send the list of 'sync_slot_names' and the primary will
> > add the physical standby's slot_name in each of the corresponding
> > sync_slot. Now, if we do this then even after restart, we should be
> > able to know for which physical slot each logical slot needs to wait.
> > We can even provide an SQL API to reset the value of
> > standby_slot_names in logical slots as a way to unblock decoding in
> > case of emergency (for example, corresponding when physical standby
> > never comes up).
> >
>
>
> Looks like a

Re: Synchronizing slots from primary to standby

On Mon, Oct 2, 2023 at 4:29 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shveta,
>
> Thank you for updating the patch!
>
> I found another ERROR due to the slot removal. Is this a real issue?
>
> 1. applied add_sleep.txt, which emulated the case the tablesync worker stucked
>and the primary crashed during the
>initial sync.
> 2. executed test_0925_v2.sh (You attached in [1])
> 3. secondary could not start the logical replication because the slot was not
>created (log files were also attached).
>
>
> Here is my analysis. The cause is that the slotsync worker aborts the slot 
> creation
> on secondary server because the restart_lsn of secondary ahead the primary's 
> one.
> IIUC it can be occurred when tablesync workers finishes initial copy before
> walsenders stream changes. In this case, the relstate of the worker is set to
> SUBREL_STATE_CATCHUP and the apply worker waits till the relation becomes
> SUBREL_STATE_SYNCDONE. From here the slot on primary will not be updated until
> the relation is caught up. If some changes are come and the primary crashes at
> that time, the syncslot worker will abort the slot creation.
>

Kuroda-San, we need to let slot-creation on standby finish before we
start expecting it to support logical replication on failover. In the
current case, as you stated the slot-creation itself is aborted and
thus it can not support logical-replication later.  We are currently
trying to think of possibilities to advance remote_lsn on primary
internally by slot-sync workers in order to accelerate slot-creation
on standby for cases where slot-creation is stuck due to primary's
restart_lsn lagging behind standby's restart_lsn. But till then, the
way to proceed for testing is to execute workload on primary for such
cases in order to accelerate slot-creation.

thanks
Shveta

Re: Row pattern recognition

2023-10-03 Thread Tatsuo Ishii

> By the way, I was thinking about eliminating recusrive calls in
> pattern matching. Attached is the first cut of the implementation. In
> the attached v8 patch:
> 
> - No recursive calls anymore. search_str_set_recurse was removed.
> 
> - Instead it generates all possible pattern variable name initial
>   strings before pattern matching. Suppose we have "ab" (row 0) and
>   "ac" (row 1). "a" and "b" represents the pattern variable names
>   which are evaluated to true.  In this case it will generate "aa",
>   "ac", "ba" and "bc" and they are examined by do_pattern_match one by
>   one, which performs pattern matching.
> 
> - To implement this, an infrastructure string_set* are created. They
>   take care of set of StringInfo.
> 
> I found that the previous implementation did not search all possible
> cases. I believe the bug is fixed in v8.

The v8 patch does not apply anymore due to commit d060e921ea "Remove obsolete 
executor cleanup code".
So I rebased and created v9 patch. The differences from the v8 include:

- Fix bug with get_slots. It did not correctly detect the end of full frame.
- Add test case using "ROWS BETWEEN CURRENT ROW AND offset FOLLOWING".

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
>From d8fd143ab717fd62814e73fd24bf1a1694143dfc Mon Sep 17 00:00:00 2001
From: Tatsuo Ishii 
Date: Wed, 4 Oct 2023 14:51:48 +0900
Subject: [PATCH v9 1/7] Row pattern recognition patch for raw parser.

---
 src/backend/parser/gram.y   | 222 ++--
 src/include/nodes/parsenodes.h  |  56 
 src/include/parser/kwlist.h |   8 ++
 src/include/parser/parse_node.h |   1 +
 4 files changed, 273 insertions(+), 14 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index e56cbe77cb..730c51bc87 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	DefElem	   *defelt;
 	SortBy	   *sortby;
 	WindowDef  *windef;
+	RPCommonSyntax	*rpcom;
+	RPSubsetItem	*rpsubset;
 	JoinExpr   *jexpr;
 	IndexElem  *ielem;
 	StatsElem  *selem;
@@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	MergeWhenClause *mergewhen;
 	struct KeyActions *keyactions;
 	struct KeyAction *keyaction;
+	RPSkipTo	skipto;
 }
 
 %type 	stmt toplevel_stmt schema_stmt routine_body_stmt
@@ -453,8 +456,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 TriggerTransitions TriggerReferencing
 vacuum_relation_list opt_vacuum_relation_list
 drop_option_list pub_obj_list
-
-%type 	opt_routine_body
+row_pattern_measure_list row_pattern_definition_list
+opt_row_pattern_subset_clause
+row_pattern_subset_list row_pattern_subset_rhs
+row_pattern
+%type 	 row_pattern_subset_item
+%type 	opt_routine_body row_pattern_term
 %type  group_clause
 %type 	group_by_list
 %type 	group_by_item empty_grouping_set rollup_clause cube_clause
@@ -551,6 +558,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	relation_expr_opt_alias
 %type 	tablesample_clause opt_repeatable_clause
 %type 	target_el set_target insert_column_item
+row_pattern_measure_item row_pattern_definition
+%type 	first_or_last
 
 %type 		generic_option_name
 %type 	generic_option_arg
@@ -633,6 +642,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	window_clause window_definition_list opt_partition_clause
 %type 	window_definition over_clause window_specification
 opt_frame_clause frame_extent frame_bound
+%type 	opt_row_pattern_common_syntax opt_row_pattern_skip_to
+%type 	opt_row_pattern_initial_or_seek
+%type 	opt_row_pattern_measures
 %type 	opt_window_exclusion_clause
 %type 		opt_existing_window_name
 %type  opt_if_not_exists
@@ -659,7 +671,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 json_object_constructor_null_clause_opt
 json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -702,7 +713,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE
 
 	DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS
-	DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC
+	DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC
 	DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P
 	DOUBLE_P DROP
 
@@ -718,7 +729,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	HANDLER HAVING HEADER_P HOLD HOUR_P
 
 	IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN

Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag

On Mon, Oct 02, 2023 at 10:53:22AM +0200, Drouvot, Bertrand wrote:
> On 10/2/23 10:17 AM, Michael Paquier wrote:
>> On Mon, Oct 02, 2023 at 10:01:04AM +0200, Drouvot, Bertrand wrote:
>>> I think that would make sense to have more flexibility in the worker_spi
>>> module. I think that could be done in a dedicated patch though. I
>>> think it makes more sense to have the current patch "focusing" on
>>> this new flag (while adding a test about it without too much
>>> refactoring). What about doing the worker_spi module  re-factoring
>>> as a follow up of this one?
>> 
>> I would do that first, as that's what I usually do,
> 
> The reason I was thinking not doing that first is that there is no real use
> case in the current worker_spi module test.

As a template, improving and extending it seems worth it to me as long
as it can also improve tests.

> > but I see also
> > your point that this is not mandatory.  If you want, I could give it a
> > shot tomorrow to see where it leads.
> 
> Oh yeah that would be great (and maybe you already see a use case in the
> current test). Thanks!

Yes, while it shows a bit more what one can achieve with bgw_extra, it
also helps in improving the test coverage with starting dynamic
workers across defined databases and roles through a SQL function.

It took me a bit longer than I expected, but here is what I finish
with:
- 0001 extends worker_spi to be able to pass down database and role
IDs for the workers to start, through MyBgworkerEntry->bgw_extra.
Perhaps the two new arguments of worker_spi_launch() should use
InvalidOid as default, actually, to fall back to the database and
roles defined by the GUCs in these cases.  That would be two lines to
change in worker_spi--1.0.sql.
- 0002 is your patch, on top of which I have added handling for the
flags in the launch() function with a text[].  The tests get much
simpler, and don't need restarts.

By the way, I am pretty sure that we are going to need a wait phase
after the startup of the worker in the case where "nologrole" is not
allowed to log in even with the original patch: the worker may not
have started at the point where we check the logs.  That's true as
well when involving worker_spi_launch().

What do you think?
--
Michael
From b37dd684da3ff6f9c00719c27b89f1e39f6961a4 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 4 Oct 2023 14:28:20 +0900
Subject: [PATCH v3 1/2] Redesign database and role handling in worker_spi

worker_spi_launch gains two arguments for a database ID and a role ID.
If these are InvalidOid, the dynamic worker falls back to the GUCs
defined to launch the workers.  Tests are refactored in consequence.

bgw_extra is used to pass down these two OIDs to the main bgworker
routine.  Workers loaded with shared_preload_libraries just rely on the
GUCs.
---
 .../modules/worker_spi/t/001_worker_spi.pl| 21 +---
 .../modules/worker_spi/worker_spi--1.0.sql|  2 +-
 src/test/modules/worker_spi/worker_spi.c  | 54 ++-
 3 files changed, 67 insertions(+), 10 deletions(-)

diff --git a/src/test/modules/worker_spi/t/001_worker_spi.pl b/src/test/modules/worker_spi/t/001_worker_spi.pl
index 2965acd789..a026042c65 100644
--- a/src/test/modules/worker_spi/t/001_worker_spi.pl
+++ b/src/test/modules/worker_spi/t/001_worker_spi.pl
@@ -20,9 +20,11 @@ $node->safe_psql('postgres', 'CREATE EXTENSION worker_spi;');
 # This consists in making sure that a table name "counted" is created
 # on a new schema whose name includes the index defined in input argument
 # of worker_spi_launch().
-# By default, dynamic bgworkers connect to the "postgres" database.
+# By default, dynamic bgworkers connect to the "postgres" database with
+# an undefined role, falling back to the GUC defaults (InvalidOid for
+# worker_spi_launch).
 my $result =
-  $node->safe_psql('postgres', 'SELECT worker_spi_launch(4) IS NOT NULL;');
+  $node->safe_psql('postgres', 'SELECT worker_spi_launch(4, 0, 0) IS NOT NULL;');
 is($result, 't', "dynamic bgworker launched");
 $node->poll_query_until(
 	'postgres',
@@ -58,6 +60,7 @@ note "testing bgworkers loaded with shared_preload_libraries";
 # Create the database first so as the workers can connect to it when
 # the library is loaded.
 $node->safe_psql('postgres', q(CREATE DATABASE mydb;));
+$node->safe_psql('postgres', q(CREATE ROLE myrole SUPERUSER LOGIN;));
 $node->safe_psql('mydb', 'CREATE EXTENSION worker_spi;');
 
 # Now load the module as a shared library.
@@ -80,16 +83,18 @@ ok( $node->poll_query_until(
 
 # Ask worker_spi to launch dynamic bgworkers with the library loaded, then
 # check their existence.  Use IDs that do not overlap with the schemas created
-# by the previous workers.
-my $worker1_pid = $node->safe_psql('mydb', 'SELECT worker_spi_launch(10);');
-my $worker2_pid = $node->safe_psql('mydb', 'SELECT worker_spi_launch(11);');
+# by the previous workers.  These use a specific role.
+my $myrole_id = $node->safe_psql('mydb', "SELECT oid FROM pg_roles where rolname

Re: Synchronizing slots from primary to standby


Hi,

On 10/4/23 6:26 AM, shveta malik wrote:

On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila  wrote:


On Tue, Oct 3, 2023 at 9:27 PM shveta malik  wrote:


On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand
 wrote:


Hi,

On 10/3/23 12:54 PM, Amit Kapila wrote:

On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand
 wrote:


On 9/29/23 1:33 PM, Amit Kapila wrote:

On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand
 wrote:





- probably open corner cases like: what if a standby is down? would that mean
that synchronize_slot_names not being send to the primary would allow the 
decoding
on the primary to go ahead?



Good question. BTW, irrespective of whether we have
'standby_slot_names' parameters or not, how should we behave if
standby is down? Say, if 'synchronize_slot_names' is only specified on
standby then in such a situation primary won't be even aware that some
of the logical walsenders need to wait.


Exactly, that's why I was thinking keeping standby_slot_names to address
this scenario. In such a case one could simply decide to keep or remove
the associated physical replication slot from standby_slot_names. Keep would
mean "wait" and removing would mean allow to decode on the primary.


OTOH, one can say that users
should configure 'synchronize_slot_names' on both primary and standby
but note that this value could be different for different standby's,
so we can't configure it on primary.



Yeah, I think that's a good use case for standby_slot_names, what do you think?



But, even if we keep 'standby_slot_names' for this purpose, the
primary doesn't know the value of 'synchronize_slot_names' once the
standby is down and or the primary is restarted. So, how will we know
which logical WAL senders needs to wait for 'standby_slot_names'?



Yeah right, I also think we'd need:

- synchronize_slot_names on both primary and standby

But now we would need to take care of different standby having different values 
(
as you said up-thread)

Thinking out loud: What about a single GUC on the primary (not 
standby_slot_names nor
synchronize_slot_names) but say logical_slots_wait_for_standby that could be a 
list of say
"logical_slot_name:physical_slot".

I think this GUC would help us define each walsender behavior (should the 
standby(s)
be up or down):



It may help in defining the walsender's behaviour better for sure. But
the problem I see once we start defining sync-slot-names on primary
(in any form whether as independent GUC or as above mapping GUC) is
that it needs to be then in sync with standbys, as each standby for
sure needs to maintain its own sync-slot-names GUC to make it aware of
what all it needs to sync.


Yes, I also think so. Also, defining such a GUC where user wants to
sync all the slots which would normally be the case would be a night
mare for the users.



This brings us to the original question of
how do we actually keep these configurations in sync between primary
and standby if we plan to maintain it on both?



- don't wait if its associated logical_slot is not listed in this GUC
- or wait based on its associated "list" of mapped physical slots (would 
probably
have to deal with the min restart_lsn for all the corresponding mapped ones).

I don't think we can avoid having to define at least one GUC on the primary (at 
least to
handle the case of standby(s) being down).



How about an alternate scheme where we define sync_slot_names on
standby but then store the physical_slot_name in the corresponding
logical slot (ReplicationSlotPersistentData) to be synced? So, the
standby will send the list of 'sync_slot_names' and the primary will
add the physical standby's slot_name in each of the corresponding
sync_slot. Now, if we do this then even after restart, we should be
able to know for which physical slot each logical slot needs to wait.
We can even provide an SQL API to reset the value of
standby_slot_names in logical slots as a way to unblock decoding in
case of emergency (for example, corresponding when physical standby
never comes up).




Looks like a better approach to me. It solves most of the pain points like:
1) Avoids the need of multiple GUCs
2) Primary and standby need not to worry to be in sync if we maintain
sync-slot-names GUC on both
3) User still gets the flexibility to remove a standby from wait-lost
of primary's logical-walsenders' using reset SQL API.



Fully agree.


Now some initial thoughts:
1) Since each logical slot could be needed to be synched by multiple
physical-standbys, so in ReplicationSlotPersistentData, we need to
hold a list of standby's name. So this brings us to question as in how
much shall we allocate initially in shared-memory? Shall it be for
max_replication_slots (worst case scenario) in each
ReplicationSlotPersistentData to hold physical-standby names?



Yeah, and even if we do the opposite means add the 'to-sync'
logical replication slot in the ReplicationSlotPersistentData of the physical
slot(s) the questions still remain (as a physical st

Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector

Extending my prior email which is now redundant.

On Tue, Oct 3, 2023 at 7:00 PM David G. Johnston 
wrote:

> On Tue, Oct 3, 2023 at 4:15 PM Karl O. Pinc  wrote:
>
>> On Tue, 3 Oct 2023 14:51:31 -0700
>> "David G. Johnston"  wrote:
>>
>> Isn't the entire section about "deviating from the normal flow of the
>> code"?  That's what makes me want "Exception" in the section title.
>>
>
> It is about how error handling in a procedure diverts the flow from the
> normal code path to some other code path - what that path is labelled is
> less important than the thing that causes the diversion - an error.
>
>
>> ?  I remain (overly?) focused on the word "exception", since that's
>> whats in the brain of the user that's writing RAISE EXCEPTION.
>> It matters if exceptions and errors are different.  If they're
>> not, then it also matters since it's exceptions that the user's
>> code raises.
>>
>>
> It's unfortunate the keyword to raise the message level "ERROR" is
> "EXCEPTION" in that command but I'd rather simply handle that one anomaly
> that make the rest of the system use the word exception, especially seem to
> be fairly consistent in our usage of ERROR already.  I'm sympathetic that
> other systems out there also encourage the usage of exception in this
> context instead of error - but not to the point of opening up this
> long-standing decision for rework.
>
>
>> Have you any thoughts on the "permissions", "privleges" and
>> "attributes" vocabulary/concepts used in this area?
>>
>
> I think we benefit from being able to equate permissions and privileges
> and trying to separate them is going to be more harmful than helpful.  The
> limited things that role attributes permit, and how they fall outside the
> privilege/permission concept as we use it, isn't something that I've
> noticed is a problem that needs addressing.
>
>
>  (I'm slightly
>> nervous about the renumbering making the thread hard to follow.)
>>
>>
> 0009 - Something just seems off with this one.  Unless there are more
> places with this type in use I would just move the relevant notes (i.e.,
> the one in proallargtypes) to that column and be done with it.  If there
> are multiple places then moving the notes to the main docs and
> cross-referencing to them seems warranted.  I also wouldn't call it legacy.
>
> 0010 -
>
> When creating new objects, if a schema qualification is not given with the
> name the first extant entry in the search_path is chosen; then an error
> will be raised should the supplied name already exist in that schema.
> In contexts where the object must already exist, but its name is not
> schema qualified, the extant search_path schemas will be consulted serially
> until one of them contains an appropriate object, returning it, or all
> schemas are consulted, resulting in an object not found error.
>
> I'm not seeing much value in presenting the additional user/public details
> here.  Especially as it would then seem appropriate to include pg_temp.
> And now we have to deal with the fact that by default the public schema
> isn't so public anymore.
>
>
0011 - (first pass, going from memory, might have missed some needed
details)

Aside from non-atomic SQL routine bodies (functions and procedures) the
result of the server executing SQL sent by the connected client does not
result in raw SQL, or textual expressions, being stored for later
evaluation.  All objects are identified (or created) during execution and
their effects stored within the system catalogs and assigned system
identifiers (oids) to provide an absolute and immutable reference to be
used while establishing inter-object dependencies.  In short, indirect
actions taken by the server, based upon stored knowledge, can and often
will execute while in a search_path that only contains the pg_catalog
schema so that the stored knowledge can be found.

For routines written in any language except Atomic SQL the textual body of
the routine is stored as-is within the database.  When executing such a
routine the (parent) session basically opens up a new connection to the
server (one per routine) and within that new sub-session sends the SQL
contained within the routine to the server for execution just like any
other client, and therefore any object references present in that SQL need
to be resolved to a schema as previously discussed.  By default, upon
connecting, the newly created session is updated so that its settings take
on the current values in the parent session.  When authoring a routine this
is often undesirable as the behavior of the routine now depends upon an
environment that is not definitively known to the routine author.
Schema-qualifying object references within the routine body is one tool to
remove such uncertainty.  Another is by using the SET clause of the
relevant CREATE SQL Command to specify what the value of important settings
are to be.

The key takeaway from the preceding two paragraphs is that because routines
are stored as text and their settings res

Re: Synchronizing slots from primary to standby

Hi,

On 10/4/23 7:00 AM, shveta malik wrote:

On Wed, Oct 4, 2023 at 9:56 AM shveta malik wrote:

The most simplistic approach would be:

1) maintain standby_slot_names GUC on primary
2) maintain synchronize_slot_names GUC on physical standby alone.

On primary, let all logical-walsenders wait on physical-standbys
configured in standby_slot_names GUC. This will work and will avoid
all the complexity involved in designs discussed above. But this
simplistic approach comes with disadvantages like below:

1) Even if the associated slot of logical-walsender is not part of
synchronize_slot_names of any of the physical-standbys, it is still
waiting for all the configured standbys to finish.

That's right. Currently (with walsender waiting an arbitrary amount of time)
that sounds like a -1. But if we're going with a new CV approach (like proposed
in [1], that might not be so terrible). Though I don't feel comfortable with
waiting for no reasons (even if this is for a short amount of time possible).

2) If associated slot of logical walsender is part of
synchronize_slot_names of standby1, it is still waiting on standby2,3
etc to finish i.e. waiting on rest of the standbys configured in
standby_slot_names which have not even marked that logical slot in
their synchronize_slot_names.

Same thoughts as above for 1)

So we need to weigh our options here.

With the simplistic approach, if a standby goes down that would impact non
related
walsenders on the primary until the standby's associated physical slot is
removed
from standby_slot_names and I don't feel comfortable wit this behavior.

So, I'm +1 for the ReplicationSlotPersistentData approach proposed by Amit.

[1]:
https://www.postgresql.org/message-id/CAA4eK1LNjgL6Lghgu1PcDfuoOfa8Ug4J7Uv-H%3DBPP8Wgf1%2BpOw%40mail.gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Making aggregate deserialization (and WAL receive) functions slightly faster