date:20230217

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-17 Thread John Naylor

On Thu, Feb 16, 2023 at 11:44 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-02-16 16:22:56 +0700, John Naylor wrote:
> > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada 
> > > Right. TidStore is implemented not only for heap, so loading
> > > out-of-order TIDs might be important in the future.
> >
> > That's what I was probably thinking about some weeks ago, but I'm
having a
> > hard time imagining how it would come up, even for something like the
> > conveyor-belt concept.
>
> We really ought to replace the tid bitmap used for bitmap heap scans. The
> hashtable we use is a pretty awful data structure for it. And that's not
> filled in-order, for example.

I took a brief look at that and agree we should sometime make it work there
as well.

v26 tidstore_add_tids() appears to assume that it's only called once per
blocknumber. While the order of offsets doesn't matter there for a single
block, calling it again with the same block would wipe out the earlier
offsets, IIUC. To do an actual "add tid" where the order doesn't matter, it
seems we would need to (acquire lock if needed), read the current bitmap
and OR in the new bit if it exists, then write it back out.

That sounds slow, so it might still be good for vacuum to call a function
that passes a block and an array of offsets that are assumed ordered (as in
v28), but with a more accurate name, like tidstore_set_block_offsets().

--
John Naylor
EDB: http://www.enterprisedb.com

Re: recovery modules

2023-02-17 Thread Michael Paquier

On Thu, Feb 16, 2023 at 01:17:54PM -0800, Andres Freund wrote:
> On 2023-02-16 12:15:12 -0800, Nathan Bossart wrote:
>> On Thu, Feb 16, 2023 at 11:29:56AM -0800, Andres Freund wrote:
>>> Not related the things changed here, but this should never have been pushed
>>> down into individual archive modules. There's absolutely no way that we're
>>> going to keep this up2date and working correctly in random archive
>>> modules. And it would break if archive modules are ever called outside of
>>> pgarch.c.

Hmm, yes.  That's a bad idea to copy the error handling stack.  The
maintenance of this code could quickly go wrong.  All that had better
be put into their own threads, IMO, to bring more visibility on these
subjects.
 
> I'm quite baffled by:
>   /* Close any files left open by copy_file() or compare_files() 
> */
>   AtEOSubXact_Files(false, InvalidSubTransactionId, 
> InvalidSubTransactionId);
> 
> in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files()
> completely outside the context of a transaction environment. And it only does
> the thing you want because you pass parameters that aren't actually valid in
> the normal use in AtEOSubXact_Files().  I really don't understand how that's
> supposed to be ok.

As does this part, probably with a backpatch..

Saying that, I have spent more time on the revamped version of the
archive modules and it was already doing a lot, so I have applied
it as it covered all the points discussed..
--
Michael


signature.asc
Description: PGP signature

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Jeff Davis

On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:
> I am saying that pg_upgrade should be able to deal with the
> difference. The
> details of how to implement that, don't matter that much.

To clarify, you're saying that pg_upgrade should simply update
pg_database to set the new databases' collation fields equal to that of
the old cluster?

I'll submit it as a separate patch because it would be independently
useful.

Regards,
Jeff Davis

Re: Normalization of utility queries in pg_stat_statements

2023-02-17 Thread Drouvot, Bertrand


Hi,

On 2/17/23 3:35 AM, Michael Paquier wrote:

On Thu, Feb 16, 2023 at 10:55:32AM +0100, Drouvot, Bertrand wrote:

In the new pg_stat_statements.sql? That way pg_stat_statements.sql would always 
behave
with default values for those (currently we are setting both of them as non 
default).

Then, with the default values in place, if we feel that some tests
are missing we could add them in > utility.sql or planning.sql
accordingly.


I am not sure about this part, TBH, so I have left these as they are.

Anyway, while having a second look at that, I have noticed that it is
possible to extract as an independent piece all the tests related to
level tracking.  Things are worse than I thought initially, actually,
because we had test scenarios mixing planning and level tracking, but
the tests don't care about measuring plans at all, see around FETCH
FORWARD, meaning that queries on the table pg_stat_statements have
just been copy-pasted around for the last few years.  There were more
tests that used "test" for a table name ;)

I have been pondering about this part, and the tracking matters for DO
blocks and PL functions, so I have moved all these cases into a new,
separate file.  There is a bit more that can be done, for WAL tracking
and roles near the end of pg_stat_statements.sql, but I have left that
out for now.  I have checked the output of the tests before and after
the refactoring for quite a bit of time, and the outputs match so
there is no loss of coverage.

0001 looks quite committable at this stage, and that's independent on
the rest.  At the end this patch creates four new test files that are
extended in the next patches: utility, planning, track and cleanup.



Thanks! LGTM.


0002:

Produces:
v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:834: trailing 
whitespace.
CREATE VIEW view_stats_1 AS
v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:838: trailing 
whitespace.
CREATE VIEW view_stats_1 AS
warning: 2 lines add whitespace errors.


Thanks, fixed.



Thanks!


+-- SET TRANSACTION ISOLATION
+BEGIN;
+SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
+SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

What about adding things like "SET SESSION CHARACTERISTICS AS
TRANSACTION..." too?


That's a good idea.  It is again one of these fancy cases, better to
keep a track of them in the long-term..



Right.

002 LGTM.


0003 and 0004:
Thanks for keeping 0003 that's useful to see the impact of A_Const 
normalization.

Looking at the diff they produce, I also do think that 0004 is what
could be done for PG16.


I am wondering if others have an opinion to share about that, but,
yes, 0004 seems enough to begin with.  We could always study more
normalization areas in future releases, taking it slowly.


Agree.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [Proposal] Add foreign-server health checks infrastructure

2023-02-17 Thread Katsuragi Yuta


On 2023-02-09 23:39, Hayato Kuroda (Fujitsu) wrote:

Dear Katsuragi-san,

Thank you for reviewing! PSA new version patches.


Thank you for updating the patch! These are my comments,
please check.

0001:
Extending pqSocketPoll seems to be a better way because we can
avoid having multiple similar functions. I also would like to hear
horiguchi-san's opinion whether this matches his expectation.
Improvements of pqSocketPoll/pqSocketCheck is discussed in this
thread[1]. I'm concerned with the discussion.

As for the function's name, what do you think about keeping
current name (pqSocketCheck)? pqSocketIsReadable... describes
the functionality very well though.

pqConnCheck seems to be a family of pqReadReady or pqWriteRedy,
so how about placing pqConnCheck below them?

+ * Moreover, when neither forRead nor forWrite is requested and timeout 
is

+ * disabled, try to check the health of socket.
Isn't it better to put the comment on how the arguments are
interpreted before the description of return value?

+#if defined(POLLRDHUP)
+   input_fd.events = POLLRDHUP | POLLHUP | POLLNVAL;
...
+   input_fd.events |= POLLERR;
To my understanding, POLLHUP, POLLNVAL and POLLERR are ignored
in event. Are they necessary?


0002:
As for the return value of postgres_fdw_verify_connection_states,
what do you think about returning NULL when connection-checking
is not performed? I think there are two cases 1) ConnectionHash
is not initialized or 2) connection is not found for specified
server name, That is, no entry passes the first if statement below
(case 2)).

```
 if (all || entry->serverid == serverid)
 {
 if (PQconnCheck(entry->conn))
 {
```

0004:
I'm wondering if we should add kqueue support in this version,
because adding kqueue support introduces additional things to
be considered. What do you think about focusing on the main
functionality using poll in this patch and going for kqueue
support after this patch?


[1]: 
https://www.postgresql.org/message-id/20230209.115009.2229702014236187289.horikyota.ntt%40gmail.com


regards,

--
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Doc: Improve note about copying into postgres_fdw foreign tables in batch

2023-02-17 Thread Etsuro Fujita

Hi,

Here is a small patch to improve the note, which was added by commit
97da48246 ("Allow batch insertion during COPY into a foreign table."),
by adding an explanation about how the actual number of rows
postgres_fdw inserts at once is determined in the COPY case, including
a limitation that does not apply to the INSERT case.

I will add this to the next CF.

Best regards,
Etsuro Fujita


postgres-fdw-batch-insert-doc.patch
Description: Binary data

Re: Make set_ps_display faster and easier to use

2023-02-17 Thread David Rowley

Thank you for having a look at this.

On Fri, 17 Feb 2023 at 14:01, Andres Freund  wrote:
> > +set_ps_display_suffix(const char *suffix)
> > +{
> > + size_t  len;
>
> Think this will give you an unused-variable warning in the PS_USE_NONE case.

Fixed

> > +#ifndef PS_USE_NONE
> > + /* update_process_title=off disables updates */
> > + if (!update_process_title)
> > + return;
> > +
> > + /* no ps display for stand-alone backend */
> > + if (!IsUnderPostmaster)
> > + return;
> > +
> > +#ifdef PS_USE_CLOBBER_ARGV
> > + /* If ps_buffer is a pointer, it might still be null */
> > + if (!ps_buffer)
> > + return;
> > +#endif
>
> This bit is now repeated three times. How about putting it into a helper?

Good idea. Done.

> > +set_ps_display_internal(void)
>
> Very very minor nit: Perhaps this should be update_ps_display() or
> flush_ps_display() instead?

I called the precheck helper update_ps_display_precheck(), so went
with flush_ps_display() for updating the display so they both didn't
start with "update".

Updated patch attached.

David
diff --git a/src/backend/replication/syncrep.c 
b/src/backend/replication/syncrep.c
index 80d681b71c..889e20b5dd 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -148,8 +148,6 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
 void
 SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 {
-   char   *new_status = NULL;
-   const char *old_status;
int mode;
 
/*
@@ -216,15 +214,10 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
/* Alter ps display to show waiting for sync rep. */
if (update_process_title)
{
-   int len;
-
-   old_status = get_ps_display(&len);
-   new_status = (char *) palloc(len + 32 + 1);
-   memcpy(new_status, old_status, len);
-   sprintf(new_status + len, " waiting for %X/%X",
-   LSN_FORMAT_ARGS(lsn));
-   set_ps_display(new_status);
-   new_status[len] = '\0'; /* truncate off " waiting ..." */
+   charbuffer[32];
+
+   sprintf(buffer, "waiting for %X/%X", LSN_FORMAT_ARGS(lsn));
+   set_ps_display_suffix(buffer);
}
 
/*
@@ -322,12 +315,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
MyProc->syncRepState = SYNC_REP_NOT_WAITING;
MyProc->waitLSN = 0;
 
-   if (new_status)
-   {
-   /* Reset ps display */
-   set_ps_display(new_status);
-   pfree(new_status);
-   }
+   /* reset ps display to remove the suffix */
+   if (update_process_title)
+   set_ps_display_remove_suffix();
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index 2d6dbc6561..98904a7c05 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4302,8 +4302,8 @@ void
 LockBufferForCleanup(Buffer buffer)
 {
BufferDesc *bufHdr;
-   char   *new_status = NULL;
TimestampTz waitStart = 0;
+   boolwaiting = false;
boollogged_recovery_conflict = false;
 
Assert(BufferIsPinned(buffer));
@@ -4350,11 +4350,11 @@ LockBufferForCleanup(Buffer buffer)

waitStart, GetCurrentTimestamp(),
NULL, 
false);
 
-   /* Report change to non-waiting status */
-   if (new_status)
+   if (waiting)
{
-   set_ps_display(new_status);
-   pfree(new_status);
+   /* reset ps display to remove the suffix if we 
added one */
+   set_ps_display_remove_suffix();
+   waiting = false;
}
return;
}
@@ -4374,18 +4374,11 @@ LockBufferForCleanup(Buffer buffer)
/* Wait to be signaled by UnpinBuffer() */
if (InHotStandby)
{
-   /* Report change to waiting status */
-   if (update_process_title && new_status == NULL)
+   if (!waiting)
{
-   const char *old_status;
-   int len;
-
-   old_status = get_ps_display(&len);
-   new_status = (char *) palloc(len + 8 + 1);
-   memcpy(new_status, old_status, len);
-   strcpy(new_status + len, " waiting");
-   set_ps_display(new_status);
-

Re: [PATCH] Add pretty-printed XML output option

2023-02-17 Thread Jim Jones


On 16.02.23 05:38, Nikolay Samokhvalov wrote:
On Thu, Feb 9, 2023 at 2:31 AM Peter Eisentraut 
 wrote:


I suggest we call it "xmlformat", which is an established term for
this.


Some very-very old, rusted memory told me that there was something in 
standard – and indeed, it seems it described an optional Feature X069, 
“XMLSerialize: INDENT” for XMLSERIALIZE. So probably pretty-printing 
should go there, to XMLSERIALIZE, to follow the standard?


Oracle also has an option for it in XMLSERIALIZE, although in a 
slightly different form, with ability to specify the number of spaces 
for indentation 
https://docs.oracle.com/database/121/SQLRF/functions268.htm#SQLRF06231.


After your comment I'm studying the possibility to extend the existing 
xmlserialize function to add the indentation feature. If so, how do you 
think it should look like? An extra parameter? e.g.


SELECT xmlserialize(DOCUMENT '42'::XML AS text, 
true);


.. or more or like Oracle does it

SELECT XMLSERIALIZE(DOCUMENT xmltype('42') AS BLOB 
INDENT)

FROM dual;

Thanks!

Best, Jim

Re: Support logical replication of DDLs

2023-02-17 Thread Amit Kapila

On Fri, Feb 17, 2023 at 1:13 AM Jonathan S. Katz  wrote:
>
> On 2/16/23 2:38 PM, Alvaro Herrera wrote:
> > On 2023-Feb-16, Jonathan S. Katz wrote:
> >
> >> On 2/16/23 12:53 PM, Alvaro Herrera wrote:
> >
> >>> I don't think this is the fault of logical replication.  Consider that
> >>> for the backend server, the function source code is just an opaque
> >>> string that is given to the plpgsql engine to interpret.  So there's no
> >>> way for the logical DDL replication engine to turn this into runnable
> >>> code if the table name is not qualified.
> >>
> >> Sure, that's fair. That said, the example above would fall under a "typical
> >> use case", i.e. I'm replicating functions that call tables without schema
> >> qualification. This is pretty common, and as logical replication becomes
> >> used for more types of workloads (e.g. high availability), we'll definitely
> >> see this.
> >
> > Hmm, I think you're saying that replay should turn check_function_bodies
> > off, and I think I agree with that.
>
> Yes, exactly. +1
>

But will that be sufficient? I guess such functions can give errors at
a later stage when invoked at DML or another DDL time. Consider the
following example:

Pub:
CREATE PUBLICATION pub FOR ALL TABLES with (ddl = 'all');

Sub:
(Set check_function_bodies = off in postgresql.conf)
CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub;

Pub:
CREATE FUNCTION t1(a int) RETURNS int AS $$
select a+1;
$$ LANGUAGE sql;
CREATE FUNCTION t(a int) RETURNS int AS $$
select t1(a);
$$ LANGUAGE sql;
CREATE TABLE tbl1 (a int primary key, b text);
create index idx on tbl1(t(a));

insert into tbl1 values (1,1); -- This insert on publisher causes an
error on the subscriber. Check subscriber Logs (ERROR:  function
t1(integer) does not exist at character 9.)

This happens because of the function used in the index expression.
Now, this is not the only thing, the replication can even fail during
DDL replication when the function like above is IMMUTABLE and used as
follows: ALTER TABLE tbl ADD COLUMN d int DEFAULT t(1);

Normally, it is recommended that users can fix such errors by
schema-qualifying affected names. See commits 11da97024a and
582edc369c.

-- 
With Regards,
Amit Kapila.

Re: Kerberos delegation support in libpq and postgres_fdw

2023-02-17 Thread Stephen Frost

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Mon, Sep 19, 2022 at 02:05:39PM -0700, Jacob Champion wrote:
> > It's not prevented, because a password is being used. In my tests I'm
> > connecting as an unprivileged user.
> > 
> > You're claiming that the middlebox shouldn't be doing this. If this new
> > default behavior were the historical behavior, then I would have agreed.
> > But the cat's already out of the bag on that, right? It's safe today.
> > And if it's not safe today for some other reason, please share why, and
> > maybe I can work on a patch to try to prevent people from doing it.
> 
> Please note that this has been marked as returned with feedback in the
> current CF, as this has remained unanswered for a bit more than three
> weeks.

There's some ongoing discussion about how to handle outbound connections
from the server ending up picking up credentials from the server's
environment (that really shouldn't be allowed unless specifically asked
for..), that's ultimately an independent change from what this patch is
doing.

Here's an updated version which does address Robert's concerns around
having this disabled by default and having options on both the server
and client side saying if it is to be enabled or not.  Also added to
pg_stat_gssapi a field that indicates if credentials were proxied or not
and made some other improvements and added additional regression tests
to test out various combinations.

Thanks,

Stephen
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 78a8bcee6e..713ef7c248 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2589,7 +2589,7 @@ dblink_security_check(PGconn *conn, remoteConn *rconn)
 {
 	if (!superuser())
 	{
-		if (!PQconnectionUsedPassword(conn))
+		if (!(PQconnectionUsedPassword(conn) || PQconnectionUsedGSSAPI(conn)))
 		{
 			libpqsrv_disconnect(conn);
 			if (rconn)
@@ -2597,8 +2597,8 @@ dblink_security_check(PGconn *conn, remoteConn *rconn)
 
 			ereport(ERROR,
 	(errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
-	 errmsg("password is required"),
-	 errdetail("Non-superuser cannot connect if the server does not request a password."),
+	 errmsg("password or GSSAPI is required"),
+	 errdetail("Non-superuser cannot connect if the server does not request a password or use GSSAPI."),
 	 errhint("Target server's authentication method must be changed.")));
 		}
 	}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d5fc61446a..77656ccf80 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -171,7 +171,8 @@ ALTER SERVER testserver1 OPTIONS (
 	sslcrl 'value',
 	--requirepeer 'value',
 	krbsrvname 'value',
-	gsslib 'value'
+	gsslib 'value',
+	gssdeleg 'value'
 	--replication 'value'
 );
 -- Error, invalid list syntax
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 984e4d168a..bb6f309907 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -286,6 +286,9 @@ InitPgFdwOptions(void)
 		{"sslcert", UserMappingRelationId, true},
 		{"sslkey", UserMappingRelationId, true},
 
+		/* gssencmode is also libpq option, same to above. */
+		{"gssencmode", UserMappingRelationId, true},
+
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 1e50be137b..1a43bdf55b 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -185,7 +185,8 @@ ALTER SERVER testserver1 OPTIONS (
 	sslcrl 'value',
 	--requirepeer 'value',
 	krbsrvname 'value',
-	gsslib 'value'
+	gsslib 'value',
+	gssdeleg 'value'
 	--replication 'value'
 );
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ecd9aa73ef..a931996968 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1170,6 +1170,23 @@ include_dir 'conf.d'
   
  
 
+ 
+  gss_accept_deleg (boolean)
+  
+   gss_accept_deleg configuration parameter
+  
+  
+  
+   
+Sets whether GSSAPI delegation should be accepted from the client.
+The default is off meaning credentials from the client will
+NOT be accepted.  Changing this to on will make the server
+accept credentials delegated to it from the client. This parameter can only be
+set in the postgresql.conf file or on the server command line.
+   
+  
+ 
+
  
   db_user_namespace (boolean)
   
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 0e7ae70c70..9f3e12b2a7 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1879,6 +1879,18 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
  
 
+ 
+  gssdeleg
+  
+   
+Forward (delegate) GSS credentials to the server.  The default is
+disable

Re: Support logical replication of global object commands

2023-02-17 Thread Amit Kapila

On Fri, Feb 17, 2023 at 10:58 AM Zheng Li  wrote:
>
> > > > Actually, I intend something for global objects. But the main thing
> > > > that is worrying me about this is that we don't have a clean way to
> > > > untie global object replication from database-specific object
> > > > replication.
> > >
> > > I think ultimately we need a clean and efficient way to publish (and
> > > subscribe to) any changes in all databases, preferably in one logical
> > > replication slot.
> > >
> >
> > Agreed. I was thinking currently for logical replication both
> > walsender and slot are database-specific. So we need a way to
> > distinguish the WAL for global objects and then avoid filtering based
> > on the slot's database during decoding.
>
> But which WALSender should handle the WAL for global objects if we
> don't filter by database? Is there any specific problem you see for
> decoding global objects commands in a database specific WALSender?
>

I haven't verified but I was concerned about the below check:
logicalddl_decode
{
...
+
+ if (message->dbId != ctx->slot->data.database ||

-- 
With Regards,
Amit Kapila.

The output sql generated by pg_dump for a create function refers to a modified table name

2023-02-17 Thread vignesh C

Hi,

The output sql generated by pg_dump for the below function refers to a
modified table name:
create table t1 (c1 int);
create table t2 (c1 int);

CREATE OR REPLACE FUNCTION test_fun(c1 int)
RETURNS void
LANGUAGE SQL
BEGIN  ATOMIC
 WITH delete_t1 AS (
 DELETE FROM t1 WHERE c1 = $1
 )
 INSERT INTO t1 (c1) SELECT $1 FROM t2;
END;

The below sql output created by pg_dump refers to t1_1 which should
have been t1:
CREATE FUNCTION public.test_fun(c1 integer) RETURNS void
LANGUAGE sql
BEGIN ATOMIC
 WITH delete_t1 AS (
  DELETE FROM public.t1
   WHERE (t1_1.c1 = test_fun.c1)
 )
  INSERT INTO public.t1 (c1)  SELECT test_fun.c1
FROM public.t2;
END;

pg_get_function_sqlbody also returns similar result:
select proname, pg_get_function_sqlbody(oid) from pg_proc where
proname = 'test_fun';
 proname  |  pg_get_function_sqlbody
--+---
 test_fun | BEGIN ATOMIC +
  |  WITH delete_t1 AS ( +
  |   DELETE FROM t1 +
  |WHERE (t1_1.c1 = test_fun.c1) +
  |  )   +
  |   INSERT INTO t1 (c1)  SELECT test_fun.c1+
  | FROM t2; +
  | END
(1 row)

I felt the problem here is with set_rtable_names function which
changes the relation name t1 to t1_1 while parsing the statement:
/*
* If the selected name isn't unique, append digits to make it so, and
* make a new hash entry for it once we've got a unique name.  For a
* very long input name, we might have to truncate to stay within
* NAMEDATALEN.
*/

During the query generation we will set the table names before
generating each statement, in our case the table t1 would have been
added already to the hash table during the first insert statement
generation. Next time it will try to set the relation names again for
the next statement, i.e delete statement, if the entry with same name
already exists, it will change the name to t1_1 by appending a digit
to keep the has entry unique.

Regards,
Vignesh

Re: pg_upgrade and logical replication

2023-02-17 Thread Amit Kapila

On Fri, Feb 17, 2023 at 1:24 PM Julien Rouhaud  wrote:
>
> I was working on testing a major upgrade scenario using a mix of physical and
> logical replication when I faced some unexpected problem leading to missing
> rows.  Note that my motivation is to rely on physical replication / physical
> backup to avoid recreating a node from scratch using logical replication, as
> the initial sync with logical replication is much more costly and impacting
> compared to pg_basebackup / restoring a physical backup, but the same problem
> exist if you just pg_upgrade a node that has subscriptions.
>
> The problem is that pg_upgrade creates the subscriptions on the newly upgraded
> node using "WITH (connect = false)", which seems expected as you obviously
> don't want to try to connect to the publisher at that point.  But then once 
> the
> newly upgraded node is restarted and ready to replace the previous one, unless
> I'm missing something there's absolutely no possibility to use the created
> subscriptions without losing some data from the publisher.
>
> The reason is that the subscription doesn't have a local list of relation to
> process until you refresh the subscription, but you can't refresh the
> subscription without enabling it (and you can't enable it in a transaction),
> which means that you have to let the logical worker start, consume and ignore
> all changes that happened on the publisher side until the refresh happens.
>
> An easy workaround that I tried is to allow something like
>
> ALTER SUBSCRIPTION ...  ENABLE WITH (refresh = true, copy_data = false)
>
> so that the refresh internally happens before the apply worker is started and
> you just keep consuming the delta, which works on naive scenario.
>
> One concern I have with this approach is that the default values for both
> "refresh" and "copy_data" for all other subcommands is "true, but we would
> probably need a different default value in that exact scenario (as we know we
> already have the data).  I think that it would otherwise be safe in my very
> specific scenario, assuming that you created the slot beforehand and moved the
> slot's LSN at the promotion point, as even if you add non-empty tables to the
> publication you will only need the delta whether those were initially empty or
> not given your initial physical replica state.
>

This point is not very clear. Why would one just need delta even for new tables?

>  Any other scenario would make
> this new option dangerous, if not entirely useless, but not more than any of
> the current commands that lead to refreshing a subscription and have the same
> options I guess.
>
> All in all, currently the only way to somewhat safely resume logical
> replication after a pg_upgrade is to drop all the subscriptions that were
> transferred during pg_upgrade on all databases and recreate them (using the
> existing slots on the publisher side obviously), allowing the initial
> connection.  But this approach only works in the exact scenario I mentioned
> (physical to logical replication, or at least a case where *all* the tables
> where logically replicated prior to the pg_ugprade), otherwise you have to
> recreate the follower node from scratch using logical repication.
>

I think if you dropped and recreated the subscriptions by retaining
old slots, the replication should resume from where it left off before
the upgrade. Which scenario are you concerned about?

> Is that indeed the current behavior, or did I miss something?
>
> Is this "resume logical replication on pg_upgraded node" something we want to
> support better?  I was thinking that we could add a new pg_dump mode (maybe
> only usable during pg_upgrade) that also restores the pg_subscription_rel
> content in each subscription or something like that.  If not, should 
> pg_upgrade
> keep preserving the subscriptions as it doesn't seem safe to use them, or at
> least document the hazards (I didn't find anything about it in the
> documentation)?
>
>

There is a mention of this in pg_dump docs. See [1] (When dumping
logical replication subscriptions ...)

[1] - https://www.postgresql.org/docs/devel/app-pgdump.html

-- 
With Regards,
Amit Kapila.

Re: Refactor calculations to use instr_time

2023-02-17 Thread Nazir Bilal Yavuz


Hi,


On 2/16/23 19:13, Andres Freund wrote:

+#define WALSTAT_ACC(fld, var_to_add) \
+   (stats_shmem->stats.fld += var_to_add.fld)
+#define WALLSTAT_ACC_INSTR_TIME_TYPE(fld) \
+   (stats_shmem->stats.fld += INSTR_TIME_GET_MICROSEC(PendingWalStats.fld))
+   WALSTAT_ACC(wal_records, diff);
+   WALSTAT_ACC(wal_fpi, diff);
+   WALSTAT_ACC(wal_bytes, diff);
+   WALSTAT_ACC(wal_buffers_full, PendingWalStats);
+   WALSTAT_ACC(wal_write, PendingWalStats);
+   WALSTAT_ACC(wal_sync, PendingWalStats);
+   WALLSTAT_ACC_INSTR_TIME_TYPE(wal_write_time);
+   WALLSTAT_ACC_INSTR_TIME_TYPE(wal_sync_time);
  #undef WALSTAT_ACC
-
LWLockRelease(&stats_shmem->lock);

WALSTAT_ACC is undefined, but WALLSTAT_ACC_INSTR_TIME_TYPE isn't.

I'd not remove the newline before LWLockRelease().



/*
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index db9675884f3..295c5eabf38 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -445,6 +445,21 @@ typedef struct PgStat_WalStats
TimestampTz stat_reset_timestamp;
  } PgStat_WalStats;
  
+/* Created for accumulating wal_write_time and wal_sync_time as a

instr_time

Minor code-formatting point: In postgres we don't put code in the same line as
a multi-line comment starting with the /*. So either

/* single line comment */
or
/*
  * multi line
  * comment
  */



Thanks for the review. I updated the patch.


Regards,
Nazir Bilal Yavuz
Microsoft


From e3723aca8b79b07190834b0cfd1440dbbf706862 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 2 Feb 2023 15:06:48 +0300
Subject: [PATCH v2] Refactor instr_time calculations

---
 src/backend/access/transam/xlog.c   |  6 ++
 src/backend/storage/file/buffile.c  |  6 ++
 src/backend/utils/activity/pgstat_wal.c | 27 +
 src/include/pgstat.h| 18 -
 4 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d1..3c35dc1ca23 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2206,8 +2206,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 	instr_time	duration;
 
 	INSTR_TIME_SET_CURRENT(duration);
-	INSTR_TIME_SUBTRACT(duration, start);
-	PendingWalStats.wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+	INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration, start);
 }
 
 PendingWalStats.wal_write++;
@@ -8201,8 +8200,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		instr_time	duration;
 
 		INSTR_TIME_SET_CURRENT(duration);
-		INSTR_TIME_SUBTRACT(duration, start);
-		PendingWalStats.wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, duration, start);
 	}
 
 	PendingWalStats.wal_sync++;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 0a51624df3b..e55f86b675e 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -469,8 +469,7 @@ BufFileLoadBuffer(BufFile *file)
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_SUBTRACT(io_time, io_start);
-		INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time);
+		INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
 	}
 
 	/* we choose not to advance curOffset here */
@@ -544,8 +543,7 @@ BufFileDumpBuffer(BufFile *file)
 		if (track_io_timing)
 		{
 			INSTR_TIME_SET_CURRENT(io_time);
-			INSTR_TIME_SUBTRACT(io_time, io_start);
-			INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time);
+			INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
 		}
 
 		file->curOffset += bytestowrite;
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e8598b2f4e0..ab2869a0e24 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -21,7 +21,7 @@
 #include "executor/instrument.h"
 
 
-PgStat_WalStats PendingWalStats = {0};
+PgStat_PendingWalUsage PendingWalStats = {0};
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
@@ -89,24 +89,25 @@ pgstat_flush_wal(bool nowait)
 	 * previous counters from the current ones.
 	 */
 	WalUsageAccumDiff(&diff, &pgWalUsage, &prevWalUsage);
-	PendingWalStats.wal_records = diff.wal_records;
-	PendingWalStats.wal_fpi = diff.wal_fpi;
-	PendingWalStats.wal_bytes = diff.wal_bytes;
 
 	if (!nowait)
 		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
 	else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
 		return true;
 
-#define WALSTAT_ACC(fld) stats_shmem->stats.fld += PendingWalStats.fld
-	WALSTAT_ACC(wal_records);
-	WALSTAT_ACC(wal_fpi);
-	WALSTAT_ACC(wal_bytes);
-	WALSTAT_ACC(wal_buffers_full);
-	WALSTAT_ACC(wal_write);
-	WALSTAT_ACC(wal_sync);
-	WALSTAT_ACC(wal_write_time);
-	WALSTAT_

Re: psql: Add role's membership options to the \du+ command

2023-02-17 Thread Pavel Luzanov


Hello,
On the one hand, it would be nice to see the membership options with 
the psql command.


After playing with cf5eb37c and e5b8a4c0 I think something must be made 
with \du command.


postgres@demo(16.0)=# CREATE ROLE admin LOGIN CREATEROLE;
CREATE ROLE
postgres@demo(16.0)=# \c - admin
You are now connected to database "demo" as user "admin".
admin@demo(16.0)=> SET createrole_self_grant = 'SET, INHERIT';
SET
admin@demo(16.0)=> CREATE ROLE bob LOGIN;
CREATE ROLE
admin@demo(16.0)=> \du

   List of roles
 Role name | Attributes | Member of
---++---
 admin | Create role    
| {bob,bob}

 bob |    | {}
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS 
| {}


We see two bob roles in the 'Member of' column.Strange? But this is correct.

admin@demo(16.0)=> select roleid::regrole, member::regrole, * from 
pg_auth_members where roleid = 'bob'::regrole;
 roleid | member |  oid  | roleid | member | grantor | admin_option | 
inherit_option | set_option

++---+++-+--++
 bob    | admin  | 16713 |  16712 |  16711 |  10 | t    | 
f  | f
 bob    | admin  | 16714 |  16712 |  16711 |   16711 | f    | 
t  | t

(2 rows)

First 'grant bob to admin' command issued immediately after creating 
role bob by superuser(grantor=10). Second command issues by admin role 
and set membership options SET and INHERIT.


If we don't ready to display membership options with \du+ may be at 
least we must group records in 'Member of' column for \du command?


-
Pavel Luzanov

Re: Support logical replication of DDLs

2023-02-17 Thread vignesh C

On Fri, 17 Feb 2023 at 02:38, Jonathan S. Katz  wrote:
>
> On 2/16/23 2:43 PM, Jonathan S. Katz wrote:
> > On 2/16/23 2:38 PM, Alvaro Herrera wrote:
> >> On 2023-Feb-16, Jonathan S. Katz wrote:
> >>
> >>> On 2/16/23 12:53 PM, Alvaro Herrera wrote:
> >>
>  I don't think this is the fault of logical replication.  Consider that
>  for the backend server, the function source code is just an opaque
>  string that is given to the plpgsql engine to interpret.  So there's no
>  way for the logical DDL replication engine to turn this into runnable
>  code if the table name is not qualified.
> >>>
> >>> Sure, that's fair. That said, the example above would fall under a
> >>> "typical
> >>> use case", i.e. I'm replicating functions that call tables without
> >>> schema
> >>> qualification. This is pretty common, and as logical replication becomes
> >>> used for more types of workloads (e.g. high availability), we'll
> >>> definitely
> >>> see this.
> >>
> >> Hmm, I think you're saying that replay should turn check_function_bodies
> >> off, and I think I agree with that.
> >
> > Yes, exactly. +1
>
> I drilled into this a bit more using the SQL standard bodies (BEGIN
> ATOMIC) to see if there were any other behaviors we needed to account
> for. Overall, it worked well but I ran into one issue.
>
> First, functions with "BEGIN ATOMIC" ignores "check_function_bodies"
> which is by design based on how this feature works. We should still turn
> "check_function_bodies" to "off" though, per above discussion.
>
> In the context of DDL replication, "BEGIN ATOMIC" does support
> schema-unqualified functions, presumably because it includes the parsed
> content?
>
> I created an updated example[1] where I converted the SQL functions to
> use the standard syntax and I returned the table names to be schema
> unqualified. This seemed to work, but I ran into a weird case with this
> function:
>
> CREATE OR REPLACE FUNCTION public.calendar_manage(room_id int,
> calendar_date date)
> RETURNS void
> LANGUAGE SQL
> BEGIN ATOMIC
>  WITH delete_calendar AS (
>  DELETE FROM calendar
>  WHERE
>  room_id = $1 AND
>  calendar_date = $2
>  )
>  INSERT INTO calendar (room_id, status, calendar_date, calendar_range)
>  SELECT $1, c.status, $2, c.calendar_range
>  FROM calendar_generate_calendar($1, tstzrange($2, $2 + 1)) c;
> END;
>
> This produced an error on the subscriber, with the following message:
>
> 2023-02-16 20:58:24.096 UTC [26864] ERROR:  missing FROM-clause entry
> for table "calendar_1" at character 322
> 2023-02-16 20:58:24.096 UTC [26864] CONTEXT:  processing remote data for
> replication origin "pg_18658" during message type "DDL" in transaction
> 980, finished at 0/C099A7D8
> 2023-02-16 20:58:24.096 UTC [26864] STATEMENT:  CREATE OR REPLACE
> FUNCTION public.calendar_manage ( IN room_id pg_catalog.int4, IN
> calendar_date pg_catalog.date ) RETURNS pg_catalog.void LANGUAGE sql
> VOLATILE PARALLEL UNSAFE CALLED ON NULL INPUT SECURITY INVOKER COST 100
> BEGIN ATOMIC
>  WITH delete_calendar AS (
>   DELETE FROM public.calendar
>WHERE ((calendar_1.room_id OPERATOR(pg_catalog.=)
> calendar_manage.room_id) AND (calendar_1.calendar_date
> OPERATOR(pg_catalog.=) calendar_manage.calendar_date))
>  )
>   INSERT INTO public.calendar (room_id, status, calendar_date,
> calendar_range)  SELECT calendar_manage.room_id,
>  c.status,
>  calendar_manage.calendar_date,
>  c.calendar_range
> FROM
> public.calendar_generate_calendar(calendar_manage.room_id,
> pg_catalog.tstzrange((calendar_manage.calendar_date)::timestamp with
> time zone, ((calendar_manage.calendar_date OPERATOR(pg_catalog.+)
> 1))::timestamp with time zone)) c(status, calendar_range);
> END
>
> This seemed to add an additional, incorrect reference to the origin
> table for the "room_id" and "calendar_date" attributes within the CTE of
> this function. I don't know if this is directly related to the DDL
> replication patch, but reporting it as I triggered the behavior through it.

I had analyzed this issue and found that this issue exists with
getting query definition, I could reproduce the issue with pg_dump and
pg_get_function_sqlbody. I have started a new thread for this at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm1MMntjmT_NJGp-Z%3DxbF02qHGAyuSHfYHias3TqQbPF2w%40mail.gmail.com

Regards,
Vignesh

Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)

2023-02-17 Thread Alvaro Herrera

On 2022-Dec-11, Amit Langote wrote:

> While staring at the build_simple_rel() bit mentioned above, I
> realized that this code fails to set userid correctly in the
> inheritance parent rels that are child relations of subquery parent
> relations, such as UNION ALL subqueries.  In that case, instead of
> copying the userid (= 0) of the parent rel, the child should look up
> its own RTEPermissionInfo, which should be there, and use the
> checkAsUser value from there.  I've attached 0002 to fix this hole.  I
> am not sure whether there's a way to add a test case for this in the
> core suite.

I gave this a look and I thought it was clearer to have the new
condition depend on rel->reloptkind instead parent or no.

I tried a few things for a new test case, but I was unable to find
anything useful.  Maybe an intermediate view, I thought; no dice.
Maybe one with a security barrier would do?  Anyway, for now I just kept
what you added in v2.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 423416f41247d5e4fa2e99de411ffa3b5e09cc8e Mon Sep 17 00:00:00 2001
From: amitlan 
Date: Wed, 18 Jan 2023 16:49:49 +0900
Subject: [PATCH v3] Correctly set userid of subquery rel's child rels

For a RTE_SUBQUERY parent baserel's child relation that itself is a
RTE_RELATION rel, build_simple_rel() should explicitly look up the
latter's RTEPermissionInfo instead of copying the parent rel's
userid, which would be 0 given that it's a subquery rel.
---
 src/backend/optimizer/util/relnode.c  | 18 ++---
 src/test/regress/expected/inherit.out | 39 +--
 src/test/regress/sql/inherit.sql  | 28 +--
 3 files changed, 77 insertions(+), 8 deletions(-)

diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index a70a16238a..6c4550b90f 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -233,12 +233,22 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
 	rel->serverid = InvalidOid;
 	if (rte->rtekind == RTE_RELATION)
 	{
+		Assert(parent == NULL ||
+			   parent->rtekind == RTE_RELATION ||
+			   parent->rtekind == RTE_SUBQUERY);
+
 		/*
-		 * Get the userid from the relation's RTEPermissionInfo, though only
-		 * the tables mentioned in query are assigned RTEPermissionInfos.
-		 * Child relations (otherrels) simply use the parent's value.
+		 * For any RELATION rte, we need a userid with which to check
+		 * permission access. Baserels simply use their own RTEPermissionInfo's
+		 * checkAsUser.
+		 *
+		 * For otherrels normally there's no RTEPermissionInfo, so we use the
+		 * parent's, which normally has one. The exceptional case is that the
+		 * parent is a subquery, in which case the otherrel will have its own.
 		 */
-		if (parent == NULL)
+		if (rel->reloptkind == RELOPT_BASEREL ||
+			(rel->reloptkind == RELOPT_OTHER_MEMBER_REL &&
+			 parent->rtekind == RTE_SUBQUERY))
 		{
 			RTEPermissionInfo *perminfo;
 
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 2d49e765de..143271cd62 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2512,9 +2512,44 @@ explain (costs off)
 (6 rows)
 
 reset session authorization;
-revoke all on permtest_parent from regress_no_child_access;
-drop role regress_no_child_access;
+-- Check with a view over permtest_parent and a UNION ALL over the view,
+-- especially that permtest_parent's permissions are checked with the role
+-- owning the view as permtest_parent's RTE's checkAsUser.
+create role regress_permtest_view_owner;
+revoke all on permtest_grandchild from regress_permtest_view_owner;
+grant select(a, c) on permtest_parent to regress_permtest_view_owner;
+create view permtest_parent_view as
+	select a, c from permtest_parent;
+alter view permtest_parent_view owner to regress_permtest_view_owner;
+-- Like above, the 2nd arm of UNION ALL gets a hash join due to lack of access
+-- to the expression index's stats
+revoke select on permtest_parent_view from regress_no_child_access;
+grant select(a,c) on permtest_parent_view to regress_no_child_access;
+set session authorization regress_no_child_access;
+explain (costs off)
+	select p1.a, p2.c from
+	(select * from permtest_parent_view
+	 union all
+	 select * from permtest_parent_view) p1
+	inner join permtest_parent p2 on p1.a = p2.a and left(p1.c, 3) ~ 'a1$';
+ QUERY PLAN  
+-
+ Hash Join
+   Hash Cond: (p2.a = permtest_parent.a)
+   ->  Seq Scan on permtest_grandchild p2
+   ->  Hash
+ ->  Append
+   ->  Seq Scan on permtest_grandchild permtest_parent
+ Filter: ("left"(c, 3) ~ 'a1$'::text)
+   ->  Seq Scan on permtest_grandchild permtest_parent_1
+ Filter: ("left"(c, 3) ~ 'a1$'::text)

Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)

2023-02-17 Thread Alvaro Herrera

On 2023-Feb-17, Alvaro Herrera wrote:

> I tried a few things for a new test case, but I was unable to find
> anything useful.  Maybe an intermediate view, I thought; no dice.
> Maybe one with a security barrier would do?  Anyway, for now I just kept
> what you added in v2.

Sorry, I failed to keep count of the patch version correctly.  The test
case here is what you sent in v3 [1], and consequently the patch I just
attached should have been labelled v4.

[1] 
https://postgr.es/m/CA+HiwqF6ricH7HFCkyrK72c=KN-PRkdncxdLmU_mEQx=dra...@mail.gmail.com

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La vida es para el que se aventura"

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Laurenz Albe

On Fri, 2023-02-17 at 14:40 +0900, Michael Paquier wrote:
> Separate question: what's the state of the Windows installers provided
> by the community regarding libicu?  Is that embedded in the MSI?

The EDB installer installs a quite old version of the ICU library
for compatibility reasons, as far as I know.

Yours,
Laurenz Albe

Re: The output sql generated by pg_dump for a create function refers to a modified table name

2023-02-17 Thread Jonathan S. Katz


On 2/17/23 5:22 AM, vignesh C wrote:

Hi,

The output sql generated by pg_dump for the below function refers to a
modified table name:
create table t1 (c1 int);
create table t2 (c1 int);

CREATE OR REPLACE FUNCTION test_fun(c1 int)
RETURNS void
LANGUAGE SQL
BEGIN  ATOMIC
  WITH delete_t1 AS (
  DELETE FROM t1 WHERE c1 = $1
  )
  INSERT INTO t1 (c1) SELECT $1 FROM t2;
END;

The below sql output created by pg_dump refers to t1_1 which should
have been t1:
CREATE FUNCTION public.test_fun(c1 integer) RETURNS void
 LANGUAGE sql
 BEGIN ATOMIC
  WITH delete_t1 AS (
   DELETE FROM public.t1
WHERE (t1_1.c1 = test_fun.c1)
  )
   INSERT INTO public.t1 (c1)  SELECT test_fun.c1
 FROM public.t2;
END;

pg_get_function_sqlbody also returns similar result:
select proname, pg_get_function_sqlbody(oid) from pg_proc where
proname = 'test_fun';
  proname  |  pg_get_function_sqlbody
--+---
  test_fun | BEGIN ATOMIC +
   |  WITH delete_t1 AS ( +
   |   DELETE FROM t1 +
   |WHERE (t1_1.c1 = test_fun.c1) +
   |  )   +
   |   INSERT INTO t1 (c1)  SELECT test_fun.c1+
   | FROM t2; +
   | END
(1 row)


Thanks for reproducing and demonstrating that this was more generally 
applicable. For context, this was initially discovered when testing the 
DDL replication patch[1] under that context.



I felt the problem here is with set_rtable_names function which
changes the relation name t1 to t1_1 while parsing the statement:
/*
* If the selected name isn't unique, append digits to make it so, and
* make a new hash entry for it once we've got a unique name.  For a
* very long input name, we might have to truncate to stay within
* NAMEDATALEN.
*/

During the query generation we will set the table names before
generating each statement, in our case the table t1 would have been
added already to the hash table during the first insert statement
generation. Next time it will try to set the relation names again for
the next statement, i.e delete statement, if the entry with same name
already exists, it will change the name to t1_1 by appending a digit
to keep the has entry unique.


Good catch. Do you have thoughts on how we can adjust the naming logic 
to handle cases like this?


Jonathan

[1] 
https://www.postgresql.org/message-id/e947fa21-24b2-f922-375a-d4f763ef3e4b%40postgresql.org


OpenPGP_signature
Description: OpenPGP digital signature

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

2023-02-17 Thread Peter Eisentraut


On 16.02.23 22:29, Andres Freund wrote:

What's the story behind 100_bugs.pl? This name clearly is copied from
src/test/subscription/t/100_bugs.pl - but I've never understood why that is
outside of the normal numbering space.


Mainly to avoid awkwardness for backpatching.  The number of tests in 
src/test/subscription/ varies quite a bit across branches.

Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16

2023-02-17 Thread Drouvot, Bertrand


Hi,

On 2/16/23 1:26 PM, Drouvot, Bertrand wrote:

Hi,

On 2/16/23 12:00 PM, Amit Kapila wrote:


BTW, feel free to create the second patch
(to align the types for variables/arguments) as that would be really
helpful after we commit this one.


Please find attached a patch proposal to do so.

It looks like a Pandora's box as it produces
those cascading changes:

_hash_vacuum_one_page
index_compute_xid_horizon_for_tuples
gistprunepage
PageIndexMultiDelete
gistXLogDelete
PageIndexMultiDelete
spgRedoVacuumRedirect
vacuumRedirectAndPlaceholder
spgPageIndexMultiDelete
moveLeafs
doPickSplit
_bt_delitems_vacuum
btvacuumpage
_bt_delitems_delete
_bt_delitems_delete_check
hash_xlog_move_page_contents
gistvacuumpage
gistXLogUpdate
gistplacetopage
hashbucketcleanup


Which makes me:

- wonder it is not too intrusive (we could reduce the scope and keep the
PageIndexMultiDelete()'s nitems argument as an int for example).

- worry if there is no side effects (like the one I'm mentioning as a comment
in PageIndexMultiDelete()) even if it passes the CI tests.

- wonder if we should not change MaxIndexTuplesPerPage from int to uint16 too 
(given
the fact that the maximum block size is 32 KB.

I'm sharing this version but I still need to think about it and
I'm curious about your thoughts too.
 
Regards,


--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom dfb3d5cae3163ebc9073cc762adec966f17e2f33 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 17 Feb 2023 11:30:21 +
Subject: [PATCH v1] Change ndeletable in _hash_vacuum_one_page() from int to
 uint16

deletable in _hash_vacuum_one_page() is assigned to 
xl_hash_vacuum_one_page.ntuples
which has been changed to uint16 in a previous commit.

This commit ensures that the assigned value has the same type.

Doing so produces those cascading changes:

_hash_vacuum_one_page
index_compute_xid_horizon_for_tuples
gistprunepage
PageIndexMultiDelete
gistXLogDelete
PageIndexMultiDelete
spgRedoVacuumRedirect
vacuumRedirectAndPlaceholder
spgPageIndexMultiDelete
moveLeafs
doPickSplit
_bt_delitems_vacuum
btvacuumpage
_bt_delitems_delete
_bt_delitems_delete_check
hash_xlog_move_page_contents
gistvacuumpage
gistXLogUpdate
gistplacetopage
hashbucketcleanup

which also fixed others missmatch on the way.

Also in some places the types have been changed from OffsetNumber to uint16.
Even if at the time of this commit the OffsetNumber is defined as uint16, it's
better to match the functions arguments types they are linked to.
---
 src/backend/access/gist/gist.c  |  6 +++---
 src/backend/access/gist/gistvacuum.c|  2 +-
 src/backend/access/gist/gistxlog.c  |  4 ++--
 src/backend/access/hash/hash.c  |  2 +-
 src/backend/access/hash/hash_xlog.c |  8 
 src/backend/access/hash/hashinsert.c|  2 +-
 src/backend/access/index/genam.c|  2 +-
 src/backend/access/nbtree/nbtpage.c | 14 +++---
 src/backend/access/nbtree/nbtree.c  |  4 ++--
 src/backend/access/spgist/spgdoinsert.c |  8 
 src/backend/access/spgist/spgvacuum.c   |  4 ++--
 src/backend/access/spgist/spgxlog.c |  2 +-
 src/backend/storage/page/bufpage.c  | 12 
 src/include/access/genam.h  |  2 +-
 src/include/access/gist_private.h   |  4 ++--
 src/include/access/nbtree.h |  4 ++--
 src/include/access/spgist_private.h |  2 +-
 src/include/storage/bufpage.h   |  2 +-
 18 files changed, 44 insertions(+), 40 deletions(-)
  14.8% src/backend/access/gist/
  11.6% src/backend/access/hash/
  25.0% src/backend/access/nbtree/
  11.2% src/backend/access/spgist/
  14.3% src/backend/storage/page/
  20.1% src/include/access/

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f6..587fb04585 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -573,8 +573,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE 
*giststate,
{
if (RelationNeedsWAL(rel))
{
-   OffsetNumber ndeloffs = 0,
-

Re: The output sql generated by pg_dump for a create function refers to a modified table name

2023-02-17 Thread Tom Lane

"Jonathan S. Katz"  writes:
> Good catch. Do you have thoughts on how we can adjust the naming logic 
> to handle cases like this?

I think it's perfectly fine that ruleutils decided to use different
aliases for the two different occurrences of "t1": the statement is
quite confusing as written.  The problem probably is that
get_delete_query_def() has no idea that it's supposed to print the
adjusted alias just after "DELETE FROM tab".  UPDATE likely has same
issue ... maybe INSERT too?

regards, tom lane

Re: pg_upgrade and logical replication

2023-02-17 Thread Julien Rouhaud

Hi,

On Fri, Feb 17, 2023 at 04:12:54PM +0530, Amit Kapila wrote:
> On Fri, Feb 17, 2023 at 1:24 PM Julien Rouhaud  wrote:
> >
> > An easy workaround that I tried is to allow something like
> >
> > ALTER SUBSCRIPTION ...  ENABLE WITH (refresh = true, copy_data = false)
> >
> > so that the refresh internally happens before the apply worker is started 
> > and
> > you just keep consuming the delta, which works on naive scenario.
> >
> > One concern I have with this approach is that the default values for both
> > "refresh" and "copy_data" for all other subcommands is "true, but we would
> > probably need a different default value in that exact scenario (as we know 
> > we
> > already have the data).  I think that it would otherwise be safe in my very
> > specific scenario, assuming that you created the slot beforehand and moved 
> > the
> > slot's LSN at the promotion point, as even if you add non-empty tables to 
> > the
> > publication you will only need the delta whether those were initially empty 
> > or
> > not given your initial physical replica state.
> >
>
> This point is not very clear. Why would one just need delta even for new 
> tables?

Because in my scenario I'm coming from physical replication, so I know that I
did replicate everything until the promotion LSN.  Any table later added in the
publication is either already fully replicated until that LSN on the upgraded
node, so only the delta is needed, or has been created after that LSN.  In the
latter case, the entirety of the table will be replicated with the logical
replication as a delta right?

> >  Any other scenario would make
> > this new option dangerous, if not entirely useless, but not more than any of
> > the current commands that lead to refreshing a subscription and have the 
> > same
> > options I guess.
> >
> > All in all, currently the only way to somewhat safely resume logical
> > replication after a pg_upgrade is to drop all the subscriptions that were
> > transferred during pg_upgrade on all databases and recreate them (using the
> > existing slots on the publisher side obviously), allowing the initial
> > connection.  But this approach only works in the exact scenario I mentioned
> > (physical to logical replication, or at least a case where *all* the tables
> > where logically replicated prior to the pg_ugprade), otherwise you have to
> > recreate the follower node from scratch using logical repication.
> >
>
> I think if you dropped and recreated the subscriptions by retaining
> old slots, the replication should resume from where it left off before
> the upgrade. Which scenario are you concerned about?

I'm concerned about people not coming from physical replication.  If you just
had some "normal" logical replication, you can't assume that you already have
all the data from the upstream subscription.  If it was modified and a non
empty table is added, you might need to copy the data of part of the tables and
keep replicating for the rest.  It's hard to be sure from a user point of view,
and even if you knew you have no way to express it.

> > Is that indeed the current behavior, or did I miss something?
> >
> > Is this "resume logical replication on pg_upgraded node" something we want 
> > to
> > support better?  I was thinking that we could add a new pg_dump mode (maybe
> > only usable during pg_upgrade) that also restores the pg_subscription_rel
> > content in each subscription or something like that.  If not, should 
> > pg_upgrade
> > keep preserving the subscriptions as it doesn't seem safe to use them, or at
> > least document the hazards (I didn't find anything about it in the
> > documentation)?
> >
> >
>
> There is a mention of this in pg_dump docs. See [1] (When dumping
> logical replication subscriptions ...)

Indeed, but it's barely saying "It is then up to the user to reactivate the
subscriptions in a suitable way" and "It might also be appropriate to truncate
the target tables before initiating a new full table copy".  As I mentioned, I
don't think there's a suitable way to reactivate the subscription, at least if
you don't want to miss some records, so truncating all target tables is the
only fully safe way to proceed.  It seems quite silly to have to do so just
because pg_upgrade doesn't retain the list of relation per subscription.

Re: pg_stat_statements and "IN" conditions

2023-02-17 Thread Dmitry Dolgov

> On Wed, Feb 15, 2023 at 08:51:56AM +0100, David Geier wrote:
> Hi,
>
> On 2/11/23 13:08, Dmitry Dolgov wrote:
> > > On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote:
> > >
> > > The original version of the patch was doing all of this, i.e. handling
> > > numerics, Param nodes, RTE_VALUES. The commentary about
> > > find_const_walker in tests is referring to a part of that, that was
> > > dealing with evaluation of expression to see if it could be reduced to a
> > > constant.
> > >
> > > Unfortunately there was a significant push back from reviewers because
> > > of those features. That's why I've reduced the patch to it's minimally
> > > useful version, having in mind re-implementing them as follow-up patches
> > > in the future. This is the reason as well why I left tests covering all
> > > this missing functionality -- as breadcrumbs to already discovered
> > > cases, important for the future extensions.
> > I'd like to elaborate on this a bit and remind about the origins of the
> > patch, as it's lost somewhere in the beginning of the thread. The idea
> > is not pulled out of thin air, everything is coming from our attempts to
> > improve one particular monitoring infrastructure in a real commercial
> > setting. Every covered use case and test in the original proposal was a
> > result of field trials, when some application-side library or ORM was
> > responsible for gigabytes of data in pgss, chocking the monitoring agent.
>
> Thanks for the clarifications. I didn't mean to contend the usefulness of
> the patch and I wasn't aware that you already jumped through the loops of
> handling Param, etc.

No worries, I just wanted to emphasize that we've already collected
quite some number of use cases.

> Seems like supporting only constants is a good starting
> point. The only thing that is likely confusing for users is that NUMERICs
> (and potentially constants of other types) are unsupported. Wouldn't it be
> fairly simple to support them via something like the following?
>
>     is_const(element) || (is_coercion(element) && is_const(element->child))

It definitely makes sense to implement that, although I don't think it's
going to be acceptable to do that via directly listing conditions an
element has to satisfy. It probably has to be more flexible, sice we
would like to extend it in the future. My plan is to address this in a
follow-up patch, when the main mechanism is approved. Would you agree
with this approach?

Re: pg_stat_statements and "IN" conditions

2023-02-17 Thread Dmitry Dolgov

> On Thu, Feb 09, 2023 at 08:43:29PM +0100, Dmitry Dolgov wrote:
> > On Thu, Feb 09, 2023 at 06:26:51PM +0100, Alvaro Herrera wrote:
> > On 2023-Feb-09, Dmitry Dolgov wrote:
> >
> > > > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote:
> >
> > > > What is the point of making this a numeric setting?  Either you want
> > > > to merge all values or you don't want to merge any values.
> > >
> > > At least in theory the definition of "too many constants" is different
> > > for different use cases and I see allowing to configure it as a way of
> > > reducing the level of surprise here.
> >
> > I was thinking about this a few days ago and I agree that we don't
> > necessarily want to make it just a boolean thing; we may want to make it
> > more complex.  One trivial idea is to make it group entries in powers of
> > 10: for 0-9 elements, you get one entry, and 10-99 you get a different
> > one, and so on:
> >
> > # group everything in a single bucket
> > const_merge_threshold = true / yes / on
> >
> > # group 0-9, 10-99, 100-999, 1000-
> > const_merge_treshold = powers
> >
> > Ideally the value would be represented somehow in the query text. For
> > example
> >
> >  query| calls
> > --+---
> >  select * from test where i in ({... 0-9 entries ...})| 2
> >  select * from test where i in ({... 10-99 entries ...})  | 1
> >
> > What do you think?  The jumble would have to know how to reduce all
> > values within each power-of-ten group to one specific value, but I don't
> > think that should be particularly difficult.
>
> Yeah, it sounds appealing and conveniently addresses the question of
> losing the information about how many constants originally were there.
> Not sure if the example above would be the most natural way to represent
> it in the query text, but otherwise I'm going to try implementing this.
> Stay tuned.

It took me couple of evenings, here is what I've got:

* The representation is not that far away from your proposal, I've
  settled on:

SELECT * FROM test_merge WHERE id IN (... [10-99 entries])

* To not reinvent the wheel, I've reused decimalLenght function from
  numutils, hence one more patch to make it available to reuse.

* This approach resolves my concerns about letting people tuning
  the behaviour of merging, as now it's possible to distinguish between
  calls with different number of constants up to the power of 10. So
  I've decided to simplify the configuration and make the guc boolean to
  turn it off or on.

* To separate queries with constants falling into different ranges
  (10-99, 100-999, etc), the order of magnitude is added into the jumble
  hash.

* I've incorporated feedback from Sergei and David, as well as tried to
  make comments and documentation more clear.

Any feedback is welcomed, thanks!
>From dbb4eab9f3efbcee2326278be6f70ff52685b2b0 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Fri, 17 Feb 2023 10:17:55 +0100
Subject: [PATCH v13 1/2] Reusable decimalLength functions

Move out decimalLength functions to reuse in the following patch.
---
 src/backend/utils/adt/numutils.c | 50 +---
 src/include/utils/numutils.h | 67 
 2 files changed, 68 insertions(+), 49 deletions(-)
 create mode 100644 src/include/utils/numutils.h

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..df7418cce7 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -18,9 +18,8 @@
 #include 
 #include 
 
-#include "common/int.h"
 #include "utils/builtins.h"
-#include "port/pg_bitutils.h"
+#include "utils/numutils.h"
 
 /*
  * A table of all two-digit numbers. This is used to speed up decimal digit
@@ -38,53 +37,6 @@ static const char DIGIT_TABLE[200] =
 "80" "81" "82" "83" "84" "85" "86" "87" "88" "89"
 "90" "91" "92" "93" "94" "95" "96" "97" "98" "99";
 
-/*
- * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10
- */
-static inline int
-decimalLength32(const uint32 v)
-{
-	int			t;
-	static const uint32 PowersOfTen[] = {
-		1, 10, 100,
-		1000, 1, 10,
-		100, 1000, 1,
-		10
-	};
-
-	/*
-	 * Compute base-10 logarithm by dividing the base-2 logarithm by a
-	 * good-enough approximation of the base-2 logarithm of 10
-	 */
-	t = (pg_leftmost_one_pos32(v) + 1) * 1233 / 4096;
-	return t + (v >= PowersOfTen[t]);
-}
-
-static inline int
-decimalLength64(const uint64 v)
-{
-	int			t;
-	static const uint64 PowersOfTen[] = {
-		UINT64CONST(1), UINT64CONST(10),
-		UINT64CONST(100), UINT64CONST(1000),
-		UINT64CONST(1), UINT64CONST(10),
-		UINT64CONST(100), UINT64CONST(1000),
-		UINT64CONST(1), UINT64CONST(10),
-		UINT64CONST(100), UINT64CONST(1000),
-		UINT64CONST(1), UINT64CONST(10

Re: Possible false valgrind error reports

2023-02-17 Thread Karina Litskevich

Thank you, I moved comment changes to 0001 and rewrote the fix through Min().

> The first hunk in 0001 doesn't seem quite right yet:
>
>   * old allocation.
>   */
>  #ifdef USE_VALGRIND
> -if (oldsize > chunk->requested_size)
> +if (size > chunk->requested_size && oldsize > chunk->requested_size)
>  VALGRIND_MAKE_MEM_UNDEFINED((char *) pointer + 
> chunk->requested_size,
>  oldsize - chunk->requested_size);
>  #endif
>
> If size < oldsize, aren't we still doing the wrong thing?  Seems like
> maybe it has to be like

If size > chunk->requested_size than chksize >= oldsize and so we can mark this
memory without worries. Region from size to chksize will be marked NOACCESS
later anyway:

/* Ensure any padding bytes are marked NOACCESS. */
VALGRIND_MAKE_MEM_NOACCESS((char *) pointer + size, chksize - size);

I agree that it's not obvious, so I changed the first hunk like this:

- if (oldsize > chunk->requested_size)
+ if (Min(size, oldsize) > chunk->requested_size)
  VALGRIND_MAKE_MEM_UNDEFINED((char *) pointer + chunk->requested_size,
- oldsize - chunk->requested_size);
+ Min(size, oldsize) - chunk->requested_size);

Any ideas on how to make this place easier to understand and comment above it
concise and clear are welcome.

There is another thing about this version. New line
+ Min(size, oldsize) - chunk->requested_size);
is longer than 80 symbols and I don't know what's the best way to avoid this
without making it look weird.

I also noticed that if RANDOMIZE_ALLOCATED_MEMORY is defined then
randomize_mem()
have already marked this memory UNDEFINED. So we only "may need to adjust
trailing bytes" if RANDOMIZE_ALLOCATED_MEMORY isn't defined. I reflected it in
v2 of 0001 too.
From 009ef9bbabcd71e0d5f2b60799c0b71d21fb9767 Mon Sep 17 00:00:00 2001
From: Karina Litskevich 
Date: Fri, 17 Feb 2023 15:32:05 +0300
Subject: [PATCH v2 2/2] Change variable name in AllocSetRealloc()

---
 src/backend/utils/mmgr/aset.c | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index ace4907ce9..9584d50b14 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -1112,7 +1112,7 @@ AllocSetRealloc(void *pointer, Size size)
 	AllocBlock	block;
 	AllocSet	set;
 	MemoryChunk *chunk = PointerGetMemoryChunk(pointer);
-	Size		oldsize;
+	Size		oldchksize;
 	int			fidx;
 
 	/* Allow access to private part of chunk header. */
@@ -1140,11 +1140,11 @@ AllocSetRealloc(void *pointer, Size size)
 
 		set = block->aset;
 
-		oldsize = block->endptr - (char *) pointer;
+		oldchksize = block->endptr - (char *) pointer;
 
 #ifdef MEMORY_CONTEXT_CHECKING
 		/* Test for someone scribbling on unused space in chunk */
-		Assert(chunk->requested_size < oldsize);
+		Assert(chunk->requested_size < oldchksize);
 		if (!sentinel_ok(pointer, chunk->requested_size))
 			elog(WARNING, "detected write past chunk end in %s %p",
  set->header.name, chunk);
@@ -1197,15 +1197,15 @@ AllocSetRealloc(void *pointer, Size size)
 #else
 		/*
 		 * If this is an increase, realloc() will have left any newly-allocated
-		 * part (from oldsize to chksize) UNDEFINED, but we may need to adjust
+		 * part (from oldchksize to chksize) UNDEFINED, but we may need to adjust
 		 * trailing bytes from the old allocation (from chunk->requested_size to
-		 * oldsize) as they are marked NOACCESS.  Make sure not to mark extra
-		 * bytes in case chunk->requested_size < size < oldsize.
+		 * oldchksize) as they are marked NOACCESS.  Make sure not to mark extra
+		 * bytes in case chunk->requested_size < size < oldchksize.
 		 */
 #ifdef USE_VALGRIND
-		if (Min(size, oldsize) > chunk->requested_size)
+		if (Min(size, oldchksize) > chunk->requested_size)
 			VALGRIND_MAKE_MEM_UNDEFINED((char *) pointer + chunk->requested_size,
-		Min(size, oldsize) - chunk->requested_size);
+		Min(size, oldchksize) - chunk->requested_size);
 #endif
 #endif
 
@@ -1223,7 +1223,7 @@ AllocSetRealloc(void *pointer, Size size)
 		 * portion DEFINED.  Make sure not to mark memory behind the new
 		 * allocation in case it's smaller than old one.
 		 */
-		VALGRIND_MAKE_MEM_DEFINED(pointer, Min(size, oldsize));
+		VALGRIND_MAKE_MEM_DEFINED(pointer, Min(size, oldchksize));
 #endif
 
 		/* Ensure any padding bytes are marked NOACCESS. */
@@ -1248,11 +1248,11 @@ AllocSetRealloc(void *pointer, Size size)
 
 	fidx = MemoryChunkGetValue(chunk);
 	Assert(FreeListIdxIsValid(fidx));
-	oldsize = GetChunkSizeFromFreeListIdx(fidx);
+	oldchksize = GetChunkSizeFromFreeListIdx(fidx);
 
 #ifdef MEMORY_CONTEXT_CHECKING
 	/* Test for someone scribbling on unused space in chunk */
-	if (chunk->requested_size < oldsize)
+	if (chunk->requested_size < oldchksize)
 		if (!sentinel_ok(pointer, chunk->requested_size))
 			elog(WARNING, "detected write past chunk end in %s %p",
  set->header.name, c

Re: The output sql generated by pg_dump for a create function refers to a modified table name

2023-02-17 Thread Jonathan S. Katz


On 2/17/23 10:09 AM, Tom Lane wrote:

"Jonathan S. Katz"  writes:

Good catch. Do you have thoughts on how we can adjust the naming logic
to handle cases like this?


I think it's perfectly fine that ruleutils decided to use different
aliases for the two different occurrences of "t1": the statement is
quite confusing as written.


Agreed on that -- while it's harder to set up, I do prefer the original 
example[1] to demonstrate this, as it shows the issue given it does not 
have those multiple occurrences, at least not within the same context, i.e.:


CREATE OR REPLACE FUNCTION public.calendar_manage(room_id int, 
calendar_date date)

RETURNS void
LANGUAGE SQL
BEGIN ATOMIC
WITH delete_calendar AS (
DELETE FROM calendar
WHERE
room_id = $1 AND
calendar_date = $2
)
INSERT INTO calendar (room_id, status, calendar_date, calendar_range)
SELECT $1, c.status, $2, c.calendar_range
FROM calendar_generate_calendar($1, tstzrange($2, $2 + 1)) c;
END;

the table prefixes on the attributes within the DELETE statement were 
ultimately mangled:


WITH delete_calendar AS (
DELETE FROM public.calendar
WHERE ((calendar_1.room_id OPERATOR(pg_catalog.=)
calendar_manage.room_id) AND (calendar_1.calendar_date
OPERATOR(pg_catalog.=) calendar_manage.calendar_date))
)
INSERT INTO public.calendar (room_id, status, calendar_date,
calendar_range)


 The problem probably is that
get_delete_query_def() has no idea that it's supposed to print the
adjusted alias just after "DELETE FROM tab".  UPDATE likely has same
issue ... maybe INSERT too?


Maybe? I modified the function above to do an INSERT/UPDATE instead of a 
DELETE but I did not get any errors. However, if the logic is similar 
there could be an issue there.


Thanks,

Jonathan

[1] 
https://www.postgresql.org/message-id/e947fa21-24b2-f922-375a-d4f763ef3e4b%40postgresql.org


OpenPGP_signature
Description: OpenPGP digital signature

Re: The output sql generated by pg_dump for a create function refers to a modified table name

2023-02-17 Thread Jonathan S. Katz


On 2/17/23 11:19 AM, Jonathan S. Katz wrote:

On 2/17/23 10:09 AM, Tom Lane wrote:


Agreed on that -- while it's harder to set up, I do prefer the original 
example[1] to demonstrate this, as it shows the issue given it does not 
have those multiple occurrences, at least not within the same context, 
i.e.:


CREATE OR REPLACE FUNCTION public.calendar_manage(room_id int, 
calendar_date date)

RETURNS void
LANGUAGE SQL
BEGIN ATOMIC
     WITH delete_calendar AS (
     DELETE FROM calendar
     WHERE
     room_id = $1 AND
     calendar_date = $2
     )
     INSERT INTO calendar (room_id, status, calendar_date, calendar_range)
     SELECT $1, c.status, $2, c.calendar_range
     FROM calendar_generate_calendar($1, tstzrange($2, $2 + 1)) c;
END;


 The problem probably is that
get_delete_query_def() has no idea that it's supposed to print the
adjusted alias just after "DELETE FROM tab".  UPDATE likely has same
issue ... maybe INSERT too?


Maybe? I modified the function above to do an INSERT/UPDATE instead of a 
DELETE but I did not get any errors. However, if the logic is similar 
there could be an issue there.


I spoke too soon -- I was looking at the wrong logs. I did reproduce it 
with UPDATE, but not INSERT. The example I used for UPDATE:


CREATE OR REPLACE FUNCTION public.calendar_manage(room_id int, 
calendar_date date)

RETURNS void
LANGUAGE SQL
BEGIN ATOMIC
WITH update_calendar AS (
UPDATE calendar
SET room_id = $1
WHERE
room_id = $1 AND
calendar_date = $2
)
INSERT INTO calendar (room_id, status, calendar_date, calendar_range)
SELECT $1, c.status, $2, c.calendar_range
FROM calendar_generate_calendar($1, tstzrange($2, $2 + 1)) c;
END;

which produced:

WITH update_calendar AS (
UPDATE public.calendar SET room_id = calendar_manage.room_id
WHERE (
(calendar_1.room_id OPERATOR(pg_catalog.=) 
calendar_manage.room_id) AND (calendar_1.calendar_date 
OPERATOR(pg_catalog.=) calendar_manage.calendar_date))

)
INSERT INTO public.calendar (room_id, status, calendar_date, 
calendar_range)  SELECT calendar_manage.room_id,

c.status,
calendar_manage.calendar_date,
c.calendar_range
FROM public.calendar_generate_calendar(calendar_manage.room_id, 
pg_catalog.tstzrange((calendar_manage.calendar_date)::timestamp with 
time zone, ((calendar_manage.calendar_date OPERATOR(pg_catalog.+) 
1))::timestamp with time zone)) c(status, calendar_range);


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature

pg_init_privs corruption.

2023-02-17 Thread Kirill Reshke

Hi hackers!

Recently we faced a problem with one of our production clusters. Problem
was with pg_upgrade,
the reason was an invalid pg_dump of cluster schema. in pg_dump sql there
was strange records like

REVOKE SELECT,INSERT,DELETE,UPDATE ON TABLE *relation* FROM "144841";

but there is no role "144841"
We did dig in, and it turns out that 144841 was OID of previously-deleted
role.

I have reproduced issue using simple test extension yoext(1).

SQL script:

create role user1;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT select ON TABLES TO user1;
create extension yoext;
drop owned by user1;
select * from pg_init_privs  where privtype = 'e';
drop role user1;
select * from pg_init_privs  where privtype = 'e';

result of execution (executed on fest master from commit
17feb6a566b77bf62ca453dec215adcc71755c20):

psql (16devel)
Type "help" for help.

postgres=#
postgres=#
postgres=# create role user1;
CREATE ROLE
postgres=# ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT select ON TABLES
TO user1;
ALTER DEFAULT PRIVILEGES
postgres=# create extension yobaext ;
CREATE EXTENSION
postgres=# drop owned by user1;
DROP OWNED
postgres=# select * from pg_init_privs  where privtype = 'e';
 objoid | classoid | objsubid | privtype | initprivs
+--+--+--+---
  16387 | 1259 |0 | e|
{reshke=arwdDxtm/reshke,user1=r/reshke,=r/reshke}
(1 row)

postgres=# drop role user1;
DROP ROLE
postgres=# select * from pg_init_privs  where privtype = 'e';
 objoid | classoid | objsubid | privtype | initprivs
+--+--+--+---
  16387 | 1259 |0 | e|
{reshke=arwdDxtm/reshke,16384=r/reshke,=r/reshke}
(1 row)


As you can see, after drop role there is invalid records in pg_init_privs
system relation. After this, pg_dump generate sql statements, some of which
are based on content of pg_init_privs, resulting in invalid dump.

PFA fix.

The idea of fix is simply drop records from pg_init_privs while dropping
role.
Records with grantor of grantee equal to oid of dropped role will erase.
after that, pg_dump works ok.

Implementation comment: i failed to find proper way to alloc acl array, so
defined some acl.c internal function `allocacl` in header. Need to improve
this somehow.

[1] yoext https://github.com/reshke/yoext/


v1-0001-Fix-pg_init_prevs-corruption.patch
Description: Binary data

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-17 Thread Jonah H. Harris

On Fri, Feb 17, 2023 at 12:03 AM David Rowley  wrote:

> On Fri, 17 Feb 2023 at 17:40, Jonah H. Harris 
> wrote:
> > Yeah. There’s definitely a smarter and more reusable approach than I was
> proposing. A lot of that code is fairly mature and I figured more people
> wouldn’t want to alter it in such ways - but I’m up for it if an approach
> like this is the direction we’d want to go in.
>
> Having something agnostic to if it's allocating a new context or
> adding a block to an existing one seems like a good idea to me.
>

I like this idea.


> I think the tricky part will be the discussion around which size
> classes to keep around and in which cases can we use a larger
> allocation without worrying too much that it'll be wasted. We also
> don't really want to make the minimum memory that a backend can keep
> around too bad. Patches such as [1] are trying to reduce that.  Maybe
> we can just keep a handful of blocks of 1KB, 8KB and 16KB around, or
> more accurately put, ALLOCSET_SMALL_INITSIZE,
> ALLOCSET_DEFAULT_INITSIZE and ALLOCSET_DEFAULT_INITSIZE * 2, so that
> it works correctly if someone adjusts those definitions.
>

Per that patch and the general idea, what do you think of either:

1. A single GUC, something like backend_keep_mem, that represents the
cached memory we'd retain rather than send directly to free()?
2. Multiple GUCs, one per block size?

While #2 would give more granularity, I'm not sure it would necessarily be
needed. The main issue I'd see in that case would be the selection approach
to block sizes to keep given a fixed amount of keep memory. We'd generally
want the majority of the next queries to make use of it as best as
possible, so we'd either need each size to be equally represented or some
heuristic.

I don't really like #2, but threw it out there :)

I think you'll want to look at what the maximum memory a backend can
> keep around in context_freelists[] and not make the worst-case memory
> consumption worse than it is today.
>

Agreed.


> I imagine this would be some new .c file in src/backend/utils/mmgr
> which aset.c, generation.c and slab.c each call a function from to see
> if we have any cached blocks of that size.  You'd want to call that in
> all places we call malloc() from those files apart from when aset.c
> and generation.c malloc() for a dedicated block.  You can probably get
> away with replacing all of the free() calls with a call to another
> function where you pass the pointer and the size of the block to have
> it decide if it's going to free() it or cache it.


Agreed. I would see this as practically just a generic allocator free-list;
is that how you view it also?


> I doubt you need to care too much if the block is from a dedicated
> allocation or a normal
> block.  We'd just always free() if it's not in the size classes that
> we care about.
>

Agreed.

--
Jonah H. Harris

Re: psql: Add role's membership options to the \du+ command

2023-02-17 Thread David G. Johnston

On Fri, Feb 17, 2023 at 4:02 AM Pavel Luzanov 
wrote:

>List of roles
>  Role name | Attributes |
> Member of
>
> ---++---
>  admin | Create role|
> {bob,bob}
>  bob   ||
> {}
>  postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS |
> {}
>
> First 'grant bob to admin' command issued immediately after creating role
> bob by superuser(grantor=10). Second command issues by admin role and set
> membership options SET and INHERIT.
> If we don't ready to display membership options with \du+ may be at least
> we must group records in 'Member of' column for \du command?
>

I agree that these views should GROUP BY roleid and use bool_or(*_option)
to produce their result.  Their purpose is to communicate the current
effective state to the user, not facilitate full inspection of the
configuration, possibly to aid in issuing GRANT and REVOKE commands.

One thing I found, and I plan to bring this up independently once I've
collected my thoughts, is that pg_has_role() uses the terminology "USAGE"
and "MEMBER" for "INHERIT" and "SET" respectively.

It's annoying that "member" has been overloaded here.  And the choice of
USAGE just seems arbitrary (though I haven't researched it) given the
related syntax.

https://www.postgresql.org/docs/15/functions-info.html

Re: [PATCH] Align GSS and TLS error handling in PQconnectPoll()

2023-02-17 Thread Jacob Champion

On Thu, Feb 16, 2023 at 10:59 PM Michael Paquier  wrote:
> I am adding Stephen Frost
> in CC to see if he has any comments about all this part of the logic
> with gssencmode.

Sounds good.

> I agree that
> PQconnectPoll() has grown beyond the point of making it easy to
> maintain.  I am wondering which approach we could take when it comes
> to simplify something like that.  Attempting to reduce the number of
> flags stored in PGconn would be one.  The second may be to split the
> internal logic into more functions, for each state we are going
> through?  The first may lead to an even cleaner logic for the second
> point.

Yeah, a mixture of both might be helpful -- the first to reduce the
inputs to the state machine; the second to reduce interdependencies
between cases, the distance of the potential goto jumps, and the
number of state machine outputs. (When cases are heavily dependent on
each other, probably best to handle them in the same function?)
Luckily it looks like the current machine is usually linear.

Thanks,
--Jacob

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Jeff Davis

On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote:
> On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:
> > I am saying that pg_upgrade should be able to deal with the
> > difference. The
> > details of how to implement that, don't matter that much.
> 
> To clarify, you're saying that pg_upgrade should simply update
> pg_database to set the new databases' collation fields equal to that
> of
> the old cluster?

Thinking about this more, it's not clear to me if this would be in
scope for pg_upgrade or not. If pg_upgrade is fixing up the new cluster
rather than checking for compatibility, why doesn't it just take over
and do the initdb for the new cluster itself? That would be less
confusing for users, and avoid some weirdness (like, if you drop the
database "postgres" on the original, why does it reappear after an
upgrade?).

Someone might want to do something interesting to the new cluster
before the upgrade, but it's not clear from the docs what would be both
useful and safe.

Regards,
Jeff Davis

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 09:01:54 -0800, Jeff Davis wrote:
> On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote:
> > On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:
> > > I am saying that pg_upgrade should be able to deal with the
> > > difference. The
> > > details of how to implement that, don't matter that much.
> > 
> > To clarify, you're saying that pg_upgrade should simply update
> > pg_database to set the new databases' collation fields equal to that
> > of
> > the old cluster?

Yes.

> Thinking about this more, it's not clear to me if this would be in
> scope for pg_upgrade or not.

I don't think we should consider changing the default collation provider
without making this more seamless, one way or another.


> If pg_upgrade is fixing up the new cluster rather than checking for
> compatibility, why doesn't it just take over and do the initdb for the new
> cluster itself? That would be less confusing for users, and avoid some
> weirdness (like, if you drop the database "postgres" on the original, why
> does it reappear after an upgrade?).

I've wondered about that as well. There are some initdb-time options you can
set, but pg_upgrade could forward those.

Greetings,

Andres Freund

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Pavel Stehule

pá 17. 2. 2023 v 18:02 odesílatel Jeff Davis  napsal:

> On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote:
> > On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:
> > > I am saying that pg_upgrade should be able to deal with the
> > > difference. The
> > > details of how to implement that, don't matter that much.
> >
> > To clarify, you're saying that pg_upgrade should simply update
> > pg_database to set the new databases' collation fields equal to that
> > of
> > the old cluster?
>
> Thinking about this more, it's not clear to me if this would be in
> scope for pg_upgrade or not. If pg_upgrade is fixing up the new cluster
> rather than checking for compatibility, why doesn't it just take over
> and do the initdb for the new cluster itself? That would be less
> confusing for users, and avoid some weirdness (like, if you drop the
> database "postgres" on the original, why does it reappear after an
> upgrade?).
>
> Someone might want to do something interesting to the new cluster
> before the upgrade, but it's not clear from the docs what would be both
> useful and safe.
>

Today I tested icu for Czech sorting. It is a little bit slower, but not
too much, but it produces partially different results.

select row_number() over (order by nazev collate "cs-x-icu"), nazev from
obce
except select row_number() over (order by nazev collate "cs_CZ"), nazev
from obce;

returns a not empty set. So minimally for Czech collate, an index rebuild
is necessary

Regards

Pavel





>
> Regards,
> Jeff Davis
>
>
>
>

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 16:19:46 +0900, Michael Paquier wrote:
> But it looks like I misunderstood what this quote meant compared to
> what v3 does.  It is true that v3 sets iov_len and iov_base more than
> needed when writing sizes larger than BLCKSZ.

I don't think it does for writes larger than BLCKSZ, it just does more for
writes larger than PG_IKOV_MAX * BLCKSZ. But in those cases CPU time is going
to be spent elsewhere.

> Seems like you think that it is not really going to matter much to track
> which iovecs have been already initialized during the first loop on
> pg_pwritev_with_retry() to keep the code shorter?

Yes. I'd bet that, in the unlikely case you're going to see any difference at
all, unconditionally initializing is going to win.

Right now we memset() 8KB, and iterate over 32 IOVs, unconditionally, on every
call. Even if we could do some further optimizations of what I did in the
patch, you can initialize needed IOVs repeatedly a *lot* of times, before it
shows up...

I'm inclined to go with my version, with the argument order swapped to
Bharath's order.

Greetings,

Andres Freund

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 17:26:20 +1300, David Rowley wrote:
> I didn't hear it mentioned explicitly here, but I suspect it's faster
> when increasing the initial size due to the memory context caching
> code that reuses aset MemoryContexts (see context_freelists[] in
> aset.c). Since we reset the context before caching it, then it'll
> remain fast when we can reuse a context, provided we don't need to do
> a malloc for an additional block beyond the initial block that's kept
> in the cache.

I'm not so sure this is the case. Which is one of the reasons I'd really like
to see a) memory context stats for executor context b) a CPU profile of the
problem c) a reproducer.

Jonah, did you just increase the initial size, or did you potentially also
increase the maximum block size?

And did you increase ALLOCSET_DEFAULT_INITSIZE everywhere, or just passed a
larger block size in CreateExecutorState()?  If the latter,the context
freelist wouldn't even come into play.

A 8MB max block size is pretty darn small if you have a workload that ends up
with a gigabytes worth of blocks.

And the problem also could just be that the default initial blocks size takes
too long to ramp up to a reasonable block size. I think it's 20 blocks to get
from ALLOCSET_DEFAULT_INITSIZE to ALLOCSET_DEFAULT_MAXSIZE.  Even if you
allocate a good bit more than 8MB, having to additionally go through 20
smaller chunks is going to be noticable until you reach a good bit higher
number of blocks.

> Maybe we should think of a more general-purpose way of doing this
> caching which just keeps a global-to-the-process dclist of blocks
> laying around.  We could see if we have any free blocks both when
> creating the context and also when we need to allocate another block.

Not so sure about that. I suspect the problem could just as well be the
maximum block size, leading to too many blocks being allocated. Perhaps we
should scale that to a certain fraction of work_mem, by default?

Either way, I don't think we should go too deep without some data, too likely
to miss the actual problem.

> I see no reason why this couldn't be shared among the other context
> types rather than keeping this cache stuff specific to aset.c.  slab.c
> might need to be pickier if the size isn't exactly what it needs, but
> generation.c should be able to make use of it the same as aset.c
> could.  I'm unsure what'd we'd need in the way of size classing for
> this, but I suspect we'd need to pay attention to that rather than do
> things like hand over 16MBs of memory to some context that only wants
> a 1KB initial block.

Possible. I can see something like a generic "free block" allocator being
useful. Potentially with allocating the underlying memory with larger mmap()s
than we need for individual blocks.

Random note:

I wonder if we should having a bitmap (in an int) in front of aset's
freelist. In a lot of cases we incur plenty cache misses, just to find the
freelist bucket empty.

Greetings,

Andres Freund

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Jeff Davis

On Fri, 2023-02-17 at 09:05 -0800, Andres Freund wrote:
> > Thinking about this more, it's not clear to me if this would be in
> > scope for pg_upgrade or not.
> 
> I don't think we should consider changing the default collation
> provider
> without making this more seamless, one way or another.

I guess I'm fine hacking pg_upgrade, but I think I'd like to make it
conditional on this specific case: only perform the fixup if the old
cluster is 15 or earlier and using libc and the newer cluster is 16 or
later and using icu.

There's already a check that the new cluster is empty, so I think it's
safe to hack the pg_database locale fields.

Regards,
Jeff Davis

>

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 10:00:41 -0800, Jeff Davis wrote:
> I guess I'm fine hacking pg_upgrade, but I think I'd like to make it
> conditional on this specific case: only perform the fixup if the old
> cluster is 15 or earlier and using libc and the newer cluster is 16 or
> later and using icu.

-1. That's just going to cause pain one major version upgrade further down the
line. Why would we want to incur that pain?


> There's already a check that the new cluster is empty, so I think it's
> safe to hack the pg_database locale fields.

I don't think we need to, we do issue the CREATE DATABASEs. So we just need to
make sure that includes the collation provider info, and the proper template
database, in pg_upgrade mode.

Greetings,

Andres Freund

Re: The output sql generated by pg_dump for a create function refers to a modified table name

2023-02-17 Thread Tom Lane

"Jonathan S. Katz"  writes:
> I spoke too soon -- I was looking at the wrong logs. I did reproduce it 
> with UPDATE, but not INSERT.

It can be reproduced with INSERT too, on the same principle as the others:
put the DML command inside a WITH, and give it an alias conflicting with
the outer query.

Being a lazy sort, I tried to collapse all three cases into a single
test case, and observed something I hadn't thought of: we disambiguate
aliases in a WITH query with respect to the outer query, but not with
respect to other WITH queries.  This makes the example (see attached)
a bit more confusing than I would have hoped.  However, the same sort
of thing happens within other kinds of nested subqueries, so I think
it's probably all right as-is.  In any case, changing this aspect
would require a significantly bigger patch with more risk of unwanted
side-effects.

To fix it, I pulled out the print-an-alias logic within
get_from_clause_item and called that new function for
INSERT/UPDATE/DELETE.  This is a bit of overkill perhaps, because
only the RTE_RELATION case can be needed by these other callers, but
it seemed like a sane refactorization.

I've not tested, but I imagine this will need patched all the way back.
The rule case should be reachable in all supported versions.

regards, tom lane

diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 9ac42efdbc..6dc117dea8 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -475,6 +475,8 @@ static void get_from_clause(Query *query, const char *prefix,
 			deparse_context *context);
 static void get_from_clause_item(Node *jtnode, Query *query,
  deparse_context *context);
+static void get_rte_alias(RangeTblEntry *rte, int varno, bool use_as,
+		  deparse_context *context);
 static void get_column_alias_list(deparse_columns *colinfo,
   deparse_context *context);
 static void get_from_clause_coldeflist(RangeTblFunction *rtfunc,
@@ -6648,12 +6650,14 @@ get_insert_query_def(Query *query, deparse_context *context,
 		context->indentLevel += PRETTYINDENT_STD;
 		appendStringInfoChar(buf, ' ');
 	}
-	appendStringInfo(buf, "INSERT INTO %s ",
+	appendStringInfo(buf, "INSERT INTO %s",
 	 generate_relation_name(rte->relid, NIL));
-	/* INSERT requires AS keyword for target alias */
-	if (rte->alias != NULL)
-		appendStringInfo(buf, "AS %s ",
-		 quote_identifier(rte->alias->aliasname));
+
+	/* Print the relation alias, if needed; INSERT requires explicit AS */
+	get_rte_alias(rte, query->resultRelation, true, context);
+
+	/* always want a space here */
+	appendStringInfoChar(buf, ' ');
 
 	/*
 	 * Add the insert-column-names list.  Any indirection decoration needed on
@@ -6835,9 +6839,10 @@ get_update_query_def(Query *query, deparse_context *context,
 	appendStringInfo(buf, "UPDATE %s%s",
 	 only_marker(rte),
 	 generate_relation_name(rte->relid, NIL));
-	if (rte->alias != NULL)
-		appendStringInfo(buf, " %s",
-		 quote_identifier(rte->alias->aliasname));
+
+	/* Print the relation alias, if needed */
+	get_rte_alias(rte, query->resultRelation, false, context);
+
 	appendStringInfoString(buf, " SET ");
 
 	/* Deparse targetlist */
@@ -7043,9 +7048,9 @@ get_delete_query_def(Query *query, deparse_context *context,
 	appendStringInfo(buf, "DELETE FROM %s%s",
 	 only_marker(rte),
 	 generate_relation_name(rte->relid, NIL));
-	if (rte->alias != NULL)
-		appendStringInfo(buf, " %s",
-		 quote_identifier(rte->alias->aliasname));
+
+	/* Print the relation alias, if needed */
+	get_rte_alias(rte, query->resultRelation, false, context);
 
 	/* Add the USING clause if given */
 	get_from_clause(query, " USING ", context);
@@ -10887,10 +10892,8 @@ get_from_clause_item(Node *jtnode, Query *query, deparse_context *context)
 	{
 		int			varno = ((RangeTblRef *) jtnode)->rtindex;
 		RangeTblEntry *rte = rt_fetch(varno, query->rtable);
-		char	   *refname = get_rtable_name(varno, context);
 		deparse_columns *colinfo = deparse_columns_fetch(varno, dpns);
 		RangeTblFunction *rtfunc1 = NULL;
-		bool		printalias;
 
 		if (rte->lateral)
 			appendStringInfoString(buf, "LATERAL ");
@@ -11027,59 +11030,7 @@ get_from_clause_item(Node *jtnode, Query *query, deparse_context *context)
 		}
 
 		/* Print the relation alias, if needed */
-		printalias = false;
-		if (rte->alias != NULL)
-		{
-			/* Always print alias if user provided one */
-			printalias = true;
-		}
-		else if (colinfo->printaliases)
-		{
-			/* Always print alias if we need to print column aliases */
-			printalias = true;
-		}
-		else if (rte->rtekind == RTE_RELATION)
-		{
-			/*
-			 * No need to print alias if it's same as relation name (this
-			 * would normally be the case, but not if set_rtable_names had to
-			 * resolve a conflict).
-			 */
-			if (strcmp(refname, get_relation_name(rte->relid)) != 0)
-printalias = true;
-		}
-		else if (rte->rtekind == RTE_FUNCTION)
-		{
-

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Justin Pryzby

On Fri, Feb 17, 2023 at 09:01:54AM -0800, Jeff Davis wrote:
> On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote:
> > On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:
> > > I am saying that pg_upgrade should be able to deal with the
> > > difference. The
> > > details of how to implement that, don't matter that much.
> > 
> > To clarify, you're saying that pg_upgrade should simply update
> > pg_database to set the new databases' collation fields equal to that
> > of
> > the old cluster?
> 
> Thinking about this more, it's not clear to me if this would be in
> scope for pg_upgrade or not. If pg_upgrade is fixing up the new cluster
> rather than checking for compatibility, why doesn't it just take over
> and do the initdb for the new cluster itself? That would be less
> confusing for users, and avoid some weirdness (like, if you drop the
> database "postgres" on the original, why does it reappear after an
> upgrade?).
> 
> Someone might want to do something interesting to the new cluster
> before the upgrade, but it's not clear from the docs what would be both
> useful and safe.

This came up before - I'm of the opinion that it's unsupported and/or
useless to try to do anything on the new cluster between initdb and
pg_upgrade.

https://www.postgresql.org/message-id/20220707184410.gb13...@telsasoft.com
https://www.postgresql.org/message-id/20220905170322.gm31...@telsasoft.com

-- 
Justin

Re: pg_init_privs corruption.

2023-02-17 Thread Tom Lane

Kirill Reshke  writes:
> As you can see, after drop role there is invalid records in pg_init_privs
> system relation. After this, pg_dump generate sql statements, some of which
> are based on content of pg_init_privs, resulting in invalid dump.

Ugh.

> PFA fix.

I don't think this is anywhere near usable as-is, because it only
accounts for pg_init_privs entries in the current database.  We need
to handle these records in the DROP OWNED BY mechanism instead, and
also ensure there are shared-dependency entries for them so that the
role can't be dropped until the entries are gone in all DBs.  The real
problem may be less that DROP is doing the wrong thing, and more that
creation of the pg_init_privs entries neglects to record a dependency.

regards, tom lane

Re: The output sql generated by pg_dump for a create function refers to a modified table name

2023-02-17 Thread Jonathan S. Katz


On 2/17/23 1:18 PM, Tom Lane wrote:


It can be reproduced with INSERT too, on the same principle as the others:
put the DML command inside a WITH, and give it an alias conflicting with
the outer query.


Ah, I see based on your example below. I did not alias the INSERT 
statement in the way (and I don't know how common of a pattern it is to 
o that).



Being a lazy sort, I tried to collapse all three cases into a single
test case, and observed something I hadn't thought of: we disambiguate
aliases in a WITH query with respect to the outer query, but not with
respect to other WITH queries.  This makes the example (see attached)
a bit more confusing than I would have hoped.  However, the same sort
of thing happens within other kinds of nested subqueries, so I think
it's probably all right as-is.  In any case, changing this aspect
would require a significantly bigger patch with more risk of unwanted
side-effects.


I think I agree. Most people should not be looking at the disambiguated 
statements unless they are troubleshooting an issue (such as $SUBJECT). 
The main goal is to disambiguate correctly.



To fix it, I pulled out the print-an-alias logic within
get_from_clause_item and called that new function for
INSERT/UPDATE/DELETE.  This is a bit of overkill perhaps, because
only the RTE_RELATION case can be needed by these other callers, but
it seemed like a sane refactorization.

I've not tested, but I imagine this will need patched all the way back.
The rule case should be reachable in all supported versions.


I tested this against HEAD (+v69 of the DDL replication patch). My cases 
are now all passing.


The code looks good to me -- I don't know if moving that logic is 
overkill, but it makes the solution relatively clean.


I didn't test in any back branches yet, but given this can generate an 
invalid function body, it does likely need to be backpatched.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature

Re: [PATCH] Add pretty-printed XML output option

2023-02-17 Thread Jim Jones


On 17.02.23 01:08, Andrey Borodin wrote:

On Thu, Feb 16, 2023 at 2:12 PM Jim Jones  wrote:
I've looked into the patch. The code looks to conform to usual 
expectations.

One nit: this comment should have just one asterisk.
+ /**


Thanks for reviewing! Asterisk removed in v14.


And I have a dumb question: is this function protected from using
external XML namespaces? What if the user passes some xmlns that will
force it to read namespace data from the server filesystem? Or is it
not possible? I see there are a lot of calls to xml_parse() anyway,
but still...


According to the documentation,[1] such validations are not supported.

/"The |xml| type does not validate input values against a document type 
declaration (DTD), even when the input value specifies a DTD. There is 
also currently no built-in support for validating against other XML 
schema languages such as XML Schema."/


But I'll have a look at the code to be sure :)

Best, Jim

1- https://www.postgresql.org/docs/15/datatype-xml.html
From 44825f436e9c8f06a9bea3ed5966ef73bab208a9 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Thu, 16 Feb 2023 22:36:19 +0100
Subject: [PATCH v14] Add pretty-printed XML output option

This small patch introduces a XML pretty print function.
It basically takes advantage of the indentation feature
of xmlDocDumpFormatMemory from libxml2 to format XML strings.
---
 doc/src/sgml/func.sgml  |  34 ++
 src/backend/utils/adt/xml.c |  44 
 src/include/catalog/pg_proc.dat |   3 +
 src/test/regress/expected/xml.out   | 101 
 src/test/regress/expected/xml_1.out |  53 +++
 src/test/regress/expected/xml_2.out | 101 
 src/test/regress/sql/xml.sql|  33 +
 7 files changed, 369 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e09e289a43..a621192425 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14861,6 +14861,40 @@ SELECT xmltable.*
 ]]>
 

+
+ 
+xmlformat
+
+ 
+ xmlformat
+ 
+
+
+xmlformat ( xml ) xml
+
+
+ 
+ Converts the given XML value to pretty-printed, indented text.
+ 
+
+ 
+ Example:
+ 
+ 
+   
+
   
 
   
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 079bcb1208..e96cbf65a7 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -473,6 +473,50 @@ xmlBuffer_to_xmltype(xmlBufferPtr buf)
 }
 #endif
 
+Datum
+xmlformat(PG_FUNCTION_ARGS)
+{
+#ifdef USE_LIBXML
+
+	xmlDocPtr  doc;
+	xmlChar*xmlbuf = NULL;
+	text   *arg = PG_GETARG_TEXT_PP(0);
+	StringInfoData buf;
+	int nbytes;
+
+	doc = xml_parse(arg, XMLOPTION_DOCUMENT, false, GetDatabaseEncoding(), NULL);
+
+	if(!doc)
+		elog(ERROR, "could not parse the given XML document");
+
+	/*
+	 * xmlDocDumpFormatMemory (
+	 *   xmlDocPtr doc, # the XML document
+	 *   xmlChar **xmlbuf,  # the memory pointer
+	 *   int  *nbytes,  # the memory length
+	 *   int   format   # 1 = node indenting)
+	 */
+
+	xmlDocDumpFormatMemory(doc, &xmlbuf, &nbytes, 1);
+
+	xmlFreeDoc(doc);
+
+	if(!nbytes)
+		elog(ERROR, "could not indent the given XML document");
+
+	initStringInfo(&buf);
+	appendStringInfoString(&buf, (const char *)xmlbuf);
+
+	xmlFree(xmlbuf);
+
+	PG_RETURN_XML_P(stringinfo_to_xmltype(&buf));
+
+#else
+	NO_XML_SUPPORT();
+return 0;
+#endif
+}
+
 
 Datum
 xmlcomment(PG_FUNCTION_ARGS)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..54e8a6262a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8842,6 +8842,9 @@
 { oid => '3053', descr => 'determine if a string is well formed XML content',
   proname => 'xml_is_well_formed_content', prorettype => 'bool',
   proargtypes => 'text', prosrc => 'xml_is_well_formed_content' },
+{ oid => '4642', descr => 'Indented text from xml',
+  proname => 'xmlformat', prorettype => 'xml',
+  proargtypes => 'xml', prosrc => 'xmlformat' },
 
 # json
 { oid => '321', descr => 'I/O',
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index 3c357a9c7e..e45116aaa7 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -1599,3 +1599,104 @@ SELECT * FROM XMLTABLE('.' PASSING XMLELEMENT(NAME a) columns a varchar(20) PATH
   | 
 (1 row)
 
+-- XML format: single line XML string
+SELECT xmlformat('Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650');
+xmlformat 
+--
+ +
+ +
+ Belgian Waffles

RE: pg_init_privs corruption.

2023-02-17 Thread Floris Van Nee

> Kirill Reshke  writes:
> > As you can see, after drop role there is invalid records in
> > pg_init_privs system relation. After this, pg_dump generate sql
> > statements, some of which are based on content of pg_init_privs, resulting
> in invalid dump.
> 

This is as far as I can see the same case as what I reported a few years ago 
here: 
https://www.postgresql.org/message-id/flat/1574068566573.13088%40Optiver.com#488bd647ce6f5d2c92764673a7c58289
There was a discussion with some options, but no fix back then. 

-Floris

Re: recovery modules

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-16 13:58:10 -0800, Nathan Bossart wrote:
> On Thu, Feb 16, 2023 at 01:17:54PM -0800, Andres Freund wrote:
> > I'm quite baffled by:
> > /* Close any files left open by copy_file() or compare_files() 
> > */
> > AtEOSubXact_Files(false, InvalidSubTransactionId, 
> > InvalidSubTransactionId);
> > 
> > in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files()
> > completely outside the context of a transaction environment. And it only 
> > does
> > the thing you want because you pass parameters that aren't actually valid in
> > the normal use in AtEOSubXact_Files().  I really don't understand how that's
> > supposed to be ok.
> 
> Hm.  Should copy_file() and compare_files() have PG_FINALLY blocks that
> attempt to close the files instead?  What would you recommend?

I don't fully now, it's not entirely clear to me what the goals here were. I
think you'd likely need to do a bit of infrastructure work to do this
sanely. So far we just didn't have the need to handle files being released in
a way like you want to do there.

I suspect a good direction would be to use resource owners. Add a separate set
of functions that release files on resource owner release. Most of the
infrastructure is there already, for temporary files
(c.f. OpenTemporaryFile()).

Then that resource owner could be reset in case of error.

I'm not even sure that erroring out is a reasonable way to implement
copy_file(), compare_files(), particularly because you want to return via a
return code from basic_archive_files().

Greetings,

Andres Freund

Re: wrong query result due to wang plan

2023-02-17 Thread Tom Lane

Richard Guo  writes:
> It occurs to me that we can leverage JoinDomain to tell if we are below
> the nullable side of a higher-level outer join if the clause is not an
> outer-join clause.  If we are belew outer join, the current JoinDomain
> is supposed to be a proper subset of top JoinDomain.  Otherwise the
> current JoinDomain must be equal to top JoinDomain.  And that leads to a
> fix as code changes below.

That doesn't look right at all: surely this situation can occur further
down in a join tree, not only just below top level.

I thought about this some more and realized that the whole business of
trying to push a qual up the join tree is probably obsolete.

In the first place, there's more than one problem with assigning the
ON FALSE condition the required_relids {t2, t3, t4, t2/t4}.  As you say,
it causes bogus conclusions about which joinrel can be considered empty if
we commute the outer joins' order.  But also, doing it this way loses the
information that t2/t3 can be considered empty, if we do the joins in
an order where that is useful to know.

In the second place, I think recording the info that t2/t3 is empty is
probably sufficient now, because of the mark_dummy_rel/is_dummy_rel
bookkeeping in joinrels.c (which did not exist when we first added this
"push to top of tree" hack).  If we know t2/t3 is empty then we will
propagate that knowledge to {t2, t3, t4, t2/t4} when it's formed,
without needing to have a clause that can be applied at that join level.

So that leads to a conclusion that we could just forget the whole
thing and always use the syntactic qualscope here.  I tried that
and it doesn't break any existing regression tests, which admittedly
doesn't prove a lot in this area :-(.  However, we'd then need to
decide what to do in process_implied_equality, which is annotated as

 * "qualscope" is the nominal syntactic level to impute to the restrictinfo.
 * This must contain at least all the rels used in the expressions, but it
 * is used only to set the qual application level when both exprs are
 * variable-free.  (Hence, it should usually match the join domain in which
 * the clause applies.)

and indeed equivclass.c is passing a join domain relid set.  It's
not hard to demonstrate that that path is also broken, for instance

explain (costs off)
select * from int8_tbl t1 left join
  (int8_tbl t2 inner join int8_tbl t3 on (t2.q1-t3.q2) = 0 and (t2.q1-t3.q2) = 1
   left join int8_tbl t4 on t2.q2 = t4.q2)
on t1.q1 = t2.q1;

One idea I played with is that we could take the join domain relid set
and subtract off the OJ relid and RHS relids of any fully-contained
SpecialJoinInfos, reasoning that we can reconstruct the fact that
those won't make the join domain's overall result nondummy, and
thereby avoiding the possibility of confusion if we end up commuting
with one of those joins.  This feels perhaps overly complicated,
though, compared to the brain-dead-simple "use the syntactic scope"
approach.  Maybe we can get equivclass.c to do something equivalent
to that?  Or maybe we only need to use this complex rule in
process_implied_equality?

(I'm also starting to realize that the current basically-syntactic
definition of join domains isn't really going to work with commutable
outer joins, so maybe the ultimate outcome is that the join domain
itself is defined in this more narrowly scoped way.  But that feels
like a task to tackle later.)

regards, tom lane

Re: Missing free_var() at end of accum_sum_final()?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 11:48:14 +0900, Michael Paquier wrote:
> On Thu, Feb 16, 2023 at 01:35:54PM -0800, Andres Freund wrote:
> > But why do we need it? Most SQL callable functions don't need to be careful
> > about not leaking O(1) memory, the exception being functions backing btree
> > opclasses.
> > 
> > In fact, the detailed memory management often is *more* expensive than just
> > relying on the calling memory context being reset.
> > 
> > Of course, numeric.c doesn't really seem to have gotten that message, so
> > there's a consistency argument here.
> 
> I don't know which final result is better.  The arguments go two ways:
> 1) Should numeric.c be simplified so as its memory structure with extra
> pfree()s, making it more consistent with more global assumptions than
> just this file?  This has the disadvantage of creating more noise in
> backpatching, while increasing the risk of leaks if some of the
> removed parts are allocated in a tight loop within the same context.
> This makes memory management less complicated.  That's how I am
> understanding your point.

It's not just simplification, it's just faster to free via context reset. I
whipped up a random query exercising numeric math a bunch:

SELECT max(a + b + '17'::numeric + c) FROM (SELECT generate_series(1::numeric, 
1000::numeric)) aa(a), (SELECT generate_series(1::numeric, 100::numeric)) 
bb(b), (SELECT generate_series(1::numeric, 10::numeric)) cc(c);

Removing the free_var()s from numeric_add_opt_error() speeds it up from ~361ms
to ~338ms.


This code really needs some memory management overhead reduction love. Many
allocation could be avoided by having a small on-stack "buffer" that's used
unless the numeric is large.


> 2) Should the style within numeric.c be more consistent?  This is how
> I am understanding this proposal.  As you quote, this makes memory
> management more complicated (not convinced about that for the internal
> of numerics?), while making the file more consistent.

> At the end, perhaps that's not worth bothering, but 2) prevails when
> it comes to the rule of making some code consistent with its
> surroundings.  1) has more risks seeing how old this code is.

I'm a bit wary that this will trigger a stream of patches to pointlessly free
things, causing churn and slowdowns. I suspect there's other places in
numeric.c where we don't free, and there certainly are a crapton in other
functions.

Greetings,

Andres Freund

Re: pgbench: using prepared BEGIN statement in a pipeline could cause an error

2023-02-17 Thread Alvaro Herrera

On 2023-Feb-13, Andres Freund wrote:

> There's something wrong with the patch, it reliably fails with core dumps:
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3260

I think this would happen on machines where sizeof(bool) is not 1 (which
mine is evidently not).  Fixed.

In addition, there was the problem that the psprintf() to generate the
command name would race against each other if you had multiple threads.
I changed the code so that the name to prepare each statement under is
generated when the Command struct is first initialized, which occurs
before the threads are started.  One small issue is that now we use a
single counter for all commands of all scripts, rather than a
script-local counter.  This doesn't seem at all important.


I did realize that Nagata-san was right that we've always prepared the
whole script in advance; that behavior was there already in commit
49639a7b2c52 that introduced -Mprepared.  We've never done each command
just before executing it.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Y una voz del caos me habló y me dijo
"Sonríe y sé feliz, podría ser peor".
Y sonreí. Y fui feliz.
Y fue peor.
>From 0728193a5f02d0dd6a1f3ec5fef314aec646ba33 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 17 Feb 2023 21:01:15 +0100
Subject: [PATCH v8] pgbench: Prepare commands in pipelines in advance

Failing to do so results in an error when a pgbench script starts a
serializable transaction inside a pipeline.
---
 src/bin/pgbench/pgbench.c| 155 +--
 src/bin/pgbench/t/001_pgbench_with_server.pl |  20 +++
 2 files changed, 126 insertions(+), 49 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 508ed218e8..38e0830e7e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -628,7 +628,8 @@ typedef struct
 	pg_time_usec_t txn_begin;	/* used for measuring schedule lag times */
 	pg_time_usec_t stmt_begin;	/* used for measuring statement latencies */
 
-	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
+	/* whether client prepared each command of each script */
+	bool	  **prepared;
 
 	/*
 	 * For processing failures and repeating transactions with serialization
@@ -733,12 +734,13 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
  * argv			Command arguments, the first of which is the command or SQL
  *string itself.  For SQL commands, after post-processing
  *argv[0] is the same as 'lines' with variables substituted.
- * varprefix 	SQL commands terminated with \gset or \aset have this set
+ * varprefix	SQL commands terminated with \gset or \aset have this set
  *to a non NULL value.  If nonempty, it's used to prefix the
  *variable name that receives the value.
  * aset			do gset on all possible queries of a combined query (\;).
  * expr			Parsed expression, if needed.
  * stats		Time spent in this command.
+ * prepname		The name that this command is prepared under, in prepare mode
  * retries		Number of retries after a serialization or deadlock error in the
  *current command.
  * failures		Number of errors in the current command that were not retried.
@@ -754,6 +756,7 @@ typedef struct Command
 	char	   *varprefix;
 	PgBenchExpr *expr;
 	SimpleStats stats;
+	char	   *prepname;
 	int64		retries;
 	int64		failures;
 } Command;
@@ -3006,13 +3009,6 @@ runShellCommand(Variables *variables, char *variable, char **argv, int argc)
 	return true;
 }
 
-#define MAX_PREPARE_NAME		32
-static void
-preparedStatementName(char *buffer, int file, int state)
-{
-	sprintf(buffer, "P%d_%d", file, state);
-}
-
 /*
  * Report the abortion of the client when processing SQL commands.
  */
@@ -3053,6 +3049,87 @@ chooseScript(TState *thread)
 	return i - 1;
 }
 
+/*
+ * Prepare the SQL command from st->use_file at command_num.
+ */
+static void
+prepareCommand(CState *st, int command_num)
+{
+	Command*command = sql_script[st->use_file].commands[command_num];
+
+	/* No prepare for non-SQL commands */
+	if (command->type != SQL_COMMAND)
+		return;
+
+	/*
+	 * If not already done, allocate space for 'prepared' flags: one boolean
+	 * for each command of each script.
+	 */
+	if (!st->prepared)
+	{
+		st->prepared = pg_malloc(sizeof(bool *) * num_scripts);
+		for (int i = 0; i < num_scripts; i++)
+		{
+			ParsedScript *script = &sql_script[i];
+			int			numcmds;
+
+			for (numcmds = 0; script->commands[numcmds] != NULL; numcmds++)
+;
+			st->prepared[i] = pg_malloc0(sizeof(bool) * numcmds);
+		}
+	}
+
+	if (!st->prepared[st->use_file][command_num])
+	{
+		PGresult   *res;
+
+		pg_log_debug("client %d preparing %s", st->id, command->prepname);
+		res = PQprepare(st->con, command->prepname,
+		command->argv[0], command->argc - 1, NULL);
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+			pg_log_error("%s", PQerrorMessage(st->con));
+		PQclear(res);
+		st->prepared[st->use_file][command_nu

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Jeff Davis

On Fri, 2023-02-17 at 10:09 -0800, Andres Freund wrote:
> -1. That's just going to cause pain one major version upgrade further
> down the
> line. Why would we want to incur that pain?

OK, we can just always do the fixup as long as the old one is libc and
the new one is ICU. I'm just trying to avoid this becoming a general
mechanism to fix up an incompatible new cluster.

> > There's already a check that the new cluster is empty, so I think
> > it's
> > safe to hack the pg_database locale fields.
> 
> I don't think we need to, we do issue the CREATE DATABASEs. So we
> just need to
> make sure that includes the collation provider info, and the proper
> template
> database, in pg_upgrade mode.

We must fixup template1/postgres in the new cluster (change it to libc
to match the old cluster), because any objects existing in those
databases in the old cluster may depend on the default collation. I
don't see how we can do that without updating pg_database, so I'm not
following your point.

(I think you're right that template0 is optional; but since we're
fixing up the other databases it would be less surprising if we also
fixed up template0.)

And if we do fixup template0/template1/postgres to match the old
cluster, then CREATE DATABASE will have no issue.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: pg_init_privs corruption.

2023-02-17 Thread Tom Lane

Floris Van Nee  writes:
> This is as far as I can see the same case as what I reported a few years ago 
> here: 
> https://www.postgresql.org/message-id/flat/1574068566573.13088%40Optiver.com#488bd647ce6f5d2c92764673a7c58289
> There was a discussion with some options, but no fix back then. 

Hmm, so Stephen was opining that the extension's objects shouldn't
have gotten these privs attached in the first place.  I'm not
quite convinced about that one way or the other, but if you buy it
then maybe this situation is unreachable once we fix that.  I'm
not sure though.  It's still clear that we are making ACL entries
that aren't reflected in pg_shdepend, and that seems bad.

regards, tom lane

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Jeff Davis

On Fri, 2023-02-17 at 18:27 +0100, Pavel Stehule wrote:
> Today I tested icu for Czech sorting. It is a little bit slower, but
> not too much, but it produces partially different results.

Thank you for trying it.

If it's a significant slowdown, can you please send more information?
ICU version, libc version, and testcase?

> select row_number() over (order by nazev collate "cs-x-icu"), nazev
> from obce 
> except select row_number() over (order by nazev collate "cs_CZ"),
> nazev from obce;
> 
> returns a not empty set. So minimally for Czech collate, an index
> rebuild is necessary

Yes, that's true of any locale change, provider change, or even
provider version change.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: The output sql generated by pg_dump for a create function refers to a modified table name

2023-02-17 Thread Tom Lane

"Jonathan S. Katz"  writes:
> On 2/17/23 1:18 PM, Tom Lane wrote:
>> It can be reproduced with INSERT too, on the same principle as the others:
>> put the DML command inside a WITH, and give it an alias conflicting with
>> the outer query.

> Ah, I see based on your example below. I did not alias the INSERT 
> statement in the way (and I don't know how common of a pattern it is to 
> o that).

I suppose you can also make examples where the true name of the DML
target table conflicts with an outer-query name, implying that we need
to give it an alias even though the user wrote none.

> I tested this against HEAD (+v69 of the DDL replication patch). My cases 
> are now all passing.
> The code looks good to me -- I don't know if moving that logic is 
> overkill, but it makes the solution relatively clean.

Cool, thanks for testing and code-reading.  I'll go see about
back-patching.

> I didn't test in any back branches yet, but given this can generate an 
> invalid function body, it does likely need to be backpatched.

Presumably it can also cause dump/restore failures for rules that
do this sort of thing, though admittedly those wouldn't be common.

regards, tom lane

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 12:36:05 -0800, Jeff Davis wrote:
> > > There's already a check that the new cluster is empty, so I think
> > > it's
> > > safe to hack the pg_database locale fields.
> > 
> > I don't think we need to, we do issue the CREATE DATABASEs. So we
> > just need to
> > make sure that includes the collation provider info, and the proper
> > template
> > database, in pg_upgrade mode.
> 
> We must fixup template1/postgres in the new cluster (change it to libc
> to match the old cluster), because any objects existing in those
> databases in the old cluster may depend on the default collation. I
> don't see how we can do that without updating pg_database, so I'm not
> following your point.

I think we just drop/recreate template1 and postgres during pg_upgrade.  Yep,
looks like it. See create_new_objects():

/*
 * template1 database will already exist in the target 
installation,
 * so tell pg_restore to drop and recreate it; otherwise we 
would fail
 * to propagate its database-level properties.
 */
create_opts = "--clean --create";

and then:

/*
 * postgres database will already exist in the target 
installation, so
 * tell pg_restore to drop and recreate it; otherwise we would 
fail to
 * propagate its database-level properties.
 */
if (strcmp(old_db->db_name, "postgres") == 0)
create_opts = "--clean --create";
else
create_opts = "--create";

Greetings,

Andres Freund

Reducing connection overhead in pg_upgrade compat check phase

2023-02-17 Thread Daniel Gustafsson

When adding a check to pg_upgrade a while back I noticed in a profile that the
cluster compatibility check phase spend a lot of time in connectToServer.  Some
of this can be attributed to data type checks which each run serially in turn
connecting to each database to run the check, and this seemed like a place
where we can do better.

The attached patch moves the checks from individual functions, which each loops
over all databases, into a struct which is consumed by a single umbrella check
where all data type queries are executed against a database using the same
connection.  This way we can amortize the connectToServer overhead across more
accesses to the database.

In the trivial case, a single database, I don't see a reduction of performance
over the current approach.  In a cluster with 100 (empty) databases there is a
~15% reduction in time to run a --check pass.  While it won't move the earth in
terms of wallclock time, consuming less resources on the old cluster allowing
--check to be cheaper might be the bigger win.

--
Daniel Gustafsson



0001-pg_upgrade-run-all-data-type-checks-per-connection.patch
Description: Binary data

archive modules loose ends

2023-02-17 Thread Nathan Bossart

Andres recently reminded me of some loose ends in archive modules [0], so
I'm starting a dedicated thread to address his feedback.

The first one is the requirement that archive module authors create their
own exception handlers if they want to make use of ERROR.  Ideally, there
would be a handler in pgarch.c so that authors wouldn't need to deal with
this.  I do see some previous dicussion about this [1] in which I expressed
concerns about memory management.  Looking at this again, I may have been
overthinking it.  IIRC I was thinking about creating a memory context that
would be switched into for only the archiving callback (and reset
afterwards), but that might not be necessary.  Instead, we could rely on
module authors to handle this.  One example is basic_archive, which
maintains its own memory context.  Alternatively, authors could simply
pfree() anything that was allocated.

Furthermore, by moving the exception handling to pgarch.c, module authors
can begin using PG_TRY, etc. in their archiving callbacks, which simplifies
things a bit.  I've attached a work-in-progress patch for this change.

On Fri, Feb 17, 2023 at 11:41:32AM -0800, Andres Freund wrote:
> On 2023-02-16 13:58:10 -0800, Nathan Bossart wrote:
>> On Thu, Feb 16, 2023 at 01:17:54PM -0800, Andres Freund wrote:
>> > I'm quite baffled by:
>> >/* Close any files left open by copy_file() or compare_files() 
>> > */
>> >AtEOSubXact_Files(false, InvalidSubTransactionId, 
>> > InvalidSubTransactionId);
>> > 
>> > in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files()
>> > completely outside the context of a transaction environment. And it only 
>> > does
>> > the thing you want because you pass parameters that aren't actually valid 
>> > in
>> > the normal use in AtEOSubXact_Files().  I really don't understand how 
>> > that's
>> > supposed to be ok.
>> 
>> Hm.  Should copy_file() and compare_files() have PG_FINALLY blocks that
>> attempt to close the files instead?  What would you recommend?
> 
> I don't fully now, it's not entirely clear to me what the goals here were. I
> think you'd likely need to do a bit of infrastructure work to do this
> sanely. So far we just didn't have the need to handle files being released in
> a way like you want to do there.
> 
> I suspect a good direction would be to use resource owners. Add a separate set
> of functions that release files on resource owner release. Most of the
> infrastructure is there already, for temporary files
> (c.f. OpenTemporaryFile()).
> 
> Then that resource owner could be reset in case of error.
> 
> 
> I'm not even sure that erroring out is a reasonable way to implement
> copy_file(), compare_files(), particularly because you want to return via a
> return code from basic_archive_files().

To initialize this thread, I'll provide a bit more background.
basic_archive makes use of copy_file(), and it introduces a function called
compare_files() that is used to check whether two files have the same
content.  These functions make use of OpenTransientFile() and
CloseTransientFile().  In basic_archive's sigsetjmp() block, there's a call
to AtEOSubXact_Files() to make sure we close any files that are open when
there is an ERROR.  IIRC I was following the example set by other processes
that make use of the AtEOXact* functions in their sigsetjmp() blocks.
Looking again, I think AtEOXact_Files() would also work for basic_archive's
use-case.  That would at least avoid the hack of using
InvalidSubTransactionId for the second and third arguments.

>From the feedback quoted above, it sounds like improving this further will
require a bit of infrastructure work.  I haven't looked too deeply into
this yet.

[0] https://postgr.es/m/20230216192956.mhi6uiakchkolpki%40awork3.anarazel.de
[1] https://postgr.es/m/20220202224433.GA1036711%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index cd852888ce..2c51d2b100 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -172,7 +172,6 @@ basic_archive_configured(ArchiveModuleState *state)
 static bool
 basic_archive_file(ArchiveModuleState *state, const char *file, const char *path)
 {
-	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext oldcontext;
 	BasicArchiveData *data = (BasicArchiveData *) state->private_data;
 	MemoryContext basic_archive_context = data->context;
@@ -184,51 +183,22 @@ basic_archive_file(ArchiveModuleState *state, const char *file, const char *path
 	 */
 	oldcontext = MemoryContextSwitchTo(basic_archive_context);

-	/*
-	 * Since the archiver operates at the bottom of the exception stack,
-	 * ERRORs turn into FATALs and cause the archiver process to restart.
-	 * However, using ereport(ERROR, ...) when there are problems is easy to
-	 * code and maintain.  Therefore, we create our own exception handler to
-	 * catch ERRORs and return fals

Re: [PATCH] Add pretty-printed XML output option

2023-02-17 Thread Nikolay Samokhvalov

On Fri, Feb 17, 2023 at 1:14 AM Jim Jones  wrote:

> After your comment I'm studying the possibility to extend the existing
> xmlserialize function to add the indentation feature. If so, how do you
> think it should look like? An extra parameter? e.g.
>
> SELECT xmlserialize(DOCUMENT '42'::XML AS text,
> true);
>
> .. or more or like Oracle does it
>
> SELECT XMLSERIALIZE(DOCUMENT xmltype('42') AS BLOB
> INDENT)
> FROM dual;
>

My idea was to follow the SQL standard (part 14, SQL/XML); unfortunately,
there is no free version, but there are drafts at
http://www.wiscorp.com/SQLStandards.html.

 ::=
  XMLSERIALIZE  [  ]

   AS 
  [  ]
  [  ]
  [  ]

  [  ]

 ::=
  [ NO ] INDENT

Oracle's extension SIZE=n also seems interesting (including a special case
SIZE=0, which means using new-line characters without spaces for each line).

Re: recovery modules

2023-02-17 Thread Nathan Bossart

On Fri, Feb 17, 2023 at 05:01:47PM +0900, Michael Paquier wrote:
> All that had better
> be put into their own threads, IMO, to bring more visibility on these
> subjects.

I created a new thread for these [0].

> Saying that, I have spent more time on the revamped version of the
> archive modules and it was already doing a lot, so I have applied
> it as it covered all the points discussed..

Thanks!

[0] https://postgr.es/m/20230217215624.GA3131134%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

2023-02-17 Thread Nathan Bossart

On Fri, Feb 17, 2023 at 03:13:32PM +0100, Peter Eisentraut wrote:
> On 16.02.23 22:29, Andres Freund wrote:
>> What's the story behind 100_bugs.pl? This name clearly is copied from
>> src/test/subscription/t/100_bugs.pl - but I've never understood why that is
>> outside of the normal numbering space.
> 
> Mainly to avoid awkwardness for backpatching.  The number of tests in
> src/test/subscription/ varies quite a bit across branches.

I'm happy to move this new test to wherever folks think it should go.  I'll
look around to see if I can find a better place, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: recovery modules

2023-02-17 Thread Nathan Bossart

Here is a new revision of the restore modules patch set.  In this patch
set, the interface looks similar to the recent archive modules redesign,
and there are separate callbacks for retrieving different types of files.
I've attempted to address all the feedback I've received, but there was a
lot scattered across different threads, so it's possible I've missed
something.  Note that 0001 is the stopgap fix for restore_command that's
being tracked elsewhere [0].  I was careful to avoid repeating the recent
mistake with the SIGTERM handling.

This patch set is still a little rough around the edges, but I wanted to
post it in case folks had general thoughts about the structure, interface,
etc.  This implementation restores files synchronously one-by-one just like
archive modules, but in the future, I would like to add
asynchronous/parallel/batching support.  My intent is for this work to move
us closer to that.

[0] https://postgr.es/m/20230214174755.GA1348509%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 64651ae3c56160fea31d15a78f07c8c12950ec99 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 14 Feb 2023 09:44:53 -0800
Subject: [PATCH v12 1/6] stopgap fix for restore_command

---
 src/backend/access/transam/xlogarchive.c | 15 +++
 src/backend/postmaster/startup.c | 20 +++-
 src/backend/storage/ipc/ipc.c|  3 +++
 src/backend/storage/lmgr/proc.c  |  2 ++
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index fcc87ff44f..41684418b6 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -159,20 +159,27 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 			 xlogRestoreCmd)));
 
+	fflush(NULL);
+	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
+
 	/*
-	 * Check signals before restore command and reset afterwards.
+	 * PreRestoreCommand() informs the SIGTERM handler for the startup process
+	 * that it should proc_exit() right away.  This is done for the duration of
+	 * the system() call because there isn't a good way to break out while it
+	 * is executing.  Since we might call proc_exit() in a signal handler, it
+	 * is best to put any additional logic before or after the
+	 * PreRestoreCommand()/PostRestoreCommand() section.
 	 */
 	PreRestoreCommand();
 
 	/*
 	 * Copy xlog from archival storage to XLOGDIR
 	 */
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
 	rc = system(xlogRestoreCmd);
-	pgstat_report_wait_end();
 
 	PostRestoreCommand();
+
+	pgstat_report_wait_end();
 	pfree(xlogRestoreCmd);
 
 	if (rc == 0)
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index efc2580536..de2b56c2fa 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -19,6 +19,8 @@
  */
 #include "postgres.h"
 
+#include 
+
 #include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -121,7 +123,23 @@ StartupProcShutdownHandler(SIGNAL_ARGS)
 	int			save_errno = errno;
 
 	if (in_restore_command)
-		proc_exit(1);
+	{
+		/*
+		 * If we are in a child process (e.g., forked by system() in
+		 * RestoreArchivedFile()), we don't want to call any exit callbacks.
+		 * The parent will take care of that.
+		 */
+		if (MyProcPid == (int) getpid())
+			proc_exit(1);
+		else
+		{
+			const char	msg[] = "StartupProcShutdownHandler() called in child process";
+			int			rc pg_attribute_unused();
+
+			rc = write(STDERR_FILENO, msg, sizeof(msg));
+			_exit(1);
+		}
+	}
 	else
 		shutdown_requested = true;
 	WakeupRecovery();
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 1904d21795..6796cabc3e 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -103,6 +103,9 @@ static int	on_proc_exit_index,
 void
 proc_exit(int code)
 {
+	/* proc_exit() is not safe in forked processes from system(), etc. */
+	Assert(MyProcPid == getpid());
+
 	/* Clean up everything that must be cleaned up */
 	proc_exit_prepare(code);
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..ae845e8249 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -805,6 +805,7 @@ ProcKill(int code, Datum arg)
 	dlist_head *procgloballist;
 
 	Assert(MyProc != NULL);
+	Assert(MyProcPid == getpid());  /* not safe if forked by system(), etc. */
 
 	/* Make sure we're out of the sync rep lists */
 	SyncRepCleanupAtProcExit();
@@ -925,6 +926,7 @@ AuxiliaryProcKill(int code, Datum arg)
 	PGPROC	   *proc;
 
 	Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS);
+	Assert(MyProcPid == getpid());	/* not safe if forked by system(), etc. */
 
 	auxproc = &AuxiliaryProcs[proctype];
 
-- 
2.25.1

>From 2ef84ed51ba934c7979e2a7

Re: Reducing connection overhead in pg_upgrade compat check phase

2023-02-17 Thread Nathan Bossart

On Fri, Feb 17, 2023 at 10:44:49PM +0100, Daniel Gustafsson wrote:
> In the trivial case, a single database, I don't see a reduction of performance
> over the current approach.  In a cluster with 100 (empty) databases there is a
> ~15% reduction in time to run a --check pass.  While it won't move the earth 
> in
> terms of wallclock time, consuming less resources on the old cluster allowing
> --check to be cheaper might be the bigger win.

Nice!  This has actually been on my list of things to look into, so I
intend to help review the patch.  In any case, +1 for making pg_upgrade
faster.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Jeff Davis

On Fri, 2023-02-17 at 12:50 -0800, Andres Freund wrote:
> I think we just drop/recreate template1 and postgres during
> pg_upgrade.

Thank you, that makes much more sense now.

I was confused because pg_upgrade loops through to check compatibility
with all the databases, which makes zero sense if it's going to drop
all of them except template0 anyway. The checks on template1/postgres
should be bypassed.

So the two approaches are:

1. Don't bother with locale provider compatibility checks at all (even
on template0). The emitted CREATE DATABASE statements already specify
the locale provider, so that will take care of everything except
template0. Maybe the locale provider of template0 shouldn't matter, but
some users might be surprised if it changes during upgrade. It also
raises some questions about the other properties of template0 like
encoding, lc_collate, and lc_ctype, which also don't matter a whole lot
(because they can all be changed when using template0 as a template).

2. Update the pg_database entry for template0. This has less potential
for surprise in case people are actually using template0 for a
template.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: pg_walinspect memory leaks

2023-02-17 Thread Jeff Davis

On Thu, 2023-02-16 at 18:00 +0530, Bharath Rupireddy wrote:
> I'm attaching the patches here. For HEAD, I'd
> want to be a bit defensive and use the temporary memory context for
> pg_get_wal_fpi_info() too.

I don't see why we shouldn't backpatch that, too?

Also, it seems like we should do the same thing for the loop in
GetXLogSummaryStats(). Maybe just for the outer loop is fine (the inner
loop is only 16 elements); though again, there's not an obvious
downside to fixing that, too.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: O(n) tasks cause lengthy startups and checkpoints

2023-02-17 Thread Nathan Bossart

On Thu, Feb 02, 2023 at 09:48:08PM -0800, Nathan Bossart wrote:
> rebased for cfbot

another rebase for cfbot

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 1c9b95cae7adcc57b7544a44ff16a26e71c6c736 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 5 Jan 2022 19:24:22 +
Subject: [PATCH v20 1/4] Introduce custodian.

The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing.  This commit simply adds the new process; it does
not yet do anything useful.
---
 doc/src/sgml/glossary.sgml  |  11 +
 src/backend/postmaster/Makefile |   1 +
 src/backend/postmaster/auxprocess.c |   8 +
 src/backend/postmaster/custodian.c  | 377 
 src/backend/postmaster/meson.build  |   1 +
 src/backend/postmaster/postmaster.c |  38 ++-
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/lmgr/proc.c |   1 +
 src/backend/utils/activity/pgstat_io.c  |   4 +-
 src/backend/utils/activity/wait_event.c |   3 +
 src/backend/utils/init/miscinit.c   |   3 +
 src/include/miscadmin.h |   3 +
 src/include/postmaster/custodian.h  |  32 ++
 src/include/storage/proc.h  |  11 +-
 src/include/utils/wait_event.h  |   1 +
 15 files changed, 491 insertions(+), 6 deletions(-)
 create mode 100644 src/backend/postmaster/custodian.c
 create mode 100644 src/include/postmaster/custodian.h

diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 7c01a541fe..ad3f53e2a3 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -144,6 +144,7 @@
  (but not the autovacuum workers),
  the background writer,
  the checkpointer,
+ the custodian,
  the logger,
  the startup process,
  the WAL archiver,
@@ -484,6 +485,16 @@

   
 
+  
+   Custodian (process)
+   
+
+ An auxiliary process
+ that is responsible for executing assorted cleanup tasks.
+
+   
+  
+
   
Data area

diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 047448b34e..5f4dde85cf 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
 	bgworker.o \
 	bgwriter.o \
 	checkpointer.o \
+	custodian.o \
 	fork_process.o \
 	interrupt.o \
 	pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index cae6feb356..a1f042f13a 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
 #include "pgstat.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 		case CheckpointerProcess:
 			MyBackendType = B_CHECKPOINTER;
 			break;
+		case CustodianProcess:
+			MyBackendType = B_CUSTODIAN;
+			break;
 		case WalWriterProcess:
 			MyBackendType = B_WAL_WRITER;
 			break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 			CheckpointerMain();
 			proc_exit(1);
 
+		case CustodianProcess:
+			CustodianMain();
+			proc_exit(1);
+
 		case WalWriterProcess:
 			WalWriterMain();
 			proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 00..98bb9efcfd
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,377 @@
+/*-
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc.  Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding).  However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode).  The custodian is not
+ * an essential process and can shutdown quickly when requested.  The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   src/backend/postmaster/custodian.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(void);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunct

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

2023-02-17 Thread Thomas Munro

I still have no theory for how this condition was reached despite a
lot of time thinking about it and searching for more clues.  As far as
I can tell, the recent improvements to postmaster's signal and event
handling shouldn't be related: the state management and logic was
unchanged.

While failing to understand this, I worked[1] on CI log indexing tool
with public reports that highlight this sort of thing[2], so I'll be
watching out for more evidence.  Unfortunately I have no data from
before 1 Feb (cfbot previously wasn't interested in the past at all;
I'd need to get my hands on the commit IDs for earlier testing but I
can't figure out how to get those out of Cirrus or Github -- anyone
know how?).  FWIW I have a thing I call bfbot for slurping up similar
data from the build farm.  It's not pretty enough for public
consumption, but I do know that this assertion hasn't failed there,
except the cases I mentioned earlier, and a load of failures on
lorikeet which was completely b0rked until recently.

[1] https://xkcd.com/974/
[2] http://cfbot.cputube.org/highlights/assertion-90.html

Share variable between psql backends in CustomScan

2023-02-17 Thread Amin

Hi,

I am looking for a way to define a global variable in CustomScan plugin
that is shared between different psql backends. Is it possible without
using shared memory? Does postgresql implement any function that
facilitates this?

Thank you,
Amin

File* argument order, argument types

2023-02-17 Thread Andres Freund

Hi,

While trying to add additional File* functions (FileZero, FileFallocat) I went
back and forth about the argument order between "amount" and "offset".

We have:

extern int FilePrefetch(File file, off_t offset, off_t amount, uint32 
wait_event_info);
extern int FileRead(File file, void *buffer, size_t amount, off_t offset, 
uint32 wait_event_info);
extern int FileWrite(File file, const void *buffer, size_t amount, off_t 
offset, uint32 wait_event_info);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 
wait_event_info);

and I want to add (for [1])
extern int FileZero(File file, off_t amount, off_t offset, uint32 
wait_event_info);
extern int FileFallocate(File file, off_t amount, off_t offset, uint32 
wait_event_info);

The differences originate in trying to mirror the underlying function's
signatures:

int posix_fadvise(int fd, off_t offset, off_t len, int advice);
ssize_t pread(int fd, void buf[.count], size_t count, off_t offset);
ssize_t pwrite(int fd, const void buf[.count], size_t count, off_t offset);
int ftruncate(int fd, off_t length);
int posix_fallocate(int fd, off_t offset, off_t len);
int sync_file_range(int fd, off64_t offset, off64_t nbytes, unsigned int flags);


It seems quite confusing to be this inconsistent about argument order and
argument types in the File* functions. For one, the relation to the underlying
posix functions isn't always obvious. For another, we're not actually
mirroring the signatures all that well, our argument and return types don't
actually match.


It'd be easy enough to decide on a set of types for the arguments, that'd be
API (but not necessarily ABI compatible, but we don't care) compatible. But
changing the argument order would commonly lead to silent breakage, which
obviously would be bad. Or maybe it's unlikely enough that there are external
callers?

I don't know what to actually propose. I guess the least bad I can see is to
pick one type & argument order that we document to be the default, with a
caveat placed above the functions not following the argument order.

Order wise, I think we should choose amount, offset.  For the return type we
probably should pick ssize_t?  I don't know what we should standardize on for
'amount', I'd probably be inclined to go for size_t.

Greetings,

Andres Freund

[1] https://postgr.es/m/20221029025420.eplyow6k7tgu6...@awork3.anarazel.de

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-18 13:27:04 +1300, Thomas Munro wrote:
> I still have no theory for how this condition was reached despite a
> lot of time thinking about it and searching for more clues.  As far as
> I can tell, the recent improvements to postmaster's signal and event
> handling shouldn't be related: the state management and logic was
> unchanged.

Yea, it's all very odd.

If you look at the log:

2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name DETAIL:  No 
valid identifier after ".".
2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name STATEMENT:  
SELECT parse_ident('xxx.1020');
...
TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), 
File: "../src/backend/storage/ipc/pmsignal.c", Line: 329, PID: 5948
abort() has been called
...
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  server process (PID 5948) was 
terminated by exception 0xC354
2023-02-08 00:53:27.420 GMT postmaster[872] HINT:  See C include file 
"ntstatus.h" for a description of the hexadecimal value.
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  terminating any other active 
server processes
2023-02-08 00:53:27.434 GMT postmaster[872] LOG:  all server processes 
terminated; reinitializing

and that it's indeed the money test that failed:
 money... FAILED (test process exited with exit 
code 2) 7337 ms

it's very hard to understand how this stack can come to be:

0085`f03ffa40 7ff6`fd89faa8 ucrtbased!abort(void)+0x5a 
[minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
0085`f03ffa80 7ff6`fd6474dc postgres!ExceptionalCondition(
char * conditionName = 0x7ff6`fdd03ca8 
"PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED", 
char * fileName = 0x7ff6`fdd03c80 
"../src/backend/storage/ipc/pmsignal.c", 
int lineNumber = 0n329)+0x78 
[c:\cirrus\src\backend\utils\error\assert.c @ 67]
0085`f03ffac0 7ff6`fd676eff 
postgres!MarkPostmasterChildActive(void)+0x7c 
[c:\cirrus\src\backend\storage\ipc\pmsignal.c @ 329]
0085`f03ffb00 7ff6`fd59aa3a postgres!InitProcess(void)+0x2ef 
[c:\cirrus\src\backend\storage\lmgr\proc.c @ 375]
0085`f03ffb60 7ff6`fd467689 postgres!SubPostmasterMain(
int argc = 0n3, 
char ** argv = 0x01c6`f3814e80)+0x33a 
[c:\cirrus\src\backend\postmaster\postmaster.c @ 4962]
0085`f03ffd90 7ff6`fda0e1c9 postgres!main(
int argc = 0n3, 
char ** argv = 0x01c6`f3814e80)+0x2f9 
[c:\cirrus\src\backend\main\main.c @ 192]

How can a process that we did notify crashing, that has already executed SQL
statements, end up in MarkPostmasterChildActive()?

> While failing to understand this, I worked[1] on CI log indexing tool
> with public reports that highlight this sort of thing[2], so I'll be
> watching out for more evidence.  Unfortunately I have no data from
> before 1 Feb (cfbot previously wasn't interested in the past at all;
> I'd need to get my hands on the commit IDs for earlier testing but I
> can't figure out how to get those out of Cirrus or Github -- anyone
> know how?).  FWIW I have a thing I call bfbot for slurping up similar
> data from the build farm.  It's not pretty enough for public
> consumption, but I do know that this assertion hasn't failed there,
> except the cases I mentioned earlier, and a load of failures on
> lorikeet which was completely b0rked until recently.

> [1] https://xkcd.com/974/
> [2] http://cfbot.cputube.org/highlights/assertion-90.html

I think this extremely useful.

Greetings,

Andres Freund

Re: Add index scan progress to pg_stat_progress_vacuum

2023-02-17 Thread Imseih (AWS), Sami

Thanks for the review!

>+ 
>+  ParallelVacuumFinish
>+  Waiting for parallel vacuum workers to finish index
>vacuum.
>+ 

>This change is out-of-date.

That was an oversight. Thanks for catching.

>Total number of indexes that will be vacuumed or cleaned up. This
>number is reported as of the beginning of the vacuuming indexes phase
>or the cleaning up indexes phase.

This is cleaner. I was being unnecessarily verbose in the original description.

>Number of indexes processed. This counter only advances when the phase
>is vacuuming indexes or cleaning up indexes.

I agree.

>Also, index_processed sounds accurate to me. What do you think?

At one point, II used index_processed, but decided to change it. 
"processed" makes sense also. I will use this.

>I think these settings are not necessary since the pcxt is palloc0'ed.

Good point.

>Assert(pvs->pcxt->parallel_progress_callback_arg) looks wrong to me.
>If 'arg' is NULL, a SEGV happens.

Correct, Assert(pvs) is all that is needed.

>I think it's better to update pvs->shared->nindexes_completed by both
>leader and worker processes who processed the index.

No reason for that, since only the leader process can report process to
backend_progress.


>I think it's better to make the function type consistent with the
>existing parallel_worker_main_type. How about
>parallel_progress_callback_type?

Yes, that makes sense.

> I've attached a patch that incorporates the above comments and has
>some suggestions of updating comments etc.

I reviewed and incorporated these changes, with a slight change. See v24.

-* Increase and report the number of index. Also, we reset the progress
-* counters.


+* Increase and report the number of index scans. Also, we reset the 
progress
+* counters.


Thanks

--
Sami Imseih
Amazon Web Services (AWS)



v24-0001-Add-2-new-columns-to-pg_stat_progress_vacuum.-Th.patch
Description: v24-0001-Add-2-new-columns-to-pg_stat_progress_vacuum.-Th.patch

Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16

2023-02-17 Thread Amit Kapila

On Fri, Feb 17, 2023 at 9:43 AM Andres Freund  wrote:
>
> On 2023-02-17 08:30:09 +0530, Amit Kapila wrote:
> > Thanks, I was not completely sure about whether we need to bump
> > XLOG_PAGE_MAGIC for this patch as this makes the additional space just
> > by changing the datatype of one of the members of the existing WAL
> > record. We normally change it for the addition/removal of new fields
> > aka change in WAL record format, or addition of a new type of WAL
> > record. Does anyone else have an opinion/suggestion on this matter?
>
> I'd definitely change it - the width of a field still means you can't really
> parse the old WAL sensibly anymore.
>

Okay, thanks for your input. I'll push this patch early next week.

-- 
With Regards,
Amit Kapila.

Re: pg_upgrade and logical replication

2023-02-17 Thread Amit Kapila

On Fri, Feb 17, 2023 at 9:05 PM Julien Rouhaud  wrote:
>
> On Fri, Feb 17, 2023 at 04:12:54PM +0530, Amit Kapila wrote:
> > On Fri, Feb 17, 2023 at 1:24 PM Julien Rouhaud  wrote:
> > >
> > > An easy workaround that I tried is to allow something like
> > >
> > > ALTER SUBSCRIPTION ...  ENABLE WITH (refresh = true, copy_data = false)
> > >
> > > so that the refresh internally happens before the apply worker is started 
> > > and
> > > you just keep consuming the delta, which works on naive scenario.
> > >
> > > One concern I have with this approach is that the default values for both
> > > "refresh" and "copy_data" for all other subcommands is "true, but we would
> > > probably need a different default value in that exact scenario (as we 
> > > know we
> > > already have the data).  I think that it would otherwise be safe in my 
> > > very
> > > specific scenario, assuming that you created the slot beforehand and 
> > > moved the
> > > slot's LSN at the promotion point, as even if you add non-empty tables to 
> > > the
> > > publication you will only need the delta whether those were initially 
> > > empty or
> > > not given your initial physical replica state.
> > >
> >
> > This point is not very clear. Why would one just need delta even for new 
> > tables?
>
> Because in my scenario I'm coming from physical replication, so I know that I
> did replicate everything until the promotion LSN.  Any table later added in 
> the
> publication is either already fully replicated until that LSN on the upgraded
> node, so only the delta is needed, or has been created after that LSN.  In the
> latter case, the entirety of the table will be replicated with the logical
> replication as a delta right?
>

That makes sense to me.

> > >  Any other scenario would make
> > > this new option dangerous, if not entirely useless, but not more than any 
> > > of
> > > the current commands that lead to refreshing a subscription and have the 
> > > same
> > > options I guess.
> > >
> > > All in all, currently the only way to somewhat safely resume logical
> > > replication after a pg_upgrade is to drop all the subscriptions that were
> > > transferred during pg_upgrade on all databases and recreate them (using 
> > > the
> > > existing slots on the publisher side obviously), allowing the initial
> > > connection.  But this approach only works in the exact scenario I 
> > > mentioned
> > > (physical to logical replication, or at least a case where *all* the 
> > > tables
> > > where logically replicated prior to the pg_ugprade), otherwise you have to
> > > recreate the follower node from scratch using logical repication.
> > >
> >
> > I think if you dropped and recreated the subscriptions by retaining
> > old slots, the replication should resume from where it left off before
> > the upgrade. Which scenario are you concerned about?
>
> I'm concerned about people not coming from physical replication.  If you just
> had some "normal" logical replication, you can't assume that you already have
> all the data from the upstream subscription.  If it was modified and a non
> empty table is added, you might need to copy the data of part of the tables 
> and
> keep replicating for the rest.  It's hard to be sure from a user point of 
> view,
> and even if you knew you have no way to express it.
>

Can't the user create a separate publication for such newly added
tables and a corresponding new subscription on the downstream node?
Now, I think it would be a bit tricky if the user already has a
publication defined with FOR ALL TABLES. In that case, we probably
need some way to specify FOR ALL TABLES EXCEPT (list of tables) which
we currently don't have.

> > > Is that indeed the current behavior, or did I miss something?
> > >
> > > Is this "resume logical replication on pg_upgraded node" something we 
> > > want to
> > > support better?  I was thinking that we could add a new pg_dump mode 
> > > (maybe
> > > only usable during pg_upgrade) that also restores the pg_subscription_rel
> > > content in each subscription or something like that.  If not, should 
> > > pg_upgrade
> > > keep preserving the subscriptions as it doesn't seem safe to use them, or 
> > > at
> > > least document the hazards (I didn't find anything about it in the
> > > documentation)?
> > >
> > >
> >
> > There is a mention of this in pg_dump docs. See [1] (When dumping
> > logical replication subscriptions ...)
>
> Indeed, but it's barely saying "It is then up to the user to reactivate the
> subscriptions in a suitable way" and "It might also be appropriate to truncate
> the target tables before initiating a new full table copy".  As I mentioned, I
> don't think there's a suitable way to reactivate the subscription, at least if
> you don't want to miss some records, so truncating all target tables is the
> only fully safe way to proceed.  It seems quite silly to have to do so just
> because pg_upgrade doesn't retain the list of relation per subscription.
>

I also don't know i

Re: Share variable between psql backends in CustomScan

2023-02-17 Thread Pavel Stehule

Hi


so 18. 2. 2023 v 1:37 odesílatel Amin  napsal:

> Hi,
>
> I am looking for a way to define a global variable in CustomScan plugin
> that is shared between different psql backends. Is it possible without
> using shared memory? Does postgresql implement any function that
> facilitates this?
>

No - there is nothing like this. You need to use shared memory.

Regards

Pavel


>
> Thank you,
> Amin
>

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Pavel Stehule

pá 17. 2. 2023 v 21:43 odesílatel Jeff Davis  napsal:

> On Fri, 2023-02-17 at 18:27 +0100, Pavel Stehule wrote:
> > Today I tested icu for Czech sorting. It is a little bit slower, but
> > not too much, but it produces partially different results.
>
> Thank you for trying it.
>
> If it's a significant slowdown, can you please send more information?
> ICU version, libc version, and testcase?
>

no - this slowdown is not significant - although 1% can looks too much -
but it is just two ms

It looks so libicu has little bit more expensive initialization, but the
execution is little bit faster

But when I try to repeat the measurements, the results are very unstable on
my desktop :-/

SELECT * FROM obce ORDER BY nazev LIMIT 10 // is faster with glibc little
bit
SELECT * FROM obce ORDER BY nazev // is faster with libicu

You can download dataset https://pgsql.cz/files/obce.sql

It is table of municipalities in czech republic (real names) - about 6000
rows

I use fedora 37 - so libicu 71.1, glibc 2.36

Regards

Pavel



>
> > select row_number() over (order by nazev collate "cs-x-icu"), nazev
> > from obce
> > except select row_number() over (order by nazev collate "cs_CZ"),
> > nazev from obce;
> >
> > returns a not empty set. So minimally for Czech collate, an index
> > rebuild is necessary
>
> Yes, that's true of any locale change, provider change, or even
> provider version change.
>
>
> --
> Jeff Davis
> PostgreSQL Contributor Team - AWS
>
>
>

Re: Reducing connection overhead in pg_upgrade compat check phase

2023-02-17 Thread Nathan Bossart

On Fri, Feb 17, 2023 at 10:44:49PM +0100, Daniel Gustafsson wrote:
> When adding a check to pg_upgrade a while back I noticed in a profile that the
> cluster compatibility check phase spend a lot of time in connectToServer.  
> Some
> of this can be attributed to data type checks which each run serially in turn
> connecting to each database to run the check, and this seemed like a place
> where we can do better.
> 
> The attached patch moves the checks from individual functions, which each 
> loops
> over all databases, into a struct which is consumed by a single umbrella check
> where all data type queries are executed against a database using the same
> connection.  This way we can amortize the connectToServer overhead across more
> accesses to the database.

This change consolidates all the data type checks, so instead of 7 separate
loops through all the databases, there is just one.  However, I wonder if
we are leaving too much on the table, as there are a number of other
functions that also loop over all the databases:

* get_loadable_libraries
* get_db_and_rel_infos
* report_extension_updates
* old_9_6_invalidate_hash_indexes
* check_for_isn_and_int8_passing_mismatch
* check_for_user_defined_postfix_ops
* check_for_incompatible_polymorphics
* check_for_tables_with_oids
* check_for_user_defined_encoding_conversions

I suspect consolidating get_loadable_libraries, get_db_and_rel_infos, and
report_extension_updates would be prohibitively complicated and not worth
the effort.  old_9_6_invalidate_hash_indexes is only needed for unsupported
versions, so that might not be worth consolidating.
check_for_isn_and_int8_passing_mismatch only loops through all databases
when float8_pass_by_value in the control data differs, so that might not be
worth it, either.  The last 4 are for supported versions and, from a very
quick glance, seem possible to consolidate.  That would bring us to a total
of 11 separate loops that we could consolidate into one.  However, the data
type checks seem to follow a nice pattern, so perhaps this is easier said
than done.

IIUC with the patch, pg_upgrade will immediately fail as soon as a single
check in a database fails.  I believe this differs from the current
behavior where all matches for a given check in the cluster are logged
before failing.  I wonder if it'd be better to perform all of the data type
checks in all databases before failing so that all of the violations are
reported.  Else, users would have to run pg_upgrade, fix a violation, run
pg_upgrade again, fix another one, etc.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-17 Thread Amit Kapila

On Fri, Feb 17, 2023 at 12:14 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thank you for replying! This direction seems OK, so I started to revise the 
> patch.
> PSA new version.
>

Few comments:
=
1.
+  
+   The minimum delay for publisher sends data, in milliseconds
+  
+ 

It would probably be better to write it as "The minimum delay, in
milliseconds, by the publisher to send changes"

2. The subminsenddelay is placed inconsistently in the patch. In the
docs (catalogs.sgml), system_views.sql, and in some places in the
code, it is after subskiplsn, but in the catalog table and
corresponding structure, it is placed after subowner. It should be
consistently placed after the subscription owner.

3.
+ 
+  WalSenderSendDelay
+  Waiting for sending changes to subscriber in WAL sender
+  process.

How about writing it as follows: "Waiting while sending changes for
time-delayed logical replication in the WAL sender process."?

4.
+ 
+  Any delay becomes effective only after all initial table
+  synchronization has finished and occurs before each transaction
+  starts to get applied on the subscriber. The delay does not take into
+  account the overhead of time spent in transferring the transaction,
+  which means that the arrival time at the subscriber may be delayed
+  more than the given time.
+ 

This needs to change based on a new approach. It should be something
like: "The delay is effective only when the publisher decides to send
a particular transaction downstream."

5.
+ * allowed. This is because in parallel streaming mode, we start applying
+ * the transaction stream as soon as the first change arrives without
+ * knowing the transaction's prepare/commit time. Always waiting for the
+ * full 'min_send_delay' period might include unnecessary delay.
+ *
+ * The other possibility was to apply the delay at the end of the parallel
+ * apply transaction but that would cause issues related to resource bloat
+ * and locks being held for a long time.
+ */

This part of the comments seems to imply more of a subscriber-side
delay approach. I think we should try to adjust these as per the
changed approach.

6.
@@ -666,6 +666,10 @@ DecodeCommit(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
  buf->origptr, buf->endptr);
  }

+ /* Delay given time if the context has 'delay' callback */
+ if (ctx->delay)
+ ctx->delay(ctx, commit_time);
+

I think we should invoke delay functionality only when
ctx->min_send_delay > 0. Otherwise, there will be some unnecessary
overhead. We can change the comment along the lines of: "Delay sending
the changes if required. For streaming transactions, this means a
delay in sending the last stream but that is okay because on the
downstream the changes will be applied only after receiving the last
stream."

7. For 2PC transactions, I think we should add the delay in
DecodePrerpare. Because after receiving the PREPARE, the downstream
will apply the xact. In this case, we shouldn't add a delay for the
commit_prepared.

8.
+#
+# If the subscription sets min_send_delay parameter, the logical replication
+# worker will delay the transaction apply for min_send_delay milliseconds.

I think here also comments should be updated as per the changed
approach for applying the delay on the publisher side.

-- 
With Regards,
Amit Kapila.

Re: pg_upgrade and logical replication

2023-02-17 Thread Julien Rouhaud

On Sat, Feb 18, 2023 at 09:31:30AM +0530, Amit Kapila wrote:
> On Fri, Feb 17, 2023 at 9:05 PM Julien Rouhaud  wrote:
> >
> > I'm concerned about people not coming from physical replication.  If you 
> > just
> > had some "normal" logical replication, you can't assume that you already 
> > have
> > all the data from the upstream subscription.  If it was modified and a non
> > empty table is added, you might need to copy the data of part of the tables 
> > and
> > keep replicating for the rest.  It's hard to be sure from a user point of 
> > view,
> > and even if you knew you have no way to express it.
> >
>
> Can't the user create a separate publication for such newly added
> tables and a corresponding new subscription on the downstream node?

Yes that seems like a safe way to go, but it relies on users being very careful
if they don't want to get corrupted logical standby, and I think it's
impossible to run any check to make sure that the subscription is adequate?

> Now, I think it would be a bit tricky if the user already has a
> publication defined with FOR ALL TABLES. In that case, we probably
> need some way to specify FOR ALL TABLES EXCEPT (list of tables) which
> we currently don't have.

Yes, and note that I rely on FOR ALL TABLES for my original physical to logical
use case.

> >
> > Indeed, but it's barely saying "It is then up to the user to reactivate the
> > subscriptions in a suitable way" and "It might also be appropriate to 
> > truncate
> > the target tables before initiating a new full table copy".  As I 
> > mentioned, I
> > don't think there's a suitable way to reactivate the subscription, at least 
> > if
> > you don't want to miss some records, so truncating all target tables is the
> > only fully safe way to proceed.  It seems quite silly to have to do so just
> > because pg_upgrade doesn't retain the list of relation per subscription.
> >
>
> I also don't know if there is any other safe way for newly added
> tables apart from the above suggestion to create separate publications
> but that can work only in specific cases.

I might be missing something, but what could go wrong if pg_upgrade could emit
a bunch of commands like:

ALTER SUBSCRIPTION subname ADD RELATION relid STATE 'x' LSN 'X/Y';

pg_upgrade already preserves the relation's oid, so we could restore the
exact original state and then enabling the subscription would just work?

We could restrict this form to --binary only so we don't provide a way for
users to mess the data.

77 matches

Mail list logo