Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Feb 16, 2023 at 11:44 PM Andres Freund wrote: > > Hi, > > On 2023-02-16 16:22:56 +0700, John Naylor wrote: > > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada > > > Right. TidStore is implemented not only for heap, so loading > > > out-of-order TIDs might be important in the future. > > > > That's what I was probably thinking about some weeks ago, but I'm having a > > hard time imagining how it would come up, even for something like the > > conveyor-belt concept. > > We really ought to replace the tid bitmap used for bitmap heap scans. The > hashtable we use is a pretty awful data structure for it. And that's not > filled in-order, for example. I took a brief look at that and agree we should sometime make it work there as well. v26 tidstore_add_tids() appears to assume that it's only called once per blocknumber. While the order of offsets doesn't matter there for a single block, calling it again with the same block would wipe out the earlier offsets, IIUC. To do an actual "add tid" where the order doesn't matter, it seems we would need to (acquire lock if needed), read the current bitmap and OR in the new bit if it exists, then write it back out. That sounds slow, so it might still be good for vacuum to call a function that passes a block and an array of offsets that are assumed ordered (as in v28), but with a more accurate name, like tidstore_set_block_offsets(). -- John Naylor EDB: http://www.enterprisedb.com
Re: recovery modules
On Thu, Feb 16, 2023 at 01:17:54PM -0800, Andres Freund wrote: > On 2023-02-16 12:15:12 -0800, Nathan Bossart wrote: >> On Thu, Feb 16, 2023 at 11:29:56AM -0800, Andres Freund wrote: >>> Not related the things changed here, but this should never have been pushed >>> down into individual archive modules. There's absolutely no way that we're >>> going to keep this up2date and working correctly in random archive >>> modules. And it would break if archive modules are ever called outside of >>> pgarch.c. Hmm, yes. That's a bad idea to copy the error handling stack. The maintenance of this code could quickly go wrong. All that had better be put into their own threads, IMO, to bring more visibility on these subjects. > I'm quite baffled by: > /* Close any files left open by copy_file() or compare_files() > */ > AtEOSubXact_Files(false, InvalidSubTransactionId, > InvalidSubTransactionId); > > in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files() > completely outside the context of a transaction environment. And it only does > the thing you want because you pass parameters that aren't actually valid in > the normal use in AtEOSubXact_Files(). I really don't understand how that's > supposed to be ok. As does this part, probably with a backpatch.. Saying that, I have spent more time on the revamped version of the archive modules and it was already doing a lot, so I have applied it as it covered all the points discussed.. -- Michael signature.asc Description: PGP signature
Re: Move defaults toward ICU in 16?
On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote: > I am saying that pg_upgrade should be able to deal with the > difference. The > details of how to implement that, don't matter that much. To clarify, you're saying that pg_upgrade should simply update pg_database to set the new databases' collation fields equal to that of the old cluster? I'll submit it as a separate patch because it would be independently useful. Regards, Jeff Davis
Re: Normalization of utility queries in pg_stat_statements
Hi, On 2/17/23 3:35 AM, Michael Paquier wrote: On Thu, Feb 16, 2023 at 10:55:32AM +0100, Drouvot, Bertrand wrote: In the new pg_stat_statements.sql? That way pg_stat_statements.sql would always behave with default values for those (currently we are setting both of them as non default). Then, with the default values in place, if we feel that some tests are missing we could add them in > utility.sql or planning.sql accordingly. I am not sure about this part, TBH, so I have left these as they are. Anyway, while having a second look at that, I have noticed that it is possible to extract as an independent piece all the tests related to level tracking. Things are worse than I thought initially, actually, because we had test scenarios mixing planning and level tracking, but the tests don't care about measuring plans at all, see around FETCH FORWARD, meaning that queries on the table pg_stat_statements have just been copy-pasted around for the last few years. There were more tests that used "test" for a table name ;) I have been pondering about this part, and the tracking matters for DO blocks and PL functions, so I have moved all these cases into a new, separate file. There is a bit more that can be done, for WAL tracking and roles near the end of pg_stat_statements.sql, but I have left that out for now. I have checked the output of the tests before and after the refactoring for quite a bit of time, and the outputs match so there is no loss of coverage. 0001 looks quite committable at this stage, and that's independent on the rest. At the end this patch creates four new test files that are extended in the next patches: utility, planning, track and cleanup. Thanks! LGTM. 0002: Produces: v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:834: trailing whitespace. CREATE VIEW view_stats_1 AS v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:838: trailing whitespace. CREATE VIEW view_stats_1 AS warning: 2 lines add whitespace errors. Thanks, fixed. Thanks! +-- SET TRANSACTION ISOLATION +BEGIN; +SET TRANSACTION ISOLATION LEVEL READ COMMITTED; +SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; +SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; What about adding things like "SET SESSION CHARACTERISTICS AS TRANSACTION..." too? That's a good idea. It is again one of these fancy cases, better to keep a track of them in the long-term.. Right. 002 LGTM. 0003 and 0004: Thanks for keeping 0003 that's useful to see the impact of A_Const normalization. Looking at the diff they produce, I also do think that 0004 is what could be done for PG16. I am wondering if others have an opinion to share about that, but, yes, 0004 seems enough to begin with. We could always study more normalization areas in future releases, taking it slowly. Agree. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [Proposal] Add foreign-server health checks infrastructure
On 2023-02-09 23:39, Hayato Kuroda (Fujitsu) wrote: Dear Katsuragi-san, Thank you for reviewing! PSA new version patches. Thank you for updating the patch! These are my comments, please check. 0001: Extending pqSocketPoll seems to be a better way because we can avoid having multiple similar functions. I also would like to hear horiguchi-san's opinion whether this matches his expectation. Improvements of pqSocketPoll/pqSocketCheck is discussed in this thread[1]. I'm concerned with the discussion. As for the function's name, what do you think about keeping current name (pqSocketCheck)? pqSocketIsReadable... describes the functionality very well though. pqConnCheck seems to be a family of pqReadReady or pqWriteRedy, so how about placing pqConnCheck below them? + * Moreover, when neither forRead nor forWrite is requested and timeout is + * disabled, try to check the health of socket. Isn't it better to put the comment on how the arguments are interpreted before the description of return value? +#if defined(POLLRDHUP) + input_fd.events = POLLRDHUP | POLLHUP | POLLNVAL; ... + input_fd.events |= POLLERR; To my understanding, POLLHUP, POLLNVAL and POLLERR are ignored in event. Are they necessary? 0002: As for the return value of postgres_fdw_verify_connection_states, what do you think about returning NULL when connection-checking is not performed? I think there are two cases 1) ConnectionHash is not initialized or 2) connection is not found for specified server name, That is, no entry passes the first if statement below (case 2)). ``` if (all || entry->serverid == serverid) { if (PQconnCheck(entry->conn)) { ``` 0004: I'm wondering if we should add kqueue support in this version, because adding kqueue support introduces additional things to be considered. What do you think about focusing on the main functionality using poll in this patch and going for kqueue support after this patch? [1]: https://www.postgresql.org/message-id/20230209.115009.2229702014236187289.horikyota.ntt%40gmail.com regards, -- Katsuragi Yuta Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Doc: Improve note about copying into postgres_fdw foreign tables in batch
Hi, Here is a small patch to improve the note, which was added by commit 97da48246 ("Allow batch insertion during COPY into a foreign table."), by adding an explanation about how the actual number of rows postgres_fdw inserts at once is determined in the COPY case, including a limitation that does not apply to the INSERT case. I will add this to the next CF. Best regards, Etsuro Fujita postgres-fdw-batch-insert-doc.patch Description: Binary data
Re: Make set_ps_display faster and easier to use
Thank you for having a look at this. On Fri, 17 Feb 2023 at 14:01, Andres Freund wrote: > > +set_ps_display_suffix(const char *suffix) > > +{ > > + size_t len; > > Think this will give you an unused-variable warning in the PS_USE_NONE case. Fixed > > +#ifndef PS_USE_NONE > > + /* update_process_title=off disables updates */ > > + if (!update_process_title) > > + return; > > + > > + /* no ps display for stand-alone backend */ > > + if (!IsUnderPostmaster) > > + return; > > + > > +#ifdef PS_USE_CLOBBER_ARGV > > + /* If ps_buffer is a pointer, it might still be null */ > > + if (!ps_buffer) > > + return; > > +#endif > > This bit is now repeated three times. How about putting it into a helper? Good idea. Done. > > +set_ps_display_internal(void) > > Very very minor nit: Perhaps this should be update_ps_display() or > flush_ps_display() instead? I called the precheck helper update_ps_display_precheck(), so went with flush_ps_display() for updating the display so they both didn't start with "update". Updated patch attached. David diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 80d681b71c..889e20b5dd 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -148,8 +148,6 @@ static bool SyncRepQueueIsOrderedByLSN(int mode); void SyncRepWaitForLSN(XLogRecPtr lsn, bool commit) { - char *new_status = NULL; - const char *old_status; int mode; /* @@ -216,15 +214,10 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit) /* Alter ps display to show waiting for sync rep. */ if (update_process_title) { - int len; - - old_status = get_ps_display(&len); - new_status = (char *) palloc(len + 32 + 1); - memcpy(new_status, old_status, len); - sprintf(new_status + len, " waiting for %X/%X", - LSN_FORMAT_ARGS(lsn)); - set_ps_display(new_status); - new_status[len] = '\0'; /* truncate off " waiting ..." */ + charbuffer[32]; + + sprintf(buffer, "waiting for %X/%X", LSN_FORMAT_ARGS(lsn)); + set_ps_display_suffix(buffer); } /* @@ -322,12 +315,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit) MyProc->syncRepState = SYNC_REP_NOT_WAITING; MyProc->waitLSN = 0; - if (new_status) - { - /* Reset ps display */ - set_ps_display(new_status); - pfree(new_status); - } + /* reset ps display to remove the suffix */ + if (update_process_title) + set_ps_display_remove_suffix(); } /* diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 2d6dbc6561..98904a7c05 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -4302,8 +4302,8 @@ void LockBufferForCleanup(Buffer buffer) { BufferDesc *bufHdr; - char *new_status = NULL; TimestampTz waitStart = 0; + boolwaiting = false; boollogged_recovery_conflict = false; Assert(BufferIsPinned(buffer)); @@ -4350,11 +4350,11 @@ LockBufferForCleanup(Buffer buffer) waitStart, GetCurrentTimestamp(), NULL, false); - /* Report change to non-waiting status */ - if (new_status) + if (waiting) { - set_ps_display(new_status); - pfree(new_status); + /* reset ps display to remove the suffix if we added one */ + set_ps_display_remove_suffix(); + waiting = false; } return; } @@ -4374,18 +4374,11 @@ LockBufferForCleanup(Buffer buffer) /* Wait to be signaled by UnpinBuffer() */ if (InHotStandby) { - /* Report change to waiting status */ - if (update_process_title && new_status == NULL) + if (!waiting) { - const char *old_status; - int len; - - old_status = get_ps_display(&len); - new_status = (char *) palloc(len + 8 + 1); - memcpy(new_status, old_status, len); - strcpy(new_status + len, " waiting"); - set_ps_display(new_status); -
Re: [PATCH] Add pretty-printed XML output option
On 16.02.23 05:38, Nikolay Samokhvalov wrote: On Thu, Feb 9, 2023 at 2:31 AM Peter Eisentraut wrote: I suggest we call it "xmlformat", which is an established term for this. Some very-very old, rusted memory told me that there was something in standard – and indeed, it seems it described an optional Feature X069, “XMLSerialize: INDENT” for XMLSERIALIZE. So probably pretty-printing should go there, to XMLSERIALIZE, to follow the standard? Oracle also has an option for it in XMLSERIALIZE, although in a slightly different form, with ability to specify the number of spaces for indentation https://docs.oracle.com/database/121/SQLRF/functions268.htm#SQLRF06231. After your comment I'm studying the possibility to extend the existing xmlserialize function to add the indentation feature. If so, how do you think it should look like? An extra parameter? e.g. SELECT xmlserialize(DOCUMENT '42'::XML AS text, true); .. or more or like Oracle does it SELECT XMLSERIALIZE(DOCUMENT xmltype('42') AS BLOB INDENT) FROM dual; Thanks! Best, Jim
Re: Support logical replication of DDLs
On Fri, Feb 17, 2023 at 1:13 AM Jonathan S. Katz wrote: > > On 2/16/23 2:38 PM, Alvaro Herrera wrote: > > On 2023-Feb-16, Jonathan S. Katz wrote: > > > >> On 2/16/23 12:53 PM, Alvaro Herrera wrote: > > > >>> I don't think this is the fault of logical replication. Consider that > >>> for the backend server, the function source code is just an opaque > >>> string that is given to the plpgsql engine to interpret. So there's no > >>> way for the logical DDL replication engine to turn this into runnable > >>> code if the table name is not qualified. > >> > >> Sure, that's fair. That said, the example above would fall under a "typical > >> use case", i.e. I'm replicating functions that call tables without schema > >> qualification. This is pretty common, and as logical replication becomes > >> used for more types of workloads (e.g. high availability), we'll definitely > >> see this. > > > > Hmm, I think you're saying that replay should turn check_function_bodies > > off, and I think I agree with that. > > Yes, exactly. +1 > But will that be sufficient? I guess such functions can give errors at a later stage when invoked at DML or another DDL time. Consider the following example: Pub: CREATE PUBLICATION pub FOR ALL TABLES with (ddl = 'all'); Sub: (Set check_function_bodies = off in postgresql.conf) CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub; Pub: CREATE FUNCTION t1(a int) RETURNS int AS $$ select a+1; $$ LANGUAGE sql; CREATE FUNCTION t(a int) RETURNS int AS $$ select t1(a); $$ LANGUAGE sql; CREATE TABLE tbl1 (a int primary key, b text); create index idx on tbl1(t(a)); insert into tbl1 values (1,1); -- This insert on publisher causes an error on the subscriber. Check subscriber Logs (ERROR: function t1(integer) does not exist at character 9.) This happens because of the function used in the index expression. Now, this is not the only thing, the replication can even fail during DDL replication when the function like above is IMMUTABLE and used as follows: ALTER TABLE tbl ADD COLUMN d int DEFAULT t(1); Normally, it is recommended that users can fix such errors by schema-qualifying affected names. See commits 11da97024a and 582edc369c. -- With Regards, Amit Kapila.
Re: Kerberos delegation support in libpq and postgres_fdw
Greetings, * Michael Paquier (mich...@paquier.xyz) wrote: > On Mon, Sep 19, 2022 at 02:05:39PM -0700, Jacob Champion wrote: > > It's not prevented, because a password is being used. In my tests I'm > > connecting as an unprivileged user. > > > > You're claiming that the middlebox shouldn't be doing this. If this new > > default behavior were the historical behavior, then I would have agreed. > > But the cat's already out of the bag on that, right? It's safe today. > > And if it's not safe today for some other reason, please share why, and > > maybe I can work on a patch to try to prevent people from doing it. > > Please note that this has been marked as returned with feedback in the > current CF, as this has remained unanswered for a bit more than three > weeks. There's some ongoing discussion about how to handle outbound connections from the server ending up picking up credentials from the server's environment (that really shouldn't be allowed unless specifically asked for..), that's ultimately an independent change from what this patch is doing. Here's an updated version which does address Robert's concerns around having this disabled by default and having options on both the server and client side saying if it is to be enabled or not. Also added to pg_stat_gssapi a field that indicates if credentials were proxied or not and made some other improvements and added additional regression tests to test out various combinations. Thanks, Stephen diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 78a8bcee6e..713ef7c248 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -2589,7 +2589,7 @@ dblink_security_check(PGconn *conn, remoteConn *rconn) { if (!superuser()) { - if (!PQconnectionUsedPassword(conn)) + if (!(PQconnectionUsedPassword(conn) || PQconnectionUsedGSSAPI(conn))) { libpqsrv_disconnect(conn); if (rconn) @@ -2597,8 +2597,8 @@ dblink_security_check(PGconn *conn, remoteConn *rconn) ereport(ERROR, (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED), - errmsg("password is required"), - errdetail("Non-superuser cannot connect if the server does not request a password."), + errmsg("password or GSSAPI is required"), + errdetail("Non-superuser cannot connect if the server does not request a password or use GSSAPI."), errhint("Target server's authentication method must be changed."))); } } diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index d5fc61446a..77656ccf80 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -171,7 +171,8 @@ ALTER SERVER testserver1 OPTIONS ( sslcrl 'value', --requirepeer 'value', krbsrvname 'value', - gsslib 'value' + gsslib 'value', + gssdeleg 'value' --replication 'value' ); -- Error, invalid list syntax diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c index 984e4d168a..bb6f309907 100644 --- a/contrib/postgres_fdw/option.c +++ b/contrib/postgres_fdw/option.c @@ -286,6 +286,9 @@ InitPgFdwOptions(void) {"sslcert", UserMappingRelationId, true}, {"sslkey", UserMappingRelationId, true}, + /* gssencmode is also libpq option, same to above. */ + {"gssencmode", UserMappingRelationId, true}, + {NULL, InvalidOid, false} }; diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index 1e50be137b..1a43bdf55b 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -185,7 +185,8 @@ ALTER SERVER testserver1 OPTIONS ( sslcrl 'value', --requirepeer 'value', krbsrvname 'value', - gsslib 'value' + gsslib 'value', + gssdeleg 'value' --replication 'value' ); diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index ecd9aa73ef..a931996968 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1170,6 +1170,23 @@ include_dir 'conf.d' + + gss_accept_deleg (boolean) + + gss_accept_deleg configuration parameter + + + + +Sets whether GSSAPI delegation should be accepted from the client. +The default is off meaning credentials from the client will +NOT be accepted. Changing this to on will make the server +accept credentials delegated to it from the client. This parameter can only be +set in the postgresql.conf file or on the server command line. + + + + db_user_namespace (boolean) diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 0e7ae70c70..9f3e12b2a7 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -1879,6 +1879,18 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname + + gssdeleg + + +Forward (delegate) GSS credentials to the server. The default is +disable
Re: Support logical replication of global object commands
On Fri, Feb 17, 2023 at 10:58 AM Zheng Li wrote: > > > > > Actually, I intend something for global objects. But the main thing > > > > that is worrying me about this is that we don't have a clean way to > > > > untie global object replication from database-specific object > > > > replication. > > > > > > I think ultimately we need a clean and efficient way to publish (and > > > subscribe to) any changes in all databases, preferably in one logical > > > replication slot. > > > > > > > Agreed. I was thinking currently for logical replication both > > walsender and slot are database-specific. So we need a way to > > distinguish the WAL for global objects and then avoid filtering based > > on the slot's database during decoding. > > But which WALSender should handle the WAL for global objects if we > don't filter by database? Is there any specific problem you see for > decoding global objects commands in a database specific WALSender? > I haven't verified but I was concerned about the below check: logicalddl_decode { ... + + if (message->dbId != ctx->slot->data.database || -- With Regards, Amit Kapila.
The output sql generated by pg_dump for a create function refers to a modified table name
Hi, The output sql generated by pg_dump for the below function refers to a modified table name: create table t1 (c1 int); create table t2 (c1 int); CREATE OR REPLACE FUNCTION test_fun(c1 int) RETURNS void LANGUAGE SQL BEGIN ATOMIC WITH delete_t1 AS ( DELETE FROM t1 WHERE c1 = $1 ) INSERT INTO t1 (c1) SELECT $1 FROM t2; END; The below sql output created by pg_dump refers to t1_1 which should have been t1: CREATE FUNCTION public.test_fun(c1 integer) RETURNS void LANGUAGE sql BEGIN ATOMIC WITH delete_t1 AS ( DELETE FROM public.t1 WHERE (t1_1.c1 = test_fun.c1) ) INSERT INTO public.t1 (c1) SELECT test_fun.c1 FROM public.t2; END; pg_get_function_sqlbody also returns similar result: select proname, pg_get_function_sqlbody(oid) from pg_proc where proname = 'test_fun'; proname | pg_get_function_sqlbody --+--- test_fun | BEGIN ATOMIC + | WITH delete_t1 AS ( + | DELETE FROM t1 + |WHERE (t1_1.c1 = test_fun.c1) + | ) + | INSERT INTO t1 (c1) SELECT test_fun.c1+ | FROM t2; + | END (1 row) I felt the problem here is with set_rtable_names function which changes the relation name t1 to t1_1 while parsing the statement: /* * If the selected name isn't unique, append digits to make it so, and * make a new hash entry for it once we've got a unique name. For a * very long input name, we might have to truncate to stay within * NAMEDATALEN. */ During the query generation we will set the table names before generating each statement, in our case the table t1 would have been added already to the hash table during the first insert statement generation. Next time it will try to set the relation names again for the next statement, i.e delete statement, if the entry with same name already exists, it will change the name to t1_1 by appending a digit to keep the has entry unique. Regards, Vignesh
Re: pg_upgrade and logical replication
On Fri, Feb 17, 2023 at 1:24 PM Julien Rouhaud wrote: > > I was working on testing a major upgrade scenario using a mix of physical and > logical replication when I faced some unexpected problem leading to missing > rows. Note that my motivation is to rely on physical replication / physical > backup to avoid recreating a node from scratch using logical replication, as > the initial sync with logical replication is much more costly and impacting > compared to pg_basebackup / restoring a physical backup, but the same problem > exist if you just pg_upgrade a node that has subscriptions. > > The problem is that pg_upgrade creates the subscriptions on the newly upgraded > node using "WITH (connect = false)", which seems expected as you obviously > don't want to try to connect to the publisher at that point. But then once > the > newly upgraded node is restarted and ready to replace the previous one, unless > I'm missing something there's absolutely no possibility to use the created > subscriptions without losing some data from the publisher. > > The reason is that the subscription doesn't have a local list of relation to > process until you refresh the subscription, but you can't refresh the > subscription without enabling it (and you can't enable it in a transaction), > which means that you have to let the logical worker start, consume and ignore > all changes that happened on the publisher side until the refresh happens. > > An easy workaround that I tried is to allow something like > > ALTER SUBSCRIPTION ... ENABLE WITH (refresh = true, copy_data = false) > > so that the refresh internally happens before the apply worker is started and > you just keep consuming the delta, which works on naive scenario. > > One concern I have with this approach is that the default values for both > "refresh" and "copy_data" for all other subcommands is "true, but we would > probably need a different default value in that exact scenario (as we know we > already have the data). I think that it would otherwise be safe in my very > specific scenario, assuming that you created the slot beforehand and moved the > slot's LSN at the promotion point, as even if you add non-empty tables to the > publication you will only need the delta whether those were initially empty or > not given your initial physical replica state. > This point is not very clear. Why would one just need delta even for new tables? > Any other scenario would make > this new option dangerous, if not entirely useless, but not more than any of > the current commands that lead to refreshing a subscription and have the same > options I guess. > > All in all, currently the only way to somewhat safely resume logical > replication after a pg_upgrade is to drop all the subscriptions that were > transferred during pg_upgrade on all databases and recreate them (using the > existing slots on the publisher side obviously), allowing the initial > connection. But this approach only works in the exact scenario I mentioned > (physical to logical replication, or at least a case where *all* the tables > where logically replicated prior to the pg_ugprade), otherwise you have to > recreate the follower node from scratch using logical repication. > I think if you dropped and recreated the subscriptions by retaining old slots, the replication should resume from where it left off before the upgrade. Which scenario are you concerned about? > Is that indeed the current behavior, or did I miss something? > > Is this "resume logical replication on pg_upgraded node" something we want to > support better? I was thinking that we could add a new pg_dump mode (maybe > only usable during pg_upgrade) that also restores the pg_subscription_rel > content in each subscription or something like that. If not, should > pg_upgrade > keep preserving the subscriptions as it doesn't seem safe to use them, or at > least document the hazards (I didn't find anything about it in the > documentation)? > > There is a mention of this in pg_dump docs. See [1] (When dumping logical replication subscriptions ...) [1] - https://www.postgresql.org/docs/devel/app-pgdump.html -- With Regards, Amit Kapila.
Re: Refactor calculations to use instr_time
Hi, On 2/16/23 19:13, Andres Freund wrote: +#define WALSTAT_ACC(fld, var_to_add) \ + (stats_shmem->stats.fld += var_to_add.fld) +#define WALLSTAT_ACC_INSTR_TIME_TYPE(fld) \ + (stats_shmem->stats.fld += INSTR_TIME_GET_MICROSEC(PendingWalStats.fld)) + WALSTAT_ACC(wal_records, diff); + WALSTAT_ACC(wal_fpi, diff); + WALSTAT_ACC(wal_bytes, diff); + WALSTAT_ACC(wal_buffers_full, PendingWalStats); + WALSTAT_ACC(wal_write, PendingWalStats); + WALSTAT_ACC(wal_sync, PendingWalStats); + WALLSTAT_ACC_INSTR_TIME_TYPE(wal_write_time); + WALLSTAT_ACC_INSTR_TIME_TYPE(wal_sync_time); #undef WALSTAT_ACC - LWLockRelease(&stats_shmem->lock); WALSTAT_ACC is undefined, but WALLSTAT_ACC_INSTR_TIME_TYPE isn't. I'd not remove the newline before LWLockRelease(). /* diff --git a/src/include/pgstat.h b/src/include/pgstat.h index db9675884f3..295c5eabf38 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -445,6 +445,21 @@ typedef struct PgStat_WalStats TimestampTz stat_reset_timestamp; } PgStat_WalStats; +/* Created for accumulating wal_write_time and wal_sync_time as a instr_time Minor code-formatting point: In postgres we don't put code in the same line as a multi-line comment starting with the /*. So either /* single line comment */ or /* * multi line * comment */ Thanks for the review. I updated the patch. Regards, Nazir Bilal Yavuz Microsoft From e3723aca8b79b07190834b0cfd1440dbbf706862 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz Date: Thu, 2 Feb 2023 15:06:48 +0300 Subject: [PATCH v2] Refactor instr_time calculations --- src/backend/access/transam/xlog.c | 6 ++ src/backend/storage/file/buffile.c | 6 ++ src/backend/utils/activity/pgstat_wal.c | 27 + src/include/pgstat.h| 18 - 4 files changed, 35 insertions(+), 22 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f9f0f6db8d1..3c35dc1ca23 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2206,8 +2206,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) instr_time duration; INSTR_TIME_SET_CURRENT(duration); - INSTR_TIME_SUBTRACT(duration, start); - PendingWalStats.wal_write_time += INSTR_TIME_GET_MICROSEC(duration); + INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration, start); } PendingWalStats.wal_write++; @@ -8201,8 +8200,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) instr_time duration; INSTR_TIME_SET_CURRENT(duration); - INSTR_TIME_SUBTRACT(duration, start); - PendingWalStats.wal_sync_time += INSTR_TIME_GET_MICROSEC(duration); + INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, duration, start); } PendingWalStats.wal_sync++; diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index 0a51624df3b..e55f86b675e 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -469,8 +469,7 @@ BufFileLoadBuffer(BufFile *file) if (track_io_timing) { INSTR_TIME_SET_CURRENT(io_time); - INSTR_TIME_SUBTRACT(io_time, io_start); - INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time); + INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start); } /* we choose not to advance curOffset here */ @@ -544,8 +543,7 @@ BufFileDumpBuffer(BufFile *file) if (track_io_timing) { INSTR_TIME_SET_CURRENT(io_time); - INSTR_TIME_SUBTRACT(io_time, io_start); - INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time); + INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start); } file->curOffset += bytestowrite; diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c index e8598b2f4e0..ab2869a0e24 100644 --- a/src/backend/utils/activity/pgstat_wal.c +++ b/src/backend/utils/activity/pgstat_wal.c @@ -21,7 +21,7 @@ #include "executor/instrument.h" -PgStat_WalStats PendingWalStats = {0}; +PgStat_PendingWalUsage PendingWalStats = {0}; /* * WAL usage counters saved from pgWALUsage at the previous call to @@ -89,24 +89,25 @@ pgstat_flush_wal(bool nowait) * previous counters from the current ones. */ WalUsageAccumDiff(&diff, &pgWalUsage, &prevWalUsage); - PendingWalStats.wal_records = diff.wal_records; - PendingWalStats.wal_fpi = diff.wal_fpi; - PendingWalStats.wal_bytes = diff.wal_bytes; if (!nowait) LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE); else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE)) return true; -#define WALSTAT_ACC(fld) stats_shmem->stats.fld += PendingWalStats.fld - WALSTAT_ACC(wal_records); - WALSTAT_ACC(wal_fpi); - WALSTAT_ACC(wal_bytes); - WALSTAT_ACC(wal_buffers_full); - WALSTAT_ACC(wal_write); - WALSTAT_ACC(wal_sync); - WALSTAT_ACC(wal_write_time); - WALSTAT_
Re: psql: Add role's membership options to the \du+ command
Hello, On the one hand, it would be nice to see the membership options with the psql command. After playing with cf5eb37c and e5b8a4c0 I think something must be made with \du command. postgres@demo(16.0)=# CREATE ROLE admin LOGIN CREATEROLE; CREATE ROLE postgres@demo(16.0)=# \c - admin You are now connected to database "demo" as user "admin". admin@demo(16.0)=> SET createrole_self_grant = 'SET, INHERIT'; SET admin@demo(16.0)=> CREATE ROLE bob LOGIN; CREATE ROLE admin@demo(16.0)=> \du List of roles Role name | Attributes | Member of ---++--- admin | Create role | {bob,bob} bob | | {} postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {} We see two bob roles in the 'Member of' column.Strange? But this is correct. admin@demo(16.0)=> select roleid::regrole, member::regrole, * from pg_auth_members where roleid = 'bob'::regrole; roleid | member | oid | roleid | member | grantor | admin_option | inherit_option | set_option ++---+++-+--++ bob | admin | 16713 | 16712 | 16711 | 10 | t | f | f bob | admin | 16714 | 16712 | 16711 | 16711 | f | t | t (2 rows) First 'grant bob to admin' command issued immediately after creating role bob by superuser(grantor=10). Second command issues by admin role and set membership options SET and INHERIT. If we don't ready to display membership options with \du+ may be at least we must group records in 'Member of' column for \du command? - Pavel Luzanov
Re: Support logical replication of DDLs
On Fri, 17 Feb 2023 at 02:38, Jonathan S. Katz wrote: > > On 2/16/23 2:43 PM, Jonathan S. Katz wrote: > > On 2/16/23 2:38 PM, Alvaro Herrera wrote: > >> On 2023-Feb-16, Jonathan S. Katz wrote: > >> > >>> On 2/16/23 12:53 PM, Alvaro Herrera wrote: > >> > I don't think this is the fault of logical replication. Consider that > for the backend server, the function source code is just an opaque > string that is given to the plpgsql engine to interpret. So there's no > way for the logical DDL replication engine to turn this into runnable > code if the table name is not qualified. > >>> > >>> Sure, that's fair. That said, the example above would fall under a > >>> "typical > >>> use case", i.e. I'm replicating functions that call tables without > >>> schema > >>> qualification. This is pretty common, and as logical replication becomes > >>> used for more types of workloads (e.g. high availability), we'll > >>> definitely > >>> see this. > >> > >> Hmm, I think you're saying that replay should turn check_function_bodies > >> off, and I think I agree with that. > > > > Yes, exactly. +1 > > I drilled into this a bit more using the SQL standard bodies (BEGIN > ATOMIC) to see if there were any other behaviors we needed to account > for. Overall, it worked well but I ran into one issue. > > First, functions with "BEGIN ATOMIC" ignores "check_function_bodies" > which is by design based on how this feature works. We should still turn > "check_function_bodies" to "off" though, per above discussion. > > In the context of DDL replication, "BEGIN ATOMIC" does support > schema-unqualified functions, presumably because it includes the parsed > content? > > I created an updated example[1] where I converted the SQL functions to > use the standard syntax and I returned the table names to be schema > unqualified. This seemed to work, but I ran into a weird case with this > function: > > CREATE OR REPLACE FUNCTION public.calendar_manage(room_id int, > calendar_date date) > RETURNS void > LANGUAGE SQL > BEGIN ATOMIC > WITH delete_calendar AS ( > DELETE FROM calendar > WHERE > room_id = $1 AND > calendar_date = $2 > ) > INSERT INTO calendar (room_id, status, calendar_date, calendar_range) > SELECT $1, c.status, $2, c.calendar_range > FROM calendar_generate_calendar($1, tstzrange($2, $2 + 1)) c; > END; > > This produced an error on the subscriber, with the following message: > > 2023-02-16 20:58:24.096 UTC [26864] ERROR: missing FROM-clause entry > for table "calendar_1" at character 322 > 2023-02-16 20:58:24.096 UTC [26864] CONTEXT: processing remote data for > replication origin "pg_18658" during message type "DDL" in transaction > 980, finished at 0/C099A7D8 > 2023-02-16 20:58:24.096 UTC [26864] STATEMENT: CREATE OR REPLACE > FUNCTION public.calendar_manage ( IN room_id pg_catalog.int4, IN > calendar_date pg_catalog.date ) RETURNS pg_catalog.void LANGUAGE sql > VOLATILE PARALLEL UNSAFE CALLED ON NULL INPUT SECURITY INVOKER COST 100 > BEGIN ATOMIC > WITH delete_calendar AS ( > DELETE FROM public.calendar >WHERE ((calendar_1.room_id OPERATOR(pg_catalog.=) > calendar_manage.room_id) AND (calendar_1.calendar_date > OPERATOR(pg_catalog.=) calendar_manage.calendar_date)) > ) > INSERT INTO public.calendar (room_id, status, calendar_date, > calendar_range) SELECT calendar_manage.room_id, > c.status, > calendar_manage.calendar_date, > c.calendar_range > FROM > public.calendar_generate_calendar(calendar_manage.room_id, > pg_catalog.tstzrange((calendar_manage.calendar_date)::timestamp with > time zone, ((calendar_manage.calendar_date OPERATOR(pg_catalog.+) > 1))::timestamp with time zone)) c(status, calendar_range); > END > > This seemed to add an additional, incorrect reference to the origin > table for the "room_id" and "calendar_date" attributes within the CTE of > this function. I don't know if this is directly related to the DDL > replication patch, but reporting it as I triggered the behavior through it. I had analyzed this issue and found that this issue exists with getting query definition, I could reproduce the issue with pg_dump and pg_get_function_sqlbody. I have started a new thread for this at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm1MMntjmT_NJGp-Z%3DxbF02qHGAyuSHfYHias3TqQbPF2w%40mail.gmail.com Regards, Vignesh
Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)
On 2022-Dec-11, Amit Langote wrote: > While staring at the build_simple_rel() bit mentioned above, I > realized that this code fails to set userid correctly in the > inheritance parent rels that are child relations of subquery parent > relations, such as UNION ALL subqueries. In that case, instead of > copying the userid (= 0) of the parent rel, the child should look up > its own RTEPermissionInfo, which should be there, and use the > checkAsUser value from there. I've attached 0002 to fix this hole. I > am not sure whether there's a way to add a test case for this in the > core suite. I gave this a look and I thought it was clearer to have the new condition depend on rel->reloptkind instead parent or no. I tried a few things for a new test case, but I was unable to find anything useful. Maybe an intermediate view, I thought; no dice. Maybe one with a security barrier would do? Anyway, for now I just kept what you added in v2. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ >From 423416f41247d5e4fa2e99de411ffa3b5e09cc8e Mon Sep 17 00:00:00 2001 From: amitlan Date: Wed, 18 Jan 2023 16:49:49 +0900 Subject: [PATCH v3] Correctly set userid of subquery rel's child rels For a RTE_SUBQUERY parent baserel's child relation that itself is a RTE_RELATION rel, build_simple_rel() should explicitly look up the latter's RTEPermissionInfo instead of copying the parent rel's userid, which would be 0 given that it's a subquery rel. --- src/backend/optimizer/util/relnode.c | 18 ++--- src/test/regress/expected/inherit.out | 39 +-- src/test/regress/sql/inherit.sql | 28 +-- 3 files changed, 77 insertions(+), 8 deletions(-) diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index a70a16238a..6c4550b90f 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -233,12 +233,22 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent) rel->serverid = InvalidOid; if (rte->rtekind == RTE_RELATION) { + Assert(parent == NULL || + parent->rtekind == RTE_RELATION || + parent->rtekind == RTE_SUBQUERY); + /* - * Get the userid from the relation's RTEPermissionInfo, though only - * the tables mentioned in query are assigned RTEPermissionInfos. - * Child relations (otherrels) simply use the parent's value. + * For any RELATION rte, we need a userid with which to check + * permission access. Baserels simply use their own RTEPermissionInfo's + * checkAsUser. + * + * For otherrels normally there's no RTEPermissionInfo, so we use the + * parent's, which normally has one. The exceptional case is that the + * parent is a subquery, in which case the otherrel will have its own. */ - if (parent == NULL) + if (rel->reloptkind == RELOPT_BASEREL || + (rel->reloptkind == RELOPT_OTHER_MEMBER_REL && + parent->rtekind == RTE_SUBQUERY)) { RTEPermissionInfo *perminfo; diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out index 2d49e765de..143271cd62 100644 --- a/src/test/regress/expected/inherit.out +++ b/src/test/regress/expected/inherit.out @@ -2512,9 +2512,44 @@ explain (costs off) (6 rows) reset session authorization; -revoke all on permtest_parent from regress_no_child_access; -drop role regress_no_child_access; +-- Check with a view over permtest_parent and a UNION ALL over the view, +-- especially that permtest_parent's permissions are checked with the role +-- owning the view as permtest_parent's RTE's checkAsUser. +create role regress_permtest_view_owner; +revoke all on permtest_grandchild from regress_permtest_view_owner; +grant select(a, c) on permtest_parent to regress_permtest_view_owner; +create view permtest_parent_view as + select a, c from permtest_parent; +alter view permtest_parent_view owner to regress_permtest_view_owner; +-- Like above, the 2nd arm of UNION ALL gets a hash join due to lack of access +-- to the expression index's stats +revoke select on permtest_parent_view from regress_no_child_access; +grant select(a,c) on permtest_parent_view to regress_no_child_access; +set session authorization regress_no_child_access; +explain (costs off) + select p1.a, p2.c from + (select * from permtest_parent_view + union all + select * from permtest_parent_view) p1 + inner join permtest_parent p2 on p1.a = p2.a and left(p1.c, 3) ~ 'a1$'; + QUERY PLAN +- + Hash Join + Hash Cond: (p2.a = permtest_parent.a) + -> Seq Scan on permtest_grandchild p2 + -> Hash + -> Append + -> Seq Scan on permtest_grandchild permtest_parent + Filter: ("left"(c, 3) ~ 'a1$'::text) + -> Seq Scan on permtest_grandchild permtest_parent_1 + Filter: ("left"(c, 3) ~ 'a1$'::text)
Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)
On 2023-Feb-17, Alvaro Herrera wrote: > I tried a few things for a new test case, but I was unable to find > anything useful. Maybe an intermediate view, I thought; no dice. > Maybe one with a security barrier would do? Anyway, for now I just kept > what you added in v2. Sorry, I failed to keep count of the patch version correctly. The test case here is what you sent in v3 [1], and consequently the patch I just attached should have been labelled v4. [1] https://postgr.es/m/CA+HiwqF6ricH7HFCkyrK72c=KN-PRkdncxdLmU_mEQx=dra...@mail.gmail.com -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La vida es para el que se aventura"
Re: Move defaults toward ICU in 16?
On Fri, 2023-02-17 at 14:40 +0900, Michael Paquier wrote: > Separate question: what's the state of the Windows installers provided > by the community regarding libicu? Is that embedded in the MSI? The EDB installer installs a quite old version of the ICU library for compatibility reasons, as far as I know. Yours, Laurenz Albe
Re: The output sql generated by pg_dump for a create function refers to a modified table name
On 2/17/23 5:22 AM, vignesh C wrote: Hi, The output sql generated by pg_dump for the below function refers to a modified table name: create table t1 (c1 int); create table t2 (c1 int); CREATE OR REPLACE FUNCTION test_fun(c1 int) RETURNS void LANGUAGE SQL BEGIN ATOMIC WITH delete_t1 AS ( DELETE FROM t1 WHERE c1 = $1 ) INSERT INTO t1 (c1) SELECT $1 FROM t2; END; The below sql output created by pg_dump refers to t1_1 which should have been t1: CREATE FUNCTION public.test_fun(c1 integer) RETURNS void LANGUAGE sql BEGIN ATOMIC WITH delete_t1 AS ( DELETE FROM public.t1 WHERE (t1_1.c1 = test_fun.c1) ) INSERT INTO public.t1 (c1) SELECT test_fun.c1 FROM public.t2; END; pg_get_function_sqlbody also returns similar result: select proname, pg_get_function_sqlbody(oid) from pg_proc where proname = 'test_fun'; proname | pg_get_function_sqlbody --+--- test_fun | BEGIN ATOMIC + | WITH delete_t1 AS ( + | DELETE FROM t1 + |WHERE (t1_1.c1 = test_fun.c1) + | ) + | INSERT INTO t1 (c1) SELECT test_fun.c1+ | FROM t2; + | END (1 row) Thanks for reproducing and demonstrating that this was more generally applicable. For context, this was initially discovered when testing the DDL replication patch[1] under that context. I felt the problem here is with set_rtable_names function which changes the relation name t1 to t1_1 while parsing the statement: /* * If the selected name isn't unique, append digits to make it so, and * make a new hash entry for it once we've got a unique name. For a * very long input name, we might have to truncate to stay within * NAMEDATALEN. */ During the query generation we will set the table names before generating each statement, in our case the table t1 would have been added already to the hash table during the first insert statement generation. Next time it will try to set the relation names again for the next statement, i.e delete statement, if the entry with same name already exists, it will change the name to t1_1 by appending a digit to keep the has entry unique. Good catch. Do you have thoughts on how we can adjust the naming logic to handle cases like this? Jonathan [1] https://www.postgresql.org/message-id/e947fa21-24b2-f922-375a-d4f763ef3e4b%40postgresql.org OpenPGP_signature Description: OpenPGP digital signature
Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy
On 16.02.23 22:29, Andres Freund wrote: What's the story behind 100_bugs.pl? This name clearly is copied from src/test/subscription/t/100_bugs.pl - but I've never understood why that is outside of the normal numbering space. Mainly to avoid awkwardness for backpatching. The number of tests in src/test/subscription/ varies quite a bit across branches.
Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16
Hi, On 2/16/23 1:26 PM, Drouvot, Bertrand wrote: Hi, On 2/16/23 12:00 PM, Amit Kapila wrote: BTW, feel free to create the second patch (to align the types for variables/arguments) as that would be really helpful after we commit this one. Please find attached a patch proposal to do so. It looks like a Pandora's box as it produces those cascading changes: _hash_vacuum_one_page index_compute_xid_horizon_for_tuples gistprunepage PageIndexMultiDelete gistXLogDelete PageIndexMultiDelete spgRedoVacuumRedirect vacuumRedirectAndPlaceholder spgPageIndexMultiDelete moveLeafs doPickSplit _bt_delitems_vacuum btvacuumpage _bt_delitems_delete _bt_delitems_delete_check hash_xlog_move_page_contents gistvacuumpage gistXLogUpdate gistplacetopage hashbucketcleanup Which makes me: - wonder it is not too intrusive (we could reduce the scope and keep the PageIndexMultiDelete()'s nitems argument as an int for example). - worry if there is no side effects (like the one I'm mentioning as a comment in PageIndexMultiDelete()) even if it passes the CI tests. - wonder if we should not change MaxIndexTuplesPerPage from int to uint16 too (given the fact that the maximum block size is 32 KB. I'm sharing this version but I still need to think about it and I'm curious about your thoughts too. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comFrom dfb3d5cae3163ebc9073cc762adec966f17e2f33 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 17 Feb 2023 11:30:21 + Subject: [PATCH v1] Change ndeletable in _hash_vacuum_one_page() from int to uint16 deletable in _hash_vacuum_one_page() is assigned to xl_hash_vacuum_one_page.ntuples which has been changed to uint16 in a previous commit. This commit ensures that the assigned value has the same type. Doing so produces those cascading changes: _hash_vacuum_one_page index_compute_xid_horizon_for_tuples gistprunepage PageIndexMultiDelete gistXLogDelete PageIndexMultiDelete spgRedoVacuumRedirect vacuumRedirectAndPlaceholder spgPageIndexMultiDelete moveLeafs doPickSplit _bt_delitems_vacuum btvacuumpage _bt_delitems_delete _bt_delitems_delete_check hash_xlog_move_page_contents gistvacuumpage gistXLogUpdate gistplacetopage hashbucketcleanup which also fixed others missmatch on the way. Also in some places the types have been changed from OffsetNumber to uint16. Even if at the time of this commit the OffsetNumber is defined as uint16, it's better to match the functions arguments types they are linked to. --- src/backend/access/gist/gist.c | 6 +++--- src/backend/access/gist/gistvacuum.c| 2 +- src/backend/access/gist/gistxlog.c | 4 ++-- src/backend/access/hash/hash.c | 2 +- src/backend/access/hash/hash_xlog.c | 8 src/backend/access/hash/hashinsert.c| 2 +- src/backend/access/index/genam.c| 2 +- src/backend/access/nbtree/nbtpage.c | 14 +++--- src/backend/access/nbtree/nbtree.c | 4 ++-- src/backend/access/spgist/spgdoinsert.c | 8 src/backend/access/spgist/spgvacuum.c | 4 ++-- src/backend/access/spgist/spgxlog.c | 2 +- src/backend/storage/page/bufpage.c | 12 src/include/access/genam.h | 2 +- src/include/access/gist_private.h | 4 ++-- src/include/access/nbtree.h | 4 ++-- src/include/access/spgist_private.h | 2 +- src/include/storage/bufpage.h | 2 +- 18 files changed, 44 insertions(+), 40 deletions(-) 14.8% src/backend/access/gist/ 11.6% src/backend/access/hash/ 25.0% src/backend/access/nbtree/ 11.2% src/backend/access/spgist/ 14.3% src/backend/storage/page/ 20.1% src/include/access/ diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index ba394f08f6..587fb04585 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -573,8 +573,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate, { if (RelationNeedsWAL(rel)) { - OffsetNumber ndeloffs = 0, -
Re: The output sql generated by pg_dump for a create function refers to a modified table name
"Jonathan S. Katz" writes: > Good catch. Do you have thoughts on how we can adjust the naming logic > to handle cases like this? I think it's perfectly fine that ruleutils decided to use different aliases for the two different occurrences of "t1": the statement is quite confusing as written. The problem probably is that get_delete_query_def() has no idea that it's supposed to print the adjusted alias just after "DELETE FROM tab". UPDATE likely has same issue ... maybe INSERT too? regards, tom lane
Re: pg_upgrade and logical replication
Hi, On Fri, Feb 17, 2023 at 04:12:54PM +0530, Amit Kapila wrote: > On Fri, Feb 17, 2023 at 1:24 PM Julien Rouhaud wrote: > > > > An easy workaround that I tried is to allow something like > > > > ALTER SUBSCRIPTION ... ENABLE WITH (refresh = true, copy_data = false) > > > > so that the refresh internally happens before the apply worker is started > > and > > you just keep consuming the delta, which works on naive scenario. > > > > One concern I have with this approach is that the default values for both > > "refresh" and "copy_data" for all other subcommands is "true, but we would > > probably need a different default value in that exact scenario (as we know > > we > > already have the data). I think that it would otherwise be safe in my very > > specific scenario, assuming that you created the slot beforehand and moved > > the > > slot's LSN at the promotion point, as even if you add non-empty tables to > > the > > publication you will only need the delta whether those were initially empty > > or > > not given your initial physical replica state. > > > > This point is not very clear. Why would one just need delta even for new > tables? Because in my scenario I'm coming from physical replication, so I know that I did replicate everything until the promotion LSN. Any table later added in the publication is either already fully replicated until that LSN on the upgraded node, so only the delta is needed, or has been created after that LSN. In the latter case, the entirety of the table will be replicated with the logical replication as a delta right? > > Any other scenario would make > > this new option dangerous, if not entirely useless, but not more than any of > > the current commands that lead to refreshing a subscription and have the > > same > > options I guess. > > > > All in all, currently the only way to somewhat safely resume logical > > replication after a pg_upgrade is to drop all the subscriptions that were > > transferred during pg_upgrade on all databases and recreate them (using the > > existing slots on the publisher side obviously), allowing the initial > > connection. But this approach only works in the exact scenario I mentioned > > (physical to logical replication, or at least a case where *all* the tables > > where logically replicated prior to the pg_ugprade), otherwise you have to > > recreate the follower node from scratch using logical repication. > > > > I think if you dropped and recreated the subscriptions by retaining > old slots, the replication should resume from where it left off before > the upgrade. Which scenario are you concerned about? I'm concerned about people not coming from physical replication. If you just had some "normal" logical replication, you can't assume that you already have all the data from the upstream subscription. If it was modified and a non empty table is added, you might need to copy the data of part of the tables and keep replicating for the rest. It's hard to be sure from a user point of view, and even if you knew you have no way to express it. > > Is that indeed the current behavior, or did I miss something? > > > > Is this "resume logical replication on pg_upgraded node" something we want > > to > > support better? I was thinking that we could add a new pg_dump mode (maybe > > only usable during pg_upgrade) that also restores the pg_subscription_rel > > content in each subscription or something like that. If not, should > > pg_upgrade > > keep preserving the subscriptions as it doesn't seem safe to use them, or at > > least document the hazards (I didn't find anything about it in the > > documentation)? > > > > > > There is a mention of this in pg_dump docs. See [1] (When dumping > logical replication subscriptions ...) Indeed, but it's barely saying "It is then up to the user to reactivate the subscriptions in a suitable way" and "It might also be appropriate to truncate the target tables before initiating a new full table copy". As I mentioned, I don't think there's a suitable way to reactivate the subscription, at least if you don't want to miss some records, so truncating all target tables is the only fully safe way to proceed. It seems quite silly to have to do so just because pg_upgrade doesn't retain the list of relation per subscription.
Re: pg_stat_statements and "IN" conditions
> On Wed, Feb 15, 2023 at 08:51:56AM +0100, David Geier wrote: > Hi, > > On 2/11/23 13:08, Dmitry Dolgov wrote: > > > On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote: > > > > > > The original version of the patch was doing all of this, i.e. handling > > > numerics, Param nodes, RTE_VALUES. The commentary about > > > find_const_walker in tests is referring to a part of that, that was > > > dealing with evaluation of expression to see if it could be reduced to a > > > constant. > > > > > > Unfortunately there was a significant push back from reviewers because > > > of those features. That's why I've reduced the patch to it's minimally > > > useful version, having in mind re-implementing them as follow-up patches > > > in the future. This is the reason as well why I left tests covering all > > > this missing functionality -- as breadcrumbs to already discovered > > > cases, important for the future extensions. > > I'd like to elaborate on this a bit and remind about the origins of the > > patch, as it's lost somewhere in the beginning of the thread. The idea > > is not pulled out of thin air, everything is coming from our attempts to > > improve one particular monitoring infrastructure in a real commercial > > setting. Every covered use case and test in the original proposal was a > > result of field trials, when some application-side library or ORM was > > responsible for gigabytes of data in pgss, chocking the monitoring agent. > > Thanks for the clarifications. I didn't mean to contend the usefulness of > the patch and I wasn't aware that you already jumped through the loops of > handling Param, etc. No worries, I just wanted to emphasize that we've already collected quite some number of use cases. > Seems like supporting only constants is a good starting > point. The only thing that is likely confusing for users is that NUMERICs > (and potentially constants of other types) are unsupported. Wouldn't it be > fairly simple to support them via something like the following? > > is_const(element) || (is_coercion(element) && is_const(element->child)) It definitely makes sense to implement that, although I don't think it's going to be acceptable to do that via directly listing conditions an element has to satisfy. It probably has to be more flexible, sice we would like to extend it in the future. My plan is to address this in a follow-up patch, when the main mechanism is approved. Would you agree with this approach?
Re: pg_stat_statements and "IN" conditions
> On Thu, Feb 09, 2023 at 08:43:29PM +0100, Dmitry Dolgov wrote: > > On Thu, Feb 09, 2023 at 06:26:51PM +0100, Alvaro Herrera wrote: > > On 2023-Feb-09, Dmitry Dolgov wrote: > > > > > > On Thu, Feb 09, 2023 at 02:30:34PM +0100, Peter Eisentraut wrote: > > > > > > What is the point of making this a numeric setting? Either you want > > > > to merge all values or you don't want to merge any values. > > > > > > At least in theory the definition of "too many constants" is different > > > for different use cases and I see allowing to configure it as a way of > > > reducing the level of surprise here. > > > > I was thinking about this a few days ago and I agree that we don't > > necessarily want to make it just a boolean thing; we may want to make it > > more complex. One trivial idea is to make it group entries in powers of > > 10: for 0-9 elements, you get one entry, and 10-99 you get a different > > one, and so on: > > > > # group everything in a single bucket > > const_merge_threshold = true / yes / on > > > > # group 0-9, 10-99, 100-999, 1000- > > const_merge_treshold = powers > > > > Ideally the value would be represented somehow in the query text. For > > example > > > > query| calls > > --+--- > > select * from test where i in ({... 0-9 entries ...})| 2 > > select * from test where i in ({... 10-99 entries ...}) | 1 > > > > What do you think? The jumble would have to know how to reduce all > > values within each power-of-ten group to one specific value, but I don't > > think that should be particularly difficult. > > Yeah, it sounds appealing and conveniently addresses the question of > losing the information about how many constants originally were there. > Not sure if the example above would be the most natural way to represent > it in the query text, but otherwise I'm going to try implementing this. > Stay tuned. It took me couple of evenings, here is what I've got: * The representation is not that far away from your proposal, I've settled on: SELECT * FROM test_merge WHERE id IN (... [10-99 entries]) * To not reinvent the wheel, I've reused decimalLenght function from numutils, hence one more patch to make it available to reuse. * This approach resolves my concerns about letting people tuning the behaviour of merging, as now it's possible to distinguish between calls with different number of constants up to the power of 10. So I've decided to simplify the configuration and make the guc boolean to turn it off or on. * To separate queries with constants falling into different ranges (10-99, 100-999, etc), the order of magnitude is added into the jumble hash. * I've incorporated feedback from Sergei and David, as well as tried to make comments and documentation more clear. Any feedback is welcomed, thanks! >From dbb4eab9f3efbcee2326278be6f70ff52685b2b0 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Fri, 17 Feb 2023 10:17:55 +0100 Subject: [PATCH v13 1/2] Reusable decimalLength functions Move out decimalLength functions to reuse in the following patch. --- src/backend/utils/adt/numutils.c | 50 +--- src/include/utils/numutils.h | 67 2 files changed, 68 insertions(+), 49 deletions(-) create mode 100644 src/include/utils/numutils.h diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c index 471fbb7ee6..df7418cce7 100644 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -18,9 +18,8 @@ #include #include -#include "common/int.h" #include "utils/builtins.h" -#include "port/pg_bitutils.h" +#include "utils/numutils.h" /* * A table of all two-digit numbers. This is used to speed up decimal digit @@ -38,53 +37,6 @@ static const char DIGIT_TABLE[200] = "80" "81" "82" "83" "84" "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"; -/* - * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10 - */ -static inline int -decimalLength32(const uint32 v) -{ - int t; - static const uint32 PowersOfTen[] = { - 1, 10, 100, - 1000, 1, 10, - 100, 1000, 1, - 10 - }; - - /* - * Compute base-10 logarithm by dividing the base-2 logarithm by a - * good-enough approximation of the base-2 logarithm of 10 - */ - t = (pg_leftmost_one_pos32(v) + 1) * 1233 / 4096; - return t + (v >= PowersOfTen[t]); -} - -static inline int -decimalLength64(const uint64 v) -{ - int t; - static const uint64 PowersOfTen[] = { - UINT64CONST(1), UINT64CONST(10), - UINT64CONST(100), UINT64CONST(1000), - UINT64CONST(1), UINT64CONST(10), - UINT64CONST(100), UINT64CONST(1000), - UINT64CONST(1), UINT64CONST(10), - UINT64CONST(100), UINT64CONST(1000), - UINT64CONST(1), UINT64CONST(10
Re: Possible false valgrind error reports
Thank you, I moved comment changes to 0001 and rewrote the fix through Min(). > The first hunk in 0001 doesn't seem quite right yet: > > * old allocation. > */ > #ifdef USE_VALGRIND > -if (oldsize > chunk->requested_size) > +if (size > chunk->requested_size && oldsize > chunk->requested_size) > VALGRIND_MAKE_MEM_UNDEFINED((char *) pointer + > chunk->requested_size, > oldsize - chunk->requested_size); > #endif > > If size < oldsize, aren't we still doing the wrong thing? Seems like > maybe it has to be like If size > chunk->requested_size than chksize >= oldsize and so we can mark this memory without worries. Region from size to chksize will be marked NOACCESS later anyway: /* Ensure any padding bytes are marked NOACCESS. */ VALGRIND_MAKE_MEM_NOACCESS((char *) pointer + size, chksize - size); I agree that it's not obvious, so I changed the first hunk like this: - if (oldsize > chunk->requested_size) + if (Min(size, oldsize) > chunk->requested_size) VALGRIND_MAKE_MEM_UNDEFINED((char *) pointer + chunk->requested_size, - oldsize - chunk->requested_size); + Min(size, oldsize) - chunk->requested_size); Any ideas on how to make this place easier to understand and comment above it concise and clear are welcome. There is another thing about this version. New line + Min(size, oldsize) - chunk->requested_size); is longer than 80 symbols and I don't know what's the best way to avoid this without making it look weird. I also noticed that if RANDOMIZE_ALLOCATED_MEMORY is defined then randomize_mem() have already marked this memory UNDEFINED. So we only "may need to adjust trailing bytes" if RANDOMIZE_ALLOCATED_MEMORY isn't defined. I reflected it in v2 of 0001 too. From 009ef9bbabcd71e0d5f2b60799c0b71d21fb9767 Mon Sep 17 00:00:00 2001 From: Karina Litskevich Date: Fri, 17 Feb 2023 15:32:05 +0300 Subject: [PATCH v2 2/2] Change variable name in AllocSetRealloc() --- src/backend/utils/mmgr/aset.c | 32 +--- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index ace4907ce9..9584d50b14 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -1112,7 +1112,7 @@ AllocSetRealloc(void *pointer, Size size) AllocBlock block; AllocSet set; MemoryChunk *chunk = PointerGetMemoryChunk(pointer); - Size oldsize; + Size oldchksize; int fidx; /* Allow access to private part of chunk header. */ @@ -1140,11 +1140,11 @@ AllocSetRealloc(void *pointer, Size size) set = block->aset; - oldsize = block->endptr - (char *) pointer; + oldchksize = block->endptr - (char *) pointer; #ifdef MEMORY_CONTEXT_CHECKING /* Test for someone scribbling on unused space in chunk */ - Assert(chunk->requested_size < oldsize); + Assert(chunk->requested_size < oldchksize); if (!sentinel_ok(pointer, chunk->requested_size)) elog(WARNING, "detected write past chunk end in %s %p", set->header.name, chunk); @@ -1197,15 +1197,15 @@ AllocSetRealloc(void *pointer, Size size) #else /* * If this is an increase, realloc() will have left any newly-allocated - * part (from oldsize to chksize) UNDEFINED, but we may need to adjust + * part (from oldchksize to chksize) UNDEFINED, but we may need to adjust * trailing bytes from the old allocation (from chunk->requested_size to - * oldsize) as they are marked NOACCESS. Make sure not to mark extra - * bytes in case chunk->requested_size < size < oldsize. + * oldchksize) as they are marked NOACCESS. Make sure not to mark extra + * bytes in case chunk->requested_size < size < oldchksize. */ #ifdef USE_VALGRIND - if (Min(size, oldsize) > chunk->requested_size) + if (Min(size, oldchksize) > chunk->requested_size) VALGRIND_MAKE_MEM_UNDEFINED((char *) pointer + chunk->requested_size, - Min(size, oldsize) - chunk->requested_size); + Min(size, oldchksize) - chunk->requested_size); #endif #endif @@ -1223,7 +1223,7 @@ AllocSetRealloc(void *pointer, Size size) * portion DEFINED. Make sure not to mark memory behind the new * allocation in case it's smaller than old one. */ - VALGRIND_MAKE_MEM_DEFINED(pointer, Min(size, oldsize)); + VALGRIND_MAKE_MEM_DEFINED(pointer, Min(size, oldchksize)); #endif /* Ensure any padding bytes are marked NOACCESS. */ @@ -1248,11 +1248,11 @@ AllocSetRealloc(void *pointer, Size size) fidx = MemoryChunkGetValue(chunk); Assert(FreeListIdxIsValid(fidx)); - oldsize = GetChunkSizeFromFreeListIdx(fidx); + oldchksize = GetChunkSizeFromFreeListIdx(fidx); #ifdef MEMORY_CONTEXT_CHECKING /* Test for someone scribbling on unused space in chunk */ - if (chunk->requested_size < oldsize) + if (chunk->requested_size < oldchksize) if (!sentinel_ok(pointer, chunk->requested_size)) elog(WARNING, "detected write past chunk end in %s %p", set->header.name, c
Re: The output sql generated by pg_dump for a create function refers to a modified table name
On 2/17/23 10:09 AM, Tom Lane wrote: "Jonathan S. Katz" writes: Good catch. Do you have thoughts on how we can adjust the naming logic to handle cases like this? I think it's perfectly fine that ruleutils decided to use different aliases for the two different occurrences of "t1": the statement is quite confusing as written. Agreed on that -- while it's harder to set up, I do prefer the original example[1] to demonstrate this, as it shows the issue given it does not have those multiple occurrences, at least not within the same context, i.e.: CREATE OR REPLACE FUNCTION public.calendar_manage(room_id int, calendar_date date) RETURNS void LANGUAGE SQL BEGIN ATOMIC WITH delete_calendar AS ( DELETE FROM calendar WHERE room_id = $1 AND calendar_date = $2 ) INSERT INTO calendar (room_id, status, calendar_date, calendar_range) SELECT $1, c.status, $2, c.calendar_range FROM calendar_generate_calendar($1, tstzrange($2, $2 + 1)) c; END; the table prefixes on the attributes within the DELETE statement were ultimately mangled: WITH delete_calendar AS ( DELETE FROM public.calendar WHERE ((calendar_1.room_id OPERATOR(pg_catalog.=) calendar_manage.room_id) AND (calendar_1.calendar_date OPERATOR(pg_catalog.=) calendar_manage.calendar_date)) ) INSERT INTO public.calendar (room_id, status, calendar_date, calendar_range) The problem probably is that get_delete_query_def() has no idea that it's supposed to print the adjusted alias just after "DELETE FROM tab". UPDATE likely has same issue ... maybe INSERT too? Maybe? I modified the function above to do an INSERT/UPDATE instead of a DELETE but I did not get any errors. However, if the logic is similar there could be an issue there. Thanks, Jonathan [1] https://www.postgresql.org/message-id/e947fa21-24b2-f922-375a-d4f763ef3e4b%40postgresql.org OpenPGP_signature Description: OpenPGP digital signature
Re: The output sql generated by pg_dump for a create function refers to a modified table name
On 2/17/23 11:19 AM, Jonathan S. Katz wrote: On 2/17/23 10:09 AM, Tom Lane wrote: Agreed on that -- while it's harder to set up, I do prefer the original example[1] to demonstrate this, as it shows the issue given it does not have those multiple occurrences, at least not within the same context, i.e.: CREATE OR REPLACE FUNCTION public.calendar_manage(room_id int, calendar_date date) RETURNS void LANGUAGE SQL BEGIN ATOMIC WITH delete_calendar AS ( DELETE FROM calendar WHERE room_id = $1 AND calendar_date = $2 ) INSERT INTO calendar (room_id, status, calendar_date, calendar_range) SELECT $1, c.status, $2, c.calendar_range FROM calendar_generate_calendar($1, tstzrange($2, $2 + 1)) c; END; The problem probably is that get_delete_query_def() has no idea that it's supposed to print the adjusted alias just after "DELETE FROM tab". UPDATE likely has same issue ... maybe INSERT too? Maybe? I modified the function above to do an INSERT/UPDATE instead of a DELETE but I did not get any errors. However, if the logic is similar there could be an issue there. I spoke too soon -- I was looking at the wrong logs. I did reproduce it with UPDATE, but not INSERT. The example I used for UPDATE: CREATE OR REPLACE FUNCTION public.calendar_manage(room_id int, calendar_date date) RETURNS void LANGUAGE SQL BEGIN ATOMIC WITH update_calendar AS ( UPDATE calendar SET room_id = $1 WHERE room_id = $1 AND calendar_date = $2 ) INSERT INTO calendar (room_id, status, calendar_date, calendar_range) SELECT $1, c.status, $2, c.calendar_range FROM calendar_generate_calendar($1, tstzrange($2, $2 + 1)) c; END; which produced: WITH update_calendar AS ( UPDATE public.calendar SET room_id = calendar_manage.room_id WHERE ( (calendar_1.room_id OPERATOR(pg_catalog.=) calendar_manage.room_id) AND (calendar_1.calendar_date OPERATOR(pg_catalog.=) calendar_manage.calendar_date)) ) INSERT INTO public.calendar (room_id, status, calendar_date, calendar_range) SELECT calendar_manage.room_id, c.status, calendar_manage.calendar_date, c.calendar_range FROM public.calendar_generate_calendar(calendar_manage.room_id, pg_catalog.tstzrange((calendar_manage.calendar_date)::timestamp with time zone, ((calendar_manage.calendar_date OPERATOR(pg_catalog.+) 1))::timestamp with time zone)) c(status, calendar_range); Thanks, Jonathan OpenPGP_signature Description: OpenPGP digital signature
pg_init_privs corruption.
Hi hackers! Recently we faced a problem with one of our production clusters. Problem was with pg_upgrade, the reason was an invalid pg_dump of cluster schema. in pg_dump sql there was strange records like REVOKE SELECT,INSERT,DELETE,UPDATE ON TABLE *relation* FROM "144841"; but there is no role "144841" We did dig in, and it turns out that 144841 was OID of previously-deleted role. I have reproduced issue using simple test extension yoext(1). SQL script: create role user1; ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT select ON TABLES TO user1; create extension yoext; drop owned by user1; select * from pg_init_privs where privtype = 'e'; drop role user1; select * from pg_init_privs where privtype = 'e'; result of execution (executed on fest master from commit 17feb6a566b77bf62ca453dec215adcc71755c20): psql (16devel) Type "help" for help. postgres=# postgres=# postgres=# create role user1; CREATE ROLE postgres=# ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT select ON TABLES TO user1; ALTER DEFAULT PRIVILEGES postgres=# create extension yobaext ; CREATE EXTENSION postgres=# drop owned by user1; DROP OWNED postgres=# select * from pg_init_privs where privtype = 'e'; objoid | classoid | objsubid | privtype | initprivs +--+--+--+--- 16387 | 1259 |0 | e| {reshke=arwdDxtm/reshke,user1=r/reshke,=r/reshke} (1 row) postgres=# drop role user1; DROP ROLE postgres=# select * from pg_init_privs where privtype = 'e'; objoid | classoid | objsubid | privtype | initprivs +--+--+--+--- 16387 | 1259 |0 | e| {reshke=arwdDxtm/reshke,16384=r/reshke,=r/reshke} (1 row) As you can see, after drop role there is invalid records in pg_init_privs system relation. After this, pg_dump generate sql statements, some of which are based on content of pg_init_privs, resulting in invalid dump. PFA fix. The idea of fix is simply drop records from pg_init_privs while dropping role. Records with grantor of grantee equal to oid of dropped role will erase. after that, pg_dump works ok. Implementation comment: i failed to find proper way to alloc acl array, so defined some acl.c internal function `allocacl` in header. Need to improve this somehow. [1] yoext https://github.com/reshke/yoext/ v1-0001-Fix-pg_init_prevs-corruption.patch Description: Binary data
Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
On Fri, Feb 17, 2023 at 12:03 AM David Rowley wrote: > On Fri, 17 Feb 2023 at 17:40, Jonah H. Harris > wrote: > > Yeah. There’s definitely a smarter and more reusable approach than I was > proposing. A lot of that code is fairly mature and I figured more people > wouldn’t want to alter it in such ways - but I’m up for it if an approach > like this is the direction we’d want to go in. > > Having something agnostic to if it's allocating a new context or > adding a block to an existing one seems like a good idea to me. > I like this idea. > I think the tricky part will be the discussion around which size > classes to keep around and in which cases can we use a larger > allocation without worrying too much that it'll be wasted. We also > don't really want to make the minimum memory that a backend can keep > around too bad. Patches such as [1] are trying to reduce that. Maybe > we can just keep a handful of blocks of 1KB, 8KB and 16KB around, or > more accurately put, ALLOCSET_SMALL_INITSIZE, > ALLOCSET_DEFAULT_INITSIZE and ALLOCSET_DEFAULT_INITSIZE * 2, so that > it works correctly if someone adjusts those definitions. > Per that patch and the general idea, what do you think of either: 1. A single GUC, something like backend_keep_mem, that represents the cached memory we'd retain rather than send directly to free()? 2. Multiple GUCs, one per block size? While #2 would give more granularity, I'm not sure it would necessarily be needed. The main issue I'd see in that case would be the selection approach to block sizes to keep given a fixed amount of keep memory. We'd generally want the majority of the next queries to make use of it as best as possible, so we'd either need each size to be equally represented or some heuristic. I don't really like #2, but threw it out there :) I think you'll want to look at what the maximum memory a backend can > keep around in context_freelists[] and not make the worst-case memory > consumption worse than it is today. > Agreed. > I imagine this would be some new .c file in src/backend/utils/mmgr > which aset.c, generation.c and slab.c each call a function from to see > if we have any cached blocks of that size. You'd want to call that in > all places we call malloc() from those files apart from when aset.c > and generation.c malloc() for a dedicated block. You can probably get > away with replacing all of the free() calls with a call to another > function where you pass the pointer and the size of the block to have > it decide if it's going to free() it or cache it. Agreed. I would see this as practically just a generic allocator free-list; is that how you view it also? > I doubt you need to care too much if the block is from a dedicated > allocation or a normal > block. We'd just always free() if it's not in the size classes that > we care about. > Agreed. -- Jonah H. Harris
Re: psql: Add role's membership options to the \du+ command
On Fri, Feb 17, 2023 at 4:02 AM Pavel Luzanov wrote: >List of roles > Role name | Attributes | > Member of > > ---++--- > admin | Create role| > {bob,bob} > bob || > {} > postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | > {} > > First 'grant bob to admin' command issued immediately after creating role > bob by superuser(grantor=10). Second command issues by admin role and set > membership options SET and INHERIT. > If we don't ready to display membership options with \du+ may be at least > we must group records in 'Member of' column for \du command? > I agree that these views should GROUP BY roleid and use bool_or(*_option) to produce their result. Their purpose is to communicate the current effective state to the user, not facilitate full inspection of the configuration, possibly to aid in issuing GRANT and REVOKE commands. One thing I found, and I plan to bring this up independently once I've collected my thoughts, is that pg_has_role() uses the terminology "USAGE" and "MEMBER" for "INHERIT" and "SET" respectively. It's annoying that "member" has been overloaded here. And the choice of USAGE just seems arbitrary (though I haven't researched it) given the related syntax. https://www.postgresql.org/docs/15/functions-info.html
Re: [PATCH] Align GSS and TLS error handling in PQconnectPoll()
On Thu, Feb 16, 2023 at 10:59 PM Michael Paquier wrote: > I am adding Stephen Frost > in CC to see if he has any comments about all this part of the logic > with gssencmode. Sounds good. > I agree that > PQconnectPoll() has grown beyond the point of making it easy to > maintain. I am wondering which approach we could take when it comes > to simplify something like that. Attempting to reduce the number of > flags stored in PGconn would be one. The second may be to split the > internal logic into more functions, for each state we are going > through? The first may lead to an even cleaner logic for the second > point. Yeah, a mixture of both might be helpful -- the first to reduce the inputs to the state machine; the second to reduce interdependencies between cases, the distance of the potential goto jumps, and the number of state machine outputs. (When cases are heavily dependent on each other, probably best to handle them in the same function?) Luckily it looks like the current machine is usually linear. Thanks, --Jacob
Re: Move defaults toward ICU in 16?
On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote: > On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote: > > I am saying that pg_upgrade should be able to deal with the > > difference. The > > details of how to implement that, don't matter that much. > > To clarify, you're saying that pg_upgrade should simply update > pg_database to set the new databases' collation fields equal to that > of > the old cluster? Thinking about this more, it's not clear to me if this would be in scope for pg_upgrade or not. If pg_upgrade is fixing up the new cluster rather than checking for compatibility, why doesn't it just take over and do the initdb for the new cluster itself? That would be less confusing for users, and avoid some weirdness (like, if you drop the database "postgres" on the original, why does it reappear after an upgrade?). Someone might want to do something interesting to the new cluster before the upgrade, but it's not clear from the docs what would be both useful and safe. Regards, Jeff Davis
Re: Move defaults toward ICU in 16?
Hi, On 2023-02-17 09:01:54 -0800, Jeff Davis wrote: > On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote: > > On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote: > > > I am saying that pg_upgrade should be able to deal with the > > > difference. The > > > details of how to implement that, don't matter that much. > > > > To clarify, you're saying that pg_upgrade should simply update > > pg_database to set the new databases' collation fields equal to that > > of > > the old cluster? Yes. > Thinking about this more, it's not clear to me if this would be in > scope for pg_upgrade or not. I don't think we should consider changing the default collation provider without making this more seamless, one way or another. > If pg_upgrade is fixing up the new cluster rather than checking for > compatibility, why doesn't it just take over and do the initdb for the new > cluster itself? That would be less confusing for users, and avoid some > weirdness (like, if you drop the database "postgres" on the original, why > does it reappear after an upgrade?). I've wondered about that as well. There are some initdb-time options you can set, but pg_upgrade could forward those. Greetings, Andres Freund
Re: Move defaults toward ICU in 16?
pá 17. 2. 2023 v 18:02 odesílatel Jeff Davis napsal: > On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote: > > On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote: > > > I am saying that pg_upgrade should be able to deal with the > > > difference. The > > > details of how to implement that, don't matter that much. > > > > To clarify, you're saying that pg_upgrade should simply update > > pg_database to set the new databases' collation fields equal to that > > of > > the old cluster? > > Thinking about this more, it's not clear to me if this would be in > scope for pg_upgrade or not. If pg_upgrade is fixing up the new cluster > rather than checking for compatibility, why doesn't it just take over > and do the initdb for the new cluster itself? That would be less > confusing for users, and avoid some weirdness (like, if you drop the > database "postgres" on the original, why does it reappear after an > upgrade?). > > Someone might want to do something interesting to the new cluster > before the upgrade, but it's not clear from the docs what would be both > useful and safe. > Today I tested icu for Czech sorting. It is a little bit slower, but not too much, but it produces partially different results. select row_number() over (order by nazev collate "cs-x-icu"), nazev from obce except select row_number() over (order by nazev collate "cs_CZ"), nazev from obce; returns a not empty set. So minimally for Czech collate, an index rebuild is necessary Regards Pavel > > Regards, > Jeff Davis > > > >
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-02-17 16:19:46 +0900, Michael Paquier wrote: > But it looks like I misunderstood what this quote meant compared to > what v3 does. It is true that v3 sets iov_len and iov_base more than > needed when writing sizes larger than BLCKSZ. I don't think it does for writes larger than BLCKSZ, it just does more for writes larger than PG_IKOV_MAX * BLCKSZ. But in those cases CPU time is going to be spent elsewhere. > Seems like you think that it is not really going to matter much to track > which iovecs have been already initialized during the first loop on > pg_pwritev_with_retry() to keep the code shorter? Yes. I'd bet that, in the unlikely case you're going to see any difference at all, unconditionally initializing is going to win. Right now we memset() 8KB, and iterate over 32 IOVs, unconditionally, on every call. Even if we could do some further optimizations of what I did in the patch, you can initialize needed IOVs repeatedly a *lot* of times, before it shows up... I'm inclined to go with my version, with the argument order swapped to Bharath's order. Greetings, Andres Freund
Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
Hi, On 2023-02-17 17:26:20 +1300, David Rowley wrote: > I didn't hear it mentioned explicitly here, but I suspect it's faster > when increasing the initial size due to the memory context caching > code that reuses aset MemoryContexts (see context_freelists[] in > aset.c). Since we reset the context before caching it, then it'll > remain fast when we can reuse a context, provided we don't need to do > a malloc for an additional block beyond the initial block that's kept > in the cache. I'm not so sure this is the case. Which is one of the reasons I'd really like to see a) memory context stats for executor context b) a CPU profile of the problem c) a reproducer. Jonah, did you just increase the initial size, or did you potentially also increase the maximum block size? And did you increase ALLOCSET_DEFAULT_INITSIZE everywhere, or just passed a larger block size in CreateExecutorState()? If the latter,the context freelist wouldn't even come into play. A 8MB max block size is pretty darn small if you have a workload that ends up with a gigabytes worth of blocks. And the problem also could just be that the default initial blocks size takes too long to ramp up to a reasonable block size. I think it's 20 blocks to get from ALLOCSET_DEFAULT_INITSIZE to ALLOCSET_DEFAULT_MAXSIZE. Even if you allocate a good bit more than 8MB, having to additionally go through 20 smaller chunks is going to be noticable until you reach a good bit higher number of blocks. > Maybe we should think of a more general-purpose way of doing this > caching which just keeps a global-to-the-process dclist of blocks > laying around. We could see if we have any free blocks both when > creating the context and also when we need to allocate another block. Not so sure about that. I suspect the problem could just as well be the maximum block size, leading to too many blocks being allocated. Perhaps we should scale that to a certain fraction of work_mem, by default? Either way, I don't think we should go too deep without some data, too likely to miss the actual problem. > I see no reason why this couldn't be shared among the other context > types rather than keeping this cache stuff specific to aset.c. slab.c > might need to be pickier if the size isn't exactly what it needs, but > generation.c should be able to make use of it the same as aset.c > could. I'm unsure what'd we'd need in the way of size classing for > this, but I suspect we'd need to pay attention to that rather than do > things like hand over 16MBs of memory to some context that only wants > a 1KB initial block. Possible. I can see something like a generic "free block" allocator being useful. Potentially with allocating the underlying memory with larger mmap()s than we need for individual blocks. Random note: I wonder if we should having a bitmap (in an int) in front of aset's freelist. In a lot of cases we incur plenty cache misses, just to find the freelist bucket empty. Greetings, Andres Freund
Re: Move defaults toward ICU in 16?
On Fri, 2023-02-17 at 09:05 -0800, Andres Freund wrote: > > Thinking about this more, it's not clear to me if this would be in > > scope for pg_upgrade or not. > > I don't think we should consider changing the default collation > provider > without making this more seamless, one way or another. I guess I'm fine hacking pg_upgrade, but I think I'd like to make it conditional on this specific case: only perform the fixup if the old cluster is 15 or earlier and using libc and the newer cluster is 16 or later and using icu. There's already a check that the new cluster is empty, so I think it's safe to hack the pg_database locale fields. Regards, Jeff Davis >
Re: Move defaults toward ICU in 16?
Hi, On 2023-02-17 10:00:41 -0800, Jeff Davis wrote: > I guess I'm fine hacking pg_upgrade, but I think I'd like to make it > conditional on this specific case: only perform the fixup if the old > cluster is 15 or earlier and using libc and the newer cluster is 16 or > later and using icu. -1. That's just going to cause pain one major version upgrade further down the line. Why would we want to incur that pain? > There's already a check that the new cluster is empty, so I think it's > safe to hack the pg_database locale fields. I don't think we need to, we do issue the CREATE DATABASEs. So we just need to make sure that includes the collation provider info, and the proper template database, in pg_upgrade mode. Greetings, Andres Freund
Re: The output sql generated by pg_dump for a create function refers to a modified table name
"Jonathan S. Katz" writes: > I spoke too soon -- I was looking at the wrong logs. I did reproduce it > with UPDATE, but not INSERT. It can be reproduced with INSERT too, on the same principle as the others: put the DML command inside a WITH, and give it an alias conflicting with the outer query. Being a lazy sort, I tried to collapse all three cases into a single test case, and observed something I hadn't thought of: we disambiguate aliases in a WITH query with respect to the outer query, but not with respect to other WITH queries. This makes the example (see attached) a bit more confusing than I would have hoped. However, the same sort of thing happens within other kinds of nested subqueries, so I think it's probably all right as-is. In any case, changing this aspect would require a significantly bigger patch with more risk of unwanted side-effects. To fix it, I pulled out the print-an-alias logic within get_from_clause_item and called that new function for INSERT/UPDATE/DELETE. This is a bit of overkill perhaps, because only the RTE_RELATION case can be needed by these other callers, but it seemed like a sane refactorization. I've not tested, but I imagine this will need patched all the way back. The rule case should be reachable in all supported versions. regards, tom lane diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 9ac42efdbc..6dc117dea8 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -475,6 +475,8 @@ static void get_from_clause(Query *query, const char *prefix, deparse_context *context); static void get_from_clause_item(Node *jtnode, Query *query, deparse_context *context); +static void get_rte_alias(RangeTblEntry *rte, int varno, bool use_as, + deparse_context *context); static void get_column_alias_list(deparse_columns *colinfo, deparse_context *context); static void get_from_clause_coldeflist(RangeTblFunction *rtfunc, @@ -6648,12 +6650,14 @@ get_insert_query_def(Query *query, deparse_context *context, context->indentLevel += PRETTYINDENT_STD; appendStringInfoChar(buf, ' '); } - appendStringInfo(buf, "INSERT INTO %s ", + appendStringInfo(buf, "INSERT INTO %s", generate_relation_name(rte->relid, NIL)); - /* INSERT requires AS keyword for target alias */ - if (rte->alias != NULL) - appendStringInfo(buf, "AS %s ", - quote_identifier(rte->alias->aliasname)); + + /* Print the relation alias, if needed; INSERT requires explicit AS */ + get_rte_alias(rte, query->resultRelation, true, context); + + /* always want a space here */ + appendStringInfoChar(buf, ' '); /* * Add the insert-column-names list. Any indirection decoration needed on @@ -6835,9 +6839,10 @@ get_update_query_def(Query *query, deparse_context *context, appendStringInfo(buf, "UPDATE %s%s", only_marker(rte), generate_relation_name(rte->relid, NIL)); - if (rte->alias != NULL) - appendStringInfo(buf, " %s", - quote_identifier(rte->alias->aliasname)); + + /* Print the relation alias, if needed */ + get_rte_alias(rte, query->resultRelation, false, context); + appendStringInfoString(buf, " SET "); /* Deparse targetlist */ @@ -7043,9 +7048,9 @@ get_delete_query_def(Query *query, deparse_context *context, appendStringInfo(buf, "DELETE FROM %s%s", only_marker(rte), generate_relation_name(rte->relid, NIL)); - if (rte->alias != NULL) - appendStringInfo(buf, " %s", - quote_identifier(rte->alias->aliasname)); + + /* Print the relation alias, if needed */ + get_rte_alias(rte, query->resultRelation, false, context); /* Add the USING clause if given */ get_from_clause(query, " USING ", context); @@ -10887,10 +10892,8 @@ get_from_clause_item(Node *jtnode, Query *query, deparse_context *context) { int varno = ((RangeTblRef *) jtnode)->rtindex; RangeTblEntry *rte = rt_fetch(varno, query->rtable); - char *refname = get_rtable_name(varno, context); deparse_columns *colinfo = deparse_columns_fetch(varno, dpns); RangeTblFunction *rtfunc1 = NULL; - bool printalias; if (rte->lateral) appendStringInfoString(buf, "LATERAL "); @@ -11027,59 +11030,7 @@ get_from_clause_item(Node *jtnode, Query *query, deparse_context *context) } /* Print the relation alias, if needed */ - printalias = false; - if (rte->alias != NULL) - { - /* Always print alias if user provided one */ - printalias = true; - } - else if (colinfo->printaliases) - { - /* Always print alias if we need to print column aliases */ - printalias = true; - } - else if (rte->rtekind == RTE_RELATION) - { - /* - * No need to print alias if it's same as relation name (this - * would normally be the case, but not if set_rtable_names had to - * resolve a conflict). - */ - if (strcmp(refname, get_relation_name(rte->relid)) != 0) -printalias = true; - } - else if (rte->rtekind == RTE_FUNCTION) - { -
Re: Move defaults toward ICU in 16?
On Fri, Feb 17, 2023 at 09:01:54AM -0800, Jeff Davis wrote: > On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote: > > On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote: > > > I am saying that pg_upgrade should be able to deal with the > > > difference. The > > > details of how to implement that, don't matter that much. > > > > To clarify, you're saying that pg_upgrade should simply update > > pg_database to set the new databases' collation fields equal to that > > of > > the old cluster? > > Thinking about this more, it's not clear to me if this would be in > scope for pg_upgrade or not. If pg_upgrade is fixing up the new cluster > rather than checking for compatibility, why doesn't it just take over > and do the initdb for the new cluster itself? That would be less > confusing for users, and avoid some weirdness (like, if you drop the > database "postgres" on the original, why does it reappear after an > upgrade?). > > Someone might want to do something interesting to the new cluster > before the upgrade, but it's not clear from the docs what would be both > useful and safe. This came up before - I'm of the opinion that it's unsupported and/or useless to try to do anything on the new cluster between initdb and pg_upgrade. https://www.postgresql.org/message-id/20220707184410.gb13...@telsasoft.com https://www.postgresql.org/message-id/20220905170322.gm31...@telsasoft.com -- Justin
Re: pg_init_privs corruption.
Kirill Reshke writes: > As you can see, after drop role there is invalid records in pg_init_privs > system relation. After this, pg_dump generate sql statements, some of which > are based on content of pg_init_privs, resulting in invalid dump. Ugh. > PFA fix. I don't think this is anywhere near usable as-is, because it only accounts for pg_init_privs entries in the current database. We need to handle these records in the DROP OWNED BY mechanism instead, and also ensure there are shared-dependency entries for them so that the role can't be dropped until the entries are gone in all DBs. The real problem may be less that DROP is doing the wrong thing, and more that creation of the pg_init_privs entries neglects to record a dependency. regards, tom lane
Re: The output sql generated by pg_dump for a create function refers to a modified table name
On 2/17/23 1:18 PM, Tom Lane wrote: It can be reproduced with INSERT too, on the same principle as the others: put the DML command inside a WITH, and give it an alias conflicting with the outer query. Ah, I see based on your example below. I did not alias the INSERT statement in the way (and I don't know how common of a pattern it is to o that). Being a lazy sort, I tried to collapse all three cases into a single test case, and observed something I hadn't thought of: we disambiguate aliases in a WITH query with respect to the outer query, but not with respect to other WITH queries. This makes the example (see attached) a bit more confusing than I would have hoped. However, the same sort of thing happens within other kinds of nested subqueries, so I think it's probably all right as-is. In any case, changing this aspect would require a significantly bigger patch with more risk of unwanted side-effects. I think I agree. Most people should not be looking at the disambiguated statements unless they are troubleshooting an issue (such as $SUBJECT). The main goal is to disambiguate correctly. To fix it, I pulled out the print-an-alias logic within get_from_clause_item and called that new function for INSERT/UPDATE/DELETE. This is a bit of overkill perhaps, because only the RTE_RELATION case can be needed by these other callers, but it seemed like a sane refactorization. I've not tested, but I imagine this will need patched all the way back. The rule case should be reachable in all supported versions. I tested this against HEAD (+v69 of the DDL replication patch). My cases are now all passing. The code looks good to me -- I don't know if moving that logic is overkill, but it makes the solution relatively clean. I didn't test in any back branches yet, but given this can generate an invalid function body, it does likely need to be backpatched. Thanks, Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH] Add pretty-printed XML output option
On 17.02.23 01:08, Andrey Borodin wrote: On Thu, Feb 16, 2023 at 2:12 PM Jim Jones wrote: I've looked into the patch. The code looks to conform to usual expectations. One nit: this comment should have just one asterisk. + /** Thanks for reviewing! Asterisk removed in v14. And I have a dumb question: is this function protected from using external XML namespaces? What if the user passes some xmlns that will force it to read namespace data from the server filesystem? Or is it not possible? I see there are a lot of calls to xml_parse() anyway, but still... According to the documentation,[1] such validations are not supported. /"The |xml| type does not validate input values against a document type declaration (DTD), even when the input value specifies a DTD. There is also currently no built-in support for validating against other XML schema languages such as XML Schema."/ But I'll have a look at the code to be sure :) Best, Jim 1- https://www.postgresql.org/docs/15/datatype-xml.html From 44825f436e9c8f06a9bea3ed5966ef73bab208a9 Mon Sep 17 00:00:00 2001 From: Jim Jones Date: Thu, 16 Feb 2023 22:36:19 +0100 Subject: [PATCH v14] Add pretty-printed XML output option This small patch introduces a XML pretty print function. It basically takes advantage of the indentation feature of xmlDocDumpFormatMemory from libxml2 to format XML strings. --- doc/src/sgml/func.sgml | 34 ++ src/backend/utils/adt/xml.c | 44 src/include/catalog/pg_proc.dat | 3 + src/test/regress/expected/xml.out | 101 src/test/regress/expected/xml_1.out | 53 +++ src/test/regress/expected/xml_2.out | 101 src/test/regress/sql/xml.sql| 33 + 7 files changed, 369 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index e09e289a43..a621192425 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -14861,6 +14861,40 @@ SELECT xmltable.* ]]> + + +xmlformat + + + xmlformat + + + +xmlformat ( xml ) xml + + + + Converts the given XML value to pretty-printed, indented text. + + + + Example: + + + + diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 079bcb1208..e96cbf65a7 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -473,6 +473,50 @@ xmlBuffer_to_xmltype(xmlBufferPtr buf) } #endif +Datum +xmlformat(PG_FUNCTION_ARGS) +{ +#ifdef USE_LIBXML + + xmlDocPtr doc; + xmlChar*xmlbuf = NULL; + text *arg = PG_GETARG_TEXT_PP(0); + StringInfoData buf; + int nbytes; + + doc = xml_parse(arg, XMLOPTION_DOCUMENT, false, GetDatabaseEncoding(), NULL); + + if(!doc) + elog(ERROR, "could not parse the given XML document"); + + /* + * xmlDocDumpFormatMemory ( + * xmlDocPtr doc, # the XML document + * xmlChar **xmlbuf, # the memory pointer + * int *nbytes, # the memory length + * int format # 1 = node indenting) + */ + + xmlDocDumpFormatMemory(doc, &xmlbuf, &nbytes, 1); + + xmlFreeDoc(doc); + + if(!nbytes) + elog(ERROR, "could not indent the given XML document"); + + initStringInfo(&buf); + appendStringInfoString(&buf, (const char *)xmlbuf); + + xmlFree(xmlbuf); + + PG_RETURN_XML_P(stringinfo_to_xmltype(&buf)); + +#else + NO_XML_SUPPORT(); +return 0; +#endif +} + Datum xmlcomment(PG_FUNCTION_ARGS) diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index c0f2a8a77c..54e8a6262a 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -8842,6 +8842,9 @@ { oid => '3053', descr => 'determine if a string is well formed XML content', proname => 'xml_is_well_formed_content', prorettype => 'bool', proargtypes => 'text', prosrc => 'xml_is_well_formed_content' }, +{ oid => '4642', descr => 'Indented text from xml', + proname => 'xmlformat', prorettype => 'xml', + proargtypes => 'xml', prosrc => 'xmlformat' }, # json { oid => '321', descr => 'I/O', diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out index 3c357a9c7e..e45116aaa7 100644 --- a/src/test/regress/expected/xml.out +++ b/src/test/regress/expected/xml.out @@ -1599,3 +1599,104 @@ SELECT * FROM XMLTABLE('.' PASSING XMLELEMENT(NAME a) columns a varchar(20) PATH |(1 row) +-- XML format: single line XML string +SELECT xmlformat('Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650'); +xmlformat +-- + + + + + Belgian Waffles
RE: pg_init_privs corruption.
> Kirill Reshke writes: > > As you can see, after drop role there is invalid records in > > pg_init_privs system relation. After this, pg_dump generate sql > > statements, some of which are based on content of pg_init_privs, resulting > in invalid dump. > This is as far as I can see the same case as what I reported a few years ago here: https://www.postgresql.org/message-id/flat/1574068566573.13088%40Optiver.com#488bd647ce6f5d2c92764673a7c58289 There was a discussion with some options, but no fix back then. -Floris
Re: recovery modules
Hi, On 2023-02-16 13:58:10 -0800, Nathan Bossart wrote: > On Thu, Feb 16, 2023 at 01:17:54PM -0800, Andres Freund wrote: > > I'm quite baffled by: > > /* Close any files left open by copy_file() or compare_files() > > */ > > AtEOSubXact_Files(false, InvalidSubTransactionId, > > InvalidSubTransactionId); > > > > in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files() > > completely outside the context of a transaction environment. And it only > > does > > the thing you want because you pass parameters that aren't actually valid in > > the normal use in AtEOSubXact_Files(). I really don't understand how that's > > supposed to be ok. > > Hm. Should copy_file() and compare_files() have PG_FINALLY blocks that > attempt to close the files instead? What would you recommend? I don't fully now, it's not entirely clear to me what the goals here were. I think you'd likely need to do a bit of infrastructure work to do this sanely. So far we just didn't have the need to handle files being released in a way like you want to do there. I suspect a good direction would be to use resource owners. Add a separate set of functions that release files on resource owner release. Most of the infrastructure is there already, for temporary files (c.f. OpenTemporaryFile()). Then that resource owner could be reset in case of error. I'm not even sure that erroring out is a reasonable way to implement copy_file(), compare_files(), particularly because you want to return via a return code from basic_archive_files(). Greetings, Andres Freund
Re: wrong query result due to wang plan
Richard Guo writes: > It occurs to me that we can leverage JoinDomain to tell if we are below > the nullable side of a higher-level outer join if the clause is not an > outer-join clause. If we are belew outer join, the current JoinDomain > is supposed to be a proper subset of top JoinDomain. Otherwise the > current JoinDomain must be equal to top JoinDomain. And that leads to a > fix as code changes below. That doesn't look right at all: surely this situation can occur further down in a join tree, not only just below top level. I thought about this some more and realized that the whole business of trying to push a qual up the join tree is probably obsolete. In the first place, there's more than one problem with assigning the ON FALSE condition the required_relids {t2, t3, t4, t2/t4}. As you say, it causes bogus conclusions about which joinrel can be considered empty if we commute the outer joins' order. But also, doing it this way loses the information that t2/t3 can be considered empty, if we do the joins in an order where that is useful to know. In the second place, I think recording the info that t2/t3 is empty is probably sufficient now, because of the mark_dummy_rel/is_dummy_rel bookkeeping in joinrels.c (which did not exist when we first added this "push to top of tree" hack). If we know t2/t3 is empty then we will propagate that knowledge to {t2, t3, t4, t2/t4} when it's formed, without needing to have a clause that can be applied at that join level. So that leads to a conclusion that we could just forget the whole thing and always use the syntactic qualscope here. I tried that and it doesn't break any existing regression tests, which admittedly doesn't prove a lot in this area :-(. However, we'd then need to decide what to do in process_implied_equality, which is annotated as * "qualscope" is the nominal syntactic level to impute to the restrictinfo. * This must contain at least all the rels used in the expressions, but it * is used only to set the qual application level when both exprs are * variable-free. (Hence, it should usually match the join domain in which * the clause applies.) and indeed equivclass.c is passing a join domain relid set. It's not hard to demonstrate that that path is also broken, for instance explain (costs off) select * from int8_tbl t1 left join (int8_tbl t2 inner join int8_tbl t3 on (t2.q1-t3.q2) = 0 and (t2.q1-t3.q2) = 1 left join int8_tbl t4 on t2.q2 = t4.q2) on t1.q1 = t2.q1; One idea I played with is that we could take the join domain relid set and subtract off the OJ relid and RHS relids of any fully-contained SpecialJoinInfos, reasoning that we can reconstruct the fact that those won't make the join domain's overall result nondummy, and thereby avoiding the possibility of confusion if we end up commuting with one of those joins. This feels perhaps overly complicated, though, compared to the brain-dead-simple "use the syntactic scope" approach. Maybe we can get equivclass.c to do something equivalent to that? Or maybe we only need to use this complex rule in process_implied_equality? (I'm also starting to realize that the current basically-syntactic definition of join domains isn't really going to work with commutable outer joins, so maybe the ultimate outcome is that the join domain itself is defined in this more narrowly scoped way. But that feels like a task to tackle later.) regards, tom lane
Re: Missing free_var() at end of accum_sum_final()?
Hi, On 2023-02-17 11:48:14 +0900, Michael Paquier wrote: > On Thu, Feb 16, 2023 at 01:35:54PM -0800, Andres Freund wrote: > > But why do we need it? Most SQL callable functions don't need to be careful > > about not leaking O(1) memory, the exception being functions backing btree > > opclasses. > > > > In fact, the detailed memory management often is *more* expensive than just > > relying on the calling memory context being reset. > > > > Of course, numeric.c doesn't really seem to have gotten that message, so > > there's a consistency argument here. > > I don't know which final result is better. The arguments go two ways: > 1) Should numeric.c be simplified so as its memory structure with extra > pfree()s, making it more consistent with more global assumptions than > just this file? This has the disadvantage of creating more noise in > backpatching, while increasing the risk of leaks if some of the > removed parts are allocated in a tight loop within the same context. > This makes memory management less complicated. That's how I am > understanding your point. It's not just simplification, it's just faster to free via context reset. I whipped up a random query exercising numeric math a bunch: SELECT max(a + b + '17'::numeric + c) FROM (SELECT generate_series(1::numeric, 1000::numeric)) aa(a), (SELECT generate_series(1::numeric, 100::numeric)) bb(b), (SELECT generate_series(1::numeric, 10::numeric)) cc(c); Removing the free_var()s from numeric_add_opt_error() speeds it up from ~361ms to ~338ms. This code really needs some memory management overhead reduction love. Many allocation could be avoided by having a small on-stack "buffer" that's used unless the numeric is large. > 2) Should the style within numeric.c be more consistent? This is how > I am understanding this proposal. As you quote, this makes memory > management more complicated (not convinced about that for the internal > of numerics?), while making the file more consistent. > At the end, perhaps that's not worth bothering, but 2) prevails when > it comes to the rule of making some code consistent with its > surroundings. 1) has more risks seeing how old this code is. I'm a bit wary that this will trigger a stream of patches to pointlessly free things, causing churn and slowdowns. I suspect there's other places in numeric.c where we don't free, and there certainly are a crapton in other functions. Greetings, Andres Freund
Re: pgbench: using prepared BEGIN statement in a pipeline could cause an error
On 2023-Feb-13, Andres Freund wrote: > There's something wrong with the patch, it reliably fails with core dumps: > https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3260 I think this would happen on machines where sizeof(bool) is not 1 (which mine is evidently not). Fixed. In addition, there was the problem that the psprintf() to generate the command name would race against each other if you had multiple threads. I changed the code so that the name to prepare each statement under is generated when the Command struct is first initialized, which occurs before the threads are started. One small issue is that now we use a single counter for all commands of all scripts, rather than a script-local counter. This doesn't seem at all important. I did realize that Nagata-san was right that we've always prepared the whole script in advance; that behavior was there already in commit 49639a7b2c52 that introduced -Mprepared. We've never done each command just before executing it. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ Y una voz del caos me habló y me dijo "Sonríe y sé feliz, podría ser peor". Y sonreí. Y fui feliz. Y fue peor. >From 0728193a5f02d0dd6a1f3ec5fef314aec646ba33 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 17 Feb 2023 21:01:15 +0100 Subject: [PATCH v8] pgbench: Prepare commands in pipelines in advance Failing to do so results in an error when a pgbench script starts a serializable transaction inside a pipeline. --- src/bin/pgbench/pgbench.c| 155 +-- src/bin/pgbench/t/001_pgbench_with_server.pl | 20 +++ 2 files changed, 126 insertions(+), 49 deletions(-) diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 508ed218e8..38e0830e7e 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -628,7 +628,8 @@ typedef struct pg_time_usec_t txn_begin; /* used for measuring schedule lag times */ pg_time_usec_t stmt_begin; /* used for measuring statement latencies */ - bool prepared[MAX_SCRIPTS]; /* whether client prepared the script */ + /* whether client prepared each command of each script */ + bool **prepared; /* * For processing failures and repeating transactions with serialization @@ -733,12 +734,13 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"}; * argv Command arguments, the first of which is the command or SQL *string itself. For SQL commands, after post-processing *argv[0] is the same as 'lines' with variables substituted. - * varprefix SQL commands terminated with \gset or \aset have this set + * varprefix SQL commands terminated with \gset or \aset have this set *to a non NULL value. If nonempty, it's used to prefix the *variable name that receives the value. * aset do gset on all possible queries of a combined query (\;). * expr Parsed expression, if needed. * stats Time spent in this command. + * prepname The name that this command is prepared under, in prepare mode * retries Number of retries after a serialization or deadlock error in the *current command. * failures Number of errors in the current command that were not retried. @@ -754,6 +756,7 @@ typedef struct Command char *varprefix; PgBenchExpr *expr; SimpleStats stats; + char *prepname; int64 retries; int64 failures; } Command; @@ -3006,13 +3009,6 @@ runShellCommand(Variables *variables, char *variable, char **argv, int argc) return true; } -#define MAX_PREPARE_NAME 32 -static void -preparedStatementName(char *buffer, int file, int state) -{ - sprintf(buffer, "P%d_%d", file, state); -} - /* * Report the abortion of the client when processing SQL commands. */ @@ -3053,6 +3049,87 @@ chooseScript(TState *thread) return i - 1; } +/* + * Prepare the SQL command from st->use_file at command_num. + */ +static void +prepareCommand(CState *st, int command_num) +{ + Command*command = sql_script[st->use_file].commands[command_num]; + + /* No prepare for non-SQL commands */ + if (command->type != SQL_COMMAND) + return; + + /* + * If not already done, allocate space for 'prepared' flags: one boolean + * for each command of each script. + */ + if (!st->prepared) + { + st->prepared = pg_malloc(sizeof(bool *) * num_scripts); + for (int i = 0; i < num_scripts; i++) + { + ParsedScript *script = &sql_script[i]; + int numcmds; + + for (numcmds = 0; script->commands[numcmds] != NULL; numcmds++) +; + st->prepared[i] = pg_malloc0(sizeof(bool) * numcmds); + } + } + + if (!st->prepared[st->use_file][command_num]) + { + PGresult *res; + + pg_log_debug("client %d preparing %s", st->id, command->prepname); + res = PQprepare(st->con, command->prepname, + command->argv[0], command->argc - 1, NULL); + if (PQresultStatus(res) != PGRES_COMMAND_OK) + pg_log_error("%s", PQerrorMessage(st->con)); + PQclear(res); + st->prepared[st->use_file][command_nu
Re: Move defaults toward ICU in 16?
On Fri, 2023-02-17 at 10:09 -0800, Andres Freund wrote: > -1. That's just going to cause pain one major version upgrade further > down the > line. Why would we want to incur that pain? OK, we can just always do the fixup as long as the old one is libc and the new one is ICU. I'm just trying to avoid this becoming a general mechanism to fix up an incompatible new cluster. > > There's already a check that the new cluster is empty, so I think > > it's > > safe to hack the pg_database locale fields. > > I don't think we need to, we do issue the CREATE DATABASEs. So we > just need to > make sure that includes the collation provider info, and the proper > template > database, in pg_upgrade mode. We must fixup template1/postgres in the new cluster (change it to libc to match the old cluster), because any objects existing in those databases in the old cluster may depend on the default collation. I don't see how we can do that without updating pg_database, so I'm not following your point. (I think you're right that template0 is optional; but since we're fixing up the other databases it would be less surprising if we also fixed up template0.) And if we do fixup template0/template1/postgres to match the old cluster, then CREATE DATABASE will have no issue. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: pg_init_privs corruption.
Floris Van Nee writes: > This is as far as I can see the same case as what I reported a few years ago > here: > https://www.postgresql.org/message-id/flat/1574068566573.13088%40Optiver.com#488bd647ce6f5d2c92764673a7c58289 > There was a discussion with some options, but no fix back then. Hmm, so Stephen was opining that the extension's objects shouldn't have gotten these privs attached in the first place. I'm not quite convinced about that one way or the other, but if you buy it then maybe this situation is unreachable once we fix that. I'm not sure though. It's still clear that we are making ACL entries that aren't reflected in pg_shdepend, and that seems bad. regards, tom lane
Re: Move defaults toward ICU in 16?
On Fri, 2023-02-17 at 18:27 +0100, Pavel Stehule wrote: > Today I tested icu for Czech sorting. It is a little bit slower, but > not too much, but it produces partially different results. Thank you for trying it. If it's a significant slowdown, can you please send more information? ICU version, libc version, and testcase? > select row_number() over (order by nazev collate "cs-x-icu"), nazev > from obce > except select row_number() over (order by nazev collate "cs_CZ"), > nazev from obce; > > returns a not empty set. So minimally for Czech collate, an index > rebuild is necessary Yes, that's true of any locale change, provider change, or even provider version change. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: The output sql generated by pg_dump for a create function refers to a modified table name
"Jonathan S. Katz" writes: > On 2/17/23 1:18 PM, Tom Lane wrote: >> It can be reproduced with INSERT too, on the same principle as the others: >> put the DML command inside a WITH, and give it an alias conflicting with >> the outer query. > Ah, I see based on your example below. I did not alias the INSERT > statement in the way (and I don't know how common of a pattern it is to > o that). I suppose you can also make examples where the true name of the DML target table conflicts with an outer-query name, implying that we need to give it an alias even though the user wrote none. > I tested this against HEAD (+v69 of the DDL replication patch). My cases > are now all passing. > The code looks good to me -- I don't know if moving that logic is > overkill, but it makes the solution relatively clean. Cool, thanks for testing and code-reading. I'll go see about back-patching. > I didn't test in any back branches yet, but given this can generate an > invalid function body, it does likely need to be backpatched. Presumably it can also cause dump/restore failures for rules that do this sort of thing, though admittedly those wouldn't be common. regards, tom lane
Re: Move defaults toward ICU in 16?
Hi, On 2023-02-17 12:36:05 -0800, Jeff Davis wrote: > > > There's already a check that the new cluster is empty, so I think > > > it's > > > safe to hack the pg_database locale fields. > > > > I don't think we need to, we do issue the CREATE DATABASEs. So we > > just need to > > make sure that includes the collation provider info, and the proper > > template > > database, in pg_upgrade mode. > > We must fixup template1/postgres in the new cluster (change it to libc > to match the old cluster), because any objects existing in those > databases in the old cluster may depend on the default collation. I > don't see how we can do that without updating pg_database, so I'm not > following your point. I think we just drop/recreate template1 and postgres during pg_upgrade. Yep, looks like it. See create_new_objects(): /* * template1 database will already exist in the target installation, * so tell pg_restore to drop and recreate it; otherwise we would fail * to propagate its database-level properties. */ create_opts = "--clean --create"; and then: /* * postgres database will already exist in the target installation, so * tell pg_restore to drop and recreate it; otherwise we would fail to * propagate its database-level properties. */ if (strcmp(old_db->db_name, "postgres") == 0) create_opts = "--clean --create"; else create_opts = "--create"; Greetings, Andres Freund
Reducing connection overhead in pg_upgrade compat check phase
When adding a check to pg_upgrade a while back I noticed in a profile that the cluster compatibility check phase spend a lot of time in connectToServer. Some of this can be attributed to data type checks which each run serially in turn connecting to each database to run the check, and this seemed like a place where we can do better. The attached patch moves the checks from individual functions, which each loops over all databases, into a struct which is consumed by a single umbrella check where all data type queries are executed against a database using the same connection. This way we can amortize the connectToServer overhead across more accesses to the database. In the trivial case, a single database, I don't see a reduction of performance over the current approach. In a cluster with 100 (empty) databases there is a ~15% reduction in time to run a --check pass. While it won't move the earth in terms of wallclock time, consuming less resources on the old cluster allowing --check to be cheaper might be the bigger win. -- Daniel Gustafsson 0001-pg_upgrade-run-all-data-type-checks-per-connection.patch Description: Binary data
archive modules loose ends
Andres recently reminded me of some loose ends in archive modules [0], so I'm starting a dedicated thread to address his feedback. The first one is the requirement that archive module authors create their own exception handlers if they want to make use of ERROR. Ideally, there would be a handler in pgarch.c so that authors wouldn't need to deal with this. I do see some previous dicussion about this [1] in which I expressed concerns about memory management. Looking at this again, I may have been overthinking it. IIRC I was thinking about creating a memory context that would be switched into for only the archiving callback (and reset afterwards), but that might not be necessary. Instead, we could rely on module authors to handle this. One example is basic_archive, which maintains its own memory context. Alternatively, authors could simply pfree() anything that was allocated. Furthermore, by moving the exception handling to pgarch.c, module authors can begin using PG_TRY, etc. in their archiving callbacks, which simplifies things a bit. I've attached a work-in-progress patch for this change. On Fri, Feb 17, 2023 at 11:41:32AM -0800, Andres Freund wrote: > On 2023-02-16 13:58:10 -0800, Nathan Bossart wrote: >> On Thu, Feb 16, 2023 at 01:17:54PM -0800, Andres Freund wrote: >> > I'm quite baffled by: >> >/* Close any files left open by copy_file() or compare_files() >> > */ >> >AtEOSubXact_Files(false, InvalidSubTransactionId, >> > InvalidSubTransactionId); >> > >> > in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files() >> > completely outside the context of a transaction environment. And it only >> > does >> > the thing you want because you pass parameters that aren't actually valid >> > in >> > the normal use in AtEOSubXact_Files(). I really don't understand how >> > that's >> > supposed to be ok. >> >> Hm. Should copy_file() and compare_files() have PG_FINALLY blocks that >> attempt to close the files instead? What would you recommend? > > I don't fully now, it's not entirely clear to me what the goals here were. I > think you'd likely need to do a bit of infrastructure work to do this > sanely. So far we just didn't have the need to handle files being released in > a way like you want to do there. > > I suspect a good direction would be to use resource owners. Add a separate set > of functions that release files on resource owner release. Most of the > infrastructure is there already, for temporary files > (c.f. OpenTemporaryFile()). > > Then that resource owner could be reset in case of error. > > > I'm not even sure that erroring out is a reasonable way to implement > copy_file(), compare_files(), particularly because you want to return via a > return code from basic_archive_files(). To initialize this thread, I'll provide a bit more background. basic_archive makes use of copy_file(), and it introduces a function called compare_files() that is used to check whether two files have the same content. These functions make use of OpenTransientFile() and CloseTransientFile(). In basic_archive's sigsetjmp() block, there's a call to AtEOSubXact_Files() to make sure we close any files that are open when there is an ERROR. IIRC I was following the example set by other processes that make use of the AtEOXact* functions in their sigsetjmp() blocks. Looking again, I think AtEOXact_Files() would also work for basic_archive's use-case. That would at least avoid the hack of using InvalidSubTransactionId for the second and third arguments. >From the feedback quoted above, it sounds like improving this further will require a bit of infrastructure work. I haven't looked too deeply into this yet. [0] https://postgr.es/m/20230216192956.mhi6uiakchkolpki%40awork3.anarazel.de [1] https://postgr.es/m/20220202224433.GA1036711%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c index cd852888ce..2c51d2b100 100644 --- a/contrib/basic_archive/basic_archive.c +++ b/contrib/basic_archive/basic_archive.c @@ -172,7 +172,6 @@ basic_archive_configured(ArchiveModuleState *state) static bool basic_archive_file(ArchiveModuleState *state, const char *file, const char *path) { - sigjmp_buf local_sigjmp_buf; MemoryContext oldcontext; BasicArchiveData *data = (BasicArchiveData *) state->private_data; MemoryContext basic_archive_context = data->context; @@ -184,51 +183,22 @@ basic_archive_file(ArchiveModuleState *state, const char *file, const char *path */ oldcontext = MemoryContextSwitchTo(basic_archive_context); - /* - * Since the archiver operates at the bottom of the exception stack, - * ERRORs turn into FATALs and cause the archiver process to restart. - * However, using ereport(ERROR, ...) when there are problems is easy to - * code and maintain. Therefore, we create our own exception handler to - * catch ERRORs and return fals
Re: [PATCH] Add pretty-printed XML output option
On Fri, Feb 17, 2023 at 1:14 AM Jim Jones wrote: > After your comment I'm studying the possibility to extend the existing > xmlserialize function to add the indentation feature. If so, how do you > think it should look like? An extra parameter? e.g. > > SELECT xmlserialize(DOCUMENT '42'::XML AS text, > true); > > .. or more or like Oracle does it > > SELECT XMLSERIALIZE(DOCUMENT xmltype('42') AS BLOB > INDENT) > FROM dual; > My idea was to follow the SQL standard (part 14, SQL/XML); unfortunately, there is no free version, but there are drafts at http://www.wiscorp.com/SQLStandards.html. ::= XMLSERIALIZE [ ] AS [ ] [ ] [ ] [ ] ::= [ NO ] INDENT Oracle's extension SIZE=n also seems interesting (including a special case SIZE=0, which means using new-line characters without spaces for each line).
Re: recovery modules
On Fri, Feb 17, 2023 at 05:01:47PM +0900, Michael Paquier wrote: > All that had better > be put into their own threads, IMO, to bring more visibility on these > subjects. I created a new thread for these [0]. > Saying that, I have spent more time on the revamped version of the > archive modules and it was already doing a lot, so I have applied > it as it covered all the points discussed.. Thanks! [0] https://postgr.es/m/20230217215624.GA3131134%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy
On Fri, Feb 17, 2023 at 03:13:32PM +0100, Peter Eisentraut wrote: > On 16.02.23 22:29, Andres Freund wrote: >> What's the story behind 100_bugs.pl? This name clearly is copied from >> src/test/subscription/t/100_bugs.pl - but I've never understood why that is >> outside of the normal numbering space. > > Mainly to avoid awkwardness for backpatching. The number of tests in > src/test/subscription/ varies quite a bit across branches. I'm happy to move this new test to wherever folks think it should go. I'll look around to see if I can find a better place, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: recovery modules
Here is a new revision of the restore modules patch set. In this patch set, the interface looks similar to the recent archive modules redesign, and there are separate callbacks for retrieving different types of files. I've attempted to address all the feedback I've received, but there was a lot scattered across different threads, so it's possible I've missed something. Note that 0001 is the stopgap fix for restore_command that's being tracked elsewhere [0]. I was careful to avoid repeating the recent mistake with the SIGTERM handling. This patch set is still a little rough around the edges, but I wanted to post it in case folks had general thoughts about the structure, interface, etc. This implementation restores files synchronously one-by-one just like archive modules, but in the future, I would like to add asynchronous/parallel/batching support. My intent is for this work to move us closer to that. [0] https://postgr.es/m/20230214174755.GA1348509%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 64651ae3c56160fea31d15a78f07c8c12950ec99 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 14 Feb 2023 09:44:53 -0800 Subject: [PATCH v12 1/6] stopgap fix for restore_command --- src/backend/access/transam/xlogarchive.c | 15 +++ src/backend/postmaster/startup.c | 20 +++- src/backend/storage/ipc/ipc.c| 3 +++ src/backend/storage/lmgr/proc.c | 2 ++ 4 files changed, 35 insertions(+), 5 deletions(-) diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index fcc87ff44f..41684418b6 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -159,20 +159,27 @@ RestoreArchivedFile(char *path, const char *xlogfname, (errmsg_internal("executing restore command \"%s\"", xlogRestoreCmd))); + fflush(NULL); + pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND); + /* - * Check signals before restore command and reset afterwards. + * PreRestoreCommand() informs the SIGTERM handler for the startup process + * that it should proc_exit() right away. This is done for the duration of + * the system() call because there isn't a good way to break out while it + * is executing. Since we might call proc_exit() in a signal handler, it + * is best to put any additional logic before or after the + * PreRestoreCommand()/PostRestoreCommand() section. */ PreRestoreCommand(); /* * Copy xlog from archival storage to XLOGDIR */ - fflush(NULL); - pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND); rc = system(xlogRestoreCmd); - pgstat_report_wait_end(); PostRestoreCommand(); + + pgstat_report_wait_end(); pfree(xlogRestoreCmd); if (rc == 0) diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c index efc2580536..de2b56c2fa 100644 --- a/src/backend/postmaster/startup.c +++ b/src/backend/postmaster/startup.c @@ -19,6 +19,8 @@ */ #include "postgres.h" +#include + #include "access/xlog.h" #include "access/xlogrecovery.h" #include "access/xlogutils.h" @@ -121,7 +123,23 @@ StartupProcShutdownHandler(SIGNAL_ARGS) int save_errno = errno; if (in_restore_command) - proc_exit(1); + { + /* + * If we are in a child process (e.g., forked by system() in + * RestoreArchivedFile()), we don't want to call any exit callbacks. + * The parent will take care of that. + */ + if (MyProcPid == (int) getpid()) + proc_exit(1); + else + { + const char msg[] = "StartupProcShutdownHandler() called in child process"; + int rc pg_attribute_unused(); + + rc = write(STDERR_FILENO, msg, sizeof(msg)); + _exit(1); + } + } else shutdown_requested = true; WakeupRecovery(); diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c index 1904d21795..6796cabc3e 100644 --- a/src/backend/storage/ipc/ipc.c +++ b/src/backend/storage/ipc/ipc.c @@ -103,6 +103,9 @@ static int on_proc_exit_index, void proc_exit(int code) { + /* proc_exit() is not safe in forked processes from system(), etc. */ + Assert(MyProcPid == getpid()); + /* Clean up everything that must be cleaned up */ proc_exit_prepare(code); diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index 22b4278610..ae845e8249 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -805,6 +805,7 @@ ProcKill(int code, Datum arg) dlist_head *procgloballist; Assert(MyProc != NULL); + Assert(MyProcPid == getpid()); /* not safe if forked by system(), etc. */ /* Make sure we're out of the sync rep lists */ SyncRepCleanupAtProcExit(); @@ -925,6 +926,7 @@ AuxiliaryProcKill(int code, Datum arg) PGPROC *proc; Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS); + Assert(MyProcPid == getpid()); /* not safe if forked by system(), etc. */ auxproc = &AuxiliaryProcs[proctype]; -- 2.25.1 >From 2ef84ed51ba934c7979e2a7
Re: Reducing connection overhead in pg_upgrade compat check phase
On Fri, Feb 17, 2023 at 10:44:49PM +0100, Daniel Gustafsson wrote: > In the trivial case, a single database, I don't see a reduction of performance > over the current approach. In a cluster with 100 (empty) databases there is a > ~15% reduction in time to run a --check pass. While it won't move the earth > in > terms of wallclock time, consuming less resources on the old cluster allowing > --check to be cheaper might be the bigger win. Nice! This has actually been on my list of things to look into, so I intend to help review the patch. In any case, +1 for making pg_upgrade faster. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Move defaults toward ICU in 16?
On Fri, 2023-02-17 at 12:50 -0800, Andres Freund wrote: > I think we just drop/recreate template1 and postgres during > pg_upgrade. Thank you, that makes much more sense now. I was confused because pg_upgrade loops through to check compatibility with all the databases, which makes zero sense if it's going to drop all of them except template0 anyway. The checks on template1/postgres should be bypassed. So the two approaches are: 1. Don't bother with locale provider compatibility checks at all (even on template0). The emitted CREATE DATABASE statements already specify the locale provider, so that will take care of everything except template0. Maybe the locale provider of template0 shouldn't matter, but some users might be surprised if it changes during upgrade. It also raises some questions about the other properties of template0 like encoding, lc_collate, and lc_ctype, which also don't matter a whole lot (because they can all be changed when using template0 as a template). 2. Update the pg_database entry for template0. This has less potential for surprise in case people are actually using template0 for a template. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: pg_walinspect memory leaks
On Thu, 2023-02-16 at 18:00 +0530, Bharath Rupireddy wrote: > I'm attaching the patches here. For HEAD, I'd > want to be a bit defensive and use the temporary memory context for > pg_get_wal_fpi_info() too. I don't see why we shouldn't backpatch that, too? Also, it seems like we should do the same thing for the loop in GetXLogSummaryStats(). Maybe just for the outer loop is fine (the inner loop is only 16 elements); though again, there's not an obvious downside to fixing that, too. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: O(n) tasks cause lengthy startups and checkpoints
On Thu, Feb 02, 2023 at 09:48:08PM -0800, Nathan Bossart wrote: > rebased for cfbot another rebase for cfbot -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 1c9b95cae7adcc57b7544a44ff16a26e71c6c736 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 5 Jan 2022 19:24:22 + Subject: [PATCH v20 1/4] Introduce custodian. The custodian process is a new auxiliary process that is intended to help offload tasks could otherwise delay startup and checkpointing. This commit simply adds the new process; it does not yet do anything useful. --- doc/src/sgml/glossary.sgml | 11 + src/backend/postmaster/Makefile | 1 + src/backend/postmaster/auxprocess.c | 8 + src/backend/postmaster/custodian.c | 377 src/backend/postmaster/meson.build | 1 + src/backend/postmaster/postmaster.c | 38 ++- src/backend/storage/ipc/ipci.c | 3 + src/backend/storage/lmgr/proc.c | 1 + src/backend/utils/activity/pgstat_io.c | 4 +- src/backend/utils/activity/wait_event.c | 3 + src/backend/utils/init/miscinit.c | 3 + src/include/miscadmin.h | 3 + src/include/postmaster/custodian.h | 32 ++ src/include/storage/proc.h | 11 +- src/include/utils/wait_event.h | 1 + 15 files changed, 491 insertions(+), 6 deletions(-) create mode 100644 src/backend/postmaster/custodian.c create mode 100644 src/include/postmaster/custodian.h diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml index 7c01a541fe..ad3f53e2a3 100644 --- a/doc/src/sgml/glossary.sgml +++ b/doc/src/sgml/glossary.sgml @@ -144,6 +144,7 @@ (but not the autovacuum workers), the background writer, the checkpointer, + the custodian, the logger, the startup process, the WAL archiver, @@ -484,6 +485,16 @@ + + Custodian (process) + + + An auxiliary process + that is responsible for executing assorted cleanup tasks. + + + + Data area diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile index 047448b34e..5f4dde85cf 100644 --- a/src/backend/postmaster/Makefile +++ b/src/backend/postmaster/Makefile @@ -18,6 +18,7 @@ OBJS = \ bgworker.o \ bgwriter.o \ checkpointer.o \ + custodian.o \ fork_process.o \ interrupt.o \ pgarch.o \ diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c index cae6feb356..a1f042f13a 100644 --- a/src/backend/postmaster/auxprocess.c +++ b/src/backend/postmaster/auxprocess.c @@ -20,6 +20,7 @@ #include "pgstat.h" #include "postmaster/auxprocess.h" #include "postmaster/bgwriter.h" +#include "postmaster/custodian.h" #include "postmaster/startup.h" #include "postmaster/walwriter.h" #include "replication/walreceiver.h" @@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype) case CheckpointerProcess: MyBackendType = B_CHECKPOINTER; break; + case CustodianProcess: + MyBackendType = B_CUSTODIAN; + break; case WalWriterProcess: MyBackendType = B_WAL_WRITER; break; @@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype) CheckpointerMain(); proc_exit(1); + case CustodianProcess: + CustodianMain(); + proc_exit(1); + case WalWriterProcess: WalWriterMain(); proc_exit(1); diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c new file mode 100644 index 00..98bb9efcfd --- /dev/null +++ b/src/backend/postmaster/custodian.c @@ -0,0 +1,377 @@ +/*- + * + * custodian.c + * + * The custodian process handles a variety of non-critical tasks that might + * otherwise delay startup, checkpointing, etc. Offloaded tasks should not + * be synchronous (e.g., checkpointing shouldn't wait for the custodian to + * complete a task before proceeding). However, tasks can be synchronously + * executed when necessary (e.g., single-user mode). The custodian is not + * an essential process and can shutdown quickly when requested. The + * custodian only wakes up to perform its tasks when its latch is set. + * + * + * Copyright (c) 2022, PostgreSQL Global Development Group + * + * + * IDENTIFICATION + * src/backend/postmaster/custodian.c + * + *- + */ +#include "postgres.h" + +#include "libpq/pqsignal.h" +#include "pgstat.h" +#include "postmaster/custodian.h" +#include "postmaster/interrupt.h" +#include "storage/bufmgr.h" +#include "storage/condition_variable.h" +#include "storage/fd.h" +#include "storage/proc.h" +#include "storage/procsignal.h" +#include "storage/smgr.h" +#include "utils/memutils.h" + +static void DoCustodianTasks(void); +static CustodianTask CustodianGetNextTask(void); +static void CustodianEnqueueTask(CustodianTask task); +static const struct cust_task_funcs_entry *LookupCustodianFunct
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
I still have no theory for how this condition was reached despite a lot of time thinking about it and searching for more clues. As far as I can tell, the recent improvements to postmaster's signal and event handling shouldn't be related: the state management and logic was unchanged. While failing to understand this, I worked[1] on CI log indexing tool with public reports that highlight this sort of thing[2], so I'll be watching out for more evidence. Unfortunately I have no data from before 1 Feb (cfbot previously wasn't interested in the past at all; I'd need to get my hands on the commit IDs for earlier testing but I can't figure out how to get those out of Cirrus or Github -- anyone know how?). FWIW I have a thing I call bfbot for slurping up similar data from the build farm. It's not pretty enough for public consumption, but I do know that this assertion hasn't failed there, except the cases I mentioned earlier, and a load of failures on lorikeet which was completely b0rked until recently. [1] https://xkcd.com/974/ [2] http://cfbot.cputube.org/highlights/assertion-90.html
Share variable between psql backends in CustomScan
Hi, I am looking for a way to define a global variable in CustomScan plugin that is shared between different psql backends. Is it possible without using shared memory? Does postgresql implement any function that facilitates this? Thank you, Amin
File* argument order, argument types
Hi, While trying to add additional File* functions (FileZero, FileFallocat) I went back and forth about the argument order between "amount" and "offset". We have: extern int FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info); extern int FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info); extern int FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info); extern int FileTruncate(File file, off_t offset, uint32 wait_event_info); extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info); and I want to add (for [1]) extern int FileZero(File file, off_t amount, off_t offset, uint32 wait_event_info); extern int FileFallocate(File file, off_t amount, off_t offset, uint32 wait_event_info); The differences originate in trying to mirror the underlying function's signatures: int posix_fadvise(int fd, off_t offset, off_t len, int advice); ssize_t pread(int fd, void buf[.count], size_t count, off_t offset); ssize_t pwrite(int fd, const void buf[.count], size_t count, off_t offset); int ftruncate(int fd, off_t length); int posix_fallocate(int fd, off_t offset, off_t len); int sync_file_range(int fd, off64_t offset, off64_t nbytes, unsigned int flags); It seems quite confusing to be this inconsistent about argument order and argument types in the File* functions. For one, the relation to the underlying posix functions isn't always obvious. For another, we're not actually mirroring the signatures all that well, our argument and return types don't actually match. It'd be easy enough to decide on a set of types for the arguments, that'd be API (but not necessarily ABI compatible, but we don't care) compatible. But changing the argument order would commonly lead to silent breakage, which obviously would be bad. Or maybe it's unlikely enough that there are external callers? I don't know what to actually propose. I guess the least bad I can see is to pick one type & argument order that we document to be the default, with a caveat placed above the functions not following the argument order. Order wise, I think we should choose amount, offset. For the return type we probably should pick ssize_t? I don't know what we should standardize on for 'amount', I'd probably be inclined to go for size_t. Greetings, Andres Freund [1] https://postgr.es/m/20221029025420.eplyow6k7tgu6...@awork3.anarazel.de
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Hi, On 2023-02-18 13:27:04 +1300, Thomas Munro wrote: > I still have no theory for how this condition was reached despite a > lot of time thinking about it and searching for more clues. As far as > I can tell, the recent improvements to postmaster's signal and event > handling shouldn't be related: the state management and logic was > unchanged. Yea, it's all very odd. If you look at the log: 2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name DETAIL: No valid identifier after ".". 2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name STATEMENT: SELECT parse_ident('xxx.1020'); ... TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c", Line: 329, PID: 5948 abort() has been called ... 2023-02-08 00:53:27.420 GMT postmaster[872] LOG: server process (PID 5948) was terminated by exception 0xC354 2023-02-08 00:53:27.420 GMT postmaster[872] HINT: See C include file "ntstatus.h" for a description of the hexadecimal value. 2023-02-08 00:53:27.420 GMT postmaster[872] LOG: terminating any other active server processes 2023-02-08 00:53:27.434 GMT postmaster[872] LOG: all server processes terminated; reinitializing and that it's indeed the money test that failed: money... FAILED (test process exited with exit code 2) 7337 ms it's very hard to understand how this stack can come to be: 0085`f03ffa40 7ff6`fd89faa8 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77] 0085`f03ffa80 7ff6`fd6474dc postgres!ExceptionalCondition( char * conditionName = 0x7ff6`fdd03ca8 "PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED", char * fileName = 0x7ff6`fdd03c80 "../src/backend/storage/ipc/pmsignal.c", int lineNumber = 0n329)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67] 0085`f03ffac0 7ff6`fd676eff postgres!MarkPostmasterChildActive(void)+0x7c [c:\cirrus\src\backend\storage\ipc\pmsignal.c @ 329] 0085`f03ffb00 7ff6`fd59aa3a postgres!InitProcess(void)+0x2ef [c:\cirrus\src\backend\storage\lmgr\proc.c @ 375] 0085`f03ffb60 7ff6`fd467689 postgres!SubPostmasterMain( int argc = 0n3, char ** argv = 0x01c6`f3814e80)+0x33a [c:\cirrus\src\backend\postmaster\postmaster.c @ 4962] 0085`f03ffd90 7ff6`fda0e1c9 postgres!main( int argc = 0n3, char ** argv = 0x01c6`f3814e80)+0x2f9 [c:\cirrus\src\backend\main\main.c @ 192] How can a process that we did notify crashing, that has already executed SQL statements, end up in MarkPostmasterChildActive()? > While failing to understand this, I worked[1] on CI log indexing tool > with public reports that highlight this sort of thing[2], so I'll be > watching out for more evidence. Unfortunately I have no data from > before 1 Feb (cfbot previously wasn't interested in the past at all; > I'd need to get my hands on the commit IDs for earlier testing but I > can't figure out how to get those out of Cirrus or Github -- anyone > know how?). FWIW I have a thing I call bfbot for slurping up similar > data from the build farm. It's not pretty enough for public > consumption, but I do know that this assertion hasn't failed there, > except the cases I mentioned earlier, and a load of failures on > lorikeet which was completely b0rked until recently. > [1] https://xkcd.com/974/ > [2] http://cfbot.cputube.org/highlights/assertion-90.html I think this extremely useful. Greetings, Andres Freund
Re: Add index scan progress to pg_stat_progress_vacuum
Thanks for the review! >+ >+ ParallelVacuumFinish >+ Waiting for parallel vacuum workers to finish index >vacuum. >+ >This change is out-of-date. That was an oversight. Thanks for catching. >Total number of indexes that will be vacuumed or cleaned up. This >number is reported as of the beginning of the vacuuming indexes phase >or the cleaning up indexes phase. This is cleaner. I was being unnecessarily verbose in the original description. >Number of indexes processed. This counter only advances when the phase >is vacuuming indexes or cleaning up indexes. I agree. >Also, index_processed sounds accurate to me. What do you think? At one point, II used index_processed, but decided to change it. "processed" makes sense also. I will use this. >I think these settings are not necessary since the pcxt is palloc0'ed. Good point. >Assert(pvs->pcxt->parallel_progress_callback_arg) looks wrong to me. >If 'arg' is NULL, a SEGV happens. Correct, Assert(pvs) is all that is needed. >I think it's better to update pvs->shared->nindexes_completed by both >leader and worker processes who processed the index. No reason for that, since only the leader process can report process to backend_progress. >I think it's better to make the function type consistent with the >existing parallel_worker_main_type. How about >parallel_progress_callback_type? Yes, that makes sense. > I've attached a patch that incorporates the above comments and has >some suggestions of updating comments etc. I reviewed and incorporated these changes, with a slight change. See v24. -* Increase and report the number of index. Also, we reset the progress -* counters. +* Increase and report the number of index scans. Also, we reset the progress +* counters. Thanks -- Sami Imseih Amazon Web Services (AWS) v24-0001-Add-2-new-columns-to-pg_stat_progress_vacuum.-Th.patch Description: v24-0001-Add-2-new-columns-to-pg_stat_progress_vacuum.-Th.patch
Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16
On Fri, Feb 17, 2023 at 9:43 AM Andres Freund wrote: > > On 2023-02-17 08:30:09 +0530, Amit Kapila wrote: > > Thanks, I was not completely sure about whether we need to bump > > XLOG_PAGE_MAGIC for this patch as this makes the additional space just > > by changing the datatype of one of the members of the existing WAL > > record. We normally change it for the addition/removal of new fields > > aka change in WAL record format, or addition of a new type of WAL > > record. Does anyone else have an opinion/suggestion on this matter? > > I'd definitely change it - the width of a field still means you can't really > parse the old WAL sensibly anymore. > Okay, thanks for your input. I'll push this patch early next week. -- With Regards, Amit Kapila.
Re: pg_upgrade and logical replication
On Fri, Feb 17, 2023 at 9:05 PM Julien Rouhaud wrote: > > On Fri, Feb 17, 2023 at 04:12:54PM +0530, Amit Kapila wrote: > > On Fri, Feb 17, 2023 at 1:24 PM Julien Rouhaud wrote: > > > > > > An easy workaround that I tried is to allow something like > > > > > > ALTER SUBSCRIPTION ... ENABLE WITH (refresh = true, copy_data = false) > > > > > > so that the refresh internally happens before the apply worker is started > > > and > > > you just keep consuming the delta, which works on naive scenario. > > > > > > One concern I have with this approach is that the default values for both > > > "refresh" and "copy_data" for all other subcommands is "true, but we would > > > probably need a different default value in that exact scenario (as we > > > know we > > > already have the data). I think that it would otherwise be safe in my > > > very > > > specific scenario, assuming that you created the slot beforehand and > > > moved the > > > slot's LSN at the promotion point, as even if you add non-empty tables to > > > the > > > publication you will only need the delta whether those were initially > > > empty or > > > not given your initial physical replica state. > > > > > > > This point is not very clear. Why would one just need delta even for new > > tables? > > Because in my scenario I'm coming from physical replication, so I know that I > did replicate everything until the promotion LSN. Any table later added in > the > publication is either already fully replicated until that LSN on the upgraded > node, so only the delta is needed, or has been created after that LSN. In the > latter case, the entirety of the table will be replicated with the logical > replication as a delta right? > That makes sense to me. > > > Any other scenario would make > > > this new option dangerous, if not entirely useless, but not more than any > > > of > > > the current commands that lead to refreshing a subscription and have the > > > same > > > options I guess. > > > > > > All in all, currently the only way to somewhat safely resume logical > > > replication after a pg_upgrade is to drop all the subscriptions that were > > > transferred during pg_upgrade on all databases and recreate them (using > > > the > > > existing slots on the publisher side obviously), allowing the initial > > > connection. But this approach only works in the exact scenario I > > > mentioned > > > (physical to logical replication, or at least a case where *all* the > > > tables > > > where logically replicated prior to the pg_ugprade), otherwise you have to > > > recreate the follower node from scratch using logical repication. > > > > > > > I think if you dropped and recreated the subscriptions by retaining > > old slots, the replication should resume from where it left off before > > the upgrade. Which scenario are you concerned about? > > I'm concerned about people not coming from physical replication. If you just > had some "normal" logical replication, you can't assume that you already have > all the data from the upstream subscription. If it was modified and a non > empty table is added, you might need to copy the data of part of the tables > and > keep replicating for the rest. It's hard to be sure from a user point of > view, > and even if you knew you have no way to express it. > Can't the user create a separate publication for such newly added tables and a corresponding new subscription on the downstream node? Now, I think it would be a bit tricky if the user already has a publication defined with FOR ALL TABLES. In that case, we probably need some way to specify FOR ALL TABLES EXCEPT (list of tables) which we currently don't have. > > > Is that indeed the current behavior, or did I miss something? > > > > > > Is this "resume logical replication on pg_upgraded node" something we > > > want to > > > support better? I was thinking that we could add a new pg_dump mode > > > (maybe > > > only usable during pg_upgrade) that also restores the pg_subscription_rel > > > content in each subscription or something like that. If not, should > > > pg_upgrade > > > keep preserving the subscriptions as it doesn't seem safe to use them, or > > > at > > > least document the hazards (I didn't find anything about it in the > > > documentation)? > > > > > > > > > > There is a mention of this in pg_dump docs. See [1] (When dumping > > logical replication subscriptions ...) > > Indeed, but it's barely saying "It is then up to the user to reactivate the > subscriptions in a suitable way" and "It might also be appropriate to truncate > the target tables before initiating a new full table copy". As I mentioned, I > don't think there's a suitable way to reactivate the subscription, at least if > you don't want to miss some records, so truncating all target tables is the > only fully safe way to proceed. It seems quite silly to have to do so just > because pg_upgrade doesn't retain the list of relation per subscription. > I also don't know i
Re: Share variable between psql backends in CustomScan
Hi so 18. 2. 2023 v 1:37 odesílatel Amin napsal: > Hi, > > I am looking for a way to define a global variable in CustomScan plugin > that is shared between different psql backends. Is it possible without > using shared memory? Does postgresql implement any function that > facilitates this? > No - there is nothing like this. You need to use shared memory. Regards Pavel > > Thank you, > Amin >
Re: Move defaults toward ICU in 16?
pá 17. 2. 2023 v 21:43 odesílatel Jeff Davis napsal: > On Fri, 2023-02-17 at 18:27 +0100, Pavel Stehule wrote: > > Today I tested icu for Czech sorting. It is a little bit slower, but > > not too much, but it produces partially different results. > > Thank you for trying it. > > If it's a significant slowdown, can you please send more information? > ICU version, libc version, and testcase? > no - this slowdown is not significant - although 1% can looks too much - but it is just two ms It looks so libicu has little bit more expensive initialization, but the execution is little bit faster But when I try to repeat the measurements, the results are very unstable on my desktop :-/ SELECT * FROM obce ORDER BY nazev LIMIT 10 // is faster with glibc little bit SELECT * FROM obce ORDER BY nazev // is faster with libicu You can download dataset https://pgsql.cz/files/obce.sql It is table of municipalities in czech republic (real names) - about 6000 rows I use fedora 37 - so libicu 71.1, glibc 2.36 Regards Pavel > > > select row_number() over (order by nazev collate "cs-x-icu"), nazev > > from obce > > except select row_number() over (order by nazev collate "cs_CZ"), > > nazev from obce; > > > > returns a not empty set. So minimally for Czech collate, an index > > rebuild is necessary > > Yes, that's true of any locale change, provider change, or even > provider version change. > > > -- > Jeff Davis > PostgreSQL Contributor Team - AWS > > >
Re: Reducing connection overhead in pg_upgrade compat check phase
On Fri, Feb 17, 2023 at 10:44:49PM +0100, Daniel Gustafsson wrote: > When adding a check to pg_upgrade a while back I noticed in a profile that the > cluster compatibility check phase spend a lot of time in connectToServer. > Some > of this can be attributed to data type checks which each run serially in turn > connecting to each database to run the check, and this seemed like a place > where we can do better. > > The attached patch moves the checks from individual functions, which each > loops > over all databases, into a struct which is consumed by a single umbrella check > where all data type queries are executed against a database using the same > connection. This way we can amortize the connectToServer overhead across more > accesses to the database. This change consolidates all the data type checks, so instead of 7 separate loops through all the databases, there is just one. However, I wonder if we are leaving too much on the table, as there are a number of other functions that also loop over all the databases: * get_loadable_libraries * get_db_and_rel_infos * report_extension_updates * old_9_6_invalidate_hash_indexes * check_for_isn_and_int8_passing_mismatch * check_for_user_defined_postfix_ops * check_for_incompatible_polymorphics * check_for_tables_with_oids * check_for_user_defined_encoding_conversions I suspect consolidating get_loadable_libraries, get_db_and_rel_infos, and report_extension_updates would be prohibitively complicated and not worth the effort. old_9_6_invalidate_hash_indexes is only needed for unsupported versions, so that might not be worth consolidating. check_for_isn_and_int8_passing_mismatch only loops through all databases when float8_pass_by_value in the control data differs, so that might not be worth it, either. The last 4 are for supported versions and, from a very quick glance, seem possible to consolidate. That would bring us to a total of 11 separate loops that we could consolidate into one. However, the data type checks seem to follow a nice pattern, so perhaps this is easier said than done. IIUC with the patch, pg_upgrade will immediately fail as soon as a single check in a database fails. I believe this differs from the current behavior where all matches for a given check in the cluster are logged before failing. I wonder if it'd be better to perform all of the data type checks in all databases before failing so that all of the violations are reported. Else, users would have to run pg_upgrade, fix a violation, run pg_upgrade again, fix another one, etc. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Fri, Feb 17, 2023 at 12:14 PM Hayato Kuroda (Fujitsu) wrote: > > Thank you for replying! This direction seems OK, so I started to revise the > patch. > PSA new version. > Few comments: = 1. + + The minimum delay for publisher sends data, in milliseconds + + It would probably be better to write it as "The minimum delay, in milliseconds, by the publisher to send changes" 2. The subminsenddelay is placed inconsistently in the patch. In the docs (catalogs.sgml), system_views.sql, and in some places in the code, it is after subskiplsn, but in the catalog table and corresponding structure, it is placed after subowner. It should be consistently placed after the subscription owner. 3. + + WalSenderSendDelay + Waiting for sending changes to subscriber in WAL sender + process. How about writing it as follows: "Waiting while sending changes for time-delayed logical replication in the WAL sender process."? 4. + + Any delay becomes effective only after all initial table + synchronization has finished and occurs before each transaction + starts to get applied on the subscriber. The delay does not take into + account the overhead of time spent in transferring the transaction, + which means that the arrival time at the subscriber may be delayed + more than the given time. + This needs to change based on a new approach. It should be something like: "The delay is effective only when the publisher decides to send a particular transaction downstream." 5. + * allowed. This is because in parallel streaming mode, we start applying + * the transaction stream as soon as the first change arrives without + * knowing the transaction's prepare/commit time. Always waiting for the + * full 'min_send_delay' period might include unnecessary delay. + * + * The other possibility was to apply the delay at the end of the parallel + * apply transaction but that would cause issues related to resource bloat + * and locks being held for a long time. + */ This part of the comments seems to imply more of a subscriber-side delay approach. I think we should try to adjust these as per the changed approach. 6. @@ -666,6 +666,10 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, buf->origptr, buf->endptr); } + /* Delay given time if the context has 'delay' callback */ + if (ctx->delay) + ctx->delay(ctx, commit_time); + I think we should invoke delay functionality only when ctx->min_send_delay > 0. Otherwise, there will be some unnecessary overhead. We can change the comment along the lines of: "Delay sending the changes if required. For streaming transactions, this means a delay in sending the last stream but that is okay because on the downstream the changes will be applied only after receiving the last stream." 7. For 2PC transactions, I think we should add the delay in DecodePrerpare. Because after receiving the PREPARE, the downstream will apply the xact. In this case, we shouldn't add a delay for the commit_prepared. 8. +# +# If the subscription sets min_send_delay parameter, the logical replication +# worker will delay the transaction apply for min_send_delay milliseconds. I think here also comments should be updated as per the changed approach for applying the delay on the publisher side. -- With Regards, Amit Kapila.
Re: pg_upgrade and logical replication
On Sat, Feb 18, 2023 at 09:31:30AM +0530, Amit Kapila wrote: > On Fri, Feb 17, 2023 at 9:05 PM Julien Rouhaud wrote: > > > > I'm concerned about people not coming from physical replication. If you > > just > > had some "normal" logical replication, you can't assume that you already > > have > > all the data from the upstream subscription. If it was modified and a non > > empty table is added, you might need to copy the data of part of the tables > > and > > keep replicating for the rest. It's hard to be sure from a user point of > > view, > > and even if you knew you have no way to express it. > > > > Can't the user create a separate publication for such newly added > tables and a corresponding new subscription on the downstream node? Yes that seems like a safe way to go, but it relies on users being very careful if they don't want to get corrupted logical standby, and I think it's impossible to run any check to make sure that the subscription is adequate? > Now, I think it would be a bit tricky if the user already has a > publication defined with FOR ALL TABLES. In that case, we probably > need some way to specify FOR ALL TABLES EXCEPT (list of tables) which > we currently don't have. Yes, and note that I rely on FOR ALL TABLES for my original physical to logical use case. > > > > Indeed, but it's barely saying "It is then up to the user to reactivate the > > subscriptions in a suitable way" and "It might also be appropriate to > > truncate > > the target tables before initiating a new full table copy". As I > > mentioned, I > > don't think there's a suitable way to reactivate the subscription, at least > > if > > you don't want to miss some records, so truncating all target tables is the > > only fully safe way to proceed. It seems quite silly to have to do so just > > because pg_upgrade doesn't retain the list of relation per subscription. > > > > I also don't know if there is any other safe way for newly added > tables apart from the above suggestion to create separate publications > but that can work only in specific cases. I might be missing something, but what could go wrong if pg_upgrade could emit a bunch of commands like: ALTER SUBSCRIPTION subname ADD RELATION relid STATE 'x' LSN 'X/Y'; pg_upgrade already preserves the relation's oid, so we could restore the exact original state and then enabling the subscription would just work? We could restrict this form to --binary only so we don't provide a way for users to mess the data.