Re: RFC: Logging plan of the running query
At Tue, 8 Feb 2022 01:13:44 +0900, Fujii Masao wrote in > > > On 2022/02/02 21:59, torikoshia wrote: > >> This may cause users to misunderstand that pg_log_query_plan() fails > >> while the target backend is holding *any* locks? Isn't it better to > >> mention "page-level locks", instead? So how about the following? > >> > >> -- > >> Note that the request to log the query plan will be ignored if it's > >> received during a short period while the target backend is holding a > >> page-level lock. > >> -- > > Agreed. > > On second thought, this note is confusing rather than helpful? Because > the users don't know when and what operation needs page-level lock. So > now I'm thinking it's better to remove this note. *I* agree to removing the note. And the following error message looks as mysterious as the note is, and the DETAIL doesn't help.. ereport(LOG_SERVER_ONLY, + errmsg("could not log the query plan"), + errdetail("Cannot log the query plan while holding page-level lock.")); + hash_seq_term(&status); We should tell the command can be retried soon, like this? "LOG: ignored request for logging query plan due to lock confilcts" "HINT: You can try again in a moment." regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
Mmm.. checkpoint and checkpointer are quite confusing in this context.. At Tue, 08 Feb 2022 16:58:22 +0900 (JST), Kyotaro Horiguchi wrote in > At Mon, 7 Feb 2022 13:51:31 +0900, Fujii Masao > wrote in > > > > > > On 2022/02/07 12:02, Kyotaro Horiguchi wrote: > > > - If any later checkpoint/restartpoint has been established, just skip > > >remaining task then return false. (!chkpt_was_latest) > > >(I'm not sure this can happen, though.) > > > - we update control file only when archive recovery is still ongoing. > > > > This comment seems to conflict with what your PoC patch does. Because > > with the patch, ControlFile->checkPoint and > > ControlFile->checkPointCopy seem to be updated even when > > ControlFile->state != DB_IN_ARCHIVE_RECOVERY. > > Ah, yeah, by "update" I meant that "move forward". Sorry for confusing > wording. Please ignore the "that". > > I agree with what your PoC patch does for now. When we're not in > > archive recovery state, checkpoint and REDO locations in pg_control > > should be updated but min recovery point should be reset to invalid > > one (which instruments that subsequent crash recovery should replay > > all available WAL files). > > Yes. All buffers before the last recovery point's end have been > flushed out so the recovery point is valid as a checkpoint. On the > other hand minRecoveryPoint is no longer needed and actually is > ignored at the next crash recovery. We can leave it alone but it is > consistent that it is cleared. > > > > - Otherwise reset minRecoveryPoint then continue. > > > Do you have any thoughts or opinions? > > > > Regarding chkpt_was_latest, whether the state is > > DB_IN_ARCHIVE_RECOVERY or not, if checkpoint and redo locations in > > pg_control are updated, IMO we don't need to skip the "remaining > > tasks". Since those locations are updated and subsequent crash > > recovery will start from that redo location, for example, ISTM that we > > can safely delete old WAL files prior to the redo location as the > > "remaining tasks". Thought? > > If I read you correctly, the PoC works that way. It updates pg_control > if the restart point is latest then performs the remaining cleanup > tasks in that case. Recovery state doesn't affect this process. > > I reexamined about the possibility of concurrent checkpoints. > > Both CreateCheckPoint and CreateRestartPoint are called from > checkpointer loop, shutdown handler of checkpointer and standalone > process. So I can't see a possibility of concurrent checkpoints. > > In the past we had a time when startup process called CreateCheckPoint - directly in the crash recovery case where checkpoint is not running - but since 7ff23c6d27 checkpoint is started before startup process + directly in the crash recovery case where checkpointer is not running + but since 7ff23c6d27 checkpointer is launched before startup process > starts. So I conclude that that cannot happen. > > So the attached takes away the path for the case where the restart > point is overtaken by a concurrent checkpoint. > > Thus.. the attached removes the ambiguity of of the proposed patch > about the LSNs in the restartpoint-ending log message. Thoughts? -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
On Tue, Feb 8, 2022 at 2:10 PM Kyotaro Horiguchi wrote: > > Thus.. the attached removes the ambiguity of of the proposed patch > > about the LSNs in the restartpoint-ending log message. > > Thoughts? Thanks for the patch. I have few comments on the v1-0001-Get-rid-of-unused-path-to-handle-concurrent-check.patch 1) Can we have this Assert right after "skipping restartpoint, already performed at %X/%X" error message block? Does it make any difference? My point is that if at all, we were to assert this, why can't we do it before CheckPointGuts? + /* We mustn't have a concurrent checkpoint that advances checkpoint LSN */ + Assert(lastCheckPoint.redo > ControlFile->checkPointCopy.redo); + 2) Related to the above Assert, do we really need an assertion or a FATAL error? 3) Let's be consistent with "crash recovery" - replace "archive-recovery" with "archive recovery"? + * We have exited from archive-recovery mode after this restartpoint + * started. Crash recovery ever after should always recover to the end 4) Isn't it enough to say "Crash recovery should always recover to the end of WAL."? + * started. Crash recovery ever after should always recover to the end 5) Is there a reliable test case covering this code? Please point me if the test case is shared upthread somewhere. 6) So, with this patch, the v8 patch-set posted at [1] doesn't need any changes IIUC. If that's the case, please feel free to post all the patches together such that they get tested in cfbot. [1] - https://www.postgresql.org/message-id/CALj2ACUtZhTb%3D2ENkF3BQ3wi137uaGi__qzvXC-qFYC0XwjALw%40mail.gmail.com Regards, Bharath Rupireddy.
RE: Logical replication timeout problem
Dear Wang, Thank you for making a patch. I applied your patch and confirmed that codes passed regression test. I put a short reviewing: ``` + static int skipped_changes_count = 0; + /* +* Conservatively, at least 150,000 changes can be skipped in 1s. +* +* Because we use half of wal_sender_timeout as the threshold, and the unit +* of wal_sender_timeout in process is ms, the final threshold is +* wal_sender_timeout * 75. +*/ + int skipped_changes_threshold = wal_sender_timeout * 75; ``` I'm not sure but could you tell me the background of this calculation? Is this assumption reasonable? ``` @@ -654,20 +663,62 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, { case REORDER_BUFFER_CHANGE_INSERT: if (!relentry->pubactions.pubinsert) + { + if (++skipped_changes_count >= skipped_changes_threshold) + { + OutputPluginUpdateProgress(ctx, true); + + /* +* After sending keepalive message, reset +* skipped_changes_count. +*/ + skipped_changes_count = 0; + } return; + } break; ``` Is the if-statement needed? In the walsender case OutputPluginUpdateProgress() leads WalSndUpdateProgress(), and the function also has the threshold for ping-ing. ``` static void -WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid) +WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool send_keep_alive) { - static TimestampTz sendTime = 0; + static TimestampTz trackTime = 0; TimestampTz now = GetCurrentTimestamp(); + if (send_keep_alive) + { + /* +* If half of wal_sender_timeout has lapsed without send message standby, +* send a keep-alive message to the standby. +*/ + static TimestampTz sendTime = 0; + TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime, + wal_sender_timeout / 2); + if (now >= ping_time) + { + WalSndKeepalive(false); + + /* Try to flush pending output to the client */ + if (pq_flush_if_writable() != 0) + WalSndShutdown(); + sendTime = now; + } + } + ``` * +1 about renaming to trackTime. * `/2` might be magic number. How about following? Renaming is very welcome: ``` +#define WALSND_LOGICAL_PING_FACTOR 0.5 + static TimestampTz sendTime = 0; + TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime, + wal_sender_timeout * WALSND_LOGICAL_PING_FACTOR) ``` Could you add a commitfest entry for cfbot? Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: [PATCH] Add min() and max() aggregate functions for xid8
On 2022-02-08 15:34, Fujii Masao wrote: Thanks for updating the patch! It basically looks good to me. I applied the following small changes to the patch. Updated version of the patch attached. Could you review this version? Thank you for the patch! It looks good to me! I'm not sure how strict coding conventions are, but the following line is over 80 characters. +insert into xid8_t1 values ('0'), ('010'), ('42'), ('0x'), ('-1'); Therefore, I made a patch which removed ('010') just to fit in 80 characters. Best wishes, -- Ken Kato Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATIONdiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 8754f2f89b..1b064b4feb 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19973,7 +19973,7 @@ SELECT NULLIF(value, '(none)') ... values. Available for any numeric, string, date/time, or enum type, as well as inet, interval, money, oid, pg_lsn, -tid, +tid, xid8, and arrays of any of these types. Yes @@ -19992,7 +19992,7 @@ SELECT NULLIF(value, '(none)') ... values. Available for any numeric, string, date/time, or enum type, as well as inet, interval, money, oid, pg_lsn, -tid, +tid, xid8, and arrays of any of these types. Yes diff --git a/src/backend/utils/adt/xid.c b/src/backend/utils/adt/xid.c index 9b4ceaea47..e4b4952a28 100644 --- a/src/backend/utils/adt/xid.c +++ b/src/backend/utils/adt/xid.c @@ -286,6 +286,30 @@ xid8cmp(PG_FUNCTION_ARGS) PG_RETURN_INT32(-1); } +Datum +xid8_larger(PG_FUNCTION_ARGS) +{ + FullTransactionId fxid1 = PG_GETARG_FULLTRANSACTIONID(0); + FullTransactionId fxid2 = PG_GETARG_FULLTRANSACTIONID(1); + + if (FullTransactionIdFollows(fxid1, fxid2)) + PG_RETURN_FULLTRANSACTIONID(fxid1); + else + PG_RETURN_FULLTRANSACTIONID(fxid2); +} + +Datum +xid8_smaller(PG_FUNCTION_ARGS) +{ + FullTransactionId fxid1 = PG_GETARG_FULLTRANSACTIONID(0); + FullTransactionId fxid2 = PG_GETARG_FULLTRANSACTIONID(1); + + if (FullTransactionIdPrecedes(fxid1, fxid2)) + PG_RETURN_FULLTRANSACTIONID(fxid1); + else + PG_RETURN_FULLTRANSACTIONID(fxid2); +} + /* * COMMAND IDENTIFIER ROUTINES * */ diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat index 137f6eef69..2843f4b415 100644 --- a/src/include/catalog/pg_aggregate.dat +++ b/src/include/catalog/pg_aggregate.dat @@ -149,6 +149,9 @@ { aggfnoid => 'max(pg_lsn)', aggtransfn => 'pg_lsn_larger', aggcombinefn => 'pg_lsn_larger', aggsortop => '>(pg_lsn,pg_lsn)', aggtranstype => 'pg_lsn' }, +{ aggfnoid => 'max(xid8)', aggtransfn => 'xid8_larger', + aggcombinefn => 'xid8_larger', aggsortop => '>(xid8,xid8)', + aggtranstype => 'xid8' }, # min { aggfnoid => 'min(int8)', aggtransfn => 'int8smaller', @@ -214,6 +217,9 @@ { aggfnoid => 'min(pg_lsn)', aggtransfn => 'pg_lsn_smaller', aggcombinefn => 'pg_lsn_smaller', aggsortop => '<(pg_lsn,pg_lsn)', aggtranstype => 'pg_lsn' }, +{ aggfnoid => 'min(xid8)', aggtransfn => 'xid8_smaller', + aggcombinefn => 'xid8_smaller', aggsortop => '<(xid8,xid8)', + aggtranstype => 'xid8' }, # count { aggfnoid => 'count(any)', aggtransfn => 'int8inc_any', diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 7024dbe10a..62f36daa98 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -203,6 +203,12 @@ { oid => '5071', descr => 'convert xid8 to xid', proname => 'xid', prorettype => 'xid', proargtypes => 'xid8', prosrc => 'xid8toxid' }, +{ oid => '5097', descr => 'larger of two', + proname => 'xid8_larger', prorettype => 'xid8', proargtypes => 'xid8 xid8', + prosrc => 'xid8_larger' }, +{ oid => '5098', descr => 'smaller of two', + proname => 'xid8_smaller', prorettype => 'xid8', proargtypes => 'xid8 xid8', + prosrc => 'xid8_smaller' }, { oid => '69', proname => 'cideq', proleakproof => 't', prorettype => 'bool', proargtypes => 'cid cid', prosrc => 'cideq' }, @@ -6564,6 +6570,9 @@ { oid => '4189', descr => 'maximum value of all pg_lsn input values', proname => 'max', prokind => 'a', proisstrict => 'f', prorettype => 'pg_lsn', proargtypes => 'pg_lsn', prosrc => 'aggregate_dummy' }, +{ oid => '5099', descr => 'maximum value of all xid8 input values', + proname => 'max', prokind => 'a', proisstrict => 'f', prorettype => 'xid8', + proargtypes => 'xid8', prosrc => 'aggregate_dummy' }, { oid => '2131', descr => 'minimum value of all bigint input values', proname => 'min', prokind => 'a', proisstrict => 'f', prorettype => 'int8', @@ -6631,6 +6640,9 @@ { oid => '4190', descr => 'minimum value of all pg_lsn input values', proname => 'min', prokind => 'a', p
Re: [BUG]Update Toast data failure in logical replication
On Tue, Feb 8, 2022 at 12:48 AM Andres Freund wrote: > > On 2022-02-07 08:44:00 +0530, Amit Kapila wrote: > > Right, and it is getting changed. We are just printing the first 200 > > characters (by using SQL [1]) from the decoded tuple so what is shown > > in the results is the initial 200 bytes. > > Ah, I knew I must have been missing something. > > > > The complete decoded data after the patch is as follows: > > Hm. I think we should change the way the strings are shortened - otherwise we > don't really verify much in that test. Perhaps we could just replace the long > repetitive strings with something shorter in the output? > > E.g. using something like regexp_replace(data, > '(1234567890|9876543210){200}', '\1{200}','g') > inside the substr(). > This sounds like a good idea. Shall we do this as part of this patch itself or as a separate improvement? > Wonder if we should deduplicate the number of different toasted strings in the > file to something that'd allow us to have a single "redact_toast" function or > such. There's too many different ones to have a reasonbly simple redaction > function right now. > I think this is also worth trying. > But that's perhaps better done separately. > +1. -- With Regards, Amit Kapila.
Should pg_restore vacuum the tables before the post-data stage?
Hello, I was wondering if pg_restore should call VACUUM ANALYZE for all tables, after the "COPY" stage, and before the "post-data" stage. Indeed, without such a VACUUM, the visibility map isn't available. Depending on the size of the tables and on the configuration, a foreign key constraint check would have to perform either: * a seq scan of the referencing table, even if an index exists for it; * a index-only scan, with 100% of heap fetches. And it's the same for the referenced table, of course, excepts the index always exists. In both cases, it could be very slow. What's more, the seq scan isn't parallelized [1]. It seems to me that in most cases, autovacuum doesn't have enough time to process the tables before the post-data stage, or only some of them. What do you think? If the matter has already been discussed previously, can you point me to the thread discussing it? Best regards, Frédéric [1] https://www.postgresql.org/message-id/flat/0d21e3b4-dcde-290c-875e-6ed5013e8e52%40dalibo.com
Re: Database-level collation version tracking
On 07.02.22 17:08, Julien Rouhaud wrote: There's so limited testing in collate.* regression tests, so I thought it would be ok to add it there. At least some ALTER DATABASE ... REFRESH VERSION would be good, similarly to collation-level versioning. I don't think you can run ALTER DATABASE from the regression test scripts, since the database name is not fixed. You'd have to paste the command together using psql tricks or something. I guess it could be done, but maybe there is a better solution. We could put it into the createdb test suite, or write a new TAP test suite somewhere. There is no good precedent for this. That code takes a RowExclusiveLock on pg_database. Did you have something else in mind? But that lock won't prevent a concurrent DROP DATABASE, so it's totally expected to hit that cache lookup failed error. There should either be a shdepLockAndCheckObject(), or changing the error message to some errmsg("data xxx does not exist"). I was not familiar with that function. AFAICT, it is only used for database and role settings (AlterDatabaseSet(), AlterRoleSet()), but not when just updating the role or database catalog (e.g., AlterRole(), RenameRole(), RenameDatabase()). So I don't think it is needed here. Maybe I'm missing something.
Re: [PATCH v2] use has_privs_for_role for predefined roles
On 2/7/22 12:09, Robert Haas wrote: On Mon, Feb 7, 2022 at 11:13 AM Joe Conway wrote: It is confusing and IMHO dangerous that the predefined roles currently work differently than regular roles eith respect to privilege inheritance. I feel like that's kind of a conclusory statement, as opposed to making an argument. I mean that this tells me something about how you feel, but it doesn't really help me understand why you feel that way. The argument is that we call these things "predefined roles", but they do not behave the same way normal "roles" behave. Someone not intimately familiar with that fact could easily make bad assumptions, and therefore wind up with misconfigured security settings. In other words, it violates the principle of least astonishment (POLA). As Joshua said nearby, it simply jumps out at me as a bug. --- After more thought, perhaps the real problem is that these things should not have been called "predefined roles" at all. I know, the horse has already left the barn on that, but in any case... They are (to me at least) similar in concept to Linux capabilities in that they allow roles other than superusers to do a certain subset of the things historically reserved for superusers through a special mechanism (hardcoded) rather than through the normal privilege system (GRANTS/ACLs). As an example, the predefined role pg_read_all_settings allows a non-superuser to read GUC normally reserved for superuser access only. If I create a new user "bob" with the default INHERIT attribute, and I grant postgres to bob, bob must SET ROLE to postgres in order to access the capability to read superuser settings. This is similar to bob's access to the default superuser privilege to read data in someone else's table (must SET ROLE to access that capability). But it is different from bob's access to inherited privileges which are GRANTed: 8<-- psql nmx psql (15devel) Type "help" for help. nmx=# create user bob; CREATE ROLE nmx=# grant postgres to bob; GRANT ROLE nmx=# \q 8<-- -and- 8<-- psql -U bob nmx psql (15devel) Type "help" for help. nmx=> select current_user; current_user -- bob (1 row) nmx=> show stats_temp_directory; ERROR: must be superuser or have privileges of pg_read_all_settings to examine "stats_temp_directory" nmx=> set role postgres; SET nmx=# show stats_temp_directory; stats_temp_directory -- pg_stat_tmp (1 row) nmx=# select current_user; current_user -- postgres (1 row) nmx=# select * from foo; id 42 (1 row) nmx=# reset role; RESET nmx=> select current_user; current_user -- bob (1 row) nmx=> select * from foo; ERROR: permission denied for table foo nmx=> set role postgres; SET nmx=# grant select on table foo to postgres; GRANT nmx=# reset role; RESET nmx=> select current_user; current_user -- bob (1 row) nmx=> select * from foo; id 42 (1 row) 8<-- Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Temporary file access API
Here I'm starting a new thread to discuss a topic that's related to the Transparent Data Encryption (TDE), but could be useful even without that. The problem has been addressed somehow in the Cybertec TDE fork, and I can post the code here if it helps. However, after reading [1] (and the posts upthread), I've got another idea, so let's try to discuss it first. It makes sense to me if we first implement the buffering (i.e. writing/reading certain amount of data at a time) and make the related functions aware of encryption later: as long as we use a block cipher, we also need to read/write (suitably sized) chunks rather than individual bytes (or arbitrary amounts of data). (In theory, someone might need encryption but reject buffering, but I'm not sure if this is a realistic use case.) For the buffering, I imagine a "file stream" object that user creates on the top of a file descriptor, such as FileStream *FileStreamCreate(File file, int buffer_size) or FileStream *FileStreamCreateFD(int fd, int buffer_size) and uses functions like int FileStreamWrite(FileStream *stream, char *buffer, int amount) and int FileStreamRead(FileStream *stream, char *buffer, int amount) to write and read data respectively. Besides functions to close the streams explicitly (e.g. FileStreamClose() / FileStreamFDClose()), we'd need to ensure automatic closing where that happens to the file. For example, if OpenTemporaryFile() was used to obtain the file descriptor, the user expects that the file will be closed and deleted on transaction boundary, so the corresponding stream should be freed automatically as well. To avoid code duplication, buffile.c should use these streams internally as well, as it also performs buffering. (Here we'd also need functions to change reading/writing position.) Once we implement the encryption, we might need add an argument to the FileStreamCreate...() functions that helps to generate an unique IV, but the ...Read() / ...Write() functions would stay intact. And possibly one more argument to specify the kind of cipher, in case we support more than one. I think that's enough to start the discussion. Thanks for feedback in advance. [1] https://www.postgresql.org/message-id/CA%2BTgmoYGjN_f%3DFCErX49bzjhNG%2BGoctY%2Ba%2BXhNRWCVvDY8U74w%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: storing an explicit nonce
Stephen Frost wrote: > Perhaps this is all too meta and we need to work through some specific > ideas around just what this would look like. In particular, thinking > about what this API would look like and how it would be used by > reorderbuffer.c, which builds up changes in memory and then does a bare > write() call, seems like a main use-case to consider. The gist there > being "can we come up with an API to do all these things that doesn't > require entirely rewriting ReorderBufferSerializeChange()?" > > Seems like it'd be easier to achieve that by having something that looks > very close to how write() looks, but just happens to have the option to > run the data through a stream cipher and maybe does better error > handling for us. Making that layer also do block-based access to the > files underneath seems like a much larger effort that, sure, may make > some things better too but if we could do that with the same API then it > could also be done later if someone's interested in that. My initial proposal is in this new thread: https://www.postgresql.org/message-id/4987.1644323098%40antos -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Database-level collation version tracking
On Tue, Feb 08, 2022 at 12:14:02PM +0100, Peter Eisentraut wrote: > On 07.02.22 17:08, Julien Rouhaud wrote: > > There's so limited testing in collate.* regression tests, so I thought it > > would > > be ok to add it there. At least some ALTER DATABASE ... REFRESH VERSION > > would > > be good, similarly to collation-level versioning. > > I don't think you can run ALTER DATABASE from the regression test scripts, > since the database name is not fixed. You'd have to paste the command > together using psql tricks or something. I guess it could be done, but > maybe there is a better solution. We could put it into the createdb test > suite, or write a new TAP test suite somewhere. There is no good precedent > for this. I was thinking using a simple DO command, as there are already some other usage for that for non deterministic things (like TOAST tables). If it's too problematic I'm fine with not having tests for that for now. > > > That code takes a RowExclusiveLock on pg_database. Did you have something > > > else in mind? > > > > But that lock won't prevent a concurrent DROP DATABASE, so it's totally > > expected to hit that cache lookup failed error. There should either be a > > shdepLockAndCheckObject(), or changing the error message to some > > errmsg("data > > xxx does not exist"). > > I was not familiar with that function. AFAICT, it is only used for database > and role settings (AlterDatabaseSet(), AlterRoleSet()), but not when just > updating the role or database catalog (e.g., AlterRole(), RenameRole(), > RenameDatabase()). So I don't think it is needed here. Maybe I'm missing > something. I'm just saying that without such a lock you can easily trigger the "cache lookup" error, and that's something that's supposed to happen with normal usage I think. So it should be a better message saying that the database has been concurrently dropped, or actually simply does not exist like it's done in AlterDatabaseOwner() for the same pattern: [...] tuple = systable_getnext(scan); if (!HeapTupleIsValid(tuple)) ereport(ERROR, (errcode(ERRCODE_UNDEFINED_DATABASE), errmsg("database \"%s\" does not exist", dbname))); [...] Apart from that I still think that we should check the collation version of the source database when creating a new database. It won't cost much but will give the DBA a chance to recreate the indexes before risking invalid index usage.
Re: Make mesage at end-of-recovery less scary.
Hi, Here are some of my review comments on the v11 patch: - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery", +LSN_FORMAT_ARGS(ErrRecPtr), +replayTLI, +xlogSourceNames[currentSource]))); Why crash recovery? Won't this message get printed even during PITR? I just did a PITR and could see these messages in the logfile. 2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time recovery to WAL location (LSN) "0/5227790" 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; automatic recovery in progress 2022-02-08 18:00:44.369 IST [86185] LOG: redo starts at 0/14DC8D8 2022-02-08 18:00:44.978 IST [86185] DEBUG1: reached end of WAL at 0/3D0 on timeline 1 in pg_wal during crash recovery, entering archive recovery == + /* +* If we haven't emit an error message, we have safely reached the +* end-of-WAL. +*/ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, + xlogSourceNames[currentSource]), +(errormsg ? errdetail_internal("%s", errormsg) : 0))); + } Doesn't it make sense to add an assert statement inside this if-block that will check for xlogreader->EndOfWAL? == -* We only end up here without a message when XLogPageRead() -* failed - in that case we already logged something. In -* StandbyMode that only happens if we have been triggered, so we -* shouldn't loop anymore in that case. +* If we get here for other than end-of-wal, emit the error +* message right now. Otherwise the message if any is shown as a +* part of the end-of-WAL message below. */ For consistency, I think we can replace "end-of-wal" with "end-of-WAL". Please note that everywhere else in the comments you have used "end-of-WAL". So why not the same here? == ereport(LOG, - (errmsg("replication terminated by primary server"), -errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write; + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", + startpointTLI, + LSN_FORMAT_ARGS(LogstreamResult.Write; Is this change really required? I don't see any issue with the existing error message. == Lastly, are we also planning to backport this patch? -- With Regards, Ashutosh Sharma. On Wed, Feb 2, 2022 at 11:05 AM Kyotaro Horiguchi wrote: > > At Tue, 1 Feb 2022 12:38:01 +0400, Pavel Borisov > wrote in > > Maybe it can be written little bit shorter: > > pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > > as > > pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > > ? > > That difference would be a matter of taste, but I found it looks > cleaner that definition and assignment is separated for both p and pe. > Now it is like the following. > > > char *p; > > char *pe; > > > > /* scan from the beginning of the record to the end of block */ > > p = (char *) record; > > pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > > > > The problem that pgindent sometimes reflow formatting of unrelated blocks > > is indeed existing. But I think it's right to manually leave pgindent-ed > > code only on what is related to the patch. The leftover is pgindent-ed in a > > scheduled manner sometimes, so don't need to bother. > > Yeah, I meant that it is a bit annoying to unpginden-ting unrelated > edits:p > > > I'd like to set v10 as RfC. > > Thanks! The suggested change is done in the attached v11. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center
Re: row filtering for logical replication
On Tue, Feb 8, 2022 at 8:01 AM houzj.f...@fujitsu.com wrote: > > > > > 12. src/backend/replication/pgoutput/pgoutput.c - get_rel_sync_entry > > > > + /* > > + * Initialize the row filter after getting the final publish_as_relid > > + * as we only evaluate the row filter of the relation which we publish > > + * change as. > > + */ > > + pgoutput_row_filter_init(data, active_publications, entry); > > > > The comment "which we publish change as" seems strangely worded. > > > > Perhaps it should be: > > "... only evaluate the row filter of the relation which being published." > > Changed. > I don't know if this change is an improvement. If you want to change then I don't think 'which' makes sense in the following part of the comment: "...relation which being published." Few other comments: 1. Can we save sending schema change messages if the row filter doesn't match by moving maybe_send_schema after row filter checks? 2. commit message/docs: "The WHERE clause only allows simple expressions that don't have user-defined functions, user-defined operators, user-defined collations, non-immutable built-in functions, or references to system columns." "user-defined types" is missing in this sentence. 3. + /* + * For all the supported nodes, check the functions and collations used in + * the nodes. + */ Again 'types' is missing in this comment. -- With Regards, Amit Kapila.
Re: [RFC] building postgres with meson - autogenerated headers
On 07.02.22 20:24, Andres Freund wrote: To be honest, I do not really understand the logic behind when autoconf ends up with #defines that define a macro to 0/1 and when a macro ends defined/or not and when we end up with a macro defined to 1 or not defined at all. The default is to define to 1 or not at all. The reason for this is presumably that originally, autoconf (or its predecessor practices) just populated the command line with a few -DHAVE_THIS options. Creating a header file came later. And -DFOO is equivalent to #define FOO 1. Also, this behavior allows code to use both the #ifdef HAVE_THIS and the #if HAVE_THIS style. The cases that deviate from this have a special reason for this. One issue to consider is that depending on how the configure script is set up or structured, a test might not run at all. But for example, if you have a check for a declaration of a function, and the test doesn't run in a particular configuration, the fallback in your own code would normally be to then manually declare the function yourself. But if you didn't even run the test, then adding a declaration of a function you didn't want in the first place might be bad. In that case, you can check with #ifdef whether the test was run, and then check the value of the macro for the test outcome.
Re: [PATCH v2] use has_privs_for_role for predefined roles
On Tue, Feb 8, 2022 at 6:59 AM Joe Conway wrote: > This is similar to bob's access to the default superuser privilege to > read data in someone else's table (must SET ROLE to access that capability). > > But it is different from bob's access to inherited privileges which are > GRANTed: Yeah. I think right here you've put your finger on what's been bugging me about this: it's similar to one thing, and it's different from another. To you and Joshua and Stephen, it seems 100% obvious that these roles should work like grants of other roles. But I think of them as capabilities derived from the superuser account, and so I'm sort of tempted to think that they should work the way the superuser bit does. And that's why I don't think the fact that they work the other way is "just a bug" -- it's one of two possible ways that someone could think that it ought to work based on how other things in the system actually do work. I'm not hard stuck on the idea that the current behavior is right, but I don't think that we can really say that we've made things fully consistent unless we make things like SUPERUSER and BYPASSRLS work the same way that you want to make predefined roles work. And probably do something about the INHERIT flag too because the current situation seems like a hot mess. -- Robert Haas EDB: http://www.enterprisedb.com
Improve correlation names in sanity tests
I had to do some analysis on the "sanity" tests in the regression test suite (opr_sanity, type_sanity) recently, and I found some of the queries very confusing. One main stumbling block is that for some probably ancient reason many of the older queries are written with correlation names p1, p2, etc. independent of the name of the catalog. This one is a good example: SELECT p1.oid, p1.oprname, p2.oid, p2.proname FROM pg_operator AS p1, pg_proc AS p2 <-- HUH?!? WHERE p1.oprcode = p2.oid AND p1.oprkind = 'l' AND (p2.pronargs != 1 OR NOT binary_coercible(p2.prorettype, p1.oprresult) OR NOT binary_coercible(p1.oprright, p2.proargtypes[0]) OR p1.oprleft != 0); I think this is better written as SELECT o1.oid, o1.oprname, p1.oid, p1.proname FROM pg_operator AS o1, pg_proc AS p1 WHERE o1.oprcode = p1.oid AND o1.oprkind = 'l' AND (p1.pronargs != 1 OR NOT binary_coercible(p1.prorettype, o1.oprresult) OR NOT binary_coercible(o1.oprright, p1.proargtypes[0]) OR o1.oprleft != 0); Attached is a patch that cleans up all the queries in this manner. (As in the above case, I kept the digits like o1 and p1 even in cases where only one of each letter is used in a query. This is mainly to keep the style consistent, but if people don't like that at all, it could be changed.)From a194456df5f56042b3dd2b6c5bd95b27a771761a Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 8 Feb 2022 14:37:56 +0100 Subject: [PATCH] Improve correlation names in sanity tests --- src/test/regress/expected/opr_sanity.out | 374 - src/test/regress/expected/type_sanity.out | 466 +++--- src/test/regress/sql/opr_sanity.sql | 376 - src/test/regress/sql/type_sanity.sql | 466 +++--- 4 files changed, 841 insertions(+), 841 deletions(-) diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out index 562b586d8e..4ce6c039b4 100644 --- a/src/test/regress/expected/opr_sanity.out +++ b/src/test/regress/expected/opr_sanity.out @@ -991,9 +991,9 @@ WHERE c.castmethod = 'b' AND -- pg_conversion -- Look for illegal values in pg_conversion fields. -SELECT p1.oid, p1.conname -FROM pg_conversion as p1 -WHERE p1.conproc = 0 OR +SELECT c.oid, c.conname +FROM pg_conversion as c +WHERE c.conproc = 0 OR pg_encoding_to_char(conforencoding) = '' OR pg_encoding_to_char(contoencoding) = ''; oid | conname @@ -1025,8 +1025,8 @@ WHERE p.oid = c.conproc AND -- them directly from SQL. But there are no non-default built-in -- conversions anyway. -- (Similarly, this doesn't cope with any search path issues.) -SELECT p1.oid, p1.conname -FROM pg_conversion as p1 +SELECT c.oid, c.conname +FROM pg_conversion as c WHERE condefault AND convert('ABC'::bytea, pg_encoding_to_char(conforencoding), pg_encoding_to_char(contoencoding)) != 'ABC'; @@ -1036,32 +1036,32 @@ WHERE condefault AND -- pg_operator -- Look for illegal values in pg_operator fields. -SELECT p1.oid, p1.oprname -FROM pg_operator as p1 -WHERE (p1.oprkind != 'b' AND p1.oprkind != 'l') OR -p1.oprresult = 0 OR p1.oprcode = 0; +SELECT o1.oid, o1.oprname +FROM pg_operator as o1 +WHERE (o1.oprkind != 'b' AND o1.oprkind != 'l') OR +o1.oprresult = 0 OR o1.oprcode = 0; oid | oprname -+- (0 rows) -- Look for missing or unwanted operand types -SELECT p1.oid, p1.oprname -FROM pg_operator as p1 -WHERE (p1.oprleft = 0 and p1.oprkind != 'l') OR -(p1.oprleft != 0 and p1.oprkind = 'l') OR -p1.oprright = 0; +SELECT o1.oid, o1.oprname +FROM pg_operator as o1 +WHERE (o1.oprleft = 0 and o1.oprkind != 'l') OR +(o1.oprleft != 0 and o1.oprkind = 'l') OR +o1.oprright = 0; oid | oprname -+- (0 rows) -- Look for conflicting operator definitions (same names and input datatypes). -SELECT p1.oid, p1.oprcode, p2.oid, p2.oprcode -FROM pg_operator AS p1, pg_operator AS p2 -WHERE p1.oid != p2.oid AND -p1.oprname = p2.oprname AND -p1.oprkind = p2.oprkind AND -p1.oprleft = p2.oprleft AND -p1.oprright = p2.oprright; +SELECT o1.oid, o1.oprcode, o2.oid, o2.oprcode +FROM pg_operator AS o1, pg_operator AS o2 +WHERE o1.oid != o2.oid AND +o1.oprname = o2.oprname AND +o1.oprkind = o2.oprkind AND +o1.oprleft = o2.oprleft AND +o1.oprright = o2.oprright; oid | oprcode | oid | oprcode -+-+-+- (0 rows) @@ -1070,14 +1070,14 @@ WHERE p1.oid != p2.oid AND -- DEFINITIONAL NOTE: If A.oprcom = B, then x A y has the same result as y B x. -- We expect that B will always say that B.oprcom = A as well; that's not -- inherently essential, but it would be inefficient not to mark it so. -SELECT p1.oid, p1.oprcode, p2.oid, p2.oprcode -FROM pg_operator AS p1, pg_operator AS p2 -WHERE p1.oprcom = p2.oid AND -(p1.oprkind != 'b' OR - p1.oprleft != p2.oprright O
Re: Database-level collation version tracking
On Mon, Feb 07, 2022 at 04:44:24PM +0100, Peter Eisentraut wrote: > On 07.02.22 11:29, Julien Rouhaud wrote: > > - that's not really something new with this patch, but should we output the > >collation version info or mismatch info in \l / \dO? > > It's a possibility. Perhaps there is a question of performance if we show > it in \l and people have tons of databases and they have to make a locale > call for each one. As you say, it's more an independent feature, but it's > worth looking into. Ok, but \l+ shows among others the database size, so I guess at that level also showing this should be fine? (or is that already the case?) Michael
Re: [RFC] building postgres with meson - perl embedding
On 2/7/22 21:40, Tom Lane wrote: > I wrote: >> Andres Freund writes: >>> What is the reason behind subtracting ccdlflags? >> It looks like the coding actually originated here: >> commit f5d0c6cad5bb2706e0e63f3f8f32e431ea428100 > Ah, here's the thread leading up to that: > > https://www.postgresql.org/message-id/flat/200106191206.f5JC6R108371%40candle.pha.pa.us > > The use of ldopts rather than hand-hacked link options seems to date to > 0ed7864d6, only a couple days before that. I don't think we had a > buildfarm then, but I'd bet against the problem being especially > widespread even then, or more people would've complained. The buildfarm's first entry is from 22 Oct 2004. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: [PATCH] Add min() and max() aggregate functions for xid8
On 2022/02/08 18:43, Ken Kato wrote: On 2022-02-08 15:34, Fujii Masao wrote: Thanks for updating the patch! It basically looks good to me. I applied the following small changes to the patch. Updated version of the patch attached. Could you review this version? Thank you for the patch! It looks good to me! Thanks for the review! I'm not sure how strict coding conventions are, but the following line is over 80 characters. +insert into xid8_t1 values ('0'), ('010'), ('42'), ('0x'), ('-1'); Therefore, I made a patch which removed ('010') just to fit in 80 characters. If you want to avoid the line longer than 80 columns, you should break it into two or more rather than remove the test code, I think. What to test is more important than formatting. Also the following descriptions about formatting would be helpful. --- https://www.postgresql.org/docs/devel/source-format.html Limit line lengths so that the code is readable in an 80-column window. (This doesn't mean that you must never go past 80 columns. For instance, breaking a long error message string in arbitrary places just to keep the code within 80 columns is probably not a net gain in readability.) --- Therefore I'm ok with the patch that I posted upthread. Also I'm ok if you will break that longer line into two and post new patch. Or if the value '010' is really useless for the test purpose, I'm also ok if you remove it. Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Refactoring SSL tests
> On 7 Feb 2022, at 17:29, Andrew Dunstan wrote: > On 2/2/22 14:50, Daniel Gustafsson wrote: >>> On 2 Feb 2022, at 17:09, Andrew Dunstan wrote: >>> On 2/2/22 08:26, Daniel Gustafsson wrote: Thoughts? I'm fairly sure there are many crimes against Perl in this patch, I'm happy to take pointers on how to improve that. >>> It feels a bit odd to me from a perl POV. I think it needs to more along >>> the lines of standard OO patterns. I'll take a stab at that based on >>> this, might be a few days. >> That would be great, thanks! > > Here's the result of that surgery. It's a little incomplete in that it > needs some POD adjustment, but I think the code is right - it passes > testing for me. Confirmed, it passes all tests for me as well. > One of the advantages of this, apart from being more idiomatic, is that > by avoiding the use of package level variables you can have two > SSL::Server objects, one for OpenSSL and (eventually) one for NSS. This > was the original motivation for the recent install_path additions to > PostgreSQL::Test::Cluster, so it complements that work nicely. Agreed, this version is a clear improvement over my attempt. Thanks! The attached v2 takes a stab at fixing up the POD sections. -- Daniel Gustafsson https://vmware.com/ v2-0001-ssl-test-refactoring.patch Description: Binary data
Re: Synchronizing slots from primary to standby
On Tue, Feb 8, 2022 at 2:02 AM Andres Freund wrote: > > Hi, > > On 2022-02-07 13:38:38 +0530, Ashutosh Sharma wrote: > > Are you talking about this scenario - what if the logical replication > > slot on the publisher is dropped, but is being referenced by the > > standby where the slot is synchronized? > > It's a bit hard to say, because neither in this thread nor in the patch I've > found a clear description of what the syncing needs to & tries to > guarantee. It might be that that was discussed in one of the precursor > threads, but... > > Generally I don't think we can permit scenarios where a slot can be in a > "corrupt" state, i.e. missing required catalog entries, after "normal" > administrative commands (i.e. not mucking around in catalog entries / on-disk > files). Even if the sequence of commands may be a bit weird. All such cases > need to be either prevented or detected. > > > As far as I can tell, the way this patch keeps slots on physical replicas > "valid" is solely by reorderbuffer.c blocking during replay via > wait_for_standby_confirmation(). > > Which means that if e.g. the standby_slot_names GUC differs from > synchronize_slot_names on the physical replica, the slots synchronized on the > physical replica are not going to be valid. Or if the primary drops its > logical slots. > > > > Should the redo function for the drop replication slot have the capability > > to drop it on standby and its subscribers (if any) as well? > > Slots are not WAL logged (and shouldn't be). > > I think you pretty much need the recovery conflict handling infrastructure I > referenced upthread, which recognized during replay if a record has a conflict > with a slot on a standby. And then ontop of that you can build something like > this patch. > OK. Understood, thanks Andres. -- With Regards, Ashutosh Sharma.
Re: [PATCH v2] use has_privs_for_role for predefined roles
On Tue, Feb 8, 2022 at 8:46 AM Robert Haas wrote: > > On Tue, Feb 8, 2022 at 6:59 AM Joe Conway wrote: > > This is similar to bob's access to the default superuser privilege to > > read data in someone else's table (must SET ROLE to access that capability). > > > > But it is different from bob's access to inherited privileges which are > > GRANTed: > > Yeah. I think right here you've put your finger on what's been bugging > me about this: it's similar to one thing, and it's different from > another. To you and Joshua and Stephen, it seems 100% obvious that > these roles should work like grants of other roles. But I think of > them as capabilities derived from the superuser account, and so I'm > sort of tempted to think that they should work the way the superuser > bit does. And that's why I don't think the fact that they work the > other way is "just a bug" -- it's one of two possible ways that > someone could think that it ought to work based on how other things in > the system actually do work. > > I'm not hard stuck on the idea that the current behavior is right, but > I don't think that we can really say that we've made things fully > consistent unless we make things like SUPERUSER and BYPASSRLS work the > same way that you want to make predefined roles work. And probably do > something about the INHERIT flag too because the current situation > seems like a hot mess. I think hot mess is an apt description of the current situation, for example consider that: src/backend/catalog/aclchk.c 3931: has_privs_of_role(roleid, ROLE_PG_READ_ALL_DATA)) 3943: has_privs_of_role(roleid, ROLE_PG_WRITE_ALL_DATA)) 4279: (has_privs_of_role(roleid, ROLE_PG_READ_ALL_DATA) || 4280: has_privs_of_role(roleid, ROLE_PG_WRITE_ALL_DATA))) src/backend/storage/ipc/signalfuncs.c 82: if (!has_privs_of_role(GetUserId(), proc->roleId) && 83: !has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_BACKEND)) src/backend/storage/ipc/procarray.c 3843: if (!has_privs_of_role(GetUserId(), proc->roleId) && 3844: !has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_BACKEND)) src/backend/tcop/utility.c 943: if (!has_privs_of_role(GetUserId(), ROLE_PG_CHECKPOINTER)) 4 predefined roles currently use has_privs_of_role in master. Further, pg_monitor, as an SQL-only predefined role, also behaves consistently with the INHERIT rules that other roles do. In order for SQL-only predefined roles to ignore INHERIT we would need to hardcode bypasses for them, which IMO seems like the worst possible solution to the current inconsistency.
Re: libpq async duplicate error results
On 2022-Feb-07, Tom Lane wrote: > In any case, the particular example we're looking at here is connection > loss, which is not something that should greatly concern us as far as > pipeline semantics are concerned, because you're not getting any more > pipelined results anyway if that happens. In the non-I/O-error case, > I see that PQgetResult does a hacky "resetPQExpBuffer(&conn->errorMessage)" > in between pipelined queries. It seems like if there are real usability > issues in this area, they probably come from that not being well placed. > In particular, I wonder why that's done with the pqPipelineProcessQueue > call in the PGASYNC_IDLE stanza, yet not with the pqPipelineProcessQueue > call in the PGASYNC_READY stanza (where we're returning a PIPELINE_SYNC > result). It kinda looks to me like maybe pqPipelineProcessQueue > ought to have the responsibility for doing that, rather than having > two out of the three call sites do it. Also it would seem more natural > to associate it with the reinitialization that happens inside > pqPipelineProcessQueue. Yeah, I agree that the placement of error message reset in pipelined libpq is more ad-hoc to the testing I was doing than carefully principled. I didn't test changing it, but on a quick look your proposed change seems reasonable to me. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Pensar que el espectro que vemos es ilusorio no lo despoja de espanto, sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)
Re: [PATCH v2] use has_privs_for_role for predefined roles
On Tue, Feb 8, 2022 at 10:00 AM Joshua Brindle wrote: > 4 predefined roles currently use has_privs_of_role in master. > > Further, pg_monitor, as an SQL-only predefined role, also behaves > consistently with the INHERIT rules that other roles do. > > In order for SQL-only predefined roles to ignore INHERIT we would need > to hardcode bypasses for them, which IMO seems like the worst possible > solution to the current inconsistency. I agree we need to make the situation consistent. But if you think there's exactly one problem here and this patch fixes it, I emphatically disagree. -- Robert Haas EDB: http://www.enterprisedb.com
Re: RFC: Logging plan of the running query
On 2022-02-03 17:09, Robert Treat wrote: On Wed, Feb 2, 2022 at 7:59 AM torikoshia wrote: 2022-02-01 01:51, Fujii Masao wrote: > +Note that nested statements (statements executed inside a > function) are not > +considered for logging. Only the plan of the most deeply nested > query is logged. > > Now the plan of even nested statement can be logged. So this > description needs to be updated? Modified it as below: -Note that nested statements (statements executed inside a function) are not -considered for logging. Only the plan of the most deeply nested query is logged. +Note that when the statements are executed inside a function, only the +plan of the most deeply nested query is logged. Minor nit, but I think the "the" is superfluous.. ie. "Note that when statements are executed inside a function, only the plan of the most deeply nested query is logged" Thanks! Modified it. On 2022-02-08 01:13, Fujii Masao wrote: Thanks for the comments! On 2022/02/02 21:59, torikoshia wrote: This may cause users to misunderstand that pg_log_query_plan() fails while the target backend is holding *any* locks? Isn't it better to mention "page-level locks", instead? So how about the following? -- Note that the request to log the query plan will be ignored if it's received during a short period while the target backend is holding a page-level lock. -- Agreed. On second thought, this note is confusing rather than helpful? Because the users don't know when and what operation needs page-level lock. So now I'm thinking it's better to remove this note. Removed it. As you pointed out offlist, the issue could occur even when both pg_log_backend_memory_contexts and pg_log_query_plan are called and it may be better to make another patch. OK. You also pointed out offlist that it would be necessary to call LockErrorCleanup() if ProcessLogQueryPlanInterrupt() can incur ERROR. AFAICS it can call ereport(ERROR), i.e., palloc0() in NewExplainState(), so I added PG_TRY/CATCH and make it call LockErrorCleanup() when ERROR occurs. As we discussed off-list, this treatment might not be necessary. Because when ERROR or FATAL error happens, AbortTransaction() is called and LockErrorCleanup() is also called inside there. Agreed. Since I'm not sure how long it will take to discuss this point, the attached patch is based on the current HEAD at this time. Thanks for updating the patch! @@ -5048,6 +5055,12 @@ AbortSubTransaction(void) */ PG_SETMASK(&UnBlockSig); + /* + * When ActiveQueryDesc is referenced after abort, some of its elements +* are freed. To avoid accessing them, reset ActiveQueryDesc here. +*/ + ActiveQueryDesc = NULL; AbortSubTransaction() should reset ActiveQueryDesc to save_ActiveQueryDesc that ExecutorRun() set, instead of NULL? Otherwise ActiveQueryDesc of top-level statement will be unavailable after subtransaction is aborted in the nested statements. For example, no plan is logged while the following "select pg_sleep(test())" is running, because the exception inside test() function aborts the subtransaction and resets ActiveQueryDesc to NULL. create or replace function test () returns int as $$ begin perform 1/0; exception when others then return 30; end; $$ language plpgsql; select pg_sleep(test()); Agreed. BTW, since the above example results in calling ExecutorRun() only once, the output didn't differ even after ActiveQueryDesc is reset to save_ActiveQueryDesc. The below definition of test() worked as expected. create or replace function test () returns int as $$ begin perform 1; perform 1/0; exception when others then return 30; end; $$ language plpgsql; -CREATE ROLE regress_log_memory; +CREATE ROLE regress_log; Isn't this name too generic? How about regress_log_function, for example? Agreed. On 2022-02-08 17:18, Kyotaro Horiguchi wrote: At Tue, 8 Feb 2022 01:13:44 +0900, Fujii Masao wrote in On 2022/02/02 21:59, torikoshia wrote: >> This may cause users to misunderstand that pg_log_query_plan() fails >> while the target backend is holding *any* locks? Isn't it better to >> mention "page-level locks", instead? So how about the following? >> >> -- >> Note that the request to log the query plan will be ignored if it's >> received during a short period while the target backend is holding a >> page-level lock. >> -- > Agreed. On second thought, this note is confusing rather than helpful? Because the users don't know when and what operation needs page-level lock. So now I'm thinking it's better to remove this note. *I* agree to removing the note. And the following error message looks as mysterious as the note is, and the DETAIL doesn't help.. ereport(LOG_SERVER_ONLY, + errmsg("could
Re: Refactoring SSL tests
On 2/8/22 09:24, Daniel Gustafsson wrote: > > The attached v2 takes a stab at fixing up the POD sections. There a capitalization typo in SSL/Backend/OpenSSL.pm - looks like that's my fault: + my $backend = SSL::backend::OpenSSL->new(); Also, I think we should document that SSL::Server::new() takes an optional flavor parameter, in the absence of which it uses $ENV{with_ssl}, and that 'openssl' is the only currently supported value. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
is the base backup protocol used by out-of-core tools?
Hi, Commit 0ba281cb4bf9f5f65529dfa4c8282abb734dd454 added a new syntax for the BASE_BACKUP command, and commit cc333f32336f5146b75190f57ef587dff225f565 added a new COPY subprotocol for taking base backups. In both cases, I preserved backward compatibility. However, nothing in PostgreSQL itself cares about this, because if you try to run an older version of pg_basebackup against a v15 server, it's going to fail anyway: pg_basebackup: error: incompatible server version 15devel >From that point of view, there's no downside to removing from the server the old syntax for BASE_BACKUP and the old protocol for taking backups. We can't remove anything from pg_basebackup, because it is our practice to make new versions of pg_basebackup work with old versions of the server. But the reverse is not true: an older pg_basebackup will categorically refuse to work with a newer server version. Therefore keeping the code for this stuff around in the server has no value ... unless there is out-of-core code that (a) uses the BASE_BACKUP command and (b) wouldn't immediately adopt the new syntax and protocol anyway. If there is, we might want to keep the backward-compatibility code around in the server for a few releases. If not, we should probably nuke that code to simplify things and reduce the maintenance burden. Patches for the nuking are attached. If nobody writes back, I'm going to assume that means nobody cares, and commit these some time before feature freeze. If one or more people do write back, then my plan is to see what they have to say and proceed accordingly. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com 0002-Remove-legacy-base-backup-protocol.patch Description: Binary data 0001-Remove-old-base-backup-syntax.patch Description: Binary data
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Mon, Dec 6, 2021 at 12:45 PM Robert Haas wrote: > So for example, imagine tests with 1GB of shard_buffers, 8GB, and > 64GB. And template databases with sizes of whatever the default is, > 1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then > create a new database from one of the templates. And then just measure > the performance. Maybe for large databases this approach is just > really the pits -- and if your max_wal_size is too small, it > definitely will be. But, I don't know, maybe with reasonable settings > it's not that bad. Writing everything to disk twice - once to WAL and > once to the target directory - has to be more expensive than doing it > once. But on the other hand, it's all sequential I/O and the data > pages don't need to be fsync'd, so perhaps the overhead is relatively > mild. I don't know. I have been tied up with other things for a bit now and have not had time to look at this thread; sorry about that. I have a little more time available now so I thought I would take a look at this again and see where things stand. Sadly, it doesn't appear to me that anyone has done any performance testing of this patch, along the lines suggested above or otherwise, and I think it's a crucial question for the patch. My reading of this thread is that nobody really likes the idea of maintaining two methods for performing CREATE DATABASE, but nobody wants to hose people who are using it to clone large databases, either. To some extent those things are inexorably in conflict. If we postulate that the 10TB template database is on a local RAID array with 40 spindles, while pg_wal is on an iSCSI volume that we access via a 128kB ISDN link, then the new system is going to be infinitely worse. But real situations aren't likely to be that bad, and it would be useful in my opinion to have an idea how bad they actually are. I'm somewhat inclined to propose that we keep the existing method around along with the new method. Even though nobody really likes that, we don't necessarily have to maintain both methods forever. If, say, we use the new method by default in all cases, but add an option to get the old method back if you need it, we could leave it that way for a few years and then propose removing the old method (and the switch to activate it) and see if anyone complains. That way, if the new method turns out to suck in certain cases, users have a way out. However, I still think doing some performance testing would be a really good idea. It's not a great plan to make decisions about this kind of thing in an information vacuum. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Wed, Dec 22, 2021 at 9:32 AM Ashutosh Sharma wrote: > I couldn't find the mdpostchkpt() function. Are you talking about > SyncPostCheckpoint() ? Anyway, as you have rightly said, we need to > unlink all the files available inside the dst_tablespaceoid/dst_dboid/ > directory by scanning the pendingUnlinks list. And finally we don't > want the next checkpoint to unlink this file again and PANIC so for > that we have to update the entry for this unlinked rel file in the > hash table i.e. cancel the sync request for it. Until commit 3eb77eba5a51780d5cf52cd66a9844cd4d26feb0 in April 2019, there was an mdpostckpt function, which is probably what was meant here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: decoupling table and index vacuum
On Sun, Feb 6, 2022 at 11:25 PM Dilip Kumar wrote: > > One thing we could try doing in order to make that easier would be: > > tweak things so that when autovacuum vacuums the table, it only > > vacuums the indexes if they meet some threshold for bloat. I'm not > > sure exactly what happens with the heap vacuuming then - do we do > > phases 1 and 2 always, or a combined heap pass, or what? But if we > > pick some criteria that vacuums indexes sometimes and not other times, > > we can probably start doing some meaningful measurement of whether > > this patch is making bloat better or worse, and whether it's using > > fewer or more resources to do it. > > I think we can always trigger phase 1 and 2 and phase 2 will only > vacuum conditionally based on if all the indexes are vacuumed for some > conveyor belt pages so we don't have risk of scanning without marking > anything unused. Not sure what you mean about a risk of scanning without marking any LP_DEAD items as LP_UNUSED. If VACUUM always does some amount of this, then it follows that the new mechanism added by the patch just can't safely avoid any work at all, making it all pointless. We have to expect heap vacuuming to take place much less often with the patch. Simply because that's what the invariant described in comments above lazy_scan_heap() requires. Note that this is not the same thing as saying that we do less *absolute* heap vacuuming with the conveyor belt -- my statement about less heap vacuuming taking place is *only* true relative to the amount of other work that happens in any individual "shortened" VACUUM operation. We could do exactly the same total amount of heap vacuuming as before (in a version of Postgres without the conveyor belt but with the same settings), but much *more* index vacuuming (at least for one or two problematic indexes). > And we can try to measure with other approaches as > well where we completely avoid phase 2 and it will be done only along > with phase 1 whenever applicable. I believe that the main benefit of the dead TID conveyor belt (outside of global index use cases) will be to enable us to do more (much more) index vacuuming for one index in particular. So it's not really about doing less index vacuuming or less heap vacuuming -- it's about doing a *greater* amount of *useful* index vacuuming, in less time. There is often some way in which failing to vacuum one index for a long time does lasting damage to the index structure. -- Peter Geoghegan
Re: decoupling table and index vacuum
On Tue, Feb 8, 2022 at 12:12 PM Peter Geoghegan wrote: > I believe that the main benefit of the dead TID conveyor belt (outside > of global index use cases) will be to enable us to do more (much more) > index vacuuming for one index in particular. So it's not really about > doing less index vacuuming or less heap vacuuming -- it's about doing > a *greater* amount of *useful* index vacuuming, in less time. There is > often some way in which failing to vacuum one index for a long time > does lasting damage to the index structure. This makes sense to me, and I think it's a good insight. It's not clear to me that we have enough information to make good decisions about which indexes to vacuum and which indexes to skip. -- Robert Haas EDB: http://www.enterprisedb.com
Re: decoupling table and index vacuum
On Tue, Feb 8, 2022 at 9:33 AM Robert Haas wrote: > On Tue, Feb 8, 2022 at 12:12 PM Peter Geoghegan wrote: > > I believe that the main benefit of the dead TID conveyor belt (outside > > of global index use cases) will be to enable us to do more (much more) > > index vacuuming for one index in particular. So it's not really about > > doing less index vacuuming or less heap vacuuming -- it's about doing > > a *greater* amount of *useful* index vacuuming, in less time. There is > > often some way in which failing to vacuum one index for a long time > > does lasting damage to the index structure. > > This makes sense to me, and I think it's a good insight. > > It's not clear to me that we have enough information to make good > decisions about which indexes to vacuum and which indexes to skip. What if "extra vacuuming, not skipping vacuuming" was not just an abstract goal, but an actual first-class part of the implementation, and the index AM API? Then the question we're asking the index/index AM is no longer "Do you [an index] *not* require index vacuuming, even though you are entitled to it according to the conventional rules of autovacuum scheduling?". The question is instead more like "Could you use an extra, early VACUUM?". if we invert the question like this then we have something that makes more sense at the index AM level, but requires significant improvements at the level of autovacuum scheduling. On the other hand I think that you already need to do at least some work in that area. -- Peter Geoghegan
Re: is the base backup protocol used by out-of-core tools?
On Tue, Feb 08, 2022 at 11:26:41AM -0500, Robert Haas wrote: > Patches for the nuking are attached. If nobody writes back, I'm going > to assume that means nobody cares, and commit these some time before > feature freeze. If one or more people do write back, then my plan is > to see what they have to say and proceed accordingly. I created this and added that for visibility and so it's not forgotten. ISTM that could be done post-freeze, although I don't know if that's useful or important. https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items -- Justin
Re: is the base backup protocol used by out-of-core tools?
On Tue, Feb 8, 2022 at 1:03 PM Justin Pryzby wrote: > I created this and added that for visibility and so it's not forgotten. > ISTM that could be done post-freeze, although I don't know if that's useful or > important. > > https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items Thanks. I feel like this is in the department of things that are unlikely to be an issue for anyone, but could be. But the set of people who could care is basically just backup tool authors, so hopefully they'll notice this thread, or that list. -- Robert Haas EDB: http://www.enterprisedb.com
Re: is the base backup protocol used by out-of-core tools?
On 2/8/22 12:39, Robert Haas wrote: On Tue, Feb 8, 2022 at 1:03 PM Justin Pryzby wrote: I created this and added that for visibility and so it's not forgotten. ISTM that could be done post-freeze, although I don't know if that's useful or important. https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items Thanks. I feel like this is in the department of things that are unlikely to be an issue for anyone, but could be. But the set of people who could care is basically just backup tool authors, so hopefully they'll notice this thread, or that list. I'm aware of several tools that use pg_basebackup but they are using the command-line tool, not the server protocol directly. Personally, I'm in favor of simplifying the code on the server side. If anyone is using the protocol directly (which I kind of doubt) I imagine they would want to take advantage of all the new features that have been added. Regards, -David
Re: decoupling table and index vacuum
On Tue, Feb 8, 2022 at 12:50 PM Peter Geoghegan wrote: > > It's not clear to me that we have enough information to make good > > decisions about which indexes to vacuum and which indexes to skip. > > What if "extra vacuuming, not skipping vacuuming" was not just an > abstract goal, but an actual first-class part of the implementation, > and the index AM API? Then the question we're asking the index/index > AM is no longer "Do you [an index] *not* require index vacuuming, even > though you are entitled to it according to the conventional rules of > autovacuum scheduling?". The question is instead more like "Could you > use an extra, early VACUUM?". > > if we invert the question like this then we have something that makes > more sense at the index AM level, but requires significant > improvements at the level of autovacuum scheduling. On the other hand > I think that you already need to do at least some work in that area. Right, that's why I asked the question. If we're going to ask the index AM whether it would like to be vacuumed right now, we're going to have to put some logic into the index AM that knows how to answer that question. But if we don't have any useful statistics that would let us answer the question correctly, then we have problems. While I basically agree with everything that you just wrote, I'm somewhat inclined to think that the question is not best phrased as either extra-vacuum or skip-a-vacuum. Either of those supposes a normative amount of vacuuming from which we could deviate in one direction or the other. I think it would be better to phrase it in a way that doesn't make such a supposition. Maybe something like: "Hi, we are vacuuming the heap right now and we are also going to vacuum any indexes that would like it, and does that include you?" The point is that it's a continuum. If we decide that we're asking the index "do you want extra vacuuming?" then that phrasing suggests that you should only say yes if you really need it. If we decide we're asking the index "can we skip vacuuming you this time?" then the phrasing suggests that you should not feel bad about insisting on a vacuum right now, and only surrender your claim if you're sure you don't need it. But in reality, no bias either way is warranted. It is either better that this index should be vacuumed right now, or better that it should not be vacuumed right now, and whichever is better should be what we choose to do. To expand on that just a bit, if I'm a btree index and someone asks me "can we skip vacuuming you this time?" I might say "return dead_tups < tiny_amount" and if they ask me "do you want extra vacuuming" I might say "return dead_tups > quite_large_amount". But if they ask me "should we vacuum you now?" then I might say "return dead_tups > moderate_amount" which feels like the correct thing here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: is the base backup protocol used by out-of-core tools?
On Tue, Feb 8, 2022 at 1:52 PM David Steele wrote: > I'm aware of several tools that use pg_basebackup but they are using the > command-line tool, not the server protocol directly. Good to know. > Personally, I'm in favor of simplifying the code on the server side. If > anyone is using the protocol directly (which I kind of doubt) I imagine > they would want to take advantage of all the new features that have been > added. I agree, and I'm glad you do too, but I thought it was worth asking. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Documentation about PL transforms
On 02/07/22 15:14, Chapman Flack wrote: > It has since occurred to me that another benefit of having a > transform_validator per PL would be immediate error reporting > if someone, for whatever reason, tries out CREATE TRANSFORM > for a PL that doesn't grok transforms. The same could be achieved, I guess, by an event trigger, though that would take positive effort to set up, where the benefit of a per-PL transform_validator would be that if a given PL does not bother to provide one, CREATE TRANSFORM for it would automatically fail. (And such a validator would not have to spend most of its life ignoring other DDL.) That does reveal another documentation gap: table 40.1 does not show CREATE or DROP TRANSFORM being supported for event triggers. I've confirmed they work, though. I'll tack that onto the doc patch. I notice our transforms lack the named groups of 9075-2. With those (plus our LANGUAGE clause), I could write, for a made-up example: CREATE TRANSFORM FOR hstore LANGUAGE plpython3u asdict ( FROM SQL WITH FUNCTION hstore_to_plpython3dict, TO SQL WITH FUNCTION plpython3dict_to_hstore) asnamedtuple ( FROM SQL WITH FUNCTION hstore_to_plpython3namedtup, TO SQL WITH FUNCTION plpython3namedtup_to_hstore); CREATE FUNCTION f1(val hstore) RETURNS int LANGUAGE plpython3u TRANSFORM GROUP asdict FOR TYPE hstore ... CREATE FUNCTION f2(val hstore) RETURNS int LANGUAGE plpython3u TRANSFORM GROUP asnamedtuple FOR TYPE hstore ... It seems to me that could be useful, in cases where a PL offers more than one good choice for representing a PostgreSQL type and the preferred one could depend on the function. Was that considered and rejected for our transforms, or were ours just based on an earlier 9075-2 without the named groups? Also, I am doubtful of our Compatibility note, "There is a CREATE TRANSFORM command in the SQL standard, but it is for adapting data types to client languages." In my reading of 9075-2, I do see the transforms used for client languages (all the s), but I also see the to-sql and from-sql functions being applied in whenever "R is an external routine" and the type "is a user-defined type". The latter doesn't seem much different from our usage. The differences I see are (1) our LANGUAGE clause, (2) we don't have a special "user-defined type" category limiting what types can have transforms (and (3), we don't have the named groups). And we are applying them /only/ for server-side routines, rather than for server and client code. The ISO transforms work by mapping the ("user-defined") type to some existing SQL type for which the PL has a standard mapping already. Ours work by mapping the type directly to some suitable type in the PL. Am I reading this accurately? Regards, -Chap
Re: Improve correlation names in sanity tests
Peter Eisentraut writes: > I had to do some analysis on the "sanity" tests in the regression test > suite (opr_sanity, type_sanity) recently, and I found some of the > queries very confusing. One main stumbling block is that for some > probably ancient reason many of the older queries are written with > correlation names p1, p2, etc. independent of the name of the catalog. I think that was at least partially my fault, and no there isn't any very good reason for it. Your proposal seems fine. regards, tom lane
Re: decoupling table and index vacuum
On Tue, Feb 8, 2022 at 10:58 AM Robert Haas wrote: > Right, that's why I asked the question. If we're going to ask the > index AM whether it would like to be vacuumed right now, we're going > to have to put some logic into the index AM that knows how to answer > that question. But if we don't have any useful statistics that would > let us answer the question correctly, then we have problems. I have very little faith in the use of statistical sampling for anything involving vacuuming. In fact, I think that the current way in which ANALYZE counts dead tuples is a misapplication of statistics. It isn't even wrong. One of the things that I really like about this project is that it can plausibly solve that problem by splitting up the work of VACUUM, at low cost -- it's less top-down. Not only do you get the obvious benefits with preventing bloat; you also get *continual* feedback about the actual physical reality in the table (and indexes, to a lesser extent). As I said recently, right now the more bloat we have, the more uncertainty about the total amount of bloat exists. We need to control both the bloat, and the uncertainty about the bloat. The basic high level idea behind how the optimizer uses statistics involves the assumption that *all* the rows in the table are *themselves* a sample taken from some larger distribution -- something from the real physical world (meeting this assumption is one reason why database/schema normalization really matters). And so on a good week it probably won't matter too much to the optimizer if ANALYZE doesn't run until the table size doubles (for a table that was already quite large). These are pretty delicate assumptions, that (from the point of view of the optimizer) work out surprisingly well in practice. Bloat just isn't like that. Dead tuples are fundamentally cyclic and dynamic in nature -- conventional statistics just won't work with something like that. Worst of all, the process that counts dead tuples (ANALYZE) is really an active participant in the system -- the whole entire purpose of even looking is to *reduce* the number of dead tuples by making an autovacuum run. That's deeply weird. > The point is that it's a continuum. If we decide that we're asking the > index "do you want extra vacuuming?" then that phrasing suggests that > you should only say yes if you really need it. If we decide we're > asking the index "can we skip vacuuming you this time?" then the > phrasing suggests that you should not feel bad about insisting on a > vacuum right now, and only surrender your claim if you're sure you > don't need it. But in reality, no bias either way is warranted. Actually, I think that this particular bias *is* warranted. We should openly and plainly be biased in the direction of causing the least harm. What's wrong with that? Having accurate information in not an intrinsic good. I even think that having more information can be strictly worse, because you might actually believe it. Variance matters a lot -- the bias/variance tradeoff is pretty fundamental here. I'm also saying some of this stuff because of broader VACUUM design considerations. VACUUM fundamentally has to work at the table level, and I don't see that changing. The approach of making autovacuum do something akin to a plain VACUUM command in the simplest cases, and only later some extra "dynamic mini vacuums" (that pick up where the VACUUM command style VACUUM left off) has a lot to recommend it. This approach allows most of the current autovacuum settings to continue to work in roughly the same way. They just need to have their documentation updated to make it clear that they're about the worst case. > To expand on that just a bit, if I'm a btree index and someone asks me > "can we skip vacuuming you this time?" I might say "return dead_tups < > tiny_amount" and if they ask me "do you want extra vacuuming" I might > say "return dead_tups > quite_large_amount". But if they ask me > "should we vacuum you now?" then I might say "return dead_tups > > moderate_amount" which feels like the correct thing here. The btree side of this shouldn't care at all about dead tuples (in general we focus way too much on dead tuples, and way too little on pages). With bottom-up index deletion the number of dead tuples in the index is just about completely irrelevant. It's entirely possible and often even likely that 20%+ of all index tuples will be dead at any one time, when the optimization perfectly preserves the index structure. The btree side of the index AM API should be focussing on the growth in index size, relative to some expectation (like maybe the growth for whatever index on the same table has grown the least since last time, accounting for obvious special cases like partial indexes). Perhaps we'd give some consideration to bulk deletes, too. Overall, it should be pretty simple, and should sometimes force us to do one of these "dynamic mini vacuums" of the index just because we're not quite sure wh
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Sun, Dec 12, 2021 at 3:09 AM Dilip Kumar wrote: > Correct, I have done this cleanup, apart from this we have dropped the > fsyc request in create database failure case as well and also need to > drop buffer in error case of creatdb as well as movedb. I have also > fixed the other issue for which you gave the patch (a bit differently) > basically, in case of movedb the source and destination dboid are same > so we don't need an additional parameter and also readjusted the > conditions to avoid nested if. Amazingly to me given how much time has passed, these patches still apply, although I think there are a few outstanding issues that you promised to fix in the next version and haven't yet addressed. In 0007, I think you will need to work a bit harder. I don't think that you can just add a second argument to ForgetDatabaseSyncRequests() that makes it do something other than what the name of the function suggests but without renaming the function or updating any comments. Elsewhere we have things like TablespaceCreateDbspace and ResetUnloggedRelationsInDbspaceDir so perhaps we ought to just add a new function with a name inspired by those precedents alongside the existing one, rather than doing it this way. In 0008, this is a bit confusing: + PageInit(dstPage, BufferGetPageSize(dstBuf), 0); + memcpy(dstPage, srcPage, BLCKSZ); After a minute, I figured out that the point here was to force log_newpage() to actually set the LSN, but how about a comment? I kind of wonder whether GetDatabaseRelationList should be broken into two functions so that don't have quite such deep nesting. And I wonder if maybe the return value of GetActiveSnapshot() should be cached in a local variable. On the whole I think there aren't huge code-level issues here, even if things need to be tweaked here and there and bugs fixed. The real key is arriving at a set of design trade-offs that doesn't make anyone too upset. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Refactoring the regression tests for more independence
Julien Rouhaud writes: > On Mon, Feb 07, 2022 at 02:00:25PM -0500, Tom Lane wrote: >> Not too surprisingly, these patches broke during the commitfest. >> Here's a rebased version. >> I'm not sure that anyone wants to review these in detail ... >> should I just go ahead and push them? > I don't see anything shocking after a quick glance, and I don't think any > review is going to give any more confidence compared to the script-dep-testing > script, so +1 for pushing them since the cf bot is green again. Done, will watch the farm. regards, tom lane
Re: Move replication slot structures/enums/macros to a new header file for better usability by external tools/modules
On Mon, Feb 7, 2022 at 4:22 PM Bharath Rupireddy wrote: > > Hi, > > While working on pg_replslotdata tool [1], it was observed that some > of the replication slot structures/enums/macros such as > ReplicationSlotPersistentData, ReplicationSlotPersistency, > ReplicationSlotOnDisk, ReplicationSlotOnDisk etc. are currently in > replication/slot.h and replication/slot.c. This makes the replication > slot on disk data structures unusable by the external tools/modules. > How about moving these structures to a new header file called > slot_common.h as attached in the patch here? > > Thoughts? > > PS: I'm planning to have the tool separately, as it was rejected to be in > core. > > [1] > https://www.postgresql.org/message-id/CALj2ACW0rV5gWK8A3m6_X62qH%2BVfaq5hznC%3Di0R5Wojt5%2Byhyw%40mail.gmail.com > > Regards, > Bharath Rupireddy. Recently I was also looking to add some new enums but I found it was difficult to find any good place to put them where they could be shared by the replication code and the pg_recvlogical tool. So +1 to your suggestion to have a common header, but I wonder can it have a more generic name (e.g. repl_common.h? ...) since the stuff I wanted to put there was not really "slot" related. -- Kind Regards, Peter Smith. Fujitsu Australia.
Re: Fix BUG #17335: Duplicate result rows in Gather node
Thanks for having a look at this. On Fri, 4 Feb 2022 at 13:48, Robert Haas wrote: > I think the actual rule is: every path under a Gather or GatherMerge > must be parallel-safe. I've adjusted the patch so that it counts parallel_aware and parallel_safe Paths independently and verifies everything below a Gather[Merge] is parallel_safe. The diff stat currently looks like: src/backend/optimizer/plan/createplan.c | 230 1 file changed, 230 insertions(+) I still feel this is quite a bit of code for what we're getting here. I'd be more for it if the path traversal function existed for some other reason and I was just adding the callback functions and Asserts. I'm keen to hear what others think about that. David diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index cd6d72c763..898046ca07 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -313,6 +313,20 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan, List *rowMarks, OnConflictExpr *onconflict, int epqParam); static GatherMerge *create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path); +static bool contains_a_parallel_aware_path(Path *path); +static bool contains_only_parallel_safe_paths(Path *path); + +/* + * PathTypeCount + * Used for various checks to assert plans are sane in assert enabled + * builds. + */ +typedef struct PathTypeCount +{ + uint64 count; + uint64 parallel_safe_count; + uint64 parallel_aware_count; +} PathTypeCount; /* @@ -389,6 +403,10 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags) /* Guard against stack overflow due to overly complex plans */ check_stack_depth(); + /* Parallel aware paths should contain only parallel safe subpaths. */ + Assert(!best_path->parallel_aware || + contains_only_parallel_safe_paths(best_path)); + switch (best_path->pathtype) { case T_SeqScan: @@ -481,6 +499,14 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags) case T_Gather: plan = (Plan *) create_gather_plan(root, (GatherPath *) best_path); + + /* +* We expect a Gather to contain at least one parallel aware path +* unless running in single_copy mode. +*/ + Assert(((GatherPath *) best_path)->single_copy || + contains_a_parallel_aware_path(((GatherPath *) + best_path)->subpath)); break; case T_Sort: plan = (Plan *) create_sort_plan(root, @@ -537,6 +563,9 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags) case T_GatherMerge: plan = (Plan *) create_gather_merge_plan(root, (GatherMergePath *) best_path); + /* GatherMerge must contain at least one parallel aware path */ + Assert(contains_a_parallel_aware_path(((GatherMergePath *) + best_path)->subpath)); break; default: elog(ERROR, "unrecognized node type: %d", @@ -7052,6 +7081,207 @@ make_modifytable(PlannerInfo *root, Plan *subplan, return node; } +/* + * path_tree_walker + * Walk a path tree beginning with 'path' and call the 'walker' function + * for that path and each of its subpaths recursively. + */ +static void +path_tree_walker(Path *path, void (*walker) (), void *context) + +{ + if (path == NULL) + return; + + /* Guard against stack overflow due to overly complex path trees */ + check_stack_depth(); + + walker(path, context); + + switch (path->pathtype) + { + case T_SeqScan: + case T_SampleScan: + case T_IndexScan: + case T_IndexOnlyScan: + case T_BitmapHeapScan: + case T_TidScan: + case T_TidRangeScan: + case T_SubqueryScan: + case T_FunctionScan: + case T_TableFuncScan: + case T_ValuesScan: + case T_CteScan: + case T_WorkTableScan: + case T_NamedTuplestoreSc
Re: make MaxBackends available in _PG_init
On Fri, Feb 4, 2022 at 3:27 PM Robert Haas wrote: > Great. I'll take a look at this next week, but not right now, mostly > because it's Friday afternoon and if I commit it and anything breaks I > don't want to end up having to fix it on the weekend if I can avoid > it. After some investigation I've determined that it's no longer Friday afternoon. I also spent time investigating whether the patch had problems that would make me uncomfortable with the idea of committing it, and I did not find any. So, I committed it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Fix BUG #17335: Duplicate result rows in Gather node
On Tue, Feb 8, 2022 at 1:11 PM David Rowley wrote: > Thanks for having a look at this. > > On Fri, 4 Feb 2022 at 13:48, Robert Haas wrote: > > I think the actual rule is: every path under a Gather or GatherMerge > > must be parallel-safe. > > I've adjusted the patch so that it counts parallel_aware and > parallel_safe Paths independently and verifies everything below a > Gather[Merge] is parallel_safe. > > The diff stat currently looks like: > > src/backend/optimizer/plan/createplan.c | 230 > 1 file changed, 230 insertions(+) > > I still feel this is quite a bit of code for what we're getting here. > I'd be more for it if the path traversal function existed for some > other reason and I was just adding the callback functions and Asserts. > > I'm keen to hear what others think about that. > > David > Hi, + break; + case T_MergeAppend: The case for T_MergeAppend should be left indented. + case T_Result: + if (IsA(path, ProjectionPath)) Since the remaining sub-cases don't have subpath, they are covered by the final `else` block - MinMaxAggPath and GroupResultPath don't need to be checked. For contains_a_parallel_aware_path(), it seems path_type_counter() can return bool indicating whether the walker should return early (when parallel aware count reaches 1). Cheers
Re: Fix BUG #17335: Duplicate result rows in Gather node
On Tue, Feb 8, 2022 at 4:11 PM David Rowley wrote: > I still feel this is quite a bit of code for what we're getting here. > I'd be more for it if the path traversal function existed for some > other reason and I was just adding the callback functions and Asserts. > > I'm keen to hear what others think about that. My view is that functions like path_tree_walker are good things to have on general principle. I find it likely that it will find other uses, and that if we don't add as part of this patch, someone will add it for some other reason in the future. So I would not really count that in deciding how big this patch is, and the rest of what you have here is pretty short and to the point. There is the more difficult philosophical question of whether it's worth expending any code on this at all. I think it is pretty clear that this has positive value: it could easily prevent >0 future bugs, which IMHO is not bad for such a small patch. However, it does feel a little bit primitive somehow, in the sense that there are a lot of things you could do wrong which this wouldn't catch. For example, a Gather with no parallel-aware node under it is probably busted, unless someone invents new kinds of parallel operators that work differently from what we have now. But a join beneath a Gather that is not itself parallel-aware should have a parallel-aware node under exactly one side of the join. If there's a parallel scan on both sides or neither side, even with stuff on top of it, that's wrong. But a parallel-aware join could do something else, e.g. Parallel Hash Join expects a parallel path on both sides. Some other parallel-aware join type could expect a parallel path on exactly one side without caring which one, or on one specific side, or maybe even on neither side. What we're really reasoning about here is whether the input is going to be partitioned across multiple executions of the plan in a proper way. A Gather is going to run the same plan in all of its workers, so it wants a subplan that when run in all workers will together produce all output rows. Parallel-aware scans partition the results across workers, so they behave that way. A non-parallel aware join will work that way if it joins a partition the input on one side to all of the input from the other side, hence the rule I describe above. For aggregates, you can't safely apply a plain old Aggregate operation either to a regular scan or to a parallel-aware scan and get the right answer, which is why we need Partial and Finalize stages for parallel query. But for a lot of other nodes, like Materialize, their output will have the same properties as the input: if the subplan of a Materialize node produces all the rows on each execution, the Materialize node will too; if it produces a partition of the output rows each time it's executed, once per worker, the Materialize node will do the same. And I think it's that kind of case that leads to the check we have here, that there ought to be a parallel-aware node in there someplace. It might be the case that there's some more sophisticated check we could be doing here that would be more satisfying than the one you've written, but I'm not sure. Such a check might end up needing to know the behavior of the existing nodes in a lot of detail, which then wouldn't help with finding bugs in new functionality we add in the future. In that sense, the kind of simple check you've got here has something to recommend it: it won't catch everything people can do wrong, but when it does trip, chances are good it's found a bug, and it's got a good chance of continuing to work as well as it does today even in the face of future additions. So I guess I'm mildly in favor of it, but I would also find it entirely reasonable if you were to decide it's not quite worth it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: WIP: System Versioned Temporal Table
On 1/24/22 00:16, Corey Huinker wrote: - Table schemas change, and all (SV active) AV items would logically need to fit the active schema or be updated to do so. Different story for SV, nothing there should ever need to be changed. Yeah, there's a mess (which you state below) about what happens if you create a table and then rename a column, or drop a column and add a same-named column back of another type at a later date, etc. In theory, this means that the valid set of columns and their types changes according to the time range specified. I may not be remembering correctly, but Vik stated that the SQL spec seemed to imply that you had to track all those things. The spec does not allow schema changes at all on a a system versioned table, except to change the system versioning itself. -- Vik Fearing
Re: [RFC] building postgres with meson - perl embedding
Andres Freund writes: > On 2022-02-07 20:42:09 -0500, Tom Lane wrote: >> ... Peter just copied the logic in 7662419f1. Considering that >> the point of 7662419f1 was to get rid of MakeMaker, maybe we no >> longer needed that at that point. > Yea. And maybe what was broken in 2001 isn't broken anymore either ;) Yeah --- note that Bruce was complaining about a problem on Perl 5.005, which was already a bit over-the-hill in 2001. > AIX is the one exception. Specifying -bE... doesn't seem right for building > plperl etc. So possibly the subtraction accidentally does work for us there... I tried this on AIX 7.2 (using the gcc farm, same build options as hoverfly). The build still works and passes regression tests, but you get a warning about each symbol exported by Perl itself: ... ld: 0711-415 WARNING: Symbol PL_veto_cleanup is already exported. ld: 0711-415 WARNING: Symbol PL_warn_nl is already exported. ld: 0711-415 WARNING: Symbol PL_warn_nosemi is already exported. ld: 0711-415 WARNING: Symbol PL_warn_reserved is already exported. ld: 0711-415 WARNING: Symbol PL_warn_uninit is already exported. ld: 0711-415 WARNING: Symbol PL_WB_invlist is already exported. ld: 0711-415 WARNING: Symbol PL_XPosix_ptrs is already exported. ld: 0711-415 WARNING: Symbol PL_Yes is already exported. ld: 0711-415 WARNING: Symbol PL_Zero is already exported. So there's about 1200 such warnings for plperl, and then the same again for each contrib foo_plperl module. Maybe that's annoying enough that we should keep the logic. OTOH, it seems entirely accidental that it has that effect. I'd be a little inclined to replace it with some rule about stripping '-bE:' switches out of the ldopts result. regards, tom lane
Re: [PATCH] Add min() and max() aggregate functions for xid8
On 2022-02-08 23:16, Fujii Masao wrote: If you want to avoid the line longer than 80 columns, you should break it into two or more rather than remove the test code, I think. What to test is more important than formatting. Also the following descriptions about formatting would be helpful. --- https://www.postgresql.org/docs/devel/source-format.html Limit line lengths so that the code is readable in an 80-column window. (This doesn't mean that you must never go past 80 columns. For instance, breaking a long error message string in arbitrary places just to keep the code within 80 columns is probably not a net gain in readability.) --- Therefore I'm ok with the patch that I posted upthread. Also I'm ok if you will break that longer line into two and post new patch. Or if the value '010' is really useless for the test purpose, I'm also ok if you remove it. Thought? Thank you for the explanation! Even though the line is over 80 characters, it makes more sense to put in one line and it enhances readability IMO. Also, '010' is good to have since it is the only octal value in the test. Therefore, I think min_max_aggregates_for_xid8_v4.patch is the best one to go. Best wishes, -- Ken Kato Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
RE: Support escape sequence for cluster_name in postgres_fdw.application_name
Hi, Thank you for updating the patch. I agree with the documentation and program. How about adding the test for %c (Session ID)? (Adding the test for %C (cluster_name) seems difficult.) Regards, Ryohei Takahashi
Re: [PATCH v2] use has_privs_for_role for predefined roles
On 2/8/22 10:07, Robert Haas wrote: On Tue, Feb 8, 2022 at 10:00 AM Joshua Brindle wrote: 4 predefined roles currently use has_privs_of_role in master. Further, pg_monitor, as an SQL-only predefined role, also behaves consistently with the INHERIT rules that other roles do. In order for SQL-only predefined roles to ignore INHERIT we would need to hardcode bypasses for them, which IMO seems like the worst possible solution to the current inconsistency. I agree we need to make the situation consistent. But if you think there's exactly one problem here and this patch fixes it, I emphatically disagree. If we were to start all over again with this feature my vote would be to do things differently than we have done. I would not have called them predefined roles, and I would have used attributes of roles (e.g. make rolsuper into a bitmap rather than a boolean) rather than role membership to implement them. But I didn't find time to participate in the original discussion or review/write the code, so I have little room to complain. However since we did call these things predefined roles, and used role membership to implement them, I think they ought to behave both self-consistently as a group, and like other real roles. That is what this patch does unless I am missing something. I guess an alternative is to discuss a "proper fix", but it seems to me that would be a version 16 thing given how late we are in this development cycle and how invasive it is likely to be. And doing nothing for pg15 is not a very satisfying proposition :-/ Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Re: [PATCH] Accept IP addresses in server certificate SANs
On Mon, 2022-02-07 at 17:29 +0900, Kyotaro Horiguchi wrote: > At Fri, 4 Feb 2022 17:06:53 +, Jacob Champion > wrote in > > That works a lot better than what I had in my head. Done that way in > > v4. Thanks! > > Thanks! > > 0002: > > +#define PGSQL_AF_INET (AF_INET + 0) > +#define PGSQL_AF_INET6 (AF_INET + 1) > .. > -#define PGSQL_AF_INET (AF_INET + 0) > -#define PGSQL_AF_INET6 (AF_INET + 1) > > I feel this should be a part of 0001. (But the patches will be > finally merged so maybe no need to bother moving it). Okay. I can move it easily if you feel like it would help review, but for now I've kept it in 0002. > > * The use of inet_aton() instead of inet_pton() is deliberate; the > > * latter cannot handle alternate IPv4 notations ("numbers-and-dots"). > > I think we should be consistent in handling IP addresses. We have > both inet_pton and inet_aton to parse IPv4 addresses. > > We use inet_pton in the inet type (network_in). > We use inet_aton in server addresses. > > # Hmm. I'm surprised to see listen_addresses accepts "0x7f.1". > # I think we should accept the same by network_in but it is another > # issue. Yeah, that's an interesting inconsistency. > So, inet_aton there seems to be the right choice but the comment > doesn't describe the reason for that behavior. I think we should add > an explanation about the reason for the behavior, maybe something like > this: > > > We accept alternative IPv4 address notations that are accepted by > > inet_aton but not by inet_pton as server address. I've pulled this wording into the comment in v5, attached. > +* GEN_IPADD is an OCTET STRING containing an IP address in network > byte > +* order. > > + /* OK to cast from unsigned to plain char, since it's all ASCII. */ > + return pq_verify_peer_name_matches_certificate_ip(conn, (const char > *) addrdata, len, store_name); > > Aren't the two comments contradicting each other? The retruned general > name looks like an octet array, which is not a subset of ASCII > string. So pq_verify_peer_name_matches_certificate_ip should receive > addrdata as "const unsigned char *", without casting. Bad copy-paste on my part; thanks for the catch. Fixed. > + if (name->type == host_type) > + check_cn = false; > > Don't we want a concise coment for this? Added one; see what you think. > - if (*names_examined == 0) > + if ((rc == 0) && check_cn) > > To me, it seems a bit hard to understand. We can set false to > check_cn in the rc != 0 path in the loop on i, like this: > > > if (rc != 0) > > + { > > + /* > > + * don't fall back to CN when we have a match > > or have an error > > + */ > > + check_cn = false; > > break; > > + } > ... > > - if ((rc == 0) && check_cn) > > + if (check_cn) If I understand right, that's not quite equivalent (and the new tests fail if I implement it that way). We have to disable fallback if the SAN exists, whether it matches or not. > The following existing code (CN fallback) > > > rc = openssl_verify_peer_name_matches_certificate_name(conn, > > X509_NAME_ENTRY_get_data(X509_NAME_get_entry(subject_name, cn_index)), > > first_name); > > is expecting that first_name has not been set when it is visited. > However, with this patch, first_name can be set when the cert has any > SAN of unmatching type (DNS/IPADD) and the already-set name leaks. We > need to avoid that memory leak since the path can be visited multiple > times from the user-program of libpq. I came up with two directions. > > 1. Completely ignore type-unmatching entries. first_name is not set by > such entries. Such unmatching entreis doesn't increase > *names_examined. > > 2. Avoid overwriting first_name there. > > I like 1, but since we don't make distinction between DNS and IPADDR > in the error message emited by the caller, we would take 2? Great catch, thanks! I implemented option 2 to start. Option 1 might make things difficult to debug if you're connecting to a server by IP address but its certificate only has DNS names. Thanks! --Jacob commit 8c427e3289a28cb683eff5d05b2e8770c3c07662 Author: Jacob Champion Date: Tue Feb 8 16:26:27 2022 -0800 squash! libpq: allow IP address SANs in server certs diff --git a/src/interfaces/libpq/fe-secure-common.c b/src/interfaces/libpq/fe-secure-common.c index cfdde58e67..4d78715756 100644 --- a/src/interfaces/libpq/fe-secure-common.c +++ b/src/interfaces/libpq/fe-secure-common.c @@ -160,7 +160,8 @@ pq_verify_peer_name_matches_certificate_name(PGconn *conn, */ int pq_verify_peer_name_matches_certificate_ip(PGconn *conn, - const char *ipdata, size_
RE: [BUG]Update Toast data failure in logical replication
On Mon, Feb 7, 2022 2:55 PM Amit Kapila wrote: > > On Sat, Feb 5, 2022 at 6:10 AM Amit Kapila wrote: > > > > On Fri, Feb 4, 2022 at 9:06 PM Alvaro Herrera > > wrote: > > > > > > > > > I have some suggestions > > > on the comments and docs though. > > > > > > > Thanks, your suggestions look good to me. I'll take care of these in > > the next version. > > > > Attached please find the modified patches. > Thanks for your patch. I tried it and it works well. Two small comments: 1) +static Bitmapset *HeapDetermineColumnsInfo(Relation relation, + Bitmapset *interesting_cols, + Bitmapset *external_cols, + HeapTuple oldtup, HeapTuple newtup, + bool *id_has_external); +HeapDetermineColumnsInfo(Relation relation, +Bitmapset *interesting_cols, +Bitmapset *external_cols, +HeapTuple oldtup, HeapTuple newtup, +bool *has_external) The declaration and the definition of this function use different parameter names for the last parameter (id_has_external and has_external), it's better to be consistent. 2) + /* +* Check if the old tuple's attribute is stored externally and is a +* member of external_cols. +*/ + if (VARATT_IS_EXTERNAL((struct varlena *) DatumGetPointer(value1)) && + bms_is_member(attrnum - FirstLowInvalidHeapAttributeNumber, + external_cols)) + *has_external = true; If has_external is already true, it seems we don't need this check, so should we check has_external first? Regards, Tang
Re: GUC flags
On Tue, Feb 08, 2022 at 01:06:29PM +0900, Michael Paquier wrote: > Makes sense. check_guc also checks after this pattern. Okay, I have done all the adjustments you mentioned, added a couple of comments and applied the patch. If the buildfarm is happy, I'll go retire check_guc. -- Michael signature.asc Description: PGP signature
Re: row filtering for logical replication
I did a review of the v79 patch. Below are my review comments: == 1. doc/src/sgml/ref/create_publication.sgml - CREATE PUBLICATION The commit message for v79-0001 says: If your publication contains a partitioned table, the publication parameter publish_via_partition_root determines if it uses the partition row filter (if the parameter is false, the default) or the root partitioned table row filter. I think that the same information should also be mentioned in the PG DOCS for CREATE PUBLICATION note about the WHERE clause. ~~~ 2. src/backend/commands/publicationcmds.c - contain_mutable_or_ud_functions_checker +/* check_functions_in_node callback */ +static bool +contain_mutable_or_user_functions_checker(Oid func_id, void *context) +{ + return (func_volatile(func_id) != PROVOLATILE_IMMUTABLE || + func_id >= FirstNormalObjectId); +} I was wondering why is the checking for user function and mutable functions combined in one function like this. IMO it might be better to have 2 "checker" callback functions instead of just one - then the error messages can be split too so that only the relevant one is displayed to the user. BEFORE contain_mutable_or_user_functions_checker --> "User-defined or built-in mutable functions are not allowed." AFTER contain_user_functions_checker --> "User-defined functions are not allowed." contain_mutable_function_checker --> "Built-in mutable functions are not allowed." ~~~ 3. src/backend/commands/publicationcmds.c - check_simple_rowfilter_expr_walker + case T_Const: + case T_FuncExpr: + case T_BoolExpr: + case T_RelabelType: + case T_CollateExpr: + case T_CaseExpr: + case T_CaseTestExpr: + case T_ArrayExpr: + case T_CoalesceExpr: + case T_MinMaxExpr: + case T_XmlExpr: + case T_NullTest: + case T_BooleanTest: + case T_List: + break; Perhaps a comment should be added here simply saying "OK, supported" just to make it more obvious? ~~~ 4. src/test/regress/sql/publication.sql - test comment +-- fail - user-defined types disallowed For consistency with the nearby comments it would be better to reword this one: "fail - user-defined types are not allowed" ~~~ 5. src/test/regress/sql/publication.sql - test for \d +-- test \d+ (now it displays filter information) +SET client_min_messages = 'ERROR'; +CREATE PUBLICATION testpub_dplus_rf_yes FOR TABLE testpub_rf_tbl1 WHERE (a > 1) WITH (publish = 'insert'); +CREATE PUBLICATION testpub_dplus_rf_no FOR TABLE testpub_rf_tbl1; +RESET client_min_messages; +\d+ testpub_rf_tbl1 Actually, the \d (without "+") will also display filters but I don't think that has been tested anywhere. So suggest updating the comment and adding one more test AFTER -- test \d+ and \d (now these display filter information) ... \d+ testpub_rf_tbl1 \d testpub_rf_tbl1 ~~~ 6. src/test/regress/sql/publication.sql - tests for partitioned table +-- Tests for partitioned table +ALTER PUBLICATION testpub6 SET TABLE rf_tbl_abcd_part_pk WHERE (a > 99); +ALTER PUBLICATION testpub6 SET (PUBLISH_VIA_PARTITION_ROOT=0); +-- ok - partition does not have row filter +UPDATE rf_tbl_abcd_part_pk SET a = 1; +ALTER PUBLICATION testpub6 SET (PUBLISH_VIA_PARTITION_ROOT=1); +-- ok - "a" is a OK col +UPDATE rf_tbl_abcd_part_pk SET a = 1; +ALTER PUBLICATION testpub6 SET TABLE rf_tbl_abcd_part_pk WHERE (b > 99); +ALTER PUBLICATION testpub6 SET (PUBLISH_VIA_PARTITION_ROOT=0); +-- ok - partition does not have row filter +UPDATE rf_tbl_abcd_part_pk SET a = 1; +ALTER PUBLICATION testpub6 SET (PUBLISH_VIA_PARTITION_ROOT=1); +-- fail - "b" is not in REPLICA IDENTITY INDEX +UPDATE rf_tbl_abcd_part_pk SET a = 1; Those comments and the way the code is arranged did not make it very clear to me what exactly these tests are doing. I think all the changes to the publish_via_partition_root belong BELOW those test comments don't they? Also the same comment "-- ok - partition does not have row filter" appears 2 times so that can be made more clear too. e.g. IIUC it should be changed to something a bit like this (Note - I did not change the SQL, I only moved it a bit and changed the comments): AFTER (??) -- Tests for partitioned table ALTER PUBLICATION testpub6 SET TABLE rf_tbl_abcd_part_pk WHERE (a > 99); -- ok - PUBLISH_VIA_PARTITION_ROOT is false -- Here the partition does not have a row filter -- Col "a" is in replica identity. ALTER PUBLICATION testpub6 SET (PUBLISH_VIA_PARTITION_ROOT=0); UPDATE rf_tbl_abcd_part_pk SET a = 1; -- ok - PUBLISH_VIA_PARTITION_ROOT is true -- Here the partition does not have a row filter, so the root filter will be used. -- Col "a" is in replica identity. ALTER PUBLICATION testpub6 SET (PUBLISH_VIA_PARTITION_ROOT=1); UPDATE rf_tbl_abcd_part_pk SET a = 1; -- Now change the root filter to use a column "b" (which is not in the replica identity) ALTER PUBLICATION testpub6 SET TABLE rf_tbl_abcd_part_pk WHERE (b > 99); -- ok - PUBLISH_VIA_PARTITION_ROOT is false -- Here the partition does not have a row filter -- Col "a" is in replica identity. AL
Re: [BUG]Update Toast data failure in logical replication
On Tue, Feb 8, 2022, at 10:18 PM, tanghy.f...@fujitsu.com wrote: > 2) > + /* > + * Check if the old tuple's attribute is stored externally and is a > + * member of external_cols. > + */ > + if (VARATT_IS_EXTERNAL((struct varlena *) DatumGetPointer(value1)) && > + bms_is_member(attrnum - FirstLowInvalidHeapAttributeNumber, > + external_cols)) > + *has_external = true; > > If has_external is already true, it seems we don't need this check, so should > we > check has_external first? Is it worth it? I don't think so. It complicates a non-critical path. In general, the condition will be executed once or twice. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: [PATCH] Add min() and max() aggregate functions for xid8
On 2022/02/09 8:49, Ken Kato wrote: On 2022-02-08 23:16, Fujii Masao wrote: If you want to avoid the line longer than 80 columns, you should break it into two or more rather than remove the test code, I think. What to test is more important than formatting. Also the following descriptions about formatting would be helpful. --- https://www.postgresql.org/docs/devel/source-format.html Limit line lengths so that the code is readable in an 80-column window. (This doesn't mean that you must never go past 80 columns. For instance, breaking a long error message string in arbitrary places just to keep the code within 80 columns is probably not a net gain in readability.) --- Therefore I'm ok with the patch that I posted upthread. Also I'm ok if you will break that longer line into two and post new patch. Or if the value '010' is really useless for the test purpose, I'm also ok if you remove it. Thought? Thank you for the explanation! Even though the line is over 80 characters, it makes more sense to put in one line and it enhances readability IMO. Also, '010' is good to have since it is the only octal value in the test. Therefore, I think min_max_aggregates_for_xid8_v4.patch is the best one to go. Agreed. So barring any objection, I will commit that patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: row filtering for logical replication
On Wed, Feb 9, 2022 at 7:07 AM Peter Smith wrote: > > 2. src/backend/commands/publicationcmds.c - > contain_mutable_or_ud_functions_checker > > +/* check_functions_in_node callback */ > +static bool > +contain_mutable_or_user_functions_checker(Oid func_id, void *context) > +{ > + return (func_volatile(func_id) != PROVOLATILE_IMMUTABLE || > + func_id >= FirstNormalObjectId); > +} > > I was wondering why is the checking for user function and mutable > functions combined in one function like this. IMO it might be better > to have 2 "checker" callback functions instead of just one - then the > error messages can be split too so that only the relevant one is > displayed to the user. > For that, we need to invoke the checker function multiple times for a node and or expression. So, not sure if it is worth it. -- With Regards, Amit Kapila.
Re: is the base backup protocol used by out-of-core tools?
On Tue, Feb 08, 2022 at 11:26:41AM -0500, Robert Haas wrote: > From that point of view, there's no downside to removing from the > server the old syntax for BASE_BACKUP and the old protocol for taking > backups. We can't remove anything from pg_basebackup, because it is > our practice to make new versions of pg_basebackup work with old > versions of the server. But the reverse is not true: an older > pg_basebackup will categorically refuse to work with a newer server > version. Therefore keeping the code for this stuff around in the > server has no value ... unless there is out-of-core code that (a) uses > the BASE_BACKUP command and (b) wouldn't immediately adopt the new > syntax and protocol anyway. If there is, we might want to keep the > backward-compatibility code around in the server for a few releases. > If not, we should probably nuke that code to simplify things and > reduce the maintenance burden. This line of arguments looks sensible from here, so +1 for this cleanup in the backend as of 15~. I am not sure if we should worry about out-of-core tools that use replication commands, either, and the new grammar is easy to adapt to. FWIW, one backup tool maintained by NTT is pg_rman, which does not use the replication protocol AFAIK: https://github.com/ossc-db/pg_rman Perhaps Horiguchi-san or Fujita-san have an opinion on that. -- Michael signature.asc Description: PGP signature
Re: [BUG]Update Toast data failure in logical replication
On Wed, Feb 9, 2022 at 7:16 AM Euler Taveira wrote: > > On Tue, Feb 8, 2022, at 10:18 PM, tanghy.f...@fujitsu.com wrote: > > 2) > + /* > + * Check if the old tuple's attribute is stored externally and is a > + * member of external_cols. > + */ > + if (VARATT_IS_EXTERNAL((struct varlena *) DatumGetPointer(value1)) && > + bms_is_member(attrnum - FirstLowInvalidHeapAttributeNumber, > + external_cols)) > + *has_external = true; > > If has_external is already true, it seems we don't need this check, so should > we > check has_external first? > > Is it worth it? I don't think so. > I also don't think it is worth adding such a check. -- With Regards, Amit Kapila.
Re: make MaxBackends available in _PG_init
On Tue, Feb 08, 2022 at 04:12:26PM -0500, Robert Haas wrote: > After some investigation I've determined that it's no longer Friday > afternoon. I also spent time investigating whether the patch had > problems that would make me uncomfortable with the idea of committing > it, and I did not find any. So, I committed it. @@ -1641,8 +1642,8 @@ SignalBackends(void) * XXX in principle these pallocs could fail, which would be bad. Maybe * preallocate the arrays? They're not that large, though. */ - pids = (int32 *) palloc(MaxBackends * sizeof(int32)); - ids = (BackendId *) palloc(MaxBackends * sizeof(BackendId)); + pids = (int32 *) palloc(GetMaxBackends() * sizeof(int32)); + ids = (BackendId *) palloc(GetMaxBackends() * sizeof(BackendId)); You could have optimized this one, while on it, as well as the ones in pgstat_beinit() and pg_safe_snapshot_blocking_pids(). It is not hot, but you did that for all the other callers of GetMaxBackends(). Just saying.. -- Michael signature.asc Description: PGP signature
Re: is the base backup protocol used by out-of-core tools?
2022年2月9日(水) 11:21 Michael Paquier : > > On Tue, Feb 08, 2022 at 11:26:41AM -0500, Robert Haas wrote: > > From that point of view, there's no downside to removing from the > > server the old syntax for BASE_BACKUP and the old protocol for taking > > backups. We can't remove anything from pg_basebackup, because it is > > our practice to make new versions of pg_basebackup work with old > > versions of the server. But the reverse is not true: an older > > pg_basebackup will categorically refuse to work with a newer server > > version. Therefore keeping the code for this stuff around in the > > server has no value ... unless there is out-of-core code that (a) uses > > the BASE_BACKUP command and (b) wouldn't immediately adopt the new > > syntax and protocol anyway. If there is, we might want to keep the > > backward-compatibility code around in the server for a few releases. > > If not, we should probably nuke that code to simplify things and > > reduce the maintenance burden. > > This line of arguments looks sensible from here, so +1 for this > cleanup in the backend as of 15~. I am not sure if we should worry > about out-of-core tools that use replication commands, either, and the > new grammar is easy to adapt to. FWIW the out-of-core utility I have some involvement in and which uses base backup functionality, uses this by executing pg_basebackup, so no problem there. But even if it used the replication protocol directly, it'd just be a case of adding another bit of Pg version-specific code handling; I certainly wouldn't expect core to hold itself back for my convenience. Regards Ian Barwick -- EnterpriseDB: https://www.enterprisedb.com
Unnecessary call to resetPQExpBuffer in getIndexes
Hi, I just noticed that e2c52beecd (adding PeterE in Cc) added a resetPQExpBuffer() which seems unnecessary since the variable is untouched since the initial createPQExpBuffer(). Simple patch attached. >From 1ebddb696af3b77f7d373034b938a358529a9ea1 Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Wed, 9 Feb 2022 10:36:07 +0800 Subject: [PATCH v1] Remove unnecessary resetPQExpBuffer call. Oversight in e2c52beecdea152ca680a22ef35c6a7da55aa30f. --- src/bin/pg_dump/pg_dump.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 3499c0a4d5..3b4b63d897 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -6524,8 +6524,6 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables) } appendPQExpBufferChar(tbloids, '}'); - resetPQExpBuffer(query); - appendPQExpBuffer(query, "SELECT t.tableoid, t.oid, i.indrelid, " "t.relname AS indexname, " -- 2.35.0
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
Hi, Bharath. At Tue, 8 Feb 2022 14:33:01 +0530, Bharath Rupireddy wrote in > On Tue, Feb 8, 2022 at 2:10 PM Kyotaro Horiguchi > wrote: > > > Thus.. the attached removes the ambiguity of of the proposed patch > > > about the LSNs in the restartpoint-ending log message. > > > > Thoughts? > > Thanks for the patch. I have few comments on the > v1-0001-Get-rid-of-unused-path-to-handle-concurrent-check.patch Thanks for checking this! > 1) Can we have this Assert right after "skipping restartpoint, already > performed at %X/%X" error message block? Does it make any difference? > My point is that if at all, we were to assert this, why can't we do it > before CheckPointGuts? > + /* We mustn't have a concurrent checkpoint that advances checkpoint LSN */ > + Assert(lastCheckPoint.redo > ControlFile->checkPointCopy.redo); No. The assertion checks if something wrong has happend while CheckPointGuts - which may take a long time - is running. If we need to do that, it should be after CheckPointGuts. (However, I removed it finally. See below.) > 2) Related to the above Assert, do we really need an assertion or a FATAL > error? It's just to make sure in case where that happens by any chance in future. But on second thought, as I mentioned, CreateRestartPoint is called from standalone process or from checkpointer. I added that assertion at the beginning of CreateRestartPoint. I think that assertion is logically equal to the old assertion. I remember that I felt uncomfortable with the lock-less behavior on ControlFile, which makes the code a bit complex to read. So I moved "PriorRedoPtr = " line to within the lock section just below. Addition to that, I feel being confused by the parallel-use of lastCheckPoint.redo and RedoRecPtr So I replaced lastCheckPoint.redo after assiging the former to the latter with RedoRecPtr. > 3) Let's be consistent with "crash recovery" - replace > "archive-recovery" with "archive recovery"? > + * We have exited from archive-recovery mode after this restartpoint > + * started. Crash recovery ever after should always recover to the end That's sensible. I found several existing use of archive-recovery in xlog.c and a few other files but the fix for them is separated as another patch (0005). > 4) Isn't it enough to say "Crash recovery should always recover to the > end of WAL."? > + * started. Crash recovery ever after should always recover to the end I think we need to explicitly say something like "we have exited archive recovery while *this* restartpoint is running". I simplified the sentence as the follows. + * Archive recovery have ended. Crash recovery ever after should + * always recover to the end of WAL. > 5) Is there a reliable test case covering this code? Please point me > if the test case is shared upthread somewhere. Nothing. Looking from the opposite side, the existing code for the competing restartpoint/checkpoint case should have not been exercised at least for these several major versions. Instead, I added an assertion at the beginning of CreateRestartPoint that asserts that "this should be called only under standalone process or from checkpointer.". If that assert doesn't fire while the whole test, it should be the proof of the premise for this patch is correct. > 6) So, with this patch, the v8 patch-set posted at [1] doesn't need > any changes IIUC. If that's the case, please feel free to post all the > patches together such that they get tested in cfbot. The two are different fixes so I don't think they are ought to be merged together. > [1] - > https://www.postgresql.org/message-id/CALj2ACUtZhTb%3D2ENkF3BQ3wi137uaGi__qzvXC-qFYC0XwjALw%40mail.gmail.com As the old 0001, though I said it'fine :p) I added a comment that reading ControlFile->checkPoint* is safe here. The old 0002 (attached 0003) looks file to me. The old 0003 (attached 0004): +++ b/src/backend/access/rmgrdesc/xlogdesc.c - appendStringInfo(buf, "redo %X/%X; " + appendStringInfo(buf, "redo lsn %X/%X; " It is shown in the context of a checkpoint record, so I think it is not needed or rather lengthning the dump line uselessly. +++ b/src/backend/access/transam/xlog.c - (errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X", + (errmsg("request to flush past end of generated WAL; request lsn %X/%X, current lsn %X/%X", +++ b/src/backend/replication/walsender.c - (errmsg("requested starting point %X/%X is ahead of the WAL flush position of this server %X/%X", + (errmsg("requested starting point %X/%X is ahead of the WAL flush LSN of this server %X/%X", "WAL" is upper-cased. So it seems rather strange that the "lsn" is lower-cased. In the first place the message doesn't look like a user-facing error message and I feel we don't need position or lsn there.. +++ b/src/bin/pg_rewin
Re: Plug minor memleak in pg_dump
On Wed, Feb 02, 2022 at 10:06:13AM +0100, Daniel Gustafsson wrote: > The leak itself is clearly not something to worry about wrt memory pressure. > We do read into tmp and free it in other places in the same function though > (as > you note above), so for code consistency alone this is worth doing IMO (and it > reduces the risk of static analyzers flagging this). > > Unless objected to I will go ahead with getting this committed. Looks like you forgot to apply that? -- Michael signature.asc Description: PGP signature
Re: [PATCH] Add min() and max() aggregate functions for xid8
At Wed, 9 Feb 2022 11:01:57 +0900, Fujii Masao wrote in > Agreed. So barring any objection, I will commit that patch. Sorry for being late, but I don't like the function names. +xid8_larger(PG_FUNCTION_ARGS) +xid8_smaller(PG_FUNCTION_ARGS) Aren't they named like xid8gt and xid8lt conventionally? At least the name lacks a mention to equality. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH v2] use has_privs_for_role for predefined roles
On Tue, Feb 8, 2022 at 7:38 PM Joe Conway wrote: > If we were to start all over again with this feature my vote would be to > do things differently than we have done. I would not have called them > predefined roles, and I would have used attributes of roles (e.g. make > rolsuper into a bitmap rather than a boolean) rather than role > membership to implement them. But I didn't find time to participate in > the original discussion or review/write the code, so I have little room > to complain. Yep, fair. I kind of like the predefined role concept myself. I find it sort of elegant, mostly because I think it scales better than a bitmask, which can run out of bits surprisingly rapidly. But opinions can vary, of course. > However since we did call these things predefined roles, and used role > membership to implement them, I think they ought to behave both > self-consistently as a group, and like other real roles. > > That is what this patch does unless I am missing something. Yes, I see that. > I guess an alternative is to discuss a "proper fix", but it seems to me > that would be a version 16 thing given how late we are in this > development cycle and how invasive it is likely to be. And doing nothing > for pg15 is not a very satisfying proposition :-/ True. The fact that we don't have consistency among existing predefined roles is IMHO the best argument for this patch. That surely can't be right. -- Robert Haas EDB: http://www.enterprisedb.com
Re: make MaxBackends available in _PG_init
On Tue, Feb 8, 2022 at 9:38 PM Michael Paquier wrote: > On Tue, Feb 08, 2022 at 04:12:26PM -0500, Robert Haas wrote: > > After some investigation I've determined that it's no longer Friday > > afternoon. I also spent time investigating whether the patch had > > problems that would make me uncomfortable with the idea of committing > > it, and I did not find any. So, I committed it. > > @@ -1641,8 +1642,8 @@ SignalBackends(void) > * XXX in principle these pallocs could fail, which would be bad. > Maybe > * preallocate the arrays? They're not that large, though. > */ > - pids = (int32 *) palloc(MaxBackends * sizeof(int32)); > - ids = (BackendId *) palloc(MaxBackends * sizeof(BackendId)); > + pids = (int32 *) palloc(GetMaxBackends() * sizeof(int32)); > + ids = (BackendId *) palloc(GetMaxBackends() * sizeof(BackendId)); > > You could have optimized this one, while on it, as well as the ones in > pgstat_beinit() and pg_safe_snapshot_blocking_pids(). It is not hot, > but you did that for all the other callers of GetMaxBackends(). Just > saying.. Well I didn't do anything myself except review and commit Nathan's patch, so I suppose you mean he could have done that, but fair enough. I don't mind if you want to change it around. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Add min() and max() aggregate functions for xid8
At Wed, 09 Feb 2022 12:04:51 +0900 (JST), Kyotaro Horiguchi wrote in > At Wed, 9 Feb 2022 11:01:57 +0900, Fujii Masao > wrote in > > Agreed. So barring any objection, I will commit that patch. > > Sorry for being late, but I don't like the function names. > > +xid8_larger(PG_FUNCTION_ARGS) > +xid8_smaller(PG_FUNCTION_ARGS) > > Aren't they named like xid8gt and xid8lt conventionally? At least the > name lacks a mention to equality. Ah, sorry. the function returns larger/smaller one from the parameters. Sorry for the noise. regardes. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Tue, Feb 8, 2022 at 10:39 PM Robert Haas wrote: > > I have been tied up with other things for a bit now and have not had > time to look at this thread; sorry about that. I have a little more > time available now so I thought I would take a look at this again and > see where things stand. Thanks for looking into this. > > Sadly, it doesn't appear to me that anyone has done any performance > testing of this patch, along the lines suggested above or otherwise, > and I think it's a crucial question for the patch. Yeah, actually some performance testing started as shared by Ahustosh [1] and soon after that we got side tracked by another issue[2] which we thought had to be fixed before we proceed with this feature. My reading of this > thread is that nobody really likes the idea of maintaining two methods > for performing CREATE DATABASE, but nobody wants to hose people who > are using it to clone large databases, either. To some extent those > things are inexorably in conflict. If we postulate that the 10TB > template database is on a local RAID array with 40 spindles, while > pg_wal is on an iSCSI volume that we access via a 128kB ISDN link, > then the new system is going to be infinitely worse. But real > situations aren't likely to be that bad, and it would be useful in my > opinion to have an idea how bad they actually are. Yeah that makes sense, I will work on performance testing in this line and also on previous ideas you suggested. > I'm somewhat inclined to propose that we keep the existing method > around along with the new method. Even though nobody really likes > that, we don't necessarily have to maintain both methods forever. If, > say, we use the new method by default in all cases, but add an option > to get the old method back if you need it, we could leave it that way > for a few years and then propose removing the old method (and the > switch to activate it) and see if anyone complains. That way, if the > new method turns out to suck in certain cases, users have a way out. > However, I still think doing some performance testing would be a > really good idea. It's not a great plan to make decisions about this > kind of thing in an information vacuum. Yeah that makes sense to me. Now, one bigger question is can we proceed with this patch without fixing [2], IMHO, if we are deciding to keep the old method also intact then one option could be that for now only change CREATE DATABASE to support both old and new way of creating database and for time being leave the ALTER DATABASE SET TABLESPACE alone and let it work only with the old method? And another option is that we first fix the issue related to the tombstone file and then come back to this? IMHO, the first option could be better in a way that we have already made better progress in this patch and this is in better shape than the other patch we are trying to make for removing the tombstone files. [1]https://www.postgresql.org/message-id/CAE9k0Pkg20tHq8oiJ%2BxXa9%3Daf3QZCSYTw99aBaPthA1UMKhnTg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CA%2BTgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Plug minor memleak in pg_dump
On Wed, Feb 9, 2022 at 8:26 AM Michael Paquier wrote: > > On Wed, Feb 02, 2022 at 10:06:13AM +0100, Daniel Gustafsson wrote: > > The leak itself is clearly not something to worry about wrt memory pressure. > > We do read into tmp and free it in other places in the same function though > > (as > > you note above), so for code consistency alone this is worth doing IMO (and > > it > > reduces the risk of static analyzers flagging this). > > > > Unless objected to I will go ahead with getting this committed. > > Looks like you forgot to apply that? Attaching the patch that I suggested above, also the original patch proposed by Georgios is at [1], leaving the decision to the committer to pick up the best one. [1] https://www.postgresql.org/message-id/oZwKiUxFsVaetG2xOJp7Hwao8F1AKIdfFDQLNJrnwoaxmjyB-45r_aYmhgXHKLcMI3GT24m9L6HafSi2ns7WFxXe0mw2_tIJpD-Z3vb_eyI%3D%40pm.me Regards, Bharath Rupireddy. From 335db3331a99a6cfc35608cdbec204d843b8ac55 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Wed, 9 Feb 2022 03:25:50 + Subject: [PATCH v1] Fix a memory leak while reading Table of Contents ReadStr() returns a malloc'ed pointer. Using it directly in a function call results in a memleak. Rewrite to use a temporary buffer which is then freed. --- src/bin/pg_dump/pg_backup_archiver.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 49bf0907cd..e62be78982 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -2500,6 +2500,8 @@ ReadToc(ArchiveHandle *AH) for (i = 0; i < AH->tocCount; i++) { + bool is_supported = true; + te = (TocEntry *) pg_malloc0(sizeof(TocEntry)); te->dumpId = ReadInt(AH); @@ -2574,7 +2576,20 @@ ReadToc(ArchiveHandle *AH) te->tableam = ReadStr(AH); te->owner = ReadStr(AH); - if (AH->version < K_VERS_1_9 || strcmp(ReadStr(AH), "true") == 0) + + if (AH->version < K_VERS_1_9) + is_supported = false; + else + { + tmp = ReadStr(AH); + + if (strcmp(tmp, "true") == 0) +is_supported = false; + + pg_free(tmp); + } + + if (!is_supported) pg_log_warning("restoring tables WITH OIDS is not supported anymore"); /* Read TOC entry dependencies */ -- 2.25.1
Re: make MaxBackends available in _PG_init
On Tue, Feb 08, 2022 at 10:57:37PM -0500, Robert Haas wrote: > Well I didn't do anything myself except review and commit Nathan's > patch, so I suppose you mean he could have done that, but fair enough. > I don't mind if you want to change it around. Okay, I'd rather apply the same rule everywhere for consistency, then, like in the attached. That's minimal, still. -- Michael diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index d44001a49f..455d895a44 100644 --- a/src/backend/commands/async.c +++ b/src/backend/commands/async.c @@ -1633,6 +1633,7 @@ SignalBackends(void) int32 *pids; BackendId *ids; int count; + int max_backends = GetMaxBackends(); /* * Identify backends that we need to signal. We don't want to send @@ -1642,8 +1643,8 @@ SignalBackends(void) * XXX in principle these pallocs could fail, which would be bad. Maybe * preallocate the arrays? They're not that large, though. */ - pids = (int32 *) palloc(GetMaxBackends() * sizeof(int32)); - ids = (BackendId *) palloc(GetMaxBackends() * sizeof(BackendId)); + pids = (int32 *) palloc(max_backends * sizeof(int32)); + ids = (BackendId *) palloc(max_backends * sizeof(BackendId)); count = 0; LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE); diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 1528d788d0..ee2e15c17e 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -2924,6 +2924,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp) LWLock *partitionLock; int count = 0; int fast_count = 0; + int max_backends = GetMaxBackends(); if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods)) elog(ERROR, "unrecognized lock method: %d", lockmethodid); @@ -2942,12 +2943,12 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp) vxids = (VirtualTransactionId *) MemoryContextAlloc(TopMemoryContext, sizeof(VirtualTransactionId) * - (GetMaxBackends() + max_prepared_xacts + 1)); + (max_backends + max_prepared_xacts + 1)); } else vxids = (VirtualTransactionId *) palloc0(sizeof(VirtualTransactionId) * - (GetMaxBackends() + max_prepared_xacts + 1)); + (max_backends + max_prepared_xacts + 1)); /* Compute hash code and partition lock, and look up conflicting modes. */ hashcode = LockTagHashCode(locktag); @@ -3104,7 +3105,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp) LWLockRelease(partitionLock); - if (count > GetMaxBackends() + max_prepared_xacts) /* should never happen */ + if (count > max_backends + max_prepared_xacts) /* should never happen */ elog(PANIC, "too many conflicting locks found"); vxids[count].backendId = InvalidBackendId; diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c index 4e517b28e1..944cd6df03 100644 --- a/src/backend/utils/adt/lockfuncs.c +++ b/src/backend/utils/adt/lockfuncs.c @@ -559,13 +559,14 @@ pg_safe_snapshot_blocking_pids(PG_FUNCTION_ARGS) int *blockers; int num_blockers; Datum *blocker_datums; + int max_backends = GetMaxBackends(); /* A buffer big enough for any possible blocker list without truncation */ - blockers = (int *) palloc(GetMaxBackends() * sizeof(int)); + blockers = (int *) palloc(max_backends * sizeof(int)); /* Collect a snapshot of processes waited for by GetSafeSnapshot */ num_blockers = - GetSafeSnapshotBlockingPids(blocked_pid, blockers, GetMaxBackends()); + GetSafeSnapshotBlockingPids(blocked_pid, blockers, max_backends); /* Convert int array to Datum array */ if (num_blockers > 0) signature.asc Description: PGP signature
RE: [BUG]Update Toast data failure in logical replication
On Tue, Feb 8, 2022 3:18 AM Andres Freund wrote: > > On 2022-02-07 08:44:00 +0530, Amit Kapila wrote: > > Right, and it is getting changed. We are just printing the first 200 > > characters (by using SQL [1]) from the decoded tuple so what is shown > > in the results is the initial 200 bytes. > > Ah, I knew I must have been missing something. > > > > The complete decoded data after the patch is as follows: > > Hm. I think we should change the way the strings are shortened - otherwise we > don't really verify much in that test. Perhaps we could just replace the long > repetitive strings with something shorter in the output? > > E.g. using something like regexp_replace(data, > '(1234567890|9876543210){200}', '\1{200}','g') > inside the substr(). > > Wonder if we should deduplicate the number of different toasted strings in the > file to something that'd allow us to have a single "redact_toast" function or > such. There's too many different ones to have a reasonbly simple redaction > function right now. But that's perhaps better done separately. > I tried to make the output shorter using your suggestion like the following SQL, please see the attached patch, which is based on v8 patch[1]. SELECT substr(regexp_replace(data, '(1234567890|9876543210){200}', '\1{200}','g'), 1, 200) FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); Note that some strings are still longer than 200 characters even though they have been shorter, so they can't be shown entirely. e.g. table public.toasted_key: UPDATE: old-key: toasted_key[text]:'1234567890{200}' new-tuple: id[integer]:1 toasted_key[text]:unchanged-toast-datum toasted_col1[text]:unchanged-toast-datum toasted_col2[te The entire string is: table public.toasted_key: UPDATE: old-key: toasted_key[text]:'1234567890{200}' new-tuple: id[integer]:1 toasted_key[text]:unchanged-toast-datum toasted_col1[text]:unchanged-toast-datum toasted_col2[text]:'9876543210{200}' Maybe it's better to change the substr length to 250 to show the entire string, or we can do it as separate HEAD only improvement where we can deduplicate some of the other long strings as well. Thoughts? [1] https://www.postgresql.org/message-id/CAA4eK1L_Z_2LDwMNbGrwoO%2BFc-2Q04YORQSA9UfGUTMQpy2O1Q%40mail.gmail.com Regards, Tang improve_toast_test.diff Description: improve_toast_test.diff
Re: decoupling table and index vacuum
On Tue, Feb 8, 2022 at 10:42 PM Peter Geoghegan wrote: > > On Sun, Feb 6, 2022 at 11:25 PM Dilip Kumar wrote: > > > One thing we could try doing in order to make that easier would be: > > > tweak things so that when autovacuum vacuums the table, it only > > > vacuums the indexes if they meet some threshold for bloat. I'm not > > > sure exactly what happens with the heap vacuuming then - do we do > > > phases 1 and 2 always, or a combined heap pass, or what? But if we > > > pick some criteria that vacuums indexes sometimes and not other times, > > > we can probably start doing some meaningful measurement of whether > > > this patch is making bloat better or worse, and whether it's using > > > fewer or more resources to do it. > > > > I think we can always trigger phase 1 and 2 and phase 2 will only > > vacuum conditionally based on if all the indexes are vacuumed for some > > conveyor belt pages so we don't have risk of scanning without marking > > anything unused. > > Not sure what you mean about a risk of scanning without marking any > LP_DEAD items as LP_UNUSED. I mean for testing purposes if we integrate with autovacuum such that, 1) always do the first pass of the vacuum 2) index vacuum will be done only for the indexes which have bloated more than some threshold and then 3) we can always trigger the heap vacuum second pass. So my point was even if from autovacuum we trigger the second vacuum pass every time it will not do anything if all the indexes are not vacuumed. If VACUUM always does some amount of this, > then it follows that the new mechanism added by the patch just can't > safely avoid any work at all, making it all pointless. We have to > expect heap vacuuming to take place much less often with the patch. > Simply because that's what the invariant described in comments above > lazy_scan_heap() requires. In the second pass we are making sure that we don't mark any LP_DEAD to LP_UNUSED for which index vacuum is not done. Basically we are storing dead items in the conveyor belt and whenever we do the index pass we remember upto which conveyor belt page index vacuum is done. And before starting the heap second pass we will find the minimum conveyor belt page upto which all the indexes have been vacuumed. > Note that this is not the same thing as saying that we do less > *absolute* heap vacuuming with the conveyor belt -- my statement about > less heap vacuuming taking place is *only* true relative to the amount > of other work that happens in any individual "shortened" VACUUM > operation. We could do exactly the same total amount of heap vacuuming > as before (in a version of Postgres without the conveyor belt but with > the same settings), but much *more* index vacuuming (at least for one > or two problematic indexes). > > > And we can try to measure with other approaches as > > well where we completely avoid phase 2 and it will be done only along > > with phase 1 whenever applicable. > > I believe that the main benefit of the dead TID conveyor belt (outside > of global index use cases) will be to enable us to do more (much more) > index vacuuming for one index in particular. So it's not really about > doing less index vacuuming or less heap vacuuming -- it's about doing > a *greater* amount of *useful* index vacuuming, in less time. There is > often some way in which failing to vacuum one index for a long time > does lasting damage to the index structure. I agree with the point. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: decoupling table and index vacuum
On Wed, Feb 9, 2022 at 1:21 AM Peter Geoghegan wrote: > > The btree side of this shouldn't care at all about dead tuples (in > general we focus way too much on dead tuples, and way too little on > pages). With bottom-up index deletion the number of dead tuples in the > index is just about completely irrelevant. It's entirely possible and > often even likely that 20%+ of all index tuples will be dead at any > one time, when the optimization perfectly preserves the index > structure. > > The btree side of the index AM API should be focussing on the growth > in index size, relative to some expectation (like maybe the growth for > whatever index on the same table has grown the least since last time, > accounting for obvious special cases like partial indexes). Perhaps > we'd give some consideration to bulk deletes, too. Overall, it should > be pretty simple, and should sometimes force us to do one of these > "dynamic mini vacuums" of the index just because we're not quite sure > what to do. There is nothing wrong with admitting the uncertainty. I agree with the point that we should be focusing more on index size growth compared to dead tuples. But I don't think that we can completely ignore the number of dead tuples. Although we have the bottom-up index deletion but whether the index structure will be preserved or not will depend upon what keys we are inserting next. So for example if there are 80% dead tuples but so far index size is fine then can we avoid vacuum? If we avoid vacuuming then it is very much possible that in some cases we will create a huge bloat e.g. if we are inserting some keys which can not take advantage of bottom up deletion. So IMHO the decision should be a combination of index size bloat and % dead tuples. Maybe we can add more weight to the size bloat and less weight to % dead tuple but we should not completely ignore it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make mesage at end-of-recovery less scary.
Hi, Ashutosh. At Tue, 8 Feb 2022 18:35:34 +0530, Ashutosh Sharma wrote in > Here are some of my review comments on the v11 patch: Thank you for taking a look on this. > - (errmsg_internal("reached end of WAL in > pg_wal, entering archive recovery"))); > + (errmsg_internal("reached end of WAL at %X/%X > on timeline %u in %s during crash recovery, entering archive > recovery", > +LSN_FORMAT_ARGS(ErrRecPtr), > +replayTLI, > +xlogSourceNames[currentSource]))); > > Why crash recovery? Won't this message get printed even during PITR? It is in the if-block with the following condition. >* If archive recovery was requested, but we were still doing >* crash recovery, switch to archive recovery and retry using the >* offline archive. We have now replayed all the valid WAL in >* pg_wal, so we are presumably now consistent. ... >if (!InArchiveRecovery && ArchiveRecoveryRequested) This means archive-recovery is requested but not started yet. That is, we've just finished crash recovery. The existing comment cited together is mentioning that. At the end of PITR (or archive recovery), the other code works. > /* > * If we haven't emit an error message, we have safely reached the > * end-of-WAL. > */ > if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) > { > char *fmt; > > if (StandbyMode) > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u > in %s during standby mode"); > else if (InArchiveRecovery) > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u > in %s during archive recovery"); > else > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u > in %s during crash recovery"); The last among the above messages is choosed when archive-recovery is not requested at all. > I just did a PITR and could see these messages in the logfile. Yeah, the log lines are describing that the server starting with crash recovery to run PITR. > 2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time > recovery to WAL location (LSN) "0/5227790" > 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not > properly shut down; automatic recovery in progress Well. I guess that the "automatic recovery" is ambiguous. Does it make sense if the second line were like the follows instead? + 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; crash recovery in progress > 2022-02-08 18:00:44.369 IST [86185] LOG: redo starts at 0/14DC8D8 > 2022-02-08 18:00:44.978 IST [86185] DEBUG1: reached end of WAL at > 0/3D0 on timeline 1 in pg_wal during crash recovery, entering > archive recovery (I don't include this change in this patch since it would be another issue.) > == > > + /* > +* If we haven't emit an error message, we have safely reached the > +* end-of-WAL. > +*/ > + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) > + { > + char *fmt; > + > + if (StandbyMode) > + fmt = gettext_noop("reached end of WAL at %X/%X on > timeline %u in %s during standby mode"); > + else if (InArchiveRecovery) > + fmt = gettext_noop("reached end of WAL at %X/%X on > timeline %u in %s during archive recovery"); > + else > + fmt = gettext_noop("reached end of WAL at %X/%X on > timeline %u in %s during crash recovery"); > + > + ereport(LOG, > + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, > + xlogSourceNames[currentSource]), > +(errormsg ? errdetail_internal("%s", errormsg) : > 0))); > + } > > Doesn't it make sense to add an assert statement inside this if-block > that will check for xlogreader->EndOfWAL? Good point. On second thought, the condition there is flat wrong. The message is "reached end of WAL" so the condition should be EndOfWAL. On the other hand we didn't make sure that the error message for the stop is emitted anywhere. Thus I don't particularly want to be strict on that point. I made the following change for this. - if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + if (xlogreader->EndOfWAL) > == > > -* We only end up here without a message when XLogPageRead() > -* failed - in that case we already logged something. In > -* StandbyMode that only happens if we have been triggered, so we > -* shouldn't loop anymore in that case. > +* If we get here for other than end-of-wal, emit the error > +* message right now. Otherwise the message if any is shown as a >
Re: Support escape sequence for cluster_name in postgres_fdw.application_name
Sorry for missing this. At Thu, 27 Jan 2022 19:26:39 +0900, Fujii Masao wrote in > > On 2022/01/27 17:10, Kyotaro Horiguchi wrote: > > I don't object to adding more meaningful replacements, but more escape > > sequence makes me anxious about the increased easiness of exceeding > > the size limit of application_name. > > If this is really an issue, it might be time to reconsider the size > limit of application_name. If it's considered too short, the patch > that enlarges it should be proposed separately. That makes sense. > > Considering that it is used to > > identify fdw-initinator server, we might need to add padding (or > > rather truncating) option in the escape sequence syntax, then warn > > about truncated application_names for safety. > > I failed to understand this. Could you tell me why we might need to > add padding option here? My point was "truncating" option, which limits the length of the replacement string. But expanding the application_name limit is more sensible. > > Is the reason for 'C' in upper-case to avoid possible conflict with > > 'c' of log_line_prefix? > > Yes. > > > I'm not sure that preventive measure is worth > > doing. Looking the escape-sequence spec alone, it seems to me rather > > strange that an upper-case letter is used in spite of its lower-case > > is not used yet. > > I have no strong opinion about using %C. If there is better character > for the escape sequence, I'm happy to use it. So what character is > more proper? %c? I think so. > > Otherwise all looks fine to me except the lack of documentation. > > The patch updated postgres-fdw.sgml, but you imply there are other > documents that the patch should update? Could you tell me where the > patch should update? Mmm. I should have missed that part. regards. -- Kyotaro Horiguchi NTT Open Source Software Center