Re: New docs chapter on Transaction Management and related changes
On Mon, 2022-11-07 at 22:43 -0500, Bruce Momjian wrote: > > I thought some more about the patch, and I don't like the title > > "Transaction Management" for the new chapter. I'd expect some more > > from a chapter titled "Internals" / "Transaction Management". > > > > In reality, the new chapter is about transaction IDs. So perhaps the > > name should reflect that, so that it does not mislead the reader. > > I renamed it to "Transaction Processing" since we also cover locking and > subtransactions. How is that? It is better. Did you take my suggestions from [1] into account in your latest cumulative patch in [2]? Otherwise, it will be difficult to integrate both. Yours, Laurenz Albe [1]: https://postgr.es/m/3603e6e85544daa5300c7106c31bc52673711cd0.camel%40cybertec.at [2]: https://postgr.es/m/Y2nP04/3BHQOviVB%40momjian.us
Re: security_context_t marked as deprecated in libselinux 3.1
On 2022-Nov-09, Michael Paquier wrote: > I got to look at that, now that the minor releases have been tagged, > and the change has no conflicts down to 9.3. 9.2 needed a slight > tweak, though, but it seemed fine as well. (Tested the build on all > branches.) So done all the way down, sticking with the new no-warning > policy for all the buildable branches. Thank you :-) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: Reviving lost replication slots
I don't think walsenders fetching segment from archive is totally stupid. With that feature, we can use fast and expensive but small storage for pg_wal, while avoiding replciation from dying even in emergency. At Tue, 8 Nov 2022 19:39:58 -0800, sirisha chamarthi wrote in > > If it's a streaming replication slot, the standby will anyway jump to > > archive mode ignoring the replication slot and the slot will never be > > usable again unless somebody creates a new replication slot and > > provides it to the standby for reuse. > > If it's a logical replication slot, the subscriber will start to > > diverge from the publisher and the slot will have to be revived > > manually i.e. created again. > > > > Physical slots can be revived with standby downloading the WAL from the > archive directly. This patch is helpful for the logical slots. However, supposing that WalSndSegmentOpen() fetches segments from archive as the fallback and that succeeds, the slot can survive missing WAL in pg_wal in the first place. So this patch doesn't seem to be needed for the purpose. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: thinko in basic_archive.c
On Tue, Nov 8, 2022 at 3:18 AM Nathan Bossart wrote: > > The call to snprintf() should take care of adding a terminating null byte > in the right place. Ah, my bad. MemSet is avoided in v5 patch setting only the first byte. > > So, IIUC, your point here is what if the copy_file fails to create the > > temp file when it already exists. With the name collision being a rare > > scenario, given the pid and timestamp variables, I'm not sure if > > copy_file can ever fail because the temp file already exists (with > > errno EEXIST). However, if we want to be extra-cautious, checking if > > temp file exists with file_exists() before calling copy_file() might > > help avoid such cases. If we don't want to have extra system call (via > > file_exists()) to check the temp file existence, we can think of > > sending a flag to copy_file(src, dst, &is_dst_file_created) and use > > is_dst_file_created in the shutdown callback. Thoughts? > > Presently, if copy_file() encounters a pre-existing file, it should ERROR, > which will be caught in the sigsetjmp() block in basic_archive_file(). The > shutdown callback shouldn't run in this scenario. Determining the "file already exists" error/EEXIST case from a bunch of other errors in copy_file() is tricky. However, I quickly hacked up copy_file() by adding elevel parameter, please see the attached v5-0001. > I think this cleanup logic should run in both the shutdown callback and the > sigsetjmp() block, but it should only take action (i.e., deleting the > leftover temporary file) if the ERROR or shutdown occurs after creating the > file in copy_file() and before renaming the temporary file to its final > destination. Please see attached v5 patch set. If the direction seems okay, I'll add a CF entry. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From cd805d6d26e66dcfc51ca6632a0147827a5b9544 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Wed, 9 Nov 2022 08:29:19 + Subject: [PATCH v5] Refactor copy_file() to remove leftover temp files in basic_archive --- contrib/basic_archive/basic_archive.c | 23 ++- src/backend/storage/file/copydir.c| 56 --- src/backend/storage/file/reinit.c | 2 +- src/include/storage/copydir.h | 2 +- 4 files changed, 66 insertions(+), 17 deletions(-) diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c index 9f221816bb..1b9f48eeed 100644 --- a/contrib/basic_archive/basic_archive.c +++ b/contrib/basic_archive/basic_archive.c @@ -275,14 +275,33 @@ basic_archive_file_internal(const char *file, const char *path) * Copy the file to its temporary destination. Note that this will fail * if temp already exists. */ - copy_file(unconstify(char *, path), temp); + if (copy_file(unconstify(char *, path), temp, LOG) != 0) + { + /* Remove the leftover temporary file. */ + if (errno != EEXIST) + unlink(temp); + + ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not copy file \"%s\" to temporary file \"%s\": %m", + path, temp))); + } /* * Sync the temporary file to disk and move it to its final destination. * Note that this will overwrite any existing file, but this is only * possible if someone else created the file since the stat() above. */ - (void) durable_rename(temp, destination, ERROR); + if (durable_rename(temp, destination, LOG) != 0) + { + /* Remove the leftover temporary file. */ + unlink(temp); + + ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not rename temporary file \"%s\" to \"%s\": %m", + temp, destination))); + } ereport(DEBUG1, (errmsg("archived \"%s\" via basic_archive", file))); diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c index 0cea4cbc89..ee20c83949 100644 --- a/src/backend/storage/file/copydir.c +++ b/src/backend/storage/file/copydir.c @@ -71,7 +71,7 @@ copydir(char *fromdir, char *todir, bool recurse) copydir(fromfile, tofile, true); } else if (xlde_type == PGFILETYPE_REG) - copy_file(fromfile, tofile); + copy_file(fromfile, tofile, ERROR); } FreeDir(xldir); @@ -113,8 +113,8 @@ copydir(char *fromdir, char *todir, bool recurse) /* * copy one file */ -void -copy_file(char *fromfile, char *tofile) +int +copy_file(char *fromfile, char *tofile, int elevel) { char *buffer; int srcfd; @@ -138,28 +138,35 @@ copy_file(char *fromfile, char *tofile) #define FLUSH_DISTANCE (1024 * 1024) #endif - /* Use palloc to ensure we get a maxaligned buffer */ - buffer = palloc(COPY_BUF_SIZE); - /* * Open the files */ srcfd = OpenTransientFile(fromfile, O_RDONLY | PG_BINARY); if (srcfd < 0) - ereport(ERROR, + { + ereport(elevel, (errcode_for_file_access(), errmsg("could not open file \"%s\": %m", fromfile))); + return -1; + } dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG
Re: Reviving lost replication slots
On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi wrote: > > I don't think walsenders fetching segment from archive is totally > stupid. With that feature, we can use fast and expensive but small > storage for pg_wal, while avoiding replciation from dying even in > emergency. It seems like a useful feature to have at least as an option and it saves a lot of work - failovers, expensive rebuilds of standbys/subscribers, manual interventions etc. If you're saying that even the walsedners serving logical replication subscribers would go fetch from the archive location for the removed WAL files, it mandates enabling archiving on the subscribers. And we know that the archiving is not cheap and has its own advantages and disadvantages, so the feature may or may not help. If you're saying that only the walsedners serving streaming replication standbys would go fetch from the archive location for the removed WAL files, it's easy to implement, however it is not a complete feature and doesn't solve the problem for logical replication. With the feature, it'll be something like 'you, as primary/publisher, archive the WAL files and when you don't have them, you'll restore them', it may not sound elegant, however, it can solve the lost replication slots problem. And, the cost of restoring WAL files from the archive might further slow down the replication thus increasing the replication lag. And, one need to think, how many such WAL files are restored and kept, whether they'll be kept in pg_wal or some other directory, how will the disk full, fetching too old or too many WAL files for replication slots lagging behind, removal of unnecessary WAL files etc. be handled. I'm not sure about other implications at this point of time. Perhaps, implementing this feature as a core/external extension by introducing segment_open() or other necessary hooks might be worth it. If implemented in some way, I think the scope of replication slot invalidation/max_slot_wal_keep_size feature gets reduced or it can be removed completely, no? > However, supposing that WalSndSegmentOpen() fetches segments from > archive as the fallback and that succeeds, the slot can survive > missing WAL in pg_wal in the first place. So this patch doesn't seem > to be needed for the purpose. That is a simple solution one can think of and provide for streaming replication standbys, however, is it worth implementing it in the core as explained above? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands
On Sun, 06 Nov 2022 12:54:17 -0500 Tom Lane wrote: > Yugo NAGATA writes: > >> The attached patch tries to add comments explaining it on the functions. > > > I forward it to the hackers list because the patch is to fix comments. > > What do you think of the attached wording? It looks good to me. That describes the expected behaviour exactly. > I don't think the pipeline angle is of concern to anyone who might be > reading these comments with the aim of understanding what guarantees > they have. Perhaps there should be more about that in the user-facing > docs, though. I agree with that we don't need to mention pipelining in these comments, and that we need more in the documentation. I attached a doc patch to add a mention of commands that do internal commit to the pipelining section. Also, this adds a reference for the pipelining protocol to the libpq doc. Regards, Yugo Nagata -- Yugo NAGATA diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 3c9bd3d673..727a024e60 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -5058,7 +5058,8 @@ int PQflush(PGconn *conn); While the pipeline API was introduced in PostgreSQL 14, it is a client-side feature which doesn't require special server support and works on any server - that supports the v3 extended query protocol. + that supports the v3 extended query protocol; see . + diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 5fdd429e05..2edd42d7e9 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -1095,6 +1095,9 @@ SELCT 1/0; that cannot be executed inside a transaction block. If one of these is executed in a pipeline, it will, upon success, force an immediate commit to preserve database consistency. +In addition, execution one of some commands (such as +VACUUM ANALYZE) that start/commit transactions +internally also can cause an immediate commit even if it fails. A Sync immediately following one of these has no effect except to respond with ReadyForQuery.
Re: Reviving lost replication slots
On Wed, Nov 9, 2022 at 3:00 PM Bharath Rupireddy wrote: > > On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi > wrote: > > > > I don't think walsenders fetching segment from archive is totally > > stupid. With that feature, we can use fast and expensive but small > > storage for pg_wal, while avoiding replciation from dying even in > > emergency. > > It seems like a useful feature to have at least as an option and it > saves a lot of work - failovers, expensive rebuilds of > standbys/subscribers, manual interventions etc. > > If you're saying that even the walsedners serving logical replication > subscribers would go fetch from the archive location for the removed > WAL files, it mandates enabling archiving on the subscribers. > Why archiving on subscribers is required? Won't it be sufficient if that is enabled on the publisher where we have walsender? -- With Regards, Amit Kapila.
Re: heavily contended lwlocks with long wait queues scale badly
On Tue, Nov 1, 2022 at 12:46 PM Bharath Rupireddy wrote: > > > Updated patch attached. > > Thanks. It looks good to me. However, some minor comments on the v3 patch: > > 1. > -if (MyProc->lwWaiting) > +if (MyProc->lwWaiting != LW_WS_NOT_WAITING) > elog(PANIC, "queueing for lock while waiting on another one"); > > Can the above condition be MyProc->lwWaiting == LW_WS_WAITING || > MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability? > > Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING && > MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting > LW_WS_WAITING? > > 2. > /* Awaken any waiters I removed from the queue. */ > proclist_foreach_modify(iter, &wakeup, lwWaitLink) > { > > @@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock) > * another lock. > */ > pg_write_barrier(); > -waiter->lwWaiting = false; > +waiter->lwWaiting = LW_WS_NOT_WAITING; > PGSemaphoreUnlock(waiter->sem); > } > > /* > * Awaken any waiters I removed from the queue. > */ > proclist_foreach_modify(iter, &wakeup, lwWaitLink) > { > PGPROC *waiter = GetPGProcByNumber(iter.cur); > > proclist_delete(&wakeup, iter.cur, lwWaitLink); > /* check comment in LWLockWakeup() about this barrier */ > pg_write_barrier(); > waiter->lwWaiting = LW_WS_NOT_WAITING; > > Can we add an assertion Assert(waiter->lwWaiting == > LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup > list and set the LW_WS_NOT_WAITING flag in above loops, having an > assertion is better here IMO. > > Below are test results with v3 patch. +1 for back-patching it. > > HEADPATCHED > 13414234289 > 27276069720 > 4136300131848 > 8210809210192 > 16240718242744 > 32297587297354 > 64341939343036 > 128383615383801 > 256342094337680 > 512263194288629 > 768145526261553 > 1024107267241811 > 204835716188389 > 409612415120300 > > PG15PATCHED > 13450334078 > 27370872054 > 4139415133321 > 8212396211390 > 16242227242584 > 32303441309288 > 64362680339211 > 128378645344291 > 256340016344291 > 512290044293337 > 768140277264618 > 102496191247636 > 204835158181488 > 409612164118610 I looked at the v3 patch again today and ran some performance tests. The results look impressive as they were earlier. Andres, any plans to get this in? pgbench with SELECT txid_current();: ClientsHEADPATCHED 13461333611 27263470546 4137885136911 8216470216076 16242535245392 32299952304740 64329788347401 128378296386873 256344939343832 512292196295839 768144212260102 1024101525250263 204835594185878 409611842104227 pgbench with insert into pgbench_accounts table: ClientsHEADPATCHED 116601600 218481746 435473395 873306754 161310313613 322601126372 645233152594 1289331395526 256127373126182 512126712127857 768116765119227 1024111464112499 20485883892756 40962606660543 -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Reviving lost replication slots
On Wed, Nov 9, 2022 at 3:53 PM Amit Kapila wrote: > > On Wed, Nov 9, 2022 at 3:00 PM Bharath Rupireddy > wrote: > > > > On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi > > wrote: > > > > > > I don't think walsenders fetching segment from archive is totally > > > stupid. With that feature, we can use fast and expensive but small > > > storage for pg_wal, while avoiding replciation from dying even in > > > emergency. > > > > It seems like a useful feature to have at least as an option and it > > saves a lot of work - failovers, expensive rebuilds of > > standbys/subscribers, manual interventions etc. > > > > If you're saying that even the walsedners serving logical replication > > subscribers would go fetch from the archive location for the removed > > WAL files, it mandates enabling archiving on the subscribers. > > > > Why archiving on subscribers is required? Won't it be sufficient if > that is enabled on the publisher where we have walsender? Ugh. A typo. I meant it mandates enabling archiving on publishers. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Small TAP improvements
On 2022-11-06 Su 08:51, Álvaro Herrera wrote: > On 2022-Jun-14, Andrew Dunstan wrote: > >> OK, here's a more principled couple of patches. For config_data, if you >> give multiple options it gives you back the list of values. If you don't >> specify any, in scalar context it just gives you back all of pg_config's >> output, but in array context it gives you a map, so you should be able >> to say things like: >> >> my %node_config = $node->config_data; > Hi, it looks to me like these were forgotten? > Yeah, will get to it this week. Thanks for the reminder. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Reviving lost replication slots
On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi wrote: > Is the intent of setting restart_lsn to InvalidXLogRecPtr was to disallow reviving the slot? > I think the intent is to compute the correct value for replicationSlotMinLSN as we use restart_lsn for it and using the invalidated slot's restart_lsn value for it doesn't make sense. -- With Regards, Amit Kapila.
Remove redundant declaration for XidInMVCCSnapshot
Commit b7eda3e0e3 moves XidINMVCCSnapshot into snapmgr.{c,h}, however, it forgets the declaration of XidINMVCCSnapshot in heapam.h. Attached removes the redundant declaration in heapam.h. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. >From 16f4bd33d7886c872319393dfebb5dd570b750ab Mon Sep 17 00:00:00 2001 From: Japin Li Date: Wed, 9 Nov 2022 18:40:11 +0800 Subject: [PATCH v1 1/1] Remove redundant declaration for XidInMVCCSnapshot --- src/include/access/heapam.h | 1 - 1 file changed, 1 deletion(-) diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 9dab35551e..5b07875740 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -213,7 +213,6 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer, uint16 infomask, TransactionId xid); extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple); -extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot); extern bool HeapTupleIsSurelyDead(HeapTuple htup, struct GlobalVisState *vistest); -- 2.25.1
Re: Perform streaming logical transactions by background workers and parallel apply
On Mon, Nov 7, 2022 at 1:46 PM Peter Smith wrote: > > Here are my review comments for v42-0001 ... ... > > 8. > > + /* > + * Resend the pending message to parallel apply worker to cleanup the > + * queue. Note that parallel apply worker will just ignore this message > + * as it has already handled this message while applying spooled > + * messages. > + */ > + result = shm_mq_send(winfo->mq_handle, strlen(winfo->pending_msg), > + winfo->pending_msg, false, true); > > If I understand this logic it seems a bit hacky. From the comment, it > seems you are resending a message that you know/expect to be ignored > simply to make it disappear. (??). Isn't there some other way to clear > the pending message without requiring a bogus send? > IIUC, this handling is required for the case when we are not able to send a message to parallel apply worker and switch to serialize mode (write remaining data to file). Basically, it is possible that the message is only partially sent and there is no way clean the queue. I feel we can directly free the worker in this case even if there is a space in the worker pool. The other idea could be that we detach from shm_mq and then invent a way to re-attach it after we try to reuse the same worker. -- With Regards, Amit Kapila.
Re: libpq error message refactoring
Hello I gave this series a quick look. Overall it seems a good idea, since the issue of newlines-or-not is quite bothersome for the libpq translations. > +/* > + * Append a formatted string to the given buffer, after translation. A > + * newline is automatically appended; the format should not end with a > + * newline. > + */ I find the "after translation" bit unclear -- does it mean that the caller should have already translated, or is it the other way around? I would say "Append a formatted string to the given buffer, after translating it", which (to me) conveys more clearly that translation occurs here. > + /* Loop in case we have to retry after enlarging the buffer. */ > + do > + { > + errno = save_errno; > + va_start(args, fmt); > + done = appendPQExpBufferVA(errorMessage, libpq_gettext(fmt), > args); I wonder if it makes sense to do libpq_gettext() just once, instead of redoing it on each iteration. > +void > +libpq_append_conn_error(PGconn *conn, const char *fmt, ...) These two routines are essentially identical. While we could argue about sharing an underlying implementation, I think it's okay the way you have it, because the overheard of sharing it would make that pointless, given how short they are. > +extern void libpq_append_error(PQExpBuffer errorMessage, const char *fmt, > ...) pg_attribute_printf(2, 3); > +extern void libpq_append_conn_error(PGconn *conn, const char *fmt, ...) > pg_attribute_printf(2, 3); pg_attribute_printf marker present -- check. > -GETTEXT_TRIGGERS = libpq_gettext pqInternalNotice:2 > -GETTEXT_FLAGS= libpq_gettext:1:pass-c-format pqInternalNotice:2:c-format > +GETTEXT_TRIGGERS = libpq_append_conn_error:2 \ > + libpq_append_error:2 \ > + libpq_gettext pqInternalNotice:2 > +GETTEXT_FLAGS= libpq_append_conn_error:2:c-format \ > + libpq_append_error:2:c-format \ > + libpq_gettext:1:pass-c-format pqInternalNotice:2:c-format Looks good. > --- a/src/interfaces/libpq/pqexpbuffer.h > +++ b/src/interfaces/libpq/pqexpbuffer.h > +/* > + * appendPQExpBufferVA > + * Shared guts of printfPQExpBuffer/appendPQExpBuffer. > + * Attempt to format data and append it to str. Returns true if done > + * (either successful or hard failure), false if need to retry. > + * > + * Caution: callers must be sure to preserve their entry-time errno > + * when looping, in case the fmt contains "%m". > + */ > +extern bool appendPQExpBufferVA(PQExpBuffer str, const char *fmt, va_list > args) pg_attribute_printf(2, 0); As an API user, I don't care that this is shared guts for something else, I just care about what it does. I think deleting that line is a sufficient fix. > - appendPQExpBufferStr(&conn->errorMessage, > - > libpq_gettext("malformed SCRAM message (empty message)\n")); > + libpq_append_conn_error(conn, "malformed SCRAM message > (empty message)"); Overall, this type of change looks positive. I didn't review all these changes too closely other than the first couple of dozens, as there are way too many; I suppose you did these with some Emacs macros or something? > @@ -420,7 +418,8 @@ pqsecure_raw_write(PGconn *conn, const void *ptr, size_t > len) > snprintf(msgbuf, sizeof(msgbuf), >libpq_gettext("server closed > the connection unexpectedly\n" > > "\tThis probably means the server terminated abnormally\n" > - > "\tbefore or while processing the request.\n")); > + > "\tbefore or while processing the request.")); > + strlcat(msgbuf, "\n", sizeof(msgbuf)); > conn->write_err_msg = strdup(msgbuf); > /* Now claim the write succeeded */ > n = len; In these two places, we're writing the error message manually to a separate variable, so the extra \n is necessary. It looks a bit odd to do it with strlcat() after the fact, but AFAICT it's necessary as it keeps the \n out of the translation catalog, which is good. This is nonobvious, so perhaps add a comment about it. Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Before you were born your parents weren't as boring as they are now. They got that way paying your bills, cleaning up your room and listening to you tell them how idealistic you are." -- Charles J. Sykes' advice to teenagers
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Wed, Nov 9, 2022 at 5:08 AM David Christensen wrote: > > Enclosed is v6, which squashes your refactor and adds the additional > recent suggestions. Thanks for working on this feature. Here are some comments for now. I haven't looked at the tests, I will take another look at the code and tests after these and all other comments are addressed. 1. For ease of review, please split the test patch to 0002. 2. I'm unable to understand the use-case for --fixup-fpi option. pg_waldump is supposed to be just WAL reader, and must not return any modified information, with --fixup-fpi option, the patch violates this principle i.e. it sets page LSN and returns. Without actually replaying the WAL record on the page, how is it correct to just set the LSN? How will it be useful? ISTM, we must ignore this option unless there's a strong use-case. 3. +if (fork >= 0 && fork <= MAX_FORKNUM) +{ +if (fork) +sprintf(forkname, "_%s", forkNames[fork]); +else +forkname[0] = 0; +} +else +pg_fatal("found invalid fork number: %u", fork); Isn't the above complex? What's the problem with something like below? Why do we need if (fork) - else block? if (fork >= 0 && fork <= MAX_FORKNUM) sprintf(forkname, "_%s", forkNames[fork]); else pg_fatal("found invalid fork number: %u", fork); 3. +charpage[BLCKSZ] = {0}; I think when writing to a file, we need PGAlignedBlock rather than a simple char array of bytes, see the description around PGAlignedBlock for why it is so. 4. +if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ) Why pg_pwrite(), why not just fwrite()? If fwrite() is used, you can avoid fileno() system calls, no? Do you need the file position to remain the same after writing, hence pg_pwrite()? 5. +if (!RestoreBlockImage(record, block_id, page)) +continue; + +/* we have our extracted FPI, let's save it now */ After extracting the page from the WAL record, do we need to perform a checksum on it? 6. +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) Use pg_dir_create_mode instead of hard-coded 0007? 7. +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) +fsync(fileno(OPF)); +fclose(OPF); Since you're creating the directory in case it's not available, you need to fsync the directory too? 8. +case 'W': +case 'X': +config.fixup_fpw = (option == 'X'); +config.save_fpw_path = pg_strdup(optarg); +break; Just set config.fixup_fpw = false before the switch block starts, like the other variables, and then perhaps doing like below is more readable: case 'W': config.save_fpw_path = pg_strdup(optarg); case 'X': config.fixup_fpw = true; config.save_fpw_path = pg_strdup(optarg); 9. +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) Should we use pg_mkdir_p() instead of mkdir()? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add tracking of backend memory allocated to pg_stat_activity
Hi Andres, Thanks for looking at this and for the feedback. Responses inline below. On Fri, 2022-11-04 at 19:41 -0700, Andres Freund wrote: > Hi, > > On 2022-11-04 08:56:13 -0400, Reid Thompson wrote: > > I'm not convinced that counting DSM values this way is quite right. > There are a few uses of DSMs that track shared resources, with the biggest > likely being the stats for relations etc. I suspect that tracking that via > backend memory usage will be quite confusing, because fairly random backends > that > had to grow the shared state end up being blamed with the memory usage in > perpituity - and then suddenly that memory usage will vanish when that > backend exits, > despite the memory continuing to exist. Ok, I'll make an attempt to identify these allocations and manage them elsewhere. > > > > @@ -734,6 +747,7 @@ AllocSetAlloc(MemoryContext context, Size size) > > return NULL; > > > > context->mem_allocated += blksize; > > + pgstat_report_backend_allocated_bytes_increase(blksize); > > > > block->aset = set; > > block->freeptr = block->endptr = ((char *) block) + blksize; > > @@ -944,6 +958,7 @@ AllocSetAlloc(MemoryContext context, Size size) > > return NULL; > > > > context->mem_allocated += blksize; > > + pgstat_report_backend_allocated_bytes_increase(blksize); > > > > block->aset = set; > > block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; > > @@ -1043,6 +1058,7 @@ AllocSetFree(void *pointer) > > block->next->prev = block->prev; > > > > set->header.mem_allocated -= block->endptr - ((char *) > > block); > > + > > pgstat_report_backend_allocated_bytes_decrease(block->endptr - ((char *) > > block)); > > > > #ifdef CLOBBER_FREED_MEMORY > > wipe_mem(block, block->freeptr - ((char *) block)); > > I suspect this will be noticable cost-wise. Even though these paths aren't the > hottest memory allocation paths, by nature of going down into malloc, adding > an external function call that then does a bunch of branching etc. seems > likely to add up to some. See below for how I think we can deal with > that... > > > > + > > +/* > > + * pgstat_report_backend_allocated_bytes_increase() - > > + * > > + * Called to report increase in memory allocated for this backend > > + * > > + */ > > +void > > +pgstat_report_backend_allocated_bytes_increase(uint64 allocation) > > +{ > > + volatile PgBackendStatus *beentry = MyBEEntry; > > + > > + if (!beentry || !pgstat_track_activities) > > + { > > + /* > > + * Account for memory before pgstats is initialized. This > > will be > > + * migrated to pgstats on initialization. > > + */ > > + backend_allocated_bytes += allocation; > > + > > + return; > > + } > > + > > + /* > > + * Update my status entry, following the protocol of bumping > > + * st_changecount before and after. We use a volatile pointer here > > to > > + * ensure the compiler doesn't try to get cute. > > + */ > > + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); > > + beentry->backend_allocated_bytes += allocation; > > + PGSTAT_END_WRITE_ACTIVITY(beentry); > > +} > > This is quite a few branches, including write/read barriers. > > It doesn't really make sense to use the PGSTAT_BEGIN_WRITE_ACTIVITY() pattern > here - you're just updating a single value, there's nothing to be gained by > it. The point of PGSTAT_BEGIN_*ACTIVITY() stuff is to allow readers to get a > consistent view of multiple values - but there aren't multiple values > here! I'll remove the barriers - initially I copied how prior functions were coded as my template ala pgstat_report_query_id, pgstat_report_xact_timestamp. > > To avoid the overhead of checking (!beentry || !pgstat_track_activities) and > the external function call, I think you'd be best off copying the trickery I > introduced for pgstat_report_wait_start(), in 225a22b19ed. > > I.e. make pgstat_report_backend_allocated_bytes_increase() a static inline > function that unconditionally updates something like > *my_backend_allocated_memory. To deal with the case of (!beentry || > !pgstat_track_activities), that variable initially points to some backend > local state and is set to the shared state in pgstat_bestart(). > > This additionally has the nice benefit that you can track memory usage from > before pgstat_bestart(), it'll be in the local variable. OK, I think I can mimic the code you reference. > > > +void > > +pgstat_report_backend_allocated_bytes_decrease(uint64 > > deallocation) > > +{ > > + volatile PgBackendStatus *beentry = MyBEEntry; > > + > > + /* > > + * Cases may occur where shared memory from a previo
Re: HOT chain validation in verify_heapam()
Hi Himanshu, > Test cases are now part of this v6 patch. I believe the patch is in pretty good shape now. I'm going to change its status to "Ready for Committer" soon unless there are going to be any objections. -- Best regards, Aleksander Alekseev
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Wed, Nov 09, 2022 at 06:00:40PM +0530, Bharath Rupireddy wrote: > On Wed, Nov 9, 2022 at 5:08 AM David Christensen > wrote: > > > > Enclosed is v6, which squashes your refactor and adds the additional > > recent suggestions. > > Thanks for working on this feature. Here are some comments for now. I > haven't looked at the tests, I will take another look at the code and > tests after these and all other comments are addressed. > > 1. For ease of review, please split the test patch to 0002. This is just my opinion, but .. why ? Since it's easy to filter/skip/display a file, I don't think it's usually useful to have separate patches for tests or docs. > 6. > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) > Use pg_dir_create_mode instead of hard-coded 0007? I think I thought of that when I first looked at the patch ... but, I'm not sure, since it says: src/include/common/file_perm.h-/* Modes for creating directories and files IN THE DATA DIRECTORY */ src/include/common/file_perm.h:extern PGDLLIMPORT int pg_dir_create_mode; I was wondering if there's any reason to do "CREATE DATABASE". The vast majority of TAP tests don't. $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head 1435 safe_psql('postgres', 335 safe_psql( 23 safe_psql($connect_db, -- Justin
Re: New docs chapter on Transaction Management and related changes
On Mon, Nov 7, 2022 at 5:04 PM Laurenz Albe wrote: > > On Sat, 2022-11-05 at 10:08 +, Simon Riggs wrote: > > Agreed; new compilation patch attached, including mine and then > > Robert's suggested rewordings. > > Thanks. There is clearly a lot of usefule information in this. > > Some comments: > > > --- a/doc/src/sgml/ref/release_savepoint.sgml > > +++ b/doc/src/sgml/ref/release_savepoint.sgml > > @@ -34,23 +34,16 @@ RELEASE [ SAVEPOINT ] > > savepoint_name > >Description > > > > > > - RELEASE SAVEPOINT destroys a savepoint previously > > defined > > - in the current transaction. > > + RELEASE SAVEPOINT will subcommit the subtransaction > > + established by the named savepoint, if one exists. This will release > > + any resources held by the subtransaction. If there were any > > + subtransactions of the named savepoint, these will also be subcommitted. > > > > > > > > "Subtransactions of the named savepoint" is somewhat confusing; how about > "subtransactions of the subtransaction established by the named savepoint"? > > If that is too long and explicit, perhaps "subtransactions of that > subtransaction". > Personally, I think these are more confusing. > > --- a/doc/src/sgml/ref/rollback.sgml > > +++ b/doc/src/sgml/ref/rollback.sgml > > @@ -56,11 +56,14 @@ ROLLBACK [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ] > > AND CHAIN > > > > > > - If AND CHAIN is specified, a new transaction is > > + If AND CHAIN is specified, a new unaborted > > transaction is > >immediately started with the same transaction characteristics (see > > >linkend="sql-set-transaction"/>) as the just finished one. > > Otherwise, > >no new transaction is started. > > I don't think that is an improvement. "Unaborted" is an un-word. A new > transaction > is always "unaborted", isn't it? > I thought about this as well when reviewing it, but I do think something is needed for the case where you have a transaction which has suffered an error and then you issue "rollback and chain"; if you just say "a new transaction is immediately started with the same transaction characteristics" it might imply to some the new transaction has some kind of carry over of the previous broken transaction... the use of the word unaborted makes it clear that the new transaction is 100% functional. > > --- a/doc/src/sgml/wal.sgml > > +++ b/doc/src/sgml/wal.sgml > > @@ -909,4 +910,36 @@ > > seem to be a problem in practice. > > > > > > + > > + > > + > > + Two-Phase Transactions > > + > > + > > + PostgreSQL supports a two-phase commit (2PC) > [...] > > + pg_twophase directory. Currently-prepared > > + transactions can be inspected using > + > > linkend="view-pg-prepared-xacts">pg_prepared_xacts. > > + > > + > > + > > > > I don't like "currently-prepared". How about: > "Transaction that are currently prepared can be inspected..." > This seems to align with other usage, so +1 > This is clearly interesting information, but I don't think the WAL chapter is > the right > place for this. "pg_twophase" is already mentioned in "storage.sgml", and > details about > when exactly a prepared transaction is persisted may exceed the details level > needed by > the end user. > > I'd look for that information in the reference page for PREPARE TRANSACTION; > perhaps > that would be a better place. Or, even better, the new "xact.sgml" chapter. > > > --- /dev/null > > +++ b/doc/src/sgml/xact.sgml > > + Transaction Management > > + The word transaction is often abbreviated as "xact". > > Should use here. > > > + Transactions and Identifiers > > > + > > +Once a transaction writes to the database, it is assigned a > > +non-virtual TransactionId (or xid), > > +e.g., 278394. Xids are assigned sequentially > > +using a global counter used by all databases within the > > +PostgreSQL cluster. This property is used by > > +the transaction system to order transactions by their first database > > +write, i.e., lower-numbered xids started writing before higher-numbered > > +xids. Of course, transactions might start in a different order. > > + > > "This property"? How about: > "Because transaction IDs are assigned sequentially, the transaction system can > use them to order transactions by their first database write" > +1 > I would want some additional information here: why does the transaction > system have > to order transactions by their first database write? > > "Of course, transactions might start in a different order." > > Now that confuses me. Are you saying that BEGIN could be in a different order > than the first database write? Perhaps like this: > > "Note that the order in which transactions perform their first database write > might be different from the order in which the transactions started." > +1 > > +The internal transaction ID type xid is 32-bits wide > > There should be no hyphen in "32 bi
Update some more ObjectType switch statements to not have default
This arose during the review of another patch. We often omit the default case of a switch statement to allow the compiler to complain if an enum case has been missed. I found a few where that wasn't done yet, but it would make sense and would have found an omission in another patch.From 37a03b420ce52f02119a0085f3c5fe3b44fb Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 9 Nov 2022 14:52:33 +0100 Subject: [PATCH] Update some more ObjectType switch statements to not have default This allows the compiler to complain if a case has been missed. In these instances, the tradeoff of having to list a few unneeded cases to silence the compiler seems better than the risk of actually missing one. --- src/backend/catalog/objectaddress.c | 26 +++ src/backend/commands/dropcmds.c | 39 +++-- 2 files changed, 53 insertions(+), 12 deletions(-) diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c index c7de7232b890..f36581d106b2 100644 --- a/src/backend/catalog/objectaddress.c +++ b/src/backend/catalog/objectaddress.c @@ -960,7 +960,7 @@ ObjectAddress get_object_address(ObjectType objtype, Node *object, Relation *relp, LOCKMODE lockmode, bool missing_ok) { - ObjectAddress address; + ObjectAddress address = {InvalidOid, InvalidOid, 0}; ObjectAddress old_address = {InvalidOid, InvalidOid, 0}; Relationrelation = NULL; uint64 inval_count; @@ -991,6 +991,7 @@ get_object_address(ObjectType objtype, Node *object, &relation, lockmode, missing_ok); break; + case OBJECT_ATTRIBUTE: case OBJECT_COLUMN: address = get_object_address_attribute(objtype, castNode(List, object), @@ -1163,14 +1164,12 @@ get_object_address(ObjectType objtype, Node *object, missing_ok); address.objectSubId = 0; break; - default: - elog(ERROR, "unrecognized object type: %d", (int) objtype); - /* placate compiler, in case it thinks elog might return */ - address.classId = InvalidOid; - address.objectId = InvalidOid; - address.objectSubId = 0; + /* no default, to let compiler warn about missing case */ } + if (!address.classId) + elog(ERROR, "unrecognized object type: %d", (int) objtype); + /* * If we could not find the supplied object, return without locking. */ @@ -2635,9 +2634,16 @@ check_object_ownership(Oid roleid, ObjectType objtype, ObjectAddress address, aclcheck_error(ACLCHECK_NOT_OWNER, objtype, NameListToString(castNode(List, object))); break; - default: - elog(ERROR, "unrecognized object type: %d", -(int) objtype); + case OBJECT_AMOP: + case OBJECT_AMPROC: + case OBJECT_DEFAULT: + case OBJECT_DEFACL: + case OBJECT_PUBLICATION_NAMESPACE: + case OBJECT_PUBLICATION_REL: + case OBJECT_USER_MAPPING: + /* These are currently not supported or don't make sense here. */ + elog(ERROR, "unsupported object type: %d", (int) objtype); + break; } } diff --git a/src/backend/commands/dropcmds.c b/src/backend/commands/dropcmds.c index 26157eb4e3f5..8bac58a9941a 100644 --- a/src/backend/commands/dropcmds.c +++ b/src/backend/commands/dropcmds.c @@ -480,10 +480,45 @@ does_not_exist_skipping(ObjectType objtype, Node *object) msg = gettext_noop("publication \"%s\" does not exist, skipping"); name = strVal(object); break; - default: - elog(ERROR, "unrecognized object type: %d", (int) objtype); + + case OBJECT_COLUMN: + case OBJECT_DATABASE: + case OBJECT_FOREIGN_TABLE: + case OBJECT_INDEX: + case OBJECT_MATVIEW: + case OBJECT_ROLE: + case OBJECT_SEQUENCE: + case OBJECT_SUBSCRIPT
Re: Add tracking of backend memory allocated to pg_stat_activity
Hi Melanie, Thank you for looking at this and for the feedback. Responses inline below. On Mon, 2022-11-07 at 16:17 -0500, Melanie Plageman wrote: > On Fri, Nov 04, 2022 at 08:56:13AM -0400, Reid Thompson wrote: > > From a8de5d29c0c6f10962181926a49ad4fec1e52bd1 Mon Sep 17 00:00:00 > > 2001 > > From: Reid Thompson > > Date: Thu, 11 Aug 2022 12:01:25 -0400 > > Subject: [PATCH] Add tracking of backend memory allocated to > > pg_stat_activity > > + > > It doesn't seem like you need the backend_ prefix in the view since > that is implied by it being in pg_stat_activity. I will remove the prefix. > For the wording on the description, I find "memory allocated to this > backend" a bit confusing. Perhaps you could reword it to make clear > you mean that the number represents the balance of allocations by this > backend. Something like: > > Memory currently allocated to this backend in bytes. This is the > balance of bytes allocated and freed by this backend. > I would also link to the system administration function > pg_size_pretty() so users know how to easily convert the value. Thanks, I'll make these changes > If you end up removing shared memory as Andres suggests in [1], I > would link pg_shmem_allocations view here and point out that shared memory > allocations can be viewed there instead (and why). > > You could instead add dynamic shared memory allocation to the > pg_shmem_allocations view as suggested as follow-on work by the > commit which introduced it, ed10f32e3. > > > +/* > > + * pgstat_report_backend_allocated_bytes_increase() - > > + * > > + * Called to report increase in memory allocated for this backend > > + * > > + */ > > It seems like you could combine the > pgstat_report_backend_allocated_bytes_decrease/increase() by either > using a signed integer to represent the allocation/deallocation or passing in > a "direction" that is just a positive or negative multiplier enum. > > Especially if you don't use the write barriers, I think you could > simplify the logic in the two functions. > > If you do combine them, you might shorten the name to > pgstat_report_backend_allocation() or pgstat_report_allocation(). Agreed. This seems a cleaner, simpler way to go. I'll add it to the TODO list. > > + /* > > + * Do not allow backend_allocated_bytes to go below zero. > > ereport if we > > + * would have. There's no need for a lock around the read > > here as it's > > + * being referenced from the same backend which means that > > there shouldn't > > + * be concurrent writes. We want to generate an ereport in > > these cases. > > + */ > > + if (deallocation > beentry->backend_allocated_bytes) > > + { > > + ereport(LOG, errmsg("decrease reduces reported > > backend memory allocated below zero; setting reported to 0")); > > + > > I also think it would be nice to include the deallocation amount and > backend_allocated_bytes amount in the log message. > It also might be nice to start the message with something more clear > than "decrease". > For example, I would find this clear as a user: > > Backend [backend_type or pid] deallocated [deallocation number] bytes, > [backend_allocated_bytes - deallocation number] more than this backend > has reported allocating. Sounds good, I'll implement these changes > > diff --git a/src/backend/utils/adt/pgstatfuncs.c > > b/src/backend/utils/adt/pgstatfuncs.c > > index 96bffc0f2a..b6d135ad2f 100644 > > --- a/src/backend/utils/adt/pgstatfuncs.c > > +++ b/src/backend/utils/adt/pgstatfuncs.c > > @@ -553,7 +553,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS) > > Datum > > pg_stat_get_activity(PG_FUNCTION_ARGS) > > { > > -#define PG_STAT_GET_ACTIVITY_COLS 30 > > +#define PG_STAT_GET_ACTIVITY_COLS 31 > > int num_backends = > > > > + values[30] = > > UInt64GetDatum(beentry->backend_allocated_bytes); > > Though not the fault of this patch, it is becoming very difficult to > keep the columns straight in pg_stat_get_activity(). Perhaps you > could add a separate commit to add an enum for the columns so the function > is easier to understand. > > > diff --git a/src/include/utils/backend_status.h > > b/src/include/utils/backend_status.h > > index b582b46e9f..75d87e8308 100644 > > --- a/src/include/utils/backend_status.h > > +++ b/src/include/utils/backend_status.h > > @@ -169,6 +169,9 @@ typedef struct PgBackendStatus > > > > /* query identifier, optionally computed using > > post_parse_analyze_hook */ > > uint64 st_query_id; > > + > > + /* Current memory allocated to this backend */ > > + uint64 backend_allocated_bytes; > > } PgBackendStatus; > > I don't think you need the backend_ prefix here since it is in > PgBackendStatus. Agreed again, I'll remove the prefix. > > @@ -313,7 +316,9 @@ extern const char > > *pgstat_get_backend_curren
Call lazy_check_wraparound_failsafe earlier for parallel vacuum
While looking through vacuum code, I noticed that unlike non-parallel vacuum, parallel vacuum only gets a failsafe check after an entire index cycle completes. In vacuumlazy.c, lazy_check_wraparound_failsafe is checked after every index completes, while in parallel, it is checked after an entire index cycle completed. if (!ParallelVacuumIsActive(vacrel)) { for (int idx = 0; idx < vacrel->nindexes; idx++) { Relationindrel = vacrel->indrels[idx]; IndexBulkDeleteResult *istat = vacrel->indstats[idx]; vacrel->indstats[idx] = lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples, vacrel); /* * Done vacuuming an index. Increment the indexes completed */ pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED, idx + 1); if (lazy_check_wraparound_failsafe(vacrel)) { /* Wraparound emergency -- end current index scan */ allindexes = false; break; } } } else { /* Outsource everything to parallel variant */ parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, vacrel->old_live_tuples, vacrel->num_index_scans); /* * Do a postcheck to consider applying wraparound failsafe now. Note * that parallel VACUUM only gets the precheck and this postcheck. */ if (lazy_check_wraparound_failsafe(vacrel)) allindexes = false; } When a user is running a parallel vacuum and the vacuum is long running due to many large indexes, it would make sense to check for failsafe earlier. Also, checking after every index for parallel vacuum will provide the same failsafe behavior for both parallel and non-parallel vacuums. To make this work, it is possible to call lazy_check_wraparound_failsafe inside parallel_vacuum_process_unsafe_indexes and parallel_vacuum_process_safe_indexes of vacuumparallel.c Regards, Sami Imseih Amazon Web Services (AWS)
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On 2022-Nov-09, Justin Pryzby wrote: > On Wed, Nov 09, 2022 at 06:00:40PM +0530, Bharath Rupireddy wrote: > > 1. For ease of review, please split the test patch to 0002. > > This is just my opinion, but .. why ? Since it's easy to > filter/skip/display a file, I don't think it's usually useful to have > separate patches for tests or docs. I concur with Justin. When a patch is bugfix and a test is added that verifies it, I like to keep the test in a separate commit (for submit purposes and in my personal repo -- not for the official push!) so that I can git-checkout to just the test and make sure it fails ahead of pushing the fix commit. But for a new feature, there's no reason to do so. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: Add tracking of backend memory allocated to pg_stat_activity
Hi, 2022-11-09 08:54:54 -0500, Reid Thompson wrote: > Thanks for looking at this and for the feedback. Responses inline below. > > > +void > > > +pgstat_report_backend_allocated_bytes_decrease(uint64 > > > deallocation) > > > +{ > > > + volatile PgBackendStatus *beentry = MyBEEntry; > > > + > > > + /* > > > + * Cases may occur where shared memory from a previous postmaster > > > + * invocation still exist. These are cleaned up at startup by > > > + * dsm_cleanup_using_control_segment. Limit decreasing memory > > > allocated to > > > + * zero in case no corresponding prior increase exists or > > > decrease has > > > + * already been accounted for. > > > + */ > > > > I don't really follow - postmaster won't ever have a backend status > > array, so how would they be tracked here? > > On startup, a check is made for leftover dsm control segments in the > DataDir. It appears possible that in certain situations on startup we > may find and destroy stale segments and thus decrement the allocation > variable. > > I based this off of: > /ipc/dsm.c > > dsm_postmaster_startup: > 150 dsm_postmaster_startup(PGShmemHeader *shim) > { > ... > 281 } I don't think we should account for memory allocations done in postmaster in this patch. They'll otherwise be counted again in each of the forked backends. As this cleanup happens during postmaster startup, we'll have to make sure accounting is reset during backend startup. Greetings, Andres Freund
Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands
Yugo NAGATA writes: > Tom Lane wrote: >> What do you think of the attached wording? > It looks good to me. That describes the expected behaviour exactly. Pushed that, then. >> I don't think the pipeline angle is of concern to anyone who might be >> reading these comments with the aim of understanding what guarantees >> they have. Perhaps there should be more about that in the user-facing >> docs, though. > I agree with that we don't need to mention pipelining in these comments, > and that we need more in the documentation. I attached a doc patch to add > a mention of commands that do internal commit to the pipelining section. > Also, this adds a reference for the pipelining protocol to the libpq doc. Hmm ... I don't really find either of these changes to be improvements. The fact that, say, multi-table ANALYZE uses multiple transactions seems to me to be a property of that statement, not of the protocol. regards, tom lane
Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands
Peter Eisentraut writes: > This has broken the following use: > parse: create temporary table t1 (a int) on commit drop > bind > execute > parse: analyze t1 > bind > execute > parse: select * from t1 > bind > execute > sync > I think the behavior of IsInTransactionBlock() needs to be further > refined to support this. Hmm. Maybe the right way to think about this is "if we have completed an EXECUTE, and not yet received a following SYNC, then report that we are in a transaction block"? But I'm not sure if that breaks any other cases. regards, tom lane
Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands
On 08.08.22 17:21, Yugo NAGATA wrote: In fact, the result of IsInTransactionBlock does not make senses at all in pipe-line mode regardless to the fix. ANALYZE could commit all previous commands in pipelining, and this may not be user expected behaviour. This seems pretty much isomorphic to the fact that CREATE DATABASE will commit preceding steps in the pipeline. I am not sure if we can think CREATE DATABASE case and ANLALYZE case similarly. First, CREATE DATABASE is one of the commands that cannot be executed inside a transaction block, but ANALYZE can be. So, users would not be able to know ANALYZE in a pipeline causes a commit from the documentation. Second, ANALYZE issues a commit internally in an early stage not only after it finished successfully. For example, even if ANALYZE is failing because a not-existing column name is specified, it issues a commit before checking the column name. This makes more hard to know which statements will be committed and which statements not committed in a pipeline. Also, as you know, there are other commands that issue internal commits. This has broken the following use: parse: create temporary table t1 (a int) on commit drop bind execute parse: analyze t1 bind execute parse: select * from t1 bind execute sync I think the behavior of IsInTransactionBlock() needs to be further refined to support this. If we are worried about external callers, maybe we need to provide a separate version. AFAICT, all the callers of IsInTransactionBlock() over time have been in vacuum/analyze-related code, so perhaps in master we should just move it there.
Re: Error for WITH options on partitioned tables
Simon Riggs writes: > Karina's changes make sense to me, so +1. > This is a minor patch, so I will set this as Ready For Committer. Pushed with minor fiddling: * I concur with Karina's thought that ERRCODE_WRONG_OBJECT_TYPE is the most on-point errcode for this. The complaint is specifically about the table relkind and has nothing to do with the storage parameter per se. I also agree that it's not worth trying to use a different errcode for CREATE vs. ALTER. * The HINT message wasn't per project style (it should be formatted as a complete sentence), and I thought using "parameters for" in the main message but "parameters on" in the hint was oddly inconsistent. regards, tom lane
Re: Remove redundant declaration for XidInMVCCSnapshot
On 2022-Nov-09, Japin Li wrote: > Commit b7eda3e0e3 moves XidINMVCCSnapshot into snapmgr.{c,h}, > however, it forgets the declaration of XidINMVCCSnapshot in > heapam.h. True. Pushed, thanks. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ #error "Operator lives in the wrong universe" ("Use of cookies in real-time system development", M. Gleixner, M. Mc Guire)
Re: heavily contended lwlocks with long wait queues scale badly
Hi, On 2022-11-09 15:54:16 +0530, Bharath Rupireddy wrote: > On Tue, Nov 1, 2022 at 12:46 PM Bharath Rupireddy > wrote: > > > > > Updated patch attached. > > > > Thanks. It looks good to me. However, some minor comments on the v3 patch: > > > > 1. > > -if (MyProc->lwWaiting) > > +if (MyProc->lwWaiting != LW_WS_NOT_WAITING) > > elog(PANIC, "queueing for lock while waiting on another one"); > > > > Can the above condition be MyProc->lwWaiting == LW_WS_WAITING || > > MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability? > > > > Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING && > > MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting > > LW_WS_WAITING? I don't think that's a good idea - it'll just mean we have to modify more places if we add another state, without making anything more robust. > > 2. > > /* Awaken any waiters I removed from the queue. */ > > proclist_foreach_modify(iter, &wakeup, lwWaitLink) > > { > > > > @@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock) > > * another lock. > > */ > > pg_write_barrier(); > > -waiter->lwWaiting = false; > > +waiter->lwWaiting = LW_WS_NOT_WAITING; > > PGSemaphoreUnlock(waiter->sem); > > } > > > > /* > > * Awaken any waiters I removed from the queue. > > */ > > proclist_foreach_modify(iter, &wakeup, lwWaitLink) > > { > > PGPROC *waiter = GetPGProcByNumber(iter.cur); > > > > proclist_delete(&wakeup, iter.cur, lwWaitLink); > > /* check comment in LWLockWakeup() about this barrier */ > > pg_write_barrier(); > > waiter->lwWaiting = LW_WS_NOT_WAITING; > > > > Can we add an assertion Assert(waiter->lwWaiting == > > LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup > > list and set the LW_WS_NOT_WAITING flag in above loops, having an > > assertion is better here IMO. I guess it can't hurt - but it's not really related to the changes in the patch, no? > I looked at the v3 patch again today and ran some performance tests. > The results look impressive as they were earlier. Andres, any plans to > get this in? I definitely didn't want to backpatch before this point release. But it seems we haven't quite got to an agreement what to do about backpatching. It's probably best to just commit it to HEAD and let the backpatch discussion happen concurrently. I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK to push it to HEAD if I get it done in the next few hours. Bigger issues, which I do not expect, should show up before tomorrow afternoon. Smaller things could wait till Sunday if necessary. Greetings, Andres Freund
Re: [PATCH] Support % wildcard in extension upgrade filenames
And now a version of the patch including documentation and regression tests. Anything you see I should improve ? --strk; On Fri, Nov 04, 2022 at 06:31:58PM +0100, Sandro Santilli wrote: > On Mon, Oct 31, 2022 at 01:55:05AM -0400, Regina Obe wrote: > > > > Sandro, can you submit an updated version of this patch. > > I was testing it out and looked good first time. > > Attached updated version of the patch. > > > If everyone is okay with this patch, we'll go ahead and add tests and > > documentation to go with it. > > Yes please let us know if there's any chance of getting this > mechanism approved or if we should keep advertising works around > this limitation (it's not just installing 100s of files but also > still missing needed upgrade paths when doing so). > > > --strk; >From 4b10297315d2db4365e6f47e375588394b260d0d Mon Sep 17 00:00:00 2001 From: Sandro Santilli Date: Wed, 14 Sep 2022 11:10:10 +0200 Subject: [PATCH] Allow wildcard (%) in extension upgrade paths A wildcard character "%" will be accepted in the "source" side of the upgrade script and be considered usable to upgrade any version to the "target" side. Using wildcards needs to be explicitly requested by extensions via a "wildcard_upgrades" setting in their control file. Includes regression test and documentation. --- doc/src/sgml/extend.sgml | 14 + src/backend/commands/extension.c | 58 +-- src/test/modules/test_extensions/Makefile | 6 +- .../expected/test_extensions.out | 15 + .../test_extensions/sql/test_extensions.sql | 7 +++ .../test_ext_wildcard1--%--2.0.sql| 6 ++ .../test_ext_wildcard1--1.0.sql | 6 ++ .../test_ext_wildcard1.control| 4 ++ 8 files changed, 108 insertions(+), 8 deletions(-) create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--%--2.0.sql create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--1.0.sql create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1.control diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml index 46e873a166..4012652574 100644 --- a/doc/src/sgml/extend.sgml +++ b/doc/src/sgml/extend.sgml @@ -807,6 +807,20 @@ RETURNS anycompatible AS ... + + + wildcard_upgrades (boolean) + + +This parameter, if set to true (which is not the +default), allows ALTER EXTENSION to consider +a wildcard character % as matching any version of +the extension. Such wildcard match will only be used when no +perfect match is found for a version. + + + + diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index 1a62e5dac5..ea8825fcff 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -86,6 +86,7 @@ typedef struct ExtensionControlFile bool relocatable; /* is ALTER EXTENSION SET SCHEMA supported? */ bool superuser; /* must be superuser to install? */ bool trusted; /* allow becoming superuser on the fly? */ + bool wildcard_upgrades; /* allow using wildcards in upgrade scripts */ int encoding; /* encoding of the script file, or -1 */ List *requires; /* names of prerequisite extensions */ } ExtensionControlFile; @@ -128,6 +129,7 @@ static void ApplyExtensionUpdates(Oid extensionOid, bool cascade, bool is_create); static char *read_whole_file(const char *filename, int *length); +static bool file_exists(const char *name); /* @@ -579,6 +581,14 @@ parse_extension_control_file(ExtensionControlFile *control, errmsg("parameter \"%s\" requires a Boolean value", item->name))); } + else if (strcmp(item->name, "wildcard_upgrades") == 0) + { + if (!parse_bool(item->value, &control->wildcard_upgrades)) +ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("parameter \"%s\" requires a Boolean value", +item->name))); + } else if (strcmp(item->name, "encoding") == 0) { control->encoding = pg_valid_server_encoding(item->value); @@ -636,6 +646,7 @@ read_extension_control_file(const char *extname) control->relocatable = false; control->superuser = true; control->trusted = false; + control->wildcard_upgrades = false; control->encoding = -1; /* @@ -890,7 +901,15 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control, if (from_version == NULL) elog(DEBUG1, "executing extension script for \"%s\" version '%s'", control->name, version); else + { + if ( control->wildcard_upgrades && ! file_exists(filename) ) + { + elog(DEBUG1, "extension upgrade script \"%s\" does not exist, will try wildcard", filename); + /* if filename does not exist, try wildcard */ + filename = get_extension_script_filename(control, "%", version); + } elog(DEBUG1, "executing extension script for \"%s\" update from version '%s
Re: refactor ownercheck and aclcheck functions
> > After considering this again, I decided to brute-force this and get rid > of all the trivial wrapper functions and also several of the special > cases. That way, there is less confusion at the call sites about why > this or that style is used in a particular case. Also, it now makes > sure you can't accidentally use the generic functions when a particular > one should be used. > +1 However, the aclcheck patch isn't applying for me now. That patch modifies 37 files, so it's hard to say just which commit conflicts.
Re: psql: Add command to use extended query protocol
Peter Eisentraut wrote: > Is there a use case for a global setting? I assume that we may sometimes want to use the extended protocol on all queries of a script, like pgbench does with --protocol=extended. Outside of psql, it's too complicated to parse a SQL script to replace the end-of-query semicolons with \gp, whereas a psql setting solves this effortlessly. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: allow segment size to be set to < 1GiB
On 2022-11-08 18:28:08 -0800, Andres Freund wrote: > On 2022-11-07 21:36:33 -0500, Tom Lane wrote: > > Andres Freund writes: > > > On 2022-11-07 12:52:25 -0500, Tom Lane wrote: > > >> How about instead allowing the segment size to be set in pages? > > > > > In addition or instead of --with-segsize/-Dsegsize? > > > > In addition to. What I meant by "instead" was to replace > > your proposal of --with-segsize-mb. > > Working on updating the patch. > > One semi-interesting bit is that <= 5 blocks per segment fails, because > corrupt_page_checksum() doesn't know about segments and > src/bin/pg_basebackup/t/010_pg_basebackup.pl does A second question: Both autoconf and meson print the segment size as GB right now. Obviously that'll print out a size of 0 for a segsize < 1GB. The easiest way to would be to just display the number of blocks, but that's not particularly nice. We could show kB, but that ends up being large. Or we can have some code to adjust the unit, but that seems a bit overkill. Opinions?
Re: allow segment size to be set to < 1GiB
Andres Freund writes: > A second question: Both autoconf and meson print the segment size as GB right > now. Obviously that'll print out a size of 0 for a segsize < 1GB. > The easiest way to would be to just display the number of blocks, but that's > not particularly nice. Well, it would be fine if you'd written --with-segsize-blocks, wouldn't it? Can we make the printout format depend on which switch was used? regards, tom lane
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Wed, Nov 9, 2022 at 6:30 AM Bharath Rupireddy wrote: > > On Wed, Nov 9, 2022 at 5:08 AM David Christensen > wrote: > > > > Enclosed is v6, which squashes your refactor and adds the additional > > recent suggestions. > > Thanks for working on this feature. Here are some comments for now. I > haven't looked at the tests, I will take another look at the code and > tests after these and all other comments are addressed. > > 1. For ease of review, please split the test patch to 0002. Per later discussion seems like new feature tests are fine in the same patch, yes? > 2. I'm unable to understand the use-case for --fixup-fpi option. > pg_waldump is supposed to be just WAL reader, and must not return any > modified information, with --fixup-fpi option, the patch violates this > principle i.e. it sets page LSN and returns. Without actually > replaying the WAL record on the page, how is it correct to just set > the LSN? How will it be useful? ISTM, we must ignore this option > unless there's a strong use-case. How I was envisioning this was for cases like extreme surgery for corrupted pages, where you extract the page from WAL but it has lsn and checksum set so you could do something like `dd if=fixup-block of=relation ...`, so it *simulates* the replay of said fullpage blocks in cases where for some reason you can't play the intermediate records; since this is always a fullpage block, it's capturing what would be the snapshot so you could manually insert somewhere as needed without needing to replay (say if dealing with an incomplete or corrupted WAL stream). > 3. > +if (fork >= 0 && fork <= MAX_FORKNUM) > +{ > +if (fork) > +sprintf(forkname, "_%s", forkNames[fork]); > +else > +forkname[0] = 0; > +} > +else > +pg_fatal("found invalid fork number: %u", fork); > > Isn't the above complex? What's the problem with something like below? > Why do we need if (fork) - else block? > > if (fork >= 0 && fork <= MAX_FORKNUM) > sprintf(forkname, "_%s", forkNames[fork]); > else > pg_fatal("found invalid fork number: %u", fork); This was to suppress any suffix for main forks, but yes, could simplify and include the `_` in the suffix name. Will include such a change. > 3. > +charpage[BLCKSZ] = {0}; > I think when writing to a file, we need PGAlignedBlock rather than a > simple char array of bytes, see the description around PGAlignedBlock > for why it is so. Easy enough change, and makes sense. > 4. > +if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ) > Why pg_pwrite(), why not just fwrite()? If fwrite() is used, you can > avoid fileno() system calls, no? Do you need the file position to > remain the same after writing, hence pg_pwrite()? I don't recall the original motivation, TBH. > 5. > +if (!RestoreBlockImage(record, block_id, page)) > +continue; > + > +/* we have our extracted FPI, let's save it now */ > After extracting the page from the WAL record, do we need to perform a > checksum on it? That is there in fixup mode (or should be). Are you thinking this should also be set if not in fixup mode? That defeats the purpose of the raw page extract, which is to see *exactly* what the WAL stream has. > 6. > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) > Use pg_dir_create_mode instead of hard-coded 0007? Sure. > 7. > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) > +fsync(fileno(OPF)); > +fclose(OPF); > Since you're creating the directory in case it's not available, you > need to fsync the directory too? Sure. > 8. > +case 'W': > +case 'X': > +config.fixup_fpw = (option == 'X'); > +config.save_fpw_path = pg_strdup(optarg); > +break; > Just set config.fixup_fpw = false before the switch block starts, > like the other variables, and then perhaps doing like below is more > readable: > case 'W': > config.save_fpw_path = pg_strdup(optarg); > case 'X': >config.fixup_fpw = true; >config.save_fpw_path = pg_strdup(optarg); Like separate opt processing with their own `break` statement? Probably a bit more readable/conventional. > 9. > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) > Should we use pg_mkdir_p() instead of mkdir()? Sure. Thanks, David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
> > 6. > > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) > > Use pg_dir_create_mode instead of hard-coded 0007? > > I think I thought of that when I first looked at the patch ... but, I'm > not sure, since it says: > > src/include/common/file_perm.h-/* Modes for creating directories and files IN > THE DATA DIRECTORY */ > src/include/common/file_perm.h:extern PGDLLIMPORT int pg_dir_create_mode; Looks like it's pretty evenly split in src/bin: $ git grep -o -E -w '(pg_mkdir_p|mkdir)' '**.c' | sort | uniq -c 3 initdb/initdb.c:mkdir 3 initdb/initdb.c:pg_mkdir_p 1 pg_basebackup/bbstreamer_file.c:mkdir 2 pg_basebackup/pg_basebackup.c:pg_mkdir_p 1 pg_dump/pg_backup_directory.c:mkdir 1 pg_rewind/file_ops.c:mkdir 4 pg_upgrade/pg_upgrade.c:mkdir So if that is the preferred approach I'll go ahead and use it. > I was wondering if there's any reason to do "CREATE DATABASE". The vast > majority of TAP tests don't. > > $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head >1435 safe_psql('postgres', > 335 safe_psql( > 23 safe_psql($connect_db, If there was a reason, I don't recall offhand; I will test removing it and if things still work will consider it good enough. David
Re: allow segment size to be set to < 1GiB
Hi, On 2022-11-09 14:44:42 -0500, Tom Lane wrote: > Andres Freund writes: > > A second question: Both autoconf and meson print the segment size as GB > > right > > now. Obviously that'll print out a size of 0 for a segsize < 1GB. > > > The easiest way to would be to just display the number of blocks, but that's > > not particularly nice. > > Well, it would be fine if you'd written --with-segsize-blocks, wouldn't > it? Can we make the printout format depend on which switch was used? Not sure why I didn't think of that... Updated patch attached. I made one autoconf and one meson CI task use a small block size, but just to ensure it work on both. I'd probably leave it set on one, so we keep the coverage for cfbot? Greetings, Andres Freund >From 7a76be232dd0587e9c84af4e4481d5a98f94bd22 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Wed, 9 Nov 2022 12:01:54 -0800 Subject: [PATCH v2] Add option to specify segment size in blocks The tests don't have much coverage of segment related code, as we don't create large enough tables. To make it easier to test these paths, add a new option specifying the segment size in blocks. Set the new option to 6 blocks in one of the CI tasks. Smaller numbers currently fail one of the tests, for understandable reasons. While at it, fix some segment size related issues in the meson build. Author: Andres Freund Discussion: https://postgr.es/m/20221107171355.c23fzwanfzq2p...@awork3.anarazel.de --- meson.build| 24 ++--- meson_options.txt | 3 ++ configure.ac | 36 ++- doc/src/sgml/installation.sgml | 14 .cirrus.yml| 2 ++ configure | 63 -- 6 files changed, 118 insertions(+), 24 deletions(-) diff --git a/meson.build b/meson.build index ce2f223a409..b0fcf82a9c2 100644 --- a/meson.build +++ b/meson.build @@ -418,7 +418,19 @@ meson_bin = find_program(meson_binpath, native: true) cdata.set('USE_ASSERT_CHECKING', get_option('cassert') ? 1 : false) -cdata.set('BLCKSZ', get_option('blocksize').to_int() * 1024, description: +blocksize = get_option('blocksize').to_int() * 1024 + +if get_option('segsize_blocks') != 0 + if get_option('segsize') != 1 +warning('both segsize and segsize_blocks specified, segsize_blocks wins') + endif + + segsize = get_option('segsize_blocks') +else + segsize = (get_option('segsize') * 1024 * 1024 * 1024) / blocksize +endif + +cdata.set('BLCKSZ', blocksize, description: '''Size of a disk block --- this also limits the size of a tuple. You can set it bigger if you need bigger tuples (although TOAST should reduce the need to have large tuples, since fields can be spread across multiple tuples). @@ -428,7 +440,7 @@ cdata.set('BLCKSZ', get_option('blocksize').to_int() * 1024, description: Changing BLCKSZ requires an initdb.''') cdata.set('XLOG_BLCKSZ', get_option('wal_blocksize').to_int() * 1024) -cdata.set('RELSEG_SIZE', get_option('segsize') * 131072) +cdata.set('RELSEG_SIZE', segsize) cdata.set('DEF_PGPORT', get_option('pgport')) cdata.set_quoted('DEF_PGPORT_STR', get_option('pgport').to_string()) cdata.set_quoted('PG_KRB_SRVNAM', get_option('krb_srvnam')) @@ -3053,9 +3065,11 @@ if meson.version().version_compare('>=0.57') summary( { - 'data block size': cdata.get('BLCKSZ'), - 'WAL block size': cdata.get('XLOG_BLCKSZ') / 1024, - 'segment size': cdata.get('RELSEG_SIZE') / 131072, + 'data block size': '@0@ kB'.format(cdata.get('BLCKSZ') / 1024), + 'WAL block size': '@0@ kB'.format(cdata.get('XLOG_BLCKSZ') / 1024), + 'segment size': get_option('segsize_blocks') != 0 ? +'@0@ blocks'.format(cdata.get('RELSEG_SIZE')) : +'@0@ GB'.format(get_option('segsize')), }, section: 'Data layout', ) diff --git a/meson_options.txt b/meson_options.txt index c7ea57994dc..0d2a34fd154 100644 --- a/meson_options.txt +++ b/meson_options.txt @@ -13,6 +13,9 @@ option('wal_blocksize', type : 'combo', option('segsize', type : 'integer', value : 1, description : '''Segment size, in gigabytes''') +option('segsize_blocks', type : 'integer', value: 0, + description : '''Segment size, in blocks''') + # Miscellaneous options diff --git a/configure.ac b/configure.ac index f76b7ee31fc..94542e862cf 100644 --- a/configure.ac +++ b/configure.ac @@ -285,15 +285,31 @@ AC_DEFINE_UNQUOTED([BLCKSZ], ${BLCKSZ}, [ # # Relation segment size # -AC_MSG_CHECKING([for segment size]) PGAC_ARG_REQ(with, segsize, [SEGSIZE], [set table segment size in GB [1]], [segsize=$withval], [segsize=1]) -# this expression is set up to avoid unnecessary integer overflow -# blocksize is already guaranteed to be a factor of 1024 -RELSEG_SIZE=`expr '(' 1024 / ${blocksize} ')' '*' ${segsize} '*' 1024` -test $? -eq 0 || exit 1 -AC_MSG_RESULT([${segsize}GB]) +PGAC_ARG_REQ(with, segsize-blocks, [SEGSIZE_BLOCKS], [set ta
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Wed, Nov 9, 2022 at 2:08 PM David Christensen wrote: > Justin sez: > > I was wondering if there's any reason to do "CREATE DATABASE". The vast > > majority of TAP tests don't. > > > > $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head > >1435 safe_psql('postgres', > > 335 safe_psql( > > 23 safe_psql($connect_db, > > If there was a reason, I don't recall offhand; I will test removing it > and if things still work will consider it good enough. Things blew up when I did that; rather than hunt it down, I just left it in. :-) Enclosed is v7, with changes thus suggested thus far. v7-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch Description: Binary data
Re: real/float example for testlibpq3
On Thu, Nov 03, 2022 at 09:55:22AM -0400, Tom Lane wrote: > Peter Eisentraut writes: > > On 01.11.22 09:15, Tom Lane wrote: > >> Agreed that the libpq manual is not the place for this, but I feel > >> like it will also be clutter in "Data Types". Perhaps we should > >> invent a new appendix or the like? Somewhere near the wire protocol > >> docs seems sensible. > > > Would that clutter the protocol docs? ;-) > > I said "near", not "in". At the time I was thinking "new appendix", > but I now recall that the wire protocol docs are not an appendix > but a chapter in the Internals division. So that doesn't seem like > quite the right place anyway. > > Perhaps a new chapter under "IV. Client Interfaces" is the right > place? > > If we wanted to get aggressive, we could move most of the nitpicky details > about datatype text formatting (e.g., the array quoting rules) there too. > I'm not set on that, but it'd make datatype.sgml smaller which could > hardly be a bad thing. > > > I suppose figuring out exactly where to put it and how to mark it up, > > etc., in a repeatable fashion is part of the job here. > > Yup. I'll take a stab at adding a new chapter and share how that looks. Regards, Mark -- Mark Wong EDB https://enterprisedb.com
Re: Expand palloc/pg_malloc API
Peter Eisentraut writes: > On 11.10.22 18:04, Tom Lane wrote: >> Hmm ... the individual allocators have that info, but mcxt.c doesn't >> have access to it. I guess we could invent an additional "method" >> to return the requested size of a chunk, which is only available in >> MEMORY_CONTEXT_CHECKING builds, or maybe in !MEMORY_CONTEXT_CHECKING >> it returns the allocated size instead. > I'm not sure whether that amount of additional work would be useful > relative to the size of this patch. Is the patch as it stands now > making the code less robust than what the code is doing now? No. It's slightly annoying that the call sites still have to track the old size of the allocation, but I guess that they have to have that information in order to know that they need to repalloc in the first place. I agree that this patch does make things easier to read and a bit less error-prone. Also, I realized that what I proposed above doesn't really work anyway for this purpose. Consider ptr = palloc(size); ... fill all "size" bytes ... ptr = repalloc0(ptr, size, newsize); where the initial size request isn't a power of 2. If production builds rely on the initial allocated size not requested size to decide where to start zeroing, this would work (no uninitialized holes) in a debug build, but leave some uninitialized bytes in a production build, which is absolutely horrible. So I guess we have to rely on the callers to track their requests. > In the meantime, here is an updated patch with the argument order > swapped and an additional assertion, as previously discussed. I think it'd be worth expending an actual runtime test in repalloc0, that is if (unlikely(oldsize > size)) elog(ERROR, "invalid repalloc0 call: oldsize %zu, new size %zu", oldsize, size); This seems cheap compared to the cost of the repalloc+memset, and the consequences of not detecting the error seem pretty catastrophic --- the memset would try to zero almost your whole address space. No objections beyond that. I've marked this RFC. regards, tom lane
Re: Call lazy_check_wraparound_failsafe earlier for parallel vacuum
On Wed, Nov 9, 2022 at 6:29 AM Imseih (AWS), Sami wrote: > When a user is running a parallel vacuum and the vacuum is long running > > due to many large indexes, it would make sense to check for failsafe earlier. It makes sense to prefer consistency here, I suppose. The reason why we're not consistent is because it was easier not to be, which isn't exactly the best reason (nor the worst). I don't think that it's obvious that we need to call the failsafe at any particular frequency. There is probably an argument to be made for the idea that we're not checking frequently enough (even in the common serial VACUUM case), just as there is an argument to be made for the opposite idea. It's not like there is some simple linear relationship (or any kind of relationship) between the amount of physical work performed by one VACUUM operation, and the rate of XID consumption by the system as a whole. And so the details of how we do it have plenty to do with what we can afford to do. My gut instinct is that the most important thing is to at least call lazy_check_wraparound_failsafe() once per index scan. Multiple index scans are disproportionately involved in VACUUMs that take far longer than expected, which are presumably the kind of VACUUMs that tend to be running when the failsafe actually triggers. We don't really expect the failsafe to trigger, so presumably when it actually does things haven't been going well for some time. (Index corruption that prevents forward progress on one particular index is another example.) That said, one thing that does bother me in this area occurs to me: we call lazy_check_wraparound_failsafe() from lazy_scan_heap() (before we get to the index scans that you're talking about) at an interval that is based on how many heap pages we've either processed (and recorded as a scanned_pages page) *or* have skipped over using the visibility map. In other words we use blkno here, when we should really be using scanned_pages instead: if (blkno - next_failsafe_block >= FAILSAFE_EVERY_PAGES) { lazy_check_wraparound_failsafe(vacrel); next_failsafe_block = blkno; } This code effectively treats pages skipped using the visibility map as equivalent to pages physically scanned (scanned_pages), even though skipping requires essentially no work at all. That just isn't logical, and feels like something worth fixing. The fundamental unit of work in lazy_scan_heap() is a *scanned* heap page. -- Peter Geoghegan
Re: HOT chain validation in verify_heapam()
Hi, To start with: I think this is an extremely helpful and important feature. Both for checking production systems and for finding problems during development. > From 08fe01f5073c0a850541265494bb4a875bec7d3f Mon Sep 17 00:00:00 2001 > From: Himanshu Upadhyaya > Date: Fri, 30 Sep 2022 17:44:56 +0530 > Subject: [PATCH v6] Implement HOT chain validation in verify_heapam() > > Himanshu Upadhyaya, reviewed by Robert Haas, Aleksander Alekseev > > Discussion: > https://postgr.es/m/CAPF61jBBR2-iE-EmN_9v0hcQEfyz_17e5Lbb0%2Bu2%3D9ukA9sWmQ%40mail.gmail.com > --- > contrib/amcheck/verify_heapam.c | 207 ++ > src/bin/pg_amcheck/t/004_verify_heapam.pl | 192 ++-- > 2 files changed, 388 insertions(+), 11 deletions(-) > > diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c > index c875f3e5a2..007f7b2f37 100644 > --- a/contrib/amcheck/verify_heapam.c > +++ b/contrib/amcheck/verify_heapam.c > @@ -399,6 +399,9 @@ verify_heapam(PG_FUNCTION_ARGS) > for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++) > { > OffsetNumber maxoff; > + OffsetNumber predecessor[MaxOffsetNumber] = {0}; > + OffsetNumber successor[MaxOffsetNumber] = {0}; > + boollp_valid[MaxOffsetNumber] = {false}; > > CHECK_FOR_INTERRUPTS(); > > @@ -433,6 +436,8 @@ verify_heapam(PG_FUNCTION_ARGS) > for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff; >ctx.offnum = OffsetNumberNext(ctx.offnum)) > { > + OffsetNumber nextoffnum; > + > ctx.itemid = PageGetItemId(ctx.page, ctx.offnum); > > /* Skip over unused/dead line pointers */ > @@ -469,6 +474,13 @@ verify_heapam(PG_FUNCTION_ARGS) > report_corruption(&ctx, > > psprintf("line pointer redirection to unused item at offset %u", > >(unsigned) rdoffnum)); > + > + /* > + * make entry in successor array, redirected > tuple will be > + * validated at the time when we loop over > successor array > + */ > + successor[ctx.offnum] = rdoffnum; > + lp_valid[ctx.offnum] = true; > continue; > } > > @@ -504,9 +516,197 @@ verify_heapam(PG_FUNCTION_ARGS) > /* It should be safe to examine the tuple's header, at > least */ > ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, > ctx.itemid); > ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr); > + lp_valid[ctx.offnum] = true; > > /* Ok, ready to check this next tuple */ > check_tuple(&ctx); > + > + /* > + * Add the data to the successor array if next updated > tuple is in > + * the same page. It will be used later to generate the > + * predecessor array. > + * > + * We need to access the tuple's header to populate the > + * predecessor array. However the tuple is not > necessarily sanity > + * checked yet so delaying construction of predecessor > array until > + * all tuples are sanity checked. > + */ > + nextoffnum = > ItemPointerGetOffsetNumber(&(ctx.tuphdr)->t_ctid); > + if (ItemPointerGetBlockNumber(&(ctx.tuphdr)->t_ctid) == > ctx.blkno && > + nextoffnum != ctx.offnum) > + { > + successor[ctx.offnum] = nextoffnum; > + } I don't really understand this logic - why can't we populate the predecessor array, if we can construct a successor entry? > + } > + > + /* > + * Loop over offset and populate predecessor array from all > entries > + * that are present in successor array. > + */ > + ctx.attnum = -1; > + for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff; > + ctx.offnum = OffsetNumberNext(ctx.offnum)) > + { > + ItemId curr_lp; > + ItemId next_lp; > + HeapTupleHeader curr_htup; > + HeapTupleHeader next_htup; > + TransactionId curr_xmax; > + TransactionId next_xmin; > + > + OffsetNumber nextoffnum = succes
Re: [DOCS] Stats views and functions not in order?
On Mon, Nov 7, 2022 at 5:19 PM Peter Smith wrote: > On Mon, Nov 7, 2022 at 5:50 AM Tom Lane wrote: > > > > Peter Smith writes: > > > Sorry, I forgot the attachments in the previous post. PSA. > > > > I spent a bit of time looking at this. I agree that a lot of the > > current ordering choices here look like they were made with the > > advice of a dartboard, and there's a number of things that are > > pretty blatantly just sloppy merging (like the out-of-order > > wait-event items). However, I'm not a big fan of "alphabetical > > order at all costs", because that frequently leads to ordering > > decisions that are not a lot better than random from a semantic > > standpoint. For example, I resist the idea that it's sensible > > to put pg_stat_all_indexes before pg_stat_all_tables. > > I'm unconvinced that putting pg_stat_sys_tables and > > pg_stat_user_tables far away from pg_stat_all_tables is great, > > either. > > > > Thanks for taking the time to look at my patch. The "at all costs" > approach was not the intention - I was just trying only to apply some > sane ordering where I did not recognize a reason for the current > order. > > > So ... how do we proceed? > > > > To proceed with the existing patches I need some guidance on exactly > which of the changes can be considered improvements versus which ones > are maybe just trading one 'random' order for another. > > How about below? > > Table 28.1. Dynamic Statistics Views -- Alphabetical order would be a > small improvement here, right? > The present ordering seems mostly OK, though just like the "Progress" update below the bottom 6 pg_stat_progress_* entries should be alphabetized; but leaving them as a group at the end seems desirable. Move pg_stat_recovery_prefetch either after subscription or after activity - the replication/received/subscription stuff all seems like it should be grouped together. As well as the security related ssl/gssapi. > Table 28.2. Collected Statistics Views -- Leave this one unchanged > (per your comments above). > I would suggest moving the 3 pg_statio_*_tables rows between the pg_stat_*_tables and the pg_stat_xact_*_tables groups. Everything pertaining to cluster, database, tables, indexes, functions. slru and replication slots should likewise shift to the (near) top in the cluster/database grouping. > Table 28.12 Wait Events of type LWLock -- Seems a clear case of bad > merging. Alphabetical order is surely needed here, right? > +1 Agreed. > > Table 28.34 Additional Statistic Functions -- Alphabetical order would > be a small improvement here, right? > No. All "reset" items should be grouped at the end like they are. I don't see an alternative ordering among them that is clearly superior. Same for the first four. > > Table 28.35 Per-Backend Statistics Functions -- Alphabetical order > would be a small improvement here, right? > > This one I would rearrange alphabetically. Or, at least, I have a different opinion of what would make a decent order but it doesn't seem all that clearly better than alphabetical. > > I'd be inclined to alphabetize by SQL command name, but maybe > > leave Base Backup to the end since it's not a SQL command. > > > > Yes, I had previously only looked at the content of section 28.2 > because I didn't want to get carried away by changing too much until > there was some support for doing the first part. > > Now PSA a separate patch for fixing section "28.4. Progress Reporting" > order as suggested. > > This seems like a clear win. David J.
Re: HOT chain validation in verify_heapam()
On Wed, Nov 9, 2022 at 2:08 PM Andres Freund wrote: > To start with: I think this is an extremely helpful and important > feature. Both for checking production systems and for finding problems during > development. +1. It's painful to get this in, in part because we now have to actually decide what the rules really are with total precision, including for all of the tricky edge cases. The exercise of writing this code should "keep us honest" about whether or not we really know what the invariants are, which is more than half the battle. > > + /* > > + * Add a line pointer offset to the predecessor array > > if xmax is > > + * matching with xmin of next tuple (reaching via its > > t_ctid). > > + * Prior to PostgreSQL 9.4, we actually changed the > > xmin to > > + * FrozenTransactionId > > I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood > of getting that right seems low and I don't see us gaining much by even > trying. This is the kind of comment that I'd usually agree with, but I disagree in this instance because of special considerations that apply to amcheck (or should IMV apply, at least). We're living in a world where we have to assume that the pre-9.4 format can occur in the field. If we can't get it right in amcheck, what chance do we have with other new code that tickles the same areas? I think that we need to support obsolescent heapam representations (both FrozenTransactionId and xvac) here on general principle. Besides, why not accept some small chance of getting this wrong? The worst that can happen is that we'll have a relatively benign bug. If we get it wrong then it's a short term problem, but also an opportunity to be less wrong in the future -- including in places where the consequences of being wrong are much more serious. > I doubt it is correct to enter this path with next_xmin == > FrozenTransactionId. This is following a ctid chain that we normally wouldn't > follow, because it doesn't satisfy the t_self->xmax == t_ctid->xmin condition. We should never see FrozenTransactionId in an xmax field (nor should it be returned by HeapTupleHeaderGetUpdateXid() under any circumstances). We can "freeze xmax" during VACUUM, but that actually means setting xmax to InvalidTransactionId (in rare cases it might mean replacing a Multi with a new Multi). The terminology in this area is a bit tricky. Anyway, it follows that we cannot expect "next_xmin == FrozenTransactionId", because that would mean that we'd called HeapTupleHeaderGetUpdateXid() which returned FrozenTransactionId -- an impossibility. (Maybe we should be checking that it really is an impossibility by checking the HeapTupleHeaderGetUpdateXid() return value, but that should be enough.) > I don't immediately see what prevents the frozen tuple being from an entirely > different HOT chain than the two tuples pointing to it. In my view it simply isn't possible for a valid HOT chain to be in this state in the first place. So by definition it wouldn't be a HOT chain. That would be a form of corruption, which is something that would probably be detected by noticing orphaned heap-only tuples (heap-only tuples not reachable from some root item on the same page, or some other intermediary heap-only tuple reachable from a root item). > Can't we have a an update chain that is e.g. > xmin 10, xmax 5 -> xmin 5, xmax invalid > > and a vacuum cutoff of 7? That'd preent the first tuple from being removed, > but would allow 5 to be frozen. I don't see how that can be possible. That is contradictory, and cannot possibly work, since it supposes a situation where every possible MVCC snapshot sees the update that generated the second/successor tuple as committed, while at the same time also somehow needing the original tuple to stay in place. Surely both things can never be true at the same time. I believe you're right that an update chain that looks like this one is possible. However, I don't think it's possible for OldestXmin/FreezeLimit to take on a value like that (i.e. a value that "skewers" the update chain like this, the value 7 from your example). We ought to be able to rely on an OldestXmin value that can never let such a situation emerge. Right? -- Peter Geoghegan
Re: HOT chain validation in verify_heapam()
Hi, On 2022-11-09 15:03:39 -0800, Peter Geoghegan wrote: > > > + /* > > > + * Add a line pointer offset to the predecessor > > > array if xmax is > > > + * matching with xmin of next tuple (reaching via > > > its t_ctid). > > > + * Prior to PostgreSQL 9.4, we actually changed the > > > xmin to > > > + * FrozenTransactionId > > > > I'm doubtful it's a good idea to try to validate the 9.4 case. The > > likelihood > > of getting that right seems low and I don't see us gaining much by even > > trying. > > This is the kind of comment that I'd usually agree with, but I > disagree in this instance because of special considerations that apply > to amcheck (or should IMV apply, at least). We're living in a world > where we have to assume that the pre-9.4 format can occur in the > field. If we can't get it right in amcheck, what chance do we have > with other new code that tickles the same areas? I think that we need > to support obsolescent heapam representations (both > FrozenTransactionId and xvac) here on general principle. To me this is extending the problem into more areas rather than reducing it. I'd have *zero* confidence in any warnings that amcheck issued that involved <9.4 special cases. We've previously discussed adding pg_class column tracking the PG version that last scanned the whole relation. We really should get to that one of these decades :(. > Besides, why not accept some small chance of getting this wrong? The > worst that can happen is that we'll have a relatively benign bug. If > we get it wrong then it's a short term problem, but also an > opportunity to be less wrong in the future -- including in places > where the consequences of being wrong are much more serious. I think it doesn't just affect the < 9.4 path, but also makes us implement things differently for >= 9.4. And we loose some accuracy due to that. > > I doubt it is correct to enter this path with next_xmin == > > FrozenTransactionId. This is following a ctid chain that we normally > > wouldn't > > follow, because it doesn't satisfy the t_self->xmax == t_ctid->xmin > > condition. > > We should never see FrozenTransactionId in an xmax field (nor should > it be returned by HeapTupleHeaderGetUpdateXid() under any > circumstances). The field we check for FrozenTransactionId in the code I was quoting is the xmin of the follower tuple. We follow the chain if either cur->xmax == next->xmin or if next->xmin == FrozenTransactionId What I'm doubting is the FrozenTransactionId path. > Anyway, it follows that we cannot expect "next_xmin == > FrozenTransactionId", because that would mean that we'd called > HeapTupleHeaderGetUpdateXid() which returned FrozenTransactionId -- an > impossibility. (Maybe we should be checking that it really is an > impossibility by checking the HeapTupleHeaderGetUpdateXid() return > value, but that should be enough.) next_xmin is acquired via HeapTupleHeaderGetXmin(next_htup), not HeapTupleHeaderGetUpdateXid(cur_typ). > > I don't immediately see what prevents the frozen tuple being from an > > entirely > > different HOT chain than the two tuples pointing to it. > > In my view it simply isn't possible for a valid HOT chain to be in > this state in the first place. So by definition it wouldn't be a HOT > chain. We haven't done any visibility checking at this point and my whole point is that there's no guarantee that the pointed-to tuple actually belongs to the same hot chain, given that we follow as soon as "xmin == FrozenXid". So the pointing tuple might be an orphaned tuple. > That would be a form of corruption, which is something that > would probably be detected by noticing orphaned heap-only tuples > (heap-only tuples not reachable from some root item on the same page, > or some other intermediary heap-only tuple reachable from a root > item). You're saying that there's no way that there's a tuple pointing to another tuple on the same page, which the pointed-to tuple belonging to a different HOT chain? I'm fairly certain that that at least used to be possible, and likely is still possible. Isn't that pretty much what you'd expect to happen if there's concurrent aborts leading to abandoned hot chains? > > Can't we have a an update chain that is e.g. > > xmin 10, xmax 5 -> xmin 5, xmax invalid > > > > and a vacuum cutoff of 7? That'd preent the first tuple from being removed, > > but would allow 5 to be frozen. > > I don't see how that can be possible. That is contradictory, and > cannot possibly work, since it supposes a situation where every > possible MVCC snapshot sees the update that generated the > second/successor tuple as committed, while at the same time also > somehow needing the original tuple to stay in place. Surely both > things can never be true at the same time. The xmin horizon is very coarse grained. Just because it is 7 doesn't mean that xid 10 is still runni
Re: Out-of-date clang/llvm version lists in PGAC_LLVM_SUPPORT
Hi, On 2022-11-08 13:08:45 -0500, Tom Lane wrote: > I happened to notice that these lists of supported versions haven't > been updated in a good long time: > > PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-7 llvm-config-6.0 > llvm-config-5.0 llvm-config-4.0 llvm-config-3.9) > > PGAC_PATH_PROGS(CLANG, clang clang-7 clang-6.0 clang-5.0 clang-4.0 > clang-3.9) > > Given the lack of complaints, it seems likely that nobody is relying > on these. Maybe we should just nuke them? Yea, that's probably a good idea. Pretty clear that that should happen only in HEAD? > I may be missing it, but it doesn't look like meson.build has any > equivalent lists. So that might be an argument for getting rid > of the lists here? The list is just in meson, it has a builtin helper for depending on llvm. So that's not quite an argument unfortunately. Greetings, Andres Freund
Document parameter count limit
Inspired by a recent posting on Slack... diff --git a/doc/src/sgml/limits.sgml b/doc/src/sgml/limits.sgml index d5b2b627dd..5d68eef093 100644 --- a/doc/src/sgml/limits.sgml +++ b/doc/src/sgml/limits.sgml @@ -97,6 +97,13 @@ 32 can be increased by recompiling PostgreSQL + + +parameters per query +65,535 +if you are reading this prepatorily, please redesign your query to use temporary tables or arrays + + David J.
Re: heavily contended lwlocks with long wait queues scale badly
On 2022-11-09 09:38:08 -0800, Andres Freund wrote: > I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK > to push it to HEAD if I get it done in the next few hours. Bigger issues, > which I do not expect, should show up before tomorrow afternoon. Smaller > things could wait till Sunday if necessary. I didn't get to it in time, so I'll leave it for when I'm back.
Re: [PATCH] Support % wildcard in extension upgrade filenames
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed PURPOSE: This feature is intended to allow projects with many micro versions that do the same thing, be able to ship much fewer files by specifying less granular stopping. One of the projects that will benefit from this is the PostGIS project. Currently we ship over 300 extension scripts per version which are currently just symlinks to the same file. Part of the reason for that is our changes are dependent on both PostgreSQL version and PostGIS version, so a simple upgrade that only considers say PostGIS 3.2.0--3.3.1 is not sufficient. Also much of the time, our function definitions don't change in micros, but yet we need to ship them to allow users to properly upgrade. This feature will allow us to get rid of all these symlinks or 0-byte files. I've tested this feature against the PostgreSQL master branch on mingw64 gcc 8.1. BASIC FEATURES 1) As stated, this feature only works if in the .control file, has a line: wildcard_upgrades = true 2) It does nothing different if the line is missing or set to false. 3) If there is such a line and there is no other path that takes an extension from it's current version to the requested version, then it will use the wildcard script file. TESTING AUTOMATED TESTS I have confirmed tests are in place. However the tests do not run when I do make check (from the root source folder) or when I do make installcheck-world To run these tests, I had to cd src/test/modules/test_extensions make check and see the output showing: == running regression test queries== test test_extensions ... ok 186 ms test test_extdepend ... ok 97 ms I confirmed the tests are in test_extensions, which has always existed. So I assume why it's not being run in installcheck-world is something off with my configuration or it's intentionally not run. MANUAL TESTS I also ran some adhoc tests of my own to confirm the behavior. and checking with SET client_min_messages='DEBUG1'; To confirm that the wildcard script I have only gets called when there is no other path that will take the user to the new version. I created my own extension wildtest 1) I confirmed It only understands % as a complete wildcard for a version number So this is legal wildtest--%--2.0.sql This does nothing wildtest--2%--2.0.sql 2) I confirmed that if you have a path such as wildtest--2.0--2.2.sql wildtest--%--2.2.sql then the exact match trumps the wildcard. In the above case if I am on 2.0 and going to 2.2, the wildtest--2.0--2.2.sql script is used instead of the wildtest--% one. 3) It is not possible to downgrade with the wildcard. For example I had paths wildtest--%--2.1.sql and I was unable to go from a version 2.2 down to a version 2.1. DOCUMENTATION I built the html docs and confirmed that in the section of the docs where it defines valid properties of extension files it now includes this line: wildcard_upgrades (boolean) This parameter, if set to true (which is not the default), allows ALTER EXTENSION to consider a wildcard character % as matching any version of the extension. Such wildcard match will only be used when no perfect match is found for a version. The new status of this patch is: Ready for Committer
Re: Allow file inclusion in pg_hba and pg_ident files
On Wed, Nov 09, 2022 at 12:09:01PM +0800, Julien Rouhaud wrote: > On Wed, Nov 09, 2022 at 09:51:17AM +0900, Michael Paquier wrote: >> Julien, please note that this is waiting on author for now. What do >> you think about the now-named v18-0001 and the addition of an >> ErrorContextCallback to provide more information about the list of >> included files on error? > > Yes, I'm unfortunately fully aware that it's waiting on me. I've been a bit > busy this week with $work but I will try to go back to it as soon as I can, > hopefully this week! FWIW, I have been playing with the addition of a ErrorContextCallback in tokenize_auth_file(), and this addition leads to a really nice result. With this method, it is possible to know the full chain of events leading to a failure when tokenizing included files, which is not available now in the logs when reloading the server. We could extend it to have more verbose information by passing more arguments to tokenize_auth_file(), still I'd like to think that just knowing the line number and the full path to the file is more than enough once you know the full chain of events. 0001 and 0002 ought to be merged together, but I am keeping these separate to show how simple the addition of the ErrorContextCallback is. -- Michael From 41ae50e3b5b8e1d9013f337e4574702830dca95a Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 7 Nov 2022 13:35:44 +0900 Subject: [PATCH v19 1/3] Invent open_auth_file() in hba.c, to refactor auth file opening This adds a check on the recursion depth when including auth files, something that has never been done when processing '@' files for database and user name lists in pg_hba.conf. --- src/include/libpq/hba.h | 4 +- src/backend/libpq/hba.c | 100 ++- src/backend/utils/adt/hbafuncs.c | 18 ++ 3 files changed, 79 insertions(+), 43 deletions(-) diff --git a/src/include/libpq/hba.h b/src/include/libpq/hba.h index 7ad227d34a..a84a5f0961 100644 --- a/src/include/libpq/hba.h +++ b/src/include/libpq/hba.h @@ -177,7 +177,9 @@ extern int check_usermap(const char *usermap_name, extern HbaLine *parse_hba_line(TokenizedAuthLine *tok_line, int elevel); extern IdentLine *parse_ident_line(TokenizedAuthLine *tok_line, int elevel); extern bool pg_isblank(const char c); +extern FILE *open_auth_file(const char *filename, int elevel, int depth, + char **err_msg); extern MemoryContext tokenize_auth_file(const char *filename, FILE *file, - List **tok_lines, int elevel); + List **tok_lines, int elevel, int depth); #endif /* HBA_H */ diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c index a9f87ab5bf..d8c0b585e5 100644 --- a/src/backend/libpq/hba.c +++ b/src/backend/libpq/hba.c @@ -117,7 +117,8 @@ static const char *const UserAuthName[] = static List *tokenize_inc_file(List *tokens, const char *outer_filename, - const char *inc_filename, int elevel, char **err_msg); + const char *inc_filename, int elevel, + int depth, char **err_msg); static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline, int elevel, char **err_msg); static int regcomp_auth_token(AuthToken *token, char *filename, int line_num, @@ -414,7 +415,7 @@ regexec_auth_token(const char *match, AuthToken *token, size_t nmatch, */ static List * next_field_expand(const char *filename, char **lineptr, - int elevel, char **err_msg) + int elevel, int depth, char **err_msg) { char buf[MAX_TOKEN]; bool trailing_comma; @@ -431,7 +432,7 @@ next_field_expand(const char *filename, char **lineptr, /* Is this referencing a file? */ if (!initial_quote && buf[0] == '@' && buf[1] != '\0') tokens = tokenize_inc_file(tokens, filename, buf + 1, - elevel, err_msg); + elevel, depth + 1, err_msg); else tokens = lappend(tokens, make_auth_token(buf, initial_quote)); } while (trailing_comma && (*err_msg == NULL)); @@ -459,6 +460,7 @@ tokenize_inc_file(List *tokens, const char *outer_filename, const char *inc_filename, int elevel, + int depth, char **err_msg) { char *inc_fullname; @@ -468,24 +470,18 @@ tokenize_inc_file(List *tokens, MemoryContext linecxt; inc_fullname = AbsoluteConfigLocation(inc_filename, outer_filename); + inc_file = open_auth_file(inc_fullname, elevel, depth, err_msg); - inc_file = AllocateFile(inc_fullname, "r"); if (inc_file == NULL) { - int save_errno = errno; - - ereport(elevel, -(errcode_for_file_access(), - errmsg("could not open secondary authentication file \"@%s\" as \"%s\": %m", - inc_filename, inc_fullname))); - *err_msg = psprintf("could not open secondary authentication file \"@%s\" as \"%s\": %s", - inc_filename, inc_fullname, strerror(save_errno)); + /* error already logged */ pfree(inc_fullname); return tokens; } /* There is possible recursion here if the file contains @ */ - line
Re: HOT chain validation in verify_heapam()
On Wed, Nov 9, 2022 at 4:15 PM Andres Freund wrote: > To me this is extending the problem into more areas rather than reducing > it. I'd have *zero* confidence in any warnings that amcheck issued that > involved <9.4 special cases. Maybe you would at first. But then we get to learn what mistake we made. And then we get to fix the bug, and get to know better next time around. Next time (or maybe the time after that) you really will have confidence in amcheck, because it'll have been battle tested at that point. For something like this that seems like the best way to go. Either we support an on-disk format that includes legacy representations, or we don't. > I think it doesn't just affect the < 9.4 path, but also makes us implement > things differently for >= 9.4. And we loose some accuracy due to that. I don't follow. How so? > The field we check for FrozenTransactionId in the code I was quoting is the > xmin of the follower tuple. We follow the chain if either > cur->xmax == next->xmin or if next->xmin == FrozenTransactionId > > What I'm doubting is the FrozenTransactionId path. AFAICT we shouldn't be treating it as part of the same HOT chain. Only the first heap-only tuple in a valid HOT chain should have an xmin that is FrozenTransactionId (and only with an LP_REDIRECT root item, I think). Otherwise the "prev_xmax==xmin" HOT chain traversal logic used in places like heap_hot_search_buffer() simply won't work. > > In my view it simply isn't possible for a valid HOT chain to be in > > this state in the first place. So by definition it wouldn't be a HOT > > chain. > > We haven't done any visibility checking at this point and my whole point is > that there's no guarantee that the pointed-to tuple actually belongs to the > same hot chain, given that we follow as soon as "xmin == FrozenXid". So the > pointing tuple might be an orphaned tuple. But an orphaned heap-only tuple shouldn't ever have xmin==FrozenTransactionId to begin with. The only valid source of orphaned heap-only tuples is transaction aborts. Aborted XID xmin fields are never frozen. > > That would be a form of corruption, which is something that > > would probably be detected by noticing orphaned heap-only tuples > > (heap-only tuples not reachable from some root item on the same page, > > or some other intermediary heap-only tuple reachable from a root > > item). > > You're saying that there's no way that there's a tuple pointing to another > tuple on the same page, which the pointed-to tuple belonging to a different > HOT chain? Define "belongs to a different HOT chain". You can get orphaned heap-only tuples, obviously. But only due to transaction abort. Any page with an orphaned heap-only tuple that is not consistent with it being from an earlier abort is a corrupt heap page. > > > Can't we have a an update chain that is e.g. > > > xmin 10, xmax 5 -> xmin 5, xmax invalid > > > > > > and a vacuum cutoff of 7? That'd preent the first tuple from being > > > removed, > > > but would allow 5 to be frozen. > > > > I don't see how that can be possible. That is contradictory, and > > cannot possibly work, since it supposes a situation where every > > possible MVCC snapshot sees the update that generated the > > second/successor tuple as committed, while at the same time also > > somehow needing the original tuple to stay in place. Surely both > > things can never be true at the same time. > > The xmin horizon is very coarse grained. Just because it is 7 doesn't mean > that xid 10 is still running. All it means that one backend or slot has an > xmin or xid of 7. Of course that's true. But I wasn't talking about the general case -- I was talking about your "xmin 10, xmax 5 -> xmin 5, xmax invalid" update chain case specifically, with its "skewered" OldestXmin of 7. > s1: acquire xid 5 > s2: acquire xid 7 > s3: acquire xid 10 > > s3: insert > s3: commit > s1: update > s1: commit > > s2: get a new snapshot, xmin 7 (or just hold no snapshot) > > At this point the xmin horizon is 7. The first tuple's xmin can't be > frozen. The second tuple's xmin can be. Basically what I'm saying about OldestXmin is that it ought to "work transitively", from the updater to the inserter that inserted the now-updated tuple. That is, the OldestXmin should either count both XIDs that appear in the update chain, or neither XID. > > I believe you're right that an update chain that looks like this one > > is possible. However, I don't think it's possible for > > OldestXmin/FreezeLimit to take on a value like that (i.e. a value that > > "skewers" the update chain like this, the value 7 from your example). > > We ought to be able to rely on an OldestXmin value that can never let > > such a situation emerge. Right? > > I don't see anything that'd guarantee that currently, nor do immediately see a > possible way to get there. > > What do you think prevents such an OldestXmin? ComputeXidHorizons() computes VACUUM's OldestXmin (actually it computes h->data_oldest_no
Re: HOT chain validation in verify_heapam()
Hi, On 2022-11-09 17:32:46 -0800, Peter Geoghegan wrote: > > The xmin horizon is very coarse grained. Just because it is 7 doesn't mean > > that xid 10 is still running. All it means that one backend or slot has an > > xmin or xid of 7. > > Of course that's true. But I wasn't talking about the general case -- > I was talking about your "xmin 10, xmax 5 -> xmin 5, xmax invalid" > update chain case specifically, with its "skewered" OldestXmin of 7. The sequence below produces such an OldestXmin: > > s1: acquire xid 5 > > s2: acquire xid 7 > > s3: acquire xid 10 > > > > s3: insert > > s3: commit > > s1: update > > s1: commit > > > > s2: get a new snapshot, xmin 7 (or just hold no snapshot) > > > > At this point the xmin horizon is 7. The first tuple's xmin can't be > > frozen. The second tuple's xmin can be. > > Basically what I'm saying about OldestXmin is that it ought to "work > transitively", from the updater to the inserter that inserted the > now-updated tuple. That is, the OldestXmin should either count both > XIDs that appear in the update chain, or neither XID. It doesn't work that way. The above sequence shows one case where it doesn't. > > > I believe you're right that an update chain that looks like this one > > > is possible. However, I don't think it's possible for > > > OldestXmin/FreezeLimit to take on a value like that (i.e. a value that > > > "skewers" the update chain like this, the value 7 from your example). > > > We ought to be able to rely on an OldestXmin value that can never let > > > such a situation emerge. Right? > > > > I don't see anything that'd guarantee that currently, nor do immediately > > see a > > possible way to get there. > > > > What do you think prevents such an OldestXmin? > > ComputeXidHorizons() computes VACUUM's OldestXmin (actually it > computes h->data_oldest_nonremovable values) by scanning the proc > array. And counts PGPROC.xmin from each running xact. So ultimately > the inserter and updater are tied together by that. It's either an > OldestXmin that includes both, or one that includes neither. > Here are some facts that I think we both agree on already: > > 1. It is definitely possible to have an update chain like your "xmin > 10, xmax 5 -> xmin 5, xmax invalid" example. > > 2. It is definitely not possible to "freeze xmax" by setting its value > to FrozenTransactionId or something similar -- there is simply no code > path that can do that, and never has been. (The term "freeze xmax" is > a bit ambiguous, though it usually means set xmax to > InvalidTransactionId.) > > 3. There is no specific reason to believe that there is a live bug here. I don't think there's a live bug here. I think the patch isn't dealing correctly with that issue though. > Putting all 3 together: doesn't it seem quite likely that the way that > we compute OldestXmin is the factor that prevents "skewering" of an > update chain? What else could possibly be preventing corruption here? > (Theoretically it might never have been discovered, but that seems > pretty hard to believe.) I don't see how that follows. The existing code is just ok with that. In fact we have explicit code trying to exploit this: /* * If the DEAD tuple is at the end of the chain, the entire chain is * dead and the root line pointer can be marked dead. Otherwise just * redirect the root to the correct chain member. */ if (i >= nchain) heap_prune_record_dead(prstate, rootoffnum); else heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]); Greetings, Andres Freund
Re: HOT chain validation in verify_heapam()
Hi, And thinking about it, it'd be quite bad if the horizon worked that way. You can easily construct a workload where every single xid would "skewer" some chain, never allowing the horizon to be raised. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: HOT chain validation in verify_heapam()
On Wed, Nov 9, 2022 at 5:46 PM Andres Freund wrote: > > Putting all 3 together: doesn't it seem quite likely that the way that > > we compute OldestXmin is the factor that prevents "skewering" of an > > update chain? What else could possibly be preventing corruption here? > > (Theoretically it might never have been discovered, but that seems > > pretty hard to believe.) > > I don't see how that follows. The existing code is just ok with that. My remarks about "3 facts we agree on" were not intended to be a watertight argument. More like: what else could it possibly be that prevents problems in practice, if not *something* to do with how we compute OldestXmin? Leaving aside the specifics of how OldestXmin is computed for a moment: what alternative explanation is even remotely plausible? There just aren't that many moving parts involved here. The idea that we can ever freeze the xmin of a successor tuple/version from an update chain without also pruning away earlier versions of the same chain is wildly implausible. It sounds totally contradictory. > In fact > we have explicit code trying to exploit this: > > /* > * If the DEAD tuple is at the end of the chain, the entire > chain is > * dead and the root line pointer can be marked dead. > Otherwise just > * redirect the root to the correct chain member. > */ > if (i >= nchain) > heap_prune_record_dead(prstate, rootoffnum); > else > heap_prune_record_redirect(prstate, rootoffnum, > chainitems[i]); I don't see why this code is relevant. -- Peter Geoghegan
Re: Refactor to introduce pg_strcoll().
Attached some new patches. I think you were right that the API of pg_strcoll() was strange before, so I changed it to: * pg_strcoll() takes nul-terminated arguments * pg_strncoll() takes arguments and lengths The new pg_strcoll()/pg_strncoll() api in 0001 seems much reasonable to support in the long term, potentially with other callers. Patch 0004 exports: * pg_collate_icu() is for ICU and takes arguments and lengths * pg_collate_libc() is for libc and takes nul-terminated arguments for a tiny speedup because varstrfastcmp_locale() has both nul- terminated arguments and their lengths. On Fri, 2022-11-04 at 15:06 -0700, Jeff Davis wrote: > But I think the complex hacks are just the transformation into UTF 16 > and calling of wcscoll(). If that's done from within pg_strcoll() > (after my patch), then I think it just works, right? I included a patch (0005) to enable varstrfastcmp_locale on windows with a server encoding of UTF-8. I don't have a great way of testing this, but it seems like it should work. > I was also considering an optimization to use stack allocation for > short strings when doing the non-UTF8 icu comparison. I am seeing > some > benefit there I also included this optimization in 0003: try to use the stack for reasonable values; allocate on the heap for larger strings. I think it's helping a bit. Patch 0002 helps a lot more: for the case of ICU with a non-UTF8 server encoding, the speedup is something like 30%. The reason is that icu_to_uchar() currently calls ucnv_toUChars() twice: once to determine the precise output buffer size required, and then again once it has the buffer ready. But it's easy to just size the destination buffer conservatively, because the maximum space required is enough to hold twice the number of UChars as there are input characters[1], plus the terminating nul. That means we just call ucnv_toUChars() once, to do the actual conversion. My perf test was to use a quick hack to disable abbreviated keys (so that it's doing more comparisons), and then just do a large ORDER BY ... COLLATE on a table with generated text. The text is random lorem ipsum data, with some characters replaced by multibyte characters to exercise some more interesting paths. If someone else has some more sophisticated collation tests I'd be interested to try them. Regards, Jeff Davis [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucnv_8h.html#ae1049fcb893783c860fe0f9d4da84939 From 29249d17b5580e87365c008b18c83db5528db37e Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Wed, 9 Nov 2022 08:58:08 -0800 Subject: [PATCH v3 1/5] Introduce pg_strcoll() and pg_strncoll(). Hide the special cases of the platform, collation provider, and database encoding in pg_locale.c. Simplify varlena.c. --- src/backend/utils/adt/pg_locale.c | 263 ++ src/backend/utils/adt/varlena.c | 230 +- src/include/utils/pg_locale.h | 3 + 3 files changed, 268 insertions(+), 228 deletions(-) diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 2b42d9ccd8..94dc64c2d0 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -79,6 +79,12 @@ #include #endif +/* + * This should be large enough that most strings will fit, but small enough + * that we feel comfortable putting it on the stack + */ +#define TEXTBUFLEN 1024 + #define MAX_L10N_DATA 80 @@ -1731,6 +1737,263 @@ get_collation_actual_version(char collprovider, const char *collcollate) return collversion; } +/* + * win32_utf8_wcscoll + * + * Convert UTF8 arguments to wide characters and invoke wcscoll() or + * wcscoll_l(). + */ +#ifdef WIN32 +static int +win32_utf8_wcscoll(const char *arg1, size_t len1, const char *arg2, size_t len2, + pg_locale_t locale) +{ + char a1buf[TEXTBUFLEN]; + char a2buf[TEXTBUFLEN]; + char *a1p, + *a2p; + int a1len; + int a2len; + int r; + int result; + + if (len1 >= TEXTBUFLEN / 2) + { + a1len = len1 * 2 + 2; + a1p = palloc(a1len); + } + else + { + a1len = TEXTBUFLEN; + a1p = a1buf; + } + if (len2 >= TEXTBUFLEN / 2) + { + a2len = len2 * 2 + 2; + a2p = palloc(a2len); + } + else + { + a2len = TEXTBUFLEN; + a2p = a2buf; + } + + /* API does not work for zero-length input */ + if (len1 == 0) + r = 0; + else + { + r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1, +(LPWSTR) a1p, a1len / 2); + if (!r) + ereport(ERROR, + (errmsg("could not convert string to UTF-16: error code %lu", + GetLastError(; + } + ((LPWSTR) a1p)[r] = 0; + + if (len2 == 0) + r = 0; + else + { + r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2, +(LPWSTR) a2p, a2len / 2); + if (!r) + ereport(ERROR, + (errmsg("could not convert string to UTF-16: error code %lu", + GetLastError(; + } + ((LPWSTR) a2p)[r] = 0; + + errno = 0; +#ifdef HAVE_LOCALE_T + if (locale) + result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
Re: HOT chain validation in verify_heapam()
On Wed, Nov 9, 2022 at 6:10 PM Andres Freund wrote: > And thinking about it, it'd be quite bad if the horizon worked that way. You > can easily construct a workload where every single xid would "skewer" some > chain, never allowing the horizon to be raised. Your whole scenario is one involving a insert of a tuple by XID 10, which is then updated by XID 5 -- a lower XID. Obviously that's possible, but it's relatively rare. I have to imagine that the vast majority of updates affect tuples inserted by an XID before the XID of the updater. My use of the term "skewer" was limited to updates that look like that. So I don't know what you mean about never allowing the horizon to be raised. -- Peter Geoghegan
Ability to reference other extensions by schema in extension scripts
I think I had brought this up a while ago, but I forget what the opinion was on the matter. PostGIS has a number of extensions that rely on it. For the extensions that are packaged with PostGIS, we force them all into the same schema except for the postgis_topology and postgis_tiger_geocoder extensions which are already installed in dedicated schemas. This makes it impossible for postgis_topology and postgis_tiger_geocoder to schema qualify their use of postgis. Other extensions like pgRouting, h3-pg, mobilitydb have similar issues. My proposal is this. If you think it's a good enough idea I can work up a patch for this. Extensions currently are allowed to specify a requires in the control file. I propose to use this information, to allow replacement of phrases @extschema_nameofextension@ as a variable, where nameofextension has to be one of the extensions listed in the requires. The extension plumbing will then use this information to look up the schema that the current required extensions are installed in, and replace the variables with the schema of where the dependent extension is installed. Does anyone see any issue with this idea. Thanks, Regina
Re: Ability to reference other extensions by schema in extension scripts
"Regina Obe" writes: > My proposal is this. If you think it's a good enough idea I can work up a > patch for this. > Extensions currently are allowed to specify a requires in the control file. > I propose to use this information, to allow replacement of phrases > @extschema_nameofextension@ as a variable, where nameofextension has to be > one of the extensions listed in the requires. I have a distinct sense of deja vu here. I think this idea, or something isomorphic to it, was previously discussed with some other syntax details. I'm too lazy to go searching the archives right now, but I suggest that you try to find that discussion and see if the discussed syntax seems better or worse than what you mention. I think it might've been along the line of @extschema:nameofextension@, which seems like it might be superior because colon isn't a valid identifier character so there's less risk of ambiguity. regards, tom lane
Re: Tests for psql \g and \o
On Tue, Nov 01, 2022 at 12:42:47PM +0100, Daniel Verite wrote: > It's a follow up to the discussion at [1]. Since this discussion > already has a slot in the CF [2] with a committed patch, let's start a > new separate thread. +psql_like($node, "SELECT 'one' \\g | cat >$g_file", qr//, "one command \\g"); +my $c1 = slurp_file($g_file); +like($c1, qr/one/); Windows may not have an equivalent for "cat", no? Note that psql's 001_basic.pl has no restriction in place for Windows. Perhaps you could use the same trick as basebackup_to_shell, where GZIP is used to write some arbitrary data.. Anyway, this has some quoting issues especially if the file's path has whitespaces? This is located in File::Temp::tempdir, still it does not sound like a good thing to rely on this assumption on portability grounds. -- Michael signature.asc Description: PGP signature
RE: Ability to reference other extensions by schema in extension scripts
> "Regina Obe" writes: > > My proposal is this. If you think it's a good enough idea I can work > > up a patch for this. > > Extensions currently are allowed to specify a requires in the control file. > > I propose to use this information, to allow replacement of phrases > > @extschema_nameofextension@ as a variable, where nameofextension has > > to be one of the extensions listed in the requires. > > I have a distinct sense of deja vu here. I think this idea, or something > isomorphic to it, was previously discussed with some other syntax details. > I'm too lazy to go searching the archives right now, but I suggest that you try to > find that discussion and see if the discussed syntax seems better or worse than > what you mention. > > I think it might've been along the line of @extschema:nameofextension@, > which seems like it might be superior because colon isn't a valid identifier > character so there's less risk of ambiguity. > > regards, tom lane I found the old discussion I recalled having and Stephen had suggested using @extschema{'postgis'}@ On this thread -- https://www.postgresql.org/message-id/20160425232251.GR10850@tamriel.snowman .net Is that the one you remember? Thanks, Regina
Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands
On Wed, 09 Nov 2022 11:17:29 -0500 Tom Lane wrote: > Yugo NAGATA writes: > > Tom Lane wrote: > >> What do you think of the attached wording? > > > It looks good to me. That describes the expected behaviour exactly. > > Pushed that, then. Thank you. > >> I don't think the pipeline angle is of concern to anyone who might be > >> reading these comments with the aim of understanding what guarantees > >> they have. Perhaps there should be more about that in the user-facing > >> docs, though. > > > I agree with that we don't need to mention pipelining in these comments, > > and that we need more in the documentation. I attached a doc patch to add > > a mention of commands that do internal commit to the pipelining section. > > Also, this adds a reference for the pipelining protocol to the libpq doc. > > Hmm ... I don't really find either of these changes to be improvements. > The fact that, say, multi-table ANALYZE uses multiple transactions > seems to me to be a property of that statement, not of the protocol. Ok. Then, if we want to notice users that commands using internal commits could unexpectedly close a transaction in pipelining, the proper place is libpq section? Regards, Yugo Nagata -- Yugo NAGATA
Re: Fix some newly modified tab-complete changes
On Tue, Oct 18, 2022 at 05:17:32PM +1100, Peter Smith wrote: > I re-tested and confirm that the patch does indeed fix the quirk I'd > previously reported. > > But, looking at the patch code, I don't know if it is the best way to > fix the problem or not. Someone with more experience of the > tab-complete module can judge that. It seems to me that the patch as proposed has more problems than that. I have spotted a few issues at quick glance, there may be more. For example, take this one: + else if (TailMatches("GRANT") || +TailMatches("REVOKE", "GRANT", "OPTION", "FOR")) COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_roles, "REVOKE GRANT OPTION FOR" completes with a list of role names, which is incorrect. FWIW, I am not much a fan of the approach taken by the patch to duplicate the full list of keywords to append after REVOKE or GRANT, at the only difference of "GRANT OPTION FOR". This may be readable if unified with a single list, with extra items appended for GRANT and REVOKE? Note that REVOKE has a "ADMIN OPTION FOR" clause, which is not completed to. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Allow usage of archive .backup files as backup_label
On Tue, Oct 18, 2022 at 04:55:46AM +0200, Laurenz Albe wrote: > I tend to agree with you. It is easy to break PostgreSQL by manipulating > or removing "backup_label", and copying a file from the WAL archive and > renaming it to "backup_label" sounds like a footgun of the first order. > There is nothing that prevents you from copying the wrong file. > Such practices should not be encouraged. > > Anybody who knows enough about PostgreSQL to be sure that what they are > doing is correct should be smart enough to know how to edit the copied file. A few weeks after, still the same thoughts on the matter, so please note that I have marked that as rejected in the CF app. If somebody wants to offer more arguments for this thread, of course please feel free. -- Michael signature.asc Description: PGP signature
Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands
On Wed, 09 Nov 2022 11:38:05 -0500 Tom Lane wrote: > Peter Eisentraut writes: > > This has broken the following use: > > > parse: create temporary table t1 (a int) on commit drop > > bind > > execute > > parse: analyze t1 > > bind > > execute > > parse: select * from t1 > > bind > > execute > > sync > > > I think the behavior of IsInTransactionBlock() needs to be further > > refined to support this. > > Hmm. Maybe the right way to think about this is "if we have completed an > EXECUTE, and not yet received a following SYNC, then report that we are in > a transaction block"? But I'm not sure if that breaks any other cases. Or, in that case, regarding it as an implicit transaction if multiple commands are executed in a pipeline as proposed in [1] could be another solution, although I have once withdrawn this for not breaking backward compatibility. Attached is the same patch of [1]. [1] https://www.postgresql.org/message-id/20220728105134.d5ce51dd756b3149e9b9c52c%40sraoss.co.jp Regards, Yugo Nagata -- Yugo NAGATA
Re: Unit tests for SLRU
On Wed, Apr 13, 2022 at 03:51:30PM +0400, Pavel Borisov wrote: > Only one thing to note. Maybe it would be good not to copy-paste Assert > after every call of SimpleLruInit, putting it into the wrapper function > instead. So the test can call calling the inner function (without assert) > and all other callers using the wrapper. Not sure about naming though. > Maybe rename current SimpleLruInit -> SimpleLruInitInner and a new wrapper > being under the old name (SimpleLruInit). I have looked at what you have here.. This patch redesigns SimpleLruInit() so as the caller would be now in charge of checking that the SLRU has been created in the context of the postmaster (aka when initializing shared memory). While this should work as long as the amount of shared memory area is correctly sized in _PG_init() and that this area is initialized, then attached later like for autoprewarm.c (this counts for LWLockRegisterTranche(), for example), I am not really convinced that this is something that a patch aimed at extending testing coverage should redesign, especially with a routine as old as that. If you don't know what you are doing, it could easily lead to problems with external code. Note that I don't object to the addition of a new code path or a routine that would be able to create a SLRU on-the-fly with less restrictions, but I am not convinced that this we should change this behavior (well, there is a new argument that would force a recompilation). I am not sure what could be the use cases in favor of a SLRU created outside the _PG_init() phase, but perhaps you have more imagination than I do for such matters ;p FWIW, I'd like to think that the proper way of doing things for this test facility is to initialize a SLRU through a loading of _PG_init() when processing shared_preload_libraries, meaning that you'd better put this facility in src/test/modules/ with a custom configuration file with shared_preload_libraries set and a NO_INSTALLCHECK, without touching at SimpleLruInit(). -- Michael signature.asc Description: PGP signature
Re: Direct I/O
On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro wrote: > Memory alignment patches: > > Direct I/O generally needs to be done to/from VM page-aligned > addresses, but only "standard" 4KB pages, even when larger VM pages > are in use (if there is an exotic system where that isn't true, it > won't work). We need to deal with buffers on the stack, the heap and > in shmem. For the stack, see patch 0001. For the heap and shared > memory, see patch 0002, but David Rowley is going to propose that part > separately, as MemoryContext API adjustments are a specialised enough > topic to deserve another thread; here I include a copy as a > dependency. The main direct I/O patch is 0003. One thing to note: Currently, a request to aset above 8kB must go into a dedicated block. Not sure if it's a coincidence that that matches the default PG page size, but if allocating pages on the heap is hot enough, maybe we should consider raising that limit. Although then, aligned-to-4kB requests would result in 16kB chunks requested unless a different allocator was used. -- John Naylor EDB: http://www.enterprisedb.com
Re: New docs chapter on Transaction Management and related changes
On Wed, 2022-11-09 at 09:16 -0500, Robert Treat wrote: > On Mon, Nov 7, 2022 at 5:04 PM Laurenz Albe wrote: > > Some comments: > > > > > > --- a/doc/src/sgml/ref/release_savepoint.sgml > > > +++ b/doc/src/sgml/ref/release_savepoint.sgml > > > @@ -34,23 +34,16 @@ RELEASE [ SAVEPOINT ] > > > savepoint_name > > > Description > > > > > > > > > - RELEASE SAVEPOINT destroys a savepoint previously > > > defined > > > - in the current transaction. > > > + RELEASE SAVEPOINT will subcommit the subtransaction > > > + established by the named savepoint, if one exists. This will release > > > + any resources held by the subtransaction. If there were any > > > + subtransactions of the named savepoint, these will also be > > > subcommitted. > > > > > > > > > > > > > "Subtransactions of the named savepoint" is somewhat confusing; how about > > "subtransactions of the subtransaction established by the named savepoint"? > > > > If that is too long and explicit, perhaps "subtransactions of that > > subtransaction". > > Personally, I think these are more confusing. Perhaps. I was worried because everywhere else, the wording makes a clear distinction between a savepoint and the subtransaction created by a savepoint. But perhaps some sloppiness is better to avoid such word cascades. > > > --- a/doc/src/sgml/ref/rollback.sgml > > > +++ b/doc/src/sgml/ref/rollback.sgml > > > @@ -56,11 +56,14 @@ ROLLBACK [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ] > > > AND CHAIN > > > > > > > > > - If AND CHAIN is specified, a new transaction is > > > + If AND CHAIN is specified, a new unaborted > > > transaction is > > > immediately started with the same transaction characteristics (see > > > > > linkend="sql-set-transaction"/>) as the just finished one. > > > Otherwise, > > > no new transaction is started. > > > > I don't think that is an improvement. "Unaborted" is an un-word. A new > > transaction > > is always "unaborted", isn't it? > > I thought about this as well when reviewing it, but I do think > something is needed for the case where you have a transaction which > has suffered an error and then you issue "rollback and chain"; if you > just say "a new transaction is immediately started with the same > transaction characteristics" it might imply to some the new > transaction has some kind of carry over of the previous broken > transaction... the use of the word unaborted makes it clear that the > new transaction is 100% functional. A new transaction is never aborted in my understanding. Being aborted is not a characteristic of a transaction, but a state. > > > > > > > + The internal transaction ID type xid is 32-bits wide > > > > There should be no hyphen in "32 bits wide", just as in "3 years old". > > Minor aside, we should clean up glossary.sgml as well. Right, it has this: The numerical, unique, sequentially-assigned identifier that each transaction receives when it first causes a database modification. Frequently abbreviated as xid. When stored on disk, xids are only 32-bits wide, so only approximately four billion write transaction IDs can be generated; to permit the system to run for longer than that, epochs are used, also 32 bits wide. Which reminds me that I should have suggested rather than where I complained about the use of . > > > > + Up to > > + 64 open subxids are cached in shared memory for each backend; after > > + that point, the overhead increases significantly since we must look > > + up subxid entries in pg_subtrans. > > > > Comma before "since". Perhaps you should mention that this means disk I/O. > > ISTR that you only use a comma before since in cases where the > preceding thought contains a negative. Not being a native speaker, I'll leave that to those who are; I went by feeling. > In any case, are you thinking something like this: > > " 64 open subxids are cached in shared memory for each backend; after > that point the overhead increases significantly due to additional disk I/O > from looking up subxid entries in pg_subtrans." Yes. Yours, Laurenz Albe