Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
I've attached an updated patch. On Fri, Feb 18, 2022 at 10:48:10PM +0530, Ashutosh Sharma wrote: > +* runningBackups is a counter indicating the number of backups currently > in > +* progress. forcePageWrites is set to true when either of these is > +* non-zero. lastBackupStart is the latest checkpoint redo location used > as > +* a starting point for an online backup. > */ > - ExclusiveBackupState exclusiveBackupState; > - int nonExclusiveBackups; > > What do you mean by "either of these is non-zero ''. Earlier we used > to set forcePageWrites in case of both exclusive and non-exclusive > backups, but we have just one type of backup now. Fixed this. > -* OK to update backup counters, forcePageWrites and session-level lock. > +* OK to update backup counters and forcePageWrites. > * > > We still update the status of session-level lock so I don't think we > should update the above comment. See below code: Fixed this. > I think we have a lot of common code in do_pg_abort_backup() and > pg_do_stop_backup(). So why not have a common function that can be > called from both these functions. I didn't follow through with this change. I only saw a handful of lines that looked similar, and AFAICT we'd need an extra branch for cleaning up the session-level lock since do_pg_abort_backup() doesn't. > +# Now delete the bogus backup_label file since it will interfere with startup > +unlink("$pgdata/backup_label") > + or BAIL_OUT("unable to unlink $pgdata/backup_label"); > + > > Why do we need this additional change? Earlier this was not required. IIUC this test relied on the following code to handle the bogus file: /* * Terminate exclusive backup mode to avoid recovery after a clean * fast shutdown. Since an exclusive backup can only be taken * during normal running (and not, for example, while running * under Hot Standby) it only makes sense to do this if we reached * normal running. If we're still in recovery, the backup file is * one we're recovering *from*, and we must keep it around so that * recovery restarts from the right place. */ if (ReachedNormalRunning) CancelBackup(); The attached patch removes this code. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 7fd092c90be90018eeddf3ea5088197627cbe593 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 1 Dec 2021 23:50:49 + Subject: [PATCH v3 1/1] remove exclusive backup mode --- doc/src/sgml/backup.sgml | 154 +- doc/src/sgml/func.sgml| 30 +- src/backend/access/transam/xlog.c | 466 ++ src/backend/access/transam/xlogfuncs.c| 235 ++--- src/backend/catalog/system_functions.sql | 14 +- src/backend/postmaster/postmaster.c | 46 +- src/backend/replication/basebackup.c | 6 +- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 + src/bin/pg_rewind/filemap.c | 6 +- src/include/access/xlog.h | 3 +- src/include/catalog/pg_proc.dat | 20 +- src/include/miscadmin.h | 4 - src/test/perl/PostgreSQL/Test/Cluster.pm | 56 +-- .../t/010_logical_decoding_timelines.pl | 4 +- 14 files changed, 113 insertions(+), 935 deletions(-) diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 0d69851bb1..f241bcab9c 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0 sequence, and that the success of a step is verified before proceeding to the next step. - -Low level base backups can be made in a non-exclusive or an exclusive -way. The non-exclusive method is recommended and the exclusive one is -deprecated and will eventually be removed. - - - -Making a Non-Exclusive Low-Level Backup - A non-exclusive low level backup is one that allows other + A low level backup allows other concurrent backups to be running (both those started using the same backup API and those started using ). @@ -884,7 +876,7 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0 rights to run pg_start_backup (superuser, or a user who has been granted EXECUTE on the function) and issue the command: -SELECT pg_start_backup('label', false, false); +SELECT pg_start_backup('label', false); where label is an
Re: Add index scan progress to pg_stat_progress_vacuum
On Mon, Feb 21, 2022 at 07:03:39PM +, Imseih (AWS), Sami wrote: > Sending again with patch files renamed to ensure correct apply order. I haven't had a chance to test this too much, but I did look through the patch set and have a couple of small comments. + + + indexes_total bigint + + + The number of indexes to be processed in the + vacuuming indexes or cleaning up indexes phase + of the vacuum. + + + + + + indexes_processed bigint + + + The number of indexes processed in the + vacuuming indexes or cleaning up indexes phase. + At the start of an index vacuum cycle, this value is set to 0. + + Will these be set to 0 for failsafe vacuums and vacuums with INDEX_CLEANUP turned off? +typedef struct VacWorkerProgressInfo +{ +int num_vacuums;/* number of active VACUUMS with parallel workers */ +int max_vacuums;/* max number of VACUUMS with parallel workers */ +slock_t mutex; +VacOneWorkerProgressInfo vacuums[FLEXIBLE_ARRAY_MEMBER]; +} VacWorkerProgressInfo; max_vacuums appears to just be a local copy of MaxBackends. Does this information really need to be stored here? Also, is there a strong reason for using a spinlock instead of an LWLock? +void +vacuum_worker_end(int leader_pid) +{ +SpinLockAcquire(&vacworkerprogress->mutex); +for (int i = 0; i < vacworkerprogress->num_vacuums; i++) +{ +VacOneWorkerProgressInfo *vac = &vacworkerprogress->vacuums[i]; + +if (vac->leader_pid == leader_pid) +{ +*vac = vacworkerprogress->vacuums[vacworkerprogress->num_vacuums - 1]; +vacworkerprogress->num_vacuums--; +SpinLockRelease(&vacworkerprogress->mutex); +break; +} +} +SpinLockRelease(&vacworkerprogress->mutex); +} I see this loop pattern in a couple of places, and it makes me wonder if this information would fit more naturally in a hash table. +if (callback) +callback(values, 3); Why does this need to be set up as a callback function? Could we just call the function if cmdtype == PROGRESS_COMMAND_VACUUM? ISTM that is pretty much all this is doing. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
remove more archiving overhead
Hi hackers, With the addition of archive modules (5ef1eef) and other archiving improvements (e.g., beb4e9b), the amount of archiving overhead has been reduced. I spent some time measuring the remaining overhead, and the following function stood out: /* * pgarch_archiveDone * * Emit notification that an xlog file has been successfully archived. * We do this by renaming the status file from NNN.ready to NNN.done. * Eventually, a checkpoint process will notice this and delete both the * NNN.done file and the xlog file itself. */ static void pgarch_archiveDone(char *xlog) { charrlogready[MAXPGPATH]; charrlogdone[MAXPGPATH]; StatusFilePath(rlogready, xlog, ".ready"); StatusFilePath(rlogdone, xlog, ".done"); (void) durable_rename(rlogready, rlogdone, WARNING); } Specifically, the call to durable_rename(), which calls fsync() a few times, can cause a decent delay. On my machine, archiving ~12k WAL files using a bare-bones archive module that did nothing but return true took over a minute. When I changed this to rename() (i.e., reverting the change to pgarch.c in 1d4a0ab), the same test took a second or less. I also spent some time investigating whether durably renaming the archive status files was even necessary. In theory, it shouldn't be. If a crash happens before the rename is persisted, we might try to re-archive files, but your archive_command/archive_library should handle that. If the file was already recycled or removed, the archiver will skip it (thanks to 6d8727f). But when digging further, I found that WAL file recyling uses durable_rename_excl(), which has the following note: * Note that a crash in an unfortunate moment can leave you with two links to * the target file. IIUC this means that in theory, a crash at an unfortunate moment could leave you with a .ready file, the file to archive, and another link to that file with a "future" WAL filename. If you re-archive the file after it has been reused, you could end up with corrupt WAL archives. I think this might already be possible today. Syncing the directory after every rename probably reduces the likelihood quite substantially, but if RemoveOldXlogFiles() quickly picks up the .done file and attempts recycling before durable_rename() calls fsync() on the directory, presumably the same problem could occur. I believe the current recommendation is to fail if the archive file already exists. The basic_archive contrib module fails if the archive already exists and has different contents, and it succeeds without re-archiving if it already exists with identical contents. Since something along these lines is recommended as a best practice, perhaps this is okay. But I think we might be able to do better. If we ensure that the archive_status directory is sync'd prior to recycling/removing a WAL file, I believe we are safe from the aforementioned problem. The drawback of this is that we are essentially offloading more work to the checkpointer process. In addition to everything else it must do, it will also need to fsync() the archive_status directory many times. I'm not sure how big of a problem this is. It should be good to ensure that archiving keeps up, and there are far more expensive tasks that the checkpointer must perform that could make the added fsync() calls relatively insignificant. I've attached a patch that makes the changes discussed above. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From a6d033aff90a6218345cba41847ccfdfbe6447d7 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 21 Feb 2022 16:29:14 -0800 Subject: [PATCH v1 1/1] remove more archiving overhead --- src/backend/access/transam/xlogarchive.c | 15 +-- src/backend/postmaster/pgarch.c | 13 - 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index a2657a2005..d6823c9c38 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -585,8 +585,13 @@ XLogArchiveForceDone(const char *xlog) * If it is not time to delete it, make sure a .ready file exists, and return * false. * - * If .done exists, then return true; else if .ready exists, - * then return false; else create .ready and return false. + * If .done exists, then sync the archive_status directory to disk and + * return true. Syncing the directory ensures that the rename from + * .ready to .done is persisted so that we won't attempt to re- + * archive the file after a crash. + * + * If .ready exists, then return false. If neither .ready nor + * .done exist, create .ready and return false. * * The reason w
Re: remove more archiving overhead
On Mon, Feb 21, 2022 at 05:19:48PM -0800, Nathan Bossart wrote: > I also spent some time investigating whether durably renaming the archive > status files was even necessary. In theory, it shouldn't be. If a crash > happens before the rename is persisted, we might try to re-archive files, > but your archive_command/archive_library should handle that. If the file > was already recycled or removed, the archiver will skip it (thanks to > 6d8727f). But when digging further, I found that WAL file recyling uses > durable_rename_excl(), which has the following note: > >* Note that a crash in an unfortunate moment can leave you with two > links to >* the target file. > > IIUC this means that in theory, a crash at an unfortunate moment could > leave you with a .ready file, the file to archive, and another link to that > file with a "future" WAL filename. If you re-archive the file after it has > been reused, you could end up with corrupt WAL archives. I think this > might already be possible today. Syncing the directory after every rename > probably reduces the likelihood quite substantially, but if > RemoveOldXlogFiles() quickly picks up the .done file and attempts > recycling before durable_rename() calls fsync() on the directory, > presumably the same problem could occur. In my testing, I found that when I killed the server just before unlink() during WAL recyling, I ended up with links to the same file in pg_wal after restarting. My latest test produced links to the same file for the current WAL file and the next one. Maybe WAL recyling should use durable_rename() instead of durable_rename_excl(). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: remove more archiving overhead
On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote: > In my testing, I found that when I killed the server just before unlink() > during WAL recyling, I ended up with links to the same file in pg_wal after > restarting. My latest test produced links to the same file for the current > WAL file and the next one. Maybe WAL recyling should use durable_rename() > instead of durable_rename_excl(). Here is an updated patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From e080d493985cf0a203122c1e07952b0f765019e4 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 22 Feb 2022 11:39:28 -0800 Subject: [PATCH v2 1/1] reduce archiving overhead --- src/backend/access/transam/xlog.c | 10 ++ src/backend/postmaster/pgarch.c | 14 +- 2 files changed, 19 insertions(+), 5 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0d2bd7a357..2ad047052f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3334,13 +3334,15 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, } /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). + * Perform the rename. Ideally, we'd use link() and unlink() to avoid + * overwriting an existing file (there shouldn't be one). However, that + * approach opens up the possibility that pg_wal will contain multiple hard + * links to the same WAL file after a crash. */ - if (durable_rename_excl(tmppath, path, LOG) != 0) + if (durable_rename(tmppath, path, LOG) != 0) { LWLockRelease(ControlFileLock); - /* durable_rename_excl already emitted log message */ + /* durable_rename already emitted log message */ return false; } diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index d916ed39a8..641297e9f5 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -746,7 +746,19 @@ pgarch_archiveDone(char *xlog) StatusFilePath(rlogready, xlog, ".ready"); StatusFilePath(rlogdone, xlog, ".done"); - (void) durable_rename(rlogready, rlogdone, WARNING); + + /* + * To avoid extra overhead, we don't durably rename the .ready file to + * .done. Archive commands and libraries must already gracefully handle + * attempts to re-archive files (e.g., if the server crashes just before + * this function is called), so it should be okay if the .ready file + * reappears after a crash. + */ + if (rename(rlogready, rlogdone) < 0) + ereport(WARNING, +(errcode_for_file_access(), + errmsg("could not rename file \"%s\" to \"%s\": %m", + rlogready, rlogdone))); } -- 2.25.1
Re: C++ Trigger Framework
On Tue, Feb 22, 2022 at 04:02:39PM +0200, Shmuel Kamensky wrote: > I'm interested in creating a Postgres extension that would enable > developers to write triggers in (modern) C++. Does anyone know if there is > already some sort of translation wrapper between the native Postgres C > API's and C++? Or if there's already a project that allows for writing > triggers in C++ with ease? > I see that https://github.com/clkao/plv8js-clkao/blob/master/plv8_type.cc > does implement an abstraction of sorts, but it's specific to v8 types and > is not genericized as a way of interacting with Postgres C API's from C++ > from *an*y C++ code. > > Can you imagine any potential technical challenges I may encounter (e.g. > massaging postgres' custom allocator to work with C++'s new and delete > operators, or unresolvable compiler incompatibilities)? This might answer your questions: https://www.postgresql.org/docs/devel/xfunc-c.html#EXTEND-CPP -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Allow file inclusion in pg_hba and pg_ident files
On Wed, Feb 23, 2022 at 12:59:59PM +0800, Julien Rouhaud wrote: > To address that, I'd like to propose the possibility to include files in hba > and ident configuration files. This was already discussed in the past, and in > my understanding this is mostly wanted, while some people expressed concerned > on a use case that wouldn't rely on thousands of entries. +1, I think this would be very useful. > 0001 adds a new pg_ident_file_mappings view, which is basically the same as > pg_hba_file_rules view but for mappings. It's probably already useful, for > instance if you need to tweak some regexp. This seems reasonable. > Finally I also added 0003, which is a POC for a new pg_hba_matches() function, > that can help DBA to understand why their configuration isn't working as they > expect. This only to start the discussion on that topic, the code is for now > really hackish, as I don't know how much this is wanted and/or if some other > behavior would be better, and there's also no documentation or test. The > function for now only takes an optional inet (null means unix socket), the > target role and an optional ssl flag and returns the file, line and raw line > matching if any, or null. For instance: I think another use-case for this is testing updates to your configuration files. For example, I could ensure that hba_forbid_non_ssl.conf wasn't accidentally reverted as part of an unrelated change. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: remove more archiving overhead
Thanks for taking a look! On Thu, Feb 24, 2022 at 02:13:49PM +0900, Kyotaro Horiguchi wrote: > https://www.postgresql.org/docs/14/continuous-archiving.html > | The archive command should generally be designed to refuse to > | overwrite any pre-existing archive file. This is an important safety > | feature to preserve the integrity of your archive in case of > | administrator error (such as sending the output of two different > | servers to the same archive directory). > > The recommended behavior of archive_command above is to avoid this > kind of corruption. If we officially admit that a WAL segment can be > archive twice, we need to revise that part (and the following part) of > the doc. Yeah, we might want to recommend succeeding if the archive already exists with identical contents (like basic_archive does). IMO this would best be handled in a separate thread that is solely focused on documentation updates. AFAICT this is an existing problem. >> IIUC this means that in theory, a crash at an unfortunate moment could >> leave you with a .ready file, the file to archive, and another link to that >> file with a "future" WAL filename. If you re-archive the file after it has >> been reused, you could end up with corrupt WAL archives. I think this > > IMHO the difference would be that the single system call (maybe) > cannot be interrupted by sigkill. So I'm not sure that that is worth > considering. The sigkill window can be elimianted if we could use > renameat(RENAME_NOREPLACE). But finally power-loss can cause that. Perhaps durable_rename_excl() could use renameat2() for systems that support it. However, the problem would still exist for systems that have link() but not renameat2(). In this particular case, there really shouldn't be an existing file in pg_wal. >> might already be possible today. Syncing the directory after every rename >> probably reduces the likelihood quite substantially, but if >> RemoveOldXlogFiles() quickly picks up the .done file and attempts >> recycling before durable_rename() calls fsync() on the directory, >> presumably the same problem could occur. > > If we don't fsync at every rename, we don't even need for > RemoveOldXlogFiles to pickup .done very quickly. Isn't it a big > difference? Yeah, it probably increases the chances quite a bit. > I think we don't want make checkpointer slower in exchange of making > archiver faster. Perhaps we need to work using a condensed > information rather than repeatedly scanning over the distributed > information like archive_status? v2 avoids giving the checkpointer more work. > At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart > wrote in >> On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote: >> > In my testing, I found that when I killed the server just before unlink() >> > during WAL recyling, I ended up with links to the same file in pg_wal after >> > restarting. My latest test produced links to the same file for the current >> > WAL file and the next one. Maybe WAL recyling should use durable_rename() >> > instead of durable_rename_excl(). >> >> Here is an updated patch. > > Is it intentional that v2 loses the xlogarchive.c part? Yes. I found that a crash at an unfortunate moment can produce multiple links to the same file in pg_wal, which seemed bad independent of archival. By fixing that (i.e., switching from durable_rename_excl() to durable_rename()), we not only avoid this problem, but we also avoid trying to archive a file the server is concurrently writing. Then, after a crash, the WAL file to archive should either not exist (which is handled by the archiver) or contain the same contents as any preexisting archives. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: remove more archiving overhead
I tested this again with many more WAL files and a much larger machine (r5d.24xlarge with data directory on an NVMe SSD instance store volume). As before, I am using a bare-bones archive module that does nothing but return true. Without the patch, archiving ~120k WAL files took about 109 seconds. With the patch, it took around 62 seconds, which amounts to a ~43% reduction in overhead. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Fix overflow in justify_interval related functions
On Fri, Feb 25, 2022 at 10:30:57AM -0500, Joseph Koshakow wrote: > Just checking because I'm not very familiar with the process, > are there any outstanding items that I need to do for this patch? Unless someone has additional feedback, I don't think so. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)
On Fri, Feb 25, 2022 at 08:31:37PM +0530, Bharath Rupireddy wrote: > Thanks Satya and others for the inputs. Here's the v1 patch that > basically allows async wal senders to wait until the sync standbys > report their flush lsn back to the primary. Please let me know your > thoughts. I haven't had a chance to look too closely yet, but IIUC this adds a new function that waits for synchronous replication. This new function essentially spins until the synchronous LSN has advanced. I don't think it's a good idea to block sending any WAL like this. AFAICT it is possible that there will be a lot of synchronously replicated WAL that we can send, and it might just be the last several bytes that cannot yet be replicated to the asynchronous standbys. І believe this patch will cause the server to avoid sending _any_ WAL until the synchronous LSN advances. Perhaps we should instead just choose the SendRqstPtr based on the current synchronous LSN. Presumably there are other things we'd need to consider, but in general, I think we ought to send as much WAL as possible for a given call to XLogSendPhysical(). > I've done pgbench testing to see if the patch causes any problems. I > ran tests two times, there isn't much difference in the txns per > seconds (tps), although there's a delay in the async standby receiving > the WAL, after all, that's the feature we are pursuing. I'm curious what a longer pgbench run looks like when the synchronous replicas are in the same region. That is probably a more realistic use-case. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add index scan progress to pg_stat_progress_vacuum
On Wed, Feb 23, 2022 at 10:41:36AM -0800, Peter Geoghegan wrote: > On Wed, Feb 23, 2022 at 10:02 AM Imseih (AWS), Sami > wrote: >> If the failsafe kicks in midway through a vacuum, the number indexes_total >> will not be reset to 0. If INDEX_CLEANUP is turned off, then the value will >> be 0 at the start of the vacuum. > > The way that this works with num_index_scans is that we "round up" > when there has been non-zero work in lazy_vacuum_all_indexes(), but > not if the precheck in lazy_vacuum_all_indexes() fails. That seems > like a good model to generalize from here. Note that this makes > INDEX_CLEANUP=off affect num_index_scans in much the same way as a > VACUUM where the failsafe kicks in very early, during the initial heap > pass. That is, if the failsafe kicks in before we reach lazy_vacuum() > for the first time (which is not unlikely), or even in the > lazy_vacuum_all_indexes() precheck, then num_index_scans will remain > at 0, just like INDEX_CLEANUP=off. > > The actual failsafe WARNING shows num_index_scans, possibly before it > gets incremented one last time (by "rounding up"). So it's reasonably > clear how this all works from that context (assuming that the > autovacuum logging stuff, which reports num_index_scans, outputs a > report for a table where the failsafe kicked in). I am confused. If failsafe kicks in during the middle of a vacuum, I (perhaps naively) would expect indexes_total and indexes_processed to go to zero, and I'd expect to no longer see the "vacuuming indexes" and "cleaning up indexes" phases. Otherwise, how would I know that we are now skipping indexes? Of course, you won't have any historical context about the index work done before failsafe kicked in, but IMO it is misleading to still include it in the progress view. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On Mon, Jan 31, 2022 at 04:25:21PM +0900, Michael Paquier wrote: > At the end of the day, it may be better to just let this stuff be. > Another argument for doing nothing is that this could cause > hard-to-spot conflicts when it comes to back-patch something. This one has been quiet for a while. Should we mark it as returned-with-feedback? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)
On Sat, Feb 26, 2022 at 02:17:50PM +0530, Bharath Rupireddy wrote: > A global min LSN of SendRqstPtr of all the sync standbys can be > calculated and the async standbys can send WAL up to global min LSN. > This is unlike what the v1 patch does i.e. async standbys will wait > until the sync standbys report flush LSN back to the primary. Problem > with the global min LSN approach is that there can still be a small > window where async standbys can get ahead of sync standbys. Imagine > async standbys being closer to the primary than sync standbys and if > the failover has to happen while the WAL at SendRqstPtr isn't received > by the sync standbys, but the async standbys can receive them as they > are closer. We hit the same problem that we are trying to solve with > this patch. This is the reason, we are waiting till the sync flush LSN > as it guarantees more transactional protection. Do you mean that the application of WAL gets ahead on your async standbys or that the writing/flushing of WAL gets ahead? If synchronous_commit is set to 'remote_write' or 'on', I think either approach can lead to situations where the async standbys are ahead of the sync standbys with WAL application. For example, a conflict between WAL replay and a query on your sync standby could delay WAL replay, but the primary will not wait for this conflict to resolve before considering a transaction synchronously replicated and sending it to the async standbys. If writing/flushing WAL gets ahead on async standbys, I think something is wrong with the patch. If you aren't sending WAL to async standbys until it is synchronously replicated to the sync standbys, it should by definition be impossible for this to happen. If you wanted to make sure that WAL was not applied to async standbys before it was applied to sync standbys, I think you'd need to set synchronous_commit to 'remote_apply'. This would ensure that the WAL is replayed on sync standbys before the primary considers the transaction synchronously replicated and sends it to the async standbys. > Do you think allowing async standbys optionally wait for either remote > write or flush or apply or global min LSN of SendRqstPtr so that users > can choose what they want? I'm not sure I follow the difference between "global min LSN of SendRqstPtr" and remote write/flush/apply. IIUC you are saying that we could use the LSN of what is being sent to sync standbys instead of the LSN of what the primary considers synchronously replicated. I don't think we should do that because it provides no guarantee that the WAL has even been sent to the sync standbys before it is sent to the async standbys. For this feature, I think we always need to consider what the primary considers synchronously replicated. My suggested approach doesn't change that. I'm saying that instead of spinning in a loop waiting for the WAL to be synchronously replicated, we just immediately send WAL up to the LSN that is presently known to be synchronously replicated. You do bring up an interesting point, though. Is there a use-case for specifying synchronous_commit='on' but not sending WAL to async replicas until it is synchronously applied? Or alternatively, would anyone want to set synchronous_commit='remote_apply' but send WAL to async standbys as soon as it is written to the sync standbys? My initial reaction is that we should depend on the synchronous replication setup. As long as the primary considers an LSN synchronously replicated, it would be okay to send it to the async standbys. I personally don't think it is worth taking on the extra complexity for that level of configuration just yet. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Sat, Feb 26, 2022 at 04:48:52PM +, Chapman Flack wrote: > My biggest concerns are the changes to the SQL-visible pg_start_backup > and pg_stop_backup functions. When the non-exclusive API was implemented > (in 7117685), that was done with care (with a new optional argument to > pg_start_backup, and a new overload of pg_stop_backup) to avoid immediate > breakage of working backup scripts. > > With this patch, even scripts that were dutifully migrated to that new API and > now invoke pg_start_backup(label, false) or (label, exclusive => false) will > immediately and unnecessarily break. What I would suggest for this patch > would be to change the exclusive default from true to false, and have the > function report an ERROR if true is passed. > > Otherwise, for sites using a third-party backup solution, there will be an > unnecessary requirement to synchronize a PostgreSQL upgrade with an > upgrade of the backup solution that won't be broken by the change. For > a site with their backup procedures scripted in-house, there will be an > unnecessarily urgent need for the current admin team to study and patch > the currently-working scripts. > > That can be avoided by just changing the default to false and rejecting calls > where true is passed. That will break only scripts that never got the memo > about moving to non-exclusive backup, available for six years now. > > Assuming the value is false, so no error is thrown, is it practical to > determine > from flinfo->fn_expr whether the value was defaulted or supplied? If so, I > would > further suggest reporting a deprecation WARNING if it was explicitly supplied, > with a HINT that the argument can simply be removed at the call site, and will > become unrecognized in some future release. This is a good point. I think I agree with your proposed changes. I believe it is possible to add a deprecation warning only when 'exclusive' is specified. If anything, we can create a separate function that accepts the 'exclusive' parameter and that always emits a NOTICE or WARNING. > pg_stop_backup needs thought, because 7117685 added a new overload for that > function, rather than just an optional argument. This patch removes the old > niladic version that returned pg_lsn, leaving just one version, with an > optional > argument, that returns a record. > > Here again, the old niladic one was only suitable for exclusive backups, so > there > can't be any script existing in 2022 that still calls that unless it has > never been > updated in six years to nonexclusive backups, and that breakage can't be > helped. > > Any scripts that did get dutifully updated over the last six years will be > calling the > record-returning version, passing false, or exclusive => false. This patch as > it > stands will unnecessarily break those, but here again I think that can be > avoided > just by making the exclusive parameter optional with default false, and > reporting > an error if true is passed. > > Here again, I would consider also issuing a deprecation warning if the > argument > is explicitly supplied, if it is practical to determine that from fn_expr. (I > haven't > looked yet to see how practical that is.) Agreed. I will look into updating this one, too. I think the 'exclusive' parameter should remain documented for now for both pg_start_backup() and pg_stop_backup(), but this documentation will just note that it is there for backward compatibility and must be set to false. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Sat, Feb 26, 2022 at 05:03:04PM -0500, Chapman Flack wrote: > I've now looked through the rest, and the only further thing I noticed > was that xlog.c's do_pg_start_backup still has a tablespaces parameter > to receive a List* of tablespaces if the caller wants, but this patch > removes the comment describing it: > > > - * If "tablespaces" isn't NULL, it receives a list of tablespaceinfo structs > - * describing the cluster's tablespaces. > > > which seems like collateral damage. Thanks. I will fix this and the proofreading nits you noted upthread in the next revision. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: parse/analyze API refactoring
On Mon, Feb 28, 2022 at 07:46:40AM +0100, Peter Eisentraut wrote: > You set this commit fest entry to Waiting on Author, but there were no > reviews posted and the patch still applies and builds AFAICT, so I don't > know what you meant by that. Apologies for the lack of clarity. I believe my only feedback was around deduplicating the pg_analyze_and_rewrite_*() functions. Would you rather handle that in a separate patch? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)
On Mon, Feb 28, 2022 at 06:45:51PM +0530, Bharath Rupireddy wrote: > On Sat, Feb 26, 2022 at 9:37 PM Nathan Bossart > wrote: >> For >> this feature, I think we always need to consider what the primary considers >> synchronously replicated. My suggested approach doesn't change that. I'm >> saying that instead of spinning in a loop waiting for the WAL to be >> synchronously replicated, we just immediately send WAL up to the LSN that >> is presently known to be synchronously replicated. > > As I said above, v1 patch does that i.e. async standbys wait until the > sync standbys update their flush LSN. > > Flush LSN is this - flushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH]; > which gets updated in SyncRepReleaseWaiters. > > Async standbys with their SendRqstPtr will wait in XLogSendPhysical or > XLogSendLogical until SendRqstPtr <= flushLSN. My feedback is specifically about this behavior. I don't think we should spin in XLogSend*() waiting for an LSN to be synchronously replicated. I think we should just choose the SendRqstPtr based on what is currently synchronously replicated. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Sat, Feb 26, 2022 at 02:06:14PM -0800, Nathan Bossart wrote: > On Sat, Feb 26, 2022 at 04:48:52PM +, Chapman Flack wrote: >> Assuming the value is false, so no error is thrown, is it practical to >> determine >> from flinfo->fn_expr whether the value was defaulted or supplied? If so, I >> would >> further suggest reporting a deprecation WARNING if it was explicitly >> supplied, >> with a HINT that the argument can simply be removed at the call site, and >> will >> become unrecognized in some future release. > > This is a good point. I think I agree with your proposed changes. I > believe it is possible to add a deprecation warning only when 'exclusive' > is specified. If anything, we can create a separate function that accepts > the 'exclusive' parameter and that always emits a NOTICE or WARNING. I've spent some time looking into this, and I haven't found a clean way to emit a WARNING only if the "exclusive" parameter is supplied (and set to false). AFAICT flinfo->fn_expr doesn't tell us whether the parameter was supplied or the default value was used. I was able to get it working by splitting pg_start_backup() into 3 separate internal functions (i.e., pg_start_backup_1arg(), pg_start_backup_2arg(), and pg_start_backup_3arg()), but this breaks calls such as pg_start_backup('mylabel', exclusive => false), and it might complicate privilege management for users. Without a WARNING, I think it will be difficult to justify removing the "exclusive" parameter in the future. We would either need to leave it around forever, or we would have to risk unnecessarily breaking some working backup scripts. I wonder if we should just remove it now and make sure that this change is well-documented in the release notes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)
On Tue, Mar 01, 2022 at 11:10:09AM +0530, Bharath Rupireddy wrote: > On Tue, Mar 1, 2022 at 12:27 AM Nathan Bossart > wrote: >> My feedback is specifically about this behavior. I don't think we should >> spin in XLogSend*() waiting for an LSN to be synchronously replicated. I >> think we should just choose the SendRqstPtr based on what is currently >> synchronously replicated. > > Do you mean something like the following? > > /* Main loop of walsender process that streams the WAL over Copy messages. */ > static void > WalSndLoop(WalSndSendDataCallback send_data) > { > /* > * Loop until we reach the end of this timeline or the client requests to > * stop streaming. > */ > for (;;) > { > if (am_async_walsender && there_are_sync_standbys) > { > XLogRecPtr SendRqstLSN; > XLogRecPtr SyncFlushLSN; > > SendRqstLSN = GetFlushRecPtr(NULL); > LWLockAcquire(SyncRepLock, LW_SHARED); > SyncFlushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH]; > LWLockRelease(SyncRepLock); > > if (SendRqstLSN > SyncFlushLSN) >continue; > } Not quite. Instead of "continue", I would set SendRqstLSN to SyncFlushLSN so that the WAL sender only sends up to the current synchronously replicated LSN. TBH there are probably other things that need to be considered (e.g., how do we ensure that the WAL sender sends the rest once it is replicated?), but I still think we should avoid spinning in the WAL sender waiting for WAL to be replicated. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Pre-allocating WAL files
On Tue, Mar 01, 2022 at 08:40:44AM -0600, Justin Pryzby wrote: > FYI: this is currently failing in cfbot on linux. > > https://cirrus-ci.com/task/4934371210690560 > https://api.cirrus-ci.com/v1/artifact/task/4934371210690560/log/src/test/regress/regression.diffs > > DROP TABLESPACE regress_tblspace_renamed; > +ERROR: tablespace "regress_tblspace_renamed" is not empty I believe this is due to an existing bug. This patch set seems to influence the timing to make it more likely. I'm tracking the fix here: https://commitfest.postgresql.org/37/3544/ -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Tue, Mar 01, 2022 at 08:44:51AM -0600, David Steele wrote: > Personally, I am in favor of removing it. We change/rename > functions/tables/views when we need to, and this happens in almost every > release. > > What we need to do is make sure that an older installation won't silently > work in a broken way, i.e. if we remove the exclusive flag somebody > expecting the pre-9.6 behavior might not receive an error and think > everything is OK. That would not be good. > > One option might be to rename the functions. Something like > pg_backup_start/stop. I'm fine with this approach. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Allow async standbys wait for sync replication
On Tue, Mar 01, 2022 at 04:34:31PM +0900, Kyotaro Horiguchi wrote: > At Mon, 28 Feb 2022 22:05:28 -0800, Nathan Bossart > wrote in >> replicated LSN. TBH there are probably other things that need to be >> considered (e.g., how do we ensure that the WAL sender sends the rest once >> it is replicated?), but I still think we should avoid spinning in the WAL >> sender waiting for WAL to be replicated. > > It seems to me it would be something similar to > SyncRepReleaseWaiters(). Or it could be possible to consolidate this > feature into the function, I'm not sure, though. Yes, perhaps the synchronous replication framework will need to alert WAL senders when the syncrep LSN advances so that the WAL is sent to the async standbys. I'm glossing over the details, but I think that should be the general direction. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Tue, Mar 01, 2022 at 11:09:13AM -0500, Chapman Flack wrote: > On 03/01/22 09:44, David Steele wrote: >> Personally, I am in favor of removing it. We change/rename >> functions/tables/views when we need to, and this happens in almost every >> release. > > For clarification, is that a suggestion to remove the 'exclusive' parameter > in some later release, after using this release to default it to false and > reject calls with true? My suggestion was to remove it in v15. My impression is that David and Stephen agree, but I could be misinterpreting their responses. > That way, at least, there would be a period of time where procedures > that currently work (by passing exclusive => false) would continue to work, > and could be adapted as time permits by removing that argument, with no > behavioral change. I'm not sure if there's any advantage to kicking the can down the road. At some point, we'll need to break existing backup scripts. Will we be more prepared to do that in v17 than we are now? We could maintain two sets of functions for a few releases and make it really clear in the documentation that pg_start/stop_backup() are going to be removed soon (and always emit a WARNING when they are used). Would that address your concerns? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Proposal: Support custom authentication methods using hooks
On Mon, Feb 28, 2022 at 03:46:34PM -0500, Stephen Frost wrote: > md5 should be removed. +1 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Allow async standbys wait for sync replication
On Tue, Mar 01, 2022 at 11:09:57PM +0530, Bharath Rupireddy wrote: > On Tue, Mar 1, 2022 at 10:35 PM Nathan Bossart > wrote: >> Yes, perhaps the synchronous replication framework will need to alert WAL >> senders when the syncrep LSN advances so that the WAL is sent to the async >> standbys. I'm glossing over the details, but I think that should be the >> general direction. > > It's doable. But we can't avoid async walsenders waiting for the flush > LSN even if we take the SyncRepReleaseWaiters() approach right? I'm > not sure (at this moment) what's the biggest advantage of this > approach i.e. (1) backends waking up walsenders after flush lsn is > updated vs (2) walsenders keep looking for the new flush lsn. I think there are a couple of advantages. For one, spinning is probably not the best from a resource perspective. There is no guarantee that the desired SendRqstPtr will ever be synchronously replicated, in which case the WAL sender would spin forever. Also, this approach might fit in better with the existing synchronous replication framework. When a WAL sender realizes that it can't send up to the current "flush" LSN because it's not synchronously replicated, it will request to be alerted when it is. In the meantime, it can send up to the latest syncrep LSN so that the async standby is as up-to-date as possible. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
Here is a new version of the patch with the following changes: 1. Addressed Chap's feedback from upthread. 2. Renamed pg_start/stop_backup() to pg_backup_start/stop() as suggested by David. 3. A couple of other small documentation adjustments. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 7119f9063f22652fca1e2a44fdf6b4b6b3fbf679 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 1 Dec 2021 23:50:49 + Subject: [PATCH v4 1/1] remove exclusive backup mode --- doc/src/sgml/backup.sgml | 188 +-- doc/src/sgml/func.sgml| 99 +--- doc/src/sgml/high-availability.sgml | 6 +- doc/src/sgml/monitoring.sgml | 4 +- doc/src/sgml/ref/pgupgrade.sgml | 2 +- src/backend/access/transam/xlog.c | 493 ++ src/backend/access/transam/xlogfuncs.c| 253 ++--- src/backend/access/transam/xlogrecovery.c | 2 +- src/backend/catalog/system_functions.sql | 18 +- src/backend/postmaster/postmaster.c | 46 +- src/backend/replication/basebackup.c | 20 +- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 + src/bin/pg_ctl/pg_ctl.c | 4 +- src/bin/pg_rewind/filemap.c | 6 +- src/include/access/xlog.h | 7 +- src/include/catalog/pg_control.h | 2 +- src/include/catalog/pg_proc.dat | 26 +- src/include/miscadmin.h | 4 - src/test/perl/PostgreSQL/Test/Cluster.pm | 56 +- .../t/010_logical_decoding_timelines.pl | 4 +- 20 files changed, 190 insertions(+), 1054 deletions(-) diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 0d69851bb1..acffee4688 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0 sequence, and that the success of a step is verified before proceeding to the next step. - -Low level base backups can be made in a non-exclusive or an exclusive -way. The non-exclusive method is recommended and the exclusive one is -deprecated and will eventually be removed. - - - -Making a Non-Exclusive Low-Level Backup - A non-exclusive low level backup is one that allows other + A low level backup allows other concurrent backups to be running (both those started using the same backup API and those started using ). @@ -881,19 +873,19 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0 Connect to the server (it does not matter which database) as a user with - rights to run pg_start_backup (superuser, or a user who has been granted + rights to run pg_backup_start (superuser, or a user who has been granted EXECUTE on the function) and issue the command: -SELECT pg_start_backup('label', false, false); +SELECT pg_backup_start('label', false); where label is any string you want to use to uniquely identify this backup operation. The connection - calling pg_start_backup must be maintained until the end of + calling pg_backup_start must be maintained until the end of the backup, or the backup will be automatically aborted. - By default, pg_start_backup can take a long time to finish. + By default, pg_backup_start can take a long time to finish. This is because it performs a checkpoint, and the I/O required for the checkpoint will be spread out over a significant period of time, by default half your inter-checkpoint interval @@ -905,10 +897,6 @@ SELECT pg_start_backup('label', false, false); issue an immediate checkpoint using as much I/O as available. - - The third parameter being false tells - pg_start_backup to initiate a non-exclusive base backup. - @@ -926,7 +914,7 @@ SELECT pg_start_backup('label', false, false); In the same connection as before, issue the command: -SELECT * FROM pg_stop_backup(false, true); +SELECT * FROM pg_backup_stop(true); This terminates backup mode. On a primary, it also performs an automatic switch to the next WAL segment. On a standby, it is not possible to @@ -937,7 +925,7 @@ SELECT * FROM pg_stop_backup(false, true); ready to archive. - The pg_stop_backup will return one row with three + The pg_backup_stop will return one row with three values. The second of these fields should be written to a file named backup_label in the root directory of the backup. The third field should be written to a file named @@ -949,14 +937,14 @@ SELECT * FROM pg_stop_backup(false, true); Once the WAL segment files active during the backup are archived, yo
Re: Allow async standbys wait for sync replication
On Wed, Mar 02, 2022 at 09:47:09AM +0530, Bharath Rupireddy wrote: > On Wed, Mar 2, 2022 at 2:57 AM Nathan Bossart > wrote: >> I think there are a couple of advantages. For one, spinning is probably >> not the best from a resource perspective. > > Just to be on the same page - by spinning do you mean - the async > walsender waiting for the sync flushLSN in a for-loop with > WaitLatch()? Yes. >> Also, this approach might fit in better >> with the existing synchronous replication framework. When a WAL sender >> realizes that it can't send up to the current "flush" LSN because it's not >> synchronously replicated, it will request to be alerted when it is. > > I think you are referring to the way a backend calls SyncRepWaitForLSN > and waits until any one of the walsender sets syncRepState to > SYNC_REP_WAIT_COMPLETE in SyncRepWakeQueue. Firstly, SyncRepWaitForLSN > blocking i.e. the backend spins/waits in for (;;) loop until its > syncRepState becomes SYNC_REP_WAIT_COMPLETE. The backend doesn't do > any other work but waits. So, spinning isn't avoided completely. > > Unless, I'm missing something, the existing syc repl queue > (SyncRepQueue) mechanism doesn't avoid spinning in the requestors > (backends) SyncRepWaitForLSN or in the walsenders SyncRepWakeQueue. My point is that there are existing tools for alerting processes when an LSN is synchronously replicated and for waking up WAL senders. What I am proposing wouldn't involve spinning in XLogSendPhysical() waiting for synchronous replication. Like SyncRepWaitForLSN(), we'd register our LSN in the queue (SyncRepQueueInsert()), but we wouldn't sit in a separate loop waiting to be woken. Instead, SyncRepWakeQueue() would eventually wake up the WAL sender and trigger another iteration of WalSndLoop(). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Wed, Mar 02, 2022 at 02:23:51PM -0500, Chapman Flack wrote: > I did not notice this earlier (sorry), but there seems to remain in > backup.sgml a programlisting example that shows a psql invocation > for pg_backup_start, then a tar command, then another psql invocation > for pg_backup_stop. > > I think that was only workable for the exclusive mode, and now it is > necessary to issue pg_backup_start and pg_backup_stop in the same session. > > (The 'touch backup_in_progress' business seems a bit bogus now too, > suggesting an exclusivity remembered from bygone days.) > > I am not sure what a workable, simple example ought to look like. > Maybe a single psql script issuing the pg_backup_start and the > pg_backup_stop, with a tar command in between with \! ? > > Several bricks shy of production-ready, but it would give the idea. Another option might be to just remove this section. The top of the section mentions that this is easily done using pg_basebackup with the -X parameter. The bottom part of the section includes more complicated steps for when "more flexibility in copying the backup files is needed..." AFAICT the more complicated strategy was around before pg_basebackup, and the pg_basebackup recommendation was added in 2012 as part of 920febd. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Tue, Mar 08, 2022 at 03:09:50PM -0600, David Steele wrote: > On 3/8/22 14:01, Nathan Bossart wrote: >> On Wed, Mar 02, 2022 at 02:23:51PM -0500, Chapman Flack wrote: >> > I did not notice this earlier (sorry), but there seems to remain in >> > backup.sgml a programlisting example that shows a psql invocation >> > for pg_backup_start, then a tar command, then another psql invocation >> > for pg_backup_stop. >> > >> > I think that was only workable for the exclusive mode, and now it is >> > necessary to issue pg_backup_start and pg_backup_stop in the same session. >> > >> > (The 'touch backup_in_progress' business seems a bit bogus now too, >> > suggesting an exclusivity remembered from bygone days.) >> > >> > I am not sure what a workable, simple example ought to look like. >> > Maybe a single psql script issuing the pg_backup_start and the >> > pg_backup_stop, with a tar command in between with \! ? >> > >> > Several bricks shy of production-ready, but it would give the idea. >> >> Another option might be to just remove this section. The top of the >> section mentions that this is easily done using pg_basebackup with the -X >> parameter. The bottom part of the section includes more complicated steps >> for when "more flexibility in copying the backup files is needed..." >> AFAICT the more complicated strategy was around before pg_basebackup, and >> the pg_basebackup recommendation was added in 2012 as part of 920febd. >> Thoughts? > > This makes sense to me. I think pg_basebackup is far preferable to doing > anything like what is described in this section. Unless you are planning to > do something fancy (parallelization, snapshots, object stores, etc.) then > pg_basebackup is the way to go. I spent some time trying to come up with a workable script to replace the existing one. I think the main problem is that you need to write out both the backup label file and the tablespace map file, but I didn't find an easy way to write the different output columns of pg_backup_stop() to separate files via psql. We'd probably need to write out the steps in prose like the 'Making a Base Backup Using the Low Level API' section does. Ultimately, I just removed everything beyond the pg_basebackup recommendation in the 'Standalone Hot Backups' section. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 8b6fa9d0b7794e3c37f79f3c71472b2c6adabe47 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 1 Dec 2021 23:50:49 + Subject: [PATCH v5 1/1] remove exclusive backup mode --- doc/src/sgml/backup.sgml | 222 +--- doc/src/sgml/func.sgml| 99 +--- doc/src/sgml/high-availability.sgml | 6 +- doc/src/sgml/monitoring.sgml | 4 +- doc/src/sgml/ref/pgupgrade.sgml | 2 +- src/backend/access/transam/xlog.c | 493 ++ src/backend/access/transam/xlogfuncs.c| 253 ++--- src/backend/access/transam/xlogrecovery.c | 2 +- src/backend/catalog/system_functions.sql | 18 +- src/backend/postmaster/postmaster.c | 46 +- src/backend/replication/basebackup.c | 20 +- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 + src/bin/pg_ctl/pg_ctl.c | 4 +- src/bin/pg_rewind/filemap.c | 6 +- src/include/access/xlog.h | 7 +- src/include/catalog/pg_control.h | 2 +- src/include/catalog/pg_proc.dat | 28 +- src/include/miscadmin.h | 4 - src/test/perl/PostgreSQL/Test/Cluster.pm | 56 +- .../t/010_logical_decoding_timelines.pl | 4 +- 20 files changed, 189 insertions(+), 1091 deletions(-) diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 0d69851bb1..79eeb25eee 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0 sequence, and that the success of a step is verified before proceeding to the next step. - -Low level base backups can be made in a non-exclusive or an exclusive -way. The non-exclusive method is recommended and the exclusive one is -deprecated and will eventually be removed. - - - -Making a Non-Exclusive Low-Level Backup - A non-exclusive low level backup is one that allows other + A low level backup allows other concurrent backups to be running (both those started using the same backup API and those started using ). @@ -881,19 +873,19 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0
Re: parse/analyze API refactoring
On Wed, Mar 09, 2022 at 11:35:32AM +0100, Peter Eisentraut wrote: > I have committed my original patches. I'll leave the above-mentioned topic > as ideas for the future. Sounds good. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Wed, Mar 09, 2022 at 02:32:24PM -0500, Chapman Flack wrote: > On 03/09/22 12:19, Stephen Frost wrote: >> Let's avoid hijacking this thread, which is about this >> patch, for an independent debate about what our documentation should or >> shouldn't include. > > Agreed. New thread here: > > https://www.postgresql.org/message-id/6228FFE4.3050309%40anastigmatix.net Great. Is there any additional feedback on this patch? Should we add an example of using pg_basebackup in the "Standalone Hot Backups" section, or should we leave all documentation additions like this for Chap's new thread? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Wed, Mar 09, 2022 at 06:11:19PM -0500, Chapman Flack wrote: > I think the listitem > > In the same connection as before, issue the command: > SELECT * FROM pg_backup_stop(true); > > would be clearer if it used named-parameter form, wait_for_archive => true. > > This is not strictly necessary, of course, for a function with a single > IN parameter, but it's good documentation (and also could save us headaches > like these if there is ever another future need to give it more parameters). > > That listitem doesn't say anything about what the parameter means, which > is a little weird, but probably ok because the next listitem does go into > it in some detail. I don't think a larger reorg is needed to bring that > language one listitem earlier. Just naming the parameter is probably > enough to make it less puzzling (or adding in that listitem, at most, > "the effect of the wait_for_archive parameter is explained below"). > > For consistency (and the same futureproofing benefit), I'd go to > fast => false in the earlier pg_backup_start as well. > > I'm more ambivalent about label => 'label'. It would be consistent, > but should we just agree for conciseness that there will always be > a label and it will always be first? > > You can pretty much tell in a call what's a label; it's those anonymous > trues and falses that are easier to read with named notation. Done. I went ahead and added "label => 'label'" for consistency. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 904d5b9f77262fe0ea740de86ca41c7dbaf210ef Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 1 Dec 2021 23:50:49 + Subject: [PATCH v6 1/1] remove exclusive backup mode --- doc/src/sgml/backup.sgml | 222 +--- doc/src/sgml/func.sgml| 99 +--- doc/src/sgml/high-availability.sgml | 6 +- doc/src/sgml/monitoring.sgml | 4 +- doc/src/sgml/ref/pgupgrade.sgml | 2 +- src/backend/access/transam/xlog.c | 493 ++ src/backend/access/transam/xlogfuncs.c| 253 ++--- src/backend/access/transam/xlogrecovery.c | 2 +- src/backend/catalog/system_functions.sql | 18 +- src/backend/postmaster/postmaster.c | 46 +- src/backend/replication/basebackup.c | 20 +- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 + src/bin/pg_ctl/pg_ctl.c | 4 +- src/bin/pg_rewind/filemap.c | 6 +- src/include/access/xlog.h | 7 +- src/include/catalog/pg_control.h | 2 +- src/include/catalog/pg_proc.dat | 28 +- src/include/miscadmin.h | 4 - src/test/perl/PostgreSQL/Test/Cluster.pm | 56 +- .../t/010_logical_decoding_timelines.pl | 4 +- 20 files changed, 189 insertions(+), 1091 deletions(-) diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 0d69851bb1..c8b914c1aa 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0 sequence, and that the success of a step is verified before proceeding to the next step. - -Low level base backups can be made in a non-exclusive or an exclusive -way. The non-exclusive method is recommended and the exclusive one is -deprecated and will eventually be removed. - - - -Making a Non-Exclusive Low-Level Backup - A non-exclusive low level backup is one that allows other + A low level backup allows other concurrent backups to be running (both those started using the same backup API and those started using ). @@ -881,19 +873,19 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0 Connect to the server (it does not matter which database) as a user with - rights to run pg_start_backup (superuser, or a user who has been granted + rights to run pg_backup_start (superuser, or a user who has been granted EXECUTE on the function) and issue the command: -SELECT pg_start_backup('label', false, false); +SELECT pg_backup_start(label => 'label', fast => false); where label is any string you want to use to uniquely identify this backup operation. The connection - calling pg_start_backup must be maintained until the end of + calling pg_backup_start must be maintained until the end of the backup, or the backup will be automatically aborted. - By default, pg_start_backup can take a long time to finish. + By default, pg_backup_start can take a long time to finish. This is because it performs a checkpoint, and t
Re: Add index scan progress to pg_stat_progress_vacuum
I took a look at the latest patch set. + + indexes_total bigint + + + The number of indexes to be processed in the + vacuuming indexes + or cleaning up indexes phase. It is set to + 0 when vacuum is not in any of these phases. + Could we avoid resetting it to 0 unless INDEX_CLEANUP was turned off or failsafe kicked in? It might be nice to know how many indexes the vacuum intends to process. I don't feel too strongly about this, so if it would add a lot of complexity, it might be okay as is. BTreeShmemInit(); SyncScanShmemInit(); AsyncShmemInit(); + vacuum_worker_init(); Don't we also need to add the size of the hash table to CalculateShmemSize()? + * A command type can optionally define a callback function + * which will derive Datum values rather than use values + * directly from the backends progress array. I think this can be removed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add index scan progress to pg_stat_progress_vacuum
On Thu, Mar 10, 2022 at 09:30:57PM +, Imseih (AWS), Sami wrote: > Attached v4 which includes accounting for the hash size on startup, removal > of the no longer needed comment in pgstatfuncs.c and a change in both > code/docs to only reset the indexes_total to 0 when failsafe is triggered. Thanks for the new patch set. +/* + * Structs for tracking shared Progress information + * amongst worker ( and leader ) processes of a vacuum. + */ nitpick: Can we remove the extra spaces in the parentheses? +if (entry != NULL) +values[PGSTAT_NUM_PROGRESS_COMMON + PROGRESS_VACUUM_INDEXES_COMPLETED] = entry->indexes_processed; What does it mean if there isn't an entry in the map? Is this actually expected, or should we ERROR instead? +/* vacuum worker progress hash table */ +max_table_size = GetMaxBackends(); +size = add_size(size, hash_estimate_size(max_table_size, + sizeof(VacProgressEntry))); I think the number of entries should be shared between VacuumWorkerProgressShmemInit() and VacuumWorkerProgressShmemSize(). Otherwise, we might update one and not the other. +/* Call the command specific function to override datum values */ +if (pg_strcasecmp(cmd, "VACUUM") == 0) +set_vaccum_worker_progress(values); I think we should elaborate a bit more in this comment. It's difficult to follow what this is doing without referencing the comment above set_vacuum_worker_progress(). IMO the patches are in decent shape, and this should likely be marked as ready-for-committer in the near future. Before doing so, I think we should check that Sawada-san is okay with moving the deeper infrastructure changes to a separate threaḋ. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Thu, Mar 10, 2022 at 07:13:14PM -0500, Chapman Flack wrote: > Looks like this change to an example in func.sgml is not quite right: > > -postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); > +postgres=# SELECT * FROM pg_walfile_name_offset(pg_backup_stop()); > > pg_backup_stop returns a record now, not just lsn. So this works for me: > > +postgres=# SELECT * FROM pg_walfile_name_offset((pg_backup_stop()).lsn); Ah, good catch. I made this change in v7. I considered doing something like this SELECT w.* FROM pg_backup_stop() b, pg_walfile_name_offset(b.lsn) w; but I think your suggestion is simpler. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 3283d2b85f38f46d1e2ada0e6c5ea59d8c8e9f9d Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 1 Dec 2021 23:50:49 + Subject: [PATCH v7 1/1] remove exclusive backup mode --- doc/src/sgml/backup.sgml | 222 +--- doc/src/sgml/func.sgml| 99 +--- doc/src/sgml/high-availability.sgml | 6 +- doc/src/sgml/monitoring.sgml | 4 +- doc/src/sgml/ref/pgupgrade.sgml | 2 +- src/backend/access/transam/xlog.c | 493 ++ src/backend/access/transam/xlogfuncs.c| 253 ++--- src/backend/access/transam/xlogrecovery.c | 2 +- src/backend/catalog/system_functions.sql | 18 +- src/backend/postmaster/postmaster.c | 46 +- src/backend/replication/basebackup.c | 20 +- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 + src/bin/pg_ctl/pg_ctl.c | 4 +- src/bin/pg_rewind/filemap.c | 6 +- src/include/access/xlog.h | 7 +- src/include/catalog/pg_control.h | 2 +- src/include/catalog/pg_proc.dat | 28 +- src/include/miscadmin.h | 4 - src/test/perl/PostgreSQL/Test/Cluster.pm | 56 +- .../t/010_logical_decoding_timelines.pl | 4 +- 20 files changed, 189 insertions(+), 1091 deletions(-) diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 0d69851bb1..c8b914c1aa 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0 sequence, and that the success of a step is verified before proceeding to the next step. - -Low level base backups can be made in a non-exclusive or an exclusive -way. The non-exclusive method is recommended and the exclusive one is -deprecated and will eventually be removed. - - - -Making a Non-Exclusive Low-Level Backup - A non-exclusive low level backup is one that allows other + A low level backup allows other concurrent backups to be running (both those started using the same backup API and those started using ). @@ -881,19 +873,19 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0 Connect to the server (it does not matter which database) as a user with - rights to run pg_start_backup (superuser, or a user who has been granted + rights to run pg_backup_start (superuser, or a user who has been granted EXECUTE on the function) and issue the command: -SELECT pg_start_backup('label', false, false); +SELECT pg_backup_start(label => 'label', fast => false); where label is any string you want to use to uniquely identify this backup operation. The connection - calling pg_start_backup must be maintained until the end of + calling pg_backup_start must be maintained until the end of the backup, or the backup will be automatically aborted. - By default, pg_start_backup can take a long time to finish. + By default, pg_backup_start can take a long time to finish. This is because it performs a checkpoint, and the I/O required for the checkpoint will be spread out over a significant period of time, by default half your inter-checkpoint interval @@ -905,10 +897,6 @@ SELECT pg_start_backup('label', false, false); issue an immediate checkpoint using as much I/O as available. - - The third parameter being false tells - pg_start_backup to initiate a non-exclusive base backup. - @@ -926,7 +914,7 @@ SELECT pg_start_backup('label', false, false); In the same connection as before, issue the command: -SELECT * FROM pg_stop_backup(false, true); +SELECT * FROM pg_backup_stop(wait_for_archive => true); This terminates backup mode. On a primary, it also performs an automatic switch to the next WAL segment. On a standby, it is not possible to @@ -937,7 +925,7 @@ SELECT * FROM pg_stop_backup(false, true); ready to archive. - The pg_stop_b
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Fri, Mar 11, 2022 at 02:00:37PM -0500, Chapman Flack wrote: > v7 looks good to me. I have repeated the installcheck-world and given > the changed documentation sections one last read-through, and will > change the status to RfC. Thanks for reviewing! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
+CheckpointStats.repl_map_cutoff_lsn = cutoff; Could we set repl_map_cutoff_lsn closer to where it is calculated? Right now, it's set at the bottom of the function just before the directory is freed. Is there a strong reason to do it there? +"logical rewrite mapping file(s) removed/synced=" UINT64_FORMAT ", size=%zu bytes, time=%ld.%03d s, cutoff LSN=%X/%X", Can we split out the "removed" versus "synced" files? Otherwise, I think this is basically just telling us how many files are in the mappings directory. +CheckpointStats.repl_snap_cutoff_lsn = cutoff; I have the same note for this one as I have for repl_map_cutoff_lsn. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add index scan progress to pg_stat_progress_vacuum
On Sat, Mar 12, 2022 at 07:00:06AM +, Imseih (AWS), Sami wrote: >> nitpick: Can we remove the extra spaces in the parentheses? > > fixed > >> What does it mean if there isn't an entry in the map? Is this actually >> expected, or should we ERROR instead? > > I cleaned up the code here and added comments. > >> I think the number of entries should be shared between >> VacuumWorkerProgressShmemInit() and VacuumWorkerProgressShmemSize(). >> Otherwise, we might update one and not the other. > > Fixed > >> I think we should elaborate a bit more in this comment. It's difficult to >> follow what this is doing without referencing the comment above >> set_vacuum_worker_progress(). > > More comments added > > I also simplified the 0002 patch as well. These patches look pretty good to me. Barring additional feedback, I intend to mark this as ready-for-committer early next week. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Optimize external TOAST storage
I think this is an interesting idea. My first thought is that it would be nice if we didn't have to first compress the data to see whether we wanted to store it compressed or not, but I don't think there is much we can do about that. In any case, we aren't adding much work in the write path compared to the potential savings in the read path, so that is probably okay. Do you think it is worth making this configurable? I don't think it is outside the realm of possibility for some users to care more about disk space than read performance. And maybe the threshold for this optimization could differ based on the workload. extern void toast_tuple_externalize(ToastTupleContext *ttc, int attribute, int options); +extern void toast_tuple_opt_externalize(ToastTupleContext *ttc, int attribute, +int options, Datum old_toast_value, +ToastAttrInfo *old_toast_attr); Could we bake this into toast_tuple_externalize() so that all existing callers benefit from this optimization? Is there a reason to export both functions? Perhaps toast_tuple_externalize() should be renamed and made static, and then this new function could be called toast_tuple_externalize() (which would be a wrapper around the internal function). +/* Sanity check: if data is not compressed then we can proceed as usual. */ +if (!VARATT_IS_COMPRESSED(DatumGetPointer(*value))) +toast_tuple_externalize(ttc, attribute, options); With a --enable-cassert build, this line causes assertion failures in the call to GetMemoryChunkContext() from pfree(). 'make check' is enough to reproduce it. Specifically, it fails the following assertion: AssertArg(MemoryContextIsValid(context)); I didn't see an existing commitfest entry for this patch. I'd encourage you to add one in the July commitfest: https://commitfest.postgresql.org/38/ -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Allow async standbys wait for sync replication
On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote: > To me it's architecturally the completely wrong direction. We should move in > the *other* direction, i.e. allow WAL to be sent to standbys before the > primary has finished flushing it locally. Which requires similar > infrastructure to what we're discussing here. I think this is a good point. After all, WALRead() has the following comment: * XXX probably this should be improved to suck data directly from the * WAL buffers when possible. Once you have all the infrastructure for that, holding back WAL replay on async standbys based on synchronous replication might be relatively easy. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Sun, Mar 13, 2022 at 01:54:10PM +0530, Bharath Rupireddy wrote: > Thanks for reviewing this. I agree with all of the above suggestions > and incorporated them in the v2 patch. Thanks for the new patch. > Another thing I added in v2 is to not emit snapshot and mapping files > stats in case of restartpoint as logical decoding isn't supported on > standbys, so it doesn't make sense to emit the stats there and cause > server log to grow unnecessarily. Having said that, I added a note > there to change it whenever logical decoding on standbys is supported. I think we actually do want to include this information for restartpoints since some files might be left over from the snapshot that was used to create the standby. Also, presumably these functions could do some work during recovery on a primary. Another problem I see is that this patch depends on the return value of the lstat() calls that we are trying to remove in 0001 from another thread [0]. I think the size of the removed/sync'd files is somewhat useful for understanding disk space usage, but I suspect the time spent performing these tasks is more closely related to the number of files. Do you think reporting the sizes is worth the extra system call? [0] https://postgr.es/m/20220215231123.GA2584239%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Mon, Mar 14, 2022 at 05:03:59PM +0530, Bharath Rupireddy wrote: > On Mon, Mar 14, 2022 at 1:33 PM Michael Paquier wrote: >> On Mon, Mar 14, 2022 at 10:54:56AM +0530, Bharath Rupireddy wrote: >> > Yes, this is a concern. Also, when there were no logical replication >> > slots on a plain server or the server removed or cleaned up all the >> > snapshot/mappings files, why would anyone want to have these messages >> > with all 0s in the server log? >> >> The default settings don't enable that, so making things conditional >> roughly as you are suggesting with two different LOG messages sounds >> rather fine. > > Attaching v3 patch which emits logs only when necessary and doesn't > clutter the existing "checkpoint/restartpoint completed" message, see > some sample logs at [1]. Please review it further. I'm okay with not adding these statistics when the number of files removed and sync'd is zero. IMO we shouldn't include the size of the files since it will prevent us from removing lstat() calls. As noted upthread, I suspect the time spent in these tasks is more closely related to the number of files anyway. I'm -1 on splitting these new statistics to separate LOGs. In addition to making it more difficult to discover statistics for a given checkpoint, I think it actually adds even more bloat to the server log. If we are concerned about the amount of information in these LOGs, perhaps we should adjust the format to make it more human-readable. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Estimating HugePages Requirements?
On Mon, Mar 14, 2022 at 04:26:43PM +0100, Magnus Hagander wrote: > Nothing fixing this ended up actually getting committed, right? That > is, we still get the extra log output? Correct. > And in fact, the command documented on > https://www.postgresql.org/docs/devel/kernel-resources.html doesn't > actually produce the output that the docs show, it also shows the log > line, in the default config? If we can't fix the extra logging we > should at least have our docs represent reality -- maybe by adding a > "2>/dev/null" entry? But it'd be better to have it not output that log > in the first place... I attached a patch to adjust the documentation for now. This apparently crossed my mind earlier [0], but I didn't follow through with it for some reason. > (Of course what I'd really want is to be able to run it on a cluster > that's running, but that was discussed downthread so I'm not bringing > that one up for changes now) I think it is worth revisiting the extra logging and the ability to view runtime-computed GUCs on a running server. Should this be an open item for v15, or do you think it's alright to leave it for the v16 development cycle? [0] https://postgr.es/m/C45224E1-29C8-414C-A8E6-B718C07ACB94%40amazon.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 7ee7b176c5280349631426ff047a9df394e26d59 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 14 Mar 2022 10:24:48 -0700 Subject: [PATCH v1 1/1] send stderr to /dev/null in Linux Huge Pages documentation --- doc/src/sgml/runtime.sgml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index f77ed24204..85b3ffcd71 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -1448,7 +1448,7 @@ export PG_OOM_ADJUST_VALUE=0 server must be shut down to view this runtime-computed parameter. This might look like: -$ postgres -D $PGDATA -C shared_memory_size_in_huge_pages +$ postgres -D $PGDATA -C shared_memory_size_in_huge_pages 2> /dev/null 3170 $ grep ^Hugepagesize /proc/meminfo Hugepagesize: 2048 kB -- 2.25.1
Re: Optionally automatically disable logical replication subscriptions on error
My compiler is worried that syncslotname may be used uninitialized in start_table_sync(). The attached patch seems to silence this warning. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 05e4e03af5afa1658ede8d78b31c1c999b5c7deb Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 14 Mar 2022 16:00:18 -0700 Subject: [PATCH v1 1/1] silence compiler warning --- src/backend/replication/logical/worker.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index a1fe81b34f..03e069c7cd 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -3387,7 +3387,7 @@ TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid) static void start_table_sync(XLogRecPtr *origin_startpos, char **myslotname) { - char *syncslotname; + char *syncslotname = NULL; Assert(am_tablesync_worker()); -- 2.25.1
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Tue, Mar 15, 2022 at 11:04:26AM +0900, Michael Paquier wrote: > On Mon, Mar 14, 2022 at 03:54:19PM +0530, Bharath Rupireddy wrote: >> At times, the snapshot or mapping files can be large in number and one >> some platforms it takes time for checkpoint to process all of them. >> Having the stats about them in server logs can help us better analyze >> why checkpoint took a long time and provide a better RCA. > > Do you have any numbers to share regarding that? Seeing information > about 1k WAL segments being recycled and/or removed by a checkpoint > where the operation takes dozens of seconds to complete because we can > talk about hundred of gigs worth of files moved around. If we are > talking about 100~200 files up to 10~20kB each for snapshot and > mapping files, the information has less value, worth only a portion of > one WAL segment. I don't have specific numbers to share, but as noted elsewhere [0], I routinely see lengthy checkpoints that spend a lot of time in these cleanup tasks. [0] https://postgr.es/m/18ED8B1F-7F5B-4ABF-848D-45916C938BC7%40amazon.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Optimize external TOAST storage
On Wed, Mar 16, 2022 at 01:03:38AM +0530, davinder singh wrote: > Regarding the 2nd part of configuring the threshold, Based on our > experiments, we have fixed it for the attributes with size > 2 * chunk_size. > The default chunk_size is 2KB and the page size is 8KB. While toasting each > attribute is divided into chunks, and each page can hold a max of 4 such > chunks. > We only need to think about the space used by the last chunk of the > attribute. > This means with each value optimization, it might use extra space in the > range > (0B,2KB]. I think this extra space is independent of attribute size. So we > don't > need to worry about configuring this threshold. Let me know if I missed > something > here. Looking closer, ISTM there are two thresholds here. The first threshold is the minimum size of the uncompressed data required to trigger this optimization. The purpose behind this threshold is to avoid increasing disk space usage in return for very little gain in the read path (i.e., skipping decompression). Specifically, this threshold helps avoid increasing disk space usage when there are many small TOAST attributes. You've chosen 2x the TOAST chunk size as the threshold. The second threshold is the compression ratio required to trigger this optimization. The purpose behind this threshold is to avoid storing poorly compressed data so that we can improve performance in the read path (i.e., skipping decompression). You've chosen >=1 TOAST chunk as the threshold. This means that when I have an attribute with size 3x a TOAST chunk, I'll only store it compressed when the compressed attribute is 2x a TOAST chunk or smaller (a 33% improvement). If I have an attribute with size 100x a TOAST chunk, I'll store it compressed when the compressed attribute is 99x a TOAST chunk or smaller (a 1% improvement). IIUC this means that it is much more likely that you will store a compressed attribute when the uncompressed data is larger. The benefit of this approach is that you know you'll never give up more than one TOAST chunk of disk space in the name of faster reads. The downside is that a huge number of smaller attributes can require much more disk space than if they were universally compressed. Perhaps this is missing one additional threshold. Perhaps there should be a compression ratio for which the optimization is ignored. If my uncompressed data is 3x a TOAST chunk and compressing it would save half of a TOAST chunk of disk space (a ~17% improvement), maybe we should still store it compressed. I think this threshold would be most important for servers with many small TOAST attributes. I don't know whether it is necessary to make all of this configurable or whether we should just use some heuristics to find a good balance. In any case, I imagine the settings should look different for the different TOAST compression methods. Or maybe we should just make the first threshold the configuration option for this feature. If you set it to -1, the optimization is never used. If you set it to 0, you always try to save on read time at the expense of disk space. If you set it to 2x the TOAST chunk size, you might use up to 50% more disk space for read path optimizations. If you set it to 100x the TOAST chunk size, you might use up to 1% more disk space for read optimizations. By default, this configuration option could be set conservatively (e.g., 10x a TOAST chunk, which could lead to a maximum of 10% more disk space). I suspect the increase in disk space would be much less than these numbers in most cases, but it might be helpful to think of these things in terms of the worst case scenario. I apologize for thinking out loud a bit here, but I hope this gives you some insight into my perspective on this. In general, I am skeptical that we can choose one threshold that will work for all PostgreSQL installations in the known universe. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Estimating HugePages Requirements?
On Tue, Mar 15, 2022 at 11:02:37PM +0100, Magnus Hagander wrote: > I think we're talking about two different things here. > > I think the "avoid extra logging" would be worth seeing if we can > address for 15. A simple approach could be to just set log_min_messages to PANIC before exiting. I've attached a patch for this. With this patch, we'll still see a FATAL if we try to use 'postgres -C' for a runtime-computed GUC on a running server, and there will be no extra output as long as the user sets log_min_messages to INFO or higher (i.e., not a DEBUG* value). For comparison, 'postgres -C' for a non-runtime-computed GUC does not emit extra output as long as the user sets log_min_messages to DEBUG2 or higher. > The "able to run on a live cluster" seems a lot bigger and more scary > and not 15 material. +1 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From cdfd1ad00ca1d8afbfcbafc1f684b5ba9cc43eb6 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 15 Mar 2022 15:36:41 -0700 Subject: [PATCH v2 1/1] don't emit shutdown messages for 'postgres -C' with runtime-computed GUCs --- src/backend/postmaster/postmaster.c | 4 1 file changed, 4 insertions(+) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 80bb269599..bf48bc6326 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1066,6 +1066,10 @@ PostmasterMain(int argc, char *argv[]) false, false); puts(config_val ? config_val : ""); + + /* don't emit shutdown messages */ + SetConfigOption("log_min_messages", "PANIC", PGC_INTERNAL, PGC_S_OVERRIDE); + ExitPostmaster(0); } -- 2.25.1
Re: USE_BARRIER_SMGRRELEASE on Linux?
On Wed, Mar 16, 2022 at 06:40:23PM +1300, Thomas Munro wrote: > Pushed and back-patched (it's slightly different before 12). Thanks! Thank you! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Optimize external TOAST storage
On Wed, Mar 16, 2022 at 10:08:45AM -0400, Robert Haas wrote: > I would like to take a slightly contrary position. I think that a > system here that involves multiple knobs is going to be too > complicated to be of any real-world use, because almost nobody will > understand it or tune it properly for their installation. And who is > to say that a single setting would even be appropriate for a whole > installation, as opposed to something that needs to be tuned for each > individual table? A single tunable might be OK, but what I would > really like here is a heuristic that, while perhaps not optimal in > every environment, is good enough that a very high percentage of users > will never need to worry about it. In a case like this, a small gain > that happens automatically feels better than a large gain that > requires manual tweaking in every install. Agreed. If there is a tunable, it needs to be as simple as possible. And ideally most users would never need to use it because the default would be "good enough." > Now that doesn't answer the question of what that heuristic should be. > I initially thought that if compression didn't end up reducing the > number of TOAST chunks then there was no point, but that's not true, > because having the last chunk be smaller than the rest can allow us to > fit more than 4 into a page rather than exactly 4. However, this > effect isn't likely to be important if most chunks are full-size > chunks. If we insert 100 full-size chunks and 1 partial chunk at the > end, we can't save much even if that chunk ends up being 1 byte > instead of a full-size chunk. 25 TOAST table pages out of 26 are going > to be full of full-size chunks either way, so we can't save more than > ~4% and we might easily save nothing at all. As you say, the potential > savings are larger as the values get smaller, because a higher > proportion of the TOAST table pages are not quite full. So I think the > only cases where it's even interesting to think about suppressing this > optimization are those where the value is quite small -- and maybe the > thing to do is just pick a conservatively large threshold and call it > good. > > For example, say that instead of applying this technique when there > are at least 2 TOAST chunks, we apply it when there are at least 500 > (i.e. normally 1MB). It's hard for me to imagine that we can ever > lose, and in fact I think we could come down quite a bit from there > and still end up winning pretty much all the time. Setting the > threshold to, say, 50 or 100 or 150 TOAST chunks instead of 2 may > leave some money on the table, but if it means that we don't have > meaningful downsides and we don't have tunables for users to fiddle > with, it might be worth it. As long as we can demonstrate some decent improvements, I think using a conservative threshold is a good idea. I do wonder whether thiѕ could weaken the read performance gains quite a bit, though. Thinking further, is simply reducing the number of TOAST chunks the right thing to look at? If I want to add a TOAST attribute that requires 100,000 chunks, and you told me that I could save 10% in the read path for an extra 250 chunks of disk space, I would probably choose read performance. If I wanted to add 100,000 attributes that were each 3 chunks, and you told me that I could save 10% in the read path for an extra 75,000 chunks of disk space, I might choose the extra disk space. These are admittedly extreme (and maybe even impossible) examples, but my point is that the amount of disk space you are willing to give up may be related to the size of the attribute. And maybe one way to extract additional read performance with this optimization is to use a variable threshold so that we are more likely to use it for large attributes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Wed, Mar 16, 2022 at 03:02:41PM +0530, Bharath Rupireddy wrote: > On Mon, Mar 14, 2022 at 10:34 PM Nathan Bossart > wrote: >> I'm -1 on splitting these new statistics to separate LOGs. In addition to >> making it more difficult to discover statistics for a given checkpoint, I >> think it actually adds even more bloat to the server log. If we are >> concerned about the amount of information in these LOGs, perhaps we should >> adjust the format to make it more human-readable. > > Below are the ways that I can think of. Thoughts? I think I prefer the first option. If these tasks don't do anything, we leave the stats out of the message. If they do some work, we add them. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Optimize external TOAST storage
On Wed, Mar 16, 2022 at 11:36:56AM -0700, Nathan Bossart wrote: > Thinking further, is simply reducing the number of TOAST chunks the right > thing to look at? If I want to add a TOAST attribute that requires 100,000 > chunks, and you told me that I could save 10% in the read path for an extra > 250 chunks of disk space, I would probably choose read performance. If I > wanted to add 100,000 attributes that were each 3 chunks, and you told me > that I could save 10% in the read path for an extra 75,000 chunks of disk > space, I might choose the extra disk space. These are admittedly extreme > (and maybe even impossible) examples, but my point is that the amount of > disk space you are willing to give up may be related to the size of the > attribute. And maybe one way to extract additional read performance with > this optimization is to use a variable threshold so that we are more likely > to use it for large attributes. I might be overthinking this. Maybe it is enough to skip compressing the attribute whenever compression saves no more than some percentage of the uncompressed attribute size. A conservative default setting might be something like 5% or 10%. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add index scan progress to pg_stat_progress_vacuum
On Wed, Mar 16, 2022 at 09:47:49PM +, Imseih (AWS), Sami wrote: > Spoke to Nathan offline and fixed some more comments/nitpicks in the patch. I don't have any substantial comments for v9, so I think this can be marked as ready-for-committer. However, we probably should first see whether Sawada-san has any comments on the revised approach. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Optimize external TOAST storage
On Thu, Mar 17, 2022 at 01:04:17PM -0400, Robert Haas wrote: > Right, so perhaps the ultimate thing here would be a more fine-grained > knob than SET STORAGE EXTERNAL -- something that allows you to specify > that you want to compress only when it really helps. While some people > might find that useful, I think the current patch is less ambitious, > and I think that's OK. It just wants to save something in the cases > where it's basically free. Unfortunately we've learned that it's never > *entirely* free because making the last TOAST chunk longer can always > cost you something, even if it gets longer by only 1 byte. But for > larger values it's hard for that to be significant. I guess I think we should be slightly more ambitious. One idea could be to create a default_toast_compression_ratio GUC with a default of 0.95. This means that, by default, a compressed attribute must be 0.95x or less of the size of the uncompressed attribute to be stored compressed. Like default_toast_compression, this could also be overridden at the column level with something like ALTER TABLE mytbl ALTER mycol SET COMPRESSION lz4 RATIO 0.9; If the current "basically free" patch is intended for v15, then maybe all this extra configurability stuff could wait for a bit, especially if we can demonstrate a decent read performance boost with a more conservative setting. However, I don't see anything terribly complicated about the proposed configurability changes (assuming my proposal makes some amount of sense to you and others). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Thu, Mar 17, 2022 at 06:48:43PM +0530, Bharath Rupireddy wrote: > Personally, I tend to agree with v4-0001 (option (4)) or v4-0002 > (option (3)) than v4-0003 (option (1)) as it adds more unreadability > with most of the code duplicated creating more diff with previous > versions and maintainability problems. Having said that, I will leave > it to the committer to decide on that. I don't think v4-0003/option 1 needs to be unreadable. Perhaps we can use a StringInfo to build the message. That might be a net improvement in readability anyway. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: O(n) tasks cause lengthy startups and checkpoints
It seems unlikely that anything discussed in this thread will be committed for v15, so I've adjusted the commitfest entry to v16 and moved it to the next commitfest. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Pre-allocating WAL files
It seems unlikely that this will be committed for v15, so I've adjusted the commitfest entry to v16 and moved it to the next commitfest. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: support for CREATE MODULE
It seems unlikely that this will be committed for v15. Swaha, should the commitfest entry be adjusted to v16 and moved to the next commitfest? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Patch to avoid orphaned dependencies
Bertand, do you think this has any chance of making it into v15? If not, are you alright with adjusting the commitfest entry to v16 and moving it to the next commitfest? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: support for CREATE MODULE
On Thu, Mar 17, 2022 at 04:26:31PM -0700, Swaha Miller wrote: > On Thu, Mar 17, 2022 at 4:16 PM Nathan Bossart > wrote: > >> It seems unlikely that this will be committed for v15. Swaha, should the >> commitfest entry be adjusted to v16 and moved to the next commitfest? >> >> > Yes please, thank you Nathan Done! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)
I think this one requires some more work, and it needn't be a priority for v15, so I've adjusted the commitfest entry to v16 and moved it to the next commitfest. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Fri, Mar 18, 2022 at 01:32:52PM +0530, Bharath Rupireddy wrote: > Well, here's the v5 patch, sample output at [1]. Please have a look at it. I think this is on the right track, but I would even take it a step further by separating each group of information: if (restartpoint) appendStringInfo(&logmsg, "restartpoint complete: "); else appendStringInfo(&logmsg, "checkpoint complete: "); /* buffer stats */ appendStringInfo(&logmsg, "wrote %d buffers (%.1f%%); ", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers); /* WAL file stats */ appendStringInfo(&logmsg, "%d WAL file(s) added, %d removed, %d recycled; ", CheckpointStats.ckpt_segs_added, CheckpointStats.ckpt_segs_removed, CheckpointStats.ckpt_segs_recycled); ... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Mon, Mar 21, 2022 at 11:27:15AM +0530, Bharath Rupireddy wrote: > On Sat, Mar 19, 2022 at 3:16 AM Nathan Bossart > wrote: >> /* buffer stats */ >> appendStringInfo(&logmsg, "wrote %d buffers (%.1f%%); ", >> >> CheckpointStats.ckpt_bufs_written, >> (double) >> CheckpointStats.ckpt_bufs_written * 100 / NBuffers); >> >> /* WAL file stats */ >> appendStringInfo(&logmsg, "%d WAL file(s) added, %d removed, %d >> recycled; ", >> >> CheckpointStats.ckpt_segs_added, >> >> CheckpointStats.ckpt_segs_removed, >> >> CheckpointStats.ckpt_segs_recycled); > > Do we actually need to granularize the message like this? I actually > don't think so, because we print the stats even if they are 0 (buffers > written is 0 or WAL segments added is 0 and so on). I suggested it because I thought it would improve readability and simplify optionally adding new stats to the message. If there is another way to accomplish those things, that is fine by me. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Estimating HugePages Requirements?
On Tue, Mar 15, 2022 at 03:44:39PM -0700, Nathan Bossart wrote: > A simple approach could be to just set log_min_messages to PANIC before > exiting. I've attached a patch for this. With this patch, we'll still see > a FATAL if we try to use 'postgres -C' for a runtime-computed GUC on a > running server, and there will be no extra output as long as the user sets > log_min_messages to INFO or higher (i.e., not a DEBUG* value). For > comparison, 'postgres -C' for a non-runtime-computed GUC does not emit > extra output as long as the user sets log_min_messages to DEBUG2 or higher. I created a commitfest entry for this: https://commitfest.postgresql.org/38/3596/ -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: pgcrypto: Remove internal padding implementation
On Wed, Mar 16, 2022 at 11:12:06AM +0100, Peter Eisentraut wrote: > Right, the previous behaviors were clearly faulty. I have updated the > commit message to call out the behavior change more clearly. > > This patch is now complete from my perspective. I took a look at this patch and found nothing of concern. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Optimize external TOAST storage
On Tue, Mar 22, 2022 at 04:34:05PM -0400, Robert Haas wrote: > We seem to have a shortage of "others" showing up with opinions on > this topic, but I guess I'm not very confident about the general > utility of such a setting. Just to be clear, I'm also not very > confident about the usefulness of the existing settings for > controlling TOAST. Why is it useful default behavior to try to get > rows down to 2kB by default, rather than 1.8kB or 3kB? Even more, why > don't we try to compress primarily based on the length of individual > attributes and then compress further only if the resulting tuple > doesn't fit into a page at all? There doesn't seem to be anything > magic about fitting tuples into a quarter-page, yet the code acts as > though that's the primary goal - and then, when that didn't turn out > to work well in all cases, we added a user-settable parameter > (toast_tuple_target) to let you say you really want tuples in table X > to fit into a third of a page or a fifth of a page instead of a > quarter. And there's some justification for that: a proposal to > fundamentally change the algorithm would likely have gotten bogged > down for years, and the parameter no doubt lets you solve some > problems. Yet if the whole algorithm is wrong, and I think maybe it > is, then being able to change the constants is not really getting us > where we need to be. > > Then, too, I'm not very confident about the usefulness of EXTENDED, > EXTERNAL, and MAIN. I think it's useful to be able to categorically > disable compression (as EXTERNAL does), but you can't categorically > disable out-of-line storage because the value can be bigger than the > page, so MAIN vs. EXTENDED is just changing the threshold for the use > of out-of-line storage. However, it does so in a way that IMHO is not > particularly intuitive, which goes back to my earlier point about the > algorithm seeming wacky, and it's in any case unclear why we should > offer exactly two possible thresholds and not any of the intermediate > values. I agree with all of this. Adding configurability for each constant might help folks in the short term, but using these knobs probably requires quite a bit of expertise in Postgres internals as well as a good understanding of your data. I think we ought to revist TOAST configurability from a user perspective. IOW what can be chosen automatically, and how do we enable users to effectively configure the settings that cannot be chosen automatically? IMO this is a worthwhile conversation to have as long as it doesn't stray too far into the "let's rewrite TOAST" realm. I think there is always going to be some level of complexity with stuff like TOAST, but there are probably all sorts of ways to simplify/improve it also. > Maybe the conclusion here is that more thought is needed before > changing anything in this area You've certainly got me thinking more about this. If the scope of this work is going to expand, a few months before the first v16 commitfest is probably the right time for it. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)
On Thu, Mar 17, 2022 at 04:45:28PM -0700, Nathan Bossart wrote: > I think this one requires some more work, and it needn't be a priority for > v15, so I've adjusted the commitfest entry to v16 and moved it to the next > commitfest. Here is a new patch. The main differences from v3 are in heapam_visibility.c. Specifically, instead of trying to work the infomask checks into the visibility logic, I added a new function that does a couple of assertions. This function is called at the beginning of each visibility function. What do folks think? The options I've considered are 1) not adding any such checks to heapam_visibility.c, 2) only adding assertions like the attached patch, or 3) actually using elog(ERROR, ...) when the invalid bit patterns are detected. AFAICT (1) is more in line with existing invalid bit patterns (e.g., XMAX_COMMITTED + XMAX_IS_MULTI). There are a couple of scattered assertions, but most code paths don't check for it. (2) adds additional checks, but only for --enable-cassert builds. (3) would add checks even for non-assert builds, but there would presumably be some performance cost involved. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 2d6b372cf61782e0fd52590b57b1c914b0ed7a4c Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 22 Mar 2022 15:35:34 -0700 Subject: [PATCH v4 1/1] disallow XMAX_COMMITTED + XMAX_LOCK_ONLY --- contrib/amcheck/verify_heapam.c | 19 src/backend/access/heap/README.tuplock | 2 +- src/backend/access/heap/heapam.c| 10 +++ src/backend/access/heap/heapam_visibility.c | 97 ++--- src/backend/access/heap/hio.c | 2 + 5 files changed, 96 insertions(+), 34 deletions(-) diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c index e5f7355dcb..f30b9c234f 100644 --- a/contrib/amcheck/verify_heapam.c +++ b/contrib/amcheck/verify_heapam.c @@ -665,6 +665,25 @@ check_tuple_header(HeapCheckContext *ctx) */ } + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && + (ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY)) + { + report_corruption(ctx, + pstrdup("locked-only should not be marked committed")); + + /* As above, do not skip further checks. */ + } + + /* also check for pre-v9.3 lock-only bit pattern */ + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && + HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask)) + { + report_corruption(ctx, + pstrdup("tuple with HEAP_XMAX_EXCL_LOCK set and HEAP_XMAX_IS_MULTI unset should not be marked committed")); + + /* As above, do not skip further checks. */ + } + if (infomask & HEAP_HASNULL) expected_hoff = MAXALIGN(SizeofHeapTupleHeader + BITMAPLEN(ctx->natts)); else diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock index 6441e8baf0..4e565e60ab 100644 --- a/src/backend/access/heap/README.tuplock +++ b/src/backend/access/heap/README.tuplock @@ -152,4 +152,4 @@ The following infomask bits are applicable: whether the XMAX is a TransactionId or a MultiXactId. We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit -is set. +or the HEAP_XMAX_LOCK_ONLY bit is set. diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 3746336a09..3f0b4cd754 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -5151,6 +5151,16 @@ l5: MultiXactStatus status; MultiXactStatus new_status; + /* + * Currently we don't allow XMAX_LOCK_ONLY and XMAX_COMMITTED to both be + * set in a tuple header, so cross-check. + * + * Note that this also checks for the special locked-only bit pattern + * that may be present on servers that were pg_upgraded from pre-v9.3 + * versions. + */ + Assert(!HEAP_XMAX_IS_LOCKED_ONLY(old_infomask)); + if (old_infomask2 & HEAP_KEYS_UPDATED) status = MultiXactStatusUpdate; else diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c index ff0b8a688d..7d6fb66b0d 100644 --- a/src/backend/access/heap/heapam_visibility.c +++ b/src/backend/access/heap/heapam_visibility.c @@ -78,6 +78,31 @@ #include "utils/snapmgr.h" +/* + * InfomaskAssertions() + * + * Checks for invalid infomask bit patterns. + */ +static inline void +InfomaskAssertions(HeapTupleHeader tuple) +{ + /* + * XMAX_COMMITTED and XMAX_LOCK_ONLY cannot both be set in a tuple header. + * + * Note that this also checks for the special locked-only bit pattern that + * may be present on servers that were pg_upgraded from pre-v9.3 versions. + */ + Assert(!((tuple->t_infomask & HEAP_XMAX_COMMITTED) && + HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))); + + /* + * XMAX_COMMITTED and XMAX_IS_MULTI cannot be be set in a tuple header. + */ + Assert(!((tuple->t
Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)
On Tue, Mar 22, 2022 at 04:13:47PM -0700, Andres Freund wrote: > Just skimming this thread quickly, I really have no idea what this is trying > to achieve and the commit message doesn't help either... I didn't read the > referenced thread, but I shouldn't have to, to get a basic idea. Ah, my bad. I should've made sure the context was carried over better. I updated the commit message with some basic information about the intent. Please let me know if there is anything else that needs to be cleared up. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 739ec6e42183280d57d867157cfe1b37d127ca54 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 22 Mar 2022 15:35:34 -0700 Subject: [PATCH v5 1/1] Disallow setting XMAX_COMMITTED and XMAX_LOCK_ONLY together. Even though Postgres doesn't set both the XMAX_COMMITTED and XMAX_LOCK_ONLY infomask bits on the same tuple header, the heap visibility logic still accepts it. However, other functions like compute_new_xmax_infomask() seem to assume that this bit pattern is not possible. This change marks this bit pattern as disallowed, removes the heap visibility logic that handles it, and adds checks like those for other disallowed infomask bit combinations (e.g., XMAX_COMMITTED and XMAX_IS_MULTI). Besides simplifying the heap visibility logic a bit, this change aims to reduce ambiguity about the legal tuple header states. Note that this change also disallows XMAX_COMMITTED together with the special pre-v9.3 locked-only bit pattern that HEAP_XMAX_IS_LOCKED_ONLY checks for. This locked-only bit pattern may still be present on servers pg_upgraded from pre-v9.3 versions. --- contrib/amcheck/verify_heapam.c | 19 src/backend/access/heap/README.tuplock | 2 +- src/backend/access/heap/heapam.c| 10 +++ src/backend/access/heap/heapam_visibility.c | 97 ++--- src/backend/access/heap/hio.c | 2 + 5 files changed, 96 insertions(+), 34 deletions(-) diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c index e5f7355dcb..f30b9c234f 100644 --- a/contrib/amcheck/verify_heapam.c +++ b/contrib/amcheck/verify_heapam.c @@ -665,6 +665,25 @@ check_tuple_header(HeapCheckContext *ctx) */ } + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && + (ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY)) + { + report_corruption(ctx, + pstrdup("locked-only should not be marked committed")); + + /* As above, do not skip further checks. */ + } + + /* also check for pre-v9.3 lock-only bit pattern */ + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && + HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask)) + { + report_corruption(ctx, + pstrdup("tuple with HEAP_XMAX_EXCL_LOCK set and HEAP_XMAX_IS_MULTI unset should not be marked committed")); + + /* As above, do not skip further checks. */ + } + if (infomask & HEAP_HASNULL) expected_hoff = MAXALIGN(SizeofHeapTupleHeader + BITMAPLEN(ctx->natts)); else diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock index 6441e8baf0..4e565e60ab 100644 --- a/src/backend/access/heap/README.tuplock +++ b/src/backend/access/heap/README.tuplock @@ -152,4 +152,4 @@ The following infomask bits are applicable: whether the XMAX is a TransactionId or a MultiXactId. We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit -is set. +or the HEAP_XMAX_LOCK_ONLY bit is set. diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 3746336a09..3f0b4cd754 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -5151,6 +5151,16 @@ l5: MultiXactStatus status; MultiXactStatus new_status; + /* + * Currently we don't allow XMAX_LOCK_ONLY and XMAX_COMMITTED to both be + * set in a tuple header, so cross-check. + * + * Note that this also checks for the special locked-only bit pattern + * that may be present on servers that were pg_upgraded from pre-v9.3 + * versions. + */ + Assert(!HEAP_XMAX_IS_LOCKED_ONLY(old_infomask)); + if (old_infomask2 & HEAP_KEYS_UPDATED) status = MultiXactStatusUpdate; else diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c index ff0b8a688d..7d6fb66b0d 100644 --- a/src/backend/access/heap/heapam_visibility.c +++ b/src/backend/access/heap/heapam_visibility.c @@ -78,6 +78,31 @@ #include "utils/snapmgr.h" +/* + * InfomaskAssertions() + * + * Checks for invalid infomask bit patterns. + */ +static inline void +InfomaskAssertions(HeapTupleHeader tuple) +{ + /* + * XMAX_COMMITTED and XMAX_LOCK_ONLY cannot both be set in a tuple header. + * + * Note that this also checks for the special locked-only bit pattern that + * may be present on servers that wer
Re: Use durable_unlink for .ready and .done files for WAL segment removal
One argument for instead checking WAL file existence before calling archive_command might be to avoid the increased startup time. Granted, any added delay from this patch is unlikely to be noticeable unless your archiver is way behind and archive_status has a huge number of files. However, I have seen cases where startup is stuck on other tasks like SyncDataDirectory() and RemovePgTempFiles() for a very long time, so perhaps it is worth considering. Nathan
Re: heavily contended lwlocks with long wait queues scale badly
On Mon, Nov 21, 2022 at 10:31:14AM -0500, Jonathan S. Katz wrote: > On 11/20/22 2:56 PM, Andres Freund wrote: >> I still think it might be worth to backpatch in a bit, but so far the votes >> on >> that weren't clear enough on that to feel comfortable. > > My general feeling is "yes" on backpatching, particularly if this is a bug > and it's fixable without ABI breaks. Now that commit a4adc31 has had some time to bake and concerns about unintended consequences may have abated, I wanted to revive this back-patching discussion. I see a few possibly-related reports [0] [1] [2], and I'm now seeing this in the field, too. While it is debatable whether this is a bug, it's a quite nasty issue for users, and it's both difficult to detect and difficult to work around. Thoughts? [0] https://postgr.es/m/CAM527d-uDn5osa6QPKxHAC6srOfBH3M8iXUM%3DewqHV6n%3Dw1u8Q%40mail.gmail.com [1] https://postgr.es/m/VI1PR05MB62031A41186ACC3FC91ACFC70%40VI1PR05MB6206.eurprd05.prod.outlook.com [2] https://postgr.es/m/dd0e070809430a31f7ddd8483fbcce59%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: introduce dynamic shared memory registry
On Thu, Jan 11, 2024 at 09:50:19AM +0700, Andrei Lepikhov wrote: > On 9/1/2024 00:16, Nathan Bossart wrote: >> On Mon, Jan 08, 2024 at 10:53:17AM +0530, Bharath Rupireddy wrote: >> > 2. FWIW, I'd like to call this whole feature "Support for named DSM >> > segments in Postgres". Do you see anything wrong with this? >> >> Why do you feel it should be renamed? I don't see anything wrong with it, >> but I also don't see any particular advantage with that name compared to >> "dynamic shared memory registry." > It is not a big issue, I suppose. But for me personally (as not a native > English speaker), the label "Named DSM segments" seems more straightforward > to understand. That is good to know, thanks. I see that it would also align better with RequestNamedLWLockTranche/GetNamedLWLockTranche. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: recovery modules
rebased -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 6cee7d220b886e9be309929da5274c5770e59845 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 15 Feb 2023 14:28:53 -0800 Subject: [PATCH v17 1/5] introduce routine for checking mutually exclusive string GUCs --- src/backend/postmaster/pgarch.c | 8 +++- src/backend/utils/misc/guc.c| 22 ++ src/include/utils/guc.h | 3 +++ 3 files changed, 28 insertions(+), 5 deletions(-) diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 67693b0580..6ae8bb53c4 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -824,11 +824,9 @@ LoadArchiveLibrary(void) { ArchiveModuleInit archive_init; - if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0') - ereport(ERROR, -(errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("both archive_command and archive_library set"), - errdetail("Only one of archive_command, archive_library may be set."))); + (void) CheckMutuallyExclusiveStringGUCs(XLogArchiveLibrary, "archive_library", + XLogArchiveCommand, "archive_command", + ERROR); /* * If shell archiving is enabled, use our special initialization function. diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 8f65ef3d89..9d21c16169 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2640,6 +2640,28 @@ ReportGUCOption(struct config_generic *record) pfree(val); } +/* + * If both parameters are set, emits a log message at 'elevel' and returns + * false. Otherwise, returns true. + */ +bool +CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name, + const char *p2val, const char *p2name, + int elevel) +{ + if (p1val[0] != '\0' && p2val[0] != '\0') + { + ereport(elevel, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("cannot set both %s and %s", p1name, p2name), + errdetail("Only one of %s or %s may be set.", + p1name, p2name))); + return false; + } + + return true; +} + /* * Convert a value from one of the human-friendly units ("kB", "min" etc.) * to the given base unit. 'value' and 'unit' are the input value and unit diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h index 471d53da8f..6543e61463 100644 --- a/src/include/utils/guc.h +++ b/src/include/utils/guc.h @@ -375,6 +375,9 @@ extern int NewGUCNestLevel(void); extern void AtEOXact_GUC(bool isCommit, int nestLevel); extern void BeginReportingGUCOptions(void); extern void ReportChangedGUCOptions(void); +extern bool CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name, + const char *p2val, const char *p2name, + int elevel); extern void ParseLongOption(const char *string, char **name, char **value); extern const char *get_config_unit_name(int flags); extern bool parse_int(const char *value, int *result, int flags, -- 2.25.1 >From 9f5c2602baa8e29058bc976c01516d04e3c1f901 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 15 Feb 2023 10:36:00 -0800 Subject: [PATCH v17 2/5] refactor code for restoring via shell --- src/backend/Makefile | 2 +- src/backend/access/transam/timeline.c | 12 +- src/backend/access/transam/xlog.c | 50 - src/backend/access/transam/xlogarchive.c | 167 --- src/backend/access/transam/xlogrecovery.c | 3 +- src/backend/meson.build | 1 + src/backend/postmaster/startup.c | 16 +- src/backend/restore/Makefile | 18 ++ src/backend/restore/meson.build | 5 + src/backend/restore/shell_restore.c | 245 ++ src/include/Makefile | 2 +- src/include/access/xlogarchive.h | 9 +- src/include/meson.build | 1 + src/include/postmaster/startup.h | 1 + src/include/restore/shell_restore.h | 26 +++ src/tools/pgindent/typedefs.list | 1 + 16 files changed, 407 insertions(+), 152 deletions(-) create mode 100644 src/backend/restore/Makefile create mode 100644 src/backend/restore/meson.build create mode 100644 src/backend/restore/shell_restore.c create mode 100644 src/include/restore/shell_restore.h diff --git a/src/backend/Makefile b/src/backend/Makefile index 7d2ea7d54a..cbeb8ac93c 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -19,7 +19,7 @@ include $(top_builddir)/src/Makefile.global SUBDIRS = access archive backup bootstrap catalog parser commands executor \ foreign lib libpq \ main nodes optimizer partitioning port postmaster \ - regex replication rewrite \ + regex replication restore rewrite \ statistics storage tcop tsearch utils $(t
reorganize "Shared Memory and LWLocks" section of docs
I recently began trying to write documentation for the dynamic shared memory registry feature [0], and I noticed that the "Shared Memory and LWLocks" section of the documentation might need some improvement. At least, I felt that it would be hard to add any new content to this section without making it very difficult to follow. Concretely, I am proposing breaking it into two sections: one for shared memory and one for LWLocks. Furthermore, the LWLocks section will be split into two: one for requesting locks at server startup and one for requesting locks after server startup. I intend to also split the shared memory section into at-startup and after-startup sections if/when the dynamic shared memory registry feature is committed. Besides this restructuring, I felt that certain parts of this documentation could benefit from rephrasing and/or additional detail. Thoughts? [0] https://postgr.es/m/20231205034647.GA2705267%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From dc5c1b37ced883021f4a92a17aa50b9d80d73fc6 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 11 Jan 2024 21:55:25 -0600 Subject: [PATCH v1 1/1] reorganize shared memory and lwlocks documentation --- doc/src/sgml/xfunc.sgml | 180 +--- 1 file changed, 112 insertions(+), 68 deletions(-) diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml index 89116ae74c..a758384a9a 100644 --- a/doc/src/sgml/xfunc.sgml +++ b/doc/src/sgml/xfunc.sgml @@ -3397,90 +3397,134 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray -Shared Memory and LWLocks +Shared Memory - - Add-ins can reserve LWLocks and an allocation of shared memory on server - startup. The add-in's shared library must be preloaded by specifying - it in - shared_preload_libraries. - The shared library should register a shmem_request_hook - in its _PG_init function. This - shmem_request_hook can reserve LWLocks or shared memory. - Shared memory is reserved by calling: + + Requesting Shared Memory at Startup + + + Add-ins can reserve shared memory on server startup. To do so, the + add-in's shared library must be preloaded by specifying it in + shared_preload_libraries. + The shared library should also register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve shared memory by + calling: void RequestAddinShmemSpace(int size) - from your shmem_request_hook. - - - LWLocks are reserved by calling: + Each backend sould obtain a pointer to the reserved shared memory by + calling: + +void *ShmemInitStruct(const char *name, Size size, bool *foundPtr) + + If this function sets foundPtr to + false, the caller should proceed to initialize the + contents of the reserved shared memory. If foundPtr + is set to true, the shared memory was already + initialized by another backend, and the caller need not initialize + further. + + + + To avoid race conditions, each backend should use the LWLock + AddinShmemInitLock when initializing its allocation + of shared memory, as shown here: + +static mystruct *ptr = NULL; +boolfound; + +LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); +ptr = ShmemInitStruct("my struct name", size, &found); +if (!found) +{ +... initialize contents of shared memory ... +ptr->locks = GetNamedLWLockTranche("my tranche name"); +} +LWLockRelease(AddinShmemInitLock); + + shmem_startup_hook provides a convenient place for the + initialization code, but it is not strictly required that all such code + be placed in this hook. Any registered + shmem_startup_hook will be executed shortly after each + backend attaches to shared memory. Note that add-ins should still + acquire AddinShmemInitLock within this hook, as + shown in the example above. + + + + An example of a shmem_request_hook and + shmem_startup_hook can be found in + contrib/pg_stat_statements/pg_stat_statements.c in + the PostgreSQL source tree. + + + + + +LWLocks + + + Requesting LWLocks at Startup + + + Add-ins can reserve LWLocks on server startup. Like with shared memory, + the add-in's shared library must be preloaded by specifying it in + shared_preload_libraries, + and the shared library should register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve LWLocks by calling: void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks) - from your shmem_request_hook. This will ensure that an array of - num_lwlocks LWLocks is available under the name - tranche_name. Use GetNamedLWLockTranche - to get a pointer to this array. -
Re: reorganize "Shared Memory and LWLocks" section of docs
Thanks for reviewing. On Fri, Jan 12, 2024 at 05:12:28PM +0300, Aleksander Alekseev wrote: > """ > Any registered shmem_startup_hook will be executed shortly after each > backend attaches to shared memory. > """ > > IMO the word "each" here can give the wrong impression as if there are > certain guarantees about synchronization between backends. Maybe we > should change this to simply "... will be executed shortly after > [the?] backend attaches..." I see what you mean, but I don't think the problem is the word "each." I think the problem is the use of passive voice. What do you think about something like Each backend will execute the registered shmem_startup_hook shortly after it attaches to shared memory. > """ > should ensure that only one process allocates a new tranche_id > (LWLockNewTrancheId) and initializes each new LWLock > (LWLockInitialize). > """ > > Personally I think that reminding the corresponding function name here > is redundant and complicates reading just a bit. But maybe it's just > me. Yeah, I waffled on this one. I don't mind removing it. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: introduce dynamic shared memory registry
Here's a new version of the patch set in which I've attempted to address the feedback in this thread. Note that 0001 is being tracked in a separate thread [0], but it is basically a prerequisite for adding the documentation for this feature, so that's why I've also added it here. [0] https://postgr.es/m/20240112041430.GA3557928%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 7cf22727a96757bf212ec106bd471bf55a6981b9 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 11 Jan 2024 21:55:25 -0600 Subject: [PATCH v6 1/3] reorganize shared memory and lwlocks documentation --- doc/src/sgml/xfunc.sgml | 182 +--- 1 file changed, 114 insertions(+), 68 deletions(-) diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml index 89116ae74c..0ba52b41d4 100644 --- a/doc/src/sgml/xfunc.sgml +++ b/doc/src/sgml/xfunc.sgml @@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray -Shared Memory and LWLocks +Shared Memory - - Add-ins can reserve LWLocks and an allocation of shared memory on server - startup. The add-in's shared library must be preloaded by specifying - it in - shared_preload_libraries. - The shared library should register a shmem_request_hook - in its _PG_init function. This - shmem_request_hook can reserve LWLocks or shared memory. - Shared memory is reserved by calling: + + Requesting Shared Memory at Startup + + + Add-ins can reserve shared memory on server startup. To do so, the + add-in's shared library must be preloaded by specifying it in + shared_preload_libraries. + The shared library should also register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve shared memory by + calling: void RequestAddinShmemSpace(int size) - from your shmem_request_hook. - - - LWLocks are reserved by calling: + Each backend sould obtain a pointer to the reserved shared memory by + calling: + +void *ShmemInitStruct(const char *name, Size size, bool *foundPtr) + + If this function sets foundPtr to + false, the caller should proceed to initialize the + contents of the reserved shared memory. If foundPtr + is set to true, the shared memory was already + initialized by another backend, and the caller need not initialize + further. + + + + To avoid race conditions, each backend should use the LWLock + AddinShmemInitLock when initializing its allocation + of shared memory, as shown here: + +static mystruct *ptr = NULL; +boolfound; + +LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); +ptr = ShmemInitStruct("my struct name", size, &found); +if (!found) +{ +... initialize contents of shared memory ... +ptr->locks = GetNamedLWLockTranche("my tranche name"); +} +LWLockRelease(AddinShmemInitLock); + + shmem_startup_hook provides a convenient place for the + initialization code, but it is not strictly required that all such code + be placed in this hook. Each backend will execute the registered + shmem_startup_hook shortly after it attaches to shared + memory. Note that add-ins should still acquire + AddinShmemInitLock within this hook, as shown in the + example above. + + + + An example of a shmem_request_hook and + shmem_startup_hook can be found in + contrib/pg_stat_statements/pg_stat_statements.c in + the PostgreSQL source tree. + + + + + +LWLocks + + + Requesting LWLocks at Startup + + + Add-ins can reserve LWLocks on server startup. Like with shared memory, + the add-in's shared library must be preloaded by specifying it in + shared_preload_libraries, + and the shared library should register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve LWLocks by calling: void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks) - from your shmem_request_hook. This will ensure that an array of - num_lwlocks LWLocks is available under the name - tranche_name. Use GetNamedLWLockTranche - to get a pointer to this array. - - - An example of a shmem_request_hook can be found in - contrib/pg_stat_statements/pg_stat_statements.c in the - PostgreSQL source tree. - - - There is another, more flexible method of obtaining LWLocks. First, - allocate a tranche_id from a shared counter by - calling: + This ensures that an array of num_lwlocks LWLocks is + available under the name tranche_name. A pointer to + this array can be obtained by calling: -int LWLockNewTrancheId(void) +LWLockPadded *GetNamedLWLockTranche(const char *tranche_name) - Next, each
Re: reorganize "Shared Memory and LWLocks" section of docs
On Fri, Jan 12, 2024 at 09:46:50AM -0600, Nathan Bossart wrote: > On Fri, Jan 12, 2024 at 05:12:28PM +0300, Aleksander Alekseev wrote: >> """ >> Any registered shmem_startup_hook will be executed shortly after each >> backend attaches to shared memory. >> """ >> >> IMO the word "each" here can give the wrong impression as if there are >> certain guarantees about synchronization between backends. Maybe we >> should change this to simply "... will be executed shortly after >> [the?] backend attaches..." > > I see what you mean, but I don't think the problem is the word "each." I > think the problem is the use of passive voice. What do you think about > something like > > Each backend will execute the registered shmem_startup_hook shortly > after it attaches to shared memory. > >> """ >> should ensure that only one process allocates a new tranche_id >> (LWLockNewTrancheId) and initializes each new LWLock >> (LWLockInitialize). >> """ >> >> Personally I think that reminding the corresponding function name here >> is redundant and complicates reading just a bit. But maybe it's just >> me. > > Yeah, I waffled on this one. I don't mind removing it. Here is a new version of the patch with these changes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 7cf22727a96757bf212ec106bd471bf55a6981b9 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 11 Jan 2024 21:55:25 -0600 Subject: [PATCH v2 1/1] reorganize shared memory and lwlocks documentation --- doc/src/sgml/xfunc.sgml | 182 +--- 1 file changed, 114 insertions(+), 68 deletions(-) diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml index 89116ae74c..0ba52b41d4 100644 --- a/doc/src/sgml/xfunc.sgml +++ b/doc/src/sgml/xfunc.sgml @@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray -Shared Memory and LWLocks +Shared Memory - - Add-ins can reserve LWLocks and an allocation of shared memory on server - startup. The add-in's shared library must be preloaded by specifying - it in - shared_preload_libraries. - The shared library should register a shmem_request_hook - in its _PG_init function. This - shmem_request_hook can reserve LWLocks or shared memory. - Shared memory is reserved by calling: + + Requesting Shared Memory at Startup + + + Add-ins can reserve shared memory on server startup. To do so, the + add-in's shared library must be preloaded by specifying it in + shared_preload_libraries. + The shared library should also register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve shared memory by + calling: void RequestAddinShmemSpace(int size) - from your shmem_request_hook. - - - LWLocks are reserved by calling: + Each backend sould obtain a pointer to the reserved shared memory by + calling: + +void *ShmemInitStruct(const char *name, Size size, bool *foundPtr) + + If this function sets foundPtr to + false, the caller should proceed to initialize the + contents of the reserved shared memory. If foundPtr + is set to true, the shared memory was already + initialized by another backend, and the caller need not initialize + further. + + + + To avoid race conditions, each backend should use the LWLock + AddinShmemInitLock when initializing its allocation + of shared memory, as shown here: + +static mystruct *ptr = NULL; +boolfound; + +LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); +ptr = ShmemInitStruct("my struct name", size, &found); +if (!found) +{ +... initialize contents of shared memory ... +ptr->locks = GetNamedLWLockTranche("my tranche name"); +} +LWLockRelease(AddinShmemInitLock); + + shmem_startup_hook provides a convenient place for the + initialization code, but it is not strictly required that all such code + be placed in this hook. Each backend will execute the registered + shmem_startup_hook shortly after it attaches to shared + memory. Note that add-ins should still acquire + AddinShmemInitLock within this hook, as shown in the + example above. + + + + An example of a shmem_request_hook and + shmem_startup_hook can be found in + contrib/pg_stat_statements/pg_stat_statements.c in + the PostgreSQL source tree. + + + + + +LWLocks + + + Requesting LWLocks at Startup + + + Add-ins can reserve LWLocks on server startup. Like with shared memory, + the add-in's shared library must be p
Re: introduce dynamic shared memory registry
On Fri, Jan 12, 2024 at 11:13:46PM +0530, Abhijit Menon-Sen wrote: > At 2024-01-12 11:21:52 -0600, nathandboss...@gmail.com wrote: >> + Each backend sould obtain a pointer to the reserved shared memory by > > sould → should D'oh. Thanks. >> + Add-ins can reserve LWLocks on server startup. Like with shared >> memory, > > (Would "As with shared memory" read better? Maybe, but then again maybe > it should be left alone because you also write "Unlike with" elsewhere.) I think "As with shared memory..." sounds better here. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: pg_upgrade failing for 200+ million Large Objects
On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote: > * I did not invent a switch to control the batching of blobs; it's > just hard-wired at 1000 blobs per group here. Probably we need some > user knob for that, but I'm unsure if we want to expose a count or > just a boolean for one vs more than one blob per batch. The point of > forcing one blob per batch would be to allow exact control during > selective restore, and I'm not sure if there's any value in random > other settings. On the other hand, selective restore of blobs has > been completely broken for the last dozen years and I can't recall any > user complaints about that; so maybe nobody cares and we could just > leave this as an internal choice. I think the argument for making this configurable is that folks might have fewer larger blobs, in which case it might make sense to lower the batch size, or they might have many smaller blobs, in which case they might want to increase it. But I'm a bit skeptical that will make any sort of tremendous difference in practice, and I'm not sure how a user would decide on the right value besides trial-and-error. In any case, at the moment I think it'd be okay to keep this an internal setting and wait for feedback from the field. > * As the patch stands, we still build a separate TOC entry for each > comment or seclabel or ACL attached to a blob. If you have a lot of > blobs with non-default properties then the TOC bloat problem comes > back again. We could do something about that, but it would take a bit > of tedious refactoring, and the most obvious way to handle it probably > re-introduces too-many-locks problems. Is this a scenario that's > worth spending a lot of time on? I'll ask around about this. > Subject: [PATCH v9 1/4] Some small preliminaries for pg_dump changes. This one looked good to me. > Subject: [PATCH v9 2/4] In dumps, group large objects into matching metadata > and data entries. I spent most of my review time reading this one. Overall, it looks pretty good to me, and I think you've called out most of the interesting design choices. > + char *cmdEnd = psprintf(" OWNER TO %s", > fmtId(te->owner)); > + > + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", > cmdEnd); This is just a nitpick, but is there any reason not to have IssueCommandPerBlob() accept a format string and the corresponding arguments? > + while (n < 1000 && i + n < ntups) Another nitpick: it might be worth moving these hard-wired constants to macros. Without a control switch, that'd at least make it easy for anyone determined to change the value for their installation. > Subject: [PATCH v9 3/4] Move BLOBS METADATA TOC entries into SECTION_DATA. This one looks reasonable, too. > In this patch I have just hard-wired pg_upgrade to use > --transaction-size 1000. Perhaps there would be value in adding > another pg_upgrade option to allow user control of that, but I'm > unsure that it's worth the trouble; I think few users would use it, > and any who did would see not that much benefit. However, we > might need to adjust the logic to make the size be 1000 divided > by the number of parallel restore jobs allowed. I wonder if it'd be worth making this configurable for pg_upgrade as an escape hatch in case the default setting is problematic. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: pg_upgrade failing for 200+ million Large Objects
On Fri, Jan 05, 2024 at 03:02:34PM -0500, Tom Lane wrote: > On further reflection, there is a very good reason why it's done like > that. Because pg_upgrade is doing schema-only dump and restore, > there's next to no opportunity for parallelism within either pg_dump > or pg_restore. There's no data-loading steps, and there's no > index-building either, so the time-consuming stuff that could be > parallelized just isn't happening in pg_upgrade's usage. > > Now it's true that my 0003 patch moves the needle a little bit: > since it makes BLOB creation (as opposed to loading) parallelizable, > there'd be some hope for parallel pg_restore doing something useful in > a database with very many blobs. But it makes no sense to remove the > existing cross-database parallelism in pursuit of that; you'd make > many more people unhappy than happy. I assume the concern is that we'd end up multiplying the effective number of workers if we parallelized both in-database and cross-database? Would it be sufficient to make those separately configurable with a note about the multiplicative effects of setting both? I think it'd be unfortunate if pg_upgrade completely missed out on this improvement. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: pg_upgrade failing for 200+ million Large Objects
On Fri, Jan 12, 2024 at 04:42:27PM -0600, Nathan Bossart wrote: > On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote: >> +char *cmdEnd = psprintf(" OWNER TO %s", >> fmtId(te->owner)); >> + >> +IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", >> cmdEnd); > > This is just a nitpick, but is there any reason not to have > IssueCommandPerBlob() accept a format string and the corresponding > arguments? Eh, I guess you'd have to find some other way of specifying where the OID is supposed to go, which would probably be weird. Please disregard this one. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Fix minor memory leak in connection string validation
On Fri, Jan 12, 2024 at 03:06:26PM -0800, Jeff Davis wrote: > It makes me wonder if we should use the resowner mechanism to track > pointers to malloc'd memory. Then we could use a standard pattern for > these kinds of cases, and it would also catch more remote issues, like > if a pstrdup() fails in an error path (which can happen a few lines up > if the parse fails). That seems worth exploring. > if (!uses_password) > + { > + /* malloc'd, so we must free it explicitly */ > + PQconninfoFree(opts); > + > ereport(ERROR, > > (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED), >errmsg("password is required"), >errdetail("Non-superusers must provide > a password in the connection string."))); > + } > } > > PQconninfoFree(opts); Another option could be to surround this with PG_TRY/PG_FINALLY, but your patch seems sufficient, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: reorganize "Shared Memory and LWLocks" section of docs
On Sat, Jan 13, 2024 at 01:49:08PM +0300, Aleksander Alekseev wrote: > That's much better, thanks. > > I think the patch could use another pair of eyes, ideally from a > native English speaker. But if no one will express any objections for > a while I suggest merging it. Great. I've attached a v3 with a couple of fixes suggested in the other thread [0]. I'll wait a little while longer in case anyone else wants to take a look. [0] https://postgr.es/m/ZaF6UpYImGqVIhVp%40toroid.org -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From b56931edc4488a7376b27ba0e5519cc3a61b4899 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 11 Jan 2024 21:55:25 -0600 Subject: [PATCH v3 1/1] reorganize shared memory and lwlocks documentation --- doc/src/sgml/xfunc.sgml | 182 +--- 1 file changed, 114 insertions(+), 68 deletions(-) diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml index 89116ae74c..82e1dadcca 100644 --- a/doc/src/sgml/xfunc.sgml +++ b/doc/src/sgml/xfunc.sgml @@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray -Shared Memory and LWLocks +Shared Memory - - Add-ins can reserve LWLocks and an allocation of shared memory on server - startup. The add-in's shared library must be preloaded by specifying - it in - shared_preload_libraries. - The shared library should register a shmem_request_hook - in its _PG_init function. This - shmem_request_hook can reserve LWLocks or shared memory. - Shared memory is reserved by calling: + + Requesting Shared Memory at Startup + + + Add-ins can reserve shared memory on server startup. To do so, the + add-in's shared library must be preloaded by specifying it in + shared_preload_libraries. + The shared library should also register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve shared memory by + calling: void RequestAddinShmemSpace(int size) - from your shmem_request_hook. - - - LWLocks are reserved by calling: + Each backend should obtain a pointer to the reserved shared memory by + calling: + +void *ShmemInitStruct(const char *name, Size size, bool *foundPtr) + + If this function sets foundPtr to + false, the caller should proceed to initialize the + contents of the reserved shared memory. If foundPtr + is set to true, the shared memory was already + initialized by another backend, and the caller need not initialize + further. + + + + To avoid race conditions, each backend should use the LWLock + AddinShmemInitLock when initializing its allocation + of shared memory, as shown here: + +static mystruct *ptr = NULL; +boolfound; + +LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); +ptr = ShmemInitStruct("my struct name", size, &found); +if (!found) +{ +... initialize contents of shared memory ... +ptr->locks = GetNamedLWLockTranche("my tranche name"); +} +LWLockRelease(AddinShmemInitLock); + + shmem_startup_hook provides a convenient place for the + initialization code, but it is not strictly required that all such code + be placed in this hook. Each backend will execute the registered + shmem_startup_hook shortly after it attaches to shared + memory. Note that add-ins should still acquire + AddinShmemInitLock within this hook, as shown in the + example above. + + + + An example of a shmem_request_hook and + shmem_startup_hook can be found in + contrib/pg_stat_statements/pg_stat_statements.c in + the PostgreSQL source tree. + + + + + +LWLocks + + + Requesting LWLocks at Startup + + + Add-ins can reserve LWLocks on server startup. As with shared memory, + the add-in's shared library must be preloaded by specifying it in + shared_preload_libraries, + and the shared library should register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve LWLocks by calling: void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks) - from your shmem_request_hook. This will ensure that an array of - num_lwlocks LWLocks is available under the name - tranche_name. Use GetNamedLWLockTranche - to get a pointer to this array. - - - An example of a shmem_request_hook can be found in - contrib/pg_stat_statements/pg_stat_statements.c in the - PostgreSQL source tree. - - - There is another, more flexible method of obtaining LWLocks. First, - allocate a tranche_id from a shared counter by - calling: + This ensures that an array of num_lwlocks LWLocks is + available under the name tranche_name. A pointer to +
Re: introduce dynamic shared memory registry
On Fri, Jan 12, 2024 at 01:45:55PM -0600, Nathan Bossart wrote: > On Fri, Jan 12, 2024 at 11:13:46PM +0530, Abhijit Menon-Sen wrote: >> At 2024-01-12 11:21:52 -0600, nathandboss...@gmail.com wrote: >>> + Each backend sould obtain a pointer to the reserved shared memory by >> >> sould → should > > D'oh. Thanks. > >>> + Add-ins can reserve LWLocks on server startup. Like with shared >>> memory, >> >> (Would "As with shared memory" read better? Maybe, but then again maybe >> it should be left alone because you also write "Unlike with" elsewhere.) > > I think "As with shared memory..." sounds better here. Here is a new version of the patch set with these changes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From b56931edc4488a7376b27ba0e5519cc3a61b4899 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 11 Jan 2024 21:55:25 -0600 Subject: [PATCH v7 1/3] reorganize shared memory and lwlocks documentation --- doc/src/sgml/xfunc.sgml | 182 +--- 1 file changed, 114 insertions(+), 68 deletions(-) diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml index 89116ae74c..82e1dadcca 100644 --- a/doc/src/sgml/xfunc.sgml +++ b/doc/src/sgml/xfunc.sgml @@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray -Shared Memory and LWLocks +Shared Memory - - Add-ins can reserve LWLocks and an allocation of shared memory on server - startup. The add-in's shared library must be preloaded by specifying - it in - shared_preload_libraries. - The shared library should register a shmem_request_hook - in its _PG_init function. This - shmem_request_hook can reserve LWLocks or shared memory. - Shared memory is reserved by calling: + + Requesting Shared Memory at Startup + + + Add-ins can reserve shared memory on server startup. To do so, the + add-in's shared library must be preloaded by specifying it in + shared_preload_libraries. + The shared library should also register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve shared memory by + calling: void RequestAddinShmemSpace(int size) - from your shmem_request_hook. - - - LWLocks are reserved by calling: + Each backend should obtain a pointer to the reserved shared memory by + calling: + +void *ShmemInitStruct(const char *name, Size size, bool *foundPtr) + + If this function sets foundPtr to + false, the caller should proceed to initialize the + contents of the reserved shared memory. If foundPtr + is set to true, the shared memory was already + initialized by another backend, and the caller need not initialize + further. + + + + To avoid race conditions, each backend should use the LWLock + AddinShmemInitLock when initializing its allocation + of shared memory, as shown here: + +static mystruct *ptr = NULL; +boolfound; + +LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); +ptr = ShmemInitStruct("my struct name", size, &found); +if (!found) +{ +... initialize contents of shared memory ... +ptr->locks = GetNamedLWLockTranche("my tranche name"); +} +LWLockRelease(AddinShmemInitLock); + + shmem_startup_hook provides a convenient place for the + initialization code, but it is not strictly required that all such code + be placed in this hook. Each backend will execute the registered + shmem_startup_hook shortly after it attaches to shared + memory. Note that add-ins should still acquire + AddinShmemInitLock within this hook, as shown in the + example above. + + + + An example of a shmem_request_hook and + shmem_startup_hook can be found in + contrib/pg_stat_statements/pg_stat_statements.c in + the PostgreSQL source tree. + + + + + +LWLocks + + + Requesting LWLocks at Startup + + + Add-ins can reserve LWLocks on server startup. As with shared memory, + the add-in's shared library must be preloaded by specifying it in + shared_preload_libraries, + and the shared library should register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve LWLocks by calling: void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks) - from your shmem_request_hook. This will ensure that an array of - num_lwlocks LWLocks is available under the name - tranche_name. Use GetNamedLWLockTranche - to get a pointer to this array. - - - An example of a shmem_request_hook can be found in - contrib/pg_stat_statements/pg_stat_statements.c in the - PostgreSQL sour
Re: Postgres and --config-file option
On Sat, Jan 13, 2024 at 01:39:50PM +0300, Aleksander Alekseev wrote: > OK, let's check section "20.1.4. Parameter Interaction via the Shell" > [1] of the documentation. Currently it doesn't tell anything about the > ability to specify GUCs --like-this, unless I missed something. It appears to be documented for 'postgres' as follows [0]: --name=value Sets a named run-time parameter; a shorter form of -c. and similarly within the --help output: --NAME=VALUE set run-time parameter Its documentation also describes this method of specifying parameters in the 'Examples' section. The section you refer to calls out "-c", so it is sort-of indirectly mentioned, but that might be a bit of a generous assessment. > Should we remove --config-file from the error message to avoid any > confusion? Should we correct --help output? Should we update the > documentation? It might be worthwhile to update the documentation if it would've helped prevent confusion here. Separately, I noticed that this is implemented in postmaster.c by looking for the '-' option character returned by getopt(), and I'm wondering why this doesn't use getopt_long() instead. AFAICT this dates back to the introduction of GUCs in 6a68f426 (May 2000). [0] https://www.postgresql.org/docs/devel/app-postgres.html -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: ALTER ROLE documentation improvement
On Sun, Jan 14, 2024 at 04:17:41PM +0530, vignesh C wrote: > The attached v3 version patch has the changes for the same. LGTM. I'll wait a little while longer for additional feedback, but if none materializes, I'll commit this soon. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: archive modules loose ends
On Mon, Jan 15, 2024 at 12:21:44PM +, Li, Yong wrote: > The patch looks good to me. With the context explained in the thread, > the patch is easy to understand. > The patch serves as a refactoring which pulls up common memory management > and error handling concerns into the pgarch.c. With the patch, > individual archive callbacks can focus on copying the files and leave the > boilerplate code to pgarch.c.. > > The patch applies cleanly to HEAD. “make check-world” also runs cleanly > with no error. Thanks for reviewing. I've marked this as ready-for-committer, and I'm hoping to commit it in the near future. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: make pg_ctl more friendly
+ POSTMASTER_RECOVERY_SHUTDOWN, Perhaps this should be POSTMASTER_SHUTDOWN_IN_RECOVERY to match the state in the control file? + case POSTMASTER_RECOVERY_SHUTDOWN: + print_msg(_("PITR shutdown\n")); + print_msg(_("update configuration for startup again if needed\n")); + break; I'm not sure I agree that this is a substantially friendlier message. From a quick skim of the thread, it seems like you want to avoid sending a scary error message if Postgres was intentionally shut down while in recovery. If I got this particular message, I think I would be worried that something went wrong during my point-in-time restore, and I'd be scrambling to figure out what configuration this message wants me to update. If I'm correctly interpreting the intent here, it might be worth fleshing out the messages a bit more. For example, instead of "PITR shutdown," perhaps we could say "shut down while in recovery." And maybe we should point to the specific settings in the latter message. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: reorganize "Shared Memory and LWLocks" section of docs
On Tue, Jan 16, 2024 at 10:02:15AM +0530, Bharath Rupireddy wrote: > The v3 patch looks good to me except for a nitpick: the input > parameter for RequestAddinShmemSpace is 'Size' not 'int' > > > void RequestAddinShmemSpace(int size) > Hah, I think this mistake is nearly old enough to vote (e0dece1, 5f78aa5). Good catch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: reorganize "Shared Memory and LWLocks" section of docs
On Tue, Jan 16, 2024 at 08:20:19AM -0600, Nathan Bossart wrote: > On Tue, Jan 16, 2024 at 10:02:15AM +0530, Bharath Rupireddy wrote: >> The v3 patch looks good to me except for a nitpick: the input >> parameter for RequestAddinShmemSpace is 'Size' not 'int' >> >> >> void RequestAddinShmemSpace(int size) >> > > Hah, I think this mistake is nearly old enough to vote (e0dece1, 5f78aa5). > Good catch. I fixed this in v4. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 402eaf87776fb6a9d212da66947f47c63bd53f2a Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 11 Jan 2024 21:55:25 -0600 Subject: [PATCH v4 1/1] reorganize shared memory and lwlocks documentation --- doc/src/sgml/xfunc.sgml | 184 +--- 1 file changed, 115 insertions(+), 69 deletions(-) diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml index 89116ae74c..ede2a5dea6 100644 --- a/doc/src/sgml/xfunc.sgml +++ b/doc/src/sgml/xfunc.sgml @@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray -Shared Memory and LWLocks +Shared Memory - - Add-ins can reserve LWLocks and an allocation of shared memory on server - startup. The add-in's shared library must be preloaded by specifying - it in - shared_preload_libraries. - The shared library should register a shmem_request_hook - in its _PG_init function. This - shmem_request_hook can reserve LWLocks or shared memory. - Shared memory is reserved by calling: + + Requesting Shared Memory at Startup + + + Add-ins can reserve shared memory on server startup. To do so, the + add-in's shared library must be preloaded by specifying it in + shared_preload_libraries. + The shared library should also register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve shared memory by + calling: -void RequestAddinShmemSpace(int size) +void RequestAddinShmemSpace(Size size) - from your shmem_request_hook. - - - LWLocks are reserved by calling: + Each backend should obtain a pointer to the reserved shared memory by + calling: + +void *ShmemInitStruct(const char *name, Size size, bool *foundPtr) + + If this function sets foundPtr to + false, the caller should proceed to initialize the + contents of the reserved shared memory. If foundPtr + is set to true, the shared memory was already + initialized by another backend, and the caller need not initialize + further. + + + + To avoid race conditions, each backend should use the LWLock + AddinShmemInitLock when initializing its allocation + of shared memory, as shown here: + +static mystruct *ptr = NULL; +boolfound; + +LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); +ptr = ShmemInitStruct("my struct name", size, &found); +if (!found) +{ +... initialize contents of shared memory ... +ptr->locks = GetNamedLWLockTranche("my tranche name"); +} +LWLockRelease(AddinShmemInitLock); + + shmem_startup_hook provides a convenient place for the + initialization code, but it is not strictly required that all such code + be placed in this hook. Each backend will execute the registered + shmem_startup_hook shortly after it attaches to shared + memory. Note that add-ins should still acquire + AddinShmemInitLock within this hook, as shown in the + example above. + + + + An example of a shmem_request_hook and + shmem_startup_hook can be found in + contrib/pg_stat_statements/pg_stat_statements.c in + the PostgreSQL source tree. + + + + + +LWLocks + + + Requesting LWLocks at Startup + + + Add-ins can reserve LWLocks on server startup. As with shared memory, + the add-in's shared library must be preloaded by specifying it in + shared_preload_libraries, + and the shared library should register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve LWLocks by calling: void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks) - from your shmem_request_hook. This will ensure that an array of - num_lwlocks LWLocks is available under the name - tranche_name. Use GetNamedLWLockTranche - to get a pointer to this array. - - - An example of a shmem_request_hook can be found in - contrib/pg_stat_statements/pg_stat_statements.c in the - PostgreSQL source tree. - - - There is another, more flexible method of obtaining LWLocks. First, - allocate a tranche_id from a shared counter by - calling: + This ensures that an array of num_lwlocks LWLocks is + available under the name tranch
Re: introduce dynamic shared memory registry
On Tue, Jan 16, 2024 at 10:28:29AM +0530, Bharath Rupireddy wrote: > I think it's better for GetNamedDSMSegment() to error out on empty > 'name' and size 0. This makes the user-facing function > GetNamedDSMSegment more concrete. Agreed, thanks for the suggestion. > +void * > +GetNamedDSMSegment(const char *name, size_t size, > + void (*init_callback) (void *ptr), bool *found) > > +Assert(found); > > Why is input parameter 'found' necessary to be passed by the caller? > Neither the test module added, nor the pg_prewarm is using the found > variable. The function will anyway create the DSM segment if one with > the given name isn't found. IMO, found is an optional parameter for > the caller. So, the assert(found) isn't necessary. The autoprewarm change (0003) does use this variable. I considered making it optional (i.e., you could pass in NULL if you didn't want it), but I didn't feel like the extra code in GetNamedDSMSegment() to allow this was worth it so that callers could avoid creating a single bool. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 402eaf87776fb6a9d212da66947f47c63bd53f2a Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 11 Jan 2024 21:55:25 -0600 Subject: [PATCH v8 1/3] reorganize shared memory and lwlocks documentation --- doc/src/sgml/xfunc.sgml | 184 +--- 1 file changed, 115 insertions(+), 69 deletions(-) diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml index 89116ae74c..ede2a5dea6 100644 --- a/doc/src/sgml/xfunc.sgml +++ b/doc/src/sgml/xfunc.sgml @@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray -Shared Memory and LWLocks +Shared Memory - - Add-ins can reserve LWLocks and an allocation of shared memory on server - startup. The add-in's shared library must be preloaded by specifying - it in - shared_preload_libraries. - The shared library should register a shmem_request_hook - in its _PG_init function. This - shmem_request_hook can reserve LWLocks or shared memory. - Shared memory is reserved by calling: + + Requesting Shared Memory at Startup + + + Add-ins can reserve shared memory on server startup. To do so, the + add-in's shared library must be preloaded by specifying it in + shared_preload_libraries. + The shared library should also register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve shared memory by + calling: -void RequestAddinShmemSpace(int size) +void RequestAddinShmemSpace(Size size) - from your shmem_request_hook. - - - LWLocks are reserved by calling: + Each backend should obtain a pointer to the reserved shared memory by + calling: + +void *ShmemInitStruct(const char *name, Size size, bool *foundPtr) + + If this function sets foundPtr to + false, the caller should proceed to initialize the + contents of the reserved shared memory. If foundPtr + is set to true, the shared memory was already + initialized by another backend, and the caller need not initialize + further. + + + + To avoid race conditions, each backend should use the LWLock + AddinShmemInitLock when initializing its allocation + of shared memory, as shown here: + +static mystruct *ptr = NULL; +boolfound; + +LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); +ptr = ShmemInitStruct("my struct name", size, &found); +if (!found) +{ +... initialize contents of shared memory ... +ptr->locks = GetNamedLWLockTranche("my tranche name"); +} +LWLockRelease(AddinShmemInitLock); + + shmem_startup_hook provides a convenient place for the + initialization code, but it is not strictly required that all such code + be placed in this hook. Each backend will execute the registered + shmem_startup_hook shortly after it attaches to shared + memory. Note that add-ins should still acquire + AddinShmemInitLock within this hook, as shown in the + example above. + + + + An example of a shmem_request_hook and + shmem_startup_hook can be found in + contrib/pg_stat_statements/pg_stat_statements.c in + the PostgreSQL source tree. + + + + + +LWLocks + + + Requesting LWLocks at Startup + + + Add-ins can reserve LWLocks on server startup. As with shared memory, + the add-in's shared library must be preloaded by specifying it in + shared_preload_libraries, + and the shared library should register a + shmem_request_hook in its + _PG_init function. This + shmem_request_hook can reserve LWLocks by calling: void RequestNamedLWLockTranche(const char *tranche_name, int num_lwl
Re: ALTER ROLE documentation improvement
On Thu, Jan 18, 2024 at 02:44:35PM -0700, David G. Johnston wrote: > LGTM too. I didn't go looking for anything else related to this, but the > proposed changes all look needed. Committed. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: reorganize "Shared Memory and LWLocks" section of docs
On Wed, Jan 17, 2024 at 06:48:37AM +0530, Bharath Rupireddy wrote: > LGTM. Committed. Thanks for reviewing! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: introduce dynamic shared memory registry
On Wed, Jan 17, 2024 at 08:00:00AM +0530, Bharath Rupireddy wrote: > The v8 patches look good to me. Committed. Thanks everyone for reviewing! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
cleanup patches for dshash
While working on the dynamic shared memory registry, I noticed a couple of potential improvements for code that uses dshash tables. * A couple of dshash_create() callers pass in 0 for the "void *arg" parameter, which seemed weird. I incorrectly assumed this was some sort of flags parameter until I actually looked at the function signature. IMHO we should specify NULL here if arg is not used. 0001 makes this change. This is admittedly a nitpick. * There are no dshash compare/hash functions for string keys. 0002 introduces some that simply forward to strcmp()/string_hash(), and it uses them for the DSM registry's dshash table. This allows us to remove the hacky key padding code for lookups on this table. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 5fe6d05f2cfa1401c2fb967fe4bb52cae3706228 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 19 Jan 2024 15:23:35 -0600 Subject: [PATCH v1 1/2] Use NULL instead of 0 for arg parameter in dshash_create(). --- src/backend/replication/logical/launcher.c | 2 +- src/backend/utils/activity/pgstat_shmem.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 122db0bb13..ee66c292bd 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -1013,7 +1013,7 @@ logicalrep_launcher_attach_dshmem(void) last_start_times_dsa = dsa_create(LWTRANCHE_LAUNCHER_DSA); dsa_pin(last_start_times_dsa); dsa_pin_mapping(last_start_times_dsa); - last_start_times = dshash_create(last_start_times_dsa, &dsh_params, 0); + last_start_times = dshash_create(last_start_times_dsa, &dsh_params, NULL); /* Store handles in shared memory for other backends to use. */ LogicalRepCtx->last_start_dsa = dsa_get_handle(last_start_times_dsa); diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c index 3ce48e88a3..71410ddd3f 100644 --- a/src/backend/utils/activity/pgstat_shmem.c +++ b/src/backend/utils/activity/pgstat_shmem.c @@ -180,7 +180,7 @@ StatsShmemInit(void) * With the limit in place, create the dshash table. XXX: It'd be nice * if there were dshash_create_in_place(). */ - dsh = dshash_create(dsa, &dsh_params, 0); + dsh = dshash_create(dsa, &dsh_params, NULL); ctl->hash_handle = dshash_get_hash_table_handle(dsh); /* lift limit set above */ -- 2.25.1 >From 6c6760e6553ac86c334e7c35d2ddbac8fdc1cb9b Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 19 Jan 2024 15:38:34 -0600 Subject: [PATCH v1 2/2] Add hash/compare functions for dshash tables with string keys. --- src/backend/lib/dshash.c | 23 +++ src/backend/storage/ipc/dsm_registry.c | 8 +++- src/include/lib/dshash.h | 4 3 files changed, 30 insertions(+), 5 deletions(-) diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c index b0bc0abda0..e497238e8f 100644 --- a/src/backend/lib/dshash.c +++ b/src/backend/lib/dshash.c @@ -583,6 +583,29 @@ dshash_memhash(const void *v, size_t size, void *arg) return tag_hash(v, size); } +/* + * A compare function that forwards to strcmp. + */ +int +dshash_strcmp(const void *a, const void *b, size_t size, void *arg) +{ + Assert(strlen((const char *) a) < size); + Assert(strlen((const char *) b) < size); + + return strcmp((const char *) a, (const char *) b); +} + +/* + * A hash function that forwards to string_hash. + */ +dshash_hash +dshash_strhash(const void *v, size_t size, void *arg) +{ + Assert(strlen((const char *) v) < size); + + return string_hash((const char *) v, size); +} + /* * Sequentially scan through dshash table and return all the elements one by * one, return NULL when all elements have been returned. diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c index ac11f51375..9dce41e8ab 100644 --- a/src/backend/storage/ipc/dsm_registry.c +++ b/src/backend/storage/ipc/dsm_registry.c @@ -50,8 +50,8 @@ typedef struct DSMRegistryEntry static const dshash_parameters dsh_params = { offsetof(DSMRegistryEntry, handle), sizeof(DSMRegistryEntry), - dshash_memcmp, - dshash_memhash, + dshash_strcmp, + dshash_strhash, LWTRANCHE_DSM_REGISTRY_HASH }; @@ -132,7 +132,6 @@ GetNamedDSMSegment(const char *name, size_t size, { DSMRegistryEntry *entry; MemoryContext oldcontext; - char name_padded[offsetof(DSMRegistryEntry, handle)] = {0}; void *ret; Assert(found); @@ -155,8 +154,7 @@ GetNamedDSMSegment(const char *name, size_t size, /* Connect to the registry. */ init_dsm_registry(); - strcpy(name_padded, name); - entry = dshash_find_or_insert(dsm_registry_table, name_padded, found); + entry = dshash_find_or_insert(dsm_registry_table, name, found); if (!(*found)) { /* Initialize the segment. *
Re: introduce dynamic shared memory registry
On Sun, Jan 21, 2024 at 11:21:46AM -0500, Tom Lane wrote: > Coverity complained about this: > > *** CID 1586660: Null pointer dereferences (NULL_RETURNS) > /srv/coverity/git/pgsql-git/postgresql/src/backend/storage/ipc/dsm_registry.c: > 185 in GetNamedDSMSegment() > 179 } > 180 else if (!dsm_find_mapping(entry->handle)) > 181 { > 182 /* Attach to existing segment. */ > 183 dsm_segment *seg = dsm_attach(entry->handle); > 184 >>>> CID 1586660: Null pointer dereferences (NULL_RETURNS) >>>> Dereferencing a pointer that might be "NULL" "seg" when calling >>>> "dsm_pin_mapping". > 185 dsm_pin_mapping(seg); > 186 ret = dsm_segment_address(seg); > 187 } > 188 else > 189 { > 190 /* Return address of an already-attached segment. */ > > I think it's right --- the comments for dsm_attach explicitly > point out that a NULL return is possible. You need to handle > that scenario in some way other than SIGSEGV. Oops. I've attached an attempt at fixing this. I took the opportunity to clean up the surrounding code a bit. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From f4c1c7a7ce8bccf7251e384f895f34beb33f839e Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sun, 21 Jan 2024 16:05:16 -0600 Subject: [PATCH v1 1/1] fix coverity complaint --- src/backend/storage/ipc/dsm_registry.c | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c index ac11f51375..c178173653 100644 --- a/src/backend/storage/ipc/dsm_registry.c +++ b/src/backend/storage/ipc/dsm_registry.c @@ -177,19 +177,22 @@ GetNamedDSMSegment(const char *name, size_t size, (errmsg("requested DSM segment size does not match size of " "existing segment"))); } - else if (!dsm_find_mapping(entry->handle)) + else { - /* Attach to existing segment. */ - dsm_segment *seg = dsm_attach(entry->handle); + dsm_segment *seg = dsm_find_mapping(entry->handle); + + /* If the existing segment is not already attached, attach it now. */ + if (seg == NULL) + { + seg = dsm_attach(entry->handle); + if (seg == NULL) +elog(ERROR, "could not map dynamic shared memory segment"); + + dsm_pin_mapping(seg); + } - dsm_pin_mapping(seg); ret = dsm_segment_address(seg); } - else - { - /* Return address of an already-attached segment. */ - ret = dsm_segment_address(dsm_find_mapping(entry->handle)); - } dshash_release_lock(dsm_registry_table, entry); MemoryContextSwitchTo(oldcontext); -- 2.25.1
Re: cleanup patches for dshash
On Mon, Jan 22, 2024 at 10:28:42AM +0800, Andy Fan wrote: > Both LGTM. Thanks for looking. > +dshash_strcmp(const void *a, const void *b, size_t size, void *arg) > +{ > + Assert(strlen((const char *) a) < size); > + Assert(strlen((const char *) b) < size); > + > > Do you think the below change will be more explicitly? > > #define DSMRegistryNameSize 64 > > DSMRegistryEntry > { > char name[DSMRegistryNameSize]; > > } > > Assert(strlen((const char *) a) < DSMRegistryNameSize); This is effectively what it's doing already. These functions are intended to be generic so that they could be used with other dshash tables with string keys, possibly with different sizes. I did notice a cfbot failure [0]. After a quick glance, I'm assuming this is caused by the memcpy() in insert_into_bucket(). Even if the string is short, it will copy the maximum key size, which is bad. So, 0002 is totally broken at the moment, and we'd need to teach insert_into_bucket() to strcpy() instead for string keys to fix it. Perhaps we could add a field to dshash_parameters for this purpose... [0] https://api.cirrus-ci.com/v1/artifact/task/5124848070950912/log/src/test/modules/test_dsm_registry/log/postmaster.log -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for dshash
On Sun, Jan 21, 2024 at 09:51:18PM -0600, Nathan Bossart wrote: > I did notice a cfbot failure [0]. After a quick glance, I'm assuming this > is caused by the memcpy() in insert_into_bucket(). Even if the string is > short, it will copy the maximum key size, which is bad. So, 0002 is > totally broken at the moment, and we'd need to teach insert_into_bucket() > to strcpy() instead for string keys to fix it. Perhaps we could add a > field to dshash_parameters for this purpose... I attempted to fix this in v2 of the patch set. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From af203e6e80f7a2843890e51fd3e3641aef8bd901 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 19 Jan 2024 15:23:35 -0600 Subject: [PATCH v2 1/2] Use NULL instead of 0 for arg parameter in dshash_create(). --- src/backend/replication/logical/launcher.c | 2 +- src/backend/utils/activity/pgstat_shmem.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 122db0bb13..ee66c292bd 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -1013,7 +1013,7 @@ logicalrep_launcher_attach_dshmem(void) last_start_times_dsa = dsa_create(LWTRANCHE_LAUNCHER_DSA); dsa_pin(last_start_times_dsa); dsa_pin_mapping(last_start_times_dsa); - last_start_times = dshash_create(last_start_times_dsa, &dsh_params, 0); + last_start_times = dshash_create(last_start_times_dsa, &dsh_params, NULL); /* Store handles in shared memory for other backends to use. */ LogicalRepCtx->last_start_dsa = dsa_get_handle(last_start_times_dsa); diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c index 3ce48e88a3..71410ddd3f 100644 --- a/src/backend/utils/activity/pgstat_shmem.c +++ b/src/backend/utils/activity/pgstat_shmem.c @@ -180,7 +180,7 @@ StatsShmemInit(void) * With the limit in place, create the dshash table. XXX: It'd be nice * if there were dshash_create_in_place(). */ - dsh = dshash_create(dsa, &dsh_params, 0); + dsh = dshash_create(dsa, &dsh_params, NULL); ctl->hash_handle = dshash_get_hash_table_handle(dsh); /* lift limit set above */ -- 2.25.1 >From 1d328f0be681da7e43574350a1d3c8da8ae8ddcc Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 19 Jan 2024 15:38:34 -0600 Subject: [PATCH v2 2/2] Add hash/compare functions for dshash tables with string keys. --- src/backend/lib/dshash.c | 58 +- src/backend/replication/logical/launcher.c | 1 + src/backend/storage/ipc/dsm_registry.c | 9 ++-- src/backend/utils/activity/pgstat_shmem.c | 1 + src/backend/utils/cache/typcache.c | 2 + src/include/lib/dshash.h | 19 ++- 6 files changed, 83 insertions(+), 7 deletions(-) diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c index b0bc0abda0..cc49b4ca51 100644 --- a/src/backend/lib/dshash.c +++ b/src/backend/lib/dshash.c @@ -188,6 +188,8 @@ static bool delete_item_from_bucket(dshash_table *hash_table, static inline dshash_hash hash_key(dshash_table *hash_table, const void *key); static inline bool equal_keys(dshash_table *hash_table, const void *a, const void *b); +static inline void copy_key(dshash_table *hash_table, void *dest, + const void *src); #define PARTITION_LOCK(hash_table, i) \ (&(hash_table)->control->partitions[(i)].lock) @@ -583,6 +585,49 @@ dshash_memhash(const void *v, size_t size, void *arg) return tag_hash(v, size); } +/* + * A copy function that forwards to memcpy. + */ +void +dshash_memcpy(void *dest, const void *src, size_t size, void *arg) +{ + (void) memcpy(dest, src, size); +} + +/* + * A compare function that forwards to strcmp. + */ +int +dshash_strcmp(const void *a, const void *b, size_t size, void *arg) +{ + Assert(strlen((const char *) a) < size); + Assert(strlen((const char *) b) < size); + + return strcmp((const char *) a, (const char *) b); +} + +/* + * A hash function that forwards to string_hash. + */ +dshash_hash +dshash_strhash(const void *v, size_t size, void *arg) +{ + Assert(strlen((const char *) v) < size); + + return string_hash((const char *) v, size); +} + +/* + * A copy function that forwards to strcpy. + */ +void +dshash_strcpy(void *dest, const void *src, size_t size, void *arg) +{ + Assert(strlen((const char *) src) < size); + + (void) strcpy((char *) dest, (const char *) src); +} + /* * Sequentially scan through dshash table and return all the elements one by * one, return NULL when all elements have been returned. @@ -949,7 +994,7 @@ insert_into_bucket(dshash_table *hash_table, hash_table->params.entry_size + MAXALIGN(sizeof(dshash_table_item))); item = dsa_get_address(hash_table->area, item_pointer); - memcpy(ENTRY_F
Re: core dumps in auto_prewarm, tests succeed
On Mon, Jan 22, 2024 at 12:41:17PM -0800, Andres Freund wrote: > I noticed that I was getting core dumps while executing the tests, without the > tests failing. Backtraces are vriations of this: Looking, thanks for the heads-up. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: core dumps in auto_prewarm, tests succeed
On Mon, Jan 22, 2024 at 02:44:57PM -0600, Nathan Bossart wrote: > On Mon, Jan 22, 2024 at 12:41:17PM -0800, Andres Freund wrote: >> I noticed that I was getting core dumps while executing the tests, without >> the >> tests failing. Backtraces are vriations of this: > > Looking, thanks for the heads-up. I think this is because the autoprewarm state was moved to a DSM segment, and DSM segments are detached before the on_shmem_exit callbacks are called during process exit. Moving apw_detach_shmem to the before_shmem_exit list seems to resolve the crashes. diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index 9ea6c2252a..88c3a04109 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -165,7 +165,7 @@ autoprewarm_main(Datum main_arg) first_time = false; /* Set on-detach hook so that our PID will be cleared on exit. */ -on_shmem_exit(apw_detach_shmem, 0); +before_shmem_exit(apw_detach_shmem, 0); /* * Store our PID in the shared memory area --- unless there's already -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: core dumps in auto_prewarm, tests succeed
On Mon, Jan 22, 2024 at 01:24:54PM -0800, Andres Freund wrote: > On 2024-01-22 15:19:36 -0600, Nathan Bossart wrote: >> I think this is because the autoprewarm state was moved to a DSM segment, >> and DSM segments are detached before the on_shmem_exit callbacks are called >> during process exit. Moving apw_detach_shmem to the before_shmem_exit list >> seems to resolve the crashes. > > That seems plausible. Would still be nice to have at least this test ensure > that the shutdown code works. Perhaps just a check of the control file after > shutdown, ensuring that the state is "shutdowned" vs crashed? I'll give that a try. I'll also expand the comment above the before_shmem_exit() call. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com