Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-02-21 Thread Nathan Bossart
I've attached an updated patch.

On Fri, Feb 18, 2022 at 10:48:10PM +0530, Ashutosh Sharma wrote:
> +* runningBackups is a counter indicating the number of backups currently 
> in
> +* progress. forcePageWrites is set to true when either of these is
> +* non-zero. lastBackupStart is the latest checkpoint redo location used 
> as
> +* a starting point for an online backup.
>  */
> -   ExclusiveBackupState exclusiveBackupState;
> -   int nonExclusiveBackups;
> 
> What do you mean by "either of these is non-zero ''. Earlier we used
> to set forcePageWrites in case of both exclusive and non-exclusive
> backups, but we have just one type of backup now.

Fixed this.

> -* OK to update backup counters, forcePageWrites and session-level lock.
> +* OK to update backup counters and forcePageWrites.
>  *
> 
> We still update the status of session-level lock so I don't think we
> should update the above comment. See below code:

Fixed this.

> I think we have a lot of common code in do_pg_abort_backup() and
> pg_do_stop_backup(). So why not have a common function that can be
> called from both these functions.

I didn't follow through with this change.  I only saw a handful of lines
that looked similar, and AFAICT we'd need an extra branch for cleaning up
the session-level lock since do_pg_abort_backup() doesn't.

> +# Now delete the bogus backup_label file since it will interfere with startup
> +unlink("$pgdata/backup_label")
> +  or BAIL_OUT("unable to unlink $pgdata/backup_label");
> +
> 
> Why do we need this additional change? Earlier this was not required.

IIUC this test relied on the following code to handle the bogus file:

/*
 * Terminate exclusive backup mode to avoid recovery 
after a clean
 * fast shutdown.  Since an exclusive backup can only 
be taken
 * during normal running (and not, for example, while 
running
 * under Hot Standby) it only makes sense to do this if 
we reached
 * normal running. If we're still in recovery, the 
backup file is
 * one we're recovering *from*, and we must keep it 
around so that
 * recovery restarts from the right place.
     */
    if (ReachedNormalRunning)
CancelBackup();

The attached patch removes this code.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7fd092c90be90018eeddf3ea5088197627cbe593 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 1 Dec 2021 23:50:49 +
Subject: [PATCH v3 1/1] remove exclusive backup mode

---
 doc/src/sgml/backup.sgml  | 154 +-
 doc/src/sgml/func.sgml|  30 +-
 src/backend/access/transam/xlog.c | 466 ++
 src/backend/access/transam/xlogfuncs.c| 235 ++---
 src/backend/catalog/system_functions.sql  |  14 +-
 src/backend/postmaster/postmaster.c   |  46 +-
 src/backend/replication/basebackup.c  |   6 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |   4 +
 src/bin/pg_rewind/filemap.c   |   6 +-
 src/include/access/xlog.h |   3 +-
 src/include/catalog/pg_proc.dat   |  20 +-
 src/include/miscadmin.h   |   4 -
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  56 +--
 .../t/010_logical_decoding_timelines.pl   |   4 +-
 14 files changed, 113 insertions(+), 935 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0d69851bb1..f241bcab9c 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0
 sequence, and that the success of a step is verified before
 proceeding to the next step.

-   
-Low level base backups can be made in a non-exclusive or an exclusive
-way. The non-exclusive method is recommended and the exclusive one is
-deprecated and will eventually be removed.
-   
-
-   
-Making a Non-Exclusive Low-Level Backup
 
- A non-exclusive low level backup is one that allows other
+ A low level backup allows other
  concurrent backups to be running (both those started using
  the same backup API and those started using
  ).
@@ -884,7 +876,7 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0
  rights to run pg_start_backup (superuser, or a user who has been granted
  EXECUTE on the function) and issue the command:
 
-SELECT pg_start_backup('label', false, false);
+SELECT pg_start_backup('label', false);
 
  where label is an

Re: Add index scan progress to pg_stat_progress_vacuum

2022-02-21 Thread Nathan Bossart
On Mon, Feb 21, 2022 at 07:03:39PM +, Imseih (AWS), Sami wrote:
> Sending again with patch files renamed to ensure correct apply order.

I haven't had a chance to test this too much, but I did look through the
patch set and have a couple of small comments.

+ 
+  
+   indexes_total bigint
+  
+  
+   The number of indexes to be processed in the
+   vacuuming indexes or cleaning up 
indexes phase
+   of the vacuum.
+  
+ 
+
+ 
+  
+   indexes_processed bigint
+  
+  
+   The number of indexes processed in the
+   vacuuming indexes or cleaning up 
indexes phase.
+   At the start of an index vacuum cycle, this value is set to 
0.
+  
+ 

Will these be set to 0 for failsafe vacuums and vacuums with INDEX_CLEANUP
turned off?

+typedef struct VacWorkerProgressInfo
+{
+int num_vacuums;/* number of active VACUUMS with 
parallel workers */
+int max_vacuums;/* max number of VACUUMS with 
parallel workers */
+slock_t mutex;
+VacOneWorkerProgressInfo vacuums[FLEXIBLE_ARRAY_MEMBER];
+} VacWorkerProgressInfo;

max_vacuums appears to just be a local copy of MaxBackends.  Does this
information really need to be stored here?  Also, is there a strong reason
for using a spinlock instead of an LWLock?

+void
+vacuum_worker_end(int leader_pid)
+{
+SpinLockAcquire(&vacworkerprogress->mutex);
+for (int i = 0; i < vacworkerprogress->num_vacuums; i++)
+{
+VacOneWorkerProgressInfo *vac = &vacworkerprogress->vacuums[i];
+
+if (vac->leader_pid == leader_pid)
+{
+*vac = vacworkerprogress->vacuums[vacworkerprogress->num_vacuums - 
1];
+vacworkerprogress->num_vacuums--;
+SpinLockRelease(&vacworkerprogress->mutex);
+break;
+}
+}
+SpinLockRelease(&vacworkerprogress->mutex);
+}

I see this loop pattern in a couple of places, and it makes me wonder if
this information would fit more naturally in a hash table.

+if (callback)
+callback(values, 3);

Why does this need to be set up as a callback function?  Could we just call
the function if cmdtype == PROGRESS_COMMAND_VACUUM?  ISTM that is pretty
much all this is doing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




remove more archiving overhead

2022-02-21 Thread Nathan Bossart
Hi hackers,

With the addition of archive modules (5ef1eef) and other archiving
improvements (e.g., beb4e9b), the amount of archiving overhead has been
reduced.  I spent some time measuring the remaining overhead, and the
following function stood out:

/*
 * pgarch_archiveDone
 *
 * Emit notification that an xlog file has been successfully archived.
 * We do this by renaming the status file from NNN.ready to NNN.done.
 * Eventually, a checkpoint process will notice this and delete both the
 * NNN.done file and the xlog file itself.
 */
static void
pgarch_archiveDone(char *xlog)
{
charrlogready[MAXPGPATH];
charrlogdone[MAXPGPATH];

StatusFilePath(rlogready, xlog, ".ready");
StatusFilePath(rlogdone, xlog, ".done");
(void) durable_rename(rlogready, rlogdone, WARNING);
}

Specifically, the call to durable_rename(), which calls fsync() a few
times, can cause a decent delay.  On my machine, archiving ~12k WAL files
using a bare-bones archive module that did nothing but return true took
over a minute.  When I changed this to rename() (i.e., reverting the change
to pgarch.c in 1d4a0ab), the same test took a second or less.

I also spent some time investigating whether durably renaming the archive
status files was even necessary.  In theory, it shouldn't be.  If a crash
happens before the rename is persisted, we might try to re-archive files,
but your archive_command/archive_library should handle that.  If the file
was already recycled or removed, the archiver will skip it (thanks to
6d8727f).  But when digging further, I found that WAL file recyling uses
durable_rename_excl(), which has the following note:

 * Note that a crash in an unfortunate moment can leave you with two 
links to
 * the target file.

IIUC this means that in theory, a crash at an unfortunate moment could
leave you with a .ready file, the file to archive, and another link to that
file with a "future" WAL filename.  If you re-archive the file after it has
been reused, you could end up with corrupt WAL archives.  I think this
might already be possible today.  Syncing the directory after every rename
probably reduces the likelihood quite substantially, but if
RemoveOldXlogFiles() quickly picks up the .done file and attempts
recycling before durable_rename() calls fsync() on the directory,
presumably the same problem could occur.

I believe the current recommendation is to fail if the archive file already
exists.  The basic_archive contrib module fails if the archive already
exists and has different contents, and it succeeds without re-archiving if
it already exists with identical contents.  Since something along these
lines is recommended as a best practice, perhaps this is okay.  But I think
we might be able to do better.  If we ensure that the archive_status
directory is sync'd prior to recycling/removing a WAL file, I believe we
are safe from the aforementioned problem.

The drawback of this is that we are essentially offloading more work to the
checkpointer process.  In addition to everything else it must do, it will
also need to fsync() the archive_status directory many times.  I'm not sure
how big of a problem this is.  It should be good to ensure that archiving
keeps up, and there are far more expensive tasks that the checkpointer must
perform that could make the added fsync() calls relatively insignificant.

I've attached a patch that makes the changes discussed above.  Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a6d033aff90a6218345cba41847ccfdfbe6447d7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 21 Feb 2022 16:29:14 -0800
Subject: [PATCH v1 1/1] remove more archiving overhead

---
 src/backend/access/transam/xlogarchive.c | 15 +--
 src/backend/postmaster/pgarch.c  | 13 -
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index a2657a2005..d6823c9c38 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -585,8 +585,13 @@ XLogArchiveForceDone(const char *xlog)
  * If it is not time to delete it, make sure a .ready file exists, and return
  * false.
  *
- * If .done exists, then return true; else if .ready exists,
- * then return false; else create .ready and return false.
+ * If .done exists, then sync the archive_status directory to disk and
+ * return true.  Syncing the directory ensures that the rename from
+ * .ready to .done is persisted so that we won't attempt to re-
+ * archive the file after a crash.
+ *
+ * If .ready exists, then return false.  If neither .ready nor
+ * .done exist, create .ready and return false.
  *
  * The reason w

Re: remove more archiving overhead

2022-02-22 Thread Nathan Bossart
On Mon, Feb 21, 2022 at 05:19:48PM -0800, Nathan Bossart wrote:
> I also spent some time investigating whether durably renaming the archive
> status files was even necessary.  In theory, it shouldn't be.  If a crash
> happens before the rename is persisted, we might try to re-archive files,
> but your archive_command/archive_library should handle that.  If the file
> was already recycled or removed, the archiver will skip it (thanks to
> 6d8727f).  But when digging further, I found that WAL file recyling uses
> durable_rename_excl(), which has the following note:
> 
>* Note that a crash in an unfortunate moment can leave you with two 
> links to
>* the target file.
> 
> IIUC this means that in theory, a crash at an unfortunate moment could
> leave you with a .ready file, the file to archive, and another link to that
> file with a "future" WAL filename.  If you re-archive the file after it has
> been reused, you could end up with corrupt WAL archives.  I think this
> might already be possible today.  Syncing the directory after every rename
> probably reduces the likelihood quite substantially, but if
> RemoveOldXlogFiles() quickly picks up the .done file and attempts
> recycling before durable_rename() calls fsync() on the directory,
> presumably the same problem could occur.

In my testing, I found that when I killed the server just before unlink()
during WAL recyling, I ended up with links to the same file in pg_wal after
restarting.  My latest test produced links to the same file for the current
WAL file and the next one.  Maybe WAL recyling should use durable_rename()
instead of durable_rename_excl().

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: remove more archiving overhead

2022-02-22 Thread Nathan Bossart
On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
> In my testing, I found that when I killed the server just before unlink()
> during WAL recyling, I ended up with links to the same file in pg_wal after
> restarting.  My latest test produced links to the same file for the current
> WAL file and the next one.  Maybe WAL recyling should use durable_rename()
> instead of durable_rename_excl().

Here is an updated patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e080d493985cf0a203122c1e07952b0f765019e4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 22 Feb 2022 11:39:28 -0800
Subject: [PATCH v2 1/1] reduce archiving overhead

---
 src/backend/access/transam/xlog.c | 10 ++
 src/backend/postmaster/pgarch.c   | 14 +-
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..2ad047052f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3334,13 +3334,15 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
+	 * Perform the rename.  Ideally, we'd use link() and unlink() to avoid
+	 * overwriting an existing file (there shouldn't be one).  However, that
+	 * approach opens up the possibility that pg_wal will contain multiple hard
+	 * links to the same WAL file after a crash.
 	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index d916ed39a8..641297e9f5 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -746,7 +746,19 @@ pgarch_archiveDone(char *xlog)
 
 	StatusFilePath(rlogready, xlog, ".ready");
 	StatusFilePath(rlogdone, xlog, ".done");
-	(void) durable_rename(rlogready, rlogdone, WARNING);
+
+	/*
+	 * To avoid extra overhead, we don't durably rename the .ready file to
+	 * .done.  Archive commands and libraries must already gracefully handle
+	 * attempts to re-archive files (e.g., if the server crashes just before
+	 * this function is called), so it should be okay if the .ready file
+	 * reappears after a crash.
+	 */
+	if (rename(rlogready, rlogdone) < 0)
+		ereport(WARNING,
+(errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+		rlogready, rlogdone)));
 }
 
 
-- 
2.25.1



Re: C++ Trigger Framework

2022-02-22 Thread Nathan Bossart
On Tue, Feb 22, 2022 at 04:02:39PM +0200, Shmuel Kamensky wrote:
> I'm interested in creating a Postgres extension that would enable
> developers to write triggers in (modern) C++. Does anyone know if there is
> already some sort of translation wrapper between the native Postgres C
> API's and C++? Or if there's already a project that allows for writing
> triggers in C++ with ease?
> I see that https://github.com/clkao/plv8js-clkao/blob/master/plv8_type.cc
> does implement an abstraction of sorts, but it's specific to v8 types and
> is not genericized as a way of interacting with Postgres C API's from C++
> from *an*y C++ code.
> 
> Can you imagine any potential technical challenges I may encounter (e.g.
> massaging postgres' custom allocator to work with C++'s new and delete
> operators, or unresolvable compiler incompatibilities)?

This might answer your questions:

    https://www.postgresql.org/docs/devel/xfunc-c.html#EXTEND-CPP

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Allow file inclusion in pg_hba and pg_ident files

2022-02-23 Thread Nathan Bossart
On Wed, Feb 23, 2022 at 12:59:59PM +0800, Julien Rouhaud wrote:
> To address that, I'd like to propose the possibility to include files in hba
> and ident configuration files.  This was already discussed in the past, and in
> my understanding this is mostly wanted, while some people expressed concerned
> on a use case that wouldn't rely on thousands of entries.

+1, I think this would be very useful.

> 0001 adds a new pg_ident_file_mappings view, which is basically the same as
> pg_hba_file_rules view but for mappings.  It's probably already useful, for
> instance if you need to tweak some regexp.

This seems reasonable.

> Finally I also added 0003, which is a POC for a new pg_hba_matches() function,
> that can help DBA to understand why their configuration isn't working as they
> expect.  This only to start the discussion on that topic, the code is for now
> really hackish, as I don't know how much this is wanted and/or if some other
> behavior would be better, and there's also no documentation or test.  The
> function for now only takes an optional inet (null means unix socket), the
> target role and an optional ssl flag and returns the file, line and raw line
> matching if any, or null.  For instance:

I think another use-case for this is testing updates to your configuration
files.  For example, I could ensure that hba_forbid_non_ssl.conf wasn't
accidentally reverted as part of an unrelated change.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: remove more archiving overhead

2022-02-24 Thread Nathan Bossart
Thanks for taking a look!

On Thu, Feb 24, 2022 at 02:13:49PM +0900, Kyotaro Horiguchi wrote:
> https://www.postgresql.org/docs/14/continuous-archiving.html
> | The archive command should generally be designed to refuse to
> | overwrite any pre-existing archive file. This is an important safety
> | feature to preserve the integrity of your archive in case of
> | administrator error (such as sending the output of two different
> | servers to the same archive directory).
> 
> The recommended behavior of archive_command above is to avoid this
> kind of corruption.  If we officially admit that a WAL segment can be
> archive twice, we need to revise that part (and the following part) of
> the doc.

Yeah, we might want to recommend succeeding if the archive already exists
with identical contents (like basic_archive does).  IMO this would best be
handled in a separate thread that is solely focused on documentation
updates.  AFAICT this is an existing problem.

>> IIUC this means that in theory, a crash at an unfortunate moment could
>> leave you with a .ready file, the file to archive, and another link to that
>> file with a "future" WAL filename.  If you re-archive the file after it has
>> been reused, you could end up with corrupt WAL archives.  I think this
> 
> IMHO the difference would be that the single system call (maybe)
> cannot be interrupted by sigkill.  So I'm not sure that that is worth
> considering.  The sigkill window can be elimianted if we could use
> renameat(RENAME_NOREPLACE).  But finally power-loss can cause that.

Perhaps durable_rename_excl() could use renameat2() for systems that
support it.  However, the problem would still exist for systems that have
link() but not renameat2().  In this particular case, there really
shouldn't be an existing file in pg_wal.

>> might already be possible today.  Syncing the directory after every rename
>> probably reduces the likelihood quite substantially, but if
>> RemoveOldXlogFiles() quickly picks up the .done file and attempts
>> recycling before durable_rename() calls fsync() on the directory,
>> presumably the same problem could occur.
> 
> If we don't fsync at every rename, we don't even need for
> RemoveOldXlogFiles to pickup .done very quickly.  Isn't it a big
> difference?

Yeah, it probably increases the chances quite a bit.

> I think we don't want make checkpointer slower in exchange of making
> archiver faster.  Perhaps we need to work using a condensed
> information rather than repeatedly scanning over the distributed
> information like archive_status?

v2 avoids giving the checkpointer more work.

> At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart  
> wrote in 
>> On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
>> > In my testing, I found that when I killed the server just before unlink()
>> > during WAL recyling, I ended up with links to the same file in pg_wal after
>> > restarting.  My latest test produced links to the same file for the current
>> > WAL file and the next one.  Maybe WAL recyling should use durable_rename()
>> > instead of durable_rename_excl().
>> 
>> Here is an updated patch.
> 
> Is it intentional that v2 loses the xlogarchive.c part?

Yes.  I found that a crash at an unfortunate moment can produce multiple
links to the same file in pg_wal, which seemed bad independent of archival.
By fixing that (i.e., switching from durable_rename_excl() to
durable_rename()), we not only avoid this problem, but we also avoid trying
to archive a file the server is concurrently writing.  Then, after a crash,
the WAL file to archive should either not exist (which is handled by the
archiver) or contain the same contents as any preexisting archives.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: remove more archiving overhead

2022-02-24 Thread Nathan Bossart
I tested this again with many more WAL files and a much larger machine
(r5d.24xlarge with data directory on an NVMe SSD instance store volume).
As before, I am using a bare-bones archive module that does nothing but
return true.  Without the patch, archiving ~120k WAL files took about 109
seconds.  With the patch, it took around 62 seconds, which amounts to a
~43% reduction in overhead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Fix overflow in justify_interval related functions

2022-02-25 Thread Nathan Bossart
On Fri, Feb 25, 2022 at 10:30:57AM -0500, Joseph Koshakow wrote:
> Just checking because I'm not very familiar with the process,
> are there any outstanding items that I need to do for this patch?

Unless someone has additional feedback, I don't think so.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)

2022-02-25 Thread Nathan Bossart
On Fri, Feb 25, 2022 at 08:31:37PM +0530, Bharath Rupireddy wrote:
> Thanks Satya and others for the inputs. Here's the v1 patch that
> basically allows async wal senders to wait until the sync standbys
> report their flush lsn back to the primary. Please let me know your
> thoughts.

I haven't had a chance to look too closely yet, but IIUC this adds a new
function that waits for synchronous replication.  This new function
essentially spins until the synchronous LSN has advanced.

I don't think it's a good idea to block sending any WAL like this.  AFAICT
it is possible that there will be a lot of synchronously replicated WAL
that we can send, and it might just be the last several bytes that cannot
yet be replicated to the asynchronous standbys.  І believe this patch will
cause the server to avoid sending _any_ WAL until the synchronous LSN
advances.

Perhaps we should instead just choose the SendRqstPtr based on the current
synchronous LSN.  Presumably there are other things we'd need to consider,
but in general, I think we ought to send as much WAL as possible for a
given call to XLogSendPhysical().

> I've done pgbench testing to see if the patch causes any problems. I
> ran tests two times, there isn't much difference in the txns per
> seconds (tps), although there's a delay in the async standby receiving
> the WAL, after all, that's the feature we are pursuing.

I'm curious what a longer pgbench run looks like when the synchronous
replicas are in the same region.  That is probably a more realistic
use-case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add index scan progress to pg_stat_progress_vacuum

2022-02-25 Thread Nathan Bossart
On Wed, Feb 23, 2022 at 10:41:36AM -0800, Peter Geoghegan wrote:
> On Wed, Feb 23, 2022 at 10:02 AM Imseih (AWS), Sami  
> wrote:
>> If the failsafe kicks in midway through a vacuum, the number indexes_total 
>> will not be reset to 0. If INDEX_CLEANUP is turned off, then the value will 
>> be 0 at the start of the vacuum.
> 
> The way that this works with num_index_scans is that we "round up"
> when there has been non-zero work in lazy_vacuum_all_indexes(), but
> not if the precheck in lazy_vacuum_all_indexes() fails. That seems
> like a good model to generalize from here. Note that this makes
> INDEX_CLEANUP=off affect num_index_scans in much the same way as a
> VACUUM where the failsafe kicks in very early, during the initial heap
> pass. That is, if the failsafe kicks in before we reach lazy_vacuum()
> for the first time (which is not unlikely), or even in the
> lazy_vacuum_all_indexes() precheck, then num_index_scans will remain
> at 0, just like INDEX_CLEANUP=off.
> 
> The actual failsafe WARNING shows num_index_scans, possibly before it
> gets incremented one last time (by "rounding up"). So it's reasonably
> clear how this all works from that context (assuming that the
> autovacuum logging stuff, which reports num_index_scans, outputs a
> report for a table where the failsafe kicked in).

I am confused.  If failsafe kicks in during the middle of a vacuum, I
(perhaps naively) would expect indexes_total and indexes_processed to go to
zero, and I'd expect to no longer see the "vacuuming indexes" and "cleaning
up indexes" phases.  Otherwise, how would I know that we are now skipping
indexes?  Of course, you won't have any historical context about the index
work done before failsafe kicked in, but IMO it is misleading to still
include it in the progress view.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2022-02-25 Thread Nathan Bossart
On Mon, Jan 31, 2022 at 04:25:21PM +0900, Michael Paquier wrote:
> At the end of the day, it may be better to just let this stuff be.
> Another argument for doing nothing is that this could cause
> hard-to-spot conflicts when it comes to back-patch something.

This one has been quiet for a while.  Should we mark it as
returned-with-feedback?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)

2022-02-26 Thread Nathan Bossart
On Sat, Feb 26, 2022 at 02:17:50PM +0530, Bharath Rupireddy wrote:
> A global min LSN of SendRqstPtr of all the sync standbys can be
> calculated and the async standbys can send WAL up to global min LSN.
> This is unlike what the v1 patch does i.e. async standbys will wait
> until the sync standbys report flush LSN back to the primary. Problem
> with the global min LSN approach is that there can still be a small
> window where async standbys can get ahead of sync standbys. Imagine
> async standbys being closer to the primary than sync standbys and if
> the failover has to happen while the WAL at SendRqstPtr isn't received
> by the sync standbys, but the async standbys can receive them as they
> are closer. We hit the same problem that we are trying to solve with
> this patch. This is the reason, we are waiting till the sync flush LSN
> as it guarantees more transactional protection.

Do you mean that the application of WAL gets ahead on your async standbys
or that the writing/flushing of WAL gets ahead?  If synchronous_commit is
set to 'remote_write' or 'on', I think either approach can lead to
situations where the async standbys are ahead of the sync standbys with WAL
application.  For example, a conflict between WAL replay and a query on
your sync standby could delay WAL replay, but the primary will not wait for
this conflict to resolve before considering a transaction synchronously
replicated and sending it to the async standbys.

If writing/flushing WAL gets ahead on async standbys, I think something is
wrong with the patch.  If you aren't sending WAL to async standbys until
it is synchronously replicated to the sync standbys, it should by
definition be impossible for this to happen.

If you wanted to make sure that WAL was not applied to async standbys
before it was applied to sync standbys, I think you'd need to set
synchronous_commit to 'remote_apply'.  This would ensure that the WAL is
replayed on sync standbys before the primary considers the transaction
synchronously replicated and sends it to the async standbys.

> Do you think allowing async standbys optionally wait for either remote
> write or flush or apply or global min LSN of SendRqstPtr so that users
> can choose what they want?

I'm not sure I follow the difference between "global min LSN of
SendRqstPtr" and remote write/flush/apply.  IIUC you are saying that we
could use the LSN of what is being sent to sync standbys instead of the LSN
of what the primary considers synchronously replicated.  I don't think we
should do that because it provides no guarantee that the WAL has even been
sent to the sync standbys before it is sent to the async standbys.  For
this feature, I think we always need to consider what the primary considers
synchronously replicated.  My suggested approach doesn't change that.  I'm
saying that instead of spinning in a loop waiting for the WAL to be
synchronously replicated, we just immediately send WAL up to the LSN that
is presently known to be synchronously replicated.

You do bring up an interesting point, though.  Is there a use-case for
specifying synchronous_commit='on' but not sending WAL to async replicas
until it is synchronously applied?  Or alternatively, would anyone want to
set synchronous_commit='remote_apply' but send WAL to async standbys as
soon as it is written to the sync standbys?  My initial reaction is that we
should depend on the synchronous replication setup.  As long as the primary
considers an LSN synchronously replicated, it would be okay to send it to
the async standbys.  I personally don't think it is worth taking on the
extra complexity for that level of configuration just yet.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-02-26 Thread Nathan Bossart
On Sat, Feb 26, 2022 at 04:48:52PM +, Chapman Flack wrote:
> My biggest concerns are the changes to the SQL-visible pg_start_backup
> and pg_stop_backup functions. When the non-exclusive API was implemented
> (in 7117685), that was done with care (with a new optional argument to
> pg_start_backup, and a new overload of pg_stop_backup) to avoid immediate
> breakage of working backup scripts.
> 
> With this patch, even scripts that were dutifully migrated to that new API and
> now invoke pg_start_backup(label, false) or (label, exclusive => false) will
> immediately and unnecessarily break. What I would suggest for this patch
> would be to change the exclusive default from true to false, and have the
> function report an ERROR if true is passed.
> 
> Otherwise, for sites using a third-party backup solution, there will be an
> unnecessary requirement to synchronize a PostgreSQL upgrade with an
> upgrade of the backup solution that won't be broken by the change. For
> a site with their backup procedures scripted in-house, there will be an
> unnecessarily urgent need for the current admin team to study and patch
> the currently-working scripts.
> 
> That can be avoided by just changing the default to false and rejecting calls
> where true is passed. That will break only scripts that never got the memo
> about moving to non-exclusive backup, available for six years now.
> 
> Assuming the value is false, so no error is thrown, is it practical to 
> determine
> from flinfo->fn_expr whether the value was defaulted or supplied? If so, I 
> would
> further suggest reporting a deprecation WARNING if it was explicitly supplied,
> with a HINT that the argument can simply be removed at the call site, and will
> become unrecognized in some future release.

This is a good point.  I think I agree with your proposed changes.  I
believe it is possible to add a deprecation warning only when 'exclusive'
is specified.  If anything, we can create a separate function that accepts
the 'exclusive' parameter and that always emits a NOTICE or WARNING.

> pg_stop_backup needs thought, because 7117685 added a new overload for that
> function, rather than just an optional argument. This patch removes the old
> niladic version that returned pg_lsn, leaving just one version, with an 
> optional
> argument, that returns a record.
> 
> Here again, the old niladic one was only suitable for exclusive backups, so 
> there
> can't be any script existing in 2022 that still calls that unless it has 
> never been
> updated in six years to nonexclusive backups, and that breakage can't be
> helped.
> 
> Any scripts that did get dutifully updated over the last six years will be 
> calling the
> record-returning version, passing false, or exclusive => false. This patch as 
> it
> stands will unnecessarily break those, but here again I think that can be 
> avoided
> just by making the exclusive parameter optional with default false, and 
> reporting
> an error if true is passed.
> 
> Here again, I would consider also issuing a deprecation warning if the 
> argument
> is explicitly supplied, if it is practical to determine that from fn_expr. (I 
> haven't
> looked yet to see how practical that is.)

Agreed.  I will look into updating this one, too.  I think the 'exclusive'
parameter should remain documented for now for both pg_start_backup() and
pg_stop_backup(), but this documentation will just note that it is there
for backward compatibility and must be set to false.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-02-26 Thread Nathan Bossart
On Sat, Feb 26, 2022 at 05:03:04PM -0500, Chapman Flack wrote:
> I've now looked through the rest, and the only further thing I noticed
> was that xlog.c's do_pg_start_backup still has a tablespaces parameter
> to receive a List* of tablespaces if the caller wants, but this patch
> removes the comment describing it:
> 
> 
> - * If "tablespaces" isn't NULL, it receives a list of tablespaceinfo structs
> - * describing the cluster's tablespaces.
> 
> 
> which seems like collateral damage.

Thanks.  I will fix this and the proofreading nits you noted upthread in
the next revision.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: parse/analyze API refactoring

2022-02-28 Thread Nathan Bossart
On Mon, Feb 28, 2022 at 07:46:40AM +0100, Peter Eisentraut wrote:
> You set this commit fest entry to Waiting on Author, but there were no
> reviews posted and the patch still applies and builds AFAICT, so I don't
> know what you meant by that.

Apologies for the lack of clarity.  I believe my only feedback was around
deduplicating the pg_analyze_and_rewrite_*() functions.  Would you rather
handle that in a separate patch?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)

2022-02-28 Thread Nathan Bossart
On Mon, Feb 28, 2022 at 06:45:51PM +0530, Bharath Rupireddy wrote:
> On Sat, Feb 26, 2022 at 9:37 PM Nathan Bossart  
> wrote:
>> For
>> this feature, I think we always need to consider what the primary considers
>> synchronously replicated.  My suggested approach doesn't change that.  I'm
>> saying that instead of spinning in a loop waiting for the WAL to be
>> synchronously replicated, we just immediately send WAL up to the LSN that
>> is presently known to be synchronously replicated.
> 
> As I said above, v1 patch does that i.e. async standbys wait until the
> sync standbys update their flush LSN.
> 
> Flush LSN is this - flushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
> which gets updated in SyncRepReleaseWaiters.
> 
> Async standbys with their SendRqstPtr will wait in XLogSendPhysical or
> XLogSendLogical until SendRqstPtr <= flushLSN.

My feedback is specifically about this behavior.  I don't think we should
spin in XLogSend*() waiting for an LSN to be synchronously replicated.  I
think we should just choose the SendRqstPtr based on what is currently
synchronously replicated.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-02-28 Thread Nathan Bossart
On Sat, Feb 26, 2022 at 02:06:14PM -0800, Nathan Bossart wrote:
> On Sat, Feb 26, 2022 at 04:48:52PM +, Chapman Flack wrote:
>> Assuming the value is false, so no error is thrown, is it practical to 
>> determine
>> from flinfo->fn_expr whether the value was defaulted or supplied? If so, I 
>> would
>> further suggest reporting a deprecation WARNING if it was explicitly 
>> supplied,
>> with a HINT that the argument can simply be removed at the call site, and 
>> will
>> become unrecognized in some future release.
> 
> This is a good point.  I think I agree with your proposed changes.  I
> believe it is possible to add a deprecation warning only when 'exclusive'
> is specified.  If anything, we can create a separate function that accepts
> the 'exclusive' parameter and that always emits a NOTICE or WARNING.

I've spent some time looking into this, and I haven't found a clean way to
emit a WARNING only if the "exclusive" parameter is supplied (and set to
false).  AFAICT flinfo->fn_expr doesn't tell us whether the parameter was
supplied or the default value was used.  I was able to get it working by
splitting pg_start_backup() into 3 separate internal functions (i.e.,
pg_start_backup_1arg(), pg_start_backup_2arg(), and
pg_start_backup_3arg()), but this breaks calls such as
pg_start_backup('mylabel', exclusive => false), and it might complicate
privilege management for users.

Without a WARNING, I think it will be difficult to justify removing the
"exclusive" parameter in the future.  We would either need to leave it
around forever, or we would have to risk unnecessarily breaking some
working backup scripts.  I wonder if we should just remove it now and make
sure that this change is well-documented in the release notes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)

2022-02-28 Thread Nathan Bossart
On Tue, Mar 01, 2022 at 11:10:09AM +0530, Bharath Rupireddy wrote:
> On Tue, Mar 1, 2022 at 12:27 AM Nathan Bossart  
> wrote:
>> My feedback is specifically about this behavior.  I don't think we should
>> spin in XLogSend*() waiting for an LSN to be synchronously replicated.  I
>> think we should just choose the SendRqstPtr based on what is currently
>> synchronously replicated.
> 
> Do you mean something like the following?
> 
> /* Main loop of walsender process that streams the WAL over Copy messages. */
> static void
> WalSndLoop(WalSndSendDataCallback send_data)
> {
> /*
>  * Loop until we reach the end of this timeline or the client requests to
>  * stop streaming.
>  */
> for (;;)
> {
> if (am_async_walsender && there_are_sync_standbys)
> {
>  XLogRecPtr SendRqstLSN;
>  XLogRecPtr SyncFlushLSN;
> 
> SendRqstLSN = GetFlushRecPtr(NULL);
> LWLockAcquire(SyncRepLock, LW_SHARED);
> SyncFlushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
> LWLockRelease(SyncRepLock);
> 
> if (SendRqstLSN > SyncFlushLSN)
>continue;
> }

Not quite.  Instead of "continue", I would set SendRqstLSN to SyncFlushLSN
so that the WAL sender only sends up to the current synchronously
replicated LSN.  TBH there are probably other things that need to be
considered (e.g., how do we ensure that the WAL sender sends the rest once
it is replicated?), but I still think we should avoid spinning in the WAL
sender waiting for WAL to be replicated.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Pre-allocating WAL files

2022-03-01 Thread Nathan Bossart
On Tue, Mar 01, 2022 at 08:40:44AM -0600, Justin Pryzby wrote:
> FYI: this is currently failing in cfbot on linux.
> 
> https://cirrus-ci.com/task/4934371210690560
> https://api.cirrus-ci.com/v1/artifact/task/4934371210690560/log/src/test/regress/regression.diffs
> 
>  DROP TABLESPACE regress_tblspace_renamed;
> +ERROR:  tablespace "regress_tblspace_renamed" is not empty

I believe this is due to an existing bug.  This patch set seems to
influence the timing to make it more likely.  I'm tracking the fix here:

https://commitfest.postgresql.org/37/3544/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-01 Thread Nathan Bossart
On Tue, Mar 01, 2022 at 08:44:51AM -0600, David Steele wrote:
> Personally, I am in favor of removing it. We change/rename
> functions/tables/views when we need to, and this happens in almost every
> release.
> 
> What we need to do is make sure that an older installation won't silently
> work in a broken way, i.e. if we remove the exclusive flag somebody
> expecting the pre-9.6 behavior might not receive an error and think
> everything is OK. That would not be good.
> 
> One option might be to rename the functions. Something like
> pg_backup_start/stop.

I'm fine with this approach.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Allow async standbys wait for sync replication

2022-03-01 Thread Nathan Bossart
On Tue, Mar 01, 2022 at 04:34:31PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 28 Feb 2022 22:05:28 -0800, Nathan Bossart  
> wrote in 
>> replicated LSN.  TBH there are probably other things that need to be
>> considered (e.g., how do we ensure that the WAL sender sends the rest once
>> it is replicated?), but I still think we should avoid spinning in the WAL
>> sender waiting for WAL to be replicated.
> 
> It seems to me it would be something similar to
> SyncRepReleaseWaiters().  Or it could be possible to consolidate this
> feature into the function, I'm not sure, though.

Yes, perhaps the synchronous replication framework will need to alert WAL
senders when the syncrep LSN advances so that the WAL is sent to the async
standbys.  I'm glossing over the details, but I think that should be the
general direction.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-01 Thread Nathan Bossart
On Tue, Mar 01, 2022 at 11:09:13AM -0500, Chapman Flack wrote:
> On 03/01/22 09:44, David Steele wrote:
>> Personally, I am in favor of removing it. We change/rename
>> functions/tables/views when we need to, and this happens in almost every
>> release.
> 
> For clarification, is that a suggestion to remove the 'exclusive' parameter
> in some later release, after using this release to default it to false and
> reject calls with true?

My suggestion was to remove it in v15.  My impression is that David and
Stephen agree, but I could be misinterpreting their responses.

> That way, at least, there would be a period of time where procedures
> that currently work (by passing exclusive => false) would continue to work,
> and could be adapted as time permits by removing that argument, with no
> behavioral change.

I'm not sure if there's any advantage to kicking the can down the road.  At
some point, we'll need to break existing backup scripts.  Will we be more
prepared to do that in v17 than we are now?  We could maintain two sets of
functions for a few releases and make it really clear in the documentation
that pg_start/stop_backup() are going to be removed soon (and always emit a
WARNING when they are used).  Would that address your concerns?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Proposal: Support custom authentication methods using hooks

2022-03-01 Thread Nathan Bossart
On Mon, Feb 28, 2022 at 03:46:34PM -0500, Stephen Frost wrote:
> md5 should be removed.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Allow async standbys wait for sync replication

2022-03-01 Thread Nathan Bossart
On Tue, Mar 01, 2022 at 11:09:57PM +0530, Bharath Rupireddy wrote:
> On Tue, Mar 1, 2022 at 10:35 PM Nathan Bossart  
> wrote:
>> Yes, perhaps the synchronous replication framework will need to alert WAL
>> senders when the syncrep LSN advances so that the WAL is sent to the async
>> standbys.  I'm glossing over the details, but I think that should be the
>> general direction.
> 
> It's doable. But we can't avoid async walsenders waiting for the flush
> LSN even if we take the SyncRepReleaseWaiters() approach right? I'm
> not sure (at this moment) what's the biggest advantage of this
> approach i.e. (1) backends waking up walsenders after flush lsn is
> updated vs (2) walsenders keep looking for the new flush lsn.

I think there are a couple of advantages.  For one, spinning is probably
not the best from a resource perspective.  There is no guarantee that the
desired SendRqstPtr will ever be synchronously replicated, in which case
the WAL sender would spin forever.  Also, this approach might fit in better
with the existing synchronous replication framework.  When a WAL sender
realizes that it can't send up to the current "flush" LSN because it's not
synchronously replicated, it will request to be alerted when it is.  In the
meantime, it can send up to the latest syncrep LSN so that the async
standby is as up-to-date as possible.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-01 Thread Nathan Bossart
Here is a new version of the patch with the following changes:

1. Addressed Chap's feedback from upthread.
2. Renamed pg_start/stop_backup() to pg_backup_start/stop() as
   suggested by David.
3. A couple of other small documentation adjustments.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7119f9063f22652fca1e2a44fdf6b4b6b3fbf679 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 1 Dec 2021 23:50:49 +
Subject: [PATCH v4 1/1] remove exclusive backup mode

---
 doc/src/sgml/backup.sgml  | 188 +--
 doc/src/sgml/func.sgml|  99 +---
 doc/src/sgml/high-availability.sgml   |   6 +-
 doc/src/sgml/monitoring.sgml  |   4 +-
 doc/src/sgml/ref/pgupgrade.sgml   |   2 +-
 src/backend/access/transam/xlog.c | 493 ++
 src/backend/access/transam/xlogfuncs.c| 253 ++---
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/catalog/system_functions.sql  |  18 +-
 src/backend/postmaster/postmaster.c   |  46 +-
 src/backend/replication/basebackup.c  |  20 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |   4 +
 src/bin/pg_ctl/pg_ctl.c   |   4 +-
 src/bin/pg_rewind/filemap.c   |   6 +-
 src/include/access/xlog.h |   7 +-
 src/include/catalog/pg_control.h  |   2 +-
 src/include/catalog/pg_proc.dat   |  26 +-
 src/include/miscadmin.h   |   4 -
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  56 +-
 .../t/010_logical_decoding_timelines.pl   |   4 +-
 20 files changed, 190 insertions(+), 1054 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0d69851bb1..acffee4688 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0
 sequence, and that the success of a step is verified before
 proceeding to the next step.

-   
-Low level base backups can be made in a non-exclusive or an exclusive
-way. The non-exclusive method is recommended and the exclusive one is
-deprecated and will eventually be removed.
-   
-
-   
-Making a Non-Exclusive Low-Level Backup
 
- A non-exclusive low level backup is one that allows other
+ A low level backup allows other
  concurrent backups to be running (both those started using
  the same backup API and those started using
  ).
@@ -881,19 +873,19 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0

 
  Connect to the server (it does not matter which database) as a user with
- rights to run pg_start_backup (superuser, or a user who has been granted
+ rights to run pg_backup_start (superuser, or a user who has been granted
  EXECUTE on the function) and issue the command:
 
-SELECT pg_start_backup('label', false, false);
+SELECT pg_backup_start('label', false);
 
  where label is any string you want to use to uniquely
  identify this backup operation. The connection
- calling pg_start_backup must be maintained until the end of
+ calling pg_backup_start must be maintained until the end of
  the backup, or the backup will be automatically aborted.
 
 
 
- By default, pg_start_backup can take a long time to finish.
+ By default, pg_backup_start can take a long time to finish.
  This is because it performs a checkpoint, and the I/O
  required for the checkpoint will be spread out over a significant
  period of time, by default half your inter-checkpoint interval
@@ -905,10 +897,6 @@ SELECT pg_start_backup('label', false, false);
  issue an immediate checkpoint using as much I/O as available.
 
 
-
- The third parameter being false tells
- pg_start_backup to initiate a non-exclusive base backup.
-


 
@@ -926,7 +914,7 @@ SELECT pg_start_backup('label', false, false);
 
  In the same connection as before, issue the command:
 
-SELECT * FROM pg_stop_backup(false, true);
+SELECT * FROM pg_backup_stop(true);
 
  This terminates backup mode. On a primary, it also performs an automatic
  switch to the next WAL segment.  On a standby, it is not possible to
@@ -937,7 +925,7 @@ SELECT * FROM pg_stop_backup(false, true);
  ready to archive.
 
 
- The pg_stop_backup will return one row with three
+ The pg_backup_stop will return one row with three
  values. The second of these fields should be written to a file named
  backup_label in the root directory of the backup. The
  third field should be written to a file named
@@ -949,14 +937,14 @@ SELECT * FROM pg_stop_backup(false, true);

 
  Once the WAL segment files active during the backup are archived, yo

Re: Allow async standbys wait for sync replication

2022-03-04 Thread Nathan Bossart
On Wed, Mar 02, 2022 at 09:47:09AM +0530, Bharath Rupireddy wrote:
> On Wed, Mar 2, 2022 at 2:57 AM Nathan Bossart  
> wrote:
>> I think there are a couple of advantages.  For one, spinning is probably
>> not the best from a resource perspective.
> 
> Just to be on the same page - by spinning do you mean - the async
> walsender waiting for the sync flushLSN in a for-loop with
> WaitLatch()?

Yes.

>> Also, this approach might fit in better
>> with the existing synchronous replication framework.  When a WAL sender
>> realizes that it can't send up to the current "flush" LSN because it's not
>> synchronously replicated, it will request to be alerted when it is.
> 
> I think you are referring to the way a backend calls SyncRepWaitForLSN
> and waits until any one of the walsender sets syncRepState to
> SYNC_REP_WAIT_COMPLETE in SyncRepWakeQueue. Firstly, SyncRepWaitForLSN
> blocking i.e. the backend spins/waits in for (;;) loop until its
> syncRepState becomes SYNC_REP_WAIT_COMPLETE. The backend doesn't do
> any other work but waits. So, spinning isn't avoided completely.
> 
> Unless, I'm missing something, the existing syc repl queue
> (SyncRepQueue) mechanism doesn't avoid spinning in the requestors
> (backends) SyncRepWaitForLSN or in the walsenders SyncRepWakeQueue.

My point is that there are existing tools for alerting processes when an
LSN is synchronously replicated and for waking up WAL senders.  What I am
proposing wouldn't involve spinning in XLogSendPhysical() waiting for
synchronous replication.  Like SyncRepWaitForLSN(), we'd register our LSN
in the queue (SyncRepQueueInsert()), but we wouldn't sit in a separate loop
waiting to be woken.  Instead, SyncRepWakeQueue() would eventually wake up
the WAL sender and trigger another iteration of WalSndLoop().

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-08 Thread Nathan Bossart
On Wed, Mar 02, 2022 at 02:23:51PM -0500, Chapman Flack wrote:
> I did not notice this earlier (sorry), but there seems to remain in
> backup.sgml a programlisting example that shows a psql invocation
> for pg_backup_start, then a tar command, then another psql invocation
> for pg_backup_stop.
> 
> I think that was only workable for the exclusive mode, and now it is
> necessary to issue pg_backup_start and pg_backup_stop in the same session.
> 
> (The 'touch backup_in_progress' business seems a bit bogus now too,
> suggesting an exclusivity remembered from bygone days.)
> 
> I am not sure what a workable, simple example ought to look like.
> Maybe a single psql script issuing the pg_backup_start and the
> pg_backup_stop, with a tar command in between with \! ?
> 
> Several bricks shy of production-ready, but it would give the idea.

Another option might be to just remove this section.  The top of the
section mentions that this is easily done using pg_basebackup with the -X
parameter.  The bottom part of the section includes more complicated steps
for when "more flexibility in copying the backup files is needed..."
AFAICT the more complicated strategy was around before pg_basebackup, and
the pg_basebackup recommendation was added in 2012 as part of 920febd.
Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-08 Thread Nathan Bossart
On Tue, Mar 08, 2022 at 03:09:50PM -0600, David Steele wrote:
> On 3/8/22 14:01, Nathan Bossart wrote:
>> On Wed, Mar 02, 2022 at 02:23:51PM -0500, Chapman Flack wrote:
>> > I did not notice this earlier (sorry), but there seems to remain in
>> > backup.sgml a programlisting example that shows a psql invocation
>> > for pg_backup_start, then a tar command, then another psql invocation
>> > for pg_backup_stop.
>> > 
>> > I think that was only workable for the exclusive mode, and now it is
>> > necessary to issue pg_backup_start and pg_backup_stop in the same session.
>> > 
>> > (The 'touch backup_in_progress' business seems a bit bogus now too,
>> > suggesting an exclusivity remembered from bygone days.)
>> > 
>> > I am not sure what a workable, simple example ought to look like.
>> > Maybe a single psql script issuing the pg_backup_start and the
>> > pg_backup_stop, with a tar command in between with \! ?
>> > 
>> > Several bricks shy of production-ready, but it would give the idea.
>> 
>> Another option might be to just remove this section.  The top of the
>> section mentions that this is easily done using pg_basebackup with the -X
>> parameter.  The bottom part of the section includes more complicated steps
>> for when "more flexibility in copying the backup files is needed..."
>> AFAICT the more complicated strategy was around before pg_basebackup, and
>> the pg_basebackup recommendation was added in 2012 as part of 920febd.
>> Thoughts?
> 
> This makes sense to me. I think pg_basebackup is far preferable to doing
> anything like what is described in this section. Unless you are planning to
> do something fancy (parallelization, snapshots, object stores, etc.) then
> pg_basebackup is the way to go.

I spent some time trying to come up with a workable script to replace the
existing one.  I think the main problem is that you need to write out both
the backup label file and the tablespace map file, but I didn't find an
easy way to write the different output columns of pg_backup_stop() to
separate files via psql.  We'd probably need to write out the steps in
prose like the 'Making a Base Backup Using the Low Level API' section does.
Ultimately, I just removed everything beyond the pg_basebackup
recommendation in the 'Standalone Hot Backups' section.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 8b6fa9d0b7794e3c37f79f3c71472b2c6adabe47 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 1 Dec 2021 23:50:49 +
Subject: [PATCH v5 1/1] remove exclusive backup mode

---
 doc/src/sgml/backup.sgml  | 222 +---
 doc/src/sgml/func.sgml|  99 +---
 doc/src/sgml/high-availability.sgml   |   6 +-
 doc/src/sgml/monitoring.sgml  |   4 +-
 doc/src/sgml/ref/pgupgrade.sgml   |   2 +-
 src/backend/access/transam/xlog.c | 493 ++
 src/backend/access/transam/xlogfuncs.c| 253 ++---
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/catalog/system_functions.sql  |  18 +-
 src/backend/postmaster/postmaster.c   |  46 +-
 src/backend/replication/basebackup.c  |  20 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |   4 +
 src/bin/pg_ctl/pg_ctl.c   |   4 +-
 src/bin/pg_rewind/filemap.c   |   6 +-
 src/include/access/xlog.h |   7 +-
 src/include/catalog/pg_control.h  |   2 +-
 src/include/catalog/pg_proc.dat   |  28 +-
 src/include/miscadmin.h   |   4 -
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  56 +-
 .../t/010_logical_decoding_timelines.pl   |   4 +-
 20 files changed, 189 insertions(+), 1091 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0d69851bb1..79eeb25eee 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0
 sequence, and that the success of a step is verified before
 proceeding to the next step.

-   
-Low level base backups can be made in a non-exclusive or an exclusive
-way. The non-exclusive method is recommended and the exclusive one is
-deprecated and will eventually be removed.
-   
-
-   
-Making a Non-Exclusive Low-Level Backup
 
- A non-exclusive low level backup is one that allows other
+ A low level backup allows other
  concurrent backups to be running (both those started using
  the same backup API and those started using
  ).
@@ -881,19 +873,19 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0

 

Re: parse/analyze API refactoring

2022-03-09 Thread Nathan Bossart
On Wed, Mar 09, 2022 at 11:35:32AM +0100, Peter Eisentraut wrote:
> I have committed my original patches.  I'll leave the above-mentioned topic
> as ideas for the future.

Sounds good.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-09 Thread Nathan Bossart
On Wed, Mar 09, 2022 at 02:32:24PM -0500, Chapman Flack wrote:
> On 03/09/22 12:19, Stephen Frost wrote:
>> Let's avoid hijacking this thread, which is about this
>> patch, for an independent debate about what our documentation should or
>> shouldn't include.
> 
> Agreed. New thread here:
> 
> https://www.postgresql.org/message-id/6228FFE4.3050309%40anastigmatix.net

Great.  Is there any additional feedback on this patch?  Should we add an
example of using pg_basebackup in the "Standalone Hot Backups" section, or
should we leave all documentation additions like this for Chap's new
thread?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-09 Thread Nathan Bossart
On Wed, Mar 09, 2022 at 06:11:19PM -0500, Chapman Flack wrote:
> I think the listitem
> 
>   In the same connection as before, issue the command:
>   SELECT * FROM pg_backup_stop(true);
> 
> would be clearer if it used named-parameter form, wait_for_archive => true.
> 
> This is not strictly necessary, of course, for a function with a single
> IN parameter, but it's good documentation (and also could save us headaches
> like these if there is ever another future need to give it more parameters).
> 
> That listitem doesn't say anything about what the parameter means, which
> is a little weird, but probably ok because the next listitem does go into
> it in some detail. I don't think a larger reorg is needed to bring that
> language one listitem earlier. Just naming the parameter is probably
> enough to make it less puzzling (or adding in that listitem, at most,
> "the effect of the wait_for_archive parameter is explained below").
> 
> For consistency (and the same futureproofing benefit), I'd go to
> fast => false in the earlier pg_backup_start as well.
> 
> I'm more ambivalent about label => 'label'. It would be consistent,
> but should we just agree for conciseness that there will always be
> a label and it will always be first?
> 
> You can pretty much tell in a call what's a label; it's those anonymous
> trues and falses that are easier to read with named notation.

Done.  I went ahead and added "label => 'label'" for consistency.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 904d5b9f77262fe0ea740de86ca41c7dbaf210ef Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 1 Dec 2021 23:50:49 +
Subject: [PATCH v6 1/1] remove exclusive backup mode

---
 doc/src/sgml/backup.sgml  | 222 +---
 doc/src/sgml/func.sgml|  99 +---
 doc/src/sgml/high-availability.sgml   |   6 +-
 doc/src/sgml/monitoring.sgml  |   4 +-
 doc/src/sgml/ref/pgupgrade.sgml   |   2 +-
 src/backend/access/transam/xlog.c | 493 ++
 src/backend/access/transam/xlogfuncs.c| 253 ++---
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/catalog/system_functions.sql  |  18 +-
 src/backend/postmaster/postmaster.c   |  46 +-
 src/backend/replication/basebackup.c  |  20 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |   4 +
 src/bin/pg_ctl/pg_ctl.c   |   4 +-
 src/bin/pg_rewind/filemap.c   |   6 +-
 src/include/access/xlog.h |   7 +-
 src/include/catalog/pg_control.h  |   2 +-
 src/include/catalog/pg_proc.dat   |  28 +-
 src/include/miscadmin.h   |   4 -
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  56 +-
 .../t/010_logical_decoding_timelines.pl   |   4 +-
 20 files changed, 189 insertions(+), 1091 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0d69851bb1..c8b914c1aa 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0
 sequence, and that the success of a step is verified before
 proceeding to the next step.

-   
-Low level base backups can be made in a non-exclusive or an exclusive
-way. The non-exclusive method is recommended and the exclusive one is
-deprecated and will eventually be removed.
-   
-
-   
-Making a Non-Exclusive Low-Level Backup
 
- A non-exclusive low level backup is one that allows other
+ A low level backup allows other
  concurrent backups to be running (both those started using
  the same backup API and those started using
  ).
@@ -881,19 +873,19 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0

 
  Connect to the server (it does not matter which database) as a user with
- rights to run pg_start_backup (superuser, or a user who has been granted
+ rights to run pg_backup_start (superuser, or a user who has been granted
  EXECUTE on the function) and issue the command:
 
-SELECT pg_start_backup('label', false, false);
+SELECT pg_backup_start(label => 'label', fast => false);
 
  where label is any string you want to use to uniquely
  identify this backup operation. The connection
- calling pg_start_backup must be maintained until the end of
+ calling pg_backup_start must be maintained until the end of
  the backup, or the backup will be automatically aborted.
 
 
 
- By default, pg_start_backup can take a long time to finish.
+ By default, pg_backup_start can take a long time to finish.
  This is because it performs a checkpoint, and t

Re: Add index scan progress to pg_stat_progress_vacuum

2022-03-09 Thread Nathan Bossart
I took a look at the latest patch set.

+  
+   indexes_total bigint
+  
+  
+   The number of indexes to be processed in the 
+   vacuuming indexes 
+   or cleaning up indexes phase. It is set to
+   0 when vacuum is not in any of these phases.
+  

Could we avoid resetting it to 0 unless INDEX_CLEANUP was turned off or
failsafe kicked in?  It might be nice to know how many indexes the vacuum
intends to process.  I don't feel too strongly about this, so if it would
add a lot of complexity, it might be okay as is.

BTreeShmemInit();
SyncScanShmemInit();
AsyncShmemInit();
+   vacuum_worker_init();

Don't we also need to add the size of the hash table to
CalculateShmemSize()?

+ * A command type can optionally define a callback function
+ * which will derive Datum values rather than use values
+ * directly from the backends progress array.

I think this can be removed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add index scan progress to pg_stat_progress_vacuum

2022-03-10 Thread Nathan Bossart
On Thu, Mar 10, 2022 at 09:30:57PM +, Imseih (AWS), Sami wrote:
> Attached v4 which includes accounting for the hash size on startup, removal 
> of the no longer needed comment in pgstatfuncs.c and a change in both 
> code/docs to only reset the indexes_total to 0 when failsafe is triggered.

Thanks for the new patch set.

+/*
+ * Structs for tracking shared Progress information
+ * amongst worker ( and leader ) processes of a vacuum.
+ */

nitpick: Can we remove the extra spaces in the parentheses?

+if (entry != NULL)
+values[PGSTAT_NUM_PROGRESS_COMMON + PROGRESS_VACUUM_INDEXES_COMPLETED] 
= entry->indexes_processed;

What does it mean if there isn't an entry in the map?  Is this actually
expected, or should we ERROR instead?

+/* vacuum worker progress hash table */
+max_table_size = GetMaxBackends();
+size = add_size(size, hash_estimate_size(max_table_size,
+ sizeof(VacProgressEntry)));

I think the number of entries should be shared between
VacuumWorkerProgressShmemInit() and VacuumWorkerProgressShmemSize().
Otherwise, we might update one and not the other.

+/* Call the command specific function to override datum values */
+if (pg_strcasecmp(cmd, "VACUUM") == 0)
+set_vaccum_worker_progress(values);

I think we should elaborate a bit more in this comment.  It's difficult to
follow what this is doing without referencing the comment above
set_vacuum_worker_progress().

IMO the patches are in decent shape, and this should likely be marked as
ready-for-committer in the near future.  Before doing so, I think we should
check that Sawada-san is okay with moving the deeper infrastructure changes
to a separate threaḋ.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-10 Thread Nathan Bossart
On Thu, Mar 10, 2022 at 07:13:14PM -0500, Chapman Flack wrote:
> Looks like this change to an example in func.sgml is not quite right:
> 
> -postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
> +postgres=# SELECT * FROM pg_walfile_name_offset(pg_backup_stop());
> 
> pg_backup_stop returns a record now, not just lsn. So this works for me:
> 
> +postgres=# SELECT * FROM pg_walfile_name_offset((pg_backup_stop()).lsn);

Ah, good catch.  I made this change in v7.  I considered doing something
like this

SELECT w.* FROM pg_backup_stop() b, pg_walfile_name_offset(b.lsn) w;

but I think your suggestion is simpler.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3283d2b85f38f46d1e2ada0e6c5ea59d8c8e9f9d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 1 Dec 2021 23:50:49 +
Subject: [PATCH v7 1/1] remove exclusive backup mode

---
 doc/src/sgml/backup.sgml  | 222 +---
 doc/src/sgml/func.sgml|  99 +---
 doc/src/sgml/high-availability.sgml   |   6 +-
 doc/src/sgml/monitoring.sgml  |   4 +-
 doc/src/sgml/ref/pgupgrade.sgml   |   2 +-
 src/backend/access/transam/xlog.c | 493 ++
 src/backend/access/transam/xlogfuncs.c| 253 ++---
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/catalog/system_functions.sql  |  18 +-
 src/backend/postmaster/postmaster.c   |  46 +-
 src/backend/replication/basebackup.c  |  20 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |   4 +
 src/bin/pg_ctl/pg_ctl.c   |   4 +-
 src/bin/pg_rewind/filemap.c   |   6 +-
 src/include/access/xlog.h |   7 +-
 src/include/catalog/pg_control.h  |   2 +-
 src/include/catalog/pg_proc.dat   |  28 +-
 src/include/miscadmin.h   |   4 -
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  56 +-
 .../t/010_logical_decoding_timelines.pl   |   4 +-
 20 files changed, 189 insertions(+), 1091 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0d69851bb1..c8b914c1aa 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0
 sequence, and that the success of a step is verified before
 proceeding to the next step.

-   
-Low level base backups can be made in a non-exclusive or an exclusive
-way. The non-exclusive method is recommended and the exclusive one is
-deprecated and will eventually be removed.
-   
-
-   
-Making a Non-Exclusive Low-Level Backup
 
- A non-exclusive low level backup is one that allows other
+ A low level backup allows other
  concurrent backups to be running (both those started using
  the same backup API and those started using
  ).
@@ -881,19 +873,19 @@ test ! -f /mnt/server/archivedir/000100A90065 && cp pg_wal/0

 
  Connect to the server (it does not matter which database) as a user with
- rights to run pg_start_backup (superuser, or a user who has been granted
+ rights to run pg_backup_start (superuser, or a user who has been granted
  EXECUTE on the function) and issue the command:
 
-SELECT pg_start_backup('label', false, false);
+SELECT pg_backup_start(label => 'label', fast => false);
 
  where label is any string you want to use to uniquely
  identify this backup operation. The connection
- calling pg_start_backup must be maintained until the end of
+ calling pg_backup_start must be maintained until the end of
  the backup, or the backup will be automatically aborted.
 
 
 
- By default, pg_start_backup can take a long time to finish.
+ By default, pg_backup_start can take a long time to finish.
  This is because it performs a checkpoint, and the I/O
  required for the checkpoint will be spread out over a significant
  period of time, by default half your inter-checkpoint interval
@@ -905,10 +897,6 @@ SELECT pg_start_backup('label', false, false);
  issue an immediate checkpoint using as much I/O as available.
 
 
-
- The third parameter being false tells
- pg_start_backup to initiate a non-exclusive base backup.
-


 
@@ -926,7 +914,7 @@ SELECT pg_start_backup('label', false, false);
 
  In the same connection as before, issue the command:
 
-SELECT * FROM pg_stop_backup(false, true);
+SELECT * FROM pg_backup_stop(wait_for_archive => true);
 
  This terminates backup mode. On a primary, it also performs an automatic
  switch to the next WAL segment.  On a standby, it is not possible to
@@ -937,7 +925,7 @@ SELECT * FROM pg_stop_backup(false, true);
  ready to archive.
 
 
- The pg_stop_b

Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-03-11 Thread Nathan Bossart
On Fri, Mar 11, 2022 at 02:00:37PM -0500, Chapman Flack wrote:
> v7 looks good to me. I have repeated the installcheck-world and given
> the changed documentation sections one last read-through, and will
> change the status to RfC.

Thanks for reviewing!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-11 Thread Nathan Bossart
+CheckpointStats.repl_map_cutoff_lsn = cutoff;

Could we set repl_map_cutoff_lsn closer to where it is calculated?  Right
now, it's set at the bottom of the function just before the directory is
freed.  Is there a strong reason to do it there?

+"logical rewrite mapping file(s) removed/synced=" 
UINT64_FORMAT ", size=%zu bytes, time=%ld.%03d s, cutoff LSN=%X/%X",

Can we split out the "removed" versus "synced" files?  Otherwise, I think
this is basically just telling us how many files are in the mappings
directory.

+CheckpointStats.repl_snap_cutoff_lsn = cutoff;

I have the same note for this one as I have for repl_map_cutoff_lsn.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add index scan progress to pg_stat_progress_vacuum

2022-03-12 Thread Nathan Bossart
On Sat, Mar 12, 2022 at 07:00:06AM +, Imseih (AWS), Sami wrote:
>> nitpick: Can we remove the extra spaces in the parentheses?
> 
> fixed
> 
>> What does it mean if there isn't an entry in the map?  Is this actually
>> expected, or should we ERROR instead?
> 
> I cleaned up the code here and added comments. 
> 
>> I think the number of entries should be shared between
>> VacuumWorkerProgressShmemInit() and VacuumWorkerProgressShmemSize().
>> Otherwise, we might update one and not the other.
> 
> Fixed
> 
>> I think we should elaborate a bit more in this comment.  It's difficult to
>> follow what this is doing without referencing the comment above
>> set_vacuum_worker_progress().
> 
> More comments added
> 
> I also simplified the 0002 patch as well.

These patches look pretty good to me.  Barring additional feedback, I
intend to mark this as ready-for-committer early next week.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Optimize external TOAST storage

2022-03-12 Thread Nathan Bossart
I think this is an interesting idea.  My first thought is that it would be
nice if we didn't have to first compress the data to see whether we wanted
to store it compressed or not, but I don't think there is much we can do
about that.  In any case, we aren't adding much work in the write path
compared to the potential savings in the read path, so that is probably
okay.

Do you think it is worth making this configurable?  I don't think it is
outside the realm of possibility for some users to care more about disk
space than read performance.  And maybe the threshold for this optimization
could differ based on the workload.

 extern void toast_tuple_externalize(ToastTupleContext *ttc, int attribute,
 int options);
+extern void toast_tuple_opt_externalize(ToastTupleContext *ttc, int attribute,
+int options, Datum old_toast_value,
+ToastAttrInfo *old_toast_attr);

Could we bake this into toast_tuple_externalize() so that all existing
callers benefit from this optimization?  Is there a reason to export both
functions?  Perhaps toast_tuple_externalize() should be renamed and made
static, and then this new function could be called
toast_tuple_externalize() (which would be a wrapper around the internal
function).

+/* Sanity check: if data is not compressed then we can proceed as usual. */
+if (!VARATT_IS_COMPRESSED(DatumGetPointer(*value)))
+toast_tuple_externalize(ttc, attribute, options);

With a --enable-cassert build, this line causes assertion failures in the
call to GetMemoryChunkContext() from pfree().  'make check' is enough to
reproduce it.  Specifically, it fails the following assertion:

AssertArg(MemoryContextIsValid(context));

I didn't see an existing commitfest entry for this patch.  I'd encourage
you to add one in the July commitfest:

https://commitfest.postgresql.org/38/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Allow async standbys wait for sync replication

2022-03-12 Thread Nathan Bossart
On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote:
> To me it's architecturally the completely wrong direction. We should move in
> the *other* direction, i.e. allow WAL to be sent to standbys before the
> primary has finished flushing it locally. Which requires similar
> infrastructure to what we're discussing here.

I think this is a good point.  After all, WALRead() has the following
comment:

 * XXX probably this should be improved to suck data directly from the
 * WAL buffers when possible.

Once you have all the infrastructure for that, holding back WAL replay on
async standbys based on synchronous replication might be relatively easy.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-13 Thread Nathan Bossart
On Sun, Mar 13, 2022 at 01:54:10PM +0530, Bharath Rupireddy wrote:
> Thanks for reviewing this. I agree with all of the above suggestions
> and incorporated them in the v2 patch.

Thanks for the new patch.

> Another thing I added in v2 is to not emit snapshot and mapping files
> stats in case of restartpoint as logical decoding isn't supported on
> standbys, so it doesn't make sense to emit the stats there and cause
> server log to grow unnecessarily. Having said that, I added a note
> there to change it whenever logical decoding on standbys is supported.

I think we actually do want to include this information for restartpoints
since some files might be left over from the snapshot that was used to
create the standby.  Also, presumably these functions could do some work
during recovery on a primary.

Another problem I see is that this patch depends on the return value of the
lstat() calls that we are trying to remove in 0001 from another thread [0].
I think the size of the removed/sync'd files is somewhat useful for
understanding disk space usage, but I suspect the time spent performing
these tasks is more closely related to the number of files.  Do you think
reporting the sizes is worth the extra system call?

[0] https://postgr.es/m/20220215231123.GA2584239%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-14 Thread Nathan Bossart
On Mon, Mar 14, 2022 at 05:03:59PM +0530, Bharath Rupireddy wrote:
> On Mon, Mar 14, 2022 at 1:33 PM Michael Paquier  wrote:
>> On Mon, Mar 14, 2022 at 10:54:56AM +0530, Bharath Rupireddy wrote:
>> > Yes, this is a concern. Also, when there were no logical replication
>> > slots on a plain server or the server removed or cleaned up all the
>> > snapshot/mappings files, why would anyone want to have these messages
>> > with all 0s in the server log?
>>
>> The default settings don't enable that, so making things conditional
>> roughly as you are suggesting with two different LOG messages sounds
>> rather fine.
> 
> Attaching v3 patch which emits logs only when necessary and doesn't
> clutter the existing "checkpoint/restartpoint completed" message, see
> some sample logs at [1]. Please review it further.

I'm okay with not adding these statistics when the number of files removed
and sync'd is zero.

IMO we shouldn't include the size of the files since it will prevent us
from removing lstat() calls.  As noted upthread, I suspect the time spent
in these tasks is more closely related to the number of files anyway.

I'm -1 on splitting these new statistics to separate LOGs.  In addition to
making it more difficult to discover statistics for a given checkpoint, I
think it actually adds even more bloat to the server log.  If we are
concerned about the amount of information in these LOGs, perhaps we should
adjust the format to make it more human-readable.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Estimating HugePages Requirements?

2022-03-14 Thread Nathan Bossart
On Mon, Mar 14, 2022 at 04:26:43PM +0100, Magnus Hagander wrote:
> Nothing fixing this ended up actually getting committed, right? That
> is, we still get the extra log output?

Correct.

> And in fact, the command documented on
> https://www.postgresql.org/docs/devel/kernel-resources.html doesn't
> actually produce the output that the docs show, it also shows the log
> line, in the default config? If we can't fix the extra logging we
> should at least have our docs represent reality -- maybe by adding a
> "2>/dev/null" entry? But it'd be better to have it not output that log
> in the first place...

I attached a patch to adjust the documentation for now.  This apparently
crossed my mind earlier [0], but I didn't follow through with it for some
reason.

> (Of course what I'd really want is to be able to run it on a cluster
> that's running, but that was discussed downthread so I'm not bringing
> that one up for changes now)

I think it is worth revisiting the extra logging and the ability to view
runtime-computed GUCs on a running server.  Should this be an open item for
v15, or do you think it's alright to leave it for the v16 development
cycle?

[0] https://postgr.es/m/C45224E1-29C8-414C-A8E6-B718C07ACB94%40amazon.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7ee7b176c5280349631426ff047a9df394e26d59 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 14 Mar 2022 10:24:48 -0700
Subject: [PATCH v1 1/1] send stderr to /dev/null in Linux Huge Pages
 documentation

---
 doc/src/sgml/runtime.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index f77ed24204..85b3ffcd71 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1448,7 +1448,7 @@ export PG_OOM_ADJUST_VALUE=0
 server must be shut down to view this runtime-computed parameter.
 This might look like:
 
-$ postgres -D $PGDATA -C shared_memory_size_in_huge_pages
+$ postgres -D $PGDATA -C shared_memory_size_in_huge_pages 2> /dev/null
 3170
 $ grep ^Hugepagesize /proc/meminfo
 Hugepagesize:   2048 kB
-- 
2.25.1



Re: Optionally automatically disable logical replication subscriptions on error

2022-03-14 Thread Nathan Bossart
My compiler is worried that syncslotname may be used uninitialized in
start_table_sync().  The attached patch seems to silence this warning.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 05e4e03af5afa1658ede8d78b31c1c999b5c7deb Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 14 Mar 2022 16:00:18 -0700
Subject: [PATCH v1 1/1] silence compiler warning

---
 src/backend/replication/logical/worker.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a1fe81b34f..03e069c7cd 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3387,7 +3387,7 @@ TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
 static void
 start_table_sync(XLogRecPtr *origin_startpos, char **myslotname)
 {
-	char	   *syncslotname;
+	char	   *syncslotname = NULL;
 
 	Assert(am_tablesync_worker());
 
-- 
2.25.1



Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-15 Thread Nathan Bossart
On Tue, Mar 15, 2022 at 11:04:26AM +0900, Michael Paquier wrote:
> On Mon, Mar 14, 2022 at 03:54:19PM +0530, Bharath Rupireddy wrote:
>> At times, the snapshot or mapping files can be large in number and one
>> some platforms it takes time for checkpoint to process all of them.
>> Having the stats about them in server logs can help us better analyze
>> why checkpoint took a long time and provide a better RCA.
> 
> Do you have any numbers to share regarding that?  Seeing information
> about 1k WAL segments being recycled and/or removed by a checkpoint
> where the operation takes dozens of seconds to complete because we can
> talk about hundred of gigs worth of files moved around.  If we are
> talking about 100~200 files up to 10~20kB each for snapshot and
> mapping files, the information has less value, worth only a portion of
> one WAL segment.

I don't have specific numbers to share, but as noted elsewhere [0], I
routinely see lengthy checkpoints that spend a lot of time in these cleanup
tasks.

[0] https://postgr.es/m/18ED8B1F-7F5B-4ABF-848D-45916C938BC7%40amazon.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Optimize external TOAST storage

2022-03-15 Thread Nathan Bossart
On Wed, Mar 16, 2022 at 01:03:38AM +0530, davinder singh wrote:
> Regarding the 2nd part of configuring the threshold, Based on our
> experiments, we have fixed it for the attributes with size > 2 * chunk_size.
> The default chunk_size is 2KB and the page size is 8KB. While toasting each
> attribute is divided into chunks, and each page can hold a max of 4 such
> chunks.
> We only need to think about the space used by the last chunk of the
> attribute.
> This means with each value optimization, it might use extra space in the
> range
> (0B,2KB]. I think this extra space is independent of attribute size. So we
> don't
> need to worry about configuring this threshold. Let me know if I missed
> something
> here.

Looking closer, ISTM there are two thresholds here.  The first threshold is
the minimum size of the uncompressed data required to trigger this
optimization.  The purpose behind this threshold is to avoid increasing
disk space usage in return for very little gain in the read path (i.e.,
skipping decompression).  Specifically, this threshold helps avoid
increasing disk space usage when there are many small TOAST attributes.
You've chosen 2x the TOAST chunk size as the threshold.

The second threshold is the compression ratio required to trigger this
optimization.  The purpose behind this threshold is to avoid storing poorly
compressed data so that we can improve performance in the read path (i.e.,
skipping decompression).  You've chosen >=1 TOAST chunk as the threshold.
This means that when I have an attribute with size 3x a TOAST chunk, I'll
only store it compressed when the compressed attribute is 2x a TOAST chunk
or smaller (a 33% improvement).  If I have an attribute with size 100x a
TOAST chunk, I'll store it compressed when the compressed attribute is 99x
a TOAST chunk or smaller (a 1% improvement).

IIUC this means that it is much more likely that you will store a
compressed attribute when the uncompressed data is larger.  The benefit of
this approach is that you know you'll never give up more than one TOAST
chunk of disk space in the name of faster reads.  The downside is that a
huge number of smaller attributes can require much more disk space than if
they were universally compressed.

Perhaps this is missing one additional threshold.  Perhaps there should be
a compression ratio for which the optimization is ignored.  If my
uncompressed data is 3x a TOAST chunk and compressing it would save half of
a TOAST chunk of disk space (a ~17% improvement), maybe we should still
store it compressed.  I think this threshold would be most important for
servers with many small TOAST attributes.  I don't know whether it is
necessary to make all of this configurable or whether we should just use
some heuristics to find a good balance.  In any case, I imagine the
settings should look different for the different TOAST compression methods.

Or maybe we should just make the first threshold the configuration option
for this feature.  If you set it to -1, the optimization is never used.  If
you set it to 0, you always try to save on read time at the expense of disk
space.  If you set it to 2x the TOAST chunk size, you might use up to 50%
more disk space for read path optimizations.  If you set it to 100x the
TOAST chunk size, you might use up to 1% more disk space for read
optimizations.  By default, this configuration option could be set
conservatively (e.g., 10x a TOAST chunk, which could lead to a maximum of
10% more disk space).  I suspect the increase in disk space would be much
less than these numbers in most cases, but it might be helpful to think of
these things in terms of the worst case scenario.

I apologize for thinking out loud a bit here, but I hope this gives you
some insight into my perspective on this.  In general, I am skeptical that
we can choose one threshold that will work for all PostgreSQL installations
in the known universe.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Estimating HugePages Requirements?

2022-03-15 Thread Nathan Bossart
On Tue, Mar 15, 2022 at 11:02:37PM +0100, Magnus Hagander wrote:
> I think we're talking about two different things here.
> 
> I think the "avoid extra logging" would be worth seeing if we can
> address for 15.

A simple approach could be to just set log_min_messages to PANIC before
exiting.  I've attached a patch for this.  With this patch, we'll still see
a FATAL if we try to use 'postgres -C' for a runtime-computed GUC on a
running server, and there will be no extra output as long as the user sets
log_min_messages to INFO or higher (i.e., not a DEBUG* value).  For
comparison, 'postgres -C' for a non-runtime-computed GUC does not emit
extra output as long as the user sets log_min_messages to DEBUG2 or higher.

> The "able to run on a live cluster" seems a lot bigger and more scary
> and not 15 material.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From cdfd1ad00ca1d8afbfcbafc1f684b5ba9cc43eb6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 15 Mar 2022 15:36:41 -0700
Subject: [PATCH v2 1/1] don't emit shutdown messages for 'postgres -C' with
 runtime-computed GUCs

---
 src/backend/postmaster/postmaster.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 80bb269599..bf48bc6326 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1066,6 +1066,10 @@ PostmasterMain(int argc, char *argv[])
  false, false);
 
 		puts(config_val ? config_val : "");
+
+		/* don't emit shutdown messages */
+		SetConfigOption("log_min_messages", "PANIC", PGC_INTERNAL, PGC_S_OVERRIDE);
+
 		ExitPostmaster(0);
 	}
 
-- 
2.25.1



Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-03-16 Thread Nathan Bossart
On Wed, Mar 16, 2022 at 06:40:23PM +1300, Thomas Munro wrote:
> Pushed and back-patched (it's slightly different before 12).  Thanks!

Thank you!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Optimize external TOAST storage

2022-03-16 Thread Nathan Bossart
On Wed, Mar 16, 2022 at 10:08:45AM -0400, Robert Haas wrote:
> I would like to take a slightly contrary position. I think that a
> system here that involves multiple knobs is going to be too
> complicated to be of any real-world use, because almost nobody will
> understand it or tune it properly for their installation. And who is
> to say that a single setting would even be appropriate for a whole
> installation, as opposed to something that needs to be tuned for each
> individual table? A single tunable might be OK, but what I would
> really like here is a heuristic that, while perhaps not optimal in
> every environment, is good enough that a very high percentage of users
> will never need to worry about it. In a case like this, a small gain
> that happens automatically feels better than a large gain that
> requires manual tweaking in every install.

Agreed.  If there is a tunable, it needs to be as simple as possible.  And
ideally most users would never need to use it because the default would be
"good enough."

> Now that doesn't answer the question of what that heuristic should be.
> I initially thought that if compression didn't end up reducing the
> number of TOAST chunks then there was no point, but that's not true,
> because having the last chunk be smaller than the rest can allow us to
> fit more than 4 into a page rather than exactly 4. However, this
> effect isn't likely to be important if most chunks are full-size
> chunks. If we insert 100 full-size chunks and 1 partial chunk at the
> end, we can't save much even if that chunk ends up being 1 byte
> instead of a full-size chunk. 25 TOAST table pages out of 26 are going
> to be full of full-size chunks either way, so we can't save more than
> ~4% and we might easily save nothing at all. As you say, the potential
> savings are larger as the values get smaller, because a higher
> proportion of the TOAST table pages are not quite full. So I think the
> only cases where it's even interesting to think about suppressing this
> optimization are those where the value is quite small -- and maybe the
> thing to do is just pick a conservatively large threshold and call it
> good.
> 
> For example, say that instead of applying this technique when there
> are at least 2 TOAST chunks, we apply it when there are at least 500
> (i.e. normally 1MB). It's hard for me to imagine that we can ever
> lose, and in fact I think we could come down quite a bit from there
> and still end up winning pretty much all the time. Setting the
> threshold to, say, 50 or 100 or 150 TOAST chunks instead of 2 may
> leave some money on the table, but if it means that we don't have
> meaningful downsides and we don't have tunables for users to fiddle
> with, it might be worth it.

As long as we can demonstrate some decent improvements, I think using a
conservative threshold is a good idea.  I do wonder whether thiѕ could
weaken the read performance gains quite a bit, though.

Thinking further, is simply reducing the number of TOAST chunks the right
thing to look at?  If I want to add a TOAST attribute that requires 100,000
chunks, and you told me that I could save 10% in the read path for an extra
250 chunks of disk space, I would probably choose read performance.  If I
wanted to add 100,000 attributes that were each 3 chunks, and you told me
that I could save 10% in the read path for an extra 75,000 chunks of disk
space, I might choose the extra disk space.  These are admittedly extreme
(and maybe even impossible) examples, but my point is that the amount of
disk space you are willing to give up may be related to the size of the
attribute.  And maybe one way to extract additional read performance with
this optimization is to use a variable threshold so that we are more likely
to use it for large attributes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-16 Thread Nathan Bossart
On Wed, Mar 16, 2022 at 03:02:41PM +0530, Bharath Rupireddy wrote:
> On Mon, Mar 14, 2022 at 10:34 PM Nathan Bossart
>  wrote:
>> I'm -1 on splitting these new statistics to separate LOGs.  In addition to
>> making it more difficult to discover statistics for a given checkpoint, I
>> think it actually adds even more bloat to the server log.  If we are
>> concerned about the amount of information in these LOGs, perhaps we should
>> adjust the format to make it more human-readable.
> 
> Below are the ways that I can think of. Thoughts?

I think I prefer the first option.  If these tasks don't do anything, we
leave the stats out of the message.  If they do some work, we add them.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Optimize external TOAST storage

2022-03-16 Thread Nathan Bossart
On Wed, Mar 16, 2022 at 11:36:56AM -0700, Nathan Bossart wrote:
> Thinking further, is simply reducing the number of TOAST chunks the right
> thing to look at?  If I want to add a TOAST attribute that requires 100,000
> chunks, and you told me that I could save 10% in the read path for an extra
> 250 chunks of disk space, I would probably choose read performance.  If I
> wanted to add 100,000 attributes that were each 3 chunks, and you told me
> that I could save 10% in the read path for an extra 75,000 chunks of disk
> space, I might choose the extra disk space.  These are admittedly extreme
> (and maybe even impossible) examples, but my point is that the amount of
> disk space you are willing to give up may be related to the size of the
> attribute.  And maybe one way to extract additional read performance with
> this optimization is to use a variable threshold so that we are more likely
> to use it for large attributes.

I might be overthinking this.  Maybe it is enough to skip compressing the
attribute whenever compression saves no more than some percentage of the
uncompressed attribute size.  A conservative default setting might be
something like 5% or 10%.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add index scan progress to pg_stat_progress_vacuum

2022-03-16 Thread Nathan Bossart
On Wed, Mar 16, 2022 at 09:47:49PM +, Imseih (AWS), Sami wrote:
> Spoke to Nathan offline and fixed some more comments/nitpicks in the patch.

I don't have any substantial comments for v9, so I think this can be marked
as ready-for-committer.  However, we probably should first see whether
Sawada-san has any comments on the revised approach.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Optimize external TOAST storage

2022-03-17 Thread Nathan Bossart
On Thu, Mar 17, 2022 at 01:04:17PM -0400, Robert Haas wrote:
> Right, so perhaps the ultimate thing here would be a more fine-grained
> knob than SET STORAGE EXTERNAL -- something that allows you to specify
> that you want to compress only when it really helps. While some people
> might find that useful, I think the current patch is less ambitious,
> and I think that's OK. It just wants to save something in the cases
> where it's basically free. Unfortunately we've learned that it's never
> *entirely* free because making the last TOAST chunk longer can always
> cost you something, even if it gets longer by only 1 byte. But for
> larger values it's hard for that to be significant.

I guess I think we should be slightly more ambitious.  One idea could be to
create a default_toast_compression_ratio GUC with a default of 0.95.  This
means that, by default, a compressed attribute must be 0.95x or less of the
size of the uncompressed attribute to be stored compressed.  Like
default_toast_compression, this could also be overridden at the column
level with something like

ALTER TABLE mytbl ALTER mycol SET COMPRESSION lz4 RATIO 0.9;

If the current "basically free" patch is intended for v15, then maybe all
this extra configurability stuff could wait for a bit, especially if we can
demonstrate a decent read performance boost with a more conservative
setting.  However, I don't see anything terribly complicated about the
proposed configurability changes (assuming my proposal makes some amount of
sense to you and others).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-17 Thread Nathan Bossart
On Thu, Mar 17, 2022 at 06:48:43PM +0530, Bharath Rupireddy wrote:
> Personally, I tend to agree with v4-0001 (option (4)) or v4-0002
> (option (3)) than v4-0003 (option (1)) as it adds more unreadability
> with most of the code duplicated creating more diff with previous
> versions and maintainability problems. Having said that, I will leave
> it to the committer to decide on that.

I don't think v4-0003/option 1 needs to be unreadable.  Perhaps we can use
a StringInfo to build the message.  That might be a net improvement in
readability anyway.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: O(n) tasks cause lengthy startups and checkpoints

2022-03-17 Thread Nathan Bossart
It seems unlikely that anything discussed in this thread will be committed
for v15, so I've adjusted the commitfest entry to v16 and moved it to the
next commitfest.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Pre-allocating WAL files

2022-03-17 Thread Nathan Bossart
It seems unlikely that this will be committed for v15, so I've adjusted the
commitfest entry to v16 and moved it to the next commitfest.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: support for CREATE MODULE

2022-03-17 Thread Nathan Bossart
It seems unlikely that this will be committed for v15.  Swaha, should the
commitfest entry be adjusted to v16 and moved to the next commitfest?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Patch to avoid orphaned dependencies

2022-03-17 Thread Nathan Bossart
Bertand, do you think this has any chance of making it into v15?  If not,
are you alright with adjusting the commitfest entry to v16 and moving it to
the next commitfest?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: support for CREATE MODULE

2022-03-17 Thread Nathan Bossart
On Thu, Mar 17, 2022 at 04:26:31PM -0700, Swaha Miller wrote:
> On Thu, Mar 17, 2022 at 4:16 PM Nathan Bossart 
> wrote:
> 
>> It seems unlikely that this will be committed for v15.  Swaha, should the
>> commitfest entry be adjusted to v16 and moved to the next commitfest?
>>
>>
> Yes please, thank you Nathan

Done!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2022-03-17 Thread Nathan Bossart
I think this one requires some more work, and it needn't be a priority for
v15, so I've adjusted the commitfest entry to v16 and moved it to the next
commitfest.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-18 Thread Nathan Bossart
On Fri, Mar 18, 2022 at 01:32:52PM +0530, Bharath Rupireddy wrote:
> Well, here's the v5 patch, sample output at [1]. Please have a look at it.

I think this is on the right track, but I would even take it a step
further by separating each group of information:

if (restartpoint)
appendStringInfo(&logmsg, "restartpoint complete: ");
else
appendStringInfo(&logmsg, "checkpoint complete: ");

/* buffer stats */
appendStringInfo(&logmsg, "wrote %d buffers (%.1f%%); ",
  
CheckpointStats.ckpt_bufs_written,
  (double) 
CheckpointStats.ckpt_bufs_written * 100 / NBuffers);

/* WAL file stats */
appendStringInfo(&logmsg, "%d WAL file(s) added, %d removed, %d 
recycled; ",
  
CheckpointStats.ckpt_segs_added,
  
CheckpointStats.ckpt_segs_removed,
      
CheckpointStats.ckpt_segs_recycled);

...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-21 Thread Nathan Bossart
On Mon, Mar 21, 2022 at 11:27:15AM +0530, Bharath Rupireddy wrote:
> On Sat, Mar 19, 2022 at 3:16 AM Nathan Bossart  
> wrote:
>> /* buffer stats */
>> appendStringInfo(&logmsg, "wrote %d buffers (%.1f%%); ",
>>   
>> CheckpointStats.ckpt_bufs_written,
>>   (double) 
>> CheckpointStats.ckpt_bufs_written * 100 / NBuffers);
>>
>> /* WAL file stats */
>> appendStringInfo(&logmsg, "%d WAL file(s) added, %d removed, %d 
>> recycled; ",
>>   
>> CheckpointStats.ckpt_segs_added,
>>   
>> CheckpointStats.ckpt_segs_removed,
>>   
>> CheckpointStats.ckpt_segs_recycled);
> 
> Do we actually need to granularize the message like this? I actually
> don't think so, because we print the stats even if they are 0 (buffers
> written is 0 or WAL segments added is 0 and so on).

I suggested it because I thought it would improve readability and simplify
optionally adding new stats to the message.  If there is another way to
accomplish those things, that is fine by me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Estimating HugePages Requirements?

2022-03-21 Thread Nathan Bossart
On Tue, Mar 15, 2022 at 03:44:39PM -0700, Nathan Bossart wrote:
> A simple approach could be to just set log_min_messages to PANIC before
> exiting.  I've attached a patch for this.  With this patch, we'll still see
> a FATAL if we try to use 'postgres -C' for a runtime-computed GUC on a
> running server, and there will be no extra output as long as the user sets
> log_min_messages to INFO or higher (i.e., not a DEBUG* value).  For
> comparison, 'postgres -C' for a non-runtime-computed GUC does not emit
> extra output as long as the user sets log_min_messages to DEBUG2 or higher.

I created a commitfest entry for this:

    https://commitfest.postgresql.org/38/3596/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pgcrypto: Remove internal padding implementation

2022-03-21 Thread Nathan Bossart
On Wed, Mar 16, 2022 at 11:12:06AM +0100, Peter Eisentraut wrote:
> Right, the previous behaviors were clearly faulty.  I have updated the
> commit message to call out the behavior change more clearly.
> 
> This patch is now complete from my perspective.

I took a look at this patch and found nothing of concern.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Optimize external TOAST storage

2022-03-22 Thread Nathan Bossart
On Tue, Mar 22, 2022 at 04:34:05PM -0400, Robert Haas wrote:
> We seem to have a shortage of "others" showing up with opinions on
> this topic, but I guess I'm not very confident about the general
> utility of such a setting. Just to be clear, I'm also not very
> confident about the usefulness of the existing settings for
> controlling TOAST. Why is it useful default behavior to try to get
> rows down to 2kB by default, rather than 1.8kB or 3kB? Even more, why
> don't we try to compress primarily based on the length of individual
> attributes and then compress further only if the resulting tuple
> doesn't fit into a page at all? There doesn't seem to be anything
> magic about fitting tuples into a quarter-page, yet the code acts as
> though that's the primary goal - and then, when that didn't turn out
> to work well in all cases, we added a user-settable parameter
> (toast_tuple_target) to let you say you really want tuples in table X
> to fit into a third of a page or a fifth of a page instead of a
> quarter. And there's some justification for that: a proposal to
> fundamentally change the algorithm would likely have gotten bogged
> down for years, and the parameter no doubt lets you solve some
> problems. Yet if the whole algorithm is wrong, and I think maybe it
> is, then being able to change the constants is not really getting us
> where we need to be.
> 
> Then, too, I'm not very confident about the usefulness of EXTENDED,
> EXTERNAL, and MAIN. I think it's useful to be able to categorically
> disable compression (as EXTERNAL does), but you can't categorically
> disable out-of-line storage because the value can be bigger than the
> page, so MAIN vs. EXTENDED is just changing the threshold for the use
> of out-of-line storage. However, it does so in a way that IMHO is not
> particularly intuitive, which goes back to my earlier point about the
> algorithm seeming wacky, and it's in any case unclear why we should
> offer exactly two possible thresholds and not any of the intermediate
> values.

I agree with all of this.  Adding configurability for each constant might
help folks in the short term, but using these knobs probably requires quite
a bit of expertise in Postgres internals as well as a good understanding of
your data.  I think we ought to revist TOAST configurability from a user
perspective.  IOW what can be chosen automatically, and how do we enable
users to effectively configure the settings that cannot be chosen
automatically?  IMO this is a worthwhile conversation to have as long as it
doesn't stray too far into the "let's rewrite TOAST" realm.  I think there
is always going to be some level of complexity with stuff like TOAST, but
there are probably all sorts of ways to simplify/improve it also.

> Maybe the conclusion here is that more thought is needed before
> changing anything in this area

You've certainly got me thinking more about this.  If the scope of this
work is going to expand, a few months before the first v16 commitfest is
probably the right time for it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2022-03-22 Thread Nathan Bossart
On Thu, Mar 17, 2022 at 04:45:28PM -0700, Nathan Bossart wrote:
> I think this one requires some more work, and it needn't be a priority for
> v15, so I've adjusted the commitfest entry to v16 and moved it to the next
> commitfest.

Here is a new patch.  The main differences from v3 are in
heapam_visibility.c.  Specifically, instead of trying to work the infomask
checks into the visibility logic, I added a new function that does a couple
of assertions.  This function is called at the beginning of each visibility
function.

What do folks think?  The options I've considered are 1) not adding any
such checks to heapam_visibility.c, 2) only adding assertions like the
attached patch, or 3) actually using elog(ERROR, ...) when the invalid bit
patterns are detected.  AFAICT (1) is more in line with existing invalid
bit patterns (e.g., XMAX_COMMITTED + XMAX_IS_MULTI).  There are a couple of
scattered assertions, but most code paths don't check for it.  (2) adds
additional checks, but only for --enable-cassert builds.  (3) would add
checks even for non-assert builds, but there would presumably be some
performance cost involved.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 2d6b372cf61782e0fd52590b57b1c914b0ed7a4c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 22 Mar 2022 15:35:34 -0700
Subject: [PATCH v4 1/1] disallow XMAX_COMMITTED + XMAX_LOCK_ONLY

---
 contrib/amcheck/verify_heapam.c | 19 
 src/backend/access/heap/README.tuplock  |  2 +-
 src/backend/access/heap/heapam.c| 10 +++
 src/backend/access/heap/heapam_visibility.c | 97 ++---
 src/backend/access/heap/hio.c   |  2 +
 5 files changed, 96 insertions(+), 34 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index e5f7355dcb..f30b9c234f 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -665,6 +665,25 @@ check_tuple_header(HeapCheckContext *ctx)
 		 */
 	}
 
+	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
+		(ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY))
+	{
+		report_corruption(ctx,
+		  pstrdup("locked-only should not be marked committed"));
+
+		/* As above, do not skip further checks. */
+	}
+
+	/* also check for pre-v9.3 lock-only bit pattern */
+	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
+		HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask))
+	{
+		report_corruption(ctx,
+		  pstrdup("tuple with HEAP_XMAX_EXCL_LOCK set and HEAP_XMAX_IS_MULTI unset should not be marked committed"));
+
+		/* As above, do not skip further checks. */
+	}
+
 	if (infomask & HEAP_HASNULL)
 		expected_hoff = MAXALIGN(SizeofHeapTupleHeader + BITMAPLEN(ctx->natts));
 	else
diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock
index 6441e8baf0..4e565e60ab 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -152,4 +152,4 @@ The following infomask bits are applicable:
   whether the XMAX is a TransactionId or a MultiXactId.
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
-is set.
+or the HEAP_XMAX_LOCK_ONLY bit is set.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3746336a09..3f0b4cd754 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5151,6 +5151,16 @@ l5:
 		MultiXactStatus status;
 		MultiXactStatus new_status;
 
+		/*
+		 * Currently we don't allow XMAX_LOCK_ONLY and XMAX_COMMITTED to both be
+		 * set in a tuple header, so cross-check.
+		 *
+		 * Note that this also checks for the special locked-only bit pattern
+		 * that may be present on servers that were pg_upgraded from pre-v9.3
+		 * versions.
+		 */
+		Assert(!HEAP_XMAX_IS_LOCKED_ONLY(old_infomask));
+
 		if (old_infomask2 & HEAP_KEYS_UPDATED)
 			status = MultiXactStatusUpdate;
 		else
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index ff0b8a688d..7d6fb66b0d 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -78,6 +78,31 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * InfomaskAssertions()
+ *
+ * Checks for invalid infomask bit patterns.
+ */
+static inline void
+InfomaskAssertions(HeapTupleHeader tuple)
+{
+	/*
+	 * XMAX_COMMITTED and XMAX_LOCK_ONLY cannot both be set in a tuple header.
+	 *
+	 * Note that this also checks for the special locked-only bit pattern that
+	 * may be present on servers that were pg_upgraded from pre-v9.3 versions.
+	 */
+	Assert(!((tuple->t_infomask & HEAP_XMAX_COMMITTED) &&
+			 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)));
+
+	/*
+	 * XMAX_COMMITTED and XMAX_IS_MULTI cannot be be set in a tuple header.
+	 */
+	Assert(!((tuple->t

Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2022-03-22 Thread Nathan Bossart
On Tue, Mar 22, 2022 at 04:13:47PM -0700, Andres Freund wrote:
> Just skimming this thread quickly, I really have no idea what this is trying
> to achieve and the commit message doesn't help either...  I didn't read the
> referenced thread, but I shouldn't have to, to get a basic idea.

Ah, my bad.  I should've made sure the context was carried over better.  I
updated the commit message with some basic information about the intent.
Please let me know if there is anything else that needs to be cleared up.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 739ec6e42183280d57d867157cfe1b37d127ca54 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 22 Mar 2022 15:35:34 -0700
Subject: [PATCH v5 1/1] Disallow setting XMAX_COMMITTED and XMAX_LOCK_ONLY
 together.

Even though Postgres doesn't set both the XMAX_COMMITTED and
XMAX_LOCK_ONLY infomask bits on the same tuple header, the heap
visibility logic still accepts it.  However, other functions like
compute_new_xmax_infomask() seem to assume that this bit pattern
is not possible.

This change marks this bit pattern as disallowed, removes the heap
visibility logic that handles it, and adds checks like those for
other disallowed infomask bit combinations (e.g., XMAX_COMMITTED
and XMAX_IS_MULTI).  Besides simplifying the heap visibility logic
a bit, this change aims to reduce ambiguity about the legal tuple
header states.

Note that this change also disallows XMAX_COMMITTED together with
the special pre-v9.3 locked-only bit pattern that
HEAP_XMAX_IS_LOCKED_ONLY checks for.  This locked-only bit pattern
may still be present on servers pg_upgraded from pre-v9.3 versions.
---
 contrib/amcheck/verify_heapam.c | 19 
 src/backend/access/heap/README.tuplock  |  2 +-
 src/backend/access/heap/heapam.c| 10 +++
 src/backend/access/heap/heapam_visibility.c | 97 ++---
 src/backend/access/heap/hio.c   |  2 +
 5 files changed, 96 insertions(+), 34 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index e5f7355dcb..f30b9c234f 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -665,6 +665,25 @@ check_tuple_header(HeapCheckContext *ctx)
 		 */
 	}
 
+	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
+		(ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY))
+	{
+		report_corruption(ctx,
+		  pstrdup("locked-only should not be marked committed"));
+
+		/* As above, do not skip further checks. */
+	}
+
+	/* also check for pre-v9.3 lock-only bit pattern */
+	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
+		HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask))
+	{
+		report_corruption(ctx,
+		  pstrdup("tuple with HEAP_XMAX_EXCL_LOCK set and HEAP_XMAX_IS_MULTI unset should not be marked committed"));
+
+		/* As above, do not skip further checks. */
+	}
+
 	if (infomask & HEAP_HASNULL)
 		expected_hoff = MAXALIGN(SizeofHeapTupleHeader + BITMAPLEN(ctx->natts));
 	else
diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock
index 6441e8baf0..4e565e60ab 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -152,4 +152,4 @@ The following infomask bits are applicable:
   whether the XMAX is a TransactionId or a MultiXactId.
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
-is set.
+or the HEAP_XMAX_LOCK_ONLY bit is set.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3746336a09..3f0b4cd754 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5151,6 +5151,16 @@ l5:
 		MultiXactStatus status;
 		MultiXactStatus new_status;
 
+		/*
+		 * Currently we don't allow XMAX_LOCK_ONLY and XMAX_COMMITTED to both be
+		 * set in a tuple header, so cross-check.
+		 *
+		 * Note that this also checks for the special locked-only bit pattern
+		 * that may be present on servers that were pg_upgraded from pre-v9.3
+		 * versions.
+		 */
+		Assert(!HEAP_XMAX_IS_LOCKED_ONLY(old_infomask));
+
 		if (old_infomask2 & HEAP_KEYS_UPDATED)
 			status = MultiXactStatusUpdate;
 		else
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index ff0b8a688d..7d6fb66b0d 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -78,6 +78,31 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * InfomaskAssertions()
+ *
+ * Checks for invalid infomask bit patterns.
+ */
+static inline void
+InfomaskAssertions(HeapTupleHeader tuple)
+{
+	/*
+	 * XMAX_COMMITTED and XMAX_LOCK_ONLY cannot both be set in a tuple header.
+	 *
+	 * Note that this also checks for the special locked-only bit pattern that
+	 * may be present on servers that wer

Re: Use durable_unlink for .ready and .done files for WAL segment removal

2018-11-02 Thread Nathan Bossart
One argument for instead checking WAL file existence before calling
archive_command might be to avoid the increased startup time.
Granted, any added delay from this patch is unlikely to be noticeable
unless your archiver is way behind and archive_status has a huge
number of files.  However, I have seen cases where startup is stuck on
other tasks like SyncDataDirectory() and RemovePgTempFiles() for a
very long time, so perhaps it is worth considering.

Nathan

Re: heavily contended lwlocks with long wait queues scale badly

2024-01-10 Thread Nathan Bossart
On Mon, Nov 21, 2022 at 10:31:14AM -0500, Jonathan S. Katz wrote:
> On 11/20/22 2:56 PM, Andres Freund wrote:
>> I still think it might be worth to backpatch in a bit, but so far the votes 
>> on
>> that weren't clear enough on that to feel comfortable.
> 
> My general feeling is "yes" on backpatching, particularly if this is a bug
> and it's fixable without ABI breaks.

Now that commit a4adc31 has had some time to bake and concerns about
unintended consequences may have abated, I wanted to revive this
back-patching discussion.  I see a few possibly-related reports [0] [1]
[2], and I'm now seeing this in the field, too.  While it is debatable
whether this is a bug, it's a quite nasty issue for users, and it's both
difficult to detect and difficult to work around.

Thoughts?

[0] 
https://postgr.es/m/CAM527d-uDn5osa6QPKxHAC6srOfBH3M8iXUM%3DewqHV6n%3Dw1u8Q%40mail.gmail.com
[1] 
https://postgr.es/m/VI1PR05MB62031A41186ACC3FC91ACFC70%40VI1PR05MB6206.eurprd05.prod.outlook.com
[2] https://postgr.es/m/dd0e070809430a31f7ddd8483fbcce59%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: introduce dynamic shared memory registry

2024-01-10 Thread Nathan Bossart
On Thu, Jan 11, 2024 at 09:50:19AM +0700, Andrei Lepikhov wrote:
> On 9/1/2024 00:16, Nathan Bossart wrote:
>> On Mon, Jan 08, 2024 at 10:53:17AM +0530, Bharath Rupireddy wrote:
>> > 2. FWIW, I'd like to call this whole feature "Support for named DSM
>> > segments in Postgres". Do you see anything wrong with this?
>> 
>> Why do you feel it should be renamed?  I don't see anything wrong with it,
>> but I also don't see any particular advantage with that name compared to
>> "dynamic shared memory registry."
> It is not a big issue, I suppose. But for me personally (as not a native
> English speaker), the label "Named DSM segments" seems more straightforward
> to understand.

That is good to know, thanks.  I see that it would also align better with
RequestNamedLWLockTranche/GetNamedLWLockTranche.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: recovery modules

2024-01-11 Thread Nathan Bossart
rebased

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6cee7d220b886e9be309929da5274c5770e59845 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 15 Feb 2023 14:28:53 -0800
Subject: [PATCH v17 1/5] introduce routine for checking mutually exclusive
 string GUCs

---
 src/backend/postmaster/pgarch.c |  8 +++-
 src/backend/utils/misc/guc.c| 22 ++
 src/include/utils/guc.h |  3 +++
 3 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 67693b0580..6ae8bb53c4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -824,11 +824,9 @@ LoadArchiveLibrary(void)
 {
 	ArchiveModuleInit archive_init;
 
-	if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("both archive_command and archive_library set"),
- errdetail("Only one of archive_command, archive_library may be set.")));
+	(void) CheckMutuallyExclusiveStringGUCs(XLogArchiveLibrary, "archive_library",
+			XLogArchiveCommand, "archive_command",
+			ERROR);
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8f65ef3d89..9d21c16169 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2640,6 +2640,28 @@ ReportGUCOption(struct config_generic *record)
 	pfree(val);
 }
 
+/*
+ * If both parameters are set, emits a log message at 'elevel' and returns
+ * false.  Otherwise, returns true.
+ */
+bool
+CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name,
+ const char *p2val, const char *p2name,
+ int elevel)
+{
+	if (p1val[0] != '\0' && p2val[0] != '\0')
+	{
+		ereport(elevel,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cannot set both %s and %s", p1name, p2name),
+ errdetail("Only one of %s or %s may be set.",
+		   p1name, p2name)));
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Convert a value from one of the human-friendly units ("kB", "min" etc.)
  * to the given base unit.  'value' and 'unit' are the input value and unit
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 471d53da8f..6543e61463 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -375,6 +375,9 @@ extern int	NewGUCNestLevel(void);
 extern void AtEOXact_GUC(bool isCommit, int nestLevel);
 extern void BeginReportingGUCOptions(void);
 extern void ReportChangedGUCOptions(void);
+extern bool CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name,
+			 const char *p2val, const char *p2name,
+			 int elevel);
 extern void ParseLongOption(const char *string, char **name, char **value);
 extern const char *get_config_unit_name(int flags);
 extern bool parse_int(const char *value, int *result, int flags,
-- 
2.25.1

>From 9f5c2602baa8e29058bc976c01516d04e3c1f901 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 15 Feb 2023 10:36:00 -0800
Subject: [PATCH v17 2/5] refactor code for restoring via shell

---
 src/backend/Makefile  |   2 +-
 src/backend/access/transam/timeline.c |  12 +-
 src/backend/access/transam/xlog.c |  50 -
 src/backend/access/transam/xlogarchive.c  | 167 ---
 src/backend/access/transam/xlogrecovery.c |   3 +-
 src/backend/meson.build   |   1 +
 src/backend/postmaster/startup.c  |  16 +-
 src/backend/restore/Makefile  |  18 ++
 src/backend/restore/meson.build   |   5 +
 src/backend/restore/shell_restore.c   | 245 ++
 src/include/Makefile  |   2 +-
 src/include/access/xlogarchive.h  |   9 +-
 src/include/meson.build   |   1 +
 src/include/postmaster/startup.h  |   1 +
 src/include/restore/shell_restore.h   |  26 +++
 src/tools/pgindent/typedefs.list  |   1 +
 16 files changed, 407 insertions(+), 152 deletions(-)
 create mode 100644 src/backend/restore/Makefile
 create mode 100644 src/backend/restore/meson.build
 create mode 100644 src/backend/restore/shell_restore.c
 create mode 100644 src/include/restore/shell_restore.h

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 7d2ea7d54a..cbeb8ac93c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -19,7 +19,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access archive backup bootstrap catalog parser commands executor \
 	foreign lib libpq \
 	main nodes optimizer partitioning port postmaster \
-	regex replication rewrite \
+	regex replication restore rewrite \
 	statistics storage tcop tsearch utils $(t

reorganize "Shared Memory and LWLocks" section of docs

2024-01-11 Thread Nathan Bossart
I recently began trying to write documentation for the dynamic shared
memory registry feature [0], and I noticed that the "Shared Memory and
LWLocks" section of the documentation might need some improvement.  At
least, I felt that it would be hard to add any new content to this section
without making it very difficult to follow.

Concretely, I am proposing breaking it into two sections: one for shared
memory and one for LWLocks.  Furthermore, the LWLocks section will be split
into two: one for requesting locks at server startup and one for requesting
locks after server startup.  I intend to also split the shared memory
section into at-startup and after-startup sections if/when the dynamic
shared memory registry feature is committed.

Besides this restructuring, I felt that certain parts of this documentation
could benefit from rephrasing and/or additional detail.

Thoughts?

[0] https://postgr.es/m/20231205034647.GA2705267%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From dc5c1b37ced883021f4a92a17aa50b9d80d73fc6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v1 1/1] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 180 +---
 1 file changed, 112 insertions(+), 68 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..a758384a9a 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,134 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
 void RequestAddinShmemSpace(int size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend sould obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Any registered
+  shmem_startup_hook will be executed shortly after each
+  backend attaches to shared memory.  Note that add-ins should still
+  acquire AddinShmemInitLock within this hook, as
+  shown in the example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  Like with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks)
 
- from your shmem_request_hook.  This will ensure that an array of
- num_lwlocks LWLocks is available under the name
- tranche_name.  Use GetNamedLWLockTranche
- to get a pointer to this array.
-

Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-12 Thread Nathan Bossart
Thanks for reviewing.

On Fri, Jan 12, 2024 at 05:12:28PM +0300, Aleksander Alekseev wrote:
> """
> Any registered shmem_startup_hook will be executed shortly after each
> backend attaches to shared memory.
> """
> 
> IMO the word "each" here can give the wrong impression as if there are
> certain guarantees about synchronization between backends. Maybe we
> should change this to simply "... will be executed shortly after
> [the?] backend attaches..."

I see what you mean, but I don't think the problem is the word "each."  I
think the problem is the use of passive voice.  What do you think about
something like

Each backend will execute the registered shmem_startup_hook shortly
after it attaches to shared memory.

> """
> should ensure that only one process allocates a new tranche_id
> (LWLockNewTrancheId) and initializes each new LWLock
> (LWLockInitialize).
> """
> 
> Personally I think that reminding the corresponding function name here
> is redundant and complicates reading just a bit. But maybe it's just
> me.

Yeah, I waffled on this one.  I don't mind removing it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: introduce dynamic shared memory registry

2024-01-12 Thread Nathan Bossart
Here's a new version of the patch set in which I've attempted to address
the feedback in this thread.  Note that 0001 is being tracked in a separate
thread [0], but it is basically a prerequisite for adding the documentation
for this feature, so that's why I've also added it here.

[0] https://postgr.es/m/20240112041430.GA3557928%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7cf22727a96757bf212ec106bd471bf55a6981b9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v6 1/3] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 182 +---
 1 file changed, 114 insertions(+), 68 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..0ba52b41d4 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
 void RequestAddinShmemSpace(int size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend sould obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Each backend will execute the registered
+  shmem_startup_hook shortly after it attaches to shared
+  memory.  Note that add-ins should still acquire
+  AddinShmemInitLock within this hook, as shown in the
+  example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  Like with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks)
 
- from your shmem_request_hook.  This will ensure that an array of
- num_lwlocks LWLocks is available under the name
- tranche_name.  Use GetNamedLWLockTranche
- to get a pointer to this array.
-
-
- An example of a shmem_request_hook can be found in
- contrib/pg_stat_statements/pg_stat_statements.c in the
- PostgreSQL source tree.
-
-
- There is another, more flexible method of obtaining LWLocks. First,
- allocate a tranche_id from a shared counter by
- calling:
+  This ensures that an array of num_lwlocks LWLocks is
+  available under the name tranche_name.  A pointer to
+  this array can be obtained by calling:
 
-int LWLockNewTrancheId(void)
+LWLockPadded *GetNamedLWLockTranche(const char *tranche_name)
 
- Next, each 

Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-12 Thread Nathan Bossart
On Fri, Jan 12, 2024 at 09:46:50AM -0600, Nathan Bossart wrote:
> On Fri, Jan 12, 2024 at 05:12:28PM +0300, Aleksander Alekseev wrote:
>> """
>> Any registered shmem_startup_hook will be executed shortly after each
>> backend attaches to shared memory.
>> """
>> 
>> IMO the word "each" here can give the wrong impression as if there are
>> certain guarantees about synchronization between backends. Maybe we
>> should change this to simply "... will be executed shortly after
>> [the?] backend attaches..."
> 
> I see what you mean, but I don't think the problem is the word "each."  I
> think the problem is the use of passive voice.  What do you think about
> something like
> 
>   Each backend will execute the registered shmem_startup_hook shortly
>   after it attaches to shared memory.
> 
>> """
>> should ensure that only one process allocates a new tranche_id
>> (LWLockNewTrancheId) and initializes each new LWLock
>> (LWLockInitialize).
>> """
>> 
>> Personally I think that reminding the corresponding function name here
>> is redundant and complicates reading just a bit. But maybe it's just
>> me.
> 
> Yeah, I waffled on this one.  I don't mind removing it.

Here is a new version of the patch with these changes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7cf22727a96757bf212ec106bd471bf55a6981b9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v2 1/1] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 182 +---
 1 file changed, 114 insertions(+), 68 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..0ba52b41d4 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
 void RequestAddinShmemSpace(int size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend sould obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Each backend will execute the registered
+  shmem_startup_hook shortly after it attaches to shared
+  memory.  Note that add-ins should still acquire
+  AddinShmemInitLock within this hook, as shown in the
+  example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  Like with shared memory,
+  the add-in's shared library must be p

Re: introduce dynamic shared memory registry

2024-01-12 Thread Nathan Bossart
On Fri, Jan 12, 2024 at 11:13:46PM +0530, Abhijit Menon-Sen wrote:
> At 2024-01-12 11:21:52 -0600, nathandboss...@gmail.com wrote:
>> +  Each backend sould obtain a pointer to the reserved shared memory by
> 
> sould → should

D'oh.  Thanks.

>> +  Add-ins can reserve LWLocks on server startup.  Like with shared 
>> memory,
> 
> (Would "As with shared memory" read better? Maybe, but then again maybe
> it should be left alone because you also write "Unlike with" elsewhere.)

I think "As with shared memory..." sounds better here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Nathan Bossart
On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote:
> * I did not invent a switch to control the batching of blobs; it's
> just hard-wired at 1000 blobs per group here.  Probably we need some
> user knob for that, but I'm unsure if we want to expose a count or
> just a boolean for one vs more than one blob per batch.  The point of
> forcing one blob per batch would be to allow exact control during
> selective restore, and I'm not sure if there's any value in random
> other settings.  On the other hand, selective restore of blobs has
> been completely broken for the last dozen years and I can't recall any
> user complaints about that; so maybe nobody cares and we could just
> leave this as an internal choice.

I think the argument for making this configurable is that folks might have
fewer larger blobs, in which case it might make sense to lower the batch
size, or they might have many smaller blobs, in which case they might want
to increase it.  But I'm a bit skeptical that will make any sort of
tremendous difference in practice, and I'm not sure how a user would decide
on the right value besides trial-and-error.  In any case, at the moment I
think it'd be okay to keep this an internal setting and wait for feedback
from the field.

> * As the patch stands, we still build a separate TOC entry for each
> comment or seclabel or ACL attached to a blob.  If you have a lot of
> blobs with non-default properties then the TOC bloat problem comes
> back again.  We could do something about that, but it would take a bit
> of tedious refactoring, and the most obvious way to handle it probably
> re-introduces too-many-locks problems.  Is this a scenario that's
> worth spending a lot of time on?

I'll ask around about this.

> Subject: [PATCH v9 1/4] Some small preliminaries for pg_dump changes.

This one looked good to me.

> Subject: [PATCH v9 2/4] In dumps, group large objects into matching metadata
>  and data entries.

I spent most of my review time reading this one.  Overall, it looks pretty
good to me, and I think you've called out most of the interesting design
choices.

> + char   *cmdEnd = psprintf(" OWNER TO %s", 
> fmtId(te->owner));
> +
> + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", 
> cmdEnd);

This is just a nitpick, but is there any reason not to have
IssueCommandPerBlob() accept a format string and the corresponding
arguments?

> + while (n < 1000 && i + n < ntups)

Another nitpick: it might be worth moving these hard-wired constants to
macros.  Without a control switch, that'd at least make it easy for anyone
determined to change the value for their installation.

> Subject: [PATCH v9 3/4] Move BLOBS METADATA TOC entries into SECTION_DATA.

This one looks reasonable, too.

> In this patch I have just hard-wired pg_upgrade to use
> --transaction-size 1000.  Perhaps there would be value in adding
> another pg_upgrade option to allow user control of that, but I'm
> unsure that it's worth the trouble; I think few users would use it,
> and any who did would see not that much benefit.  However, we
> might need to adjust the logic to make the size be 1000 divided
> by the number of parallel restore jobs allowed.

I wonder if it'd be worth making this configurable for pg_upgrade as an
escape hatch in case the default setting is problematic.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Nathan Bossart
On Fri, Jan 05, 2024 at 03:02:34PM -0500, Tom Lane wrote:
> On further reflection, there is a very good reason why it's done like
> that.  Because pg_upgrade is doing schema-only dump and restore,
> there's next to no opportunity for parallelism within either pg_dump
> or pg_restore.  There's no data-loading steps, and there's no
> index-building either, so the time-consuming stuff that could be
> parallelized just isn't happening in pg_upgrade's usage.
> 
> Now it's true that my 0003 patch moves the needle a little bit:
> since it makes BLOB creation (as opposed to loading) parallelizable,
> there'd be some hope for parallel pg_restore doing something useful in
> a database with very many blobs.  But it makes no sense to remove the
> existing cross-database parallelism in pursuit of that; you'd make
> many more people unhappy than happy.

I assume the concern is that we'd end up multiplying the effective number
of workers if we parallelized both in-database and cross-database?  Would
it be sufficient to make those separately configurable with a note about
the multiplicative effects of setting both?  I think it'd be unfortunate if
pg_upgrade completely missed out on this improvement.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Nathan Bossart
On Fri, Jan 12, 2024 at 04:42:27PM -0600, Nathan Bossart wrote:
> On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote:
>> +char   *cmdEnd = psprintf(" OWNER TO %s", 
>> fmtId(te->owner));
>> +
>> +IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", 
>> cmdEnd);
> 
> This is just a nitpick, but is there any reason not to have
> IssueCommandPerBlob() accept a format string and the corresponding
> arguments?

Eh, I guess you'd have to find some other way of specifying where the OID
is supposed to go, which would probably be weird.  Please disregard this
one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Fix minor memory leak in connection string validation

2024-01-12 Thread Nathan Bossart
On Fri, Jan 12, 2024 at 03:06:26PM -0800, Jeff Davis wrote:
> It makes me wonder if we should use the resowner mechanism to track
> pointers to malloc'd memory. Then we could use a standard pattern for
> these kinds of cases, and it would also catch more remote issues, like
> if a pstrdup() fails in an error path (which can happen a few lines up
> if the parse fails).

That seems worth exploring.

>   if (!uses_password)
> + {
> + /* malloc'd, so we must free it explicitly */
> + PQconninfoFree(opts);
> +
>   ereport(ERROR,
>   
> (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
>errmsg("password is required"),
>errdetail("Non-superusers must provide 
> a password in the connection string.")));
> + }
>   }
>  
>   PQconninfoFree(opts);

Another option could be to surround this with PG_TRY/PG_FINALLY, but your
patch seems sufficient, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-13 Thread Nathan Bossart
On Sat, Jan 13, 2024 at 01:49:08PM +0300, Aleksander Alekseev wrote:
> That's much better, thanks.
> 
> I think the patch could use another pair of eyes, ideally from a
> native English speaker. But if no one will express any objections for
> a while I suggest merging it.

Great.  I've attached a v3 with a couple of fixes suggested in the other
thread [0].  I'll wait a little while longer in case anyone else wants to
take a look.

[0] https://postgr.es/m/ZaF6UpYImGqVIhVp%40toroid.org

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From b56931edc4488a7376b27ba0e5519cc3a61b4899 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v3 1/1] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 182 +---
 1 file changed, 114 insertions(+), 68 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..82e1dadcca 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
 void RequestAddinShmemSpace(int size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend should obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Each backend will execute the registered
+  shmem_startup_hook shortly after it attaches to shared
+  memory.  Note that add-ins should still acquire
+  AddinShmemInitLock within this hook, as shown in the
+  example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  As with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks)
 
- from your shmem_request_hook.  This will ensure that an array of
- num_lwlocks LWLocks is available under the name
- tranche_name.  Use GetNamedLWLockTranche
- to get a pointer to this array.
-
-
- An example of a shmem_request_hook can be found in
- contrib/pg_stat_statements/pg_stat_statements.c in the
- PostgreSQL source tree.
-
-
- There is another, more flexible method of obtaining LWLocks. First,
- allocate a tranche_id from a shared counter by
- calling:
+  This ensures that an array of num_lwlocks LWLocks is
+  available under the name tranche_name.  A pointer to
+

Re: introduce dynamic shared memory registry

2024-01-13 Thread Nathan Bossart
On Fri, Jan 12, 2024 at 01:45:55PM -0600, Nathan Bossart wrote:
> On Fri, Jan 12, 2024 at 11:13:46PM +0530, Abhijit Menon-Sen wrote:
>> At 2024-01-12 11:21:52 -0600, nathandboss...@gmail.com wrote:
>>> +  Each backend sould obtain a pointer to the reserved shared memory by
>> 
>> sould → should
> 
> D'oh.  Thanks.
> 
>>> +  Add-ins can reserve LWLocks on server startup.  Like with shared 
>>> memory,
>> 
>> (Would "As with shared memory" read better? Maybe, but then again maybe
>> it should be left alone because you also write "Unlike with" elsewhere.)
> 
> I think "As with shared memory..." sounds better here.

Here is a new version of the patch set with these changes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From b56931edc4488a7376b27ba0e5519cc3a61b4899 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v7 1/3] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 182 +---
 1 file changed, 114 insertions(+), 68 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..82e1dadcca 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
 void RequestAddinShmemSpace(int size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend should obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Each backend will execute the registered
+  shmem_startup_hook shortly after it attaches to shared
+  memory.  Note that add-ins should still acquire
+  AddinShmemInitLock within this hook, as shown in the
+  example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  As with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks)
 
- from your shmem_request_hook.  This will ensure that an array of
- num_lwlocks LWLocks is available under the name
- tranche_name.  Use GetNamedLWLockTranche
- to get a pointer to this array.
-
-
- An example of a shmem_request_hook can be found in
- contrib/pg_stat_statements/pg_stat_statements.c in the
- PostgreSQL sour

Re: Postgres and --config-file option

2024-01-13 Thread Nathan Bossart
On Sat, Jan 13, 2024 at 01:39:50PM +0300, Aleksander Alekseev wrote:
> OK, let's check section "20.1.4. Parameter Interaction via the Shell"
> [1] of the documentation. Currently it doesn't tell anything about the
> ability to specify GUCs --like-this, unless I missed something.

It appears to be documented for 'postgres' as follows [0]:

--name=value
Sets a named run-time parameter; a shorter form of -c.

and similarly within the --help output:

--NAME=VALUE   set run-time parameter

Its documentation also describes this method of specifying parameters in
the 'Examples' section.  The section you refer to calls out "-c", so it is
sort-of indirectly mentioned, but that might be a bit of a generous
assessment.

> Should we remove --config-file from the error message to avoid any
> confusion? Should we correct --help output? Should we update the
> documentation?

It might be worthwhile to update the documentation if it would've helped
prevent confusion here.

Separately, I noticed that this is implemented in postmaster.c by looking
for the '-' option character returned by getopt(), and I'm wondering why
this doesn't use getopt_long() instead.  AFAICT this dates back to the
introduction of GUCs in 6a68f426 (May 2000).

[0] https://www.postgresql.org/docs/devel/app-postgres.html

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: ALTER ROLE documentation improvement

2024-01-14 Thread Nathan Bossart
On Sun, Jan 14, 2024 at 04:17:41PM +0530, vignesh C wrote:
> The attached v3 version patch has the changes for the same.

LGTM.  I'll wait a little while longer for additional feedback, but if none
materializes, I'll commit this soon.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: archive modules loose ends

2024-01-15 Thread Nathan Bossart
On Mon, Jan 15, 2024 at 12:21:44PM +, Li, Yong wrote:
> The patch looks good to me.  With the context explained in the thread,
> the patch is easy to understand.
> The patch serves as a refactoring which pulls up common memory management
> and error handling concerns into the pgarch.c.  With the patch,
> individual archive callbacks can focus on copying the files and leave the
> boilerplate code to pgarch.c..
> 
> The patch applies cleanly to HEAD.  “make check-world” also runs cleanly
> with no error.

Thanks for reviewing.  I've marked this as ready-for-committer, and I'm
hoping to commit it in the near future.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: make pg_ctl more friendly

2024-01-15 Thread Nathan Bossart
+   POSTMASTER_RECOVERY_SHUTDOWN,

Perhaps this should be POSTMASTER_SHUTDOWN_IN_RECOVERY to match the state
in the control file?

+   case POSTMASTER_RECOVERY_SHUTDOWN:
+   print_msg(_("PITR shutdown\n"));
+   print_msg(_("update configuration for startup 
again if needed\n"));
+   break;

I'm not sure I agree that this is a substantially friendlier message.  From
a quick skim of the thread, it seems like you want to avoid sending a scary
error message if Postgres was intentionally shut down while in recovery.
If I got this particular message, I think I would be worried that something
went wrong during my point-in-time restore, and I'd be scrambling to figure
out what configuration this message wants me to update.

If I'm correctly interpreting the intent here, it might be worth fleshing
out the messages a bit more.  For example, instead of "PITR shutdown,"
perhaps we could say "shut down while in recovery."  And maybe we should
point to the specific settings in the latter message.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-16 Thread Nathan Bossart
On Tue, Jan 16, 2024 at 10:02:15AM +0530, Bharath Rupireddy wrote:
> The v3 patch looks good to me except for a nitpick: the input
> parameter for RequestAddinShmemSpace is 'Size' not 'int'
> 
>  
>  void RequestAddinShmemSpace(int size)
>  

Hah, I think this mistake is nearly old enough to vote (e0dece1, 5f78aa5).
Good catch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-16 Thread Nathan Bossart
On Tue, Jan 16, 2024 at 08:20:19AM -0600, Nathan Bossart wrote:
> On Tue, Jan 16, 2024 at 10:02:15AM +0530, Bharath Rupireddy wrote:
>> The v3 patch looks good to me except for a nitpick: the input
>> parameter for RequestAddinShmemSpace is 'Size' not 'int'
>> 
>>  
>>  void RequestAddinShmemSpace(int size)
>>  
> 
> Hah, I think this mistake is nearly old enough to vote (e0dece1, 5f78aa5).
> Good catch.

I fixed this in v4.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 402eaf87776fb6a9d212da66947f47c63bd53f2a Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v4 1/1] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 184 +---
 1 file changed, 115 insertions(+), 69 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..ede2a5dea6 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
-void RequestAddinShmemSpace(int size)
+void RequestAddinShmemSpace(Size size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend should obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Each backend will execute the registered
+  shmem_startup_hook shortly after it attaches to shared
+  memory.  Note that add-ins should still acquire
+  AddinShmemInitLock within this hook, as shown in the
+  example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  As with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks)
 
- from your shmem_request_hook.  This will ensure that an array of
- num_lwlocks LWLocks is available under the name
- tranche_name.  Use GetNamedLWLockTranche
- to get a pointer to this array.
-
-
- An example of a shmem_request_hook can be found in
- contrib/pg_stat_statements/pg_stat_statements.c in the
- PostgreSQL source tree.
-
-
- There is another, more flexible method of obtaining LWLocks. First,
- allocate a tranche_id from a shared counter by
- calling:
+  This ensures that an array of num_lwlocks LWLocks is
+  available under the name tranch

Re: introduce dynamic shared memory registry

2024-01-16 Thread Nathan Bossart
On Tue, Jan 16, 2024 at 10:28:29AM +0530, Bharath Rupireddy wrote:
> I think it's better for GetNamedDSMSegment() to error out on empty
> 'name' and size 0. This makes the user-facing function
> GetNamedDSMSegment more concrete.

Agreed, thanks for the suggestion.

> +void *
> +GetNamedDSMSegment(const char *name, size_t size,
> +   void (*init_callback) (void *ptr), bool *found)
> 
> +Assert(found);
> 
> Why is input parameter 'found' necessary to be passed by the caller?
> Neither the test module added, nor the pg_prewarm is using the found
> variable. The function will anyway create the DSM segment if one with
> the given name isn't found. IMO, found is an optional parameter for
> the caller. So, the assert(found) isn't necessary.

The autoprewarm change (0003) does use this variable.  I considered making
it optional (i.e., you could pass in NULL if you didn't want it), but I
didn't feel like the extra code in GetNamedDSMSegment() to allow this was
worth it so that callers could avoid creating a single bool.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 402eaf87776fb6a9d212da66947f47c63bd53f2a Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v8 1/3] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 184 +---
 1 file changed, 115 insertions(+), 69 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..ede2a5dea6 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
-void RequestAddinShmemSpace(int size)
+void RequestAddinShmemSpace(Size size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend should obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Each backend will execute the registered
+  shmem_startup_hook shortly after it attaches to shared
+  memory.  Note that add-ins should still acquire
+  AddinShmemInitLock within this hook, as shown in the
+  example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  As with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, int num_lwl

Re: ALTER ROLE documentation improvement

2024-01-18 Thread Nathan Bossart
On Thu, Jan 18, 2024 at 02:44:35PM -0700, David G. Johnston wrote:
> LGTM too.  I didn't go looking for anything else related to this, but the
> proposed changes all look needed.

Committed.  Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-19 Thread Nathan Bossart
On Wed, Jan 17, 2024 at 06:48:37AM +0530, Bharath Rupireddy wrote:
> LGTM.

Committed.  Thanks for reviewing!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: introduce dynamic shared memory registry

2024-01-19 Thread Nathan Bossart
On Wed, Jan 17, 2024 at 08:00:00AM +0530, Bharath Rupireddy wrote:
> The v8 patches look good to me.

Committed.  Thanks everyone for reviewing!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




cleanup patches for dshash

2024-01-19 Thread Nathan Bossart
While working on the dynamic shared memory registry, I noticed a couple of
potential improvements for code that uses dshash tables.

* A couple of dshash_create() callers pass in 0 for the "void *arg"
  parameter, which seemed weird.  I incorrectly assumed this was some sort
  of flags parameter until I actually looked at the function signature.
  IMHO we should specify NULL here if arg is not used.  0001 makes this
  change.  This is admittedly a nitpick.

* There are no dshash compare/hash functions for string keys.  0002
  introduces some that simply forward to strcmp()/string_hash(), and it
  uses them for the DSM registry's dshash table.  This allows us to remove
  the hacky key padding code for lookups on this table.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5fe6d05f2cfa1401c2fb967fe4bb52cae3706228 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 19 Jan 2024 15:23:35 -0600
Subject: [PATCH v1 1/2] Use NULL instead of 0 for arg parameter in
 dshash_create().

---
 src/backend/replication/logical/launcher.c | 2 +-
 src/backend/utils/activity/pgstat_shmem.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 122db0bb13..ee66c292bd 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -1013,7 +1013,7 @@ logicalrep_launcher_attach_dshmem(void)
 		last_start_times_dsa = dsa_create(LWTRANCHE_LAUNCHER_DSA);
 		dsa_pin(last_start_times_dsa);
 		dsa_pin_mapping(last_start_times_dsa);
-		last_start_times = dshash_create(last_start_times_dsa, &dsh_params, 0);
+		last_start_times = dshash_create(last_start_times_dsa, &dsh_params, NULL);
 
 		/* Store handles in shared memory for other backends to use. */
 		LogicalRepCtx->last_start_dsa = dsa_get_handle(last_start_times_dsa);
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 3ce48e88a3..71410ddd3f 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -180,7 +180,7 @@ StatsShmemInit(void)
 		 * With the limit in place, create the dshash table. XXX: It'd be nice
 		 * if there were dshash_create_in_place().
 		 */
-		dsh = dshash_create(dsa, &dsh_params, 0);
+		dsh = dshash_create(dsa, &dsh_params, NULL);
 		ctl->hash_handle = dshash_get_hash_table_handle(dsh);
 
 		/* lift limit set above */
-- 
2.25.1

>From 6c6760e6553ac86c334e7c35d2ddbac8fdc1cb9b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 19 Jan 2024 15:38:34 -0600
Subject: [PATCH v1 2/2] Add hash/compare functions for dshash tables with
 string keys.

---
 src/backend/lib/dshash.c   | 23 +++
 src/backend/storage/ipc/dsm_registry.c |  8 +++-
 src/include/lib/dshash.h   |  4 
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b0bc0abda0..e497238e8f 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -583,6 +583,29 @@ dshash_memhash(const void *v, size_t size, void *arg)
 	return tag_hash(v, size);
 }
 
+/*
+ * A compare function that forwards to strcmp.
+ */
+int
+dshash_strcmp(const void *a, const void *b, size_t size, void *arg)
+{
+	Assert(strlen((const char *) a) < size);
+	Assert(strlen((const char *) b) < size);
+
+	return strcmp((const char *) a, (const char *) b);
+}
+
+/*
+ * A hash function that forwards to string_hash.
+ */
+dshash_hash
+dshash_strhash(const void *v, size_t size, void *arg)
+{
+	Assert(strlen((const char *) v) < size);
+
+	return string_hash((const char *) v, size);
+}
+
 /*
  * Sequentially scan through dshash table and return all the elements one by
  * one, return NULL when all elements have been returned.
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index ac11f51375..9dce41e8ab 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -50,8 +50,8 @@ typedef struct DSMRegistryEntry
 static const dshash_parameters dsh_params = {
 	offsetof(DSMRegistryEntry, handle),
 	sizeof(DSMRegistryEntry),
-	dshash_memcmp,
-	dshash_memhash,
+	dshash_strcmp,
+	dshash_strhash,
 	LWTRANCHE_DSM_REGISTRY_HASH
 };
 
@@ -132,7 +132,6 @@ GetNamedDSMSegment(const char *name, size_t size,
 {
 	DSMRegistryEntry *entry;
 	MemoryContext oldcontext;
-	char		name_padded[offsetof(DSMRegistryEntry, handle)] = {0};
 	void	   *ret;
 
 	Assert(found);
@@ -155,8 +154,7 @@ GetNamedDSMSegment(const char *name, size_t size,
 	/* Connect to the registry. */
 	init_dsm_registry();
 
-	strcpy(name_padded, name);
-	entry = dshash_find_or_insert(dsm_registry_table, name_padded, found);
+	entry = dshash_find_or_insert(dsm_registry_table, name, found);
 	if (!(*found))
 	{
 		/* Initialize the segment. *

Re: introduce dynamic shared memory registry

2024-01-21 Thread Nathan Bossart
On Sun, Jan 21, 2024 at 11:21:46AM -0500, Tom Lane wrote:
> Coverity complained about this:
> 
> *** CID 1586660:  Null pointer dereferences  (NULL_RETURNS)
> /srv/coverity/git/pgsql-git/postgresql/src/backend/storage/ipc/dsm_registry.c:
>  185 in GetNamedDSMSegment()
> 179   }
> 180   else if (!dsm_find_mapping(entry->handle))
> 181   {
> 182   /* Attach to existing segment. */
> 183   dsm_segment *seg = dsm_attach(entry->handle);
> 184 
>>>> CID 1586660:  Null pointer dereferences  (NULL_RETURNS)
>>>> Dereferencing a pointer that might be "NULL" "seg" when calling 
>>>> "dsm_pin_mapping".
> 185   dsm_pin_mapping(seg);
> 186   ret = dsm_segment_address(seg);
> 187   }
> 188   else
> 189   {
> 190   /* Return address of an already-attached segment. */
> 
> I think it's right --- the comments for dsm_attach explicitly
> point out that a NULL return is possible.  You need to handle
> that scenario in some way other than SIGSEGV.

Oops.  I've attached an attempt at fixing this.  I took the opportunity to
clean up the surrounding code a bit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From f4c1c7a7ce8bccf7251e384f895f34beb33f839e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sun, 21 Jan 2024 16:05:16 -0600
Subject: [PATCH v1 1/1] fix coverity complaint

---
 src/backend/storage/ipc/dsm_registry.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index ac11f51375..c178173653 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -177,19 +177,22 @@ GetNamedDSMSegment(const char *name, size_t size,
 (errmsg("requested DSM segment size does not match size of "
 		"existing segment")));
 	}
-	else if (!dsm_find_mapping(entry->handle))
+	else
 	{
-		/* Attach to existing segment. */
-		dsm_segment *seg = dsm_attach(entry->handle);
+		dsm_segment *seg = dsm_find_mapping(entry->handle);
+
+		/* If the existing segment is not already attached, attach it now. */
+		if (seg == NULL)
+		{
+			seg = dsm_attach(entry->handle);
+			if (seg == NULL)
+elog(ERROR, "could not map dynamic shared memory segment");
+
+			dsm_pin_mapping(seg);
+		}
 
-		dsm_pin_mapping(seg);
 		ret = dsm_segment_address(seg);
 	}
-	else
-	{
-		/* Return address of an already-attached segment. */
-		ret = dsm_segment_address(dsm_find_mapping(entry->handle));
-	}
 
 	dshash_release_lock(dsm_registry_table, entry);
 	MemoryContextSwitchTo(oldcontext);
-- 
2.25.1



Re: cleanup patches for dshash

2024-01-21 Thread Nathan Bossart
On Mon, Jan 22, 2024 at 10:28:42AM +0800, Andy Fan wrote:
> Both LGTM.

Thanks for looking.

> +dshash_strcmp(const void *a, const void *b, size_t size, void *arg)
> +{
> + Assert(strlen((const char *) a) < size);
> + Assert(strlen((const char *) b) < size);
> +
> 
> Do you think the below change will be more explicitly?
> 
> #define DSMRegistryNameSize 64
> 
> DSMRegistryEntry
> {
> char name[DSMRegistryNameSize];
> 
> }
> 
> Assert(strlen((const char *) a) < DSMRegistryNameSize);

This is effectively what it's doing already.  These functions are intended
to be generic so that they could be used with other dshash tables with
string keys, possibly with different sizes.

I did notice a cfbot failure [0].  After a quick glance, I'm assuming this
is caused by the memcpy() in insert_into_bucket().  Even if the string is
short, it will copy the maximum key size, which is bad.  So, 0002 is
totally broken at the moment, and we'd need to teach insert_into_bucket()
to strcpy() instead for string keys to fix it.  Perhaps we could add a
field to dshash_parameters for this purpose...

[0] 
https://api.cirrus-ci.com/v1/artifact/task/5124848070950912/log/src/test/modules/test_dsm_registry/log/postmaster.log

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for dshash

2024-01-21 Thread Nathan Bossart
On Sun, Jan 21, 2024 at 09:51:18PM -0600, Nathan Bossart wrote:
> I did notice a cfbot failure [0].  After a quick glance, I'm assuming this
> is caused by the memcpy() in insert_into_bucket().  Even if the string is
> short, it will copy the maximum key size, which is bad.  So, 0002 is
> totally broken at the moment, and we'd need to teach insert_into_bucket()
> to strcpy() instead for string keys to fix it.  Perhaps we could add a
> field to dshash_parameters for this purpose...

I attempted to fix this in v2 of the patch set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From af203e6e80f7a2843890e51fd3e3641aef8bd901 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 19 Jan 2024 15:23:35 -0600
Subject: [PATCH v2 1/2] Use NULL instead of 0 for arg parameter in
 dshash_create().

---
 src/backend/replication/logical/launcher.c | 2 +-
 src/backend/utils/activity/pgstat_shmem.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 122db0bb13..ee66c292bd 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -1013,7 +1013,7 @@ logicalrep_launcher_attach_dshmem(void)
 		last_start_times_dsa = dsa_create(LWTRANCHE_LAUNCHER_DSA);
 		dsa_pin(last_start_times_dsa);
 		dsa_pin_mapping(last_start_times_dsa);
-		last_start_times = dshash_create(last_start_times_dsa, &dsh_params, 0);
+		last_start_times = dshash_create(last_start_times_dsa, &dsh_params, NULL);
 
 		/* Store handles in shared memory for other backends to use. */
 		LogicalRepCtx->last_start_dsa = dsa_get_handle(last_start_times_dsa);
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 3ce48e88a3..71410ddd3f 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -180,7 +180,7 @@ StatsShmemInit(void)
 		 * With the limit in place, create the dshash table. XXX: It'd be nice
 		 * if there were dshash_create_in_place().
 		 */
-		dsh = dshash_create(dsa, &dsh_params, 0);
+		dsh = dshash_create(dsa, &dsh_params, NULL);
 		ctl->hash_handle = dshash_get_hash_table_handle(dsh);
 
 		/* lift limit set above */
-- 
2.25.1

>From 1d328f0be681da7e43574350a1d3c8da8ae8ddcc Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 19 Jan 2024 15:38:34 -0600
Subject: [PATCH v2 2/2] Add hash/compare functions for dshash tables with
 string keys.

---
 src/backend/lib/dshash.c   | 58 +-
 src/backend/replication/logical/launcher.c |  1 +
 src/backend/storage/ipc/dsm_registry.c |  9 ++--
 src/backend/utils/activity/pgstat_shmem.c  |  1 +
 src/backend/utils/cache/typcache.c |  2 +
 src/include/lib/dshash.h   | 19 ++-
 6 files changed, 83 insertions(+), 7 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b0bc0abda0..cc49b4ca51 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -188,6 +188,8 @@ static bool delete_item_from_bucket(dshash_table *hash_table,
 static inline dshash_hash hash_key(dshash_table *hash_table, const void *key);
 static inline bool equal_keys(dshash_table *hash_table,
 			  const void *a, const void *b);
+static inline void copy_key(dshash_table *hash_table, void *dest,
+			const void *src);
 
 #define PARTITION_LOCK(hash_table, i)			\
 	(&(hash_table)->control->partitions[(i)].lock)
@@ -583,6 +585,49 @@ dshash_memhash(const void *v, size_t size, void *arg)
 	return tag_hash(v, size);
 }
 
+/*
+ * A copy function that forwards to memcpy.
+ */
+void
+dshash_memcpy(void *dest, const void *src, size_t size, void *arg)
+{
+	(void) memcpy(dest, src, size);
+}
+
+/*
+ * A compare function that forwards to strcmp.
+ */
+int
+dshash_strcmp(const void *a, const void *b, size_t size, void *arg)
+{
+	Assert(strlen((const char *) a) < size);
+	Assert(strlen((const char *) b) < size);
+
+	return strcmp((const char *) a, (const char *) b);
+}
+
+/*
+ * A hash function that forwards to string_hash.
+ */
+dshash_hash
+dshash_strhash(const void *v, size_t size, void *arg)
+{
+	Assert(strlen((const char *) v) < size);
+
+	return string_hash((const char *) v, size);
+}
+
+/*
+ * A copy function that forwards to strcpy.
+ */
+void
+dshash_strcpy(void *dest, const void *src, size_t size, void *arg)
+{
+	Assert(strlen((const char *) src) < size);
+
+	(void) strcpy((char *) dest, (const char *) src);
+}
+
 /*
  * Sequentially scan through dshash table and return all the elements one by
  * one, return NULL when all elements have been returned.
@@ -949,7 +994,7 @@ insert_into_bucket(dshash_table *hash_table,
 hash_table->params.entry_size +
 MAXALIGN(sizeof(dshash_table_item)));
 	item = dsa_get_address(hash_table->area, item_pointer);
-	memcpy(ENTRY_F

Re: core dumps in auto_prewarm, tests succeed

2024-01-22 Thread Nathan Bossart
On Mon, Jan 22, 2024 at 12:41:17PM -0800, Andres Freund wrote:
> I noticed that I was getting core dumps while executing the tests, without the
> tests failing. Backtraces are vriations of this:

Looking, thanks for the heads-up.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: core dumps in auto_prewarm, tests succeed

2024-01-22 Thread Nathan Bossart
On Mon, Jan 22, 2024 at 02:44:57PM -0600, Nathan Bossart wrote:
> On Mon, Jan 22, 2024 at 12:41:17PM -0800, Andres Freund wrote:
>> I noticed that I was getting core dumps while executing the tests, without 
>> the
>> tests failing. Backtraces are vriations of this:
> 
> Looking, thanks for the heads-up.

I think this is because the autoprewarm state was moved to a DSM segment,
and DSM segments are detached before the on_shmem_exit callbacks are called
during process exit.  Moving apw_detach_shmem to the before_shmem_exit list
seems to resolve the crashes.

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 9ea6c2252a..88c3a04109 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -165,7 +165,7 @@ autoprewarm_main(Datum main_arg)
 first_time = false;
 
 /* Set on-detach hook so that our PID will be cleared on exit. */
-on_shmem_exit(apw_detach_shmem, 0);
+before_shmem_exit(apw_detach_shmem, 0);
 
 /*
  * Store our PID in the shared memory area --- unless there's already

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: core dumps in auto_prewarm, tests succeed

2024-01-22 Thread Nathan Bossart
On Mon, Jan 22, 2024 at 01:24:54PM -0800, Andres Freund wrote:
> On 2024-01-22 15:19:36 -0600, Nathan Bossart wrote:
>> I think this is because the autoprewarm state was moved to a DSM segment,
>> and DSM segments are detached before the on_shmem_exit callbacks are called
>> during process exit.  Moving apw_detach_shmem to the before_shmem_exit list
>> seems to resolve the crashes.
> 
> That seems plausible. Would still be nice to have at least this test ensure
> that the shutdown code works. Perhaps just a check of the control file after
> shutdown, ensuring that the state is "shutdowned" vs crashed?

I'll give that a try.  I'll also expand the comment above the
before_shmem_exit() call.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




  1   2   3   4   5   6   7   8   9   10   >