Re: making relfilenodes 56 bits

2022-07-02 Thread Dilip Kumar
On Sat, Jul 2, 2022 at 9:38 AM Andres Freund  wrote:

Thanks for the review,

> I'm not feeling inspired by "locator", tbh. But I don't really have a great
> alternative, so ...
>
>
> On 2022-07-01 16:12:01 +0530, Dilip Kumar wrote:
> > From f07ca9ef19e64922c6ee410707e93773d1a01d7c Mon Sep 17 00:00:00 2001
> > From: dilip kumar 
> > Date: Sat, 25 Jun 2022 10:43:12 +0530
> > Subject: [PATCH v4 2/4] Preliminary refactoring for supporting larger
> >  relfilenumber
>
> I don't think we have abbreviated buffer as 'buff' in many places? I think we
> should either spell buffer out or use 'buf'. This is in regard to BuffTag etc.

Okay, I will change it to 'buf'

> > Subject: [PATCH v4 3/4] Use 56 bits for relfilenumber to avoid wraparound

> Normally we don't capitalize the first character of a comment that's not a
> full sentence (i.e. ending with a punctuation mark).

Okay.

> "logged for use" - looks like you reformulated the sentence incompletely.

Right, I will fix it.

> > + if (ShmemVariableCache->relnumbercount == 0)
> > + {
> > + 
> > XLogPutNextRelFileNumber(ShmemVariableCache->nextRelFileNumber +
> > +  
> > VAR_RFN_PREFETCH);
>
> I know this is just copied, but I find "XLogPut" as a prefix pretty unhelpful.

Maybe we can change to LogNextRelFileNumber()?

> What's the story behind moving relfilenode to the front? Alignment
> consideration? Seems odd to move the relfilenode before the oid. If there's an
> alignment issue, can't you just swap it with reltablespace or such to resolve
> it?

Because of a test case added in this commit
(79b716cfb7a1be2a61ebb4418099db1258f35e30).  Even I did not like to
move relfilenode before oid, but under this commit it is expected this
to aligned as well as before any NameData check this comments

===
+--
+--  Keep such columns before the first NameData column of the
+-- catalog, since packagers can override NAMEDATALEN to an odd number.
+--
===

>
> > From f6e8e0e7412198b02671e67d1859a7448fe83f38 Mon Sep 17 00:00:00 2001
> > From: dilip kumar 
> > Date: Wed, 29 Jun 2022 13:24:32 +0530
> > Subject: [PATCH v4 4/4] Don't delay removing Tombstone file until next
> >  checkpoint
> >
> > Currently, we can not remove the unused relfilenode until the
> > next checkpoint because if we remove them immediately then
> > there is a risk of reusing the same relfilenode for two
> > different relations during single checkpoint due to Oid
> > wraparound.
>
> Well, not quite "currently", because at this point we've fixed that in a prior
> commit ;)

Right, I will change, but I'm not sure whether we want to commit 0003
and 0004 as an independent patch or as a simple patch.

> > Now as part of the previous patch set we have made relfilenode
> > 56 bit wider and removed the risk of wraparound so now we don't
> > need to wait till the next checkpoint for removing the unused
> > relation file and we can clean them up on commit.
>
> Hm. Wasn't there also some issue around crash-restarts benefiting from having
> those files around until later? I think what I'm remembering is what is
> referenced in this comment:

I think due to wraparound if relfilenode gets reused by another
relation in the same checkpoint then there was an issue in crash
recovery with wal level minimal.  But the root of the issue is a
wraparound, right?

>
> > - * For regular relations, we don't unlink the first segment file of the 
> > rel,
> > - * but just truncate it to zero length, and record a request to unlink it 
> > after
> > - * the next checkpoint.  Additional segments can be unlinked immediately,
> > - * however.  Leaving the empty file in place prevents that relfilenumber
> > - * from being reused.  The scenario this protects us from is:
> > - * 1. We delete a relation (and commit, and actually remove its file).
> > - * 2. We create a new relation, which by chance gets the same 
> > relfilenumber as
> > - * the just-deleted one (OIDs must've wrapped around for that to 
> > happen).
> > - * 3. We crash before another checkpoint occurs.
> > - * During replay, we would delete the file and then recreate it, which is 
> > fine
> > - * if the contents of the file were repopulated by subsequent WAL entries.
> > - * But if we didn't WAL-log insertions, but instead relied on fsyncing the
> > - * file after populating it (as we do at wal_level=minimal), the contents 
> > of
> > - * the file would be lost forever.  By leaving the empty file until after 
> > the
> > - * next checkpoint, we prevent reassignment of the relfilenumber until it's
> > - * safe, because relfilenumber assignment skips over any existing file.
>
> This isn't related to oid wraparound, just crashes. It's possible that the
> XLogFlush() in XLogPutNextRelFileNumber() prevents such a scenario, but if so
> it still ought to be explained here, I think.

I think the root cause of the problem is oid reuse which is due to
relfilenode wraparound, and the problem is created if the

Re: making relfilenodes 56 bits

2022-07-02 Thread Robert Haas
On Sat, Jul 2, 2022 at 4:53 AM Dilip Kumar  wrote:
> > I'm doubtful it's a good idea to start dropping at the first segment. I'm
> > fairly certain that there's smgrexists() checks in some places, and they'll
> > now stop working, even if there are later segments that remained, e.g. 
> > because
> > of an error in the middle of removing later segments.
>
> Okay, so you mean to say that we can first drop the remaining segment
> and at last we drop the segment 0 right?

I think we need to do it in descending order, starting with the
highest-numbered segment and working down. md.c isn't smart about gaps
in the sequence of files, so it's better if we don't create any gaps.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg15b2: large objects lost on upgrade

2022-07-02 Thread Robert Haas
On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby  wrote:
> I noticed this during beta1, but dismissed the issue when it wasn't easily
> reproducible.  Now, I saw the same problem while upgrading from beta1 to 
> beta2,
> so couldn't dismiss it.  It turns out that LOs are lost if VACUUM FULL was 
> run.

Yikes. That's really bad, and I have no idea what might be causing it,
either. I'll plan to investigate this on Tuesday unless someone gets
to it before then.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: replacing role-level NOINHERIT with a grant-level option

2022-07-02 Thread Robert Haas
On Fri, Jul 1, 2022 at 5:12 PM Nathan Bossart  wrote:
> > I have some hesitations about this approach. On the positive side, I
> > believe it's fully backward compatible, and that's something to like.
> > On the negative side, I've always felt that the role-level properties
> > - [NO]INHERIT, [NO]CREATEROLE, [NO]SUPERUSER, [NO]CREATEDB were hacky
> > kludges, and I'd be happier they all went away in favor of more
> > principled ways of controlling those behaviors. I think that the
> > previous design moves us in the direction of being able to eventually
> > remove [NO]INHERIT, whereas this, if anything, entrenches it.
>
> +1.  As mentioned upthread [0], I think attributes like CREATEROLE,
> SUPERUSER, and CREATEDB could be replaced with predefined roles.  However,
> since role attributes aren't inherited, that would result in a behavior
> change.  I'm curious what you have in mind.  It might be worth spinning off
> a new thread for this sooner than later.

I don't think there is a whole lot of point in replacing role-level
flags with predefined roles that work exactly the same way. Maybe
there is some point, but I'm not excited about it. The problem with
these settings in my opinion is that they are too monolithic. Either
you can create any role with basically any privileges or you can
create no roles at all. Either you are a superuser and can bypass all
permissions checks or you are not and can't bypass any permissions
checks.

> unparenthesized syntax or add WARNINGs when it is used?)  For [NO]INHERIT
> and WITH INHERIT DEFAULT, presumably we could do something similar.  Down
> the road, those would be removed in favor of only using grant-level
> options.

I think it'd be hard to do that if WITH INHERIT DEFAULT is actually
state stored in the catalog. Maybe I should revise this again so that
the catalog column is just true or false, and the role-level property
only sets the default for future grants. That might be more like what
Tom had in mind originally.

More clear sketch: Remove WITH INHERIT DEFAULT and just have WITH
INHERIT { TRUE | FALSE }. If neither is specified, the default for a
new GRANT is set based on the role-level property. I want more control
than that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Support logical replication of DDLs

2022-07-02 Thread vignesh C
On Sat, Jul 2, 2022 at 8:51 AM Amit Kapila  wrote:
>
> On Fri, Jul 1, 2022 at 10:22 PM vignesh C  wrote:
> >
> > On Wed, Jun 29, 2022 at 3:25 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> >
> > Thanks for the updated patch.
> > Few comments on 0002 patch:
> > 1) When we create a subscription for a publication with the existing
> > default PUBLISH parameter having default value as
> > 'insert,update,delete,truncate', we do an initial table sync to get
> > the initial table data from the publisher to the subscriber. But in
> > case of a publication created with 'ddl', the subscription expects the
> > existing initial tables present in the publisher to be created
> > beforehand in the subscriber. Should this be the default behavior?
> > Should we do a ddl dump for all the tables and restore the ddl to the
> > subscription while creating the subscription? Or is this planned as an
> > option for the later version.
> >
>
> The idea is to develop initial sync (for ddl replication) as a
> separate patch. But both need to be integrated at some point.

Yes, that approach makes sense.

> >
> > 3) SYNTAX Support:
> > Currently creation of "FOR TABLE" publication with ddl is supported.
> > Should we allow support of ddl for "FOR TABLE" publication.
> >
>
> The above comment is unclear to me. It seems to me in the first
> sentence, you are saying that the "FOR TABLE" syntax is supported and
> in the second sentence, you are asking to allow support of it? I think
> at this stage, the focus is to build the core part of the feature
> (allow ddl replication and deparsing support), and then we can discuss
> more on Syntax. Having said that, it will be good if we can support
> table-level DDL replication as well in the patch as you seem to be
> suggesting.

I initially thought that supporting "FOR TABLE" publication for ddl
might not be useful as currently the create subscription fails with
table does not exist error. Now that the initial sync for ddl
replication will also be implemented as mentioned in [1], this issue
will be handled. I agree with supporting table-level DDL replication.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2B5zJAT_RYOAEOq8M33s196kR5sDyLQLUXd8Rnqr%2BiB0Q%40mail.gmail.com

Regards,
Vignesh




Re: pg15b2: large objects lost on upgrade

2022-07-02 Thread Justin Pryzby
On Sat, Jul 02, 2022 at 08:34:04AM -0400, Robert Haas wrote:
> On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby  wrote:
> > I noticed this during beta1, but dismissed the issue when it wasn't easily
> > reproducible.  Now, I saw the same problem while upgrading from beta1 to 
> > beta2,
> > so couldn't dismiss it.  It turns out that LOs are lost if VACUUM FULL was 
> > run.
> 
> Yikes. That's really bad, and I have no idea what might be causing it,
> either. I'll plan to investigate this on Tuesday unless someone gets
> to it before then.

I suppose it's like Bruce said, here.

https://www.postgresql.org/message-id/20210601140949.GC22012%40momjian.us

|One tricky case is pg_largeobject, which is copied from the old to new
|cluster since it has user data.  To preserve that relfilenode, you would
|need to have pg_upgrade perform cluster surgery in each database to
|renumber its relfilenode to match since it is created by initdb.  I
|can't think of a case where pg_upgrade already does something like that.

Rather than setting the filenode of the next object as for user tables,
pg-upgrade needs to UPDATE the relfilenode.

This patch "works for me" but feel free to improve on it.
>From 88cbe118d4eb4dcf9b6d2831d81f723587d80942 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 2 Jul 2022 00:48:47 -0500
Subject: [PATCH] wip: preserve relfilenode of pg_largeobject and its indexes

An empty table is created by initdb, but its filenode was not preserved, so
pg_largeobject was empty after upgrades.

See also:
9a974cbcba005256a19991203583a94b4f9a21a9
https://www.postgresql.org/message-id/20210601140949.GC22012%40momjian.us
https://www.postgresql.org/message-id/20220701231413.GI13040%40telsasoft.com
---
 src/bin/pg_dump/pg_dump.c  | 37 +++---
 src/bin/pg_upgrade/t/002_pg_upgrade.pl |  2 ++
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d9c5bcafd20..c629f4154fc 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3135,7 +3135,7 @@ dumpDatabase(Archive *fout)
 
 	/*
 	 * pg_largeobject comes from the old system intact, so set its
-	 * relfrozenxids and relminmxids.
+	 * relfrozenxids, relminmxids and relfilenode.
 	 */
 	if (dopt->binary_upgrade)
 	{
@@ -3143,34 +3143,41 @@ dumpDatabase(Archive *fout)
 		PQExpBuffer loFrozenQry = createPQExpBuffer();
 		PQExpBuffer loOutQry = createPQExpBuffer();
 		int			i_relfrozenxid,
+	i_relfilenode,
+	i_oid,
 	i_relminmxid;
 
 		/*
 		 * pg_largeobject
 		 */
 		if (fout->remoteVersion >= 90300)
-			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, relminmxid\n"
+			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, relminmxid, relfilenode, oid\n"
 			  "FROM pg_catalog.pg_class\n"
-			  "WHERE oid = %u;\n",
-			  LargeObjectRelationId);
+			  "WHERE oid IN (%u, %u);\n",
+			  LargeObjectRelationId, LargeObjectLOidPNIndexId);
 		else
-			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, 0 AS relminmxid\n"
+			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, 0 AS relminmxid, relfilenode, oid\n"
 			  "FROM pg_catalog.pg_class\n"
-			  "WHERE oid = %u;\n",
-			  LargeObjectRelationId);
+			  "WHERE oid IN (%u, %u);\n",
+			  LargeObjectRelationId, LargeObjectLOidPNIndexId);
 
-		lo_res = ExecuteSqlQueryForSingleRow(fout, loFrozenQry->data);
+		lo_res = ExecuteSqlQuery(fout, loFrozenQry->data, PGRES_TUPLES_OK);
 
 		i_relfrozenxid = PQfnumber(lo_res, "relfrozenxid");
 		i_relminmxid = PQfnumber(lo_res, "relminmxid");
+		i_relfilenode = PQfnumber(lo_res, "relfilenode");
+		i_oid = PQfnumber(lo_res, "oid");
+
+		appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, set pg_largeobject relfrozenxid, relminmxid and relfilenode\n");
+		for (int i = 0; i < PQntuples(lo_res); ++i)
+			appendPQExpBuffer(loOutQry, "UPDATE pg_catalog.pg_class\n"
+			  "SET relfrozenxid = '%u', relminmxid = '%u', relfilenode = '%u'\n"
+			  "WHERE oid = %u;\n",
+			  atooid(PQgetvalue(lo_res, i, i_relfrozenxid)),
+			  atooid(PQgetvalue(lo_res, i, i_relminmxid)),
+			  atooid(PQgetvalue(lo_res, i, i_relfilenode)),
+			  atooid(PQgetvalue(lo_res, i, i_oid)));
 
-		appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, set pg_largeobject relfrozenxid and relminmxid\n");
-		appendPQExpBuffer(loOutQry, "UPDATE pg_catalog.pg_class\n"
-		  "SET relfrozenxid = '%u', relminmxid = '%u'\n"
-		  "WHERE oid = %u;\n",
-		  atooid(PQgetvalue(lo_res, 0, i_relfrozenxid)),
-		  atooid(PQgetvalue(lo_res, 0, i_relminmxid)),
-		  LargeObjectRelationId);
 		ArchiveEntry(fout, nilCatalogId, createDumpId(),
 	 ARCHIVE_OPTS(.tag = "pg_largeobject",
   .description = "pg_largeobject",
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 67e0be68568..a22f0f8c885 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t

Re: pg15b2: large objects lost on upgrade

2022-07-02 Thread Julien Rouhaud
On Sat, Jul 02, 2022 at 08:34:04AM -0400, Robert Haas wrote:
> On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby  wrote:
> > I noticed this during beta1, but dismissed the issue when it wasn't easily
> > reproducible.  Now, I saw the same problem while upgrading from beta1 to 
> > beta2,
> > so couldn't dismiss it.  It turns out that LOs are lost if VACUUM FULL was 
> > run.
> 
> Yikes. That's really bad, and I have no idea what might be causing it,
> either. I'll plan to investigate this on Tuesday unless someone gets
> to it before then.

As far as I can see the data is still there, it's just that the new cluster
keeps its default relfilenode instead of preserving the old cluster's value:

regression=# table pg_largeobject;
 loid | pageno | data
--++--
(0 rows)

regression=# select oid, relfilenode from pg_class where relname = 
'pg_largeobject';
 oid  | relfilenode
--+-
 2613 |2613
(1 row)

-- using the value from the old cluster
regression=# update pg_class set relfilenode = 39909 where oid = 2613;
UPDATE 1

regression=# table pg_largeobject;
 loid  | pageno |
---++-
 33211 |  0 | \x0a4920776[...]
 34356 |  0 | \xdeadbeef
(2 rows)




Re: PSA: Autoconf has risen from the dead

2022-07-02 Thread Tom Lane
Peter Eisentraut  writes:
> To summarize:
> - Autoconf 2.71 has been out for 1.5 years.
> - It is available in many recently updated OSs.
> - It allows us to throw away several workarounds.
> Hence:
>> Currently, I think early PG16 might be good time to do this update.

In preparation for reviewing this, I tried to install autoconf 2.71
from source locally.  All went well on my RHEL8 workstation, but
autoconf's testsuite falls over rather badly on my macOS laptop [1].
It fails differently on another Mac where I have a MacPorts
installation at the head of the search path [2].

After sending the requested reports, I tried scanning the bug-autoconf
archives, and found a similar report that was answered thus [3]:

> I *think* this is the same problem as 
> https://savannah.gnu.org/support/?110492 
> : current Autoconf doesn't work correctly with the (rather old) version of 
> GNU 
> M4 that ships with MacOS.  Please try installing a current version of GNU M4 
> in 
> your PATH and then retry the build and testsuite.

So that explains part of it: most of the failures are down to using
Apple's hoary m4 instead of the one from MacPorts.  We could usefully
warn about that in our own docs, perhaps.  But there's still these
scary failures:

509: AC_CHECK_HEADER_STDBOOL FAILED (acheaders.at:9)
514: AC_HEADER_STDBOOL   FAILED (acheaders.at:14)

The generated autoconf program builds the same output files as you have
in your patch, and running the configure script gives the correct answer
from AC_HEADER_STDBOOL, so I'm not sure what these test failures are
unhappy about.  Still, this is not a good look for a mainstream
development platform.  I wonder if we ought to wait for a fix.

regards, tom lane

[1] https://lists.gnu.org/archive/html/bug-autoconf/2022-07/msg0.html
[2] https://lists.gnu.org/archive/html/bug-autoconf/2022-07/msg1.html
[3] https://lists.gnu.org/archive/html/bug-autoconf/2022-04/msg2.html




Re: Slow standby snapshot

2022-07-02 Thread Michail Nikolaev
Hello, Simon.

Sorry for calling you directly, but you know the subject better than
anyone else. It is related to your work from 2010 - replacing
KnownAssignedXidsHash with the KnownAssignedXids array.

I have added additional optimization to the data structure you
implemented. Initially, it was caused by the huge usage of CPU (70%+)
for KnownAssignedXidsGetAndSetXmin in case of long (few seconds)
transactions on primary and high (few thousands) max_connections in
Postgres 11.

After snapshot scalability optimization by Andres Freund (2), it is
not so crucial but still provides a significant performance impact
(+10% TPS) for a typical workload, see benchmark (3).

Last patch version is here - (4).

Does such optimisation look worth committing?

Thanks in advance,
Michail.

[1]: 
https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179
[2]: https://commitfest.postgresql.org/29/2500/
[3]: 
https://www.postgresql.org/message-id/flat/CANtu0ohzBFTYwdLtcanWo4%2B794WWUi7LY2rnbHyorJdE8_ZnGg%40mail.gmail.com#379c1be7b8134ada5a574078d51b64c6
[4]: 
https://www.postgresql.org/message-id/flat/CANtu0ogzo4MsR7My9%2BNhu3to5%3Dy7G9zSzUbxfWYOn9W5FfHjTA%40mail.gmail.com#341a3c3b033f69b260120b3173a66382




Re: PSA: Autoconf has risen from the dead

2022-07-02 Thread Tom Lane
I wrote:
> So that explains part of it: most of the failures are down to using
> Apple's hoary m4 instead of the one from MacPorts.  We could usefully
> warn about that in our own docs, perhaps.

Hmm.  I have just spent a very frustrating hour trying, and failing,
to build any version of GNU m4 from source on either RHEL8 or current
macOS.  I don't quite understand why: neither the RPM specfile nor
the MacPorts recipe for their respective m4 packages seem to contain
any special hacks, so that it looks like the usual "configure; make;
make check; make install" procedure ought to work fine.  But it doesn't.
I hit build failures (apparently because the source code is far too much
in bed with nonstandard aspects of libc), or get an executable that
SIGABRT's instantly, or if it doesn't do that it still fails some
self-tests.  With the latest 1.4.19 on macOS, the configure script
hangs up, for crissakes.

I am now feeling *very* hesitant about doing anything where we might
be effectively asking people to build m4 for themselves.

On the whole, I'm questioning the value of messing with our autoconf
infrastructure at this stage.  We did agree at PGCon that we'd keep
it going for a couple years more, but it's not real clear to me why
we can't limp along with 2.69 until we decide to drop it.

regards, tom lane




AIX support - alignment issues

2022-07-02 Thread Andres Freund
Hi,

(sorry for sending this twice to you Noah, forgot -hackers the first time
round)

We've had a bunch of changes to manually deal with our alignment code not
understanding AIX alignment.

commit f3b421da5f4addc95812b9db05a24972b8fd9739
Author: Peter Eisentraut 
Date:   2016-12-21 12:00:00 -0500

Reorder pg_sequence columns to avoid alignment issue

commit 79b716cfb7a1be2a61ebb4418099db1258f35e30
Author: Amit Kapila 
Date:   2022-04-07 09:39:25 +0530

Reorder subskiplsn in pg_subscription to avoid alignment issues.


A good explanation of the problem is in 
https://postgr.es/m/20220402081346.GD3719101%40rfd.leadboat.com


I strikes me as a remarkably bad idea to manually try to maintain the correct
alignment. Even with the tests added it's still quite manual and requires
contorted struct layouts (see e.g. [1]).

I think we should either teach our system the correct alignment rules or we
should drop AIX support.


If we decide we want to continue supporting AIX we should bite the bullet and
add a 64bit-int TYPALIGN_*. It might be worth to translate that to bytes when
building tupledescs, so we don't need more branches (reducing them compared to
today).


Personally I think we should just drop AIX. The amount of effort to keep it
working is substantial due to being quite different from other unices ([2]), 
the is
very outdated, the whole ecosystem is barely on lifesupport ([3]). And all of 
that
for very little real world use.

Afaics we don't have access to an up2date AIX system. Some of have access to
7.2 via the gcc compile farm, but not 7.3. Most other niche-y operating
systems we can start in a VM, but I've yet to see a legal and affordable way
to do that with AIX.


I think Noah has done quite a heroic effort at keeping the AIX animals in a
kind-of-healthy state, but without more widespread access and more widespread
usage it seems like a doomed effort.


Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/CAFiTN-uiAngcW50Trwa94F1EWY2BxEx%2BB38QSyX3DtV3dzEGhA%40mail.gmail.com

[2] linking etc is handled entirely different, so there's a fair bit of
dedicated AIX code around the buildsystem - a lot of it vestigial stuff,
see references to aix3.2.5 etc.

[3] 7.2 was released in 2015-10-05, 7.3 in 2021-12-10, the set of changes is
pretty darn small for that timeframe

https://www.ibm.com/common/ssi/cgi-bin/ssialias?infotype=AN&subtype=CA&htmlfid=897/ENUS221-328&appname=USN

Bull / Atos stopped their AIX work in 2022-03-01 - unfortunately they
didn't even keep the announcement of that online.

https://www.linkedin.com/pulse/said-say-bull-closing-down-aix-open-source-platform-michaelis
https://github.com/power-devops/bullfreeware




Re: Emit extra debug message when executing extension script.

2022-07-02 Thread Jeff Davis
On Fri, 2022-07-01 at 15:33 -0700, Nathan Bossart wrote:
> On Fri, Jul 01, 2022 at 03:24:27PM -0700, Jeff Davis wrote:
> > +   ereport(DEBUG1, errmsg("executing extension update
> > script from version '%s' to '%s'", from_version, version));
> 
> nitpick: I would suggest "executing extension script for update from
> version X to Y."

Thank you. Committed with minor modification to include the extension
name.

I did end up using Peter's suggestion. I reviewed other DEBUG messages
and it seems nearly all use elog() or errmsg_internal().

> I personally would rather this output the name of the file.  If
> revealing
> the directory is a concern, perhaps we could just trim everything but
> the
> file name.

I could have slightly refactored the code to do this, but it didn't
quite seem worth it for a single debug message.

Regards,
Jeff Davis






Re: Probable memory leak with ECPG and AIX

2022-07-02 Thread Tom Lane
Noah Misch  writes:
> I had expected to use pthread_once() for the newlocale() call, but there would
> be no useful way to report failure and try again later.  Instead, I called
> newlocale() while ECPGconnect() holds connections_mutex.  See log message and
> comments for details.  I tested "./configure ac_cv_func_uselocale=no ..." and
> tested the scenario of newlocale() failing every time.

This looks solid to me.  The only nit I can find to pick is that I'd
have added one more comment, along the lines of

diff --git a/src/interfaces/ecpg/ecpglib/connect.c 
b/src/interfaces/ecpg/ecpglib/connect.c
index 9f958b822c..96f99ae072 100644
--- a/src/interfaces/ecpg/ecpglib/connect.c
+++ b/src/interfaces/ecpg/ecpglib/connect.c
@@ -508,6 +508,11 @@ ECPGconnect(int lineno, int c, const char *name, const 
char *user, const char *p
 #ifdef ENABLE_THREAD_SAFETY
pthread_mutex_lock(&connections_mutex);
 #endif
+
+   /*
+* ... but first, make certain we have created ecpg_clocale.  Rely on
+* holding connections_mutex to ensure this is done by only one thread.
+*/
 #ifdef HAVE_USELOCALE
if (!ecpg_clocale)
{

I've marked it RFC.

regards, tom lane




Re: AIX support - alignment issues

2022-07-02 Thread Peter Geoghegan
On Sat, Jul 2, 2022 at 11:34 AM Andres Freund  wrote:
> Personally I think we should just drop AIX. The amount of effort to keep it
> working is substantial due to being quite different from other unices ([2]), 
> the is
> very outdated, the whole ecosystem is barely on lifesupport ([3]). And all of 
> that
> for very little real world use.

I tend to agree about dropping AIX. But I wonder if there is an
argument against that proposal that doesn't rely on AIX being relevant
to at least one user. Has supporting AIX ever led to the discovery of
a bug that didn't just affect AIX? In other words, are AIX systems
peculiar in some particular way that clearly makes them more likely to
flush out a certain class of bugs? What is the best argument *against*
desupporting AIX that you know of?

Desupporting AIX doesn't mean that any AIX users will be left in the
lurch immediately. Obviously these users will be able to use a
supported version of Postgres for several more years.

-- 
Peter Geoghegan




Re: AIX support - alignment issues

2022-07-02 Thread Tom Lane
Peter Geoghegan  writes:
> I tend to agree about dropping AIX. But I wonder if there is an
> argument against that proposal that doesn't rely on AIX being relevant
> to at least one user. Has supporting AIX ever led to the discovery of
> a bug that didn't just affect AIX?

Searching the commit log quickly finds

591e088dd

datetime.c's parsing logic has assumed that strtod() will accept
a string that looks like ".", which it does in glibc, but not on
some less-common platforms such as AIX.

glibc's behavior is clearly not meeting the letter of the POSIX spec here.

a745b9365

I'm not sure how we've managed not to notice this problem, but it
seems to explain slow execution of the 017_shm.pl test script on AIX
since commit 4fdbf9af5, which added a speculative "pg_ctl stop" with
the idea of making real sure that the postmaster isn't there.  In the
test steps that kill-9 and then restart the postmaster, it's possible
to get past the initial signal attempt before kill() stops working
for the doomed postmaster.  If that happens, pg_ctl waited till
PGCTLTIMEOUT before giving up ... and the buildfarm's AIX members
have that set very high.

Admittedly, this one is more about "slow" than about "AIX".

57b5a9646

Most versions of tar are willing to overlook the missing terminator, but
the AIX buildfarm animals were not. Fix by inventing a new kind of
bbstreamer that just blindly adds a terminator, and using it whenever we
don't parse the tar archive.

Another place where we failed to conform to relevant standards.

b9b610577

Fix ancient violation of zlib's API spec.

And another.

Now, it's certainly possible that AIX is the only surviving platform
that hasn't adopted bug-compatible-with-glibc interpretations of
POSIX.  But I think the standard is the standard, and we ought to
stay within it.  So I find value in these fixes.

regards, tom lane




Re: making relfilenodes 56 bits

2022-07-02 Thread Andres Freund
Hi,

On 2022-07-02 14:23:08 +0530, Dilip Kumar wrote:
> > > + if (ShmemVariableCache->relnumbercount == 0)
> > > + {
> > > + 
> > > XLogPutNextRelFileNumber(ShmemVariableCache->nextRelFileNumber +
> > > +  
> > > VAR_RFN_PREFETCH);
> >
> > I know this is just copied, but I find "XLogPut" as a prefix pretty 
> > unhelpful.
> 
> Maybe we can change to LogNextRelFileNumber()?

Much better.


Hm. Now that I think about it, isn't the XlogFlush() in
XLogPutNextRelFileNumber() problematic performance wise? Yes, we'll spread the
cost across a number of GetNewRelFileNumber() calls, but still, an additional
f[data]sync for every 64 relfilenodes assigned isn't cheap - today there's
zero fsyncs when creating a sequence or table inside a transaction (there are
some for indexes, but there's patches to fix that).

Not that I really see an obvious alternative.

I guess we could try to invent a flush-log-before-write type logic for
relfilenodes somehow? So that the first block actually written to a file needs
to ensure the WAL record that created the relation is flushed. But getting
that to work reliably seems nontrivial.


One thing that would be good is to add an assertion to a few places ensuring
that relfilenodes aren't above ->nextRelFileNumber, most importantly somewhere
in the recovery path.


Why did you choose a quite small value for VAR_RFN_PREFETCH? VAR_OID_PREFETCH
is 8192, but you chose 64 for VAR_RFN_PREFETCH?

I'd spell out RFN in VAR_RFN_PREFETCH btw, it took me a bit to expand RFN to
relfilenode.


> > What's the story behind moving relfilenode to the front? Alignment
> > consideration? Seems odd to move the relfilenode before the oid. If there's 
> > an
> > alignment issue, can't you just swap it with reltablespace or such to 
> > resolve
> > it?
> 
> Because of a test case added in this commit
> (79b716cfb7a1be2a61ebb4418099db1258f35e30).  Even I did not like to
> move relfilenode before oid, but under this commit it is expected this
> to aligned as well as before any NameData check this comments
> 
> ===
> +--
> +--  Keep such columns before the first NameData column of the
> +-- catalog, since packagers can override NAMEDATALEN to an odd number.
> +--
> ===

This is embarassing. Trying to keep alignment match between C and catalog
alignment on AIX, without actually making the system understand the alignment
rules, is a remarkably shortsighted approach.

I started a separate thread about it, since it's not really relevant to this 
thread:
https://postgr.es/m/20220702183354.a6uhja35wta7agew%40alap3.anarazel.de

Maybe we could at least make the field order to be something like
  oid, relam, relfilenode, relname

that should be ok alignment wise, keep oid first, and seems to make sense from
an "importance" POV? Can't really interpret later fields without knowing relam
etc.



> > > Now as part of the previous patch set we have made relfilenode
> > > 56 bit wider and removed the risk of wraparound so now we don't
> > > need to wait till the next checkpoint for removing the unused
> > > relation file and we can clean them up on commit.
> >
> > Hm. Wasn't there also some issue around crash-restarts benefiting from 
> > having
> > those files around until later? I think what I'm remembering is what is
> > referenced in this comment:
> 
> I think due to wraparound if relfilenode gets reused by another
> relation in the same checkpoint then there was an issue in crash
> recovery with wal level minimal.  But the root of the issue is a
> wraparound, right?

I'm not convinced the tombstones were required solely in the oid wraparound
case before, despite the comment saying so, with wal_level=minimal. I gotta do
some non-work stuff for a bit, so I need to stop pondering this now :)

I think it might be a good idea to have a few weeks in which we do *not*
remove the tombstones, but have assertion checks against such files existing
when we don't expect them to. I.e. commit 1-3, add the asserts, then commit 4
a bit later.


> > I'm doubtful it's a good idea to start dropping at the first segment. I'm
> > fairly certain that there's smgrexists() checks in some places, and they'll
> > now stop working, even if there are later segments that remained, e.g. 
> > because
> > of an error in the middle of removing later segments.
> 
> Okay, so you mean to say that we can first drop the remaining segment
> and at last we drop the segment 0 right?

I'd use the approach Robert suggested and delete from the end, going down.

Greetings,

Andres Freund




Re: Add red-black tree missing comparison searches

2022-07-02 Thread Steve Chavez
Hey Alexander,

> But I think we can support strict inequalities too (e.g.
less and greater without equals).  Could you please make it a bool
argument equal_matches?

Sure, an argument is a good idea to keep the code shorter.

> Could you please extract this change as a separate patch.

Done!

On Thu, 30 Jun 2022 at 14:34, Alexander Korotkov 
wrote:

> Hi, Steve!
>
> Thank you for working on this.
>
> On Thu, Jun 30, 2022 at 7:51 PM Steve Chavez  wrote:
> > Currently the red-black tree implementation only has an equality search.
> Other extensions might need other comparison searches, like less-or-equal
> or greater-or-equal. For example OrioleDB defines a greater-or-equal search
> on its postgres fork:
> >
> >
> https://github.com/orioledb/postgres/blob/4c18ae94c20e3e95c374b9947f1ace7d1d6497a1/src/backend/lib/rbtree.c#L164-L186
> >
> > So I thought this might be valuable to have in core. I've added
> less-or-equal and greater-or-equal searches functions plus tests in the
> test_rbtree module. I can add the remaining less/great searches if this is
> deemed worth it.
>
> Looks good.  But I think we can support strict inequalities too (e.g.
> less and greater without equals).  Could you please make it a bool
> argument equal_matches?
>
> > Also I refactored the sentinel used in the rbtree.c to use C99
> designators.
>
> Could you please extract this change as a separate patch.
>
> --
> Regards,
> Alexander Korotkov
>
From 8046463e2c98fb4c51ec05740cee130bdf2ad533 Mon Sep 17 00:00:00 2001
From: steve-chavez 
Date: Tue, 28 Jun 2022 23:46:38 -0500
Subject: [PATCH 1/2] Change rbtree sentinel to C99 designator

---
 src/backend/lib/rbtree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/lib/rbtree.c b/src/backend/lib/rbtree.c
index a9981dbada..e1cc4110bd 100644
--- a/src/backend/lib/rbtree.c
+++ b/src/backend/lib/rbtree.c
@@ -62,7 +62,7 @@ struct RBTree
 
 static RBTNode sentinel =
 {
-	RBTBLACK, RBTNIL, RBTNIL, NULL
+	.color = RBTBLACK,.left = RBTNIL,.right = RBTNIL,.parent = NULL
 };
 
 
-- 
2.34.1

From 5d49ddd764b05083e1c276df0b50d43ec6896931 Mon Sep 17 00:00:00 2001
From: steve-chavez 
Date: Wed, 29 Jun 2022 16:28:02 -0500
Subject: [PATCH 2/2] Add rbtree missing comparison searches

* Add find great/less or equal by adding an argument
---
 src/backend/lib/rbtree.c   |  61 +
 src/include/lib/rbtree.h   |   2 +
 src/test/modules/test_rbtree/test_rbtree.c | 101 +
 3 files changed, 164 insertions(+)

diff --git a/src/backend/lib/rbtree.c b/src/backend/lib/rbtree.c
index e1cc4110bd..92f6b99164 100644
--- a/src/backend/lib/rbtree.c
+++ b/src/backend/lib/rbtree.c
@@ -161,6 +161,67 @@ rbt_find(RBTree *rbt, const RBTNode *data)
 	return NULL;
 }
 
+/*
+ * rbt_find_great: search for a greater value in an RBTree
+ *
+ * If equal_match is true, this will be a great or equal search.
+ *
+ * Returns the matching tree entry, or NULL if no match is found.
+ */
+RBTNode *
+rbt_find_great(RBTree *rbt, const RBTNode *data, bool equal_match)
+{
+	RBTNode*node = rbt->root;
+	RBTNode*greater = NULL;
+
+	while (node != RBTNIL)
+	{
+		int			cmp = rbt->comparator(data, node, rbt->arg);
+
+		if (equal_match && cmp == 0)
+			return node;
+		else if (cmp < 0)
+		{
+			greater = node;
+			node = node->left;
+		}
+		else
+			node = node->right;
+	}
+
+	return greater;
+}
+
+/*
+ * rbt_find_less: search for a lesser value in an RBTree
+ *
+ * If equal_match is true, this will be a less or equal search.
+ *
+ * Returns the matching tree entry, or NULL if no match is found.
+ */
+RBTNode *
+rbt_find_less(RBTree *rbt, const RBTNode *data, bool equal_match)
+{
+	RBTNode*node = rbt->root;
+	RBTNode*lesser = NULL;
+
+	while (node != RBTNIL)
+	{
+		int			cmp = rbt->comparator(data, node, rbt->arg);
+
+		if (equal_match && cmp == 0)
+			return node;
+		else if (cmp > 0){
+			lesser = node;
+			node = node->right;
+		}
+		else
+			node = node->left;
+	}
+
+	return lesser;
+}
+
 /*
  * rbt_leftmost: fetch the leftmost (smallest-valued) tree node.
  * Returns NULL if tree is empty.
diff --git a/src/include/lib/rbtree.h b/src/include/lib/rbtree.h
index 580a3e3439..13cf68415e 100644
--- a/src/include/lib/rbtree.h
+++ b/src/include/lib/rbtree.h
@@ -67,6 +67,8 @@ extern RBTree *rbt_create(Size node_size,
 		  void *arg);
 
 extern RBTNode *rbt_find(RBTree *rbt, const RBTNode *data);
+extern RBTNode *rbt_find_great(RBTree *rbt, const RBTNode *data, bool equal_match);
+extern RBTNode *rbt_find_less(RBTree *rbt, const RBTNode *data, bool equal_match);
 extern RBTNode *rbt_leftmost(RBTree *rbt);
 
 extern RBTNode *rbt_insert(RBTree *rbt, const RBTNode *data, bool *isNew);
diff --git a/src/test/modules/test_rbtree/test_rbtree.c b/src/test/modules/test_rbtree/test_rbtree.c
index 7cb38759a2..c17d524e01 100644
--- a/src/test/modules/test_rbtree/test_rbtree.c
+++ b/src/test/modules/test_rbtree/test_rbtree.c
@@ -278,6 +278,105 @@ testfind(int si

Re: AIX support - alignment issues

2022-07-02 Thread Andres Freund
Hi,

On 2022-07-02 11:54:16 -0700, Peter Geoghegan wrote:
> I tend to agree about dropping AIX. But I wonder if there is an
> argument against that proposal that doesn't rely on AIX being relevant
> to at least one user. Has supporting AIX ever led to the discovery of
> a bug that didn't just affect AIX?

Yes, it clearly has. But I tend to think that that's far outweighed by the
complications triggered by AIX support. It'd be a different story if AIX
hadn't a very peculiar linking model and was more widely accessible.


> What is the best argument *against* desupporting AIX that you know of?

Hm.

- a distinct set of system libraries that can help find portability issues

- With IBM's compiler it adds a, not otherwise used, compiler that PG builds
  with. So the warnings could theoretically help find issues that we wouldn't
  otherwise see - but I don't think that's been particularly useful (nor
  monitored). And the compiler is buggy enough to add a fair bit work over the
  years.


> Desupporting AIX doesn't mean that any AIX users will be left in the
> lurch immediately. Obviously these users will be able to use a
> supported version of Postgres for several more years.

Right.

Greetings,

Andres Freund




Re: doc: Make selectivity example match wording

2022-07-02 Thread Dian M Fay
On Thu Jun 9, 2022 at 11:57 AM EDT, David G. Johnston wrote:
> Reposting this to its own thread.
>
> https://www.postgresql.org/message-id/flat/CAKFQuwby1aMsJDMeibaBaohgoaZhivAo4WcqHC1%3D9-GDZ3TSng%40mail.gmail.com
>
> doc: make unique non-null join selectivity example match the prose
>
> The description of the computation for the unique, non-null,
> join selectivity describes a division by the maximum of two values,
> while the example shows a multiplication by their reciprocal.  While
> equivalent the max phrasing is easier to understand; which seems
> more important here than precisely adhering to the formula used
> in the code (for which either variant is still an approximation).
>
> While both num_distinct and num_rows are equal for a unique column
> both the concept and formula use row count (10,000) and the
> field num_distinct has already been set to mean the specific value
> present in the pg_stats table (i.e, -1), so use num_rows here.

Pointing out that n_distinct = -1 is helpful but changing "because" to
"and" suggests that the missing MCV info is coincidental or a side
effect. Is there any case in which the stronger "because" wouldn't be
appropriate?

The second parenthetical (num_rows, not shown, but "tenk") took me a
minute to get since the row counts are only apparent on looking somewhat
closely at the other examples in the chapter. num_rows also isn't a
column in pg_stats which the "not shown" could be taken to imply; it's
sourced from somewhere else and only given as num_rows in this example.
How's '(as num_rowsN, 10,000 for both "tenk" example tables)'?

By "this value does get scaled in the non-unique case" do you mean it
relies on n_distinct as in the uncorrected algorithm listing? If so I
think it'd help to specify that.

You didn't take this line on but "This is, subtract the null
fraction..." omits the step of multiplying the complements of the null
fractions together before dividing.

Should n_distinct and num_rows be d in the text?




Re: AIX support - alignment issues

2022-07-02 Thread Peter Geoghegan
On Sat, Jul 2, 2022 at 12:22 PM Tom Lane  wrote:
> Now, it's certainly possible that AIX is the only surviving platform
> that hasn't adopted bug-compatible-with-glibc interpretations of
> POSIX.  But I think the standard is the standard, and we ought to
> stay within it.  So I find value in these fixes.

I imagine that there is strong evolutionary pressure pushing minority
platforms in the direction of bug-compatible-with-glibc. There is
definitely a similar trend around things like endianness and alignment
pickiness. But it wasn't always so.

It seems fair to wonder if AIX bucks the glibc-compatible trend
because it is already on the verge of extinction. If it wasn't just
about dead already then somebody would have gone to the trouble of
making it bug-compatible-with-glibc by now. (To be clear, I'm not
arguing that this is a good thing.)

Maybe it is still worth hanging on to AIX support for the time being,
but it would be nice to have some idea of where we *will* finally draw
the line. If the complaints from Andres aren't a good enough reason
now, then what other hypothetical reasons might be good enough in the
future? It seems fairly likely that Postgres desupporting AIX will
happen (say) at some time in the next decade, no matter what happens
today.

-- 
Peter Geoghegan




Re: AIX support - alignment issues

2022-07-02 Thread Tom Lane
Peter Geoghegan  writes:
> Maybe it is still worth hanging on to AIX support for the time being,
> but it would be nice to have some idea of where we *will* finally draw
> the line. If the complaints from Andres aren't a good enough reason
> now, then what other hypothetical reasons might be good enough in the
> future? It seems fairly likely that Postgres desupporting AIX will
> happen (say) at some time in the next decade, no matter what happens
> today.

Agreed.  But I think that this sort of thing is better driven by
"when there's no longer anyone willing to do the legwork" than
by project policy.  IOW, we'll stop when Noah gets tired of doing
it (and no one steps up to take his place).

In the case at hand, given that the test added by 79b716cfb/c1da0acbb
correctly detects troublesome catalog layouts (and no I've not studied
it myself), I don't see that we have to do more right now.

I am a little concerned though that we don't have access to the latest
version of AIX --- that seems like a non-maintainable situation.

regards, tom lane




Re: AIX support - alignment issues

2022-07-02 Thread Andres Freund
Hi,

On 2022-07-02 16:34:35 -0400, Tom Lane wrote:
> Agreed.  But I think that this sort of thing is better driven by
> "when there's no longer anyone willing to do the legwork" than
> by project policy.  IOW, we'll stop when Noah gets tired of doing
> it (and no one steps up to take his place).

I do think we should take the impact it has on everyone into account, not just
Noah's willingness. If it's just "does somebody still kind of maintain it"
then we'll bear the distributed cost of complications for irrelevant platforms
way longer than worthwhile.


> In the case at hand, given that the test added by 79b716cfb/c1da0acbb
> correctly detects troublesome catalog layouts (and no I've not studied
> it myself), I don't see that we have to do more right now.

What made me look at this issue right now is that the alignment issue lead the
56bit relfilenode patch to move the relfilenode field to the start of pg_class
(ahead of the oid), because a 64bit value cannot be after a NameData. Now, I
think we can do a bit better by moving a few more fields around. But the
restriction still seems quite onerous.

Greetings,

Andres Freund




Re: AIX support - alignment issues

2022-07-02 Thread Tom Lane
Andres Freund  writes:
> What made me look at this issue right now is that the alignment issue lead the
> 56bit relfilenode patch to move the relfilenode field to the start of pg_class
> (ahead of the oid),

Agreed, up with that we should not put.  However ...

> because a 64bit value cannot be after a NameData.

... this coding rule strikes me as utterly ridiculous.  Why can't we
instead insist that NAMEDATALEN must be a multiple of 8?  Anyone who
tries to make it different from that is likely to be wasting padding
space even on platforms where there's not a deeper problem.

regards, tom lane




Re: O(n) tasks cause lengthy startups and checkpoints

2022-07-02 Thread Nathan Bossart
On Fri, Jun 24, 2022 at 11:45:22AM +0100, Simon Riggs wrote:
> On Thu, 23 Jun 2022 at 18:15, Nathan Bossart  wrote:
>> I'm grateful for the discussion in this thread so far, but I'm not seeing a
>> clear path forward.
> 
> +1 to add the new auxiliary process.

I went ahead and put together a new patch set for this in which I've
attempted to address most of the feedback from upthread.  Notably, I've
abandoned 0007 and 0008, added a way for processes to request specific
tasks for the custodian, and removed all the checks for
ShutdownRequestPending.

I haven't addressed the existing transaction ID wraparound risk with the
logical replication files.  My instinct is that this deserveѕ its own
thread, and it might need to be considered a prerequisite to this change
based on the prior discussion here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 68e3005c14ba116e372a1724dad079914108ab2d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 5 Jan 2022 19:24:22 +
Subject: [PATCH v6 1/6] Introduce custodian.

The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing.  This commit simply adds the new process; it does
not yet do anything useful.
---
 src/backend/postmaster/Makefile |   1 +
 src/backend/postmaster/auxprocess.c |   8 +
 src/backend/postmaster/custodian.c  | 252 
 src/backend/postmaster/postmaster.c |  44 -
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/lmgr/proc.c |   1 +
 src/backend/utils/activity/wait_event.c |   3 +
 src/backend/utils/init/miscinit.c   |   3 +
 src/include/miscadmin.h |   3 +
 src/include/postmaster/custodian.h  |  20 ++
 src/include/storage/proc.h  |  11 +-
 src/include/utils/wait_event.h  |   1 +
 12 files changed, 345 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/postmaster/custodian.c
 create mode 100644 src/include/postmaster/custodian.h

diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
 	bgworker.o \
 	bgwriter.o \
 	checkpointer.o \
+	custodian.o \
 	fork_process.o \
 	interrupt.o \
 	pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 39ac4490db..620a0b1bae 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
 #include "pgstat.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 		case CheckpointerProcess:
 			MyBackendType = B_CHECKPOINTER;
 			break;
+		case CustodianProcess:
+			MyBackendType = B_CUSTODIAN;
+			break;
 		case WalWriterProcess:
 			MyBackendType = B_WAL_WRITER;
 			break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 			CheckpointerMain();
 			proc_exit(1);
 
+		case CustodianProcess:
+			CustodianMain();
+			proc_exit(1);
+
 		case WalWriterProcess:
 			WalWriterMain();
 			proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 00..db00282658
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,252 @@
+/*-
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc.  Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding).  Also, ensure that any offloaded
+ * tasks are either not required during single-user mode or are performed
+ * separately during single-user mode.
+ *
+ * The custodian is not an essential process and can shutdown quickly when
+ * requested.  The custodian will wake up approximately once every 5
+ * minutes to perform its tasks, but backends can (and should) set its
+ * latch to wake it up sooner.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   src/backend/postmaster/custodian.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+#define CUSTODIAN_TIMEOUT_S (300)		/* 5 minutes */
+
+typedef struct
+{
+

Re: replacing role-level NOINHERIT with a grant-level option

2022-07-02 Thread Nathan Bossart
On Sat, Jul 02, 2022 at 08:45:50AM -0400, Robert Haas wrote:
> I don't think there is a whole lot of point in replacing role-level
> flags with predefined roles that work exactly the same way. Maybe
> there is some point, but I'm not excited about it. The problem with
> these settings in my opinion is that they are too monolithic. Either
> you can create any role with basically any privileges or you can
> create no roles at all. Either you are a superuser and can bypass all
> permissions checks or you are not and can't bypass any permissions
> checks.

Okay.  I see.

>> unparenthesized syntax or add WARNINGs when it is used?)  For [NO]INHERIT
>> and WITH INHERIT DEFAULT, presumably we could do something similar.  Down
>> the road, those would be removed in favor of only using grant-level
>> options.
> 
> I think it'd be hard to do that if WITH INHERIT DEFAULT is actually
> state stored in the catalog. Maybe I should revise this again so that
> the catalog column is just true or false, and the role-level property
> only sets the default for future grants. That might be more like what
> Tom had in mind originally.

I was thinking that when DEFAULT was removed, pg_dump would just need to
generate WITH INHERIT TRUE/FALSE based on the value of rolinherit for older
versions.  Using the role-level property as the default for future grants
seems a viable strategy, although it would break backward compatibility.
For example, if I create a NOINHERIT role, grant a bunch of roles to it,
and then change it to INHERIT, the role won't begin inheriting the
privileges of the roles it is a member of.  Right now, it does.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: O(n) tasks cause lengthy startups and checkpoints

2022-07-02 Thread Andres Freund
Hi,

On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote:
> + /* Obtain requested tasks */
> + SpinLockAcquire(&CustodianShmem->cust_lck);
> + flags = CustodianShmem->cust_flags;
> + CustodianShmem->cust_flags = 0;
> + SpinLockRelease(&CustodianShmem->cust_lck);

Just resetting the flags to 0 is problematic. Consider what happens if there's
two tasks and and the one processed first errors out. You'll loose information
about needing to run the second task.


> + /* TODO: offloaded tasks go here */

Seems we're going to need some sorting of which tasks are most "urgent" / need
to be processed next if we plan to make this into some generic facility.


> +/*
> + * RequestCustodian
> + *   Called to request a custodian task.
> + */
> +void
> +RequestCustodian(int flags)
> +{
> + SpinLockAcquire(&CustodianShmem->cust_lck);
> + CustodianShmem->cust_flags |= flags;
> + SpinLockRelease(&CustodianShmem->cust_lck);
> +
> + if (ProcGlobal->custodianLatch)
> + SetLatch(ProcGlobal->custodianLatch);
> +}

With this representation we can't really implement waiting for a task or
such. And it doesn't seem like a great API for the caller to just specify a
mix of flags.


> + /* Calculate how long to sleep */
> + end_time = (pg_time_t) time(NULL);
> + elapsed_secs = end_time - start_time;
> + if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
> + continue;   /* no sleep for us */
> + cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
> +
> + (void) WaitLatch(MyLatch,
> +  WL_LATCH_SET | WL_TIMEOUT | 
> WL_EXIT_ON_PM_DEATH,
> +  cur_timeout * 1000L /* convert 
> to ms */ ,
> +  WAIT_EVENT_CUSTODIAN_MAIN);
> + }

I don't think we should have this thing wake up on a regular basis. We're
doing way too much of that already, and I don't think we should add
more. Either we need a list of times when tasks need to be processed and wake
up at that time, or just wake up if somebody requests a task.


> From 5e95666efa31d6c8aa351e430c37ead6e27acb72 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart 
> Date: Sun, 5 Dec 2021 21:16:44 -0800
> Subject: [PATCH v6 3/6] Split pgsql_tmp cleanup into two stages.
> 
> First, pgsql_tmp directories will be renamed to stage them for
> removal.  Then, all files in pgsql_tmp are removed before removing
> the staged directories themselves.  This change is being made in
> preparation for a follow-up change to offload most temporary file
> cleanup to the new custodian process.

> Note that temporary relation files cannot be cleaned up via the
> aforementioned strategy and will not be offloaded to the custodian.

> ---
>  src/backend/postmaster/postmaster.c |   8 +-
>  src/backend/storage/file/fd.c   | 174 
>  src/include/storage/fd.h|   2 +-
>  3 files changed, 160 insertions(+), 24 deletions(-)
> 
> diff --git a/src/backend/postmaster/postmaster.c 
> b/src/backend/postmaster/postmaster.c
> index e67370012f..82aa0c6307 100644
> --- a/src/backend/postmaster/postmaster.c
> +++ b/src/backend/postmaster/postmaster.c
> @@ -1402,7 +1402,8 @@ PostmasterMain(int argc, char *argv[])
>* Remove old temporary files.  At this point there can be no other
>* Postgres processes running in this directory, so this should be safe.
>*/
> - RemovePgTempFiles();
> + RemovePgTempFiles(true, true);
> + RemovePgTempFiles(false, false);

This is imo hard to read and easy to get wrong. Make it multiple functions or
pass named flags in.


> + * StagePgTempDirForRemoval
> + *
> + * This function renames the given directory with a special prefix that
> + * RemoveStagedPgTempDirs() will know to look for.  An integer is appended to
> + * the end of the new directory name in case previously staged pgsql_tmp
> + * directories have not yet been removed.
> + */

It doesn't seem great to need to iterate through a directory that contains
other files, potentially a significant number. How about having a
staged_for_removal/ directory, and then only scanning that?


> +static void
> +StagePgTempDirForRemoval(const char *tmp_dir)
> +{
> + DIR*dir;
> + charstage_path[MAXPGPATH * 2];
> + charparent_path[MAXPGPATH * 2];
> + struct stat statbuf;
> +
> + /*
> +  * If tmp_dir doesn't exist, there is nothing to stage.
> +  */
> + dir = AllocateDir(tmp_dir);
> + if (dir == NULL)
> + {
> + if (errno != ENOENT)
> + ereport(LOG,
> + (errcode_for_file_access(),
> +  errmsg("could not open directory 
> \"%s\": %m", tmp_dir)));
> + return;
> + }
> +  

Re: First draft of the PG 15 release notes

2022-07-02 Thread Bruce Momjian
On Fri, Jul  1, 2022 at 09:56:17AM -0700, Peter Geoghegan wrote:
> On Fri, Jul 1, 2022 at 9:41 AM Bruce Momjian  wrote:
> > > It might be worth explaining the shift directly in the release notes.
> > > The new approach is simpler and makes a lot more sense -- why should
> > > the relfrozenxid be closely tied to freezing? We don't necessarily
> >
> > I don't think this is an appropriate detail for the release notes.
> 
> Okay. What about saying something about relminmxid advancement where
> the database consumes lots of multixacts?

No. same issue.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: First draft of the PG 15 release notes

2022-07-02 Thread Peter Geoghegan
>
> Okay, thanks Bruce.
-- 
Peter Geoghegan


Re: replacing role-level NOINHERIT with a grant-level option

2022-07-02 Thread Robert Haas
On Sat, Jul 2, 2022 at 6:16 PM Nathan Bossart  wrote:
> I was thinking that when DEFAULT was removed, pg_dump would just need to
> generate WITH INHERIT TRUE/FALSE based on the value of rolinherit for older
> versions.  Using the role-level property as the default for future grants
> seems a viable strategy, although it would break backward compatibility.
> For example, if I create a NOINHERIT role, grant a bunch of roles to it,
> and then change it to INHERIT, the role won't begin inheriting the
> privileges of the roles it is a member of.  Right now, it does.

I think the idea you propose here is interesting, because I think it
proves that committing v2 or something like it doesn't really lock us
into the role-level property any more than we already are, which at
least makes me feel slightly less bad about that option. However, if
there's implacable opposition to any compatibility break at any point,
then maybe this plan would never actually be implemented in practice.
And if there's not, maybe we can be bolder now.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Probable memory leak with ECPG and AIX

2022-07-02 Thread Noah Misch
On Sat, Jul 02, 2022 at 02:53:34PM -0400, Tom Lane wrote:
> This looks solid to me.  The only nit I can find to pick is that I'd
> have added one more comment, along the lines of
> 
> diff --git a/src/interfaces/ecpg/ecpglib/connect.c 
> b/src/interfaces/ecpg/ecpglib/connect.c
> index 9f958b822c..96f99ae072 100644
> --- a/src/interfaces/ecpg/ecpglib/connect.c
> +++ b/src/interfaces/ecpg/ecpglib/connect.c
> @@ -508,6 +508,11 @@ ECPGconnect(int lineno, int c, const char *name, const 
> char *user, const char *p
>  #ifdef ENABLE_THREAD_SAFETY
>   pthread_mutex_lock(&connections_mutex);
>  #endif
> +
> + /*
> +  * ... but first, make certain we have created ecpg_clocale.  Rely on
> +  * holding connections_mutex to ensure this is done by only one thread.
> +  */
>  #ifdef HAVE_USELOCALE
>   if (!ecpg_clocale)
>   {
> 
> I've marked it RFC.

Thanks for reviewing.  Pushed with that comment.  prairiedog complains[1]:

  ld: common symbols not allowed with MH_DYLIB output format with the 
-multi_module option
  connect.o definition of common _ecpg_clocale (size 4)

I bet this would fix it:

--- a/src/interfaces/ecpg/ecpglib/connect.c
+++ b/src/interfaces/ecpg/ecpglib/connect.c
@@ -11,7 +11,7 @@
 #include "sqlca.h"
 
 #ifdef HAVE_USELOCALE
-locale_t   ecpg_clocale;
+locale_t   ecpg_clocale = (locale_t) 0;
 #endif
 
 #ifdef ENABLE_THREAD_SAFETY

I hear[1] adding -fno-common to compiler options would also fix that.  Still,
in the absence of other opinions, I'll just add the no-op initialization.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2022-07-03%2001%3A14%3A19
[2] https://gcc.gnu.org/legacy-ml/gcc/2005-06/msg00378.html




Re: Probable memory leak with ECPG and AIX

2022-07-02 Thread Tom Lane
Noah Misch  writes:
> Thanks for reviewing.  Pushed with that comment.  prairiedog complains[1]:
>   ld: common symbols not allowed with MH_DYLIB output format with the 
> -multi_module option
>   connect.o definition of common _ecpg_clocale (size 4)

Blah.

> I bet this would fix it:

> -locale_t ecpg_clocale;
> +locale_t ecpg_clocale = (locale_t) 0;

Hmm, I was considering suggesting that just on stylistic grounds,
but decided it was too nitpicky even for me.
Do you want me to test it on prairiedog?

> I hear[1] adding -fno-common to compiler options would also fix that.

I've got -fno-common turned on on my other macOS animals, but in
those cases I did it to detect bugs not fix them.  I'm not sure
whether prairiedog's ancient toolchain has that switch at all,
or whether it behaves the same as in more recent platforms.
Still, that gcc.gnu.org message you cite is of the right era.

regards, tom lane




Re: Probable memory leak with ECPG and AIX

2022-07-02 Thread Noah Misch
On Sat, Jul 02, 2022 at 11:37:08PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > Thanks for reviewing.  Pushed with that comment.  prairiedog complains[1]:
> >   ld: common symbols not allowed with MH_DYLIB output format with the 
> > -multi_module option
> >   connect.o definition of common _ecpg_clocale (size 4)
> 
> Blah.
> 
> > I bet this would fix it:
> 
> > -locale_t   ecpg_clocale;
> > +locale_t   ecpg_clocale = (locale_t) 0;
> 
> Hmm, I was considering suggesting that just on stylistic grounds,
> but decided it was too nitpicky even for me.
> Do you want me to test it on prairiedog?

Sure, if it's easy enough.  If not, I'm 87% sure it will suffice.




Re: Probable memory leak with ECPG and AIX

2022-07-02 Thread Tom Lane
Noah Misch  writes:
> On Sat, Jul 02, 2022 at 11:37:08PM -0400, Tom Lane wrote:
>> Do you want me to test it on prairiedog?

> Sure, if it's easy enough.  If not, I'm 87% sure it will suffice.

On it now, but it'll take a few minutes :-(

regards, tom lane




Re: Probable memory leak with ECPG and AIX

2022-07-02 Thread Tom Lane
I wrote:
> On it now, but it'll take a few minutes :-(

Confirmed that either initializing ecpg_clocale or adding -fno-common
allows the ecpglib build to succeed.  (Testing it beyond that would
require another hour or so to build the rest of the system, so I won't.)

As I said, I was already leaning to the idea that initializing the
variable explicitly is better style, so I recommend we do that.

regards, tom lane