date:20240606

Re: relfilenode statistics

2024-06-06 Thread Bertrand Drouvot

Hi,

On Mon, Jun 03, 2024 at 11:11:46AM +, Bertrand Drouvot wrote:
> Yeah, I’ll update the commit message in V2 with better explanations once I get
> feedback on V1 (should we decide to move on with the relfilenode stats idea).
> 

Please find attached v2, mandatory rebase due to cd312adc56. In passing it
provides a more detailed commit message (also making clear that the goal of this
patch is to start the discussion and agree on the design before moving forward.)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 81d25e077c9f4eafa5304c257d1b39ee8a811628 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 16 Nov 2023 02:30:01 +
Subject: [PATCH v2] Provide relfilenode statistics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We currently don’t have writes counters for relations.
The reason is that we don’t have the relation OID when writing buffers out.
Tracking writes per relfilenode would allow us to track/consolidate writes per
relation.

relfilenode stats is also beneficial for the "Split index and table statistics
into different types of stats" work in progress: it would allow us to avoid
additional branches in some situations.

=== Remarks ===

This is a POC patch. There is still work to do: there is more places we should
add relfilenode counters, create more APIS to retrieve the relfilenode stats,
the patch takes care of rewrite generated by TRUNCATE but there is more to
care about like CLUSTER,VACUUM FULL.

The new logic to retrieve stats in pg_statio_all_tables has been implemented
only for the new blocks_written stat (we'd need to do the same for the existing
buffer read / buffer hit if we agree on the approach implemented here).

The goal of this patch is to start the discussion and agree on the design before
moving forward.
---
 src/backend/access/rmgrdesc/xactdesc.c|   5 +-
 src/backend/catalog/storage.c |   8 ++
 src/backend/catalog/system_functions.sql  |   2 +-
 src/backend/catalog/system_views.sql  |   5 +-
 src/backend/postmaster/checkpointer.c |   5 +
 src/backend/storage/buffer/bufmgr.c   |   6 +-
 src/backend/storage/smgr/md.c |   7 ++
 src/backend/utils/activity/pgstat.c   |  39 --
 src/backend/utils/activity/pgstat_database.c  |  12 +-
 src/backend/utils/activity/pgstat_function.c  |  13 +-
 src/backend/utils/activity/pgstat_relation.c  | 112 --
 src/backend/utils/activity/pgstat_replslot.c  |  13 +-
 src/backend/utils/activity/pgstat_shmem.c |  19 ++-
 .../utils/activity/pgstat_subscription.c  |  12 +-
 src/backend/utils/activity/pgstat_xact.c  |  60 +++---
 src/backend/utils/adt/pgstatfuncs.c   |  34 +-
 src/include/access/tableam.h  |  19 +++
 src/include/access/xact.h |   1 +
 src/include/catalog/pg_proc.dat   |  14 ++-
 src/include/pgstat.h  |  19 ++-
 src/include/utils/pgstat_internal.h   |  34 --
 src/test/recovery/t/029_stats_restart.pl  |  40 +++
 .../recovery/t/030_stats_cleanup_replica.pl   |   6 +-
 src/test/regress/expected/rules.out   |  12 +-
 src/test/regress/expected/stats.out   |  30 ++---
 src/test/regress/sql/stats.sql|  30 ++---
 src/test/subscription/t/026_stats.pl  |   4 +-
 src/tools/pgindent/typedefs.list  |   1 +
 28 files changed, 415 insertions(+), 147 deletions(-)
   4.6% src/backend/catalog/
  47.8% src/backend/utils/activity/
   6.5% src/backend/utils/adt/
   3.7% src/backend/
   3.3% src/include/access/
   3.3% src/include/catalog/
   6.2% src/include/utils/
   3.3% src/include/
  12.1% src/test/recovery/t/
   5.5% src/test/regress/expected/
   3.0% src/test/

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index dccca201e0..c02b079645 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -319,10 +319,11 @@ xact_desc_stats(StringInfo buf, const char *label,
 		appendStringInfo(buf, "; %sdropped stats:", label);
 		for (i = 0; i < ndropped; i++)
 		{
-			appendStringInfo(buf, " %d/%u/%u",
+			appendStringInfo(buf, " %d/%u/%u/%u",
 			 dropped_stats[i].kind,
 			 dropped_stats[i].dboid,
-			 dropped_stats[i].objoid);
+			 dropped_stats[i].objoid,
+			 dropped_stats[i].relfile);
 		}
 	}
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f56b3cc0f2..db6107cd90 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -33,6 +33,7 @@
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
+#include "utils/pgstat_internal.h"
 #include "utils/rel.h"
 
 /* GUC variables */
@@ -152,6 +153,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
 	i

Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-06-06 Thread Kyotaro Horiguchi

At Tue, 4 Jun 2024 08:30:19 +0900, Michael Paquier  wrote 
in 
> On Mon, Jan 15, 2024 at 01:34:46PM -0500, Robert Haas wrote:
> > This kind of change looks massively helpful to me. I don't know if it
> > is exactly right or not, but it would have been a big help to me when
> > writing my previous review, so +1 for some change of this general
> > type.
> 
> During a live review of this patch last week, as part of the Advanced
> Patch Workshop of pgconf.dev, it has been mentioned by Tom that we may
> be able to simplify the check on pmstart if the detection of the
> postmaster PID started by pg_ctl is more stable using the WIN32
> internals that this patch relies on.  I am not sure that this
> suggestion is right, though, because we should still care about the
> clock skew case as written in the surrounding comments?  Even if
> that's OK, I would assume that this should be an independent patch,
> written on top of the proposed v6-0001.
> 
> Tom, could you comment about that?  Perhaps my notes did not catch
> what you meant.

Thank you for the follow-up.

I have been thinking about this since then. At first, I thought it
referred to FindFirstChangeNotification() and friends, and inotify on
Linux. However, I haven't found a way to simplify the specified code
area using those APIs.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-06-06 Thread Kyotaro Horiguchi

At Thu, 06 Jun 2024 16:45:00 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I have been thinking about this since then. At first, I thought it
> referred to FindFirstChangeNotification() and friends, and inotify on
> Linux. However, I haven't found a way to simplify the specified code
> area using those APIs.

By the way, the need to shift by 2 seconds to tolerate clock skew
suggests that the current launcher-postmaster association mechanism is
somewhat unreliable. Couldn't we add a command line option to
postmaster to explicitly pass a unique identifier (say, pid itself) of
the launcher? If it is not specified, the number should be the PID of
the immediate parent process.

This change avoids the need for the special treatment for Windows.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-06-06 Thread Kyotaro Horiguchi

At Thu, 06 Jun 2024 17:15:15 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 06 Jun 2024 16:45:00 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > I have been thinking about this since then. At first, I thought it
> > referred to FindFirstChangeNotification() and friends, and inotify on
> > Linux. However, I haven't found a way to simplify the specified code
> > area using those APIs.
> 
> By the way, the need to shift by 2 seconds to tolerate clock skew
> suggests that the current launcher-postmaster association mechanism is
> somewhat unreliable. Couldn't we add a command line option to
> postmaster to explicitly pass a unique identifier (say, pid itself) of
> the launcher? If it is not specified, the number should be the PID of
> the immediate parent process.

No. The combination of pg_ctl's pid and timestamp, to avoid false
matching during reboot.

> This change avoids the need for the special treatment for Windows.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: question regarding policy for patches to out-of-support branches

2024-06-06 Thread Hannu Krosing

On Wed, Jun 5, 2024 at 8:29 PM Tom Lane  wrote:
>
> Joe Conway  writes:
> > I was having a discussion regarding out-of-support branches and effort
> > to keep them building, but could not for the life of me find any actual
> > documented policy (although I distinctly remember that we do something...).
> > Is the policy written down somewhere, or is it only project lore? In
> > either case, what is the actual policy?
>
> I believe our policy was set in this thread:
>
> https://www.postgresql.org/message-id/flat/2923349.1634942313%40sss.pgh.pa.us
>
> and you're right that it hasn't really been memorialized anywhere
> else.  I'm not sure where would be appropriate.

Not absolutely sure, but would at least adding a page to PostgreSQL
Wiki about this make sense ?

---
Hannu

Proposal: Job Scheduler

2024-06-06 Thread Wang Cheng

Hackers,


We are the PostgreSQL team in Tencent. We have recently developed a job 
scheduler that runs inside the database to schedules and manages jobs similar 
to Oracle DBMS_JOB package, and we would like to contribute this feature to the 
community.


Similar to autovacuum, the job scheduler consists of 2 parts: the job launcher 
and the job worker. The job launcher periodically scans a metadata table and 
signals the postmaster to start new workers if needed.


As far as we know, there are currently two open-sourced job scheduling 
extensions for PostgreSQL: pg_cron (https://github.com/citusdata/pg_cron/) and 
pg_dbms_job (https://github.com/MigOpsRepos/pg_dbms_job/tree/main). However, 
the cron-based syntax is not easy to use and suffers some limitations like 
one-off commands. The pg_dbms_job extension is difficult to manage and operate 
because it runs as a standalone process .


That's why we have developed the job scheduler that runs as a process inside 
the database just like autovacuum.


We can start to send the patch if this idea makes sense to the you. Thanks for 
your time.



Regards,
Cheng

Re: Proposal: Job Scheduler

2024-06-06 Thread Laurenz Albe

On Thu, 2024-06-06 at 16:27 +0800, Wang Cheng wrote:
> We are the PostgreSQL team in Tencent. We have recently developed a job 
> scheduler
> that runs inside the database to schedules and manages jobs similar to Oracle
> DBMS_JOB package, and we would like to contribute this feature to the 
> community.
> 
> As far as we know, there are currently two open-sourced job scheduling 
> extensions
> for PostgreSQL: pg_cron (https://github.com/citusdata/pg_cron/) and 
> pg_dbms_job
> (https://github.com/MigOpsRepos/pg_dbms_job/tree/main). However, the 
> cron-based
> syntax is not easy to use and suffers some limitations like one-off commands.
> The pg_dbms_job extension is difficult to manage and operate because it runs 
> as
> a standalone process .

There is also pg_timetable:
https://github.com/cybertec-postgresql/pg_timetable

> That's why we have developed the job scheduler that runs as a process inside 
> the
> database just like autovacuum.
> 
> We can start to send the patch if this idea makes sense to the you.

Perhaps your job scheduler is much better than all the existing ones.
But what would be a compelling reason to keep it in the PostgreSQL source tree?
With PostgreSQL's extensibility features, it should be possible to write your
job scheduler as an extension and maintain it outside the PostgreSQL source.

I am sure that the PostgreSQL community will be happy to use the extension
if it is any good.

Yours,
Laurenz Albe

Re: Proposal: Job Scheduler

2024-06-06 Thread Dave Page

On Thu, 6 Jun 2024 at 09:47, Laurenz Albe  wrote:

> On Thu, 2024-06-06 at 16:27 +0800, Wang Cheng wrote:
> > We are the PostgreSQL team in Tencent. We have recently developed a job
> scheduler
> > that runs inside the database to schedules and manages jobs similar to
> Oracle
> > DBMS_JOB package, and we would like to contribute this feature to the
> community.
> >
> > As far as we know, there are currently two open-sourced job scheduling
> extensions
> > for PostgreSQL: pg_cron (https://github.com/citusdata/pg_cron/) and
> pg_dbms_job
> > (https://github.com/MigOpsRepos/pg_dbms_job/tree/main). However, the
> cron-based
> > syntax is not easy to use and suffers some limitations like one-off
> commands.
> > The pg_dbms_job extension is difficult to manage and operate because it
> runs as
> > a standalone process .
>
> There is also pg_timetable:
> https://github.com/cybertec-postgresql/pg_timetable


And probably the oldest of them all, pgAgent:
https://www.pgadmin.org/docs/pgadmin4/8.7/pgagent.html


>
>
> > That's why we have developed the job scheduler that runs as a process
> inside the
> > database just like autovacuum.
> >
> > We can start to send the patch if this idea makes sense to the you.
>
> Perhaps your job scheduler is much better than all the existing ones.
> But what would be a compelling reason to keep it in the PostgreSQL source
> tree?
> With PostgreSQL's extensibility features, it should be possible to write
> your
> job scheduler as an extension and maintain it outside the PostgreSQL
> source.
>
> I am sure that the PostgreSQL community will be happy to use the extension
> if it is any good.
>

I agree. This is an area in which there are lots of options at the moment,
with compelling reasons to choose from various of them depending on your
needs.

It's this kind of choice that means it's unlikely we'd include any one
option in PostgreSQL, much like various other tools such as failover
managers or poolers.

-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org
EDB: https://www.enterprisedb.com

Re: [multithreading] extension compatibility

2024-06-06 Thread Heikki Linnakangas


On 06/06/2024 05:47, Robert Haas wrote:

On Wed, Jun 5, 2024 at 10:09 PM Andres Freund  wrote:

Maybe. I think shipping a mode where users can fairly simply toggle between
threaded and process mode will allow us to get this stable a *lot* quicker
than if we distribute two builds. Most users don't build from source, distros
will have to pick the mode. If they don't choose threaded mode, we'll not find
problems. If they do choose threaded mode, we can't ask users to switch to a
process based mode to check if the problem is related.


I don't believe that being coercive here is the right approach. I
think distros see the value in compiling with as many things turned on
as possible; when they ship with something turned off, it's because
it's unstable or introduces weird dependencies or has some other
disadvantage.


I presume there's no harm in building with multithreading support. If 
you don't want to use it, put "multithreading=off" in your config file 
(which will presumably be the default for a while).


If we're worried about the performance impact of thread-local variables 
in particular, we can try to measure that. I don't think it's material 
though.


If there is some material harm from compiling with multithreading 
support even if you're not using it, we should try to fix that. I'm not 
dead set against having a compile-time option, but I don't see the need 
for it at the moment.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Proposal: Job Scheduler

2024-06-06 Thread Wang Cheng

Noted. Thanks for suggestions. We will open-source it as an extension.



Regards,
Cheng



 




-- Original --
From:   
 "Dave Page"

https://github.com/citusdata/pg_cron/) and 
pg_dbms_job
 > (https://github.com/MigOpsRepos/pg_dbms_job/tree/main). However, the 
cron-based
 > syntax is not easy to use and suffers some limitations like one-off 
commands.
 > The pg_dbms_job extension is difficult to manage and operate because it 
runs as
 > a standalone process .
 
 There is also pg_timetable:
 https://github.com/cybertec-postgresql/pg_timetable

And probably the oldest of them all, 
pgAgent: https://www.pgadmin.org/docs/pgadmin4/8.7/pgagent.html
 

 
 > That's why we have developed the job scheduler that runs as a process 
inside the
 > database just like autovacuum.
 > 
 > We can start to send the patch if this idea makes sense to the you.
 
 Perhaps your job scheduler is much better than all the existing ones.
 But what would be a compelling reason to keep it in the PostgreSQL source tree?
 With PostgreSQL's extensibility features, it should be possible to write your
 job scheduler as an extension and maintain it outside the PostgreSQL source.
 
 I am sure that the PostgreSQL community will be happy to use the extension
 if it is any good.


I agree. This is an area in which there are lots of options at the moment, with 
compelling reasons to choose from various of them depending on your needs.


It's this kind of choice that means it's unlikely we'd include any one option 
in PostgreSQL, much like various other tools such as failover managers or 
poolers. 



-- 
Dave PagepgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org
EDB: https://www.enterprisedb.com

Re: Things I don't like about \du's "Attributes" column

2024-06-06 Thread Pavel Luzanov


On 16.04.2024 09:15, Pavel Luzanov wrote:

On 16.04.2024 01:06, David G. Johnston wrote:

At this point I'm on board with retaining the \dr charter of simply being
an easy way to access the detail exposed in pg_roles with some display
formatting but without any attempt to convey how the system uses said
information.  Without changing pg_roles.  Our level of effort here, and
degree of dependence on superuser, doesn't seem to be bothering people
enough to push more radical changes here through and we have good
improvements that are being held up in the hope of possible perfection.
I have similar thoughts. I decided to wait for the end of 
featurefreeze and propose a simpler version of the patch for v18, 
without changes in pg_roles


Since no votes for the changes in pg_roles, please look the simplified version.
We can return to this topic later.

Butnowthere are nochangesinpg_roles.  Just a special interpretation
of the two values of the "Connection limit" column:
  0 - Now allowed (changed from 'No connections')
 -1 - empty string

Full list of changes in commit message.

Example output:

\du+ regress_du*
List of roles
Role name | Login | Attributes  | Valid until  | 
Connection limit |   Description
--+---+-+--+--+--
 regress_du_admin | yes   | Superuser  +|  |
  | some description
  |   | Create DB  +|  |
  |
  |   | Create role+|  |
  |
  |   | Inherit+|  |
  |
  |   | Replication+|  |
  |
  |   | Bypass RLS  |  |
  |
 regress_du_role0 | yes   | Inherit | Tue Jun 04 00:00:00 2024 PDT | Not 
allowed  |
 regress_du_role1 | no| Create role+| infinity |
  |
  |   | Inherit |  |
  |
 regress_du_role2 | yes   | Inherit+|  | 42 
  |
  |   | Replication+|  |
  |
  |   | Bypass RLS  |  |
  |
(4 rows)

Data:
CREATE ROLE regress_du_role0 LOGIN PASSWORD '123' VALID UNTIL '2024-06-04' 
CONNECTION LIMIT 0;
CREATE ROLE regress_du_role1 CREATEROLE CONNECTION LIMIT -1 VALID UNTIL 
'infinity';
CREATE ROLE regress_du_role2 LOGIN REPLICATION BYPASSRLS CONNECTION LIMIT 42;
CREATE ROLE regress_du_admin LOGIN SUPERUSER CREATEROLE CREATEDB BYPASSRLS 
REPLICATION INHERIT;
COMMENT ON ROLE regress_du_admin IS 'some description';

--
Pavel Luzanov
Postgres Professional:https://postgrespro.com
From 8cd9a815de36a40d450029d327e013cf69827499 Mon Sep 17 00:00:00 2001
From: Pavel Luzanov 
Date: Thu, 6 Jun 2024 11:30:23 +0300
Subject: [PATCH v7] psql: Rethinking of \du command

Cnanges in the \du command
- "Login", "Connection limit" and "Valid until" attributes are placed
  in separate columns.
- The "Attributes" column includes only the enabled logical attributes.
- The attribute names correspond to the keywords of the CREATE ROLE command.
- The attributes are listed in the same order as in the documentation.
- A special interpretation of the two values of the "Connection limit" column:
0 - Now allowed
   -1 - empty string
- General refactoring of describeRoles function in describe.c.
---
 src/bin/psql/describe.c| 149 -
 src/test/regress/expected/psql.out |  40 +---
 src/test/regress/sql/psql.sql  |  12 ++-
 3 files changed, 75 insertions(+), 126 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index f67bf0b892..8967102261 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -36,7 +36,6 @@ static bool describeOneTableDetails(const char *schemaname,
 	bool verbose);
 static void add_tablespace_footer(printTableContent *const cont, char relkind,
   Oid tablespace, const bool newline);
-static void add_role_attribute(PQExpBuffer buf, const char *const str);
 static bool listTSParsersVerbose(const char *pattern);
 static bool describeOneTSParser(const char *oid, const char *nspname,
 const char *prsname);
@@ -3615,34 +3614,50 @@ describeRoles(const char *pattern, bool verbose, bool showSystem)
 {
 	PQExpBufferData buf;
 	PGresult   *res;
-	printTableContent cont;
-	printTableOpt myopt = pset.popt.topt;
-	int			ncols = 2;
-	int			nrows = 0;
-	int			i;
-	int			conns;
-	const char	align = 'l';
-	char	  **attr;
-
-	myopt.default_footer = false;
+	printQueryOpt myopt = pset.popt;
 
 	initPQExpBuffer(&buf);
-
 	printfPQEx

Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel

2024-06-06 Thread Dilip Kumar

As part of commit 5cd72cc0c5017a9d4de8b5d465a75946da5abd1d, the
dependency on global counters such as VacuumPage(Hit/Miss/Dirty) was
removed from the vacuum. However, do_analyze_rel() was still using
these counters, necessitating the tracking of global counters
alongside BufferUsage counters.

The attached patch addresses the issue by eliminating the need to
track VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel(), making
the global counters obsolete. This simplifies the code and improves
consistency.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Remove-duplicate-tracking-of-the-page-stats-durin.patch
Description: Binary data

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-06-06 Thread Jelte Fennema-Nio

On Thu, 6 Jun 2024 at 03:03, Robert Haas  wrote:
> This makes some sense to me. I don't think that I believe
> max_protocol_version=latest is future-proof: just because I want to
> opt into this round of new features doesn't mean I feel the same way
> about the next round. But it may still be a good idea.

I think for most people the only reason not to opt-in to improvements
(even if they are small) is if those improvements break something
else. Once the NegotiateProtocolVersion message is implemented
everywhere in the ecosystem, nothing should break when going from e.g.
3.2 to 3.4. So for the majority of the people I think
max_protocol_version=latest is what they'll want to use once the
ecosystem has caught up. Of course there will be people that want
tight control, but they can set max_protocol_version=3.2 instead.

> I suppose the semantics are that we try to connect with the version
> specified by max_protocol_version and, if we get downgraded by the
> server, we abort if we end up below min_protocol_version.

Correct

> I like those
> semantics, and I think I also like having both parameters, but I'm not
> 100% sure of the naming. It's a funny use of "max" and "min", because
> the max is really what we're trying to do and the min is what we end
> up with, and those terms don't necessarily bring those ideas to mind.
> I don't have a better idea off-hand, though.

I borrowed this terminology from the the ssl_min_protocol_version and
ssl_max_protocol_version connection options that we already have.
Those basically have the same semantics as what I'm proposing here,
but for the TLS protocol version instead of the Postgres protocol
version. I'm also not a huge fan of the min_protocol_version and
max_protocol_version names, but staying consistent with existing
options seems quite nice.

Looking at ssl_max_protocol_version closer though, to stay really
consistent I'd have to change "latest" to be renamed to empty string
(i.e. there is no max_protocol_version). I think I might prefer
staying consistent over introducing an imho slightly clearer name.
Another way to stay consistent would of course be also adding "latest"
as an option to ssl_max_protocol_version? What do you think?

I'll look into adding min_protocol_version to the patchset soonish.
Some review of the existing code in the first few patches would
definitely be appreciated.

Incorrect matching of sql/json PASSING variable names

2024-06-06 Thread Amit Langote

Hi,

Alvaro reported off-list that the following should really fail,
because the jsonpath expression refers to a PASSING variable that
doesn't exist:

select json_query('"1"', jsonpath '$xy' passing 2 AS xyz);
 json_query

 2
(1 row)

This works because of a bug in GetJsonPathVar() whereby it allows a
jsonpath expression to reference any prefix of the PASSING variable
names.

Attached is a patch to fix that.

Thanks Alvaro for the report.

-- 
Thanks, Amit Langote


v1-0001-in-transformJsonBehavior-better-handle-default-ex.patch
Description: Binary data

Re: Proposal: Job Scheduler

2024-06-06 Thread Andrei Lepikhov


On 6/6/2024 16:04, Wang Cheng wrote:

Noted. Thanks for suggestions. We will open-source it as an extension.
It would be nice! `For me doesn't matter where to contribute: to 
PostgreSQL core or to its extension if it is published under BSD license.


--
regards, Andrei Lepikhov

Re: Logical Replication of sequences

2024-06-06 Thread Amit Kapila

On Thu, Jun 6, 2024 at 11:10 AM Masahiko Sawada  wrote:
>
> On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila  wrote:
> >
>
> > To achieve this, we can allow sequences to be copied during
> > the initial CREATE SUBSCRIPTION command similar to what we do for
> > tables. And then later by new/existing command, we re-copy the already
> > existing sequences on the subscriber.
> >
> > The options for the new command could be:
> > Alter Subscription ... Refresh Sequences
> > Alter Subscription ... Replicate Sequences
> >
> > In the second option, we need to introduce a new keyword Replicate.
> > Can you think of any better option?
>
> Another idea is doing that using options. For example,
>
> For initial sequences synchronization:
>
> CREATE SUBSCRIPTION ... WITH (copy_sequence = true);
>

How will it interact with the existing copy_data option? So copy_data
will become equivalent to copy_table_data, right?

> For re-copy (or update) sequences:
>
> ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = true);
>

Similar to the previous point it can be slightly confusing w.r.t
copy_data. And would copy_sequence here mean that it would copy
sequence values of both pre-existing and newly added sequences, if so,
that would make it behave differently than copy_data? The other
possibility in this direction would be to introduce an option like
replicate_all_sequences/copy_all_sequences which indicates a copy of
both pre-existing and new sequences, if any.

If we want to go in the direction of having an option such as
copy_(all)_sequences then do you think specifying that copy_data is
just for tables in the docs would be sufficient? I am afraid that it
would be confusing for users.

> >
> > In addition to the above, the command Alter Subscription .. Refresh
> > Publication will fetch any missing sequences similar to what it does
> > for tables.
>
> On the subscriber side, do we need to track which sequences are
> created via CREATE/ALTER SUBSCRIPTION?
>

I think so unless we find some other way to know at refresh
publication time which all new sequences need to be part of the
subscription. What should be the behavior w.r.t sequences when the
user performs ALTER SUBSCRIPTION ... REFRESH PUBLICATION? I was
thinking similar to tables, it should fetch any missing sequence
information from the publisher.

-- 
With Regards,
Amit Kapila.

Re: Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel

2024-06-06 Thread Anthonin Bonnefoy

Hi,

I sent a similar patch for this in
https://www.postgresql.org/message-id/flat/cao6_xqr__kttclkftqs0qscm-j7_xbrg3ge2rwhucxqjmjh...@mail.gmail.com

Regards,
Anthonin

On Thu, Jun 6, 2024 at 11:10 AM Dilip Kumar  wrote:

> As part of commit 5cd72cc0c5017a9d4de8b5d465a75946da5abd1d, the
> dependency on global counters such as VacuumPage(Hit/Miss/Dirty) was
> removed from the vacuum. However, do_analyze_rel() was still using
> these counters, necessitating the tracking of global counters
> alongside BufferUsage counters.
>
> The attached patch addresses the issue by eliminating the need to
> track VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel(), making
> the global counters obsolete. This simplifies the code and improves
> consistency.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Alexander Lakhin


Hello hackers,

I tried to investigate a recent buildfarm test failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-06-04%2003%3A27%3A47

 29/295 postgresql:recovery / recovery/026_overwrite_contrecord ERROR   
 39.55s   exit status 32

log/026_overwrite_contrecord_standby.log
TRAP: failed Assert("ItemIdIsNormal(lpp)"), File: 
"../pgsql/src/backend/access/heap/heapam.c", Line: 1002, PID: 3740958
postgres: standby: bf postgres [local] 
startup(ExceptionalCondition+0x81)[0x56c60bf9]
postgres: standby: bf postgres [local] startup(+0xf776e)[0x5667276e]
postgres: standby: bf postgres [local] 
startup(heap_getnextslot+0x40)[0x56672ee1]
postgres: standby: bf postgres [local] startup(+0x11c218)[0x56697218]
postgres: standby: bf postgres [local] 
startup(systable_getnext+0xfa)[0x56697c1a]
postgres: standby: bf postgres [local] startup(+0x6d29c7)[0x56c4d9c7]
postgres: standby: bf postgres [local] startup(+0x6d372c)[0x56c4e72c]
postgres: standby: bf postgres [local] startup(+0x6d8288)[0x56c53288]
postgres: standby: bf postgres [local] 
startup(RelationCacheInitializePhase3+0x149)[0x56c52d71]

(It's not the only failure of that ilk in the buildfarm.)

and managed to reproduce the failure by running many
026_overwrite_contrecord tests in parallel (with fsync=on).

Analyzing the core dump added some info:
...
#3  0x00bb43cc in ExceptionalCondition (conditionName=0xc45c77 
"ItemIdIsNormal(lpp)",
    fileName=0xc45aa8 "heapam.c", lineNumber=1002) at assert.c:66
#4  0x004f7f13 in heapgettup_pagemode (scan=0x19f5660, 
dir=ForwardScanDirection, nkeys=2, key=0x19f61d0)
    at heapam.c:1002
#5  0x004f86d1 in heap_getnextslot (sscan=0x19f5660, 
direction=ForwardScanDirection, slot=0x19f5da0)
    at heapam.c:1307
#6  0x0051d028 in table_scan_getnextslot (sscan=0x19f5660, 
direction=ForwardScanDirection, slot=0x19f5da0)
    at ../../../../src/include/access/tableam.h:1081
#7  0x0051da80 in systable_getnext (sysscan=0x19f5470) at genam.c:530
#8  0x00ba0937 in RelationBuildTupleDesc (relation=0x7fa004feea88) at 
relcache.c:572
#9  0x00ba17b9 in RelationBuildDesc (targetRelId=2679, insertIt=true) 
at relcache.c:1184
#10 0x00ba6520 in load_critical_index (indexoid=2679, heapoid=2610) at 
relcache.c:4353
#11 0x00ba607d in RelationCacheInitializePhase3 () at relcache.c:4132
#12 0x00bcb704 in InitPostgres (in_dbname=0x196ca30 "postgres", dboid=5, 
username=0x19a91b8 "law", useroid=0,
    flags=1, out_dbname=0x0) at postinit.c:1193
...
(gdb) frame 4
(gdb) p lpp->lp_flags
$2 = 1
(gdb) p ItemIdIsNormal(lpp)
$12 = 1

So it looks like the Assert had failed when lpp->lp_flags had some other
contents...

I added the following debugging code:
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -995,10 +995,14 @@ continue_page:
    for (; linesleft > 0; linesleft--, lineindex += dir)
    {
    ItemId  lpp;
+   ItemIdData  iid;
    OffsetNumber lineoff;

    lineoff = scan->rs_vistuples[lineindex];
    lpp = PageGetItemId(page, lineoff);
+   iid = *((ItemIdData *)lpp);
+
+   Assert(ItemIdIsNormal(&iid));
    Assert(ItemIdIsNormal(lpp));

and got:
...
#2  0x55b68dc6998c in ExceptionalCondition (conditionName=0x55b68dcfe5f7 
"ItemIdIsNormal(&iid)",
    fileName=0x55b68dcfe428 "heapam.c", lineNumber=1010) at assert.c:66
#3  0x55b68d588a78 in heapgettup_pagemode (scan=0x55b68f0905e0, 
dir=ForwardScanDirection, nkeys=2,
    key=0x55b68f091150) at heapam.c:1010
#4  0x55b68d58930e in heap_getnextslot (sscan=0x55b68f0905e0, 
direction=ForwardScanDirection, slot=0x55b68f090d20)
    at heapam.c:1322
...
(gdb) frame 3
#3  0x55b68d588a78 in heapgettup_pagemode (...) at heapam.c:1010
1010    Assert(ItemIdIsNormal(&iid));
(gdb) info locals
lpp = 0x7f615c34b0ec
iid = {lp_off = 0, lp_flags = 0, lp_len = 0}
lineoff = 54
tuple = 0x55b68f090638
page = 0x7f615c34b000 ""

(gdb) p *lpp
$1 = {lp_off = 3160, lp_flags = 1, lp_len = 136}

It seemingly confirms that the underlying memory was changed while being
processed in heapgettup_pagemode().

I've tried to add checks for the page buffer content as below:
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -953,11 +953,15 @@ heapgettup_pagemode(HeapScanDesc scan,
 Page    page;
 int lineindex;
 int linesleft;
+char    page_copy[BLCKSZ];

 if (likely(scan->rs_inited))
 {
 /* continue from previously returned page/tuple */
 page = BufferGetPage(scan->rs_cbuf);
+memcpy(page_copy, page, BLCKSZ);
+for (int i = 0; i < 100; i++)
+Assert(memcmp(page_copy, page, BLCKSZ) == 0);

 lineindex = scan->rs_cindex + dir;
 if (ScanDirectionIsForward(dir))
@@ -986,6

Re: Conflict Detection and Resolution

2024-06-06 Thread Amit Kapila

On Wed, Jun 5, 2024 at 7:29 PM Dilip Kumar  wrote:
>
> On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila  wrote:
> >
> > Can you share the use case of "earliest_timestamp_wins" resolution
> > method? It seems after the initial update on the local node, it will
> > never allow remote update to succeed which sounds a bit odd. Jan has
> > shared this and similar concerns about this resolution method, so I
> > have added him to the email as well.
> >
> I can not think of a use case exactly in this context but it's very
> common to have such a use case while designing a distributed
> application with multiple clients.  For example, when we are doing git
> push concurrently from multiple clients it is expected that the
> earliest commit wins.
>

Okay, I think it mostly boils down to something like what Shveta
mentioned where Inserts for a primary key can use
"earliest_timestamp_wins" resolution method [1]. So, it seems useful
to support this method as well.

[1] - 
https://www.postgresql.org/message-id/CAJpy0uC4riK8e6hQt8jcU%2BnXYmRRjnbFEapYNbmxVYjENxTw2g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Logical Replication of sequences

2024-06-06 Thread Ashutosh Bapat

On Thu, Jun 6, 2024 at 9:22 AM Amit Kapila  wrote:

> On Wed, Jun 5, 2024 at 6:01 PM Ashutosh Bapat
>  wrote:
> >
> > On Wed, Jun 5, 2024 at 8:45 AM Amit Kapila 
> wrote:
> >>
> >>  How about periodically sending this information?
> >> >
> >>
> >> Now, if we want to support some sort of failover then probably this
> >> will help. Do you have that use case in mind?
> >
> >
> > Regular failover was a goal for supporting logical replication of
> sequences. That might be more common than major upgrade scenario.
> >
>
> We can't support regular failovers to subscribers unless we can
> replicate/copy slots because the existing nodes connected to the
> current publisher/primary would expect that. It should be primarily
> useful for major version upgrades at this stage.
>

We don't want to design it in a way that requires major rework when we are
able to copy slots and then support regular failovers. That's when the
consistency between a sequence and the table using it would be a must. So
it's better that we take that into consideration now.

-- 
Best Wishes,
Ashutosh Bapat

Re: Conflict Detection and Resolution

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 3:43 PM Amit Kapila  wrote:
>
> On Wed, Jun 5, 2024 at 7:29 PM Dilip Kumar  wrote:
> >
> > On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila  wrote:
> > >
> > > Can you share the use case of "earliest_timestamp_wins" resolution
> > > method? It seems after the initial update on the local node, it will
> > > never allow remote update to succeed which sounds a bit odd. Jan has
> > > shared this and similar concerns about this resolution method, so I
> > > have added him to the email as well.
> > >
> > I can not think of a use case exactly in this context but it's very
> > common to have such a use case while designing a distributed
> > application with multiple clients.  For example, when we are doing git
> > push concurrently from multiple clients it is expected that the
> > earliest commit wins.
> >
>
> Okay, I think it mostly boils down to something like what Shveta
> mentioned where Inserts for a primary key can use
> "earliest_timestamp_wins" resolution method [1]. So, it seems useful
> to support this method as well.

Correct, but we still need to think about how to make it work
correctly in the presence of a clock skew as I mentioned in one of my
previous emails.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 3:23 PM Anthonin Bonnefoy
 wrote:
>
> Hi,
>
> I sent a similar patch for this in 
> https://www.postgresql.org/message-id/flat/cao6_xqr__kttclkftqs0qscm-j7_xbrg3ge2rwhucxqjmjh...@mail.gmail.com

Okay, I see, In that case, we can just discard mine, thanks for notifying me.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Injection points: preloading and runtime arguments

2024-06-06 Thread Andrey M. Borodin

> On 5 Jun 2024, at 03:52, Michael Paquier  wrote:
> 
> Another thing you could do is to define a
> INJECTION_POINT_LOAD() in the code path you're stressing outside the 
> critical section where the point is run.  This should save from a call
> to the SQL function.  This choice is up to the one implementing the
> test, both can be useful depending on what one is trying to achieve.

Thanks!

Interestingly, previously having INJECTION_POINT_PRELOAD() was not enough.
But now both INJECTION_POINT_LOAD() or injection_points_load() do the trick, so 
for me any of them is enough.

My test works with current version, but I have one slight problem, I need to 
call
$node->safe_psql('postgres', q(select 
injection_points_detach('GetMultiXactIdMembers-CV-sleep')));
Before
$node->safe_psql('postgres', q(select 
injection_points_wakeup('GetMultiXactIdMembers-CV-sleep')));

Is it OK to detach() before wakeup()? Or, perhaps, can a detach() do a wakeup() 
automatically?

Best regards, Andrey Borodin.

Re: Proposal: Job Scheduler

2024-06-06 Thread Alvaro Herrera

On 2024-Jun-06, Dave Page wrote:

> It's this kind of choice that means it's unlikely we'd include any one
> option in PostgreSQL, much like various other tools such as failover
> managers or poolers.

TBH I see that more as a bug than as a feature, and I see the fact that
there are so many schedulers as a process failure.  If we could have
_one_ scheduler in core that encompassed all the important features of
all the independent ones we have, with hooks or whatever to allow the
user to add any fringe features they need, that would probably lead to
less duplicative code and divergent UIs, and would be better for users
overall.

That's, of course, just my personal opinion.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Compress ReorderBuffer spill files using LZ4

2024-06-06 Thread Julien Tachoires

Hi,

When the content of a large transaction (size exceeding
logical_decoding_work_mem) and its sub-transactions has to be
reordered during logical decoding, then, all the changes are written
on disk in temporary files located in pg_replslot/.
Decoding very large transactions by multiple replication slots can
lead to disk space saturation and high I/O utilization.

When compiled with LZ4 support (--with-lz4), this patch enables data
compression/decompression of these temporary files. Each transaction
change that must be written on disk (ReorderBufferDiskChange) is now
compressed and encapsulated in a new structure.

3 different compression strategies are implemented:

1. LZ4 streaming compression is the preferred one and works
   efficiently for small individual changes.
2. LZ4 regular compression when the changes are too large for using
   the streaming API.
3. No compression when compression fails, the change is then stored
   not compressed.

When not using compression, the following case generates 1590MB of
spill files:

  CREATE TABLE t (i INTEGER PRIMARY KEY, t TEXT);
  INSERT INTO t
SELECT i, 'Hello number n°'||i::TEXT
FROM generate_series(1, 1000) as i;

With LZ4 compression, it creates 653MB of spill files: 58.9% less
disk space usage.

Open items:

1. The spill_bytes column from pg_stat_get_replication_slot() still returns
plain data size, not the compressed data size. Should we expose the
compressed data size when compression occurs?

2. Do we want a GUC to switch compression on/off?

Regards,

JT


v1-0001-Compress-ReorderBuffer-spill-files-using-LZ4.patch
Description: Binary data

Re: Logical Replication of sequences

2024-06-06 Thread Amit Kapila

On Thu, Jun 6, 2024 at 3:44 PM Ashutosh Bapat
 wrote:
>
> On Thu, Jun 6, 2024 at 9:22 AM Amit Kapila  wrote:
>>
>> On Wed, Jun 5, 2024 at 6:01 PM Ashutosh Bapat
>>  wrote:
>> >
>> > On Wed, Jun 5, 2024 at 8:45 AM Amit Kapila  wrote:
>> >>
>> >>  How about periodically sending this information?
>> >> >
>> >>
>> >> Now, if we want to support some sort of failover then probably this
>> >> will help. Do you have that use case in mind?
>> >
>> >
>> > Regular failover was a goal for supporting logical replication of 
>> > sequences. That might be more common than major upgrade scenario.
>> >
>>
>> We can't support regular failovers to subscribers unless we can
>> replicate/copy slots because the existing nodes connected to the
>> current publisher/primary would expect that. It should be primarily
>> useful for major version upgrades at this stage.
>
>
> We don't want to design it in a way that requires major rework when we are 
> able to copy slots and then support regular failover.
>

I don't think we can just copy slots like we do for standbys. The
slots would require WAL locations to continue, so not sure if we can
make it work for failover for subscribers.

>
 That's when the consistency between a sequence and the table using it
would be a must. So it's better that we take that into consideration
now.
>

With the ideas being discussed here, I could only see the use case of
a major version upgrade or planned switchover to work. If we come up
with any other agreeable way that is better than this then we can
consider the same.

-- 
With Regards,
Amit Kapila.

Re: Function and Procedure with same signature?

2024-06-06 Thread Hannu Krosing

Hi Peter and Tom

Following up on our conversation art pgcon.dev

If I understood correctly Peter has some old patch for splitting the
namespaces which could be resurrected to try to move forward on this ?

Can you share what you did there ?

Also, while at it we should extend the function lookup to support full
"method call syntax" in general, up from one-argument case  so that

SELECT function(a_thing, arg2, arg 2, ...)

could also be called as

SELECT a_thing.function(arg2, arg 2, ...)


--
Hannu



On Mon, Mar 11, 2024 at 12:55 PM Hannu Krosing  wrote:
>
> On Thu, Mar 7, 2024 at 5:46 PM Tom Lane  wrote:
> >
> > Hannu Krosing  writes:
> > > On Sat, Feb 10, 2024 at 12:38 AM Tom Lane  wrote:
> > >> Worth noting perhaps that this is actually required by the SQL
> > >> standard: per spec, functions and procedures are both "routines"
> > >> and share the same namespace,
> >
> > > Can you point me to a place in the standard where it requires all
> > > kinds of ROUTINES to be using the same namespace ?
> >
> > [ digs around a bit... ]  Well, the spec is vaguer about this than
> > I thought.  It is clear on one thing: 11.60 
> > conformance rules include
> ...
>
> Thanks for thorough analysis of the standard.
>
> I went and looked at more what other relevant database do in this
> space based on their documentation
>
> Tl;DR
>
> * MS SQL Server
>- no overloading allowed anywhere
> * MySQL
>- no overloading
> * Oracle
>- no overloading at top level
>- overloading and independent namespaces for functions and procedures
> * Teradata
>- function overloading allowed
>- not clear from documentation if this also applies procedures
>- function overloading docs does not mention possible clashes with
> procedure names anywhere
> * DB2
>- function overloading fully supported
>- procedure overloading supported, but only for distinct NUMBER OF 
> ARGUMENTS
>
> I'll try to get access to a Teradata instance to verify the above
>
> So at high level most other Serious Databases
>   - do support function overloading
>   - keep functions and procedures in separated namespaces
>
> I still think that PostgreSQL having functions and procedures share
> the same namespace is an unneeded and unjustified restriction
>
>
> I plan to do some hands-on testing on Teradata and DB2 to understand it
>
> But my current thinking is that we should not be more restrictive than
> others unless there is a strong technical reason for it. And currently
> I do not see any.
>
> > It could be argued that this doesn't prohibit having both a function
> > and a procedure with the same data type list, only that you can't
> > write ROUTINE when trying to drop or alter one.  But I think that's
> > just motivated reasoning.  The paragraphs for  being
> > FUNCTION or PROCEDURE are exactly like the above except they say
> > "exactly one function" or "exactly one procedure".  If you argue that
> > this text means we must allow functions and procedures with the same
> > parameter lists, then you are also arguing that we must allow multiple
> > functions with the same parameter lists, and it's just the user's
> > tough luck if they need to drop one of them.
>
> The main issue is not dropping them, but inability to determine which
> one to call.
>
> We already have this in case of two overloaded functions with same
> initial argument types and the rest having defaults - when
>
> ---
> hannuk=# create function get(i int, out o int) begin atomic select i; end;
> CREATE FUNCTION
> hannuk=# create function get(i int, j int default 0, out o int) begin
> atomic select i+j; end;
> CREATE FUNCTION
> hannuk=# select get(1);
> ERROR:  function get(integer) is not unique
> LINE 1: select get(1);
>^
> HINT:  Could not choose a best candidate function. You might need to
> add explicit type casts.
> ---
>
> > A related point is that our tolerance for overloading routine
> > names isn't unlimited: we don't allow duplicate names with the
> > same list of input parameters, even if the output parameters are
> > different.
>
> This again has a good reason, as there would be many cases where you
> could not decide which one to call
>
> Not allowing overloading based on only return types is common across
> all OO languages.
>
> My point is that this does not apply to FUNCTION vs. PROCEDURE as it
> is very clear from the CALL syntax which one is meant.
>
> > This is not something that the SQL spec cares to
> > address, but there are good ambiguity-avoidance reasons for it.
> > I think limiting overloading so that a ROUTINE specification is
> > unambiguous is also a good thing.
>
> I think ROUTINE being unambiguous is not e very good goal.
>
> What if next version of standard introduces DROP DATABASE OBJECT ?
>
> > I remain unexcited about changing our definition of this.
> > "Oracle allows it" is not something that has any weight in
> > my view: they have made a bunch of bad design decisions
> > as well as good ones, and I think this is a

Re: Compress ReorderBuffer spill files using LZ4

2024-06-06 Thread Amit Kapila

On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
>
> When the content of a large transaction (size exceeding
> logical_decoding_work_mem) and its sub-transactions has to be
> reordered during logical decoding, then, all the changes are written
> on disk in temporary files located in pg_replslot/.
> Decoding very large transactions by multiple replication slots can
> lead to disk space saturation and high I/O utilization.
>

Why can't one use 'streaming' option to send changes to the client
once it reaches the configured limit of 'logical_decoding_work_mem'?

>
> 2. Do we want a GUC to switch compression on/off?
>

It depends on the overhead of decoding. Did you try to measure the
decoding overhead of decompression when reading compressed files?

-- 
With Regards,
Amit Kapila.

Re: ssl tests fail due to TCP port conflict

2024-06-06 Thread Andrew Dunstan




On 2024-06-05 We 16:00, Alexander Lakhin wrote:

Hello Andrew,

05.06.2024 21:10, Andrew Dunstan wrote:


I think I see what's going on here. It looks like it's because we 
start the server in unix socket mode, and then switch to using TCP as 
well.


Can you try your test with this patch applied and see if the problem 
persists? If we start in TCP mode the framework should test for a 
port clash.




It seems that the failure rate decreased (I guess the patch rules out the
case with two servers choosing the same port), but I still got:

16/53 postgresql:ssl / ssl/001_ssltests_36 OK 15.25s   205 
subtests passed
17/53 postgresql:ssl / ssl/001_ssltests_30 ERROR 3.17s (exit 
status 255 or signal 127 SIGinvalid)


2024-06-05 19:40:37.395 UTC [414110] LOG:  starting PostgreSQL 17beta1 
on x86_64-linux, compiled by gcc-13.2.1, 64-bit
2024-06-05 19:40:37.395 UTC [414110] LOG:  could not bind IPv4 address 
"127.0.0.1": Address already in use
2024-06-05 19:40:37.395 UTC [414110] HINT:  Is another postmaster 
already running on port 50072? If not, wait a few seconds and retry.


`grep '\b50072\b' -r testrun/` yields:
testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 
19:40:37.392 UTC [414111] [unknown] LOG:  connection received: 
host=localhost port=50072

(a psql case)

That is, psql from the test instance 001_ssltests_34 opened a 
connection to

the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.




After sleeping on it, I still think the patch would be a good thing. 
Your torture test might still show some failures, but the buildfarm 
isn't running those, and it might be enough to eliminate or at least 
substantially reduce buildfarm failures by reducing to almost zero the 
time in which a competing script might grab the port. The biggest 
problem with the current script is apparently that we delay using the 
TCP port by starting the server in Unix socket mode, and only switch to 
using TCP when we restart. If changing that doesn't fix the problem 
we'll have to rethink. If this isn't the cause, though, I would expect 
to have seen similar failures from other test suites.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: ResourceOwner refactoring

2024-06-06 Thread Heikki Linnakangas


On 05/06/2024 16:58, Heikki Linnakangas wrote:

On 04/06/2024 01:49, Heikki Linnakangas wrote:

A straightforward fix is to modify RelationFlushRelation() so that if
!IsTransactionState(), it just marks the entry as invalid instead of
calling RelationClearRelation(). That's what RelationClearRelation()
would do anyway, if it didn't hit the assertion first.


Here's a patch with that straightforward fix. Your test case hit the
"rd_createSubid != InvalidSubTransactionId" case, I expanded it to also
cover the "rd_firstRelfilelocatorSubid != InvalidSubTransactionId" case.


For the record, I got the above backwards: your test case covered the 
rd_firstRelfilelocatorSubid case and I expanded it to also cover the 
rd_createSubid case.



Barring objections, I'll commit this later today or tomorrow. Thanks for
the report!


Committed.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Conflict Detection and Resolution

2024-06-06 Thread Nisha Moond

On Wed, Jun 5, 2024 at 7:29 PM Dilip Kumar  wrote:
>
> On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila  wrote:
> >
> > Can you share the use case of "earliest_timestamp_wins" resolution
> > method? It seems after the initial update on the local node, it will
> > never allow remote update to succeed which sounds a bit odd. Jan has
> > shared this and similar concerns about this resolution method, so I
> > have added him to the email as well.
> >
> I can not think of a use case exactly in this context but it's very
> common to have such a use case while designing a distributed
> application with multiple clients.  For example, when we are doing git
> push concurrently from multiple clients it is expected that the
> earliest commit wins.
>

Here are more use cases of the "earliest_timestamp_wins" resolution method:
1) Applications where the record of first occurrence of an event is
important. For example, sensor based applications like earthquake
detection systems, capturing the first seismic wave's time is crucial.
2) Scheduling systems, like appointment booking, prioritize the
earliest request when handling concurrent ones.
3) In contexts where maintaining chronological order is important -
  a) Social media platforms display comments ensuring that the
earliest ones are visible first.
  b) Finance transaction processing systems rely on timestamps to
prioritize the processing of transactions, ensuring that the earliest
transaction is handled first

--
Thanks,
Nisha

Re: Compress ReorderBuffer spill files using LZ4

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 4:43 PM Amit Kapila  wrote:
>
> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
> >
> > When the content of a large transaction (size exceeding
> > logical_decoding_work_mem) and its sub-transactions has to be
> > reordered during logical decoding, then, all the changes are written
> > on disk in temporary files located in pg_replslot/.
> > Decoding very large transactions by multiple replication slots can
> > lead to disk space saturation and high I/O utilization.
> >
>
> Why can't one use 'streaming' option to send changes to the client
> once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> >
> > 2. Do we want a GUC to switch compression on/off?
> >
>
> It depends on the overhead of decoding. Did you try to measure the
> decoding overhead of decompression when reading compressed files?

I think it depends on the trade-off between the I/O savings from
reducing the data size and the performance cost of compressing and
decompressing the data. This balance is highly dependent on the
hardware. For example, if you have a very slow disk and a powerful
processor, compression could be advantageous. Conversely, if the disk
is very fast, the I/O savings might be minimal, and the compression
overhead could outweigh the benefits. Additionally, the effectiveness
of compression also depends on the compression ratio, which varies
with the type of data being compressed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Postgresql OOM

2024-06-06 Thread Radu Radutiu

Hello all,

I have a query that forces an out of memory error, where the OS will kill
the postgresql process.
The query plan (run immediately after a vacuum analyze) is at
https://explain.depesz.com/s/ITQI#html .


PostgreSQL version 16.3, running on RHEL 8.9, 16 vCPU, 64 GB RAM, 32 GB swap

shared_buffers=8G
effective_cache_size=24G
maintenance_work_mem=2G
work_mem=104857kB
default_statistics_target = 100
max_worker_processes = 16
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4
jit=off


It looks like the excessive memory allocation is reported in
HashSpillContext. I've attached the dump of the memory context for the 5
processes (query + 4 parallel workers) some time after query start. I also
see a huge number of temporary files being created. For the time being I've
set enable_parallel_hash = 'off' and the problem went away.

I've seen a potentially similar problem reported in
https://www.postgresql.org/message-id/flat/20230516200052.sbg6z4ghcmsas3wv%40liskov#f6059259c7c9251fb8c17f5793a2d427
.


Any idea on how to identify the problem? I can reproduce it on demand.
Should I report it pgsql-bugs?

Best regards,
Radu
2024-06-05 15:58:53.806 +08 [338795] LOG:  logging memory contexts of PID 338795
2024-06-05 15:58:53.807 +08 [338795] LOG:  level: 0; TopMemoryContext: 2269696 
total in 8 blocks; 33432 free (11 chunks); 2236264 used
2024-06-05 15:58:53.807 +08 [338795] LOG:  level: 1; HandleParallelMessages: 
8192 total in 1 blocks; 7928 free (0 chunks); 264 used
2024-06-05 15:58:53.807 +08 [338795] LOG:  level: 1; TableSpace cache: 8192 
total in 1 blocks; 2128 free (0 chunks); 6064 used
2024-06-05 15:58:53.807 +08 [338795] LOG:  level: 1; TopTransactionContext: 
8192 total in 1 blocks; 5736 free (0 chunks); 2456 used
2024-06-05 15:58:53.808 +08 [338795] LOG:  level: 1; Prepared Queries: 16384 
total in 2 blocks; 6696 free (3 chunks); 9688 used
2024-06-05 15:58:53.808 +08 [338795] LOG:  level: 1; Type information cache: 
24368 total in 2 blocks; 2648 free (0 chunks); 21720 used
2024-06-05 15:58:53.809 +08 [338795] LOG:  level: 1; Operator lookup cache: 
24576 total in 2 blocks; 10792 free (3 chunks); 13784 used
2024-06-05 15:58:53.809 +08 [338795] LOG:  level: 1; RowDescriptionContext: 
8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
2024-06-05 15:58:53.809 +08 [338795] LOG:  level: 1; MessageContext: 8192 total 
in 1 blocks; 2256 free (0 chunks); 5936 used
2024-06-05 15:58:53.809 +08 [338795] LOG:  level: 1; Operator class cache: 8192 
total in 1 blocks; 592 free (0 chunks); 7600 used
2024-06-05 15:58:53.810 +08 [338795] LOG:  level: 1; smgr relation table: 32768 
total in 3 blocks; 16880 free (8 chunks); 15888 used
2024-06-05 15:58:53.810 +08 [338795] LOG:  level: 1; PgStat Shared Ref Hash: 
7216 total in 2 blocks; 688 free (0 chunks); 6528 used
2024-06-05 15:58:53.810 +08 [338795] LOG:  level: 1; PgStat Shared Ref: 8192 
total in 4 blocks; 3704 free (4 chunks); 4488 used
2024-06-05 15:58:53.811 +08 [338795] LOG:  level: 1; PgStat Pending: 16384 
total in 5 blocks; 3288 free (10 chunks); 13096 used
2024-06-05 15:58:53.811 +08 [338795] LOG:  level: 1; TransactionAbortContext: 
32768 total in 1 blocks; 32504 free (0 chunks); 264 used
2024-06-05 15:58:53.811 +08 [338795] LOG:  level: 1; Portal hash: 8192 total in 
1 blocks; 592 free (0 chunks); 7600 used
2024-06-05 15:58:53.812 +08 [338795] LOG:  level: 1; TopPortalContext: 8192 
total in 1 blocks; 7664 free (0 chunks); 528 used
2024-06-05 15:58:53.812 +08 [338795] LOG:  level: 2; PortalHoldContext: 24624 
total in 2 blocks; 7400 free (0 chunks); 17224 used
2024-06-05 15:58:53.823 +08 [338795] LOG:  level: 2; PortalContext: 527776 
total in 83 blocks; 3760 free (6 chunks); 524016 used: 
2024-06-05 15:58:53.824 +08 [338795] LOG:  level: 3; ExecutorState: 58826880 
total in 19 blocks; 3932992 free (11 chunks); 54893888 used
2024-06-05 15:58:53.824 +08 [338795] LOG:  level: 4; HashTableContext: 8192 
total in 1 blocks; 7752 free (0 chunks); 440 used
2024-06-05 15:58:53.825 +08 [338795] LOG:  level: 5; HashSpillContext: 8192 
total in 1 blocks; 4008 free (9 chunks); 4184 used
2024-06-05 15:58:53.825 +08 [338795] LOG:  level: 5; HashBatchContext: 8192 
total in 1 blocks; 7928 free (0 chunks); 264 used
2024-06-05 15:58:53.825 +08 [338795] LOG:  level: 4; HashTableContext: 8192 
total in 1 blocks; 7752 free (0 chunks); 440 used
2024-06-05 15:58:53.840 +08 [338795] LOG:  level: 5; HashSpillContext: 
339738672 total in 47 blocks; 3740880 free (163341 chunks); 335997792 used
2024-06-05 15:58:53.840 +08 [338795] LOG:  level: 5; HashBatchContext: 8192 
total in 1 blocks; 7928 free (0 chunks); 264 used
2024-06-05 15:58:53.841 +08 [338795] LOG:  level: 4; HashTableContext: 8192 
total in 1 blocks; 7752 free (0 chunks); 440 used
2024-06-05 15:58:53.841 +08 [338795] LOG:  level: 5; HashSpillContext: 8192 
total in 1 blocks; 4720 free (0 chunks); 3472 used
2024-06-05 15:58:53.841 +08 [338795] LOG:  level: 5; HashBatch

How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Ashutosh Sharma

Hello everyone,

At present, we use MVCC snapshots to identify dependent objects. This
implies that if a new dependent object is inserted within a transaction
that is still ongoing, our search for dependent objects won't include this
recently added one. Consequently, if someone attempts to drop the
referenced object, it will be dropped, and when the ongoing transaction
completes, we will end up having an entry for a referenced object that has
already been dropped. This situation can lead to an inconsistent state.
Below is an example illustrating this scenario:

Session 1:
- create table t1(a int);
- insert into t1 select i from generate_series(1, 1000) i;
- create extension btree_gist;
- create index i1 on t1 using gist( a );

Session 2: (While the index creation in session 1 is in progress, drop the
btree_gist extension)
- drop extension btree_gist;

Above command succeeds and so does the create index command running in
session 1, post this, if we try running anything on table t1, i1, it fails
with an error: "cache lookup failed for opclass ..."

Attached is the patch that I have tried, which seems to be working for me.
It's not polished and thoroughly tested, but just sharing here to clarify
what I am trying to suggest. Please have a look and let me know your
thoughts.

--
With Regards,
Ashutosh Sharma.


use-dirty-snapshots-to-find-dependent-objects.patch
Description: Binary data

Re: Proposal: Job Scheduler

2024-06-06 Thread Dmitry Dolgov

> On Thu, Jun 06, 2024 at 12:53:38PM GMT, Alvaro Herrera wrote:
> On 2024-Jun-06, Dave Page wrote:
>
> > It's this kind of choice that means it's unlikely we'd include any one
> > option in PostgreSQL, much like various other tools such as failover
> > managers or poolers.
>
> TBH I see that more as a bug than as a feature, and I see the fact that
> there are so many schedulers as a process failure.  If we could have
> _one_ scheduler in core that encompassed all the important features of
> all the independent ones we have, with hooks or whatever to allow the
> user to add any fringe features they need, that would probably lead to
> less duplicative code and divergent UIs, and would be better for users
> overall.
>
> That's, of course, just my personal opinion.

+1. The PostgreSQL ecosystem is surprisingly fragmented, when it comes
to quite essential components that happen to be outside of the core. But
of course it doesn't mean that there should be _one_ component of every
kind in core, more like it makes sense to have _one_ component available
out of the box (where the box is whatever form of PostgreSQL that gets
delivered to users, e.g. a distro package, container, etc.).

Re: Logical Replication of sequences

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 9:34 AM Amit Kapila  wrote:
>
> On Wed, Jun 5, 2024 at 3:17 PM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Jun 4, 2024 at 5:40 PM Amit Kapila  wrote:
> > >
> > > Even if we decode it periodically (say each time we decode the
> > > checkpoint record) then also we need to send the entire set of
> > > sequences at shutdown. This is because the sequences may have changed
> > > from the last time we sent them.
> >
> > Agree. How about decoding and sending only the sequences that are
> > changed from the last time when they were sent? I know it requires a
> > bit of tracking and more work, but all I'm looking for is to reduce
> > the amount of work that walsenders need to do during the shutdown.
> >
>
> I see your point but going towards tracking the changed sequences
> sounds like moving towards what we do for incremental backups unless
> we can invent some other smart way.

Yes, we would need an entirely new infrastructure to track the
sequence change since the last sync. We can only determine this from
WAL, and relying on it would somehow bring us back to the approach we
were trying to achieve with logical decoding of sequences patch.

> > Having said that, I like the idea of letting the user sync the
> > sequences via ALTER SUBSCRIPTION command and not weave the logic into
> > the shutdown checkpoint path. As Peter Eisentraut said here
> > https://www.postgresql.org/message-id/42e5cb35-4aeb-4f58-8091-90619c7c3ecc%40eisentraut.org,
> > this can be a good starting point to get going.
> >
>
> Agreed.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma  wrote:
>
> Hello everyone,
>
> At present, we use MVCC snapshots to identify dependent objects. This implies 
> that if a new dependent object is inserted within a transaction that is still 
> ongoing, our search for dependent objects won't include this recently added 
> one. Consequently, if someone attempts to drop the referenced object, it will 
> be dropped, and when the ongoing transaction completes, we will end up having 
> an entry for a referenced object that has already been dropped. This 
> situation can lead to an inconsistent state. Below is an example illustrating 
> this scenario:

I don't think it's correct to allow the index to be dropped while a
transaction is creating it. Instead, the right solution should be for
the create index operation to protect the object it is using from
being dropped. Specifically, the create index operation should acquire
a shared lock on the Access Method (AM) to ensure it doesn't get
dropped concurrently while the transaction is still in progress.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Compress ReorderBuffer spill files using LZ4

2024-06-06 Thread Julien Tachoires

Le jeu. 6 juin 2024 à 04:13, Amit Kapila  a écrit :
>
> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
> >
> > When the content of a large transaction (size exceeding
> > logical_decoding_work_mem) and its sub-transactions has to be
> > reordered during logical decoding, then, all the changes are written
> > on disk in temporary files located in pg_replslot/.
> > Decoding very large transactions by multiple replication slots can
> > lead to disk space saturation and high I/O utilization.
> >
>
> Why can't one use 'streaming' option to send changes to the client
> once it reaches the configured limit of 'logical_decoding_work_mem'?

That's right, setting subscription's option 'streaming' to 'on' moves
the problem away from the publisher to the subscribers. This patch
tries to improve the default situation when 'streaming' is set to
'off'.

> > 2. Do we want a GUC to switch compression on/off?
> >
>
> It depends on the overhead of decoding. Did you try to measure the
> decoding overhead of decompression when reading compressed files?

Quick benchmarking executed on my laptop shows 1% overhead.

Table DDL:
CREATE TABLE t (i INTEGER PRIMARY KEY, t TEXT);

Data generated with:
INSERT INTO t SELECT i, 'Text number n°'||i::TEXT FROM
generate_series(1, 1000) as i;

Restoration duration measured using timestamps of log messages:
"DEBUG:  restored / changes from disk"

HEAD: 25.54s, 25.94s, 25.516s, 26.267s, 26.11s / avg=25.874s
Patch: 26.872s, 26.311s, 25.753s, 26.003, 25.843s / avg=26.156s

Regards,

JT

Re: Postgresql OOM

2024-06-06 Thread Pantelis Theodosiou

On Thu, Jun 6, 2024 at 1:25 PM Radu Radutiu  wrote:

> Hello all,
>
> I have a query that forces an out of memory error, where the OS will kill
> the postgresql process.
> The query plan (run immediately after a vacuum analyze) is at
> https://explain.depesz.com/s/ITQI#html .
>
> ...

>
> Any idea on how to identify the problem? I can reproduce it on demand.
> Should I report it pgsql-bugs?
>
> Best regards,
> Radu
>

I am not qualified to answer on the OOM issue but why are you joining the
same table (outputrequest) 4 times (using an identical join condition)?
This essentially does a cross join, if an input_sequence value has say,
1000 related rows in outputrequest, you will be getting 1000^4 rows in the
result set.

  FROM inputrequest t
  LEFT JOIN outputrequest rec_tro
  ON rec_tro.input_sequence = t.input_sequence
  LEFT JOIN inputrequest r
  ON r.originalRequest_id = t.input_sequence
  LEFT JOIN outputrequest rpl_rec_tro
  ON rpl_rec_tro.input_sequence = r.input_sequence
  LEFT JOIN outputrequest rpl_snd_tro
  ON rpl_snd_tro.reply_input_sequence = r.input_sequence
  LEFT JOIN outputrequest snd_tro
  ON snd_tro.reply_input_sequence = t.input_sequence

Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Bertrand Drouvot

On Thu, Jun 06, 2024 at 05:59:00PM +0530, Ashutosh Sharma wrote:
> Hello everyone,
> 
> At present, we use MVCC snapshots to identify dependent objects. This
> implies that if a new dependent object is inserted within a transaction
> that is still ongoing, our search for dependent objects won't include this
> recently added one. Consequently, if someone attempts to drop the
> referenced object, it will be dropped, and when the ongoing transaction
> completes, we will end up having an entry for a referenced object that has
> already been dropped. This situation can lead to an inconsistent state.
> Below is an example illustrating this scenario:
> 
> Session 1:
> - create table t1(a int);
> - insert into t1 select i from generate_series(1, 1000) i;
> - create extension btree_gist;
> - create index i1 on t1 using gist( a );
> 
> Session 2: (While the index creation in session 1 is in progress, drop the
> btree_gist extension)
> - drop extension btree_gist;
> 
> Above command succeeds and so does the create index command running in
> session 1, post this, if we try running anything on table t1, i1, it fails
> with an error: "cache lookup failed for opclass ..."
> 
> Attached is the patch that I have tried, which seems to be working for me.
> It's not polished and thoroughly tested, but just sharing here to clarify
> what I am trying to suggest. Please have a look and let me know your
> thoughts.

Thanks for the patch proposal!

The patch does not fix the other way around:

- session 1: BEGIN; DROP extension btree_gist;
- session 2: create index i1 on t1 using gist( a );
- session 1: commits while session 2 is creating the index

and does not address all the possible orphaned dependencies cases.

There is an ongoing thread (see [1]) to fix the orphaned dependencies issue.

v9 attached in [1] fixes the case you describe here.

[1]: 
https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Compress ReorderBuffer spill files using LZ4

2024-06-06 Thread Amit Kapila

On Thu, Jun 6, 2024 at 6:22 PM Julien Tachoires  wrote:
>
> Le jeu. 6 juin 2024 à 04:13, Amit Kapila  a écrit :
> >
> > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
> > >
> > > When the content of a large transaction (size exceeding
> > > logical_decoding_work_mem) and its sub-transactions has to be
> > > reordered during logical decoding, then, all the changes are written
> > > on disk in temporary files located in pg_replslot/.
> > > Decoding very large transactions by multiple replication slots can
> > > lead to disk space saturation and high I/O utilization.
> > >
> >
> > Why can't one use 'streaming' option to send changes to the client
> > once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> That's right, setting subscription's option 'streaming' to 'on' moves
> the problem away from the publisher to the subscribers. This patch
> tries to improve the default situation when 'streaming' is set to
> 'off'.
>

Can we think of changing the default to 'parallel'? BTW, it would be
better to use 'parallel' for the 'streaming' option, if the workload
has large transactions. Is there a reason to use a default value in
this case?

> > > 2. Do we want a GUC to switch compression on/off?
> > >
> >
> > It depends on the overhead of decoding. Did you try to measure the
> > decoding overhead of decompression when reading compressed files?
>
> Quick benchmarking executed on my laptop shows 1% overhead.
>

Thanks. We probably need different types of data (say random data in
bytea column, etc.) for this.

-- 
With Regards,
Amit Kapila.

Re: race condition in pg_class

2024-06-06 Thread Robert Haas

On Wed, Jun 5, 2024 at 2:17 PM Noah Misch  wrote:
> Starting 2024-06-10, I plan to push the first seven of the ten patches:
>
> inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
> inplace010-tests-v1.patch
> inplace040-waitfuncs-v1.patch
> inplace050-tests-inj-v1.patch
> inplace060-nodeModifyTable-comments-v1.patch
>   Those five just deal in tests, test infrastructure, and comments.
> inplace070-rel-locks-missing-v1.patch
>   Main risk is new DDL deadlocks.
> inplace080-catcache-detoast-inplace-stale-v1.patch
>   If it fails to fix the bug it targets, I expect it's a no-op rather than
>   breaking things.
>
> I'll leave the last three of the ten needing review.  Those three are beyond
> my skill to self-certify.

It's not this patch set's fault, but I'm not very pleased to see that
the injection point wait events have been shoehorned into the
"Extension" category - which they are not - instead of being a new
wait_event_type. That would have avoided the ugly wait-event naming
pattern, inconsistent with everything else, introduced by
inplace050-tests-inj-v1.patch.

I think that the comments and commit messages in this patch set could,
in some places, use improvement. For instance,
inplace060-nodeModifyTable-comments-v1.patch reflows a bunch of
comments, which makes it hard to see what actually changed, and the
commit message doesn't tell you, either. A good bit of it seems to be
changing "a view" to "a view INSTEAD OF trigger" or "a view having an
INSTEAD OF trigger," but the reasoning behind that change is not
spelled out anywhere. The reader is left to guess what the other case
is and why the same principles don't apply to it. I don't doubt that
the new comments are more correct than the old ones, but I expect
future patch authors to have difficulty maintaining that state of
affairs.

Similarly, inplace070-rel-locks-missing-v1.patch adds no comments.
IMHO, the commit message also isn't very informative. It disclaims
knowledge of what bug it's fixing, while at the same time leaving the
reader to figure out for themselves how the behavior has changed.
Consequently, I expect writing the release notes for a release
including this patch to be difficult: "We added some locks that block
... something ... in some circumstances ... to prevent ... something."
It's not really the job of the release note author to fill in those
blanks, but rather of the patch author or committer. I don't want to
overburden the act of fixing bugs, but I just feel like more
explanation is needed here. When I see for example that we're adding a
lock acquisition to the end of heap_create(), I can't help but wonder
if it's really true that we don't take a lock on a just-created
relation today. I'm certainly under the impression that we lock
newly-created, uncommitted relations, and a quick test seems to
confirm that. I don't quite know whether that happens, but evidently
this call is guarding against something more subtle than a categorical
failure to lock a relation on creation so I think there should be a
comment explaining what that thing is.

It's also quite surprising that SetRelationHasSubclass() says "take X
lock before calling" and 2 of 4 callers just don't. I guess that's how
it is. But shouldn't we then have an assertion inside that function to
guard against future mistakes? If the reason why we failed to add this
initially is discernible from the commit messages that introduced the
bug, it would be nice to mention what it seems to have been; if not,
it would at least be nice to mention the offending commit(s). I'm also
a bit worried that this is going to cause deadlocks, but I suppose if
it does, that's still better than the status quo.

IsInplaceUpdateOid's header comment says IsInplaceUpdateRelation
instead of IsInplaceUpdateOid.

inplace080-catcache-detoast-inplace-stale-v1.patch seems like another
place where spelling out the rationale in more detail would be helpful
to future readers; for instance, the commit message says that
PgDatabaseToastTable is the only one affected, but it doesn't say why
the others are not, or why this one is. The lengthy comment in
CatalogCacheCreateEntry is also difficult to correlate with the code
which follows. I can't guess whether the two cases called out in the
comment always needed to be handled and were handled save only for
in-place updates, and thus the comment changes were simply taking the
opportunity to elaborate on the existing comments; or whether one of
those cases is preexisting and the other arises from the desire to
handle inplace updates. It could be helpful to mention relevant
identifiers from the code in the comment text e.g.
"systable_recheck_tuple detects ordinary updates by noting changes to
the tuple's visibility information, while the equalTuple() case
detects inplace updates."

IMHO, this patch set underscores the desirability of removing in-place
update altogether. That sounds difficult and not back-patchable, but I
can't classify what this patch set does as anything better

Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Ashutosh Sharma

On Thu, Jun 6, 2024 at 6:20 PM Dilip Kumar  wrote:

> On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma 
> wrote:
> >
> > Hello everyone,
> >
> > At present, we use MVCC snapshots to identify dependent objects. This
> implies that if a new dependent object is inserted within a transaction
> that is still ongoing, our search for dependent objects won't include this
> recently added one. Consequently, if someone attempts to drop the
> referenced object, it will be dropped, and when the ongoing transaction
> completes, we will end up having an entry for a referenced object that has
> already been dropped. This situation can lead to an inconsistent state.
> Below is an example illustrating this scenario:
>
> I don't think it's correct to allow the index to be dropped while a
> transaction is creating it. Instead, the right solution should be for
> the create index operation to protect the object it is using from
> being dropped. Specifically, the create index operation should acquire
> a shared lock on the Access Method (AM) to ensure it doesn't get
> dropped concurrently while the transaction is still in progress.
>

If I'm following you correctly, that's exactly what the patch is trying to
do; while the index creation is in progress, if someone tries to drop the
object referenced by the index under creation, the referenced object being
dropped is able to know about the dependent object (in this case the index
being created) using dirty snapshot and hence, it is unable to acquire the
lock on the dependent object, and as a result of that, it is unable to drop
it.

--
With Regards,
Ashutosh Sharma.

Re: Postgresql OOM

2024-06-06 Thread Radu Radutiu

>
>
>> I am not qualified to answer on the OOM issue but why are you joining the
> same table (outputrequest) 4 times (using an identical join condition)?
> This essentially does a cross join, if an input_sequence value has say,
> 1000 related rows in outputrequest, you will be getting 1000^4 rows in the
> result set.
>

The query itself runs fine in a reasonable time with enable_parallel_hash =
'off'. I see two problems - one is the wrong execution plan (right after
running analyze), the second and the most important is the huge memory
usage (far exceeding work_mem and shared buffers) leading to OOM.
See https://explain.depesz.com/s/yAqS for the explain plan
with enable_parallel_hash = 'off.

Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Ashutosh Sharma

On Thu, Jun 6, 2024 at 6:57 PM Bertrand Drouvot <
bertranddrouvot...@gmail.com> wrote:

> On Thu, Jun 06, 2024 at 05:59:00PM +0530, Ashutosh Sharma wrote:
> > Hello everyone,
> >
> > At present, we use MVCC snapshots to identify dependent objects. This
> > implies that if a new dependent object is inserted within a transaction
> > that is still ongoing, our search for dependent objects won't include
> this
> > recently added one. Consequently, if someone attempts to drop the
> > referenced object, it will be dropped, and when the ongoing transaction
> > completes, we will end up having an entry for a referenced object that
> has
> > already been dropped. This situation can lead to an inconsistent state.
> > Below is an example illustrating this scenario:
> >
> > Session 1:
> > - create table t1(a int);
> > - insert into t1 select i from generate_series(1, 1000) i;
> > - create extension btree_gist;
> > - create index i1 on t1 using gist( a );
> >
> > Session 2: (While the index creation in session 1 is in progress, drop
> the
> > btree_gist extension)
> > - drop extension btree_gist;
> >
> > Above command succeeds and so does the create index command running in
> > session 1, post this, if we try running anything on table t1, i1, it
> fails
> > with an error: "cache lookup failed for opclass ..."
> >
> > Attached is the patch that I have tried, which seems to be working for
> me.
> > It's not polished and thoroughly tested, but just sharing here to clarify
> > what I am trying to suggest. Please have a look and let me know your
> > thoughts.
>
> Thanks for the patch proposal!
>
> The patch does not fix the other way around:
>
> - session 1: BEGIN; DROP extension btree_gist;
> - session 2: create index i1 on t1 using gist( a );
> - session 1: commits while session 2 is creating the index
>
> and does not address all the possible orphaned dependencies cases.
>
> There is an ongoing thread (see [1]) to fix the orphaned dependencies
> issue.
>
> v9 attached in [1] fixes the case you describe here.
>
> [1]:
> https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal


I see. Thanks for sharing this. I can take a look at this and help in
whatever way I can.

With Regards,
Ashutosh Sharma.

Re: [multithreading] extension compatibility

2024-06-06 Thread Robert Haas

On Thu, Jun 6, 2024 at 5:00 AM Heikki Linnakangas  wrote:
> If there is some material harm from compiling with multithreading
> support even if you're not using it, we should try to fix that. I'm not
> dead set against having a compile-time option, but I don't see the need
> for it at the moment.

Well, OK, so it sounds like I'm outvoted, at least at the moment.
Maybe that will change as more people vote, but for now, that's where
we are. Given that, I suppose we want something more like Tristan's
patch, but with a more extensible syntax. Does that sound right?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Compress ReorderBuffer spill files using LZ4

2024-06-06 Thread Alvaro Herrera

On 2024-Jun-06, Amit Kapila wrote:

> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
> >
> > When the content of a large transaction (size exceeding
> > logical_decoding_work_mem) and its sub-transactions has to be
> > reordered during logical decoding, then, all the changes are written
> > on disk in temporary files located in pg_replslot/.
> > Decoding very large transactions by multiple replication slots can
> > lead to disk space saturation and high I/O utilization.

I like the general idea of compressing the output of logical decoding.
It's not so clear to me that we only want to do so for spilling to disk;
for instance, if the two nodes communicate over a slow network, it may
even be beneficial to compress when streaming, so to this question:

> Why can't one use 'streaming' option to send changes to the client
> once it reaches the configured limit of 'logical_decoding_work_mem'?

I would say that streaming doesn't necessarily have to mean we don't
want compression, because for some users it might be beneficial.

I think a GUC would be a good idea.  Also, what if for whatever reason
you want a different compression algorithm or different compression
parameters?  Looking at the existing compression UI we offer in
pg_basebackup, perhaps you could add something like this:

compress_logical_decoding = none
compress_logical_decoding = lz4:42
compress_logical_decoding = spill-zstd:99

"none" says to never use compression (perhaps should be the default),
"lz4:42" says to use lz4 with parameters 42 on both spilling and
streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but
only for spilling to disk.

(I don't mean to say that you should implement Zstd compression with
this patch, only that you should choose the implementation so that
adding Zstd support (or whatever) later is just a matter of adding some
branches here and there.  With the current #ifdef you propose, it's hard
to do that.  Maybe separate the parts that depend on the specific
algorithm to algorithm-agnostic functions.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Things I don't like about \du's "Attributes" column

2024-06-06 Thread Robert Haas

On Thu, Jun 6, 2024 at 5:08 AM Pavel Luzanov  wrote:
> But now there are no changes in pg_roles. Just a special interpretation
> of the two values of the "Connection limit" column:
>   0 - Now allowed (changed from 'No connections')
>  -1 - empty string

I think the first of these special interpretations is unnecessary and
should be removed. It seems pretty clear what 0 means.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [multithreading] extension compatibility

2024-06-06 Thread Heikki Linnakangas


On 06/06/2024 17:23, Robert Haas wrote:

On Thu, Jun 6, 2024 at 5:00 AM Heikki Linnakangas  wrote:

If there is some material harm from compiling with multithreading
support even if you're not using it, we should try to fix that. I'm not
dead set against having a compile-time option, but I don't see the need
for it at the moment.


Well, OK, so it sounds like I'm outvoted, at least at the moment.
Maybe that will change as more people vote, but for now, that's where
we are. Given that, I suppose we want something more like Tristan's
patch, but with a more extensible syntax. Does that sound right?


+1

--
Heikki Linnakangas
Neon (https://neon.tech)

report on not thread-safe functions

2024-06-06 Thread Peter Eisentraut


In the context of the multithreaded-server project, I looked into
potentially not thread-safe functions.

(See proposed next steps at the end of this message.)

Here is a list of functions in POSIX that are possibly not thread-safe:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_01

I checked those against the PostgreSQL server source code (backend +
common + timezone), and found that the following from those are in
use:

- dlerror()
- getenv()
- getgrnam()
- getopt()
- getpwuid()
- localeconv()
- localtime()
- nl_langinfo()
- readdir()
- setenv()
- setlocale()
- strerror()
- strsignal()
- strtok()
- system()
- unsetenv()

Additionally, there are non-standard functions that are not
thread-safe, such as getopt_long().

Also, there are replacement functions such as pg_gmtime() and
pg_localtime() that mirror standard thread-unsafe functions and that
appear to be equally unsafe.  (Note to those looking into annotating
global variables: You also need to check static local variables.)

Conversely, some of the above might actually be thread-safe in
some/many/all implementations.  For example, strerror() and system()
are thread-safe in glibc.  So you might actually get a multithreaded
server running in that environment with fewer source code changes but
have it fail in others.  Just something to keep in mind.

I also tried the concurrency-mt-unsafe check from clang-tidy
(https://clang.llvm.org/extra/clang-tidy/checks/concurrency/mt-unsafe.html). 
 Run it for example like this:


clang-tidy -p build --quiet --checks='-*,concurrency-mt-unsafe' 
src/backend/main/*.c


(needs a compilation database in the build directory)

(You can't just run it like src/backend/**/*.c because some .c files
don't compile standalone, and then the whole thing aborts with too
many errors.  Maybe with a judicious exclusion list, this can be
achieved.  However, it's also not good dealing with compilation
options like FRONTEND.  So it can't easily serve as an automated
checker, but it's okay as a manual exploration tool.)

In addition to the POSIX list above, this also flagged:

- exit()
- sigprocmask()

Allegedly, you can toggle it between posix and glibc modes, but I
haven't succeeded with that.  So for example, it does not actually
flag strerror() out of the box, presumably because that is not in its
glibc list.


Now some more detailed comments on these functions:

- dlerror()

dlerror() gets the error from the last dlopen() call, which is
obviously not thread-safe.  This might require some deeper
investigation of the whole dfmgr.c mechanism.  (Which might be
appropriate in any case, since in multithreaded environments, you
don't need to load a library into each session separately.)

- exit()

Most of the exit() calls happen where there are not multiple threads
active.  But some emergency exit calls like in elog.c might more
correctly use _exit()?

- getenv()
- setenv()
- unsetenv()

getenv() is unsafe if there are concurrent setenv() or unsetenv()
calls.  We should try to move all those to early in the program
startup.  This seems doable.  Some calls are related to locale stuff,
which is a separate subproject to clean up.  There are some calls to
setenv("KRB5*"), which would need to be fixed.  The source code
comments nearby already contain ideas how to.

- getgrnam()
- getpwuid()
- localtime()

These have _r replacements.

- getopt()

This needs a custom replacement.  (There is no getopt_r() because
programs usually don't call getopt() after startup.)

(Note: This is also called during session startup, not only during
initial postmaster start.  So we definitely need something here, if we
want to, like, start more than one session concurrently.)

- localeconv()
- nl_langinfo()
- setlocale()

The locale business needs to be reworked to use locale_t and _l
functions.  This is already being discussed for other reasons.

- readdir()

This is listed as possibly thread-unsafe, but I think it is
thread-safe in practice.  You just can't work on the same DIR handle
from multiple threads.  There is a readdir_r(), but that's already
deprecated.  I think we can ignore this one.

- sigprocmask()

It looks like this is safe in practice.  Also, there is
pthread_sigmask() if needed.

- strerror()

Use strerror_r().  There are very few calls of this, actually, since
most potential users use printf %m.

- strsignal()

Use strsignal_r().  These calls are already wrapped in pg_strsignal()
for Windows portability, so it's easy to change.

But this also led me to think that it is potentially dangerous to have
different standards of thread-safety across the tree.  pg_strsignal()
is used by wait_result_to_str() which is used by pclose_check()
... and at that point you have completely lost track of what you are
dealing with underneath.  So if someone were to put, say,
pclose_check() into pgbench, it could be broken.

- strtok()

Use strtok_r() or maybe even strsep() (but there are small semantic
differences with the latte

Re: ResourceOwner refactoring

2024-06-06 Thread Robert Haas

On Thu, Jun 6, 2024 at 7:32 AM Heikki Linnakangas  wrote:
> > Barring objections, I'll commit this later today or tomorrow. Thanks for
> > the report!
>
> Committed.

I think you may have forgotten to push.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: question regarding policy for patches to out-of-support branches

2024-06-06 Thread Robert Haas

On Thu, Jun 6, 2024 at 4:25 AM Hannu Krosing  wrote:
> Not absolutely sure, but would at least adding a page to PostgreSQL
> Wiki about this make sense ?

I feel like we need to do something. Tom says this is a policy, and
he's made that comment before about other things, but the fact that
they're not memorialized anywhere is a huge problem, IMHO. People
don't read or remember every mailing list discussion forever, and even
if they did, how would we enumerate all the policies for the benefit
of a newcomer? Maybe this belongs in the documentation, maybe in the
wiki, maybe someplace else, but the real issue for me is that policies
have to be discoverable by the people who need to adhere to them, and
old mailing list discussions aren't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions

2024-06-06 Thread Ashutosh Sharma

Hi,

On Thu, Jun 6, 2024 at 2:36 AM Jeff Davis  wrote:

> On Wed, 2024-06-05 at 14:36 +0530, Ashutosh Sharma wrote:
> > Thank you, Ashutosh, for the quick response. I've drafted a patch
> > aimed at addressing this issue. The patch attempts to solve this
> > issue by configuring the search_path for all security definer
> > functions created by the extension.

> I like the general direction you propose, but I think it needs more
> discussion about the details.
>

I agree.

>
> * What exactly is the right search_path for a function defined in an
> extension?
>

Determining the precise value can be challenging. However, since it's a
function installed by an extension, typically, the search_path should
include the extension's search_path and the schema where the function
resides. If the function relies on a schema other than the one we set in
its search_path, which would usually be the one created by the extension,
this approach will enforce the extension developers to set the extension's
specific search_path in the create function statement, if it's not set. The
primary goal here is to ensure that the security definer functions created
by an extension do not refer to any untrusted schema(s).

>
> * Do we need a new magic search_path value of "$extension_schema" that
> resolves to the extension's schema, so that it can handle ALTER
> EXTENSION ... SET SCHEMA?
>

Possibly yes, we can think about it, I think it would be something like the
$user we have now.

>
> * What do we do for functions that want the current behavior and how do
> we handle migration issues?
>

That can be controlled via some GUC if needed, I guess.

>
> * What about SECURITY INVOKER functions? Those can still be vulnerable
> to manipulation by the caller by setting search_path, which can cause
> an incorrect value to be returned. That can matter in some contexts
> like a CHECK constraint.
>

I didn't get you completely here. w.r.t extensions how will this have an
impact if we set the search_path for definer functions.

--
With Regards,
Ashutosh Sharma.

Re: ResourceOwner refactoring

2024-06-06 Thread Heikki Linnakangas


On 06/06/2024 18:27, Robert Haas wrote:

On Thu, Jun 6, 2024 at 7:32 AM Heikki Linnakangas  wrote:

Barring objections, I'll commit this later today or tomorrow. Thanks for
the report!


Committed.


I think you may have forgotten to push.


Huh, you're right. I could swear I did...

Pushed now thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-06-06 Thread Robert Haas

On Thu, Jun 6, 2024 at 5:12 AM Jelte Fennema-Nio  wrote:
> Looking at ssl_max_protocol_version closer though, to stay really
> consistent I'd have to change "latest" to be renamed to empty string
> (i.e. there is no max_protocol_version). I think I might prefer
> staying consistent over introducing an imho slightly clearer name.
> Another way to stay consistent would of course be also adding "latest"
> as an option to ssl_max_protocol_version? What do you think?

As I see it, the issue here is whether the default value would ever be
different from the latest value. If not, then using blank to mean the
latest seems fine, but if so, then I'd expect blank to mean the
default version and latest to mean the latest version.

> I'll look into adding min_protocol_version to the patchset soonish.
> Some review of the existing code in the first few patches would
> definitely be appreciated.

Yeah, I understand, and I do want to do that, but keep in mind I've
already spent considerable time on this patch set, way more than most
others, and if I want to get it committed I'm nowhere close to being
done. It's probably multiple weeks of additional work for me, and I
think I've probably already spent close to a week on this, and I only
work ~48 weeks a year, and there are ~300 patches in the CommitFest.
Plus, right now there is no possibility of actually committing
anything until after we branch. And, respectfully, I feel like there
has to be some give and take here. I've been trying to give this patch
set higher priority because it's in an area that I know something
about and have opinions about and also because I can tell that you're
kind of frustrated and I don't want you to leave the development
community. But, at the same time, I don't think you've done much to
help me get my patches committed, and while you have done some review
of other people's patches, it doesn't seem to often be the kind of
detailed, line-by-line review that is needed to get most patches
committed. So I'm curious how you expect this system to scale.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: relfilenode statistics

2024-06-06 Thread Robert Haas

On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot
 wrote:
> I think we should keep the stats in the relation during relfilenode changes.
> As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in
> table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can
> see in the example provided up-thread the new heap_blks_written statistic has
> been preserved during the TRUNCATE.

Yeah, I think there's something weird about this design. Somehow we're
ending up with both per-relation and per-relfilenode counters:

+   pg_stat_get_blocks_written(C.oid) +
pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN
C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END,
C.relfilenode) AS heap_blks_written,

I'll defer to Andres if he thinks that's awesome, but to me it does
not seem right to track some blocks written in a per-relation counter
and others in a per-relfilenode counter.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-06-06 Thread Robert Haas

On Wed, Jun 5, 2024 at 3:48 AM Ashutosh Bapat
 wrote:
> Here's planning time measurements.
>  num_joins | master (ms) | patched (ms) | change in planning time (ms) | 
> change in planning time
> ---+-+--+--+---
>  2 |  187.86 |   177.27 |10.59 |  
> 5.64%
>  3 |  721.81 |   758.80 |   -36.99 |  
>-5.12%
>  4 | 2239.87 |  2236.19 | 3.68 |  
> 0.16%
>  5 | 6830.86 |  7027.76 |  -196.90 |  
>-2.88%

I find these results concerning. Why should the planning time
sometimes go up? And why should it go up for odd numbers of joinrels
and down for even numbers of joinrels? I wonder if these results are
really correct, and if they are, I wonder what could account for such
odd behavior

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Robert Haas

On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin  wrote:
> Am I missing something or the the page buffer indeed lacks locking there?

I don't know, but if the locks are really missing now, I feel like the
first question is "which commit got rid of them?". It's a little hard
to believe that they've never been there and somehow nobody has
noticed.

Then again, maybe we have; see Noah's thread about in-place updates
breaking stuff and some of the surprising discoveries there. But it
seems worth investigating.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions

2024-06-06 Thread Jeff Davis

On Thu, 2024-06-06 at 21:17 +0530, Ashutosh Sharma wrote:
> That can be controlled via some GUC if needed, I guess.

That's a possibility, but it's easy to create a mess that way. I don't
necessarily oppose it, but we'd need some pretty strong agreement that
we are somehow moving users in a better direction and not just creating
two behaviors that last forever.

I also think there should be a way to explicitly request the old
behavior -- obtaining search_path from the session -- regardless of how
the GUC is set.

> I didn't get you completely here. w.r.t extensions how will this have
> an impact if we set the search_path for definer functions. 

If we only set the search path for SECURITY DEFINER functions, I don't
think that solves the whole problem.

Regards,
Jeff Davis

Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions

2024-06-06 Thread Isaac Morland

On Thu, 6 Jun 2024 at 12:53, Jeff Davis  wrote:

> > I didn't get you completely here. w.r.t extensions how will this have
> > an impact if we set the search_path for definer functions.
>
> If we only set the search path for SECURITY DEFINER functions, I don't
> think that solves the whole problem.

Indeed. While the ability for a caller to set the search_path for a
security definer functions introduces security problems that are different
than for security invoker functions, it's still weird for the behaviour of
a function to depend on the caller's search_path. It’s even weirder for the
default search path behaviour to be different depending on whether or not
the function is security definer.

Re: question regarding policy for patches to out-of-support branches

2024-06-06 Thread Tom Lane

Robert Haas  writes:
> On Thu, Jun 6, 2024 at 4:25 AM Hannu Krosing  wrote:
>> Not absolutely sure, but would at least adding a page to PostgreSQL
>> Wiki about this make sense ?

> I feel like we need to do something. Tom says this is a policy, and
> he's made that comment before about other things, but the fact that
> they're not memorialized anywhere is a huge problem, IMHO.

I didn't say it wasn't ;-)

ISTM we have two basic choices: wiki page, or new SGML docs section.
In the short term I'd lean to a wiki page.  It'd be reasonable for

https://wiki.postgresql.org/wiki/Committing_checklist

to link to it (and maybe the existing section there about release
freezes would be more apropos on a "Project policies" page?  Not
sure.)

To get a sense of how much of a problem we have, I grepped the git
history for comments mentioning project policies.  Ignoring ones
that are really talking about very localized issues, what I found
is attached.  It seems like it's little enough that a single wiki
page with subsections ought to be enough.  I'm not super handy with
editing the wiki, plus I'm only firing on one cylinder today (seem
to have acquired a head cold at pgconf), so maybe somebody else
would like to draft something?

regards, tom lane

This was submitted as a security issue, but the security team
has been unable to identify any consequence worse than a null
pointer dereference (from trying to access rd_tableam methods
that the relation no longer has).  Therefore, in accordance
with our usual policy, it's not security material and should
just be fixed as a routine bug.
(this is probably material for security-team-private documentation)

All backend-side variables should be marked with PGDLLIMPORT, as per
policy introduced in 8ec569479f.

Project policy is to not leave global objects behind after a regress
test run.  This was found as a result of the development of a patch
to make pg_regress detect such leftovers automatically, which in the
end was withdrawn due to issues with parallel runs.

Per project policy, transient roles created by regression test cases
should be named "regress_something", to reduce the risks of running
such cases against installed servers.  And no such role should ever
be left behind after running a test.

Per project policy that we want to keep recently-out-of-support
branches buildable on modern systems, back-patch all the way to 9.2.

This back-patches commit 9ff47ea41 into out-of-support branches,
pursuant to newly-established project policy.  The point is to
suppress scary-looking warnings so that people building these
branches needn't expend brain cells verifying that it's safe
to ignore the warnings.

Tweak detail and hint messages to be consistent with project policy
(this should reference message style guide in SGML docs)

Doc: update testing recipe in src/test/perl/README.
The previous text didn't provide any clear explanation of our policy
around TAP test portability.
(should just reference that README as a guide for writing TAP tests)

"typename" is a C++ keyword, so pg_upgrade.h fails to compile in C++.
Fortunately, there seems no likely reason for somebody to need to
do that.  Nonetheless, it's project policy that all .h files should
pass cpluspluscheck, so rename the argument to fix that.

Commit a6417078 established a new project policy around OID assignment:
new patches are encouraged to choose a random OID in the 8000..
range when a manually-assigned OID is required (if multiple OIDs are
required, a consecutive block of OIDs starting from the random point
should be used).  Catalog entries added by committed patches that use
OIDs from this "unstable" range are renumbered after feature freeze.
(this should reference bki.sgml)

libpq failed to ignore Windows-style newlines in connection service files.
This normally wasn't a problem on Windows itself, because fgets() would
convert \r\n to just \n.  But if libpq were running inside a program that
changes the default fopen mode to binary, it would see the \r's and think
they were data.  In any case, it's project policy to ignore \r in text
files unconditionally, because people sometimes try to use files with
DOS-style newlines on Unix machines, where the C library won't hide that
from us.

However, project policy since parallel query came in is that all plan
node types should have outfuncs/readfuncs support, so this is clearly
an oversight that should be repaired.
(Probably moot now, given auto generation of these functions.)

We have a project policy that every .c file should start by including
postgres.h, postgres_fe.h, or c.h as appropriate; and then there is no
need for any .h file to explicitly include any of these.  (The core
reason for this policy is to make it ea

Re: tiny step toward threading: reduce dependence on setlocale()

2024-06-06 Thread Jeff Davis

On Wed, 2024-06-05 at 17:23 -0700, Jeff Davis wrote:
> A brief test shows that there may be a performance regression for
> libc
> default collations. But if so, I'm not sure that's avoidable if the
> goal is to take away setlocale. I'll see if removing the extra
> branches
> mitigates it.

I redid the test and didn't see a difference, and then I ran a
standalone microbenchmark to compare strcoll() vs strcoll_l(), and
didn't see a difference there, either.

Another implementation may show a difference, but it doesn't seem to be
a problem for glibc.

I think this patch series is a nice cleanup, as well, making libc more
like the other providers and not dependent on global state.

Regards,
Jeff Davis

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Alexander Lakhin


Hello Robert,

06.06.2024 19:36, Robert Haas wrote:

On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin  wrote:

Am I missing something or the the page buffer indeed lacks locking there?

I don't know, but if the locks are really missing now, I feel like the
first question is "which commit got rid of them?". It's a little hard
to believe that they've never been there and somehow nobody has
noticed.

Then again, maybe we have; see Noah's thread about in-place updates
breaking stuff and some of the surprising discoveries there. But it
seems worth investigating.


Yes, my last experiment with memcmp for the whole buffer was wrong,
given the comment above heapgettup_pagemode(). I think the correct
check would be:
 ItemId  lpp;
 OffsetNumber lineoff;
+ItemIdData  iid;

 lineoff = scan->rs_vistuples[lineindex];
 lpp = PageGetItemId(page, lineoff);
+iid = *((ItemIdData *)lpp);
+for (int i = 0; i < 1000; i++)
+Assert(memcmp(&iid, lpp, sizeof(iid)) == 0);

It significantly alleviates reproducing of the test failure for me.
Will try to bisect this anomaly tomorrow.

Best regards,
Alexander

Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-06-06 Thread Andrew Dunstan




On 2024-06-06 Th 04:15, Kyotaro Horiguchi wrote:

At Thu, 06 Jun 2024 16:45:00 +0900 (JST), Kyotaro Horiguchi 
 wrote in

I have been thinking about this since then. At first, I thought it
referred to FindFirstChangeNotification() and friends, and inotify on
Linux. However, I haven't found a way to simplify the specified code
area using those APIs.

By the way, the need to shift by 2 seconds to tolerate clock skew
suggests that the current launcher-postmaster association mechanism is
somewhat unreliable. Couldn't we add a command line option to
postmaster to explicitly pass a unique identifier (say, pid itself) of
the launcher? If it is not specified, the number should be the PID of
the immediate parent process.

This change avoids the need for the special treatment for Windows.



Looks good generally. I assume iterating over the process table to find 
the right pid will be pretty quick.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Noah Misch

On Thu, Jun 06, 2024 at 12:36:32PM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin  wrote:
> > Am I missing something or the the page buffer indeed lacks locking there?
> 
> I don't know, but if the locks are really missing now, I feel like the
> first question is "which commit got rid of them?". It's a little hard
> to believe that they've never been there and somehow nobody has
> noticed.
> 
> Then again, maybe we have; see Noah's thread about in-place updates
> breaking stuff and some of the surprising discoveries there. But it
> seems worth investigating.

$SUBJECT looks more like a duplicate of
postgr.es/m/flat/20240512171658.7e.nmi...@google.com (Hot standby queries see
transient all-zeros pages).

Re: problems with "Shared Memory and Semaphores" section of docs

2024-06-06 Thread Nathan Bossart

Here is a rebased version of the patch for v18 that adds a runtime-computed
GUC.  As I noted earlier, there still isn't a consensus on this approach.

-- 
nathan
>From 74f638f7df9c51f5ab47b282bb7107c4ba6cb5b6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 21 May 2024 14:02:22 -0500
Subject: [PATCH v2 1/1] add semaphores_required GUC

---
 doc/src/sgml/config.sgml| 14 +++
 doc/src/sgml/runtime.sgml   | 39 -
 src/backend/storage/ipc/ipci.c  |  6 -
 src/backend/utils/misc/guc_tables.c | 12 +
 4 files changed, 53 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 698169afdb..f6afc941df 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11215,6 +11215,20 @@ dynamic_library_path = 
'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  semaphores_required (integer)
+  
+   semaphores_required configuration 
parameter
+  
+  
+  
+   
+Reports the number of semaphores that are needed for the server based
+on the number of allowed connections, worker processes, etc.
+   
+  
+ 
+
  
   ssl_library (string)
   
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 2f7c618886..c387f302d7 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -781,13 +781,13 @@ psql: error: connection to server on socket 
"/tmp/.s.PGSQL.5432" failed: No such

 SEMMNI
 Maximum number of semaphore identifiers (i.e., sets)
-at least ceil((max_connections + 
autovacuum_max_workers + max_wal_senders + max_worker_processes + 7) / 
16) plus room for other applications
+at least ceil(semaphores_required / 16) plus 
room for other applications

 

 SEMMNS
 Maximum number of semaphores system-wide
-ceil((max_connections + autovacuum_max_workers + 
max_wal_senders + max_worker_processes + 7) / 16) * 17 plus room for 
other applications
+ceil(semaphores_required / 16) * 17 plus 
room for other applications

 

@@ -836,30 +836,38 @@ psql: error: connection to server on socket 
"/tmp/.s.PGSQL.5432" failed: No such
 

 When using System V semaphores,
-PostgreSQL uses one semaphore per allowed 
connection
-(), allowed autovacuum worker process
-(), allowed WAL sender process
-(), and allowed background
-process (), in sets of 16.
+PostgreSQL uses one semaphore per allowed 
connection,
+worker process, etc., in sets of 16.
 Each such set will
 also contain a 17th semaphore which contains a magic
 number, to detect collision with semaphore sets used by
 other applications. The maximum number of semaphores in the system
 is set by SEMMNS, which consequently must be at least
-as high as max_connections plus
-autovacuum_max_workers plus 
max_wal_senders,
-plus max_worker_processes, plus one extra for each 16
-allowed connections plus workers (see the formula in  plus one extra for
+each set of 16 required semaphores (see the formula in ).  The parameter SEMMNI
 determines the limit on the number of semaphore sets that can
 exist on the system at one time.  Hence this parameter must be at
-least ceil((max_connections + autovacuum_max_workers + 
max_wal_senders + max_worker_processes + 7) / 16).
+least ceil(semaphores_required / 16).
 Lowering the number
 of allowed connections is a temporary workaround for failures,
 which are usually confusingly worded No space
 left on device, from the function semget.

 
+   
+The number of semaphores required by PostgreSQL
+is provided by the runtime-computed parameter
+semaphores_required, which can be determined before
+starting the server with a postgres command like:
+
+$ postgres -D $PGDATA -C semaphores_required
+
+The value of semaphores_required should be input into
+the aforementioned formulas to determine appropriate values for
+SEMMNI and SEMMNS.
+   
+

 In some cases it might also be necessary to increase
 SEMMAP to be at least on the order of
@@ -882,11 +890,8 @@ psql: error: connection to server on socket 
"/tmp/.s.PGSQL.5432" failed: No such
 

 When using POSIX semaphores, the number of semaphores needed is the
-same as for System V, that is one semaphore per allowed connection
-(), allowed autovacuum worker process
-(), allowed WAL sender process
-(), and allowed background
-process ().
+same as for System V, that is one semaphore per allowed connection,
+worker process, etc.
 On the platforms where this option is preferred, there is no specific
 kernel limit on the number of POSIX semaphores.

diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..3e030accac 100644
--- a/src/backend/s

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-06-06 Thread Jelte Fennema-Nio

On Thu, 6 Jun 2024 at 18:01, Robert Haas  wrote:
> As I see it, the issue here is whether the default value would ever be
> different from the latest value. If not, then using blank to mean the
> latest seems fine, but if so, then I'd expect blank to mean the
> default version and latest to mean the latest version.

Alright, that's fair. And we already seem to follow that pattern:
There's currently no connection option that has a default that's not
the empty string, but still accepts the empty string as an argument.

> > I'll look into adding min_protocol_version to the patchset soonish.
> > Some review of the existing code in the first few patches would
> > definitely be appreciated.
>
> Yeah, I understand, and I do want to do that, but keep in mind I've
> already spent considerable time on this patch set, way more than most
> others, and if I want to get it committed I'm nowhere close to being
> done. It's probably multiple weeks of additional work for me, and I
> think I've probably already spent close to a week on this, and I only
> work ~48 weeks a year, and there are ~300 patches in the CommitFest.

I very much appreciate the time you spent on this patchset so far. I
mainly thought that instead of only discussing the more complex parts
of the patchset, it would be nice to also actually move forward a
little bit too. And the first 3 patches in this patchset are very
small and imho straightforward improvements.

To be clear, I'm not saying that should be all on you. I think those
first three patches can be reviewed by pretty much anyone.

> Plus, right now there is no possibility of actually committing
> anything until after we branch.

Totally fair, but even a LGTM on one of the patches would be quite nice.

> And, respectfully, I feel like there
> has to be some give and take here. I've been trying to give this patch
> set higher priority because it's in an area that I know something
> about and have opinions about and also because I can tell that you're
> kind of frustrated and I don't want you to leave the development
> community.

Thank you for giving it a higher priority, it's definitely appreciated
and noticed.

> But, at the same time, I don't think you've done much to
> help me get my patches committed, and while you have done some review
> of other people's patches, it doesn't seem to often be the kind of
> detailed, line-by-line review that is needed to get most patches
> committed. So I'm curious how you expect this system to scale.

Of course there's always the possibility to review more. But I don't
really agree with this summary of my review activity. I did see your
patches related to the incremental backup stuff. They looked
interesting, but at the time from an outside perspective it didn't
seem like those threads needed my reviews to progress (a bunch of
people more knowledgable on the topic were already responding). So I
spent my time mainly on threads where I felt I could add something
useful, and often that was more on the design front than the exact
code. Honestly that's what triggered this whole patchset in the first
place: Adding infrastructure for protocol changes so that the several
other threads that try to introduce protocol changes can actually move
forward, instead of being in limbo forever.

Regarding line-by-line reviews, imho I definitely do that for the
smaller patches I tend to review (even if they are part of a bigger
patchset). But the bigger ones I don't think line-by-line reviews are
super helpful at the start, so I generally comment more on the design
in those cases.

Re: Compress ReorderBuffer spill files using LZ4

2024-06-06 Thread Julien Tachoires

Le jeu. 6 juin 2024 à 06:40, Amit Kapila  a écrit :
>
> On Thu, Jun 6, 2024 at 6:22 PM Julien Tachoires  wrote:
> >
> > Le jeu. 6 juin 2024 à 04:13, Amit Kapila  a écrit :
> > >
> > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
> > > >
> > > > When the content of a large transaction (size exceeding
> > > > logical_decoding_work_mem) and its sub-transactions has to be
> > > > reordered during logical decoding, then, all the changes are written
> > > > on disk in temporary files located in pg_replslot/.
> > > > Decoding very large transactions by multiple replication slots can
> > > > lead to disk space saturation and high I/O utilization.
> > > >
> > >
> > > Why can't one use 'streaming' option to send changes to the client
> > > once it reaches the configured limit of 'logical_decoding_work_mem'?
> >
> > That's right, setting subscription's option 'streaming' to 'on' moves
> > the problem away from the publisher to the subscribers. This patch
> > tries to improve the default situation when 'streaming' is set to
> > 'off'.
> >
>
> Can we think of changing the default to 'parallel'? BTW, it would be
> better to use 'parallel' for the 'streaming' option, if the workload
> has large transactions. Is there a reason to use a default value in
> this case?

You're certainly right, if using the streaming API helps to avoid bad
situations and there is no downside, it could be used by default.

> > > > 2. Do we want a GUC to switch compression on/off?
> > > >
> > >
> > > It depends on the overhead of decoding. Did you try to measure the
> > > decoding overhead of decompression when reading compressed files?
> >
> > Quick benchmarking executed on my laptop shows 1% overhead.
> >
>
> Thanks. We probably need different types of data (say random data in
> bytea column, etc.) for this.

Yes, good idea, will run new tests in that sense.

Thank you!

Regards,

JT

Re: problems with "Shared Memory and Semaphores" section of docs

2024-06-06 Thread Robert Haas

On Thu, Jun 6, 2024 at 3:21 PM Nathan Bossart  wrote:
> Here is a rebased version of the patch for v18 that adds a runtime-computed
> GUC.  As I noted earlier, there still isn't a consensus on this approach.

I don't really like making this a GUC, but what's the other option?
It's reasonable for people to want to ask the server how many
resources it will need to start, and -C is the only tool we have for
that right now. So I feel like this is a fair thing to do.

I do think the name could use some more thought, though.
semaphores_required would end up being the same kind of thing as
shared_memory_size_in_huge_pages, but the names seem randomly
different. If semaphores_required is right here, why isn't
shared_memory_required used there? Seems more like we ought to call
this semaphores or os_semaphores or num_semaphores or
num_os_semaphores or something.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Compress ReorderBuffer spill files using LZ4

2024-06-06 Thread Julien Tachoires

Le jeu. 6 juin 2024 à 07:24, Alvaro Herrera  a écrit :
>
> On 2024-Jun-06, Amit Kapila wrote:
>
> > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
> > >
> > > When the content of a large transaction (size exceeding
> > > logical_decoding_work_mem) and its sub-transactions has to be
> > > reordered during logical decoding, then, all the changes are written
> > > on disk in temporary files located in pg_replslot/.
> > > Decoding very large transactions by multiple replication slots can
> > > lead to disk space saturation and high I/O utilization.
>
> I like the general idea of compressing the output of logical decoding.
> It's not so clear to me that we only want to do so for spilling to disk;
> for instance, if the two nodes communicate over a slow network, it may
> even be beneficial to compress when streaming, so to this question:
>
> > Why can't one use 'streaming' option to send changes to the client
> > once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> I would say that streaming doesn't necessarily have to mean we don't
> want compression, because for some users it might be beneficial.

Interesting idea, will try to evaluate how to compress/decompress data
transiting via streaming and how good the compression ratio would be.

> I think a GUC would be a good idea.  Also, what if for whatever reason
> you want a different compression algorithm or different compression
> parameters?  Looking at the existing compression UI we offer in
> pg_basebackup, perhaps you could add something like this:
>
> compress_logical_decoding = none
> compress_logical_decoding = lz4:42
> compress_logical_decoding = spill-zstd:99
>
> "none" says to never use compression (perhaps should be the default),
> "lz4:42" says to use lz4 with parameters 42 on both spilling and
> streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but
> only for spilling to disk.

I agree, if the server was compiled with support of multiple
compression libraries, users should be able to choose which one they
want to use.

> (I don't mean to say that you should implement Zstd compression with
> this patch, only that you should choose the implementation so that
> adding Zstd support (or whatever) later is just a matter of adding some
> branches here and there.  With the current #ifdef you propose, it's hard
> to do that.  Maybe separate the parts that depend on the specific
> algorithm to algorithm-agnostic functions.)

Makes sense, will rework this patch in that way.

Thank you!

Regards,

JT

Re: problems with "Shared Memory and Semaphores" section of docs

2024-06-06 Thread Nathan Bossart

On Thu, Jun 06, 2024 at 03:31:53PM -0400, Robert Haas wrote:
> I don't really like making this a GUC, but what's the other option?
> It's reasonable for people to want to ask the server how many
> resources it will need to start, and -C is the only tool we have for
> that right now. So I feel like this is a fair thing to do.

Yeah, this is how I feel, too.

> I do think the name could use some more thought, though.
> semaphores_required would end up being the same kind of thing as
> shared_memory_size_in_huge_pages, but the names seem randomly
> different. If semaphores_required is right here, why isn't
> shared_memory_required used there? Seems more like we ought to call
> this semaphores or os_semaphores or num_semaphores or
> num_os_semaphores or something.

I'm fine with any of your suggestions.  If I _had_ to pick one, I'd
probably choose num_os_semaphores because it's the most descriptive.

-- 
nathan

Re: question regarding policy for patches to out-of-support branches

2024-06-06 Thread Joe Conway


On 6/6/24 14:12, Tom Lane wrote:

Robert Haas  writes:

On Thu, Jun 6, 2024 at 4:25 AM Hannu Krosing  wrote:

Not absolutely sure, but would at least adding a page to PostgreSQL
Wiki about this make sense ?



I feel like we need to do something. Tom says this is a policy, and
he's made that comment before about other things, but the fact that
they're not memorialized anywhere is a huge problem, IMHO.


I didn't say it wasn't ;-)

ISTM we have two basic choices: wiki page, or new SGML docs section.
In the short term I'd lean to a wiki page.  It'd be reasonable for

https://wiki.postgresql.org/wiki/Committing_checklist

to link to it (and maybe the existing section there about release
freezes would be more apropos on a "Project policies" page?  Not
sure.)

To get a sense of how much of a problem we have, I grepped the git
history for comments mentioning project policies.  Ignoring ones
that are really talking about very localized issues, what I found
is attached.  It seems like it's little enough that a single wiki
page with subsections ought to be enough.  I'm not super handy with
editing the wiki, plus I'm only firing on one cylinder today (seem
to have acquired a head cold at pgconf), so maybe somebody else
would like to draft something?


I added them here with minimal copy editing an no attempt to organize or 
sort into groups:


https://wiki.postgresql.org/wiki/Committing_checklist#Policies

If someone has thoughts on how to improve I am happy to make more changes.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Avoid orphaned objects dependencies, take 3

2024-06-06 Thread Robert Haas

On Thu, Jun 6, 2024 at 1:56 AM Bertrand Drouvot
 wrote:
> v9 is more invasive (as it changes code in much more places) than v8 but it is
> easier to follow (as it is now clear where the new lock is acquired).

Hmm, this definitely isn't what I had in mind. Possibly that's a sign
that what I had in mind was dumb, but for sure it's not what I
imagined. What I thought you were going to do was add calls like
LockDatabaseObject(NamespaceRelationId, schemaid, 0, AccessShareLock)
in various places, or perhaps LockRelationOid(reloid,
AccessShareLock), or whatever the case may be. Here you've got stuff
like this:

- record_object_address_dependencies(&conobject, addrs_auto,
-DEPENDENCY_AUTO);
+ lock_record_object_address_dependencies(&conobject, addrs_auto,
+ DEPENDENCY_AUTO);

...which to me looks like the locking is still pushed down inside the
dependency code.

And you also have stuff like this:

  ObjectAddressSet(referenced, RelationRelationId, childTableId);
+ depLockAndCheckObject(&referenced);
  recordDependencyOn(&depender, &referenced, DEPENDENCY_PARTITION_SEC);

But in depLockAndCheckObject you have:

+ if (object->classId == RelationRelationId || object->classId ==
AuthMemRelationId)
+ return;

That doesn't seem right, because then it seems like the call isn't
doing anything, but there isn't really any reason for it to not be
doing anything. If we're dropping a dependency on a table, then it
seems like we need to have a lock on that table. Presumably the reason
why we don't end up with dangling dependencies in such cases now is
because we're careful about doing LockRelation() in the right places,
but we're not similarly careful about other operations e.g.
ConstraintSetParentConstraint is called by DefineIndex which calls
table_open(childRelId, ...) first, but there's no logic in DefineIndex
to lock the constraint.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: small fix for llvm build

2024-06-06 Thread Peter Eisentraut


On 28.05.24 17:17, Peter Eisentraut wrote:
I'm getting build failures when building with meson and llvm enabled, 
like this:


[1/112] Generating src/backend/jit/llvm/llvmjit_types.bc with a custom 
command

FAILED: src/backend/jit/llvm/llvmjit_types.bc
/usr/local/bin/ccache /usr/local/Cellar/llvm/18.1.6/bin/clang -c -o 
src/backend/jit/llvm/llvmjit_types.bc 
../src/backend/jit/llvm/llvmjit_types.c -flto=thin -emit-llvm -MD -MQ 
src/backend/jit/llvm/llvmjit_types.bc -MF 
src/backend/jit/llvm/llvmjit_types.c.bc.d -O2 -Wno-ignored-attributes 
-Wno-empty-body -fno-strict-aliasing -fwrapv -I./src/include 
-I./src/backend/utils/misc -I../src/include

In file included from ../src/backend/jit/llvm/llvmjit_types.c:27:
In file included from ../src/include/postgres.h:45:
../src/include/c.h:75:10: fatal error: 'libintl.h' file not found
    75 | #include 
   |  ^~~
1 error generated.


The reason is that libintl.h is at /usr/local/include/libintl.h, but 
that is not in the include path for this command.  I have 
-I/usr/local/include in CPPFLAGS in the environment, which is why the 
normal compilation commands pick it up, but this is not used by this 
custom command.


Wit this small patch I can make it work:

diff --git a/src/backend/jit/llvm/meson.build 
b/src/backend/jit/llvm/meson.build

index 41c759f73c5..4a4232661ba 100644
--- a/src/backend/jit/llvm/meson.build
+++ b/src/backend/jit/llvm/meson.build
@@ -63,6 +63,7 @@ bitcode_cflags = ['-fno-strict-aliasing', '-fwrapv']
  if llvm.version().version_compare('=15.0')
    bitcode_cflags += ['-Xclang', '-no-opaque-pointers']
  endif
+bitcode_cflags += get_option('c_args')
  bitcode_cflags += cppflags

  # XXX: Worth improving on the logic to find directories here


I have committed this change.

Re: Postgresql OOM

2024-06-06 Thread Tom Lane

Radu Radutiu  writes:
> The query itself runs fine in a reasonable time with enable_parallel_hash =
> 'off'. I see two problems - one is the wrong execution plan (right after
> running analyze), the second and the most important is the huge memory
> usage (far exceeding work_mem and shared buffers) leading to OOM.
> See https://explain.depesz.com/s/yAqS for the explain plan
> with enable_parallel_hash = 'off.

What it looks like to me is that the join key column has very skewed
statistics, such that a large majority of the tuples end up in the
same hash bucket (probably they even all have identical keys).  I think
the memory growth is coming from the executor repeatedly splitting
the buckets in a vain attempt to separate those tuples into multiple
buckets.

The planner should recognize this situation and avoid use of hash
join in such cases, but maybe the statistics aren't reflecting the
problem, or maybe there's something wrong with the logic specific
to parallel hash join.  You've not really provided enough information
to diagnose why the poor choice of plan.

regards, tom lane

Re: Things I don't like about \du's "Attributes" column

2024-06-06 Thread Pavel Luzanov


On 06.06.2024 17:29, Robert Haas wrote:

I think the first of these special interpretations is unnecessary and
should be removed. It seems pretty clear what 0 means.


Agree.
There is an additional technical argument for removing this replacement.
I don't like explicit cast to text of the "Connection limit" column.
Without 'Not allowed' it is no longerrequired.
Value -1 can be replaced by NULL with an implicit cast to integer.

Next version with this change attached.

Example output:

\du+ regress_du*
List of roles
Role name | Login | Attributes  | Valid until  | 
Connection limit |   Description
--+---+-+--+--+--
 regress_du_admin | yes   | Superuser  +|  |
  | some description
  |   | Create DB  +|  |
  |
  |   | Create role+|  |
  |
  |   | Inherit+|  |
  |
  |   | Replication+|  |
  |
  |   | Bypass RLS  |  |
  |
 regress_du_role0 | yes   | Inherit | Tue Jun 04 00:00:00 2024 PDT |
0 |
 regress_du_role1 | no| Create role+| infinity |
  |
  |   | Inherit |  |
  |
 regress_du_role2 | yes   | Inherit+|  |
   42 |
  |   | Replication+|  |
  |
  |   | Bypass RLS  |  |
  |
(4 rows)

Current version for comparison:

  List of roles
Role name | Attributes 
|   Description
--++--
 regress_du_admin | Superuser, Create role, Create DB, Replication, Bypass RLS 
| some description
 regress_du_role0 | No connections+|
  | Password valid until 2024-06-04 00:00:00+03|
 regress_du_role1 | Create role, Cannot login +|
  | Password valid until infinity  |
 regress_du_role2 | Replication, Bypass RLS   +|
  | 42 connections |


Data:
CREATE ROLE regress_du_role0 LOGIN PASSWORD '123' VALID UNTIL '2024-06-04' 
CONNECTION LIMIT 0;
CREATE ROLE regress_du_role1 CREATEROLE CONNECTION LIMIT -1 VALID UNTIL 
'infinity';
CREATE ROLE regress_du_role2 LOGIN REPLICATION BYPASSRLS CONNECTION LIMIT 42;
CREATE ROLE regress_du_admin LOGIN SUPERUSER CREATEROLE CREATEDB BYPASSRLS 
REPLICATION INHERIT;
COMMENT ON ROLE regress_du_admin IS 'some description';

--
Pavel Luzanov
Postgres Professional:https://postgrespro.com
From fd3fb8a4bea89f870789fe63a270f872c945980c Mon Sep 17 00:00:00 2001
From: Pavel Luzanov 
Date: Thu, 6 Jun 2024 23:48:32 +0300
Subject: [PATCH v8] psql: Rethinking of \du command

Cnanges in the \du command
- "Login", "Connection limit" and "Valid until" attributes are placed
  in separate columns.
  to understand that there is no limit.
- The "Attributes" column includes only the enabled logical attributes.
- The attribute names correspond to the keywords of the CREATE ROLE command.
- The attributes are listed in the same order as in the documentation.
- Value -1 for "Connection limit" replaced by NULL to make it easier
- General refactoring of describeRoles function in describe.c.
---
 src/bin/psql/describe.c| 146 -
 src/test/regress/expected/psql.out |  40 +---
 src/test/regress/sql/psql.sql  |  12 ++-
 3 files changed, 72 insertions(+), 126 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index f67bf0b892..31cc40b38f 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -36,7 +36,6 @@ static bool describeOneTableDetails(const char *schemaname,
 	bool verbose);
 static void add_tablespace_footer(printTableContent *const cont, char relkind,
   Oid tablespace, const bool newline);
-static void add_role_attribute(PQExpBuffer buf, const char *const str);
 static bool listTSParsersVerbose(const char *pattern);
 static bool describeOneTSParser(const char *oid, const char *nspname,
 const char *prsname);
@@ -3615,34 +3614,47 @@ describeRoles(const char *pattern, bool verbose, bool showSystem)
 {
 	PQExpBufferData buf;
 	PGresult   *res;
-	printTableContent cont;
-	printTableO

Re: ssl tests fail due to TCP port conflict

2024-06-06 Thread Jelte Fennema-Nio

On Wed, 5 Jun 2024 at 23:37, Tom Lane  wrote:
>
> Andrew Dunstan  writes:
> > On 2024-06-05 We 16:00, Alexander Lakhin wrote:
> >> That is, psql from the test instance 001_ssltests_34 opened a
> >> connection to
> >> the test server with the client port 50072 and it made using the port by
> >> the server from the test instance 001_ssltests_30 impossible.
>
> > Oh. (kicks self)
>
> D'oh.
>
> > Should we really be allocating ephemeral server ports in the range
> > 41952..65535? Maybe we should be looking for an unallocated port
> > somewhere below 41952, and above, say, 32767, so we couldn't have a
> > client socket collision.
>
> Hmm, are there really any standards about how these port numbers
> are used?
>
> I wonder if we don't need to just be prepared to retry the whole
> thing a few times.  Even if it's true that "clients" shouldn't
> choose ports below 41952, we still have a small chance of failure
> against a non-Postgres server starting up at the wrong time.

My suggestion would be to not touch the ephemeral port range at all
for these ports. In practice the ephemeral port range is used for
cases where the operating system assigns the port, and the application
doesn't care whot it is. Not for when you want to get a free port, but
want to know in advance which one it is.

For the PgBouncer test suite we do something similar as the PG its
perl tests do, but there we allocate a port between 10200 and 32768:
https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215

Sure theoretically it's possible to hit a rare case where another
server starts up at the wrong time, but that chance seems way lower
than a client starting up at the wrong time. Especially since there
aren't many servers that use a port with 5 digits.

Attached is a patch that updates the port numbers.

v1-0001-Don-t-use-ephemeral-port-range.patch
Description: Binary data

Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions

2024-06-06 Thread Jelte Fennema-Nio

On Thu, 6 Jun 2024 at 20:10, Isaac Morland  wrote:
>
> On Thu, 6 Jun 2024 at 12:53, Jeff Davis  wrote:
>
>>
>> > I didn't get you completely here. w.r.t extensions how will this have
>> > an impact if we set the search_path for definer functions.
>>
>> If we only set the search path for SECURITY DEFINER functions, I don't
>> think that solves the whole problem.
>
>
> Indeed. While the ability for a caller to set the search_path for a security 
> definer functions introduces security problems that are different than for 
> security invoker functions, it's still weird for the behaviour of a function 
> to depend on the caller's search_path. It’s even weirder for the default 
> search path behaviour to be different depending on whether or not the 
> function is security definer.

+1

And +1 to the general idea and direction this thread is going in. I
definitely think we should be making extensions more secure by
default, and this is an important piece of it.

Even by default making the search_path "pg_catalog, pg_temp" for
functions created by extensions would be very useful.

Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions

2024-06-06 Thread Jeff Davis

On Fri, 2024-06-07 at 00:19 +0200, Jelte Fennema-Nio wrote:
> Even by default making the search_path "pg_catalog, pg_temp" for
> functions created by extensions would be very useful.

Right now there's no syntax to override that. We'd need something to
say "get the search_path from the session".

Regards,
Jeff Davis

Re: race condition in pg_class

2024-06-06 Thread Michael Paquier

On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> It's not this patch set's fault, but I'm not very pleased to see that
> the injection point wait events have been shoehorned into the
> "Extension" category - which they are not - instead of being a new
> wait_event_type. That would have avoided the ugly wait-event naming
> pattern, inconsistent with everything else, introduced by
> inplace050-tests-inj-v1.patch.

Not sure to agree with that.  The set of core backend APIs supporting
injection points have nothing to do with wait events.  The library
attached to one or more injection points *may* decide to use a wait
event like what the wait/wakeup calls in modules/injection_points do,
but that's entirely optional.  These rely on custom wait events,
plugged into the Extension category as the code run is itself in an
extension.  I am not arguing against the point that it may be
interesting to plug in custom wait event categories, but the current
design of wait events makes that much harder than what core is
currently able to handle, and I am not sure that this brings much at
the end as long as the wait event strings can be customized.

I've voiced upthread concerns over the naming enforced by the patch
and the way it plugs the namings into the isolation functions, by the
way.
--
Michael

signature.asc
Description: PGP signature

Re: Injection points: preloading and runtime arguments

2024-06-06 Thread Michael Paquier

On Thu, Jun 06, 2024 at 03:47:47PM +0500, Andrey M. Borodin wrote:
> Is it OK to detach() before wakeup()? Or, perhaps, can a detach() do a 
> wakeup() automatically?

It is OK to do a detach before a wakeup.  Noah has been relying on
this behavior in an isolation test for a patch he's worked on.  See
inplace110-successors-v1.patch here:
https://www.postgresql.org/message-id/20240512232923.aa.nmi...@google.com

That's also something we've discussed for 33181b48fd0e, where Noah
wanted to emulate in an automated fashion what one can do with a
debugger and one or more breakpoints.

Not sure that wakeup() involving a automated detach() is the behavior
to hide long-term, actually, as there is also an argument for waking
up a point and *not* detach it to force multiple waits.
--
Michael

signature.asc
Description: PGP signature

Re: report on not thread-safe functions

2024-06-06 Thread Jeff Davis

On Thu, 2024-06-06 at 16:34 +0200, Peter Eisentraut wrote:
> - setlocale()
> 
> The locale business needs to be reworked to use locale_t and _l
> functions.  This is already being discussed for other reasons.

I posted a few patches to do this for collation:

https://commitfest.postgresql.org/48/5023/

Regards,
Jeff Davis

Re: Revive num_dead_tuples column of pg_stat_progress_vacuum

2024-06-06 Thread Masahiko Sawada

On Wed, Jun 5, 2024 at 7:19 PM Andrey M. Borodin  wrote:
>
>
>
> > On 4 Jun 2024, at 00:26, Masahiko Sawada  wrote:
>
> Thank you! Vacuum enhancement is a really good step forward, and this small 
> change would help a lot of observability tools.
>
>
> > On 4 Jun 2024, at 00:49, Peter Geoghegan  wrote:
> >
> > Can we rename this to num_dead_item_ids (or something similar) in
> > passing?
>
> I do not insist, but many tools will have to adapt to this change [0,1]. 
> However, most of tools will have to deal with removed max_dead_tuples anyway 
> [2], so this is not that big problem.

True, this incompatibility would not be a big problem.

num_dead_item_ids seems good to me. I've updated the patch that
incorporated the comment from Álvaro[1].

Regards,

[1] 
https://www.postgresql.org/message-id/202406041535.pmyty3ci4pfd%40alvherre.pgsql

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From 24838307afc52dfc00317579ad88a59ba5c00192 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Tue, 4 Jun 2024 06:17:25 +0900
Subject: [PATCH v2] Reintroduce dead tuple counter in pg_stat_progress_vacuum.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Commit 667e65aac3 changed both num_dead_tuples and max_dead_tuples
columns to dead_tuple_bytes and max_dead_tuple_bytes columns,
respectively. But as per discussion, the number of dead tuples
collected still can provide meaningful insights for users.

This change reintroduce the column for the count of dead tuples,
renamed as num_dead_item_ids to avoid confusion with the number of
dead tuples removed by VACUUM, which includes dead heap-only tuples,
but excludes any pre-existing LP_DEAD items left behind by
opportunistic pruning.

XXX: bump catalog version.

Reviewed-by: Peter Geoghegan, Álvaro Herrera, Andrey Borodin
Discussion: https://postgr.es/m/CAD21AoBL5sJE9TRWPyv%2Bw7k5Ee5QAJqDJEDJBUdAaCzGWAdvZw%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml |  9 +
 src/backend/access/heap/vacuumlazy.c | 12 +---
 src/backend/catalog/system_views.sql |  3 ++-
 src/include/commands/progress.h  |  5 +++--
 src/test/regress/expected/rules.out  |  5 +++--
 5 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 053da8d6e4..b2ad9b446f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6268,6 +6268,15 @@ FROM pg_stat_get_backend_idset() AS backendid;
   
  
 
+ 
+  
+  num_dead_item_ids bigint
+  
+  
+   Number of dead item identifiers collected since the last index vacuum cycle.
+  
+ 
+
  
   
indexes_total bigint
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8145ea8fc3..ef1df35afa 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2883,13 +2883,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 			   int num_offsets)
 {
 	TidStore   *dead_items = vacrel->dead_items;
+	const int	prog_index[2] = {
+		PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS,
+		PROGRESS_VACUUM_DEAD_TUPLE_BYTES
+	};
+	int64		prog_val[2];
 
 	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
-	/* update the memory usage report */
-	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
- TidStoreMemoryUsage(dead_items));
+	/* update the progress information */
+	prog_val[0] = vacrel->dead_items_info->num_items;
+	prog_val[1] = TidStoreMemoryUsage(dead_items);
+	pgstat_progress_update_multi_param(2, prog_index, prog_val);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 53047cab5f..efb29adeb3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
 S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
 S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
 S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
-S.param8 AS indexes_total, S.param9 AS indexes_processed
+S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
+S.param10 AS indexes_processed
 FROM pg_stat_get_progress_info('VACUUM') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 82a8fe6bd1..5616d64523 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -25,8 +25,9 @@
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
 #define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
 #define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
-#define PROGRESS_VACUUM_INDEXES_TOTAL			7
-#define PROGRESS_VACUUM_INDEXES_PROCESSED		8
+#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS		7
+#define PROGRESS_VAC

Re: question regarding policy for patches to out-of-support branches

2024-06-06 Thread Tom Lane

Joe Conway  writes:
> On 6/6/24 14:12, Tom Lane wrote:
>> To get a sense of how much of a problem we have, I grepped the git
>> history for comments mentioning project policies.  Ignoring ones
>> that are really talking about very localized issues, what I found
>> is attached.  It seems like it's little enough that a single wiki
>> page with subsections ought to be enough.  I'm not super handy with
>> editing the wiki, plus I'm only firing on one cylinder today (seem
>> to have acquired a head cold at pgconf), so maybe somebody else
>> would like to draft something?

> I added them here with minimal copy editing an no attempt to organize or 
> sort into groups:
> https://wiki.postgresql.org/wiki/Committing_checklist#Policies
> If someone has thoughts on how to improve I am happy to make more changes.

Thanks!  I summoned the energy to make a few more improvements,
particularly updating stuff that seemed out-of-date.  I'm sure
there's more that could be added here.

regards, tom lane

Re: Logical Replication of sequences

2024-06-06 Thread Masahiko Sawada

On Thu, Jun 6, 2024 at 6:40 PM Amit Kapila  wrote:
>
> On Thu, Jun 6, 2024 at 11:10 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila  wrote:
> > >
> >
> > > To achieve this, we can allow sequences to be copied during
> > > the initial CREATE SUBSCRIPTION command similar to what we do for
> > > tables. And then later by new/existing command, we re-copy the already
> > > existing sequences on the subscriber.
> > >
> > > The options for the new command could be:
> > > Alter Subscription ... Refresh Sequences
> > > Alter Subscription ... Replicate Sequences
> > >
> > > In the second option, we need to introduce a new keyword Replicate.
> > > Can you think of any better option?
> >
> > Another idea is doing that using options. For example,
> >
> > For initial sequences synchronization:
> >
> > CREATE SUBSCRIPTION ... WITH (copy_sequence = true);
> >
>
> How will it interact with the existing copy_data option? So copy_data
> will become equivalent to copy_table_data, right?

Right.

>
> > For re-copy (or update) sequences:
> >
> > ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = true);
> >
>
> Similar to the previous point it can be slightly confusing w.r.t
> copy_data. And would copy_sequence here mean that it would copy
> sequence values of both pre-existing and newly added sequences, if so,
> that would make it behave differently than copy_data?  The other
> possibility in this direction would be to introduce an option like
> replicate_all_sequences/copy_all_sequences which indicates a copy of
> both pre-existing and new sequences, if any.

Copying sequence data works differently than replicating table data
(initial data copy and logical replication). So I thought the
copy_sequence option (or whatever better name) always does both
updating pre-existing sequences and adding new sequences. REFRESH
PUBLICATION updates the tables to be subscribed, so we also update or
add sequences associated to these tables.

>
> If we want to go in the direction of having an option such as
> copy_(all)_sequences then do you think specifying that copy_data is
> just for tables in the docs would be sufficient? I am afraid that it
> would be confusing for users.

I see your point. But I guess it would not be very problematic as it
doesn't break the current behavior and copy_(all)_sequences is
primarily for upgrade use cases.

>
> > >
> > > In addition to the above, the command Alter Subscription .. Refresh
> > > Publication will fetch any missing sequences similar to what it does
> > > for tables.
> >
> > On the subscriber side, do we need to track which sequences are
> > created via CREATE/ALTER SUBSCRIPTION?
> >
>
> I think so unless we find some other way to know at refresh
> publication time which all new sequences need to be part of the
> subscription. What should be the behavior w.r.t sequences when the
> user performs ALTER SUBSCRIPTION ... REFRESH PUBLICATION? I was
> thinking similar to tables, it should fetch any missing sequence
> information from the publisher.

It seems to make sense to me. But I have one question: do we want to
support replicating sequences that are not associated with any tables?
if yes, what if we refresh two different subscriptions that subscribe
to different tables on the same database? On the other hand, if no
(i.e. replicating only sequences owned by tables), can we know which
sequences to replicate by checking the subscribed tables?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Synchronizing slots from primary to standby

2024-06-06 Thread Zhijie Hou (Fujitsu)

On Thursday, June 6, 2024 12:21 PM Peter Smith 
> 
> Hi, here are some review comments for the docs patch v5-0001.

Thanks for the comments! Here is the V6 patch that addressed the these.

Best Regards,
Hou zj


v6-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch
Description:  v6-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Alexander Lakhin


Hello Noah,

06.06.2024 22:07, Noah Misch wrote:



I don't know, but if the locks are really missing now, I feel like the
first question is "which commit got rid of them?". It's a little hard
to believe that they've never been there and somehow nobody has
noticed.

Then again, maybe we have; see Noah's thread about in-place updates
breaking stuff and some of the surprising discoveries there. But it
seems worth investigating.

$SUBJECT looks more like a duplicate of
postgr.es/m/flat/20240512171658.7e.nmi...@google.com (Hot standby queries see
transient all-zeros pages).


Thank you for the reference! Yes, it looks very similar. Though I can't
say the sleep you proposed helps the failure reproduction (I've tried
026_overwrite_contrecord.pl and saw no more frequent failures or so).

My bisect run ended with:
210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit

Author: Thomas Munro 
Date:   Wed Apr 3 00:03:08 2024 +1300

    Provide vectored variant of ReadBuffer().

Other buildfarm failures with this Assert I could find kind of confirm this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
(presumably a first failure of this sort)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08

Best regards,
Alexander

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Thomas Munro

On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin  wrote:
> My bisect run ended with:
> 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit
>
> Author: Thomas Munro 
> Date:   Wed Apr 3 00:03:08 2024 +1300
>
>  Provide vectored variant of ReadBuffer().
>
> Other buildfarm failures with this Assert I could find kind of confirm this:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
> (presumably a first failure of this sort)
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08

Looking...

Re: relfilenode statistics

2024-06-06 Thread Andres Freund

Hi,

On 2024-06-06 12:27:49 -0400, Robert Haas wrote:
> On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot
>  wrote:
> > I think we should keep the stats in the relation during relfilenode changes.
> > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in
> > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you 
> > can
> > see in the example provided up-thread the new heap_blks_written statistic 
> > has
> > been preserved during the TRUNCATE.
>
> Yeah, I think there's something weird about this design. Somehow we're
> ending up with both per-relation and per-relfilenode counters:
>
> +   pg_stat_get_blocks_written(C.oid) +
> pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN
> C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END,
> C.relfilenode) AS heap_blks_written,
>
> I'll defer to Andres if he thinks that's awesome, but to me it does
> not seem right to track some blocks written in a per-relation counter
> and others in a per-relfilenode counter.

It doesn't immediately sound awesome. Nor really necessary?

If we just want to keep prior stats upon arelation rewrite, we can just copy
the stats from the old relfilenode.  Or we can decide that those stats don't
really make sense anymore, and start from scratch.

I *guess* I could see an occasional benefit in having both counter for "prior
relfilenodes" and "current relfilenode" - except that stats get reset manually
and upon crash anyway, making this less useful than if it were really
"lifetime" stats.

Greetings,

Andres Freund

Re: cannot drop intarray extension

2024-06-06 Thread jian he

On Mon, Jun 3, 2024 at 12:14 PM jian he  wrote:
>
> hi.
>
>  setup
> drop table if exist test__int cascade;
> create extension intarray;
>
> CREATE TABLE test__int( a int[] );
> CREATE INDEX text_idx on test__int using gist (a gist__intbig_ops(siglen = 
> 1));
> drop extension intarray cascade;
> NOTICE:  drop cascades to index text_idx
> 2024-06-03 11:53:32.629 CST [41165] ERROR:  cache lookup failed for
> function 17758
> 2024-06-03 11:53:32.629 CST [41165] STATEMENT:  drop extension intarray 
> cascade;
> ERROR:  cache lookup failed for function 17758
>
> 
> backtrace info:
> index_getprocinfo
> #0  index_opclass_options (indrel=0x7faeca727b58, attnum=1,
> attoptions=94372901674408, validate=false)
> at 
> ../../Desktop/pg_src/src4/postgres/src/backend/access/index/indexam.c:1034
> #1  0x55d4e63a79cb in RelationGetIndexAttOptions
> (relation=0x7faeca727b58, copy=false)
> at 
> ../../Desktop/pg_src/src4/postgres/src/backend/utils/cache/relcache.c:5872
> #2  0x55d4e639d72d in RelationInitIndexAccessInfo 
> (relation=0x7faeca727b58)
> at 
> ../../Desktop/pg_src/src4/postgres/src/backend/utils/cache/relcache.c:1569
> #3  0x55d4e639c5ac in RelationBuildDesc (targetRelId=24582, insertIt=true)
> at 
> ../../Desktop/pg_src/src4/postgres/src/backend/utils/cache/relcache.c:1207
> #4  0x55d4e639e9ce in RelationIdGetRelation (relationId=24582)
> at 
> ../../Desktop/pg_src/src4/postgres/src/backend/utils/cache/relcache.c:2115
> #5  0x55d4e5a412fd in relation_open (relationId=24582, lockmode=8)
> at 
> ../../Desktop/pg_src/src4/postgres/src/backend/access/common/relation.c:58
> #6  0x55d4e5ae6a06 in index_open (relationId=24582, lockmode=8)
> at 
> ../../Desktop/pg_src/src4/postgres/src/backend/access/index/indexam.c:137
> #7  0x55d4e5be61b8 in index_drop (indexId=24582, concurrent=false,
> concurrent_lock_mode=false)
> at ../../Desktop/pg_src/src4/postgres/src/backend/catalog/index.c:2156
> 
> i guess it's because we first dropped the function g_intbig_options

in this context, the index "text_idx" has a normal dependency with pg_opclass.
but `drop extension intarray cascade;`,
CASCADE means that we drop the pg_opclass and pg_opclass's inner dependency
first, then drop the index.

while drop index (sub functions
RelationGetIndexAttOptions,index_opclass_options, index_getprocinfo)
requires that pg_opclass and its inner dependencies (namely
g_intbig_options,  g_int_options) are not dropped first.

in deleteObjectsInList, under certain conditions trying to sort the to
be deleted object list
by just using sort_object_addresses seems to work,
but it looks like a hack.
maybe the proper fix would be in findDependentObjects.
From 8deefb638df270cf26e5649b1a99f218474821fa Mon Sep 17 00:00:00 2001
From: jian he 
Date: Fri, 7 Jun 2024 11:25:03 +0800
Subject: [PATCH v1 1/1] trying to resolve drop extension deletion order

sometimes, drop extension cascade cannot resolve the internal deletion order correctly.
e.g.
drop table if exist test__int cascade;
create extension intarray;
CREATE TABLE test__int( a int[] );
CREATE INDEX text_idx on test__int using gist (a gist__intbig_ops(siglen = 1));
drop extension intarray cascade;

the index "text_idx" only have a normal dependency with pg_opclass.
even though the index can be dropped separately without affecting the "pg_opclass".

but CASCADE means that we drop the pg_opclass and pg_opclass's inner dependency
first, then drop the index.

while drop index (sub functions RelationGetIndexAttOptions,index_opclass_options, index_getprocinfo)
requires that pg_opclass and its inner dependencies are not dropped first.

Resorting the deleted objects in deleteObjectsInList using sort_object_addresses seems like a hack.
but it works for now.

discussion: https://www.postgresql.org/message-id/CACJufxEspPKC7oxVLci7oUddUmcAGNKJnWWSD7-B03bGtT9gDg%40mail.gmail.com
---
 src/backend/catalog/dependency.c  | 33 +++
 src/backend/catalog/pg_shdepend.c |  2 +-
 src/backend/commands/dropcmds.c   |  2 +-
 src/backend/commands/indexcmds.c  |  2 +-
 src/backend/commands/tablecmds.c  | 13 +++-
 src/include/catalog/dependency.h  |  2 +-
 6 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ad..d0c2454b 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -183,7 +183,7 @@ static void DeleteInitPrivs(const ObjectAddress *object);
  */
 static void
 deleteObjectsInList(ObjectAddresses *targetObjects, Relation *depRel,
-	int flags)
+	int flags, bool sort_objects)
 {
 	int			i;

@@ -213,6 +213,8 @@ deleteObjectsInList(ObjectAddresses *targetObjects, Relation *depRel,
 		}
 	}

+	if (sort_objects)
+		sort_object_addresses(targetObjects);
 	/*
 	 * Delete all the objects in the proper order, except that if told

Re: relfilenode statistics

2024-06-06 Thread Andres Freund

Hi,

On 2024-06-03 11:11:46 +, Bertrand Drouvot wrote:
> The main argument is that we currently don’t have writes counters for 
> relations.
> The reason is that we don’t have the relation OID when writing buffers out.
> Tracking writes per relfilenode would allow us to track/consolidate writes per
> relation (example in the v1 patch and in the message up-thread).
> 
> I think that adding instrumentation in this area (writes counters) could be
> beneficial (like it is for the ones we currently have for reads).
> 
> Second argument is that this is also beneficial for the "Split index and
> table statistics into different types of stats" thread (mentioned in the 
> previous
> message). It would allow us to avoid additional branches in some situations 
> (like
> the one mentioned by Andres in the link I provided up-thread).

I think there's another *very* significant benefit:

Right now physical replication doesn't populate statistics fields like
n_dead_tup, which can be a huge issue after failovers, because there's little
information about what autovacuum needs to do.

Auto-analyze *partially* can fix it at times, if it's lucky enough to see
enough dead tuples - but that's not a given and even if it works, is often
wildly inaccurate.

Once we put things like n_dead_tup into per-relfilenode stats, we can populate
them during WAL replay. Thus after a promotion autovacuum has much better
data.

This also is important when we crash: We've been talking about storing a
snapshot of the stats alongside each REDO pointer. Combined with updating
stats during crash recovery, we'll have accurate dead-tuple stats once recovey
has finished.

Greetings,

Andres Freund

Re: speed up a logical replica setup

2024-06-06 Thread Euler Taveira

On Wed, May 22, 2024, at 12:16 PM, Euler Taveira wrote:
> I'll summarize all issues as soon as I finish the review of sync slot 
> support. I
> think we should avoid new development if we judge that the item can be
> documented as a limitation for this version. Nevertheless, I will share 
> patches
> so you can give your opinion on whether it is an open item or new development.

Here it is a patch series to fix the issues reported in recent discussions. The
patches 0001 and 0003 aim to fix the buildfarm issues. The patch 0002 removes
synchronized failover slots on subscriber since it has no use. I also included
an optional patch 0004 that improves the usability by checking both servers if
it already failed in any subscriber check.

As I said in this previous email I decided to remove the logic that reacts for
an issue on primary. We can reintroduce another code later if/when we have a
better way to check the recovery progress. It will rely on the recovery_timeout
and it adds recovery_timeout equals to PG_TEST_TIMEOUT_DEFAULT to let the
animals control how long it should wait for the recovery.

Since some animals reported some issues in the check_publisher routine that
checks if the primary_slot_name is in use on primary, this logic was removed
too (patch 0003). We could introduce a way to keep trying this test but the
conclusion is that it is not worth it and if the primary_slot_name does not
exist (due to a setup error), pg_createsubscriber will log an error message and
continue.

The 0002 removes any failover slots that remains on subscriber. Talking about
terminology, I noticed that slotsync.c uses "logical failover slots" and
"failover logical slots", I think the latter sounds better but I'm not a native
speaker. I also don't know if we use a short terminology like "failover slots"
"failover replication slots" or "failover logical replication slots". IMO we
can omit "logical" because "failover" infers it is a logical replication slot.
I'm also not sure about omitting "replication". It is not descriptive enough. I
prefer "failover replication slots".

Before sending this email I realized that I did nothing about physical
replication slots on the standby. I think we should also remove them too
unconditionally.

--
Euler Taveira
EDB   https://www.enterprisedb.com/
From 5d8b4781e6e9bcb00564f45c25e575a8abab6ae8 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Tue, 21 May 2024 23:04:57 -0300
Subject: [PATCH 1/4] Only the recovery_timeout controls the end of recovery
 process

It used to check if the target server is connected to the primary server
(send required WAL) to rapidly react when the process won't succeed.
This code is not enough to guarantee that the recovery process will
complete. There is a window between the walreceiver shutdown and the
pg_is_in_recovery() returns false that can reach NUM_CONN_ATTEMPTS
attempts and fails.
Instead, rely only on recovery_timeout option to give up the process
after the specified number of seconds.
---
 doc/src/sgml/ref/pg_createsubscriber.sgml |  7 -
 src/bin/pg_basebackup/pg_createsubscriber.c   | 29 ++-
 .../t/040_pg_createsubscriber.pl  |  2 ++
 3 files changed, 5 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/pg_createsubscriber.sgml b/doc/src/sgml/ref/pg_createsubscriber.sgml
index 142b02..a700697f88 100644
--- a/doc/src/sgml/ref/pg_createsubscriber.sgml
+++ b/doc/src/sgml/ref/pg_createsubscriber.sgml
@@ -325,13 +325,6 @@ PostgreSQL documentation
 connections to the target server should fail.

-   
-During the recovery process, if the target server disconnects from the
-source server, pg_createsubscriber will check a
-few times if the connection has been reestablished to stream the required
-WAL.  After a few attempts, it terminates with an error.
-   
-

 Since DDL commands are not replicated by logical replication, avoid
 executing DDL commands that change the database schema while running
diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index 90cc580811..f62f34b1a7 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -1360,6 +1360,9 @@ stop_standby_server(const char *datadir)
  *
  * If recovery_timeout option is set, terminate abnormally without finishing
  * the recovery process. By default, it waits forever.
+ *
+ * XXX Is the recovery process still in progress? When recovery process has a
+ * better progress reporting mechanism, it should be added here.
  */
 static void
 wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions *opt)
@@ -1367,9 +1370,6 @@ wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions
 	PGconn	   *conn;
 	int			status = POSTMASTER_STILL_STARTING;
 	int			timer = 0;
-	int			count = 0;		/* number of consecutive connection attempts */
-
-#define NUM_CONN_ATTEMPTS	10

 	pg_log_i

Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 7:39 PM Ashutosh Sharma  wrote:
>
> On Thu, Jun 6, 2024 at 6:20 PM Dilip Kumar  wrote:
>>
>> On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma  wrote:
>> >
>> > Hello everyone,
>> >
>> > At present, we use MVCC snapshots to identify dependent objects. This 
>> > implies that if a new dependent object is inserted within a transaction 
>> > that is still ongoing, our search for dependent objects won't include this 
>> > recently added one. Consequently, if someone attempts to drop the 
>> > referenced object, it will be dropped, and when the ongoing transaction 
>> > completes, we will end up having an entry for a referenced object that has 
>> > already been dropped. This situation can lead to an inconsistent state. 
>> > Below is an example illustrating this scenario:
>>
>> I don't think it's correct to allow the index to be dropped while a
>> transaction is creating it. Instead, the right solution should be for
>> the create index operation to protect the object it is using from
>> being dropped. Specifically, the create index operation should acquire
>> a shared lock on the Access Method (AM) to ensure it doesn't get
>> dropped concurrently while the transaction is still in progress.
>
>
> If I'm following you correctly, that's exactly what the patch is trying to 
> do; while the index creation is in progress, if someone tries to drop the 
> object referenced by the index under creation, the referenced object being 
> dropped is able to know about the dependent object (in this case the index 
> being created) using dirty snapshot and hence, it is unable to acquire the 
> lock on the dependent object, and as a result of that, it is unable to drop 
> it.

You are aiming for the same outcome, but not in the conventional way.
In my opinion, the correct approach is not to find objects being
created using a dirty snapshot. Instead, when creating an object, you
should acquire a proper lock on any dependent objects to prevent them
from being dropped during the creation process. For instance, when
creating an index that depends on the btree_gist access method, the
create index operation should protect btree_gist from being dropped by
acquiring the appropriate lock. It is not the responsibility of the
drop extension to identify in-progress index creations.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

PgStat_KindInfo.named_on_disk not required in shared stats

2024-06-06 Thread Michael Paquier

Hi all,
(Relevant folks in CC.)

While hacking on the area of pgstat_*.c, I have noticed the existence
of named_on_disk in PgStat_KindInfo, that is here to track the fact
that replication slots are a particular case in the PgStat_HashKey for
the dshash table of the stats because this kind of stats requires a
mapping between the replication slot name and the hash key.

As far as I can see, this field is not required and is used nowhere,
because the code relies on the existence of the to_serialized_name and
from_serialized_name callbacks to do the mapping.

Wouldn't it make sense to remove it?  This field is defined since
5891c7a8ed8f that introduced the shmem stats, and has never been used
since.

This frees an extra bit in PgStat_KindInfo, which is going to help me
a bit with what I'm doing with this area of the code while keeping the
structure size the same.

Thoughts?
--
Michael
From 68c6e8401baea7ba1f0c616bbcd74c19daab770e Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 7 Jun 2024 14:04:06 +0900
Subject: [PATCH] Remove PgStat_KindInfo.named_on_disk

This field is used to track a special case for replication slots that
need a mapping between the dshash key and the slot names, but it is used
nowhere as callbacks take care of sanity checks.
---
 src/include/utils/pgstat_internal.h | 8 +---
 src/backend/utils/activity/pgstat.c | 1 -
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index dbbca31602..f6031995a9 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -193,12 +193,6 @@ typedef struct PgStat_KindInfo
 	 */
 	bool		accessed_across_databases:1;
 
-	/*
-	 * For variable-numbered stats: Identified on-disk using a name, rather
-	 * than PgStat_HashKey. Probably only needed for replication slot stats.
-	 */
-	bool		named_on_disk:1;
-
 	/*
 	 * The size of an entry in the shared stats hash table (pointed to by
 	 * PgStatShared_HashEntry->body).
@@ -239,7 +233,7 @@ typedef struct PgStat_KindInfo
 	void		(*reset_timestamp_cb) (PgStatShared_Common *header, TimestampTz ts);
 
 	/*
-	 * For variable-numbered stats with named_on_disk. Optional.
+	 * For variable-numbered stats. Optional.
 	 */
 	void		(*to_serialized_name) (const PgStat_HashKey *key,
 	   const PgStatShared_Common *header, NameData *name);
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index dcc2ad8d95..44f0d3ede7 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -307,7 +307,6 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.fixed_amount = false,
 
 		.accessed_across_databases = true,
-		.named_on_disk = true,
 
 		.shared_size = sizeof(PgStatShared_ReplSlot),
 		.shared_data_off = offsetof(PgStatShared_ReplSlot, stats),
-- 
2.43.0



signature.asc
Description: PGP signature

Re: cannot drop intarray extension

2024-06-06 Thread Michael Paquier

On Fri, Jun 07, 2024 at 11:32:14AM +0800, jian he wrote:
> in deleteObjectsInList, under certain conditions trying to sort the to
> be deleted object list
> by just using sort_object_addresses seems to work,
> but it looks like a hack.
> maybe the proper fix would be in findDependentObjects.

@@ -1459,6 +1459,7 @@ RemoveRelations(DropStmt *drop)
[...]
-   performMultipleDeletions(objects, drop->behavior, flags);
+   if (list_length(drop->objects) > 1)
+   sortable = false;

I have not studied the patch in details, but this looks
overcomplicated to me.  All the callers of performMultipleDeletions
pass down sortable as true, while deleteObjectsInList() uses this
argument to avoid the sorting on nested calls.  It seems to me that
this could be simpler.
--
Michael


signature.asc
Description: PGP signature

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-06 Thread Dilip Kumar

On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
 wrote:
>
> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela  wrote:
> >
> > Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart 
> >  escreveu:
> >>
> >> I noticed that the "Restoring database schemas in the new cluster" part of
> >> pg_upgrade can take a while if you have many databases, so I experimented
> >> with a couple different settings to see if there are any easy ways to speed
> >> it up.  The FILE_COPY strategy for CREATE DATABASE helped quite
> >> significantly on my laptop.  For ~3k empty databases, this step went from
> >> ~100 seconds to ~30 seconds with the attached patch.  I see commit ad43a41
> >> made a similar change for initdb, so there might even be an argument for
> >> back-patching this to v15 (where STRATEGY was introduced).  One thing I
> >> still need to verify is that this doesn't harm anything when there are lots
> >> of objects in the databases, i.e., more WAL generated during many
> >> concurrent CREATE-DATABASE-induced checkpoints.
> >>
> >> Thoughts?
> >
> > Why not use it too, if not binary_upgrade?
>
> Because in the normal case (not during binary_upgrade) you don't want
> to have to generate 2 checkpoints for every created database,
> especially not when your shared buffers are large. Checkpoints' costs
> scale approximately linearly with the size of shared buffers, so being
> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
> of performance in the systems where this performance impact matters
> most.

I agree with you that we introduced the WAL_LOG strategy to avoid
these force checkpoints. However, in binary upgrade cases where no
operations are happening in the system, the FILE_COPY strategy should
be faster.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Thomas Munro

On Fri, Jun 7, 2024 at 3:06 PM Thomas Munro  wrote:
> On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin  wrote:
> > My bisect run ended with:
> > 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit
> >
> > Author: Thomas Munro 
> > Date:   Wed Apr 3 00:03:08 2024 +1300
> >
> >  Provide vectored variant of ReadBuffer().
> >
> > Other buildfarm failures with this Assert I could find kind of confirm this:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
> > (presumably a first failure of this sort)
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08
>
> Looking...

What Noah described[1] is what should be happening already, I think,
but 210622c6 unconditionally zeroed the page.  Oops.  The attached
seems to cure his repro for me.  Does it also cure your test?  I
couldn't see that variant myself for some reason, but it seems to make
sense as the explanation.  I would probably adjust the function name
or perhaps consider refactoring slightly, but first let's confirm that
this is the same issue and fix.

[1] 
https://www.postgresql.org/message-id/flat/20240512171658.7e.nmi...@google.com
From f3bb1d69a57bea820895efaf366371463e62235d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 7 Jun 2024 17:49:19 +1200
Subject: [PATCH] Fix RBM_ZERO_AND_LOCK.

Commit 210622c6 accidentally zeroed out pages even if they were found in
the buffer pool.  It should always lock the page, but it should only
zero pages that were not already found as an optimization to avoid I/O.
Otherwise, concurrent readers that hold only a pin might see corrupted
page contents changing under their feet.

Reported-by: Noah Misch 
Reported-by: Alexander Lakhin 
Discussion: https://postgr.es/m/20240512171658.7e.nmi...@google.com
Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f91..238fc0e3547 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1016,7 +1016,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * return.
  */
 static void
-ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)
 {
 	BufferDesc *bufHdr;
 	uint32		buf_state;
@@ -1034,7 +1034,8 @@ ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 			LockBufferForCleanup(buffer);
 	}
 
-	memset(BufferGetPage(buffer), 0, BLCKSZ);
+	if (zero)
+		memset(BufferGetPage(buffer), 0, BLCKSZ);
 
 	if (BufferIsLocal(buffer))
 	{
@@ -1185,7 +1186,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 
 		buffer = PinBufferForBlock(rel, smgr, smgr_persistence,
    forkNum, blockNum, strategy, &found);
-		ZeroBuffer(buffer, mode);
+		ZeroBuffer(buffer, mode, !found);
 		return buffer;
 	}
 
-- 
2.45.1

Add support to TLS 1.3 cipher suites and curves lists

2024-06-06 Thread Erica Zhang

Hi All,

I’m a Postgres user and I’m looking into restricting the set of allowed ciphers 
on Postgres and configure a concrete set of curves on our postgres instances.

I see in current Postgres doc mentioned that only TLS1.2 and below cipher lists 
can be configured. And there is no setting that controls the cipher choices 
used by TLS1.3. 

As for ECDH keys currently postgres opts to support setting only a single 
elliptic group instead of setting a lists.
As described in below doc link:

https://www.postgresql.org/docs/devel/runtime-config-connection.html


Now I have a patch to support settings for TLS1.3 ciphersuites and expanding 
the configuration option for EC settings. With my patch we can do:
1. Added a new configuration option ssl_ciphers_suites to control the cipher 
choices used by TLS 1.3. 2. Extend the existing configuration option 
ssl_ecdh_curve to accept a list of curve names seperated by colon.

Could you please help to review to see if you are interested in having this 
change in upcoming Postgres major release(It's should be PG17)? 

Thanks in advance.

patch_support_tls1.3_curvelist.diff
Description: Binary data

Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Ashutosh Sharma

On Fri, Jun 7, 2024 at 10:06 AM Dilip Kumar  wrote:
>
> On Thu, Jun 6, 2024 at 7:39 PM Ashutosh Sharma  wrote:
> >
> > On Thu, Jun 6, 2024 at 6:20 PM Dilip Kumar  wrote:
> >>
> >> On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma  
> >> wrote:
> >> >
> >> > Hello everyone,
> >> >
> >> > At present, we use MVCC snapshots to identify dependent objects. This 
> >> > implies that if a new dependent object is inserted within a transaction 
> >> > that is still ongoing, our search for dependent objects won't include 
> >> > this recently added one. Consequently, if someone attempts to drop the 
> >> > referenced object, it will be dropped, and when the ongoing transaction 
> >> > completes, we will end up having an entry for a referenced object that 
> >> > has already been dropped. This situation can lead to an inconsistent 
> >> > state. Below is an example illustrating this scenario:
> >>
> >> I don't think it's correct to allow the index to be dropped while a
> >> transaction is creating it. Instead, the right solution should be for
> >> the create index operation to protect the object it is using from
> >> being dropped. Specifically, the create index operation should acquire
> >> a shared lock on the Access Method (AM) to ensure it doesn't get
> >> dropped concurrently while the transaction is still in progress.
> >
> >
> > If I'm following you correctly, that's exactly what the patch is trying to 
> > do; while the index creation is in progress, if someone tries to drop the 
> > object referenced by the index under creation, the referenced object being 
> > dropped is able to know about the dependent object (in this case the index 
> > being created) using dirty snapshot and hence, it is unable to acquire the 
> > lock on the dependent object, and as a result of that, it is unable to drop 
> > it.
>
> You are aiming for the same outcome, but not in the conventional way.
> In my opinion, the correct approach is not to find objects being
> created using a dirty snapshot. Instead, when creating an object, you
> should acquire a proper lock on any dependent objects to prevent them
> from being dropped during the creation process. For instance, when
> creating an index that depends on the btree_gist access method, the
> create index operation should protect btree_gist from being dropped by
> acquiring the appropriate lock. It is not the responsibility of the
> drop extension to identify in-progress index creations.

Thanks for sharing your thoughts, I appreciate your inputs and
completely understand your perspective, but I wonder if that is
feasible? For example, if an object (index in this case) has
dependency on lets say 'n' number of objects, and those 'n' number of
objects belong to say 'n' different catalog tables, so should we
acquire locks on each of them until the create index command succeeds,
or, should we just check for the presence of dependent objects and
record their dependency inside the pg_depend table. Talking about this
particular case, we are trying to create gist index that has
dependency on gist_int4 opclass, it is one of the tuple inside
pg_opclass catalog table, so should acquire lock in this tuple/table
until the create index command succeeds and is that the thing to be
done for all the dependent objects?

--
With Regards,
Ashutosh Sharma.

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-06 Thread Matthias van de Meent

On Fri, 7 Jun 2024 at 07:18, Dilip Kumar  wrote:
>
> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
>  wrote:
>>
>> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela  wrote:
>>>
>>> Why not use it too, if not binary_upgrade?
>>
>> Because in the normal case (not during binary_upgrade) you don't want
>> to have to generate 2 checkpoints for every created database,
>> especially not when your shared buffers are large. Checkpoints' costs
>> scale approximately linearly with the size of shared buffers, so being
>> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
>> of performance in the systems where this performance impact matters
>> most.
>
> I agree with you that we introduced the WAL_LOG strategy to avoid
> these force checkpoints. However, in binary upgrade cases where no
> operations are happening in the system, the FILE_COPY strategy should
> be faster.

While you would be correct if there were no operations happening in
the system, during binary upgrade we're still actively modifying
catalogs; and this is done with potentially many concurrent jobs. I
think it's not unlikely that this would impact performance.

Now that I think about it, arguably, we shouldn't need to run
checkpoints during binary upgrade for the FILE_COPY strategy after
we've restored the template1 database and created a checkpoint after
that: All other databases use template1 as their template database,
and the checkpoint is there mostly to guarantee the FS knows about all
changes in the template database before we task it with copying the
template database over to our new database, so the protections we get
from more checkpoints are practically useless.
If such a change were implemented (i.e. no checkpoints for FILE_COPY
in binary upgrade, with a single manual checkpoint after restoring
template1 in create_new_objects) I think most of my concerns with this
patch would be alleviated.

Kind regards,

Matthias van de Meent

1 2 >

1 - 100 of 103 matches

Mail list logo