Re: relfilenode statistics
Hi, On Mon, Jun 03, 2024 at 11:11:46AM +, Bertrand Drouvot wrote: > Yeah, I’ll update the commit message in V2 with better explanations once I get > feedback on V1 (should we decide to move on with the relfilenode stats idea). > Please find attached v2, mandatory rebase due to cd312adc56. In passing it provides a more detailed commit message (also making clear that the goal of this patch is to start the discussion and agree on the design before moving forward.) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 81d25e077c9f4eafa5304c257d1b39ee8a811628 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Thu, 16 Nov 2023 02:30:01 + Subject: [PATCH v2] Provide relfilenode statistics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit We currently don’t have writes counters for relations. The reason is that we don’t have the relation OID when writing buffers out. Tracking writes per relfilenode would allow us to track/consolidate writes per relation. relfilenode stats is also beneficial for the "Split index and table statistics into different types of stats" work in progress: it would allow us to avoid additional branches in some situations. === Remarks === This is a POC patch. There is still work to do: there is more places we should add relfilenode counters, create more APIS to retrieve the relfilenode stats, the patch takes care of rewrite generated by TRUNCATE but there is more to care about like CLUSTER,VACUUM FULL. The new logic to retrieve stats in pg_statio_all_tables has been implemented only for the new blocks_written stat (we'd need to do the same for the existing buffer read / buffer hit if we agree on the approach implemented here). The goal of this patch is to start the discussion and agree on the design before moving forward. --- src/backend/access/rmgrdesc/xactdesc.c| 5 +- src/backend/catalog/storage.c | 8 ++ src/backend/catalog/system_functions.sql | 2 +- src/backend/catalog/system_views.sql | 5 +- src/backend/postmaster/checkpointer.c | 5 + src/backend/storage/buffer/bufmgr.c | 6 +- src/backend/storage/smgr/md.c | 7 ++ src/backend/utils/activity/pgstat.c | 39 -- src/backend/utils/activity/pgstat_database.c | 12 +- src/backend/utils/activity/pgstat_function.c | 13 +- src/backend/utils/activity/pgstat_relation.c | 112 -- src/backend/utils/activity/pgstat_replslot.c | 13 +- src/backend/utils/activity/pgstat_shmem.c | 19 ++- .../utils/activity/pgstat_subscription.c | 12 +- src/backend/utils/activity/pgstat_xact.c | 60 +++--- src/backend/utils/adt/pgstatfuncs.c | 34 +- src/include/access/tableam.h | 19 +++ src/include/access/xact.h | 1 + src/include/catalog/pg_proc.dat | 14 ++- src/include/pgstat.h | 19 ++- src/include/utils/pgstat_internal.h | 34 -- src/test/recovery/t/029_stats_restart.pl | 40 +++ .../recovery/t/030_stats_cleanup_replica.pl | 6 +- src/test/regress/expected/rules.out | 12 +- src/test/regress/expected/stats.out | 30 ++--- src/test/regress/sql/stats.sql| 30 ++--- src/test/subscription/t/026_stats.pl | 4 +- src/tools/pgindent/typedefs.list | 1 + 28 files changed, 415 insertions(+), 147 deletions(-) 4.6% src/backend/catalog/ 47.8% src/backend/utils/activity/ 6.5% src/backend/utils/adt/ 3.7% src/backend/ 3.3% src/include/access/ 3.3% src/include/catalog/ 6.2% src/include/utils/ 3.3% src/include/ 12.1% src/test/recovery/t/ 5.5% src/test/regress/expected/ 3.0% src/test/ diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c index dccca201e0..c02b079645 100644 --- a/src/backend/access/rmgrdesc/xactdesc.c +++ b/src/backend/access/rmgrdesc/xactdesc.c @@ -319,10 +319,11 @@ xact_desc_stats(StringInfo buf, const char *label, appendStringInfo(buf, "; %sdropped stats:", label); for (i = 0; i < ndropped; i++) { - appendStringInfo(buf, " %d/%u/%u", + appendStringInfo(buf, " %d/%u/%u/%u", dropped_stats[i].kind, dropped_stats[i].dboid, - dropped_stats[i].objoid); + dropped_stats[i].objoid, + dropped_stats[i].relfile); } } } diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index f56b3cc0f2..db6107cd90 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -33,6 +33,7 @@ #include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" +#include "utils/pgstat_internal.h" #include "utils/rel.h" /* GUC variables */ @@ -152,6 +153,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence, i
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Tue, 4 Jun 2024 08:30:19 +0900, Michael Paquier wrote in > On Mon, Jan 15, 2024 at 01:34:46PM -0500, Robert Haas wrote: > > This kind of change looks massively helpful to me. I don't know if it > > is exactly right or not, but it would have been a big help to me when > > writing my previous review, so +1 for some change of this general > > type. > > During a live review of this patch last week, as part of the Advanced > Patch Workshop of pgconf.dev, it has been mentioned by Tom that we may > be able to simplify the check on pmstart if the detection of the > postmaster PID started by pg_ctl is more stable using the WIN32 > internals that this patch relies on. I am not sure that this > suggestion is right, though, because we should still care about the > clock skew case as written in the surrounding comments? Even if > that's OK, I would assume that this should be an independent patch, > written on top of the proposed v6-0001. > > Tom, could you comment about that? Perhaps my notes did not catch > what you meant. Thank you for the follow-up. I have been thinking about this since then. At first, I thought it referred to FindFirstChangeNotification() and friends, and inotify on Linux. However, I haven't found a way to simplify the specified code area using those APIs. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Thu, 06 Jun 2024 16:45:00 +0900 (JST), Kyotaro Horiguchi wrote in > I have been thinking about this since then. At first, I thought it > referred to FindFirstChangeNotification() and friends, and inotify on > Linux. However, I haven't found a way to simplify the specified code > area using those APIs. By the way, the need to shift by 2 seconds to tolerate clock skew suggests that the current launcher-postmaster association mechanism is somewhat unreliable. Couldn't we add a command line option to postmaster to explicitly pass a unique identifier (say, pid itself) of the launcher? If it is not specified, the number should be the PID of the immediate parent process. This change avoids the need for the special treatment for Windows. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Thu, 06 Jun 2024 17:15:15 +0900 (JST), Kyotaro Horiguchi wrote in > At Thu, 06 Jun 2024 16:45:00 +0900 (JST), Kyotaro Horiguchi > wrote in > > I have been thinking about this since then. At first, I thought it > > referred to FindFirstChangeNotification() and friends, and inotify on > > Linux. However, I haven't found a way to simplify the specified code > > area using those APIs. > > By the way, the need to shift by 2 seconds to tolerate clock skew > suggests that the current launcher-postmaster association mechanism is > somewhat unreliable. Couldn't we add a command line option to > postmaster to explicitly pass a unique identifier (say, pid itself) of > the launcher? If it is not specified, the number should be the PID of > the immediate parent process. No. The combination of pg_ctl's pid and timestamp, to avoid false matching during reboot. > This change avoids the need for the special treatment for Windows. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: question regarding policy for patches to out-of-support branches
On Wed, Jun 5, 2024 at 8:29 PM Tom Lane wrote: > > Joe Conway writes: > > I was having a discussion regarding out-of-support branches and effort > > to keep them building, but could not for the life of me find any actual > > documented policy (although I distinctly remember that we do something...). > > Is the policy written down somewhere, or is it only project lore? In > > either case, what is the actual policy? > > I believe our policy was set in this thread: > > https://www.postgresql.org/message-id/flat/2923349.1634942313%40sss.pgh.pa.us > > and you're right that it hasn't really been memorialized anywhere > else. I'm not sure where would be appropriate. Not absolutely sure, but would at least adding a page to PostgreSQL Wiki about this make sense ? --- Hannu
Proposal: Job Scheduler
Hackers, We are the PostgreSQL team in Tencent. We have recently developed a job scheduler that runs inside the database to schedules and manages jobs similar to Oracle DBMS_JOB package, and we would like to contribute this feature to the community. Similar to autovacuum, the job scheduler consists of 2 parts: the job launcher and the job worker. The job launcher periodically scans a metadata table and signals the postmaster to start new workers if needed. As far as we know, there are currently two open-sourced job scheduling extensions for PostgreSQL: pg_cron (https://github.com/citusdata/pg_cron/) and pg_dbms_job (https://github.com/MigOpsRepos/pg_dbms_job/tree/main). However, the cron-based syntax is not easy to use and suffers some limitations like one-off commands. The pg_dbms_job extension is difficult to manage and operate because it runs as a standalone process . That's why we have developed the job scheduler that runs as a process inside the database just like autovacuum. We can start to send the patch if this idea makes sense to the you. Thanks for your time. Regards, Cheng
Re: Proposal: Job Scheduler
On Thu, 2024-06-06 at 16:27 +0800, Wang Cheng wrote: > We are the PostgreSQL team in Tencent. We have recently developed a job > scheduler > that runs inside the database to schedules and manages jobs similar to Oracle > DBMS_JOB package, and we would like to contribute this feature to the > community. > > As far as we know, there are currently two open-sourced job scheduling > extensions > for PostgreSQL: pg_cron (https://github.com/citusdata/pg_cron/) and > pg_dbms_job > (https://github.com/MigOpsRepos/pg_dbms_job/tree/main). However, the > cron-based > syntax is not easy to use and suffers some limitations like one-off commands. > The pg_dbms_job extension is difficult to manage and operate because it runs > as > a standalone process . There is also pg_timetable: https://github.com/cybertec-postgresql/pg_timetable > That's why we have developed the job scheduler that runs as a process inside > the > database just like autovacuum. > > We can start to send the patch if this idea makes sense to the you. Perhaps your job scheduler is much better than all the existing ones. But what would be a compelling reason to keep it in the PostgreSQL source tree? With PostgreSQL's extensibility features, it should be possible to write your job scheduler as an extension and maintain it outside the PostgreSQL source. I am sure that the PostgreSQL community will be happy to use the extension if it is any good. Yours, Laurenz Albe
Re: Proposal: Job Scheduler
On Thu, 6 Jun 2024 at 09:47, Laurenz Albe wrote: > On Thu, 2024-06-06 at 16:27 +0800, Wang Cheng wrote: > > We are the PostgreSQL team in Tencent. We have recently developed a job > scheduler > > that runs inside the database to schedules and manages jobs similar to > Oracle > > DBMS_JOB package, and we would like to contribute this feature to the > community. > > > > As far as we know, there are currently two open-sourced job scheduling > extensions > > for PostgreSQL: pg_cron (https://github.com/citusdata/pg_cron/) and > pg_dbms_job > > (https://github.com/MigOpsRepos/pg_dbms_job/tree/main). However, the > cron-based > > syntax is not easy to use and suffers some limitations like one-off > commands. > > The pg_dbms_job extension is difficult to manage and operate because it > runs as > > a standalone process . > > There is also pg_timetable: > https://github.com/cybertec-postgresql/pg_timetable And probably the oldest of them all, pgAgent: https://www.pgadmin.org/docs/pgadmin4/8.7/pgagent.html > > > > That's why we have developed the job scheduler that runs as a process > inside the > > database just like autovacuum. > > > > We can start to send the patch if this idea makes sense to the you. > > Perhaps your job scheduler is much better than all the existing ones. > But what would be a compelling reason to keep it in the PostgreSQL source > tree? > With PostgreSQL's extensibility features, it should be possible to write > your > job scheduler as an extension and maintain it outside the PostgreSQL > source. > > I am sure that the PostgreSQL community will be happy to use the extension > if it is any good. > I agree. This is an area in which there are lots of options at the moment, with compelling reasons to choose from various of them depending on your needs. It's this kind of choice that means it's unlikely we'd include any one option in PostgreSQL, much like various other tools such as failover managers or poolers. -- Dave Page pgAdmin: https://www.pgadmin.org PostgreSQL: https://www.postgresql.org EDB: https://www.enterprisedb.com
Re: [multithreading] extension compatibility
On 06/06/2024 05:47, Robert Haas wrote: On Wed, Jun 5, 2024 at 10:09 PM Andres Freund wrote: Maybe. I think shipping a mode where users can fairly simply toggle between threaded and process mode will allow us to get this stable a *lot* quicker than if we distribute two builds. Most users don't build from source, distros will have to pick the mode. If they don't choose threaded mode, we'll not find problems. If they do choose threaded mode, we can't ask users to switch to a process based mode to check if the problem is related. I don't believe that being coercive here is the right approach. I think distros see the value in compiling with as many things turned on as possible; when they ship with something turned off, it's because it's unstable or introduces weird dependencies or has some other disadvantage. I presume there's no harm in building with multithreading support. If you don't want to use it, put "multithreading=off" in your config file (which will presumably be the default for a while). If we're worried about the performance impact of thread-local variables in particular, we can try to measure that. I don't think it's material though. If there is some material harm from compiling with multithreading support even if you're not using it, we should try to fix that. I'm not dead set against having a compile-time option, but I don't see the need for it at the moment. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Proposal: Job Scheduler
Noted. Thanks for suggestions. We will open-source it as an extension. Regards, Cheng -- Original -- From: "Dave Page" https://github.com/citusdata/pg_cron/) and pg_dbms_job > (https://github.com/MigOpsRepos/pg_dbms_job/tree/main). However, the cron-based > syntax is not easy to use and suffers some limitations like one-off commands. > The pg_dbms_job extension is difficult to manage and operate because it runs as > a standalone process . There is also pg_timetable: https://github.com/cybertec-postgresql/pg_timetable And probably the oldest of them all, pgAgent: https://www.pgadmin.org/docs/pgadmin4/8.7/pgagent.html > That's why we have developed the job scheduler that runs as a process inside the > database just like autovacuum. > > We can start to send the patch if this idea makes sense to the you. Perhaps your job scheduler is much better than all the existing ones. But what would be a compelling reason to keep it in the PostgreSQL source tree? With PostgreSQL's extensibility features, it should be possible to write your job scheduler as an extension and maintain it outside the PostgreSQL source. I am sure that the PostgreSQL community will be happy to use the extension if it is any good. I agree. This is an area in which there are lots of options at the moment, with compelling reasons to choose from various of them depending on your needs. It's this kind of choice that means it's unlikely we'd include any one option in PostgreSQL, much like various other tools such as failover managers or poolers. -- Dave PagepgAdmin: https://www.pgadmin.org PostgreSQL: https://www.postgresql.org EDB: https://www.enterprisedb.com
Re: Things I don't like about \du's "Attributes" column
On 16.04.2024 09:15, Pavel Luzanov wrote: On 16.04.2024 01:06, David G. Johnston wrote: At this point I'm on board with retaining the \dr charter of simply being an easy way to access the detail exposed in pg_roles with some display formatting but without any attempt to convey how the system uses said information. Without changing pg_roles. Our level of effort here, and degree of dependence on superuser, doesn't seem to be bothering people enough to push more radical changes here through and we have good improvements that are being held up in the hope of possible perfection. I have similar thoughts. I decided to wait for the end of featurefreeze and propose a simpler version of the patch for v18, without changes in pg_roles Since no votes for the changes in pg_roles, please look the simplified version. We can return to this topic later. Butnowthere are nochangesinpg_roles. Just a special interpretation of the two values of the "Connection limit" column: 0 - Now allowed (changed from 'No connections') -1 - empty string Full list of changes in commit message. Example output: \du+ regress_du* List of roles Role name | Login | Attributes | Valid until | Connection limit | Description --+---+-+--+--+-- regress_du_admin | yes | Superuser +| | | some description | | Create DB +| | | | | Create role+| | | | | Inherit+| | | | | Replication+| | | | | Bypass RLS | | | regress_du_role0 | yes | Inherit | Tue Jun 04 00:00:00 2024 PDT | Not allowed | regress_du_role1 | no| Create role+| infinity | | | | Inherit | | | regress_du_role2 | yes | Inherit+| | 42 | | | Replication+| | | | | Bypass RLS | | | (4 rows) Data: CREATE ROLE regress_du_role0 LOGIN PASSWORD '123' VALID UNTIL '2024-06-04' CONNECTION LIMIT 0; CREATE ROLE regress_du_role1 CREATEROLE CONNECTION LIMIT -1 VALID UNTIL 'infinity'; CREATE ROLE regress_du_role2 LOGIN REPLICATION BYPASSRLS CONNECTION LIMIT 42; CREATE ROLE regress_du_admin LOGIN SUPERUSER CREATEROLE CREATEDB BYPASSRLS REPLICATION INHERIT; COMMENT ON ROLE regress_du_admin IS 'some description'; -- Pavel Luzanov Postgres Professional:https://postgrespro.com From 8cd9a815de36a40d450029d327e013cf69827499 Mon Sep 17 00:00:00 2001 From: Pavel Luzanov Date: Thu, 6 Jun 2024 11:30:23 +0300 Subject: [PATCH v7] psql: Rethinking of \du command Cnanges in the \du command - "Login", "Connection limit" and "Valid until" attributes are placed in separate columns. - The "Attributes" column includes only the enabled logical attributes. - The attribute names correspond to the keywords of the CREATE ROLE command. - The attributes are listed in the same order as in the documentation. - A special interpretation of the two values of the "Connection limit" column: 0 - Now allowed -1 - empty string - General refactoring of describeRoles function in describe.c. --- src/bin/psql/describe.c| 149 - src/test/regress/expected/psql.out | 40 +--- src/test/regress/sql/psql.sql | 12 ++- 3 files changed, 75 insertions(+), 126 deletions(-) diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index f67bf0b892..8967102261 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -36,7 +36,6 @@ static bool describeOneTableDetails(const char *schemaname, bool verbose); static void add_tablespace_footer(printTableContent *const cont, char relkind, Oid tablespace, const bool newline); -static void add_role_attribute(PQExpBuffer buf, const char *const str); static bool listTSParsersVerbose(const char *pattern); static bool describeOneTSParser(const char *oid, const char *nspname, const char *prsname); @@ -3615,34 +3614,50 @@ describeRoles(const char *pattern, bool verbose, bool showSystem) { PQExpBufferData buf; PGresult *res; - printTableContent cont; - printTableOpt myopt = pset.popt.topt; - int ncols = 2; - int nrows = 0; - int i; - int conns; - const char align = 'l'; - char **attr; - - myopt.default_footer = false; + printQueryOpt myopt = pset.popt; initPQExpBuffer(&buf); - printfPQEx
Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel
As part of commit 5cd72cc0c5017a9d4de8b5d465a75946da5abd1d, the dependency on global counters such as VacuumPage(Hit/Miss/Dirty) was removed from the vacuum. However, do_analyze_rel() was still using these counters, necessitating the tracking of global counters alongside BufferUsage counters. The attached patch addresses the issue by eliminating the need to track VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel(), making the global counters obsolete. This simplifies the code and improves consistency. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v1-0001-Remove-duplicate-tracking-of-the-page-stats-durin.patch Description: Binary data
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Thu, 6 Jun 2024 at 03:03, Robert Haas wrote: > This makes some sense to me. I don't think that I believe > max_protocol_version=latest is future-proof: just because I want to > opt into this round of new features doesn't mean I feel the same way > about the next round. But it may still be a good idea. I think for most people the only reason not to opt-in to improvements (even if they are small) is if those improvements break something else. Once the NegotiateProtocolVersion message is implemented everywhere in the ecosystem, nothing should break when going from e.g. 3.2 to 3.4. So for the majority of the people I think max_protocol_version=latest is what they'll want to use once the ecosystem has caught up. Of course there will be people that want tight control, but they can set max_protocol_version=3.2 instead. > I suppose the semantics are that we try to connect with the version > specified by max_protocol_version and, if we get downgraded by the > server, we abort if we end up below min_protocol_version. Correct > I like those > semantics, and I think I also like having both parameters, but I'm not > 100% sure of the naming. It's a funny use of "max" and "min", because > the max is really what we're trying to do and the min is what we end > up with, and those terms don't necessarily bring those ideas to mind. > I don't have a better idea off-hand, though. I borrowed this terminology from the the ssl_min_protocol_version and ssl_max_protocol_version connection options that we already have. Those basically have the same semantics as what I'm proposing here, but for the TLS protocol version instead of the Postgres protocol version. I'm also not a huge fan of the min_protocol_version and max_protocol_version names, but staying consistent with existing options seems quite nice. Looking at ssl_max_protocol_version closer though, to stay really consistent I'd have to change "latest" to be renamed to empty string (i.e. there is no max_protocol_version). I think I might prefer staying consistent over introducing an imho slightly clearer name. Another way to stay consistent would of course be also adding "latest" as an option to ssl_max_protocol_version? What do you think? I'll look into adding min_protocol_version to the patchset soonish. Some review of the existing code in the first few patches would definitely be appreciated.
Incorrect matching of sql/json PASSING variable names
Hi, Alvaro reported off-list that the following should really fail, because the jsonpath expression refers to a PASSING variable that doesn't exist: select json_query('"1"', jsonpath '$xy' passing 2 AS xyz); json_query 2 (1 row) This works because of a bug in GetJsonPathVar() whereby it allows a jsonpath expression to reference any prefix of the PASSING variable names. Attached is a patch to fix that. Thanks Alvaro for the report. -- Thanks, Amit Langote v1-0001-in-transformJsonBehavior-better-handle-default-ex.patch Description: Binary data
Re: Proposal: Job Scheduler
On 6/6/2024 16:04, Wang Cheng wrote: Noted. Thanks for suggestions. We will open-source it as an extension. It would be nice! `For me doesn't matter where to contribute: to PostgreSQL core or to its extension if it is published under BSD license. -- regards, Andrei Lepikhov
Re: Logical Replication of sequences
On Thu, Jun 6, 2024 at 11:10 AM Masahiko Sawada wrote: > > On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila wrote: > > > > > To achieve this, we can allow sequences to be copied during > > the initial CREATE SUBSCRIPTION command similar to what we do for > > tables. And then later by new/existing command, we re-copy the already > > existing sequences on the subscriber. > > > > The options for the new command could be: > > Alter Subscription ... Refresh Sequences > > Alter Subscription ... Replicate Sequences > > > > In the second option, we need to introduce a new keyword Replicate. > > Can you think of any better option? > > Another idea is doing that using options. For example, > > For initial sequences synchronization: > > CREATE SUBSCRIPTION ... WITH (copy_sequence = true); > How will it interact with the existing copy_data option? So copy_data will become equivalent to copy_table_data, right? > For re-copy (or update) sequences: > > ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = true); > Similar to the previous point it can be slightly confusing w.r.t copy_data. And would copy_sequence here mean that it would copy sequence values of both pre-existing and newly added sequences, if so, that would make it behave differently than copy_data? The other possibility in this direction would be to introduce an option like replicate_all_sequences/copy_all_sequences which indicates a copy of both pre-existing and new sequences, if any. If we want to go in the direction of having an option such as copy_(all)_sequences then do you think specifying that copy_data is just for tables in the docs would be sufficient? I am afraid that it would be confusing for users. > > > > In addition to the above, the command Alter Subscription .. Refresh > > Publication will fetch any missing sequences similar to what it does > > for tables. > > On the subscriber side, do we need to track which sequences are > created via CREATE/ALTER SUBSCRIPTION? > I think so unless we find some other way to know at refresh publication time which all new sequences need to be part of the subscription. What should be the behavior w.r.t sequences when the user performs ALTER SUBSCRIPTION ... REFRESH PUBLICATION? I was thinking similar to tables, it should fetch any missing sequence information from the publisher. -- With Regards, Amit Kapila.
Re: Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel
Hi, I sent a similar patch for this in https://www.postgresql.org/message-id/flat/cao6_xqr__kttclkftqs0qscm-j7_xbrg3ge2rwhucxqjmjh...@mail.gmail.com Regards, Anthonin On Thu, Jun 6, 2024 at 11:10 AM Dilip Kumar wrote: > As part of commit 5cd72cc0c5017a9d4de8b5d465a75946da5abd1d, the > dependency on global counters such as VacuumPage(Hit/Miss/Dirty) was > removed from the vacuum. However, do_analyze_rel() was still using > these counters, necessitating the tracking of global counters > alongside BufferUsage counters. > > The attached patch addresses the issue by eliminating the need to > track VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel(), making > the global counters obsolete. This simplifies the code and improves > consistency. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
Assert in heapgettup_pagemode() fails due to underlying buffer change
Hello hackers, I tried to investigate a recent buildfarm test failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-06-04%2003%3A27%3A47 29/295 postgresql:recovery / recovery/026_overwrite_contrecord ERROR 39.55s exit status 32 log/026_overwrite_contrecord_standby.log TRAP: failed Assert("ItemIdIsNormal(lpp)"), File: "../pgsql/src/backend/access/heap/heapam.c", Line: 1002, PID: 3740958 postgres: standby: bf postgres [local] startup(ExceptionalCondition+0x81)[0x56c60bf9] postgres: standby: bf postgres [local] startup(+0xf776e)[0x5667276e] postgres: standby: bf postgres [local] startup(heap_getnextslot+0x40)[0x56672ee1] postgres: standby: bf postgres [local] startup(+0x11c218)[0x56697218] postgres: standby: bf postgres [local] startup(systable_getnext+0xfa)[0x56697c1a] postgres: standby: bf postgres [local] startup(+0x6d29c7)[0x56c4d9c7] postgres: standby: bf postgres [local] startup(+0x6d372c)[0x56c4e72c] postgres: standby: bf postgres [local] startup(+0x6d8288)[0x56c53288] postgres: standby: bf postgres [local] startup(RelationCacheInitializePhase3+0x149)[0x56c52d71] (It's not the only failure of that ilk in the buildfarm.) and managed to reproduce the failure by running many 026_overwrite_contrecord tests in parallel (with fsync=on). Analyzing the core dump added some info: ... #3 0x00bb43cc in ExceptionalCondition (conditionName=0xc45c77 "ItemIdIsNormal(lpp)", fileName=0xc45aa8 "heapam.c", lineNumber=1002) at assert.c:66 #4 0x004f7f13 in heapgettup_pagemode (scan=0x19f5660, dir=ForwardScanDirection, nkeys=2, key=0x19f61d0) at heapam.c:1002 #5 0x004f86d1 in heap_getnextslot (sscan=0x19f5660, direction=ForwardScanDirection, slot=0x19f5da0) at heapam.c:1307 #6 0x0051d028 in table_scan_getnextslot (sscan=0x19f5660, direction=ForwardScanDirection, slot=0x19f5da0) at ../../../../src/include/access/tableam.h:1081 #7 0x0051da80 in systable_getnext (sysscan=0x19f5470) at genam.c:530 #8 0x00ba0937 in RelationBuildTupleDesc (relation=0x7fa004feea88) at relcache.c:572 #9 0x00ba17b9 in RelationBuildDesc (targetRelId=2679, insertIt=true) at relcache.c:1184 #10 0x00ba6520 in load_critical_index (indexoid=2679, heapoid=2610) at relcache.c:4353 #11 0x00ba607d in RelationCacheInitializePhase3 () at relcache.c:4132 #12 0x00bcb704 in InitPostgres (in_dbname=0x196ca30 "postgres", dboid=5, username=0x19a91b8 "law", useroid=0, flags=1, out_dbname=0x0) at postinit.c:1193 ... (gdb) frame 4 (gdb) p lpp->lp_flags $2 = 1 (gdb) p ItemIdIsNormal(lpp) $12 = 1 So it looks like the Assert had failed when lpp->lp_flags had some other contents... I added the following debugging code: --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -995,10 +995,14 @@ continue_page: for (; linesleft > 0; linesleft--, lineindex += dir) { ItemId lpp; + ItemIdData iid; OffsetNumber lineoff; lineoff = scan->rs_vistuples[lineindex]; lpp = PageGetItemId(page, lineoff); + iid = *((ItemIdData *)lpp); + + Assert(ItemIdIsNormal(&iid)); Assert(ItemIdIsNormal(lpp)); and got: ... #2 0x55b68dc6998c in ExceptionalCondition (conditionName=0x55b68dcfe5f7 "ItemIdIsNormal(&iid)", fileName=0x55b68dcfe428 "heapam.c", lineNumber=1010) at assert.c:66 #3 0x55b68d588a78 in heapgettup_pagemode (scan=0x55b68f0905e0, dir=ForwardScanDirection, nkeys=2, key=0x55b68f091150) at heapam.c:1010 #4 0x55b68d58930e in heap_getnextslot (sscan=0x55b68f0905e0, direction=ForwardScanDirection, slot=0x55b68f090d20) at heapam.c:1322 ... (gdb) frame 3 #3 0x55b68d588a78 in heapgettup_pagemode (...) at heapam.c:1010 1010 Assert(ItemIdIsNormal(&iid)); (gdb) info locals lpp = 0x7f615c34b0ec iid = {lp_off = 0, lp_flags = 0, lp_len = 0} lineoff = 54 tuple = 0x55b68f090638 page = 0x7f615c34b000 "" (gdb) p *lpp $1 = {lp_off = 3160, lp_flags = 1, lp_len = 136} It seemingly confirms that the underlying memory was changed while being processed in heapgettup_pagemode(). I've tried to add checks for the page buffer content as below: --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -953,11 +953,15 @@ heapgettup_pagemode(HeapScanDesc scan, Page page; int lineindex; int linesleft; +char page_copy[BLCKSZ]; if (likely(scan->rs_inited)) { /* continue from previously returned page/tuple */ page = BufferGetPage(scan->rs_cbuf); +memcpy(page_copy, page, BLCKSZ); +for (int i = 0; i < 100; i++) +Assert(memcmp(page_copy, page, BLCKSZ) == 0); lineindex = scan->rs_cindex + dir; if (ScanDirectionIsForward(dir)) @@ -986,6
Re: Conflict Detection and Resolution
On Wed, Jun 5, 2024 at 7:29 PM Dilip Kumar wrote: > > On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila wrote: > > > > Can you share the use case of "earliest_timestamp_wins" resolution > > method? It seems after the initial update on the local node, it will > > never allow remote update to succeed which sounds a bit odd. Jan has > > shared this and similar concerns about this resolution method, so I > > have added him to the email as well. > > > I can not think of a use case exactly in this context but it's very > common to have such a use case while designing a distributed > application with multiple clients. For example, when we are doing git > push concurrently from multiple clients it is expected that the > earliest commit wins. > Okay, I think it mostly boils down to something like what Shveta mentioned where Inserts for a primary key can use "earliest_timestamp_wins" resolution method [1]. So, it seems useful to support this method as well. [1] - https://www.postgresql.org/message-id/CAJpy0uC4riK8e6hQt8jcU%2BnXYmRRjnbFEapYNbmxVYjENxTw2g%40mail.gmail.com -- With Regards, Amit Kapila.
Re: Logical Replication of sequences
On Thu, Jun 6, 2024 at 9:22 AM Amit Kapila wrote: > On Wed, Jun 5, 2024 at 6:01 PM Ashutosh Bapat > wrote: > > > > On Wed, Jun 5, 2024 at 8:45 AM Amit Kapila > wrote: > >> > >> How about periodically sending this information? > >> > > >> > >> Now, if we want to support some sort of failover then probably this > >> will help. Do you have that use case in mind? > > > > > > Regular failover was a goal for supporting logical replication of > sequences. That might be more common than major upgrade scenario. > > > > We can't support regular failovers to subscribers unless we can > replicate/copy slots because the existing nodes connected to the > current publisher/primary would expect that. It should be primarily > useful for major version upgrades at this stage. > We don't want to design it in a way that requires major rework when we are able to copy slots and then support regular failovers. That's when the consistency between a sequence and the table using it would be a must. So it's better that we take that into consideration now. -- Best Wishes, Ashutosh Bapat
Re: Conflict Detection and Resolution
On Thu, Jun 6, 2024 at 3:43 PM Amit Kapila wrote: > > On Wed, Jun 5, 2024 at 7:29 PM Dilip Kumar wrote: > > > > On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila wrote: > > > > > > Can you share the use case of "earliest_timestamp_wins" resolution > > > method? It seems after the initial update on the local node, it will > > > never allow remote update to succeed which sounds a bit odd. Jan has > > > shared this and similar concerns about this resolution method, so I > > > have added him to the email as well. > > > > > I can not think of a use case exactly in this context but it's very > > common to have such a use case while designing a distributed > > application with multiple clients. For example, when we are doing git > > push concurrently from multiple clients it is expected that the > > earliest commit wins. > > > > Okay, I think it mostly boils down to something like what Shveta > mentioned where Inserts for a primary key can use > "earliest_timestamp_wins" resolution method [1]. So, it seems useful > to support this method as well. Correct, but we still need to think about how to make it work correctly in the presence of a clock skew as I mentioned in one of my previous emails. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel
On Thu, Jun 6, 2024 at 3:23 PM Anthonin Bonnefoy wrote: > > Hi, > > I sent a similar patch for this in > https://www.postgresql.org/message-id/flat/cao6_xqr__kttclkftqs0qscm-j7_xbrg3ge2rwhucxqjmjh...@mail.gmail.com Okay, I see, In that case, we can just discard mine, thanks for notifying me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Injection points: preloading and runtime arguments
> On 5 Jun 2024, at 03:52, Michael Paquier wrote: > > Another thing you could do is to define a > INJECTION_POINT_LOAD() in the code path you're stressing outside the > critical section where the point is run. This should save from a call > to the SQL function. This choice is up to the one implementing the > test, both can be useful depending on what one is trying to achieve. Thanks! Interestingly, previously having INJECTION_POINT_PRELOAD() was not enough. But now both INJECTION_POINT_LOAD() or injection_points_load() do the trick, so for me any of them is enough. My test works with current version, but I have one slight problem, I need to call $node->safe_psql('postgres', q(select injection_points_detach('GetMultiXactIdMembers-CV-sleep'))); Before $node->safe_psql('postgres', q(select injection_points_wakeup('GetMultiXactIdMembers-CV-sleep'))); Is it OK to detach() before wakeup()? Or, perhaps, can a detach() do a wakeup() automatically? Best regards, Andrey Borodin.
Re: Proposal: Job Scheduler
On 2024-Jun-06, Dave Page wrote: > It's this kind of choice that means it's unlikely we'd include any one > option in PostgreSQL, much like various other tools such as failover > managers or poolers. TBH I see that more as a bug than as a feature, and I see the fact that there are so many schedulers as a process failure. If we could have _one_ scheduler in core that encompassed all the important features of all the independent ones we have, with hooks or whatever to allow the user to add any fringe features they need, that would probably lead to less duplicative code and divergent UIs, and would be better for users overall. That's, of course, just my personal opinion. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Compress ReorderBuffer spill files using LZ4
Hi, When the content of a large transaction (size exceeding logical_decoding_work_mem) and its sub-transactions has to be reordered during logical decoding, then, all the changes are written on disk in temporary files located in pg_replslot/. Decoding very large transactions by multiple replication slots can lead to disk space saturation and high I/O utilization. When compiled with LZ4 support (--with-lz4), this patch enables data compression/decompression of these temporary files. Each transaction change that must be written on disk (ReorderBufferDiskChange) is now compressed and encapsulated in a new structure. 3 different compression strategies are implemented: 1. LZ4 streaming compression is the preferred one and works efficiently for small individual changes. 2. LZ4 regular compression when the changes are too large for using the streaming API. 3. No compression when compression fails, the change is then stored not compressed. When not using compression, the following case generates 1590MB of spill files: CREATE TABLE t (i INTEGER PRIMARY KEY, t TEXT); INSERT INTO t SELECT i, 'Hello number n°'||i::TEXT FROM generate_series(1, 1000) as i; With LZ4 compression, it creates 653MB of spill files: 58.9% less disk space usage. Open items: 1. The spill_bytes column from pg_stat_get_replication_slot() still returns plain data size, not the compressed data size. Should we expose the compressed data size when compression occurs? 2. Do we want a GUC to switch compression on/off? Regards, JT v1-0001-Compress-ReorderBuffer-spill-files-using-LZ4.patch Description: Binary data
Re: Logical Replication of sequences
On Thu, Jun 6, 2024 at 3:44 PM Ashutosh Bapat wrote: > > On Thu, Jun 6, 2024 at 9:22 AM Amit Kapila wrote: >> >> On Wed, Jun 5, 2024 at 6:01 PM Ashutosh Bapat >> wrote: >> > >> > On Wed, Jun 5, 2024 at 8:45 AM Amit Kapila wrote: >> >> >> >> How about periodically sending this information? >> >> > >> >> >> >> Now, if we want to support some sort of failover then probably this >> >> will help. Do you have that use case in mind? >> > >> > >> > Regular failover was a goal for supporting logical replication of >> > sequences. That might be more common than major upgrade scenario. >> > >> >> We can't support regular failovers to subscribers unless we can >> replicate/copy slots because the existing nodes connected to the >> current publisher/primary would expect that. It should be primarily >> useful for major version upgrades at this stage. > > > We don't want to design it in a way that requires major rework when we are > able to copy slots and then support regular failover. > I don't think we can just copy slots like we do for standbys. The slots would require WAL locations to continue, so not sure if we can make it work for failover for subscribers. > That's when the consistency between a sequence and the table using it would be a must. So it's better that we take that into consideration now. > With the ideas being discussed here, I could only see the use case of a major version upgrade or planned switchover to work. If we come up with any other agreeable way that is better than this then we can consider the same. -- With Regards, Amit Kapila.
Re: Function and Procedure with same signature?
Hi Peter and Tom Following up on our conversation art pgcon.dev If I understood correctly Peter has some old patch for splitting the namespaces which could be resurrected to try to move forward on this ? Can you share what you did there ? Also, while at it we should extend the function lookup to support full "method call syntax" in general, up from one-argument case so that SELECT function(a_thing, arg2, arg 2, ...) could also be called as SELECT a_thing.function(arg2, arg 2, ...) -- Hannu On Mon, Mar 11, 2024 at 12:55 PM Hannu Krosing wrote: > > On Thu, Mar 7, 2024 at 5:46 PM Tom Lane wrote: > > > > Hannu Krosing writes: > > > On Sat, Feb 10, 2024 at 12:38 AM Tom Lane wrote: > > >> Worth noting perhaps that this is actually required by the SQL > > >> standard: per spec, functions and procedures are both "routines" > > >> and share the same namespace, > > > > > Can you point me to a place in the standard where it requires all > > > kinds of ROUTINES to be using the same namespace ? > > > > [ digs around a bit... ] Well, the spec is vaguer about this than > > I thought. It is clear on one thing: 11.60 > > conformance rules include > ... > > Thanks for thorough analysis of the standard. > > I went and looked at more what other relevant database do in this > space based on their documentation > > Tl;DR > > * MS SQL Server >- no overloading allowed anywhere > * MySQL >- no overloading > * Oracle >- no overloading at top level >- overloading and independent namespaces for functions and procedures > * Teradata >- function overloading allowed >- not clear from documentation if this also applies procedures >- function overloading docs does not mention possible clashes with > procedure names anywhere > * DB2 >- function overloading fully supported >- procedure overloading supported, but only for distinct NUMBER OF > ARGUMENTS > > I'll try to get access to a Teradata instance to verify the above > > So at high level most other Serious Databases > - do support function overloading > - keep functions and procedures in separated namespaces > > I still think that PostgreSQL having functions and procedures share > the same namespace is an unneeded and unjustified restriction > > > I plan to do some hands-on testing on Teradata and DB2 to understand it > > But my current thinking is that we should not be more restrictive than > others unless there is a strong technical reason for it. And currently > I do not see any. > > > It could be argued that this doesn't prohibit having both a function > > and a procedure with the same data type list, only that you can't > > write ROUTINE when trying to drop or alter one. But I think that's > > just motivated reasoning. The paragraphs for being > > FUNCTION or PROCEDURE are exactly like the above except they say > > "exactly one function" or "exactly one procedure". If you argue that > > this text means we must allow functions and procedures with the same > > parameter lists, then you are also arguing that we must allow multiple > > functions with the same parameter lists, and it's just the user's > > tough luck if they need to drop one of them. > > The main issue is not dropping them, but inability to determine which > one to call. > > We already have this in case of two overloaded functions with same > initial argument types and the rest having defaults - when > > --- > hannuk=# create function get(i int, out o int) begin atomic select i; end; > CREATE FUNCTION > hannuk=# create function get(i int, j int default 0, out o int) begin > atomic select i+j; end; > CREATE FUNCTION > hannuk=# select get(1); > ERROR: function get(integer) is not unique > LINE 1: select get(1); >^ > HINT: Could not choose a best candidate function. You might need to > add explicit type casts. > --- > > > A related point is that our tolerance for overloading routine > > names isn't unlimited: we don't allow duplicate names with the > > same list of input parameters, even if the output parameters are > > different. > > This again has a good reason, as there would be many cases where you > could not decide which one to call > > Not allowing overloading based on only return types is common across > all OO languages. > > My point is that this does not apply to FUNCTION vs. PROCEDURE as it > is very clear from the CALL syntax which one is meant. > > > This is not something that the SQL spec cares to > > address, but there are good ambiguity-avoidance reasons for it. > > I think limiting overloading so that a ROUTINE specification is > > unambiguous is also a good thing. > > I think ROUTINE being unambiguous is not e very good goal. > > What if next version of standard introduces DROP DATABASE OBJECT ? > > > I remain unexcited about changing our definition of this. > > "Oracle allows it" is not something that has any weight in > > my view: they have made a bunch of bad design decisions > > as well as good ones, and I think this is a
Re: Compress ReorderBuffer spill files using LZ4
On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > When the content of a large transaction (size exceeding > logical_decoding_work_mem) and its sub-transactions has to be > reordered during logical decoding, then, all the changes are written > on disk in temporary files located in pg_replslot/. > Decoding very large transactions by multiple replication slots can > lead to disk space saturation and high I/O utilization. > Why can't one use 'streaming' option to send changes to the client once it reaches the configured limit of 'logical_decoding_work_mem'? > > 2. Do we want a GUC to switch compression on/off? > It depends on the overhead of decoding. Did you try to measure the decoding overhead of decompression when reading compressed files? -- With Regards, Amit Kapila.
Re: ssl tests fail due to TCP port conflict
On 2024-06-05 We 16:00, Alexander Lakhin wrote: Hello Andrew, 05.06.2024 21:10, Andrew Dunstan wrote: I think I see what's going on here. It looks like it's because we start the server in unix socket mode, and then switch to using TCP as well. Can you try your test with this patch applied and see if the problem persists? If we start in TCP mode the framework should test for a port clash. It seems that the failure rate decreased (I guess the patch rules out the case with two servers choosing the same port), but I still got: 16/53 postgresql:ssl / ssl/001_ssltests_36 OK 15.25s 205 subtests passed 17/53 postgresql:ssl / ssl/001_ssltests_30 ERROR 3.17s (exit status 255 or signal 127 SIGinvalid) 2024-06-05 19:40:37.395 UTC [414110] LOG: starting PostgreSQL 17beta1 on x86_64-linux, compiled by gcc-13.2.1, 64-bit 2024-06-05 19:40:37.395 UTC [414110] LOG: could not bind IPv4 address "127.0.0.1": Address already in use 2024-06-05 19:40:37.395 UTC [414110] HINT: Is another postmaster already running on port 50072? If not, wait a few seconds and retry. `grep '\b50072\b' -r testrun/` yields: testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 19:40:37.392 UTC [414111] [unknown] LOG: connection received: host=localhost port=50072 (a psql case) That is, psql from the test instance 001_ssltests_34 opened a connection to the test server with the client port 50072 and it made using the port by the server from the test instance 001_ssltests_30 impossible. After sleeping on it, I still think the patch would be a good thing. Your torture test might still show some failures, but the buildfarm isn't running those, and it might be enough to eliminate or at least substantially reduce buildfarm failures by reducing to almost zero the time in which a competing script might grab the port. The biggest problem with the current script is apparently that we delay using the TCP port by starting the server in Unix socket mode, and only switch to using TCP when we restart. If changing that doesn't fix the problem we'll have to rethink. If this isn't the cause, though, I would expect to have seen similar failures from other test suites. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: ResourceOwner refactoring
On 05/06/2024 16:58, Heikki Linnakangas wrote: On 04/06/2024 01:49, Heikki Linnakangas wrote: A straightforward fix is to modify RelationFlushRelation() so that if !IsTransactionState(), it just marks the entry as invalid instead of calling RelationClearRelation(). That's what RelationClearRelation() would do anyway, if it didn't hit the assertion first. Here's a patch with that straightforward fix. Your test case hit the "rd_createSubid != InvalidSubTransactionId" case, I expanded it to also cover the "rd_firstRelfilelocatorSubid != InvalidSubTransactionId" case. For the record, I got the above backwards: your test case covered the rd_firstRelfilelocatorSubid case and I expanded it to also cover the rd_createSubid case. Barring objections, I'll commit this later today or tomorrow. Thanks for the report! Committed. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Conflict Detection and Resolution
On Wed, Jun 5, 2024 at 7:29 PM Dilip Kumar wrote: > > On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila wrote: > > > > Can you share the use case of "earliest_timestamp_wins" resolution > > method? It seems after the initial update on the local node, it will > > never allow remote update to succeed which sounds a bit odd. Jan has > > shared this and similar concerns about this resolution method, so I > > have added him to the email as well. > > > I can not think of a use case exactly in this context but it's very > common to have such a use case while designing a distributed > application with multiple clients. For example, when we are doing git > push concurrently from multiple clients it is expected that the > earliest commit wins. > Here are more use cases of the "earliest_timestamp_wins" resolution method: 1) Applications where the record of first occurrence of an event is important. For example, sensor based applications like earthquake detection systems, capturing the first seismic wave's time is crucial. 2) Scheduling systems, like appointment booking, prioritize the earliest request when handling concurrent ones. 3) In contexts where maintaining chronological order is important - a) Social media platforms display comments ensuring that the earliest ones are visible first. b) Finance transaction processing systems rely on timestamps to prioritize the processing of transactions, ensuring that the earliest transaction is handled first -- Thanks, Nisha
Re: Compress ReorderBuffer spill files using LZ4
On Thu, Jun 6, 2024 at 4:43 PM Amit Kapila wrote: > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > > > When the content of a large transaction (size exceeding > > logical_decoding_work_mem) and its sub-transactions has to be > > reordered during logical decoding, then, all the changes are written > > on disk in temporary files located in pg_replslot/. > > Decoding very large transactions by multiple replication slots can > > lead to disk space saturation and high I/O utilization. > > > > Why can't one use 'streaming' option to send changes to the client > once it reaches the configured limit of 'logical_decoding_work_mem'? > > > > > 2. Do we want a GUC to switch compression on/off? > > > > It depends on the overhead of decoding. Did you try to measure the > decoding overhead of decompression when reading compressed files? I think it depends on the trade-off between the I/O savings from reducing the data size and the performance cost of compressing and decompressing the data. This balance is highly dependent on the hardware. For example, if you have a very slow disk and a powerful processor, compression could be advantageous. Conversely, if the disk is very fast, the I/O savings might be minimal, and the compression overhead could outweigh the benefits. Additionally, the effectiveness of compression also depends on the compression ratio, which varies with the type of data being compressed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Postgresql OOM
Hello all, I have a query that forces an out of memory error, where the OS will kill the postgresql process. The query plan (run immediately after a vacuum analyze) is at https://explain.depesz.com/s/ITQI#html . PostgreSQL version 16.3, running on RHEL 8.9, 16 vCPU, 64 GB RAM, 32 GB swap shared_buffers=8G effective_cache_size=24G maintenance_work_mem=2G work_mem=104857kB default_statistics_target = 100 max_worker_processes = 16 max_parallel_workers_per_gather = 4 max_parallel_workers = 16 max_parallel_maintenance_workers = 4 jit=off It looks like the excessive memory allocation is reported in HashSpillContext. I've attached the dump of the memory context for the 5 processes (query + 4 parallel workers) some time after query start. I also see a huge number of temporary files being created. For the time being I've set enable_parallel_hash = 'off' and the problem went away. I've seen a potentially similar problem reported in https://www.postgresql.org/message-id/flat/20230516200052.sbg6z4ghcmsas3wv%40liskov#f6059259c7c9251fb8c17f5793a2d427 . Any idea on how to identify the problem? I can reproduce it on demand. Should I report it pgsql-bugs? Best regards, Radu 2024-06-05 15:58:53.806 +08 [338795] LOG: logging memory contexts of PID 338795 2024-06-05 15:58:53.807 +08 [338795] LOG: level: 0; TopMemoryContext: 2269696 total in 8 blocks; 33432 free (11 chunks); 2236264 used 2024-06-05 15:58:53.807 +08 [338795] LOG: level: 1; HandleParallelMessages: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used 2024-06-05 15:58:53.807 +08 [338795] LOG: level: 1; TableSpace cache: 8192 total in 1 blocks; 2128 free (0 chunks); 6064 used 2024-06-05 15:58:53.807 +08 [338795] LOG: level: 1; TopTransactionContext: 8192 total in 1 blocks; 5736 free (0 chunks); 2456 used 2024-06-05 15:58:53.808 +08 [338795] LOG: level: 1; Prepared Queries: 16384 total in 2 blocks; 6696 free (3 chunks); 9688 used 2024-06-05 15:58:53.808 +08 [338795] LOG: level: 1; Type information cache: 24368 total in 2 blocks; 2648 free (0 chunks); 21720 used 2024-06-05 15:58:53.809 +08 [338795] LOG: level: 1; Operator lookup cache: 24576 total in 2 blocks; 10792 free (3 chunks); 13784 used 2024-06-05 15:58:53.809 +08 [338795] LOG: level: 1; RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used 2024-06-05 15:58:53.809 +08 [338795] LOG: level: 1; MessageContext: 8192 total in 1 blocks; 2256 free (0 chunks); 5936 used 2024-06-05 15:58:53.809 +08 [338795] LOG: level: 1; Operator class cache: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used 2024-06-05 15:58:53.810 +08 [338795] LOG: level: 1; smgr relation table: 32768 total in 3 blocks; 16880 free (8 chunks); 15888 used 2024-06-05 15:58:53.810 +08 [338795] LOG: level: 1; PgStat Shared Ref Hash: 7216 total in 2 blocks; 688 free (0 chunks); 6528 used 2024-06-05 15:58:53.810 +08 [338795] LOG: level: 1; PgStat Shared Ref: 8192 total in 4 blocks; 3704 free (4 chunks); 4488 used 2024-06-05 15:58:53.811 +08 [338795] LOG: level: 1; PgStat Pending: 16384 total in 5 blocks; 3288 free (10 chunks); 13096 used 2024-06-05 15:58:53.811 +08 [338795] LOG: level: 1; TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chunks); 264 used 2024-06-05 15:58:53.811 +08 [338795] LOG: level: 1; Portal hash: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used 2024-06-05 15:58:53.812 +08 [338795] LOG: level: 1; TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used 2024-06-05 15:58:53.812 +08 [338795] LOG: level: 2; PortalHoldContext: 24624 total in 2 blocks; 7400 free (0 chunks); 17224 used 2024-06-05 15:58:53.823 +08 [338795] LOG: level: 2; PortalContext: 527776 total in 83 blocks; 3760 free (6 chunks); 524016 used: 2024-06-05 15:58:53.824 +08 [338795] LOG: level: 3; ExecutorState: 58826880 total in 19 blocks; 3932992 free (11 chunks); 54893888 used 2024-06-05 15:58:53.824 +08 [338795] LOG: level: 4; HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used 2024-06-05 15:58:53.825 +08 [338795] LOG: level: 5; HashSpillContext: 8192 total in 1 blocks; 4008 free (9 chunks); 4184 used 2024-06-05 15:58:53.825 +08 [338795] LOG: level: 5; HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used 2024-06-05 15:58:53.825 +08 [338795] LOG: level: 4; HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used 2024-06-05 15:58:53.840 +08 [338795] LOG: level: 5; HashSpillContext: 339738672 total in 47 blocks; 3740880 free (163341 chunks); 335997792 used 2024-06-05 15:58:53.840 +08 [338795] LOG: level: 5; HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used 2024-06-05 15:58:53.841 +08 [338795] LOG: level: 4; HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used 2024-06-05 15:58:53.841 +08 [338795] LOG: level: 5; HashSpillContext: 8192 total in 1 blocks; 4720 free (0 chunks); 3472 used 2024-06-05 15:58:53.841 +08 [338795] LOG: level: 5; HashBatch
How about using dirty snapshots to locate dependent objects?
Hello everyone, At present, we use MVCC snapshots to identify dependent objects. This implies that if a new dependent object is inserted within a transaction that is still ongoing, our search for dependent objects won't include this recently added one. Consequently, if someone attempts to drop the referenced object, it will be dropped, and when the ongoing transaction completes, we will end up having an entry for a referenced object that has already been dropped. This situation can lead to an inconsistent state. Below is an example illustrating this scenario: Session 1: - create table t1(a int); - insert into t1 select i from generate_series(1, 1000) i; - create extension btree_gist; - create index i1 on t1 using gist( a ); Session 2: (While the index creation in session 1 is in progress, drop the btree_gist extension) - drop extension btree_gist; Above command succeeds and so does the create index command running in session 1, post this, if we try running anything on table t1, i1, it fails with an error: "cache lookup failed for opclass ..." Attached is the patch that I have tried, which seems to be working for me. It's not polished and thoroughly tested, but just sharing here to clarify what I am trying to suggest. Please have a look and let me know your thoughts. -- With Regards, Ashutosh Sharma. use-dirty-snapshots-to-find-dependent-objects.patch Description: Binary data
Re: Proposal: Job Scheduler
> On Thu, Jun 06, 2024 at 12:53:38PM GMT, Alvaro Herrera wrote: > On 2024-Jun-06, Dave Page wrote: > > > It's this kind of choice that means it's unlikely we'd include any one > > option in PostgreSQL, much like various other tools such as failover > > managers or poolers. > > TBH I see that more as a bug than as a feature, and I see the fact that > there are so many schedulers as a process failure. If we could have > _one_ scheduler in core that encompassed all the important features of > all the independent ones we have, with hooks or whatever to allow the > user to add any fringe features they need, that would probably lead to > less duplicative code and divergent UIs, and would be better for users > overall. > > That's, of course, just my personal opinion. +1. The PostgreSQL ecosystem is surprisingly fragmented, when it comes to quite essential components that happen to be outside of the core. But of course it doesn't mean that there should be _one_ component of every kind in core, more like it makes sense to have _one_ component available out of the box (where the box is whatever form of PostgreSQL that gets delivered to users, e.g. a distro package, container, etc.).
Re: Logical Replication of sequences
On Thu, Jun 6, 2024 at 9:34 AM Amit Kapila wrote: > > On Wed, Jun 5, 2024 at 3:17 PM Bharath Rupireddy > wrote: > > > > On Tue, Jun 4, 2024 at 5:40 PM Amit Kapila wrote: > > > > > > Even if we decode it periodically (say each time we decode the > > > checkpoint record) then also we need to send the entire set of > > > sequences at shutdown. This is because the sequences may have changed > > > from the last time we sent them. > > > > Agree. How about decoding and sending only the sequences that are > > changed from the last time when they were sent? I know it requires a > > bit of tracking and more work, but all I'm looking for is to reduce > > the amount of work that walsenders need to do during the shutdown. > > > > I see your point but going towards tracking the changed sequences > sounds like moving towards what we do for incremental backups unless > we can invent some other smart way. Yes, we would need an entirely new infrastructure to track the sequence change since the last sync. We can only determine this from WAL, and relying on it would somehow bring us back to the approach we were trying to achieve with logical decoding of sequences patch. > > Having said that, I like the idea of letting the user sync the > > sequences via ALTER SUBSCRIPTION command and not weave the logic into > > the shutdown checkpoint path. As Peter Eisentraut said here > > https://www.postgresql.org/message-id/42e5cb35-4aeb-4f58-8091-90619c7c3ecc%40eisentraut.org, > > this can be a good starting point to get going. > > > > Agreed. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: How about using dirty snapshots to locate dependent objects?
On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma wrote: > > Hello everyone, > > At present, we use MVCC snapshots to identify dependent objects. This implies > that if a new dependent object is inserted within a transaction that is still > ongoing, our search for dependent objects won't include this recently added > one. Consequently, if someone attempts to drop the referenced object, it will > be dropped, and when the ongoing transaction completes, we will end up having > an entry for a referenced object that has already been dropped. This > situation can lead to an inconsistent state. Below is an example illustrating > this scenario: I don't think it's correct to allow the index to be dropped while a transaction is creating it. Instead, the right solution should be for the create index operation to protect the object it is using from being dropped. Specifically, the create index operation should acquire a shared lock on the Access Method (AM) to ensure it doesn't get dropped concurrently while the transaction is still in progress. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Compress ReorderBuffer spill files using LZ4
Le jeu. 6 juin 2024 à 04:13, Amit Kapila a écrit : > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > > > When the content of a large transaction (size exceeding > > logical_decoding_work_mem) and its sub-transactions has to be > > reordered during logical decoding, then, all the changes are written > > on disk in temporary files located in pg_replslot/. > > Decoding very large transactions by multiple replication slots can > > lead to disk space saturation and high I/O utilization. > > > > Why can't one use 'streaming' option to send changes to the client > once it reaches the configured limit of 'logical_decoding_work_mem'? That's right, setting subscription's option 'streaming' to 'on' moves the problem away from the publisher to the subscribers. This patch tries to improve the default situation when 'streaming' is set to 'off'. > > 2. Do we want a GUC to switch compression on/off? > > > > It depends on the overhead of decoding. Did you try to measure the > decoding overhead of decompression when reading compressed files? Quick benchmarking executed on my laptop shows 1% overhead. Table DDL: CREATE TABLE t (i INTEGER PRIMARY KEY, t TEXT); Data generated with: INSERT INTO t SELECT i, 'Text number n°'||i::TEXT FROM generate_series(1, 1000) as i; Restoration duration measured using timestamps of log messages: "DEBUG: restored / changes from disk" HEAD: 25.54s, 25.94s, 25.516s, 26.267s, 26.11s / avg=25.874s Patch: 26.872s, 26.311s, 25.753s, 26.003, 25.843s / avg=26.156s Regards, JT
Re: Postgresql OOM
On Thu, Jun 6, 2024 at 1:25 PM Radu Radutiu wrote: > Hello all, > > I have a query that forces an out of memory error, where the OS will kill > the postgresql process. > The query plan (run immediately after a vacuum analyze) is at > https://explain.depesz.com/s/ITQI#html . > > ... > > Any idea on how to identify the problem? I can reproduce it on demand. > Should I report it pgsql-bugs? > > Best regards, > Radu > I am not qualified to answer on the OOM issue but why are you joining the same table (outputrequest) 4 times (using an identical join condition)? This essentially does a cross join, if an input_sequence value has say, 1000 related rows in outputrequest, you will be getting 1000^4 rows in the result set. FROM inputrequest t LEFT JOIN outputrequest rec_tro ON rec_tro.input_sequence = t.input_sequence LEFT JOIN inputrequest r ON r.originalRequest_id = t.input_sequence LEFT JOIN outputrequest rpl_rec_tro ON rpl_rec_tro.input_sequence = r.input_sequence LEFT JOIN outputrequest rpl_snd_tro ON rpl_snd_tro.reply_input_sequence = r.input_sequence LEFT JOIN outputrequest snd_tro ON snd_tro.reply_input_sequence = t.input_sequence
Re: How about using dirty snapshots to locate dependent objects?
On Thu, Jun 06, 2024 at 05:59:00PM +0530, Ashutosh Sharma wrote: > Hello everyone, > > At present, we use MVCC snapshots to identify dependent objects. This > implies that if a new dependent object is inserted within a transaction > that is still ongoing, our search for dependent objects won't include this > recently added one. Consequently, if someone attempts to drop the > referenced object, it will be dropped, and when the ongoing transaction > completes, we will end up having an entry for a referenced object that has > already been dropped. This situation can lead to an inconsistent state. > Below is an example illustrating this scenario: > > Session 1: > - create table t1(a int); > - insert into t1 select i from generate_series(1, 1000) i; > - create extension btree_gist; > - create index i1 on t1 using gist( a ); > > Session 2: (While the index creation in session 1 is in progress, drop the > btree_gist extension) > - drop extension btree_gist; > > Above command succeeds and so does the create index command running in > session 1, post this, if we try running anything on table t1, i1, it fails > with an error: "cache lookup failed for opclass ..." > > Attached is the patch that I have tried, which seems to be working for me. > It's not polished and thoroughly tested, but just sharing here to clarify > what I am trying to suggest. Please have a look and let me know your > thoughts. Thanks for the patch proposal! The patch does not fix the other way around: - session 1: BEGIN; DROP extension btree_gist; - session 2: create index i1 on t1 using gist( a ); - session 1: commits while session 2 is creating the index and does not address all the possible orphaned dependencies cases. There is an ongoing thread (see [1]) to fix the orphaned dependencies issue. v9 attached in [1] fixes the case you describe here. [1]: https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Compress ReorderBuffer spill files using LZ4
On Thu, Jun 6, 2024 at 6:22 PM Julien Tachoires wrote: > > Le jeu. 6 juin 2024 à 04:13, Amit Kapila a écrit : > > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > > > > > When the content of a large transaction (size exceeding > > > logical_decoding_work_mem) and its sub-transactions has to be > > > reordered during logical decoding, then, all the changes are written > > > on disk in temporary files located in pg_replslot/. > > > Decoding very large transactions by multiple replication slots can > > > lead to disk space saturation and high I/O utilization. > > > > > > > Why can't one use 'streaming' option to send changes to the client > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > That's right, setting subscription's option 'streaming' to 'on' moves > the problem away from the publisher to the subscribers. This patch > tries to improve the default situation when 'streaming' is set to > 'off'. > Can we think of changing the default to 'parallel'? BTW, it would be better to use 'parallel' for the 'streaming' option, if the workload has large transactions. Is there a reason to use a default value in this case? > > > 2. Do we want a GUC to switch compression on/off? > > > > > > > It depends on the overhead of decoding. Did you try to measure the > > decoding overhead of decompression when reading compressed files? > > Quick benchmarking executed on my laptop shows 1% overhead. > Thanks. We probably need different types of data (say random data in bytea column, etc.) for this. -- With Regards, Amit Kapila.
Re: race condition in pg_class
On Wed, Jun 5, 2024 at 2:17 PM Noah Misch wrote: > Starting 2024-06-10, I plan to push the first seven of the ten patches: > > inplace005-UNEXPECTEDPASS-tap-meson-v1.patch > inplace010-tests-v1.patch > inplace040-waitfuncs-v1.patch > inplace050-tests-inj-v1.patch > inplace060-nodeModifyTable-comments-v1.patch > Those five just deal in tests, test infrastructure, and comments. > inplace070-rel-locks-missing-v1.patch > Main risk is new DDL deadlocks. > inplace080-catcache-detoast-inplace-stale-v1.patch > If it fails to fix the bug it targets, I expect it's a no-op rather than > breaking things. > > I'll leave the last three of the ten needing review. Those three are beyond > my skill to self-certify. It's not this patch set's fault, but I'm not very pleased to see that the injection point wait events have been shoehorned into the "Extension" category - which they are not - instead of being a new wait_event_type. That would have avoided the ugly wait-event naming pattern, inconsistent with everything else, introduced by inplace050-tests-inj-v1.patch. I think that the comments and commit messages in this patch set could, in some places, use improvement. For instance, inplace060-nodeModifyTable-comments-v1.patch reflows a bunch of comments, which makes it hard to see what actually changed, and the commit message doesn't tell you, either. A good bit of it seems to be changing "a view" to "a view INSTEAD OF trigger" or "a view having an INSTEAD OF trigger," but the reasoning behind that change is not spelled out anywhere. The reader is left to guess what the other case is and why the same principles don't apply to it. I don't doubt that the new comments are more correct than the old ones, but I expect future patch authors to have difficulty maintaining that state of affairs. Similarly, inplace070-rel-locks-missing-v1.patch adds no comments. IMHO, the commit message also isn't very informative. It disclaims knowledge of what bug it's fixing, while at the same time leaving the reader to figure out for themselves how the behavior has changed. Consequently, I expect writing the release notes for a release including this patch to be difficult: "We added some locks that block ... something ... in some circumstances ... to prevent ... something." It's not really the job of the release note author to fill in those blanks, but rather of the patch author or committer. I don't want to overburden the act of fixing bugs, but I just feel like more explanation is needed here. When I see for example that we're adding a lock acquisition to the end of heap_create(), I can't help but wonder if it's really true that we don't take a lock on a just-created relation today. I'm certainly under the impression that we lock newly-created, uncommitted relations, and a quick test seems to confirm that. I don't quite know whether that happens, but evidently this call is guarding against something more subtle than a categorical failure to lock a relation on creation so I think there should be a comment explaining what that thing is. It's also quite surprising that SetRelationHasSubclass() says "take X lock before calling" and 2 of 4 callers just don't. I guess that's how it is. But shouldn't we then have an assertion inside that function to guard against future mistakes? If the reason why we failed to add this initially is discernible from the commit messages that introduced the bug, it would be nice to mention what it seems to have been; if not, it would at least be nice to mention the offending commit(s). I'm also a bit worried that this is going to cause deadlocks, but I suppose if it does, that's still better than the status quo. IsInplaceUpdateOid's header comment says IsInplaceUpdateRelation instead of IsInplaceUpdateOid. inplace080-catcache-detoast-inplace-stale-v1.patch seems like another place where spelling out the rationale in more detail would be helpful to future readers; for instance, the commit message says that PgDatabaseToastTable is the only one affected, but it doesn't say why the others are not, or why this one is. The lengthy comment in CatalogCacheCreateEntry is also difficult to correlate with the code which follows. I can't guess whether the two cases called out in the comment always needed to be handled and were handled save only for in-place updates, and thus the comment changes were simply taking the opportunity to elaborate on the existing comments; or whether one of those cases is preexisting and the other arises from the desire to handle inplace updates. It could be helpful to mention relevant identifiers from the code in the comment text e.g. "systable_recheck_tuple detects ordinary updates by noting changes to the tuple's visibility information, while the equalTuple() case detects inplace updates." IMHO, this patch set underscores the desirability of removing in-place update altogether. That sounds difficult and not back-patchable, but I can't classify what this patch set does as anything better
Re: How about using dirty snapshots to locate dependent objects?
On Thu, Jun 6, 2024 at 6:20 PM Dilip Kumar wrote: > On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma > wrote: > > > > Hello everyone, > > > > At present, we use MVCC snapshots to identify dependent objects. This > implies that if a new dependent object is inserted within a transaction > that is still ongoing, our search for dependent objects won't include this > recently added one. Consequently, if someone attempts to drop the > referenced object, it will be dropped, and when the ongoing transaction > completes, we will end up having an entry for a referenced object that has > already been dropped. This situation can lead to an inconsistent state. > Below is an example illustrating this scenario: > > I don't think it's correct to allow the index to be dropped while a > transaction is creating it. Instead, the right solution should be for > the create index operation to protect the object it is using from > being dropped. Specifically, the create index operation should acquire > a shared lock on the Access Method (AM) to ensure it doesn't get > dropped concurrently while the transaction is still in progress. > If I'm following you correctly, that's exactly what the patch is trying to do; while the index creation is in progress, if someone tries to drop the object referenced by the index under creation, the referenced object being dropped is able to know about the dependent object (in this case the index being created) using dirty snapshot and hence, it is unable to acquire the lock on the dependent object, and as a result of that, it is unable to drop it. -- With Regards, Ashutosh Sharma.
Re: Postgresql OOM
> > >> I am not qualified to answer on the OOM issue but why are you joining the > same table (outputrequest) 4 times (using an identical join condition)? > This essentially does a cross join, if an input_sequence value has say, > 1000 related rows in outputrequest, you will be getting 1000^4 rows in the > result set. > The query itself runs fine in a reasonable time with enable_parallel_hash = 'off'. I see two problems - one is the wrong execution plan (right after running analyze), the second and the most important is the huge memory usage (far exceeding work_mem and shared buffers) leading to OOM. See https://explain.depesz.com/s/yAqS for the explain plan with enable_parallel_hash = 'off.
Re: How about using dirty snapshots to locate dependent objects?
On Thu, Jun 6, 2024 at 6:57 PM Bertrand Drouvot < bertranddrouvot...@gmail.com> wrote: > On Thu, Jun 06, 2024 at 05:59:00PM +0530, Ashutosh Sharma wrote: > > Hello everyone, > > > > At present, we use MVCC snapshots to identify dependent objects. This > > implies that if a new dependent object is inserted within a transaction > > that is still ongoing, our search for dependent objects won't include > this > > recently added one. Consequently, if someone attempts to drop the > > referenced object, it will be dropped, and when the ongoing transaction > > completes, we will end up having an entry for a referenced object that > has > > already been dropped. This situation can lead to an inconsistent state. > > Below is an example illustrating this scenario: > > > > Session 1: > > - create table t1(a int); > > - insert into t1 select i from generate_series(1, 1000) i; > > - create extension btree_gist; > > - create index i1 on t1 using gist( a ); > > > > Session 2: (While the index creation in session 1 is in progress, drop > the > > btree_gist extension) > > - drop extension btree_gist; > > > > Above command succeeds and so does the create index command running in > > session 1, post this, if we try running anything on table t1, i1, it > fails > > with an error: "cache lookup failed for opclass ..." > > > > Attached is the patch that I have tried, which seems to be working for > me. > > It's not polished and thoroughly tested, but just sharing here to clarify > > what I am trying to suggest. Please have a look and let me know your > > thoughts. > > Thanks for the patch proposal! > > The patch does not fix the other way around: > > - session 1: BEGIN; DROP extension btree_gist; > - session 2: create index i1 on t1 using gist( a ); > - session 1: commits while session 2 is creating the index > > and does not address all the possible orphaned dependencies cases. > > There is an ongoing thread (see [1]) to fix the orphaned dependencies > issue. > > v9 attached in [1] fixes the case you describe here. > > [1]: > https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal I see. Thanks for sharing this. I can take a look at this and help in whatever way I can. With Regards, Ashutosh Sharma.
Re: [multithreading] extension compatibility
On Thu, Jun 6, 2024 at 5:00 AM Heikki Linnakangas wrote: > If there is some material harm from compiling with multithreading > support even if you're not using it, we should try to fix that. I'm not > dead set against having a compile-time option, but I don't see the need > for it at the moment. Well, OK, so it sounds like I'm outvoted, at least at the moment. Maybe that will change as more people vote, but for now, that's where we are. Given that, I suppose we want something more like Tristan's patch, but with a more extensible syntax. Does that sound right? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Compress ReorderBuffer spill files using LZ4
On 2024-Jun-06, Amit Kapila wrote: > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > > > When the content of a large transaction (size exceeding > > logical_decoding_work_mem) and its sub-transactions has to be > > reordered during logical decoding, then, all the changes are written > > on disk in temporary files located in pg_replslot/. > > Decoding very large transactions by multiple replication slots can > > lead to disk space saturation and high I/O utilization. I like the general idea of compressing the output of logical decoding. It's not so clear to me that we only want to do so for spilling to disk; for instance, if the two nodes communicate over a slow network, it may even be beneficial to compress when streaming, so to this question: > Why can't one use 'streaming' option to send changes to the client > once it reaches the configured limit of 'logical_decoding_work_mem'? I would say that streaming doesn't necessarily have to mean we don't want compression, because for some users it might be beneficial. I think a GUC would be a good idea. Also, what if for whatever reason you want a different compression algorithm or different compression parameters? Looking at the existing compression UI we offer in pg_basebackup, perhaps you could add something like this: compress_logical_decoding = none compress_logical_decoding = lz4:42 compress_logical_decoding = spill-zstd:99 "none" says to never use compression (perhaps should be the default), "lz4:42" says to use lz4 with parameters 42 on both spilling and streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but only for spilling to disk. (I don't mean to say that you should implement Zstd compression with this patch, only that you should choose the implementation so that adding Zstd support (or whatever) later is just a matter of adding some branches here and there. With the current #ifdef you propose, it's hard to do that. Maybe separate the parts that depend on the specific algorithm to algorithm-agnostic functions.) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: Things I don't like about \du's "Attributes" column
On Thu, Jun 6, 2024 at 5:08 AM Pavel Luzanov wrote: > But now there are no changes in pg_roles. Just a special interpretation > of the two values of the "Connection limit" column: > 0 - Now allowed (changed from 'No connections') > -1 - empty string I think the first of these special interpretations is unnecessary and should be removed. It seems pretty clear what 0 means. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [multithreading] extension compatibility
On 06/06/2024 17:23, Robert Haas wrote: On Thu, Jun 6, 2024 at 5:00 AM Heikki Linnakangas wrote: If there is some material harm from compiling with multithreading support even if you're not using it, we should try to fix that. I'm not dead set against having a compile-time option, but I don't see the need for it at the moment. Well, OK, so it sounds like I'm outvoted, at least at the moment. Maybe that will change as more people vote, but for now, that's where we are. Given that, I suppose we want something more like Tristan's patch, but with a more extensible syntax. Does that sound right? +1 -- Heikki Linnakangas Neon (https://neon.tech)
report on not thread-safe functions
In the context of the multithreaded-server project, I looked into potentially not thread-safe functions. (See proposed next steps at the end of this message.) Here is a list of functions in POSIX that are possibly not thread-safe: https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_01 I checked those against the PostgreSQL server source code (backend + common + timezone), and found that the following from those are in use: - dlerror() - getenv() - getgrnam() - getopt() - getpwuid() - localeconv() - localtime() - nl_langinfo() - readdir() - setenv() - setlocale() - strerror() - strsignal() - strtok() - system() - unsetenv() Additionally, there are non-standard functions that are not thread-safe, such as getopt_long(). Also, there are replacement functions such as pg_gmtime() and pg_localtime() that mirror standard thread-unsafe functions and that appear to be equally unsafe. (Note to those looking into annotating global variables: You also need to check static local variables.) Conversely, some of the above might actually be thread-safe in some/many/all implementations. For example, strerror() and system() are thread-safe in glibc. So you might actually get a multithreaded server running in that environment with fewer source code changes but have it fail in others. Just something to keep in mind. I also tried the concurrency-mt-unsafe check from clang-tidy (https://clang.llvm.org/extra/clang-tidy/checks/concurrency/mt-unsafe.html). Run it for example like this: clang-tidy -p build --quiet --checks='-*,concurrency-mt-unsafe' src/backend/main/*.c (needs a compilation database in the build directory) (You can't just run it like src/backend/**/*.c because some .c files don't compile standalone, and then the whole thing aborts with too many errors. Maybe with a judicious exclusion list, this can be achieved. However, it's also not good dealing with compilation options like FRONTEND. So it can't easily serve as an automated checker, but it's okay as a manual exploration tool.) In addition to the POSIX list above, this also flagged: - exit() - sigprocmask() Allegedly, you can toggle it between posix and glibc modes, but I haven't succeeded with that. So for example, it does not actually flag strerror() out of the box, presumably because that is not in its glibc list. Now some more detailed comments on these functions: - dlerror() dlerror() gets the error from the last dlopen() call, which is obviously not thread-safe. This might require some deeper investigation of the whole dfmgr.c mechanism. (Which might be appropriate in any case, since in multithreaded environments, you don't need to load a library into each session separately.) - exit() Most of the exit() calls happen where there are not multiple threads active. But some emergency exit calls like in elog.c might more correctly use _exit()? - getenv() - setenv() - unsetenv() getenv() is unsafe if there are concurrent setenv() or unsetenv() calls. We should try to move all those to early in the program startup. This seems doable. Some calls are related to locale stuff, which is a separate subproject to clean up. There are some calls to setenv("KRB5*"), which would need to be fixed. The source code comments nearby already contain ideas how to. - getgrnam() - getpwuid() - localtime() These have _r replacements. - getopt() This needs a custom replacement. (There is no getopt_r() because programs usually don't call getopt() after startup.) (Note: This is also called during session startup, not only during initial postmaster start. So we definitely need something here, if we want to, like, start more than one session concurrently.) - localeconv() - nl_langinfo() - setlocale() The locale business needs to be reworked to use locale_t and _l functions. This is already being discussed for other reasons. - readdir() This is listed as possibly thread-unsafe, but I think it is thread-safe in practice. You just can't work on the same DIR handle from multiple threads. There is a readdir_r(), but that's already deprecated. I think we can ignore this one. - sigprocmask() It looks like this is safe in practice. Also, there is pthread_sigmask() if needed. - strerror() Use strerror_r(). There are very few calls of this, actually, since most potential users use printf %m. - strsignal() Use strsignal_r(). These calls are already wrapped in pg_strsignal() for Windows portability, so it's easy to change. But this also led me to think that it is potentially dangerous to have different standards of thread-safety across the tree. pg_strsignal() is used by wait_result_to_str() which is used by pclose_check() ... and at that point you have completely lost track of what you are dealing with underneath. So if someone were to put, say, pclose_check() into pgbench, it could be broken. - strtok() Use strtok_r() or maybe even strsep() (but there are small semantic differences with the latte
Re: ResourceOwner refactoring
On Thu, Jun 6, 2024 at 7:32 AM Heikki Linnakangas wrote: > > Barring objections, I'll commit this later today or tomorrow. Thanks for > > the report! > > Committed. I think you may have forgotten to push. -- Robert Haas EDB: http://www.enterprisedb.com
Re: question regarding policy for patches to out-of-support branches
On Thu, Jun 6, 2024 at 4:25 AM Hannu Krosing wrote: > Not absolutely sure, but would at least adding a page to PostgreSQL > Wiki about this make sense ? I feel like we need to do something. Tom says this is a policy, and he's made that comment before about other things, but the fact that they're not memorialized anywhere is a huge problem, IMHO. People don't read or remember every mailing list discussion forever, and even if they did, how would we enumerate all the policies for the benefit of a newcomer? Maybe this belongs in the documentation, maybe in the wiki, maybe someplace else, but the real issue for me is that policies have to be discoverable by the people who need to adhere to them, and old mailing list discussions aren't. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions
Hi, On Thu, Jun 6, 2024 at 2:36 AM Jeff Davis wrote: > On Wed, 2024-06-05 at 14:36 +0530, Ashutosh Sharma wrote: > > Thank you, Ashutosh, for the quick response. I've drafted a patch > > aimed at addressing this issue. The patch attempts to solve this > > issue by configuring the search_path for all security definer > > functions created by the extension. > I like the general direction you propose, but I think it needs more > discussion about the details. > I agree. > > * What exactly is the right search_path for a function defined in an > extension? > Determining the precise value can be challenging. However, since it's a function installed by an extension, typically, the search_path should include the extension's search_path and the schema where the function resides. If the function relies on a schema other than the one we set in its search_path, which would usually be the one created by the extension, this approach will enforce the extension developers to set the extension's specific search_path in the create function statement, if it's not set. The primary goal here is to ensure that the security definer functions created by an extension do not refer to any untrusted schema(s). > > * Do we need a new magic search_path value of "$extension_schema" that > resolves to the extension's schema, so that it can handle ALTER > EXTENSION ... SET SCHEMA? > Possibly yes, we can think about it, I think it would be something like the $user we have now. > > * What do we do for functions that want the current behavior and how do > we handle migration issues? > That can be controlled via some GUC if needed, I guess. > > * What about SECURITY INVOKER functions? Those can still be vulnerable > to manipulation by the caller by setting search_path, which can cause > an incorrect value to be returned. That can matter in some contexts > like a CHECK constraint. > I didn't get you completely here. w.r.t extensions how will this have an impact if we set the search_path for definer functions. -- With Regards, Ashutosh Sharma.
Re: ResourceOwner refactoring
On 06/06/2024 18:27, Robert Haas wrote: On Thu, Jun 6, 2024 at 7:32 AM Heikki Linnakangas wrote: Barring objections, I'll commit this later today or tomorrow. Thanks for the report! Committed. I think you may have forgotten to push. Huh, you're right. I could swear I did... Pushed now thanks! -- Heikki Linnakangas Neon (https://neon.tech)
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Thu, Jun 6, 2024 at 5:12 AM Jelte Fennema-Nio wrote: > Looking at ssl_max_protocol_version closer though, to stay really > consistent I'd have to change "latest" to be renamed to empty string > (i.e. there is no max_protocol_version). I think I might prefer > staying consistent over introducing an imho slightly clearer name. > Another way to stay consistent would of course be also adding "latest" > as an option to ssl_max_protocol_version? What do you think? As I see it, the issue here is whether the default value would ever be different from the latest value. If not, then using blank to mean the latest seems fine, but if so, then I'd expect blank to mean the default version and latest to mean the latest version. > I'll look into adding min_protocol_version to the patchset soonish. > Some review of the existing code in the first few patches would > definitely be appreciated. Yeah, I understand, and I do want to do that, but keep in mind I've already spent considerable time on this patch set, way more than most others, and if I want to get it committed I'm nowhere close to being done. It's probably multiple weeks of additional work for me, and I think I've probably already spent close to a week on this, and I only work ~48 weeks a year, and there are ~300 patches in the CommitFest. Plus, right now there is no possibility of actually committing anything until after we branch. And, respectfully, I feel like there has to be some give and take here. I've been trying to give this patch set higher priority because it's in an area that I know something about and have opinions about and also because I can tell that you're kind of frustrated and I don't want you to leave the development community. But, at the same time, I don't think you've done much to help me get my patches committed, and while you have done some review of other people's patches, it doesn't seem to often be the kind of detailed, line-by-line review that is needed to get most patches committed. So I'm curious how you expect this system to scale. -- Robert Haas EDB: http://www.enterprisedb.com
Re: relfilenode statistics
On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot wrote: > I think we should keep the stats in the relation during relfilenode changes. > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can > see in the example provided up-thread the new heap_blks_written statistic has > been preserved during the TRUNCATE. Yeah, I think there's something weird about this design. Somehow we're ending up with both per-relation and per-relfilenode counters: + pg_stat_get_blocks_written(C.oid) + pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END, C.relfilenode) AS heap_blks_written, I'll defer to Andres if he thinks that's awesome, but to me it does not seem right to track some blocks written in a per-relation counter and others in a per-relfilenode counter. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)
On Wed, Jun 5, 2024 at 3:48 AM Ashutosh Bapat wrote: > Here's planning time measurements. > num_joins | master (ms) | patched (ms) | change in planning time (ms) | > change in planning time > ---+-+--+--+--- > 2 | 187.86 | 177.27 |10.59 | > 5.64% > 3 | 721.81 | 758.80 | -36.99 | >-5.12% > 4 | 2239.87 | 2236.19 | 3.68 | > 0.16% > 5 | 6830.86 | 7027.76 | -196.90 | >-2.88% I find these results concerning. Why should the planning time sometimes go up? And why should it go up for odd numbers of joinrels and down for even numbers of joinrels? I wonder if these results are really correct, and if they are, I wonder what could account for such odd behavior -- Robert Haas EDB: http://www.enterprisedb.com
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin wrote: > Am I missing something or the the page buffer indeed lacks locking there? I don't know, but if the locks are really missing now, I feel like the first question is "which commit got rid of them?". It's a little hard to believe that they've never been there and somehow nobody has noticed. Then again, maybe we have; see Noah's thread about in-place updates breaking stuff and some of the surprising discoveries there. But it seems worth investigating. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions
On Thu, 2024-06-06 at 21:17 +0530, Ashutosh Sharma wrote: > That can be controlled via some GUC if needed, I guess. That's a possibility, but it's easy to create a mess that way. I don't necessarily oppose it, but we'd need some pretty strong agreement that we are somehow moving users in a better direction and not just creating two behaviors that last forever. I also think there should be a way to explicitly request the old behavior -- obtaining search_path from the session -- regardless of how the GUC is set. > I didn't get you completely here. w.r.t extensions how will this have > an impact if we set the search_path for definer functions. If we only set the search path for SECURITY DEFINER functions, I don't think that solves the whole problem. Regards, Jeff Davis
Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions
On Thu, 6 Jun 2024 at 12:53, Jeff Davis wrote: > > I didn't get you completely here. w.r.t extensions how will this have > > an impact if we set the search_path for definer functions. > > If we only set the search path for SECURITY DEFINER functions, I don't > think that solves the whole problem. Indeed. While the ability for a caller to set the search_path for a security definer functions introduces security problems that are different than for security invoker functions, it's still weird for the behaviour of a function to depend on the caller's search_path. It’s even weirder for the default search path behaviour to be different depending on whether or not the function is security definer.
Re: question regarding policy for patches to out-of-support branches
Robert Haas writes: > On Thu, Jun 6, 2024 at 4:25 AM Hannu Krosing wrote: >> Not absolutely sure, but would at least adding a page to PostgreSQL >> Wiki about this make sense ? > I feel like we need to do something. Tom says this is a policy, and > he's made that comment before about other things, but the fact that > they're not memorialized anywhere is a huge problem, IMHO. I didn't say it wasn't ;-) ISTM we have two basic choices: wiki page, or new SGML docs section. In the short term I'd lean to a wiki page. It'd be reasonable for https://wiki.postgresql.org/wiki/Committing_checklist to link to it (and maybe the existing section there about release freezes would be more apropos on a "Project policies" page? Not sure.) To get a sense of how much of a problem we have, I grepped the git history for comments mentioning project policies. Ignoring ones that are really talking about very localized issues, what I found is attached. It seems like it's little enough that a single wiki page with subsections ought to be enough. I'm not super handy with editing the wiki, plus I'm only firing on one cylinder today (seem to have acquired a head cold at pgconf), so maybe somebody else would like to draft something? regards, tom lane This was submitted as a security issue, but the security team has been unable to identify any consequence worse than a null pointer dereference (from trying to access rd_tableam methods that the relation no longer has). Therefore, in accordance with our usual policy, it's not security material and should just be fixed as a routine bug. (this is probably material for security-team-private documentation) All backend-side variables should be marked with PGDLLIMPORT, as per policy introduced in 8ec569479f. Project policy is to not leave global objects behind after a regress test run. This was found as a result of the development of a patch to make pg_regress detect such leftovers automatically, which in the end was withdrawn due to issues with parallel runs. Per project policy, transient roles created by regression test cases should be named "regress_something", to reduce the risks of running such cases against installed servers. And no such role should ever be left behind after running a test. Per project policy that we want to keep recently-out-of-support branches buildable on modern systems, back-patch all the way to 9.2. This back-patches commit 9ff47ea41 into out-of-support branches, pursuant to newly-established project policy. The point is to suppress scary-looking warnings so that people building these branches needn't expend brain cells verifying that it's safe to ignore the warnings. Tweak detail and hint messages to be consistent with project policy (this should reference message style guide in SGML docs) Doc: update testing recipe in src/test/perl/README. The previous text didn't provide any clear explanation of our policy around TAP test portability. (should just reference that README as a guide for writing TAP tests) "typename" is a C++ keyword, so pg_upgrade.h fails to compile in C++. Fortunately, there seems no likely reason for somebody to need to do that. Nonetheless, it's project policy that all .h files should pass cpluspluscheck, so rename the argument to fix that. Commit a6417078 established a new project policy around OID assignment: new patches are encouraged to choose a random OID in the 8000.. range when a manually-assigned OID is required (if multiple OIDs are required, a consecutive block of OIDs starting from the random point should be used). Catalog entries added by committed patches that use OIDs from this "unstable" range are renumbered after feature freeze. (this should reference bki.sgml) libpq failed to ignore Windows-style newlines in connection service files. This normally wasn't a problem on Windows itself, because fgets() would convert \r\n to just \n. But if libpq were running inside a program that changes the default fopen mode to binary, it would see the \r's and think they were data. In any case, it's project policy to ignore \r in text files unconditionally, because people sometimes try to use files with DOS-style newlines on Unix machines, where the C library won't hide that from us. However, project policy since parallel query came in is that all plan node types should have outfuncs/readfuncs support, so this is clearly an oversight that should be repaired. (Probably moot now, given auto generation of these functions.) We have a project policy that every .c file should start by including postgres.h, postgres_fe.h, or c.h as appropriate; and then there is no need for any .h file to explicitly include any of these. (The core reason for this policy is to make it ea
Re: tiny step toward threading: reduce dependence on setlocale()
On Wed, 2024-06-05 at 17:23 -0700, Jeff Davis wrote: > A brief test shows that there may be a performance regression for > libc > default collations. But if so, I'm not sure that's avoidable if the > goal is to take away setlocale. I'll see if removing the extra > branches > mitigates it. I redid the test and didn't see a difference, and then I ran a standalone microbenchmark to compare strcoll() vs strcoll_l(), and didn't see a difference there, either. Another implementation may show a difference, but it doesn't seem to be a problem for glibc. I think this patch series is a nice cleanup, as well, making libc more like the other providers and not dependent on global state. Regards, Jeff Davis
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
Hello Robert, 06.06.2024 19:36, Robert Haas wrote: On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin wrote: Am I missing something or the the page buffer indeed lacks locking there? I don't know, but if the locks are really missing now, I feel like the first question is "which commit got rid of them?". It's a little hard to believe that they've never been there and somehow nobody has noticed. Then again, maybe we have; see Noah's thread about in-place updates breaking stuff and some of the surprising discoveries there. But it seems worth investigating. Yes, my last experiment with memcmp for the whole buffer was wrong, given the comment above heapgettup_pagemode(). I think the correct check would be: ItemId lpp; OffsetNumber lineoff; +ItemIdData iid; lineoff = scan->rs_vistuples[lineindex]; lpp = PageGetItemId(page, lineoff); +iid = *((ItemIdData *)lpp); +for (int i = 0; i < 1000; i++) +Assert(memcmp(&iid, lpp, sizeof(iid)) == 0); It significantly alleviates reproducing of the test failure for me. Will try to bisect this anomaly tomorrow. Best regards, Alexander
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
On 2024-06-06 Th 04:15, Kyotaro Horiguchi wrote: At Thu, 06 Jun 2024 16:45:00 +0900 (JST), Kyotaro Horiguchi wrote in I have been thinking about this since then. At first, I thought it referred to FindFirstChangeNotification() and friends, and inotify on Linux. However, I haven't found a way to simplify the specified code area using those APIs. By the way, the need to shift by 2 seconds to tolerate clock skew suggests that the current launcher-postmaster association mechanism is somewhat unreliable. Couldn't we add a command line option to postmaster to explicitly pass a unique identifier (say, pid itself) of the launcher? If it is not specified, the number should be the PID of the immediate parent process. This change avoids the need for the special treatment for Windows. Looks good generally. I assume iterating over the process table to find the right pid will be pretty quick. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On Thu, Jun 06, 2024 at 12:36:32PM -0400, Robert Haas wrote: > On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin wrote: > > Am I missing something or the the page buffer indeed lacks locking there? > > I don't know, but if the locks are really missing now, I feel like the > first question is "which commit got rid of them?". It's a little hard > to believe that they've never been there and somehow nobody has > noticed. > > Then again, maybe we have; see Noah's thread about in-place updates > breaking stuff and some of the surprising discoveries there. But it > seems worth investigating. $SUBJECT looks more like a duplicate of postgr.es/m/flat/20240512171658.7e.nmi...@google.com (Hot standby queries see transient all-zeros pages).
Re: problems with "Shared Memory and Semaphores" section of docs
Here is a rebased version of the patch for v18 that adds a runtime-computed GUC. As I noted earlier, there still isn't a consensus on this approach. -- nathan >From 74f638f7df9c51f5ab47b282bb7107c4ba6cb5b6 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 21 May 2024 14:02:22 -0500 Subject: [PATCH v2 1/1] add semaphores_required GUC --- doc/src/sgml/config.sgml| 14 +++ doc/src/sgml/runtime.sgml | 39 - src/backend/storage/ipc/ipci.c | 6 - src/backend/utils/misc/guc_tables.c | 12 + 4 files changed, 53 insertions(+), 18 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 698169afdb..f6afc941df 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -11215,6 +11215,20 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' + + semaphores_required (integer) + + semaphores_required configuration parameter + + + + +Reports the number of semaphores that are needed for the server based +on the number of allowed connections, worker processes, etc. + + + + ssl_library (string) diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index 2f7c618886..c387f302d7 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -781,13 +781,13 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such SEMMNI Maximum number of semaphore identifiers (i.e., sets) -at least ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 7) / 16) plus room for other applications +at least ceil(semaphores_required / 16) plus room for other applications SEMMNS Maximum number of semaphores system-wide -ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 7) / 16) * 17 plus room for other applications +ceil(semaphores_required / 16) * 17 plus room for other applications @@ -836,30 +836,38 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such When using System V semaphores, -PostgreSQL uses one semaphore per allowed connection -(), allowed autovacuum worker process -(), allowed WAL sender process -(), and allowed background -process (), in sets of 16. +PostgreSQL uses one semaphore per allowed connection, +worker process, etc., in sets of 16. Each such set will also contain a 17th semaphore which contains a magic number, to detect collision with semaphore sets used by other applications. The maximum number of semaphores in the system is set by SEMMNS, which consequently must be at least -as high as max_connections plus -autovacuum_max_workers plus max_wal_senders, -plus max_worker_processes, plus one extra for each 16 -allowed connections plus workers (see the formula in plus one extra for +each set of 16 required semaphores (see the formula in ). The parameter SEMMNI determines the limit on the number of semaphore sets that can exist on the system at one time. Hence this parameter must be at -least ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 7) / 16). +least ceil(semaphores_required / 16). Lowering the number of allowed connections is a temporary workaround for failures, which are usually confusingly worded No space left on device, from the function semget. + +The number of semaphores required by PostgreSQL +is provided by the runtime-computed parameter +semaphores_required, which can be determined before +starting the server with a postgres command like: + +$ postgres -D $PGDATA -C semaphores_required + +The value of semaphores_required should be input into +the aforementioned formulas to determine appropriate values for +SEMMNI and SEMMNS. + + In some cases it might also be necessary to increase SEMMAP to be at least on the order of @@ -882,11 +890,8 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such When using POSIX semaphores, the number of semaphores needed is the -same as for System V, that is one semaphore per allowed connection -(), allowed autovacuum worker process -(), allowed WAL sender process -(), and allowed background -process (). +same as for System V, that is one semaphore per allowed connection, +worker process, etc. On the platforms where this option is preferred, there is no specific kernel limit on the number of POSIX semaphores. diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 521ed5418c..3e030accac 100644 --- a/src/backend/s
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Thu, 6 Jun 2024 at 18:01, Robert Haas wrote: > As I see it, the issue here is whether the default value would ever be > different from the latest value. If not, then using blank to mean the > latest seems fine, but if so, then I'd expect blank to mean the > default version and latest to mean the latest version. Alright, that's fair. And we already seem to follow that pattern: There's currently no connection option that has a default that's not the empty string, but still accepts the empty string as an argument. > > I'll look into adding min_protocol_version to the patchset soonish. > > Some review of the existing code in the first few patches would > > definitely be appreciated. > > Yeah, I understand, and I do want to do that, but keep in mind I've > already spent considerable time on this patch set, way more than most > others, and if I want to get it committed I'm nowhere close to being > done. It's probably multiple weeks of additional work for me, and I > think I've probably already spent close to a week on this, and I only > work ~48 weeks a year, and there are ~300 patches in the CommitFest. I very much appreciate the time you spent on this patchset so far. I mainly thought that instead of only discussing the more complex parts of the patchset, it would be nice to also actually move forward a little bit too. And the first 3 patches in this patchset are very small and imho straightforward improvements. To be clear, I'm not saying that should be all on you. I think those first three patches can be reviewed by pretty much anyone. > Plus, right now there is no possibility of actually committing > anything until after we branch. Totally fair, but even a LGTM on one of the patches would be quite nice. > And, respectfully, I feel like there > has to be some give and take here. I've been trying to give this patch > set higher priority because it's in an area that I know something > about and have opinions about and also because I can tell that you're > kind of frustrated and I don't want you to leave the development > community. Thank you for giving it a higher priority, it's definitely appreciated and noticed. > But, at the same time, I don't think you've done much to > help me get my patches committed, and while you have done some review > of other people's patches, it doesn't seem to often be the kind of > detailed, line-by-line review that is needed to get most patches > committed. So I'm curious how you expect this system to scale. Of course there's always the possibility to review more. But I don't really agree with this summary of my review activity. I did see your patches related to the incremental backup stuff. They looked interesting, but at the time from an outside perspective it didn't seem like those threads needed my reviews to progress (a bunch of people more knowledgable on the topic were already responding). So I spent my time mainly on threads where I felt I could add something useful, and often that was more on the design front than the exact code. Honestly that's what triggered this whole patchset in the first place: Adding infrastructure for protocol changes so that the several other threads that try to introduce protocol changes can actually move forward, instead of being in limbo forever. Regarding line-by-line reviews, imho I definitely do that for the smaller patches I tend to review (even if they are part of a bigger patchset). But the bigger ones I don't think line-by-line reviews are super helpful at the start, so I generally comment more on the design in those cases.
Re: Compress ReorderBuffer spill files using LZ4
Le jeu. 6 juin 2024 à 06:40, Amit Kapila a écrit : > > On Thu, Jun 6, 2024 at 6:22 PM Julien Tachoires wrote: > > > > Le jeu. 6 juin 2024 à 04:13, Amit Kapila a écrit : > > > > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > > > > > > > When the content of a large transaction (size exceeding > > > > logical_decoding_work_mem) and its sub-transactions has to be > > > > reordered during logical decoding, then, all the changes are written > > > > on disk in temporary files located in pg_replslot/. > > > > Decoding very large transactions by multiple replication slots can > > > > lead to disk space saturation and high I/O utilization. > > > > > > > > > > Why can't one use 'streaming' option to send changes to the client > > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > > > That's right, setting subscription's option 'streaming' to 'on' moves > > the problem away from the publisher to the subscribers. This patch > > tries to improve the default situation when 'streaming' is set to > > 'off'. > > > > Can we think of changing the default to 'parallel'? BTW, it would be > better to use 'parallel' for the 'streaming' option, if the workload > has large transactions. Is there a reason to use a default value in > this case? You're certainly right, if using the streaming API helps to avoid bad situations and there is no downside, it could be used by default. > > > > 2. Do we want a GUC to switch compression on/off? > > > > > > > > > > It depends on the overhead of decoding. Did you try to measure the > > > decoding overhead of decompression when reading compressed files? > > > > Quick benchmarking executed on my laptop shows 1% overhead. > > > > Thanks. We probably need different types of data (say random data in > bytea column, etc.) for this. Yes, good idea, will run new tests in that sense. Thank you! Regards, JT
Re: problems with "Shared Memory and Semaphores" section of docs
On Thu, Jun 6, 2024 at 3:21 PM Nathan Bossart wrote: > Here is a rebased version of the patch for v18 that adds a runtime-computed > GUC. As I noted earlier, there still isn't a consensus on this approach. I don't really like making this a GUC, but what's the other option? It's reasonable for people to want to ask the server how many resources it will need to start, and -C is the only tool we have for that right now. So I feel like this is a fair thing to do. I do think the name could use some more thought, though. semaphores_required would end up being the same kind of thing as shared_memory_size_in_huge_pages, but the names seem randomly different. If semaphores_required is right here, why isn't shared_memory_required used there? Seems more like we ought to call this semaphores or os_semaphores or num_semaphores or num_os_semaphores or something. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Compress ReorderBuffer spill files using LZ4
Le jeu. 6 juin 2024 à 07:24, Alvaro Herrera a écrit : > > On 2024-Jun-06, Amit Kapila wrote: > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > > > > > When the content of a large transaction (size exceeding > > > logical_decoding_work_mem) and its sub-transactions has to be > > > reordered during logical decoding, then, all the changes are written > > > on disk in temporary files located in pg_replslot/. > > > Decoding very large transactions by multiple replication slots can > > > lead to disk space saturation and high I/O utilization. > > I like the general idea of compressing the output of logical decoding. > It's not so clear to me that we only want to do so for spilling to disk; > for instance, if the two nodes communicate over a slow network, it may > even be beneficial to compress when streaming, so to this question: > > > Why can't one use 'streaming' option to send changes to the client > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > I would say that streaming doesn't necessarily have to mean we don't > want compression, because for some users it might be beneficial. Interesting idea, will try to evaluate how to compress/decompress data transiting via streaming and how good the compression ratio would be. > I think a GUC would be a good idea. Also, what if for whatever reason > you want a different compression algorithm or different compression > parameters? Looking at the existing compression UI we offer in > pg_basebackup, perhaps you could add something like this: > > compress_logical_decoding = none > compress_logical_decoding = lz4:42 > compress_logical_decoding = spill-zstd:99 > > "none" says to never use compression (perhaps should be the default), > "lz4:42" says to use lz4 with parameters 42 on both spilling and > streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but > only for spilling to disk. I agree, if the server was compiled with support of multiple compression libraries, users should be able to choose which one they want to use. > (I don't mean to say that you should implement Zstd compression with > this patch, only that you should choose the implementation so that > adding Zstd support (or whatever) later is just a matter of adding some > branches here and there. With the current #ifdef you propose, it's hard > to do that. Maybe separate the parts that depend on the specific > algorithm to algorithm-agnostic functions.) Makes sense, will rework this patch in that way. Thank you! Regards, JT
Re: problems with "Shared Memory and Semaphores" section of docs
On Thu, Jun 06, 2024 at 03:31:53PM -0400, Robert Haas wrote: > I don't really like making this a GUC, but what's the other option? > It's reasonable for people to want to ask the server how many > resources it will need to start, and -C is the only tool we have for > that right now. So I feel like this is a fair thing to do. Yeah, this is how I feel, too. > I do think the name could use some more thought, though. > semaphores_required would end up being the same kind of thing as > shared_memory_size_in_huge_pages, but the names seem randomly > different. If semaphores_required is right here, why isn't > shared_memory_required used there? Seems more like we ought to call > this semaphores or os_semaphores or num_semaphores or > num_os_semaphores or something. I'm fine with any of your suggestions. If I _had_ to pick one, I'd probably choose num_os_semaphores because it's the most descriptive. -- nathan
Re: question regarding policy for patches to out-of-support branches
On 6/6/24 14:12, Tom Lane wrote: Robert Haas writes: On Thu, Jun 6, 2024 at 4:25 AM Hannu Krosing wrote: Not absolutely sure, but would at least adding a page to PostgreSQL Wiki about this make sense ? I feel like we need to do something. Tom says this is a policy, and he's made that comment before about other things, but the fact that they're not memorialized anywhere is a huge problem, IMHO. I didn't say it wasn't ;-) ISTM we have two basic choices: wiki page, or new SGML docs section. In the short term I'd lean to a wiki page. It'd be reasonable for https://wiki.postgresql.org/wiki/Committing_checklist to link to it (and maybe the existing section there about release freezes would be more apropos on a "Project policies" page? Not sure.) To get a sense of how much of a problem we have, I grepped the git history for comments mentioning project policies. Ignoring ones that are really talking about very localized issues, what I found is attached. It seems like it's little enough that a single wiki page with subsections ought to be enough. I'm not super handy with editing the wiki, plus I'm only firing on one cylinder today (seem to have acquired a head cold at pgconf), so maybe somebody else would like to draft something? I added them here with minimal copy editing an no attempt to organize or sort into groups: https://wiki.postgresql.org/wiki/Committing_checklist#Policies If someone has thoughts on how to improve I am happy to make more changes. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Avoid orphaned objects dependencies, take 3
On Thu, Jun 6, 2024 at 1:56 AM Bertrand Drouvot wrote: > v9 is more invasive (as it changes code in much more places) than v8 but it is > easier to follow (as it is now clear where the new lock is acquired). Hmm, this definitely isn't what I had in mind. Possibly that's a sign that what I had in mind was dumb, but for sure it's not what I imagined. What I thought you were going to do was add calls like LockDatabaseObject(NamespaceRelationId, schemaid, 0, AccessShareLock) in various places, or perhaps LockRelationOid(reloid, AccessShareLock), or whatever the case may be. Here you've got stuff like this: - record_object_address_dependencies(&conobject, addrs_auto, -DEPENDENCY_AUTO); + lock_record_object_address_dependencies(&conobject, addrs_auto, + DEPENDENCY_AUTO); ...which to me looks like the locking is still pushed down inside the dependency code. And you also have stuff like this: ObjectAddressSet(referenced, RelationRelationId, childTableId); + depLockAndCheckObject(&referenced); recordDependencyOn(&depender, &referenced, DEPENDENCY_PARTITION_SEC); But in depLockAndCheckObject you have: + if (object->classId == RelationRelationId || object->classId == AuthMemRelationId) + return; That doesn't seem right, because then it seems like the call isn't doing anything, but there isn't really any reason for it to not be doing anything. If we're dropping a dependency on a table, then it seems like we need to have a lock on that table. Presumably the reason why we don't end up with dangling dependencies in such cases now is because we're careful about doing LockRelation() in the right places, but we're not similarly careful about other operations e.g. ConstraintSetParentConstraint is called by DefineIndex which calls table_open(childRelId, ...) first, but there's no logic in DefineIndex to lock the constraint. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Re: small fix for llvm build
On 28.05.24 17:17, Peter Eisentraut wrote: I'm getting build failures when building with meson and llvm enabled, like this: [1/112] Generating src/backend/jit/llvm/llvmjit_types.bc with a custom command FAILED: src/backend/jit/llvm/llvmjit_types.bc /usr/local/bin/ccache /usr/local/Cellar/llvm/18.1.6/bin/clang -c -o src/backend/jit/llvm/llvmjit_types.bc ../src/backend/jit/llvm/llvmjit_types.c -flto=thin -emit-llvm -MD -MQ src/backend/jit/llvm/llvmjit_types.bc -MF src/backend/jit/llvm/llvmjit_types.c.bc.d -O2 -Wno-ignored-attributes -Wno-empty-body -fno-strict-aliasing -fwrapv -I./src/include -I./src/backend/utils/misc -I../src/include In file included from ../src/backend/jit/llvm/llvmjit_types.c:27: In file included from ../src/include/postgres.h:45: ../src/include/c.h:75:10: fatal error: 'libintl.h' file not found 75 | #include | ^~~ 1 error generated. The reason is that libintl.h is at /usr/local/include/libintl.h, but that is not in the include path for this command. I have -I/usr/local/include in CPPFLAGS in the environment, which is why the normal compilation commands pick it up, but this is not used by this custom command. Wit this small patch I can make it work: diff --git a/src/backend/jit/llvm/meson.build b/src/backend/jit/llvm/meson.build index 41c759f73c5..4a4232661ba 100644 --- a/src/backend/jit/llvm/meson.build +++ b/src/backend/jit/llvm/meson.build @@ -63,6 +63,7 @@ bitcode_cflags = ['-fno-strict-aliasing', '-fwrapv'] if llvm.version().version_compare('=15.0') bitcode_cflags += ['-Xclang', '-no-opaque-pointers'] endif +bitcode_cflags += get_option('c_args') bitcode_cflags += cppflags # XXX: Worth improving on the logic to find directories here I have committed this change.
Re: Postgresql OOM
Radu Radutiu writes: > The query itself runs fine in a reasonable time with enable_parallel_hash = > 'off'. I see two problems - one is the wrong execution plan (right after > running analyze), the second and the most important is the huge memory > usage (far exceeding work_mem and shared buffers) leading to OOM. > See https://explain.depesz.com/s/yAqS for the explain plan > with enable_parallel_hash = 'off. What it looks like to me is that the join key column has very skewed statistics, such that a large majority of the tuples end up in the same hash bucket (probably they even all have identical keys). I think the memory growth is coming from the executor repeatedly splitting the buckets in a vain attempt to separate those tuples into multiple buckets. The planner should recognize this situation and avoid use of hash join in such cases, but maybe the statistics aren't reflecting the problem, or maybe there's something wrong with the logic specific to parallel hash join. You've not really provided enough information to diagnose why the poor choice of plan. regards, tom lane
Re: Things I don't like about \du's "Attributes" column
On 06.06.2024 17:29, Robert Haas wrote: I think the first of these special interpretations is unnecessary and should be removed. It seems pretty clear what 0 means. Agree. There is an additional technical argument for removing this replacement. I don't like explicit cast to text of the "Connection limit" column. Without 'Not allowed' it is no longerrequired. Value -1 can be replaced by NULL with an implicit cast to integer. Next version with this change attached. Example output: \du+ regress_du* List of roles Role name | Login | Attributes | Valid until | Connection limit | Description --+---+-+--+--+-- regress_du_admin | yes | Superuser +| | | some description | | Create DB +| | | | | Create role+| | | | | Inherit+| | | | | Replication+| | | | | Bypass RLS | | | regress_du_role0 | yes | Inherit | Tue Jun 04 00:00:00 2024 PDT | 0 | regress_du_role1 | no| Create role+| infinity | | | | Inherit | | | regress_du_role2 | yes | Inherit+| | 42 | | | Replication+| | | | | Bypass RLS | | | (4 rows) Current version for comparison: List of roles Role name | Attributes | Description --++-- regress_du_admin | Superuser, Create role, Create DB, Replication, Bypass RLS | some description regress_du_role0 | No connections+| | Password valid until 2024-06-04 00:00:00+03| regress_du_role1 | Create role, Cannot login +| | Password valid until infinity | regress_du_role2 | Replication, Bypass RLS +| | 42 connections | Data: CREATE ROLE regress_du_role0 LOGIN PASSWORD '123' VALID UNTIL '2024-06-04' CONNECTION LIMIT 0; CREATE ROLE regress_du_role1 CREATEROLE CONNECTION LIMIT -1 VALID UNTIL 'infinity'; CREATE ROLE regress_du_role2 LOGIN REPLICATION BYPASSRLS CONNECTION LIMIT 42; CREATE ROLE regress_du_admin LOGIN SUPERUSER CREATEROLE CREATEDB BYPASSRLS REPLICATION INHERIT; COMMENT ON ROLE regress_du_admin IS 'some description'; -- Pavel Luzanov Postgres Professional:https://postgrespro.com From fd3fb8a4bea89f870789fe63a270f872c945980c Mon Sep 17 00:00:00 2001 From: Pavel Luzanov Date: Thu, 6 Jun 2024 23:48:32 +0300 Subject: [PATCH v8] psql: Rethinking of \du command Cnanges in the \du command - "Login", "Connection limit" and "Valid until" attributes are placed in separate columns. to understand that there is no limit. - The "Attributes" column includes only the enabled logical attributes. - The attribute names correspond to the keywords of the CREATE ROLE command. - The attributes are listed in the same order as in the documentation. - Value -1 for "Connection limit" replaced by NULL to make it easier - General refactoring of describeRoles function in describe.c. --- src/bin/psql/describe.c| 146 - src/test/regress/expected/psql.out | 40 +--- src/test/regress/sql/psql.sql | 12 ++- 3 files changed, 72 insertions(+), 126 deletions(-) diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index f67bf0b892..31cc40b38f 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -36,7 +36,6 @@ static bool describeOneTableDetails(const char *schemaname, bool verbose); static void add_tablespace_footer(printTableContent *const cont, char relkind, Oid tablespace, const bool newline); -static void add_role_attribute(PQExpBuffer buf, const char *const str); static bool listTSParsersVerbose(const char *pattern); static bool describeOneTSParser(const char *oid, const char *nspname, const char *prsname); @@ -3615,34 +3614,47 @@ describeRoles(const char *pattern, bool verbose, bool showSystem) { PQExpBufferData buf; PGresult *res; - printTableContent cont; - printTableO
Re: ssl tests fail due to TCP port conflict
On Wed, 5 Jun 2024 at 23:37, Tom Lane wrote: > > Andrew Dunstan writes: > > On 2024-06-05 We 16:00, Alexander Lakhin wrote: > >> That is, psql from the test instance 001_ssltests_34 opened a > >> connection to > >> the test server with the client port 50072 and it made using the port by > >> the server from the test instance 001_ssltests_30 impossible. > > > Oh. (kicks self) > > D'oh. > > > Should we really be allocating ephemeral server ports in the range > > 41952..65535? Maybe we should be looking for an unallocated port > > somewhere below 41952, and above, say, 32767, so we couldn't have a > > client socket collision. > > Hmm, are there really any standards about how these port numbers > are used? > > I wonder if we don't need to just be prepared to retry the whole > thing a few times. Even if it's true that "clients" shouldn't > choose ports below 41952, we still have a small chance of failure > against a non-Postgres server starting up at the wrong time. My suggestion would be to not touch the ephemeral port range at all for these ports. In practice the ephemeral port range is used for cases where the operating system assigns the port, and the application doesn't care whot it is. Not for when you want to get a free port, but want to know in advance which one it is. For the PgBouncer test suite we do something similar as the PG its perl tests do, but there we allocate a port between 10200 and 32768: https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215 Sure theoretically it's possible to hit a rare case where another server starts up at the wrong time, but that chance seems way lower than a client starting up at the wrong time. Especially since there aren't many servers that use a port with 5 digits. Attached is a patch that updates the port numbers. v1-0001-Don-t-use-ephemeral-port-range.patch Description: Binary data
Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions
On Thu, 6 Jun 2024 at 20:10, Isaac Morland wrote: > > On Thu, 6 Jun 2024 at 12:53, Jeff Davis wrote: > >> >> > I didn't get you completely here. w.r.t extensions how will this have >> > an impact if we set the search_path for definer functions. >> >> If we only set the search path for SECURITY DEFINER functions, I don't >> think that solves the whole problem. > > > Indeed. While the ability for a caller to set the search_path for a security > definer functions introduces security problems that are different than for > security invoker functions, it's still weird for the behaviour of a function > to depend on the caller's search_path. It’s even weirder for the default > search path behaviour to be different depending on whether or not the > function is security definer. +1 And +1 to the general idea and direction this thread is going in. I definitely think we should be making extensions more secure by default, and this is an important piece of it. Even by default making the search_path "pg_catalog, pg_temp" for functions created by extensions would be very useful.
Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions
On Fri, 2024-06-07 at 00:19 +0200, Jelte Fennema-Nio wrote: > Even by default making the search_path "pg_catalog, pg_temp" for > functions created by extensions would be very useful. Right now there's no syntax to override that. We'd need something to say "get the search_path from the session". Regards, Jeff Davis
Re: race condition in pg_class
On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote: > It's not this patch set's fault, but I'm not very pleased to see that > the injection point wait events have been shoehorned into the > "Extension" category - which they are not - instead of being a new > wait_event_type. That would have avoided the ugly wait-event naming > pattern, inconsistent with everything else, introduced by > inplace050-tests-inj-v1.patch. Not sure to agree with that. The set of core backend APIs supporting injection points have nothing to do with wait events. The library attached to one or more injection points *may* decide to use a wait event like what the wait/wakeup calls in modules/injection_points do, but that's entirely optional. These rely on custom wait events, plugged into the Extension category as the code run is itself in an extension. I am not arguing against the point that it may be interesting to plug in custom wait event categories, but the current design of wait events makes that much harder than what core is currently able to handle, and I am not sure that this brings much at the end as long as the wait event strings can be customized. I've voiced upthread concerns over the naming enforced by the patch and the way it plugs the namings into the isolation functions, by the way. -- Michael signature.asc Description: PGP signature
Re: Injection points: preloading and runtime arguments
On Thu, Jun 06, 2024 at 03:47:47PM +0500, Andrey M. Borodin wrote: > Is it OK to detach() before wakeup()? Or, perhaps, can a detach() do a > wakeup() automatically? It is OK to do a detach before a wakeup. Noah has been relying on this behavior in an isolation test for a patch he's worked on. See inplace110-successors-v1.patch here: https://www.postgresql.org/message-id/20240512232923.aa.nmi...@google.com That's also something we've discussed for 33181b48fd0e, where Noah wanted to emulate in an automated fashion what one can do with a debugger and one or more breakpoints. Not sure that wakeup() involving a automated detach() is the behavior to hide long-term, actually, as there is also an argument for waking up a point and *not* detach it to force multiple waits. -- Michael signature.asc Description: PGP signature
Re: report on not thread-safe functions
On Thu, 2024-06-06 at 16:34 +0200, Peter Eisentraut wrote: > - setlocale() > > The locale business needs to be reworked to use locale_t and _l > functions. This is already being discussed for other reasons. I posted a few patches to do this for collation: https://commitfest.postgresql.org/48/5023/ Regards, Jeff Davis
Re: Revive num_dead_tuples column of pg_stat_progress_vacuum
On Wed, Jun 5, 2024 at 7:19 PM Andrey M. Borodin wrote: > > > > > On 4 Jun 2024, at 00:26, Masahiko Sawada wrote: > > Thank you! Vacuum enhancement is a really good step forward, and this small > change would help a lot of observability tools. > > > > On 4 Jun 2024, at 00:49, Peter Geoghegan wrote: > > > > Can we rename this to num_dead_item_ids (or something similar) in > > passing? > > I do not insist, but many tools will have to adapt to this change [0,1]. > However, most of tools will have to deal with removed max_dead_tuples anyway > [2], so this is not that big problem. True, this incompatibility would not be a big problem. num_dead_item_ids seems good to me. I've updated the patch that incorporated the comment from Álvaro[1]. Regards, [1] https://www.postgresql.org/message-id/202406041535.pmyty3ci4pfd%40alvherre.pgsql -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com From 24838307afc52dfc00317579ad88a59ba5c00192 Mon Sep 17 00:00:00 2001 From: Masahiko Sawada Date: Tue, 4 Jun 2024 06:17:25 +0900 Subject: [PATCH v2] Reintroduce dead tuple counter in pg_stat_progress_vacuum. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit 667e65aac3 changed both num_dead_tuples and max_dead_tuples columns to dead_tuple_bytes and max_dead_tuple_bytes columns, respectively. But as per discussion, the number of dead tuples collected still can provide meaningful insights for users. This change reintroduce the column for the count of dead tuples, renamed as num_dead_item_ids to avoid confusion with the number of dead tuples removed by VACUUM, which includes dead heap-only tuples, but excludes any pre-existing LP_DEAD items left behind by opportunistic pruning. XXX: bump catalog version. Reviewed-by: Peter Geoghegan, Álvaro Herrera, Andrey Borodin Discussion: https://postgr.es/m/CAD21AoBL5sJE9TRWPyv%2Bw7k5Ee5QAJqDJEDJBUdAaCzGWAdvZw%40mail.gmail.com --- doc/src/sgml/monitoring.sgml | 9 + src/backend/access/heap/vacuumlazy.c | 12 +--- src/backend/catalog/system_views.sql | 3 ++- src/include/commands/progress.h | 5 +++-- src/test/regress/expected/rules.out | 5 +++-- 5 files changed, 26 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 053da8d6e4..b2ad9b446f 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -6268,6 +6268,15 @@ FROM pg_stat_get_backend_idset() AS backendid; + + + num_dead_item_ids bigint + + + Number of dead item identifiers collected since the last index vacuum cycle. + + + indexes_total bigint diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 8145ea8fc3..ef1df35afa 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -2883,13 +2883,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets, int num_offsets) { TidStore *dead_items = vacrel->dead_items; + const int prog_index[2] = { + PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS, + PROGRESS_VACUUM_DEAD_TUPLE_BYTES + }; + int64 prog_val[2]; TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets); vacrel->dead_items_info->num_items += num_offsets; - /* update the memory usage report */ - pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES, - TidStoreMemoryUsage(dead_items)); + /* update the progress information */ + prog_val[0] = vacrel->dead_items_info->num_items; + prog_val[1] = TidStoreMemoryUsage(dead_items); + pgstat_progress_update_multi_param(2, prog_index, prog_val); } /* diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 53047cab5f..efb29adeb3 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1221,7 +1221,8 @@ CREATE VIEW pg_stat_progress_vacuum AS S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned, S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count, S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes, -S.param8 AS indexes_total, S.param9 AS indexes_processed +S.param8 AS num_dead_item_ids, S.param9 AS indexes_total, +S.param10 AS indexes_processed FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h index 82a8fe6bd1..5616d64523 100644 --- a/src/include/commands/progress.h +++ b/src/include/commands/progress.h @@ -25,8 +25,9 @@ #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS 4 #define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES 5 #define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6 -#define PROGRESS_VACUUM_INDEXES_TOTAL 7 -#define PROGRESS_VACUUM_INDEXES_PROCESSED 8 +#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7 +#define PROGRESS_VAC
Re: question regarding policy for patches to out-of-support branches
Joe Conway writes: > On 6/6/24 14:12, Tom Lane wrote: >> To get a sense of how much of a problem we have, I grepped the git >> history for comments mentioning project policies. Ignoring ones >> that are really talking about very localized issues, what I found >> is attached. It seems like it's little enough that a single wiki >> page with subsections ought to be enough. I'm not super handy with >> editing the wiki, plus I'm only firing on one cylinder today (seem >> to have acquired a head cold at pgconf), so maybe somebody else >> would like to draft something? > I added them here with minimal copy editing an no attempt to organize or > sort into groups: > https://wiki.postgresql.org/wiki/Committing_checklist#Policies > If someone has thoughts on how to improve I am happy to make more changes. Thanks! I summoned the energy to make a few more improvements, particularly updating stuff that seemed out-of-date. I'm sure there's more that could be added here. regards, tom lane
Re: Logical Replication of sequences
On Thu, Jun 6, 2024 at 6:40 PM Amit Kapila wrote: > > On Thu, Jun 6, 2024 at 11:10 AM Masahiko Sawada wrote: > > > > On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila wrote: > > > > > > > > To achieve this, we can allow sequences to be copied during > > > the initial CREATE SUBSCRIPTION command similar to what we do for > > > tables. And then later by new/existing command, we re-copy the already > > > existing sequences on the subscriber. > > > > > > The options for the new command could be: > > > Alter Subscription ... Refresh Sequences > > > Alter Subscription ... Replicate Sequences > > > > > > In the second option, we need to introduce a new keyword Replicate. > > > Can you think of any better option? > > > > Another idea is doing that using options. For example, > > > > For initial sequences synchronization: > > > > CREATE SUBSCRIPTION ... WITH (copy_sequence = true); > > > > How will it interact with the existing copy_data option? So copy_data > will become equivalent to copy_table_data, right? Right. > > > For re-copy (or update) sequences: > > > > ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = true); > > > > Similar to the previous point it can be slightly confusing w.r.t > copy_data. And would copy_sequence here mean that it would copy > sequence values of both pre-existing and newly added sequences, if so, > that would make it behave differently than copy_data? The other > possibility in this direction would be to introduce an option like > replicate_all_sequences/copy_all_sequences which indicates a copy of > both pre-existing and new sequences, if any. Copying sequence data works differently than replicating table data (initial data copy and logical replication). So I thought the copy_sequence option (or whatever better name) always does both updating pre-existing sequences and adding new sequences. REFRESH PUBLICATION updates the tables to be subscribed, so we also update or add sequences associated to these tables. > > If we want to go in the direction of having an option such as > copy_(all)_sequences then do you think specifying that copy_data is > just for tables in the docs would be sufficient? I am afraid that it > would be confusing for users. I see your point. But I guess it would not be very problematic as it doesn't break the current behavior and copy_(all)_sequences is primarily for upgrade use cases. > > > > > > > In addition to the above, the command Alter Subscription .. Refresh > > > Publication will fetch any missing sequences similar to what it does > > > for tables. > > > > On the subscriber side, do we need to track which sequences are > > created via CREATE/ALTER SUBSCRIPTION? > > > > I think so unless we find some other way to know at refresh > publication time which all new sequences need to be part of the > subscription. What should be the behavior w.r.t sequences when the > user performs ALTER SUBSCRIPTION ... REFRESH PUBLICATION? I was > thinking similar to tables, it should fetch any missing sequence > information from the publisher. It seems to make sense to me. But I have one question: do we want to support replicating sequences that are not associated with any tables? if yes, what if we refresh two different subscriptions that subscribe to different tables on the same database? On the other hand, if no (i.e. replicating only sequences owned by tables), can we know which sequences to replicate by checking the subscribed tables? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Synchronizing slots from primary to standby
On Thursday, June 6, 2024 12:21 PM Peter Smith > > Hi, here are some review comments for the docs patch v5-0001. Thanks for the comments! Here is the V6 patch that addressed the these. Best Regards, Hou zj v6-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch Description: v6-0001-Document-the-steps-to-check-if-the-standby-is-rea.patch
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
Hello Noah, 06.06.2024 22:07, Noah Misch wrote: I don't know, but if the locks are really missing now, I feel like the first question is "which commit got rid of them?". It's a little hard to believe that they've never been there and somehow nobody has noticed. Then again, maybe we have; see Noah's thread about in-place updates breaking stuff and some of the surprising discoveries there. But it seems worth investigating. $SUBJECT looks more like a duplicate of postgr.es/m/flat/20240512171658.7e.nmi...@google.com (Hot standby queries see transient all-zeros pages). Thank you for the reference! Yes, it looks very similar. Though I can't say the sleep you proposed helps the failure reproduction (I've tried 026_overwrite_contrecord.pl and saw no more frequent failures or so). My bisect run ended with: 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit Author: Thomas Munro Date: Wed Apr 3 00:03:08 2024 +1300 Provide vectored variant of ReadBuffer(). Other buildfarm failures with this Assert I could find kind of confirm this: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18 (presumably a first failure of this sort) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08 Best regards, Alexander
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin wrote: > My bisect run ended with: > 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit > > Author: Thomas Munro > Date: Wed Apr 3 00:03:08 2024 +1300 > > Provide vectored variant of ReadBuffer(). > > Other buildfarm failures with this Assert I could find kind of confirm this: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18 > (presumably a first failure of this sort) > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08 Looking...
Re: relfilenode statistics
Hi, On 2024-06-06 12:27:49 -0400, Robert Haas wrote: > On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot > wrote: > > I think we should keep the stats in the relation during relfilenode changes. > > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in > > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you > > can > > see in the example provided up-thread the new heap_blks_written statistic > > has > > been preserved during the TRUNCATE. > > Yeah, I think there's something weird about this design. Somehow we're > ending up with both per-relation and per-relfilenode counters: > > + pg_stat_get_blocks_written(C.oid) + > pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN > C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END, > C.relfilenode) AS heap_blks_written, > > I'll defer to Andres if he thinks that's awesome, but to me it does > not seem right to track some blocks written in a per-relation counter > and others in a per-relfilenode counter. It doesn't immediately sound awesome. Nor really necessary? If we just want to keep prior stats upon arelation rewrite, we can just copy the stats from the old relfilenode. Or we can decide that those stats don't really make sense anymore, and start from scratch. I *guess* I could see an occasional benefit in having both counter for "prior relfilenodes" and "current relfilenode" - except that stats get reset manually and upon crash anyway, making this less useful than if it were really "lifetime" stats. Greetings, Andres Freund
Re: cannot drop intarray extension
On Mon, Jun 3, 2024 at 12:14 PM jian he wrote: > > hi. > > setup > drop table if exist test__int cascade; > create extension intarray; > > CREATE TABLE test__int( a int[] ); > CREATE INDEX text_idx on test__int using gist (a gist__intbig_ops(siglen = > 1)); > drop extension intarray cascade; > NOTICE: drop cascades to index text_idx > 2024-06-03 11:53:32.629 CST [41165] ERROR: cache lookup failed for > function 17758 > 2024-06-03 11:53:32.629 CST [41165] STATEMENT: drop extension intarray > cascade; > ERROR: cache lookup failed for function 17758 > > > backtrace info: > index_getprocinfo > #0 index_opclass_options (indrel=0x7faeca727b58, attnum=1, > attoptions=94372901674408, validate=false) > at > ../../Desktop/pg_src/src4/postgres/src/backend/access/index/indexam.c:1034 > #1 0x55d4e63a79cb in RelationGetIndexAttOptions > (relation=0x7faeca727b58, copy=false) > at > ../../Desktop/pg_src/src4/postgres/src/backend/utils/cache/relcache.c:5872 > #2 0x55d4e639d72d in RelationInitIndexAccessInfo > (relation=0x7faeca727b58) > at > ../../Desktop/pg_src/src4/postgres/src/backend/utils/cache/relcache.c:1569 > #3 0x55d4e639c5ac in RelationBuildDesc (targetRelId=24582, insertIt=true) > at > ../../Desktop/pg_src/src4/postgres/src/backend/utils/cache/relcache.c:1207 > #4 0x55d4e639e9ce in RelationIdGetRelation (relationId=24582) > at > ../../Desktop/pg_src/src4/postgres/src/backend/utils/cache/relcache.c:2115 > #5 0x55d4e5a412fd in relation_open (relationId=24582, lockmode=8) > at > ../../Desktop/pg_src/src4/postgres/src/backend/access/common/relation.c:58 > #6 0x55d4e5ae6a06 in index_open (relationId=24582, lockmode=8) > at > ../../Desktop/pg_src/src4/postgres/src/backend/access/index/indexam.c:137 > #7 0x55d4e5be61b8 in index_drop (indexId=24582, concurrent=false, > concurrent_lock_mode=false) > at ../../Desktop/pg_src/src4/postgres/src/backend/catalog/index.c:2156 > > i guess it's because we first dropped the function g_intbig_options in this context, the index "text_idx" has a normal dependency with pg_opclass. but `drop extension intarray cascade;`, CASCADE means that we drop the pg_opclass and pg_opclass's inner dependency first, then drop the index. while drop index (sub functions RelationGetIndexAttOptions,index_opclass_options, index_getprocinfo) requires that pg_opclass and its inner dependencies (namely g_intbig_options, g_int_options) are not dropped first. in deleteObjectsInList, under certain conditions trying to sort the to be deleted object list by just using sort_object_addresses seems to work, but it looks like a hack. maybe the proper fix would be in findDependentObjects. From 8deefb638df270cf26e5649b1a99f218474821fa Mon Sep 17 00:00:00 2001 From: jian he Date: Fri, 7 Jun 2024 11:25:03 +0800 Subject: [PATCH v1 1/1] trying to resolve drop extension deletion order sometimes, drop extension cascade cannot resolve the internal deletion order correctly. e.g. drop table if exist test__int cascade; create extension intarray; CREATE TABLE test__int( a int[] ); CREATE INDEX text_idx on test__int using gist (a gist__intbig_ops(siglen = 1)); drop extension intarray cascade; the index "text_idx" only have a normal dependency with pg_opclass. even though the index can be dropped separately without affecting the "pg_opclass". but CASCADE means that we drop the pg_opclass and pg_opclass's inner dependency first, then drop the index. while drop index (sub functions RelationGetIndexAttOptions,index_opclass_options, index_getprocinfo) requires that pg_opclass and its inner dependencies are not dropped first. Resorting the deleted objects in deleteObjectsInList using sort_object_addresses seems like a hack. but it works for now. discussion: https://www.postgresql.org/message-id/CACJufxEspPKC7oxVLci7oUddUmcAGNKJnWWSD7-B03bGtT9gDg%40mail.gmail.com --- src/backend/catalog/dependency.c | 33 +++ src/backend/catalog/pg_shdepend.c | 2 +- src/backend/commands/dropcmds.c | 2 +- src/backend/commands/indexcmds.c | 2 +- src/backend/commands/tablecmds.c | 13 +++- src/include/catalog/dependency.h | 2 +- 6 files changed, 41 insertions(+), 13 deletions(-) diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index d4b5b2ad..d0c2454b 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -183,7 +183,7 @@ static void DeleteInitPrivs(const ObjectAddress *object); */ static void deleteObjectsInList(ObjectAddresses *targetObjects, Relation *depRel, - int flags) + int flags, bool sort_objects) { int i; @@ -213,6 +213,8 @@ deleteObjectsInList(ObjectAddresses *targetObjects, Relation *depRel, } } + if (sort_objects) + sort_object_addresses(targetObjects); /* * Delete all the objects in the proper order, except that if told
Re: relfilenode statistics
Hi, On 2024-06-03 11:11:46 +, Bertrand Drouvot wrote: > The main argument is that we currently don’t have writes counters for > relations. > The reason is that we don’t have the relation OID when writing buffers out. > Tracking writes per relfilenode would allow us to track/consolidate writes per > relation (example in the v1 patch and in the message up-thread). > > I think that adding instrumentation in this area (writes counters) could be > beneficial (like it is for the ones we currently have for reads). > > Second argument is that this is also beneficial for the "Split index and > table statistics into different types of stats" thread (mentioned in the > previous > message). It would allow us to avoid additional branches in some situations > (like > the one mentioned by Andres in the link I provided up-thread). I think there's another *very* significant benefit: Right now physical replication doesn't populate statistics fields like n_dead_tup, which can be a huge issue after failovers, because there's little information about what autovacuum needs to do. Auto-analyze *partially* can fix it at times, if it's lucky enough to see enough dead tuples - but that's not a given and even if it works, is often wildly inaccurate. Once we put things like n_dead_tup into per-relfilenode stats, we can populate them during WAL replay. Thus after a promotion autovacuum has much better data. This also is important when we crash: We've been talking about storing a snapshot of the stats alongside each REDO pointer. Combined with updating stats during crash recovery, we'll have accurate dead-tuple stats once recovey has finished. Greetings, Andres Freund
Re: speed up a logical replica setup
On Wed, May 22, 2024, at 12:16 PM, Euler Taveira wrote: > I'll summarize all issues as soon as I finish the review of sync slot > support. I > think we should avoid new development if we judge that the item can be > documented as a limitation for this version. Nevertheless, I will share > patches > so you can give your opinion on whether it is an open item or new development. Here it is a patch series to fix the issues reported in recent discussions. The patches 0001 and 0003 aim to fix the buildfarm issues. The patch 0002 removes synchronized failover slots on subscriber since it has no use. I also included an optional patch 0004 that improves the usability by checking both servers if it already failed in any subscriber check. As I said in this previous email I decided to remove the logic that reacts for an issue on primary. We can reintroduce another code later if/when we have a better way to check the recovery progress. It will rely on the recovery_timeout and it adds recovery_timeout equals to PG_TEST_TIMEOUT_DEFAULT to let the animals control how long it should wait for the recovery. Since some animals reported some issues in the check_publisher routine that checks if the primary_slot_name is in use on primary, this logic was removed too (patch 0003). We could introduce a way to keep trying this test but the conclusion is that it is not worth it and if the primary_slot_name does not exist (due to a setup error), pg_createsubscriber will log an error message and continue. The 0002 removes any failover slots that remains on subscriber. Talking about terminology, I noticed that slotsync.c uses "logical failover slots" and "failover logical slots", I think the latter sounds better but I'm not a native speaker. I also don't know if we use a short terminology like "failover slots" "failover replication slots" or "failover logical replication slots". IMO we can omit "logical" because "failover" infers it is a logical replication slot. I'm also not sure about omitting "replication". It is not descriptive enough. I prefer "failover replication slots". Before sending this email I realized that I did nothing about physical replication slots on the standby. I think we should also remove them too unconditionally. -- Euler Taveira EDB https://www.enterprisedb.com/ From 5d8b4781e6e9bcb00564f45c25e575a8abab6ae8 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Tue, 21 May 2024 23:04:57 -0300 Subject: [PATCH 1/4] Only the recovery_timeout controls the end of recovery process It used to check if the target server is connected to the primary server (send required WAL) to rapidly react when the process won't succeed. This code is not enough to guarantee that the recovery process will complete. There is a window between the walreceiver shutdown and the pg_is_in_recovery() returns false that can reach NUM_CONN_ATTEMPTS attempts and fails. Instead, rely only on recovery_timeout option to give up the process after the specified number of seconds. --- doc/src/sgml/ref/pg_createsubscriber.sgml | 7 - src/bin/pg_basebackup/pg_createsubscriber.c | 29 ++- .../t/040_pg_createsubscriber.pl | 2 ++ 3 files changed, 5 insertions(+), 33 deletions(-) diff --git a/doc/src/sgml/ref/pg_createsubscriber.sgml b/doc/src/sgml/ref/pg_createsubscriber.sgml index 142b02..a700697f88 100644 --- a/doc/src/sgml/ref/pg_createsubscriber.sgml +++ b/doc/src/sgml/ref/pg_createsubscriber.sgml @@ -325,13 +325,6 @@ PostgreSQL documentation connections to the target server should fail. - -During the recovery process, if the target server disconnects from the -source server, pg_createsubscriber will check a -few times if the connection has been reestablished to stream the required -WAL. After a few attempts, it terminates with an error. - - Since DDL commands are not replicated by logical replication, avoid executing DDL commands that change the database schema while running diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c index 90cc580811..f62f34b1a7 100644 --- a/src/bin/pg_basebackup/pg_createsubscriber.c +++ b/src/bin/pg_basebackup/pg_createsubscriber.c @@ -1360,6 +1360,9 @@ stop_standby_server(const char *datadir) * * If recovery_timeout option is set, terminate abnormally without finishing * the recovery process. By default, it waits forever. + * + * XXX Is the recovery process still in progress? When recovery process has a + * better progress reporting mechanism, it should be added here. */ static void wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions *opt) @@ -1367,9 +1370,6 @@ wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions PGconn *conn; int status = POSTMASTER_STILL_STARTING; int timer = 0; - int count = 0; /* number of consecutive connection attempts */ - -#define NUM_CONN_ATTEMPTS 10 pg_log_i
Re: How about using dirty snapshots to locate dependent objects?
On Thu, Jun 6, 2024 at 7:39 PM Ashutosh Sharma wrote: > > On Thu, Jun 6, 2024 at 6:20 PM Dilip Kumar wrote: >> >> On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma wrote: >> > >> > Hello everyone, >> > >> > At present, we use MVCC snapshots to identify dependent objects. This >> > implies that if a new dependent object is inserted within a transaction >> > that is still ongoing, our search for dependent objects won't include this >> > recently added one. Consequently, if someone attempts to drop the >> > referenced object, it will be dropped, and when the ongoing transaction >> > completes, we will end up having an entry for a referenced object that has >> > already been dropped. This situation can lead to an inconsistent state. >> > Below is an example illustrating this scenario: >> >> I don't think it's correct to allow the index to be dropped while a >> transaction is creating it. Instead, the right solution should be for >> the create index operation to protect the object it is using from >> being dropped. Specifically, the create index operation should acquire >> a shared lock on the Access Method (AM) to ensure it doesn't get >> dropped concurrently while the transaction is still in progress. > > > If I'm following you correctly, that's exactly what the patch is trying to > do; while the index creation is in progress, if someone tries to drop the > object referenced by the index under creation, the referenced object being > dropped is able to know about the dependent object (in this case the index > being created) using dirty snapshot and hence, it is unable to acquire the > lock on the dependent object, and as a result of that, it is unable to drop > it. You are aiming for the same outcome, but not in the conventional way. In my opinion, the correct approach is not to find objects being created using a dirty snapshot. Instead, when creating an object, you should acquire a proper lock on any dependent objects to prevent them from being dropped during the creation process. For instance, when creating an index that depends on the btree_gist access method, the create index operation should protect btree_gist from being dropped by acquiring the appropriate lock. It is not the responsibility of the drop extension to identify in-progress index creations. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
PgStat_KindInfo.named_on_disk not required in shared stats
Hi all, (Relevant folks in CC.) While hacking on the area of pgstat_*.c, I have noticed the existence of named_on_disk in PgStat_KindInfo, that is here to track the fact that replication slots are a particular case in the PgStat_HashKey for the dshash table of the stats because this kind of stats requires a mapping between the replication slot name and the hash key. As far as I can see, this field is not required and is used nowhere, because the code relies on the existence of the to_serialized_name and from_serialized_name callbacks to do the mapping. Wouldn't it make sense to remove it? This field is defined since 5891c7a8ed8f that introduced the shmem stats, and has never been used since. This frees an extra bit in PgStat_KindInfo, which is going to help me a bit with what I'm doing with this area of the code while keeping the structure size the same. Thoughts? -- Michael From 68c6e8401baea7ba1f0c616bbcd74c19daab770e Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Fri, 7 Jun 2024 14:04:06 +0900 Subject: [PATCH] Remove PgStat_KindInfo.named_on_disk This field is used to track a special case for replication slots that need a mapping between the dshash key and the slot names, but it is used nowhere as callbacks take care of sanity checks. --- src/include/utils/pgstat_internal.h | 8 +--- src/backend/utils/activity/pgstat.c | 1 - 2 files changed, 1 insertion(+), 8 deletions(-) diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h index dbbca31602..f6031995a9 100644 --- a/src/include/utils/pgstat_internal.h +++ b/src/include/utils/pgstat_internal.h @@ -193,12 +193,6 @@ typedef struct PgStat_KindInfo */ bool accessed_across_databases:1; - /* - * For variable-numbered stats: Identified on-disk using a name, rather - * than PgStat_HashKey. Probably only needed for replication slot stats. - */ - bool named_on_disk:1; - /* * The size of an entry in the shared stats hash table (pointed to by * PgStatShared_HashEntry->body). @@ -239,7 +233,7 @@ typedef struct PgStat_KindInfo void (*reset_timestamp_cb) (PgStatShared_Common *header, TimestampTz ts); /* - * For variable-numbered stats with named_on_disk. Optional. + * For variable-numbered stats. Optional. */ void (*to_serialized_name) (const PgStat_HashKey *key, const PgStatShared_Common *header, NameData *name); diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index dcc2ad8d95..44f0d3ede7 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -307,7 +307,6 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { .fixed_amount = false, .accessed_across_databases = true, - .named_on_disk = true, .shared_size = sizeof(PgStatShared_ReplSlot), .shared_data_off = offsetof(PgStatShared_ReplSlot, stats), -- 2.43.0 signature.asc Description: PGP signature
Re: cannot drop intarray extension
On Fri, Jun 07, 2024 at 11:32:14AM +0800, jian he wrote: > in deleteObjectsInList, under certain conditions trying to sort the to > be deleted object list > by just using sort_object_addresses seems to work, > but it looks like a hack. > maybe the proper fix would be in findDependentObjects. @@ -1459,6 +1459,7 @@ RemoveRelations(DropStmt *drop) [...] - performMultipleDeletions(objects, drop->behavior, flags); + if (list_length(drop->objects) > 1) + sortable = false; I have not studied the patch in details, but this looks overcomplicated to me. All the callers of performMultipleDeletions pass down sortable as true, while deleteObjectsInList() uses this argument to avoid the sorting on nested calls. It seems to me that this could be simpler. -- Michael signature.asc Description: PGP signature
Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent wrote: > > On Wed, 5 Jun 2024 at 18:47, Ranier Vilela wrote: > > > > Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart > > escreveu: > >> > >> I noticed that the "Restoring database schemas in the new cluster" part of > >> pg_upgrade can take a while if you have many databases, so I experimented > >> with a couple different settings to see if there are any easy ways to speed > >> it up. The FILE_COPY strategy for CREATE DATABASE helped quite > >> significantly on my laptop. For ~3k empty databases, this step went from > >> ~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41 > >> made a similar change for initdb, so there might even be an argument for > >> back-patching this to v15 (where STRATEGY was introduced). One thing I > >> still need to verify is that this doesn't harm anything when there are lots > >> of objects in the databases, i.e., more WAL generated during many > >> concurrent CREATE-DATABASE-induced checkpoints. > >> > >> Thoughts? > > > > Why not use it too, if not binary_upgrade? > > Because in the normal case (not during binary_upgrade) you don't want > to have to generate 2 checkpoints for every created database, > especially not when your shared buffers are large. Checkpoints' costs > scale approximately linearly with the size of shared buffers, so being > able to skip those checkpoints (with strategy=WAL_LOG) will save a lot > of performance in the systems where this performance impact matters > most. I agree with you that we introduced the WAL_LOG strategy to avoid these force checkpoints. However, in binary upgrade cases where no operations are happening in the system, the FILE_COPY strategy should be faster. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On Fri, Jun 7, 2024 at 3:06 PM Thomas Munro wrote: > On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin wrote: > > My bisect run ended with: > > 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit > > > > Author: Thomas Munro > > Date: Wed Apr 3 00:03:08 2024 +1300 > > > > Provide vectored variant of ReadBuffer(). > > > > Other buildfarm failures with this Assert I could find kind of confirm this: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18 > > (presumably a first failure of this sort) > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08 > > Looking... What Noah described[1] is what should be happening already, I think, but 210622c6 unconditionally zeroed the page. Oops. The attached seems to cure his repro for me. Does it also cure your test? I couldn't see that variant myself for some reason, but it seems to make sense as the explanation. I would probably adjust the function name or perhaps consider refactoring slightly, but first let's confirm that this is the same issue and fix. [1] https://www.postgresql.org/message-id/flat/20240512171658.7e.nmi...@google.com From f3bb1d69a57bea820895efaf366371463e62235d Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 7 Jun 2024 17:49:19 +1200 Subject: [PATCH] Fix RBM_ZERO_AND_LOCK. Commit 210622c6 accidentally zeroed out pages even if they were found in the buffer pool. It should always lock the page, but it should only zero pages that were not already found as an optimization to avoid I/O. Otherwise, concurrent readers that hold only a pin might see corrupted page contents changing under their feet. Reported-by: Noah Misch Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/20240512171658.7e.nmi...@google.com Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com --- src/backend/storage/buffer/bufmgr.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 49637284f91..238fc0e3547 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1016,7 +1016,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, * return. */ static void -ZeroBuffer(Buffer buffer, ReadBufferMode mode) +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero) { BufferDesc *bufHdr; uint32 buf_state; @@ -1034,7 +1034,8 @@ ZeroBuffer(Buffer buffer, ReadBufferMode mode) LockBufferForCleanup(buffer); } - memset(BufferGetPage(buffer), 0, BLCKSZ); + if (zero) + memset(BufferGetPage(buffer), 0, BLCKSZ); if (BufferIsLocal(buffer)) { @@ -1185,7 +1186,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence, buffer = PinBufferForBlock(rel, smgr, smgr_persistence, forkNum, blockNum, strategy, &found); - ZeroBuffer(buffer, mode); + ZeroBuffer(buffer, mode, !found); return buffer; } -- 2.45.1
Add support to TLS 1.3 cipher suites and curves lists
Hi All, I’m a Postgres user and I’m looking into restricting the set of allowed ciphers on Postgres and configure a concrete set of curves on our postgres instances. I see in current Postgres doc mentioned that only TLS1.2 and below cipher lists can be configured. And there is no setting that controls the cipher choices used by TLS1.3. As for ECDH keys currently postgres opts to support setting only a single elliptic group instead of setting a lists. As described in below doc link: https://www.postgresql.org/docs/devel/runtime-config-connection.html Now I have a patch to support settings for TLS1.3 ciphersuites and expanding the configuration option for EC settings. With my patch we can do: 1. Added a new configuration option ssl_ciphers_suites to control the cipher choices used by TLS 1.3. 2. Extend the existing configuration option ssl_ecdh_curve to accept a list of curve names seperated by colon. Could you please help to review to see if you are interested in having this change in upcoming Postgres major release(It's should be PG17)? Thanks in advance. patch_support_tls1.3_curvelist.diff Description: Binary data
Re: How about using dirty snapshots to locate dependent objects?
On Fri, Jun 7, 2024 at 10:06 AM Dilip Kumar wrote: > > On Thu, Jun 6, 2024 at 7:39 PM Ashutosh Sharma wrote: > > > > On Thu, Jun 6, 2024 at 6:20 PM Dilip Kumar wrote: > >> > >> On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma > >> wrote: > >> > > >> > Hello everyone, > >> > > >> > At present, we use MVCC snapshots to identify dependent objects. This > >> > implies that if a new dependent object is inserted within a transaction > >> > that is still ongoing, our search for dependent objects won't include > >> > this recently added one. Consequently, if someone attempts to drop the > >> > referenced object, it will be dropped, and when the ongoing transaction > >> > completes, we will end up having an entry for a referenced object that > >> > has already been dropped. This situation can lead to an inconsistent > >> > state. Below is an example illustrating this scenario: > >> > >> I don't think it's correct to allow the index to be dropped while a > >> transaction is creating it. Instead, the right solution should be for > >> the create index operation to protect the object it is using from > >> being dropped. Specifically, the create index operation should acquire > >> a shared lock on the Access Method (AM) to ensure it doesn't get > >> dropped concurrently while the transaction is still in progress. > > > > > > If I'm following you correctly, that's exactly what the patch is trying to > > do; while the index creation is in progress, if someone tries to drop the > > object referenced by the index under creation, the referenced object being > > dropped is able to know about the dependent object (in this case the index > > being created) using dirty snapshot and hence, it is unable to acquire the > > lock on the dependent object, and as a result of that, it is unable to drop > > it. > > You are aiming for the same outcome, but not in the conventional way. > In my opinion, the correct approach is not to find objects being > created using a dirty snapshot. Instead, when creating an object, you > should acquire a proper lock on any dependent objects to prevent them > from being dropped during the creation process. For instance, when > creating an index that depends on the btree_gist access method, the > create index operation should protect btree_gist from being dropped by > acquiring the appropriate lock. It is not the responsibility of the > drop extension to identify in-progress index creations. Thanks for sharing your thoughts, I appreciate your inputs and completely understand your perspective, but I wonder if that is feasible? For example, if an object (index in this case) has dependency on lets say 'n' number of objects, and those 'n' number of objects belong to say 'n' different catalog tables, so should we acquire locks on each of them until the create index command succeeds, or, should we just check for the presence of dependent objects and record their dependency inside the pg_depend table. Talking about this particular case, we are trying to create gist index that has dependency on gist_int4 opclass, it is one of the tuple inside pg_opclass catalog table, so should acquire lock in this tuple/table until the create index command succeeds and is that the thing to be done for all the dependent objects? -- With Regards, Ashutosh Sharma.
Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
On Fri, 7 Jun 2024 at 07:18, Dilip Kumar wrote: > > On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent > wrote: >> >> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela wrote: >>> >>> Why not use it too, if not binary_upgrade? >> >> Because in the normal case (not during binary_upgrade) you don't want >> to have to generate 2 checkpoints for every created database, >> especially not when your shared buffers are large. Checkpoints' costs >> scale approximately linearly with the size of shared buffers, so being >> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot >> of performance in the systems where this performance impact matters >> most. > > I agree with you that we introduced the WAL_LOG strategy to avoid > these force checkpoints. However, in binary upgrade cases where no > operations are happening in the system, the FILE_COPY strategy should > be faster. While you would be correct if there were no operations happening in the system, during binary upgrade we're still actively modifying catalogs; and this is done with potentially many concurrent jobs. I think it's not unlikely that this would impact performance. Now that I think about it, arguably, we shouldn't need to run checkpoints during binary upgrade for the FILE_COPY strategy after we've restored the template1 database and created a checkpoint after that: All other databases use template1 as their template database, and the checkpoint is there mostly to guarantee the FS knows about all changes in the template database before we task it with copying the template database over to our new database, so the protections we get from more checkpoints are practically useless. If such a change were implemented (i.e. no checkpoints for FILE_COPY in binary upgrade, with a single manual checkpoint after restoring template1 in create_new_objects) I think most of my concerns with this patch would be alleviated. Kind regards, Matthias van de Meent