Re: Disallow redundant indexes
On Thu, 24 Apr 2025 at 21:27, Japin Li wrote: > I propose that PostgreSQL prevent redundant index creation by: > > - Checking for identical existing indexes during CREATE INDEX. > - Failing with an error (like Oracle's ORA-01408) if a duplicate is found. > - Providing a GUC parameter (allow_redundant_indexes) to control this. > I’d love to hear your feedback or suggestions for improvement. Sounds like pointless nannying to me. Also, we don't want GUCs that change the way things work. It's too annoying for application developers. I don't have a complete picture of the history of it, but I believe we still have quite a mess as a result of some GUC-dependent behaviour. Check the mess around backslash_quote, standard_conforming_strings and escape_string_warning. In any case, who are we to define what a duplicate index is? Would creating a hash index on a column that's already part of a btree index be disallowed? How about creating an index on (col1) when there's already an index on (col1,col2)? One person might think that's a waste of space, and someone else might think it's useful to have the (col1) index to support faster index-only scans when only that column is needed. I get that we have REINDEX CONCURRENTLY now and there's less of a need to recreate another duplicate index CONCURRENTLY before dropping the old one, but aren't there still reasons to create a duplicate index concurrently, e.g to move an index to another tablespace without blocking queries. If you want to do this, then maybe just write some query that looks at pg_index to find duplicates and stick it on the wiki. Anyone who cares enough about this can run that to check. Oh wait, someone did that already, see [1]. David [1] https://wiki.postgresql.org/wiki/Index_Maintenance#Duplicate_indexes
Re: pg_upgrade-breaking release
On Thu, Apr 24, 2025 at 8:12 AM Bruce Momjian wrote: > Do we think most people are _not_ going to use pg_upgrade now that we > are defaulting to checksums being enabled by default in PG 18? I cannot imagine this would stop anyone from upgrading. It's one additional flag, which was already a requirement for those going the other direction in the past (checksums -> no checksums). And I also assume that "most people" are already running with checksums enabled. And if so, do we think we are ever going to have a > storage-format-changing release where pg_upgrade cannot be used? > Seems very unlikely, that would kind of go against the whole purpose of pg_upgrade. -- Cheers, Greg -- Crunchy Data - https://www.crunchydata.com Enterprise Postgres Software Products & Tech Support
Re: RFC: Logging plan of the running query
Hi, Attached a rebased version of the patch. -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.From 112467238f585bb3398d86ac6e1a71caa6549fb4 Mon Sep 17 00:00:00 2001 From: Atsushi Torikoshi Date: Thu, 24 Apr 2025 20:50:41 +0900 Subject: [PATCH v44] Add function to log the plan of the currently running query Currently, we have to wait for the query execution to finish to know its plan either using EXPLAIN ANALYZE or auto_explain. This is not so convenient, for example when investigating long-running queries on production environments. To improve this situation, this patch adds pg_log_query_plan() function that requests to log the plan of the currently executing query. On receipt of the request, codes for logging plan is wrapped to every ExecProcNode and when the executor runs one of ExecProcNode, the plan is actually logged. These wrappers are unwrapped when once the plan is logged. In this way, we can avoid adding the overhead which we'll face when adding CHECK_FOR_INTERRUPTS() like mechanisms in somewhere in executor codes safely. By default, only superusers are allowed to request to log the plans because allowing any users to issue this request at an unbounded rate would cause lots of log messages and which can lead to denial of service. --- contrib/auto_explain/auto_explain.c | 24 +- doc/src/sgml/func.sgml | 55 +++ src/backend/access/transam/xact.c| 7 + src/backend/catalog/system_functions.sql | 2 + src/backend/commands/Makefile| 1 + src/backend/commands/dynamic_explain.c | 368 +++ src/backend/commands/explain.c | 38 +- src/backend/commands/meson.build | 1 + src/backend/executor/execMain.c | 10 + src/backend/storage/ipc/procsignal.c | 4 + src/backend/tcop/postgres.c | 4 + src/backend/utils/init/globals.c | 2 + src/include/catalog/pg_proc.dat | 6 + src/include/commands/dynamic_explain.h | 26 ++ src/include/commands/explain.h | 5 + src/include/commands/explain_state.h | 1 + src/include/miscadmin.h | 1 + src/include/nodes/execnodes.h| 3 + src/include/storage/procsignal.h | 2 + src/include/tcop/pquery.h| 1 - src/test/regress/expected/misc_functions.out | 54 ++- src/test/regress/sql/misc_functions.sql | 41 ++- 22 files changed, 613 insertions(+), 43 deletions(-) create mode 100644 src/backend/commands/dynamic_explain.c create mode 100644 src/include/commands/dynamic_explain.h diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c index cd6625020a..e720ddf39f 100644 --- a/contrib/auto_explain/auto_explain.c +++ b/contrib/auto_explain/auto_explain.c @@ -15,6 +15,7 @@ #include #include "access/parallel.h" +#include "commands/dynamic_explain.h" #include "commands/explain.h" #include "commands/explain_format.h" #include "commands/explain_state.h" @@ -412,26 +413,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc) es->format = auto_explain_log_format; es->settings = auto_explain_log_settings; - ExplainBeginOutput(es); - ExplainQueryText(es, queryDesc); - ExplainQueryParameters(es, queryDesc->params, auto_explain_log_parameter_max_length); - ExplainPrintPlan(es, queryDesc); - if (es->analyze && auto_explain_log_triggers) -ExplainPrintTriggers(es, queryDesc); - if (es->costs) -ExplainPrintJITSummary(es, queryDesc); - ExplainEndOutput(es); - - /* Remove last line break */ - if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n') -es->str->data[--es->str->len] = '\0'; - - /* Fix JSON to output an object */ - if (auto_explain_log_format == EXPLAIN_FORMAT_JSON) - { -es->str->data[0] = '{'; -es->str->data[es->str->len - 1] = '}'; - } + ExplainStringAssemble(es, queryDesc, auto_explain_log_format, + auto_explain_log_triggers, + auto_explain_log_parameter_max_length); /* * Note: we rely on the existing logging of context or diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 574a544d9f..13dc0a7745 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -28822,6 +28822,29 @@ acl | {postgres=arwdDxtm/postgres,foo=r/postgres} + + + + pg_log_query_plan + +pg_log_query_plan ( pid integer ) +boolean + + +Requests to log the plan of the query currently running on the +backend with specified process ID. +It will be logged at LOG message level and +will appear in the server log based on the log +configuration set (See +for more information), but will not be sent to the client +regardless of . + + +This function is restricted to sup
Re: pg_upgrade-breaking release
On Thu, Apr 24, 2025 at 8:37 AM Bruce Momjian wrote: > When I wrote pg_upgrade, I assumed at some point the value of changing the > storage format would outweigh the value of allowing in-place upgrades. I > guess that hasn't happened yet. > It reminds me of TDE, which is an good example of that, in a way. It requires pg_dump or logical replication to go from unencrypted -> encrypted. If core were to have something like that, I guess pg_upgrade could technically support it, but it would be equivalent to pg_dump. We've all been spoiled by that awesome --link option! :) Cheers, Greg -- Crunchy Data - https://www.crunchydata.com Enterprise Postgres Software Products & Tech Support
Re: pg_upgrade-breaking release
On Thu, Apr 24, 2025 at 08:51:41AM -0400, Greg Sabino Mullane wrote: > On Thu, Apr 24, 2025 at 8:37 AM Bruce Momjian wrote: > > When I wrote pg_upgrade, I assumed at some point the value of changing the > storage format would outweigh the value of allowing in-place upgrades. I > guess that hasn't happened yet. > > > It reminds me of TDE, which is an good example of that, in a way. It requires > pg_dump or logical replication to go from unencrypted -> encrypted. If core > were to have something like that, I guess pg_upgrade could technically support > it, but it would be equivalent to pg_dump. We've all been spoiled by that > awesome --link option! :) True on spoiled, but I think TDE is being held back by code complexity, not upgrade complexity. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Re: Improve verification of recovery_target_timeline GUC.
On 2/14/25 02:42, Michael Paquier wrote: On Fri, Jan 24, 2025 at 01:36:45PM +, David Steele wrote: + timeline = strtoull(*newval, &endp, 0); + + if (*endp != '\0' || errno == EINVAL || errno == ERANGE) { GUC_check_errdetail("\"recovery_target_timeline\" is not a valid number."); return false; } Why not using strtou64() here? That's more portable depending on SIZEOF_LONG (aka the 4 bytes on WIN32 vs 8 bytes for the rest of the world). Done. + + if (timeline < 2 || timeline > UINT_MAX) + { A recovery_target_timeline of 1 is a valid value, isn't it? + GUC_check_errdetail( + "\"recovery_target_timeline\" must be between 2 and 4,294,967,295."); + return false; + } I would suggest to rewrite that as \"%s\" must be between %u and %u, which would be more generic for translations in case there is an overlap with something else. Done. This means there are no commas in the upper bound but I don't think it's a big deal and it more closely matches other GUC messages. +$logfile = slurp_file($node_standby->logfile()); +ok($logfile =~ qr/must be between 2 and 4,294,967,295/, + 'target timelime out of max range'); These sequences of tests could be improved: - Let's use start locations for the logs slurped, reducing the scope of the logs scanned. Done. - Perhaps it would be better to rely on wait_for_log() to make sure that the expected log contents are generated without any kind of race condition? Done. I'm also now using a single cluster for all three tests rather than creating a new one for each test. This is definitely more efficient. Having said that, I think these tests are awfully expensive for a single GUC. Unit tests would definitely be preferable but that's not an option for GUCs, so far as I know. Thanks, -DavidFrom 2ccf6f207a14dafe1a426208618ae7690c4513a2 Mon Sep 17 00:00:00 2001 From: David Steele Date: Thu, 24 Apr 2025 15:11:51 + Subject: Improve verification of recovery_target_timeline GUC. Currently check_recovery_target_timeline() converts any value that is not current, latest, or a valid integer to 0. So for example: recovery_target_timeline = 'currrent' results in the following error: FATAL: 22023: recovery target timeline 0 does not exist Since there is no range checking for uint32 (but there is a cast from unsigned long) this setting: recovery_target_timeline = '99' results in the following error: FATAL: 22023: recovery target timeline 1410065407 does not exist Improve by adding endptr checking to catch conversion errors and add range checking to exclude values < 2 and greater than UINT_MAX. --- src/backend/access/transam/xlogrecovery.c | 17 +-- src/test/recovery/t/003_recovery_targets.pl | 50 + 2 files changed, 64 insertions(+), 3 deletions(-) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 6ce979f2d8b..e82aa739d79 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -4994,13 +4994,24 @@ check_recovery_target_timeline(char **newval, void **extra, GucSource source) rttg = RECOVERY_TARGET_TIMELINE_LATEST; else { + char *endp; + uint64 timeline; rttg = RECOVERY_TARGET_TIMELINE_NUMERIC; errno = 0; - strtoul(*newval, NULL, 0); - if (errno == EINVAL || errno == ERANGE) + timeline = strtou64(*newval, &endp, 0); + + if (*endp != '\0' || errno == EINVAL || errno == ERANGE) + { + GUC_check_errdetail("\"%s\" is not a valid number.", + "recovery_target_timeline"); + return false; + } + + if (timeline < 1 || timeline > UINT_MAX) { - GUC_check_errdetail("\"recovery_target_timeline\" is not a valid number."); + GUC_check_errdetail("\"%s\" must be between %u and %u.", + "recovery_target_timeline", 1, UINT_MAX); return false; } } diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl index 0ae2e982727..bd701d04f99 100644 --- a/src/test/recovery/t/003_recovery_targets.pl +++ b/src/test/recovery/t/003_recovery_targets.pl @@ -187,4 +187,54 @@ ok( $logfile =~ qr/FATAL: .* recovery ended before configured recovery target was reached/, 'recovery end before target reached is a fatal error'); +# Invalid timeline target +$node_standby = Postg
Re: sslmode=secure by default (Re: Making sslrootcert=system work on Windows psql)
Re: George MacKerron > > Before we can make this change, I think we would have to improve the > > UX. psql does not even have any --switch for it. PostgreSQL serving > > non-SSL and SSL on the same port doesn't make the UX better... :-/ > > How do you think the UX could be improved? Maybe by using a psql switch > and/or an env var to opt out of (or initially even to opt into) the new > sslmode treatment? The env var is already there (PGSSLMODE). Now you can say `psql -h db.example.com -p 5433 dbfoo`, but for specifying the sslmode, you have to rewrite at least the last argument to use connection string syntax, `psql "dbname=dbfoo sslmode=verify-full`. This needs be be less cumbersome. (And the names of the options make me want to stay away from them, require/verify-ca/verify-full/verify-confusing. Your sslmode=secure idea is really good.) It should be as simple as psql --ssl (= sslmode=secure) psql --insecure (the old sslmode=require) psql --no-ssl (= sslmode=disable) psql -s and -S are unfortunately already taken :-/ For connection strings, perhaps the best action is to tell people that always including "sslmode=something" is best practise. For libpq-style key=value connection strings, that wouldn't even be ugly. For postgresql://-style strings, we would ideally have something like http:// vs https://, but I am not sure how to squeeze that into the syntax. (Appending ?sslmode= works, but meh.) Christoph
pg_upgrade-breaking release
Do we think most people are _not_ going to use pg_upgrade now that we are defaulting to checksums being enabled by default in PG 18? And if so, do we think we are ever going to have a storage-format-changing release where pg_upgrade cannot be used? Maybe it is too late to ask this, but I am asking anyway. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Re: MergeAppend could consider sorting cheapest child path
On 3/28/25 09:19, Alexander Pyhalov wrote: Andy Fan писал(а) 2024-10-17 03:33: I've updated patch. One more interesting case which we found - when fractional path is selected, it still can be more expensive than sorted cheapest total path (as we look only on indexes whith necessary pathkeys, not on indexes which allow efficiently fetch data). So far couldn't find artificial example, but we've seen inadequate index selection due to this issue - instead of using index suited for filters in where, index, suitable for sorting was selected as one having the cheapest fractional cost. I think it is necessary to generalise the approach a little. Each MergeAppend subpath candidate that fits pathkeys should be compared to the overall-optimal path + explicit Sort node. Let's label this two-node composition as base_path. There are three cases exist: startup-optimal, total-optimal and fractional-optimal. In the startup case, base_path should use cheapest_startup_path, the total-optimal case - cheapest_total_path and for a fractional case, we may employ the get_cheapest_fractional_path routine to detect proper base_path. It may provide a more effective plan either in full, fractional and partial scan cases: 1. The Sort node may be pushed down to subpaths under a parallel or async Append. 2. When a minor set of subpaths doesn't have a proper index, and it is profitable to sort them instead of switching to plain Append. In general, analysing the regression tests changed by this code, I see that the cost model prefers a series of small sortings instead of a single massive one. May be it will show some profit for execution time. I am not afraid of any palpable planning overhead here because we just do cheap cost estimation and comparison operations that don't need additional memory allocations. The caller is responsible for building a proper Sort node if this method is chosen as optimal. In the attachment, see the patch written according to the idea. There are I introduced two new routines: get_cheapest_path_for_pathkeys_ext get_cheapest_fractional_path_for_pathkeys_ext I have designed the code that way to reduce duplicated code in the generate_orderedappend_paths routine. But the main point is that I envision these new routines may be reused elsewhere. This feature looks сlose to the one we discussed before [1]. So, let me CC the people from the previous conversation to the discussion. [1] https://www.postgresql.org/message-id/flat/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA%40mail.gmail.com -- regards, Andrei LepikhovFrom 98306f0e14c12b6dee92ef5977d85fc1dd324898 Mon Sep 17 00:00:00 2001 From: "Andrei V. Lepikhov" Date: Thu, 24 Apr 2025 14:03:02 +0200 Subject: [PATCH v0] Consider explicit sort of the MergeAppend subpaths. Expand the optimiser search scope a little: fetching optimal subpath matching pathkeys of the planning MergeAppend, consider the extra case of overall-optimal path plus explicit Sort node at the top. It may provide a more effective plan in both full and fractional scan cases: 1. The Sort node may be pushed down to subpaths under a parallel or async Append. 2. The case is when a minor set of subpaths doesn't have a proper index, and it is profitable to sort them instead of switching to plain Append. In general, analysing the regression tests changed by this code, I see that the cost model prefers a series of small sortings instead of a single massive one. Overhead: It seems multiple subpaths may be encountered, as well as many pathkeys. So, to be as careful as possible here, only cost estimation is performed. --- .../postgres_fdw/expected/postgres_fdw.out| 6 +- src/backend/optimizer/path/allpaths.c | 35 ++--- src/backend/optimizer/path/pathkeys.c | 112 ++ src/include/optimizer/paths.h | 10 ++ .../regress/expected/collate.icu.utf8.out | 9 +- src/test/regress/expected/inherit.out | 19 ++- src/test/regress/expected/partition_join.out | 139 +++--- src/test/regress/expected/partition_prune.out | 16 +- src/test/regress/expected/union.out | 6 +- src/test/regress/sql/inherit.sql | 4 +- 10 files changed, 255 insertions(+), 101 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index d1acee5a5fa..11b42f18cb6 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -10277,13 +10277,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a -> Nested Loop Join Filter: (t1.a = t2.b) -> Append - -> Foreign Scan on ftprt1_p1 t1_1 + -> Sort + Sort Key: t1_1.a + -> Foreign Scan on ftprt1_p1 t1_1 -> Foreign Scan on ftprt1_p2 t1_2 -> Materialize -> Append
Re: extension_control_path and "directory"
On Thu, Apr 24, 2025 at 7:21 AM Christoph Berg wrote: > > Re: Matheus Alcantara > > I've tested with the semver extension and it seems to work fine with > > this patch. Can you please check on your side to see if it's also > > working? > > Hi Matheus, > > thanks for the patch, it does indeed work. > Thanks for testing and for reviewing. > diff --git a/src/backend/commands/extension.c > b/src/backend/commands/extension.c > index 180f4af9be3..d68efd59118 100644 > --- a/src/backend/commands/extension.c > +++ b/src/backend/commands/extension.c > @@ -376,6 +376,14 @@ get_extension_control_directories(void) > > /* Substitute the path macro if needed */ > mangled = substitute_path_macro(piece, "$system", > system_dir); > + > + /* > +* Append "extension" suffix in case is a custom > extension control > +* path. > +*/ > + if (strcmp(piece, "$system") != 0) > + mangled = psprintf("%s/extension", mangled); > > This would look prettier if it was something like > > mangled = substitute_path_macro(piece, "$system", > system_dir "/extension"); > > ... but I'm wondering if it wouldn't be saner if the control path > should be stored without "extension" in that struct. Then opening the > control file would be path + "extension/" + filename and the extra > directory would be path + directory, without any on-the-fly stripping > of trailing components. > > The extension_control_path GUC could also be adjusted to refer to the > directory one level above the extension/foo.control location. > Storing the control path directly without any code to remove the /extension at the end would be more trick I think, because we would need to change the find_in_path() function to return the path without the suffix. In this new version I've introduced a new "basedir" field on ExtensionControlFile so that we can save the base directory to search for .control files and scripts. With this new field, on get_extension_script_directory() we just need to join control->basedir with control->directory. Note that we still need to handle the removal of the /extension at the end of control path but I think that on this new version the code looks a bit better (IMHO) since we just need to handle on find_extension_control_filename(). WYT? > > + /* > +* Assert that the control->control_dir end with /extension suffix so > that > +* we can replace with the value from control->directory. > +*/ > + Assert(ctrldir_len >= suffix_len && > + strcmp(control->control_dir + ctrldir_len - suffix_len, > "extension") == 0); > > If control_dir is coming from extension_control_path, it might have a > different suffix. Replace the Assert by elog(ERROR). (Or see above.) > In v2 I've moved the logic to remove the /extension to parse_extension_control_file(), do you think that this Assert on this function would still be wrong? IIUC we should always have /extension at the end of "control_dir" at this place, because the extension_control_path GUC will omit the /extension at the end and we will force it to have the suffix on the path at find_extension_control_filename() and get_extension_control_directories() functions. I'm missing something here? I've also included some more TAP tests on this new version. -- Matheus Alcantara v2-0001-Make-directory-work-with-extension-control-path.patch Description: Binary data
Re: Enable data checksums by default
On 23.04.25 00:24, Tomas Vondra wrote: The patch that flips the default has been committed. I also started a PG18 open items page and made a note that we follow up on the upgrade experience, as was discussed in this thread. https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items Regarding the open item, can someone explain what exactly are we planning to evaluate mid-beta? If you have a PG <=17 installation without checksums (the default), and you do the usual upgrade procedure to PG 18 involving initdb + pg_upgrade, then pg_upgrade will reject the upgrade, because the checksum settings don't match. The workaround is to run initdb with --no-data-checksums and try again. That's probably not all that bad, but if this is all below a bunch of layers of scripts, users will have to do some work on their end to get this working smoothly. The point of the open item was (a) to make sure this is adequately documented, for instance in the release notes, (b) to think about technological solutions to simplify this, such as [0], and (c) to just check the general feedback. Nothing from [0] ended up being committed, so that part of obsolete. The action for beta1 is (a). And then for (c) perhaps monitor the feedback between beta1 and beta2. [0]: https://www.postgresql.org/message-id/flat/57957aca-3eae-4106-afb2-3008122b9950%40eisentraut.org
Re: Disallow redundant indexes
On Thu, Apr 24, 2025 at 7:31 AM David Rowley wrote: > On Thu, 24 Apr 2025 at 21:27, Japin Li wrote: > > I propose that PostgreSQL prevent redundant index creation by: > In any case, who are we to define what a duplicate index is? I think this part is easier than you make it sound: everything (except the name) is exactly the same as an existing index. That's the 99% case we are trying to catch here. I've had this idea before, and even wrote a quick POC at one point, but I had it simply throw a warning rather than an error. That avoids the need for any GUC, which I agree is not a good idea. And it still allows people to create a duplicate index if they really want to. Cheers, Greg -- Crunchy Data - https://www.crunchydata.com Enterprise Postgres Software Products & Tech Support
Re: What's our minimum supported Python version?
On Thu, 24 Apr 2025 at 10:54, Peter Eisentraut wrote: > The cut-off in practice for these things is usually RHEL. PG18 > currently still supports RHEL 7, which appears to come with Python 3.6. > Seeing that the small problem with the test script was easily fixed, I > think we should stick with that for now. There might be value in > dropping support for RHEL 7 sometime, but that should be done across the > board (meson version, openssl version, perl version, etc.), and requires > some buildfarm changes, so I wouldn't do this right now. RHEL7 is not supported by Devrim for PG16 and PG17 already[1]. Did you mean something different with "PG18 currently still supports RHEL 7"? [1]: https://yum.postgresql.org/news/rhel7-postgresql-rpms-end-of-life/
Re: sslmode=secure by default (Re: Making sslrootcert=system work on Windows psql)
On 24.04.25 12:53, Christoph Berg wrote: Now you can say `psql -h db.example.com -p 5433 dbfoo`, but for specifying the sslmode, you have to rewrite at least the last argument to use connection string syntax, `psql "dbname=dbfoo sslmode=verify-full`. This needs be be less cumbersome. (And the names of the options make me want to stay away from them, require/verify-ca/verify-full/verify-confusing. Your sslmode=secure idea is really good.) I'm generally in favor of making sslmode=verify-full the effective default somehow. Another detail to think about is how this affects psql -h localhost. In principle, this should require full SSL, but you're probably not going to have certificates that allow "localhost". And connections to localhost are the default on Windows. We could also switch the Windows default to Unix-domain sockets. But there are probably still other reasons why connections to TCP/IP localhost are made. Some things to think about.
Re: Disallow redundant indexes
Hi, Greg and David Thank you for your feedback. On Thu, 24 Apr 2025 at 08:26, Greg Sabino Mullane wrote: > On Thu, Apr 24, 2025 at 7:31 AM David Rowley wrote: > > On Thu, 24 Apr 2025 at 21:27, Japin Li wrote: > > I propose that PostgreSQL prevent redundant index creation by: > > > In any case, who are we to define what a duplicate index is? > You're absolutely right. Defining a duplicate index is indeed simpler than I initially described. > I think this part is easier than you make it sound: everything (except the > name) is exactly the same as an existing > index. That's the 99% case we are trying to catch here. > As Greg pointed out, if everything except the name is identical to an existing index, it should be considered a duplicate in most cases (the 99% case). This is precisely the scenario I'm aiming to prevent. > I've had this idea before, and even wrote a quick POC at one point, but I had > it simply throw a warning rather than an > error. That avoids the need for any GUC, which I agree is not a good idea. > And it still allows people to create a > duplicate index if they really want to. > I also appreciate your suggestion regarding the GUC parameter. You've convinced me that a warning might be a more appropriate approach. A warning would still alert users to the potential issue of creating a redundant index, while allowing them to proceed if they have a specific reason to do so. -- Regrads, Japin Li
Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Tender Wang 于2025年4月14日周一 14:17写道: > Hi, > > While I debug hashjoin codes, in estimate_multivariate_bucketsize(), I > find that > the list_copy(hashclauses) below is unnecessary if we have a single join > clause. > > List *clauses = list_copy(hashclauses); > ... > > I adjust the place of list_copy() call as the attached patch. > This can save some overhead of function calls and memory copies. > > Any thoughts? > > Hi Alexander, In the last thread, I found a minor optimization for the code in estimate_multivariate_bucketsize(). Adjust the place of list_copy() at the start of estimate_multivariate_bucketsize, and we can avoid unnecessarily creating a new list and memory copy if we have only a single hash clause. Do you think it's worth doing this? -- Thanks, Tender Wang
Re: Showing applied extended statistics in explain Part 2
On 2/10/25 16:56, Tomas Vondra wrote: On 2/10/25 10:09, Andrei Lepikhov wrote: On 8/2/2025 20:50, Tomas Vondra wrote: The main profit here - you see all the stats involved in estimations (and their state), even if final plan doesn't contain estimated stuff at all. OK, that seems very underwhelming. I still think we should show which clauses were estimated using which statistics object. To understand how it may work, I employed the EXPLAIN extensibility introduced in PG 18 to show the use of plain statistics [1]. It looks like the following: EXPLAIN (COSTS OFF, STAT ON) SELECT * FROM sc_a WHERE x=1 AND y LIKE 'a'; Seq Scan on sc_a Filter: ((y ~~ 'a'::text) AND (x = 1)) Statistics: "sc_a.y: 1 times, stats: { MCV: 10 values, Correlation, ndistinct: 10., nullfrac: 0., width: 5 } "sc_a.x: 1 times, stats: { Histogram: 0 values, Correlation, ndistinct: -1., nullfrac: 0., width: 4 } As you can see, stat usage is summarised at the end of the EXPLAIN. It contains information about the column, how many times it was used and the parameters of statistic slots. Of course, being an extension it is constrained a lot, but even there is the profit: 1. You may see types of statistics exist on the column 2. Precision of the histogram or MCV (statistic_target) on a specific table - some users forget to increase it on large (or partitioned) tables 3. You have basic stat like nullfrac, ndistinct without the necessity to teach personnel how to gather it on a production instance safely. Also, using it in real cases, I realised that it would be highly profitable to specify which statistic type was used to estimate this specific clause. Of course, extended statistics have their own specifics, which may require another output format. Just consider this example too. [1] https://github.com/danolivo/pg_index_stats -- regards, Andrei Lepikhov
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi Alexander, Thank you for the review. I apologize for a late reply. I missed your email. > 1) As ReplicationSlotsComputeRequiredLSN() is called each time we need > to advance the position of WAL needed by replication slots, the usage > pattern probably could be changed. Thus, we probably need to call > ReplicationSlotsComputeRequiredLSN() somewhere after change of > restart_lsn_flushed while restart_lsn is not changed. And probably > can skip ReplicationSlotsComputeRequiredLSN() in some cases when only > restart_lsn is changed. Yes, it is a good idea for investigation, thank you! I guess, It may work for persistent slots but I'm not sure about other types of slots (ephemeral and temporary). I have no clear understanding of consequences at the moment. I propose to postpone it for future, because the proposed changes will me more invasive. > 2) I think it's essential to include into the patch test caches which > fail without patch. You could start from integrating [1] test into > your patch, and then add more similar tests for different situations. The problem with TAP tests - it is hard to reproduce without injection points. The Tomas Vondra's tests create two new injection points. I have to add more injection points for new tests as well. Injection points help to test the code but make the code unreadable. I'm not sure, what is the policy of creating injection points? Personally, I would not like to add new injection points only to check this particular rare case. I'm trying to create a test without injection points that should fail occasionally, but I haven't succeeded at the moment. I have a question - is there any interest to backport the solution into existing major releases? I can prepare a patch where restart_lsn_flushed stored outside of ReplicationSlot structure and doesn't affect the existing API. With best regards, Vitaly On Friday, April 04, 2025 06:22 MSK, Alexander Korotkov wrote: > Hi, Vitaly! > > On Mon, Mar 3, 2025 at 5:12 PM Vitaly Davydov > wrote: > > The slot data is flushed to the disk at the beginning of checkpoint. If > > an existing slot is advanced in the middle of checkpoint execution, its > > advanced restart LSN is taken to calculate the oldest LSN for WAL > > segments removal at the end of checkpoint. If the node is restarted just > > after the checkpoint, the slots data will be read from the disk at > > recovery with the oldest restart LSN which can refer to removed WAL > > segments. > > > > The patch introduces a new in-memory state for slots - > > flushed_restart_lsn which is used to calculate the oldest LSN for WAL > > segments removal. This state is updated every time with the current > > restart_lsn at the moment, when the slot is saving to disk. > > Thank you for your work on this subject. I think generally your > approach is correct. When we're truncating the WAL log, we need to > reply on the position that would be used in the case of server crush. > That is the position flushed to the disk. > > While your patch is generality looks good, I'd like make following notes: > > 1) As ReplicationSlotsComputeRequiredLSN() is called each time we need > to advance the position of WAL needed by replication slots, the usage > pattern probably could be changed. Thus, we probably need to call > ReplicationSlotsComputeRequiredLSN() somewhere after change of > restart_lsn_flushed while restart_lsn is not changed. And probably > can skip ReplicationSlotsComputeRequiredLSN() in some cases when only > restart_lsn is changed. > 2) I think it's essential to include into the patch test caches which > fail without patch. You could start from integrating [1] test into > your patch, and then add more similar tests for different situations. > > Links. > 1. > https://www.postgresql.org/message-id/e3ac0535-e7a2-4a96-9b36-9f765e9cfec5%40vondra.me > > -- > Regards, > Alexander Korotkov > Supabase
Re: What's our minimum supported Python version?
On 24.04.25 13:16, Jelte Fennema-Nio wrote: On Thu, 24 Apr 2025 at 10:54, Peter Eisentraut wrote: The cut-off in practice for these things is usually RHEL. PG18 currently still supports RHEL 7, which appears to come with Python 3.6. Seeing that the small problem with the test script was easily fixed, I think we should stick with that for now. There might be value in dropping support for RHEL 7 sometime, but that should be done across the board (meson version, openssl version, perl version, etc.), and requires some buildfarm changes, so I wouldn't do this right now. RHEL7 is not supported by Devrim for PG16 and PG17 already[1]. Did you mean something different with "PG18 currently still supports RHEL 7"? [1]: https://yum.postgresql.org/news/rhel7-postgresql-rpms-end-of-life/ There are approximately 6 buildfarm members with RHEL 7 or CentOS 7, so in that sense, "support" means that everything is expected to work there. And some amount of work has been done recently to keep that support alive. (But I'm confused now because I see Python 3.6 on both the RHEL 7 and the RHEL 8 animals. So it's not clear how representative these animals are especially with respect to the Python versions.)
Postmaster fails to shut down right after crash restart
Hello, While developing a patch and running regression tests I noticed that the postmaster could fail to shut down right after crash restart. It could get stuck in the PM_WAIT_BACKENDS state forever. As far as I understand, the problem occurs when a shutdown signal is received before getting PMSIGNAL_RECOVERY_STARTED from the startup process. In that case the FatalError flag is not cleared, and the postmaster is stuck in PM_WAIT_BACKENDS waiting for the checkpointer, which ignores SIGTERM. To easily reproduce the problem I added pg_usleep in xlogrecovery.c just before SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED). See the patch attached. Then I run a script that simulates a crash and does pg_ctl stop: $ ./init.sh [...] $ ./stop-after-crash.sh waiting for server to start done server started waiting for server to shut down... failed pg_ctl: server does not shut down Some processes are still alive: $ ps uf -C postgres USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND sergey279874 0.0 0.0 222816 28560 ?Ss 14:25 0:00 /home/sergey/pgwork/devel/install/bin/postgres -D data sergey279887 0.0 0.0 222772 5664 ?Ss 14:25 0:00 \_ postgres: io worker 0 sergey279888 0.0 0.0 222772 5664 ?Ss 14:25 0:00 \_ postgres: io worker 1 sergey279889 0.0 0.0 222772 5664 ?Ss 14:25 0:00 \_ postgres: io worker 2 sergey279891 0.0 0.0 222884 8480 ?Ss 14:25 0:00 \_ postgres: checkpointer Here is an excerpt from the debug log: postmaster[279874] LOG: all server processes terminated; reinitializing startup[279890] LOG: database system was interrupted; last known up at 2025-04-24 14:25:58 MSK startup[279890] LOG: database system was not properly shut down; automatic recovery in progress postmaster[279874] DEBUG: postmaster received shutdown request signal postmaster[279874] LOG: received fast shutdown request postmaster[279874] DEBUG: updating PMState from PM_STARTUP to PM_STOP_BACKENDS postmaster[279874] DEBUG: sending signal 15/SIGTERM to background writer process with pid 279892 postmaster[279874] DEBUG: sending signal 15/SIGTERM to checkpointer process with pid 279891 postmaster[279874] DEBUG: sending signal 15/SIGTERM to startup process with pid 279890 postmaster[279874] DEBUG: sending signal 15/SIGTERM to io worker process with pid 279889 postmaster[279874] DEBUG: sending signal 15/SIGTERM to io worker process with pid 279888 postmaster[279874] DEBUG: sending signal 15/SIGTERM to io worker process with pid 279887 postmaster[279874] DEBUG: updating PMState from PM_STOP_BACKENDS to PM_WAIT_BACKENDS startup[279890] LOG: invalid record length at 0/175A4D8: expected at least 24, got 0 postmaster[279874] DEBUG: postmaster received pmsignal signal startup[279890] LOG: redo is not required checkpointer[279891] LOG: checkpoint starting: end-of-recovery immediate wait checkpointer[279891] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.007 s, sync=0.002 s, total=0.026 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB; lsn=0/175A4D8, redo lsn=0/175A4D8 startup[279890] DEBUG: exit(0) postmaster[279874] DEBUG: updating PMState from PM_WAIT_BACKENDS to PM_WAIT_BACKENDS checkpointer[279891] DEBUG: checkpoint skipped because system is idle checkpointer[279891] DEBUG: checkpoint skipped because system is idle I don't know how to fix this, but thought it's worth reporting. Best regards, -- Sergey Shinderukhttps://postgrespro.com/diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 6ce979f2d8b..19ee8b09685 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1696,7 +1696,10 @@ PerformWalRecovery(void) * archiver if necessary. */ if (IsUnderPostmaster) + { + pg_usleep(300); SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED); + } /* * Allow read-only connections immediately if we're consistent already. init.sh Description: application/shellscript stop-after-crash.sh Description: application/shellscript logfile.gz Description: application/gzip
Re: pg_upgrade-breaking release
On Thu, Apr 24, 2025 at 08:35:10AM -0400, Greg Sabino Mullane wrote: > On Thu, Apr 24, 2025 at 8:12 AM Bruce Momjian wrote: > > Do we think most people are _not_ going to use pg_upgrade now that we > are defaulting to checksums being enabled by default in PG 18? > > > I cannot imagine this would stop anyone from upgrading. It's one additional > flag, which was already a requirement for those going the other direction in > the past (checksums -> no checksums). And I also assume that "most people" are > already running with checksums enabled. > > > And if so, do we think we are ever going to have a > storage-format-changing release where pg_upgrade cannot be used? > > > Seems very unlikely, that would kind of go against the whole purpose of > pg_upgrade. When I wrote pg_upgrade, I assumed at some point the value of changing the storage format would outweigh the value of allowing in-place upgrades. I guess that hasn't happened yet. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
On Wed, Apr 23, 2025 at 9:35 PM Masahiko Sawada wrote: > > On Wed, Apr 23, 2025 at 5:46 AM Amit Kapila wrote: > > > > BTW, did we consider the idea to automatically transition to 'logical' > > when the first logical slot is created and transition back to > > 'replica' when last logical slot gets dropped? I see some ideas around > > this last time we discussed this topic. > > Yes. Bertrand pointed out that a drawback is that the primary server > needs to create a logical slot in order to execute logical decoding on > the standbys[1]. > True, but if we want to avoid that, we can still keep 'logical' as wal_level for the ease of users. We can also have another API like the one you originally proposed (pg_activate_logical_decoding) for the ease of users. But the minimum requirement would be that one creates a logical slot to enable logical decoding/replication. Additionally, shall we do some benchmarking, if not done already, to show the cases where the performance and WAL volume can hurt users if we make wal_level as 'logical'? -- With Regards, Amit Kapila.
Re: vacuumdb --missing-stats-only and pg_upgrade from PG13
Re: Nathan Bossart > Here is what I have staged for commit. I'll aim to commit these patches > sometime next week to give time for additional feedback. I confirm my PG13 test table gets analyzed now with the patch. Christoph
Re: pg_upgrade-breaking release
On 2025-Apr-24, Bruce Momjian wrote: > Do we think most people are _not_ going to use pg_upgrade now that we > are defaulting to checksums being enabled by default in PG 18? And if > so, do we think we are ever going to have a storage-format-changing > release where pg_upgrade cannot be used? Peter E posted a patch that allowed pg_upgrade to migrate (rewrite) files from non-checksummed to checksummed, but he appears to have given up on it for this release given an apparent lack of interest. https://postgr.es/m/57957aca-3eae-4106-afb2-3008122b9...@eisentraut.org -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: pg_upgrade-breaking release
On Thu, Apr 24, 2025 at 04:51:08PM +0200, Álvaro Herrera wrote: > On 2025-Apr-24, Bruce Momjian wrote: > > > Do we think most people are _not_ going to use pg_upgrade now that we > > are defaulting to checksums being enabled by default in PG 18? And if > > so, do we think we are ever going to have a storage-format-changing > > release where pg_upgrade cannot be used? > > Peter E posted a patch that allowed pg_upgrade to migrate (rewrite) > files from non-checksummed to checksummed, but he appears to have given > up on it for this release given an apparent lack of interest. > https://postgr.es/m/57957aca-3eae-4106-afb2-3008122b9...@eisentraut.org Yeah, I saw that, and I think we could do something similar for TDE if we ever had it. I think we are suggesting people just do offline addition of checksums rather than trying to do something fancy with pg_upgrade. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Re: [PATCH] dynahash: add memory allocation failure check
> On 24 Apr 2025, at 00:42, Tom Lane wrote: > > - hashp = (HTAB *) DynaHashAlloc(sizeof(HTAB) + strlen(tabname) + 1); > + hashp = (HTAB *) MemoryContextAlloc(CurrentDynaHashCxt, > + sizeof(HTAB) + strlen(tabname) + 1); This seems correct to me. While fixing this maybe use MemoryContextAllocZero() instead of subsequent MemSet()? But this might unroll loop of unnecessary beautifications like DynaHashAlloc() calling Assert(MemoryContextIsValid(CurrentDynaHashCxt)) just before MemoryContextAllocExtended() will repeat same exercise. Best regards, Andrey Borodin.
Re: pg_upgrade-breaking release
On Thu, Apr 24, 2025 at 08:37:56AM -0400, Bruce Momjian wrote: > On Thu, Apr 24, 2025 at 08:35:10AM -0400, Greg Sabino Mullane wrote: > > On Thu, Apr 24, 2025 at 8:12 AM Bruce Momjian wrote: > > > > Do we think most people are _not_ going to use pg_upgrade now that we > > are defaulting to checksums being enabled by default in PG 18? > > > > > > I cannot imagine this would stop anyone from upgrading. It's one additional > > flag Yeah. We've had this before, with integer datetimes and others. People will notice the friction, but v18 won't be a shock in this respect. > > And if so, do we think we are ever going to have a > > storage-format-changing release where pg_upgrade cannot be used? > > > > > > Seems very unlikely, that would kind of go against the whole purpose of > > pg_upgrade. > > When I wrote pg_upgrade, I assumed at some point the value of changing > the storage format would outweigh the value of allowing in-place > upgrades. I guess that hasn't happened yet. Having pg_upgrade has made PostgreSQL popular for applications with high availability and high data volumes, applications that would have ruled out PostgreSQL before pg_upgrade. In other words, adding pg_upgrade changed the expectations of PostgreSQL. A storage format break is less plausible now than it was in the early years of having pg_upgrade. I think a prerequisite for a storage format break would be a foolproof upgrade recipe based on logical replication techniques. The step having downtime would need to be no slower than pg_upgrade. Even with that, the double storage requirement isn't negligible. Hence, it wouldn't be a given that we'd regard the storage format break as sufficiently-valuable.
Re: [PATCH] dynahash: add memory allocation failure check
Andrey Borodin writes: > While fixing this maybe use MemoryContextAllocZero() instead of subsequent > MemSet()? I thought about that but intentionally left it as-is, because that would force zeroing of the space reserved for the hashtable name too. That's unnecessary, and since it'd often be odd-sized it might result in a less efficient fill loop. regards, tom lane
Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
On Thu, Apr 24, 2025 at 5:30 AM Amit Kapila wrote: > > On Wed, Apr 23, 2025 at 9:35 PM Masahiko Sawada wrote: > > > > On Wed, Apr 23, 2025 at 5:46 AM Amit Kapila wrote: > > > > > > BTW, did we consider the idea to automatically transition to 'logical' > > > when the first logical slot is created and transition back to > > > 'replica' when last logical slot gets dropped? I see some ideas around > > > this last time we discussed this topic. > > > > Yes. Bertrand pointed out that a drawback is that the primary server > > needs to create a logical slot in order to execute logical decoding on > > the standbys[1]. > > > > True, but if we want to avoid that, we can still keep 'logical' as > wal_level for the ease of users. I think we'd like to cover the use case like where users start with 'replica' on the primary and execute logical decoding on the standby without neither creating a logical slot on the primary nor restarting the primary. > We can also have another API like the > one you originally proposed (pg_activate_logical_decoding) for the > ease of users. But the minimum requirement would be that one creates a > logical slot to enable logical decoding/replication. I think we want to avoid the runtime WAL level automatically decreased to 'replica' once all logical slots are removed, if users still want to execute logical decoding on only the standby. One idea is that if users enable logical decoding using pg_activate_logical_decoding(), the runtime WAL level doesn't decrease to 'replica' even if all logical slots are removed. But it would require for us to remember how the logical decoding has been enabled in a permanent way. Also, I'm concerned that having three ways to enable logical decoding could confuse users: wal_level GUC parameter, creating at least one logical slot, and pg_activate_logical_decoding(). > Additionally, shall we do some benchmarking, if not done already, to > show the cases where the performance and WAL volume can hurt users if > we make wal_level as 'logical'? I believe it would be significant especially for REPLICA IDENTITY FULL tables. I agree it's worth benchmarking it but I guess the result would not convince us to make 'logical' default. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Making sslrootcert=system work on Windows psql
On Wed, Apr 23, 2025 at 8:47 AM George MacKerron wrote: > I’d suggest two new special sslrootcert values: > > (1) sslrootcert=openssl > > This does exactly what sslrootcert=system does now, but is less confusingly > named for Windows users. sslrootcert=system becomes a deprecated synonym for > this option. Stealing the word "system" from the existing sslrootcert domain had at least two hazards: a) existing users might have a file named "system" that would now be ignored, and b) users might accidentally use sslrootcert=system on older versions of libpq, picking up an unexpected file named "system" and doing the Wrong Thing. Problem (a) can be worked around by saying "./system" instead, so honestly I wasn't too concerned about that, and I considered (b) to be more of a theoretical problem that was outweighed by the benefit of getting OpenSSL to just Do The Thing people wanted it to do. A couple years on, I think (b) is less theoretical than I had originally hoped. As evidence I point to Stack Overflow questions like [1], where both the asker and the answerer are a bit confused about how connection string versioning works. If we steal more words, I think that problem is going to get worse. So I'm leaning toward's Daniel's earlier position that sslrootcert has kind of run its course, and if you want to select OpenSSL stores, we need a more fully featured syntax and probably a completely new option to be able to pass that through safely. > (2) sslrootcert=os > > This does what I was proposing in my patch: it uses winstore on Windows and > behaves the same as sslrootcert=openssl elsewhere, where openssl *is* the > operating system SSL provider. Falling back to standard OpenSSL introduces the same hazard we're running into today, though -- what if someone creates a macstore [2] for OpenSSL, so that its behavior matches Safari's or whatever, and then everyone wonders why sslrootcert=os doesn't use that? If the abstraction must leak the details anyway, I think we should expose them directly instead. (As a small soapbox, I think "application-level" fallback for a trust chain is frequently going to lead to regret. You should ideally tell us what you want, and either get it or fail.) Thanks, --Jacob [1] https://stackoverflow.com/questions/77989772/psql-root-certificate-file-system-does-not-exist-why-sslrootcert-system-do [2] https://github.com/openssl/openssl/issues/23460
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 24, 2025 at 2:24 AM Nisha Moond wrote: > > On Thu, Apr 24, 2025 at 12:28 PM Amit Kapila wrote: > > > > On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada > > wrote: > > > > > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila > > > wrote: > > > > > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > > > wrote: > > > > > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Approach 2 > > > > > > > -- > > > > > > > > > > > > > > Instead of disallowing the use of two-phase and failover > > > > > > > together, a more > > > > > > > flexible strategy could be only restrict failover for slots with > > > > > > > two-phase > > > > > > > enabled when there's a possibility of existing prepared > > > > > > > transactions before > > > > > > the > > > > > > > two_phase_at that are not yet replicated. During slot creation > > > > > > > with > > > > > > two-phase > > > > > > > and failover, we could check for any decoded prepared > > > > > > > transactions when > > > > > > > determining the decoding start point > > > > > > > (DecodingContextFindStartpoint). For > > > > > > > subsequent attempts to alter failover to true, we ensure that > > > > > > > two_phase_at is > > > > > > > less than restart_lsn, indicating that all prepared transactions > > > > > > > have been > > > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > > > > > pros: > > > > > > > > > > > > > > This method minimizes restrictions for users. Especially during > > > > > > > slot creation > > > > > > > with (two_phase=on, failover=on), as it’s uncommon for > > > > > > > transactions to > > > > > > prepare > > > > > > > during consistent snapshot creation, the restriction becomes > > > > > > > almost > > > > > > > unnoticeable. > > > > > > > > > > > > I think this approach can work for the transactions that are > > > > > > prepared > > > > > > while the slot is created. But if I understand the problem > > > > > > correctly, > > > > > > while the initial table sync is performing, the slot's two_phase is > > > > > > still false, so we need to deal with the transactions that are > > > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we > > > > > haven't > > > > > started decoding when setting two_phase=true during > > > > > CreateDecodingContext() > > > > > after tablesync, we could check prepared transactions afterwards > > > > > during > > > > > decoding. This could involve reporting an ERROR when skipping a > > > > > prepared > > > > > transaction during decoding if its prepare LSN is less than > > > > > two_phase_at. > > > > > > > > > > > > > It will make it difficult for users to detect it as this happens at a > > > > later point of time. > > > > > > > > > Alternatively, a simpler method would be to prevent this situation > > > > > entirely > > > > > during the CREATE SUBSCRIPTION command. For example, we could > > > > > restrict slots > > > > > created with failover set to true and twophase is later modified to > > > > > true after > > > > > tablesync. Although the simpler check is more user-visible, it may > > > > > offer less > > > > > flexibility. > > > > > > > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > > > smart checks in the back-branch. If we follow what you say here, then > > > > users have the following ways in PG17 to enable both failover and > > > > two_phase. (a) During Create Subscription, users can set both > > > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > > > 'copy_data' is true, during Create Subscription, then users can enable > > > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > > > to set 'failover'. > > > > > > Yet another idea would be to disallow enabling both two_phase and > > > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > > > add check when enabling failover for the two_phase-enabled-slots. For > > > example, users who want to enable both need to do two steps: > > > > > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > > > 2. ALTER SUBSCRIPTION with failover = true. > > > > > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > > > the two_phase states is ready (and possibly check if the slot's > > > two_phase has been enabled too), otherwise raises an ERROR. Then, when > > > the publisher enables the failover for the two_phase-enabled-slot up > > > on walrcv_alter_slot() request, it checks the slot's restart_lsn has > > > passed slot's two_phase_at, otherwise raise an error with the message > > > like "the slot need to consume change upto %X/%X to enable failover". > > > > > > > This should furthe
Cancel problems of query to pg_stat_statements
Hi! Recently we faced a problem in out production psql installation, which was that we had to cancel all requests to the db, including performance monitoring requests, that uses ps_stat_statements. But we could not cancel the request in usual way, and had to kill -9 the pg process of it. We've noticed that the the query execution stuck on PGSS_TEXT_FILE file reading in function qtext_load_file, which doesn't have CHECK_FOR_INTERRUPTS in the read cycle. In addition to our case with large PGSS_TEXT_FILE (and maybe the problems with virtual disk i/o) that can explain uncancellable pg_stat_statements queries. Also, the reading block size can be reduced from 1GB to 32MB in order to increase the frequency of CHECK_FOR_INTERRUPTS calls without qtext_load_file performance degradation. To check that I did some little testing with fio like: fio --name=readtest --filename=./random-bytes-file --rw=read --bs=32m --size=10G --ioengine=libaio --direct=1 --numjobs=1 So, I made a simple patch that adds CHECK_FOR_INTERRUPTS call in read cycle of qtext_load_file and change max value of toread from 1GB to 32MB. I would appreciate your feedback on these changes. From eab041fcd1f4973b03f6bebd37bef3395fe64b4b Mon Sep 17 00:00:00 2001 From: rkhapov Date: Thu, 24 Apr 2025 17:01:54 + Subject: [PATCH] pg_stat_statements.c: cancelable qtext_load_file In the case of a large PGSS_TEXT_FILE, the work time of the qtext_load_file function will be quite long, and the query to the pg_stat_statements table will not be cancellable, as there is no CHECK_FOR_INTERRUPT in the function. Also, the amount of bytes read can reach 1 GB, which leads to a slow read system call that does not allow cancellation of the query. Testing the speed of sequential read using fio with different block sizes shows that there is no significant difference between 16 MB blocks and 1 GB blocks. Therefore, this patch changes the maximum read value from 1 GB to 16 MB and adds CHECK_FOR_INTERRUPTION in the read loop of qtext_load_file to make it cancellable. Signed-off-by: rkhapov --- contrib/pg_stat_statements/pg_stat_statements.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 9778407cba3..cd34f1ce248 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -2365,7 +2365,9 @@ qtext_load_file(Size *buffer_size) nread = 0; while (nread < stat.st_size) { - int toread = Min(1024 * 1024 * 1024, stat.st_size - nread); + int toread = Min(32 * 1024 * 1024, stat.st_size - nread); + + CHECK_FOR_INTERRUPTS(); /* * If we get a short read and errno doesn't get set, the reason is -- 2.43.0
Re: sslmode=secure by default (Re: Making sslrootcert=system work on Windows psql)
On Thu, Apr 24, 2025 at 5:00 AM Peter Eisentraut wrote: > I'm generally in favor of making sslmode=verify-full the effective > default somehow. +many On Thu, Apr 24, 2025 at 3:53 AM Christoph Berg wrote: > For > postgresql://-style strings, we would ideally have something like http:// > vs https://, but I am not sure how to squeeze that into the syntax. Not to derail things too much, but I'd also like a postgress:// scheme, and I've put a little bit of idle thought into it. I think we'd want it to imply sslnegotiation=direct and sslrootcert=system (modulo the Windows discussion already in progress), and potentially make a bunch of stricter decisions about TLS settings to better match modern practice. The intent would be to have a "browser-strength" scheme for people who care more about security than about raw compatibility with older systems, because they're connecting to someone else's servers on the open Web. The hardest part, in my opinion, is that we'd have to start following the RFC concept of "authority". A URL of "postgress://example.com/db?host=evil.com&hostaddr=..." is outright dangerous, as is "postgress://example.com/db?sslmode=disable". So if there's interest in that scheme, I think it should remain a separate feature from "verify-full by default", because there's a lot more to figure out. --Jacob
Re: Available disk space per tablespace
Hi, I also tested the patch on Linux mint 22.1 with the btrfs and ext4 partitions. I generated some data and the outcome looks good: postgres=# \db+ List of tablespaces Name | Owner | Location | Access privileges | Options | Size | Free | Description --+--+---+---+-+-+-+- pg_default | postgres | | | | 1972 MB | 29 GB | pg_global | postgres | | | | 556 kB | 29 GB | tablespace_test2 | postgres | /media/said/queryme/pgsql | | | 3147 MB | 1736 GB | Numbers are the same as if I were executing the command: df -h tablespace_test2 was the ext4 partition on usb stick. Numbers are correct. Said On 2025-03-13 14 h 10, Christoph Berg wrote: Hi, I'm picking up a 5 year old patch again: https://www.postgresql.org/message-id/flat/20191108132419.GG8017%40msg.df7cb.de Users will be interested in knowing how much extra data they can load into a database, but PG currently does not expose that number. This patch introduces a new function pg_tablespace_avail() that takes a tablespace name or oid, and returns the number of bytes "available" there. This is the number without any reserved blocks (Unix, f_avail) or available to the current user (Windows). (This is not meant to replace a full-fledged OS monitoring system that has much more numbers about disks and everything, it is filling a UX gap.) Compared to the last patch, this just returns a single number so it's easier to use - total space isn't all that interesting, we just return the number the user wants. The free space is included in \db+ output: postgres =# \db+ List of tablespaces Name│ Owner │ Location │ Access privileges │ Options │ Size │ Free │ Description ┼───┼──┼───┼─┼─┼┼─ pg_default │ myon │ │ ∅ │ ∅ │ 23 MB │ 538 GB │ ∅ pg_global │ myon │ │ ∅ │ ∅ │ 556 kB │ 538 GB │ ∅ spc│ myon │ /tmp/spc │ ∅ │ ∅ │ 0 bytes │ 31 GB │ ∅ (3 rows) The patch has also been tested on Windows. TODO: Figure out which systems need statfs() vs statvfs() Christoph
Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX
hi, two more minor issues. src/bin/pg_dump/pg_dump.c if (fout->remoteVersion >= 18) need change to if (fout->remoteVersion >= 19) +-- Test index visibility with partitioned tables +CREATE TABLE part_tbl(id int, data text) PARTITION BY RANGE(id); +CREATE TABLE part1 PARTITION OF part_tbl FOR VALUES FROM (1) TO (100); +CREATE TABLE part2 PARTITION OF part_tbl FOR VALUES FROM (100) TO (200); +INSERT INTO part_tbl +SELECT g, 'data ' || g +FROM generate_series(1, 199) g; +CREATE INDEX idx_part_tbl ON part_tbl(data); +SELECT c.relname, i.indisvisible +FROM pg_index i +JOIN pg_class c ON i.indexrelid = c.oid +WHERE c.relname LIKE 'idx_part_tbl%' +ORDER BY c.relname; + relname| indisvisible +--+-- + idx_part_tbl | t +(1 row) + This test seems not that good? "idx_part_tbl" is the partitioned index, we also need to show each partition index pg_index.indisvisible value? we can change it to CREATE TABLE part_tbl(id int, data text) PARTITION BY RANGE(id); CREATE TABLE part_tbl_1 PARTITION OF part_tbl FOR VALUES FROM (1) TO (100); CREATE TABLE part_tbl_2 PARTITION OF part_tbl FOR VALUES FROM (100) TO (200); CREATE INDEX ON part_tbl(data); SELECT c.relname, i.indisvisible FROM pg_index i JOIN pg_class c ON i.indexrelid = c.oid WHERE c.relname LIKE 'part_tbl%' ORDER BY c.relname; -
Re: [BUG] temporary file usage report with extended protocol and unnamed portals
On 4/23/25 18:13, Sami Imseih wrote: Also, another strange behavior of the way portal cleanup occurs is that in extended-query-protocol and within a transaction, ExecutorEnd for the last query is not actually called until the next command. This just seems odd to me especially for extensions that rely on ExecutorEnd. So, Can we do something like this? This drops the portal as soon as execution completes ( the portal is fetched to completion ). This will ensure that there is no delay in ExecutorEnd getting called and in the case of log_temp_files, the message will be logged while debug_query_string is still pointing to the correct query. diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index dc4c600922d..efe0151ca8f 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -2327,6 +2327,9 @@ exec_execute_message(const char *portal_name, long max_rows) /* Send appropriate CommandComplete to client */ EndCommand(&qc, dest, false); + + if (!portal->portalPinned) + PortalDrop(portal, false); } else { I don't know if it is the correct solution, but it seems good to me (FWIW), and I've tested it and it works well in all the following cases: * Java program: extended protocol used for the two queries (the one that use the temp file and the SELECT 1). * Python program: the SELECT 1 is using the simple protocol. * SQL PREPARE / EXECUTE * Another version of the Java program using the setFetchSize() method (and named portals).
Re: What's our minimum supported Python version?
On 22.04.25 18:04, Tom Lane wrote: Jacob Champion writes: As for picking a version... 3.6 will have been EOL for almost three years by the time 18 releases. It seems like we would drop it happily, were it not for RHEL8. Agreed, but RHEL8 is out there and I don't think we can just drop support for it. I'm also not excited by the idea that an incidental test script gets to dictate what the cutoff is. I do think we should stop claiming that Python 3.2 will work. (Maybe it does, but we don't know that.) I see that the configure script only checks for Python >= 3, and it doesn't look like the meson scripts check explicitly at all, although there's a comment saying that our meson version cutoff is intended to allow working with Python 3.5. Maybe it's sufficient to make a documentation change here, and say we support Python >= 3.5? I'd be okay with saying 3.6.8 too, on the grounds that if anything older fails to work we'd almost certainly just say "too bad". But RHEL8 is widespread enough that I think we need to keep making the effort for 3.6.8. There are a lot of different things that are meant by Python support nowadays. It used to be the case that the minimum Python version was (1) the oldest version that was required to get plpython to compile, and (2) the range of versions for which we would maintain "expected" files for the plpython tests (see history of src/pl/plpython/expected/README). #2 isn't really a problem anymore, it seems. It used to require yearly attention, but not anymore. (Maybe Python itself has gotten more stable, or we have changed our tests to be less sensitive to this, probably a combination of both.) #1 is also less of a hot-spot, as indicated that we still claim to support back to Python 3.2. Also, starting in PG18, we are using the Python "Limited API", so this is being enforced on the Python side. That means, in principle, if plpython compiles successfully right now on Python 3.13, then it should also compile successfully on Python 3.2. But now the new meanings are also, (3) what version of Python is required by the oldest supported Meson version, and (4) what version of Python can be assumed by various build and test scripts. There is actually (4a) scripts that are only run from a meson build, which should surely align with #3, and (4b) scripts that are also run by a make build, which is what oauth_server.py is, for which we can pick anything. The answer to #3 is currently Python 3.5 (see top of top-level meson.build). And then there is meaning (5) what version has anyone actually tested. Note that #2, #3, and #4 are build-time and #1 also has a run-time component. It would be plausible to say that you need a certain newer Python to build, but plpython can still run on older versions. If we were to make any changes, we need to make sure that the documentation is precise about this and matches the code (see #define Py_LIMITED_API). The cut-off in practice for these things is usually RHEL. PG18 currently still supports RHEL 7, which appears to come with Python 3.6. Seeing that the small problem with the test script was easily fixed, I think we should stick with that for now. There might be value in dropping support for RHEL 7 sometime, but that should be done across the board (meson version, openssl version, perl version, etc.), and requires some buildfarm changes, so I wouldn't do this right now.
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 24, 2025 at 12:28 PM Amit Kapila wrote: > > On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada > wrote: > > > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila wrote: > > > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > > wrote: > > > > > > > > > > > > > > > > > > -- > > > > > > Approach 2 > > > > > > -- > > > > > > > > > > > > Instead of disallowing the use of two-phase and failover together, > > > > > > a more > > > > > > flexible strategy could be only restrict failover for slots with > > > > > > two-phase > > > > > > enabled when there's a possibility of existing prepared > > > > > > transactions before > > > > > the > > > > > > two_phase_at that are not yet replicated. During slot creation with > > > > > two-phase > > > > > > and failover, we could check for any decoded prepared transactions > > > > > > when > > > > > > determining the decoding start point > > > > > > (DecodingContextFindStartpoint). For > > > > > > subsequent attempts to alter failover to true, we ensure that > > > > > > two_phase_at is > > > > > > less than restart_lsn, indicating that all prepared transactions > > > > > > have been > > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > > > pros: > > > > > > > > > > > > This method minimizes restrictions for users. Especially during > > > > > > slot creation > > > > > > with (two_phase=on, failover=on), as it’s uncommon for transactions > > > > > > to > > > > > prepare > > > > > > during consistent snapshot creation, the restriction becomes almost > > > > > > unnoticeable. > > > > > > > > > > I think this approach can work for the transactions that are prepared > > > > > while the slot is created. But if I understand the problem correctly, > > > > > while the initial table sync is performing, the slot's two_phase is > > > > > still false, so we need to deal with the transactions that are > > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we > > > > haven't > > > > started decoding when setting two_phase=true during > > > > CreateDecodingContext() > > > > after tablesync, we could check prepared transactions afterwards during > > > > decoding. This could involve reporting an ERROR when skipping a prepared > > > > transaction during decoding if its prepare LSN is less than > > > > two_phase_at. > > > > > > > > > > It will make it difficult for users to detect it as this happens at a > > > later point of time. > > > > > > > Alternatively, a simpler method would be to prevent this situation > > > > entirely > > > > during the CREATE SUBSCRIPTION command. For example, we could restrict > > > > slots > > > > created with failover set to true and twophase is later modified to > > > > true after > > > > tablesync. Although the simpler check is more user-visible, it may > > > > offer less > > > > flexibility. > > > > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > > smart checks in the back-branch. If we follow what you say here, then > > > users have the following ways in PG17 to enable both failover and > > > two_phase. (a) During Create Subscription, users can set both > > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > > 'copy_data' is true, during Create Subscription, then users can enable > > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > > to set 'failover'. > > > > Yet another idea would be to disallow enabling both two_phase and > > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > > add check when enabling failover for the two_phase-enabled-slots. For > > example, users who want to enable both need to do two steps: > > > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > > 2. ALTER SUBSCRIPTION with failover = true. > > > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > > the two_phase states is ready (and possibly check if the slot's > > two_phase has been enabled too), otherwise raises an ERROR. Then, when > > the publisher enables the failover for the two_phase-enabled-slot up > > on walrcv_alter_slot() request, it checks the slot's restart_lsn has > > passed slot's two_phase_at, otherwise raise an error with the message > > like "the slot need to consume change upto %X/%X to enable failover". > > > > This should further simplify the checks with an additional restriction > during the CREATE SUBSCRIPTION time. I am in favor of it because I > want the code in this area to be as much same in HEAD and backbranches > as possible. > Please find the updated patch for Approach 3, which implements the idea suggested by Swada-san above. -- Thanks, Nisha v7-0001-PG17-Appro
Disallow redundant indexes
Hi, hackers Currently, PostgreSQL permits creating multiple indexes on the same columns in the same order for a table, potentially leading to redundant indexes. For example: CREATE INDEX ON t(id); CREATE INDEX ON t(id); While permitted, this leads to: - Increased storage consumption - Performance degradation (for data modification) - Maintenance overhead - Potential query optimizer confusion Oracle prevents this with an error like ORA-01408: such column list already indexed [1]. I propose that PostgreSQL prevent redundant index creation by: - Checking for identical existing indexes during CREATE INDEX. - Failing with an error (like Oracle's ORA-01408) if a duplicate is found. - Providing a GUC parameter (allow_redundant_indexes) to control this. This change would: - Prevent accidental redundancy - Optimize storage - Improve performance - Simplify maintenance - Enhance efficiency and user flexibility I’d love to hear your feedback or suggestions for improvement. [1] https://docs.oracle.com/en/error-help/db/ora-01408/?r=19c -- Regrads, Japin Li
Re: Conflicting updates of command progress
Sami Imseih wrote: > >> pgstat_progress_start_command should only be called once by the entry > >> point for the > >> command. In theory, we could end up in a situation where start_command > >> is called multiple times during the same top-level command; > > > Not only in theory - it actually happens when CLUSTER is rebuilding indexes. > > In the case of CLUSTER, pgstat_progress_start_command is only called once, > but pgstat_progress_update_param is called in the context of both CLUSTER > and CREATE INDEX. pgstat_progress_start_command() is called twice: First with cmdtype=PROGRESS_COMMAND_CLUSTER, second with PROGRESS_COMMAND_CREATE_INDEX. The first happens in cluster_rel() the second in cluster_rel() -> rebuild_relation() -> finish_heap_swap() -> reindex_relation() -> reindex_index(). It does not matter though, the current design only expects one command. > > That's a possible approach. However, if the index build takes long time in > > the > > CLUSTER case, the user will probably be interested in details about the > > index > > build. > > I agree, > > >> Is there a repro that you can share that shows the weird values? It sounds > >> like > >> the repro is on top of [1]. Is that right? > > >> You can reproduce the similar problem by creating a trigger function that > >> runs a progress-reporting command like COPY, and then COPY data into > >> a table that uses that trigger. > > >> [2] https://commitfest.postgresql.org/patch/5282/ > > In this case, pgstat_progress_start_command is actually called > twice in the life of a single COPY command; the upper-level COPY > command calls pgstat_progress_start_command and then the nested COPY > command also does calls pgstat_progress_start_command. > > > I think that can be implemented by moving the progress related fields from > > PgBackendStatus into a new structure and by teaching the backend to insert a > > new instance of that structure into a shared hash table (dshash.c) > > I think this is a good idea in general to move the backend progress to > shared memory. > and with a new API that will deal with scenarios as described above. > 1/ an (explicit) nested > command was started by a top-level command, such as the COPY case above. > 2/ a top-level command triggered some other progress code implicitly, such as > CLUSTER triggering CREATE INDEX code. Yes, I mean a new API. I imagine pgstat_progress_start_command() to initialize the shared memory to track the "nested" command and to put the existing value of MyBEEntry onto a stack. pgstat_progress_end_command() would then restore the original value. However, special care is needed for [2] because that's not necessarily nesting: consider merge-joining two foreign tables, both using file_fdw. In this case the pointers to the progress tracking shared memory would probably have to be passed as function arguments. (Note that the current progress tracking system also uses shared memory, however in a less flexible way.) > I also like the shared memory approach because we can then not have to use > a message like the one introduced in f1889729dd3ab0 to support parallel index > vacuum progress 46ebdfe164c61. I didn't know about these patches. I'm not sure though if this needs to be removed. Even if each worker updated the progress information separately (would users appreciate that?), it should still send the progress information to the leader before it (i.e. the worker) exits. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: long-standing data loss bug in initial sync of logical replication
On Wed, Apr 23, 2025 at 10:28 PM Masahiko Sawada wrote: > > > > > Fair enough. OTOH, we can leave the 13 branch considering following: > > (a) it is near EOL, (b) bug happens in rare cases (when the DDLs like > > ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take > > a strong lock on table happens concurrently to DMLs on the tables > > involved in the DDL.), and (c) the complete fix is invasive, even > > partial fix is not simple. I have a slight fear that if we make any > > mistake in fixing it partially (of course, we can't see any today), we > > may not even get a chance to fix it. > > > > Now, if the above convinces you or someone else not to push the > > partial fix in 13, then fine; otherwise, I'll push the 0001 to 13 day > > after tomorrow. > > I've considered the above points. I guess (b), particularly executing > ALTER PUBLICATION .. ADD TABLE while the target table is being > updated, might not be rare depending on systems. Given that this bug > causes a silent data-loss on the subscriber that is hard for users to > realize, it could ultimately depend on to what extent we can mitigate > the problem with only 0001 and there is a workaround when the problem > happens. > > Kuroda-san already shared[1] the analysis of what happens with and > without 0002 patch, but let me try with the example close to the > original data-loss problem[2]: > > Consider the following scenario: > > S1: CREATE TABLE d(data text not null); > S1: INSERT INTO d VALUES('d1'); > S2: BEGIN; > S2: INSERT INTO d VALUES('d2'); > S1: ALTER PUBLICATION pb ADD TABLE d; > S2: INSERT INTO d VALUES('d3'); > S2: COMMIT > S2: INSERT INTO d VALUES('d4'); > S1: INSERT INTO d VALUES('d5'); > > Without 0001 and 0002 (i.e. as of today), the walsender fails to send > all changes to table 'd' until it invalidates its caches for some > reasons. > > With only 0001, the walsender sends 'd4' insertion or later. > > WIth both 0001 and 0002, the wansender sends 'd3' insertion or later. > > ISTM the difference between without both 0001 and 0002 and with 0001 > is significant. So I think it's worth applying 0001 for v13. > Pushed to v13 as well, thanks for sharing the feedback. -- With Regards, Amit Kapila.
Re: extension_control_path and "directory"
Re: Matheus Alcantara > I've tested with the semver extension and it seems to work fine with > this patch. Can you please check on your side to see if it's also > working? Hi Matheus, thanks for the patch, it does indeed work. diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index 180f4af9be3..d68efd59118 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -376,6 +376,14 @@ get_extension_control_directories(void) /* Substitute the path macro if needed */ mangled = substitute_path_macro(piece, "$system", system_dir); + + /* +* Append "extension" suffix in case is a custom extension control +* path. +*/ + if (strcmp(piece, "$system") != 0) + mangled = psprintf("%s/extension", mangled); This would look prettier if it was something like mangled = substitute_path_macro(piece, "$system", system_dir "/extension"); ... but I'm wondering if it wouldn't be saner if the control path should be stored without "extension" in that struct. Then opening the control file would be path + "extension/" + filename and the extra directory would be path + directory, without any on-the-fly stripping of trailing components. The extension_control_path GUC could also be adjusted to refer to the directory one level above the extension/foo.control location. + /* +* Assert that the control->control_dir end with /extension suffix so that +* we can replace with the value from control->directory. +*/ + Assert(ctrldir_len >= suffix_len && + strcmp(control->control_dir + ctrldir_len - suffix_len, "extension") == 0); If control_dir is coming from extension_control_path, it might have a different suffix. Replace the Assert by elog(ERROR). (Or see above.) Christoph
Re: sslmode=secure by default (Re: Making sslrootcert=system work on Windows psql)
> On Linux/*ix, there would be 3 things that are all the same. > > If the Windows Openssl store is that bad, wouldn't the smarter thing > to do for PG19 to use winstore by default? The Openssl one would still > be available when requested explicitly. This would avoid the > proliferation of default values. I agree ... but I think that looks rather like my most recent (rejected) patch? However, perhaps we could extend that patch for greater backwards-compatibility, checking not only that the SSL_CERT_DIR and SSL_CERT_FILE environment variables are not set, but *also* that there is no cert.pem file and no certs/ directory inside OPENSSLDIR. I think that should make the behaviour backwards-compatible for all scenarios *except* those that would otherwise be guaranteed to fail certificate verification because we are on Windows and there are no OpenSSL certificates configured on the system. It seems fairly safe to assume that people who are using sslrootcert=system on Windows and without any configured OpenSSL certs are not doing so with the deliberate intention that all connections should fail! I attach a patch that would do this (side-by-side view at https://github.com/postgres/postgres/compare/master...jawj:postgres:jawj-sslrootcert-system-windows). An advantage of this approach would be that people building Postgres who want this behaviour sooner than next year could also patch it into versions 16 – 18 without much trouble. >> BIGGER IDEA >> In summary, you end up with these as sslmode values: >> >> * disabled >> * insecure (formerly known as require) >> * verify-ca >> * verify-full >> * secure (the new default, meaning sslmode=verify-full + sslrootcert=os) >> >> Obviously this would need to be well-trailed ahead of time, as some people >> would need to make changes to how they use psql/libpq. But it would peg the >> default security of a Postgres connection at the same level as the security >> of any random blog page (which I think is a bare minimum one might aspire >> to). > > I agree that this would be a good change for SSL users, and also one > that people would likely be willing to buy. > > The big problem here is that a lot of installations are not using SSL > at all (default on RPM), and another big chunk is using SSL, but > relying on the default snakeoil certificates to just work (default on > Debian), so this would not be "some people" but more like "everyone > except the few who have already configured certificates properly". > > These people would have to change every single connection string to > include "sslmode=disabled" or the like. This will likely not be > received well. > > Before we can make this change, I think we would have to improve the > UX. psql does not even have any --switch for it. PostgreSQL serving > non-SSL and SSL on the same port doesn't make the UX better... :-/ How do you think the UX could be improved? Maybe by using a psql switch and/or an env var to opt out of (or initially even to opt into) the new sslmode treatment? sslrootcert-system-windows.diff Description: Binary data
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 24, 2025 at 2:54 PM Nisha Moond wrote: > > Please find the updated patch for Approach 3, which implements the > idea suggested by Swada-san above. > Thank You for the patch. 1) CreateDecodingContext: if (ctx->twophase && !slot->data.two_phase) { + /* + * Do not allow two-phase decoding for failover enabled slots. + * + * See comments atop the similar check in ReplicationSlotCreate() for + * a detailed reason. + */ + if (slot->data.failover) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable two-phase decoding for failover enabled slot \"%s\"", + NameStr(slot->data.name; + Can you please add tetscase to cover this scenario? 2) We shall update create-sub documents as well for these mutually exclusive options. Review other pages (alter-sub, create-slot) as well for any required change. 3) +## +# Test that the failover option can be enabled for a two_phase enabled +# subscription. +## Suggestion: 'Test that the failover option can be enabled for a two_phase enabled subscription only through Alter Subscription (failover=true)' thanks Shveta
Re: [PATCH] contrib/xml2: xslt_process() should report XSLT-related error details
On Thu Apr 24, 2025 at 01:59:10 GMT +03, Tom Lane wrote: > Hm. I wonder if we couldn't get rid of the > HAVE_XMLSTRUCTUREDERRORCONTEXT conditionalization. If the > libxml2 versions that lacked that were "old" in 2011, surely > they are extinct in the wild by now. I'm thinking that we > really don't want to be using the generic handler at all, given > the commit log's comment that it can't distinguish between > error, warning, and notice cases. libxml 2.7.4 was released 15 years ago, so yes these #ifdefs can probably be removed already. What's the oldest distribution/OS you want to support PG on? Even Ubuntu 14.04 from 2014 already had libxml 2.9.1. If you like, I will prepare another patch to remove HAVE_XMLSTRUCTUREDERRORCONTEXT in a separate thread. PS: I added the libxslt error handling patch to the next commitfest: https://commitfest.postgresql.org/patch/5718/ -- Robin Haberkorn Senior Software Engineer B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg / https://www.b1-systems.de GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt, HRB 3537
Re: Does RENAME TABLE rename associated identity sequence?
Hi Jason, On Wed, Apr 23, 2025 at 6:06 PM Jason Song wrote: > Hi hackers, > > I was wondering if there's any built-in functionality in PostgreSQL where > renaming a table with an identity column would also rename the > auto-generated sequence associated with that identity column. > > In my case, I renamed a table that used `GENERATED BY DEFAULT AS > IDENTITY`, and later when I ran `pg_dump`, I noticed that the sequence name > was unchanged (e.g., still `old_table_id_seq`). As a result, any `setval()` > or sequence-related operations referenced the old sequence name, even > though the table name had changed. > Is it causing a problem in your application or environment? As long as the system uses the given name consistently and does not cause any error due to trying to derive sequence name from new table name and ending up in non-existent sequence error, it should be fine. Internally we use OID and not name. Identity sequences are not expected to be manipulated by users anyway, so the name shouldn't matter. I agree that finding an identity sequence associated with a table is not straightforward - that association is stored in pg_depends, which is not intuitive. Take a look at getIdentitySequence(). > > I realize this can be worked around — for example, by using > `--exclude-table-data` to skip the `setval()` or manually renaming the > sequence after the table rename. But I'm curious if there are any plans (or > technical reasons against) supporting something like `ALTER TABLE ... > RENAME ... WITH SEQUENCE`, or having the sequence name automatically follow > the table rename when it was originally auto-generated by an identity > column. > > If there's any problem, IMO, ALTER TABLE ... RENAME ... should rename the sequence too since the identity sequences are created implicitly when the table is created, so they should be renamed implicitly. We should not require WITH SEQUENCE clause. -- Best Wishes, Ashutosh Bapat
Possible incorrect comment in ginget.c
Hi, I was a bit confused by this comment in keyGetItem() (ginget.c) /* * Normally, none of the items in additionalEntries can have a curItem * larger than minItem. But if minItem is a lossy page, then there * might be exact items on the same page among additionalEntries. */ AFAIS the code before the comment is opposite to the comment's first sentence. We set advancePast right before the minItem, and if an additional entry's curItem is smaller than or equals to advancePast, we call entryGetItem(), so afterwards it should be larger than advancePast and larger than or equal to minItem. So it seems that normally all items in additionalEntries have a curItem larger than or equal to minItem. It seems the correct first sentence would be: "Normally, none of the items in additionalEntries can have a curItem LESS than minItem" Best regards, Arseniy Mukhin
Re: Fix premature xmin advancement during fast forward decoding
On Tue, Apr 22, 2025 at 12:36 PM Zhijie Hou (Fujitsu) wrote: > > When analyzing some issues in another thread[1], I found a bug that fast > forward > decoding could lead to premature advancement of catalog_xmin, resulting in > required catalog data being removed by vacuum. > > The issue arises because we do not build a base snapshot when decoding changes > during fast forward mode, preventing reference to the minimum transaction ID > that remains visible in the snapshot when determining the candidate for > catalog_xmin. As a result, catalog_xmin was directly advanced to the oldest > running transaction ID found in the latest running_xacts record. > > In code-level, SnapBuildProcessChange -> ReorderBufferSetBaseSnapshot could > not > be reached during fast forward decoding, resulting in > rb->txns_by_base_snapshot_lsn being NULL. When advancing catalog_xmin, the > system attempts to refer to rb->txns_by_base_snapshot_lsn via > ReorderBufferGetOldestXmin(). However, since rb->txns_by_base_snapshot_lsn is > NULL, it defaults to directly using running->oldestRunningXid as the candidate > for catalog_xmin. For reference, see the details in > SnapBuildProcessRunningXacts. > > See the attachment for a test(0002) to prove that the catalog data that are > still required would be removed after premature catalog_xmin advancement > during > fast forward decoding. > > To fix this, I think we can allow the base snapshot to be built during fast > forward decoding, as implemented in the patch 0001 (We already built base > snapshot in fast-forward mode for logical message in logicalmsg_decode()). > The same bug was fixed for non-fast_forward mode in commit f49a80c4. See the following code in that commit: - LogicalIncreaseXminForSlot(lsn, running->oldestRunningXid); + xmin = ReorderBufferGetOldestXmin(builder->reorder); + if (xmin == InvalidTransactionId) + xmin = running->oldestRunningXid; + LogicalIncreaseXminForSlot(lsn, xmin); Is my understanding correct? If so, then I think it is a miss in the commit f49a80c4 to consider fast_forward mode. Please refer commit in the commit message. The code changes look correct to me. I have some suggestions related to comments in the code. See attached. Please prepare patches for back branches as well. -- With Regards, Amit Kapila. diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index 8dc467a8bb3..cc03f0706e9 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -413,10 +413,10 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) /* * If we don't have snapshot or we are just fast-forwarding, there is no * point in decoding data changes. However, it's crucial to build the base -* snapshot during fast-forward mode (as done in SnapBuildProcessChange()) -* because the snapshot's minimum visible transaction ID (xmin) must be -* referenced when determining the candidate catalog_xmin for the -* replication slot. +* snapshot during fast-forward mode (as is done in +* SnapBuildProcessChange()) because we require the snapshot's xmin when +* determining the candidate catalog_xmin for the replication slot. See +* SnapBuildProcessRunningXacts(). */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) return; @@ -477,10 +477,10 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) /* * If we don't have snapshot or we are just fast-forwarding, there is no * point in decoding data changes. However, it's crucial to build the base -* snapshot during fast-forward mode (as done in SnapBuildProcessChange()) -* because the snapshot's minimum visible transaction ID (xmin) must be -* referenced when determining the candidate catalog_xmin for the -* replication slot. +* snapshot during fast-forward mode (as is done in +* SnapBuildProcessChange()) because we require the snapshot's xmin when +* determining the candidate catalog_xmin for the replication slot. See +* SnapBuildProcessRunningXacts(). */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) return;
Re: Does RENAME TABLE rename associated identity sequence?
Isaac Morland writes: > On Thu, 24 Apr 2025 at 05:53, Ashutosh Bapat > wrote: >> If there's any problem, IMO, ALTER TABLE ... RENAME ... should rename the >> sequence too since the identity sequences are created implicitly when the >> table is created, so they should be renamed implicitly. We should not >> require WITH SEQUENCE clause. > My concern would be what happens if the new sequence name is not available. > I suppose the simplest behaviour might be to skip renaming the sequence in > that case, perhaps raising a warning. We do not rename any other subsidiary objects such as indexes. Why would we rename a sequence (which has a lot more reason to be considered an independent object than an index does)? regression=# create table foo (i int primary key); CREATE TABLE regression=# \d+ foo Table "public.foo" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description +-+---+--+-+-+-+--+- i | integer | | not null | | plain | | | Indexes: "foo_pkey" PRIMARY KEY, btree (i) Not-null constraints: "foo_i_not_null" NOT NULL "i" Access method: heap regression=# alter table foo rename to bar; ALTER TABLE regression=# \d+ bar Table "public.bar" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description +-+---+--+-+-+-+--+- i | integer | | not null | | plain | | | Indexes: "foo_pkey" PRIMARY KEY, btree (i) Not-null constraints: "foo_i_not_null" NOT NULL "i" Access method: heap I think it's up to the user to rename subsidiary objects if they wish to do so. regards, tom lane
Re: What's our minimum supported Python version?
Peter Eisentraut writes: > There are approximately 6 buildfarm members with RHEL 7 or CentOS 7, so > in that sense, "support" means that everything is expected to work > there. And some amount of work has been done recently to keep that > support alive. Yeah. Devrim's choice of what to package is not project policy, it's just his own personal allocation of resources. > (But I'm confused now because I see Python 3.6 on both the RHEL 7 and > the RHEL 8 animals. So it's not clear how representative these animals > are especially with respect to the Python versions.) Per wikipedia, RHEL7 was released mid-2014, so the latest Python version 7.0 could possibly have shipped with is 3.4 (released 2014-03) and much more likely they shipped 3.3 (2012-09). However, you can if you choose install later Python versions, and you have control over which version /usr/bin/python3 points at. (At least this is true on the RHEL8 box I am looking at, and I'm fairly sure it was so on RHEL7 too.) Similarly, Python 3.6 seems to be the baseline default on RHEL8 --- and the timing more or less matches up, as 3.7 was released not long before they would have been freezing the RHEL8 feature set. But you can install 3.8, 3.9, 3.11, and/or 3.12 without going outside the universe of Red-Hat-supported packages. So what's that mean for us? "We still want to support RHEL7" turns out to be squishier than one might think. But I don't think we really want to promise to work with Python 3.3, especially given the lack of any buildfarm representation of that. In the other direction, yeah we could insist that RHEL users install some later Python version, but I think we'd get push-back. The point of using an LTS distro is, well, stability. The users want to decide which packages they are comfortable with updating. I'm still content with the idea of deciding that 3.6 is now our cutoff. If someone comes along and complains because oauth_server.py doesn't work on an older version, it's up to them to provide a patch. If/when we want to include some Python code that can't reasonably be made to work on 3.6, we can reconsider. regards, tom lane
RE: Fix premature xmin advancement during fast forward decoding
On Fri, Apr 25, 2025 at 5:44 AM Masahiko Sawada wrote: (Resending this email after compressing the attachment) > On Tue, Apr 22, 2025 at 12:06 AM Zhijie Hou (Fujitsu) > wrote: > > > > Hi, > > > > When analyzing some issues in another thread[1], I found a bug that > > fast forward decoding could lead to premature advancement of > > catalog_xmin, resulting in required catalog data being removed by vacuum. > > > > The issue arises because we do not build a base snapshot when decoding > > changes during fast forward mode, preventing reference to the minimum > > transaction ID that remains visible in the snapshot when determining > > the candidate for catalog_xmin. As a result, catalog_xmin was directly > > advanced to the oldest running transaction ID found in the latest > running_xacts record. > > > > In code-level, SnapBuildProcessChange -> ReorderBufferSetBaseSnapshot > > could not be reached during fast forward decoding, resulting in > > rb->txns_by_base_snapshot_lsn being NULL. When advancing > catalog_xmin, > > rb->the > > system attempts to refer to rb->txns_by_base_snapshot_lsn via > > ReorderBufferGetOldestXmin(). However, since > > rb->txns_by_base_snapshot_lsn is NULL, it defaults to directly using > > running->oldestRunningXid as the candidate for catalog_xmin. For > > reference, see the details in SnapBuildProcessRunningXacts. > > I agree with your analysis. > > > To fix this, I think we can allow the base snapshot to be built during > > fast forward decoding, as implemented in the patch 0001 (We already > > built base snapshot in fast-forward mode for logical message in > logicalmsg_decode()). > > > > I also explored the possibility of further optimizing the fix to > > reduce the overhead associated with building a snapshot in > > fast-forward mode. E.g., to maintain a single base_snapshot_xmin > > rather than generating a full snapshot for each transaction, and use > > this base_snapshot_xmin when determining the candidate catalog_xmin. > > However, I am not feeling confident about maintaining some > > fast-forward dedicated fields in common structures and perhaps employing > different handling for catalog_xmin. > > > > Moreover, I conducted a basic test[2] to test the patch's impact, > > noting that advancing the slot incurs roughly a 4% increase in > > processing time after applying the patch, which appears to be > > acceptable. Additionally, the cost associated with building the > > snapshot via SnapBuildBuildSnapshot() did not show up in the profile. > > Where did the regression come from? I think that even with your patch > SnapBuildBuildSnapshot() would be called only a few times in the workload. AFAICS, the primary cost arises from the additional function invocations/executions. Functions such as SnapBuildProcessChange, ReorderBufferSetBaseSnapshot, and ReorderBufferXidHasBaseSnapshot are invoked more frequently after the patch. Although these functions aren't inherently expensive, the scenario I tested involved numerous transactions starting with just a single insert each. This resulted in a high frequency of calls to these functions. Attaching the profile for both HEAD and PATCHED for reference. > With the patch, I think that we need to create ReorderBufferTXN entries for > each transaction even during fast_forward mode, which could lead to > overheads. I think ReorderBufferTXN was created in fast_forward even without the patch. See heap_decode-> ReorderBufferProcessXid. > I think that 4% degradation is something that we want to avoid. As mentioned above, these costs arise from essential function executions required to maintain txns_by_base_snapshot_lsn. So I think the cost is reasonable to ensure correctness. Thoughts ? Best Regards, Hou zj <>
Re: Making sslrootcert=system work on Windows psql
On Wed, 23 Apr 2025 at 17:47, George MacKerron wrote: > I’d suggest two new special sslrootcert values: > > (1) sslrootcert=openssl > > This does exactly what sslrootcert=system does now, but is less confusingly > named for Windows users. sslrootcert=system becomes a deprecated synonym for > this option. > > (2) sslrootcert=os Okay I have some time to respond to this thread now. My main thought is this: I think we should try as hard as possible for sslrootcert=system to do "the right thing" on all platforms. "the right thing" being: allowing users to connect to a Postgres server if the cert of that server. It's a huge shame that MANY users connect to their production Postgres databases over unauthenticated TLS. I believe the only way to fix that is by having a *standard* & *secure* connection string that people can copy paste from their database service provider's portal. Adding new options that users need to choose between makes it impossible for database providers to provide such a *standard* connection string because the exact string will depend on the platform. Which means they'll continue to only provide sslmode=require in their provided connstrings. And even if we can somehow avoid that, it will reset the clock on when most clients will actually support that *standard* connection string. Thus increasing the time (by at least two years) before database providers dare to put these options in their default connection strings, in fear of their customers not being able to connect and opening support requests or closing their accounts. But yeah, the discussed situation is problematic for this: Windows machines have multiple cert stores that could be reasonably considered the system store. So I'd like to propose a different way around that problem: Instead adding more connection options. How about we add a *compile time* option that allows the person that compiles libpq to choose which cert store it should use if sslrootcert=system is provided. Something like --system-cert-store=openssl and --system-cert-store=winstore flags for ./configure. This way users don't have to choose between the various system stores to get behaviour that is sensible. Which one should be the default, requires discussion, and maybe we'd want to let that depend on the OpenSSL version or change it in the future. We could even make it required for people compiling libpq on Windows (with an OpenSSl version that supports winstore) to choose between these two options, by making it a required flag. Note: This doesn't mean we couldn't allow people to override the compile time systemstore at runtime with e.g. sslsystemstore=winstore, but the default would be the one that was chosen at compile time. > BIGGER IDEA I would really like to get to a point where libpq by default fails to connect if you're not connecting to Postgres in a secure way: i.e. one where you're not both encrypting traffic and authenticating the host (unless you're connecting over unix sockets and maybe local loopback). I think there's no point in actually working on/proposing that until we have a secure connection string that works on all systems (i.e. what sslrootcert=system is supposed to do)
Re: Fix premature xmin advancement during fast forward decoding
Hi, On Tue, Apr 22, 2025 at 12:06 AM Zhijie Hou (Fujitsu) wrote: > > Hi, > > When analyzing some issues in another thread[1], I found a bug that fast > forward > decoding could lead to premature advancement of catalog_xmin, resulting in > required catalog data being removed by vacuum. > > The issue arises because we do not build a base snapshot when decoding changes > during fast forward mode, preventing reference to the minimum transaction ID > that remains visible in the snapshot when determining the candidate for > catalog_xmin. As a result, catalog_xmin was directly advanced to the oldest > running transaction ID found in the latest running_xacts record. > > In code-level, SnapBuildProcessChange -> ReorderBufferSetBaseSnapshot could > not > be reached during fast forward decoding, resulting in > rb->txns_by_base_snapshot_lsn being NULL. When advancing catalog_xmin, the > system attempts to refer to rb->txns_by_base_snapshot_lsn via > ReorderBufferGetOldestXmin(). However, since rb->txns_by_base_snapshot_lsn is > NULL, it defaults to directly using running->oldestRunningXid as the candidate > for catalog_xmin. For reference, see the details in > SnapBuildProcessRunningXacts. I agree with your analysis. > To fix this, I think we can allow the base snapshot to be built during fast > forward decoding, as implemented in the patch 0001 (We already built base > snapshot in fast-forward mode for logical message in logicalmsg_decode()). > > I also explored the possibility of further optimizing the fix to reduce the > overhead associated with building a snapshot in fast-forward mode. E.g., to > maintain a single base_snapshot_xmin rather than generating a full snapshot > for > each transaction, and use this base_snapshot_xmin when determining the > candidate catalog_xmin. However, I am not feeling confident about maintaining > some fast-forward dedicated fields in common structures and > perhaps employing different handling for catalog_xmin. > > Moreover, I conducted a basic test[2] to test the patch's impact, noting that > advancing the slot incurs roughly a 4% increase in processing time after > applying the patch, which appears to be acceptable. Additionally, the cost > associated with building the snapshot via SnapBuildBuildSnapshot() did not > show > up in the profile. Where did the regression come from? I think that even with your patch SnapBuildBuildSnapshot() would be called only a few times in the workload. With the patch, I think that we need to create ReorderBufferTXN entries for each transaction even during fast_forward mode, which could lead to overheads. I think that 4% degradation is something that we want to avoid. Regards -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: sslmode=secure by default (Re: Making sslrootcert=system work on Windows psql)
On Thu, 24 Apr 2025 at 18:46, Jacob Champion wrote: > > On Thu, Apr 24, 2025 at 5:00 AM Peter Eisentraut wrote: > > I'm generally in favor of making sslmode=verify-full the effective > > default somehow. > > +many Yes, +many > Not to derail things too much, but I'd also like a postgress:// > scheme Sounds great. Let me derail some more, while we're at it I think it would be good to add tls-prefixed aliases for all our ssl options. Like tlscert/tlskey. Since such a new postgress:// scheme would be totally new, maybe we can even disallow the ssl prefixed ones there. > The hardest part, in my opinion, is that we'd have to start following > the RFC concept of "authority". A URL of > "postgress://example.com/db?host=evil.com&hostaddr=..." is outright > dangerous Why is this dangerous? As long as we'd validate that the provided cert by the server is for example.com, I don't see any security problem in having DNS resolution happen for evil.com, nor in having the IP addresses hardcoded using hostaddr. > as is "postgress://example.com/db?sslmode=disable" Yeah that should be addressed, but seems like we mainly need to disallow specifying sslmode completely there (or error if it's not verify-full). And maybe there's some other options that we'd want to disallow. On Thu, 24 Apr 2025 at 18:46, Jacob Champion wrote: > > On Thu, Apr 24, 2025 at 5:00 AM Peter Eisentraut wrote: > > I'm generally in favor of making sslmode=verify-full the effective > > default somehow. > > +many > > On Thu, Apr 24, 2025 at 3:53 AM Christoph Berg wrote: > > For > > postgresql://-style strings, we would ideally have something like http:// > > vs https://, but I am not sure how to squeeze that into the syntax. > > Not to derail things too much, but I'd also like a postgress:// > scheme, and I've put a little bit of idle thought into it. I think > we'd want it to imply sslnegotiation=direct and sslrootcert=system > (modulo the Windows discussion already in progress), and potentially > make a bunch of stricter decisions about TLS settings to better match > modern practice. The intent would be to have a "browser-strength" > scheme for people who care more about security than about raw > compatibility with older systems, because they're connecting to > someone else's servers on the open Web. > > The hardest part, in my opinion, is that we'd have to start following > the RFC concept of "authority". A URL of > "postgress://example.com/db?host=evil.com&hostaddr=..." is outright > dangerous, as is "postgress://example.com/db?sslmode=disable". So if > there's interest in that scheme, I think it should remain a separate > feature from "verify-full by default", because there's a lot more to > figure out. > > --Jacob > >
Re: Making sslrootcert=system work on Windows psql
On Thu, 24 Apr 2025 at 23:52, Jelte Fennema-Nio wrote: > How about we add a *compile time* > option that allows the person that compiles libpq to choose which cert > store it should use if sslrootcert=system is provided. Something like > --system-cert-store=openssl and --system-cert-store=winstore flags for > ./configure. @George So basically my suggestion is to make the behaviour that your patch introduces configurable at compile time. FWIW my vote would probably be to default to --system-cert-store=winstore if it's available. And then --system-cert-store=openssl would be a way out for people that took the effort to configure openssl correctly on Windows.
Re: extension_control_path and "directory"
On Apr 24, 2025, at 11:18, Matheus Alcantara wrote: > In v2 I've moved the logic to remove the /extension to > parse_extension_control_file(), do you think that this Assert on this > function would still be wrong? IIUC we should always have /extension at > the end of "control_dir" at this place, because the > extension_control_path GUC will omit the /extension at the end and we > will force it to have the suffix on the path at > find_extension_control_filename() and > get_extension_control_directories() functions. I'm missing something > here? I took this patch for a spin and managed to make it core dump. How? Well I installed semver with this command: ```sh make prefix=/Users/david/Downloads install ``` Then set the search paths and restarted: ```ini extension_control_path = '/Users/david/Downloads/share/extension:$system' dynamic_library_path = '/Users/david/Downloads/lib:$libdir' ``` Then I connected and ran `CREATE EXTENSION semver` and it segfaulted. I poked around for a few minutes and realized that my prefix is not what I expected. Because it doesn’t contain the string “postgres”, PGXS helpfully adds it. The actual paths are: ```ini extension_control_path = '/Users/david/Downloads/share/postgresql/extension:$system' dynamic_library_path = '/Users/david/Downloads/lib/postgresql:$libdir' ``` With that fix it no longer segafulted. So I presume something crashes when a directory or file doesn’t exist. But I am not at all sure we want this prefix behavior for installing extensions. I get that has been the behavior for setting the main sharedir and libdir for Postgres, but I don’t know that it makes sense for extension prefixes. Best, David signature.asc Description: Message signed with OpenPGP
Re: long-standing data loss bug in initial sync of logical replication
On Thu, 24 Apr 2025 at 14:39, Amit Kapila wrote: > > On Wed, Apr 23, 2025 at 10:28 PM Masahiko Sawada > wrote: > > > > > > > > Fair enough. OTOH, we can leave the 13 branch considering following: > > > (a) it is near EOL, (b) bug happens in rare cases (when the DDLs like > > > ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take > > > a strong lock on table happens concurrently to DMLs on the tables > > > involved in the DDL.), and (c) the complete fix is invasive, even > > > partial fix is not simple. I have a slight fear that if we make any > > > mistake in fixing it partially (of course, we can't see any today), we > > > may not even get a chance to fix it. > > > > > > Now, if the above convinces you or someone else not to push the > > > partial fix in 13, then fine; otherwise, I'll push the 0001 to 13 day > > > after tomorrow. > > > > I've considered the above points. I guess (b), particularly executing > > ALTER PUBLICATION .. ADD TABLE while the target table is being > > updated, might not be rare depending on systems. Given that this bug > > causes a silent data-loss on the subscriber that is hard for users to > > realize, it could ultimately depend on to what extent we can mitigate > > the problem with only 0001 and there is a workaround when the problem > > happens. > > > > Kuroda-san already shared[1] the analysis of what happens with and > > without 0002 patch, but let me try with the example close to the > > original data-loss problem[2]: > > > > Consider the following scenario: > > > > S1: CREATE TABLE d(data text not null); > > S1: INSERT INTO d VALUES('d1'); > > S2: BEGIN; > > S2: INSERT INTO d VALUES('d2'); > > S1: ALTER PUBLICATION pb ADD TABLE d; > > S2: INSERT INTO d VALUES('d3'); > > S2: COMMIT > > S2: INSERT INTO d VALUES('d4'); > > S1: INSERT INTO d VALUES('d5'); > > > > Without 0001 and 0002 (i.e. as of today), the walsender fails to send > > all changes to table 'd' until it invalidates its caches for some > > reasons. > > > > With only 0001, the walsender sends 'd4' insertion or later. > > > > WIth both 0001 and 0002, the wansender sends 'd3' insertion or later. > > > > ISTM the difference between without both 0001 and 0002 and with 0001 > > is significant. So I think it's worth applying 0001 for v13. > > > > Pushed to v13 as well, thanks for sharing the feedback. > In the commits, I saw that the filenames are misspelled for files invalidation_distrubution.out and invalidation_distrubution.spec. This is present in branches from REL_13 to HEAD. I have attached patches to fix the same. Thanks and Regards, Shlok Kyal v1_HEAD-0001-Fix-spelling-for-file-names.patch Description: Binary data v1_REL_15-0001-Fix-spelling-for-file-names.patch Description: Binary data v1_REL_13-0001-Fix-spelling-for-file-names.patch Description: Binary data v1_REL_16-0001-Fix-spelling-for-file-names.patch Description: Binary data v1_REL_14-0001-Fix-spelling-for-file-names.patch Description: Binary data v1_REL_17-0001-Fix-spelling-for-file-names.patch Description: Binary data
Re: Fix premature xmin advancement during fast forward decoding
On Thu, Apr 24, 2025 at 9:56 PM Amit Kapila wrote: > > On Fri, Apr 25, 2025 at 8:14 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Fri, Apr 25, 2025 at 5:44 AM Masahiko Sawada wrote: > > > On Tue, Apr 22, 2025 at 12:06 AM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > Hi, > > > > > > > > When analyzing some issues in another thread[1], I found a bug that > > > > fast forward decoding could lead to premature advancement of > > > > catalog_xmin, resulting in required catalog data being removed by > > > > vacuum. > > > > > > > > The issue arises because we do not build a base snapshot when decoding > > > > changes during fast forward mode, preventing reference to the minimum > > > > transaction ID that remains visible in the snapshot when determining > > > > the candidate for catalog_xmin. As a result, catalog_xmin was directly > > > > advanced to the oldest running transaction ID found in the latest > > > running_xacts record. > > > > > > > > In code-level, SnapBuildProcessChange -> ReorderBufferSetBaseSnapshot > > > > could not be reached during fast forward decoding, resulting in > > > > rb->txns_by_base_snapshot_lsn being NULL. When advancing > > > catalog_xmin, > > > > rb->the > > > > system attempts to refer to rb->txns_by_base_snapshot_lsn via > > > > ReorderBufferGetOldestXmin(). However, since > > > > rb->txns_by_base_snapshot_lsn is NULL, it defaults to directly using > > > > running->oldestRunningXid as the candidate for catalog_xmin. For > > > > reference, see the details in SnapBuildProcessRunningXacts. > > > > > > I agree with your analysis. > > > > > > > To fix this, I think we can allow the base snapshot to be built during > > > > fast forward decoding, as implemented in the patch 0001 (We already > > > > built base snapshot in fast-forward mode for logical message in > > > logicalmsg_decode()). > > > > > > > > I also explored the possibility of further optimizing the fix to > > > > reduce the overhead associated with building a snapshot in > > > > fast-forward mode. E.g., to maintain a single base_snapshot_xmin > > > > rather than generating a full snapshot for each transaction, and use > > > > this base_snapshot_xmin when determining the candidate catalog_xmin. > > > > However, I am not feeling confident about maintaining some > > > > fast-forward dedicated fields in common structures and perhaps employing > > > different handling for catalog_xmin. > > > > > > > > Moreover, I conducted a basic test[2] to test the patch's impact, > > > > noting that advancing the slot incurs roughly a 4% increase in > > > > processing time after applying the patch, which appears to be > > > > acceptable. Additionally, the cost associated with building the > > > > snapshot via SnapBuildBuildSnapshot() did not show up in the profile. > > > > > > Where did the regression come from? I think that even with your patch > > > SnapBuildBuildSnapshot() would be called only a few times in the workload. > > > > AFAICS, the primary cost arises from the additional function > > invocations/executions. Functions such as SnapBuildProcessChange, > > ReorderBufferSetBaseSnapshot, and ReorderBufferXidHasBaseSnapshot are > > invoked > > more frequently after the patch. Although these functions aren't inherently > > expensive, the scenario I tested involved numerous transactions starting > > with > > just a single insert each. This resulted in a high frequency of calls to > > these > > functions. Ah, you're right. I looked at the wrong profile. > > > > Yeah, I see below from the patched profile. > > |--4.65%--SnapBuildProcessChange > | | > | | > | | > | |--2.08%--ReorderBufferSetBaseSnapshot > | | > | | | > | | > | | --1.32%--__mcount_internal > | | > | | > | | > | |--1.29%--__mcount_internal > | | > | | > | | > | --0.72%--ReorderBufferXidHasBaseSnapshot > It looks like you've run the benchmark test with enabling assertions, is that right? I can see AssertTXNLsnOrder() in the profiles. > As the number of operations per transaction increases, this cost > should reduce further as we will set base_snapshot only once per > transaction. Now, we can try to micro-optimize this by passing > ReorderBufferTXN, as that is computed before we invoke > SnapBuildProcessChange, but not sure how much it will change. However, > the bigger question is, is it really worth optimizing this code path? > > If we really want to optimize this code path for the test Hou-San did, > I see that SnapBuildCommitT
Re: Conflict detection for update_deleted in logical replication
On Thu, Apr 24, 2025 at 6:11 PM Zhijie Hou (Fujitsu) wrote: > > Few comments for patch004: > > Config.sgml: > > 1) > > + > > +Maximum duration (in milliseconds) for which conflict > > +information can be retained for conflict detection by the apply > > worker. > > +The default value is 0, indicating that conflict > > +information is retained until it is no longer needed for detection > > +purposes. > > + > > > > IIUC, the above is not entirely accurate. Suppose the subscriber manages to > > catch up and sets oldest_nonremovable_xid to 100, which is then updated in > > slot. After this, the apply worker takes a nap and begins a new xid update > > cycle. > > Now, let’s say the next candidate_xid is 200, but this time the subscriber > > fails > > to keep up and exceeds max_conflict_retention_duration. As a result, it sets > > oldest_nonremovable_xid to InvalidTransactionId, and the launcher skips > > updating the slot’s xmin. > > If the time exceeds the max_conflict_retention_duration, the launcher would > Invalidate the slot, instead of skipping updating it. So the conflict > info(e.g., > dead tuples) would not be retained anymore. > launcher will not invalidate the slot until all subscriptions have stopped conflict_info retention. So info of dead tuples for a particular oldest_xmin of a particular apply worker could be retained for much longer than this configured duration. If other apply workers are actively working (catching up with primary), then they should keep on advancing xmin of shared slot but if xmin of shared slot remains same for say 15min+15min+15min for 3 apply-workers (assuming they are marking themselves with stop_conflict_retention one after other and xmin of slot has not been advanced), then the first apply worker having marked itself with stop_conflict_retention still has access to the oldest_xmin's data for 45 mins instead of 15 mins. (where max_conflict_retention_duration=15 mins). Please let me know if my understanding is wrong. > > However, the previous xmin value (100) is still there > > in the slot, causing its data to be retained beyond the > > max_conflict_retention_duration. The xid 200 which actually honors > > max_conflict_retention_duration was never marked for retention. If my > > understanding is correct, then the documentation doesn’t fully capture this > > scenario. > > As mentioned above, the strategy here is to invalidate the slot. Please consider the case with multiple subscribers. Sorry if I missed to mention in my previous email that it was a multi-sub case. thanks Shveta
Re: Fix premature xmin advancement during fast forward decoding
On Fri, Apr 25, 2025 at 8:14 AM Zhijie Hou (Fujitsu) wrote: > > On Fri, Apr 25, 2025 at 5:44 AM Masahiko Sawada wrote: > > On Tue, Apr 22, 2025 at 12:06 AM Zhijie Hou (Fujitsu) > > wrote: > > > > > > Hi, > > > > > > When analyzing some issues in another thread[1], I found a bug that > > > fast forward decoding could lead to premature advancement of > > > catalog_xmin, resulting in required catalog data being removed by vacuum. > > > > > > The issue arises because we do not build a base snapshot when decoding > > > changes during fast forward mode, preventing reference to the minimum > > > transaction ID that remains visible in the snapshot when determining > > > the candidate for catalog_xmin. As a result, catalog_xmin was directly > > > advanced to the oldest running transaction ID found in the latest > > running_xacts record. > > > > > > In code-level, SnapBuildProcessChange -> ReorderBufferSetBaseSnapshot > > > could not be reached during fast forward decoding, resulting in > > > rb->txns_by_base_snapshot_lsn being NULL. When advancing > > catalog_xmin, > > > rb->the > > > system attempts to refer to rb->txns_by_base_snapshot_lsn via > > > ReorderBufferGetOldestXmin(). However, since > > > rb->txns_by_base_snapshot_lsn is NULL, it defaults to directly using > > > running->oldestRunningXid as the candidate for catalog_xmin. For > > > reference, see the details in SnapBuildProcessRunningXacts. > > > > I agree with your analysis. > > > > > To fix this, I think we can allow the base snapshot to be built during > > > fast forward decoding, as implemented in the patch 0001 (We already > > > built base snapshot in fast-forward mode for logical message in > > logicalmsg_decode()). > > > > > > I also explored the possibility of further optimizing the fix to > > > reduce the overhead associated with building a snapshot in > > > fast-forward mode. E.g., to maintain a single base_snapshot_xmin > > > rather than generating a full snapshot for each transaction, and use > > > this base_snapshot_xmin when determining the candidate catalog_xmin. > > > However, I am not feeling confident about maintaining some > > > fast-forward dedicated fields in common structures and perhaps employing > > different handling for catalog_xmin. > > > > > > Moreover, I conducted a basic test[2] to test the patch's impact, > > > noting that advancing the slot incurs roughly a 4% increase in > > > processing time after applying the patch, which appears to be > > > acceptable. Additionally, the cost associated with building the > > > snapshot via SnapBuildBuildSnapshot() did not show up in the profile. > > > > Where did the regression come from? I think that even with your patch > > SnapBuildBuildSnapshot() would be called only a few times in the workload. > > AFAICS, the primary cost arises from the additional function > invocations/executions. Functions such as SnapBuildProcessChange, > ReorderBufferSetBaseSnapshot, and ReorderBufferXidHasBaseSnapshot are invoked > more frequently after the patch. Although these functions aren't inherently > expensive, the scenario I tested involved numerous transactions starting with > just a single insert each. This resulted in a high frequency of calls to these > functions. > Yeah, I see below from the patched profile. |--4.65%--SnapBuildProcessChange | | | | | | | |--2.08%--ReorderBufferSetBaseSnapshot | | | | | | | | | --1.32%--__mcount_internal | | | | | | | |--1.29%--__mcount_internal | | | | | | | --0.72%--ReorderBufferXidHasBaseSnapshot As the number of operations per transaction increases, this cost should reduce further as we will set base_snapshot only once per transaction. Now, we can try to micro-optimize this by passing ReorderBufferTXN, as that is computed before we invoke SnapBuildProcessChange, but not sure how much it will change. However, the bigger question is, is it really worth optimizing this code path? If we really want to optimize this code path for the test Hou-San did, I see that SnapBuildCommitTxn() has a bigger overhead, so we should investigate whether we really need it for the fast-forward mode, where we won't process changes. I am of the opinion that we need to pay this additional cost of setting base_snapshot for the correctness of the fast-forward mode. -- With Regards, Amit Kapila.
Re: Conflicting updates of command progress
> pgstat_progress_start_command() is called twice: First with > cmdtype=PROGRESS_COMMAND_CLUSTER, second with > PROGRESS_COMMAND_CREATE_INDEX. The first happens in cluster_rel() the second > in cluster_rel() -> rebuild_relation() -> finish_heap_swap() -> > reindex_relation() -> reindex_index(). > > It does not matter though, the current design only expects one command. When I set a breakpoint on pgstat_progress_start_command and ran a CLUSTER on a table with a few indexes, I only saw start_command being called once for PROGRESS_COMMAND_CLUSTER. Then I went back in the code and realized that reindex_index when called via the CLUSTER command has progress set to false. So as it stands now, only PROGRESS_COMMAND_CLUSTER is effectively started when CLUSTER is called. ``` if (progress) { const int progress_cols[] = { PROGRESS_CREATEIDX_COMMAND, PROGRESS_CREATEIDX_INDEX_OID }; const int64 progress_vals[] = { PROGRESS_CREATEIDX_COMMAND_REINDEX, indexId }; pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, heapId); pgstat_progress_update_multi_param(2, progress_cols, progress_vals); } ``` -- Sami
Re: RFC: Additional Directory for Extensions
On Mar 19, 2025, at 13:29, David E. Wheeler wrote: > I think this should at least be documented, but generally feels unexpected to > me. I’ve attached a patch that fleshes out the docs, along with an example of > setting `extension_control_path` and `dynamic_library_path` to use the > locations. It might not have the information right about the need for > “postgresql” or “pgsql” in the path. Back in 2003[1] it was just “postgres”, > but I couldn’t find the logic for it just now. Here’s a rebase. Best, David v2-0001-Flesh-out-docs-for-the-prefix-make-variable.patch Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: What's our minimum supported Python version?
On Thu, Apr 24, 2025 at 7:59 AM Tom Lane wrote: > I'm still content with the idea of deciding that 3.6 is now our > cutoff. Seems like no one is pushing hard for an earlier version, yet, so here's a patch with your suggested wording from upthread. I'm not sure if this meets Peter's request for precision. (Though I'm not really excited about documenting more precision than we are testing for...) > If someone comes along and complains because > oauth_server.py doesn't work on an older version, it's up to them > to provide a patch. As an aside, EL7's standard libcurl is too old to pass our current configure checks, so I think someone would really have to go out of their way to get to that point. --Jacob 0001-Bump-the-minimum-supported-Python-version-to-3.6.8.patch Description: Binary data
Re: Does RENAME TABLE rename associated identity sequence?
On Thu, 24 Apr 2025 at 05:53, Ashutosh Bapat wrote: > If there's any problem, IMO, ALTER TABLE ... RENAME ... should rename the > sequence too since the identity sequences are created implicitly when the > table is created, so they should be renamed implicitly. We should not > require WITH SEQUENCE clause. > My concern would be what happens if the new sequence name is not available. I suppose the simplest behaviour might be to skip renaming the sequence in that case, perhaps raising a warning.
Re: Improve verification of recovery_target_timeline GUC.
On Thu, Apr 24, 2025 at 03:34:29PM +, David Steele wrote: > Done. This means there are no commas in the upper bound but I don't think > it's a big deal and it more closely matches other GUC messages. One thing that I have double-checked is if we have similar messages for GUCs that are defined as strings while requiring some range checks. While we have similar messages (tablesample code is one, opclass a second), we don't have similar thing for GUCs. I'd bet that this would be a win in the long-run anyway. > I'm also now using a single cluster for all three tests rather than creating > a new one for each test. This is definitely more efficient. > > Having said that, I think these tests are awfully expensive for a single > GUC. Unit tests would definitely be preferable but that's not an option for > GUCs, so far as I know. On my laptop, 003_recovery_targets.pl increases from 8.2s to 8.7s, which seems OK here for the coverage we have. Undoing the changes in xlogrecovery.c causes the tests to fail, so we are good. "invalid recovery startup fails" is used three times repeatedly for the tests with bogus and out-of-bound values. I'd suggest to make these more verbose, with three extras "bogus value", "lower bound check" and "upper bound check" added. Side thing noted while reading the area: check_recovery_target_xid() does not have any GUC_check_errdetail() giving details for the EINVAL and ERANGE cases. Your addition for recovery_target_timeline is a good idea, so perhaps we could do the same there? No need to do that, that's just something I've noticed in passing.. And you are following the fat-comma convention for the command lines, thanks for doing that. Note that I'm not planning to do anything here for v18 now that we are in feature freeze, these would be for v19 once the branch opens. -- Michael signature.asc Description: PGP signature
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 24, 2025 at 10:48 AM Masahiko Sawada wrote: > > On Thu, Apr 24, 2025 at 2:24 AM Nisha Moond wrote: > > > > On Thu, Apr 24, 2025 at 12:28 PM Amit Kapila > > wrote: > > > > > > On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada > > > wrote: > > > > > > > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila > > > > wrote: > > > > > > > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > > > > wrote: > > > > > > > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Approach 2 > > > > > > > > -- > > > > > > > > > > > > > > > > Instead of disallowing the use of two-phase and failover > > > > > > > > together, a more > > > > > > > > flexible strategy could be only restrict failover for slots > > > > > > > > with two-phase > > > > > > > > enabled when there's a possibility of existing prepared > > > > > > > > transactions before > > > > > > > the > > > > > > > > two_phase_at that are not yet replicated. During slot creation > > > > > > > > with > > > > > > > two-phase > > > > > > > > and failover, we could check for any decoded prepared > > > > > > > > transactions when > > > > > > > > determining the decoding start point > > > > > > > > (DecodingContextFindStartpoint). For > > > > > > > > subsequent attempts to alter failover to true, we ensure that > > > > > > > > two_phase_at is > > > > > > > > less than restart_lsn, indicating that all prepared > > > > > > > > transactions have been > > > > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > > > > > > > pros: > > > > > > > > > > > > > > > > This method minimizes restrictions for users. Especially during > > > > > > > > slot creation > > > > > > > > with (two_phase=on, failover=on), as it’s uncommon for > > > > > > > > transactions to > > > > > > > prepare > > > > > > > > during consistent snapshot creation, the restriction becomes > > > > > > > > almost > > > > > > > > unnoticeable. > > > > > > > > > > > > > > I think this approach can work for the transactions that are > > > > > > > prepared > > > > > > > while the slot is created. But if I understand the problem > > > > > > > correctly, > > > > > > > while the initial table sync is performing, the slot's two_phase > > > > > > > is > > > > > > > still false, so we need to deal with the transactions that are > > > > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we > > > > > > haven't > > > > > > started decoding when setting two_phase=true during > > > > > > CreateDecodingContext() > > > > > > after tablesync, we could check prepared transactions afterwards > > > > > > during > > > > > > decoding. This could involve reporting an ERROR when skipping a > > > > > > prepared > > > > > > transaction during decoding if its prepare LSN is less than > > > > > > two_phase_at. > > > > > > > > > > > > > > > > It will make it difficult for users to detect it as this happens at a > > > > > later point of time. > > > > > > > > > > > Alternatively, a simpler method would be to prevent this situation > > > > > > entirely > > > > > > during the CREATE SUBSCRIPTION command. For example, we could > > > > > > restrict slots > > > > > > created with failover set to true and twophase is later modified to > > > > > > true after > > > > > > tablesync. Although the simpler check is more user-visible, it may > > > > > > offer less > > > > > > flexibility. > > > > > > > > > > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > > > > smart checks in the back-branch. If we follow what you say here, then > > > > > users have the following ways in PG17 to enable both failover and > > > > > two_phase. (a) During Create Subscription, users can set both > > > > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > > > > 'copy_data' is true, during Create Subscription, then users can enable > > > > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > > > > to set 'failover'. > > > > > > > > Yet another idea would be to disallow enabling both two_phase and > > > > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > > > > add check when enabling failover for the two_phase-enabled-slots. For > > > > example, users who want to enable both need to do two steps: > > > > > > > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > > > > 2. ALTER SUBSCRIPTION with failover = true. > > > > > > > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > > > > the two_phase states is ready (and possibly check if the slot's > > > > two_phase has been enabled too), otherwise raises an ERROR. Then, when > > > > the publisher enables the fai
Re: ZStandard (with dictionaries) compression support for TOAST compression
On Wed, Apr 23, 2025 at 11:59:26AM -0400, Robert Haas wrote: > That's nice to know, but I think the key question is not so much what > the feature costs when it is used but what it costs when it isn't > used. If we implement a system where we don't let > dictionary-compressed zstd datums leak out of tables, that's bound to > slow down a CTAS from a table where this feature is used, but that's > kind of OK: the feature has pros and cons, and if you don't like those > tradeoffs, you don't have to use it. However, it sounds like this > could also slow down inserts and updates in some cases even for users > who are not making use of the feature, and that's going to be a major > problem unless it can be shown that there is no case where the impact > is at all significant. Users hate paying for features that they aren't > using. The cost of digesting a dictionnary when decompressing sets of values is also something I think we should worry about, FWIW (see [1]), as the digesting cost is documented as costly, so I think that there is also an argument in making the feature efficient if used. That would hurt if a sequential scan needs to detoast multiple blobs with the same dict. If we attach that on a per-value value, wouldn't it imply that we need to digest the dictionnary every time a blob is decompressed? This information could be cached, but it seems a bit weird to me to invent a new level of relation caching for would could be attached as a relation attribute option in the relcache. If a dictionnary gets trained with a new sample of values, we could rely on the invalidation to pass the new information. Based on what I'm reading and I know very little about the topic so I may be wrong, but does it even make sense to allow multiple dictionnaries to be used in a single attribute? Of course that may depend on the JSON blob patterns a single attribute is dealing with, but I'm not sure that this is worth the extra complexity this creates. > I wonder if there's a possible design where we only allow > dictionary-compressed datums to exist as top-level attributes in > designated tables to which those dictionaries are attached; and any > time you try to bury that Datum inside a container object (row, range, > array, whatever) detoasting is forced. If there's a clean and > inexpensive way to implement that, then you could avoid having > heap_toast_insert_or_update care about HeapTupleHasExternal(), which > seems like it might be a key point. Interesting, not sure. FWIW, I'd still try to focus on making varatt more extensible with plain zstd support first, because diving in all these details. We are going to need it anyway. [1]: https://facebook.github.io/zstd/zstd_manual.html#Chapter10 -- Michael signature.asc Description: PGP signature
Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?
On Fri, 21 Mar 2025 at 17:14, Matthias van de Meent wrote: > Attached is v10, which polishes the previous patches, and adds a patch > for nbtree to use the new visibility checking strategy so that it too > can release its index pages much earlier, and adds a similar > visibility check test to nbtree. And here's v12. v11 (skipped) would've been a rebase, but after finishing the rebase I noticed a severe regression in btree's IOS with the new code, so v12 here applies some optimizations which reduce the overhead of the new code. Given its TableAM api changes it'd be nice to have a review on 0001, though the additions could be rewritten to not (yet) add TableAMRoutine. I think patches 1, 2 and 3 are relevant to PG18 (as long as we don't have a beta, and this is only a bit more than a bugfix). Patch 4 is for PG19 to get btree to implement the new API, too, and patch 5 contains tests similar to the bitmap scan tests, validating that IOS doesn't block VACUUM but still returns correct results. I'll try to figure out a patch that's backpatchable, as alternative to patches 2 and 3, or at least for back-patching into PG17-. That will arrive separately, though. Kind regards, Matthias van de Meent Neon (https://neon.tech) v12-0002-GIST-Fix-visibility-issues-in-IOS.patch Description: Binary data v12-0005-Test-for-IOS-Vacuum-race-conditions-in-index-AMs.patch Description: Binary data v12-0001-IOS-TableAM-Support-AM-specific-fast-visibility-.patch Description: Binary data v12-0004-NBTree-Reduce-Index-Only-Scan-pin-duration.patch Description: Binary data v12-0003-SP-GIST-Fix-visibility-issues-in-IOS.patch Description: Binary data
Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints
On 2025-Apr-09, jian he wrote: > hi. > > attached patch is for address pg_dump inconsistency > when parent is "not null not valid" while child is "not null". Here's my take on this patch. We don't really need the notnull_parent_invalid flag; in flagInhAttrs we can just set "islocal" to convince getTableAttrs to print the constraint. This also fixes the bug that in getTableAttrs() you handled the case where shouldPrintColumn() is true and not the other one. I also added test cases to pg_dump/t/002_pg_dump.pl to verify that the output was correct in those cases. In constraints.sql I added a couple of tables to ensure that the pg_upgrade handling (the pg_dump --binary-upgrade mode) is tested as well. Looking at the surrounding code in flagInhAttrs I noticed that we're mishandling this case: create table parent1 (a int); create table parent2 (a int); create table child () inherits (parent1, parent2); alter table parent1 add not null a; alter table parent2 add not null a not valid; We print the constraint for table child for no apparent reason. Patch 0002 is a part of your proposed patch that I don't think we need, but I'm open to hearing arguments about why we do, preferrably with some test cases. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Hay que recordar que la existencia en el cosmos, y particularmente la elaboración de civilizaciones dentro de él no son, por desgracia, nada idílicas" (Ijon Tichy) >From 00c46ba90a24a99857cdc849943cf3fd7b96bd8e Mon Sep 17 00:00:00 2001 From: jian he Date: Wed, 9 Apr 2025 13:07:58 +0800 Subject: [PATCH 1/2] Fix pg_dump for inherited validated not-null constraints When a child constraint is validated and its parent constraint isn't, pg_dump requires special handling. --- src/bin/pg_dump/common.c | 12 +++ src/bin/pg_dump/pg_dump.c | 12 ++- src/bin/pg_dump/pg_dump.h | 1 + src/bin/pg_dump/t/002_pg_dump.pl | 41 +-- src/test/regress/expected/constraints.out | 24 + src/test/regress/sql/constraints.sql | 14 6 files changed, 101 insertions(+), 3 deletions(-) diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c index 56b6c368acf..b0973d44f32 100644 --- a/src/bin/pg_dump/common.c +++ b/src/bin/pg_dump/common.c @@ -546,6 +546,18 @@ flagInhAttrs(Archive *fout, DumpOptions *dopt, TableInfo *tblinfo, int numTables parent->notnull_constrs[inhAttrInd] != NULL) foundNotNull = true; + /* + * For validated not-null constraints in child tables which + * derive from a parent constraint marked NOT VALID, we + * artificially mark the child constraint as local so that + * it is printed independently. Failing to do this would + * result in the child constraint being restored as NOT + * VALID. + */ + if (fout->remoteVersion >= 18 && + parent->notnull_invalid[inhAttrInd]) + tbinfo->notnull_islocal[j] = true; + foundDefault |= (parentDef != NULL && strcmp(parentDef->adef_expr, "NULL") != 0 && !parent->attgenerated[inhAttrInd]); diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 105e917aa7b..0e2485479f8 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -9255,6 +9255,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables) tbinfo->attfdwoptions = (char **) pg_malloc(numatts * sizeof(char *)); tbinfo->attmissingval = (char **) pg_malloc(numatts * sizeof(char *)); tbinfo->notnull_constrs = (char **) pg_malloc(numatts * sizeof(char *)); + tbinfo->notnull_invalid = (bool *) pg_malloc(numatts * sizeof(bool)); tbinfo->notnull_noinh = (bool *) pg_malloc(numatts * sizeof(bool)); tbinfo->notnull_islocal = (bool *) pg_malloc(numatts * sizeof(bool)); tbinfo->attrdefs = (AttrDefInfo **) pg_malloc(numatts * sizeof(AttrDefInfo *)); @@ -9280,6 +9281,8 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables) tbinfo->attlen[j] = atoi(PQgetvalue(res, r, i_attlen)); tbinfo->attalign[j] = *(PQgetvalue(res, r, i_attalign)); tbinfo->attislocal[j] = (PQgetvalue(res, r, i_attislocal)[0] == 't'); + tbinfo->notnull_invalid[j] = false; /* it only change in + * determineNotNullFlags */ /* Handle not-null constraint name and flags */ determineNotNullFlags(fout, res, r, @@ -9756,6 +9759,12 @@ determineNotNullFlags(Archive *fout, PGresult *res, int r, else appendPQExpBuffer(*invalidnotnulloids, ",%s", constroid); + /* + * Track when a parent constraint is invalid for the cases where a + * child constraint has been validated independenly. + */ + tbinfo->notnull_invalid[j] = true; + /* nothing else to do */ tbinfo->notnull_constrs[j] = NULL; return; @@ -9763,10 +9772,11 @@ determineNotNullFlags(Archive *fout, PGresult *res, int r, /* * notnull_noinh is straight from the query result. notnull_islocal al
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Wed, Apr 23, 2025 at 1:13 PM Christoph Berg wrote: > Uhm, so far the plan was to have one "libpq-oauth" package, not several. I think the system is overconstrained at that point. If you want to support clients that delay-load the ABI they're compiled against, _and_ have them continue to work seamlessly after the system has upgraded the ABI underneath them, without restarting the client... is there any option other than side-by-side installation? > Since shipping a single libpq5.deb package for all PG majors has worked well > for the past decades, I wouldn't want to complicate that now. I'm not sure if it's possible to ship a client-side module system without something getting more complicated, though... I'm trying hard not to overcomplicate it for you, but I also don't think the complexity is going to remain the same. --Jacob
Re: [PATCH] dynahash: add memory allocation failure check
> On 24 Apr 2025, at 19:10, Tom Lane wrote: > > I thought about that but intentionally left it as-is, because that > would force zeroing of the space reserved for the hashtable name too. > That's unnecessary, and since it'd often be odd-sized it might result > in a less efficient fill loop. Well, that's just few hundred bytes at most. But I agree that makes sense. Best regards, Andrey Borodin.
Re: Typos in the comment for the estimate_multivariate_ndistinct()
Daniel Gustafsson 于2025年4月16日周三 22:20写道: > > On 14 Apr 2025, at 15:34, Tender Wang wrote: > > > > Hi, > > > > While reading the estimate_multivariate_ndistinct(), > > I think "If a match it found, *varinfos is > > * updated to remove the list of matched varinfos" > > should be "If a match is found, *varinfos is > > * updated to remove the list of matched varinfos" > > I've attached a patch for that. > > Seems like a correct change. > Thanks for checking. -- Thanks, Tender Wang
Re: AIX support
On Mon, Apr 7, 2025 at 10:04 PM Heikki Linnakangas wrote: > I'm surprised how big the difference is, because I actually expected the > compiler to detect the memory-zeroing loop and replace it with some > fancy vector instructions (does powerpc have any?). It certainly does, and we've played with explicit AltiVec vectorisation before with the idea that our abstraction over x86 and ARM instructions could be extended to POWER, but that was just trying stuff and learning while wondering if those abstractions are general enough and how well all the instructions match up. No one with modern hardware and a vested interest has seriously investigated it. That's independent of automatic vectorisation done by the compiler or libc, and I guess it would not really be AIX-specific since it would also apply to Linux on POWER. One of those patches is linked from this small list of "ideas for a future PostgreSQL/AIX maintainer": https://wiki.postgresql.org/wiki/AIX