Re: Converting contrib SQL functions to new style
I wrote: > I reviewed and pushed these. Let's not forget that the pageinspect > one is still pending, though. We were waiting on Tomas' fix, which > is now pushed at 957ba9ff1, so I suppose it needs a rebase. Actually ... that one's quite trivial, so I went ahead and pushed it. The submitted patch had the functions as PARALLEL SAFE, but I think they have to remain PARALLEL RESTRICTED because they use tuple_data_split(), cf pageinspect--1.10--1.11.sql. regards, tom lane
Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size
On Mon, 30 Dec 2024 at 09:34, Shubham Khanna wrote: > > Hi, > > Currently, there is a risk that pg_createsubscriber may fail to > complete successfully when the max_slot_wal_keep_size value is set too > low. This can occur if the WAL is removed before the standby using the > replication slot is able to complete replication, as the required WAL > files are no longer available. > > I was able to reproduce this issue using the following steps: > Set up a streaming replication environment. > Run pg_createsubscriber in a debugger. > Pause pg_createsubscriber at the setup_recovery stage. > Perform several operations on the primary node to generate a large > volume of WAL, causing older WAL segments to be removed due to the low > max_slot_wal_keep_size setting. > Once the necessary WAL segments are deleted, continue the execution of > pg_createsubscriber. > At this point, pg_createsubscriber fails with the following error: > 2024-12-29 01:21:37.590 IST [427353] FATAL: could not receive data > from WAL stream: ERROR: requested WAL segment > 00010003 has already been removed > 2024-12-29 01:21:37.592 IST [427345] LOG: waiting for WAL to become > available at 0/3000110 > 2024-12-29 01:21:42.593 IST [427358] LOG: started streaming WAL from > primary at 0/300 on timeline 1 > 2024-12-29 01:21:42.593 IST [427358] FATAL: could not receive data > from WAL stream: ERROR: requested WAL segment > 00010003 has already been removed > > This issue was previously reported in [1], with a suggestion to raise > a warning in [2]. I’ve implemented a patch that logs a warning in > dry-run mode. This will give users the opportunity to adjust the > max_slot_wal_keep_size value before running the command. > > Thoughts? +1 for throwing a warning in dry-run mode Few comments: 1) We can have this check only in dry-run mode, it is not required in non dry-run mode as there is nothing much user can do once the tool is running, we can change this: + if (max_slot_wal_keep_size != -1) + { + pg_log_warning("publisher requires 'max_slot_wal_keep_size = -1', but only %d remain", + max_slot_wal_keep_size); + pg_log_warning_detail("Change the 'max_slot_wal_keep_size' configuration on the publisher to -1."); + } to: + if (dry_run && max_slot_wal_keep_size != -1) + { + pg_log_warning("publisher requires 'max_slot_wal_keep_size = -1', but only %d remain", + max_slot_wal_keep_size); + pg_log_warning_detail("Change the 'max_slot_wal_keep_size' configuration on the publisher to -1."); + } 2) This error message is not quite right, can we change it to "publisher requires max_slot_wal_keep_size to be -1, but is set to %d" + if (max_slot_wal_keep_size != -1) + { + pg_log_warning("publisher requires 'max_slot_wal_keep_size = -1', but only %d remain", + max_slot_wal_keep_size); + pg_log_warning_detail("Change the 'max_slot_wal_keep_size' configuration on the publisher to -1."); + } 3) Also the configuration could be specified in format specifier like it is specified in the earlier case Regards, Vignesh
Re: Memory leak in pg_logical_slot_{get,peek}_changes
On Fri, Dec 27, 2024 at 08:13:53AM +, Zhijie Hou (Fujitsu) wrote: > Thanks for updating patches ! They look good to me. Fine by me as well. I had a bit of time today, so I've taken care of all this one down to 15 for now after checking each branch. + cachectx_mcallback = palloc0(sizeof(MemoryContextCallback)); + cachectx_mcallback->func = pgoutput_cachectx_reset_callback; + MemoryContextRegisterResetCallback(ctx->context, cachectx_mcallback); In the version of the patch for 14 and 13, why don't you just use a single reset callback to handle both of cachectx and pubctx at the same time? That would be simpler. +/* + * Private memory context for relation attribute map, created in + * PGOutputData->context when starting pgoutput, and set to NULL when its + * parent context is reset via a dedicated MemoryContextCallback. + */ +static MemoryContext cachectx = NULL; This comment block is a copy-paste of the previous one, let's just stick both declarations together. > Just to confirm, would the other stuff (streamed_txns) that allocated under > CacheMemoryContext also leaks memory ? I think it's OK to change them > separately if it does but just would like to confirm if there is a risk. Yeah, let's tackle this stuff separately and remove more the dependency to CacheMemoryContext. -- Michael signature.asc Description: PGP signature
Re: Introduce XID age and inactive timeout based replication slot invalidation
Hi Nisha, Here are some review comments for the patch v57-0001. == src/backend/replication/slot.c 1. +/* + * Invalidate replication slots that have remained idle longer than this + * duration; '0' disables it. + */ +int idle_replication_slot_timeout_min = HOURS_PER_DAY * MINS_PER_HOUR; IMO it would be better to have the suffix "_mins" instead of "_min" here to avoid any confusion with "minimum". ~~~ 2. +/* + * Can invalidate an idle replication slot? + * Not an English sentence. == src/backend/utils/adt/timestamp.c 3. + /* Return if the difference meets or exceeds the threshold */ + return (secs >= threshold_sec); That comment may not be necessary; it is saying just the same as the code. == src/backend/utils/misc/guc_tables.c 4. + { + {"idle_replication_slot_timeout", PGC_SIGHUP, REPLICATION_SENDING, + gettext_noop("Sets the duration a replication slot can remain idle before " + "it is invalidated."), + NULL, + GUC_UNIT_MIN + }, + &idle_replication_slot_timeout_min, + HOURS_PER_DAY * MINS_PER_HOUR, 0, INT_MAX / SECS_PER_MINUTE, + check_idle_replication_slot_timeout, NULL, NULL + }, + Maybe it's better to include a comment that says "24 hours". (e.g. like wal_summary_keep_time does) == src/backend/utils/misc/postgresql.conf.sample 5. #track_commit_timestamp = off # collect timestamp of transaction commit # (change requires restart) +#idle_replication_slot_timeout = 1d # in minutes; 0 disables I felt it might be better to say 24h here instead of 1d. And, that would also be consistent with the docs, which said the default was 24 hours. == Kind Regards, Peter Smith. Fujitsu Australia
Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)
On 12/29/24 16:39, Robert Treat wrote: > On Fri, Dec 27, 2024 at 8:25 PM Tomas Vondra wrote: >> On 12/27/24 05:00, Michael Paquier wrote: >>> On Thu, Dec 26, 2024 at 06:58:11PM +0100, Tomas Vondra wrote: If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not make a fundamental difference ... Anyway, the 128MB value is rather arbitrary. I don't mind increasing the limit, or possibly removing it entirely (and accepting anything the system can handle). >>> >>> +DefineCustomIntVariable("stats_history.size", >>> +"Sets the amount of memory available for past >>> events.", >>> >>> How about some time-based retention? Data size can be hard to think >>> about for the end user, while there would be a set of users that would >>> want to retain data for the past week, month, etc? If both size and >>> time upper-bound are define, then entries that match one condition or >>> the other are removed. >>> >> >> Right. In my response [1] I suggested replacing the simple memory limit >> with a time-based limit, but I haven't done anything about that yet. >> >> And the more I think about it the more I'm convinced we don't need to >> keep the data about past runs in memory, a file should be enough (except >> maybe for a small buffer). That would mean we don't need to worry about >> dynamic shared memory, etc. I initially rejected this because it seemed >> like a regression to how pgstat worked initially (sharing data through >> files), but I don't think that's actually true - this data is different >> (almost append-only), etc. >> >> The one case when we may need co read the data is in response to DROP of >> a table, when we need to discard entries for that object. Or we could >> handle that by recording OIDs of dropped objects ... not sure how >> complex this would need to be. >> >> We'd still want to enforce some limits, of course - a time-based limit >> of the minimum amount of time to cover, and maximum amount of disk space >> to use (more as a safety). >> >> FWIW there's one "issue" with enforcing the time-based limit - we can >> only do that for the "end time", because that's when we see the entry. >> If you configure the limit to keep "1 day" history, and then a vacuum >> completes after running for 2 days, we want to keep it, so that anyone >> can actually see it. >> > > I can't say I recall all the reasoning involved in making > pg_stat_statements just be based on a fixed number of entries, but the > ability to come up with corner cases was certainly a factor. For > example, imagine the scenario where you set a max at 30 days, but you > have some tables only being vacuumed every few months. Ideally you > probably want the last entry no matter what, and honestly probably the > last 2 (in case you are troubleshooting something, having the last run > and something to compare against is ideal). In any case, it can get > complicated pretty quickly. > I really don't want to get into such complicated retention policies, because there's no "right" solution, and it adds complexity both to implementation (e.g. we'd need to keep per-relation counters for all the various events), and processing (How would you know if there were no other vacuum runs on a relation or if those other events were removed?) I think the best solution to use a trivial retention policy (e.g. keep everything for X days), and if you want to keep a longer history, you need to read and archive that regularly (and the retention period ensures you don't miss any events). After all, that's what we assume for most other runtime stats anyway - it's not very useful to know the current values, if you can't calculate deltas. So you have to sample the stats. There's also the trouble that this is not crash safe, so I'd hesitate to rely on this very much if it can disappear. > ... > > At the risk of increasing scope, since you already are working on > checkpoints along with vacuums, I'm curious if there was a reason not > to do analyze stats retention as well? It seems pretty correlated in > the same area/problems as vacuum history. > I agree tracking ANALYZE would be useful. I ignored it mostly to keep the scope as limited as possible, so two separate events were enough. Adding another hook to do_analyze_rel() would be fairly trivial, but I'll leave that for later. regards -- Tomas Vondra
Re: An improvement of ProcessTwoPhaseBuffer logic
On Fri, Dec 27, 2024 at 06:16:24PM +0300, Vitaly Davydov wrote: > In general, I like your solution to use FullTransactionId. I haven't > found any evident problems. I just would like to propose to create a > couple of new functions like RemoveTwoPhaseFileInCurrentEpoch, > TwoPhaseFilePathInCurrentEpoch which accept TransactionId instead > FullTransactionId. It may make the code clearer and result into less > boilerplate code. I tried to do some hand testing. It seems, the > problem is gone with the patch. Thanks for the review. Agreed that it would be a good thing to limit the number of paths calling ReadNextFullTransactionId(), but I did not like much the suggestion TwoPhaseFilePathInCurrentEpoch(), feeling that it was more important to keep a single code path in charge of building the file names. Instead, I have gone with a new FullTransactionIdFromCurrentEpoch() that replaces AdjustToFullTransactionId(). It cleans up most of the calls to ReadNextFullTransactionId() compared to the previous patch. It is true that these couple with calls to ProcessTwoPhaseBuffer(), but the result felt OK this way in the scope of this fix. > As an idea, I would like to propose to store FullTransactionId in > global transaction state instead of TransactionId. I'm not sure, it > will consume significant additional memory, but it make the code > more clear and probably result into less number of locks. FWIW, I was wondering about doing the same thing. However, I have concluded that this some refactoring work out of the scope of fixing the primary issue we have here, as we are hit by the way the file names are built when we attempt to remove them. Note that the memory footprint of storing a FullTransactionId in twophase.c's GlobalTransactionData does not worry me much. It is more important to not increase the size of the two-phase state data, impacting files on disk and their WAL records. This size argument has been mentioned on the thread that has added epochs to the 2PC file names, as far as I recall. At the end, I have applied two patches, the first one down to 13 that took care of the "future" issue, with tests added in the v14~v16 range. v13 was lacking a perl routine, and it's not a big deal to not have coverage for this code path anyway. The second patch has been applied down to v17, to fix the epoch issue, with the more advanced tests. If you have a suggestion of patch to plug in a FullTransactionId into GlobalTransactionData rather than a TransactionId, feel free to propose something! Help is always welcome, and this would be HEAD-only work, making it less urgent to deal with. -- Michael signature.asc Description: PGP signature
Re: Introduce XID age and inactive timeout based replication slot invalidation
Hi Nisha. Here are some review comments for patch v57-0001. == src/backend/replication/slot.c 1. + +/* + * Raise an error based on the invalidation cause of the slot. + */ +static void +RaiseSlotInvalidationError(ReplicationSlot *slot) +{ + StringInfoData err_detail; + + initStringInfo(&err_detail); + + switch (slot->data.invalidated) 1a. /invalidation cause of the slot./slot's invalidation cause./ ~ 1b. This function does not expect to be called with slot->data.invalidated == RS_INVAL_NONE, so I think it will be better to assert that up-front. ~ 1c. This code could be simplified if you declare/initialize the variable together, like: StringInfo err_detail = makeStringInfo(); ~~~ 2. + case RS_INVAL_WAL_REMOVED: + appendStringInfo(&err_detail, _("This slot has been invalidated because the required WAL has been removed.")); + break; + + case RS_INVAL_HORIZON: + appendStringInfo(&err_detail, _("This slot has been invalidated because the required rows have been removed.")); + break; Since there are no format strings here, appendStringInfoString can be used directly in some places. == FYI. I've attached a diffs patch that implements some of the above-suggested changes. == Kind Regards, Peter Smith. Fujitsu Australia diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 2a99c1f..71c6ae2 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -2816,23 +2816,23 @@ WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) static void RaiseSlotInvalidationError(ReplicationSlot *slot) { - StringInfoData err_detail; + StringInfo err_detail = makeStringInfo(); - initStringInfo(&err_detail); + Assert(slot->data.invalidated != RS_INVAL_NONE); switch (slot->data.invalidated) { case RS_INVAL_WAL_REMOVED: - appendStringInfo(&err_detail, _("This slot has been invalidated because the required WAL has been removed.")); + appendStringInfoString(err_detail, _("This slot has been invalidated because the required WAL has been removed.")); break; case RS_INVAL_HORIZON: - appendStringInfo(&err_detail, _("This slot has been invalidated because the required rows have been removed.")); + appendStringInfoString(err_detail, _("This slot has been invalidated because the required rows have been removed.")); break; case RS_INVAL_WAL_LEVEL: /* translator: %s is a GUC variable name */ - appendStringInfo(&err_detail, _("This slot has been invalidated because \"%s\" is insufficient for slot."), + appendStringInfo(err_detail, _("This slot has been invalidated because \"%s\" is insufficient for slot."), "wal_level"); break; @@ -2844,5 +2844,5 @@ RaiseSlotInvalidationError(ReplicationSlot *slot) errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("can no longer get changes from replication slot \"%s\"", NameStr(slot->data.name)), - errdetail_internal("%s", err_detail.data)); + errdetail_internal("%s", err_detail->data)); }
Proposal: Progressive explain
Hello community, CONTEXT: Back in October I presented the talk "Debugging active queries with mid-flight instrumented explain plans" at PGConf EU 2024 (recording: https://www.youtube.com/watch?v=6ahTb-7C05c) presenting an experimental feature that enables visualization of in progress EXPLAIN ANALYZE executions. Given the positive feedback and requests, I am sending this patch with the feature, which I am calling Progressive Explain. PROPOSAL: This proposal introduces a feature to print execution plans of active queries in an in-memory shared hash object so that other sessions can visualize them with a new view: pg_stat_progress_explain. Plans are only printed if the new GUC parameter progressive_explain is enabled. For regular queries or queries started with EXPLAIN (without ANALYZE) the plan is printed only once at the start. For instrumented runs (started via EXPLAIN ANALYZE or when auto_explain flag log_analyze is enabled), the plan is printed on a fixed interval controlled by the new GUC parameter progressive_explain_interval. This plan includes all instrumentation stats computed so far (per node rows and execution time). New view: - pg_stat_progress_explain - pid: PID of the process running the query - last_explain: timestamp when plan was last printed - explain_count: amount of times plan was printed - total_explain_time: accumulated time spend printing plans (in ms) - explain: the actual plan (limited read privileges) New GUCs: - progressive_explain: if progressive plans are printed for local session. - type: bool - default: off - context: user - progressive_explain_interval: interval between each explain print. - type: int - default: 1s - min: 10ms - context: user - progressive_explain_sample_rate: fraction of rows processed by the query until progressive_explain_interval is evaluated to print a progressive plan - type: floating point - default: 0.1 - range: (0.0 - 1.0) - context: user - progressive_explain_output_size: max output size of the plan printed in the in-memory hash table. - type: int - default: 4096 - min: 100 - context: postmaster - progressive_explain_format: format used to print the plans. - type: enum - default: text - context: user - progressive_explain_settings: controls whether information about modified configuration is added to the printed plan. - type: bool - default: off - context: user - progressive_explain_verbose: controls whether verbose details are added to the printed plan. - type: bool - default: off - context: user DEMONSTRATION: postgres=# SET progressive_explain = ON; SET postgres=# EXPLAIN ANALYZE SELECT * FROM test t1 UNION ALL SELECT * FROM test t1; postgres=# select * from pg_stat_progress_explain; -[ RECORD 1 ]--+--- pid| 299663 last_explain | 2024-12-29 22:40:33.016833+00 explain_count | 5 total_explain_time | 0.205 explain| Append (cost=0.00..466670.40 rows=2160 width=37) (actual time=0.052..3372.318 rows=14013813 loops=1) | Buffers: shared hit=4288 read=112501 | -> Seq Scan on test t1 (cost=0.00..183334.80 rows=1080 width=37) (actual time=0.052..1177.428 rows=1000 loops=1) | Buffers: shared hit=4288 read=79046 | -> Seq Scan on test t1_1 (cost=0.00..183334.80 rows=1080 width=37) (actual time=0.072..608.481 rows=4013813 loops=1) (current) | Buffers: shared read=33455 | IMPLEMENTATION HIGHLIGHTS: - The initial plan is printed at the end of standard_ExecutorStart if progressive_explain is enabled, for both regular queries and instrumented ones (EXPLAIN ANALYZE): /* * Start progressive explain if enabled. */ if (progressive_explain) ProgressiveExplainBegin(queryDesc); - The incremental plan print for instrumented runs uses a dedicated ExecProcNode if progressive_explain is enabled: if (node->instrument) if (progressive_explain) node->ExecProcNode = ExecProcNodeInstrExplain; else node->ExecProcNode = ExecProcNodeInstr; else node->ExecProcNode = node->ExecProcNodeReal; - ExecProcNodeInstrExplain is identical to ExecProcNodeInstr with an additional part to print plans based on a sampling logic: /* * Update progressive explain based on sampling. */ if (pg_prng_double(&pg_global_prng_state) < progressive_explain_sample_rate) ProgressiveExplainUpdate(node); That logic was added because ExecProcNodeInstrExplain is called once per row processed (a lot of times) and performing the timestamp interval check with progressive_explain_interval to decide whether to print the plan (done inside ProgressiveExplainUpdate) is expensive. Benchmarks (shared at the end of this email) show that sampling + timestamp check gives much better results than performing the
Re: Remove support for OpenSSL *eay32 libs on Windows
On Fri, Dec 27, 2024 at 03:09:48PM +0100, Daniel Gustafsson wrote: > The thread in [0] which adds Meson support for *eay32 OpenSSL library names on > Windows reminded me that we should remove these from master since we no longer > support OpenSSL 1.0.2 (which was the final version with those names). > Attached > is a small (as of yet untested on Windows/autoconf) diff removing these names > from the Makefiles. > > [0] cahrt6656w9onfomqthbgydcm5ckz7hcgzft8l+n0ezbzfcn...@mail.gmail.com Good idea, looks rather sane to me. I don't have a clear idea about the WIN32/autoconf combination (sorry!), but I can't really believe that packagers of OpenSSL on Windows would want the same library name compatibility, patching upstream to be able to achieve the old naming. fairywren, hamerkop and drongo use meson. Lorikeet relies on ./configure and builds with OpenSSL 3.0.12, so even there I would hope that we are in the clear. Not something I'm planning to care about, just noting that from the top of OpenSSL master branch (0958f5a5bc47 as of today): $ git grep eay32 .gitignore:/ms/libeay32.def .gitignore:/ms/ssleay32.def They should have no need for these references. -- Michael signature.asc Description: PGP signature
Re: IANA timezone abbreviations versus timezone_abbreviations
"Jelte Fennema-Nio" writes: > I think it would be good to add some additional clarify here. It was > fairly confusing to me. Especially the last sentence, due to the use of > "active zone", even though it's really talking about the currently > active abbreviations list. Probably my confusion mostly came from the > fact that I wasn't aware that timezone abbreviations were localized, but > I doubt I'm alone in not knowing this. Maybe something like this (feel > free to improve further): Hmm, I don't like your phrasing using "IANA time zone database", because that makes it sound like we'll take any abbreviation that's found anywhere in that whole data set. What the proposal actually does is to recognize any abbreviation that is used, or has been used in the past, in the IANA time zone selected by our current timezone setting. (And, perhaps more to the point, to give it the meaning it has or had in that zone.) Not sure about concise wording for that. regards, tom lane
Re: IANA timezone abbreviations versus timezone_abbreviations
On Mon Dec 16, 2024 at 8:57 PM CET, Tom Lane wrote: Andreas Karlsson writes: On 12/13/24 12:33 AM, Tom Lane wrote: I am not convinced this is an improvement. While this patch removes the round-trip hazard it also makes it confusing to use the timezone_abbreviations GUC since it can be overridden by IANA data based on your current timezone. So you need to know all the, sometimes weird, names for your current timezone. Seems unnecessarily hard to reason about and wouldn't most people who use timezone_abbreviations rely on the current behavior? Presumably they're not that weird to the locals? I am not sure what you mean by "people who use timezone_abbreviations". I think that's about everyone --- it's not like the default setting doesn't contain any abbreviations. (If it didn't then we'd not have such a problem...) Maybe changing the default value of timezone_abbreviations is a better solution to the problem, or in addition to the proposed patch.
Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size
On Mon, Dec 30, 2024 at 10:10 AM vignesh C wrote: > > On Mon, 30 Dec 2024 at 09:34, Shubham Khanna > wrote: > > > > Hi, > > > > Currently, there is a risk that pg_createsubscriber may fail to > > complete successfully when the max_slot_wal_keep_size value is set too > > low. This can occur if the WAL is removed before the standby using the > > replication slot is able to complete replication, as the required WAL > > files are no longer available. > > > > I was able to reproduce this issue using the following steps: > > Set up a streaming replication environment. > > Run pg_createsubscriber in a debugger. > > Pause pg_createsubscriber at the setup_recovery stage. > > Perform several operations on the primary node to generate a large > > volume of WAL, causing older WAL segments to be removed due to the low > > max_slot_wal_keep_size setting. > > Once the necessary WAL segments are deleted, continue the execution of > > pg_createsubscriber. > > At this point, pg_createsubscriber fails with the following error: > > 2024-12-29 01:21:37.590 IST [427353] FATAL: could not receive data > > from WAL stream: ERROR: requested WAL segment > > 00010003 has already been removed > > 2024-12-29 01:21:37.592 IST [427345] LOG: waiting for WAL to become > > available at 0/3000110 > > 2024-12-29 01:21:42.593 IST [427358] LOG: started streaming WAL from > > primary at 0/300 on timeline 1 > > 2024-12-29 01:21:42.593 IST [427358] FATAL: could not receive data > > from WAL stream: ERROR: requested WAL segment > > 00010003 has already been removed > > > > This issue was previously reported in [1], with a suggestion to raise > > a warning in [2]. I’ve implemented a patch that logs a warning in > > dry-run mode. This will give users the opportunity to adjust the > > max_slot_wal_keep_size value before running the command. > > > > Thoughts? > > +1 for throwing a warning in dry-run mode > > Few comments: > 1) We can have this check only in dry-run mode, it is not required in > non dry-run mode as there is nothing much user can do once the tool is > running, we can change this: > + if (max_slot_wal_keep_size != -1) > + { > + pg_log_warning("publisher requires > 'max_slot_wal_keep_size = -1', but only %d remain", > + max_slot_wal_keep_size); > + pg_log_warning_detail("Change the > 'max_slot_wal_keep_size' configuration on the publisher to -1."); > + } > > to: > + if (dry_run && max_slot_wal_keep_size != -1) > + { > + pg_log_warning("publisher requires > 'max_slot_wal_keep_size = -1', but only %d remain", > + max_slot_wal_keep_size); > + pg_log_warning_detail("Change the > 'max_slot_wal_keep_size' configuration on the publisher to -1."); > + } > > 2) This error message is not quite right, can we change it to > "publisher requires max_slot_wal_keep_size to be -1, but is set to %d" > + if (max_slot_wal_keep_size != -1) > + { > + pg_log_warning("publisher requires > 'max_slot_wal_keep_size = -1', but only %d remain", > + max_slot_wal_keep_size); > + pg_log_warning_detail("Change the > 'max_slot_wal_keep_size' configuration on the publisher to -1."); > + } > > 3) Also the configuration could be specified in format specifier like > it is specified in the earlier case > I have fixed the given comments. The attached patch contains the suggested changes. Thanks and regards, Shubham Khanna. v2-0001-Validate-max_slot_wal_keep_size-in-pg_createsubsc.patch Description: Binary data
table_tuple_lock's snapshot argument
/* * Lock a tuple in the specified mode. * * Input parameters: * relation: relation containing tuple (caller must hold suitable lock) * tid: TID of tuple to lock * snapshot: snapshot to use for visibility determinations * cid: current command ID (used for visibility test, and stored into * tuple's cmax if lock is successful) * mode: lock mode desired * wait_policy: what to do if tuple lock is not available * flags: * If TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, follow the update chain to * also lock descendant tuples if lock modes don't conflict. * If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update chain and lock * latest version. * * Output parameters: * *slot: contains the target tuple * *tmfd: filled in failure cases (see below) * * Function result may be: * TM_Ok: lock was successfully acquired * TM_Invisible: lock failed because tuple was never visible to us * TM_SelfModified: lock failed because tuple updated by self * TM_Updated: lock failed because tuple updated by other xact * TM_Deleted: lock failed because tuple deleted by other xact * TM_WouldBlock: lock couldn't be acquired and wait_policy is skip * * In the failure cases other than TM_Invisible and TM_Deleted, the routine * fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax. See * comments for struct TM_FailureData for additional info. */ static inline TM_Result table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot, TupleTableSlot *slot, CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy, uint8 flags, TM_FailureData *tmfd) What are the semantics of the 'snapshot' argument? In the heapam implementation, it's not used for anything. What visibility checks the function might do in a different implementation? I vaguely remember that the idea was that the TID might not be sufficient to uniquely identify the row version in something like zheap, which updates the row in place. In that case, all the different row versions are represented by the same TID, and the snapshot identifies the version. There are a few callers of table_tuple_lock: 1. trigger.c: GetTupleForTrigger 2. nodeModifyTable.c 3. nodeLockRows.c 4. execReplication.c The first three callers pass the EState's snapshot, the same that was used in a table or index scan that returned the TID. That makes sense. But the calls in execReplication.c look fishy: PushActiveSnapshot(GetLatestSnapshot()); res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(), outslot, GetCurrentCommandId(false), lockmode, LockWaitBlock, 0 /* don't follow updates */ , &tmfd); PopActiveSnapshot(); if (should_refetch_tuple(res, &tmfd)) goto retry; Firstly, this calls GetLatestSnapshot() twice, which seems plain wrong. I think the idea was to push the latest snapshot and use the same snapshot in the call to table_tuple_lock(). But because each call to GetLatestSnapshot() takes a new snapshot, this pushes one snapshot as the active snapshot and passes a *different* snapshot to table_tuple_lock(). This went wrong in commit 5db6df0c01, which introduced the update/delete/insert/lock table AM interface. The argument to table_tuple_lock() was supposed to be GetActiveSnapshot(). However, I think GetLatestSnapshot() is wrong here anyway. None of this matters for heapam which just ignores the 'snapshot' argument, but let's assume a different AM that would use the snapshot to identify the tuple version. The TID was fetched from an index using an index scan with SnapshotDirty. There's no guarantee that the LatestSnapshot would match the same tuple version that the index scan found. If an MVCC snapshot is needed, surely it should be acquired before the index scan, and used for the index scan as well. I see three options: 1. Remove the 'snapshot' argument, since it's not used by heapam. If we get a table AM where a single TID represents multiple row versions, this will need to be revisited. 2. Rewrite the recheck code in execReplication.c so that it uses the snapshot in a more consistent fashion. Call GetLatestSnapshot() first, and use the same snapshot in the index scan and table_tuple_lock(). Acquiring a snapshot isn't free though, so it would be nice to avoid doing that when the heapam is just going to ignore it anywa
meson: Fix missing name arguments of cc.compiles() calls
I noticed a few cc.compiles() checks in meson.build don't show up in the "meson setup" output, because they don't have a "name" argument. Also, the "typeof" test doesn't show the name of the symbol that is currently being tested. All this makes remote debugging a bit harder. This patch fixes it. While analyzing the fixed output, I also noticed that the test for decltype as an alternative to typeof never actually worked and was just forgotten to be removed. This is also fixed here. From c43a418850e6d5c23eb6fb1ed83935e688c1d261 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Sun, 29 Dec 2024 13:43:38 +0100 Subject: [PATCH 1/2] meson: Fix missing name arguments of cc.compiles() calls Without it, the check won't show up in the meson setup/configure output. --- meson.build | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/meson.build b/meson.build index e5ce437a5c7..cb642b1f2fb 100644 --- a/meson.build +++ b/meson.build @@ -1727,6 +1727,7 @@ if cc.compiles(''' my_label: return 1; }''', +name: 'computed goto', args: test_c_args) cdata.set('HAVE_COMPUTED_GOTO', 1) endif @@ -1743,6 +1744,7 @@ if cc.compiles(''' ({ _Static_assert(1, "foo"); }); } ''', +name: '_Static_assert', args: test_c_args) cdata.set('HAVE__STATIC_ASSERT', 1) endif @@ -2359,6 +2361,7 @@ elif host_cpu == 'ppc' or host_cpu == 'ppc64' } int test_adds(int x) { return addi(3, x) + addi(x, 5); } ''', + name: '@0@: "i"(x) when __builtin_constant_p(x)'.format(host_cpu), args: test_c_args) cdata.set('HAVE_I_CONSTRAINT__BUILTIN_CONSTANT_P', 1) endif @@ -2547,7 +2550,7 @@ int main(void) return y; } '''.format(kw), -name: 'typeof()', +name: kw, args: test_c_args, include_directories: postgres_inc) cdata.set('HAVE_TYPEOF', 1) base-commit: d85ce012f99f63249bb45a78fcecb7a2383920b1 -- 2.47.1 From a1eb3feba6ebb687cf19a14156f5debf536fbccb Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Sun, 29 Dec 2024 14:40:04 +0100 Subject: [PATCH 2/2] Remove useless configure check The test for "decltype" as variant of "typeof" apparently never worked (see commit 3582b223d49), so remove it. --- config/c-compiler.m4 | 2 +- configure| 2 +- meson.build | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index e112fd45d48..8534cc54c13 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -142,7 +142,7 @@ fi])# PGAC_C_STATIC_ASSERT AC_DEFUN([PGAC_C_TYPEOF], [AC_CACHE_CHECK(for typeof, pgac_cv_c_typeof, [pgac_cv_c_typeof=no -for pgac_kw in typeof __typeof__ decltype; do +for pgac_kw in typeof __typeof__; do AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [int x = 0; $pgac_kw(x) y; diff --git a/configure b/configure index 518c33b73a9..cfe62e80ba8 100755 --- a/configure +++ b/configure @@ -14344,7 +14344,7 @@ if ${pgac_cv_c_typeof+:} false; then : $as_echo_n "(cached) " >&6 else pgac_cv_c_typeof=no -for pgac_kw in typeof __typeof__ decltype; do +for pgac_kw in typeof __typeof__; do cat confdefs.h - <<_ACEOF >conftest.$ac_ext /* end confdefs.h. */ diff --git a/meson.build b/meson.build index cb642b1f2fb..045ce800667 100644 --- a/meson.build +++ b/meson.build @@ -2540,7 +2540,7 @@ endif # Check if the C compiler understands typeof or a variant. Define # HAVE_TYPEOF if so, and define 'typeof' to the actual key word. -foreach kw : ['typeof', '__typeof__', 'decltype'] +foreach kw : ['typeof', '__typeof__'] if cc.compiles(''' int main(void) { -- 2.47.1
Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)
On Fri, Dec 27, 2024 at 8:25 PM Tomas Vondra wrote: > On 12/27/24 05:00, Michael Paquier wrote: > > On Thu, Dec 26, 2024 at 06:58:11PM +0100, Tomas Vondra wrote: > >> If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not > >> make a fundamental difference ... > >> > >> Anyway, the 128MB value is rather arbitrary. I don't mind increasing the > >> limit, or possibly removing it entirely (and accepting anything the > >> system can handle). > > > > +DefineCustomIntVariable("stats_history.size", > > +"Sets the amount of memory available for past > > events.", > > > > How about some time-based retention? Data size can be hard to think > > about for the end user, while there would be a set of users that would > > want to retain data for the past week, month, etc? If both size and > > time upper-bound are define, then entries that match one condition or > > the other are removed. > > > > Right. In my response [1] I suggested replacing the simple memory limit > with a time-based limit, but I haven't done anything about that yet. > > And the more I think about it the more I'm convinced we don't need to > keep the data about past runs in memory, a file should be enough (except > maybe for a small buffer). That would mean we don't need to worry about > dynamic shared memory, etc. I initially rejected this because it seemed > like a regression to how pgstat worked initially (sharing data through > files), but I don't think that's actually true - this data is different > (almost append-only), etc. > > The one case when we may need co read the data is in response to DROP of > a table, when we need to discard entries for that object. Or we could > handle that by recording OIDs of dropped objects ... not sure how > complex this would need to be. > > We'd still want to enforce some limits, of course - a time-based limit > of the minimum amount of time to cover, and maximum amount of disk space > to use (more as a safety). > > FWIW there's one "issue" with enforcing the time-based limit - we can > only do that for the "end time", because that's when we see the entry. > If you configure the limit to keep "1 day" history, and then a vacuum > completes after running for 2 days, we want to keep it, so that anyone > can actually see it. > I can't say I recall all the reasoning involved in making pg_stat_statements just be based on a fixed number of entries, but the ability to come up with corner cases was certainly a factor. For example, imagine the scenario where you set a max at 30 days, but you have some tables only being vacuumed every few months. Ideally you probably want the last entry no matter what, and honestly probably the last 2 (in case you are troubleshooting something, having the last run and something to compare against is ideal). In any case, it can get complicated pretty quickly. > [1] > https://www.postgresql.org/message-id/8df7cee1-31aa-4db3-bbb7-83157ca139da%40vondra.me > > > +checkpoint_log_hook( > > + CheckpointStats.ckpt_start_t,/* start_time */ > > + CheckpointStats.ckpt_end_t,/* end_time */ > > + (flags & CHECKPOINT_IS_SHUTDOWN),/* is_shutdown */ > > + (flags & CHECKPOINT_END_OF_RECOVERY),/* is_end_of_recovery */ > > + (flags & CHECKPOINT_IMMEDIATE),/* is_immediate */ > > + (flags & CHECKPOINT_FORCE),/* is_force */ > > + (flags & CHECKPOINT_WAIT),/* is_wait */ > > + (flags & CHECKPOINT_CAUSE_XLOG),/* is_wal */ > > + (flags & CHECKPOINT_CAUSE_TIME),/* is_time */ > > + (flags & CHECKPOINT_FLUSH_ALL),/* is_flush_all */ > > + CheckpointStats.ckpt_bufs_written,/* buffers_written */ > > + CheckpointStats.ckpt_slru_written,/* slru_written */ > > + CheckpointStats.ckpt_segs_added,/* segs_added */ > > + CheckpointStats.ckpt_segs_removed,/* segs_removed */ > > + CheckpointStats.ckpt_segs_recycled,/* segs_recycled */ > > > > That's a lot of arguments. CheckpointStatsData and the various > > CHECKPOINT_* flags are exposed, why not just send these values to the > > hook? > > > > For v1-0001 as well, I'd suggest some grouping with existing > > structures, or expose these structures so as they can be reused for > > out-of-core code via the proposed hook. More arguments lead to more > > mistakes that could be easily avoided. > > Yes, I admit the number of parameters seemed a bit annoying to me too, > and maybe we could reduce that somewhat. Certainly for checkpoints, > where we already have a reasonable CheckpointStats struct and flags, > wrapping most of the fields. > > With vacuum it's a bit more complicated, for two reasons: (a) the > counters are simply in LVRelState, mixed with all kinds of other info, > and it seems "not great" to pass it to a "log" hook, and (b) ther
Re: Query regarding pg_prewarm extension
On Sun, 29 Dec 2024 at 14:00, Bruce Momjian wrote: > It feels like we should document what the block range is used for, so > attached is a doc patch to do that. - means prewarm through the last block in the relation). The return value - is the number of blocks prewarmed. + means prewarm through the last block in the relation). The block + range allows a single relation to be loaded in parallel using multiple + concurent function calls. The return value is the number of blocks + prewarmed. hmm, do we really need to highlight one specific usage for the range like this? I think mentioning this could just confuse readers as it makes it sound like using a range is going to magically run something in parallel. I was confused to what you were talking about until I read what Jeremy had written in his email. Another equally legitimate use case would be if the user only wanted to prewarm a subset of the relation... Actually, I'd imagine that's probably more common than someone trying to speed this up by kicking off multiple queries each with their own range. I imagine there's less need to use the range to speed this up now that we have read steams and likely there will be even less need when AIO is in. I think the current wording is ok as it is. But if I'm outvoted, "concurent" needs another 'r'. David
Re: Converting contrib SQL functions to new style
I wrote: > Here's the remaining two patches in the current set. This is just > to pacify the cfbot: I've not done anything to them, just verified > that they still apply and pass regression. I reviewed and pushed these. Let's not forget that the pageinspect one is still pending, though. We were waiting on Tomas' fix, which is now pushed at 957ba9ff1, so I suppose it needs a rebase. regards, tom lane
Re: Should fix a comment referring to stats collector?
On Sat, Dec 28, 2024 at 03:25:46PM +0800, Junwang Zhao wrote: > I believe some redundant wording has been committed. > > K may be chosen may be chosen at ANALYZE time. > > Attached patch fixes it and with some line adjustments. Indeed. Fixed, thanks! -- Michael signature.asc Description: PGP signature
Re: IANA timezone abbreviations versus timezone_abbreviations
On Sun Dec 29, 2024 at 12:47 AM CET, Bruce Momjian wrote: On Mon, Dec 16, 2024 at 02:57:59PM -0500, Tom Lane wrote: Yes, your patch seems like a big improvement. +1 +Before consulting the timezone_abbreviations file, +PostgreSQL checks to see whether an +abbreviation used in datetime input is defined in the IANA time zone +database entry currently selected by the + run-time parameter. If so the time +zone's meaning is used, for consistency with datetime output. The +timezone_abbreviations file is mainly useful for +allowing datetime input to recognize abbreviations for time zones +other than the active zone. I think it would be good to add some additional clarify here. It was fairly confusing to me. Especially the last sentence, due to the use of "active zone", even though it's really talking about the currently active abbreviations list. Probably my confusion mostly came from the fact that I wasn't aware that timezone abbreviations were localized, but I doubt I'm alone in not knowing this. Maybe something like this (feel free to improve further): Before consulting the timezone_abbreviations file, PostgreSQL checks to see whether an abbreviation used in datetime input is defined in the currently active IANA time zone database. The abbreviations for these entries are localized based on the run-time parameter, so depending on the configured abbreviations will differ. If it is found the IANA time zone database, then that meaning is used for consistency with datetime output. The timezone_abbreviations file is only useful for allowing datetime input to recognize abbreviations for time zones that are not defined in the currently active IANA time zone database.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Tue, Dec 24, 2024 at 10:37 PM Nisha Moond wrote: > > On Fri, Dec 20, 2024 at 3:12 PM Amit Kapila wrote: >> >> On Mon, Dec 16, 2024 at 4:10 PM Nisha Moond wrote: >> > >> > Here is the v56 patch set with the above comments incorporated. >> > >> >> Review comments: >> === >> 1. >> + { >> + {"idle_replication_slot_timeout", PGC_SIGHUP, REPLICATION_SENDING, >> + gettext_noop("Sets the duration a replication slot can remain idle before " >> + "it is invalidated."), >> + NULL, >> + GUC_UNIT_MS >> + }, >> + &idle_replication_slot_timeout_ms, >> >> I think users are going to keep idele_slot timeout at least in hours. >> So, millisecond seems the wrong choice to me. I suggest to keep the >> units in minutes. I understand that writing a test would be >> challenging as spending a minute or more on one test is not advisable. >> But I don't see any test testing the other GUCs that are in minutes >> (wal_summary_keep_time and log_rotation_age). The default value should >> be one day. > > > +1 > - Changed the GUC unit to "minute". > > Regarding the tests, we have two potential options: > 1) Introduce an additional "debug_xx" GUC parameter with units of seconds or > milliseconds, only for testing purposes. > 2) Skip writing tests for this, similar to other GUCs with units in minutes. > > IMO, adding an additional GUC just for testing may not be worthwhile. It's > reasonable to proceed without the test. > > Thoughts? > > The attached v57 patch-set addresses all the comments. I have kept the test > case in the patch for now, it takes 2-3 minutes to complete. > Hi Nisha. I think we are often too quick to throw out perfectly good tests. Citing that some similar GUCs don't do testing as a reason to skip them just seems to me like an example of "two wrongs don't make a right". There is a third option. Keep the tests. Because they take excessive time to run, that simply means you should run them *conditionally* based on the PG_TEST_EXTRA environment variable so they don't impact the normal BF execution. The documentation [1] says this env var is for "resource intensive" tests -- AFAIK this is exactly the scenario we find ourselves in, so is exactly what this env var was meant for. Search other *.pl tests for PG_TEST_EXTRA to see some examples. == [1] https://www.postgresql.org/docs/17/regress-run.html Kind Regards, Peter Smith. Fujitsu Australia
Log a warning in pg_createsubscriber for max_slot_wal_keep_size
Hi, Currently, there is a risk that pg_createsubscriber may fail to complete successfully when the max_slot_wal_keep_size value is set too low. This can occur if the WAL is removed before the standby using the replication slot is able to complete replication, as the required WAL files are no longer available. I was able to reproduce this issue using the following steps: Set up a streaming replication environment. Run pg_createsubscriber in a debugger. Pause pg_createsubscriber at the setup_recovery stage. Perform several operations on the primary node to generate a large volume of WAL, causing older WAL segments to be removed due to the low max_slot_wal_keep_size setting. Once the necessary WAL segments are deleted, continue the execution of pg_createsubscriber. At this point, pg_createsubscriber fails with the following error: 2024-12-29 01:21:37.590 IST [427353] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00010003 has already been removed 2024-12-29 01:21:37.592 IST [427345] LOG: waiting for WAL to become available at 0/3000110 2024-12-29 01:21:42.593 IST [427358] LOG: started streaming WAL from primary at 0/300 on timeline 1 2024-12-29 01:21:42.593 IST [427358] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00010003 has already been removed This issue was previously reported in [1], with a suggestion to raise a warning in [2]. I’ve implemented a patch that logs a warning in dry-run mode. This will give users the opportunity to adjust the max_slot_wal_keep_size value before running the command. Thoughts? [1] - https://www.postgresql.org/message-id/TY3PR01MB9889FEDBBF74A9F79CA7CC87F57A2%40TY3PR01MB9889.jpnprd01.prod.outlook.com [2] - https://www.postgresql.org/message-id/be92c57b-82e1-4920-ac31-a8a04206db7b%40app.fastmail.com Thanks and regards, Shubham Khanna. v1-0001-Validate-max_slot_wal_keep_size-in-pg_createsubsc.patch Description: Binary data