Re: brininsert optimization opportunity
On Tue, Dec 12, 2023 at 3:25 AM Tomas Vondra wrote: > The attached 0001 patch fixes this by adding the call to validate_index, which seems like the proper place as it's where the indexInfo is allocated and destroyed. Yes, and by reading validate_index's header comment, there is a clear expectation that we will be adding tuples to the index in the table AM call table_index_validate_scan (albeit "validating" doesn't necessarily convey this side effect). So, it makes perfect sense to call it here. > But this makes me think - are there any other places that might call > index_insert without then also doing the cleanup? I did grep for the > index_insert() calls, and it seems OK except for this one. > > There's a call in toast_internals.c, but that seems OK because that only > deals with btree indexes (and those don't need any cleanup). The same > logic applies to unique_key_recheck(). The rest goes through > execIndexing.c, which should do the cleanup in ExecCloseIndices(). > > Note: We should probably call the cleanup even in the btree cases, even > if only to make it clear it needs to be called after index_insert(). Agreed. Doesn't feel great, but yeah all of these btree specific code does call through index_* functions, instead of calling btree functions directly. So, ideally we should follow through with that pattern and call the cleanup explicitly. But we are introducing per-tuple overhead that is totally wasted. Maybe we can add a comment instead like: void toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock) { int i; /* * Save a few cycles by choosing not to call index_insert_cleanup(). Toast * indexes are btree, which don't have a aminsertcleanup() anyway. */ /* Close relations and clean up things */ ... } And add something similar for unique_key_recheck()? That said, I am also not opposed to just accepting these wasted cycles, if the commenting seems wonky. > I propose we do a much simpler thing instead - allow the cache may be > initialized / cleaned up repeatedly, and make sure it gets reset at > convenient place (typically after index_insert calls that don't go > through execIndexing). That'd mean the cleanup does not need to happen > very far from the index_insert(), which makes the reasoning much easier. > 0002 does this. That kind of goes against the ethos of the ii_AmCache, which is meant to capture state to be used across multiple index inserts. Also, quoting the current docs: "If the index AM wishes to cache data across successive index insertions within an SQL statement, it can allocate space in indexInfo->ii_Context and store a pointer to the data in indexInfo->ii_AmCache (which will be NULL initially). After the index insertions complete, the memory will be freed automatically. If additional cleanup is required (e.g. if the cache contains buffers and tuple descriptors), the AM may define an optional function indexinsertcleanup, called before the memory is released." The memory will be freed automatically (as soon as ii_Context goes away). So, why would we explicitly free it, like in the attached 0002 patch? And the whole point of the cleanup function is to do something other than freeing memory, as the docs note highlights so well. Also, the 0002 patch does introduce per-tuple function call overhead in heapam_index_validate_scan(). Also, now we have split brain...in certain situations we want to call index_insert_cleanup() per index insert and in certain cases we would like it called once for all inserts. Not very easy to understand IMO. Finally, I don't think that any index AM would have anything to clean up after every insert. I had tried (abused) an approach with MemoryContextCallbacks upthread and that seems a no-go. And yes I agree, having a dual to makeIndexInfo() (like destroyIndexInfo()) means that we lose the benefits of ii_Context. That could be disruptive to existing AMs in-core and outside. All said and done, v1 has my vote. Regards, Soumyadeep (VMware)
Re: remaining sql/json patches
Hi. small issues I found... typo: +-- Test mutabilily od query functions + default: + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only datetime, bool, numeric, and text types can be casted to jsonpath types"))); transformJsonPassingArgs's function: transformJsonValueExpr will make the above code unreached. also based on the `switch (typid)` cases, I guess best message would be errmsg("only datetime, bool, numeric, text, json, jsonb types can be casted to jsonpath types"))); + case JSON_QUERY_OP: + jsexpr->wrapper = func->wrapper; + jsexpr->omit_quotes = (func->quotes == JS_QUOTES_OMIT); + + if (!OidIsValid(jsexpr->returning->typid)) + { + JsonReturning *ret = jsexpr->returning; + + ret->typid = JsonFuncExprDefaultReturnType(jsexpr); + ret->typmod = -1; + } + jsexpr->result_coercion = coerceJsonFuncExprOutput(pstate, jsexpr); I noticed, if (!OidIsValid(jsexpr->returning->typid)) is the true function JsonFuncExprDefaultReturnType may be called twice, not sure if it's good or not..
Re: Synchronizing slots from primary to standby
Hi Shveta, here are some review comments for v45-0002. == doc/src/sgml/bgworker.sgml 1. + + + BgWorkerStart_PostmasterStart + + + BgWorkerStart_PostmasterStart + Start as soon as postgres itself has finished its own initialization; + processes requesting this are not eligible for database connections. + + + + + + BgWorkerStart_ConsistentState + + + BgWorkerStart_ConsistentState + Start as soon as a consistent state has been reached in a hot-standby, + allowing processes to connect to databases and run read-only queries. + + + + + + BgWorkerStart_RecoveryFinished + + + BgWorkerStart_RecoveryFinished + Start as soon as the system has entered normal read-write state. Note + that the BgWorkerStart_ConsistentState and + BgWorkerStart_RecoveryFinished are equivalent + in a server that's not a hot standby. + + + + + + BgWorkerStart_ConsistentState_HotStandby + + + BgWorkerStart_ConsistentState_HotStandby + Same meaning as BgWorkerStart_ConsistentState but + it is more strict in terms of the server i.e. start the worker only + if it is hot-standby. + + + + Maybe reorder these slightly, because I felt it is better if the BgWorkerStart_ConsistentState_HotStandby comes next after BgWorkerStart_ConsistentState, which it refers to For example:: 1st.BgWorkerStart_PostmasterStart 2nd.BgWorkerStart_ConsistentState 3rd.BgWorkerStart_ConsistentState_HotStandby 4th.BgWorkerStart_RecoveryFinished == doc/src/sgml/config.sgml 2. enable_syncslot = true Not sure, but I thought the "= true" part should be formatted too. SUGGESTION enable_syncslot = true == doc/src/sgml/logicaldecoding.sgml 3. + + A logical replication slot on the primary can be synchronized to the hot + standby by enabling the failover option during slot creation and setting + enable_syncslot on the standby. For the synchronization + to work, it is mandatory to have a physical replication slot between the + primary and the standby. It's highly recommended that the said physical + replication slot is listed in standby_slot_names on + the primary to prevent the subscriber from consuming changes faster than + the hot standby. Additionally, hot_standby_feedback + must be enabled on the standby for the slots synchronization to work. + I felt those parts that describe the mandatory GUCs should be kept together. SUGGESTION For the synchronization to work, it is mandatory to have a physical replication slot between the primary and the standby, and hot_standby_feedback must be enabled on the standby. It's also highly recommended that the said physical replication slot is named in standby_slot_names list on the primary, to prevent the subscriber from consuming changes faster than the hot standby. ~~~ 4. (Chapter 49) By enabling synchronization of slots, logical replication can be resumed after failover depending upon the pg_replication_slots.sync_state for the synchronized slots on the standby at the time of failover. Only slots that were in ready sync_state ('r') on the standby before failover can be used for logical replication after failover. However, the slots which were in initiated sync_state ('i') and not sync-ready ('r') at the time of failover will be dropped and logical replication for such slots can not be resumed after failover. This applies to the case where a logical subscription is disabled before failover and is enabled after failover. If the synchronized slot due to disabled subscription could not be made sync-ready ('r') on standby, then the subscription can not be resumed after failover even when enabled. If the primary is idle, then the synchronized slots on the standby may take a noticeable time to reach the ready ('r') sync_state. This can be sped up by calling the pg_log_standby_snapshot function on the primary. ~ Somehow, I still felt all that was too wordy/repetitive. Below is my attempt to make it more concise. Thoughts? SUGGESTION The ability to resume logical replication after failover depends upon the pg_replication_slots.sync_state value for the synchronized slots on the standby at the time of failover. Only slots that have attained a "ready" sync_state ('r') on the standby before failover can be used for logical replication after failover. Slots that have not yet reached 'r' state (they are still 'i') will be dropped, therefore logical replication for those slots cannot be resumed. For example, if the synchronized slot could not become sync-ready on standby due to a disabled subscription, then the subscription cannot be resumed after failover even when it is enabled. If the primary is idle, the synchronized slots on the standby may take a noticeable time to reach the ready ('r') sync_state. This can be sped up by calling the pg_log_standby_s
Re: trying again to get incremental backup
Hi Robert, On Mon, Dec 11, 2023 at 6:08 PM Robert Haas wrote: > > On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak > wrote: > > While we are at it, maybe around the below in PrepareForIncrementalBackup() > > > > if (tlep[i] == NULL) > > ereport(ERROR, > > > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > errmsg("timeline %u found in > > manifest, but not in this server's history", > > range->tli))); > > > > we could add > > > > errhint("You might need to start a new full backup instead of > > incremental one") > > > > ? > > I can't exactly say that such a hint would be inaccurate, but I think > the impulse to add it here is misguided. One of my design goals for > this system is to make it so that you never have to take a new > incremental backup "just because," Did you mean take a new full backup here? > not even in case of an intervening > timeline switch. So, all of the errors in this function are warning > you that you've done something that you really should not have done. > In this particular case, you've either (1) manually removed the > timeline history file, and not just any timeline history file but the > one for a timeline for a backup that you still intend to use as the > basis for taking an incremental backup or (2) tried to use a full > backup taken from one server as the basis for an incremental backup on > a completely different server that happens to share the same system > identifier, e.g. because you promoted two standbys derived from the > same original primary and then tried to use a full backup taken on one > as the basis for an incremental backup taken on the other. > Okay, but please consider two other possibilities: (3) I had a corrupted DB where I've fixed it by running pg_resetwal and some cronjob just a day later attempted to take incremental and failed with that error. (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb directory) the DB where I had cronjob that just failed with this error I bet that (4) is going to happen more often than (1), (2) , which might trigger users to complain on forums, support tickets. > > > I have a fix for this locally, but I'm going to hold off on publishing > > > a new version until either there's a few more things I can address all > > > at once, or until Thomas commits the ubsan fix. > > > > > > > Great, I cannot get it to fail again today, it had to be some dirty > > state of the testing env. BTW: Thomas has pushed that ubsan fix. > > Huzzah, the cfbot likes the patch set now. Here's a new version with > the promised fix for your non-reproducible issue. Let's see whether > you and cfbot still like this version. LGTM, all quick tests work from my end too. BTW: I have also scheduled the long/large pgbench -s 14000 (~200GB?) - multiple day incremental test. I'll let you know how it went. -J.
Re: Report planning memory in EXPLAIN ANALYZE
On Wed, Dec 13, 2023 at 1:41 AM Alvaro Herrera wrote: > > I would replace the text in explain.sgml with this: > > + Include information on memory consumption by the query planning phase. > + This includes the precise amount of storage used by planner in-memory > + structures, as well as overall total consumption of planner memory, > + including allocation overhead. > + This parameter defaults to FALSE. To me consumption in the "... total consumption ..." part is same as allocation and that includes allocation overhead as well as any memory that was allocated but remained unused (see discussion [1] if you haven't already) ultimately. Mentioning "total consumption" and "allocation overhead" seems more confusing. How about + Include information on memory consumption by the query planning phase. + Report the precise amount of storage used by planner in-memory + structures, and total memory considering allocation overhead. + The parameter defaults to FALSE. Takes care of your complaint about multiple include/ing as well. > > and remove the new example you're adding; it's not really necessary. Done. > > In struct ExplainState, I'd put the new member just below summary. Done > > If we're creating a new memory context for planner, we don't need the > "mem_counts_start" thing, and we don't need the > MemoryContextSizeDifference thing. Just add the > MemoryContextMemConsumed() function (which ISTM should be just below > MemoryContextMemAllocated() instead of in the middle of the > MemoryContextStats implementation), and > just report the values we get there. I think the SizeDifference stuff > increases surface damage for no good reason. 240 bytes are used right after creating a memory context i.e. mem_counts_start.totalspace = 8192 and mem_counts_start.freespace = 7952. To account for that I used two counters and calculated the difference. If no planning is involved in EXECUTE it will show 240 bytes used instead of 0. Barring that for all practical purposes 240 bytes negligible. E.g. 240 bytes is 4% of the amount of memory used for planning a simple query " select 'x' ". But your suggestion simplifies the code a lot. An error of 240 bytes seems worth the code simplification. So did that way. > > ExplainOnePlan() is given a MemoryUsage object (or, if we do as above > and no longer have a MemoryUsage struct at all in the first place, a > MemoryContextCounters object) even when the MEMORY option is false. > This means we waste computing memory usage when not needed. Let's avoid > that useless overhead. Done. Also avoided creating a memory context and switching to it when es->memory = false. > > I'd also do away with the comment you added in explain_ExecutorEnd() and > do just this, below setting of es->summary: > > + /* No support for MEMORY option */ > + /* es->memory = false; */ Done. I ended up rewriting most of the code, so squashed everything into a single patch as attached. -- Best Wishes, Ashutosh Bapat From 8282b3347a8535cfe09b5f2f4a0c5a6bdcba11fa Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Wed, 12 Jul 2023 14:34:14 +0530 Subject: [PATCH] EXPLAIN reports memory consumed for planning a query The memory consumed is reported as "Planner Memory" property in EXPLAIN output when option MEMORY is specified. A separate memory context is allocated for measuring memory consumption to eliminate any effect of previous activity on the calculation. The memory context statistics is noted before and after calling pg_plan_query() from ExplainOneQuery(). The difference in the two statistics is used to calculate the memory consumption. We report two values "used" and "allocated". The difference between memory statistics counters totalspace and freespace gives the memory that remains palloc'ed at the end of planning. It is reported as memory "used". This does not account for chunks of memory that were palloc'ed but subsequently pfree'ed. But such memory remains allocated in the memory context is given by the totalspace counter. The value of this counter is reported as memory "allocated". Ashutosh Bapat, reviewed by David Rowley and Alvaro Herrera --- contrib/auto_explain/auto_explain.c | 2 + doc/src/sgml/ref/explain.sgml | 13 + src/backend/commands/explain.c| 67 +- src/backend/commands/prepare.c| 26 +- src/backend/utils/mmgr/mcxt.c | 13 + src/include/commands/explain.h| 4 +- src/include/utils/memutils.h | 2 + src/test/regress/expected/explain.out | 68 +++ src/test/regress/sql/explain.sql | 6 +++ 9 files changed, 197 insertions(+), 4 deletions(-) diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c index c3ac27ae99..26e80e4e16 100644 --- a/contrib/auto_explain/auto_explain.c +++ b/contrib/auto_explain/auto_explain.c @@ -396,6 +396,8 @@ explain_ExecutorEnd(QueryDesc
Re: Some useless includes in llvmjit_inline.cpp
On Mon, Dec 11, 2023 at 6:28 AM Thomas Munro wrote: > > Hi, > > This is not exhaustive, I just noticed in passing that we don't need these. I was able to compile the changes with "--with-llvm" successfully, and the changes look good to me. Thanks and Regards, Shubham Khanna.
Re: Make COPY format extendable: Extract COPY TO format implementations
On Tue, Dec 12, 2023 at 11:09 AM Junwang Zhao wrote: > > On Mon, Dec 11, 2023 at 10:32 PM Masahiko Sawada > wrote: > > > > On Mon, Dec 11, 2023 at 7:19 PM Michael Paquier wrote: > > > > > > On Mon, Dec 11, 2023 at 10:57:15AM +0900, Masahiko Sawada wrote: > > > > IIUC we cannot create two same name functions with the same arguments > > > > but a different return value type in the first place. It seems to me > > > > to be an overkill to change such a design. > > > > > > Agreed to not touch the logictics of LookupFuncName() for the sake of > > > this thread. I have not checked the SQL specification, but I recall > > > that there are a few assumptions from the spec embedded in the lookup > > > logic particularly when it comes to specify a procedure name without > > > arguments. > > > > > > > Another idea is to encapsulate copy_to/from_handler by a super class > > > > like copy_handler. The handler function is called with an argument, > > > > say copyto, and returns copy_handler encapsulating either > > > > copy_to/from_handler depending on the argument. > > > > > > Yep, that's possible as well and can work as a cross-check between the > > > argument and the NodeTag assigned to the handler structure returned by > > > the function. > > > > > > At the end, the final result of the patch should IMO include: > > > - Documentation about how one can register a custom copy_handler. > > > - Something in src/test/modules/, minimalistic still useful that can > > > be used as a template when one wants to implement their own handler. > > > The documentation should mention about this module. > > > - No need for SQL functions for all the in-core handlers: let's just > > > return pointers to them based on the options given. > > > > Agreed. > > > > > It would be probably cleaner to split the patch so as the code is > > > refactored and evaluated with the in-core handlers first, and then > > > extended with the pluggable facilities and the function lookups. > > > > Agreed. > > > > I've sketched the above idea including a test module in > > src/test/module/test_copy_format, based on v2 patch. It's not splitted > > and is dirty so just for discussion. > > > The test_copy_format extension doesn't use the fields of CopyToState and > CopyFromState in this patch, I think we should move CopyFromStateData > and CopyToStateData to commands/copy.h, what do you think? Yes, I basically agree with that, where we move CopyFromStateData to might depend on how we define COPY FROM APIs though. I think we can move CopyToStateData to copy.h at least. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 12 Dec 2023, at 18:28, Alvaro Herrera wrote: > > Andrey, do you have any stress tests or anything else that you used to > gain confidence in this code? We are using only first two steps of the patchset, these steps do not touch locking stuff. We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 7, 2023 at 10:41 AM Dilip Kumar wrote: > > On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra > wrote: > > > > Yes, if something like this happens, that'd be a problem: > > > > 1) decoding starts, with > > > >SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT > > > > 2) transaction that creates a new refilenode gets decoded, but we skip > >it because we don't have the correct snapshot > > > > 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT > > > > 4) we decode sequence change from nextval() for the sequence > > > > This would lead to us attempting to apply sequence change for a > > relfilenode that's not visible yet (and may even get aborted). > > > > But can this even happen? Can we start decoding in the middle of a > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID, > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical > > messages, where we also call the output plugin in non-transactional cases. > > It's not a problem for logical messages because whether the message is > transaction or non-transactional is decided while WAL logs the message > itself. But here our problem starts with deciding whether the change > is transactional vs non-transactional, because if we insert the > 'relfilenode' in hash then the subsequent sequence change in the same > transaction would be considered transactional otherwise > non-transactional. > It is correct that we can make a wrong decision about whether a change is transactional or non-transactional when sequence DDL happens before the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens after that state. However, one thing to note here is that we won't try to stream such a change because for non-transactional cases we don't proceed unless the snapshot is in a consistent state. Now, if the decision had been correct then we would probably have queued the sequence change and discarded at commit. One thing that we deviate here is that for non-sequence transactional cases (including logical messages), we immediately start queuing the changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided SnapBuildProcessChange() returns true which is quite possible) and take final decision at commit/prepare/abort time. However, that won't be the case for sequences because of the dependency of determining transactional cases on one of the prior records. Now, I am not completely sure at this stage if such a deviation can cause any problem and or whether we are okay to have such a deviation for sequences. -- With Regards, Amit Kapila.
Re: trying again to get incremental backup
On Wed, Dec 13, 2023 at 5:39 AM Jakub Wartak wrote: > > I can't exactly say that such a hint would be inaccurate, but I think > > the impulse to add it here is misguided. One of my design goals for > > this system is to make it so that you never have to take a new > > incremental backup "just because," > > Did you mean take a new full backup here? Yes, apologies for the typo. > > not even in case of an intervening > > timeline switch. So, all of the errors in this function are warning > > you that you've done something that you really should not have done. > > In this particular case, you've either (1) manually removed the > > timeline history file, and not just any timeline history file but the > > one for a timeline for a backup that you still intend to use as the > > basis for taking an incremental backup or (2) tried to use a full > > backup taken from one server as the basis for an incremental backup on > > a completely different server that happens to share the same system > > identifier, e.g. because you promoted two standbys derived from the > > same original primary and then tried to use a full backup taken on one > > as the basis for an incremental backup taken on the other. > > > > Okay, but please consider two other possibilities: > > (3) I had a corrupted DB where I've fixed it by running pg_resetwal > and some cronjob just a day later attempted to take incremental and > failed with that error. > > (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb > directory) the DB where I had cronjob that just failed with this error > > I bet that (4) is going to happen more often than (1), (2) , which > might trigger users to complain on forums, support tickets. Hmm. In case (4), I was thinking that you'd get a complaint about the database system identifier not matching. I'm not actually sure that's what would happen, though, now that you mention it. In case (3), I think you would get an error about missing WAL summary files. > > Huzzah, the cfbot likes the patch set now. Here's a new version with > > the promised fix for your non-reproducible issue. Let's see whether > > you and cfbot still like this version. > > LGTM, all quick tests work from my end too. BTW: I have also scheduled > the long/large pgbench -s 14000 (~200GB?) - multiple day incremental > test. I'll let you know how it went. Awesome, thank you so much. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Built-in CTYPE provider
On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote: > My biggest concern is around maintenance. Every year Unicode is > assigning new characters to existing code points, and those existing > code points can of course already be stored in old databases before > libs > are updated. Is the concern only about unassigned code points? I already committed a function "unicode_assigned()" to test whether a string contains only assigned code points, which can be used in a CHECK() constraint. I also posted[5] an idea about a per-database option that could reject the storage of any unassigned code point, which would make it easier for users highly concerned about compatibility. > And we may end up with > something like the timezone database where we need to periodically > add a > more current ruleset - albeit alongside as a new version in this > case. There's a build target "update-unicode" which is run to pull in new Unicode data files and parse them into static C arrays (we already do this for the Unicode normalization tables). So I agree that the tables should be updated but I don't understand why that's a problem. > If I'm reading the Unicode 15 update correctly, PostgreSQL regex > expressions with [[:digit:]] will not correctly identify Kaktovik or > Nag > Mundari or Kawi digits without that update to character type specs. Yeah, if we are behind in the Unicode version, then results won't be the most up-to-date. But ICU or libc could also be behind in the Unicode version. > But lets remember that people like to build indexes on character > classification functions like upper/lower, for case insensitive > searching. UPPER()/LOWER() are based on case mapping, not character classification. I intend to introduce a SQL-level CASEFOLD() function that would obey Unicode casefolding rules, which have very strong compatibility guarantees[6] (essentially, if you are only using assigned code points, you are fine). > It's another case where the index will be corrupted if > someone happened to store Latin Glottal vowels in their database and > then we update libs to the latest character type rules. I don't agree with this characterization at all. (a) It's not "another case". Corruption of an index on LOWER() can happen today. My proposal makes the situation better, not worse. (b) These aren't libraries, I am proposing built-in Unicode tables that only get updated in a new major PG version. (c) It likely only affects a small number of indexes and it's easier for an administrator to guess which ones might be affected, making it easier to just rebuild those indexes. (d) It's not a problem if you stick to assigned code points. > So even with something as basic as character type, if we're going to > do > it right, we still need to either version it or definitively decide > that > we're not going to every support newly added Unicode characters like > Latin Glottals. If, by "version it", you mean "update the data tables in new Postgres versions", then I agree. If you mean that one PG version would need to support many versions of Unicode, I don't agree. Regards, Jeff Davis [5] https://postgr.es/m/c5e9dac884332824e0797937518da0b8766c1238.ca...@j-davis.com [6] https://www.unicode.org/policies/stability_policy.html#Case_Folding
Re: Remove MSVC scripts from the tree
On Tue, Dec 05, 2023 at 07:29:59AM +0900, Michael Paquier wrote: > Okay. Thanks for the update. While in Prague, Andres and Peter E. have mentioned me that we perhaps had better move on with this patch sooner than later, without waiting for the two buildfarm members to do the switch because much more cleanup is required for the documentation once the scripts are removed. So, any objections with the patch as presented to remove the scripts while moving the existing doc blocks from install-windows.sgml that still need more discussion? -- Michael signature.asc Description: PGP signature
Re: Move walreceiver state assignment (to WALRCV_STREAMING) in WalReceiverMain()
On Tue, Dec 12, 2023 at 04:54:32PM -0300, Euler Taveira wrote: > Couldn't it give up before starting if you apply your patch? My main concern > is > due to a slow system, the walrcv_connect() took to long in WalReceiverMain() > and the code above kills the walreceiver while in the process to start it. > Since you cannot control the hardcoded WALRCV_STARTUP_TIMEOUT value, you might > have issues during overload periods. Sounds like a fair point to me, this area is trickier than it looks. Another thing that I'm a bit surprised with is why it would be OK to switch the status to STREAMING only we first_stream is set, discarding the restart case. -- Michael signature.asc Description: PGP signature
Re: Subsequent request from pg client
Abdul Matin writes: > Can postgres client send subsequent requests without receiving response? If > so, how does the postgres client correlate between a request and its > response? Where can I get more hints about it? https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-PIPELINING See also the pipelining-related functions in recent libpq versions. regards, tom lane
Re: Fix bug with indexes on whole-row expressions
ywgrit writes: > I forbid to create indexes on whole-row expression in the following patch. > I'd like to hear your opinions. As I said in the previous thread, I don't think this can possibly be acceptable. Surely there are people depending on the capability. I'm not worried so much about the exact case of an index column being a whole-row Var --- I agree that that's pretty useless --- but an index column that is a function on a whole-row Var seems quite useful. (Your patch fails to detect that, BTW, which means it does not block the case presented in bug #18244.) I thought about extending the ALTER TABLE logic to disallow changes in composite types that appear in index expressions. We already have find_composite_type_dependencies(), and it turns out that this already blocks ALTER for the case you want to forbid, but we concluded that we didn't need to prevent it for the bug #18244 case: * If objsubid identifies a specific column, refer to that in error * messages. Otherwise, search to see if there's a user column of the * type. (We assume system columns are never of interesting types.) * The search is needed because an index containing an expression * column of the target type will just be recorded as a whole-relation * dependency. If we do not find a column of the type, the dependency * must indicate that the type is transiently referenced in an index * expression but not stored on disk, which we assume is OK, just as * we do for references in views. (It could also be that the target * type is embedded in some container type that is stored in an index * column, but the previous recursion should catch such cases.) Perhaps a reasonable answer would be to issue a WARNING (not error) in the case where an index has this kind of dependency. The index might need to be reindexed --- but it might not, too, and in any case I doubt that flat-out forbidding the ALTER is a helpful idea. regards, tom lane
AW: Building PosgresSQL with LLVM fails on Solaris 11.4
Hi Andres Thanks for your reply. The reason I was suspicious with the warnings of the gcc build was, because gmake check reported 138 out of 202 tests to have failed. I have attached the output of gmake check. After you mentioned that gcc did not report any errors, just warnings, we installed the build. First, it seeemed to work and SELECT pg_jit_available(); showed "pg_jit_available" as "t" but the DB showed strange behaviour. I.e. not always, but sometimes running "show parallel_tuple_cost" caused postmaster to restart a server process. We had to back to the previous installation. It seems there is definitievly something wrong with the result gcc created. Best regards Sacha Von: Andres Freund Datum: Donnerstag, 7. Dezember 2023 um 17:50 An: Sacha Hottinger Cc: pgsql-hack...@postgresql.org Betreff: Re: Building PosgresSQL with LLVM fails on Solaris 11.4 Hi, On 2023-12-07 13:43:55 +, Sacha Hottinger wrote: > Thanks a lot. > It now got much further but failed here with Sun Studio: > … > gmake[2]: Leaving directory > '/opt/cnd/opt28-2_13.3_gmake_all_llvm_fixV2/src/test/perl' > gmake -C backend/jit/llvm all > gmake[2]: Entering directory > '/opt/cnd/opt28-2_13.3_gmake_all_llvm_fixV2/src/backend/jit/llvm' > /opt/developerstudio12.6/bin/cc -m64 -xarch=native -Xa -v -O -KPIC > -D__STDC_LIMIT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_CONSTANT_MACROS > -I/usr/include -I../../../../src/include-c -o llvmjit.o llvmjit.c > "llvmjit.c", line 493: warning: argument #1 is incompatible with prototype: > prototype: pointer to void : > "../../../../src/include/jit/llvmjit_emit.h", line 27 > argument : pointer to function(pointer to struct > FunctionCallInfoBaseData {pointer to struct FmgrInfo {..} flinfo, pointer to > struct Node {..} context, pointer to struct Node {..} resultinfo, unsigned > int fncollation, _Bool isnull, short nargs, array[-1] of struct NullableDatum > {..} args}) returning unsigned long > g++ -O -std=c++14 -KPIC -D__STDC_LIMIT_MACROS -D__STDC_FORMAT_MACROS > -D__STDC_CONSTANT_MACROS -I/usr/include -I../../../../src/include-c -o > llvmjit_error.o llvmjit_error.cpp > g++: error: unrecognized command-line option ‘-KPIC’; did you mean ‘-fPIC’? > gmake[2]: *** [: llvmjit_error.o] Error 1 > gmake[2]: Leaving directory > '/opt/cnd/opt28-2_13.3_gmake_all_llvm_fixV2/src/backend/jit/llvm' > gmake[1]: *** [Makefile:42: all-backend/jit/llvm-recurse] Error 2 > gmake[1]: Leaving directory '/opt/cnd/opt28-2_13.3_gmake_all_llvm_fixV2/src' > gmake: *** [GNUmakefile:11: all-src-recurse] Error 2 I don't know where the -KPIC is coming from. And TBH, I don't see much point trying to fix a scenario involving matching sun studio C with g++. > With ggc it fails at the same step as before. > I have attached the log files of the SunStudio and gcc runs to the email. I don't see a failure with gcc. The warnings are emitted for every extension and compilation succeeds. Greetings, Andres Freund This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. gcc_gmake_check.log Description: gcc_gmake_check.log
Re: Built-in CTYPE provider
Jeff Davis wrote: > While "full" case mapping sounds more complex, there are actually > very few cases to consider and they are covered in another (small) > data file. That data file covers ~100 code points that convert to > multiple code points when the case changes (e.g. "ß" -> "SS"), 7 > code points that have context-sensitive mappings, and three locales > which have special conversions ("lt", "tr", and "az") for a few code > points. But there are CLDR mappings on top of that. According to the Unicode FAQ https://unicode.org/faq/casemap_charprop.html#5 Q: Does the default case mapping work for every language? What about the default case folding? [...] To make case mapping language sensitive, the Unicode Standard specificially allows implementations to tailor the mappings for each language, but does not provide the necessary data. The file SpecialCasing.txt is included in the Standard as a guide to a few of the more important individual character mappings needed for specific languages, notably the Greek script and the Turkic languages. However, for most language-specific mappings and tailoring, users should refer to CLDR and other resources. In particular "el" (modern greek) has case mapping rules that ICU seems to implement, but "el" is missing from the list ("lt", "tr", and "az") you identified. The CLDR case mappings seem to be found in https://github.com/unicode-org/cldr/tree/main/common/transforms in *-Lower.xml and *-Upper.xml Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
"pgoutput" options missing on documentation
I noticed that Logical Streaming Replication Protocol documentation [1] is missing the options added to "pgoutput" since version 14. A patch is attached to fix it together with another small one to give a nice error when "proto_version" parameter is not provided. [1] https://www.postgresql.org/docs/devel/protocol-logical-replication.html v00-0001-pgoutput-Test-if-protocol_version-is-given.patch Description: Binary data v00-0002-pgoutput-Document-missing-options.patch Description: Binary data
Re: Built-in CTYPE provider
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote: > But there are CLDR mappings on top of that. I see, thank you. Would it still be called "full" case mapping to only use the mappings in SpecialCasing.txt? And would that be useful? Regards, Jeff Davis
Re: Clean up find_typedefs and add support for Mac
On Tue Dec 12, 2023 at 5:02 PM CST, Tom Lane wrote: "Tristan Partin" writes: > The big patch here is adding support for Mac. objdump -W doesn't work on > Mac. So, I used dsymutil and dwarfdump to achieve the same result. We should probably nuke the current version of src/tools/find_typedef altogether in favor of copying the current buildfarm code for that. We know that the buildfarm's version works, whereas I'm not real sure that src/tools/find_typedef is being used in anger by anyone. Also, we're long past the point where developers can avoid having Perl installed. Ideally, the buildfarm client would then start to use find_typedef from the tree rather than have its own copy, both to reduce duplication and to ensure that the in-tree copy keeps working. That makes sense to me. Where can I find the buildfarm code to propose a different patch, at least pulling in the current state of the buildfarm script? Or perhaps Andrew is the best person for this job. -- Tristan Partin Neon (https://neon.tech)
Re: Eager page freeze criteria clarification
Great results. On Sat, Dec 9, 2023 at 5:12 AM Melanie Plageman wrote: > Values can be "removed" from the accumulator by simply decrementing its > cardinality and decreasing the sum and sum squared by a value that will > not change the mean and standard deviation of the overall distribution. > To adapt to a table's changing access patterns, we'll need to remove > values from this accumulator over time, but this patch doesn't yet > decide when to do this. A simple solution may be to cap the cardinality > of the accumulator to the greater of 1% of the table size, or some fixed > number of values (perhaps 200?). Even without such removal of values, > the distribution recorded in the accumulator will eventually skew toward > more recent data, albeit at a slower rate. I think we're going to need something here. Otherwise, after 6 months of use, changing a table's perceived access pattern will be quite difficult. I think one challenge here is to find something that doesn't decay too often and end up with cases where it basically removes all the data. As long as you avoid that, I suspect that the algorithm might not be terribly sensitive to other kinds of changes. If you decay after 200 values or 2000 or 20,000, it will only affect how fast we can change our notion of the access pattern, and my guess would be that any of those values would produce broadly acceptable results, with some differences in the details. If you decay after 200,000,000 values or not at all, then I think there will be problems. > This algorithm is able to predict when pages will be modified before > target_freeze_threshold as long as their modification pattern fits a > normal distribution -- this accommodates access patterns where, after > some period of time, pages become less likely to be modified the older > they are. As an example, however, access patterns where modification > times are bimodal aren't handled well by this model (imagine that half > your modifications are to very new data and the other half are to data > that is much older but still younger than target_freeze_duration). If > this is a common access pattern, the normal distribution could easily be > swapped out for another distribution. The current accumulator is capable > of tracking a distribution's first two moments of central tendency (the > mean and standard deviation). We could track more if we wanted to use a > fancier distribution. I think it might be fine to ignore this problem. What we're doing right now is way worse. > We also must consider what to do when we have few unsets, e.g. with an > insert-only workload. When there are very few unsets (I chose 30 which > the internet says is the approximate minimum number required for the > central limit theorem to hold), we can choose to always freeze freezable > pages; above this limit, the calculated page threshold is used. However, > we may not want to make predictions based on 31 values either. To avoid > this "cliff", we could modify the algorithm a bit to skew the mean and > standard deviation of the distribution using a confidence interval based > on the number of values we've collected. I think some kind of smooth transition would might be a good idea, but I also don't know how much it matters. I think the important thing is that we freeze aggressively unless there's a clear signal telling us not to do that. Absence of evidence should cause us to tend toward aggressive freezing. Otherwise, we get insert-only tables wrong. > The goal is to keep pages frozen for at least target_freeze_duration. > target_freeze_duration is in seconds and pages only have a last > modification LSN, so target_freeze_duration must be converted to LSNs. > To accomplish this, I've added an LSNTimeline data structure, containing > XLogRecPtr, TimestampTz pairs stored with decreasing precision as they > age. When we need to translate the guc value to LSNs, we linearly > interpolate it on this timeline. For the time being, the global > LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There > is no reason it can't be updated with some other cadence and/or by some > other process (nothing about it is inherently tied to vacuum). The > cached translated value of target_freeze_duration is stored in each > table's stats. This is arbitrary as it is not a table-level stat. > However, it needs to be located somewhere that is accessible on > update/delete. We may want to recalculate it more often than once per > table vacuum, especially in case of long-running vacuums. This part sounds like it isn't quite baked yet. The idea of the data structure seems fine, but updating it once per vacuum sounds fairly unprincipled to me? Don't we want the updates to happen on a somewhat regular wall clock cadence? > To benchmark this new heuristic (I'm calling it algo 6 since it is the > 6th I've proposed on this thread), I've used a modified subset of my > original workloads: > > [ everything worked perfectly ] Hard to beat that. --
Re: Clean up find_typedefs and add support for Mac
"Tristan Partin" writes: > That makes sense to me. Where can I find the buildfarm code to propose > a different patch, at least pulling in the current state of the > buildfarm script? Or perhaps Andrew is the best person for this job. I think this is the authoritative repo: https://github.com/PGBuildFarm/client-code.git regards, tom lane
Re: Clean up find_typedefs and add support for Mac
On Wed Dec 13, 2023 at 11:27 AM CST, Tom Lane wrote: "Tristan Partin" writes: > That makes sense to me. Where can I find the buildfarm code to propose > a different patch, at least pulling in the current state of the > buildfarm script? Or perhaps Andrew is the best person for this job. I think this is the authoritative repo: https://github.com/PGBuildFarm/client-code.git Cool. I'll reach out to Andrew off list to work with him. Maybe I'll gain a little bit more knowledge of how the buildfarm works :). -- Tristan Partin Neon (https://neon.tech)
Re: Teach predtest about IS [NOT] proofs
James Coleman writes: > Attached is a patch that solves that issue. It also teaches predtest about > quite a few more cases involving BooleanTest expressions (e.g., how they > relate to NullTest expressions). One thing I could imagine being an > objection is that not all of these warrant cycles in planning. If that > turns out to be the case there's not a particularly clear line in my mind > about where to draw that line. I don't have an objection in principle to adding more smarts to predtest.c. However, we should be wary of slowing down cases where no BooleanTests are present to be optimized. I wonder if it could help to use a switch on nodeTag rather than a series of if(IsA()) tests. (I'd be inclined to rewrite the inner if-then-else chains as switches too, really. You get some benefit from the compiler noticing whether you've covered all the enum values.) I note you've actively broken the function's ability to cope with NULL input pointers. Maybe we don't need it to, but I'm not going to accept a patch that just side-swipes that case without any justification. Another way in which the patch needs more effort is that you've not bothered to update the large comment block atop the function. Perhaps, rather than hoping people will notice comments that are potentially offscreen from what they're modifying, we should relocate those comment paras to be adjacent to the relevant parts of the function? I've not gone through the patch in detail to see whether I believe the proposed proof rules. It would help to have more comments justifying them. > As noted in a TODO in the patch itself, I think it may be worth refactoring > the test_predtest module to run the "x, y" case as well as the "y, x" case > with a single call so as to eliminate a lot of repetition in > clause/expression test cases. If reviewers agree that's desirable, then I > could do that as a precursor. I think that's actively undesirable. It is not typically the case that a proof rule for A => B also works in the other direction, so this would encourage wasting cycles in the tests. I fear it might also cause confusion about which direction a proof rule is supposed to work in. regards, tom lane
LargeObjectRelationId vs LargeObjectMetadataRelationId, redux
By chance I discovered that checks on large object ownership are broken in v16+. For example, regression=# create user alice; CREATE ROLE regression=# \c - alice You are now connected to database "regression" as user "alice". regression=> \lo_import test lo_import 40378 regression=> comment on large object 40378 is 'test'; ERROR: unrecognized class ID: 2613 This has been failing since commit afbfc0298, which refactored ownership checks, replacing pg_largeobject_ownercheck and allied routines with object_ownercheck. That function lacks the little dance that's been stuck into assorted crannies: if (classId == LargeObjectRelationId) classId = LargeObjectMetadataRelationId; which translates from the object-address representation with classId LargeObjectRelationId to the catalog we actually need to look at. The proximate cause of the failure is in get_object_property_data, so I first thought of making that function do this transposition. That might be a good thing to do, but it wouldn't be enough to fix the problem, because we'd then reach this in object_ownercheck: rel = table_open(classid, AccessShareLock); which is going to examine the wrong catalog. So AFAICS what we have to do is put this substitution into object_ownercheck, adding to the four or five places that know about it already. This is an absolutely horrid mess, of course. The big problem is that at this point I have exactly zero confidence that there are not other places with the same bug; and it's not apparent how to find them. There seems little choice but to make the hacky fix in v16, but I wonder whether we shouldn't be more ambitious and try to fix this permanently in HEAD, by getting rid of the discrepancy in which OID to use. ISTM the correct fix is to change the ObjectAddress representation of large objects to use classid LargeObjectMetadataRelationId. Somebody seems to have felt that that would create more problems than it solves, but I have to disagree. If we stick with the current way, we are going to be hitting problems of this ilk forevermore. Thoughts? regards, tom lane
Re: Windows default locale vs initdb
Here is a thought that occurs to me, as I follow along with Jeff Davis's evolving proposals for built-in collations and ctypes: What would stop us from dropping support for the libc (sic) provider on Windows? That may sound radical and likely to cause extra work for people on upgrade, but how does that compare to the pain of keeping this barely maintained code in the tree? Suppose the idea in this thread goes ahead and we get people to transition to the modern locale names: there is non-zero transitional/upgrade pain there too. How delicious it would be to just nuke the whole thing from orbit, and keep only cross-platform code that is maintained with enthusiasm by active hackers. That's probably a little extreme, but it's the direction my thoughts start to go in when confronting the realisation that it's up to us [Unix hackers making drive-by changes], no one is coming to help us [from the Windows user community]. I've even heard others talk about dropping Windows completely, due to the maintenance imbalance. This would be somewhat more fine grained. (One could use a similar argument to drop non-NTFS filesystems and turn on POSIX-mode file links, to end that other locus of struggle.)
Re: Clean up find_typedefs and add support for Mac
On 2023-12-12 Tu 18:02, Tom Lane wrote: "Tristan Partin" writes: The big patch here is adding support for Mac. objdump -W doesn't work on Mac. So, I used dsymutil and dwarfdump to achieve the same result. We should probably nuke the current version of src/tools/find_typedef altogether in favor of copying the current buildfarm code for that. We know that the buildfarm's version works, whereas I'm not real sure that src/tools/find_typedef is being used in anger by anyone. Also, we're long past the point where developers can avoid having Perl installed. Ideally, the buildfarm client would then start to use find_typedef from the tree rather than have its own copy, both to reduce duplication and to ensure that the in-tree copy keeps working. +1. I'd be more than happy to be rid of maintaining the code. We already performed a rather more complicated operation along these lines with the PostgreSQL::Test::AdjustUpgrade stuff. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
[BUG] autovacuum may skip tables when session_authorization/role is set on database
Hi, A recent case in the field in which a database session_authorization is altered to a non-superuser, non-owner of tables via alter database .. set session_authorization .. caused autovacuum to skip tables. The issue was discovered on 13.10, and the logs show such messages: warning: skipping "table1" --- only table or database owner can vacuum it In HEAD, I can repro, but the message is now a bit different due to [1]. WARNING: permission denied to vacuum "table1”, skipping it It seems to me we should force an autovacuum worker to set the session userid to a superuser. Attached is a repro and a patch which sets the session user to the BOOTSTRAP superuser at the start of the autovac worker. Thoughts? Regards, Sami Amazon Web Services (AWS) [1] https://postgr.es/m/20220726.104712.912995710251150228.horikyota@gmail.com 0001-v1-Force-autovacuum-to-use-bootstrap-superuser.patch Description: 0001-v1-Force-autovacuum-to-use-bootstrap-superuser.patch ## make autovac trigger often for the test psql<
Re: Add --check option to pgindent
On 2023-12-12 Tu 10:30, Alvaro Herrera wrote: On 2023-Dec-12, Tom Lane wrote: "Euler Taveira" writes: When you add exceptions, it starts to complicate the UI. Indeed. It seems like --silent-diff was poorly defined and poorly named, and we need to rethink that option along the way to adding this behavior. The idea that --show-diff and --silent-diff can be used together is just inherently confusing, because they sound like opposites Maybe it's enough to rename --silent-diff to --check. You can do "--show-diff --check" and get both the error and the diff printed; or just "--check" and it'll throw an error without further ado; or "--show-diff" and it will both apply the diff and print it. That seems reasonable. These features were fairly substantially debated when we put them in, but I'm fine with some tweaking. But note: --show-diff doesn't apply the diff, it's intentionally non-destructive. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Clean up find_typedefs and add support for Mac
On Wed Dec 13, 2023 at 2:35 PM CST, Andrew Dunstan wrote: On 2023-12-12 Tu 18:02, Tom Lane wrote: > "Tristan Partin" writes: >> The big patch here is adding support for Mac. objdump -W doesn't work on >> Mac. So, I used dsymutil and dwarfdump to achieve the same result. > We should probably nuke the current version of src/tools/find_typedef > altogether in favor of copying the current buildfarm code for that. > We know that the buildfarm's version works, whereas I'm not real sure > that src/tools/find_typedef is being used in anger by anyone. Also, > we're long past the point where developers can avoid having Perl > installed. > > Ideally, the buildfarm client would then start to use find_typedef > from the tree rather than have its own copy, both to reduce > duplication and to ensure that the in-tree copy keeps working. > > +1. I'd be more than happy to be rid of maintaining the code. We already performed a rather more complicated operation along these lines with the PostgreSQL::Test::AdjustUpgrade stuff. I'll work with you on GitHub to help make the transition. I've already made a draft PR in the client-code repo, but I am sure I am missing some stuff. -- Tristan Partin Neon (https://neon.tech)
Re: [BUG] autovacuum may skip tables when session_authorization/role is set on database
"Imseih (AWS), Sami" writes: > A recent case in the field in which a database session_authorization is > altered to a non-superuser, non-owner of tables via alter database .. set > session_authorization .. > caused autovacuum to skip tables. That seems like an extremely not-bright idea. What is the actual use case for such a setting? Doesn't it risk security problems? > Attached is a repro and a patch which sets the session user to the BOOTSTRAP > superuser > at the start of the autovac worker. I'm rather unimpressed by this proposal, first because there are probably ten other ways to break autovac with ill-considered settings, and second because if we do want to consider this a supported case, what about other background processes? They'd likely have issues as well. regards, tom lane
Re: Remove MSVC scripts from the tree
On 2023-12-13 We 09:23, Michael Paquier wrote: On Tue, Dec 05, 2023 at 07:29:59AM +0900, Michael Paquier wrote: Okay. Thanks for the update. While in Prague, Andres and Peter E. have mentioned me that we perhaps had better move on with this patch sooner than later, without waiting for the two buildfarm members to do the switch because much more cleanup is required for the documentation once the scripts are removed. So, any objections with the patch as presented to remove the scripts while moving the existing doc blocks from install-windows.sgml that still need more discussion? TBH I'd prefer to wait. But I have had a couple more urgent things on my plate. I hope to get back to it before New Year. In the meantime I have switched bowerbird to building only STABLE branches. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Obscure lwlock assertion failure if write fails in initdb
Hi, In all releases, if bootstrap mode's checkpoint gets an error (ENOSPC, EDQUOT, EIO, ...) or a short write in md.c, ERROR is promoted to FATAL and the shmem_exit resowner machinery reaches this: running bootstrap script ... 2023-12-14 10:38:02.320 NZDT [1409162] FATAL: could not write block 42 in file "base/1/1255": wrote only 4096 of 8192 bytes 2023-12-14 10:38:02.320 NZDT [1409162] HINT: Check free disk space. 2023-12-14 10:38:02.320 NZDT [1409162] CONTEXT: writing block 42 of relation base/1/1255 TRAP: failed Assert("!LWLockHeldByMe(BufferDescriptorGetContentLock(buf))"), File: "bufmgr.c", Line: 2409, PID: 1409162 It's really hard to hit because we'd normally expect smgrextend() to get the error first, and when it does it looks something like this: running bootstrap script ... 2023-12-14 10:22:41.940 NZDT [1378512] FATAL: could not extend file "base/1/1255": wrote only 4096 of 8192 bytes at block 42 2023-12-14 10:22:41.940 NZDT [1378512] HINT: Check free disk space. 2023-12-14 10:22:41.940 NZDT [1378512] PANIC: cannot abort transaction 1, it was already committed Aborted (core dumped) A COW system might succeed in smgrextend() and then fail in smgrwrite(), and any system might fail here with other errno. It's an extremely well hidden edge case and doesn't matter to users: initdb failed for lack of space or worse, the message is clear and the rest is meaningless detail of interest to developers with assertion builds. I only happened to notice because I've been testing short write and error scenarios via artificially rigged up means for my vectored I/O work. No patch, I just wanted to flag this obscure pre-existing problem spotted in passing.
Re: [BUG] autovacuum may skip tables when session_authorization/role is set on database
> What is the actual > use case for such a setting? I don't have exact details on the use-case, bit this is not a common use-case. > Doesn't it risk security problems? I cannot see how setting it on the database being more problematic than setting it on a session level. > I'm rather unimpressed by this proposal, first because there are > probably ten other ways to break autovac with ill-considered settings, There exists code in autovac that safeguard for such settings. For example, statement_timeout, lock_timeout are disabled. There are a dozen or more other settings that are overridden for autovac. I see this being just another one to ensure that autovacuum always runs as superuser. > and second because if we do want to consider this a supported case, > what about other background processes? They'd likely have issues > as well. I have not considered other background processes, but autovac is the only one that I can think of which checks for relation permissions. Regards, Sami
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Dec 12, 2023 at 11:53 AM John Naylor wrote: > > On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada wrote: > > > I've attached the updated patch set. From the previous patch set, I've > > merged patches 0007 to 0010. The other changes such as adding RT_GET() > > still are unmerged for now, for discussion. Probably we can make them > > as follow-up patches as we discussed. 0011 to 0015 patches are new > > changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and > > support variable-length values. > > This looks like the right direction, and I'm pleased it's not much > additional code on top of my last patch. > > v44-0014: > > +#ifdef RT_VARLEN_VALUE > + /* XXX: need to choose block sizes? */ > + tree->leaf_ctx = AllocSetContextCreate(ctx, > +"radix tree leaves", > +ALLOCSET_DEFAULT_SIZES); > +#else > + tree->leaf_ctx = SlabContextCreate(ctx, > +"radix tree leaves", > +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > +sizeof(RT_VALUE_TYPE)); > +#endif /* RT_VARLEN_VALUE */ > > Choosing block size: Similar to what we've discussed previously around > DSA segments, we might model this on CreateWorkExprContext() in > src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m / > autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the > max block size to 1/16 of that, or less. > > Also, it occurred to me that compile-time embeddable values don't need > a leaf context. I'm not sure how many places assume that there is > always a leaf context. If not many, it may be worth not creating one > here, just to be tidy. > > + size_t copysize; > > - memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE)); > + copysize = sizeof(RT_VALUE_TYPE); > +#endif > + > + memcpy(leaf.local, value_p, copysize); > > I'm not sure this indirection adds clarity. I guess the intent was to > keep from saying "memcpy" twice, but now the code has to say "copysize > = foo" twice. > > For varlen case, we need to watch out for slowness because of memcpy. > Let's put that off for later testing, though. We may someday want to > avoid a memcpy call for the varlen case, so let's keep it flexible > here. > > v44-0015: > > +#define SizeOfBlocktableEntry (offsetof( > > Unused. > > + char buf[MaxBlocktableEntrySize] = {0}; > > Zeroing this buffer is probably going to be expensive. Also see this > pre-existing comment: > /* WIP: slow, since it writes to memory for every bit */ > page->words[wordnum] |= ((bitmapword) 1 << bitnum); > > For this function (which will be vacuum-only, so we can assume > ordering), in the loop we can: > * declare the local bitmapword variable to be zero > * set the bits on it > * write it out to the right location when done. > > Let's fix both of these at once. > > + if (TidStoreIsShared(ts)) > + shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len); > + else > + local_rt_set(ts->tree.local, blkno, (void *) page, page_len); > > Is there a reason for "void *"? The declared parameter is > "RT_VALUE_TYPE *value_p" in 0014. > Also, since this function is for vacuum (and other uses will need a > new function), let's assert the returned bool is false. > > Does iteration still work? If so, it's not too early to re-wire this > up with vacuum and see how it behaves. > > Lastly, my compiler has a warning that CI doesn't have: > > In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121: > ../src/include/lib/radixtree.h: In function ‘rt_find.isra’: > ../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used > uninitialized [-Wmaybe-uninitialized] > 2142 | return (RT_VALUE_TYPE*) slot; > |^ > ../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here > 2112 | RT_PTR_ALLOC *slot; > | ^~~~ Thank you for the comments! I agreed with all of them and incorporated them into the attached latest patch set, v45. In v45, 0001 - 0006 are from earlier versions but I've merged previous updates. So the radix tree now has RT_SET() and RT_FIND() but not RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous versions that incorporated the above comments. 0009 patch integrates tidstore with lazy vacuum. Note that DSA segment problem is not resolved yet in this patch. 0010 and 0011 makes DSA initial/max segment size configurable and make parallel vacuum specify both in proportion to maintenance_work_mem. 0012 is a development-purpose patch to make it easy to investigate bugs in tidstore. I'd like to keep it in the patch set at least during the development. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v45-ART.tar.gz Description: GNU Zip compressed data
Re: Teach predtest about IS [NOT] proofs
Thanks for taking a look! On Wed, Dec 13, 2023 at 1:36 PM Tom Lane wrote: > > James Coleman writes: > > Attached is a patch that solves that issue. It also teaches predtest about > > quite a few more cases involving BooleanTest expressions (e.g., how they > > relate to NullTest expressions). One thing I could imagine being an > > objection is that not all of these warrant cycles in planning. If that > > turns out to be the case there's not a particularly clear line in my mind > > about where to draw that line. > > I don't have an objection in principle to adding more smarts to > predtest.c. However, we should be wary of slowing down cases where > no BooleanTests are present to be optimized. I wonder if it could > help to use a switch on nodeTag rather than a series of if(IsA()) > tests. (I'd be inclined to rewrite the inner if-then-else chains > as switches too, really. You get some benefit from the compiler > noticing whether you've covered all the enum values.) I think I could take this on; would you prefer it as a patch in this series? Or as a new patch thread? > I note you've actively broken the function's ability to cope with > NULL input pointers. Maybe we don't need it to, but I'm not going > to accept a patch that just side-swipes that case without any > justification. I should have explained that. I don't think I've broken it: 1. predicate_implied_by_simple_clause() is only ever called by predicate_implied_by_recurse() 2. predicate_implied_by_recurse() starts with: pclass = predicate_classify(predicate, &pred_info); 3. predicate_classify(Node *clause, PredIterInfo info) starts off with: Assert(clause != NULL); I believe this means we are currently guaranteed by the caller to receive a non-NULL pointer, but I could be missing something. The same argument (just substituting the equivalent "refute" function names) applies to predicate_refuted_by_simple_clause(). > Another way in which the patch needs more effort is that you've > not bothered to update the large comment block atop the function. > Perhaps, rather than hoping people will notice comments that are > potentially offscreen from what they're modifying, we should relocate > those comment paras to be adjacent to the relevant parts of the > function? Splitting up that block comment makes sense to me. > I've not gone through the patch in detail to see whether I believe > the proposed proof rules. It would help to have more comments > justifying them. Most of them are sufficiently simple -- e.g., X IS TRUE implies X -- that I don't think there's a lot to say in justification. In some cases I've noted the cases that force only strong or weak implication. There are a few cases, though, (e.g., "X is unknown weakly implies X is not true") that, reading over this again, don't immediately strike me as obvious, so I'll expand on those. > > As noted in a TODO in the patch itself, I think it may be worth refactoring > > the test_predtest module to run the "x, y" case as well as the "y, x" case > > with a single call so as to eliminate a lot of repetition in > > clause/expression test cases. If reviewers agree that's desirable, then I > > could do that as a precursor. > > I think that's actively undesirable. It is not typically the case that > a proof rule for A => B also works in the other direction, so this would > encourage wasting cycles in the tests. I fear it might also cause > confusion about which direction a proof rule is supposed to work in. That makes sense in the general case. Boolean expressions seem like a special case in that regard: (subject to what it looks like) would you be OK with a wrapping function that does both directions (with output that shows which direction is being tested) used only for the cases where we do want to check both directions? Thanks, James Coleman
Simplify newNode()
The newNode() macro can be turned into a static inline function, which makes it a lot simpler. See attached. This was not possible when the macro was originally written, as we didn't require compiler to have static inline support, but nowadays we do. This was last discussed in 2008, see discussion at https://www.postgresql.org/message-id/26133.1220037409%40sss.pgh.pa.us. In those tests, Tom observed that gcc refused to inline the static inline function. That was weird, the function is very small and doesn't do anything special. Whatever the problem was, I think we can dismiss it with modern compilers. It does get inlined on gcc 12 and clang 14 that I have installed. -- Heikki Linnakangas Neon (https://neon.tech)From 6ad4c4cf49ef5b3f7ed22acc258a868f1a13f6f4 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 14 Dec 2023 01:08:16 +0100 Subject: [PATCH 1/1] Convert newNode() into a static inline function Because it's simpler. We didn't require static inline support from compiler when this was originally written, but now we do. One complication is that the static inline function doesn't work in the frontend, because it calls palloc0fast() which isn't included frontend programs. That's OK, the old macro also didn't work in frontend programs if you actually tried to call it, but we now need to explicitly put it in an #ifndef FRONTEND block to keep the compiler happy. --- src/backend/nodes/Makefile| 1 - src/backend/nodes/meson.build | 1 - src/backend/nodes/nodes.c | 31 - src/include/nodes/nodes.h | 43 +++ 4 files changed, 13 insertions(+), 63 deletions(-) delete mode 100644 src/backend/nodes/nodes.c diff --git a/src/backend/nodes/Makefile b/src/backend/nodes/Makefile index ebbe9052cb7..66bbad8e6e0 100644 --- a/src/backend/nodes/Makefile +++ b/src/backend/nodes/Makefile @@ -23,7 +23,6 @@ OBJS = \ makefuncs.o \ multibitmapset.o \ nodeFuncs.o \ - nodes.o \ outfuncs.o \ params.o \ print.o \ diff --git a/src/backend/nodes/meson.build b/src/backend/nodes/meson.build index 31467a12d3b..1efbf2c11ca 100644 --- a/src/backend/nodes/meson.build +++ b/src/backend/nodes/meson.build @@ -7,7 +7,6 @@ backend_sources += files( 'makefuncs.c', 'multibitmapset.c', 'nodeFuncs.c', - 'nodes.c', 'params.c', 'print.c', 'read.c', diff --git a/src/backend/nodes/nodes.c b/src/backend/nodes/nodes.c deleted file mode 100644 index 1913a4bdf7d..000 --- a/src/backend/nodes/nodes.c +++ /dev/null @@ -1,31 +0,0 @@ -/*- - * - * nodes.c - * support code for nodes (now that we have removed the home-brew - * inheritance system, our support code for nodes is much simpler) - * - * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group - * Portions Copyright (c) 1994, Regents of the University of California - * - * - * IDENTIFICATION - * src/backend/nodes/nodes.c - * - * HISTORY - * Andrew Yu Oct 20, 1994 file creation - * - *- - */ -#include "postgres.h" - -#include "nodes/nodes.h" - -/* - * Support for newNode() macro - * - * In a GCC build there is no need for the global variable newNodeMacroHolder. - * However, we create it anyway, to support the case of a non-GCC-built - * loadable module being loaded into a GCC-built backend. - */ - -Node *newNodeMacroHolder; diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index 4c32682e4ce..dbb8eed122c 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -132,6 +132,8 @@ typedef struct Node #define nodeTag(nodeptr) (((const Node*)(nodeptr))->type) +#ifndef FRONTEND + /* * newNode - * create a new node of the specified size and tag the node with the @@ -139,43 +141,24 @@ typedef struct Node * * !WARNING!: Avoid using newNode directly. You should be using the * macro makeNode. eg. to create a Query node, use makeNode(Query) - * - * Note: the size argument should always be a compile-time constant, so the - * apparent risk of multiple evaluation doesn't matter in practice. - */ -#ifdef __GNUC__ - -/* With GCC, we can use a compound statement within an expression */ -#define newNode(size, tag) \ -({ Node *_result; \ - AssertMacro((size) >= sizeof(Node)); /* need the tag, at least */ \ - _result = (Node *) palloc0fast(size); \ - _result->type = (tag); \ - _result; \ -}) -#else - -/* - * There is no way to dereference the palloc'ed pointer to assign the - * tag, and also return the pointer itself, so we need a holder variable. - * Fortunately, this macro isn't recursive so we just define - * a global variable for this purpose. */ -extern PGDLLIMPORT Node *newNodeMacroHolder; +static inline Node * +newNode(size_t size, NodeTag tag) +{ + Node *result; -#define newNode(size, tag) \ -( \ - AssertMacro((size) >= sizeof(Node)), /* need the tag, at le
useless LIMIT_OPTION_DEFAULT
Hi, all By reading the codes, I found that we process limit option as LIMIT_OPTION_WITH_TIES when using WITH TIES and all others are LIMIT_OPTION_COUNT by commit 357889eb17bb9c9336c4f324ceb1651da616fe57. And check actual limit node in limit_needed(). There is no need to maintain a useless default limit enum. I remove it and have an install check to verify. Are there any considerations behind this? Shall we remove it for clear as it’s not actually the default option. Zhang Mingli www.hashdata.xyz v1-0001-Remove-useless-LIMIT_OPTION_DEFAULT.patch Description: Binary data
Re: Synchronizing slots from primary to standby
Hi, here are a few more review comments for the patch v47-0002 (plus my review comments of v45-0002 [1] are yet to be addressed) == 1. General For consistency and readability, try to use variables of the same names whenever they have the same purpose, even when they declared are in different functions. A few like this were already mentioned in the previous review but there are more I keep noticing. For example, 'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller. == src/backend/replication/logical/slotsync.c 2. +/* + * + * Validates the primary server info. + * + * Using the specified primary server connection, it verifies whether the master + * is a standby itself and returns true in that case to convey the caller that + * we are on the cascading standby. + * But if master is the primary server, it goes ahead and validates + * primary_slot_name. It emits error if the physical slot in primary_slot_name + * does not exist on the primary server. + */ +static bool +validate_primary_info(WalReceiverConn *wrconn) 2a. Extra line top of that comment? ~ 2b. IMO it is too tricky to have a function called "validate_xxx", when actually you gave that return value some special unintuitive meaning other than just validation. IMO it is always better for the returned value to properly match the function name so the expectations are very obvious. So, In this case, I think a better function signature would be like this: SUGGESTION static void validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby) or static void validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) ~~~ 3. + if (res->status != WALRCV_OK_TUPLES) + ereport(ERROR, + (errmsg("could not fetch recovery and primary_slot_name \"%s\" info from the " + "primary: %s", PrimarySlotName, res->err))); I'm not sure that including "recovery and" in the error message is meaningful to the user, is it? ~~~ 4. slotsync_reread_config +/* + * Re-read the config file. + * + * If any of the slot sync GUCs have changed, re-validate them. The + * worker will exit if the check fails. + * + * Returns TRUE if primary_slot_name is changed, let the caller re-verify it. + */ +static bool +slotsync_reread_config(WalReceiverConn *wrconn) Hm. This is another function where the return value has been butchered to have a special meaning unrelated the the function name. IMO it makes the code unnecessarily confusing. IMO a better function signature here would be: static void slotsync_reread_config(WalReceiverConn *wrconn, bool *primary_slot_name_changed) ~~~ 5. ProcessSlotSyncInterrupts +/* + * Interrupt handler for main loop of slot sync worker. + */ +static bool +ProcessSlotSyncInterrupts(WalReceiverConn *wrconn, bool check_cascading_standby) +{ There is no function comment describing the meaning of the return value. But actually, IMO this is an example of how conflating the meanings of validation VERSUS are_we_cascading_standby in the lower-down function has propagated up to become a big muddle. The code + if (primary_slot_changed || check_cascading_standby) + return validate_primary_info(wrconn); seems unnecessarily hard to understand because, false -- doesn't mean invalid true -- doesn't mean valid Please, consider changing this signature also so the functions return what you would intuitively expect them to return without surprisingly different meanings. SUGGESTION static void ProcessSlotSyncInterrupts(WalReceiverConn *wrconn, bool check_cascading_standby, bool *am_cascading_standby) ~~~ 6. ReplSlotSyncWorkerMain + int rc; + long naptime = WORKER_DEFAULT_NAPTIME_MS; + TimestampTz now; + bool slot_updated; + + /* + * The transaction env is needed by walrcv_exec() in both the slot + * sync and primary info validation flow. + */ + StartTransactionCommand(); + + if (!am_cascading_standby) + { + slot_updated = synchronize_slots(wrconn); + + /* + * If any of the slots get updated in this sync-cycle, use default + * naptime and update 'last_update_time'. But if no activity is + * observed in this sync-cycle, then increase naptime provided + * inactivity time reaches threshold. + */ + now = GetCurrentTimestamp(); + if (slot_updated) + last_update_time = now; + else if (TimestampDifferenceExceeds(last_update_time, + now, WORKER_INACTIVITY_THRESHOLD_MS)) + naptime = WORKER_INACTIVITY_NAPTIME_MS; + } + else + naptime = 6 * WORKER_INACTIVITY_NAPTIME_MS; /* 60 sec */ + + rc = WaitLatch(MyLatch, +WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, +naptime, +WAIT_EVENT_REPL_SLOTSYNC_MAIN); + + if (rc & WL_LATCH_SET) + ResetLatch(MyLatch); + + am_cascading_standby = + ProcessSlotSyncInterrupts(wrconn, am_cascading_standby); + + CommitTransactionCommand(); IMO it is more natural to avoid negative conditions, so just reverse these. Also, some comment is needed to explain why the longer naptime is needed in this special case. SUGGESTION if (am_cascading_standby) { /* comment the reason */
Re: Simplify newNode()
Hi, LGTM. + Assert(size >= sizeof(Node)); /* need the tag, at least */ + result = (Node *) palloc0fast(size); + result->type = tag; + return result; +} How about moving the comments /* need the tag, at least */ after result->type = tag; by the way? Zhang Mingli www.hashdata.xyz
Re: Simplify newNode()
On Thu, Dec 14, 2023 at 9:34 AM Zhang Mingli wrote: > > Hi, > > LGTM. > > + Assert(size >= sizeof(Node)); /* need the tag, at least */ > + result = (Node *) palloc0fast(size); > + result->type = tag; > > + return result; > +} > > How about moving the comments /* need the tag, at least */ after result->type > = tag; by the way? I don't think so, the comment has the meaning of the requested size should at least the size of Node, which contains just a NodeTag. typedef struct Node { NodeTag type; } Node; > > > > Zhang Mingli > www.hashdata.xyz -- Regards Junwang Zhao
Re: Remove MSVC scripts from the tree
On Mon, 4 Dec 2023 17:05:24 +0900 Michael Paquier wrote: > On Tue, Sep 26, 2023 at 12:17:04PM -0400, Andrew Dunstan wrote: > > On 2023-09-26 Tu 01:25, NINGWEI CHEN wrote: > >> hamerkop is not yet prepared for Meson builds, but we plan to work on this > >> support soon. > >> If we go with Meson builds exclusively right now, we will have to > >> temporarily remove the master/HEAD for a while. > > > > I don't think we should switch to that until you're ready. > > Agreed that it would just be breaking a build for the sake of breaking > it. Saying that, the last exchange that we had about hamerkop > switching to meson was two months ago. Are there any plans to do the > switch? > -- > Michael Sorry for the delayed response. We are currently working on transitioning to meson build at hamerkop and anticipating that this can be accomplished by no later than January. If the old build scripts are removed before that, hamerkop will be temporarily taken off the master branch, and will rejoin once the adjustment is done. Best Regards. -- SRA OSS LLC Chen Ningwei
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar wrote: > On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar wrote: > > > > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar > wrote: > > Here is the updated patch based on some comments by tender wang (those > comments were sent only to me) > few nitpicks: + + /* +* Mask for slotno banks, considering 1GB SLRU buffer pool size and the +* SLRU_BANK_SIZE bits16 should be sufficient for the bank mask. +*/ + bits16 bank_mask; } SlruCtlData; ... ... + int bankno = pageno & ctl->bank_mask; I am a bit uncomfortable seeing it as a mask, why can't it be simply a number of banks (num_banks) and get the bank number through modulus op (pageno % num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a bit difficult to read compared to modulus op which is quite simple, straightforward and much common practice in hashing. Are there any advantages of using & over % ? Also, a few places in 0002 and 0003 patch, need the bank number, it is better to have a macro for that. --- extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage, void *data); - +extern bool check_slru_buffers(const char *name, int *newval); #endif /* SLRU_H */ Add an empty line after the declaration, in 0002 patch. --- -TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno) +TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, + int slotno) Unrelated change for 0003 patch. --- Regards, Amul
Re: "pgoutput" options missing on documentation
On Wed, Dec 13, 2023 at 9:25 PM Emre Hasegeli wrote: > > I noticed that Logical Streaming Replication Protocol documentation > [1] is missing the options added to "pgoutput" since version 14. A > patch is attached to fix it together with another small one to give a > nice error when "proto_version" parameter is not provided. > I agree that we missed updating the parameters of the Logical Streaming Replication Protocol documentation. I haven't reviewed all the details yet but one minor thing that caught my attention while looking at your patch is that we can update the exact additional information we started to send for streaming mode parallel. We should update the following sentence: "It accepts an additional value "parallel" to enable sending extra information with the "Stream Abort" messages to be used for parallelisation." -- With Regards, Amit Kapila.
RE: logical decoding and replication of sequences, take 2
Dear hackers, > It is correct that we can make a wrong decision about whether a change > is transactional or non-transactional when sequence DDL happens before > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > after that state. I found a workload which decoder distinguish wrongly. # Prerequisite Apply an attached patch for inspecting the sequence status. It can be applied atop v20231203 patch set. Also, a table and a sequence must be defined: ``` CREATE TABLE foo (var int); CREATE SEQUENCE s; ``` # Workload Then, you can execute concurrent transactions from three clients like below: Client-1 BEGIN; INSERT INTO foo VALUES (1); Client-2 SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); Client-3 BEGIN; ALTER SEQUENCE s MAXVALUE 5000; COMMIT; SAVEPOINT s1; SELECT setval('s', 2000); ROLLBACK; SELECT pg_logical_slot_get_changes('slot', 'test_decoding'); # Result and analysis At first, below lines would be output on the log. This meant that WAL records for ALTER SEQUENCE were decoded but skipped because the snapshot had been building. ``` ... LOG: logical decoding found initial starting point at 0/154D238 DETAIL: Waiting for transactions (approximately 1) older than 741 to end. STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: skipped STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: skipped ... ``` Note that above `seq_decode...` line was not output via `setval()`, it was done by ALTER SEQUENCE statement. Below is a call stack for inserting WAL. ``` XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); fill_seq_fork_with_data fill_seq_with_data AlterSequence ``` Then, subsequent lines would say like them. This means that the snapshot becomes FULL and `setval()` is regarded non-transactional wrongly. ``` LOG: logical decoding found initial consistent point at 0/154D658 DETAIL: Waiting for transactions (approximately 1) older than 742 to end. STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: the sequence is non-transactional STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: not consistent: skipped ``` The change would be discarded because the snapshot has not been CONSISTENT yet by the below part. If it has been transactional, we would have queued this change though the transaction will be skipped at commit. ``` else if (!transactional && (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr))) return; ``` But anyway, we could find a case which we can make a wrong decision. This example is lucky - does not output wrongly, but I'm not sure all the case like that. Best Regards, Hayato Kuroda FUJITSU LIMITED diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index d48d88081f..73e38cafd8 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -1397,12 +1397,17 @@ seq_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(r), buf->origptr); + elog(LOG, "XXX: seq_decode. snapshot is %s", SnapBuildIdentify(builder)); + /* * If we don't have snapshot or we are just fast-forwarding, there is no * point in decoding sequences. */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) + { + elog(LOG, "XXX: skipped"); return; + } /* only interested in our database */ XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL); @@ -1437,14 +1442,22 @@ seq_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) target_locator, NULL); + elog(LOG, "XXX: the
Re: Synchronizing slots from primary to standby
A review comment for v47-0001 == src/backend/replication/slot.c 1. GetStandbySlotList +static void +WalSndRereadConfigAndReInitSlotList(List **standby_slots) +{ + char*pre_standby_slot_names; + + ProcessConfigFile(PGC_SIGHUP); + + /* + * If we are running on a standby, there is no need to reload + * standby_slot_names since we do not support syncing slots to cascading + * standbys. + */ + if (RecoveryInProgress()) + return; Should the RecoveryInProgress() check be first -- even before the ProcessConfigFile call? == Kind Regards, Peter Smith. Fujitsu Australia
Re: logical decoding and replication of sequences, take 2
On Wed, Dec 13, 2023 at 6:26 PM Amit Kapila wrote: > > > > But can this even happen? Can we start decoding in the middle of a > > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID, > > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical > > > messages, where we also call the output plugin in non-transactional cases. > > > > It's not a problem for logical messages because whether the message is > > transaction or non-transactional is decided while WAL logs the message > > itself. But here our problem starts with deciding whether the change > > is transactional vs non-transactional, because if we insert the > > 'relfilenode' in hash then the subsequent sequence change in the same > > transaction would be considered transactional otherwise > > non-transactional. > > > > It is correct that we can make a wrong decision about whether a change > is transactional or non-transactional when sequence DDL happens before > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > after that state. However, one thing to note here is that we won't try > to stream such a change because for non-transactional cases we don't > proceed unless the snapshot is in a consistent state. Now, if the > decision had been correct then we would probably have queued the > sequence change and discarded at commit. > > One thing that we deviate here is that for non-sequence transactional > cases (including logical messages), we immediately start queuing the > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > SnapBuildProcessChange() returns true which is quite possible) and > take final decision at commit/prepare/abort time. However, that won't > be the case for sequences because of the dependency of determining > transactional cases on one of the prior records. Now, I am not > completely sure at this stage if such a deviation can cause any > problem and or whether we are okay to have such a deviation for > sequences. Okay, so this particular scenario that I raised is somehow saved, I mean although we are considering transactional sequence operation as non-transactional we also know that if some of the changes for a transaction are skipped because the snapshot was not FULL that means that transaction can not be streamed because that transaction has to be committed before snapshot become CONSISTENT (based on the snapshot state change machinery). Ideally based on the same logic that the snapshot is not consistent the non-transactional sequence changes are also skipped. But the only thing that makes me a bit uncomfortable is that even though the result is not wrong we have made some wrong intermediate decisions i.e. considered transactional change as non-transactions. One solution to this issue is that, even if the snapshot state does not reach FULL just add the sequence relids to the hash, I mean that hash is only maintained for deciding whether the sequence is changed in that transaction or not. So no adding such relids to hash seems like a root cause of the issue. Honestly, I haven't analyzed this idea in detail about how easy it would be to add only these changes to the hash and what are the other dependencies, but this seems like a worthwhile direction IMHO. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
On Wed, Dec 13, 2023 at 3:53 PM Peter Smith wrote: > > 12. > +static void > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > +{ > + ReplicationSlot *s; > + ReplicationSlot *slot; > + char sync_state = '\0'; > > In my previous review [1]#33a I thought it was strange to assign the > sync_state (which is essentially an enum) to some meaningless value, > so I suggested it should be set to SYNCSLOT_STATE_NONE in the > declaration. The reply [2] was "No, that will change the flow. It > should stay uninitialized if the slot is not found." > > But I am not convinced there is any flow problem. Also, > SYNCSLOT_STATE_NONE seems the naturally correct default for something > with no state. It cannot be found and be SYNCSLOT_STATE_NONE at the > same time (that is reported as an ERROR "skipping sync of slot") so I > see no problem. > > The CURRENT code is like this: > > /* Slot created by the slot sync worker exists, sync it */ > if (sync_state) > { > Assert(sync_state == SYNCSLOT_STATE_READY || sync_state == > SYNCSLOT_STATE_INITIATED); > ... > } > /* Otherwise create the slot first. */ > else > { > ... > } > > AFAICT that could easily be changed to like below, with no change to > the logic, and it avoids setting strange values. > > SUGGESTION. > > if (sync_state == SYNCSLOT_STATE_NONE) > { > /* Slot not found. Create it. */ > .. > } > else > { > Assert(sync_state == SYNCSLOT_STATE_READY || sync_state == > SYNCSLOT_STATE_INITIATED); > ... > } > I think instead of creating syncslot based on syncstate, it would be better to create it when we don't find it via SearchNamedReplicationSlot(). That will avoid the need to initialize the syncstate and I think it would make code in this area look better. > ~~~ > > 13. synchronize_one_slot > > +static void > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > > This *slot_updated parameter looks dubious. It is used in a loop from > the caller to mean that ANY slot was updated -- e.g. maybe it is true > or false on entry to this function. > > But, Instead of having some dependency between this function and the > caller, IMO it makes more sense if we would make this just a boolean > function in the first place (e.g. was updated? T/F) > > Then the caller can also be written more easily like: > > some_slot_updated |= synchronize_one_slot(wrconn, remote_slot); > +1. > > 23. slotsync_reread_config > > + old_dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); > + Assert(old_dbname); > > (This is same comment as old review [1]#61) > > Hmm. I still don't see why this extraction of the dbname cannot be > deferred until later when you know the PrimaryConnInfo has changed, > otherwise, it might be redundant to do this. Shveta replied [2] that > "Once PrimaryConnInfo is changed, we can not get old-dbname.", but I'm > not so sure. Isn't this walrcv_get_dbname_from_conninfo just doing a > string search -- Why can't you defer this until you know > conninfoChanged is true, and then to get the old_dbname, you can just > pass the old_primary_conninfo. E.g. call like > walrcv_get_dbname_from_conninfo(old_primary_conninfo); Maybe I am > mistaken. > I think we should just restart if any one of the information is changed with a message like: "slotsync worker will restart because of a parameter change". This would be similar to what we do apply worker in maybe_reread_subscription(). > > 28. > + /* > + * Is this a slot created by a sync-slot worker? > + * > + * Only relevant for logical slots on the physical standby. > + */ > + char sync_state; > + > > (probably I am repeating a previous thought here) > > The comment says the field is only relevant for standby, and that's > how I've been visualizing it, and why I had previously suggested even > renaming it to 'standby_sync_state'. However, replies are saying that > after failover these sync_states also have "some meaning for the > primary server". > > That's the part I have trouble understanding. IIUC the server states > are just either all 'n' (means nothing) or 'r' because they are just > leftover from the old standby state. So, does it *truly* have meaning > for the server? Or should those states somehow be removed/ignored on > the new primary? Anyway, the point is that if this field does have > meaning also on the primary (I doubt) then those details should be in > this comment. Otherwise "Only relevant ... on the standby" is too > misleading. > I think this deserves more comments. -- With Regards, Amit Kapila.
RE: Synchronizing slots from primary to standby
On Thursday, December 14, 2023 12:45 PM Peter Smith wrote: > A review comment for v47-0001 Thanks for the comment. > > == > src/backend/replication/slot.c > > 1. GetStandbySlotList > > +static void > +WalSndRereadConfigAndReInitSlotList(List **standby_slots) { > + char*pre_standby_slot_names; > + > + ProcessConfigFile(PGC_SIGHUP); > + > + /* > + * If we are running on a standby, there is no need to reload > + * standby_slot_names since we do not support syncing slots to > + cascading > + * standbys. > + */ > + if (RecoveryInProgress()) > + return; > > Should the RecoveryInProgress() check be first -- even before the > ProcessConfigFile call? ProcessConfigFile is necessary here, it is used not only for standby_slot_names but also all other GUCs that could be used in the caller. Best Regards, Hou zj
Re: Reducing output size of nodeToString
On Thu, 7 Dec 2023 at 13:09, David Rowley wrote: > > On Thu, 7 Dec 2023 at 10:09, Matthias van de Meent > wrote: > > PFA a patch that reduces the output size of nodeToString by 50%+ in > > most cases (measured on pg_rewrite), which on my system reduces the > > total size of pg_rewrite by 33% to 472KiB. This does keep the textual > > pg_node_tree format alive, but reduces its size significantly. > > It would be very cool to have the technology proposed by Andres back > in 2019 [1]. With that, we could easily write various output > functions. One could be compact and easily machine-readable and > another designed to be better for humans for debugging purposes. > > We could also easily serialize plans to binary format for copying to > parallel workers rather than converting them to a text-based > serialized format. It would also allow us to do things like serialize > PREPAREd plans into a nicely compact single allocation that we could > just pfree in a single pfree call on DEALLOCATE. I'm not sure what benefit you're refering to. If you mean "it's more compact than the current format" then sure; but the other points can already be covered by either the current nodeToString format, or by nodeCopy-ing the prepared plan into its own MemoryContext, which would allow us to do essentially the same thing. > Likely we could just use the existing Perl scripts to form the > metadata arrays rather than the clang parsing stuff Andres used in his > patch. > > Anyway, just wanted to ensure you knew about this idea. I knew about that thread thread, but didn't notice the metadata arrays part of it, which indeed looks interesting for this patch. Thanks for pointing it out. I'll see if I can incorporate parts of that into this patchset. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar wrote: > > > > > > > It is correct that we can make a wrong decision about whether a change > > is transactional or non-transactional when sequence DDL happens before > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > > after that state. However, one thing to note here is that we won't try > > to stream such a change because for non-transactional cases we don't > > proceed unless the snapshot is in a consistent state. Now, if the > > decision had been correct then we would probably have queued the > > sequence change and discarded at commit. > > > > One thing that we deviate here is that for non-sequence transactional > > cases (including logical messages), we immediately start queuing the > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > > SnapBuildProcessChange() returns true which is quite possible) and > > take final decision at commit/prepare/abort time. However, that won't > > be the case for sequences because of the dependency of determining > > transactional cases on one of the prior records. Now, I am not > > completely sure at this stage if such a deviation can cause any > > problem and or whether we are okay to have such a deviation for > > sequences. > > Okay, so this particular scenario that I raised is somehow saved, I > mean although we are considering transactional sequence operation as > non-transactional we also know that if some of the changes for a > transaction are skipped because the snapshot was not FULL that means > that transaction can not be streamed because that transaction has to > be committed before snapshot become CONSISTENT (based on the snapshot > state change machinery). Ideally based on the same logic that the > snapshot is not consistent the non-transactional sequence changes are > also skipped. But the only thing that makes me a bit uncomfortable is > that even though the result is not wrong we have made some wrong > intermediate decisions i.e. considered transactional change as > non-transactions. > > One solution to this issue is that, even if the snapshot state does > not reach FULL just add the sequence relids to the hash, I mean that > hash is only maintained for deciding whether the sequence is changed > in that transaction or not. So no adding such relids to hash seems > like a root cause of the issue. Honestly, I haven't analyzed this > idea in detail about how easy it would be to add only these changes to > the hash and what are the other dependencies, but this seems like a > worthwhile direction IMHO. I also thought about the same solution. I tried this solution as the attached patch on top of Hayato's diagnostic changes. Following log messages are seen in server error log. Those indicate that the sequence change was correctly deemed as a transactional change (line 2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is transactional). 2023-12-14 12:12:50.550 IST [321229] ERROR: relation "pg_replication_slot" does not exist at character 15 2023-12-14 12:12:50.550 IST [321229] STATEMENT: select * from pg_replication_slot; 2023-12-14 12:12:57.289 IST [321229] LOG: logical decoding found initial starting point at 0/1598D50 2023-12-14 12:12:57.289 IST [321229] DETAIL: Waiting for transactions (approximately 1) older than 759 to end. 2023-12-14 12:12:57.289 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.551 IST [321229] LOG: XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.551 IST [321229] LOG: XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.551 IST [321229] LOG: XXX: skipped 2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.552 IST [321229] LOG: logical decoding found initial consistent point at 0/1599170 2023-12-14 12:13:49.552 IST [321229] DETAIL: Waiting for transactions (approximately 1) older than 760 to end. 2023-12-14 12:13:49.552 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:14:55.591 IST [321229] LOG: XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT 2023-12-14 12:14:55.591 IST [321230] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is transactional 2023-12-14 12:14:55.591 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:14:55.813 IST [321229] LOG: logical decoding found consistent point at 0/15992E8 2023-12-14 12:14:55.813 IST [321229] DETAIL: There are no running transactions. 2023-12-14 12:14:55.813 IST [32
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat wrote: > > On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar wrote: > > > > > > > > > > > It is correct that we can make a wrong decision about whether a change > > > is transactional or non-transactional when sequence DDL happens before > > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > > > after that state. However, one thing to note here is that we won't try > > > to stream such a change because for non-transactional cases we don't > > > proceed unless the snapshot is in a consistent state. Now, if the > > > decision had been correct then we would probably have queued the > > > sequence change and discarded at commit. > > > > > > One thing that we deviate here is that for non-sequence transactional > > > cases (including logical messages), we immediately start queuing the > > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > > > SnapBuildProcessChange() returns true which is quite possible) and > > > take final decision at commit/prepare/abort time. However, that won't > > > be the case for sequences because of the dependency of determining > > > transactional cases on one of the prior records. Now, I am not > > > completely sure at this stage if such a deviation can cause any > > > problem and or whether we are okay to have such a deviation for > > > sequences. > > > > Okay, so this particular scenario that I raised is somehow saved, I > > mean although we are considering transactional sequence operation as > > non-transactional we also know that if some of the changes for a > > transaction are skipped because the snapshot was not FULL that means > > that transaction can not be streamed because that transaction has to > > be committed before snapshot become CONSISTENT (based on the snapshot > > state change machinery). Ideally based on the same logic that the > > snapshot is not consistent the non-transactional sequence changes are > > also skipped. But the only thing that makes me a bit uncomfortable is > > that even though the result is not wrong we have made some wrong > > intermediate decisions i.e. considered transactional change as > > non-transactions. > > > > One solution to this issue is that, even if the snapshot state does > > not reach FULL just add the sequence relids to the hash, I mean that > > hash is only maintained for deciding whether the sequence is changed > > in that transaction or not. So no adding such relids to hash seems > > like a root cause of the issue. Honestly, I haven't analyzed this > > idea in detail about how easy it would be to add only these changes to > > the hash and what are the other dependencies, but this seems like a > > worthwhile direction IMHO. > > I also thought about the same solution. I tried this solution as the > attached patch on top of Hayato's diagnostic changes. I think you forgot to attach the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: "pgoutput" options missing on documentation
Hi, here are some initial review comments. // Patch v00-0001 1. + + /* Check required parameters */ + if (!protocol_version_given) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("proto_version parameter missing"))); + if (!publication_names_given) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("publication_names parameter missing"))); To reduce translation efforts, perhaps it is better to arrange for these to share a common message. For example, ereport(ERROR, errcode(ERRCODE_INVALID_PARAMETER_VALUE), /* translator: %s is a pgoutput option */ errmsg("missing pgoutput option: %s", "proto_version")); ~ ereport(ERROR, errcode(ERRCODE_INVALID_PARAMETER_VALUE), /* translator: %s is a pgoutput option */ errmsg("missing pgoutput option: %s", "publication_names")); Also, I am unsure whether to call these "parameters" or "options" -- I wanted to call them parameters like in the documentation, but every other message in this function refers to "options", so in my example, I mimicked the nearby code YMMV. // Patch v00-0002 2. - The logical replication START_REPLICATION command - accepts following parameters: + The standard logical decoding plugin (pgoutput) accepts + following parameters with START_REPLICATION command: Since the previous line already said pgoutput is the standard decoding plugin, maybe it's not necessary to repeat that. SUGGESTION Using the START_REPLICATION command, pgoutput) accepts the following parameters: ~~~ 3. I noticed in the protocol message formats docs [1] that those messages are grouped according to the protocol version that supports them. Perhaps this page should be arranged similarly for these parameters? For example, document the parameters in the order they were introduced. SUGGESTION -proto_version ... -publication_names ... -binary ... -messages ... Since protocol version 2: -streaming (boolean) ... Since protocol version 3: -two_phase ... Since protocol version 4: -streaming (boolean/parallel) ... -origin ... ~~~ 4. + Boolean parameter to use the binary transfer mode. This is faster + than the text mode, but slightly less robust SUGGESTION Boolean parameter to use binary transfer mode. Binary mode is faster than the text mode but slightly less robust == [1] https://www.postgresql.org/docs/current/protocol-logicalrep-message-formats.html Kind Regards, Peter Smith. Fujitsu Australia