Re: A proposal for shared memory based backup infrastructure
On Sat, Jul 30, 2022 at 12:23 PM mahendrakar s wrote: > > On Mon, 25 Jul 2022 at 12:00, Bharath Rupireddy > wrote: >> >> On Mon, Jul 25, 2022 at 10:03 AM mahendrakar s >> wrote: >> > >> > Hi Bharath, >> >> Thanks Mahendrakar for taking a look at the design. >> >> > "Typically, step (3) takes a good amount of time in production >> > environments with terabytes or petabytes scale of data and keeping the >> > session alive from step (1) to (4) has overhead and it wastes the >> > resources. And the session can get closed for various reasons - idle >> > in session timeout, tcp/ip keepalive timeout, network problems etc. >> > All of these can render the backup useless." >> > >> > >> this could be a common scenario and needs to be addressed. >> >> Hm. Additionally, the problem of keeping the session that starts the >> backup open until the entire data directory is backed-up becomes more >> worrisome if we were to run backups for a huge number of servers at >> scale - the entity (control plane or whatever), that is responsible >> for taking backups across huge fleet of postgres production servers, >> will have tremendous amount of resources wasted and it's a problem for >> that entity to keep the backup sessions active until the actual backup >> is finished. >> >> > "What if the backup started by a session can also be closed by another >> > session? This seems to be achievable, if we can place the >> > backup_label, tablespace_map and other required session/backend level >> > contents in shared memory with the key as backup_label name. It's a >> > long way to go." >> > >> > >> I think storing metadata about backup of a session in shared memory >> > >> may not work as it gets purged when the database goes for restart. We >> > >> might require a separate catalogue table to handle the backup session. >> >> Right now, the non-exclusive (and we don't have exclusive backups now >> from postgres 15) backup will anyway become useless if the postgres >> restarts, because there's no running backup state (backup_label, >> tablespace_map contents) that's persisted. >> >> Following are few more thoughts with the shared memory based backups >> as proposed in this thread: >> >> 1) How many max backups do we want to allow? Right now, there's no >> limit, I believe, max_connections number of concurrent backups can be >> taken - we have XLogCtlInsert->runningBackups but no limit. If we were >> to use shared memory to track the backup state, we might or might not >> have to decide on max backup limit to not preallocate and consume >> shared memory unnecessarily, otherwise, we could use something like >> dynamic shared memory hash table for storing backup state. >> >> 2) How to deal with the backups that are started but no one is coming >> to stop them? Basically, when to declare that the backup is dead or >> expired? Perhaps, we can have a max time limit after which if no stop >> backup is issued for a backup, which is then marked as dead or >> expired. >> >> We may or may not want to think on the above points for now until the >> idea in general has some benefits over the current backup >> infrastructure. > > Hi Bharath, > > There might be security concerns if the backup started by one user can be > stopped by another user. > This is because the user who stops the backup will get the backup_label or > table space map file contents of other user. > Isn't this a concern for non-exclusive backup? > > I think there should be role based control for backup related activity which > can prevent other unprivileged users from stopping the backup. > > Thoughts? The pg_backup_start() and pg_backup_stop() functions are role based - restricted to superusers by default, but other users can be granted EXECUTE to run the functions - I think the existing behaviour would suffice. However, the responsibility of not letting the users stop backups started by other users (yes, just with the label name) can lie with those who use these functions with the new shared memory based backups, they have to ensure that whoever starts the backup, they only should stop it. Perhaps, we can call that out in the documentations explicitly. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Re: optimize lookups in snapshot [sub]xip arrays
On Thu, Aug 4, 2022 at 3:25 AM Nathan Bossart wrote: > Here is a new patch set without the annotation. Were you considering adding the new function to simd.h now that that's committed? It's a bit up in the air what should go in there, but this new function is low-level and generic enough to be a candidate... I wonder if the "pg_" prefix is appropriate here, as that is most often used for things that hide specific details *and* where the base name would clash, like OS calls or C library functions. I'm not quite sure where the line is drawn, but I mean that "linearsearch" is a generic algorithm and not a specific API we are implementing, if that makes sense. The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(), pg_popcount32, etc). We'll likely want to search bytes sometime, so something to keep in mind as far as naming ("_int" vs "_byte"?). I'm not a fan of "its" as a variable name, and I'm curious what it's intended to convey. All the __m128i vars could technically be declared const, although I think it doesn't matter -- it's just a hint to the reader. Out of curiosity do we know how much we get by loading four registers rather than two? -- John Naylor EDB: http://www.enterprisedb.com
Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size
Hi, On Wed, Aug 3, 2022 at 12:00 AM Andres Freund wrote: > On 2022-08-02 12:05:55 +0900, Amit Langote wrote: > > On Tue, Aug 2, 2022 at 9:39 AM Andres Freund wrote: > > > On 2022-07-27 17:01:13 +0900, Amit Langote wrote: > > > > Here's an updated version of the patch, with mostly cosmetic changes. > > > > In particular, I added comments describing the new llvm_compile_expr() > > > > blobs. > > > > > > - I've asked a couple times before: Why do we need space for every > > > possible > > > datatype at once in JsonItemCoercions? Can there be multiple > > > "concurrent" > > > coercions in process? > > > > This topic has been a head-scratcher for me too from the beginning, > > but I've since come to understand (convince myself) that we do need > > the coercions for all possible types, because we don't know the type > > of the JSON item that's going to pop out of the main JSON path > > expression until we've run it through the JSON path executor that > > ExecEvalJson() invokes. So, it's not possible to statically assign > > the coercion. > > Sure. But that doesn't mean we have to have memory for every possible type *at > the same time*. > > > I am not really sure if different coercions may be used > > in the same query over multiple evaluations of the same JSON path > > expression, but maybe that's also possible. > > Even if the type can change, I don't think that means we need to have space > for multiple types at the same time - there can't be multiple coercions > happening at the same time, otherwise there could be two coercions of the same > type as well. So we don't need memory for every coercion type. Do you find it unnecessary to statically allocate memory for JsonItemCoercionState for each possible coercion, as in the following struct definition: typedef struct JsonItemCoercionsState { JsonItemCoercionState null; JsonItemCoercionState string; JsonItemCoercionState numeric; JsonItemCoercionState boolean; JsonItemCoercionState date; JsonItemCoercionState time; JsonItemCoercionState timetz; JsonItemCoercionState timestamp; JsonItemCoercionState timestamptz; JsonItemCoercionState composite; } JsonItemCoercionsState; A given JsonItemCoercionState (note singular Coercion) contains: typedef struct JsonItemCoercionState { /* Expression used to evaluate the coercion */ JsonCoercion *coercion; /* ExprEvalStep to compute this coercion's expression */ int jump_eval_expr; } JsonItemCoercionState; jump_eval_expr above is the address in JsonItemCoercions' ExprState.steps of the 1st ExprEvalStep corresponding to coercion->expr. IIUC, all ExprEvalSteps needed to evaluate an expression and its children must be allocated statically in ExecInitExprRec(), and none on-the-fly as needed. So, this considers all coercions and allocates states of all statically. > > > The whole coercion stuff just seems incredibly clunky (in a slightly > > > different shape before this patch). ExecEvalJsonExprItemCoercion() calls > > > ExecPrepareJsonItemCoercion(), which gets a pointer to one of the > > > per-type > > > elements in JsonItemCoercionsState, dispatching on the type of the json > > > object. Then we later call ExecGetJsonItemCoercion() (via a convoluted > > > path), which again will dispatch on the type (extracting the json object > > > again afaics!), to then somehow eventually get the coerced value. > > > > I think it might be possible to make this a bit simpler, by not > > leaving anything coercion-related in ExecEvalJsonExpr(). > > Honestly, this code seems like it should just be rewritten from scratch. Based on what I wrote above, please let me know if I've misunderstood your concerns about over-allocation of coercion state. I can try to rewrite one more time if I know what this should look like instead. > > I left some pieces there, because I thought the error of not finding an > > appropriate coercion must be thrown right away as the code in > > ExecEvalJsonExpr() does after calling ExecGetJsonItemCoercion(). > > > > ExecPrepareJsonItemCoercion() is called later when it's time to > > actually evaluate the coercion. If we move the error path to > > ExecPrepareJsonItemCoercion(), both ExecGetJsonItemCoercion() and the > > error path code in ExecEvalJsonExpr() will be unnecessary. I will > > give that a try. > > Why do we need the separation of prepare and then evaluation? They're executed > straight after each other? ExecPrepareJsonItemCoercion() is a helper routine to choose the coercion and extract the Datum out of the JsonbValue produced by the EEOP_JSONEXPR_PATH step to feed to the coercion expression's ExprEvalStep. The coercion evaluation will be done by jumping to said step in ExecInterpExpr(). > > > - Looks like there's still some recursive expression states, namely > > > JsonExprState->{result_coercion, coercions}? > > > > So, the problem with inlining coercion evaluation into the main p
Re: Correct comment in RemoveNonParentXlogFiles()
On Thu, Aug 4, 2022 at 11:30 AM Kyotaro Horiguchi wrote: > At Wed, 3 Aug 2022 18:16:33 +0530, Ashutosh Sharma > wrote in > > Following comment in RemoveNonParentXlogFiles() says that we are trying > to > > remove any WAL file whose segment number is >= the segment number of the > > first WAL file on the new timeline. However, looking at the code, I can > say > > that we are trying to remove the WAL files from the previous timeline > whose > > segment number is just greater than (not equal to) the segment number of > > the first WAL file in the new timeline. I think we should improve this > > comment, thoughts? > > > > /* > > * Remove files that are on a timeline older than the new one > we're > > * switching to, but with a segment number >= the first segment > on > > the > > * new timeline. > > */ > > if (strncmp(xlde->d_name, switchseg, 8) < 0 && > > strcmp(xlde->d_name + 8, switchseg + 8) > 0) > > I'm not sure I'm fully getting your point. The current comment is > correctly saying that it removes the segments "on a timeline older > than the new one". I agree about segment comparison. > Yeah my complaint is about the comment on segment comparison for removal. > > So, if I changed that comment, I would finish with the following change. > > - * switching to, but with a segment number >= the first segment > on > + * switching to, but with a segment number greater than the > first segment on > which looks correct to me and will inline it with the code. > > That disagreement started at the time the code was introduced by > b2a5545bd6. Leaving the last segment in the old timeline is correct > since it is renamed to .partial later. If timeline switch happened > just at segment boundary, that segment would not not be created. > Yeah, that's why we keep the last segment (partially written) from the old timeline, which means we're not deleting it here. So the comment should not be saying that we are also removing the last wal segment from the old timeline which is equal to the first segment from the new timeline. -- With Regards, Ashutosh Sharma.
Re: Race between KeepFileRestoredFromArchive() and restartpoint
At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch wrote in > > > I think in this case a HINT might be sufficient to at least keep people > > > from > > > wasting time tracking down a problem that has already been fixed. > > Here's a patch to add that HINT. I figure it's better to do this before next > week's minor releases. In the absence of objections, I'll push this around > 2022-08-05 14:00 UTC. +1 > > > However, there is another issue [1] that might argue for a back patch if > > > this patch (as I believe) would fix the issue. > > > > > [1] > > > https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com > > > > That makes sense. Each iteration of the restartpoint recycle loop has a 1/N > > chance of failing. Recovery adds >N files between restartpoints. Hence, > > the > > WAL directory grows without bound. Is that roughly the theory in mind? > > On further reflection, I don't expect it to happen that way. Each failure > message is LOG-level, so the remaining recycles still happen. (The race > condition can yield an ERROR under PreallocXlogFiles(), but recycling is > already done at that point.) I agree to this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: fix typos
On Wed, Aug 3, 2022 at 11:41 PM Robert Haas wrote: > > I think that it's talking about this (documented) syntax: > > ALTER ROUTINE name [ ( [ [ argmode ] [ argname ] argtype [, ...] ] ) ] > [ NO ] DEPENDS ON EXTENSION extension_name > > So the change from "depends" to "depend" here is incorrect. Maybe we > can say something like: > > the DEPENDS ON EXTENSION > extension_name action > > (I haven't tested whether this markup works.) Makes sense, I'll go make it happen. -- John Naylor EDB: http://www.enterprisedb.com
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin wrote: > > > 25 июля 2022 г., в 14:29, Bharath Rupireddy > > написал(а): > > > > Hm, after thinking for a while, I tend to agree with the above > > approach - meaning, query cancel interrupt processing can completely > > be disabled in SyncRepWaitForLSN() and process proc die interrupt > > immediately, this approach requires no GUC as opposed to the proposed > > v1 patch upthread. > GUC was proposed here[0] to maintain compatibility with previous behaviour. > But I think that having no GUC here is fine too. If we do not allow > cancelation of unreplicated backends, of course. > > >> > >> And yes, we need additional complexity - but in some other place. > >> Transaction can also be locally committed in presence of a server crash. > >> But this another difficult problem. Crashed server must not allow data > >> queries until LSN of timeline end is successfully replicated to > >> synchronous_standby_names. > > > > Hm, that needs to be done anyways. How about doing as proposed > > initially upthread [1]? Also, quoting the idea here [2]. > > > > Thoughts? > > > > [1] > > https://www.postgresql.org/message-id/CALj2ACUrOB59QaE6=jf2cfayv1mr7fzd8tr4ym5+oweyg1s...@mail.gmail.com > > [2] 2) Wait for sync standbys to catch up upon restart after the crash or > > in the next txn after the old locally committed txn was canceled. One > > way to achieve this is to let the backend, that's making the first > > connection, wait for sync standbys to catch up in ClientAuthentication > > right after successful authentication. However, I'm not sure this is > > the best way to do it at this point. > > > I think ideally startup process should not allow read only connections in > CheckRecoveryConsistency() until WAL is not replicated to quorum al least up > until new timeline LSN. We can't do it in CheckRecoveryConsistency() unless I'm missing something. Because, the walsenders (required for sending the remaining WAL to sync standbys to achieve quorum) can only be started after the server reaches a consistent state, after all walsenders are specialized backends. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Re: collate not support Unicode Variation Selector
At Wed, 3 Aug 2022 20:12:53 +0900, 荒井元成 wrote in > Thank you for your reply. > > About 60,000 characters are registered in the IPAmj Mincho font designated by > the national specifications. > It should be able to handle all characters. Yeah, it is one of that fonts. But I didn't know that MS-Word can *display* ideographic variations. But it is dissapoinging that input requires to copy-paste from the Web.. Maybe that characters can be input smoothly by using ATOK or alikes.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
On Thu, Aug 4, 2022 at 1:42 PM Bharath Rupireddy wrote: > > On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin wrote: > > > > > 25 июля 2022 г., в 14:29, Bharath Rupireddy > > > написал(а): > > > > > > Hm, after thinking for a while, I tend to agree with the above > > > approach - meaning, query cancel interrupt processing can completely > > > be disabled in SyncRepWaitForLSN() and process proc die interrupt > > > immediately, this approach requires no GUC as opposed to the proposed > > > v1 patch upthread. > > GUC was proposed here[0] to maintain compatibility with previous behaviour. > > But I think that having no GUC here is fine too. If we do not allow > > cancelation of unreplicated backends, of course. > > > > >> > > >> And yes, we need additional complexity - but in some other place. > > >> Transaction can also be locally committed in presence of a server crash. > > >> But this another difficult problem. Crashed server must not allow data > > >> queries until LSN of timeline end is successfully replicated to > > >> synchronous_standby_names. > > > > > > Hm, that needs to be done anyways. How about doing as proposed > > > initially upthread [1]? Also, quoting the idea here [2]. > > > > > > Thoughts? > > > > > > [1] > > > https://www.postgresql.org/message-id/CALj2ACUrOB59QaE6=jf2cfayv1mr7fzd8tr4ym5+oweyg1s...@mail.gmail.com > > > [2] 2) Wait for sync standbys to catch up upon restart after the crash or > > > in the next txn after the old locally committed txn was canceled. One > > > way to achieve this is to let the backend, that's making the first > > > connection, wait for sync standbys to catch up in ClientAuthentication > > > right after successful authentication. However, I'm not sure this is > > > the best way to do it at this point. > > > > > > I think ideally startup process should not allow read only connections in > > CheckRecoveryConsistency() until WAL is not replicated to quorum al least > > up until new timeline LSN. > > We can't do it in CheckRecoveryConsistency() unless I'm missing > something. Because, the walsenders (required for sending the remaining > WAL to sync standbys to achieve quorum) can only be started after the > server reaches a consistent state, after all walsenders are > specialized backends. Continuing on the above thought (I inadvertently clicked the send button previously): A simple approach would be to check for quorum in PostgresMain() before entering the query loop for (;;) for non-walsender cases. A disadvantage of this would be that all the backends will be waiting here in the worst case if it takes time for achieving the sync quorum after restart - roughly we can do the following in PostgresMain(), of course we need locking mechanism so that all the backends whoever reaches here will wait for the same lsn: if (sync_replicaion_defined == true && shmem->wait_for_sync_repl_upon_restart == true) { SyncRepWaitForLSN(pg_current_wal_flush_lsn(), false); shmem->wait_for_sync_repl_upon_restart = false; } Thoughts? -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Re: Fix obsoleted comments for function prototypes
On Tue, Aug 02, 2022 at 07:25:49PM +0900, Michael Paquier wrote: > These declarations are linked to comments with their file paths, so > making that automated looks rather complicated to me. I have looked > at the surroundings without noticing anything obvious, so what you > have caught here sounds fine to me, good catches :) Done as of 245e14e. -- Michael signature.asc Description: PGP signature
Re: Mingw task for Cirrus CI
On Thu, Aug 4, 2022 at 2:04 PM Thomas Munro wrote: > I noticed that this says: > > [01:01:45.657] sqlda.pgc: In function 'dump_sqlda': > [01:01:45.657] sqlda.pgc:45:33: warning: format '%d' expects argument > of type 'int', but argument 3 has type 'long long int' [-Wformat=] > [01:01:45.657] 45 | "name sqlda descriptor: '%s' value %I64d\n", > [01:01:45.657] | ^~~ > [01:01:45.657] .. > [01:01:45.657] 49 | sqlda->sqlvar[i].sqlname.data, *(long long int > *)sqlda->sqlvar[i].sqldata); > [01:01:45.657] | ~~ > [01:01:45.657] | | > [01:01:45.657] | long long int > > ... but fairywren doesn't. Why would they disagree on recognising %I64d? Oops, I was looking in the wrong place. fairywren does also shows the warning: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=fairywren&dt=2022-08-02%2022%3A05%3A30&stg=ecpg-check Something to fix, but not directly relevant to this patch. Sorry for the noise.
Re: How is this possible "publication does not exist"
On Wed, Aug 11, 2021 at 1:37 PM Amit Kapila wrote: > I think it is and the context is generated via > output_plugin_error_callback. Is this reproducible for you and if so, > can you share a test case or some steps to reproduce this? Does this > work and suddenly start giving errors or it happens the very first > time you tried to set up publication/subscription? I think some more > details are required about your setup and steps to analyze this > problem. You might want to check publication-side logs but not sure if > get any better clue there. This seems to regularly reproduce the issue on PostgreSQL 14.4: drop subscription if exists local_sub; drop publication if exists local_pub; drop table if exists local; select pg_create_logical_replication_slot('test','pgoutput'); create table local (x int, y int); insert into local values (1,2); create publication local_pub for table local; create subscription local_sub connection 'host=localhost port=5432' publication local_pub with (create_slot = false, slot_name = 'test', copy_data = false); The log on the publisher then repeatedly shows: 2022-08-04 10:46:56.140 CEST [12785] ERROR: publication "local_pub" does not exist 2022-08-04 10:46:56.140 CEST [12785] CONTEXT: slot "test", output plugin "pgoutput", in the change callback, associated LSN 8/6C01A270 2022-08-04 10:46:56.140 CEST [12785] STATEMENT: START_REPLICATION SLOT "test" LOGICAL 0/0 (proto_version '2', publication_names '"local_pub"') (fails in the same way when setting up the subscription on a different node) The local_pub does appear in pg_publication, but it seems a bit like the change_cb is using an old snapshot when reading from the catalog in GetPublicationByName. cheers, Marco
Re: Reducing planning time on tables with many indexes
Hi Tom, On Wed, Jul 27, 2022 at 6:39 PM Tom Lane wrote: > David Geier writes: > > We tracked down the root cause of this slowdown to lock contention in > > 'get_relation_info()'. The index lock of every single index of every > single > > table used in that query is acquired. We attempted a fix by pre-filtering > > out all indexes that anyways cannot be used with a certain query, without > > taking the index locks (credits to Luc Vlaming for idea and > > implementation). The patch does so by caching the columns present in > every > > index, inside 'struct Relation', similarly to 'rd_indexlist'. > > I wonder how much thought you gave to the costs imposed by that extra > cache space. We have a lot of users who moan about relcache bloat > already. The current representation could be compacted (e.g. by storing the column indexes as 16-bit integers, instead of using a Bitmapset). That should make the additional space needed neglectable compared to the current size of RelationData. On top of that we could maybe reorder the members of RelationData to save padding bytes. Currently, there is lots of interleaving of members of different sizes. Even when taking cache locality into consideration it looks like a fair amount could be saved, probably accounting for the additional space needed to store the index column data. But more to the point, I do not buy the assumption that > an index's set of columns is a good filter for which indexes are of > interest. A trivial counterexample from the regression database is > > regression=# explain select count(*) from tenk1; > QUERY PLAN > > > > > > Aggregate (cost=219.28..219.29 rows=1 width=8) >-> Index Only Scan using tenk1_hundred on tenk1 (cost=0.29..194.28 > rows=100 > 00 width=0) > (2 rows) > > Only for queries without index conditions, the current version of the patch simply discards all indexes but the one with the least columns. This is case (3) in s64_IsUnnecessaryIndex(). This is an over-simplification with the goal of picking the index which yields least I/O. For single column indexes that works, but it can fall short for multi-column indexes (e.g. [INT, TEXT] index vs [INT, INT]t index have both 2 columns but the latter would be better suited when there's no other index and we want to read the first integer column). What we should do here instead is to discard indexes based on storage size. > It looks to me like the patch also makes unwarranted assumptions about > being able to discard all but the smallest index having a given set > of columns. This would, for example, possibly lead to dropping the > index that has the most useful sort order, or that has the operator > class needed to support a specific WHERE clause.t > Why would that be? If we keep all indexes that contain columns that are used by the query, we also keep the indexes which provide a certain sort order or operator class. > > In short, I'm not sure I buy this concept at all. I think it might > be more useful to attack the locking overhead more directly. I kind > of wonder why we need per-index locks at all during planning --- > I think that we already interpret AccessShareLock on the parent table > as being sufficient to block schema changes on existing indexes. > > As I said in the reply to your other mail, there's huge savings also in the serial case where lock contention is not an issue. It seems like pre-filtering saves work down the road. I'll do some perf runs to track down where exactly the savings come from. One source I can think of is only having to consider a subset of all indexes during path creation. > Unfortunately, as things stand today, the planner needs more than the > right to look at the indexes' schemas, because it makes physical accesses > to btree indexes to find out their tree height (and I think there are some > comparable behaviors in other AMs). I've never particularly cared for > that implementation, and would be glad to rip out that behavior if we can > find another way. Maybe we could persuade VACUUM or ANALYZE to store that > info in the index's pg_index row, or some such, and then the planner > could use it with no lock? > > That's another interesting approach, but I would love to save the planner cycles for the sequential case. -- David Geier (ServiceNow)
RE: Introduce wait_for_subscription_sync for TAP tests
On Thu, Aug 4, 2022 2:28 PM Masahiko Sawada wrote: > > On Thu, Aug 4, 2022 at 10:37 AM Amit Kapila > wrote: > > > > On Wed, Aug 3, 2022 at 10:21 AM Amit Kapila > wrote: > > > > > > Pushed this one and now I'll look at your other patch. > > > > > > > I have pushed the second patch as well after making minor changes in > > the comments. Alvaro [1] and Tom [2] suggest to back-patch this and > > they sound reasonable to me. Will you be able to produce back branch > > patches? > > Yes. I've attached patches for backbranches. The updates are > straightforward on v11 - v15. However, on v10, we don't use > wait_for_catchup() in some logical replication test cases. The commit > bbd3363e128dae refactored the tests to use wait_for_catchup but it's > not backpatched. So in the patch for v10, I didn't change the code > that was changed by the commit. Also, since wait_for_catchup requires > to specify $target_lsn, unlike the one in v11 or later, I changed > wait_for_subscription_sync() accordingly. > Thanks for your patches. In the patches for pg11 ~ pg14, it looks we need to add a "=pod" before the current change in PostgresNode.pm. Right? Regards, Shi yu
Re: Add last failed connection error message to pg_stat_wal_receiver
On Mon, Jul 25, 2022 at 2:40 PM Michael Paquier wrote: > > On Mon, Jul 25, 2022 at 12:19:40PM +0530, Bharath Rupireddy wrote: > > Thanks a lot Cary for reviewing. It will be great if you can add > > yourself as a reviewer and set the status accordingly in the CF entry > > here - https://commitfest.postgresql.org/38/3666/. > > Hmm. This stands for the connection error, but there are other things > that could cause a failure down the road, like an incorrect system > ID or a TLI-related report, so that seems a bit limited to me? Good point. The walreceiver can exit for any reason. We can either 1) store for all the error messages or 2) think of using sigsetjmp but that only catches the ERROR kinds, leaving FATAL and PANIC messages. The option (1) is simple but there are problems - we may miss storing future error messages, good commenting and reviewing may help here and all the error messages now need to be stored in string, which is complex. The option (2) seems reasonable but we will miss FATAL and PANIC messages (we have many ERRORs, 2 FATALs, 3 PANICs). Maybe a combination of option (1) for FATALs and PANICs, and option (2) for ERRORs helps. Thoughts? -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
RE: collate not support Unicode Variation Selector
Thank you for your reply. SQLServer supports Unicode Variation Selector, so I would like PostgreSQL to support them as well. Regards. -- Motonari Arai -Original Message- From: Kyotaro Horiguchi Sent: Thursday, August 4, 2022 5:24 PM To: n2...@ndensan.co.jp Cc: thomas.mu...@gmail.com; t...@sss.pgh.pa.us; pgsql-hackers@lists.postgresql.org Subject: Re: collate not support Unicode Variation Selector At Wed, 3 Aug 2022 20:12:53 +0900, 荒井元成 wrote in > Thank you for your reply. > > About 60,000 characters are registered in the IPAmj Mincho font designated by the national specifications. > It should be able to handle all characters. Yeah, it is one of that fonts. But I didn't know that MS-Word can *display* ideographic variations. But it is dissapoinging that input requires to copy-paste from the Web.. Maybe that characters can be input smoothly by using ATOK or alikes.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: fix typos
On Tue, Aug 2, 2022 at 1:11 AM Justin Pryzby wrote: > > On Mon, Aug 01, 2022 at 08:04:54PM +0200, Erik Rijkers wrote: > > Recent typos... > > LGTM, thanks. > > Here are some others I've been sitting on, mostly in .c files. I pushed Robert's suggestion, then pushed the rest of Erik's changes and two of Justin's. For Justin's 0004: --- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -364,7 +364,7 @@ restart: if (nowait) ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), - errmsg("could not drop replication origin with OID %d, in use by PID %d", + errmsg("could not drop replication origin with OID %u, in use by PID %d", RepOriginId is a typedef for uint16, so this can't print the wrong answer, but it is inconsistent with other uses. So it seems we don't need to backpatch this one? For patch 0002, the whitespace issue in the top comment in inval.c, I'm inclined to just change all the out-of-place tabs in a single commit, so we can add that to the list of whitespace commits. -- John Naylor EDB: http://www.enterprisedb.com
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On 8/4/22 04:06, Kyotaro Horiguchi wrote: At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch wrote in I think in this case a HINT might be sufficient to at least keep people from wasting time tracking down a problem that has already been fixed. Here's a patch to add that HINT. I figure it's better to do this before next week's minor releases. In the absence of objections, I'll push this around 2022-08-05 14:00 UTC. +1 Looks good to me as well. However, there is another issue [1] that might argue for a back patch if this patch (as I believe) would fix the issue. [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com That makes sense. Each iteration of the restartpoint recycle loop has a 1/N chance of failing. Recovery adds >N files between restartpoints. Hence, the WAL directory grows without bound. Is that roughly the theory in mind? On further reflection, I don't expect it to happen that way. Each failure message is LOG-level, so the remaining recycles still happen. (The race condition can yield an ERROR under PreallocXlogFiles(), but recycling is already done at that point.) I agree to this. Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing without bound. Even if we are not sure what is causing it, how confident are we that the patches applied to v15 would fix it? Regards, -David
Re: Fix obsoleted comments for function prototypes
On Thu, Aug 4, 2022 at 4:38 PM Michael Paquier wrote: > On Tue, Aug 02, 2022 at 07:25:49PM +0900, Michael Paquier wrote: > > These declarations are linked to comments with their file paths, so > > making that automated looks rather complicated to me. I have looked > > at the surroundings without noticing anything obvious, so what you > > have caught here sounds fine to me, good catches :) > > Done as of 245e14e. Thank you Michael! Thanks Richard
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Thu, Aug 4, 2022 at 9:41 AM Dilip Kumar wrote: > > On Thu, Aug 4, 2022 at 12:18 AM Justin Pryzby wrote: > > > > On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote: > > > Hm. This looks more like an issue of DROP DATABASE not being > > > interruptible. I > > > suspect this isn't actually related to STRATEGY wal_log and could likely > > > be > > > reproduced in older versions too. > > > > I couldn't reproduce it with file_copy, but my recipe isn't exactly > > reliable. > > That may just mean that it's easier to hit now. > > I think this looks like a problem with drop db but IMHO you are seeing > this behavior only when a database is created using WAL LOG because in > this strategy we are using buffers to write the destination database > pages and some of the dirty buffers and sync requests might still be > pending. And now when we try to drop the database it drops all the > dirty buffers and all pending sync requests and then before it > actually removes the directory it gets interrupted and now you see the > database directory on disk which is partially corrupted. See below > sequence of drop database > > > dropdb() > { > ... > DropDatabaseBuffers(db_id); > ... > ForgetDatabaseSyncRequests(db_id); > ... > RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); > > WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); > -- Inside this it can process the cancel query and get interrupted > remove_dbtablespaces(db_id); > .. > } > > I reproduced the same error by inducing error just before > WaitForProcSignalBarrier. > > postgres[14968]=# CREATE DATABASE a STRATEGY WAL_LOG ; drop database a; > CREATE DATABASE > ERROR: XX000: test error > LOCATION: dropdb, dbcommands.c:1684 > postgres[14968]=# \c a > connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC: > could not open critical system index 2662 > Previous connection kept > postgres[14968]=# So basically, from this we can say it is completely a problem with drop databases, I mean I can produce any behavior by interrupting drop database 1. If we created some tables/inserted data and the drop database got cancelled, it might have a database directory and those objects are lost. 2. Or you can even drop the database directory and then get cancelled before deleting the pg_database entry then also you will end up with a corrupted database, doesn't matter whether you created it with WAL LOG or FILE COPY. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Fix inconsistencies GUC categories
Hi, It appears that config.sgml and pg_settings have not been updated, even though a new subcategory was added in 249d649. a55a984 may have been missed in the cleaning. -- Category is 'CONNECTIONS AND AUTHENTICATION' and subcategory is 'Connection Settings' at config.sgml. Category is 'CONNECTIONS AND AUTHENTICATION' and subcategory is 'Connection Settings' at pg_settings. Category is 'CONNECTIONS AND AUTHENTICATION' and subcategory is 'TCP settings' at postgresql.conf.sample. -- I would like to unify the following with config.sgml as in a55a984. -- Category is 'REPORTING AND LOGGING' and subcategory is 'PROCESS TITLE' at config.sgml. Category is 'REPORTING AND LOGGING' and subcategory is 'PROCESS TITLE' at pg_settings. Category is 'PROCESS TITLE' and subcategory is none at postgresql.conf.sample. -- Trivial changes were made to the following short_desc. -- recovery_prefetch enable_group_by_reordering stats_fetch_consistency -- I've attached a patch. Thoghts? -- Regards, -- Shinya Kato Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATIONdiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index e2d728e0c4..2522f4c8c5 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -896,6 +896,13 @@ include_dir 'conf.d' + + + + + TCP Settings + + tcp_keepalives_idle (integer) diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index af4a1c3068..5db5df6285 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -778,6 +778,8 @@ const char *const config_group_names[] = gettext_noop("File Locations"), /* CONN_AUTH_SETTINGS */ gettext_noop("Connections and Authentication / Connection Settings"), + /* CONN_AUTH_TCP */ + gettext_noop("Connections and Authentication / TCP Settings"), /* CONN_AUTH_AUTH */ gettext_noop("Connections and Authentication / Authentication"), /* CONN_AUTH_SSL */ @@ -1216,7 +1218,7 @@ static struct config_bool ConfigureNamesBool[] = }, { {"enable_group_by_reordering", PGC_USERSET, QUERY_TUNING_METHOD, - gettext_noop("enable reordering of GROUP BY key"), + gettext_noop("Enable reordering of GROUP BY key."), NULL, GUC_EXPLAIN }, @@ -3460,7 +3462,7 @@ static struct config_int ConfigureNamesInt[] = }, { - {"tcp_keepalives_idle", PGC_USERSET, CONN_AUTH_SETTINGS, + {"tcp_keepalives_idle", PGC_USERSET, CONN_AUTH_TCP, gettext_noop("Time between issuing TCP keepalives."), gettext_noop("A value of 0 uses the system default."), GUC_UNIT_S @@ -3471,7 +3473,7 @@ static struct config_int ConfigureNamesInt[] = }, { - {"tcp_keepalives_interval", PGC_USERSET, CONN_AUTH_SETTINGS, + {"tcp_keepalives_interval", PGC_USERSET, CONN_AUTH_TCP, gettext_noop("Time between TCP keepalive retransmits."), gettext_noop("A value of 0 uses the system default."), GUC_UNIT_S @@ -3493,7 +3495,7 @@ static struct config_int ConfigureNamesInt[] = }, { - {"tcp_keepalives_count", PGC_USERSET, CONN_AUTH_SETTINGS, + {"tcp_keepalives_count", PGC_USERSET, CONN_AUTH_TCP, gettext_noop("Maximum number of TCP keepalive retransmits."), gettext_noop("This controls the number of consecutive keepalive retransmits that can be " "lost before a connection is considered dead. A value of 0 uses the " @@ -3595,7 +3597,7 @@ static struct config_int ConfigureNamesInt[] = }, { - {"tcp_user_timeout", PGC_USERSET, CONN_AUTH_SETTINGS, + {"tcp_user_timeout", PGC_USERSET, CONN_AUTH_TCP, gettext_noop("TCP user timeout."), gettext_noop("A value of 0 uses the system default."), GUC_UNIT_MS @@ -3640,7 +3642,7 @@ static struct config_int ConfigureNamesInt[] = }, { - {"client_connection_check_interval", PGC_USERSET, CONN_AUTH_SETTINGS, + {"client_connection_check_interval", PGC_USERSET, CONN_AUTH_TCP, gettext_noop("Sets the time interval between checks for disconnection while running queries."), NULL, GUC_UNIT_MS @@ -4953,7 +4955,7 @@ static struct config_enum ConfigureNamesEnum[] = { {"stats_fetch_consistency", PGC_USERSET, STATS_CUMULATIVE, - gettext_noop("Sets the consistency of accesses to statistics data"), + gettext_noop("Sets the consistency of accesses to statistics data."), NULL }, &pgstat_fetch_consistency, @@ -5044,7 +5046,7 @@ static struct config_enum ConfigureNamesEnum[] = { {"recovery_prefetch", PGC_SIGHUP, WAL_RECOVERY, - gettext_noop("Prefetch referenced blocks during recovery"), + gettext_noop("Prefetch referenced blocks during recovery."), gettext_noop("Look ahead in the WAL to find references to uncached data.") }, &recovery_prefetch, diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index b4bc06e5f5..90bec0502c 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@
RE: Data is copied twice when specifying both child and parent table in publication
On Thursday, July 28, 2022 5:17 PM Peter Smith wrote: > Here are some review comments for the HEAD_v7-0001 patch: > > == > > 1. > > I have a fundamental question about this patch. > > IIUC the purpose of this patch is to ensure that (when > publish_via_root = true) the copy of the partition data will happen > only once (e.g. from one parent table on one of the publishers). But I > think there is no guarantee that those 2 publishers even had the same > data, right? Therefore it seems to me you could get different results > if the data were copied from pub1 or from pub2. (I have not tried it - > this is just my suspicion). > > Am I correct or mistaken? Since the subscribed publishers are combined with OR and are from the same database. And we are trying to copy the data from the top most parent table, so I think the results should be as expected. Best regards, Hou zj
Re: Question about user/database-level parameters
Japin Li wrote: > However, if the database is in production, we cannot go into single-user > mode, should we provide an option to change this behavior on the fly? It already exists, through PGOPTIONS, which appears to work for local_preload_libraries, in a quick test. That is, you can log in by invoking psql with: PGOPTIONS="-c local_preload_libraries=" and issue the ALTER USER to reset things back to normal without stopping the instance. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: making relfilenodes 56 bits
On Sat, Jul 30, 2022 at 1:59 AM Robert Haas wrote: > One solution to all this is to do as Dilip proposes here: for system > relations, keep assigning the OID as the initial relfilenumber. > Actually, we really only need to do this for pg_largeobject; all the > other relfilenumber values could be assigned from a counter, as long > as they're assigned from a range distinct from what we use for user > relations. > > But I don't really like that, because I feel like the whole thing > where we start out with relfilenumber=oid is a recipe for hidden bugs. > I believe we'd be better off if we decouple those concepts more > thoroughly. So here's another idea: what if we set the > next-relfilenumber counter for the new cluster to the value from the > old cluster, and then rewrote all the (thus-far-empty) system tables? You mean in a new cluster start the next-relfilenumber counter from the highest relfilenode/Oid value in the old cluster right?. Yeah, if we start next-relfilenumber after the range of the old cluster then we can also avoid the logic of SetNextRelFileNumber() during upgrade. My very initial idea around this was to start the next-relfilenumber directly from the 4 billion in the new cluster so there can not be any conflict and we don't even need to identify the highest value of used relfilenode in the old cluster. In fact we don't need to rewrite the system table before upgrading I think. So what do we lose with this? just 4 billion relfilenode? does that really matter provided the range we get with the 56 bits relfilenumber. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Smoothing the subtrans performance catastrophe
On Wed, 3 Aug 2022 at 20:18, Andres Freund wrote: > On 2022-08-01 17:42:49 +0100, Simon Riggs wrote: > > The reason for the slowdown is clear: when we overflow we check every > > xid against subtrans, producing a large stream of lookups. Some > > previous hackers have tried to speed up subtrans - this patch takes a > > different approach: remove as many subtrans lookups as possible. (So > > is not competing with those other solutions). > > > > Attached patch improves on the situation, as also shown in the attached > > diagram. > > I think we should consider redesigning subtrans more substantially - even with > the changes you propose here, there's still plenty ways to hit really bad > performance. And there's only so much we can do about that without more > fundamental design changes. I completely agree - you will be glad to hear that I've been working on a redesign of the subtrans module. But we should be clear that redesigning subtrans has nothing to do with this patch; they are separate ideas and this patch relates to XidInMVCCSnapshot(), an important caller of subtrans. I will post my patch, when complete, in a different thread. > One way to fix a lot of the issues around pg_subtrans would be remove the > pg_subtrans SLRU and replace it with a purely in-memory hashtable. IMO there's > really no good reason to use an SLRU for it (anymore). > > In contrast to e.g. clog or multixact we don't need to access a lot of old > entries, we don't need persistency etc. Nor is it a good use of memory and IO > to have loads of pg_subtrans pages that don't point anywhere, because the xid > is just a "normal" xid. > > While we can't put a useful hard cap on the number of potential subtrans > entries (we can only throw subxid->parent mappings away once no existing > snapshot might need them), saying that there can't be more subxids "considered > running" at a time than can fit in memory doesn't seem like a particularly > problematic restriction. I do agree that sometimes it is easier to impose restrictions than to try to provide unbounded resources. Having said that, I can't see an easy way of making that work well in practice for this case. Making write transactions just suddenly stop working at some point doesn't sound like it would be good for availability, especially when it happens sporadically and unpredictably as that would, whenever long running transactions appear alongside users of subtransactions. > So, why don't we use a dshash table with some amount of statically allocated > memory for the mapping? In common cases that will *reduce* memory usage > (because we don't need to reserve space for [as many] subxids in snapshots / > procarray anymore) and IO (no mostly-zeroes pg_subtrans). I considered this and have ruled it out, but as I said above, we can discuss that on a different thread. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: enable/disable broken for statement triggers on partitioned tables
Another point for backpatch: EnableDisableTrigger() changes API, which is potentially not good. In backbranches I'll keep the function unchanged and add another function with the added argument, EnableDisableTriggerNew(). So extensions that want to be compatible with both old and current versions (assuming any users of that function exist out of core; I didn't find any) could do something like #if PG_VERSION_NUM <= 16 EnableDisableTriggerNew( all args ) #else EnableDisableTrigger( all args ) #endif and otherwise they're compatible as compiled today. Since there are no known users of this interface, it doesn't seem to warrant any more convenient treatment. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Those who use electric razors are infidels destined to burn in hell while we drink from rivers of beer, download free vids and mingle with naked well shaved babes." (http://slashdot.org/comments.pl?sid=44793&cid=4647152)
Re: Handle infinite recursion in logical replication setup
On Tue, Aug 2, 2022 at 8:59 PM Amit Kapila wrote: > > On Tue, Jul 26, 2022 at 9:07 AM Amit Kapila wrote: > > > > On Tue, Jul 26, 2022 at 7:13 AM Jonathan S. Katz > > wrote: > > > > > > Thanks for the example. I agree that it is fairly simple to reproduce. > > > > > > I understand that "copy_data = force" is meant to protect a user from > > > hurting themself. I'm not convinced that this is the best way to do so. > > > > > > For example today I can subscribe to multiple publications that write to > > > the same table. If I have a primary key on that table, and two of the > > > subscriptions try to write an identical ID, we conflict. We don't have > > > any special flags or modes to guard against that from happening, though > > > we do have documentation on conflicts and managing them. > > > > > > AFAICT the same issue with "copy_data" also exists in the above scenario > > > too, even without the "origin" attribute. > > > > > > > That's true but there is no parameter like origin = NONE which > > indicates that constraint violations or duplicate data problems won't > > occur due to replication. In the current case, I think the situation > > is different because a user has specifically asked not to replicate > > any remote data by specifying origin = NONE, which should be dealt > > differently. Note that current users or their setup won't see any > > difference/change unless they specify the new parameter origin as > > NONE. > > > > Let me try to summarize the discussion so that it is easier for others > to follow. The work in this thread is to avoid loops, and duplicate > data in logical replication when the operations happened on the same > table in multiple nodes. It has been explained in email [1] with an > example of how a logical replication setup can lead to duplicate or > inconsistent data. > > The idea to solve this problem is that we don't replicate data that is > not generated locally which we can normally identify based on origin > of data in WAL. The commit 366283961a achieves that for replication > but still the problem can happen during initial sync which is > performed internally via copy. We can't differentiate the data in heap > based on origin. So, we decided to prohibit the subscription > operations that can perform initial sync (ex. Create Subscription, > Alter Subscription ... Refresh) by detecting that the publisher has > subscribed to the same table from some other publisher. > > To prohibit the subscription operations, the currently proposed patch > throws an error. Then, it also provides a new copy_data option > 'force' under which the user will still be able to perform the > operation. This could be useful when the user intentionally wants to > replicate the initial data even if it contains data from multiple > nodes (for example, when in a multi-node setup, one decides to get the > initial data from just one node and then allow replication of data to > proceed from each of respective nodes). > > The other alternative discussed was to just give a warning for > subscription operations and probably document the steps for users to > avoid it. But the problem with that is once the user sees this > warning, it won't be able to do anything except recreate the setup, so > why not give an error in the first place? > > Thoughts? Thank you for the summary! I understand that this feature could help some cases, but I'm really not sure adding new value 'force' for them is worthwhile, and concerned it could reduce the usability. IIUC this feature would work only when origin = 'none' and copy_data = 'on', and the purpose is to prevent the data from being duplicated/conflicted by the initial table sync. But there are cases where duplication/conflict doesn't happen even if origin = 'none' and copy_data = 'on'. For instance, the table on the publisher might be empty. Also, even with origin = 'any', copy_data = 'on', there is a possibility of data duplication/conflict. Why do we need to address only the case where origin = 'none'? I think that using origin = 'none' doesn't necessarily mean using bi-directional (or N-way) replication. Even when using uni-directional logical replication with two nodes, they may use origin = 'none'. Therefore, it seems to me that this feature works only for a narrow situation and has false positives. Since it has been the user's responsibility not to try to make the data inconsistent by the initial table sync, I think that it might be sufficient if we note the risk in the documentation. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Smoothing the subtrans performance catastrophe
On Wed, Aug 3, 2022 at 4:14 PM Andres Freund wrote: > I don't think this scenario would fundamentally change - we already keep the > set of subxids in backend local memory (e.g. either a dedicated > TransactionStateData or an element in ->childXids) and in the locking table > (via XactLockTableInsert()). Sure > The problematic part isn't keeping "actually" running subxids in memory, but > keeping subxids that might be "considered running" in memory (i.e. subxids > that are considered running by an old snapshot in another backend). > > A hashtable just containing child->parent mapping for subxids doesn't actually > need that much memory. It'd be approximately (2 * 4 bytes) * subxids * > (2-fillfactor) or such? So maybe ~10MB for 1 milllion subxids? Allocating > that on-demand doesn't strike me as prohibitive. I mean the worst case is ~2 bn, no? -- Robert Haas EDB: http://www.enterprisedb.com
ERREUR: cache lookup failed for function 0 with PostgreSQL 15 beta 2, no error with PostgreSQL 14.4
Hi, A DSS developer from my company, Julien Roze, reported me an error I cannot explained. Is it a new behavior or a bug ? Original query is much more complicated but here is a simplified test case with postgresql 14 and 15 beta 2 on Debian 11, packages from pgdg : Ver Cluster Port Status OwnerData directory Log file 14 main5432 online postgres /var/lib/postgresql/14/main /var/log/postgresql/postgresql-14-main.log 15 main5433 online postgres /var/lib/postgresql/15/main /var/log/postgresql/postgresql-15-main.log psql -p 5432 select version(); version - PostgreSQL 14.4 (Debian 14.4-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit (1 ligne) with fakedata as ( select 'hello' word union all select 'world' word ) select * from ( select word, count(*) over (partition by word) nb from fakedata ) t where nb = 1; word | nb ---+ hello | 1 world | 1 (2 lignes) with fakedata as ( select 'hello' word union all select 'world' word ) select * from ( select word, count(*) nb from fakedata group by word ) t where nb = 1; word | nb ---+ hello | 1 world | 1 (2 lignes) psql -p 5433 select version(); version PostgreSQL 15beta2 (Debian 15~beta2-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit (1 ligne) with fakedata as ( select 'hello' word union all select 'world' word ) select * from ( select word, count(*) over (partition by word) nb from fakedata ) t where nb = 1; ERREUR: cache lookup failed for function 0 with fakedata as ( select 'hello' word union all select 'world' word ) select * from ( select word, count(*) nb from fakedata group by word ) t where nb = 1; word | nb ---+ hello | 1 world | 1 (2 lignes) Best regards, Phil
Re: enable/disable broken for statement triggers on partitioned tables
On Thu, Aug 4, 2022 at 9:56 PM Alvaro Herrera wrote: > Another point for backpatch: EnableDisableTrigger() changes API, which > is potentially not good. In backbranches I'll keep the function > unchanged and add another function with the added argument, > EnableDisableTriggerNew(). +1 > So extensions that want to be compatible with both old and current > versions (assuming any users of that function exist out of core; I > didn't find any) could do something like > > #if PG_VERSION_NUM <= 16 > EnableDisableTriggerNew( all args ) > #else > EnableDisableTrigger( all args ) > #endif > > and otherwise they're compatible as compiled today. > > Since there are no known users of this interface, it doesn't seem to > warrant any more convenient treatment. Makes sense. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: ERREUR: cache lookup failed for function 0 with PostgreSQL 15 beta 2, no error with PostgreSQL 14.4
On Thu, Aug 04, 2022 at 01:19:59PM +, Phil Florent wrote: > A DSS developer from my company, Julien Roze, reported me an error I cannot > explained. Is it a new behavior or a bug ? > > Original query is much more complicated but here is a simplified test case > with postgresql 14 and 15 beta 2 on Debian 11, packages from pgdg : Thanks for simplifying and reporting it. It looks like an issue with window run conditions (commit 9d9c02ccd). +David (gdb) b pg_re_throw (gdb) bt #0 pg_re_throw () at elog.c:1795 #1 0x557c85645e69 in errfinish (filename=, filename@entry=0x557c858db7da "fmgr.c", lineno=lineno@entry=183, funcname=funcname@entry=0x557c858dc410 <__func__.24841> "fmgr_info_cxt_security") at elog.c:588 #2 0x557c85650e21 in fmgr_info_cxt_security (functionId=functionId@entry=0, finfo=finfo@entry=0x557c86a05ad0, mcxt=, ignore_security=ignore_security@entry=false) at fmgr.c:183 #3 0x557c85651284 in fmgr_info (functionId=functionId@entry=0, finfo=finfo@entry=0x557c86a05ad0) at fmgr.c:128 #4 0x557c84b32c73 in ExecInitFunc (scratch=scratch@entry=0x7ffc369a9cf0, node=node@entry=0x557c869f59b8, args=0x557c869f5a68, funcid=funcid@entry=0, inputcollid=inputcollid@entry=0, state=state@entry=0x557c86a05620) at execExpr.c:2748 #5 0x557c84b27904 in ExecInitExprRec (node=node@entry=0x557c869f59b8, state=state@entry=0x557c86a05620, resv=resv@entry=0x557c86a05628, resnull=resnull@entry=0x557c86a05625) at execExpr.c:1147 #6 0x557c84b33a1d in ExecInitQual (qual=0x557c869f5b18, parent=parent@entry=0x557c86a05080) at execExpr.c:253 #7 0x557c84c8eadb in ExecInitWindowAgg (node=node@entry=0x557c869f4d20, estate=estate@entry=0x557c86a04e10, eflags=eflags@entry=16) at nodeWindowAgg.c:2420 #8 0x557c84b8edda in ExecInitNode (node=node@entry=0x557c869f4d20, estate=estate@entry=0x557c86a04e10, eflags=eflags@entry=16) at execProcnode.c:345 #9 0x557c84b70ea2 in InitPlan (queryDesc=queryDesc@entry=0x557c8695af50, eflags=eflags@entry=16) at execMain.c:938 #10 0x557c84b71658 in standard_ExecutorStart (queryDesc=queryDesc@entry=0x557c8695af50, eflags=16, eflags@entry=0) at execMain.c:265 #11 0x557c84b71ca4 in ExecutorStart (queryDesc=queryDesc@entry=0x557c8695af50, eflags=0) at execMain.c:144 #12 0x557c8525292b in PortalStart (portal=portal@entry=0x557c869a45e0, params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0) at pquery.c:517 #13 0x557c8524b2a4 in exec_simple_query ( query_string=query_string@entry=0x557c86938af0 "with fakedata as (\n", ' ' , "select 'hello' word\n", ' ' , "union all\n", ' ' , "select 'world' word\n)\nselect *\nfrom (\n", ' ' , "select word, count(*) over (partition by word) nb fro"...) at postgres.c:1204 #14 0x557c8524e8bd in PostgresMain (dbname=, username=username@entry=0x557c86964298 "pryzbyj") at postgres.c:4505 #15 0x557c85042db6 in BackendRun (port=port@entry=0x557c8695a910) at postmaster.c:4490 #16 0x557c8504a79a in BackendStartup (port=port@entry=0x557c8695a910) at postmaster.c:4218 #17 0x557c8504ae12 in ServerLoop () at postmaster.c:1808 #18 0x557c8504c926 in PostmasterMain (argc=3, argv=) at postmaster.c:1480 #19 0x557c84ce4209 in main (argc=3, argv=0x557c86933000) at main.c:197 (gdb) fr 7 #7 0x557c84c8eadb in ExecInitWindowAgg (node=node@entry=0x557c869f4d20, estate=estate@entry=0x557c86a04e10, eflags=eflags@entry=16) at nodeWindowAgg.c:2420 2420winstate->runcondition = ExecInitQual(node->runCondition, -- Justin
Re: fix typos
John Naylor writes: > On Tue, Aug 2, 2022 at 1:11 AM Justin Pryzby wrote: > ereport(ERROR, > (errcode(ERRCODE_OBJECT_IN_USE), > - errmsg("could not drop replication origin with OID %d, in use by PID %d", > + errmsg("could not drop replication origin with OID %u, in use by PID %d", > RepOriginId is a typedef for uint16, so this can't print the wrong answer, > but it is inconsistent with other uses. So it seems we don't need to > backpatch this one? Um ... if it's int16, then it can't be an OID, so I'd say this message has far worse problems than %d vs %u. It should not use that terminology. regards, tom lane
Re: Introduce wait_for_subscription_sync for TAP tests
Masahiko Sawada writes: > Yes. I've attached patches for backbranches. FWIW, I'd recommend waiting till after next week's wrap before pushing these. While I'm definitely in favor of doing this, the odds of introducing a bug are nonzero, so right before a release deadline doesn't seem like a good time. regards, tom lane
Re: pg15b2: large objects lost on upgrade
On 8/3/22 4:19 PM, Tom Lane wrote: "Jonathan S. Katz" writes: I did rule out wanting to do the "xid + $X" check after reviewing some of the output. I think that both $X could end up varying, and it really feels like a bandaid. It is that. I wouldn't feel comfortable with $X less than 100 or so, which is probably sloppy enough to draw Robert's ire. Still, realizing that what we want right now is a band-aid for 15beta3, I don't think it's an unreasonable short-term option. Attached is the "band-aid / sloppy" version of the patch. Given from the test examples I kept seeing deltas over 100 for relfrozenxid, I chose 1000. The mxid delta was less, but I kept it at 1000 for consistency (and because I hope this test is short lived in this state), but can be talked into otherwise. Andres suggested upthread using "txid_current()" -- for the comparison, that's one thing I looked at. Would any of the XID info from "pg_control_checkpoint()" also serve for this test? I like the idea of txid_current(), but we have no comparable function for mxid do we? While you could get both numbers from pg_control_checkpoint(), I doubt that's sufficiently up-to-date. ...unless we force a checkpoint in the test? Jonathandiff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl index 4cbc75644c..ced889601f 100644 --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl @@ -226,6 +226,8 @@ ORDER BY c.oid::regclass::text EOM $horizon_query =~ s/\s+/ /g; # run it together on one line my $horizon1 = $oldnode->safe_psql('regression', $horizon_query); +chomp($horizon1); +my @horizon1_values = split(/\|/, $horizon1); # In a VPATH build, we'll be started in the source directory, but we want # to run pg_upgrade in the build directory so that any files generated finish @@ -313,6 +315,8 @@ $newnode->command_ok( # And record the horizons from the upgraded cluster as well. my $horizon2 = $newnode->safe_psql('regression', $horizon_query); +chomp($horizon2); +my @horizon2_values = split(/\|/, $horizon2); # Compare the two dumps, there should be no differences. my $compare_res = compare("$tempdir/dump1.sql", "$tempdir/dump2.sql"); @@ -331,8 +335,13 @@ if ($compare_res != 0) print "=== EOF ===\n"; } -# Compare the horizons, there should be no differences. -my $horizons_ok = $horizon1 eq $horizon2; +# Compare the horizons. There should be no change in the OID, so we will test +# for equality. However, because autovacuum may have run between the two +# horizions, we will check if the updated horizon XIDs are greater than or +# equal to the originals. +my $horizons_ok = $horizon2_values[0] eq $horizon1_values[0] && + int($horizon2_values[1]) <= int($horizon1_values[1]) + 1000 && + int($horizon2_values[2]) <= int($horizon1_values[2]) + 1000; ok($horizons_ok, 'old and new horizons match after pg_upgrade'); # Provide more context if the horizons do not match.
Re: pg15b2: large objects lost on upgrade
On Wed, Aug 3, 2022 at 7:19 PM Tom Lane wrote: > "Jonathan S. Katz" writes: > > I did rule out wanting to do the "xid + $X" check after reviewing some > > of the output. I think that both $X could end up varying, and it really > > feels like a bandaid. > > It is that. I wouldn't feel comfortable with $X less than 100 or so, > which is probably sloppy enough to draw Robert's ire. Still, realizing > that what we want right now is a band-aid for 15beta3, I don't think > it's an unreasonable short-term option. 100 << 2^32, so it's not terrible, but I'm honestly coming around to the view that we ought to just nuke this test case. >From my point of view, the assertion that disabling autovacuum during this test case would make the test case useless seems to be incorrect. The original purpose of the test was to make sure that the pre-upgrade schema matched the post-upgrade schema. If having autovacuum running or not running affects that, we have a serious problem, but this test case isn't especially likely to find it, because whether autovacuum runs or not during the brief window where the test is running is totally unpredictable. Furthermore, if we do have such a problem, it would probably indicate that vacuum is using the wrong horizons to prune or test the visibility of the tuples. To find that out, we might want to compare values upon which the behavior of vacuum might depend, like relfrozenxid. But to do that, we have to disable autovacuum, so that the value can't change under us. From my point of view, that's making test coverage better, not worse, because any bugs in this area that can be found without explicit testing of relevant horizons are dependent on low-probability race conditions materializing in the buildfarm. If we disable autovacuum and then compare relfrozenxid and whatever else we care about explicitly, we can find bugs in that category reliably. However, if people don't accept that argument, then this new test case is kind of silly. It's not the worst idea in the world to use a threshold of 100 XIDs or something, but without disabling autovacuum, we're basically comparing two things that can't be expected to be equal, so we test and see if they're approximately equal and then call that good enough. I don't know that I believe we'll ever find a bug that way, though. -- Robert Haas EDB: http://www.enterprisedb.com enable-autovacuum-later.patch Description: Binary data
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 10:02 AM Jonathan S. Katz wrote: > Attached is the "band-aid / sloppy" version of the patch. Given from the > test examples I kept seeing deltas over 100 for relfrozenxid, I chose > 1000. The mxid delta was less, but I kept it at 1000 for consistency > (and because I hope this test is short lived in this state), but can be > talked into otherwise. ISTM that you'd need to loop over the rows and do this for each row. Otherwise I think you're just comparing results for the first relation and ignoring all the rest. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg15b2: large objects lost on upgrade
"Jonathan S. Katz" writes: > On 8/3/22 4:19 PM, Tom Lane wrote: >> I like the idea of txid_current(), but we have no comparable >> function for mxid do we? While you could get both numbers from >> pg_control_checkpoint(), I doubt that's sufficiently up-to-date. > ...unless we force a checkpoint in the test? Hmm ... maybe if you take a snapshot and hold that open while forcing the checkpoint and doing the subsequent checks. That seems messy though. Also, while that should serve to hold back global xmin, I'm not at all clear on whether that has a similar effect on minmxid. regards, tom lane
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On Thu, Aug 04, 2022 at 06:32:38AM -0400, David Steele wrote: > On 8/4/22 04:06, Kyotaro Horiguchi wrote: > >At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch wrote in > I think in this case a HINT might be sufficient to at least keep people > from > wasting time tracking down a problem that has already been fixed. > >> > >>Here's a patch to add that HINT. I figure it's better to do this before > >>next > >>week's minor releases. In the absence of objections, I'll push this around > >>2022-08-05 14:00 UTC. > > > >+1 > > Looks good to me as well. Thanks for reviewing. > However, there is another issue [1] that might argue for a back patch if > this patch (as I believe) would fix the issue. > >>> > [1] > https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com > >>> > >>>That makes sense. Each iteration of the restartpoint recycle loop has a > >>>1/N > >>>chance of failing. Recovery adds >N files between restartpoints. Hence, > >>>the > >>>WAL directory grows without bound. Is that roughly the theory in mind? > >> > >>On further reflection, I don't expect it to happen that way. Each failure > >>message is LOG-level, so the remaining recycles still happen. (The race > >>condition can yield an ERROR under PreallocXlogFiles(), but recycling is > >>already done at that point.) > > > >I agree to this. > > Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing > without bound. Even if we are not sure what is causing it, how confident are > we that the patches applied to v15 would fix it? I'll say 7%; certainly too low to stop investigating. The next things I'd want to see are: - select name, setting, source from pg_settings where setting <> boot_val; - log_checkpoints log entries, and other log entries having the same PID If the theory about checkpoint skipping holds, there should be a period where the volume of WAL replayed greatly exceeds max_wal_size, yet 0-1 restartpoints begin and 0 restartpoints complete.
Re: pg15b2: large objects lost on upgrade
Robert Haas writes: > 100 << 2^32, so it's not terrible, but I'm honestly coming around to > the view that we ought to just nuke this test case. I'd hesitated to suggest that, but I think that's a fine solution. Especially since we can always put it back in later if we think of a more robust way. regards, tom lane
Re: [doc] fix a potential grammer mistake
> On 4 Aug 2022, at 00:44, Junwang Zhao wrote: > Attachment is a patch with the "just" removed. I think this is a change for better, so I've pushed it. Thanks for the contribution! -- Daniel Gustafsson https://vmware.com/
Re: Question about user/database-level parameters
On Thu, 04 Aug 2022 at 19:29, Daniel Verite wrote: > Japin Li wrote: > >> However, if the database is in production, we cannot go into single-user >> mode, should we provide an option to change this behavior on the fly? > > It already exists, through PGOPTIONS, which appears to work > for local_preload_libraries, in a quick test. > > That is, you can log in by invoking psql with: > PGOPTIONS="-c local_preload_libraries=" > and issue the ALTER USER to reset things back to normal > without stopping the instance. Oh, great! Thank you very much! -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: Use fadvise in wal replay
> On 18 Jul 2022, at 22:55, Robert Haas wrote: > > On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak wrote: >> Cool. As for GUC I'm afraid there's going to be resistance of adding yet >> another GUC (to avoid many knobs). Ideally it would be nice if we had some >> advanced/deep/hidden parameters , but there isn't such thing. >> Maybe another option would be to use (N * maintenance_io_concurrency * >> XLOG_BLCKSZ), so N=1 that's 80kB and N=2 160kB (pretty close to default >> value, and still can be tweaked by enduser). Let's wait what others say? > > I don't think adding more parameters is a problem intrinsically. A > good question to ask, though, is how the user is supposed to know what > value they should configure. If we don't have any idea what value is > likely to be optimal, odds are users won't either. We know that 128KB is optimal on some representative configuration and that changing value won't really affect performance much. 128KB is marginally better then 8KB and removes some theoretical concerns about extra syscalls. > It's not very clear to me that we have any kind of agreement on what > the basic approach should be here, though. Actually, the only question is offset from current read position: should it be 1 block or full readehead chunk. Again, this does not change anything, just a matter of choice. Thanks! Best regards, Andrey Borodin.
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 10:26 AM Tom Lane wrote: > Robert Haas writes: > > 100 << 2^32, so it's not terrible, but I'm honestly coming around to > > the view that we ought to just nuke this test case. > > I'd hesitated to suggest that, but I think that's a fine solution. > Especially since we can always put it back in later if we think > of a more robust way. IMHO it's 100% clear how to make it robust. If you want to check that two values are the same, you can't let one of them be overwritten by an unrelated event in the middle of the check. There are many specific things we could do here, a few of which I proposed in my previous email, but they all boil down to "don't let autovacuum screw up the results". But if you don't want to do that, and you also don't want to have random failures, the only alternatives are weakening the check and removing the test. It's kind of hard to say which is better, but I'm inclined to think that if we just weaken the test we're going to think we've got coverage for this kind of problem when we really don't. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg15b2: large objects lost on upgrade
Hi, On 2022-08-04 12:43:49 -0400, Robert Haas wrote: > On Thu, Aug 4, 2022 at 10:26 AM Tom Lane wrote: > > Robert Haas writes: > > > 100 << 2^32, so it's not terrible, but I'm honestly coming around to > > > the view that we ought to just nuke this test case. > > > > I'd hesitated to suggest that, but I think that's a fine solution. > > Especially since we can always put it back in later if we think > > of a more robust way. > > IMHO it's 100% clear how to make it robust. If you want to check that > two values are the same, you can't let one of them be overwritten by > an unrelated event in the middle of the check. There are many specific > things we could do here, a few of which I proposed in my previous > email, but they all boil down to "don't let autovacuum screw up the > results". > > But if you don't want to do that, and you also don't want to have > random failures, the only alternatives are weakening the check and > removing the test. It's kind of hard to say which is better, but I'm > inclined to think that if we just weaken the test we're going to think > we've got coverage for this kind of problem when we really don't. Why you think it's better to not have the test than to have a very limited amount of fuzziness (by using the next xid as an upper limit). What's the bug that will reliably pass the nextxid fuzzy comparison, but not an exact comparison? Greetings, Andres Freund
Re: Patch to address creation of PgStat* contexts with null parent context
On Fri, 2022-07-29 at 11:53 +0900, Kyotaro Horiguchi wrote: > > That makes the memorycontext-tree structure unstable because > CacheMemoryContext can be created on-the-fly. > > Honestly I don't like to call CreateCacheMemoryContext in the two > functions on-the-fly. Since every process that calls > pgstat_initialize() necessarily calls pgstat_setup_memcxt() at latest > at process termination, I think we can create at least > CacheMemoryContext in pgstat_initialize(). Attached is a patch creating CacheMemoryContext() in pgstat_initialize() rather than the two previous patch locations. > Or couldn't we create the > all three contexts in the function, instead of calling > pgstat_setup_memcxt() on-the fly? You note that that pgstat_setup_memcxt() is called at latest at process termination -- was the intent to hold off on requesting memory for these two contexts until it was needed? > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thomp...@crunchydata.com www.crunchydata.com diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index ffd096405e..b5b69ead2b 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -532,6 +532,8 @@ pgstat_initialize(void) /* Set up a process-exit hook to clean up */ before_shmem_exit(pgstat_shutdown_hook, 0); + CreateCacheMemoryContext(); + #ifdef USE_ASSERT_CHECKING pgstat_is_initialized = true; #endif
Re: pg15b2: large objects lost on upgrade
Robert Haas writes: > IMHO it's 100% clear how to make it robust. If you want to check that > two values are the same, you can't let one of them be overwritten by > an unrelated event in the middle of the check. There are many specific > things we could do here, a few of which I proposed in my previous > email, but they all boil down to "don't let autovacuum screw up the > results". It doesn't really matter how robust a test case is, if it isn't testing the thing you need to have tested. So I remain unwilling to disable autovac in a way that won't match real-world usage. Note that the patch you proposed at [1] will not fix anything. It turns off autovac in the new node, but the buildfarm failures we've seen appear to be due to autovac running on the old node. (I believe that autovac in the new node is *also* a hazard, but it seems to be a lot less of one, presumably because of timing considerations.) To make it work, we'd have to shut off autovac in the old node before starting pg_upgrade, and that would make it unacceptably (IMHO) different from what real users will do. Conceivably, we could move all of this processing into pg_upgrade itself --- autovac disable/re-enable and capturing of the horizon data --- and that would address my complaint. I don't really want to go there though, especially when in the final analysis IT IS NOT A BUG if a rel's horizons advance a bit during pg_upgrade. It's only a bug if they become inconsistent with the rel's data, which is not what this test is testing for. regards, tom lane [1] https://www.postgresql.org/message-id/CA%2BTgmoZkBcMi%2BNikxfc54dgkWj41Q%3DZ4nuyHpheTcxA-qfS5Qg%40mail.gmail.com
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 9:44 AM Robert Haas wrote: > But if you don't want to do that, and you also don't want to have > random failures, the only alternatives are weakening the check and > removing the test. It's kind of hard to say which is better, but I'm > inclined to think that if we just weaken the test we're going to think > we've got coverage for this kind of problem when we really don't. Perhaps amcheck's verify_heapam() function can be used here. What could be better than exhaustively verifying that the relfrozenxid (and relminmxid) invariants hold for every single tuple in the table? Those are the exact conditions that we care about, as far as relfrozenxid/relminmxid goes. My sense is that that has a much better chance of detecting a real bug at some point. This approach is arguably an example of property-based testing. -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
Peter Geoghegan writes: > Perhaps amcheck's verify_heapam() function can be used here. What > could be better than exhaustively verifying that the relfrozenxid (and > relminmxid) invariants hold for every single tuple in the table? How much will that add to the test's runtime? I could get behind this idea if it's not exorbitantly expensive. regards, tom lane
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 11:07 AM Tom Lane wrote: > How much will that add to the test's runtime? I could get behind this > idea if it's not exorbitantly expensive. I'm not sure offhand, but I suspect it wouldn't be too bad. -- Peter Geoghegan
Re: [PATCH] CF app: add "Returned: Needs more interest"
Hi Andres, My intention had not quite been for this to be a referendum on the decision for every patch -- we can do that if it helps, but I don't think we necessarily have to have unanimity on the bucketing for every patch in order for the new state to be useful. On 8/3/22 12:46, Andres Freund wrote: >> - https://commitfest.postgresql.org/38/2482/ > > Hm - "Returned: Needs more interest" doesn't seem like it'd have been more > descriptive? It was split off a patchset that was committed at the tail end of > 15 (and which still has *severe* code quality issues). Imo having a CF entry > before the rest of the jsonpath stuff made it in doesn't seem like a good > idea There were no comments about code quality issues on the thread that I can see, and there were three people who independently said "I don't know why this isn't getting review." Seems like a shoe-in for "needs more interest". >> - https://commitfest.postgresql.org/38/3338/ > > Here it'd have fit. Okay. That's one. >> - https://commitfest.postgresql.org/38/3181/ > > FWIW, I mentioned at least once that I didn't think this was worth pursuing. (I don't see that comment on that thread? You mentioned it needed a rebase.) IMO, mentioning that something is not worth pursuing is not actionable feedback. It's a declaration of non-interest in the mildest case, and a Rejection in the strongest case. But let's please not say "meh" and then Return with Feedback; an author can't do anything with that. >> - https://commitfest.postgresql.org/38/2918/ > > Hm, certainly not a lot of review activity. That's two. >> - https://commitfest.postgresql.org/38/2710/ > > A good bit of this was committed in some form with a decent amount of review > activity for a while. But then the rest of it stalled. Something has to be done with the open entry. >> - https://commitfest.postgresql.org/38/2266/ (this one was particularly >> miscommunicated during the first RwF) > > I'd say misunderstanding than miscommunication... The CFM sending it said, "It seems there has been no activity since last version of the patch so I don't think RwF is correct" [1], and then the email sent said "you are encouraged to send a new patch [...] with the suggested changes." But there were no suggested changes left to make. This really highlights, for me, why the two states should not be combined into one. > It seems partially stalled due to the potential better approach based on > https://www.postgresql.org/message-id/flat/15848.1576515643%40sss.pgh.pa.us ? > In which case RwF doesn't seem to inappropriate. Those comments are, as far as I can tell, not in the thread. (And the new thread you linked is also stalled.) >> - https://commitfest.postgresql.org/38/2218/ > > Yep. That's three. >> - https://commitfest.postgresql.org/38/3256/ > > Yep. That's four. >> - https://commitfest.postgresql.org/38/3310/ > > I don't really understand why this has been RwF'd, doesn't seem that long > since the last review leading to changes. Eight months without feedback, when we expect authors to turn around a patch in two weeks or less to avoid being RwF'd, is a long time IMHO. I don't think a patch should sit motionless in CF for eight months; it's not at all fair to the author. >> - https://commitfest.postgresql.org/38/3050/ > > Given that a non-author did a revision of the patch, listed a number of TODO > items and said "I'll create regression tests firstly." - I don't think "lacks > interest" would have been appropriate, and RwF is? That was six months ago, and prior to that there was another six month silence. I'd say that lacks interest, and I don't feel like it's currently reviewable in CF. >> (Even if they'd all received skeptical feedback, if the author replies in >> good faith and is met with silence for months, we need to not keep stringing >> them along.) > > I agree very much with that - just am doubtful that "lacks interest" is a good > way of dealing with it, unless we just want to treat it as a nicer sounding > "rejected". Tom summed up my position well: there's a difference between those two that is both meaningful and actionable for contributors. Is there an alternative you'd prefer? Thanks for the discussion! --Jacob [1] https://www.postgresql.org/message-id/20211004071249.GA6304%40ahch-to
Re: Clarifying Commitfest policies
On Wed, Aug 3, 2022 at 2:05 PM Matthias van de Meent wrote: > On Wed, 3 Aug 2022 at 20:04, Jacob Champion wrote: > > Is that enough, or should we do more? > > "The CF Checklist" seems to refer to only the page that is (or seems > to be) intended for the CFM only. We should probably also update the > pages of "Commitfest", "Submitting a patch", "Reviewing a Patch", "So, > you want to be a developer?", and the "Developer FAQ" page, which > doesn't have to be more than removing outdated information and > refering to any (new) documentation on how to participate in the > PostgreSQL development and/or Commitfest workflow as a non-CFM. Agreed, a sweep of those materials would be helpful as well. I'm personally focused on CFM tasks, since it's fresh in my brain and documentation is almost non-existent for it, but if you have ideas for those areas, I definitely don't want to shut down that line of the conversation. > Additionally, we might want to add extra text to the "developers" > section of the main website [0] to refer to any new documentation. > This suggestion does depend on whether the new documentation has a > high value for potential community members. Right. So what kinds of info do we want to highlight in this documentation, to make it high-quality? Drawing from some of the questions I've seen recently, we could talk about - CF "power" structure (perhaps simply to highlight that the CFM has no additional authority to get patches in) - the back-and-forth process on the mailing list, maybe including expected response times - what to do when a patch is returned (or rejected) > As an example, the GitHub mirror of the main PostgreSQL repository > receives a decent amount of pull request traffic. When a project has a > CONTRIBUTING.md -file at the top level people writing the pull request > message will be pointed to those contributing guidelines. This could (I think some text got cut here.) The mirror bot will point you to the "So, you want to be a developer?" wiki when you open a PR, but I agree that a CONTRIBUTING doc would help prevent that small embarrassment. --Jacob
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 1:49 PM Tom Lane wrote: > Note that the patch you proposed at [1] will not fix anything. > It turns off autovac in the new node, but the buildfarm failures > we've seen appear to be due to autovac running on the old node. > (I believe that autovac in the new node is *also* a hazard, but > it seems to be a lot less of one, presumably because of timing > considerations.) To make it work, we'd have to shut off autovac > in the old node before starting pg_upgrade, Yeah, that's a fair point. > and that would make it > unacceptably (IMHO) different from what real users will do. I don't agree with that, but as you say, it is a matter of opinion. In any case, what exactly do you want to do now? Jonathon Katz has proposed a patch to do the fuzzy comparison which I believe to be incorrect because I think it compares, at most, the horizons for one table in the database. I could go work on a better version of that, or he could, or you could, but it seems like we're running out of time awfully quick here, given that you wanted to have this resolved today and it's almost the end of today. I think the most practical alternative is to put this file back to the way it was before I started tinkering with it, and revisit this issue after the release. If you want to do something else, that's fine, but I'm not going to be available to work on this issue over the weekend, so if you want to do something else, you or someone else is going to have to take responsibility for whatever further stabilization that other approach may require between now and the release. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg15b2: large objects lost on upgrade
Robert Haas writes: > I think the most practical alternative is to put this file back to the > way it was before I started tinkering with it, and revisit this issue > after the release. Yeah, that seems like the right thing. We are running too low on time to have any confidence that a modified version of the test will be reliable. regards, tom lane
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 12:59 PM Andres Freund wrote: > Why you think it's better to not have the test than to have a very limited > amount of fuzziness (by using the next xid as an upper limit). What's the bug > that will reliably pass the nextxid fuzzy comparison, but not an exact > comparison? I don't know. I mean, I guess one possibility is that the nextXid value could be wrong too, because I doubt we have any separate test that would catch that. But more generally, I don't have a lot of confidence in fuzzy tests. It's too easy for things to look like they're working when they really aren't. Let's say that the value in the old cluster was 100 and the nextXid in the new cluster is 1000. Well, it's not like every value between 100 and 1000 is OK. The overwhelming majority of those values could never be produced, neither from the old cluster nor from any subsequent vacuum. Given that the old cluster is suffering no new write transactions, there's probably exactly two values that are legal: one being the value from the old cluster, which we know, and the other being whatever a vacuum of that table would produce, which we don't know, although we do know that it's somewhere in that range. Let's flip the question on its head: why should some hypothetical future bug that we have in this area produce a value outside that range? If it's a failure to set the value at all, or if it generates a value at random, we'd likely still catch it. And those are pretty likely, so maybe the value of such a test is not zero. On the other hand, subtle breakage might be more likely to survive developer testing than complete breakage. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 12:15 PM Robert Haas wrote: > Given that the old cluster is suffering no new write > transactions, there's probably exactly two values that are legal: one > being the value from the old cluster, which we know, and the other > being whatever a vacuum of that table would produce, which we don't > know, although we do know that it's somewhere in that range. What about autoanalyze? -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 3:23 PM Peter Geoghegan wrote: > On Thu, Aug 4, 2022 at 12:15 PM Robert Haas wrote: > > Given that the old cluster is suffering no new write > > transactions, there's probably exactly two values that are legal: one > > being the value from the old cluster, which we know, and the other > > being whatever a vacuum of that table would produce, which we don't > > know, although we do know that it's somewhere in that range. > > What about autoanalyze? What about it? -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 3:10 PM Tom Lane wrote: > Robert Haas writes: > > I think the most practical alternative is to put this file back to the > > way it was before I started tinkering with it, and revisit this issue > > after the release. > > Yeah, that seems like the right thing. We are running too low on time > to have any confidence that a modified version of the test will be > reliable. Done. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 12:31 PM Robert Haas wrote: > > What about autoanalyze? > > What about it? It has a tendency to consume an XID, here or there, quite unpredictably. I've noticed that this often involves an analyze of pg_statistic. Have you accounted for that? You said upthread that you don't like "fuzzy" tests, because it's too easy for things to look like they're working when they really aren't. I suppose that there may be some truth to that, but ISTM that there is also a lot to be said for a test that can catch failures that weren't specifically anticipated. Users won't be running pg_upgrade with autovacuum disabled. And so ISTM that just testing that relfrozenxid has been carried forward is more precise about one particular detail (more precise than alternative approaches to testing), but less precise about the thing that we actually care about. -- Peter Geoghegan
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar wrote: > Another version of the patch which closes the smgr at the end using > smgrcloserellocator() and I have also added a commit message. I have reviewed this patch and I don't see a problem with it. However, it would be nice if Andres or someone else who understands this area well (Tom? Thomas?) would also review it, because I also reviewed what's in the tree now and that turns out to be buggy, which leads me to conclude that I don't understand this area as well as would be desirable. I'm inclined to hold off on committing this until next week, not only for that reason, but also because there's a wrap planned on Monday, and committing anything now seems like it might have too much of a risk of upsetting that plan. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Pluggable toaster
Hi hackers! I've made a rebase according to Andres and Aleksander comments. Rebased branch resides here: https://github.com/postgrespro/postgres/tree/toasterapi_clean I've decided to leave only the first 2 patches for review and send less significant changes after the main patch will be straightened out. So, here is v13-0001-toaster-interface.patch - main TOAST API patch, with reference TOAST mechanics left as-is. v13-0002-toaster-default.patch - reference TOAST re-implemented via TOAST API. >I'm a bit confused by the patch numbering - why isn't there a patch 0001 and >0005? Sorry for the confusion, my fault. The first patch (CREATE TABLE .. SET STORAGE) is already committed into v15, and several of the late patches weren't included. I've rearranged patch numbers in this iteration. >I think 0002 needs to be split further - the relevant part isn't that it >introduces the "dummy toaster" module, it's a large change doing lots of >things, the addition of the contrib module is irrelevant comparatively. Done, contrib /dummy_toaster excluded from main patch and placed in branch as a separate commit. >As is the patches unfortunately are close to unreviewable. Lots of code gets >moved around in one patch, then again in the next patch, then again in the >next. So I've decided to put here only the first one while I'm working on the latter to clean this up - I agree, code in latter patches needs some refactoring. Working on it. >Unfortunately, scanning through these patches, it seems this is a lot of >complexity, with a (for me) comensurate benefit. There's a lot of more general >improvements to toasting and the json type that we can do, that are a lot less >complex than this. We have very significant improvements for storing large JSON and a couple of other TOAST improvements which make a lot of sense, but they are based on this API. But in the first patch reference TOAST is left as-is, and does not use TOAST API. >> 2) New VARATT_CUSTOM data structure with fixed header and variable >> tail to store custom toasted data, with according macros set; >That's adding overhead to every toast interaction, independent of any new >infrastructure being used. We've performed some tests on this and haven't detected significant overhead, >So we're increasing pg_attribute - often already the largest catalog table in >a database. A little bit, with an OID column storing Toaster OID. We do not see any other way to keep track of Toaster used by the table's column, because it could be changed any time by ALTER TABLE ... SET TOASTER. >Am I just missing something, or is atttoaster not actually used in this patch? >So most of the contrib module added is unreachable code? It is necessary for Toasters implemented via TOAST API, the first patch does not use it directly because reference TOAST is left unchanged. The second one which implements reference TOAST via TOAST API uses it. >That seems not great. About Toasters deletion - we forbid dropping Toasters because if Toaster is dropped the data TOASTed with it is lost, and as was mentioned before, we think that there won't be a lot of custom Toasters, likely seems to be less then a dozen. >That move the whole list around! On a cache hit. Tthis would likely already be >slower than syscache. Thank you for the remark, it is questionable approach. I've changed this in current iteration (patch in attach) to keep Toaster list appended-only if Toaster was not found, and leave Toaster cache as a straight list - first element in is the head of the list. Also, documentation on TOAST API is provided in README.toastapi in the first patch - I'd be grateful for comments on it. Thanks for the feedback! On Tue, Aug 2, 2022 at 6:37 PM Andres Freund wrote: > Hi, > > On 2022-08-02 09:15:12 +0300, Nikita Malakhov wrote: > > Attach includes: > > v11-0002-toaster-interface.patch - contains TOAST API with default > Toaster > > left as-is (reference implementation) and Dummy toaster as an example > > (will be removed later as a part of refactoring?). > > > > v11-0003-toaster-default.patch - implements reference TOAST as Default > > Toaster > > via TOAST API, so Heap AM calls Toast only via API, and does not have > direct > > calls to Toast functionality. > > > > v11-0004-toaster-snapshot.patch - supports row versioning for TOASTed > values > > and some refactoring. > > I'm a bit confused by the patch numbering - why isn't there a patch 0001 > and > 0005? > > I think 0002 needs to be split further - the relevant part isn't that it > introduces the "dummy toaster" module, it's a large change doing lots of > things, the addition of the contrib module is irrelevant comparatively. > > As is the patches unfortunately are close to unreviewable. Lots of code > gets > moved around in one patch, then again in the next patch, then again in the > next. > > > Unfortunately, scanning through these patches, it seems this is a lot of > complexity, with a (for me) comensurate benefit. There's a lot of mo
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote: > I'm inclined to hold off on committing this until next week, not only +1 I don't see any reason to hurry to fix problems that occur when DROP DATABASE is interrupted. Sorry to beat up your patches so much and for such crappy test cases^C -- Justin
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Justin Pryzby writes: > On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote: >> I'm inclined to hold off on committing this until next week, not only > +1 +1 ... there are some other v15 open items that I don't think we'll see fixed for beta3, either. regards, tom lane
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Hi, On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote: > Another version of the patch which closes the smgr at the end using > smgrcloserellocator() and I have also added a commit message. What's the motivation behind the explicit close? > @@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char > *srcpath) > Pagepage; > List *rlocatorlist = NIL; > LockRelId relid; > - Relationrel; > Snapshotsnapshot; > + SMgrRelationsmgr; > BufferAccessStrategy bstrategy; > > /* Get pg_class relfilenumber. */ > @@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char > *srcpath) > rlocator.dbOid = dbid; > rlocator.relNumber = relfilenumber; > > - /* > - * We can't use a real relcache entry for a relation in some other > - * database, but since we're only going to access the fields related to > - * physical storage, a fake one is good enough. If we didn't do this and > - * used the smgr layer directly, we would have to worry about > - * invalidations. > - */ > - rel = CreateFakeRelcacheEntry(rlocator); > - nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM); > - FreeFakeRelcacheEntry(rel); > + smgr = smgropen(rlocator, InvalidBackendId); > + nblocks = smgrnblocks(smgr, MAIN_FORKNUM); > + smgrclose(smgr); Why are you opening and then closing again? Part of the motivation for the question is that a local SMgrRelation variable may lead to it being used further, opening up interrupt processing issues. > + rlocator.locator = src_rlocator; > + smgrcloserellocator(rlocator); > + > + rlocator.locator = dst_rlocator; > + smgrcloserellocator(rlocator); As mentioned above, it's not clear to me why this is now done... Otherwise looks good to me. Greetings, Andres Freund
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Hi, On 2022-08-04 16:07:01 -0400, Robert Haas wrote: > On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar wrote: > > Another version of the patch which closes the smgr at the end using > > smgrcloserellocator() and I have also added a commit message. > > I have reviewed this patch and I don't see a problem with it. However, > it would be nice if Andres or someone else who understands this area > well (Tom? Thomas?) would also review it, because I also reviewed > what's in the tree now and that turns out to be buggy, which leads me > to conclude that I don't understand this area as well as would be > desirable. I don't think this issue is something I'd have caught "originally" either. It's probably more of a "fake relcache entry" issue (or undocumented limitation) than a bug in the new code. I did a quick review upthread - some minor quibbles aside, I think it looks good. > I'm inclined to hold off on committing this until next week, not only > for that reason, but also because there's a wrap planned on Monday, > and committing anything now seems like it might have too much of a > risk of upsetting that plan. Makes sense. Unlikely to be a blocker for anybody. Greetings, Andres Freund
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Hi, On 2022-08-04 16:38:35 +0530, Dilip Kumar wrote: > So basically, from this we can say it is completely a problem with > drop databases, I mean I can produce any behavior by interrupting drop > database > 1. If we created some tables/inserted data and the drop database got > cancelled, it might have a database directory and those objects are > lost. > 2. Or you can even drop the database directory and then get cancelled > before deleting the pg_database entry then also you will end up with a > corrupted database, doesn't matter whether you created it with WAL LOG > or FILE COPY. Yea. I think at the very least we need to start holding interrupts before the DropDatabaseBuffers() and do so until commit. That's probably best done by doing the transaction commit inside dropdb. Greetings, Andres Freund
Re: [PATCH] CF app: add "Returned: Needs more interest"
Hi, On 2022-08-04 11:19:28 -0700, Jacob Champion wrote: > My intention had not quite been for this to be a referendum on the > decision for every patch -- we can do that if it helps, but I don't > think we necessarily have to have unanimity on the bucketing for every > patch in order for the new state to be useful. Sorry, I should have been clearer. It wasn't mine either! I was just trying to understand what you see as the usecase / get a better feel for it. I'm now a bit more convinced it's useful than before. > >> - https://commitfest.postgresql.org/38/3310/ > > > > I don't really understand why this has been RwF'd, doesn't seem that long > > since the last review leading to changes. > > Eight months without feedback, when we expect authors to turn around a > patch in two weeks or less to avoid being RwF'd, is a long time IMHO. Why is it better to mark it as lacks interested than RwF if there actually *has* been feedback? > I don't think a patch should sit motionless in CF for eight months; it's not > at all fair to the author. It's not great, I agree, but wishes don't conjure up resources :( > >> - https://commitfest.postgresql.org/38/3050/ > > > > Given that a non-author did a revision of the patch, listed a number of TODO > > items and said "I'll create regression tests firstly." - I don't think > > "lacks > > interest" would have been appropriate, and RwF is? > > That was six months ago, and prior to that there was another six month > silence. I'd say that lacks interest, and I don't feel like it's > currently reviewable in CF. I don't think the entry needs more review - it needs changes: https://www.postgresql.org/message-id/CAOKkKFtc45uNFoWYOCo4St19ayxrh-_%2B4TnZtwxGZz6-3k_GSA%40mail.gmail.com That contains quite a few things that should be changed. A patch that has gotten feedback, but that feedback hasn't been processed pretty much is the definition of RwF, no? > >> (Even if they'd all received skeptical feedback, if the author replies in > >> good faith and is met with silence for months, we need to not keep > >> stringing > >> them along.) > > > > I agree very much with that - just am doubtful that "lacks interest" is a > > good > > way of dealing with it, unless we just want to treat it as a nicer sounding > > "rejected". > > Tom summed up my position well: there's a difference between those two > that is both meaningful and actionable for contributors. Is there an > alternative you'd prefer? I agree that "lacks interest" could be useful. But I'm wary of it becoming just a renaming if we end up marking patches that should be RwF or rejected as "lacks interest". Greetings, Andres Freund
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Robert Haas writes: > I have reviewed this patch and I don't see a problem with it. However, > it would be nice if Andres or someone else who understands this area > well (Tom? Thomas?) would also review it, because I also reviewed > what's in the tree now and that turns out to be buggy, which leads me > to conclude that I don't understand this area as well as would be > desirable. FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry here, because I do not think that mechanism is meant to be used outside of WAL replay. However, this patch fails to remove it from CreateAndCopyRelationData, which seems likely to be just as much at risk. The "invalidation" comment bothered me for awhile, but I think it's fine: we know that no other backend can connect to the source DB because we have it locked, and we know that no other backend can connect to the destination DB because it doesn't exist yet according to the catalogs, so nothing could possibly occur to invalidate our idea of where the physical files are. It would be nice to document these assumptions, though, rather than merely remove all the relevant commentary. While I'm at it, I would like to strenuously object to the current framing of CreateAndCopyRelationData as a general-purpose copying mechanism. Because of the above assumptions, I think it's utterly unsafe to use anywhere except in CREATE DATABASE. The header comment fails to warn about that at all, and placing it in bufmgr.c rather than static in dbcommands.c is just an invitation to future misuse. Perhaps I'm overly sensitive to that because I just finished cleaning up somebody's misuse of non-general-purpose code (1aa8dad41), but as this stands I think it's positively dangerous. regards, tom lane
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Andres Freund writes: > Yea. I think at the very least we need to start holding interrupts before the > DropDatabaseBuffers() and do so until commit. That's probably best done by > doing the transaction commit inside dropdb. We've talked before about ignoring interrupts across commit, but I find the idea a bit scary. In any case, DROP DATABASE is far from the only place with a problem. regards, tom lane
Re: optimize lookups in snapshot [sub]xip arrays
On Thu, Aug 04, 2022 at 02:58:14PM +0700, John Naylor wrote: > Were you considering adding the new function to simd.h now that that's > committed? It's a bit up in the air what should go in there, but this new > function is low-level and generic enough to be a candidate... I don't have a strong opinion. I went with a separate file because I envisioned a variety of possible linear search functions (e.g., char, uint16, uint32), and some might not use SIMD instructions. Futhermore, it seemed less obvious to look in simd.h for linear search functions. That being said, it might make sense to just add it here for now. > I wonder if the "pg_" prefix is appropriate here, as that is most often > used for things that hide specific details *and* where the base name would > clash, like OS calls or C library functions. I'm not quite sure where the > line is drawn, but I mean that "linearsearch" is a generic algorithm and > not a specific API we are implementing, if that makes sense. Yeah, I was concerned about clashing with lsearch() and lfind(). I will drop the prefix. > The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(), > pg_popcount32, etc). We'll likely want to search bytes sometime, so > something to keep in mind as far as naming ("_int" vs "_byte"?). How about something like lsearch32 or linearsearch32? > I'm not a fan of "its" as a variable name, and I'm curious what it's > intended to convey. It's short for "iterations." I'll spell it out completely to avoid this kind of confusion. > All the __m128i vars could technically be declared const, although I think > it doesn't matter -- it's just a hint to the reader. Will do. > Out of curiosity do we know how much we get by loading four registers > rather than two? The small program I've been using for testing takes about 40% more time with the two register approach. The majority of this test involves searching for elements that either don't exist in the array or that live near the end of the array, so this is probably close to the worst case. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote: > The "invalidation" comment bothered me for awhile, but I think it's > fine: we know that no other backend can connect to the source DB > because we have it locked, About that - is it any problem that the currently-connected db can be used as a template? It's no issue for 2-phase commit, because "create database" cannot run in an txn. -- Justin
Re: ERREUR: cache lookup failed for function 0 with PostgreSQL 15 beta 2, no error with PostgreSQL 14.4
On Fri, 5 Aug 2022 at 01:20, Phil Florent wrote: > with fakedata as ( >select 'hello' word >union all >select 'world' word > ) > select * > from ( >select word, count(*) over (partition by word) nb from fakedata > ) t where nb = 1; > ERREUR: cache lookup failed for function 0 > A DSS developer from my company, Julien Roze, reported me an error I cannot > explained. Is it a new behavior or a bug ? Thank you for the report and the minimal self-contained test case. That's highly useful for us. I've now committed a fix for this ([1]). It will appear in the next beta release for PG15. David [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=270eb4b5d4986534f2d522ebb19f67396d13cf44
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Hi, On 2022-08-04 18:05:25 -0400, Tom Lane wrote: > Andres Freund writes: > > Yea. I think at the very least we need to start holding interrupts before > > the > > DropDatabaseBuffers() and do so until commit. That's probably best done by > > doing the transaction commit inside dropdb. > > We've talked before about ignoring interrupts across commit, but > I find the idea a bit scary. I'm not actually suggesting to do so across commit, just until the commit. Maintaining that seems easiest if dropdb() does the commit internally. > In any case, DROP DATABASE is far from the only place with a problem. What other place has a database corrupting potential of this magnitude just because interrupts are accepted? We throw valid s_b contents away and then accept interrupts before committing - with predictable results. We also accept interrupts as part of deleting the db data dir (due to catalog access). Greetings, Andres Freund
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
I wrote: > While I'm at it, I would like to strenuously object to the current > framing of CreateAndCopyRelationData as a general-purpose copying > mechanism. And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer not completely broken? /* Read block from source relation. */ srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno, RBM_NORMAL, bstrategy_src, permanent); srcPage = BufferGetPage(srcBuf); if (PageIsNew(srcPage) || PageIsEmpty(srcPage)) { ReleaseBuffer(srcBuf); continue; } /* Use P_NEW to extend the destination relation. */ dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW, RBM_NORMAL, bstrategy_dst, permanent); You can't skip pages just because they are empty. Well, maybe you could if you were doing something to ensure that you zero-fill the corresponding blocks on the destination side. But this isn't doing that. It's using P_NEW for dstBuf, which will have the effect of silently collapsing out such pages. Maybe in isolation a heap could withstand that, but its indexes won't be happy (and I guess t_ctid chain links won't either). I think you should just lose the if() stanza. There's no optimization to be had here that's worth any extra complication. (This seems worth fixing before beta3, as it looks like a rather nasty data corruption hazard.) regards, tom lane
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
I wrote: > And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer > not completely broken? [pile^2] Also, what is the rationale for locking the target buffer but not the source buffer? That seems pretty hard to justify from here, even granting the assumption that we don't expect any other processes to be interested in these buffers (which I don't grant, because checkpointer). regards, tom lane
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Hi, On 2022-08-04 19:01:06 -0400, Tom Lane wrote: > And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer > not completely broken? > > /* Read block from source relation. */ > srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno, >RBM_NORMAL, bstrategy_src, >permanent); > srcPage = BufferGetPage(srcBuf); > if (PageIsNew(srcPage) || PageIsEmpty(srcPage)) > { > ReleaseBuffer(srcBuf); > continue; > } > > /* Use P_NEW to extend the destination relation. */ > dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW, >RBM_NORMAL, bstrategy_dst, >permanent); > > You can't skip pages just because they are empty. Well, maybe you could > if you were doing something to ensure that you zero-fill the corresponding > blocks on the destination side. But this isn't doing that. It's using > P_NEW for dstBuf, which will have the effect of silently collapsing out > such pages. Maybe in isolation a heap could withstand that, but its > indexes won't be happy (and I guess t_ctid chain links won't either). > > I think you should just lose the if() stanza. There's no optimization to > be had here that's worth any extra complication. > > (This seems worth fixing before beta3, as it looks like a rather > nasty data corruption hazard.) Ugh, yes. And even with this fixed I think this should grow at least an assertion that the block numbers match, probably even an elog. Greetings, Andres
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Andres Freund writes: > On 2022-08-04 18:05:25 -0400, Tom Lane wrote: >> In any case, DROP DATABASE is far from the only place with a problem. > What other place has a database corrupting potential of this magnitude just > because interrupts are accepted? We throw valid s_b contents away and then > accept interrupts before committing - with predictable results. We also accept > interrupts as part of deleting the db data dir (due to catalog access). Those things would be better handled by moving the data-discarding steps to post-commit. Maybe that argues for having an internal commit halfway through DROP DATABASE: remove pg_database row, commit, start new transaction, clean up. regards, tom lane
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Andres Freund writes: > On 2022-08-04 19:01:06 -0400, Tom Lane wrote: >> (This seems worth fixing before beta3, as it looks like a rather >> nasty data corruption hazard.) > Ugh, yes. And even with this fixed I think this should grow at least an > assertion that the block numbers match, probably even an elog. Yeah, the assumption that P_NEW would automatically match the source block was making me itchy too. An explicit test-and-elog seems worth the cycles. regards, tom lane
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Hi, On August 4, 2022 4:11:13 PM PDT, Tom Lane wrote: >I wrote: >> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer >> not completely broken? > >[pile^2] Also, what is the rationale for locking the target buffer >but not the source buffer? That seems pretty hard to justify from >here, even granting the assumption that we don't expect any other >processes to be interested in these buffers (which I don't grant, >because checkpointer). I'm not arguing it's good or should stay that way, but it's probably okayish that checkpointer / bgwriter have access, given that they will never modify buffers. They just take a lock to prevent concurrent modifications, which RelationCopyStorageUsingBuffer hopefully doesn't do. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Hi, On August 4, 2022 4:20:16 PM PDT, Tom Lane wrote: >Yeah, the assumption that P_NEW would automatically match the source block >was making me itchy too. An explicit test-and-elog seems worth the >cycles. Is there a good reason to rely on P_NEW at all? Both from an efficiency and robustness POV it seems like it'd be better to use smgrextend to bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO (or whatever it is called). That bulk smgrextend would later be a good point to use fallocate so the FS can immediately size the file correctly, without a lot of pointless writes of zeroes. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Andres Freund writes: > On August 4, 2022 4:11:13 PM PDT, Tom Lane wrote: >> [pile^2] Also, what is the rationale for locking the target buffer >> but not the source buffer? That seems pretty hard to justify from >> here, even granting the assumption that we don't expect any other >> processes to be interested in these buffers (which I don't grant, >> because checkpointer). > I'm not arguing it's good or should stay that way, but it's probably okayish > that checkpointer / bgwriter have access, given that they will never modify > buffers. They just take a lock to prevent concurrent modifications, which > RelationCopyStorageUsingBuffer hopefully doesn't do. I'm not arguing that it's actively broken today --- but AFAIR, every other access to a shared buffer takes a buffer lock. It does not seem to me to be very future-proof for this code to decide it's exempt from that rule, without so much as a comment justifying it. Furthermore, what's the gain? We aren't expecting contention here, I think. If we were, then it probably *would* be actively broken. regards, tom lane
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Andres Freund writes: > Is there a good reason to rely on P_NEW at all? Both from an efficiency > and robustness POV it seems like it'd be better to use smgrextend to > bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO > (or whatever it is called). That bulk smgrextend would later be a good > point to use fallocate so the FS can immediately size the file > correctly, without a lot of pointless writes of zeroes. Hmm ... makes sense. We'd be using smgrextend to write just the last page of the file, relying on its API spec "Note that we assume writing a block beyond current EOF causes intervening file space to become filled with zeroes". I don't know that we're using that assumption aggressively today, but as long as it doesn't confuse the kernel it'd probably be fine. regards, tom lane
Re: annoyance with .git-blame-ignore-revs
On Mon, Jul 11, 2022 at 12:30 PM Alvaro Herrera wrote: > $ git blame configure > fatal: could not open object name list: .git-blame-ignore-revs > > My first workaround was to add empty .git-blame-ignore-revs in all > checkouts. This was moderately ok (shrug), until after a recent `tig` > upgrade the empty file started to show up in the history as an untracked > file. Ping? Would be nice to get this done soon. I don't think that it requires a great deal of care. If I was doing this myself, I would probably make sure that the backbranch copies of the file won't reference commits from later releases. But even that probably doesn't matter; just backpatching the file from HEAD as-is wouldn't break anybody's workflow. Again, to reiterate: I see no reason to do anything on the backbranches here more than once. I mentioned already that somebody proposed a patch that fixes the problem at the git level, which seems to have stalled. Here is the discussion: https://public-inbox.org/git/xmqq5ywehb69.fsf@gitster.g/T/ ISTM that we're working around what is actually a usability problem with git (imagine that!). I think that that's fine. Just thought that it was worth acknowledging it as such. We're certainly not the first people to run into this exact annoyance. -- Peter Geoghegan
Re: Allow file inclusion in pg_hba and pg_ident files
On Tue, Aug 02, 2022 at 07:32:54PM +0900, Michael Paquier wrote: > As a quick update from my side, I intend to look and apply 0001~0003 > (not double-checked yet) shortly. And a couple of days later, these look fine so done as of 47ab1ac and 718fe0a. 0002 and 0003 have been merged together. -- Michael signature.asc Description: PGP signature
RE: Introduce wait_for_subscription_sync for TAP tests
On Thu, Aug 4, 2022 5:49 PM shiy.f...@fujitsu.com wrote: > > On Thu, Aug 4, 2022 2:28 PM Masahiko Sawada > wrote: > > > > On Thu, Aug 4, 2022 at 10:37 AM Amit Kapila > > wrote: > > > > > > On Wed, Aug 3, 2022 at 10:21 AM Amit Kapila > > > wrote: > > > > > > > > Pushed this one and now I'll look at your other patch. > > > > > > > > > > I have pushed the second patch as well after making minor changes in > > > the comments. Alvaro [1] and Tom [2] suggest to back-patch this and > > > they sound reasonable to me. Will you be able to produce back branch > > > patches? > > > > Yes. I've attached patches for backbranches. The updates are > > straightforward on v11 - v15. However, on v10, we don't use > > wait_for_catchup() in some logical replication test cases. The commit > > bbd3363e128dae refactored the tests to use wait_for_catchup but it's > > not backpatched. So in the patch for v10, I didn't change the code > > that was changed by the commit. Also, since wait_for_catchup requires > > to specify $target_lsn, unlike the one in v11 or later, I changed > > wait_for_subscription_sync() accordingly. > > > > Thanks for your patches. > > In the patches for pg11 ~ pg14, it looks we need to add a "=pod" before the > current change in PostgresNode.pm. Right? > By the way, I notice that in 002_types.pl (on master branch), it seems the "as well" in the following comment should be removed. Is it worth being fixed? $node_subscriber->safe_psql('postgres', "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)" ); # Wait for initial sync to finish as well $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub'); Regards, Shi yu
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Thu, Aug 4, 2022 at 6:02 PM Tom Lane wrote: > Robert Haas writes: > > I have reviewed this patch and I don't see a problem with it. However, > > it would be nice if Andres or someone else who understands this area > > well (Tom? Thomas?) would also review it, because I also reviewed > > what's in the tree now and that turns out to be buggy, which leads me > > to conclude that I don't understand this area as well as would be > > desirable. > > FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry > here, because I do not think that mechanism is meant to be used > outside of WAL replay. However, this patch fails to remove it from > CreateAndCopyRelationData, which seems likely to be just as much > at risk. It looks to me like it does? > The "invalidation" comment bothered me for awhile, but I think it's > fine: we know that no other backend can connect to the source DB > because we have it locked, and we know that no other backend can > connect to the destination DB because it doesn't exist yet according > to the catalogs, so nothing could possibly occur to invalidate our > idea of where the physical files are. It would be nice to document > these assumptions, though, rather than merely remove all the relevant > commentary. I don't think that's the point. We could always suffer a sinval reset or a PROCSIGNAL_BARRIER_SMGRRELEASE. But since the code avoids ever reusing the smgr, it should be OK. I think. > While I'm at it, I would like to strenuously object to the current > framing of CreateAndCopyRelationData as a general-purpose copying > mechanism. Because of the above assumptions, I think it's utterly > unsafe to use anywhere except in CREATE DATABASE. The header comment > fails to warn about that at all, and placing it in bufmgr.c rather > than static in dbcommands.c is just an invitation to future misuse. > Perhaps I'm overly sensitive to that because I just finished cleaning > up somebody's misuse of non-general-purpose code (1aa8dad41), but > as this stands I think it's positively dangerous. OK. No objection to you revising the comments however you feel is appropriate. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Thu, Aug 4, 2022 at 7:01 PM Tom Lane wrote: > And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer > not completely broken? Ouch. That's pretty bad. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Thu, Aug 4, 2022 at 7:11 PM Tom Lane wrote: > [pile^2] Also, what is the rationale for locking the target buffer > but not the source buffer? That seems pretty hard to justify from > here, even granting the assumption that we don't expect any other > processes to be interested in these buffers (which I don't grant, > because checkpointer). Ooph. I agree. -- Robert Haas EDB: http://www.enterprisedb.com
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe wrote in > While this may mitigate the problem, I don't think it will deal with > all the cases which could cause a transaction to end up committed locally, > but not on the synchronous standby. I think that only using the full > power of two-phase commit can make this bulletproof. > > Is it worth adding additional complexity that is not a complete solution? I would agree to this. Likewise 2PC, whatever we do to make it perfect, we're left with unresolvable problems at least for now. Doesn't it meet your requirements if we have a means to know the last transaction on the current session is locally committed or aborted? We are already internally managing last committed LSN. I think we can do the same thing about transaction abort and last inserted LSN and we can expose them any way. This is way simpler than the (maybe) uncompletable attempt to fill up the deep gap. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Introduce wait_for_subscription_sync for TAP tests
On Thu, Aug 4, 2022 at 7:13 PM Tom Lane wrote: > > Masahiko Sawada writes: > > Yes. I've attached patches for backbranches. > > FWIW, I'd recommend waiting till after next week's wrap before > pushing these. While I'm definitely in favor of doing this, > the odds of introducing a bug are nonzero, so right before a > release deadline doesn't seem like a good time. > Agreed. I was planning to do it only after next week's wrap. Thanks for your suggestion. -- With Regards, Amit Kapila.
Re: BTMaxItemSize seems to be subtly incorrect
On Thu, Jun 9, 2022 at 11:20 AM Robert Haas wrote: > I think I'd feel more comfortable if you wrote the patch, if that's possible. Okay, pushed a fix just now. Thanks -- Peter Geoghegan
Re: optimize lookups in snapshot [sub]xip arrays
On Fri, Aug 5, 2022 at 5:15 AM Nathan Bossart wrote: > > On Thu, Aug 04, 2022 at 02:58:14PM +0700, John Naylor wrote: > > Were you considering adding the new function to simd.h now that that's > > committed? It's a bit up in the air what should go in there, but this new > > function is low-level and generic enough to be a candidate... > > I don't have a strong opinion. I went with a separate file because I > envisioned a variety of possible linear search functions (e.g., char, > uint16, uint32), and some might not use SIMD instructions. Futhermore, it > seemed less obvious to look in simd.h for linear search functions. That is a good point. Maybe potential helpers in simd.h should only deal specifically with vector registers, with it's users providing C fallbacks. I don't have any good ideas of where to put the new function, though. > > I wonder if the "pg_" prefix is appropriate here, as that is most often > > used for things that hide specific details *and* where the base name would > > clash, like OS calls or C library functions. I'm not quite sure where the > > line is drawn, but I mean that "linearsearch" is a generic algorithm and > > not a specific API we are implementing, if that makes sense. > > Yeah, I was concerned about clashing with lsearch() and lfind(). I will > drop the prefix. Hmm, I didn't know about those. lfind() is similar enough that it would make sense to have pg_lfind32() etc in src/include/port/pg_lsearch.h, at least for the v4 version that returns the pointer. We already declare bsearch_arg() in src/include/port.h and that's another kind of array search. Returning bool is different enough to have a different name. pg_lfind32_ispresent()? *_noptr()? Meh. Having said all that, the man page under BUGS [1] says "The naming is unfortunate." > > Out of curiosity do we know how much we get by loading four registers > > rather than two? > > The small program I've been using for testing takes about 40% more time > with the two register approach. The majority of this test involves > searching for elements that either don't exist in the array or that live > near the end of the array, so this is probably close to the worst case. Ok, sounds good. [1] https://man7.org/linux/man-pages/man3/lsearch.3.html#BUGS -- John Naylor EDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Aug 5, 2022 at 4:31 AM Tom Lane wrote: > > I wrote: > > While I'm at it, I would like to strenuously object to the current > > framing of CreateAndCopyRelationData as a general-purpose copying > > mechanism. > > And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer > not completely broken? > > /* Read block from source relation. */ > srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno, >RBM_NORMAL, bstrategy_src, >permanent); > srcPage = BufferGetPage(srcBuf); > if (PageIsNew(srcPage) || PageIsEmpty(srcPage)) > { > ReleaseBuffer(srcBuf); > continue; > } > > /* Use P_NEW to extend the destination relation. */ > dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW, >RBM_NORMAL, bstrategy_dst, >permanent); > > You can't skip pages just because they are empty. Well, maybe you could > if you were doing something to ensure that you zero-fill the corresponding > blocks on the destination side. But this isn't doing that. It's using > P_NEW for dstBuf, which will have the effect of silently collapsing out > such pages. Maybe in isolation a heap could withstand that, but its > indexes won't be happy (and I guess t_ctid chain links won't either). > > I think you should just lose the if() stanza. There's no optimization to > be had here that's worth any extra complication. > > (This seems worth fixing before beta3, as it looks like a rather > nasty data corruption hazard.) Yeah this is broken. -- Dilip -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Handle infinite recursion in logical replication setup
On Thu, Aug 4, 2022 at 6:31 PM Masahiko Sawada wrote: > > On Tue, Aug 2, 2022 at 8:59 PM Amit Kapila wrote: > > > > On Tue, Jul 26, 2022 at 9:07 AM Amit Kapila wrote: > > > > Let me try to summarize the discussion so that it is easier for others > > to follow. The work in this thread is to avoid loops, and duplicate > > data in logical replication when the operations happened on the same > > table in multiple nodes. It has been explained in email [1] with an > > example of how a logical replication setup can lead to duplicate or > > inconsistent data. > > > > The idea to solve this problem is that we don't replicate data that is > > not generated locally which we can normally identify based on origin > > of data in WAL. The commit 366283961a achieves that for replication > > but still the problem can happen during initial sync which is > > performed internally via copy. We can't differentiate the data in heap > > based on origin. So, we decided to prohibit the subscription > > operations that can perform initial sync (ex. Create Subscription, > > Alter Subscription ... Refresh) by detecting that the publisher has > > subscribed to the same table from some other publisher. > > > > To prohibit the subscription operations, the currently proposed patch > > throws an error. Then, it also provides a new copy_data option > > 'force' under which the user will still be able to perform the > > operation. This could be useful when the user intentionally wants to > > replicate the initial data even if it contains data from multiple > > nodes (for example, when in a multi-node setup, one decides to get the > > initial data from just one node and then allow replication of data to > > proceed from each of respective nodes). > > > > The other alternative discussed was to just give a warning for > > subscription operations and probably document the steps for users to > > avoid it. But the problem with that is once the user sees this > > warning, it won't be able to do anything except recreate the setup, so > > why not give an error in the first place? > > > > Thoughts? > > Thank you for the summary! > > I understand that this feature could help some cases, but I'm really > not sure adding new value 'force' for them is worthwhile, and > concerned it could reduce the usability. > > IIUC this feature would work only when origin = 'none' and copy_data = > 'on', and the purpose is to prevent the data from being > duplicated/conflicted by the initial table sync. But there are cases > where duplication/conflict doesn't happen even if origin = 'none' and > copy_data = 'on'. For instance, the table on the publisher might be > empty. > Right, but if we want we can check the tables on publishers to ensure that. Now, another case could be where the corresponding subscription was disabled on publisher during create subscription but got enabled just before copy, we can even catch that situation as we are doing for column lists in fetch_remote_table_info(). Are there other cases of false positives you can think of? I see your point that we can document that users should be careful with certain configurations to avoid data inconsistency but not able completely convince myself about the same. I think the main thing to decide here is how much we want to ask users to be careful by referring them to docs. Now, if you are not convinced with giving an ERROR here then we can probably show a WARNING (that we might copy data for multiple origins during initial sync in spite of the user having specified origin as NONE)? > Also, even with origin = 'any', copy_data = 'on', there is a > possibility of data duplication/conflict. Why do we need to address > only the case where origin = 'none'? > Because the user has specifically asked not to replicate any remote data by specifying origin = NONE, which should be dealt with differently whereas 'any' doesn't have such a requirement. Now, tomorrow, if we want to support replication based on specific origin names say origin = 'node-1' then also we won't be able to identify the data during initial sync but I think 'none' is a special case where giving some intimation to user won't be a bad idea especially because we can identify the same. > I think that using origin = > 'none' doesn't necessarily mean using bi-directional (or N-way) > replication. Even when using uni-directional logical replication with > two nodes, they may use origin = 'none'. > It is possible but still, I think it is a must for bi-directional (or N-way) replication, otherwise, there is a risk of loops. -- With Regards, Amit Kapila.
Re: How is this possible "publication does not exist"
At Thu, 4 Aug 2022 10:56:45 +0200, Marco Slot wrote in > On Wed, Aug 11, 2021 at 1:37 PM Amit Kapila wrote: > > I think it is and the context is generated via > > output_plugin_error_callback. Is this reproducible for you and if so, > > can you share a test case or some steps to reproduce this? Does this > > work and suddenly start giving errors or it happens the very first > > time you tried to set up publication/subscription? I think some more > > details are required about your setup and steps to analyze this > > problem. You might want to check publication-side logs but not sure if > > get any better clue there. > > This seems to regularly reproduce the issue on PostgreSQL 14.4: > > drop subscription if exists local_sub; > drop publication if exists local_pub; > drop table if exists local; > > select pg_create_logical_replication_slot('test','pgoutput'); > create table local (x int, y int); > insert into local values (1,2); > create publication local_pub for table local; > create subscription local_sub connection 'host=localhost port=5432' > publication local_pub with (create_slot = false, slot_name = 'test', > copy_data = false); > The local_pub does appear in pg_publication, but it seems a bit like > the change_cb is using an old snapshot when reading from the catalog > in GetPublicationByName. Yes. So the slot should be created *after* the publication is created. A discussion [1] was held on allowing missing publications in such a case but it has not been rached a conclusion.. [1] https://www.postgresql.org/message-id/CAA4eK1LwQAEPJMTwVe3UYODeNMkK2QHf-WZF5aXp5ZcjDRcrUA%40mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: fix typos
On Thu, Aug 4, 2022 at 8:41 PM Tom Lane wrote: > > John Naylor writes: > > RepOriginId is a typedef for uint16, so this can't print the wrong answer, > > but it is inconsistent with other uses. So it seems we don't need to > > backpatch this one? > > Um ... if it's int16, then it can't be an OID, so I'd say this message has > far worse problems than %d vs %u. It should not use that terminology. The catalog has the following. Since it's not a real oid, maybe this column should be rethought? CATALOG(pg_replication_origin,6000,ReplicationOriginRelationId) BKI_SHARED_RELATION { /* * Locally known id that get included into WAL. * * This should never leave the system. * * Needs to fit into an uint16, so we don't waste too much space in WAL * records. For this reason we don't use a normal Oid column here, since * we need to handle allocation of new values manually. */ Oid roident; [...] -- John Naylor EDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Aug 5, 2022 at 5:36 AM Andres Freund wrote: > > Hi, > > On August 4, 2022 4:20:16 PM PDT, Tom Lane wrote: > >Yeah, the assumption that P_NEW would automatically match the source block > >was making me itchy too. An explicit test-and-elog seems worth the > >cycles. > > Is there a good reason to rely on P_NEW at all? I think there were 2 arguments for which we used bufmgr instead of smgrextend for the destination database 1) (Comment from Andres) The big benefit would be that the *target* database does not have to be written out / shared buffer is immediately populated. [1] 2) (Comment from Robert) We wanted to avoid writing new code which bypasses the shared buffers. [1]https://www.postgresql.org/message-id/20210905202800.ji4fnfs3xzhvo7l6%40alap3.anarazel.de Both from an efficiency and robustness POV it seems like it'd be better to use smgrextend to bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO (or whatever it is called). That bulk smgrextend would later be a good point to use fallocate so the FS can immediately size the file correctly, without a lot of pointless writes of zeroes. Yeah okay, so you mean since we already know the nblocks in the source file so we can directly do smgrextend in bulk before the copy loop and then we can just copy block by block using bufmgr with proper blkno instead of P_NEW. Yeah I think this looks optimized to me and this will take care of the above 2 points I mentioned that we will still have the target database pages in the shared buffers and we are not bypassing the shared buffers also. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Aug 5, 2022 at 2:59 AM Andres Freund wrote: > > Hi, > > On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote: > > Another version of the patch which closes the smgr at the end using > > smgrcloserellocator() and I have also added a commit message. > > What's the motivation behind the explicit close? > > > > @@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char > > *srcpath) > > Pagepage; > > List *rlocatorlist = NIL; > > LockRelId relid; > > - Relationrel; > > Snapshotsnapshot; > > + SMgrRelationsmgr; > > BufferAccessStrategy bstrategy; > > > > /* Get pg_class relfilenumber. */ > > @@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char > > *srcpath) > > rlocator.dbOid = dbid; > > rlocator.relNumber = relfilenumber; > > > > - /* > > - * We can't use a real relcache entry for a relation in some other > > - * database, but since we're only going to access the fields related > > to > > - * physical storage, a fake one is good enough. If we didn't do this > > and > > - * used the smgr layer directly, we would have to worry about > > - * invalidations. > > - */ > > - rel = CreateFakeRelcacheEntry(rlocator); > > - nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM); > > - FreeFakeRelcacheEntry(rel); > > + smgr = smgropen(rlocator, InvalidBackendId); > > + nblocks = smgrnblocks(smgr, MAIN_FORKNUM); > > + smgrclose(smgr); > > Why are you opening and then closing again? Part of the motivation for the > question is that a local SMgrRelation variable may lead to it being used > further, opening up interrupt processing issues. Yeah okay, I think there is no reason to close, in the previous version I had like below and I think that's a better idea. nblocks = smgrnblocks(smgropen(rlocator, InvalidBackendId), MAIN_FORKNUM) > > > + rlocator.locator = src_rlocator; > > + smgrcloserellocator(rlocator); > > + > > + rlocator.locator = dst_rlocator; > > + smgrcloserellocator(rlocator); > > As mentioned above, it's not clear to me why this is now done... > > Otherwise looks good to me. Yeah maybe it is not necessary to close as these unowned smgr will automatically get closed on the transaction end. Actually the previous person of the patch had both these comments fixed. The reason for explicitly closing it is that I have noticed that most of the places we explicitly close the smgr where we do smgropen e.g. index_copy_data(), heapam_relation_copy_data() OTOH some places we don't close it e.g. IssuePendingWritebacks(). So now I think that in our case better we do not close it because I do not like this specific code at the end to close the smgr. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: BTMaxItemSize seems to be subtly incorrect
On Fri, Aug 5, 2022 at 3:56 PM Peter Geoghegan wrote: > On Thu, Jun 9, 2022 at 11:20 AM Robert Haas wrote: > > I think I'd feel more comfortable if you wrote the patch, if that's > > possible. > > Okay, pushed a fix just now. FYI florican and lapwing showed: 2022-08-05 01:04:29.903 EDT [34485:5] FATAL: deduplication failed to add heap tid to pending posting list 2022-08-05 01:04:29.903 EDT [34485:6] CONTEXT: WAL redo at 0/49708D8 for Btree/DEDUP: nintervals 4; blkref #0: rel 1663/16384/2674, blk 1
Re: BTMaxItemSize seems to be subtly incorrect
On Thu, Aug 4, 2022 at 10:25 PM Thomas Munro wrote: > FYI florican and lapwing showed: > > 2022-08-05 01:04:29.903 EDT [34485:5] FATAL: deduplication failed to > add heap tid to pending posting list > 2022-08-05 01:04:29.903 EDT [34485:6] CONTEXT: WAL redo at 0/49708D8 > for Btree/DEDUP: nintervals 4; blkref #0: rel 1663/16384/2674, blk 1 This very likely has something to do with the way nbtdedup.c uses BTMaxItemSize(), which apparently won't work on these 32-bit systems now. I'll fix this tomorrow morning. -- Peter Geoghegan
Re: Support logical replication of DDLs
Hi Hou-san, here are my review comments for the patch v15-0001: == 1. Commit Message CREATE/ALTER/DROP TABLE (*) At first, I thought "(*)" looks like a SQL syntax element. SUGGESTION: CREATE/ALTER/DROP TABLE - - Note #1, Note #2 ... Note #1 – blah blah Note #2 – yada yada == 2. src/backend/commands/ddl_deparse.c - General 2.1 Lots of the deparse_XXX function are in random places scattered around in this module. Since there are so many, I think it's better to have functions arrange alphabetically to make them easier to find. (And if there are several functions that logically "belong" together then those should be re-named so they will be group together alphabetically... Same applies to other functions – not just the deparse_XXX ones 2.2 There are lots of 'tmp' (or 'tmp2') variables in this file. Sometimes 'tmp' is appropriate (or maybe 'tmpobj' would be better) but in other cases it seems like there should be a better name than 'tmp'. Please search all these and replace where you can use a more meaningful name than tmp. 2.3 Pointer NULL comparisons are not done consistently all through the file. E.g. Sometimes you do tree->fmtinfo == NULL, but in others you do like if (!tree->fmtinfo). It's no big deal whichever way you want to use, but at least for the comparisons involving the same variables IMO should use the same style consistently. ~~~ 3. src/backend/commands/ddl_deparse.c - format_type_detailed 3.1 + * - typename is set to the type name, without quotes But the param is called 'typname', not 'typename' 3.2 + * - typmod is set to the typemod, if any, as a string with parens I think you mean: "typmod is set" -> "typemodstr is set" 3.3 + if (IsTrueArrayType(typeform) && + typeform->typstorage != TYPSTORAGE_PLAIN) + { + /* Switch our attention to the array element type */ + ReleaseSysCache(tuple); + tuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(array_base_type)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for type %u", type_oid); + + typeform = (Form_pg_type) GETSTRUCT(tuple); + type_oid = array_base_type; + *typarray = true; + } + else + *typarray = false; Maybe this if/else can be simplified *typarray = IsTrueArrayType(typeform) && typeform->typstorage != TYPSTORAGE_PLAIN; if (*typarray) { ... } 3.4 + /* otherwise, WITH TZ is added by typmod. */ Uppercase comment ~~~ 4. src/backend/commands/ddl_deparse.c - append_object_to_format_string + for (cp = sub_fmt; cp < end_ptr; cp++) + { + if (*cp == '{') + { + start_copy = true; + continue; + } What's this logic going to do if it encounters "{{" - it looks like it will just keep going but wouldn't that be a name error to have a "{" in it? ~~~ 5. src/backend/commands/ddl_deparse.c - append_bool_object +append_bool_object(ObjTree *tree, char *sub_fmt, bool value) +{ + ObjElem *param; + char *object_name = sub_fmt; + bool is_present_flag = false; + + Assert(sub_fmt); + + if (strcmp(sub_fmt, "present") == 0) + { + is_present_flag = true; + tree->present = value; + } + + if (!verbose && !tree->present) + return; + + if (!is_present_flag) + object_name = append_object_to_format_string(tree, sub_fmt); + + param = new_object(ObjTypeBool, object_name); + param->value.boolean = value; + append_premade_object(tree, param); +} It feels like there is some subtle trickery going on here with the conditions. Is there a simpler way to write this, or maybe it is just lacking some explanatory comments? ~~~ 6. src/backend/commands/ddl_deparse.c - append_array_object + if (!verbose) + { + ListCell *lc; + + /* Extract the ObjElems whose present flag is true */ + foreach(lc, array) + { + ObjElem *elem = (ObjElem *) lfirst(lc); + + Assert(elem->objtype == ObjTypeObject); + + if (!elem->value.object->present) + array = foreach_delete_current(array, lc); + } + + if (list_length(array) == 0) + return; + } Maybe it is OK as-is. I'm just wondering if this list_length(array) check should be outside of the !verbose check? ~~~ 7. src/backend/commands/ddl_deparse.c - objtree_to_jsonb_element + case ObjTypeObject: + /* recursively add the object into the existing parse state */ Uppercase comment ~~~ 8. src/backend/commands/ddl_deparse.c - new_objtree_for_qualname_id 8.1 + * + * Elements "schemaname" and "objname" are set. If the object is a temporary + * object, the schema name is set to "pg_temp". I'm not sure if this is the right place to say this, since it is not really this function that sets that "pg_temp". 8.2 + if (isnull) + elog(ERROR, "unexpected NULL namespace"); + objname = heap_getattr(catobj, Anum_name, RelationGetDescr(catalog), +&isnull); Missing blank line after the elog? ~~~ 9. src/backend/commands/ddl_deparse.c - deparse_ColumnIdentity + /* Definition elemets */ typo "elemets" ~~~ 10. src/backend/commands/ddl_deparse.c - deparse_CreateTrigStmt 10.1 + else + elog(ERROR, "unrecognized trigger timing value %d", node->timing); should that say "unrecognized trigger timing type" (e.g.