Re: Vacuum statistics
> While backwards compatibility is important, there’s definitely precedent for > changing > what shows up in the catalog. IMHO it’s better to bite the bullet and move > those fields > instead of having vacuum stats spread across two different views. Correct, the most recent one that I could think of is pg_stat_checkpointer, which pulled the checkpoint related columns from pg_stat_bgwriter. In that case though, these are distinct background processes and it's a clear distinction. In this case, I am not so sure about this, particularly because we will then have the autoanalyze and autovacuum fields in different views, which could be more confusing to users than saying pg_stat_all_tables has high level metrics about vacuum and analyze and for more details on vacuum, refer to pg_stat_vacuum_tables ( or whatever name we settle on ). Regards, Sami
Re: Fwd: Re: A new look at old NFS readdir() problems?
On 01/02/2025 3:42 pm, Thomas Munro wrote: On Fri, Jan 3, 2025 at 10:16 AM Larry Rosenman wrote: What about doing what Rick suggests? do { dir = opendir("X"); dp = readdir(dir); if (dp != NULL) unlink(dp->d_name); close(dir); } while (dp != NULL); ? That only works for unlink loops where we expect no concurrent modifications. DROP DATABASE qualifies, but we use readdir() on directories that might be changing in various other places, including backups. They'd silently fail to do their job, and can't use that technique as they aren't unlinking as they go and so they would never make progress. Which leads to the lipstick/pig conclusion: we'd have other, worse, buggy behaviour still. It might be possible to make our own readdir snapshot thing that checks a shmem counter of all links/unlinks before and after and retries. But ugh... Would it make sense for ONLY drop database to have the above loop? -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: l...@lerctr.org US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010
Re: WAL-logging facility for pgstats kinds
Hi, On 2024-12-27 12:32:25 +0900, Michael Paquier wrote: > While brainstorming on the contents of the thread I have posted a > couple of days ago, I have been looking at what could be done so as > pgstats and WAL-logging could work together. This was point 2) from > this message: > https://www.postgresql.org/message-id/z2tblemfuofzy...@paquier.xyz > > While considering the point that not all stats data is worth > replicating, I have fallen down to the following properties that are > nice to have across the board: > - A pgstats kind should be able to WAL-log some data that is should be > able to decide. Including versioning of the data. I don't really think that's right. For cumulative stats to make sense on both a primary and a replica, or a primary after a crash, they need to cover things that *already* are WAL logged. E.g. for pg_stat_all_tables.n_{tup_{ins,upd,del,hot_upd},live_tup,dead_tup, ...}, - which are important for autovacuum scheduling - we should use the WAL records covering those changes to re-assemble the stats during WAL replay. The only reason that that's not trivial is that we don't have access to the relation oid during crash recovery. Which in turn is why I think we should change relation level stats to be keyed by relfilenode, rather than relation oid. I can't think of a real case where we would want to WAL log the stats themselves, rather than re-emitting stats during replay based on the WAL record of the "underlying object". Do you have counter-examples? Greetings, Andres Freund
Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size
Hi Shubham, Here are some review comments for patch v5-0001. == doc/src/sgml/ref/pg_createsubscriber.sgml 1. + +The source server must have to +be set to -1 to prevent the automatic removal of WAL +replication slots. Setting this parameter to files needed by a specific size +may lead to replication failures if required WAL files are +prematurely deleted. + IIUC, this problem is all about the removal of WAL *files*, not "WAL replication slots". Also, the "Setting this parameter to files needed by a specific size" part sounds strange. I think this can be simplified anyhow like below. SUGGESTION: Replication failures can occur if required WAL files are prematurely deleted. To prevent this, the source server must set to -1, ensuring WAL files are not automatically removed. == src/bin/pg_basebackup/pg_createsubscriber.c check_publisher: 2. + if (dry_run && max_slot_wal_keep_size != -1) + { + pg_log_warning("publisher requires WAL size must not be restricted"); + pg_log_warning_detail("The max_slot_wal_keep_size parameter is currently set to a non-default value, which may lead to replication failures. " + "This parameter must be set to -1 to ensure that required WAL files are not prematurely removed."); + } 2a. Whenever you mention a GUC name in a PostgreSQL message then (by convention) it must be double-quoted. ~ 2b. The detailed message seems verbose and repetitive to me. ~ 2c. The other nearby messages use the terminology "configuration parameter", so this should be the same. ~ 2d. The other nearby messages use \"%s\" substitution for the GUC name, so this should be the same. ~ 2e. IMO advising the user to change a configuration parameter should be done using the pg_log_warning_hint function (e.g. _hint, not _details). ~~ Result: CURRENT (pg_log_warning_details) The max_slot_wal_keep_size parameter is currently set to a non-default value, which may lead to replication failures. This parameter must be set to -1 to ensure that required WAL files are not prematurely removed. SUGGESTION (pg_log_warning_hint) Set the configuration parameter \"%s\" to -1 to ensure that required WAL files are not prematurely removed. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Strange issue with NFS mounted PGDATA on ugreen NAS
I wrote: > I forgot to report back, but yesterday I spent time unsuccessfully > trying to reproduce the problem with macOS client and NFS server > using btrfs (a Synology NAS running some no-name version of Linux). Also, I *can* reproduce it using the same NFS server and a FreeBSD 14.2 client. At least with this pair of machines, the behavior seems deterministic: "createdb foo" followed by "dropdb foo" leaves the same set of not-dropped files behind each time. I added a bit of debug logging to rmtree.c and verified that it's not seeing anything odd happening, except that readdir never returns anything about the missed files. I will file a bug report, unless you already did? regards, tom lane
Re: apply_scanjoin_target_to_paths and partitionwise join
On Thu, Jan 2, 2025 at 4:41 AM Ashutosh Bapat wrote: > A join between partitions is pushed down if only partitionwise join is > chosen and a join between partitions won't be pushed down if > partitionwise join is not chosen. Hence this bug affects pushdown as > well. > > The CF entry shows as waiting for author. But that isn't the right > status. Will change it to needs review. I think we need a consensus as > to whether we want to fix this bug or not. Since this bug doesn't > affect me anymore, I will just withdraw this CF entry if there is no > interest. I think it's unhelpful that you keep calling this a "bug" when the behavior is clearly deliberate. Whether it's the *right* behavior is debatable, but it's not *accidental* behavior. I don't actually have a clear understanding of why we need this. In https://www.postgresql.org/message-id/CAKZiRmyaFFvxyEYGG_hu0F-EVEcqcnveH23MULhW6UY_jwykGw%40mail.gmail.com Jakub says that an EDB customer experienced a case where the partitionwise plan took 530+s and the non-partitionwise plan took 23s, but unfortunately there's no public test case, and in the examples shared publicly, either the partionwise plan is actually slower but is mistakenly estimated to be faster, or the two are extremely close to the same speed so it doesn't really matter. So the customer scenario (which is not public) is justification for a code-change, but the publicly-posted examples, as far as I can see, are not. And what confuses me is -- what could that test case possibly look like? I mean, how can it be less efficient to perform N small joins than 1 big join? For instance, suppose we're doing a merge join between A and B (partitions A_1..A_N and B_1..B_N) and we have to sort all the data. With a partitionwise join, we have to do 2N sorts of partitions of some size, let's say K. The cost should be O(2N * K lg K). If we instead do two really big sorts, the cost is now O(2 * (NK) lg (NK)), which is more. If we do a hash join, the cost should be about the same either way, because probing a hash table is roughly constant time. However, if we do N small hash joins, the hash table is a lot more likely to fit in memory -- and if the big hash table does not fit in memory and the small hash tables do, we should win big. Finally, let's say we're doing a nested loop. If the inner side of the nested loop is unparameterized, then the cost of the non-partitionwise nested loop is O(N^2 * K^2), while the cost of the partitionwise nested loop is O(N * K^2), which is a huge win. If the inner side is parameterized, then the partitionwise plan involves scanning one partition for matching values for each outer row, whereas the non-partitionwise plan involves scanning every partition for matching values for each outer row, which is again clearly worse. I'm obviously missing something here, because I'm sure Jakub is quite right when he says that this actually happened and actually hosed an EDB customer. But I don't understand HOW it happened, and I think if we're going to change the code we really ought to understand that and write some code comments about it. In general, I think that it's very reasonable to expect that a bunch of small joins will beat one big join, which is why the code does what it currently does. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add the ability to limit the amount of memory that can be allocated to backends.
> On Dec 31, 2024, at 5:41 PM, Tomas Vondra wrote: > > On 12/31/24 21:46, Jim Nasby wrote: >> On Dec 30, 2024, at 7:05 PM, James Hunter wrote: >>> >>> On Sat, Dec 28, 2024 at 11:24 PM Jim Nasby wrote: IMHO none of this will be very sane until we actually have cluster-level limits. One sudden burst in active connections and you still OOM the instance. >>> >>> Fwiw, PG does support "max_connections" GUC, so a backend/connection - >>> level limit, times "max_connections", yields a cluster-level limit. >> >> max_connections is useless here, for two reasons: >> >> 1. Changing it requires a restart. That’s at *best* a real PITA in >> production. [1] >> 2. It still doesn’t solve the actual problem. Unless your workload *and* >> your data are extremely homogeneous you can’t simply limit the number of >> connections and call it a day. A slight change in incoming queries, OR in >> the data that the queries are looking at and you go from running fine to >> meltdown. You don’t even need a plan flip for this to happen, just the same >> plan run at the same rate but now accessing more data than before. >> > > I really don't follow your argument ... > > Yes, changing max_connections requires a restart - so what? AFAIK the > point James was making is that if you multiply max_connections by the > per-backend limit, you get a cluster-wide limit. And presumably the > per-backend limit would be a GUC not requiring a restart. > > Yes, high values of max_connections are problematic. I don't see how a > global limit would fundamentally change that. In fact, it could > introduce yet more weird failures because some unrelated backend did > something weird. That’s basically my argument for having workload management. If a system becomes loaded enough for the global limit to start kicking in it’s likely that query response time is increasing, which means you will soon have more and more active backends trying to run queries. That’s just going to make the situation even worse. You’d either have to start trying to “take memory away” from already running backends or backends that are just starting would have such a low limit as to cause them to spill very quickly, creating further load on the system. > FWIW I'm not opposed to having some global memory limit, but as I > explained earlier, I don't see a way to do that sensibly without having > a per-backend limit first. Because if you have a global limit, a single > backend consuming memory could cause all kinds of weird failures in > random other backends. I agree, but I’m also not sure how much a per-backend limit would actually help on its own, especially in OLTP environments. >> Most of what I’ve seen on this thread is discussing ways to *optimize* how >> much memory the set of running backends can consume. Adjusting how you slice >> the memory pie across backends, or even within a single backend, is >> optimization. While that’s a great goal that I do support, it will never >> fully fix the problem. At some point you need to either throw your hands in >> the air and start tossing memory errors, because you don’t have control over >> how much work is being thrown at the engine. The only way that the engine >> can exert control over that would be to hold new transactions from starting >> when the system is under duress (ie, workload management). While workload >> managers can be quite sophisticated (aka, complex), the nice thing about >> limiting this scope to work_mem, and only as a means to prevent complete >> overload, is that the problem becomes a lot simpler since you’re only >> looking at one metric and not trying to support any kind of priority system. >> The only fanciness I think an MVP would need is a GUC to control how long a >> transaction can sit waiting before it throws an error. Frankly, that sounds >> a lot less complex and much easier for DBAs to adjust than trying to teach >> the planner how to apportion out per-node work_mem limits. >> >> As I said, I’m not opposed to optimizations, I just think they’re very much >> cart-before-the-horse. >> > > What optimization? I didn't notice anything like that. I don't see how > "adjusting how you slice the memory pie across backends" counts as an > optimization. I mean, that's exactly what a memory limit is meant to do. > > Similarly, there was a proposal to do planning with work_mem, and then > go back and adjust the per-node limits to impose a global limit. That > does not seem like an optimization either ... (more an opposite of it). It’s optimization in that you’re trying to increase how many active backends you can have before getting memory errors. It’s an alternative to throwing more memory at the problem or limiting the rate of incoming workload. >> 1: While it’d be a lot of work to make max_connections dynamic one thing we >> could do fairly easily would be to introduce another GUC (max_backends?) >> that actually controls the total numbe
Re: Logical Replication of sequences
Hi Vignesh. Here are some review comments for patch v20241230-0002 == 1. SYNTAX The proposed syntax is currently: CREATE PUBLICATION name [ FOR ALL object_type [, ...] | FOR publication_object [, ... ] ] [ WITH ( publication_parameter [= value] [, ... ] ) ] where object_type is one of: TABLES SEQUENCES where publication_object is one of: TABLE [ ONLY ] table_name [ * ] [ ( column_name [, ... ] ) ] [ WHERE ( expression ) ] [, ... ] TABLES IN SCHEMA { schema_name | CURRENT_SCHEMA } [, ... ] ~ But lately, I've been thinking it could be clearer if you removed the object_type and instead fully spelled out FOR ALL TABLES and/or FOR ALL SEQUENCES. compare CREATE PUBLICATION FOR ALL TABLES, SEQUENCES; versus CREATE PUBLICATION FOR ALL TABLES, ALL SEQUENCES; ~ Also AFAICT, the current syntax says it is impossible to mix FOR ALL SEQUENCES with FOR TABLE etc but really that *should* be allowed, right? And it looks like you may come to similar grief in future if you try things like: "FOR ALL TABLES" mixed with "FOR SEQUENCE seq_name" "FOR ALL TABLES" mixed with "FOR SEQUENCES IN SCHEMA schema_name" ~ So, maybe a revised syntax like below would end up being easier and also more flexible: CREATE PUBLICATION name [ FOR publication_object [, ... ] ] [ WITH ( publication_parameter [= value] [, ... ] ) ] where publication_object is one of: ALL TABLES ALL SEQUENCES TABLE [ ONLY ] table_name [ * ] [ ( column_name [, ... ] ) ] [ WHERE ( expression ) ] [, ... ] TABLES IN SCHEMA { schema_name | CURRENT_SCHEMA } [, ... ] == src/backend/commands/publicationcmds.c CreatePublication: 2. - /* FOR ALL TABLES requires superuser */ - if (stmt->for_all_tables && !superuser()) + /* FOR ALL TABLES or FOR ALL SEQUENCES requires superuser */ + if ((stmt->for_all_tables || stmt->for_all_sequences) && !superuser()) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must be superuser to create FOR ALL TABLES publication"))); + errmsg("must be superuser to create ALL TABLES and/or ALL SEQUENCES publication"))); 2a. Typo. /create ALL TABLES and/or ALL SEQUENCES publication/create a FOR ALL TABLES and/or a FOR ALL SEQUENCES publication/ ~ 2b. This message might be OK now, but I suspect it will become very messy in future after you introduce another syntax like "FOR SEQUENCE seq_name" etc (which would also be able to be used in combination with a FOR ALL TABLES). So, I think that for future-proofing against all the possible (future) combinations, and for keeping the code cleaner, it will be far simpler to just keep the errors for tables and sequences separated: SUGGESTION: if (!superuser()) { if (stmt->for_all_tables) ereport(ERROR, ... FOR ALL TABLES ...); if (stmt->for_all_sequences) ereport(ERROR, ... FOR ALL SEQUENCES ...); } ~~~ AlterPublicationOwner_internal: 3. - if (form->puballtables && !superuser_arg(newOwnerId)) + /* FOR ALL TABLES or FOR ALL SEQUENCES requires superuser */ + if ((form->puballtables || form->puballsequences) && + !superuser_arg(newOwnerId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("permission denied to change owner of publication \"%s\"", NameStr(form->pubname)), - errhint("The owner of a FOR ALL TABLES publication must be a superuser."))); + errhint("The owner of ALL TABLES and/or ALL SEQUENCES publication must be a superuser."))); Ditto the above comment #2. == src/bin/psql/describe.c 4. + puboid_col = cols++; + pubname_col = cols++; + pubowner_col = cols++; + puballtables_col = cols++; + + if (has_pubsequence) + { + appendPQExpBufferStr(&buf, + ", puballsequences"); + puballsequences_col = cols++; + } + + appendPQExpBufferStr(&buf, + ", pubinsert, pubupdate, pubdelete"); + pubins_col = cols++; + pubupd_col = cols++; + pubdel_col = cols++; + if (has_pubtruncate) + { appendPQExpBufferStr(&buf, ", pubtruncate"); + pubtrunc_col = cols++; + } + if (has_pubgencols) + { appendPQExpBufferStr(&buf, ", pubgencols"); + pubgen_col = cols++; + } + if (has_pubviaroot) + { appendPQExpBufferStr(&buf, ", pubviaroot"); + pubviaroot_col = cols++; + } There is some overlap/duplication of the new variable 'cols' and the existing variable 'ncols'. AFAICT you can just move/replace the declaration of 'ncols' to where 'cols' is declared, and then you can remove the duplicated code below (because the above code is already doing the same thing). if (has_pubtruncate) ncols++; if (has_pubgencols) ncols++; if (has_pubviaroot) ncols++; if (has_pubsequence) ncols++; == Kind Regards, Peter Smith. Fujitsu Australia
Re: Conflict detection for update_deleted in logical replication
On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu) wrote: > > Attach the new version patch set which addressed all other comments. > Some more miscellaneous comments: = 1. @@ -1431,9 +1431,9 @@ RecordTransactionCommit(void) * modifying it. This makes checkpoint's determination of which xacts * are delaying the checkpoint a bit fuzzy, but it doesn't matter. */ - Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0); + Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0); START_CRIT_SECTION(); - MyProc->delayChkptFlags |= DELAY_CHKPT_START; + MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT; /* * Insert the commit XLOG record. @@ -1536,7 +1536,7 @@ RecordTransactionCommit(void) */ if (markXidCommitted) { - MyProc->delayChkptFlags &= ~DELAY_CHKPT_START; + MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT; END_CRIT_SECTION(); The comments related to this change should be updated in EndPrepare() and RecordTransactionCommitPrepared(). They still refer to the DELAY_CHKPT_START flag. We should update the comments explaining why a similar change is not required for prepare or commit_prepare, if there is one. 2. static bool tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2, - TypeCacheEntry **eq) + TypeCacheEntry **eq, Bitmapset *columns) { int attrnum; @@ -337,6 +340,14 @@ tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2, if (att->attisdropped || att->attgenerated) continue; + /* + * Ignore columns that are not listed for checking. + */ + if (columns && + !bms_is_member(att->attnum - FirstLowInvalidHeapAttributeNumber, +columns)) + continue; Update the comment atop tuples_equal to reflect this change. 3. +FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot, + TransactionId *delete_xid, + RepOriginId *delete_origin, + TimestampTz *delete_time) ... ... + /* Try to find the tuple */ + while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot)) + { + bool dead = false; + TransactionId xmax; + TimestampTz localts; + RepOriginId localorigin; + + if (!tuples_equal(scanslot, searchslot, eq, indexbitmap)) + continue; + + tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL); + buf = hslot->buffer; + + LockBuffer(buf, BUFFER_LOCK_SHARE); + + if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) == HEAPTUPLE_RECENTLY_DEAD) + dead = true; + + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + if (!dead) + continue; Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it matter to detect update_delete conflict? 4. In FindMostRecentlyDeletedTupleInfo(), add comments to state why we need to use SnapshotAny. 5. + + +detect_update_deleted (boolean) + + + Specifies whether the detection of + is enabled. The default is false. If set to + true, the dead tuples on the subscriber that are still useful for + detecting + are retained, One of the purposes of retaining dead tuples is to detect update_delete conflict. But, I also see the following in 0001's commit message: "Since the mechanism relies on a single replication slot, it not only assists in retaining dead tuples but also preserves commit timestamps and origin data. These information will be displayed in the additional logs generated for logical replication conflicts. Furthermore, the preserved commit timestamps and origin data are essential for consistently detecting update_origin_differs conflicts." which indicates there are other cases where retaining dead tuples can help. So, I was thinking about whether to name this new option as retain_dead_tuples or something along those lines? BTW, it is not clear how retaining dead tuples will help the detection update_origin_differs. Will it happen when the tuple is inserted or updated on the subscriber and then when we try to update the same tuple due to remote update, the commit_ts information of the xact is not available because the same is already removed by vacuum? This should happen for the update case for the new row generated by the update operation as that will be used in comparison. Can you please show it be a test case even if it is manual? Can't it happen for delete_origin_differs as well for the same reason? 6. I feel we should keep 0004 as a later patch. We can ideally consider committing 0001, 0002, 0003, 0005, and 0006 (or part of 0006 to get some tests that are relevant) as one unit and then the patch to detect and report update_delete conflict. What do you think? -- With Regards, Amit Kapila.
Re: Fwd: Re: A new look at old NFS readdir() problems?
Thomas Munro writes: > For what little it's worth, I'm not quite convinced yet that FreeBSD's > client isn't more broken than it needs to be. I'm suspicious of that too. The wireshark trace you described is hard to read any other way than that FreeBSD went out of its way to deliver incorrect information. I'm prepared to believe that we can't work correctly on NFS servers that don't do the stable-cookie thing, but why isn't it succeeding on ones that do? regards, tom lane
Re: Fwd: Re: A new look at old NFS readdir() problems?
I wrote: > Thomas Munro writes: >> For what little it's worth, I'm not quite convinced yet that FreeBSD's >> client isn't more broken than it needs to be. > I'm suspicious of that too. I poked at this a little further. I made the attached stand-alone test case (you don't need any more than "cc -o rmtree rmtree.c" to build it, then point the script at some NFS-mounted directory). This fails with my NAS at least as far back as FreeBSD 11.0. I also tried it on NetBSD 9.2 which seems fine. regards, tom lane #! /bin/sh set -e TESTDIR="$1" mkdir "$TESTDIR" i=0 while [ $i -lt 1000 ] do touch "$TESTDIR/$i" i=`expr $i + 1` done ./rmtree "$TESTDIR" /*- * * rmtree.c * * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION * src/common/rmtree.c * *- */ #include #include #include #include #include #include #include #include #include typedef enum PGFileType { PGFILETYPE_ERROR, PGFILETYPE_UNKNOWN, PGFILETYPE_REG, PGFILETYPE_DIR, PGFILETYPE_LNK, } PGFileType; static void * palloc(size_t size) { void *tmp; /* Avoid unportable behavior of malloc(0) */ if (size == 0) size = 1; tmp = malloc(size); if (tmp == NULL) { fprintf(stderr, "out of memory\n"); exit(1); } return tmp; } static void * repalloc(void *ptr, size_t size) { void *tmp; /* Avoid unportable behavior of realloc(NULL, 0) */ if (ptr == NULL && size == 0) size = 1; tmp = realloc(ptr, size); if (!tmp) { fprintf(stderr, "out of memory\n"); exit(1); } return tmp; } static char * pstrdup(const char *in) { char *tmp; if (!in) { fprintf(stderr, "cannot duplicate null pointer (internal error)\n"); exit(1); } tmp = strdup(in); if (!tmp) { fprintf(stderr, "out of memory\n"); exit(1); } return tmp; } /* * Return the type of a directory entry. */ static PGFileType get_dirent_type(const char *path, const struct dirent *de, bool look_through_symlinks) { PGFileType result; /* * Some systems tell us the type directly in the dirent struct, but that's * a BSD and Linux extension not required by POSIX. Even when the * interface is present, sometimes the type is unknown, depending on the * filesystem. */ #if defined(DT_REG) && defined(DT_DIR) && defined(DT_LNK) if (de->d_type == DT_REG) result = PGFILETYPE_REG; else if (de->d_type == DT_DIR) result = PGFILETYPE_DIR; else if (de->d_type == DT_LNK && !look_through_symlinks) result = PGFILETYPE_LNK; else result = PGFILETYPE_UNKNOWN; #else result = PGFILETYPE_UNKNOWN; #endif if (result == PGFILETYPE_UNKNOWN) { struct stat fst; int sret; if (look_through_symlinks) sret = stat(path, &fst); else sret = lstat(path, &fst); if (sret < 0) { result = PGFILETYPE_ERROR; fprintf(stderr, "could not stat file \"%s\": %m\n", path); } else if (S_ISREG(fst.st_mode)) result = PGFILETYPE_REG; else if (S_ISDIR(fst.st_mode)) result = PGFILETYPE_DIR; else if (S_ISLNK(fst.st_mode)) result = PGFILETYPE_LNK; } return result; } /* * rmtree * * Delete a directory tree recursively. * Assumes path points to a valid directory. * Deletes everything under path. * If rmtopdir is true deletes the directory too. * Returns true if successful, false if there was any problem. * (The details of the problem are reported already, so caller * doesn't really have to say anything more, but most do.) */ static bool rmtree(const char *path, bool rmtopdir) { char pathbuf[8192]; DIR *dir; struct dirent *de; bool result = true; size_t dirnames_size = 0; size_t dirnames_capacity = 8; char **dirnames; dir = opendir(path); if (dir == NULL) { fprintf(stderr, "could not open directory \"%s\": %m\n", path); return false; } dirnames = (char **) palloc(sizeof(char *) * dirnames_capacity); while (errno = 0, (de = readdir(dir))) { if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) continue; snprintf(pathbuf, sizeof(pathbuf), "%s/%s", path, de->d_name); switch (get_dirent_type(pathbuf, de, false)) { case PGFILETYPE_ERROR: /* already logged, press on */ break; case PGFILETYPE_DIR: /* * Defer recursion until after we've closed this directory, to * avoid using more than one file descriptor at a time. */ if (dirnames_size == dirnames_capacity) { dirnames = repalloc(dirnames, sizeof(char *) * dirnames_capacity * 2); dirnames_capacity *= 2; } dirnames[dirnames_size++] = pstrdup(pathbuf); break; default: if (unlink(pathbuf) != 0 && errno != ENOENT) { fprintf(stderr, "could not remove file \"%s\": %m\n", pathbuf); result = false; } break; } } if (errno != 0) {
Re: Conflict detection for update_deleted in logical replication
On Tue, Dec 24, 2024 at 6:43 PM Zhijie Hou (Fujitsu) wrote: > > On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 > wrote: > > > > Dear Hou, > > > > Thanks for updating the patch. Few comments: > > Thanks for the comments! > > > 02. ErrorOnReservedSlotName() > > > > Currently the function is callsed from three points - > > create_physical_replication_slot(), > > create_logical_replication_slot() and CreateReplicationSlot(). > > Can we move them to the ReplicationSlotCreate(), or combine into > > ReplicationSlotValidateName()? > > I am not sure because moving the check into these functions because that would > prevent the launcher from creating the slot as well unless we add a new > parameter for these functions, but I am not sure if it's worth it at this > stage. > > > > > 03. advance_conflict_slot_xmin() > > > > ``` > > Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin)); > > ``` > > > > Assuming the case that the launcher crashed just after > > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT). > > After the restart, the slot can be acquired since > > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT) > > is true, but the process would fail the assert because data.xmin is still > > invalid. > > > > I think we should re-create the slot when the xmin is invalid. Thought? > > After thinking more, the standard approach to me would be to mark the slot as > EPHEMERAL during creation and persist it after initializing, so changed like > that. > > > 05. check_remote_recovery() > > > > Can we add a test case related with this? > > I think the code path is already tested, and I am a bit unsure if we want to > setup > a standby to test the ERROR case, so didn't add this. > > --- > > Attach the new version patch set which addressed all other comments. > > Based on some off-list discussions with Sawada-san and Amit, it would be > better > if the apply worker can avoid reporting an ERROR if the publisher's clock's > lags behind that of the subscriber, so I implemented a new 0007 patch to allow > the apply worker to wait for the clock skew to pass and then send a new > request > to the publisher for the latest status. The implementation is as follows: > > Since we have the time (reply_time) on the walsender when it confirms that all > the committing transactions have finished, it means any subsequent > transactions > on the publisher should be assigned a commit timestamp later then reply_time. > And the (candidate_xid_time) when it determines the oldest active xid. Any old > transactions on the publisher that have finished should have a commit > timestamp > earlier than the candidate_xid_time. > > The apply worker can compare the candidate_xid_time with reply_time. If > candidate_xid_time is less than the reply_time, then it's OK to advance the > xid > immdidately. If candidate_xid_time is greater than reply_time, it means the > clock of publisher is behind that of the subscriber, so the apply worker can > wait for the skew to pass before advancing the xid. > > Since this is considered as an improvement, we can focus on this after > pushing the main patches. > Thank you for updating the patches! I have one comment on the 0001 patch: + /* +* The changes made by this and later transactions are still non-removable +* to allow for the detection of update_deleted conflicts when applying +* changes in this logical replication worker. +* +* Note that this info cannot directly protect dead tuples from being +* prematurely frozen or removed. The logical replication launcher +* asynchronously collects this info to determine whether to advance the +* xmin value of the replication slot. +* +* Therefore, FullTransactionId that includes both the transaction ID and +* its epoch is used here instead of a single Transaction ID. This is +* critical because without considering the epoch, the transaction ID +* alone may appear as if it is in the future due to transaction ID +* wraparound. +*/ + FullTransactionId oldest_nonremovable_xid; The last paragraph of the comment mentions that we need to use FullTransactionId to properly compare XIDs even after the XID wraparound happens. But once we set the oldest-nonremovable-xid it prevents XIDs from being wraparound, no? I mean that workers' oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its xmin) are never away from more than 2^31 XIDs. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fwd: Re: A new look at old NFS readdir() problems?
Larry Rosenman writes: > @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on > my issue. Thanks for reaching out to him. So if I'm reading this correctly, there's little point in filing a FreeBSD bug because it'll be dismissed as unfixable. This leaves us in rather a nasty position. Sure, we could rewrite rmtree() as Thomas suggested upthread, but I'm still of the opinion that that's smearing lipstick on a pig. rmtree() is the least of our worries: it doesn't need to expect that anybody else will be modifying the target directory, plus it can easily restart its scan without complicated bookkeeping. I doubt we can make such an assumption for all our uses of readdir(), or that it's okay to miss or double-process files in every one of them. I'm still of the opinion that the best thing to do is disclaim safety of storing a database on NFS. regards, tom lane
Re: apply_scanjoin_target_to_paths and partitionwise join
Robert Haas writes: > I'm obviously missing something here, because I'm sure Jakub is quite > right when he says that this actually happened and actually hosed an > EDB customer. But I don't understand HOW it happened, and I think if > we're going to change the code we really ought to understand that and > write some code comments about it. In general, I think that it's very > reasonable to expect that a bunch of small joins will beat one big > join, which is why the code does what it currently does. I am wondering if the problem is not that the plan is slower, it's that for some reason the planner took a lot longer to create it. It's very plausible that partitionwise planning takes longer, and maybe we have some corner cases where the time is O(N^2) or worse. However, this is pure speculation without a test case, and any proposed fix would be even more speculative. I concur with your bottom line: we should insist on a public test case before deciding what to do about it. regards, tom lane
Re: Fwd: Re: A new look at old NFS readdir() problems?
On Thu, Jan 2, 2025 at 3:49 PM Tom Lane wrote: > I'm still of the opinion that the best thing to do is disclaim > safety of storing a database on NFS. If we're going to disclaim support for NFS, it would certainly be better to do that clearly and with reasons than otherwise. However, I suspect a substantial number of users will still use NFS and then blame us when it doesn't work; or alternatively say that our software sucks because it doesn't work on NFS. Neither of which are very nice outcomes. -- Robert Haas EDB: http://www.enterprisedb.com
Re: apply_scanjoin_target_to_paths and partitionwise join
On Thu, Jan 2, 2025 at 3:58 PM Tom Lane wrote: > I am wondering if the problem is not that the plan is slower, it's > that for some reason the planner took a lot longer to create it. > It's very plausible that partitionwise planning takes longer, and > maybe we have some corner cases where the time is O(N^2) or worse. That doesn't seem like a totally unreasonable speculation, but it seems a little surprising that retaining the non-partitionwise paths would fix it. True, that might let us discard a bunch of partitionwise paths more quickly than would otherwise be possible, but I wouldn't expect that to have an impact as dramatic as what Jakub alleged. The thing I thought about was whether there might be some weird effects with lots of empty partitions; or maybe with some other property of the path like say sort keys or parallelism. For example if we couldn't generate a partitionwise path with sort keys as good as the non-partitionwise path had, or if we couldn't generate a parallel partitionwise path but we could generate a parallel non-partitionwise path. As far as I knew neither of those things are real problems, but if they were then I believe they could pretty easily explain a large regression. > However, this is pure speculation without a test case, and any > proposed fix would be even more speculative. I concur with your > bottom line: we should insist on a public test case before deciding > what to do about it. Yeah. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Fwd: Re: A new look at old NFS readdir() problems?
Robert Haas writes: > On Thu, Jan 2, 2025 at 3:49 PM Tom Lane wrote: >> I'm still of the opinion that the best thing to do is disclaim >> safety of storing a database on NFS. > If we're going to disclaim support for NFS, it would certainly be > better to do that clearly and with reasons than otherwise. However, I > suspect a substantial number of users will still use NFS and then > blame us when it doesn't work; or alternatively say that our software > sucks because it doesn't work on NFS. Neither of which are very nice > outcomes. Are you prepared to buy into "we will make every bit of code that uses readdir() proof against arbitrary lies from readdir()"? I'm not: I cannot see how to do that in any but the simplest cases -- rmtree() being about the simplest. Even if we had a bulletproof design in mind, it's pretty hard to believe that future patches would apply it every time. I think it's inevitable that we'd have bugs like "sometimes not every file gets fsync'd", which would be impossible to find until someone's data gets eaten, and maybe still impossible to find afterwards. (To be clear: if this is how FreeBSD acts, then I'm afraid we already do have such bugs. The rmtree case is just easier to observe than a missed fsync.) regards, tom lane
Re: Fwd: Re: A new look at old NFS readdir() problems?
On 01/02/2025 4:56 pm, Tom Lane wrote: Larry Rosenman writes: Would it make sense for ONLY drop database to have the above loop? Not really. We'd just be papering over the most-easily-observable consequence of readdir's malfeasance. There'd still be problems like basebackups omitting files, missed fsyncs potentially leading to data loss, etc. I am sort of wondering though why we've not heard reports of this many times before. Did FreeBSD recently change something in this area? Also, if as they argue it's a fundamental problem in the NFS protocol, why are we failing to reproduce it with other clients? regards, tom lane Pre version 16, we worked just fine. There was a change in 16. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: l...@lerctr.org US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010
Re: Fwd: Re: A new look at old NFS readdir() problems?
Larry Rosenman writes: > Would it make sense for ONLY drop database to have the above loop? Not really. We'd just be papering over the most-easily-observable consequence of readdir's malfeasance. There'd still be problems like basebackups omitting files, missed fsyncs potentially leading to data loss, etc. I am sort of wondering though why we've not heard reports of this many times before. Did FreeBSD recently change something in this area? Also, if as they argue it's a fundamental problem in the NFS protocol, why are we failing to reproduce it with other clients? regards, tom lane
Re: Fwd: Re: A new look at old NFS readdir() problems?
Larry Rosenman writes: > On 01/02/2025 4:56 pm, Tom Lane wrote: >> I am sort of wondering though why we've not heard reports of this >> many times before. Did FreeBSD recently change something in this >> area? Also, if as they argue it's a fundamental problem in the NFS >> protocol, why are we failing to reproduce it with other clients? > Pre version 16, we worked just fine. There was a change in 16. Well, as Thomas already mentioned, rmtree() used to work by reading the directory and making an in-memory copy of the whole file list before it started to delete anything. But reverting that wouldn't fix all of the other problems that we now know exist. (In other words, "we worked just fine" only for small values of "fine".) regards, tom lane
Re: Fwd: Re: A new look at old NFS readdir() problems?
On Fri, Jan 3, 2025 at 10:53 AM Tom Lane wrote: > (To be clear: if this is how FreeBSD acts, then I'm afraid we already > do have such bugs. The rmtree case is just easier to observe than a > missed fsync.) For what little it's worth, I'm not quite convinced yet that FreeBSD's client isn't more broken than it needs to be. Lots of systems I looked at have stable cookies in practice (as NFS 4 recommends), including the one used in this report, so it seems like a more basic problem. At the risk of being wrong on the internet, I don't see any fundamental reason why a readdir() scan can't have no-skip, no-duplicate, no-fail semantics for stable-cookie, no-verification servers. And this case works perfectly with a couple of other NFS clients implementations that you and I tried. As for systems that don't have stable cookies, well then they should implement the cookie verification scheme and AFAICS the readdir() scan should then fail if it can't recover, or it would expose isolation anomalies offensive to database hacker sensibilities. I think it should be theoretically possible to recover in some happy cases (maybe: when enough state is still around in cache to deduplicate). But that shouldn't be necessary on filers using eg ZFS or BTRFS whose cookies are intended to be stable. A server could also do MVCC magic, and from a quick google, I guess NetApp WAFL might do that as it has the concept of "READDIR expired", which smells a bit like ORA-01555: snapshot too old.
Re: Proposal: Progressive explain
hi. all the newly added GUC progressive_explain; progressive_explain_verbose; progressive_explain_settings; progressive_explain_interval; progressive_explain_output_size; progressive_explain_format; progressive_explain_sample_rate; also need to add to postgresql.conf.sample? in doc/src/sgml/monitoring.sgml, we also need add view pg_stat_progress_explain to the section Dynamic Statistics Views (Table 27.1. Dynamic Statistics Views) pg_stat_progress_explain.explain will be truncated after 4096 byte. (default value of progressive_explain_output_size) so if the progressive_explain_format is json, and the plan is bigger (imagine two partitioned tables joined together, each having many partitions) the column "explain" text may not be a valid json. Should we be concerned about this? I don't really understand the actual usage of pg_stat_progress_explain.explain_count. Other column usage makes sense to me. Can you share your idea why we need this column? select name, category from pg_settings where category = 'Query Tuning / Planner Method Configuration'; you will see that in config_group as QUERY_TUNING_METHOD all the GUC names generally begin with "enable". all the GUC names begin with "progressive" set the config_group as QUERY_TUNING_METHOD may not be appropriate? also it is not related to query tuning. #include "utils/backend_status.h" #include "storage/procarray.h" #include "executor/spi.h" #include "utils/guc.h" src/backend/commands/explain.c the header generally should be sorted in alphabetical ascending order. apply the order to ipci.c, execMain.c, execProcnode.c else /* Node in progress */ if (es->progressive && planstate == planstate->state->progressive_explain_current_node) appendStringInfo(es->str, " (actual rows=%.0f loops=%.0f) (current)", rows, nloops); else appendStringInfo(es->str, " (actual rows=%.0f loops=%.0f)", rows, nloops); the above part in src/backend/commands/explain.c ExplainNode, the indentation looks wrong to me.
Re: Avoid orphaned objects dependencies, take 3
Hi, On Mon, Oct 28, 2024 at 09:30:19AM +, Bertrand Drouvot wrote: > Hi, > > On Mon, Aug 19, 2024 at 03:35:14PM +, Bertrand Drouvot wrote: > > Hi, > > > > On Wed, Jul 10, 2024 at 07:31:06AM +, Bertrand Drouvot wrote: > > > So, to sum up: > > > > > > A. Locking is now done exclusively with LockNotPinnedObject(Oid classid, > > > Oid objid) > > > so that it's now always clear what object we want to acquire a lock for. > > > It means > > > we are not manipulating directly an object address or a list of objects > > > address > > > as it was the case when the locking was done "directly" within the > > > dependency code. > > > > > > B. A special case is done for objects that belong to the > > > RelationRelationId class. > > > For those, we should be in one of the two following cases that would > > > already > > > prevent the relation to be dropped: > > > > > > 1. The relation is already locked (could be an existing relation or a > > > relation > > > that we are creating). > > > > > > 2. The relation is protected indirectly (i.e an index protected by a > > > lock on > > > its table, a table protected by a lock on a function that depends the > > > table...) > > > > > > To avoid any risks for the RelationRelationId class case, we acquire a > > > lock if > > > there is none. That may add unnecessary lock for 2. but that seems worth > > > it. > > > > > > > Please find attached v16, mandatory rebase due to 80ffcb8427. > > rebased (v17 attached). rebased (v18 attached). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From b2b6810fabd4743fa18bf19d714805e6bb219da5 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v18] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place before the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch takes into account any type of objects except the ones that are pinned (they are not droppable because the system requires it). A special case is done for objects that belong to the RelationRelationId class. For those, we should be in one of the two following cases that would already prevent the relation to be dropped: 1. The relation is already locked (could be an existing relation or a relation that we are creating). 2. The relation is protected indirectly (i.e an index protected by a lock on its table, a table protected by a lock on a function that depends the table...) To avoid any risks for the RelationRelationId class case, we acquire a lock if there is none. That may add unnecessary lock for 2. but that's worth it. The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/aclchk.c | 1 + src/backend/catalog/dependency.c | 212 ++ src/backend/catalog/heap.c| 7 + src/backend/catalog/index.c | 26 +++ src/backend/catalog/objectaddress.c | 57 + src/backend/catalog/pg_aggregate.c| 9 + src/backend/catalog/pg_attrdef.c | 1 + src/backend/catalog/pg_cast.c | 5 + src/backend/catalog/pg_collation.c| 1 + src/backend/catalog/pg_constraint.c | 26 +++ src/backend/catalog/pg_conversion.c | 2 + src/backend/catalog/pg_depend.c | 39 +++- src/backend/catalog/pg_operator.c | 19 ++ src/backend/catalog/pg_proc.c | 17 +- src/backend/catalog/pg_publication.c | 11 + src/backend/catalog/pg_range.c| 6 + src/backend/catalog/pg_type.c
Re: Strange issue with NFS mounted PGDATA on ugreen NAS
On Wed, Jan 1, 2025 at 6:39 PM Tom Lane wrote: > ISTM we used to disclaim responsibility for data integrity if you > try to put PGDATA on NFS. I looked at the current wording about > NFS in runtime.sgml and was frankly shocked at how optimistic it is. > Shouldn't we be saying something closer to "if it breaks you > get to keep both pieces"? I now suspect this specific readdir() problem is in FreeBSD's NFS client. See below. There have also been reports of missed files from (IIRC) Linux clients without much analysis, but that doesn't seem too actionable from here unless someone can come up with a repro or at least some solid details to investigate; those involved unspecified (possibly appliance/cloud) NFS and CIFS file servers. The other issue I know of personally is NFS ENOSPC, which has some exciting disappearing-committed-data failure modes caused by lazy allocation on Linux's implementation (and possibly others), that I've written about before. But actually that one is not strictly an NFS-only issue, it's just really easy to hit that way, and I have a patch to fix it on our side, which I hope to re-post soon. Independently of this, really as it's tangled up with quite a few other things... > > Anyway, I'll write a patch to change rmtree() to buffer the names in > > memory. In theory there could be hundreds of gigabytes of pathnames, > > so perhaps I should do it in batches; I'll look into that. > > This feels a lot like putting lipstick on a pig. Hehe. Yeah. Abandoned. I see this issue here with a FreeBSD client talking to a Debian server exporting BTRFS or XFS, even with dirreadsize set high so that multi-request paging is not expected. Looking at Wireshark and the NFS spec (disclaimer: I have never studied NFS at this level before, addito salis grano), what I see is a READDIR request with cookie=0 (good), and which receives a response containing the whole directory listing and a final entry marker eof=1 (good), but then FreeBSD unexpectedly (to me) sends *another* READDIR request with cookie=662, which is a real cookie that was received somewhere in the middle of the first response on the entry for "13816_fsm", and that entry was followed by an entry for "13816_vm". The second request gets a response that begins at "13816_vm" (correct on the server's part). Then the client sends REMOVE (unlink) requests for some but not all of the files, including "13816_fsm" but not "13816_vm". Then it sends yet another READDIR request with cookie=0 (meaning go from the top), and gets a non-empty directory listing, but immediately sends RMDIR, which unsurprisingly fails NFS3ERR_NOTEMPTY. So my best guess so far is that FreeBSD's NFS client must be corrupting its directory cache when files are unlinked, and it's not the server's fault. I don't see any obvious problem with the way the cookies work. Seems like material for a minimised bug report elsewhere, and not our issue.
Re: ERROR: corrupt MVNDistinct entry
On Tue, Dec 31, 2024 at 5:40 PM Richard Guo wrote: > Regarding the back-patch, this patch is a bug fix, which suggests it > should be back-patched. However, it also changes some plans by fixing > the cost estimation. Does anyone know what our usual approach is in > this situation? Although this patch could result in plan changes, it fixes an actual bug, so I've back-patched it to v16. Thanks Richard
Re: FileFallocate misbehaving on XFS
Il giorno mar 31 dic 2024 alle ore 16:31 Andres Freund ha scritto: 2024-12-19 04:47:04 CET [2646363]: ERROR: could not extend file > "pg_tblspc/107724/PG_16_202307071/465960/3232056651.2" by 11 blocks, from > 29850 to 29861, using FileFallocate(): No space left on device Dunno it it helps, but today I read this reference in latest patchset on XFS mailing list: https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangsh...@kylinos.cn/ It could be related and explain the effect? For the record, I found it here: https://lore.kernel.org/linux-xfs/canubcdxwhottw4pjje1qjajheg48ls7mfc065gcqwoh7s0y...@mail.gmail.com/ Ciao, Gelma
RFC: Lock-free XLog Reservation from WAL
Hi all, I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions. To address this issue, we propose the following changes: 1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add). 2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL. 3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes). 4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field. 5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism. The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x. Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations. 0001-Lock-free-XLog-Reservation-from-WAL.patch Description: 0001-Lock-free-XLog-Reservation-from-WAL.patch
[RFC] Lock-free XLog Reservation from WAL
Hi all, I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions. To address this issue, we propose the following changes: 1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add). 2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL. 3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes). 4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field. 5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism. The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x. Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations. 0001-Lock-free-XLog-Reservation-from-WAL.patch Description: 0001-Lock-free-XLog-Reservation-from-WAL.patch
[RFC] Lock-free XLog Reservation from WAL
Hi all, I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions. To address this issue, we propose the following changes: 1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add). 2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL. 3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes). 4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field. 5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism. The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x. Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations. 0001-Lock-free-XLog-Reservation-from-WAL.patch Description: 0001-Lock-free-XLog-Reservation-from-WAL.patch
Re: Modern SHA2- based password hashes for pgcrypto
Am Donnerstag, dem 02.01.2025 um 15:57 +0100 schrieb Daniel Gustafsson: > > I adapted the code from the publicly available reference > > implementation > > at [1]. It's based on our existing OpenSSL infrastructure in > > pgcrypto > > and produces compatible password hashes with crypt() and "openssl > > passwd" with "-5" and "-6" switches. > > Potentially daft question, but since we require OpenSSL to build > pgcrypto, why > do we need to include sha2 code instead of using the sha2 > implementation in > libcrypto? How complicated would it be to use the OpenSSL API > instead? Not sure i got you, but i use OpenSSL and the SHA2 implementation there. See the pgcrypto px_* API (px.h and openssl.c respectively) i am using to create the digests. Thanks, Bernd
Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size
On Fri, Jan 3, 2025 at 8:06 AM Peter Smith wrote: > > Hi Shubham, > > Here are some review comments for patch v5-0001. > > == > doc/src/sgml/ref/pg_createsubscriber.sgml > > 1. > + > +The source server must have > to > +be set to -1 to prevent the automatic removal of WAL > +replication slots. Setting this parameter to files needed by a > specific size > +may lead to replication failures if required WAL files are > +prematurely deleted. > + > > IIUC, this problem is all about the removal of WAL *files*, not "WAL > replication slots". > > Also, the "Setting this parameter to files needed by a specific size" > part sounds strange. > > I think this can be simplified anyhow like below. > > SUGGESTION: > Replication failures can occur if required WAL files are prematurely > deleted. To prevent this, the source server must set linkend="guc-max-slot-wal-keep-size"/> to -1, > ensuring WAL files are not automatically removed. > > == > src/bin/pg_basebackup/pg_createsubscriber.c > > check_publisher: > > 2. > + if (dry_run && max_slot_wal_keep_size != -1) > + { > + pg_log_warning("publisher requires WAL size must not be restricted"); > + pg_log_warning_detail("The max_slot_wal_keep_size parameter is > currently set to a non-default value, which may lead to replication > failures. " > + "This parameter must be set to -1 to ensure that required WAL > files are not prematurely removed."); > + } > > 2a. > Whenever you mention a GUC name in a PostgreSQL message then (by > convention) it must be double-quoted. > > ~ > > 2b. > The detailed message seems verbose and repetitive to me. > > ~ > > 2c. > The other nearby messages use the terminology "configuration > parameter", so this should be the same. > > ~ > > 2d. > The other nearby messages use \"%s\" substitution for the GUC name, so > this should be the same. > > ~ > > 2e. > IMO advising the user to change a configuration parameter should be > done using the pg_log_warning_hint function (e.g. _hint, not > _details). > > ~~ > > Result: > > CURRENT (pg_log_warning_details) > The max_slot_wal_keep_size parameter is currently set to a non-default > value, which may lead to replication failures. This parameter must be > set to -1 to ensure that required WAL files are not prematurely > removed. > > SUGGESTION (pg_log_warning_hint) > Set the configuration parameter \"%s\" to -1 to ensure that required > WAL files are not prematurely removed. > I have fixed the given comments using your suggestions. Tha attached patch contains the suggested changes. Thanks and Regards, Shubham Khanna. v6-0001-Validate-max_slot_wal_keep_size-in-pg_createsubsc.patch Description: Binary data
Re: Re: proposal: schema variables
hi. in the function svariableStartupReceiver all these "ereport(ERROR" cannot happen, since transformLetStmt already did all the heavy work. base on https://www.postgresql.org/docs/current/error-message-reporting.html all these "ereport(ERROR," in the svariableStartupReceiver can be simplified as "elog(ERROR," or Assert. After standard_ExecutorStart->InitPlan, queryDesc.tupDesc will not include attr->attisdropped is true scarenio. In standard_ExecutorStart, I added the following code then ran the regress test again to prove my point. standard_ExecutorStart /* * Initialize the plan state tree */ InitPlan(queryDesc, eflags); for (int i = 0; i < queryDesc->tupDesc->natts; i++) { Form_pg_attribute attr = TupleDescAttr(queryDesc->tupDesc, i); if (attr->attisdropped) { elog(INFO, "some attribute is dropped queryDesc->operation is %d", queryDesc->operation); } } MemoryContextSwitchTo(oldcontext); - svariableStartupReceiver parameter typeinfo is from queryDesc->tupDesc So I think svariableStartupReceiver, typeinfo->natts will always equal one. therefore SVariableState.slot_offset is not necessary. overall, i think svariableStartupReceiver can be simplified as the following: static void svariableStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo) { SVariableState *myState = (SVariableState *) self; intnatts = typeinfo->natts; Form_pg_attribute attr; LOCKTAGlocktag PG_USED_FOR_ASSERTS_ONLY; Assert(myState->pub.mydest == DestVariable); Assert(OidIsValid(myState->varid)); Assert(SearchSysCacheExists1(VARIABLEOID, myState->varid)); #ifdef USE_ASSERT_CHECKING SET_LOCKTAG_OBJECT(locktag, MyDatabaseId, VariableRelationId, myState->varid, 0); Assert(LockHeldByMe(&locktag, AccessShareLock, false)); #endif Assert(natts == 1); attr = TupleDescAttr(typeinfo, 0); myState->need_detoast = attr->attlen == -1; myState->rows = 0; } I've attached the file containing the changes I mentioned earlier. -<<>>>--- Overall, 0001 and 0002 the doc looks good to me now. The following are only some minor issues I came up with. In Table 5.1. ACL Privilege Abbreviations ACL Privilege Abbreviations VARIABLE (occurred 3 times) should be SESSION VARIABLE ? doc/src/sgml/glossary.sgml I want to do minor tweak. from A persistent database object that holds a value in session memory. This value is private to each session and is released when the session ends. Read or write access to session variables is controlled by privileges, similar to other database objects. to A persistent database object that holds a value in session memory. This value is private to each session and is reset to default value (null) when the session ends. Read or write access to session variables is controlled by access privileges, similar to other database objects. in let.sgml. Example: CREATE VARIABLE myvar AS integer; LET myvar = 10; LET myvar = (SELECT sum(val) FROM tab); it should be Examples ...your example code v1-0001-refactoring-svariableStartupReceiver.no-cfbot Description: Binary data
Re: magical eref alias names
Robert Haas writes: > Oh, I agree, but I don't see why anyone would care whether rel names > are unique across different queries. When I mentioned global > uniqueness, I meant unique within a query, like what > set_rtable_names() does after the fact. Okay, but then we still have the problem of how to ensure that in a query that has inline'd some views. I think solving the sort of I-want-to-reference-this problem you describe would require that we unique-ify the aliases in the rewriter, just after it finishes incorporating any views. We could do that, but it seems like a lot of cycles to expend on something that would be pointless in the typical case where nobody ever looks at the aliases later. > Now, I'm firmly convinced that this is a real problem and worth > solving, but let me be clear that I don't think the solution is > anywhere on this thread, nor do I think that it is simple. Agreed. > My original > proposal of getting rid of system-generated fake names isn't > necessary, because you very helpfully pointed out that I can look at > whether RTE->alias->aliasname exists to figure that out. Actually, I noticed that we are failing to honor that in the places where we inject "*SELECT*" and "*SELECT* %d" names, because that code puts those names into RTE->alias not only RTE->eref. I experimented with the attached patch to not do that anymore, which is sort of a subset of what you did but just focused on not lying about what's generated versus user-written. We could alternatively keep the current generated names by extending addRangeTableEntryForSubquery's API so that alias and generated eref are passed separately. (I didn't look to see if anyplace else is messing up this distinction similarly.) regards, tom lane diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index bf322198a2..3d8a433c19 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -4961,13 +4961,13 @@ SELECT ft1.c1 FROM ft1 JOIN ft2 on ft1.c1 = ft2.c1 WHERE -- === EXPLAIN (verbose, costs off) INSERT INTO ft2 (c1,c2,c3) SELECT c1+1000,c2+100, c3 || c3 FROM ft2 LIMIT 20; -QUERY PLAN --- + QUERY PLAN + Insert on public.ft2 Remote SQL: INSERT INTO "S 1"."T 1"("C 1", c2, c3, c4, c5, c6, c7, c8) VALUES ($1, $2, $3, $4, $5, $6, $7, $8) Batch Size: 1 - -> Subquery Scan on "*SELECT*" - Output: "*SELECT*"."?column?", "*SELECT*"."?column?_1", NULL::integer, "*SELECT*"."?column?_2", NULL::timestamp with time zone, NULL::timestamp without time zone, NULL::character varying, 'ft2 '::character(10), NULL::user_enum + -> Subquery Scan on unnamed_subquery + Output: unnamed_subquery."?column?", unnamed_subquery."?column?_1", NULL::integer, unnamed_subquery."?column?_2", NULL::timestamp with time zone, NULL::timestamp without time zone, NULL::character varying, 'ft2 '::character(10), NULL::user_enum -> Foreign Scan on public.ft2 ft2_1 Output: (ft2_1.c1 + 1000), (ft2_1.c2 + 100), (ft2_1.c3 || ft2_1.c3) Remote SQL: SELECT "C 1", c2, c3 FROM "S 1"."T 1" LIMIT 20::bigint diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c index 3a2d51c5ad..c58bbd5b78 100644 --- a/src/backend/executor/functions.c +++ b/src/backend/executor/functions.c @@ -1962,7 +1962,7 @@ tlist_coercion_finished: rte = makeNode(RangeTblEntry); rte->rtekind = RTE_SUBQUERY; rte->subquery = parse; - rte->eref = rte->alias = makeAlias("*SELECT*", colnames); + rte->eref = makeAlias("unnamed_subquery", colnames); rte->lateral = false; rte->inh = false; rte->inFromCl = true; diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c index 3864a675d2..359d3c390e 100644 --- a/src/backend/parser/analyze.c +++ b/src/backend/parser/analyze.c @@ -823,7 +823,7 @@ transformInsertStmt(Pa
Re: Vacuum statistics
Hi, Thanks for the work you have done here. Exposing cumulative metrics at this level of detail for vacuum is surely useful to find vacuum bottlenecks and to determine the effectiveness of vacuum tuning. I am yet to look very closely, but I think some additional columns that will be useful is the number of failsafe autovacuums occurred. Also the counter for number of index_cleanup skipped, truncate phase skipped and toast vacuuming skipped ( the latter will only be relevant for the main relation ). I also wonder if if makes sense to break down timing by phase. I surely would like to know how much of my vacuum time was spent in index cleanup vs heap scan, etc. A nit: I noticed in v14, the column is "schema". It should be "schemaname" for consistency. Also, instead of pg_stat_vacuum_tables, what about pg_stat_vacuum? Now, I became aware of this discussion after starting a new thread to track total time spent in vacuum/analyze in pg_stat_all_tables [1]. But this begs the question of what should be done with the current counters in pg_stat_all_tables? I see it mentioned above that (auto)vacuum_count should be added to this new view, but it's also already in pg_stat_all_tables. I don't think we should be duplicating the same columns across views. I think total_time should be removed from your current patch and added as is being suggested in [1]. This way high level metrics such as counts and total time spent remain in pg_stat_all_tables, while the new view you are proposing will contain more details. I don't think we will have consistency issues between the views because a reset using pg_stat_reset() will act on all the stats and pg_stat_reset_single_table_counters() will act on all the stats related to that table. There should be no way to reset the vacuum stats independently, AFAICT. Alternatively, we can remove the vacuum related stats from pg_stat_all_tables, but that will break monitoring tools and will leave us with the (auto)analyze metrics alone in pg_stat_all_tables. This sounds very ugly. What do you think? Regards, Sami Imseih Amazon Web Services (AWS) [1] https://commitfest.postgresql.org/52/5485/
Re: Fwd: Re: A new look at old NFS readdir() problems?
On Thu, Jan 2, 2025 at 03:48:53PM -0500, Tom Lane wrote: > Larry Rosenman writes: > > @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on > > my issue. > > Thanks for reaching out to him. So if I'm reading this correctly, > there's little point in filing a FreeBSD bug because it'll be > dismissed as unfixable. > > This leaves us in rather a nasty position. Sure, we could rewrite > rmtree() as Thomas suggested upthread, but I'm still of the opinion > that that's smearing lipstick on a pig. rmtree() is the least of > our worries: it doesn't need to expect that anybody else will be > modifying the target directory, plus it can easily restart its scan > without complicated bookkeeping. I doubt we can make such an > assumption for all our uses of readdir(), or that it's okay to > miss or double-process files in every one of them. > > I'm still of the opinion that the best thing to do is disclaim > safety of storing a database on NFS. It would be nice to have some details on the docs about why NFS can cause problems so we have something to point to when people ask. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Re: Alias of VALUES RTE in explain plan
Robert Haas writes: > On Sat, Nov 2, 2024 at 1:09 PM Tom Lane wrote: >> It seems sufficient to avoid alias pushdown when there's an ORDER BY >> inside the VALUES subquery. We disallow a locking clause, and >> while there can be LIMIT/OFFSET, those aren't allowed to reference the >> VALUES output anyway. I added some test cases to show that this is >> enough to make view-dumping behave sanely. > ... The correctness of that rejiggering depends crucially on what > will happen at plan time and then at EXPLAIN/ruleutils time, but the > rules for what will happen at those later times are pretty darn > complicated, so I feel like this is creating an awful lot of action at > a distance. I'm not seeing where there's a correctness issue here? The parser is charged with assigning aliases to RTEs that lack one, and with this patch that's still true. It's just assigning a different alias than it did before. There is no question of whether the alias can "leak" out of the implicit sub-select; it cannot, by SQL's scoping rules. We have to avoid changing anything if there are clauses inside the implicit sub-select that could reference the old choice of alias, but the patch does that. (Or we could decide to simplify things at the cost of breaking such SQL code, since there probably is none in the field. It's still not clear to me which choice is better.) Yes, changing the parser to get an effect in ruleutils.c is indeed action-at-a-distance-y, but the same can be said of an awful lot of the behavior in this area. If we were to do this differently, we'd be breaking existing conventions like "the parser is what initially assigns aliases", and we'd be much more likely to create new bugs that way (cf. my criticisms upthread of the v1 patch). So I'm not really seeing another way. We could just reject this patch series on the grounds of "it's not a bug and we're not going to change the behavior". But I do think the proposed new output looks nicer. > If were able (and I suspect we're not, but hypothetically) to in > effect pull up the subquery at parse time, so that to the planner and > executor it doesn't even exist, then I think that would be perfectly > fine, because then we would have strong reasons for believing that no > later decision can turn our parse-time decision into a problem. But to > leave that subquery there and guess that it's going to disappear > before we get to EXPLAIN doesn't seem nearly as safe. It seems pretty > easy to either miss things (like the ORDER BY case) or even to have > future planner changes break stuff. I completely fail to understand what you think might break. The planner is not especially interested in aliases, and ruleutils already has sufficient defenses against cases like duplicate aliases --- it must, because such cases can arise anyway. regards, tom lane
Re: apply_scanjoin_target_to_paths and partitionwise join
On Tue, Oct 1, 2024 at 3:22 AM Andrei Lepikhov wrote: > > On 24/7/2024 15:22, Ashutosh Bapat wrote: > > On Wed, Jul 24, 2024 at 9:42 AM Richard Guo wrote: > >> Is there a specific query that demonstrates benefits from this change? > >> I'm curious about scenarios where a partitionwise join runs slower > >> than a non-partitionwise join. > > > > [1] provides a testcase where a nonpartitionwise join is better than > > partitionwise join. This testcase is derived from a bug reported by an > > EDB customer. [2] is another bug report on psql-bugs. > I haven't passed through the patch yet, but can this issue affect the > decision on what to push down to foreign servers: a whole join or just a > scan of two partitions? > If the patch is related to the pushdown decision, I'd say it is quite an > annoying problem for me. From time to time, I see cases where JOIN > produces more tuples than both partitions have in total - in this case, > it would be better to transfer tables' tuples to the main instance > before joining them. Sorry for replying late. I somehow didn't notice this. A join between partitions is pushed down if only partitionwise join is chosen and a join between partitions won't be pushed down if partitionwise join is not chosen. Hence this bug affects pushdown as well. The CF entry shows as waiting for author. But that isn't the right status. Will change it to needs review. I think we need a consensus as to whether we want to fix this bug or not. Since this bug doesn't affect me anymore, I will just withdraw this CF entry if there is no interest. -- Best Wishes, Ashutosh Bapat
Re: [PoC] Federated Authn/z with OAUTHBEARER
On 20.12.24 02:00, Jacob Champion wrote: v40 also contains: - explicit testing for connect_timeout compatibility - support for require_auth=oauth, including compatibility with require_auth=!scram-sha-256 - the ability for a validator to set authn_id even if the token is not authorized, for auditability in the logs - the use of pq_block_sigpipe() for additional safety in the face of CURLOPT_NOSIGNAL I have split out the require_auth changes temporarily (0002) for ease of review, and I plan to ping the last thread where SASL support in require_auth was discussed [1]. Some review of v40: General: There is a mix of using "URL" and "URI" throughout the patch. I tried to look up in the source material (RFCs) what the correct use would be, but even they are mixing it in nonobvious ways. Maybe this is just hopelessly confused, or maybe there is a system that I don't recognize. * .cirrus.tasks.yml Since libcurl is an "auto" meson option, it doesn't need to be enabled explicitly. At least that's how most of the other feature options are handled. So probably better to stick to that pattern. * config/programs.m4 Useless whitespace change. * configure.ac +AC_MSG_CHECKING([whether to build with libcurl support for OAuth client flows]) etc. Let's just write something like 'whether to build with libcurl support' here. So we don't have to keep updating it if the scope of the option changes. * meson_options.txt +option('libcurl', type : 'feature', value: 'auto', + description: 'libcurl support for OAuth client flows') Similarly, let's just write something like 'libcurl support' here. * src/backend/libpq/auth-oauth.c +typedef enum +{ + OAUTH_STATE_INIT = 0, + OAUTH_STATE_ERROR, + OAUTH_STATE_FINISHED, +} oauth_state; + +/* Mechanism callback state. */ +struct oauth_ctx +{ + oauth_state state; This is the only use of that type definition. I think you can skip the typedef and use the enum tag directly. * src/interfaces/libpq/libpq-fe.h +#ifdef WIN32 +#include /* for SOCKET */ +#endif Including a system header like that seems a bit unpleasant. AFAICT, you only need it for this: + PostgresPollingStatusType (*async) (PGconn *conn, + struct _PGoauthBearerRequest *request, + SOCKTYPE * altsock); But that function could already get the altsock handle via conn->altsock. So maybe that is a way to avoid the whole socket type dance in this header. * src/test/authentication/t/001_password.pl I suppose this file could be a separate commit? It just separates the SASL/SCRAM terminology for existing functionality. * src/test/modules/oauth_validator/fail_validator.c +{ + elog(FATAL, "fail_validator: sentinel error"); + pg_unreachable(); +} This pg_unreachable() is probably not necessary after elog(FATAL). * .../modules/oauth_validator/oauth_hook_client.c +#include +#include These are generally not necessary, as they come in via c.h. +#ifdef WIN32 +#include +#else +#include +#endif I don't think this special Windows handling is necessary, since there is src/include/port/win32/sys/socket.h. +static void +usage(char *argv[]) +{ + fprintf(stderr, "usage: %s [flags] CONNINFO\n\n", argv[0]); Help output should go to stdout. With the above changes, I think this patch set is structurally okay now. Now it just needs to do the right things. ;-)
read stream on amcheck
Hi, While reviewing some other patches implementing stream API for core subsystems, I noticed that the amcheck extension could also benefit from that. Notice the refactor when handling the "skip" parameter; The logic was moved to the heapam_read_stream_next_block callback so that verify_heapam don't need to touch any private field of heapam_read_stream_next_block_private struct. One other think to mention is that the test cases of "skip" parameter that I've seen just test when the first page is corrupted, so I think that a carefully review on callback logic would be good to ensure that we don't accidentally skip a page when doing p->current_blocknum++; This patch doesn't show any performance improvements (or regression) but I think that it would be good to replace the ReadBufferExtended usage with the read stream API, so in the future it could be benefit from the AIO project. -- Matheus Alcantara v1-0001-Use-read-stream-on-amcheck.patch Description: Binary data
Re: Modern SHA2- based password hashes for pgcrypto
> On 2 Jan 2025, at 16:17, Bernd Helmle wrote: > > Am Donnerstag, dem 02.01.2025 um 15:57 +0100 schrieb Daniel Gustafsson: >>> I adapted the code from the publicly available reference >>> implementation >>> at [1]. It's based on our existing OpenSSL infrastructure in >>> pgcrypto >>> and produces compatible password hashes with crypt() and "openssl >>> passwd" with "-5" and "-6" switches. >> >> Potentially daft question, but since we require OpenSSL to build >> pgcrypto, why >> do we need to include sha2 code instead of using the sha2 >> implementation in >> libcrypto? How complicated would it be to use the OpenSSL API >> instead? > > Not sure i got you, but i use OpenSSL and the SHA2 implementation > there. See the pgcrypto px_* API (px.h and openssl.c respectively) i am > using to create the digests. Sorry, skimming the patch I misread it, nevermind. -- Daniel Gustafsson
Re: Strange issue with NFS mounted PGDATA on ugreen NAS
Thomas Munro writes: > I now suspect this specific readdir() problem is in FreeBSD's NFS > client. See below. There have also been reports of missed files from > (IIRC) Linux clients without much analysis, but that doesn't seem too > actionable from here unless someone can come up with a repro or at > least some solid details to investigate; those involved unspecified > (possibly appliance/cloud) NFS and CIFS file servers. I forgot to report back, but yesterday I spent time unsuccessfully trying to reproduce the problem with macOS client and NFS server using btrfs (a Synology NAS running some no-name version of Linux). So that lends additional weight to your conclusion that it isn't specifically a btrfs bug. > I see this issue here with a FreeBSD client talking to a Debian server > exporting BTRFS or XFS, even with dirreadsize set high so that > multi-request paging is not expected. Looking at Wireshark and the > NFS spec (disclaimer: I have never studied NFS at this level before, > addito salis grano), what I see is a READDIR request with cookie=0 > (good), and which receives a response containing the whole directory > listing and a final entry marker eof=1 (good), but then FreeBSD > unexpectedly (to me) sends *another* READDIR request with cookie=662, > which is a real cookie that was received somewhere in the middle of > the first response on the entry for "13816_fsm", and that entry was > followed by an entry for "13816_vm". The second request gets a > response that begins at "13816_vm" (correct on the server's part). > Then the client sends REMOVE (unlink) requests for some but not all of > the files, including "13816_fsm" but not "13816_vm". Then it sends > yet another READDIR request with cookie=0 (meaning go from the top), > and gets a non-empty directory listing, but immediately sends RMDIR, > which unsurprisingly fails NFS3ERR_NOTEMPTY. So my best guess so far > is that FreeBSD's NFS client must be corrupting its directory cache > when files are unlinked, and it's not the server's fault. I don't see > any obvious problem with the way the cookies work. Seems like > material for a minimised bug report elsewhere, and not our issue. Yeah, that seems pretty damning. Thanks for looking into it. regards, tom lane
Re: magical eref alias names
On Thu, Jan 2, 2025 at 3:27 PM Tom Lane wrote: > Global uniqueness across the database (not single queries) would be > needed to prevent cases where different views use the same generated > names. The only way I can see to do that without nasty performance > costs is to use something like an OID counter. Which would mean > that rather than nice names like "union_1", "union_2", etc, you'd > soon be looking at "union_5846926". I don't think anyone would > find that to be an improvement on what we're doing now. Oh, I agree, but I don't see why anyone would care whether rel names are unique across different queries. When I mentioned global uniqueness, I meant unique within a query, like what set_rtable_names() does after the fact. > > If we had global uniqueness, or even per-subquery-level uniqueness, > > this would sort itself out somewhat nicely, I think. We would just end > > up with select_1 and select_2 or union_1 and union_2 or something like > > that, and I think that would be strictly better than the status quo > > where we do sometimes generate *SELECT* %d, but also sometimes just > > *SELECT* and other times unnamed_subquery, and also only ever *VALUES* > > and not *VALUES* %d. > > I'll concede that it'd be nicer. But I'm not convinced it'd be enough > nicer to justify the costs of changing. We've been doing it this way > for a long time, and AFAIR you're the first to complain about it. I'm not sure it's enough nicer to justify the cost of changing, either. I do think that "you're the first to complain about it" is not a convincing argument, here or in general. Every time somebody reports a new problem, they are the first to complain about it, but that does make the problem any less real. The reason that I got interested in this problem was because of the thread about allowing extensions to control planner behavior. I wrote some words about it over there, but it's been a while so those thoughts might not be entirely up to date. What I've found is that it's a lot easier to insert a hook or three to allow an extension to control the planner behavior than it is to get those hooks to do anything useful, and that's precisely because it is hard to find a useful way to identify a particular part of the query plan. If the query planner were a person sitting next to you at your computer, you could point at the screen with your finger and say "hey, do you see this part of the EXPLAIN plan right here? could you try making this a sequential scan rather than an index scan?" and the planner could say "sure, let me re-plan it that way and I'll show you how it turns out". Since the planner is incorporeal and cannot see your finger, you want to identify "this part of the EXPLAIN plan right here" in some other way, like with a name, but the names known at parsing time and planning time are not unique and need not match what shows up in the EXPLAIN output. Concretely, if the user creates a partitioned table called *VALUES* with several partitions and then creates another such table, also called *VALUES*, in a different schema, and then joins the two of them together; and then does UNION ALL with a subquery that actually uses VALUES (), and then somewhere includes a subquery that also queries one of the tables actually called *VALUES*, you cannot meaningfully use the label *VALUES* to identify one particular scan; and you can't use anything that appears in the EXPLAIN output for that purpose either because the next planning cycle won't have those labels available *during planning* when it needs to honor whatever plan-shape requests were made. Now, I'm firmly convinced that this is a real problem and worth solving, but let me be clear that I don't think the solution is anywhere on this thread, nor do I think that it is simple. My original proposal of getting rid of system-generated fake names isn't necessary, because you very helpfully pointed out that I can look at whether RTE->alias->aliasname exists to figure that out. Also, enforcing global uniqueness across the query at parse time wouldn't actually help, because when we apply rewriter rules we can introduce new relations into the query, and then when we expand inheritance hierarchies we can introduce even more new relations into the query; and what the user will care about if they want to modify "that portion of the plan right there" is what ultimately ends up in the plan, not what was in the query at parse time. So as far as I am concerned, this thread has served the useful purpose of making me smarter and I can't really see a way for it to do anything more than that; but if it's given you a clever idea for something that we could change in the code, I'm certainly happy to hear about that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: magical eref alias names
Robert Haas writes: > On Sun, Dec 22, 2024 at 12:45 PM Tom Lane wrote: >> In particular, in a situation where we're trying to show a plan for >> a query with inlined views, EXPLAIN would probably have to have code >> to unique-ify the names anyway --- there's no way we're going to make >> these nonce names globally unique, so the view(s) might contain names >> that conflict with each other or the outer query. > When you say "there's no way we're going to make these nonce names > globally unique," is that because you think it would be too costly > from a performance perspective (which was my concern) or that you > think it's flat-out impossible for some reason (which doesn't seem to > me to be true)? Global uniqueness across the database (not single queries) would be needed to prevent cases where different views use the same generated names. The only way I can see to do that without nasty performance costs is to use something like an OID counter. Which would mean that rather than nice names like "union_1", "union_2", etc, you'd soon be looking at "union_5846926". I don't think anyone would find that to be an improvement on what we're doing now. > If we had global uniqueness, or even per-subquery-level uniqueness, > this would sort itself out somewhat nicely, I think. We would just end > up with select_1 and select_2 or union_1 and union_2 or something like > that, and I think that would be strictly better than the status quo > where we do sometimes generate *SELECT* %d, but also sometimes just > *SELECT* and other times unnamed_subquery, and also only ever *VALUES* > and not *VALUES* %d. I'll concede that it'd be nicer. But I'm not convinced it'd be enough nicer to justify the costs of changing. We've been doing it this way for a long time, and AFAIR you're the first to complain about it. regards, tom lane
Fwd: Re: A new look at old NFS readdir() problems?
@Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on my issue. Original Message Subject: Re: A new look at old NFS readdir() problems? Date: 01/02/2025 10:08 am From: Rick Macklem To: Thomas Munro Cc: Rick Macklem , Larry Rosenman On Thu, Jan 2, 2025 at 2:50 AM Thomas Munro wrote: CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ith...@uoguelph.ca. Hi Rick CC: ler I hope you don't mind me reaching out directly, I just didn't really want to spam existing bug reports without sufficient understanding to actually help yet... but I figured I should get in touch and see if you have any clues or words of warning, since you've worked on so much of the NFS code. I'm a minor FBSD contributor and interested in file systems, but not knowledgeable about NFS; I run into/debug/report a lot of file system bugs on a lot of systems in my day job on databases. I'm interested to see if I can help with this problem. Existing ancient report and interesting email: https://lists.freebsd.org/pipermail/freebsd-fs/2014-October/020155.html https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=57696 What we ran into is not the "bad cookie" state, which doesn't really seem to be recoverable in general, from what I understand (though the FreeBSD code apparently would try, huh). It's a simple case where the NFS client requests a whole directory with a large READDIR request, and then tries to unlink all the files in a traditional while-readdir()-unlink() loop that works on other systems. In general, NFS is not a POSIX compliant file system, due to its protocol design. The above is one example. The only "safe" way is to opendir() or rewinddir() after every removal. The above usually works (and always worked for UFS long ago) because the directory offset cookies for subsequent entries in the directory after the entry unlinked happened to "still be valid". That is no longer true for FreeBSD's UFS nor for many other file systems that can be exported. If the client reads the entire directory in one READDIR, then it is fine, since it has no need to the directory offset cookies. However, there is a limit to how much a single READDIR can do (these days for NFSv4.1/4.2, it could be raised to just over 1Mbyte, however FreeBSD limits it to 8K at the moment). Another way to work around the problem is to read the entire directory into the client via READDIRs before starting to do the unlinks. The opendir()/readdir() code in libc could be hacked to do that, but I have never tried to push such a patch into FreeBSD. (It would be limited by how much memory can be malloc()'d, that is pretty generous compared to even large directorys with 10s of thousand entries.) The above is true for all versions of NFS up to NFSv4.2, which is the current one and unless some future version of NFS does READDIR differently (I won't live long enough to see this;-), it will always be the case. If my comment above was not clear, the following encoding is the "safe" way to remove all entries in a directory. do { dir = opendir("X"); dp = readdir(dir); if (dp != NULL) unlink(dp->d_name); close(dir); } while (dp != NULL); In theory, the directory_offset_cookie was supposed to handle this, but it has never worked correctly, for a couple of reasons. 1 - RFC1813 (the NFSv3 one) did not describe the cookie verifier correctly. It should only change when cookies for extant entries change. The description suggested it should change whenever an entry is deleted, since that cookie is no longer valid. 2 - #1 only works if directory offset cookies for other entries in the directory do not change when an entry is deleted. This used to be the case for UFS, but was broken in FreeBSD when a commit many years ago optimized ufs_readdir() to compress out invalid entries. Doing this changes the directory offset cookies every time an entry is deleted at the beginning of a directory block. rick On FreeBSD it seems to clobber its own directory cache, make extra unnecessary READDIR requests, and skip some of the files. Or maybe I have no idea what's going on and this is a hopelessly naive question and mission :-) Here's what we learned so far starting from Larry's report: https://www.postgresql.org/message-id/flat/04f95c3c13d4a9db87b3ac082a9f4877%40lerctr.org Note that this issue has nothing to do with "bad cookie" errors (I doubt the server I'm talking to even implements that -- instead it tries to have cookies that are persistent/stable). Also, while looking into this and initially suspecting cookie stability bugs (incorrectly), I checked a bunch of local file systems to understand how their cookies work, and I think I found a related problem when FreeBSD exports UFS, too.
Re: Alias of VALUES RTE in explain plan
On Sat, Nov 2, 2024 at 1:09 PM Tom Lane wrote: > regression=# create view vv as SELECT * FROM (VALUES (4),(2),(3),(1) ORDER BY > t1.x LIMIT 2) AS t1(x); > CREATE VIEW > regression=# \d+ vv > View "public.vv" > Column | Type | Collation | Nullable | Default | Storage | Description > +-+---+--+-+-+- > x | integer | | | | plain | > View definition: > SELECT x >FROM ( VALUES (4), (2), (3), (1) > ORDER BY t1_1.x > LIMIT 2) t1(x); > > ruleutils has decided that it needs to make the two "t1" table > aliases distinct. But of course that will fail on reload: > > regression=# SELECT x > regression-#FROM ( VALUES (4), (2), (3), (1) > regression(# ORDER BY t1_1.x > regression(# LIMIT 2) t1(x); > ERROR: missing FROM-clause entry for table "t1_1" > LINE 3: ORDER BY t1_1.x >^ > It seems sufficient to avoid alias pushdown when there's an ORDER BY > inside the VALUES subquery. We disallow a locking clause, and > while there can be LIMIT/OFFSET, those aren't allowed to reference the > VALUES output anyway. I added some test cases to show that this is > enough to make view-dumping behave sanely. I'm concerned about taking things in this direction. There's two scans here, really: a Values Scan for the VALUES construct, and then a Subquery Scan sitting on top of it that will normally be optimized away. It seems to me that you're basically guessing whether the subquery scan will be optimized away to a sufficient degree that its alias will not leak out anywhere. But that seems a bit fragile and error-prone. Whether to elide the subquery scan is a planner decision; what aliases to assign to the planner output is a ruleutils.c decision; but here you're talking about rejiggering things at parse time. The correctness of that rejiggering depends crucially on what will happen at plan time and then at EXPLAIN/ruleutils time, but the rules for what will happen at those later times are pretty darn complicated, so I feel like this is creating an awful lot of action at a distance. If were able (and I suspect we're not, but hypothetically) to in effect pull up the subquery at parse time, so that to the planner and executor it doesn't even exist, then I think that would be perfectly fine, because then we would have strong reasons for believing that no later decision can turn our parse-time decision into a problem. But to leave that subquery there and guess that it's going to disappear before we get to EXPLAIN doesn't seem nearly as safe. It seems pretty easy to either miss things (like the ORDER BY case) or even to have future planner changes break stuff. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum statistics
> On Jan 2, 2025, at 2:12 PM, Sami Imseih wrote: > > Alternatively, we can remove the vacuum related stats from pg_stat_all_tables, > but that will break monitoring tools and will leave us with the (auto)analyze > metrics alone in pg_stat_all_tables. This sounds very ugly. While backwards compatibility is important, there’s definitely precedent for changing what shows up in the catalog. IMHO it’s better to bite the bullet and move those fields instead of having vacuum stats spread across two different views.
Re: Fwd: Re: A new look at old NFS readdir() problems?
On 01/02/2025 2:50 pm, Bruce Momjian wrote: On Thu, Jan 2, 2025 at 03:48:53PM -0500, Tom Lane wrote: Larry Rosenman writes: > @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on > my issue. Thanks for reaching out to him. So if I'm reading this correctly, there's little point in filing a FreeBSD bug because it'll be dismissed as unfixable. This leaves us in rather a nasty position. Sure, we could rewrite rmtree() as Thomas suggested upthread, but I'm still of the opinion that that's smearing lipstick on a pig. rmtree() is the least of our worries: it doesn't need to expect that anybody else will be modifying the target directory, plus it can easily restart its scan without complicated bookkeeping. I doubt we can make such an assumption for all our uses of readdir(), or that it's okay to miss or double-process files in every one of them. I'm still of the opinion that the best thing to do is disclaim safety of storing a database on NFS. It would be nice to have some details on the docs about why NFS can cause problems so we have something to point to when people ask. What about doing what Rick suggests? do { dir = opendir("X"); dp = readdir(dir); if (dp != NULL) unlink(dp->d_name); close(dir); } while (dp != NULL); ? -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: l...@lerctr.org US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On 1/2/25 22:09, Jim Nasby wrote: > >> On Dec 31, 2024, at 5:41 PM, Tomas Vondra wrote: >> >> On 12/31/24 21:46, Jim Nasby wrote: >>> On Dec 30, 2024, at 7:05 PM, James Hunter >>> wrote: On Sat, Dec 28, 2024 at 11:24 PM Jim Nasby wrote: > > IMHO none of this will be very sane until we actually have cluster- > level limits. One sudden burst in active connections and you still > OOM the instance. Fwiw, PG does support "max_connections" GUC, so a backend/connection - level limit, times "max_connections", yields a cluster-level limit. >>> >>> max_connections is useless here, for two reasons: >>> >>> 1. Changing it requires a restart. That’s at *best* a real PITA in >>> production. [1] >>> 2. It still doesn’t solve the actual problem. Unless your workload >>> *and* your data are extremely homogeneous you can’t simply limit the >>> number of connections and call it a day. A slight change in incoming >>> queries, OR in the data that the queries are looking at and you go >>> from running fine to meltdown. You don’t even need a plan flip for >>> this to happen, just the same plan run at the same rate but now >>> accessing more data than before. >>> >> >> I really don't follow your argument ... >> >> Yes, changing max_connections requires a restart - so what? AFAIK the >> point James was making is that if you multiply max_connections by the >> per-backend limit, you get a cluster-wide limit. And presumably the >> per-backend limit would be a GUC not requiring a restart. >> >> Yes, high values of max_connections are problematic. I don't see how a >> global limit would fundamentally change that. In fact, it could >> introduce yet more weird failures because some unrelated backend did >> something weird. > > That’s basically my argument for having workload management. If a system > becomes loaded enough for the global limit to start kicking in it’s > likely that query response time is increasing, which means you will soon > have more and more active backends trying to run queries. That’s just > going to make the situation even worse. You’d either have to start > trying to “take memory away” from already running backends or backends > that are just starting would have such a low limit as to cause them to > spill very quickly, creating further load on the system. > I'm not opposed to having a some sort of "workload management" (similar to what's available in some databases), but my guess is that's (at least) an order of magnitude more complex than introducing the memory limit discussed here. I can only guess, because no one really explained what would it include, how would it be implemented. Which makes it easy to dream about a solution that would fix the problem ... What I'm afraid will happen everyone mostly agrees a comprehensive workload management system would be better than a memory limit (be it per-backend or a global one). But without a workable proposal how to implement it no one ends up working on it. And no one gets to work on a memory limit because the imagined workload management would be better. So we get nothing ... FWIW I have a hard time imagining a workload management system without some sort of a memory limit. >> FWIW I'm not opposed to having some global memory limit, but as I >> explained earlier, I don't see a way to do that sensibly without having >> a per-backend limit first. Because if you have a global limit, a single >> backend consuming memory could cause all kinds of weird failures in >> random other backends. > > I agree, but I’m also not sure how much a per-backend limit would > actually help on its own, especially in OLTP environments. > So what if it doesn't help in every possible case? It'd be valuable even for OLAP/mixed workloads with sensible max_connection values, because it'd allow setting the work_mem more aggressively. >>> Most of what I’ve seen on this thread is discussing ways to >>> *optimize* how much memory the set of running backends can consume. >>> Adjusting how you slice the memory pie across backends, or even >>> within a single backend, is optimization. While that’s a great goal >>> that I do support, it will never fully fix the problem. At some point >>> you need to either throw your hands in the air and start tossing >>> memory errors, because you don’t have control over how much work is >>> being thrown at the engine. The only way that the engine can exert >>> control over that would be to hold new transactions from starting >>> when the system is under duress (ie, workload management). While >>> workload managers can be quite sophisticated (aka, complex), the nice >>> thing about limiting this scope to work_mem, and only as a means to >>> prevent complete overload, is that the problem becomes a lot simpler >>> since you’re only looking at one metric and not trying to support any >>> kind of priority system. The only fanciness I think an MVP would need >>> is a GUC to control how long a transaction can sit waitin
Re: Logical Replication of sequences
Hi Vignesh, Some minor review comments for the patch v20241230-0003. == src/backend/replication/logical/syncutils.c 1. + * syncutils.c + * PostgreSQL logical replication: common synchronization code + * + * Copyright (c) 2024, PostgreSQL Global Development Group Happy New Year. s/2024/2025/ ~~~ 2. +/* + * Enum representing the overall state of subscription relations state. + * + * SYNC_RELATIONS_STATE_NEEDS_REBUILD indicates that the subscription relations + * state is no longer valid and the subscription relations should be rebuilt. + * + * SYNC_RELATIONS_STATE_REBUILD_STARTED indicates that the subscription + * relations state is being rebuilt. + * + * SYNC_RELATIONS_STATE_VALID indicates that subscription relation state is + * up-to-date and valid. + */ 2a. That first sentence saying "overall state of [...] state" is a bit strange. Maybe it can be reworded something like: Enum for phases of the subscription relations state. ~ 2b. /is no longer valid and/is no longer valid, and/ ` 2c. /that subscription relation state is up-to-date/that the subscription relation state is up-to-date/ == Kind Regards, Peter Smith. Fujitsu Australia
Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size
On Thu, Jan 2, 2025 at 4:20 AM Peter Smith wrote: > > Hi Shubham. > > Here are some review comments for the patch v4-0001. > > == > Commit message. > > 1. > The 'pg_createsubscriber' utility is updated to fetch and validate the > 'max_slot_wal_keep_size' setting from the publisher. A warning is raised > during > the '--dry-run' mode if the configuration is set to a non-default value. > > > This says a "warning is raised", but patch v4 changed the warning to > an error, so this description is incompatible with the code. > Since, this patch now raises a warning instead of an error so I think this should be the same as before. > == > doc/src/sgml/ref/pg_createsubscriber.sgml > > 2. > + > +The 'max_slot_wal_keep_size' must be set to -1 to prevent the automatic > +removal of WAL files needed by replication slots. Setting this parameter > to > +a specific size may lead to replication failures if required WAL files > are > +prematurely deleted. > + > > The 'max_slot_wal_keep_size' should not be single-quoted like that. It > should use the same markup as the other nearby GUC names and also have > a link like those others do. > Added the link for 'max_slot_wal_keep_size' and updated the 'pg_createsubscriber' documentation. > == > src/bin/pg_basebackup/pg_createsubscriber.c > > 3. > +/* > + * Validate max_slot_wal_keep_size > + * Logical replication requires max_slot_wal_keep_size to be set to -1 on the > + * publisher to prevent the deletion of WAL files that are still needed by > + * replication slots. If this parameter is set to a non-default value, it may > + * cause replication failures due to required WAL files being prematurely > + * removed. > + */ > > 3a. > Not fixed? For patch v3 I had already posted [1-#4a] that this entire > comment is wrongly indented. > Fixed. > > 3b. > That first sentence looks like it is missing a period and a following > blank line. OTOH, maybe the first sentence is not even necessary. > Fixed. > > 4. > + if (dry_run && max_slot_wal_keep_size != -1) > + { > + pg_log_error("publisher requires max_slot_wal_keep_size to be -1, > but only %d remain", > + max_slot_wal_keep_size); > + pg_log_error_hint("Change the configuration parameter \"%s\" on the > publisher to %d.", > + "max_slot_wal_keep_size", -1); > + failed = true; > + } > + > > 4a. > Not fixed? For patch v3 I had already posted [1-#4b] that it seemed > strange you did not use the \"%s\" substitution for the GUC name of > the first message (unlike the 2nd message). > > I also think it is strange that the 1st message uses a hardwired -1, > but the 2nd message uses a parameter for the -1. > Updated the warning message and warning detail. > > 4b. > I don't think "but only %d remain" is the appropriate wording for this > GUC. It looks like a cut/paste mistake from a previous message. > Anyway, maybe this message should be something quite different so it > is not just duplicating the same information as the error_hint. e.g. > see below for one idea. This could also resolve the previous review > comment. > > SUGGESTION > "publisher requires WAL size must not be restricted" > I have used your suggestion in the latest patch. > > 4c. > I saw that this only emits the message in dry-mode, which is > apparently based on the suggestion from Vignesh [2-#1]. Meanwhile, all > the comments/docs say "it may cause replication failures", so I felt > it might be better to always stop everything up-front if there is a > wrong configuration, instead of waiting for potential failures to > happen -- that's why I had suggested using error instead of a warning, > but maybe I am in the minority. > > IIUC there are two ways to address this configuration problem: > > A. Give warnings, but only in dry-run mode. If the warnings are > ignored (or if the dry-run was not even done), there may be > replication failures later, but we HOPE it will not happen. > > B. Give errors in all modes. The early config error prevents any > chance of later replication errors. > > So, I think the implementation needs to choose either A or B. > Currently, it seems a mixture. > > == > [1] my v3 review - > https://www.postgresql.org/message-id/CAHut%2BPsgEphCa-P-3Q7cORA%3D1TRv15UUsaxN9ZkX56M1-J_QRg%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/CALDaNm09cRzke52UN5zx33PT390whU92oXY4gfOSZEo17CLPjw%40mail.gmail.com If 'max_slot_wal_keep_size' is not set to -1, there is no surety that 'pg_createsubscriber' will function as expected and the removal of WAL files will start to occur. Therefore, a warning is raised instead of an error to alert users about the potential configuration issue. If 'max_slot_wal_keep_size' is -1 (the default), replication slots may retain an unlimited amount of WAL files. The attached Patch contains the suggested changes. Thanks and regards, Shubham Khanna. v5-0001-Validate-max_slot_wal_keep_size-in-pg_createsubsc.patch Description: Binary data
Re: Conflict detection for update_deleted in logical replication
On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu) wrote: > > On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 > wrote: > > > > Dear Hou, > > > > Thanks for updating the patch. Few comments: > > Thanks for the comments! > > > 02. ErrorOnReservedSlotName() > > > > Currently the function is callsed from three points - > > create_physical_replication_slot(), > > create_logical_replication_slot() and CreateReplicationSlot(). > > Can we move them to the ReplicationSlotCreate(), or combine into > > ReplicationSlotValidateName()? > > I am not sure because moving the check into these functions because that would > prevent the launcher from creating the slot as well unless we add a new > parameter for these functions, but I am not sure if it's worth it at this > stage. > > > > > 03. advance_conflict_slot_xmin() > > > > ``` > > Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin)); > > ``` > > > > Assuming the case that the launcher crashed just after > > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT). > > After the restart, the slot can be acquired since > > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT) > > is true, but the process would fail the assert because data.xmin is still > > invalid. > > > > I think we should re-create the slot when the xmin is invalid. Thought? > > After thinking more, the standard approach to me would be to mark the slot as > EPHEMERAL during creation and persist it after initializing, so changed like > that. > > > 05. check_remote_recovery() > > > > Can we add a test case related with this? > > I think the code path is already tested, and I am a bit unsure if we want to > setup > a standby to test the ERROR case, so didn't add this. > > --- > > Attach the new version patch set which addressed all other comments. Few suggestions: 1) If we have a subscription with detect_update_deleted option and we try to upgrade it with default settings(in case dba forgot to set track_commit_timestamp), the upgrade will fail after doing a lot of steps like that mentioned in ok below: Setting locale and encoding for new cluster ok Analyzing all rows in the new cluster ok Freezing all rows in the new cluster ok Deleting files from new pg_xact ok Copying old pg_xact to new server ok Setting oldest XID for new clusterok Setting next transaction ID and epoch for new cluster ok Deleting files from new pg_multixact/offsets ok Copying old pg_multixact/offsets to new serverok Deleting files from new pg_multixact/members ok Copying old pg_multixact/members to new serverok Setting next multixact ID and offset for new cluster ok Resetting WAL archivesok Setting frozenxid and minmxid counters in new cluster ok Restoring global objects in the new cluster ok Restoring database schemas in the new cluster postgres *failure* We should detect this at an earlier point somewhere like in check_new_cluster_subscription_configuration and throw an error from there. 2) Also should we include an additional slot for the pg_conflict_detection slot while checking max_replication_slots. Though this error will occur after the upgrade is completed, it may be better to include the slot during upgrade itself so that the DBA need not handle this error separately after the upgrade is completed. 3) We have reserved the pg_conflict_detection name in this version, so if there was a replication slot with the name pg_conflict_detection in the older version, the upgrade will fail at a very later stage like an earlier upgrade shown. I feel we should check if the old cluster has any slot with the name pg_conflict_detection and throw an error earlier itself: +void +ErrorOnReservedSlotName(const char *name) +{ + if (strcmp(name, CONFLICT_DETECTION_SLOT) == 0) + ereport(ERROR, + errcode(ERRCODE_RESERVED_NAME), + errmsg("replication slot name \"%s\" is reserved", + name)); +} 4) We should also mention something like below in the documentation so the user can be aware of it: The slot name cannot be created with pg_conflict_detection, as this is reserved for logical replication conflict detection. Regards, Vignesh
Re: Modern SHA2- based password hashes for pgcrypto
> On 31 Dec 2024, at 17:06, Bernd Helmle wrote: > I adapted the code from the publicly available reference implementation > at [1]. It's based on our existing OpenSSL infrastructure in pgcrypto > and produces compatible password hashes with crypt() and "openssl > passwd" with "-5" and "-6" switches. Potentially daft question, but since we require OpenSSL to build pgcrypto, why do we need to include sha2 code instead of using the sha2 implementation in libcrypto? How complicated would it be to use the OpenSSL API instead? -- Daniel Gustafsson
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Thu, Jan 2, 2025 at 5:44 AM Peter Smith wrote: > > Hi Nisha. > > My review comments for patch v58-0001. > > == > src/backend/replication/slot.c > > InvalidatePossiblyObsoleteSlot: > > 1. > /* > - * If the slot can be acquired, do so and mark it invalidated > - * immediately. Otherwise we'll signal the owning process, below, and > - * retry. > + * If the slot can be acquired, do so and mark it as invalidated. If > + * the slot is already ours, mark it as invalidated. Otherwise, we'll > + * signal the owning process below and retry. > */ > - if (active_pid == 0) > + if (active_pid == 0 || > + (MyReplicationSlot == s && active_pid == MyProcPid)) > { > > As you previously explained [1] "This change applies to all types of > invalidation, not just inactive_timeout case [...] It's a general > optimization for the case when the current process is the active PID > for the slot." > > In that case, should this be in a separate patch that can be pushed to > master by itself, i.e. independent of anything else in this thread > that is being done for the purpose of implementing the timeout > feature? The patch-001 has additional general optimizations similar to the one you mentioned, which are not strictly required for this feature. Let’s wait for input from others on splitting the patches or addressing it in a separate thread. -- Thanks, Nisha
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Thu, Jan 2, 2025 at 8:16 AM Peter Smith wrote: > > Hi Nisha, > > Here are some minor review comments for patch v58-0002. > Thank you for your feedback! Please find the v59 patch set addressing all the comments. Note: There are no new changes in patch-0001. -- Thanks, Nisha From 8154e2baee6fcf348524899c1f8b643a1e3564fc Mon Sep 17 00:00:00 2001 From: Nisha Moond Date: Mon, 18 Nov 2024 16:13:26 +0530 Subject: [PATCH v59 1/2] Enhance replication slot error handling, slot invalidation, and inactive_since setting logic In ReplicationSlotAcquire(), raise an error for invalid slots if the caller specifies error_if_invalid=true. Add check if the slot is already acquired, then mark it invalidated directly. Ensure same inactive_since time for all slots in update_synced_slots_inactive_since() and RestoreSlotFromDisk(). --- .../replication/logical/logicalfuncs.c| 2 +- src/backend/replication/logical/slotsync.c| 13 ++-- src/backend/replication/slot.c| 69 --- src/backend/replication/slotfuncs.c | 2 +- src/backend/replication/walsender.c | 4 +- src/backend/utils/adt/pg_upgrade_support.c| 2 +- src/include/replication/slot.h| 3 +- src/test/recovery/t/019_replslot_limit.pl | 2 +- 8 files changed, 75 insertions(+), 22 deletions(-) diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c index 0148ec3678..ca53caac2f 100644 --- a/src/backend/replication/logical/logicalfuncs.c +++ b/src/backend/replication/logical/logicalfuncs.c @@ -197,7 +197,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin else end_of_wal = GetXLogReplayRecPtr(NULL); - ReplicationSlotAcquire(NameStr(*name), true); + ReplicationSlotAcquire(NameStr(*name), true, true); PG_TRY(); { diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c index f6945af1d4..692527b984 100644 --- a/src/backend/replication/logical/slotsync.c +++ b/src/backend/replication/logical/slotsync.c @@ -446,7 +446,7 @@ drop_local_obsolete_slots(List *remote_slot_list) if (synced_slot) { -ReplicationSlotAcquire(NameStr(local_slot->data.name), true); +ReplicationSlotAcquire(NameStr(local_slot->data.name), true, false); ReplicationSlotDropAcquired(); } @@ -665,7 +665,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid) * pre-check to ensure that at least one of the slot properties is * changed before acquiring the slot. */ - ReplicationSlotAcquire(remote_slot->name, true); + ReplicationSlotAcquire(remote_slot->name, true, false); Assert(slot == MyReplicationSlot); @@ -1508,7 +1508,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len) static void update_synced_slots_inactive_since(void) { - TimestampTz now = 0; + TimestampTz now; /* * We need to update inactive_since only when we are promoting standby to @@ -1523,6 +1523,9 @@ update_synced_slots_inactive_since(void) /* The slot sync worker or SQL function mustn't be running by now */ Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing); + /* Use same inactive_since time for all slots */ + now = GetCurrentTimestamp(); + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); for (int i = 0; i < max_replication_slots; i++) @@ -1537,10 +1540,6 @@ update_synced_slots_inactive_since(void) /* The slot must not be acquired by any process */ Assert(s->active_pid == 0); - /* Use the same inactive_since time for all the slots. */ - if (now == 0) -now = GetCurrentTimestamp(); - SpinLockAcquire(&s->mutex); s->inactive_since = now; SpinLockRelease(&s->mutex); diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index b30e0473e1..47e474a548 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -163,6 +163,7 @@ static void ReplicationSlotDropPtr(ReplicationSlot *slot); static void RestoreSlotFromDisk(const char *name); static void CreateSlotOnDisk(ReplicationSlot *slot); static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel); +static void RaiseSlotInvalidationError(ReplicationSlot *slot); /* * Report shared-memory space needed by ReplicationSlotsShmemInit. @@ -535,9 +536,12 @@ ReplicationSlotName(int index, Name name) * * An error is raised if nowait is true and the slot is currently in use. If * nowait is false, we sleep until the slot is released by the owning process. + * + * An error is raised if error_if_invalid is true and the slot is found to + * be invalid. */ void -ReplicationSlotAcquire(const char *name, bool nowait) +ReplicationSlotAcquire(const char *name, bool nowait, bool error_if_invalid) { ReplicationSlot *s; int active_pid; @@ -615,6 +619,13 @@ retry: /* We made this slot active, so it's ours now. */ MyReplicationSlot = s; + /* + * An err
Re: Conflict detection for update_deleted in logical replication
On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu) wrote: > > On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 > wrote: > > > > Dear Hou, > > > > Thanks for updating the patch. Few comments: > > Thanks for the comments! > > > 02. ErrorOnReservedSlotName() > > > > Currently the function is callsed from three points - > > create_physical_replication_slot(), > > create_logical_replication_slot() and CreateReplicationSlot(). > > Can we move them to the ReplicationSlotCreate(), or combine into > > ReplicationSlotValidateName()? > > I am not sure because moving the check into these functions because that would > prevent the launcher from creating the slot as well unless we add a new > parameter for these functions, but I am not sure if it's worth it at this > stage. > > > > > 03. advance_conflict_slot_xmin() > > > > ``` > > Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin)); > > ``` > > > > Assuming the case that the launcher crashed just after > > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT). > > After the restart, the slot can be acquired since > > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT) > > is true, but the process would fail the assert because data.xmin is still > > invalid. > > > > I think we should re-create the slot when the xmin is invalid. Thought? > > After thinking more, the standard approach to me would be to mark the slot as > EPHEMERAL during creation and persist it after initializing, so changed like > that. > > > 05. check_remote_recovery() > > > > Can we add a test case related with this? > > I think the code path is already tested, and I am a bit unsure if we want to > setup > a standby to test the ERROR case, so didn't add this. > > --- > > Attach the new version patch set which addressed all other comments. > > Based on some off-list discussions with Sawada-san and Amit, it would be > better > if the apply worker can avoid reporting an ERROR if the publisher's clock's > lags behind that of the subscriber, so I implemented a new 0007 patch to allow > the apply worker to wait for the clock skew to pass and then send a new > request > to the publisher for the latest status. The implementation is as follows: > > Since we have the time (reply_time) on the walsender when it confirms that all > the committing transactions have finished, it means any subsequent > transactions > on the publisher should be assigned a commit timestamp later then reply_time. > And the (candidate_xid_time) when it determines the oldest active xid. Any old > transactions on the publisher that have finished should have a commit > timestamp > earlier than the candidate_xid_time. > > The apply worker can compare the candidate_xid_time with reply_time. If > candidate_xid_time is less than the reply_time, then it's OK to advance the > xid > immdidately. If candidate_xid_time is greater than reply_time, it means the > clock of publisher is behind that of the subscriber, so the apply worker can > wait for the skew to pass before advancing the xid. > > Since this is considered as an improvement, we can focus on this after > pushing the main patches. Conflict detection of truncated updates is detected as update_missing and deleted update is detected as update_deleted. I was not sure if truncated updates should also be detected as update_deleted, as the document says truncate operation is "It has the same effect as an unqualified DELETE on each table" at [1]. I tried with the following three node(N1,N2 & N3) setup with subscriber on N3 subscribing to the publisher pub1 in N1 and publisher pub2 in N2: N1 - pub1 N2 - pub2 N3 - sub1 -> pub1(N1) and sub2 -> pub2(N2) -- Insert a record in N1 insert into t1 values(1); -- Insert a record in N2 insert into t1 values(1); -- Now N3 has the above inserts from N1 and N2 N3=# select * from t1; c1 1 1 (2 rows) -- Truncate t1 from N2 N2=# truncate t1; TRUNCATE TABLE -- Now N3 has no records: N3=# select * from t1; c1 (0 rows) -- Update from N1 to generated a conflict postgres=# update t1 set c1 = 2; UPDATE 1 N1=# select * from t1; c1 2 (1 row) --- N3 logs the conflict as update_missing 2025-01-02 12:21:37.388 IST [24803] LOG: conflict detected on relation "public.t1": conflict=update_missing 2025-01-02 12:21:37.388 IST [24803] DETAIL: Could not find the row to be updated. Remote tuple (2); replica identity full (1). 2025-01-02 12:21:37.388 IST [24803] CONTEXT: processing remote data for replication origin "pg_16387" during message type "UPDATE" for replication target relation "public.t1" in transaction 757, finished at 0/17478D0 -- Insert a record with value 2 in N2 N2=# insert into t1 values(2); INSERT 0 1 -- Now N3 has the above inserted records: N3=# select * from t1; c1 2 (1 row) -- Delete this record from N2: N2=# delete from t1; DELETE 1 -- Now N3 has no records: N3=# select * from t1; c1 (0 rows) -- Update from N1 to generate a conflict postgres=# update t1 set c1 = 3; UPDATE 1 --
Re: FileFallocate misbehaving on XFS
Hi, On 2025-01-02 11:41:56 +0100, Andrea Gelmini wrote: > Il giorno mar 31 dic 2024 alle ore 16:31 Andres Freund > ha scritto: > > 2024-12-19 04:47:04 CET [2646363]: ERROR: could not extend file > > "pg_tblspc/107724/PG_16_202307071/465960/3232056651.2" by 11 blocks, from > > 29850 to 29861, using FileFallocate(): No space left on device > > > Dunno it it helps, but today I read this reference in latest patchset on > XFS mailing list: > https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangsh...@kylinos.cn/ > > > It could be related and explain the effect? I doubt it - there was a lot more free space in the various AGs and they, with the exception of 1-2 AGs, weren't that fragmented. Greetings, Andres Freund
Re: FileFallocate misbehaving on XFS
Hi, On 2024-12-20 11:39:42 -0500, Andres Freund wrote: > On 2024-12-19 17:47:13 +1100, Michael Harris wrote: > > This is a different system to those I previously provided logs from. > > It is also running RHEL8 with a similar configuration to the other > > system. > > Given it's a RHEL system, have you raised this as an issue with RH? They > probably have somebody with actual XFS hacking experience on staff. > > RH's kernels are *heavily* patched, so it's possible the issue is actually RH > specific. FWIW, I raised this on the #xfs irc channel. One ask they had was: │15:56:40 dchinner | andres: can you get a metadump of a filesystem that is displaying these symptoms for us to analyse? │15:57:54 dchinner | metadumps don't contain data, and metadata is obfuscated so no filenames or attributes are exposed, either. Greetings, Andres
POC: track vacuum/analyze cumulative time per relation
Hi, After a recent question regarding tracking vacuum start_time in pg_stat_all_tables [1], it dawned on me that this view is missing an important cumulative metric, which is how much time is spent performing vacuum per table. Currently, the only way a user can get this information is if they enable autovacuum logging or track timing for manual vacuums. Even then, if a user wants to trend the time spent vacuuming over time, they must store the timing data somewhere and perform the calculations. Also, unless autovacuum logging is enabled for all a/v operations, they could have gaps in their analysis. Having the total (auto)vacuum elapsed time along side the existing (auto)vaccum_count allows a user to track the average time an operating overtime and to find vacuum tuning opportunities. The same can also be said for (auto)analyze. attached a patch ( without doc changes) that adds 4 new columns: total_autovacuum_time total_vacuum_time total_autoanalyze_time total_analyze_time Below is an example of output and how it can be used to derive the average vacuum operation time. postgres=# select relname, autovacuum_count, total_autovacuum_time, total_autovacuum_time/NULLIF(autovacuum_count,0) average_autovac_time, vacuum_count, total_vacuum_time, total_vacuum_time/NULLIF(vacuum_count,0) average_vac_time from pg_catalog.pg_stat_all_tables where relname = 'pgbench_history'; -[ RECORD 1 ]-+- relname | pgbench_history autovacuum_count | 3 total_autovacuum_time | 1689 average_autovac_time | 563 vacuum_count | 1 total_vacuum_time | 1 average_vac_time | 1 It should be noted that the timing is only tracked when the vacuum or analyze operation completes, as is the case with the other metrics. Also, there is another discussion in-flight [2] regarding tracking vacuum run history in a view, but this serves a different purpose as this will provide all the metrics that are other wise exposed in vacuum logging via sql. This history will also be required to drop entries using some criteria to keep the cache from growing infinitely. Feedback for the attached patch is appreciated! Regards, Sami Imseih Amazon Web Services (AWS) [1] https://www.postgresql.org/message-id/flat/CAGjGUAKQ4UBNdkjunH2qLsdUVG-3F9gCuG0Kb0hToo%2BuMmSteQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/b68ab452-c41f-4d04-893f-eaab84f1855b%40vondra.me POC-track-vacuum-analyze-cumulative-time-per-table.patch Description: Binary data
Re: IWYU annotations
On 09.12.24 17:37, Tom Lane wrote: Peter Eisentraut writes: I have done a pass over much of the source code with include-what-you-use (IWYU) to remove superfluous includes (commits dbbca2cf299, 9be4e5d293b, ecb5af77987). Along the way I have collected some pragma annotations to deal with exceptions and special cases and peculiarities of the PostgreSQL source code header structures (see [0] for description). Here I'm proposing a set of patches to add such annotations in commonly useful cases that should deal with most of the noise. This seems to be going in the direction that there will be Yet Another tool that committers have to know everything about in order to not commit bad code. I'm feeling resistant to that, mainly because I'm far from convinced that IWYU brings us enough value to justify everybody having to learn about it. It's not realistic at the moment for it to be a tool that everyone needs to use and everyone needs to keep 100% clean. We're certainly very far from that being feasible at the moment. I see it useful in two areas: First, when you add new files or move lots of code around, you can have it provide an informed opinion about what includes to keep and add. Because in practice that's mostly a crapshoot nowadays. An example from a current patch under review: $ iwyu_tool.py -p build src/backend/libpq/auth-oauth.c ... ../src/backend/libpq/auth-oauth.c should remove these lines: - #include // lines 19-19 - #include // lines 18-18 - #include "storage/fd.h" // lines 28-28 Second, clangd (the language server) has support for this also, and so depending on local configuration and preferences, it can highlight missing or redundant includes or even add some automatically as you edit. (The latter obviously needs some manual supervision, but it is arguably kind of neat that you don't need to jump to the top and manually add includes as you type in new code that needs an additional header.) But in order for either of this to work, it needs to have some information about basic PostgreSQL code conventions. Otherwise, it will also do this: ../src/backend/libpq/auth-oauth.c should add these lines: ... #include // for false, bool, true #include// for NULL, size_t #include "c.h"// for Assert, explicit_bzero, pg_strncasecmp ... because it doesn't know that the convention is that you are supposed to include "postgres.h" (or one of the other always-first headers) and then everything that it brings it should usually not be included again directly. (Or conversely in some cases it will suggest to remove the include of "postgres.h" because it doesn't provide anything of use.) So right now you get a bunch of noise and misleading information for each file and the whole clangd support is of limited use. That's what my patch set is mainly trying to address. In particular, this patchset introduces what seem like very error-prone setups, such as in rmgrdesc.c where there's now one group of #include's with "pragma: begin_keep/pragma: end_keep" around it and another group without. Most of us are likely to just blindly stick a new #include into alphabetical order somewhere in there and not notice that there's now an additional concern. The fact that that you've added precisely zero documentation about what these pragmas are doesn't help. It's a fair point that some documentation could be provided. I suppose we don't want to verbosely explain each pragma individually. Should there be some central explanation, maybe in src/tools/pginclude/README? Note that if you google like "IWYU pragma: export" it will take you to the upstream documentation as the first hit, so the full documentation is pretty easy to find.
RE: [RFC] Lock-free XLog Reservation from WAL
This message is a duplicate of ph7pr11mb5796659f654f9be983f3ad97ef...@ph7pr11mb5796.namprd11.prod.outlook.com. Please consider dropping this thread and review the original one instead. Sorry for your inconvenience. -Original Message- From: Zhou, Zhiguo Sent: Thursday, January 2, 2025 3:20 PM To: pgsql-hackers@lists.postgresql.org Subject: [RFC] Lock-free XLog Reservation from WAL Hi all, I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions. To address this issue, we propose the following changes: 1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add). 2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL. 3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes). 4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field. 5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism. The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x. Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations.
RE: RFC: Lock-free XLog Reservation from WAL
This message is a duplicate of ph7pr11mb5796659f654f9be983f3ad97ef...@ph7pr11mb5796.namprd11.prod.outlook.com. Please consider dropping this thread and review the original one instead. Sorry for your inconvenience. -Original Message- From: Zhou, Zhiguo Sent: Thursday, January 2, 2025 5:15 PM To: pgsql-hackers@lists.postgresql.org Subject: RFC: Lock-free XLog Reservation from WAL Hi all, I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions. To address this issue, we propose the following changes: 1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add). 2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL. 3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes). 4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field. 5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism. The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x. Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations.
Re: read stream on amcheck
Thanks for reviewing! Em qui., 2 de jan. de 2025 às 13:16, Kirill Reshke escreveu: > > However, this: > > >- if (skip_option == SKIP_PAGES_ALL_FROZEN) > >- { > >- if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) > >- continue; > >- } > >- > >- if (skip_option == SKIP_PAGES_ALL_VISIBLE) > >- { > >- if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0) > >- continue; > >- } > > changes to this > (in heapam_read_stream_next_block) > >+ > >+ if ((p->skip_option == SKIP_PAGES_ALL_FROZEN && (mapbts & > >VISIBILITYMAP_ALL_FROZEN) != 0) || > >+ (p->skip_option == SKIP_PAGES_ALL_VISIBLE && (mapbts & > >VISIBILITYMAP_ALL_VISIBLE) != 0)) > > + continue; > > I don't understand this change. The patch aims to be purely applying > streaming API, not refactoring. And if we refactor this code, this is > less readable than it was. Yeap, I agree. Attached a v2 fixed. -- Matheus Alcantara v2-0001-Use-read-stream-on-amcheck.patch Description: Binary data
Re: Switching XLog source from archive to streaming when primary available
> On 23 Mar 2024, at 14:22, Bharath Rupireddy > wrote: > > IMHO, it makes sense to have something like replay_source_order if > there's any use case that arises in future requiring the standby to > intentionally switch to pg_wal or archive. But not as part of this > feature. IMO, it's vital part of a feature. In my observation restore from archive is many orders of magnitude faster than streaming replication. Advanced archive tools employ compression (x6 to speed), download parallelism (x4), are not constrained be primary's network limits (x3) and disk limits, do not depend on complicated FEBE protocol, etc. When I have to cope with lagging replica, almost always I kill walreceiver and tweak server readahead. But there might be cases where you still have to attach replica ASAP. I can think of releasing replication slot, transiently failed archive network or storage. Finally, one might want to have many primary connections: cascading replica might want to stream from any available host from the group of HA hosts. Best regards, Andrey Borodin.
Re: IWYU annotations
Peter Eisentraut writes: > On 09.12.24 17:37, Tom Lane wrote: >> In particular, this patchset introduces what seem like very >> error-prone setups, such as in rmgrdesc.c where there's now one >> group of #include's with "pragma: begin_keep/pragma: end_keep" >> around it and another group without. Most of us are likely >> to just blindly stick a new #include into alphabetical order >> somewhere in there and not notice that there's now an additional >> concern. The fact that that you've added precisely zero >> documentation about what these pragmas are doesn't help. > It's a fair point that some documentation could be provided. I suppose > we don't want to verbosely explain each pragma individually. Should > there be some central explanation, maybe in src/tools/pginclude/README? That might do, but perhaps instead in the "PostgreSQL Coding Conventions" chapter of the SGML docs? Actually, I think we could do with a centralized explanation of our inclusion conventions --- I'm not sure that the whole business of "postgres.h or a sibling must be first" is explained in any easy-to-find place. This topic would likely fit well with such an explanation. But really, the point I was trying to make above is that I don't want this to break our very long-standing convention that headers should be #include'd alphabetically and there is never a need to impose some other order (at least not without lots of commentary about it at the scene of the crime). The way you've done it here is just asking for trouble, IMO. If that means redundant pragma commands, so be it. regards, tom lane
Re: read stream on amcheck
On Thu, 2 Jan 2025 at 20:29, Matheus Alcantara wrote: > > Hi, > > While reviewing some other patches implementing stream API for core > subsystems, > I noticed that the amcheck extension could also benefit from that. > > Notice the refactor when handling the "skip" parameter; The logic was moved to > the heapam_read_stream_next_block callback so that verify_heapam don't need to > touch any private field of heapam_read_stream_next_block_private struct. > > One other think to mention is that the test cases of "skip" parameter > that I've seen just test when the first page is corrupted, so I think > that a carefully review on callback logic would be good to ensure that > we don't accidentally skip a page when doing p->current_blocknum++; > > This patch doesn't show any performance improvements (or regression) > but I think that it would be good to replace the ReadBufferExtended > usage with the read stream API, so in the future it could be benefit > from the AIO project. > > -- > Matheus Alcantara Hi! +1 on idea However, this: >- if (skip_option == SKIP_PAGES_ALL_FROZEN) >- { >- if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) >- continue; >- } >- >- if (skip_option == SKIP_PAGES_ALL_VISIBLE) >- { >- if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0) >- continue; >- } changes to this (in heapam_read_stream_next_block) >+ >+ if ((p->skip_option == SKIP_PAGES_ALL_FROZEN && (mapbts & >VISIBILITYMAP_ALL_FROZEN) != 0) || >+ (p->skip_option == SKIP_PAGES_ALL_VISIBLE && (mapbts & >VISIBILITYMAP_ALL_VISIBLE) != 0)) > + continue; I don't understand this change. The patch aims to be purely applying streaming API, not refactoring. And if we refactor this code, this is less readable than it was. Other than that, LGTM. -- Best regards, Kirill Reshke
Re: Doc: fix the rewrite condition when executing ALTER TABLE ADD COLUMN
On Tue, Dec 3, 2024 at 3:13 AM Masahiro Ikeda wrote: > > Hi, > > The documentation seems to overlook the rewrite condition > when executing ALTER TABLE ADD COLUMN. > > The current document states that a volatile DEFAULT will > trigger a rewrite of the table and its indexes. However, the > table and its indexes will also be rewritten when an IDENTITY > column is added, or when a column with a domain data type that > has constraints is added. > > What do you think? > We still see a number of people asking (or confused) about table rewrites when adding columns, so I think the initial tip should remain, though I think it can be cleaned up a little. In the second section (alter_table.sgml) I liked the idea of adding these additional examples, though I tweaked the wording a bit to (hopefully) make it a little easier to read. Modified patch attached. Robert Treat https://xzilla.net v2-0001-Doc-fix-the-rewrite-condition-when-executing-ALTE.patch Description: Binary data
Further _bt_first simplifications for parallel index scans
Attached patch goes a bit further with simplifying _bt_first's handling of seizing the parallel scan. This continues recent work from commits 4e6e375b and b5ee4e52. Aside from requiring less code, the new structure relieves _bt_first from having separate calls to _bt_start_array_keys for the serial and parallel cases. It also places emphasis on the idea that it's expected that calls to _bt_first will go through _bt_readfirstpage (not _bt_readnextpage). While it is of course possible for a parallel scan's _bt_first to call _bt_readnextpage instead, that is the exception, not the rule. -- Peter Geoghegan v1-0001-Simplify-_bt_first.patch Description: Binary data
Allow NOT VALID foreign key constraints on partitioned tables.
Hi, While working on NOT ENFORCED constraints[1], which are by default marked as NOT VALID, I encountered an error when adding a NOT ENFORCED foreign key (FK) constraint to a partitioned table [2]. Alvaro also confirmed off-list that NOT VALID FK constraints have not yet been implemented. This patch addresses that gap. When adding a new FK constraint or attaching a partitioned table, where matching FK constraints are merged, we allow the parent constraint to be NOT VALID while the child constraint remains VALID, which is harmless. However, the reverse scenario -- where the parent constraint is VALID and the child is NOT VALID -- is incorrect. To address this, when merging a NOT VALID FK constraint from the child with a VALID parent constraint, it implicitly validates the child constraint against its existing data and marks it as VALID. This behavior aligns with adding a new FK constraint directly to the child table, which would also validate the existing data. The 0001 patch focuses on code refactoring and does not modify or introduce new behaviors. It splits ATExecValidateConstraint() into two separate functions for handling FK and CHECK constraints. For this feature, I wanted to reuse the FK validation logic and make it recursive for partitioned tables, necessitating its separation. Although CHECK constraint validation isn't required for this work, separating it simplifies ATExecValidateConstraint() and prepares the codebase for adding support for other constraint types (e.g., NOT NULL) in the future. Additional changes in the refactoring include renaming the variable tuple to contuple within these functions, duplicating a few lines of code that update pg_constraint.convalidated, and running pgindent, which rearranged the code and comments. I hope the duplication is not a significant concern. Please review the attached patches. Any comments or suggestions would be greatly appreciated. Thank you! 1] https://postgr.es/m/caaj_b962c5acyw9kut_r_er5qs3fugbe4az-sp-vuwps-w-...@mail.gmail.com 2] https://postgr.es/m/caaj_b94+0-yfj4lopvqz_+c7ckkuya77g_5rgtjvnuyepuh...@mail.gmail.com -- Regards, Amul Sul EDB: http://www.enterprisedb.com From 7b03ff68ae24ded5547a9e268be8692b196cf509 Mon Sep 17 00:00:00 2001 From: Amul Sul Date: Thu, 2 Jan 2025 17:15:30 +0530 Subject: [PATCH v1 1/2] Refactor: Split ATExecValidateConstraint() Splitting ATExecValidateConstraint() into two separate functions to handle FOREIGN KEY and CHECK constraints respectively. The next patch requires the FOREIGN KEY validation code in a separate function so it can be called directly. Additionally, the function needs to be recursive to handle NOT VALID constraints on declarative partitions. Although moving the CHECK constraint-related code is not strictly necessary for this task, maintaining separate code for FOREIGN KEY constraints makes it logical to do the same for CHECK constraints. This separation will also simplify adding support for other constraints (e.g., NOT NULL) in ATExecValidateConstraint() in the future. --- src/backend/commands/tablecmds.c | 282 +++ 1 file changed, 171 insertions(+), 111 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 33ea619224b..afe71fb7e8e 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -394,6 +394,11 @@ static ObjectAddress ATExecAlterConstraint(Relation rel, AlterTableCmd *cmd, static bool ATExecAlterConstrRecurse(Constraint *cmdcon, Relation conrel, Relation tgrel, Relation rel, HeapTuple contuple, List **otherrelids, LOCKMODE lockmode); +static void QueueFKConstraintValidation(List **wqueue, Relation conrel, Relation rel, + HeapTuple contuple, LOCKMODE lockmode); +static void QueueCheckConstraintValidation(List **wqueue, Relation conrel, Relation rel, + char *constrName, HeapTuple contuple, + bool recurse, bool recursing, LOCKMODE lockmode); static ObjectAddress ATExecValidateConstraint(List **wqueue, Relation rel, char *constrName, bool recurse, bool recursing, LOCKMODE lockmode); @@ -11951,6 +11956,169 @@ ATExecAlterConstrRecurse(Constraint *cmdcon, Relation conrel, Relation tgrel, return changed; } +/* + * QueueFKConstraintValidation + * + * Add an entry to the wqueue to validate the given foreign key constraint in + * Phase 3 and update the convalidated field in the pg_constraint catalog + * for the specified relation. + */ +static void +QueueFKConstraintValidation(List **wqueue, Relation conrel, Relation rel, + HeapTuple contuple, LOCKMODE lockmode) +{ + Form_pg_constraint con; + AlteredTableInfo *tab; + HeapTuple copyTuple; + Form_pg_constraint copy_con; + + con = (Form_pg_constraint) GETSTRUCT(contuple); + Assert(con->contype == CONSTRAINT_FOREIGN); + + if (rel->rd_rel->relkind == RELKIND_RELATION) + { + NewConstraint *newcon; + Constraint *fkconstraint; + + /* Queue validatio
Re: Allow NOT VALID foreign key constraints on partitioned tables.
On Thu, Jan 2, 2025, at 5:49 PM, Amul Sul wrote: > When adding a new FK constraint or attaching a partitioned table, where > matching FK constraints are merged, we allow the parent constraint to be NOT > VALID while the child constraint remains VALID, which is harmless. However, > the > reverse scenario -- where the parent constraint is VALID and the child is NOT > VALID -- is incorrect. To address this, when merging a NOT VALID FK constraint > from the child with a VALID parent constraint, it implicitly validates the > child constraint against its existing data and marks it as VALID. This > behavior > aligns with adding a new FK constraint directly to the child table, which > would > also validate the existing data. Hmm, I'm not sure about this, which may cause surprising delays. Maybe it would be better that the operation fails with an error, so that the user can do VALIDATE CONSTRAINT explicitly and retry the ATTACH once all the partitions have been so processed.
Re: magical eref alias names
On Sun, Dec 22, 2024 at 12:45 PM Tom Lane wrote: > > Hmm, OK, that's useful. But I guess I'm still puzzled about the theory > > here. A name like *VALUES* doesn't seem like it was created with the > > idea of referring to it from some other part of your query. I do take > > your point that it works and somebody's probably relying on it, but I > > don't think you'd pick a name that requires quoting if you were trying > > to make it easy to use that name in the query. > > As you say, some of this is lost in the mists of time; but I think the > idea was specifically that these names should *not* be easy to type, > because we don't want them to conflict with any relation alias that > the user is likely to pick. I'm fairly sure that the SQL spec > actually has verbiage to the effect of "the implementation shall > select a name that does not conflict with any other name in the query". OK, but picking names that the user probably won't use is neither necessary nor sufficient to guarantee uniqueness. > > You might possibly also > > try to generate names that are easy for users to guess, and distinct. > > Yeah, per spec they should be distinct; but somebody didn't bother > with that early on, and now we've ended up in a situation where > ruleutils.c does it instead. I'm not sure that that's terribly evil. > In particular, in a situation where we're trying to show a plan for > a query with inlined views, EXPLAIN would probably have to have code > to unique-ify the names anyway --- there's no way we're going to make > these nonce names globally unique, so the view(s) might contain names > that conflict with each other or the outer query. When you say "there's no way we're going to make these nonce names globally unique," is that because you think it would be too costly from a performance perspective (which was my concern) or that you think it's flat-out impossible for some reason (which doesn't seem to me to be true)? > > Yeah, I don't really want to be the one to break somebody's query that > > explicitly references "*VALUES*" or whatever. At least not without a > > better reason than I currently have. If this were just a display > > artifact I think finding some way to clean it up would be pretty > > worthwhile, but I would need a better reason to break working SQL. > > So it seems like we're coming to the conclusion that we don't want > to change things here? > > The reason I want to push for a conclusion is that the discussion > about "*VALUES*" over at [1] is on hold pending some decision here. > The v3 patch in that thread is set up in such a way that it improves > EXPLAIN output without breaking any existing SQL, and that's what > I'd use if we refrain from changing things here. But it's a tad > inconsistent, so if we did decide it was okay to break some edge > cases, I'd reconsider. I think that if we can solve multiple problems all at once by breaking some edge cases, that's worth considering. There probably aren't that many people who have queries that reference the "*VALUES*" alias, so if we wanted to change that to "values" I don't see that as completely out of the question. Somebody will probably be inconvenienced but if it solves other problems maybe it's worth it. But I don't think that change by itself really helps anything here. > Perhaps a similar idea could be applied to the other cases where > we invent aliases, but it's less obvious how. For instance in > > SELECT ... UNION SELECT ... > > there's no obvious place where we could get names for the > two sub-SELECTs. If we had global uniqueness, or even per-subquery-level uniqueness, this would sort itself out somewhat nicely, I think. We would just end up with select_1 and select_2 or union_1 and union_2 or something like that, and I think that would be strictly better than the status quo where we do sometimes generate *SELECT* %d, but also sometimes just *SELECT* and other times unnamed_subquery, and also only ever *VALUES* and not *VALUES* %d. -- Robert Haas EDB: http://www.enterprisedb.com