Re: Confine vacuum skip logic to lazy_scan_skip
On Fri, Feb 14, 2025 at 12:11 PM Melanie Plageman wrote: > I've done some clean-up including incorporating a few off-list pieces > of minor feedback from Andres. I've been poking, reading, and trying out these patches. They look good to me. Tiny nit, maybe this comment could say something less obvious, cf the similar comment near the other stream: + /* Set up the read stream */ + stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, I don't really love the cumbersome casting required around per_buffer_data, but that's not your patches' fault (hmm, wonder what we can do to improve that).
Re: Parallel heap vacuum
On Thu, Feb 13, 2025 at 5:37 AM Masahiko Sawada wrote: > > During eager vacuum scan, we reset the eager_scan_remaining_fails > counter when we start to scan the new region. So if we want to make > parallel heap vacuum behaves exactly the same way as the > single-progress vacuum in terms of the eager vacuum scan, we would > need to have the eager_scan_remaining_fails counters for each region > so that the workers can decrement it corresponding to the region of > the block that the worker is scanning. But I'm concerned that it makes > the logic very complex. I'd like to avoid making newly introduced > codes more complex by adding yet another new code on top of that. Would it be simpler to make only phase III parallel? In other words, how much of the infrastructure and complexity needed for parallel phase I is also needed for phase III? -- John Naylor Amazon Web Services
Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding
Hi Ajin, Some review comments for the patch v13-0002. == src/backend/replication/logical/reorderbuffer.c 1. GENERAL I felt that a summary/overview of how all this filter/cache logic works should be given in the large file header comment at the top of this file. There may be some overlap with comments that are already in the "RelFileLocator filtering" section. ~~~ ReorderBufferRelFileLocatorEnt: 2. +/* entry for hash table we use to track if the relation can be filtered */ +typedef struct ReorderBufferRelFileLocatorEnt /* Hash table entry used to determine if the relation can be filtered. */ ~~~ ReorderBufferQueueChange: 3. + /* + * If we're not filtering and we've crossed the change threshold, + * attempt to filter again + */ SUGGESTION If filtering was suspended, and we've crossed the change threshold then reenable filtering. ~~~ ReorderBufferGetRelation: 4. +static Relation +ReorderBufferGetRelation(ReorderBuffer *rb, RelFileLocator *rlocator, + bool has_tuple) Would a better name be ReorderBufferGetRelationForDecoding(). Yeah, it's a bit long but perhaps it explains the context/purpose better. ~~~ 5. + if (filterable) + { + RelationClose(relation); + return NULL; + } I wonder if some small descriptive comment would be helpful here just to say we are returning NULL to indicate that this relation is not needed and yada yada... ~~~ RelFileLocatorCacheInvalidateCallback: 6. + /* slightly inefficient algorithm but not performance critical path */ + while ((entry = (ReorderBufferRelFileLocatorEnt *) hash_seq_search(&status)) != NULL) + { + /* + * If relid is InvalidOid, signaling a complete reset, we must remove + * all entries, otherwise just remove the specific relation's entry. + * Always remove negative cache entries. + */ + if (relid == InvalidOid || /* complete reset */ + entry->relid == InvalidOid || /* negative cache entry */ + entry->relid == relid) /* individual flushed relation */ 6a. Maybe uppercase that 1st comment. ~ 6b. It seems a bit unusual to be referring to "negative cache entries". I thought it should be described in terms of InvalidOid since that is what it is using in the condition. ~ 6c. If the relid parameter can take special values like "If relid is InvalidOid, signaling a complete reset" that sounds like the kind of thing that should be documented in the function comment. ~~~ ReorderBufferFilterByRelFileLocator 7. Despite the extra indenting required, I wondered if the logic may be easier to read (e.g. it shows the association of the rb->can_filter_change and entry->filterable more clearly) if this is refactored slightly by sharing a single common return like below: BEFORE ... + key.reltablespace = rlocator->spcOid; + key.relfilenumber = rlocator->relNumber; + entry = hash_search(RelFileLocatorFilterCache, &key, HASH_ENTER, &found); + + if (found) + { + rb->can_filter_change = entry->filterable; + return entry->filterable; + } ... + rb->can_filter_change = entry->filterable; ... + return entry->filterable; +} AFTER ... + key.reltablespace = rlocator->spcOid; + key.relfilenumber = rlocator->relNumber; + entry = hash_search(RelFileLocatorFilterCache, &key, HASH_ENTER, &found); + + if (!found) + { ... + } + + rb->can_filter_change = entry->filterable; + return entry->filterable; +} == src/include/replication/reorderbuffer.h 8. + /* should we try to filter the change? */ + bool can_filter_change; + I think most of my difficulty reading this patch was due to this field name 'can_filter_change'. 'can_filter_change' somehow sounded to me like it is past tense. e.g. like as if we already found some change and we yes, we CAN filter it. But AFAICT the real meaning is simply that (when the flag is true) we are ALLOWED to check to see if there is anything filterable. In fact, the change may or may not be filterable. Can this be renamed to something more "correct"? e.g. - 'allow_change_filtering' - 'enable_change_filtering' - etc. ~~ 9. + /* number of changes after a failed attempt at filtering */ + int8 processed_changes; Maybe 'unfiltered_changes_count' is a better name for this field? ~~~ 10. +extern bool ReorderBufferCanFilterChanges(ReorderBuffer *rb); Should match the 'can_filter_change' field name, so if you change that (see comment #8), then change this too. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Move wal_buffers_full to WalUsage (and report it in pgss/explain)
On Tue, Feb 11, 2025 at 02:05:18PM +, Bertrand Drouvot wrote: >> 2. I have a small suggestion for pg_stat_statements: would it make sense to >> move wal_buffers_full next to wal_records, wal_fpi and wal_bytes? This way, >> all WAL-related information would be grouped together. > > I think I prefer to add it in "append" order. That way, that does not break > queries that rely on ordinal numbers. Not sure. FWIW, it makes sense to me to group them by "family" in this case, as they would belong to the same instrument structure. -- Michael signature.asc Description: PGP signature
Re: DOCS - inactive_since field readability
On Fri, Feb 14, 2025 at 4:04 AM Peter Smith wrote: > > On Tue, Feb 11, 2025 at 10:10 PM Amit Kapila wrote: > > > ... > > The change in 0001 looks odd after seeing it in HTML format. We should > > either add one empty line between two paragraphs otherwise it doesn't > > appear good. Did you see multi-paragraphs in any other column > > definitions? > > > > Patch 0001 > > The pg_replication_slots system view is unusual in that there can be > entirely different descriptions of the same field depending on the > context, such as: > a- for logical slots > b- for physical slots > c- for primary servers versus standby servers > > IIUC your 0001 feedback says that a blank line might be ok, but just > doing it for 'active_since' and nothing else makes it look odd. > No, I meant to say that the description didn't looked any better to me even after your 0001 patch. The second paragraph started immediately in the next line which doesn't make it look any better. If we really want to make it look better then one more additional line is required. However, I don't want to go in that direction unless we have some history of writing the docs similarly. I suggest let's go with your 0002 patch as that makes the description concise and clear. -- With Regards, Amit Kapila.
Re: Change GUC hashtable to use simplehash?
Hi Anton, could you please test if the attached passes for you? This seems the simplest way. -- John Naylor Amazon Web Services diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h index e07c0226c1..bb09f87abe 100644 --- a/src/include/common/hashfn_unstable.h +++ b/src/include/common/hashfn_unstable.h @@ -290,13 +290,7 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str) str += FH_SIZEOF_ACCUM; } - /* - * The byte corresponding to the NUL will be 0x80, so the rightmost bit - * position will be in the range 7, 15, ..., 63. Turn this into byte - * position by dividing by 8. - */ - remainder = pg_rightmost_one_pos64(zero_byte_low) / BITS_PER_BYTE; - fasthash_accum(hs, str, remainder); + remainder = fasthash_accum_cstring_unaligned(hs, str); str += remainder; return str - start;
Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.
On Thu, Feb 13, 2025 at 5:36 PM Shlok Kyal wrote: > > On Thu, 13 Feb 2025 at 15:20, Shubham Khanna > wrote: > > > > On Tue, Feb 11, 2025 at 9:56 PM Shlok Kyal wrote: > > > > > > On Tue, 11 Feb 2025 at 09:51, Shubham Khanna > > > wrote: > > > > > > > > On Fri, Feb 7, 2025 at 7:46 AM Hayato Kuroda (Fujitsu) > > > > wrote: > > > > > > > > > > Dear Shubham, > > > > > > > > > > Thanks for updating the patch. > > > > > > > > > > Previously you told that you had a plan to extend the patch to drop > > > > > other replication > > > > > objects [1], but I think it is not needed. pg_createsubscriber has > > > > > already been > > > > > able to drop the existing subscrisubscriptions in > > > > > check_and_drop_existing_subscriptions(). > > > > > As for the replication slot, I have told in [2], it would be created > > > > > intentionally > > > > > thus I feel it should not be dropped. > > > > > Thus I regard the patch does not have concrete extending plan. > > > > > > > > > > Below part contains my review comment. > > > > > > > > > > 01. Option name > > > > > > > > > > Based on the above discussion, "--cleanup-publisher-objects" is not > > > > > suitable because > > > > > it won't drop replication slots. How about "--cleanup-publications"? > > > > > > > > > > > > > I have changed the name of the option to > > > > "--cleanup-existing-publications" > > > > > > > > > 02. usage() > > > > > ``` > > > > > + printf(_(" -C --cleanup-publisher-objects drop all > > > > > publications on the logical replica\n")); > > > > > ``` > > > > > > > > Fixed. > > > > > > > > > s/logical replica/subscriber > > > > > > > > > > 03. drop_all_publications > > > > > ``` > > > > > +/* Drops all existing logical replication publications from all > > > > > subscriber > > > > > + * databases > > > > > + */ > > > > > +static void > > > > > ``` > > > > > > > > > > Initial line of the comment must be blank [3]. > > > > > > > > > > > > > Removed this function. > > > > > > > > > 04. main > > > > > ``` > > > > > + {"cleanup-publisher-objects", no_argument, NULL, 'C'}, > > > > > ``` > > > > > > > > > > Is there a reason why upper case is used? I feel lower one is enough. > > > > > > > > > > > > > Fixed. > > > > > > > > > 05. main > > > > > ``` > > > > > + /* Drop publications from the subscriber if requested */ > > > > > + if (opt.cleanup_publisher_objects) > > > > > + drop_all_publications(dbinfo); > > > > > ``` > > > > > > > > > > After considering more, I noticed that we have already called > > > > > drop_publication() > > > > > in the setup_subscriber(). Can we call drop_all_publications() there > > > > > instead when > > > > > -C is specified? > > > > > > > > > > > > > I agree with you on this. I have removed the drop_all_publication() > > > > and added the "--cleanup-existing-publications" option to the > > > > drop_publication() > > > > > > > > > 06. 040_pg_createsubscriber.pl > > > > > > > > > > ``` > > > > > +$node_s->start; > > > > > +# Create publications to test it's removal > > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL > > > > > TABLES;"); > > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL > > > > > TABLES;"); > > > > > + > > > > > +# Verify the existing publications > > > > > +my $pub_count_before = > > > > > + $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;"); > > > > > +is($pub_count_before, '2', > > > > > + 'two publications created before --cleanup-publisher-objects > > > > > is run'); > > > > > + > > > > > +$node_s->stop; > > > > > ``` > > > > > > > > > > I feel it requires unnecessary startup and shutdown. IIUC, creating > > > > > publications and check > > > > > counts can be before stopping the node_s, around line 331. > > > > > > > > > > > > > Fixed. > > > > > > > > > 07. 040_pg_createsubscriber.pl > > > > > > > > > > ``` > > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL > > > > > TABLES;"); > > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL > > > > > TABLES;"); > > > > > + > > > > > +# Verify the existing publications > > > > > +my $pub_count_before = > > > > > + $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;"); > > > > > +is($pub_count_before, '2', > > > > > + 'two publications created before --cleanup-publisher-objects > > > > > is run'); > > > > > + > > > > > ``` > > > > > > > > > > Also, there is a possibility that CREATE PUBLICATION on node_p is not > > > > > replicated yet > > > > > when SELECT COUNT(*) is executed. Please wait for the replay. > > > > > > > > > > [1]: > > > > > https://www.postgresql.org/message-id/CAHv8RjL4OvoYafofTb_U_JD5HuyoNowBoGpMfnEbhDSENA74Kg%40mail.gmail.com > > > > > [2]: > > > > > https://www.postgresql.org/message-id/OSCPR01MB1496664FDC38DA40A441F449FF5EE2%40OSCPR01MB14966.jpnprd01.prod.outlook.com > > > > > [3]: https://www.postgresql.org/docs/devel/source-format.html > > > > > > > > > >
Re: Move wal_buffers_full to WalUsage (and report it in pgss/explain)
On Tue, Feb 11, 2025 at 09:37:37AM +, Bertrand Drouvot wrote: > While at it, adding 0004 to report wal_buffers_full in VACUUM/ANALYZE > (VERBOSE). Thanks for summarizing the history behind WalUsage and wal_buffers_full. FWIW, 0001 that moves wal_buffers_full from PgStat_PendingWalStats to WalUsage is going to make our lives much easier for your pending patch to adds backend statistics for WAL. WAL write/sync numbers/times will be covered in the backend stats by pg_stat_io, allowing us to remove entirely the dependency to PgStat_PendingWalStats. I have been wondering for a bit if the comment at the top of WalUsage should be updated, but the current description fits as well with the follow-up patch series. Anyway, if there are no objections, I am planning to apply this one to lift this barrier and help with the follow-up work for the backend stats. That's the mandatory piece to makes the backend WAL stats integration easier. 0002, 0003 and 0004 are straight-forward follow-ups. It's IMO one of these things where extra data is cheap to have access to, and can be useful to be aware when distributed across multiple contexts like queries, plans or even EXPLAIN. So no real objections here. If other have comments, feel free. show_wal_usage() should have its check on (usage->wal_buffers_full) in explain.c, as Ilia has mentioned. It's true that you would not get a (wal_buffers_full > 0) if at least the condition on wal_bytes is not satisfied, but the addition makes sense on consistency grounds, at least. Agreed about the attribute ordering in pgss with everything associated to WalUsage grouped together, btw. -- Michael signature.asc Description: PGP signature
Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
On Thu, Feb 13, 2025 at 09:53:52AM -0800, Jacob Champion wrote: > I guess I'm going to zero in on your definition of "may know nothing > about". If that is true, something is very wrong IMO. > > My understanding of the backend code was that port->peer is only set > after OpenSSL has verified that the certificate has been issued by an > explicitly trusted CA. (Our verify_cb() doesn't override the internal > checks to allow failed certificates through.) That may not be enough > to authorize entry into the server, but it also shouldn't be untrusted > data. If a CA is issuing Subject data that is somehow dangerous to the > operation of the server, I think that's a security problem in and of > itself: there are clientcert HBA modes that don't validate the > Subject, but they're still going to push that data into the catalogs, > aren't they? Is that the case before we finish authentication now? I was not sure how much of this data is set before and after finishing authentication, tracking that this was part of the init phase of the connection, something we do earlier than the actual authentication. It feels like this should be documented more clearly in the patch if pgstat_bestart_security() is allowed to be called multiple times (I think we should not allow that). That's quite unclear now in the patch; on HEAD everything is set after authentication completes. > So if we're concerned that Subject data is dangerous at this point in > the code, I agree that my patch makes it even more dangerous and I'll > modify it -- but then I'm going to split off another thread to try to > fix that underlying issue. A user should not have to be authorized to > access the server in order for signed authentication data from the > transport layer to be considered trustworthy. Being able to monitor > those separately is important for auditability. Making this information visible in the catalogs for already logged in users increases the potential of problems, and this is a sensitive area of the code, so.. >> As a whole we still have a gap between what could be OK, what could >> not be OK, and the fact that pgstat_bestart_security() is called twice >> makes that confusing. > > My end goal is that all of this _should_ always be OK, so calling it > once or twice or thirty times should be safe. (But again, if that's > not actually true today, I'll remove it for now.) The concept of making pgstat_bestart_security() callable multiple times relates also to the issue pointed upthread by Andres, somewhat: allowing this pattern may lead to errors in the future regarding this that should or should not be set this early in the authentication process. Sounds just saner to me to do pgstat_bestart_security() once for now once we're OK with authentication, and it does not prevent to address your initial point of being able to know if backends with a given PID are stuck at a certain point. At least that's a step towards more debuggability as the backend entries are available earlier than they are now. Getting ready for tomatoes now, please aim freely. > I agree with Robert's goal in general. I just think that following > that rule to *this* extent is counterproductive. But I won't die on > that hill; in the end, I just want to be able to see when LDAP calls > hang. Understood. -- Michael signature.asc Description: PGP signature
Re: DROP CONSTRAINT, printing a notice and/or skipping when no action is taken
Oof, the subject line was meant to be DROP DEFAULT, not constraint On Thu, Feb 13, 2025 at 11:13 AM Andrew Atkinson wrote: > Hello. I noticed a small opportunity for a possible enhancement to DROP > DEFAULT, and wanted to share the idea. Apologies if this idea was suggested > before, I tried a basic search for pgsql-hackers similar things but didn’t > find a hit. > > > I noticed when running an ALTER TABLE with DROP DEFAULT, whether the > column default exists or not, ALTER TABLE is always printed as the result. > This is arguably slightly confusing, because it’s unclear if anything was > done. In the scenario where there is no column default, there isn’t a > message saying “skipped” or something equivalent, indicating that there was > no default that was dropped. Some of the commands in Postgres do have this > kind of feedback, so it seems like an opportunity for greater consistency. > > > > For example: if I create a column default, or repeatedly run the following > ALTER TABLE statements for the "id_new" column, I always get ALTER TABLE > back. > > > ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT; > > ALTER TABLE > > ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT; > > ALTER TABLE > > ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT; > > ALTER TABLE > > > An opportunity would be to add a NOTICE type of message when ALTER TABLE > ALTER COLUMN DROP DEFAULT is issued, at least when no column default > exists, and no action was taken. In that scenario, the operation could > possibly be skipped altogether, which might have some additional benefits. > > > As a refreshed on a “Notice” type of message example, here’s one when > adding an index and using the "if not exists" clause (an equivalent "if not > exists" clause does not exist for DROP DEFAULT to my knowledge): > > > -- an index called “foo” already exists > > psql> create index if not exists foo on organizations (id); > > NOTICE: relation "foo" already exists, skipping > > CREATE INDEX > > > The message being “NOTICE: relation "foo" already exists, skipping” > > > A similar message for DROP DEFAULT might look like: > > > “NOTICE: default does not exist, skipping” > > > Or an alternative that includes the column name might look like: > > > “NOTICE: default does not exist for column id_new, skipping” > > > Or another alternative might be a new (non-standard?) "if exists" clause > for DROP DEFAULT. Example: > > ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT IF EXISTS; > > > -- Or an alternative placement of the "if exists" clause, because I don’t > really know where it would go: > > ALTER TABLE my_table ALTER COLUMN id_new DROP IF EXISTS DEFAULT; > > > > Thanks! > > - Andrew >
Re: Restrict publishing of partitioned table with a foreign table as partition
On Thu, 13 Feb 2025 at 20:12, vignesh C wrote: > > On Thu, 13 Feb 2025 at 15:50, Shlok Kyal wrote: > > > > > > I have fixed the issue. Attached the updated v6 patch. > > There is another concurrency issue: > In case of create publication for all tables with > publish_via_partition_root we will call check_foreign_tables: > @@ -876,6 +876,10 @@ CreatePublication(ParseState *pstate, > CreatePublicationStmt *stmt) > /* Associate objects with the publication. */ > if (stmt->for_all_tables) > { > + /* Check if any foreign table is a part of partitioned table > */ > + if (publish_via_partition_root) > + check_foreign_tables(stmt->pubname); > > At the time of check in check_foreign_tables, there are no foreign > tables so this check will be successful: > +check_foreign_tables_in_schema(Oid schemaid, char *pubname) > +{ > + RelationclassRel; > + ScanKeyData key[2]; > + TableScanDesc scan; > + HeapTuple tuple; > + > + classRel = table_open(RelationRelationId, AccessShareLock); > + > + ScanKeyInit(&key[0], > + Anum_pg_class_relnamespace, > + BTEqualStrategyNumber, F_OIDEQ, > + schemaid); > + ScanKeyInit(&key[1], > + Anum_pg_class_relkind, > + BTEqualStrategyNumber, F_CHAREQ, > + CharGetDatum(RELKIND_PARTITIONED_TABLE)); > + > + scan = table_beginscan_catalog(classRel, 2, key); > + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) > > Now immediately after execution of this, create a foreign table: > postgres=# CREATE FOREIGN TABLE part22 PARTITION OF part2 FOR VALUES > FROM (10) TO (15) SERVER fdw; > CREATE FOREIGN TABLE > > And then continue execution of create publication, it will also be successful: > postgres=# create publication pub1 for all tables with ( > publish_via_partition_root =true); > CREATE PUBLICATION > > One probable way to fix this is to do the search similar to > check_foreign_tables_in_schema where we can skip including schemaid > key for all tables. > > How about something like the attached patch. > Hi Vignesh, I have used the changes suggested by you. Also I have updated the comments and the function name. Thanks and Regards, Shlok Kyal v7-0001-Restrict-publishing-of-partitioned-table-with-for.patch Description: Binary data
Re: AIO v2.3
On Tue, Feb 11, 2025 at 12:10 AM Andres Freund wrote: >> TLDR; in terms of SELECTs the master vs aioworkers looks very solid! > Phew! Weee! Yay. Another good news: I've completed a full 24h pgbench run on the same machine and it did not fail or report anything suspicious. FYI, patchset didn't not apply anymore (seems patches 1..6 are already applied on master due to checkpoint shutdown sequence), but there was a failed hunk in patch #12 yesterday too: [..] patching file src/backend/postmaster/postmaster.c Hunk #10 succeeded at 2960 (offset 14 lines). Hunk #11 FAILED at 3047. [..] 1 out of 15 hunks FAILED -- saving rejects to file src/backend/postmaster/postmaster.c.rej anyway, so on master @ a5579a90af05814eb5dc2fd5f68ce803899d2504 (~ Jan 24) to have clean apply I've used the below asserted build: meson setup build --reconfigure --prefix=/usr/pgsql18.aio --debug -Dsegsize_blocks=13 -Dcassert=true /usr/pgsql18.aio/bin/pgbench -i -s 500 --partitions=100 # ~8GB /usr/pgsql18.aio/bin/pgbench -R 1500 -c 100 -j 4 -P 1 -T 86400 with some add-on functionalities: effective_io_concurrency = '4' shared_buffers = '2GB' max_connections = '1000' archive_command = 'cp %p /dev/null' archive_mode = 'on' summarize_wal = 'on' wal_summary_keep_time = '1h' wal_compression = 'on' wal_log_hints = 'on' max_wal_size = '1GB' shared_preload_libraries = 'pg_stat_statements' huge_pages = 'off' wal_receiver_status_interval = '1s' so the above got perfect run: [..] duration: 86400 s number of transactions actually processed: 129615534 number of failed transactions: 0 (0.000%) latency average = 5.332 ms latency stddev = 24.107 ms rate limit schedule lag: avg 0.748 (max 1992.517) ms initial connection time = 124.472 ms tps = 1500.179231 (without initial connection time) > > I was kind of afraid that additional IPC to separate processes would put > > workers at a disadvantage a little bit , but that's amazingly not true. > > It's a measurable disadvantage, it's just more than counteracted by being able > to do IO asynchronously :). > > It's possible to make it more visible, by setting io_combine_limit = 1. If you > have a small shared buffers with everything in the kernel cache, the dispatch > overhead starts to be noticeable above several GB/s. But that's ok, I think. Sure it is. > > 2. my very limited in terms of time data analysis thoughts > > - most of the time perf with aioworkers is identical (+/- 3%) as of > > the master, in most cases it is much BETTER > > I assume s/most/some/ for the second most? Right, pardon for my excited moment ;) > > - on parallel seqscans "sata" with datasets bigger than VFS-cache > > ("big") and high e_io_c with high client counts(sigh!), it looks like > > it would user noticeable big regression but to me it's not regression > > itself, probably we are issuing way too many posix_fadvise() > > readaheads with diminishing returns. Just letting you know. Not sure > > it is worth introducing some global (shared aioworkers e_io_c > > limiter), I think not. I think it could also be some maintenance noise > > on that I/O device, but I have no isolated SATA RAID10 with like 8x > > HDDs in home to launch such a test to be absolutely sure. > > I think this is basically a configuration issue - configuring a high e_io_c > for a device that can't handle that and then load it up with a lot of clients, > well, that'll not work out great. Sure, btw i'm going to also an idea about autotuning that e_io_c in that related thread where everybody is complaining about it > > 3. with aioworkers in documentation it would worth pointing out that > > `iotop` won't be good enough to show which PID is doing I/O anymore . > > I've often get question like this: who is taking the most of I/O right > > now because storage is fully saturated on multi-use system. Not sure > > it would require new view or not (pg_aios output seems to be not more > > like in-memory debug view that would be have to be sampled > > aggressively, and pg_statio_all_tables shows well table, but not PID > > -- same for pg_stat_io). IMHO if docs would be simple like > > "In order to understand which processes (PIDs) are issuing lots of > > IOs, please check pg_stat_activty for *IO/AioCompletion* waits events" > > it should be good enough for a start. > > pg_stat_get_backend_io() should allow to answer that, albeit with the usual > weakness of our stats system, namely that the user has to diff two snapshots > themselves. It probably also has the weakness of not showing results for > queries before they've finished, although I think that's something we should > be able to improve without too much trouble (not in this release though, I > suspect). > > I guess we could easily reference pg_stat_get_backend_io(), but a more > complete recipe isn't entirely trivial... I was trying to come out with something that could be added to docs, but the below thing is too ugly and as you stated the primary weakness is that query needs to finish before it is reflected:
Re: Improve verification of recovery_target_timeline GUC.
On Fri, Jan 24, 2025 at 01:36:45PM +, David Steele wrote: > I attached the wrong patch. Oops! Thanks for the new patch. > + timeline = strtoull(*newval, &endp, 0); > + > + if (*endp != '\0' || errno == EINVAL || errno == ERANGE) > { > GUC_check_errdetail("\"recovery_target_timeline\" is > not a valid number."); > return false; > } Why not using strtou64() here? That's more portable depending on SIZEOF_LONG (aka the 4 bytes on WIN32 vs 8 bytes for the rest of the world). > + > + if (timeline < 2 || timeline > UINT_MAX) > + { A recovery_target_timeline of 1 is a valid value, isn't it? > + GUC_check_errdetail( > + "\"recovery_target_timeline\" must be between 2 > and 4,294,967,295."); > + return false; > + } I would suggest to rewrite that as \"%s\" must be between %u and %u, which would be more generic for translations in case there is an overlap with something else. > +$logfile = slurp_file($node_standby->logfile()); > +ok($logfile =~ qr/must be between 2 and 4,294,967,295/, > + 'target timelime out of max range'); These sequences of tests could be improved: - Let's use start locations for the logs slurped, reducing the scope of the logs scanned. - Perhaps it would be better to rely on wait_for_log() to make sure that the expected log contents are generated without any kind of race condition? -- Michael signature.asc Description: PGP signature
Re: Confine vacuum skip logic to lazy_scan_skip
On Thu, Feb 13, 2025 at 8:30 PM Masahiko Sawada wrote: > > On Thu, Feb 13, 2025 at 4:55 PM Melanie Plageman > wrote: > > > > We don't want to hang onto that pin for a > > long time. But I can't move them to the bottom of the loop after we > > release the buffer because some of the code paths don't make it that > > far. I don't see a good way other than how I did it or special-casing > > block 0. What do you think? > > How about adding 'vacrel->scanned_pages > 0' to the if statement? > Which seems not odd to me. Cool. I've done this in attached v19. > Looking at the 0002 patch, it seems you reverted the change to the > following comment: > > /* >* Vacuum the Free Space Map to make newly-freed space visible on > - * upper-level FSM pages. Note we have not yet processed blkno. > + * upper-level FSM pages. Note we have not yet processed blkno+1. >*/ > > I feel that the previous change I saw in v17 is clearer: I've reverted to the old comment. Thanks > The rest looks good to me. Cool! I'll plan to push this tomorrow barring any objections. - Melanie From 25140d2966d2f87a4a410fc92147f2d88e587920 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Thu, 13 Feb 2025 16:49:10 -0500 Subject: [PATCH v19 1/3] Convert heap_vac_scan_next_block() boolean parameters to flags The read stream API only allows one piece of extra per block state to be passed back to the API user. lazy_scan_heap() needs to know whether or not a given block was all-visible in the visibility map and whether or not it was eagerly scanned. Convert these two pieces of information to flags so that they can be passed as a single argument to heap_vac_scan_next_block() (which will become the read stream API callback for heap phase I vacuuming). Reviewed-by: Masahiko Sawada Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/CAAKRu_bmx33jTqATP5GKNFYwAg02a9dDtk4U_ciEjgBHZSVkOQ%40mail.gmail.com --- src/backend/access/heap/vacuumlazy.c | 47 1 file changed, 27 insertions(+), 20 deletions(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 3df5b92afb8..c4d0f77ee2f 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -248,6 +248,13 @@ typedef enum */ #define EAGER_SCAN_REGION_SIZE 4096 +/* + * heap_vac_scan_next_block() sets these flags to communicate information + * about the block it read to the caller. + */ +#define VAC_BLK_WAS_EAGER_SCANNED (1 << 0) +#define VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM (1 << 1) + typedef struct LVRelState { /* Target heap relation and its indexes */ @@ -417,8 +424,7 @@ static void lazy_scan_heap(LVRelState *vacrel); static void heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params); static bool heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno, - bool *all_visible_according_to_vm, - bool *was_eager_scanned); + uint8 *blk_info); static void find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis); static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno, Page page, @@ -1171,8 +1177,7 @@ lazy_scan_heap(LVRelState *vacrel) BlockNumber rel_pages = vacrel->rel_pages, blkno, next_fsm_block_to_vacuum = 0; - bool all_visible_according_to_vm, -was_eager_scanned = false; + uint8 blk_info = 0; BlockNumber orig_eager_scan_success_limit = vacrel->eager_scan_remaining_successes; /* for logging */ Buffer vmbuffer = InvalidBuffer; @@ -1196,8 +1201,7 @@ lazy_scan_heap(LVRelState *vacrel) vacrel->next_unskippable_eager_scanned = false; vacrel->next_unskippable_vmbuffer = InvalidBuffer; - while (heap_vac_scan_next_block(vacrel, &blkno, &all_visible_according_to_vm, - &was_eager_scanned)) + while (heap_vac_scan_next_block(vacrel, &blkno, &blk_info)) { Buffer buf; Page page; @@ -1206,7 +1210,7 @@ lazy_scan_heap(LVRelState *vacrel) bool got_cleanup_lock = false; vacrel->scanned_pages++; - if (was_eager_scanned) + if (blk_info & VAC_BLK_WAS_EAGER_SCANNED) vacrel->eager_scanned_pages++; /* Report as block scanned, update error traceback information */ @@ -1331,7 +1335,8 @@ lazy_scan_heap(LVRelState *vacrel) */ if (got_cleanup_lock) lazy_scan_prune(vacrel, buf, blkno, page, - vmbuffer, all_visible_according_to_vm, + vmbuffer, + blk_info & VAC_BLK_ALL_VISIBLE_ACCORDING_TO_VM, &has_lpdead_items, &vm_page_frozen); /* @@ -1348,7 +1353,8 @@ lazy_scan_heap(LVRelState *vacrel) * exclude pages skipped due to cleanup lock contention from eager * freeze algorithm caps. */ - if (got_cleanup_lock && was_eager_scanned) + if (got_cleanup_lock && + (blk_info & VAC_BLK_WAS_EAGER_SCANNED)) { /* Aggressive vacuums do not eager scan. */ Assert(!vacrel->aggressive); @@ -1479,11 +1485,11 @@ lazy_scan_heap(LVRelState *vacrel) * and various thresholds to skip
Re: [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations
Thank you for your answer TOM. I'd like to have this option only for initial loading of huge amounts of data, where atomicity consistency is needed at all. Maybe an option on startup command just for initial import mode, would be nice. I had huge problems on server 3 weeks after a 6 TB migration from other DB. I think it's sad to rewrite all data twice. Le jeu. 13 févr. 2025 à 16:08, Tom Lane a écrit : > =?UTF-8?Q?S=C3=A9bastien?= writes: > > Implementation details: > > >- A new INSERT FROZEN option could be introduced, similar to COPY > FREEZE, > >allowing direct insertion of tuples in a frozen state. > >- This would likely require changes in heap storage logic to ensure > >tuples are written with a frozen XID at insert time. > >- Consideration should be given to transaction semantics and WAL > logging > >to ensure consistency and crash recovery integrity. > > That last is exactly why this won't happen. A frozen tuple would be > considered committed and visible the instant it appears in the table, > thus completely breaking both atomicity and integrity of the > transaction. > > There has been work going on recently to reduce the impact of freezing > massive amounts of data by spreading the work more effectively [1]. > I don't say that that particular commit has completely solved the > problem, but I think that continued effort in that direction is more > likely to yield usable results than what you're suggesting. > > BTW, this might or might not be usable in your particular workflow, > but: there have long been some optimizations for data load into a > table created in the same transaction. The idea there is that if the > transaction rolls back, the table will never have been visible to any > other transaction at all, so that maintaining atomicity/integrity of > its contents is moot. > > regards, tom lane > > [1] > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=052026c9b > -- Sébastien Caunes +33 6 7 229 229 7
Re: Elimination of the repetitive code at the SLRU bootstrap functions.
Hello Aleksander! Hello Álvaro! Thank you for your attention to the patch. Since BootStrapSlruPage() is the only caller of ZeroPage() it seems to me that it merely wastes space in SlruCtlData. On top of that I'm not 100% sure if all the supported platforms have C99 compilers with designated initializers support. Wouldn't it be simpler to pass the callback straight to BootStrapSlruPage()? I agree with your proposals. Moreover, similar solution is also implemented and exploited at Tantor's versions of the xid64 patch (https://www.postgresql.org/message-id/5ca56aed-dc69-4c3e-968f-a822ae0937b5%40gmail.com). The corrected patch version implementing these proposals is attached. Best regards, Evgeny Voropaev, Tantor Labs LLC. From a5a29cdbb37834bfcd23cd22db90a78a7db43d18 Mon Sep 17 00:00:00 2001 From: Evgeny Voropaev Date: Fri, 14 Feb 2025 14:03:01 +0800 Subject: [PATCH v2] Elimination of the repetitive code at the SLRU bootstrap functions. The functions bootstrapping SLRU pages used to have a lot of repetitive code. The new realized function BootStrapSlruPage has moved duplicating code into the single place and eliminated code repetitions. Author: Evgeny Voropaev --- src/backend/access/transam/clog.c | 27 + src/backend/access/transam/commit_ts.c | 24 +--- src/backend/access/transam/multixact.c | 54 ++ src/backend/access/transam/slru.c | 35 + src/backend/access/transam/subtrans.c | 28 ++--- src/include/access/slru.h | 1 + 6 files changed, 55 insertions(+), 114 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 0d556c00b8c..5ae684da6af 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -832,19 +832,7 @@ check_transaction_buffers(int *newval, void **extra, GucSource source) void BootStrapCLOG(void) { - int slotno; - LWLock *lock = SimpleLruGetBankLock(XactCtl, 0); - - LWLockAcquire(lock, LW_EXCLUSIVE); - - /* Create and zero the first page of the commit log */ - slotno = ZeroCLOGPage(0, false); - - /* Make sure it's written out */ - SimpleLruWritePage(XactCtl, slotno); - Assert(!XactCtl->shared->page_dirty[slotno]); - - LWLockRelease(lock); + BootStrapSlruPage(XactCtl, 0, ZeroCLOGPage); } /* @@ -1114,19 +1102,8 @@ clog_redo(XLogReaderState *record) if (info == CLOG_ZEROPAGE) { int64 pageno; - int slotno; - LWLock *lock; - memcpy(&pageno, XLogRecGetData(record), sizeof(pageno)); - - lock = SimpleLruGetBankLock(XactCtl, pageno); - LWLockAcquire(lock, LW_EXCLUSIVE); - - slotno = ZeroCLOGPage(pageno, false); - SimpleLruWritePage(XactCtl, slotno); - Assert(!XactCtl->shared->page_dirty[slotno]); - - LWLockRelease(lock); + BootStrapSlruPage(XactCtl, pageno, ZeroCLOGPage); } else if (info == CLOG_TRUNCATE) { diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 95049acd0b5..54600f1b69c 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -747,16 +747,7 @@ ActivateCommitTs(void) /* Create the current segment file, if necessary */ if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno)) - { - LWLock *lock = SimpleLruGetBankLock(CommitTsCtl, pageno); - int slotno; - - LWLockAcquire(lock, LW_EXCLUSIVE); - slotno = ZeroCommitTsPage(pageno, false); - SimpleLruWritePage(CommitTsCtl, slotno); - Assert(!CommitTsCtl->shared->page_dirty[slotno]); - LWLockRelease(lock); - } + BootStrapSlruPage(CommitTsCtl, pageno, ZeroCommitTsPage); /* Change the activation status in shared memory. */ LWLockAcquire(CommitTsLock, LW_EXCLUSIVE); @@ -1023,19 +1014,8 @@ commit_ts_redo(XLogReaderState *record) if (info == COMMIT_TS_ZEROPAGE) { int64 pageno; - int slotno; - LWLock *lock; - memcpy(&pageno, XLogRecGetData(record), sizeof(pageno)); - - lock = SimpleLruGetBankLock(CommitTsCtl, pageno); - LWLockAcquire(lock, LW_EXCLUSIVE); - - slotno = ZeroCommitTsPage(pageno, false); - SimpleLruWritePage(CommitTsCtl, slotno); - Assert(!CommitTsCtl->shared->page_dirty[slotno]); - - LWLockRelease(lock); + BootStrapSlruPage(CommitTsCtl, pageno, ZeroCommitTsPage); } else if (info == COMMIT_TS_TRUNCATE) { diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 27ccdf9500f..30cc47021f7 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -2033,32 +2033,8 @@ check_multixact_member_buffers(int *newval, void **extra, GucSource source) void BootStrapMultiXact(void) { - int slotno; - LWLock *lock; - - lock = SimpleLruGetBankLock(MultiXactOffsetCtl, 0); - LWLockAcquire(lock, LW_EXCLUSIVE); - - /* Create and zero the first page of the offsets log */ - slotno = ZeroMultiXactOffsetPage(0, false); - - /* Make sure it's written out */ - SimpleLruWritePage(MultiXactOffsetCtl, slotno); - As
Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.
On Thu, 13 Feb 2025 at 15:20, Shubham Khanna wrote: > > On Tue, Feb 11, 2025 at 9:56 PM Shlok Kyal wrote: > > > > On Tue, 11 Feb 2025 at 09:51, Shubham Khanna > > wrote: > > > > > > On Fri, Feb 7, 2025 at 7:46 AM Hayato Kuroda (Fujitsu) > > > wrote: > > > > > > > > Dear Shubham, > > > > > > > > Thanks for updating the patch. > > > > > > > > Previously you told that you had a plan to extend the patch to drop > > > > other replication > > > > objects [1], but I think it is not needed. pg_createsubscriber has > > > > already been > > > > able to drop the existing subscrisubscriptions in > > > > check_and_drop_existing_subscriptions(). > > > > As for the replication slot, I have told in [2], it would be created > > > > intentionally > > > > thus I feel it should not be dropped. > > > > Thus I regard the patch does not have concrete extending plan. > > > > > > > > Below part contains my review comment. > > > > > > > > 01. Option name > > > > > > > > Based on the above discussion, "--cleanup-publisher-objects" is not > > > > suitable because > > > > it won't drop replication slots. How about "--cleanup-publications"? > > > > > > > > > > I have changed the name of the option to > > > "--cleanup-existing-publications" > > > > > > > 02. usage() > > > > ``` > > > > + printf(_(" -C --cleanup-publisher-objects drop all > > > > publications on the logical replica\n")); > > > > ``` > > > > > > Fixed. > > > > > > > s/logical replica/subscriber > > > > > > > > 03. drop_all_publications > > > > ``` > > > > +/* Drops all existing logical replication publications from all > > > > subscriber > > > > + * databases > > > > + */ > > > > +static void > > > > ``` > > > > > > > > Initial line of the comment must be blank [3]. > > > > > > > > > > Removed this function. > > > > > > > 04. main > > > > ``` > > > > + {"cleanup-publisher-objects", no_argument, NULL, 'C'}, > > > > ``` > > > > > > > > Is there a reason why upper case is used? I feel lower one is enough. > > > > > > > > > > Fixed. > > > > > > > 05. main > > > > ``` > > > > + /* Drop publications from the subscriber if requested */ > > > > + if (opt.cleanup_publisher_objects) > > > > + drop_all_publications(dbinfo); > > > > ``` > > > > > > > > After considering more, I noticed that we have already called > > > > drop_publication() > > > > in the setup_subscriber(). Can we call drop_all_publications() there > > > > instead when > > > > -C is specified? > > > > > > > > > > I agree with you on this. I have removed the drop_all_publication() > > > and added the "--cleanup-existing-publications" option to the > > > drop_publication() > > > > > > > 06. 040_pg_createsubscriber.pl > > > > > > > > ``` > > > > +$node_s->start; > > > > +# Create publications to test it's removal > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL > > > > TABLES;"); > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL > > > > TABLES;"); > > > > + > > > > +# Verify the existing publications > > > > +my $pub_count_before = > > > > + $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;"); > > > > +is($pub_count_before, '2', > > > > + 'two publications created before --cleanup-publisher-objects is > > > > run'); > > > > + > > > > +$node_s->stop; > > > > ``` > > > > > > > > I feel it requires unnecessary startup and shutdown. IIUC, creating > > > > publications and check > > > > counts can be before stopping the node_s, around line 331. > > > > > > > > > > Fixed. > > > > > > > 07. 040_pg_createsubscriber.pl > > > > > > > > ``` > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL > > > > TABLES;"); > > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL > > > > TABLES;"); > > > > + > > > > +# Verify the existing publications > > > > +my $pub_count_before = > > > > + $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;"); > > > > +is($pub_count_before, '2', > > > > + 'two publications created before --cleanup-publisher-objects is > > > > run'); > > > > + > > > > ``` > > > > > > > > Also, there is a possibility that CREATE PUBLICATION on node_p is not > > > > replicated yet > > > > when SELECT COUNT(*) is executed. Please wait for the replay. > > > > > > > > [1]: > > > > https://www.postgresql.org/message-id/CAHv8RjL4OvoYafofTb_U_JD5HuyoNowBoGpMfnEbhDSENA74Kg%40mail.gmail.com > > > > [2]: > > > > https://www.postgresql.org/message-id/OSCPR01MB1496664FDC38DA40A441F449FF5EE2%40OSCPR01MB14966.jpnprd01.prod.outlook.com > > > > [3]: https://www.postgresql.org/docs/devel/source-format.html > > > > > > > > > > Fixed. > > > > > > The attached Patch contains the suggested changes. > > > > > > > Hi Shubham, > > > > I have some comments for v4 patch: > > 1. I think we should update the comment for the function > > 'drop_publication'. As its usage is changed with this patch > > Currently it states: > > /* > > * Remove publication if it couldn't f
Re: Elimination of the repetitive code at the SLRU bootstrap functions.
Hi Evgeny, > The functions, bootstrapping SLRU pages, such as BootStrapMultiXact, > BootStrapCLOG, ActivateCommitTs, multixact_redo and others, have a lot > of repetitive code. > > A new proposed function BootStrapSlruPage moves a duplicating code into > the single place. Additionally, a new member ZeroFunc is implemented in > the SlruCtlData structure. The ZeroFunc keeps a pointer to a function > proper for nullifying SLRU pages of a type defined by a corresponding > SlruCtlData object. > > Applying proposed modifications can simplify maintenance and future > development of Postgres code in the realm of bootstrapping SLRU. Thanks for the patch. ``` --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -102,14 +102,6 @@ TransactionIdToPage(TransactionId xid) */ #define THRESHOLD_SUBTRANS_CLOG_OPT5 -/* - * Link to shared-memory data structures for CLOG control - */ -static SlruCtlData XactCtlData; - -#define XactCtl (&XactCtlData) - - static intZeroCLOGPage(int64 pageno, bool writeXlog); static bool CLOGPagePrecedes(int64 page1, int64 page2); static void WriteZeroPageXlogRec(int64 pageno); @@ -130,6 +122,14 @@ static void TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids, XLogRecPtr lsn, int64 pageno); +/* + * Link to shared-memory data structures for CLOG control + */ +static SlruCtlData XactCtlData = { .ZeroPage = ZeroCLOGPage }; + +#define XactCtl (&XactCtlData) + + ``` Since BootStrapSlruPage() is the only caller of ZeroPage() it seems to me that it merely wastes space in SlruCtlData. On top of that I'm not 100% sure if all the supported platforms have C99 compilers with designated initializers support. Wouldn't it be simpler to pass the callback straight to BootStrapSlruPage()? After changing the patch accordingly it shouldn't move XactCtl and others. This is just an irrelevant change. -- Best regards, Aleksander Alekseev
Re: BitmapHeapScan streaming read user and prelim refactoring
On 2/13/25 01:40, Melanie Plageman wrote: > On Sun, Feb 9, 2025 at 9:27 AM Tomas Vondra wrote: >> >> For the nvme RAID (device: raid-nvme), it's looks almost exactly the >> same, except that with parallel query (page 27) there's a clear area of >> regression with eic=1 (look for "column" of red cells). That's a bit >> unfortunate, because eic=1 is the default value. > > So, I feel pretty confident after even more analysis today with Thomas > Munro that all of the parallel cases with effective_io_concurrency == > 1 are because master cheats effective_io_concurrency. Therefore, I'm > not too worried about those for now. We probably will have to increase > effective_io_concurrency before merging this patch, though. > > Thomas mentioned this to me off-list, and I think he's right. We > likely need to rethink the way parallel bitmap heap scan workers get > block assignments for reading and prefetching to make it more similar > to parallel sequential scan. The workers should probably get > assignments of a range of blocks. On master, each worker does end up > issuing reads/fadvises for a bunch of blocks in a row -- even though > that isn't the intent of the parallel bitmap table scan > implementation. We are losing some of that with the patch -- but only > because it is behaving as you would expect given the implementation > and design. I don't consider that a blocker, though. > Agreed. If this is due to master doing something wrong (and either prefetching more, or failing to prefetch), that's not the fault of this patch. People might perceive this as a regression, but increasing the default eic value would fix that. BTW I recall we ran into issues with prefetching and parallel workers about a year ago [1]. Is this the same issue, or something new? [1] https://www.postgresql.org/message-id/20240315211449.en2jcmdqxv5o6tlz%40alap3.anarazel.de >> For the SATA SSD RAID (device: raid-sata), it's similar - a couple >> regressions for eic=0, just like for NVMe. But then there's also a >> couple regressions for higher eic values, usually for queries with >> selective conditions. > > I started trying to reproduce the regressions you saw on your SATA > devices for higher eic values. I was focusing just on serial bitmap > heap scan -- since that doesn't have the problems mentioned above. I > don't have a SATA drive, but I used dmsetup to add a 10ms delay to my > nvme drive. Unfortunately, that did the opposite of reproduce the > issue. With dm delay 10ms, the patch is 5 times faster than master. > I'm not quite sure at which point does dm-setup add the delay. Does it affect the fadvise call too, or just the I/O if the page is not already in page cache? regards -- Tomas Vondra
Re: NOT ENFORCED constraint feature
On 2025-Feb-13, Ashutosh Bapat wrote: > > So considering that, I think a three-state system makes more sense. > > Something like: > > > > 1) NOT ENFORCED -- no data is checked > > 2) NOT VALID -- existing data is unchecked, new data is checked > > 3) ENFORCED -- all data is checked > > > > Transitions: > > > > (1) - [ ALTER TABLE ... ALTER CONSTRAINT ... NOT VALID ] -> (2) > > Per your notation, this means the the constraint is not enforced but > new data is checked - that seems a contradiction, how would we check > the data when the constraint is not being enforced. Or do you suggest > that we convert a NOT ENFORCED constraint to ENFORCED as a result of > converting it to NOT VALID? I agree this one is a little weird. For this I would have the command be ALTER TABLE ... ALTER CONSTRAINT ... ENFORCED NOT VALID this way it's explicit that what we want is flip the ENFORCED bit while leaving NOT VALID as-is. > > (2) - [ ALTER TABLE ... VALIDATE CONSTRAINT ... ] -> (3) > > As a result of this a not enforced constraint would turn into an > enforced constraint. The user might have intended to just validate the > data but not enforce it to avoid paying price for the checks on new > data. I'm not sure there's a use case for validating existing data without starting to enforce the constraint. The data can become invalid immediately after you've run the command, so why bother? > I think, what you intend to say is clearer with 4 state system {NE, E} > * {NV, V} = {(NE, NV), (NE, V), (E, NV), (E, V)} where (NE, V) is > unreachable. Let's name them S1, S2, S3, S4 respectively. [...] > Notice that there are no edges to and from S2. So why list it as a possible state? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Las mujeres son como hondas: mientras más resistencia tienen, más lejos puedes llegar con ellas" (Jonas Nightingale, Leap of Faith)
Re: BitmapHeapScan streaming read user and prelim refactoring
On 2/13/25 05:15, Thomas Munro wrote: > On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman > wrote: >> On Sun, Feb 9, 2025 at 9:27 AM Tomas Vondra wrote: >>> For the nvme RAID (device: raid-nvme), it's looks almost exactly the >>> same, except that with parallel query (page 27) there's a clear area of >>> regression with eic=1 (look for "column" of red cells). That's a bit >>> unfortunate, because eic=1 is the default value. >> >> So, I feel pretty confident after even more analysis today with Thomas >> Munro that all of the parallel cases with effective_io_concurrency == >> 1 are because master cheats effective_io_concurrency. Therefore, I'm >> not too worried about those for now. We probably will have to increase >> effective_io_concurrency before merging this patch, though. > > Yeah. Not only did we see it issuing up to 20 or so unauthorised > overlapping POSIX_FADV_WILLNEED calls, but in the case we studied they > didn't really do anything at all because it was *postfetching* (!), ie > issuing advice for blocks it had already read. The I/O was sequential > (reading all or most blocks in order), so the misdirected advice at > least kept out of the kernel's way, and it did some really effective > readahead. Meanwhile the stream version played by the rules and > respected eic=1, staying only one block ahead. We know that advice for > sequential blocks blocks readahead, so that seems to explain the red > for that parameter/test permutation. (Hypothesis based on some > quick reading: I guess kernel pages faulted in that way don't get > marked with the internal "readahead was here" flag, which would > normally cause the following pread() to trigger more asynchronous > readahead with ever larger physical size). But that code > in master is also *trying* to do exactly what we're doing, it's just > failing because of some combination of bugs in the parallel scan code > (?). It would be absurd to call it the winner: we're setting out > with the same goal, but for that particular set of parameters and test > setup, apparently two (?) bugs make a right. Melanie mentioned that > some other losing cases also involved unauthorised amounts of > overlapping advice, except there it was random and beneficial and the > iterators didn't get out of sync with each other, so again it beat us > in a few red patches, seemingly with a different accidental behaviour. > Meanwhile we obediently trundled along with just 1 concurrent I/O or > whatever, and lost. > Makes sense, likely explains the differences. And I agree we should not hold this against this patch series, it's master that's doing something wrong. And in most of these cases we'd not even do bitmap scan anyway, I think, it only happens because we forced the plan. BTW how are you investigating those I/O requests? strace? iosnoop? perf? Something else? > On top of that, read_stream.c is also *trying* to avoid issuing advice > for access patterns that look sequential, and *trying* to do I/O > combining, which all works OK in serial BHS, but it breaks down in > parallel because: > >> Thomas mentioned this to me off-list, and I think he's right. We >> likely need to rethink the way parallel bitmap heap scan workers get >> block assignments for reading and prefetching to make it more similar >> to parallel sequential scan. The workers should probably get >> assignments of a range of blocks. On master, each worker does end up >> issuing reads/fadvises for a bunch of blocks in a row -- even though >> that isn't the intent of the parallel bitmap table scan >> implementation. We are losing some of that with the patch -- but only >> because it is behaving as you would expect given the implementation >> and design. I don't consider that a blocker, though. > > +/* > + * The maximum number of tuples per page is not large (typically 256 with > + * 8K pages, or 1024 with 32K pages). So there's not much point in > making > + * the per-page bitmaps variable size. We just legislate that the size > is > + * this: > + */ > +OffsetNumber offsets[MaxHeapTuplesPerPage]; > } TBMIterateResult; > > Seems to be 291? So sizeof(TBMIterateResult) must be somewhere > around 588 bytes? There's one of those for every buffer, and we'll > support queues of up to effective_io_concurrency (1-1000) * > io_combine_limit (1-128?), which would require ~75MB of > per-buffer-data with maximum settings That's a lot of memory to have > to touch as you whizz around the circular queue even when you don't > really need it. With more typical numbers like 10 * 32 it's ~90kB, > which I guess is OK. I wonder if it would be worth trying to put a > small object in there instead that could be expanded to the results > later, cf f6bef362 for TIDStore, but I'm not sure if it's a blocker, > just an observation. Not sure I follow. Doesn't f6bef362 do quite the opposite, i.e. removing the repalloc() calls and replacing that with a local on-stack buffer sized for MaxOffsetNum
Re: pg_stat_statements and "IN" conditions
On 2025-Feb-13, Dmitry Dolgov wrote: > Here is how it looks like (posting only the first patch, since we > concentrate on it). This version handles just a little more to cover > simpe cases like the implicit convertion above. The GUC is also moved > out from pgss and renamed to query_id_merge_values. On top I've added > more tests showing the impact, as well as sometimes awkward looking > normalized query I was talking about. I'm going to experiment how to > iron out the latter. Thanks! It's looking better. Some small comments -- please add the new GUC to postgresql.conf.sample. Also, how wed are you to "query_id_merge_values" as a name? It's not in any way obvious that this is about values in arrays. How about query_id_squash_arrays? Or are you thinking in having values in other types of structures squashed as well, and that this first patch does it for arrays only but you want the GUC to also control some future feature? (I think I prefer "squash" here as a verb to "merge"). > +static bool > +IsMergeableConst(Node *element) > +{ > + if (IsA(element, RelabelType)) > + element = (Node *) ((RelabelType *) element)->arg; > + > + if (IsA(element, CoerceViaIO)) > + element = (Node *) ((CoerceViaIO *) element)->arg; > + > + if(IsA(element, FuncExpr)) > + { > + FuncExpr *func = (FuncExpr *) element; > + char provolatile = func_volatile(func->funcid); I think calling func_volatile potentially once per array element is not good; this might cause dozens/thousands of identical syscache lookups. Maybe we can pass an initially NIL list from IsMergeableConstList (as List **), which IsMergeableConst fills with OIDs of functions that have been checked and found acceptable. Then the second time around we search the list first and only do func_volatile() after not finding a match. Another thing I didn't quite understand is why you did this rather baroque-looking list scan: > +static bool > +IsMergeableConstList(List *elements, Node **firstExpr, Node **lastExpr) > +{ > + ListCell *temp; > + Node *firstElem = NULL; > + > + if (elements == NIL) > + return false; > + > + if (!query_id_merge_values) > + { > + /* Merging is disabled, process everything one by one */ > + return false; > + } > + > + firstElem = linitial(elements); > + > + /* > + * If the first expression is a constant, verify if the following > elements > + * are constants as well. If yes, the list is eligible for merging. > + */ > + if (IsMergeableConst(firstElem)) > + { > + foreach(temp, elements) > + { > + Node *element = lfirst(temp); > + > + if (!IsMergeableConst(element)) > + return false; > + } > + > + *firstExpr = firstElem; > + *lastExpr = llast(elements); > + return true; > + } Why not just scan the list in the straightforward way, that is foreach(temp, elements) { if (!IsMergeableConst(lfirst(temp))) return false; } *firstExpr = linitial(elements); *lastExpr = llast(elements); return true; Is there something being optimized here specifically for the first element? I don't see it. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Using Expanded Objects other than Arrays from plpgsql
On Tue, 11 Feb 2025 at 21:50, Tom Lane wrote: > > Andrey Borodin writes: > > On 7 Feb 2025, at 02:05, Tom Lane wrote: > >> Do you have any further comments on this patch? > > > No, all steps of the patch set look good to me. > > Pushed then. Thanks for reviewing! Thanks to Michel and everyone working on the patch! Regards, Pavel Borisov
Re: Adding NetBSD and OpenBSD to Postgres CI
Hi, On Wed, 12 Feb 2025 at 17:49, Andres Freund wrote: > I finally pushed this. The meson fix backpatched to 16. > > I did some very minor polishing, reordering the OS lists to stay alphabetical, > instead of adding netbsd/openbsd somewhere to the front of lists. Thanks! > Obviously not your fault, but I do think it's pretty crazy that with the same > available resources, netbsd and openbsd take considerably longer than linux > and freebsd, which both do a lot more (linux tests 32 and 64 bit with ubsan > with autoconf and asan with meson, freebsd tests a bunch of debugging > options). I wonder what is going wrong. I suspect we might be waiting for > the filesystem a lot, according to cirrus-ci's CPU usage graph we're not CPU > bound during the test phase. Or maybe they just scale badly. Yes, I could not find the reason why either. > Any chance you're interested in rebasing and expanding > https://postgr.es/m/20240413021221.hg53rvqlvldqh57i%40awork3.anarazel.de > > It'd be nice if we could enable these tasks on cfbot, where we bring our own > compute, while leaving them on manual for everyone else. Sure, I will work on this. Thank you for mentioning it. -- Regards, Nazir Bilal Yavuz Microsoft
Re: Virtual generated columns
On Tue, Feb 11, 2025 at 10:34 AM Richard Guo wrote: > > On Mon, Feb 10, 2025 at 1:16 PM Zhang Mingli wrote: > > I believe virtual columns should behave like stored columns, except they > > don't actually use storage. > > Virtual columns are computed when the table is read, and they should adhere > > to the same rules of join semantics. > > I agree with Richard, the result seems incorrect. The right outcome should > > be: > > gpadmin=# SELECT t2.a, t2.b FROM t t1 LEFT JOIN t t2 ON FALSE; > > a | b > > --+-- > > NULL | NULL > > NULL | NULL > > (2 rows) > > Yeah, I also feel that the virtual generated columns should adhere to > outer join semantics, rather than being unconditionally replaced by > the generation expressions. But maybe I'm wrong. > > If that's the case, this incorrect-result issue isn't limited to > constant expressions; it could also occur with non-strict ones. > > CREATE TABLE t (a int, b int GENERATED ALWAYS AS (COALESCE(a, 100))); > INSERT INTO t VALUES (1); > INSERT INTO t VALUES (2); > > # SELECT t2.a, t2.b FROM t t1 LEFT JOIN t t2 ON FALSE; > a | b > ---+- >| 100 >| 100 > (2 rows) > Now I agree with you. I think the following two should return the same result. SELECT t2.a, t2.b FROM t t1 LEFT JOIN t t2 ON FALSE; SELECT t2.a, t2.b FROM t t1 LEFT JOIN (select * from t) t2 ON FALSE; atatch refined patch solves the failure to copy the nullingrel bits for the virtual generated columns. in ReplaceVarsFromTargetList_callback. I tried to just use add_nulling_relids, but failed, so I did the similar thing as SetVarReturningType. I didn't solve the out join semantic issue. i am wondering, can we do the virtual generated column expansion in the rewrite stage as is, and wrap the expressions in PHVs if the virtual generated columns come from the nullable side of an outer join. I am looking at pullup_replace_vars_callback, but it seems not very helpful to us. From 2f9bd9ec737227b1b2dc52a4800d006bfa228003 Mon Sep 17 00:00:00 2001 From: jian he Date: Thu, 13 Feb 2025 11:28:57 +0800 Subject: [PATCH v2 1/1] fix expand virtual generated column Var node varnullingrels field To expand the virtual generated column Var node, we need to copy the fields of the original virtual generated column Var node, including varnullingrels, varlevelsup, varreturningtype, and varattno, to the newly expanded Var node. ReplaceVarsFromTargetList_callback didn't taken care of varnullingrels, this patch fix this issue. discussion: https://postgr.es/m/75eb1a6f-d59f-42e6-8a78-124ee808c...@gmail.com --- src/backend/rewrite/rewriteManip.c| 62 +++ src/include/rewrite/rewriteManip.h| 3 + .../regress/expected/generated_virtual.out| 45 ++ src/test/regress/sql/generated_virtual.sql| 11 src/tools/pgindent/typedefs.list | 1 + 5 files changed, 122 insertions(+) diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c index a115b217c91..69701ce6ecd 100644 --- a/src/backend/rewrite/rewriteManip.c +++ b/src/backend/rewrite/rewriteManip.c @@ -884,6 +884,64 @@ IncrementVarSublevelsUp_rtable(List *rtable, int delta_sublevels_up, QTW_EXAMINE_RTES_BEFORE); } +/* + * SetVarNullingrels - adjust Var nodes for a specified varnullingrels. + * + * Find all Var nodes referring to the specified varno in the given + * expression and set their varnullingrels to the specified value. + */ +typedef struct +{ + int sublevels_up; + int varno; + Bitmapset *varnullingrels; +} SetVarNullingrels_context; + +static bool +SetVarNullingrels_walker(Node *node, + SetVarNullingrels_context *context) +{ + if (node == NULL) + return false; + + if (IsA(node, Var)) + { + Var *var = (Var *) node; + + if (var->varlevelsup == context->sublevels_up && + var->varno == context->varno) + var->varnullingrels = bms_union(context->varnullingrels, + var->varnullingrels); + + return false; + } + + if (IsA(node, Query)) + { + /* Recurse into subselects */ + bool result; + + context->sublevels_up++; + result = query_tree_walker((Query *) node, SetVarNullingrels_walker, + context, 0); + context->sublevels_up--; + return result; + } + return expression_tree_walker(node, SetVarNullingrels_walker, context); +} +void +SetVarNullingrels(Node *node, int sublevels_up, int varno, Bitmapset *varnullingrels) +{ + SetVarNullingrels_context context; + + context.sublevels_up = sublevels_up; + context.varno = varno; + context.varnullingrels = varnullingrels; + + /* Expect to start with an expression */ + SetVarNullingrels_walker(node, &context); +} + /* * SetVarReturningType - adjust Var nodes for a specified varreturningtype. * @@ -1832,6 +1890,10 @@ ReplaceVarsFromTargetList_callback(Var *var, if (var->varlevelsup > 0) IncrementVarSublevelsUp((Node *) newnode, var->varlevelsup, 0); + if (var->varnullingrels != NULL) + SetVarNullingrels((Node *) newno
Re: Get rid of WALBufMappingLock
13.02.2025 12:34, Alexander Korotkov пишет: > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov wrote: >> 08.02.2025 13:07, Alexander Korotkov пишет: >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov >>> wrote: Good, thank you. I think 0001 patch is generally good, but needs some further polishing, e.g. more comments explaining how does it work. >> >> I tried to add more comments. I'm not good at, so recommendations are >> welcome. >> >>> Two things comes to my mind worth rechecking about 0001. >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be >>> sensitive to that. If so, I would propose to explicitly comment that >>> and add corresponding asserts. >> >> They're certainly page aligned, since they are page borders. >> I added assert on alignment of InitializeReserved for the sanity. >> >>> 2) Check if there are concurrency issues between >>> AdvanceXLInsertBuffer() and switching to the new WAL file. >> >> There are no issues: >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses >> GetXLogBuffer to zero-out WAL page. >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks, >> so switching wal is not concurrent. (Although, there is no need in this >> exclusiveness, imho.) > > Good, thank you. I've also revised commit message and comments. > > But I see another issue with this patch. In the worst case, we do > XLogWrite() by ourselves, and it could potentially could error out. > Without patch, that would cause WALBufMappingLock be released and > XLogCtl->InitializedUpTo not advanced. With the patch, that would > cause other processes infinitely waiting till we finish the > initialization. > > Possible solution would be to save position of the page to be > initialized, and set it back to XLogCtl->InitializeReserved on error > (everywhere we do LWLockReleaseAll()). We also must check that on > error we only set XLogCtl->InitializeReserved to the past, because > there could be multiple concurrent failures. Also we need to > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters. The single place where AdvanceXLInsertBuffer is called outside of critical section is in XLogBackgroundFlush. All other call stacks will issue server restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer. XLogBackgroundFlush explicitely avoids writing buffers by passing opportunistic=true parameter. Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer since server will shutdown/restart. Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call to XLogWrite here? --- regards Yura Sokolov aka funny-falcon
Re: Restrict publishing of partitioned table with a foreign table as partition
On Thu, 13 Feb 2025 at 11:25, vignesh C wrote: > > On Tue, 11 Feb 2025 at 16:55, Shlok Kyal wrote: > > I have handled the above cases and added tests for the same. > > There is a concurrency issue with the patch: > +check_partrel_has_foreign_table(Form_pg_class relform) > +{ > + boolhas_foreign_tbl = false; > + > + if (relform->relkind == RELKIND_PARTITIONED_TABLE) > + { > + List *relids = NIL; > + > + relids = find_all_inheritors(relform->oid, NoLock, NULL); > + > + foreach_oid(relid, relids) > + { > + Relationrel = table_open(relid, > AccessShareLock); > + > + if (RelationGetForm(rel)->relkind == > RELKIND_FOREIGN_TABLE) > + has_foreign_tbl = true; > + > + table_close(rel, AccessShareLock); > + > + if (has_foreign_tbl) > + break; > + } > + } > + > + return has_foreign_tbl; > +} > > In an ideal scenario, the creation of a foreign table should fail if > there is an associated publication, as demonstrated below: > CREATE TABLE t(id int) PARTITION BY RANGE(id); > CREATE TABLE part1 PARTITION OF t FOR VALUES FROM (0) TO (5); > CREATE TABLE part2 PARTITION OF t FOR VALUES FROM (5) TO (15) > PARTITION BY RANGE(id); > CREATE PUBLICATION pub1 FOR TABLE t with (publish_via_partition_root = true); > > postgres=# CREATE FOREIGN TABLE part22 PARTITION OF part2 FOR VALUES > FROM (10) TO (15) SERVER fdw; > ERROR: cannot create table foreign partition "part22" > DETAIL: partition table "part2" is published with option > publish_via_partition_root > > Consider a scenario where the publication is being created and after > the check_partrel_has_foreign_table execution is done, concurrently > creation of foreign table is executed, then the creation will be > successful. > postgres=# CREATE FOREIGN TABLE part22 PARTITION OF part2 FOR VALUES > FROM (10) TO (15) SERVER fdw; > CREATE FOREIGN TABLE > > I felt the problem here is that you have released the lock: > + if (RelationGetForm(rel)->relkind == > RELKIND_FOREIGN_TABLE) > + has_foreign_tbl = true; > + > + table_close(rel, AccessShareLock); > > We should retain the lock to fix this issue: > + if (RelationGetForm(rel)->relkind == > RELKIND_FOREIGN_TABLE) > + has_foreign_tbl = true; > + > + table_close(rel, NoLock); > Hi Vignesh, I have fixed the issue. Attached the updated v6 patch. Thanks and Regards, Shlok Kyal v6-0001-Restrict-publishing-of-partitioned-table-with-for.patch Description: Binary data
Re: Restrict copying of invalidated replication slots
On Tue, 4 Feb 2025 at 15:27, Shlok Kyal wrote: > > Hi, > > Currently, we can copy an invalidated slot using the function > 'pg_copy_logical_replication_slot'. As per the suggestion in the > thread [1], we should prohibit copying of such slots. > > I have created a patch to address the issue. This patch does not fix all the copy_replication_slot scenarios completely, there is a very corner concurrency case where an invalidated slot still gets copied: + /* We should not copy invalidated replication slots */ + if (src_isinvalidated) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), +errmsg("cannot copy an invalidated replication slot"))); Consider the following scenario: step 1) Set up streaming replication between the primary and standby nodes. step 2) Create a logical replication slot (test1) on the standby node. step 3) Have a breakpoint in InvalidatePossiblyObsoleteSlot if cause is RS_INVAL_WAL_LEVEL, no need to hold other invalidation causes or add a sleep in InvalidatePossiblyObsoleteSlot function like below: if (cause == RS_INVAL_WAL_LEVEL) { while (bsleep) sleep(1); } step 4) Reduce wal_level on the primary to replica and restart the primary node. step 5) SELECT 'copy' FROM pg_copy_logical_replication_slot('test1', 'test2'); -- It will wait till the lock held by InvalidatePossiblyObsoleteSlot is released while trying to create a slot. step 6) Increase wal_level back to logical on the primary node and restart the primary. step 7) Now allow the invalidation to happen (continue the breakpoint held at step 3), the replication control lock will be released and the invalidated slot will be copied After this: postgres=# SELECT 'copy' FROM pg_copy_logical_replication_slot('test1', 'test2'); ?column? -- copy (1 row) -- The invalidated slot (test1) is copied successfully: postgres=# select * from pg_replication_slots ; slot_name |plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phas e | inactive_since | conflicting | invalidation_reason | failover | synced ---+---+---++--+---+++--+--+-+-++---+- --+--+-++--+ test1 | test_decoding | logical | 5 | postgres | f | f || | 745 | 0/4029060 | 0/4029098 | lost | | f | 2025-02-13 15:26:54.666725+05:30 | t | wal_level_insufficient | f| f test2 | test_decoding | logical | 5 | postgres | f | f || | 745 | 0/4029060 | 0/4029098 | reserved | | f | 2025-02-13 15:30:30.477836+05:30 | f | | f| f (2 rows) -- A subsequent attempt to decode changes from the invalidated slot (test2) fails: postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL); WARNING: detected write past chunk end in TXN 0x5e77e6c6f300 ERROR: logical decoding on standby requires "wal_level" >= "logical" on the primary -- Alternatively, the following error may occur: postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL); WARNING: detected write past chunk end in TXN 0x582d1b2d6ef0 data BEGIN 744 COMMIT 744 (2 rows) This is an edge case that can occur under specific conditions involving replication slot invalidation when there is a huge lag between primary and standby. There might be a similar concurrency case for wal_removed too. Regards, Vignesh
Re: Small memory fixes for pg_createsubcriber
Em qua., 12 de fev. de 2025 às 18:17, Tom Lane escreveu: > Ranier Vilela writes: > > Coverity has some reports about pg_createsubcriber. > > > CID 1591322: (#1 of 1): Resource leak (RESOURCE_LEAK) > > 10. leaked_storage: Variable dbname going out of scope leaks the storage > it > > points to. > > FTR, the security team's Coverity instance also complained about that. > I was planning to fix it after the release freeze lifted, but you > beat me to it, which is fine. Our report turned up a couple other > things that I just pushed fixes for. > Yeah, I see the commits, thanks for that. I still have some reports that I could post that Coverity thinks are bugs. They are not, but I think it is worth the effort to fix them because the code is confusing. I think it would improve readability and future maintainability. > > (It seems like Coverity must've updated their rules recently, > because we also got a bunch of false-positive reports that were > not there before, mostly in pre-existing code.) > I believe they are trying to innovate at some point. Many of these false positives come from a risky coding style, I am much more cautious in my analyses. best regards, Ranier Vilela
Re: [PATCH] Optionally record Plan IDs to track plan changes for a query
On 2025-Feb-12, Sami Imseih wrote: > Greg S. Mullane wrote: > > > I agree fingerprint is the right final word. But "jumble" conveys > > the *process* better than "fingerprinting". I view it as jumbling > > produces an object that can be fingerprinted. > > hmm, "jumble" describes something that is scrambled > or not in order, such as the 64-bit hash produced. It > sounds like the final product. I don't understand why we would change any naming here at all. I think you should be looking at a much broader consensus and plus-ones that a renaming is needed. -1 from me. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "No renuncies a nada. No te aferres a nada."
Re: Non-text mode for pg_dumpall
Thanks Jian. On Wed, 12 Feb 2025 at 12:45, jian he wrote: > > On Wed, Feb 12, 2025 at 1:17 AM Mahendra Singh Thalor > wrote: > > > > > > > > There are some tests per https://commitfest.postgresql.org/52/5495, I > > > will check it later. > > hi. > the cfbot failure is related to function _tocEntryRequired > > if (strcmp(te->desc, "DATABASE") == 0 || > strcmp(te->desc, "DATABASE PROPERTIES") == 0) > { > - if (ropt->createDB) > + if (ropt->createDB || AH->format != archNull) > return REQ_SCHEMA; > else > return 0; > > for restoring multiple databases: > in v16 implementation: pg_restore even if you do not specify --create, > it actually did what pg_restore --create option does. > > if there are multiple databases in the archive: > to make the pg_restore --file output is usable, the output file need > have \connect and CREATE DATABASE > command. that is exactly what --create option would do. > pg_restore --file behavior need align with pg_restore --dbname. > therefore pg_restore restoring multiple databases will use --create option. > > > we can either error out (pg_fatal) saying > restoring multiple databases requires the pg_restore --create option. > Or we can add a pg_log_info saying > pg_restore --create option will be set to true while restoring > multiple databases. In my earlier version, I was giving an error if --create option was not specified. I think it will be good and more preferable if we give an error without the --create option if dump was taken from pg_dumpall. Even though there is a single database in the dump of pg_dumpall, it is possible that a particular database hasn't been created. Ex: -d postgres and we have db1 dump in file. In this case, we have only one database dump but this database has not been created. If the user wants to restore a single database, then the user should use a single database dump file. Forcefully adding --create option is not a good idea, instead we will give an error to the user and let him correct the inputs. Apart from the above handling, I fixed all the pending review comments in this patch and made some more changes. Here, I am attaching an updated patch for review and testing. -- Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com v17_pg_dumpall-with-non-text_format-13th_feb.patch Description: Binary data
Re: Elimination of the repetitive code at the SLRU bootstrap functions.
On 2025-Feb-13, Aleksander Alekseev wrote: > Since BootStrapSlruPage() is the only caller of ZeroPage() it seems to > me that it merely wastes space in SlruCtlData. On top of that I'm not > 100% sure if all the supported platforms have C99 compilers with > designated initializers support. They do -- we use them quite extensively nowadays. > Wouldn't it be simpler to pass the callback straight to > BootStrapSlruPage()? Yeah, maybe this would be easier. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Just treat us the way you want to be treated + some extra allowance for ignorance."(Michael Brusser)
Re: SQL:2011 application time
On 23.01.25 16:40, Peter Eisentraut wrote: I think my interpretation of what RESTRICT should do is different. The clause "Execution of referential actions" in the SQL standard only talks about referenced and referencing columns, not periods. So this would mean you can change the period columns all you want (as long as they maintain referential integrity). So it would be like the NO ACTION case. But you can't change any of the non-period columns on the primary key if they are referenced by any referencing columns, even if the respective periods are disjoint. Maybe this makes sense, or maybe this is a mistake (neglected to update this part when periods were introduced?). But in any case, I can't get from this to what the patch does. When I apply the tests in the patch without the code changes, what I would intuitively like are more errors than the starting state, but your patch results in fewer errors. After staring at this a bit more, I think my interpretation above was not correct. This seems better: The clause "Execution of referential actions" in the SQL standard only talks about referenced and referencing columns, not periods. The RESTRICT error is raised when a "matching row" exists in the referencing table. The "matching row" is determined purely by looking at the "normal" columns of the key, not the period columns. So in our implementation in ri_restrict(), ISTM, we just need to ignore the period/range columns when doing the RESTRICT check. Attached is a quick patch that demonstrates how this could work. I think the semantics of this are right and make sense. From 7d2ae226da67838825120f759a912362cf91e065 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 13 Feb 2025 13:15:23 +0100 Subject: [PATCH] WIP: Fix RESTRICT behavior for temporal foreign keys --- src/backend/utils/adt/ri_triggers.c | 12 1 file changed, 12 insertions(+) diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c index 8473448849c..bbd1d5712f2 100644 --- a/src/backend/utils/adt/ri_triggers.c +++ b/src/backend/utils/adt/ri_triggers.c @@ -787,6 +787,12 @@ ri_restrict(TriggerData *trigdata, bool is_no_action) Oid pk_type = RIAttType(pk_rel, riinfo->pk_attnums[i]); Oid fk_type = RIAttType(fk_rel, riinfo->fk_attnums[i]); + if (riinfo->hasperiod && !is_no_action) + { + if (i == riinfo->nkeys - 1) + continue; + } + quoteOneName(attname, RIAttName(fk_rel, riinfo->fk_attnums[i])); sprintf(paramname, "$%d", i + 1); @@ -2734,6 +2740,12 @@ ri_ReportViolation(const RI_ConstraintInfo *riinfo, Datum datum; boolisnull; + if (riinfo->hasperiod && is_restrict) + { + if (idx == riinfo->nkeys - 1) + continue; + } + name = NameStr(att->attname); datum = slot_getattr(violatorslot, fnum, &isnull); -- 2.48.1
Re: Elimination of the repetitive code at the SLRU bootstrap functions.
Hi Alvaro, > > Since BootStrapSlruPage() is the only caller of ZeroPage() it seems to > > me that it merely wastes space in SlruCtlData. On top of that I'm not > > 100% sure if all the supported platforms have C99 compilers with > > designated initializers support. > > They do -- we use them quite extensively nowadays. Got it, thanks. I was a bit worried about Visual Studio in particular. -- Best regards, Aleksander Alekseev
Fwd: [Feature Request] Per-Database Transaction Logs for Enhanced Isolation and New Capabilities
Introduce per-database transaction logs (WAL) and transaction ID spaces to improve database isolation, enable hot-mounting/unmounting, selective replication, and open new possibilities in PostgreSQL. Business Use-case: With modern SSDs offering high throughput and low latency, maintaining a single *global* transaction log across all databases in a PostgreSQL instance is becoming an unnecessary constraint. By allowing *each database to have its own transaction log and transaction ID space*, PostgreSQL could achieve significant improvements in performance, isolation, and flexibility. *Key Benefits*: - *Better isolation between databases*: - A long-running transaction in one database would no longer prevent vacuuming of tables in another. - No risk of transaction wraparound issues in one database affecting others. - *Hot-mounting/unmounting databases*: - Ability to attach/detach databases dynamically at the filesystem level without impacting the rest of the cluster. - Faster database restores and migrations by simply copying database files and starting the instance. - *Selective replication*: - Currently, logical replication can be done at the table level, but physical replication applies to the entire cluster. - With per-database WAL, it would be possible to *replicate only specific databases* without requiring complex logical replication setups. - *More flexible backup & restore*: - Ability to back up and restore *individual databases* with transaction consistency, instead of full-cluster backups. - Faster recovery and better disaster recovery options. - *Better integration with cloud and containerized environments*: Would enable dynamically adding and removing databases in cloud environments without cluster-wide restarts. User impact with the change: - Users with large multi-database clusters would see *better transaction isolation*, fewer maintenance conflicts, and *more flexible database management*. - Organizations running *multi-tenant* environments or *per-database replication* setups would gain *easier and more efficient ways to manage databases*. - PostgreSQL would become much more *modular and cloud-friendly*, aligning it with modern high-availability and container-based deployments. Implementation details: - Requires modifying PostgreSQL's WAL and transaction system to support per-database transaction logs. - WAL archiving, replication, and recovery logic would need adjustments to support per-database operations. - Needs careful handling of catalog metadata (such as pg_database) to ensure atomicity when attaching/detaching databases. Estimated Development Time: I do not know PostgreSQL's internal architecture well enough to assess the full impact of such a change. However, taking a step back, it seems that rather than deeply modifying the core engine, an alternative approach could be to spawn a separate PostgreSQL engine per database. In this case, the main entry point would act more like a connection bouncer, routing requests to individual database engines. Opportunity Window Period: As SSD and cloud-based infrastructures become the norm, this change would provide *major competitive advantages* for PostgreSQL in multi-tenant, high-performance, and cloud-native use cases. Budget Money: ... Contact Information: Sebastien Caunes sebast...@pixseed.fr
[Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations
One-line Summary: Introduce an INSERT FROZEN feature to bypass vacuum processing for large-scale cold data imports, reducing the impact on system performance post-import. For large imports, migrations and major version upgrades. Business Use-case: When importing massive datasets (e.g., 6-10TB) into PostgreSQL on a heavily loaded server, we observed that the system struggled significantly weeks later when the autovacuum process had to freeze all the imported data pages. This led to severe performance degradation, requiring manual intervention to prioritize vacuum jobs to complete them as quickly as possible. This issue is particularly critical during database *migrations* or *version upgrades*, where a full data reload is often necessary. Each time a major PostgreSQL upgrade occurs, users must reimport large datasets, leading to the same problem of vacuum storms post-import. An INSERT FROZEN feature would allow importing data that is known to be immutable, preventing unnecessary vacuum overhead and reducing system strain. User impact with the change: - Users importing large, cold datasets (initial loads, migrations, version upgrades) can mark them as "frozen" during insertion, so pages are directly marked as frozen. - Reduced risk of autovacuum storms weeks after large imports. - More predictable system performance post-import and post-upgrade. - Avoid unnecessary rewriting of all pages after. - Significant improvement for users who perform regular major version upgrades and need to reload data. Implementation details: - A new INSERT FROZEN option could be introduced, similar to COPY FREEZE, allowing direct insertion of tuples in a frozen state. - This would likely require changes in heap storage logic to ensure tuples are written with a frozen XID at insert time. - Consideration should be given to transaction semantics and WAL logging to ensure consistency and crash recovery integrity. Estimated Development Time: Unknown (would require input from core developers to assess complexity). But I think it's not that much and pretty straightforward to implement experimentally. Then. Opportunity Window Period: ... Budget Money: ... Contact Information: If you have further question regarding the issues I experienced that this would solve, feel free to contact me Sébastien Caunes bokan...@gmail.com Thank you for your attention. -- Sébastien Caunes +33 6 7 229 229 7
Re: Restrict publishing of partitioned table with a foreign table as partition
On Thu, 13 Feb 2025 at 15:50, Shlok Kyal wrote: > > > I have fixed the issue. Attached the updated v6 patch. There is another concurrency issue: In case of create publication for all tables with publish_via_partition_root we will call check_foreign_tables: @@ -876,6 +876,10 @@ CreatePublication(ParseState *pstate, CreatePublicationStmt *stmt) /* Associate objects with the publication. */ if (stmt->for_all_tables) { + /* Check if any foreign table is a part of partitioned table */ + if (publish_via_partition_root) + check_foreign_tables(stmt->pubname); At the time of check in check_foreign_tables, there are no foreign tables so this check will be successful: +check_foreign_tables_in_schema(Oid schemaid, char *pubname) +{ + RelationclassRel; + ScanKeyData key[2]; + TableScanDesc scan; + HeapTuple tuple; + + classRel = table_open(RelationRelationId, AccessShareLock); + + ScanKeyInit(&key[0], + Anum_pg_class_relnamespace, + BTEqualStrategyNumber, F_OIDEQ, + schemaid); + ScanKeyInit(&key[1], + Anum_pg_class_relkind, + BTEqualStrategyNumber, F_CHAREQ, + CharGetDatum(RELKIND_PARTITIONED_TABLE)); + + scan = table_beginscan_catalog(classRel, 2, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) Now immediately after execution of this, create a foreign table: postgres=# CREATE FOREIGN TABLE part22 PARTITION OF part2 FOR VALUES FROM (10) TO (15) SERVER fdw; CREATE FOREIGN TABLE And then continue execution of create publication, it will also be successful: postgres=# create publication pub1 for all tables with ( publish_via_partition_root =true); CREATE PUBLICATION One probable way to fix this is to do the search similar to check_foreign_tables_in_schema where we can skip including schemaid key for all tables. How about something like the attached patch. Regards, Vignesh diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c index 520fa2f382..3214104dec 100644 --- a/src/backend/catalog/pg_publication.c +++ b/src/backend/catalog/pg_publication.c @@ -1388,21 +1388,24 @@ check_foreign_tables_in_schema(Oid schemaid, char *pubname) { Relation classRel; ScanKeyData key[2]; + int keycount = 0; TableScanDesc scan; HeapTuple tuple; classRel = table_open(RelationRelationId, AccessShareLock); - ScanKeyInit(&key[0], -Anum_pg_class_relnamespace, -BTEqualStrategyNumber, F_OIDEQ, -schemaid); - ScanKeyInit(&key[1], + ScanKeyInit(&key[keycount++], Anum_pg_class_relkind, BTEqualStrategyNumber, F_CHAREQ, CharGetDatum(RELKIND_PARTITIONED_TABLE)); - - scan = table_beginscan_catalog(classRel, 2, key); + + if (OidIsValid(schemaid)) + ScanKeyInit(&key[keycount++], + Anum_pg_class_relnamespace, + BTEqualStrategyNumber, F_OIDEQ, + schemaid); + + scan = table_beginscan_catalog(classRel, keycount, key); while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) { Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); @@ -1436,46 +1439,3 @@ check_foreign_tables_in_schema(Oid schemaid, char *pubname) table_endscan(scan); table_close(classRel, AccessShareLock); } - -/* Check if any foreign table is a partition table */ -void -check_foreign_tables(char *pubname) -{ - Relation classRel; - ScanKeyData key[1]; - TableScanDesc scan; - HeapTuple tuple; - - classRel = table_open(RelationRelationId, AccessShareLock); - - ScanKeyInit(&key[0], -Anum_pg_class_relkind, -BTEqualStrategyNumber, F_CHAREQ, -CharGetDatum(RELKIND_FOREIGN_TABLE)); - - scan = table_beginscan_catalog(classRel, 1, key); - while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) - { - Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); - - if (relForm->relispartition) - { - Oid parent_oid; - char *parent_name; - List *ancestors = get_partition_ancestors(relForm->oid); - - parent_oid = llast_oid(ancestors); - parent_name = get_rel_name(parent_oid); - - ereport(ERROR, - (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("cannot set parameter \"%s\" to true for publication \"%s\"", - "publish_via_partition_root", pubname), - errdetail("partition table \"%s\" in publication contains a foreign partition", - parent_name))); - } - } - - table_endscan(scan); - table_close(classRel, AccessShareLock); -} diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c index e71d5408c7..b8da4ed130 100644 --- a/src/backend/commands/publicationcmds.c +++ b/src/backend/commands/publicationcmds.c @@ -878,7 +878,7 @@ CreatePublication(ParseState *pstate, CreatePublicationStmt *stmt) { /* Check if any foreign table is a part of
Re: [PATCH] Add regression tests of ecpg command notice (error / warning)
On 2025/02/06 8:57, Ryo Kanbayashi wrote: On Wed, Feb 5, 2025 at 9:31 PM Ryo Kanbayashi wrote: Hi hackers, When I wrote patch of ecpg command notice bug, I recognized needs of regression tests for ecpg command notices and I say that I write the tests. Thanks for working on this! I explain about implementation of this patch. What is this patch - add regression tests which test ecpg command notices such as warning and errors - test notices implemented in ecpg.addons file Basic policy on implementation - do in a way that matches the method using the existing pg_regress command as much as possible - avoid methods that increase the scope of influence Next, I list answers to points that are likely to be pointed out in advance below :) - shell scripts and bat files is used due to ... avoid non zero exit code of ecpg command makes tests failure avoid increasing C code for executing binary which cares cross platform - python code is used because I couldn't write meson.build appropriately describe dependency about materials which is used on tests without it. please help me... - as you said, kick this kind of tests by pg_regress accompanied with needless PG server process execution. but pg_regress doesn't execute test without it and making pg_regress require modification which has not small scope of influence Wouldn't it be simpler to use the existing TAP test mechanism, as shown in the attached patch? Please note that this patch is very WIP, so there would be many places that need further implementation and refinement. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile index 84199a9a5d..d0e3852a87 100644 --- a/src/interfaces/ecpg/preproc/Makefile +++ b/src/interfaces/ecpg/preproc/Makefile @@ -81,6 +81,9 @@ ecpg_keywords.o: ecpg_kwlist_d.h c_keywords.o: c_kwlist_d.h keywords.o: $(top_srcdir)/src/include/parser/kwlist.h +check: + $(prove_check) + install: all installdirs $(INSTALL_PROGRAM) ecpg$(X) '$(DESTDIR)$(bindir)' diff --git a/src/interfaces/ecpg/preproc/t/001_ecpg.pl b/src/interfaces/ecpg/preproc/t/001_ecpg.pl new file mode 100644 index 00..bc52d19a7f --- /dev/null +++ b/src/interfaces/ecpg/preproc/t/001_ecpg.pl @@ -0,0 +1,37 @@ + +# Copyright (c) 2021-2025, PostgreSQL Global Development Group + +use strict; +use warnings FATAL => 'all'; +use PostgreSQL::Test::Utils; +use Test::More; + +program_help_ok('ecpg'); +program_version_ok('ecpg'); +program_options_handling_ok('ecpg'); +command_fails(['ecpg'], 'ecpg without arguments fails'); + +command_checks_all( + [ 'ecpg', 't/notice.pgc' ], + 3, + [qr//], + [ + qr/ERROR: AT option not allowed in CONNECT statement/, + qr/ERROR: AT option not allowed in DISCONNECT statement/, + qr/ERROR: AT option not allowed in SET CONNECTION statement/, + qr/ERROR: AT option not allowed in TYPE statement/, + qr/ERROR: AT option not allowed in WHENEVER statement/, + qr/ERROR: AT option not allowed in VAR statement/, + qr/WARNING: COPY FROM STDIN is not implemented/, + qr/ERROR: using variable "cursor_var" in different declare statements is not supported/, + qr/ERROR: cursor "duplicate_cursor" is already defined/, + qr/ERROR: SHOW ALL is not implemented/, + qr/WARNING: no longer supported LIMIT/, + qr/WARNING: cursor "duplicate_cursor" has been declared but not opened/, + qr/WARNING: cursor "duplicate_cursor" has been declared but not opened/, + qr/WARNING: cursor ":cursor_var" has been declared but not opened/, + qr/WARNING: cursor ":cursor_var" has been declared but not opened/ + ], + 'ecpg with warnings'); + +done_testing(); diff --git a/src/interfaces/ecpg/preproc/t/notice.pgc b/src/interfaces/ecpg/preproc/t/notice.pgc new file mode 100644 index 00..7e54627fcb --- /dev/null +++ b/src/interfaces/ecpg/preproc/t/notice.pgc @@ -0,0 +1,42 @@ +/* Test ECPG notice/warning/error messages */ + +#include + +int +main(void) +{ + EXEC SQL BEGIN DECLARE SECTION; + char *cursor_var = "mycursor"; + short a; + EXEC SQL END DECLARE SECTION; + + /* For consistency with other tests */ + EXEC SQL CONNECT TO testdb AS con1; + + /* Test AT option errors */ + EXEC SQL AT con1 CONNECT TO testdb2; + EXEC SQL AT con1 DISCONNECT; + EXEC SQL AT con1 SET CONNECTION TO testdb2; + EXEC SQL AT con1 TYPE string IS char[11]; + EXEC SQL AT con1 WHENEVER NOT FOUND CONTINUE; + EXEC SQL AT con1 VAR a IS int; + + /* Test COPY FROM STDIN warning */ + EXEC SQL COPY test FROM stdin; + + /* Test same variable in multi declare statement */ + EXEC SQL DE
Re: Address the bug in 041_checkpoint_at_promote.pl
> > Anyway, how did you find that? Did you see a pattern when running the > > test on a very slow machine or just from a read? That was a good > > catch. > +1. I was wondering the same. I was writing a TAP test to reproduce a crash recovery issue and used parts of 041_checkpoint_at_promote.pl. Unfortunately, my test wasn't waiting for the desired message to appear in the log. I then realized there was a mistake in log_contains(), which I had copied from the existing test. After testing 041_checkpoint_at_promote.pl multiple times to see if it worked as expected, I noticed differences in some iterations. Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft On Thu, Feb 13, 2025 at 11:18 AM Ashutosh Bapat wrote: > > On Thu, Feb 13, 2025 at 5:08 AM Michael Paquier wrote: > > > > Anyway, how did you find that? Did you see a pattern when running the > > test on a very slow machine or just from a read? That was a good > > catch. > +1. I was wondering the same. > > > -- > Best Wishes, > Ashutosh Bapat
Simplify the logic a bit (src/bin/scripts/reindexdb.c)
Hi. Coverity complained about possible dereference null pointer in *reindex_one_database* function. That's not really true. But the logic is unnecessarily complicated. Let's simplify it to humans and machines. patch attached. Best regards, Ranier Vilela simplifies-reindex-one-database-reindexdb.patch Description: Binary data
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Feb 13, 2025 at 7:08 AM Tomas Vondra wrote: > > > > On 2/13/25 01:40, Melanie Plageman wrote: > > On Sun, Feb 9, 2025 at 9:27 AM Tomas Vondra wrote: > >> > >> For the nvme RAID (device: raid-nvme), it's looks almost exactly the > >> same, except that with parallel query (page 27) there's a clear area of > >> regression with eic=1 (look for "column" of red cells). That's a bit > >> unfortunate, because eic=1 is the default value. > > > > So, I feel pretty confident after even more analysis today with Thomas > > Munro that all of the parallel cases with effective_io_concurrency == > > 1 are because master cheats effective_io_concurrency. Therefore, I'm > > not too worried about those for now. We probably will have to increase > > effective_io_concurrency before merging this patch, though. > > > > Thomas mentioned this to me off-list, and I think he's right. We > > likely need to rethink the way parallel bitmap heap scan workers get > > block assignments for reading and prefetching to make it more similar > > to parallel sequential scan. The workers should probably get > > assignments of a range of blocks. On master, each worker does end up > > issuing reads/fadvises for a bunch of blocks in a row -- even though > > that isn't the intent of the parallel bitmap table scan > > implementation. We are losing some of that with the patch -- but only > > because it is behaving as you would expect given the implementation > > and design. I don't consider that a blocker, though. > > > > Agreed. If this is due to master doing something wrong (and either > prefetching more, or failing to prefetch), that's not the fault of this > patch. People might perceive this as a regression, but increasing the > default eic value would fix that. > > BTW I recall we ran into issues with prefetching and parallel workers > about a year ago [1]. Is this the same issue, or something new? > > [1] > https://www.postgresql.org/message-id/20240315211449.en2jcmdqxv5o6tlz%40alap3.anarazel.de Th postfetching is the same issue as you linked in [1]. > >> For the SATA SSD RAID (device: raid-sata), it's similar - a couple > >> regressions for eic=0, just like for NVMe. But then there's also a > >> couple regressions for higher eic values, usually for queries with > >> selective conditions. > > > > I started trying to reproduce the regressions you saw on your SATA > > devices for higher eic values. I was focusing just on serial bitmap > > heap scan -- since that doesn't have the problems mentioned above. I > > don't have a SATA drive, but I used dmsetup to add a 10ms delay to my > > nvme drive. Unfortunately, that did the opposite of reproduce the > > issue. With dm delay 10ms, the patch is 5 times faster than master. > > > > I'm not quite sure at which point does dm-setup add the delay. Does it > affect the fadvise call too, or just the I/O if the page is not already > in page cache? So dmsetup adds the delay only when the page is not found in kernel buffer cache -- actually below the filesystem before issuing it to the block device. reads, writes, and flushes can be delayed (I delayed all three by 10ms). fadvises can't since they are above that level. - Melanie
Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options
On Mon, Feb 3, 2025 at 11:46 AM Tatsuo Ishii wrote: > > > I've looked at it again and I think the code is correct, but I > > miswrote that the array needs to be sorted. The above query returns: > > x | y | nth_value > > ---+---+--- > > 1 | 1 | 2 > > 2 | 2 | 1 > > 3 | | 2 > > 4 | 4 | > > 5 | | 4 > > 6 | 6 | 7 > > 7 | 7 | 6 > > (7 rows) > > > > This is correct, for values of x: > > > > 1: The first non-null value of y is at position 0, however we have > > EXCLUDE CURRENT ROW so it picks the next non-null value at position 1 > > and stores it in the array, returning 2. > > 2: We can now take the first non-null value of y at position 0 and > > store it in the array, returning 1. > > 3. We take 1 preceding, using the position stored in the array, returning 2. > > 4. 1 preceding and 1 following are both null, and we exclude the > > current row, so returning null. > > 5. 1 preceding is at position 3, store it in the array, returning 4. > > 6. 1 preceding is null and we exclude the current row, so store > > position 6 in the array, returning 7. > > 7. 1 preceding is at position 5, store it in the array and return 6. > > > > It will be unordered when the EXCLUDE clause is used but the code > > should handle this correctly. > > I ran this query (not using IGNORE NULLS) and get a result. > > SELECT > x, > nth_value(x,2) OVER w > FROM generate_series(1,5) g(x) > WINDOW w AS (ORDER BY x ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE > CURRENT ROW); > x | nth_value > ---+--- > 1 | 3 > 2 | 3 > 3 | 2 > 4 | 3 > 5 | 4 > (5 rows) > > Since there's no NULL in x column, I expected the same result using > IGNORE NULLS, but it was not: > > SELECT > x, > nth_value(x,2) IGNORE NULLS OVER w > FROM generate_series(1,5) g(x) > WINDOW w AS (ORDER BY x ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE > CURRENT ROW); > x | nth_value > ---+--- > 1 | 3 > 2 | 4 > 3 | 4 > 4 | 3 > 5 | 4 > (5 rows) > > I suspect the difference is in the code path of > ignorenulls_getfuncarginframe and the code path in > WinGetFuncArgInFrame, which takes care of EXCLUDE like this. > > case FRAMEOPTION_EXCLUDE_CURRENT_ROW: > if (abs_pos >= winstate->currentpos && > winstate->currentpos >= > winstate->frameheadpos) > abs_pos++; Attached version doesn't use the nonnulls array if an Exclude is specified, as I think it's not going to work with exclusions (as it's only an optimization, this is ok and can be taken out entirely if you prefer). I've also added your tests above to the tests. 0007-ignore-nulls.patch Description: Binary data
Re: [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations
=?UTF-8?Q?S=C3=A9bastien?= writes: > Implementation details: >- A new INSERT FROZEN option could be introduced, similar to COPY FREEZE, >allowing direct insertion of tuples in a frozen state. >- This would likely require changes in heap storage logic to ensure >tuples are written with a frozen XID at insert time. >- Consideration should be given to transaction semantics and WAL logging >to ensure consistency and crash recovery integrity. That last is exactly why this won't happen. A frozen tuple would be considered committed and visible the instant it appears in the table, thus completely breaking both atomicity and integrity of the transaction. There has been work going on recently to reduce the impact of freezing massive amounts of data by spreading the work more effectively [1]. I don't say that that particular commit has completely solved the problem, but I think that continued effort in that direction is more likely to yield usable results than what you're suggesting. BTW, this might or might not be usable in your particular workflow, but: there have long been some optimizations for data load into a table created in the same transaction. The idea there is that if the transaction rolls back, the table will never have been visible to any other transaction at all, so that maintaining atomicity/integrity of its contents is moot. regards, tom lane [1] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=052026c9b
Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.
On Tue, Feb 11, 2025 at 9:56 PM Shlok Kyal wrote: > > On Tue, 11 Feb 2025 at 09:51, Shubham Khanna > wrote: > > > > On Fri, Feb 7, 2025 at 7:46 AM Hayato Kuroda (Fujitsu) > > wrote: > > > > > > Dear Shubham, > > > > > > Thanks for updating the patch. > > > > > > Previously you told that you had a plan to extend the patch to drop other > > > replication > > > objects [1], but I think it is not needed. pg_createsubscriber has > > > already been > > > able to drop the existing subscrisubscriptions in > > > check_and_drop_existing_subscriptions(). > > > As for the replication slot, I have told in [2], it would be created > > > intentionally > > > thus I feel it should not be dropped. > > > Thus I regard the patch does not have concrete extending plan. > > > > > > Below part contains my review comment. > > > > > > 01. Option name > > > > > > Based on the above discussion, "--cleanup-publisher-objects" is not > > > suitable because > > > it won't drop replication slots. How about "--cleanup-publications"? > > > > > > > I have changed the name of the option to "--cleanup-existing-publications" > > > > > 02. usage() > > > ``` > > > + printf(_(" -C --cleanup-publisher-objects drop all publications > > > on the logical replica\n")); > > > ``` > > > > Fixed. > > > > > s/logical replica/subscriber > > > > > > 03. drop_all_publications > > > ``` > > > +/* Drops all existing logical replication publications from all > > > subscriber > > > + * databases > > > + */ > > > +static void > > > ``` > > > > > > Initial line of the comment must be blank [3]. > > > > > > > Removed this function. > > > > > 04. main > > > ``` > > > + {"cleanup-publisher-objects", no_argument, NULL, 'C'}, > > > ``` > > > > > > Is there a reason why upper case is used? I feel lower one is enough. > > > > > > > Fixed. > > > > > 05. main > > > ``` > > > + /* Drop publications from the subscriber if requested */ > > > + if (opt.cleanup_publisher_objects) > > > + drop_all_publications(dbinfo); > > > ``` > > > > > > After considering more, I noticed that we have already called > > > drop_publication() > > > in the setup_subscriber(). Can we call drop_all_publications() there > > > instead when > > > -C is specified? > > > > > > > I agree with you on this. I have removed the drop_all_publication() > > and added the "--cleanup-existing-publications" option to the > > drop_publication() > > > > > 06. 040_pg_createsubscriber.pl > > > > > > ``` > > > +$node_s->start; > > > +# Create publications to test it's removal > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL TABLES;"); > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL TABLES;"); > > > + > > > +# Verify the existing publications > > > +my $pub_count_before = > > > + $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;"); > > > +is($pub_count_before, '2', > > > + 'two publications created before --cleanup-publisher-objects is > > > run'); > > > + > > > +$node_s->stop; > > > ``` > > > > > > I feel it requires unnecessary startup and shutdown. IIUC, creating > > > publications and check > > > counts can be before stopping the node_s, around line 331. > > > > > > > Fixed. > > > > > 07. 040_pg_createsubscriber.pl > > > > > > ``` > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL TABLES;"); > > > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL TABLES;"); > > > + > > > +# Verify the existing publications > > > +my $pub_count_before = > > > + $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;"); > > > +is($pub_count_before, '2', > > > + 'two publications created before --cleanup-publisher-objects is > > > run'); > > > + > > > ``` > > > > > > Also, there is a possibility that CREATE PUBLICATION on node_p is not > > > replicated yet > > > when SELECT COUNT(*) is executed. Please wait for the replay. > > > > > > [1]: > > > https://www.postgresql.org/message-id/CAHv8RjL4OvoYafofTb_U_JD5HuyoNowBoGpMfnEbhDSENA74Kg%40mail.gmail.com > > > [2]: > > > https://www.postgresql.org/message-id/OSCPR01MB1496664FDC38DA40A441F449FF5EE2%40OSCPR01MB14966.jpnprd01.prod.outlook.com > > > [3]: https://www.postgresql.org/docs/devel/source-format.html > > > > > > > Fixed. > > > > The attached Patch contains the suggested changes. > > > > Hi Shubham, > > I have some comments for v4 patch: > 1. I think we should update the comment for the function > 'drop_publication'. As its usage is changed with this patch > Currently it states: > /* > * Remove publication if it couldn't finish all steps. > */ > Fixed. > 2. In case when --cleanup_existing_publications is not specified the > info message has two double quotes. > > pg_createsubscriber: dropping publication > ""pg_createsubscriber_5_aa3c31f2"" in database "postgres" > > The code: > + appendPQExpBufferStr(targets, > +PQescapeIdentifier(conn, dbinfo->pubname, > +
Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.
On Wed, Feb 12, 2025 at 12:57 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Shubham, > > Thanks for updating the patch! Few comments. > > 01. pg_createsubscriber.sgml > ``` > + > + Remove all existing publications on the subscriber node before > creating > + a subscription. > + > + > ``` > Fixed. > I think this is not accurate. Publications on databases which are not target > will > retain. Also, I'm not sure it is important to clarify "before creating a > subscription.". > > How about: Remove all existing publications from specified databases. > > 02. main() > ``` > + opt.cleanup_existing_publications = false; > ``` > > ISTM initialization is done with the same ordering with > CreateSubscriberOptions. > Thus this should be at after recovery_timeout. > Fixed. > 03. 040_pg_createsubscriber.pl > ``` > +$node_p->wait_for_replay_catchup($node_s); > + > +# Verify the existing publications > +my $pub_count_before = > + $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;"); > +is($pub_count_before, '2', > + 'two publications created before --cleanup-existing-publications is > run'); > ``` > > I think no need to declare $pub_count_before. How about: > > ``` > ok( $node_s->poll_query_until( > $db1, "SELECT COUNT(*) = 2 FROM pg_publication"), > 'two publications created before --cleanup-existing-publications is > run'); > ``` > Fixed. > 04. 040_pg_createsubscriber.pl > ``` +my $pub_count_after = > + $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;"); > +is($pub_count_after, '0', > + 'all publications dropped after --cleanup-existing-publications is > run'); > + > ``` > > I think no need to declare $pub_count_after. How about: > > ``` > is($node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;"), '0', > 'all publications dropped after --cleanup-existing-publications is > run'); > ``` > Fixed. > 05. > > According to the document [1], we must do double-quote for each identifiers. > But current > patch seems not to do. Below example shows the case when three publications > exist. > > ``` > pg_createsubscriber: dropping publications "aaa, bbb, ccc" in database > "postgres" > ``` > > I think the output must be `"aaa", "bbb", "ccc"`. This means > drop_publication() should be refactored. > > [1]: > https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-QUOTES > Fixed. The attached patch at [1] contains the suggested changes. [1] - https://www.postgresql.org/message-id/CAHv8RjKGywu%2B3cAv9MPgxi1_TUXBT8yzUsW%2Bf%3Dg5UsgeJ8Uk6g%40mail.gmail.com Thanks and regards, Shubham Khanna.
Re: Get rid of WALBufMappingLock
On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov wrote: > 13.02.2025 12:34, Alexander Korotkov пишет: > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov > > wrote: > >> 08.02.2025 13:07, Alexander Korotkov пишет: > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov > >>> wrote: > Good, thank you. I think 0001 patch is generally good, but needs some > further polishing, e.g. more comments explaining how does it work. > >> > >> I tried to add more comments. I'm not good at, so recommendations are > >> welcome. > >> > >>> Two things comes to my mind worth rechecking about 0001. > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and > >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be > >>> sensitive to that. If so, I would propose to explicitly comment that > >>> and add corresponding asserts. > >> > >> They're certainly page aligned, since they are page borders. > >> I added assert on alignment of InitializeReserved for the sanity. > >> > >>> 2) Check if there are concurrency issues between > >>> AdvanceXLInsertBuffer() and switching to the new WAL file. > >> > >> There are no issues: > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses > >> GetXLogBuffer to zero-out WAL page. > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks, > >> so switching wal is not concurrent. (Although, there is no need in this > >> exclusiveness, imho.) > > > > Good, thank you. I've also revised commit message and comments. > > > > But I see another issue with this patch. In the worst case, we do > > XLogWrite() by ourselves, and it could potentially could error out. > > Without patch, that would cause WALBufMappingLock be released and > > XLogCtl->InitializedUpTo not advanced. With the patch, that would > > cause other processes infinitely waiting till we finish the > > initialization. > > > > Possible solution would be to save position of the page to be > > initialized, and set it back to XLogCtl->InitializeReserved on error > > (everywhere we do LWLockReleaseAll()). We also must check that on > > error we only set XLogCtl->InitializeReserved to the past, because > > there could be multiple concurrent failures. Also we need to > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters. > > The single place where AdvanceXLInsertBuffer is called outside of critical > section is in XLogBackgroundFlush. All other call stacks will issue server > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer. > > XLogBackgroundFlush explicitely avoids writing buffers by passing > opportunistic=true parameter. > > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer > since server will shutdown/restart. > > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call > to XLogWrite here? You're correct. I just reflected this in the next revision of the patch. -- Regards, Alexander Korotkov Supabase v6-0001-Get-rid-of-WALBufMappingLock.patch Description: Binary data v6-0002-several-attempts-to-lock-WALInsertLocks.patch Description: Binary data
Re: Windows meson build
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, failed Spec compliant: tested, failed Documentation:tested, failed Hi! This patch clarify the options used by meson to build PostgreSQL, how to included libraries and include files from dependencies: 1) environment variable PKG_CONFIG_PATH 2) options -Dextra_include_dirs= and -Dextra_lib_dirs This information useful for Windows builds and any other system, where meson build is used. +1 for commit The new status of this patch is: Ready for Committer
Re: explain analyze rows=%.0f
On 12.02.2025 22:56, Robert Haas wrote: On Wed, Feb 12, 2025 at 2:55 PM Andrei Lepikhov wrote: On 13/2/2025 01:40, Tom Lane wrote: I was idly speculating yesterday about letting the Ryu code print the division result, so that we get a variable number of digits. Realistically, that'd probably result in many cases in more digits than anybody wants, so it's not a serious proposal. I'm cool with the fixed-two-digits approach to start with. Okay, since no one else voted for the meaningful-numbers approach, I would say that fixed size is better than nothing. It may cover some of my practical cases, but unfortunately, not the most problematic ones. I don't love it either, but I do think it is significantly better than nothing. I'm in favor of having some improvement rather than nothing at all—otherwise, we might never reach a consensus. 1. Documentation (v9-0001-Clarify-display-of-rows-as-decimal-fractions-DOC.patch) One thing that bothers me is that the documentation explains how to compute total time, but it does not clarify how to compute total rows. Maybe this was obvious to others before, but now that we are displaying |rows| as a fraction, we should explicitly document how to interpret it alongside total time. I believe it would be helpful to show a non-integer rows value in an example query. However, to achieve this, we need the index scan results to vary across iterations. One way to accomplish this is by using the condition t1.unique2 > t2.unique2. Additionally, I suggest making loops a round number (e.g., 100) for better readability, which can be achieved using t1.thousand < 10. The final query would look like this: EXPLAIN ANALYZE SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.thousand < 10 AND t1.unique2 > t2.unique2; I believe this is an ideal example for the documentation because it not only demonstrates fractional rows, but also keeps the execution plan nearly unchanged. While the estimated and actual average row counts become slightly rounded, I don't see another way to ensure different results for each index scan. I'm open to any feedback or suggestions for a better example to use in the documentation or additional explaining fractional rows in the text. 2. Code and tests (v9-0002-Clarify-display-of-rows-as-decimal-fractions.patch) I left the code and tests unchanged since we agreed on a fixed format of two decimal places. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 789d98383988ef6d3b0bc2c6b9b5c73a83ffd6d4 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Thu, 13 Feb 2025 11:19:03 +0300 Subject: [PATCH v9] Clarify display of rows as decimal fractions When loops > 1, the average rows value is now displayed as a decimal fraction with two digits after the decimal point in EXPLAIN ANALYZE. Previously, the average rows value was always rounded to an integer, which could lead to misleading results when estimating total rows by multiplying rows by loop. This change ensures that users get a more accurate representation of the data flow through execution nodes. --- doc/src/sgml/perform.sgml | 41 --- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index a502a2aaba..dd61ae4507 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -717,26 +717,26 @@ FROM tenk1 t1 WHERE t1.ten = (SELECT (random() * 10)::integer); EXPLAIN ANALYZE SELECT * FROM tenk1 t1, tenk2 t2 -WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; - - QUERY PLAN &zwsp;-- - Nested Loop (cost=4.65..118.50 rows=10 width=488) (actual time=0.017..0.051 rows=10 loops=1) - Buffers: shared hit=36 read=6 - -> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.38 rows=10 width=244) (actual time=0.009..0.017 rows=10 loops=1) - Recheck Cond: (unique1 < 10) - Heap Blocks: exact=10 - Buffers: shared hit=3 read=5 written=4 - -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.004..0.004 rows=10 loops=1) - Index Cond: (unique1 < 10) +WHERE t1.thousand < 10 AND t1.unique2 > t2.unique2; + + QUERY PLAN +---&zwsp;- + Nested Loop (cost=5.40..11571.44 rows=356667 width=488) (actual time=0.042..117.205 rows=513832 loops=1) + Buffers: shared hit=19377 read=29 + -> Bitmap Heap Scan on tenk1 t1 (cost=5.11..233.60 rows=107 width=244) (actual time=0.021..0.103 rows=100 loops=1) + Recheck Cond: (thousand < 10) + Heap Blocks: exact=90 + Buffers: shared hit=92 + -> Bitmap Index Scan on te
Re: Get rid of WALBufMappingLock
On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov wrote: > 08.02.2025 13:07, Alexander Korotkov пишет: > > On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov > > wrote: > >> Good, thank you. I think 0001 patch is generally good, but needs some > >> further polishing, e.g. more comments explaining how does it work. > > I tried to add more comments. I'm not good at, so recommendations are welcome. > > > Two things comes to my mind worth rechecking about 0001. > > 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and > > XLogCtl->xlblocks always page-aligned? Because algorithm seems to be > > sensitive to that. If so, I would propose to explicitly comment that > > and add corresponding asserts. > > They're certainly page aligned, since they are page borders. > I added assert on alignment of InitializeReserved for the sanity. > > > 2) Check if there are concurrency issues between > > AdvanceXLInsertBuffer() and switching to the new WAL file. > > There are no issues: > 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses > GetXLogBuffer to zero-out WAL page. > 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks, > so switching wal is not concurrent. (Although, there is no need in this > exclusiveness, imho.) Good, thank you. I've also revised commit message and comments. But I see another issue with this patch. In the worst case, we do XLogWrite() by ourselves, and it could potentially could error out. Without patch, that would cause WALBufMappingLock be released and XLogCtl->InitializedUpTo not advanced. With the patch, that would cause other processes infinitely waiting till we finish the initialization. Possible solution would be to save position of the page to be initialized, and set it back to XLogCtl->InitializeReserved on error (everywhere we do LWLockReleaseAll()). We also must check that on error we only set XLogCtl->InitializeReserved to the past, because there could be multiple concurrent failures. Also we need to broadcast XLogCtl->InitializedUpToCondVar to wake up waiters. > > Regarding 0002 patch, it looks generally reasonable. But are 2 > > attempts always optimal? Are there cases of regression, or cases when > > more attempts are even better? Could we have there some > > self-adjusting mechanism like what we have for spinlocks? > > Well, I chose to perform 3 probes (2 conditional attempts + 1 > unconditional) based on intuition. I have some experience in building hash > tables, and cuckoo-hashing theory tells 2 probes is usually enough to reach > 50% fill-rate, and 3 probes enough for ~75% fill rate. Since each probe is > cache miss, it is hardly sensible to do more probes. > > 3 probes did better than 2 in other benchmark [1], although there were > NUM_XLOGINSERT_LOCK increased. > > Excuse me for not bencmarking different choices here. I'll try to do > measurements in next days. > > [1] https://postgr.es/m/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru Ok, let's wait for your measurements. -- Regards, Alexander Korotkov Supabase v5-0002-several-attempts-to-lock-WALInsertLocks.patch Description: Binary data v5-0001-Get-rid-of-WALBufMappingLock.patch Description: Binary data
Re: pg_stat_statements and "IN" conditions
> On Wed, Feb 12, 2025 at 08:48:03PM GMT, Dmitry Dolgov wrote: > > On Wed, Feb 12, 2025 at 07:39:39PM GMT, Álvaro Herrera wrote: > > The nastiness level of this seems quite low, compared to what happens to > > this other example if we didn't handle these easy cases: > > > > create table t (a float); > > select i from t where i in (1, 2); > > select i from t where i in (1, '2'); > > select i from t where i in ('1', 2); > > select i from t where i in ('1', '2'); > > select i from t where i in (1.0, 1.0); > > Yep, the current version I've got so far produces the same > pg_stat_statements entry for all of those queries. I'm going to move out > the renamed GUC and post the new patch tomorrow. Here is how it looks like (posting only the first patch, since we concentrate on it). This version handles just a little more to cover simpe cases like the implicit convertion above. The GUC is also moved out from pgss and renamed to query_id_merge_values. On top I've added more tests showing the impact, as well as sometimes awkward looking normalized query I was talking about. I'm going to experiment how to iron out the latter. >From ef4115248cd7213494e2bdf8175eaa930aa41640 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Tue, 3 Dec 2024 14:55:45 +0100 Subject: [PATCH v23] Prevent jumbling of every element in ArrayExpr pg_stat_statements produces multiple entries for queries like SELECT something FROM table WHERE col IN (1, 2, 3, ...) depending on the number of parameters, because every element of ArrayExpr is jumbled. In certain situations it's undesirable, especially if the list becomes too large. Make an array of Const expressions contribute only the first/last elements to the jumble hash. Allow to enable this behavior via the new pg_stat_statements parameter query_id_merge_values with the default value off. Reviewed-by: Zhihong Yu, Sergey Dudoladov, Robert Haas, Tom Lane, Michael Paquier, Sergei Kornilov, Alvaro Herrera, David Geier, Sutou Kouhei, Sami Imseih Tested-by: Chengxi Sun, Yasuo Honda --- contrib/pg_stat_statements/Makefile | 2 +- .../pg_stat_statements/expected/merging.out | 432 ++ contrib/pg_stat_statements/meson.build| 1 + .../pg_stat_statements/pg_stat_statements.c | 47 +- contrib/pg_stat_statements/sql/merging.sql| 169 +++ doc/src/sgml/config.sgml | 27 ++ doc/src/sgml/pgstatstatements.sgml| 28 +- src/backend/nodes/gen_node_support.pl | 21 +- src/backend/nodes/queryjumblefuncs.c | 167 ++- src/backend/postmaster/launch_backend.c | 3 + src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 - src/include/nodes/nodes.h | 3 + src/include/nodes/primnodes.h | 2 +- src/include/nodes/queryjumble.h | 8 +- 15 files changed, 895 insertions(+), 26 deletions(-) create mode 100644 contrib/pg_stat_statements/expected/merging.out create mode 100644 contrib/pg_stat_statements/sql/merging.sql diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index 241c02587b..eef8d69cc4 100644 --- a/contrib/pg_stat_statements/Makefile +++ b/contrib/pg_stat_statements/Makefile @@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS)) REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf REGRESS = select dml cursors utility level_tracking planning \ user_activity wal entry_timestamp privileges extended \ - parallel cleanup oldextversions + parallel cleanup oldextversions merging # Disabled because these tests require "shared_preload_libraries=pg_stat_statements", # which typical installcheck users do not have (e.g. buildfarm clients). NO_INSTALLCHECK = 1 diff --git a/contrib/pg_stat_statements/expected/merging.out b/contrib/pg_stat_statements/expected/merging.out new file mode 100644 index 00..e7815a200c --- /dev/null +++ b/contrib/pg_stat_statements/expected/merging.out @@ -0,0 +1,432 @@ +-- +-- Const merging functionality +-- +CREATE EXTENSION pg_stat_statements; +CREATE TABLE test_merge (id int, data int); +-- IN queries +-- No merging is performed, as a baseline result +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + +SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9); + id | data ++-- +(0 rows) + +SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10); + id | data ++-- +(0 rows) + +SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11); + id | data ++-- +(0 rows) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; +query | calls +-+--- + SELECT * FROM test_merge WHE
Re: Windows meson build
Vladlen Popolitov писал(а) 2025-02-13 17:58: The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, failed Spec compliant: tested, failed Documentation:tested, failed Hi! Previous email has text "failed". It is generated by "send review" function, I do not know, how to change it (tried in other browsers). Nothing failed in this patch. -- Best regards, Vladlen Popolitov.
Re: Improve CRC32C performance on SSE4.2
On Thu, Feb 13, 2025 at 4:18 AM Nathan Bossart wrote: > > I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer > overhead if possible. I looked at switching to always using runtime checks > for this stuff, and we concluded that we'd better not [0]. > > [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13 For short lengths, I tried unrolling the loop into a switch statement, as in the attached v5-0006 (the other new patches are fixes for CI). That usually looks faster for me, but not on the length used under the WAL insert lock. Usual caveat: Using small fixed-sized lengths in benchmarks can be misleading, because branches are more easily predicted. It seems like for always using runtime checks we'd need to use branching, rather than function pointers, as has been proposed elsewhere. master: 20 latency average = 3.622 ms latency average = 3.573 ms latency average = 3.599 ms 64 latency average = 7.791 ms latency average = 7.920 ms latency average = 7.888 ms 80 latency average = 8.076 ms latency average = 8.140 ms latency average = 8.150 ms 96 latency average = 8.853 ms latency average = 8.897 ms latency average = 8.914 ms 112 latency average = 9.867 ms latency average = 9.825 ms latency average = 9.869 ms v5: 20 latency average = 4.550 ms latency average = 4.327 ms latency average = 4.320 ms 64 latency average = 5.064 ms latency average = 4.934 ms latency average = 5.020 ms 80 latency average = 4.904 ms latency average = 4.786 ms latency average = 4.942 ms 96 latency average = 5.392 ms latency average = 5.376 ms latency average = 5.367 ms 112 latency average = 5.730 ms latency average = 5.859 ms latency average = 5.734 ms -- John Naylor Amazon Web Services From acb63cddd8c8220db97ae0b012bf4f2fb5174e8a Mon Sep 17 00:00:00 2001 From: John Naylor Date: Wed, 12 Feb 2025 17:07:49 +0700 Subject: [PATCH v5 5/8] Improve CRC32C performance on x86_64 The current SSE4.2 implementation of CRC32C relies on the native CRC32 instruction, which operates on 8 bytes at a time. We can get a substantial speedup on longer inputs by using carryless multiplication on SIMD registers, processing 64 bytes per loop iteration. The PCLMULQDQ instruction has been widely available since 2011 (almost as old as SSE 4.2), so this commit now requires that, as well as SSE 4.2, to build pg_crc32c_sse42.c. The MIT-licensed implementation was generated with the "generate" program from https://github.com/corsix/fast-crc32/ Based on: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" V. Gopal, E. Ozturk, et al., 2009 Author: Raghuveer Devulapalli Author: John Naylor Discussion: PH8PR11MB82869FF741DFA4E9A029FF13FBF72@PH8PR11MB8286.namprd11.prod.outlook.com">https://postgr.es/m/PH8PR11MB82869FF741DFA4E9A029FF13FBF72@PH8PR11MB8286.namprd11.prod.outlook.com --- config/c-compiler.m4 | 7 ++- configure | 7 ++- meson.build | 7 +-- src/port/pg_crc32c_sse42.c| 4 src/port/pg_crc32c_sse42_choose.c | 9 ++--- 5 files changed, 27 insertions(+), 7 deletions(-) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 8534cc54c1..8b255b5cc8 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -557,14 +557,19 @@ AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS], [define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics])])dnl AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32], [Ac_cachevar], [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include +#include #if defined(__has_attribute) && __has_attribute (target) -__attribute__((target("sse4.2"))) +__attribute__((target("sse4.2,pclmul"))) #endif static int crc32_sse42_test(void) + { + __m128i x1 = _mm_set1_epi32(1); unsigned int crc = 0; crc = _mm_crc32_u8(crc, 0); crc = _mm_crc32_u32(crc, 0); + x1 = _mm_clmulepi64_si128(x1, x1, 0x00); // pclmul + crc = crc + _mm_extract_epi32(x1, 1); /* return computed value, to prevent the above being optimized away */ return crc == 0; }], diff --git a/configure b/configure index 0ffcaeb436..3f2a2a515e 100755 --- a/configure +++ b/configure @@ -17059,14 +17059,19 @@ else cat confdefs.h - <<_ACEOF >conftest.$ac_ext /* end confdefs.h. */ #include +#include #if defined(__has_attribute) && __has_attribute (target) -__attribute__((target("sse4.2"))) +__attribute__((target("sse4.2,pclmul"))) #endif static int crc32_sse42_test(void) + { + __m128i x1 = _mm_set1_epi32(1); unsigned int crc = 0; crc = _mm_crc32_u8(crc, 0); crc = _mm_crc32_u32(crc, 0); + x1 = _mm_clmulepi64_si128(x1, x1, 0x00); + crc = crc + _mm_extract_epi32(x1, 1); /* return computed value, to prevent the above being optimized away */ return crc == 0; } diff --git a/meson.build b/meson.build index 1ceadb9a83..456c3fafc3 100644 --- a/meson.build +++ b/meson.build @@ -2227,15
Re: [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations
On Thu, Feb 13, 2025 at 9:45 AM Sébastien wrote: > This issue is particularly critical during database *migrations* or *version > upgrades*, where a full data reload is often necessary. Each time a major > PostgreSQL upgrade occurs, users must reimport large datasets, leading to > the same problem of vacuum storms post import. > What major upgrade process are you doing that uses INSERT? Cheers, Greg -- Crunchy Data - https://www.crunchydata.com Enterprise Postgres Software Products & Tech Support
Re: pg17.3 PQescapeIdentifier() ignores len
Em qui., 13 de fev. de 2025 às 16:05, Tom Lane escreveu: > Ranier Vilela writes: > > Interesting, Coverity has some new reports regarding PQescapeIdentifier. > > > CID 1591290: (#1 of 1): Out-of-bounds access (OVERRUN) > > 2. alloc_strlen: Allocating insufficient memory for the terminating null > of > > the string. [Note: The source code implementation of the function has > been > > overridden by a builtin model.] > > That's not new, we've been seeing those for awhile. I've been > ignoring them on the grounds that (a) if the code actually had such a > problem, valgrind testing would have found it, and (b) the message is > saying in so many words that they're ignoring our code in favor of > somebody's apparently-inaccurate model of said code. > Thanks Tom, extra care is needed when analyzing these reports. best regards, Ranier Vilela
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Feb 13, 2025 at 11:28 AM Tomas Vondra wrote: > > On 2/13/25 17:01, Melanie Plageman wrote: > > On Thu, Feb 13, 2025 at 10:46 AM Tomas Vondra wrote: > >> > >> I reviewed v29 today, and I think it's pretty much ready to go. > >> > >> The one part where I don't quite get is 0001, which replaces a > >> FLEXIBLE_ARRAY_MEMBER array with a fixed-length array. It's not wrong, > >> but I don't quite see the benefits / clarity. And I think Thomas might > >> be right we may want to make this dynamic, to save memory. > >> > >> Not a blocker, but I'd probably skip 0001 (unless it's required by the > >> later parts, I haven't checked/tried). > > > > So, on master, it already pallocs an array of size MAX_TUPLES_PER_PAGE > > (which is hard-coded in the tidbitmap API to MaxHeapTuplesPerPage) -- > > see tbm_begin_private_iterate(). > > > > So we always palloc the same amount of memory. The reason I changed it > > from a flexible sized array to a fixed size is that we weren't using > > the flexibility and having a flexible sized array in the > > TBMIterateResult meant it couldn't be nested in another struct. Since > > I have to separate the TBMIterateResult and TBMIterator to implement > > the read stream API for BHS, once I separate them, I nest the > > TBMIterateResult in the GinScanEntry. If the array of offsets is > > flexible sized, then I would have to manage that memory separately in > > GIN code for the TBMIterateResult.. > > > > So, 0001 isn't a change in the amount of memory allocated. > > > > With the read stream API, these TBMIterateResults are palloc'd just > > like we palloc'd one in master. However, we have to have more than one > > at a time. > > > > I know it's not changing how much memory we allocate (compared to > master). I haven't thought about the GinScanEntry - yes, flexible array > member would make this a bit more complex. Oh, I see. I didn't understand Thomas' proposal. I don't know how hard it would be to make tidbitmap allocate the offsets on-demand. I'd need to investigate more. But probably not worth it for this patch. - Melanie
Is pgAdmin the only front-end to PostgreSQL debugger ? And is "a working pl/pgsql debugger" something core should care to maintain ?
Hallo PostgreSQL Hackers, We recently discovered an error where pgAdmin fails when stepping into nested function calls ( https://github.com/pgadmin-org/pgadmin4/issues/8443 ). So while waiting for this to be fixed I would want to know if there are other debugger front-ends that could be used to do basic debugging of pl/pgsql code ? And would these need a separate debugging extension, or is "debugging" meant to be a generic PostgreSQL server feature that has a well-defined API ? I know there used to be another debugger as part of OmniDB which had its own server-side extension as well, but wanted to know what is the general thinking on this. Is debugging pl/pgsq a generic feature of PostgreSQL core considering that pl/pgsql itself kind of is ? Should there be something similar for debugging plain SQL and possibly other PL languages? Should there perhaps be debugging support in psql ? -- Hannu
Re: pg17.3 PQescapeIdentifier() ignores len
Em qui., 13 de fev. de 2025 às 13:51, Justin Pryzby escreveu: > I found errors in our sql log after upgrading to 17.3. > > error_severity | ERROR > message| schema > "rptcache.44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214" does not exist > query | copy > "rptcache.44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214"."44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214" > from stdin > > The copy command is from pygresql's inserttable(), which does: > > do { > t = strchr(s, '.'); > if (!t) > t = s + strlen(s); > table = PQescapeIdentifier(self->cnx, s, (size_t)(t - s)); > fprintf(stderr, "table %s len %ld => %s\n", s, t-s, table); > if (bufpt < bufmax) > bufpt += snprintf(bufpt, (size_t)(bufmax - bufpt), "%s", > table); > PQfreemem(table); > s = t; > if (*s && bufpt < bufmax) > *bufpt++ = *s++; > } while (*s); > > The fprintf suggests that since 5dc1e42b4 PQescapeIdentifier ignores its > len. > Interesting, Coverity has some new reports regarding PQescapeIdentifier. CID 1591290: (#1 of 1): Out-of-bounds access (OVERRUN) 2. alloc_strlen: Allocating insufficient memory for the terminating null of the string. [Note: The source code implementation of the function has been overridden by a builtin model.] Until now, I was in disbelief. best regards, Ranier Vilela
Patch: Log parameter types in detailed query logging
Dear PostgreSQL Hackers, This patch adds the ability to log the types of parameters used in queries when detailed query logging is enabled. Currently, detailed logging only shows the parameter values, which often requires further investigation or asking the client to determine the data types. This enhancement will significantly aid in debugging problematic queries, especially when data type mismatches are suspected. The patch modifies the detailed logging output to include the data type of each parameter, making it easier to understand the context of the query and diagnose potential issues without additional communication overhead. Here's an example of the new logging format: ``` 2025-02-10 21:05:42.622 +07 [3702286] LOG: duration: 0.008 ms execute P_1: SELECT u.username, u.email, r.role_name, o.order_date, p.product_name, oi.quantity, ur.role_id, (p.price * oi.quantity) AS total_price FROM users u JOIN user_roles ur ON u.user_id = ur.user_id JOIN roles r ON ur.role_id = r.role_id JOIN orders o ON u.user_id = o.user_id JOIN order_items oi ON o.order_id = oi.order_id JOIN products p ON oi.product_id = p.product_id where ur.role_id = $1 and u.user_id = $2 and oi.order_id = $3::bigint ORDER BY o.order_date; 2025-02-10 21:05:42.622 +07 [3702286] DETAIL: Parameters: $1 = (integer)'11', $2 = (integer)'86', $3 = (bigint)'14' ``` As you can see, the DETAIL log message now includes the data type in parentheses before the parameter value. I believe this addition will greatly improve the usefulness of detailed query logging. I welcome your feedback and suggestions. Thank you for your time and consideration. Best regards, Stepan Neretin Log in - PPG Jira https://jira.postgrespro.ru";> var AJS=AJS||{};AJS.debug=true; html:not([data-theme]), html[data-color-mode="dark"][data-theme~="dark:light"], html[data-color-mode="light"][data-theme~="light:light"], html[data-color-mode="light"][data-theme~="light:original"] { --jira-color-gadgetcolor1: #205081; --jira-color-gadgetcolor2: #de350b; --jira-color-gadgetcolor3: #ff8b00; --jira-color-gadgetcolor4: #00875a; --jira-color-gadgetcolor5: #00a3bf; --jira-color-gadgetcolor6: #6554c0; --jira-color-gadgetcolor7: #5e6c84; --jira-color-heroButtonBaseBGColour: #3572B0; --jira-color-heroButtonTextColour: #deebff; --jira-color-menuBackgroundColour: #ebecf0; --jira-color-menuSeparatorColour: #dfe1e6; --jira-color-menuTxtColour: #42526e; --jira-color-textActiveLinkColour: #0065ff; --jira-color-textHeadingColour: #172b4d; --jira-color-textLinkColour: #3572B0; --jira-color-topBackgroundColour: #205081; --jira-color-topHighlightColor: #3572B0; --jira-color-topSeparatorBackgroundColor: #2e3d54; --jira-color-topTextHighlightColor: #deebff; --jira-color-topTxtColour: #deebff; } html[data-color-mode="dark"] { --jira-color-gadgetcolor1: var(--ds-background-brand-bold); --jira-color-gadgetcolor2: var(--ds-background-danger-bold); --jira-color-gadgetcolor3: var(--ds-background-accent-orange-subtler-pressed); --jira-color-gadgetcolor4: var(--ds-background-accent-green-bolder-hovered); --jira-color-gadgetcolor5: var(--ds-background-accent-teal-bolder); --jira-color-gadgetcolor6: var(--ds-background-accent-purple-bolder-hovered); --jira-color-gadgetcolor7: var(--ds-background-accent-gray-bolder); --jira-color-heroButtonBaseBGColour: var(--ds-background-brand-bold); --jira-color-heroButtonTextColour: var(--ds-text-inverse); --jira-color-menuBackgroundColour: var(--ds-surface-overlay-hovered); --jira-color-menuSeparatorColour: var(--ds-border); --jira-color-menuTxtColour: var(--ds-text-subtlest); --jira-color-textActiveLinkColour: var(--ds-link-pressed); --jira-color-textHeadingColour: var(--ds-text); --jira-color-textLinkColour: var(--ds-link); --jira-color-topBackgroundColour: var(--ds-surface); --jira-color-topHighlightColor: var(--ds-surface-hovered); --jira-color-topSeparatorBackgroundColor: var(--ds-border); --jira-color-topTextHighlightColor: var(--ds-text-subtle); --jira-color-topTxtColour: var(--ds-text-subtle); } window.contextPath = ''; window.__resourcePhaseCheckpointResolvers={resolveDeferPhaseCheckpoint:null,resolveInteractionPhaseCheckpoint:null};if(window.performance&&window.performance.mark){window.DeferScripts||(window.DeferScripts={});window.DeferScripts.totalClicks=0;window.DeferScripts.totalKeydowns=0;window.DeferScripts.clickListener=function(){"use strict";window.DeferScripts.totalClicks+=1};window.addEventListener("click",window.DeferScripts.clickListener);window.DeferScripts.keydownListener=function(){"use strict";window.DeferScripts.totalKeydowns+=1};window.addEventListener("keydown",window.DeferScripts.keydownListener)}window.resourcePhaseCheckpoint=Object.freeze({defer:new Promise((function(e){"use strict";window.__resourc
Re: BitmapHeapScan streaming read user and prelim refactoring
On 2/13/25 17:01, Melanie Plageman wrote: > On Thu, Feb 13, 2025 at 10:46 AM Tomas Vondra wrote: >> >> I reviewed v29 today, and I think it's pretty much ready to go. >> >> The one part where I don't quite get is 0001, which replaces a >> FLEXIBLE_ARRAY_MEMBER array with a fixed-length array. It's not wrong, >> but I don't quite see the benefits / clarity. And I think Thomas might >> be right we may want to make this dynamic, to save memory. >> >> Not a blocker, but I'd probably skip 0001 (unless it's required by the >> later parts, I haven't checked/tried). > > So, on master, it already pallocs an array of size MAX_TUPLES_PER_PAGE > (which is hard-coded in the tidbitmap API to MaxHeapTuplesPerPage) -- > see tbm_begin_private_iterate(). > > So we always palloc the same amount of memory. The reason I changed it > from a flexible sized array to a fixed size is that we weren't using > the flexibility and having a flexible sized array in the > TBMIterateResult meant it couldn't be nested in another struct. Since > I have to separate the TBMIterateResult and TBMIterator to implement > the read stream API for BHS, once I separate them, I nest the > TBMIterateResult in the GinScanEntry. If the array of offsets is > flexible sized, then I would have to manage that memory separately in > GIN code for the TBMIterateResult.. > > So, 0001 isn't a change in the amount of memory allocated. > > With the read stream API, these TBMIterateResults are palloc'd just > like we palloc'd one in master. However, we have to have more than one > at a time. > I know it's not changing how much memory we allocate (compared to master). I haven't thought about the GinScanEntry - yes, flexible array member would make this a bit more complex. I don't want to bikeshed this - I'll leave the decision about 0001 to you. I'm OK with both options. regards -- Tomas Vondra
Re: Get rid of WALBufMappingLock
Hi, Yura and Alexander! On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov wrote: > > On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov > wrote: > > 13.02.2025 12:34, Alexander Korotkov пишет: > > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov > > > wrote: > > >> 08.02.2025 13:07, Alexander Korotkov пишет: > > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov > > >>> wrote: > > Good, thank you. I think 0001 patch is generally good, but needs some > > further polishing, e.g. more comments explaining how does it work. > > >> > > >> I tried to add more comments. I'm not good at, so recommendations are > > >> welcome. > > >> > > >>> Two things comes to my mind worth rechecking about 0001. > > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and > > >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be > > >>> sensitive to that. If so, I would propose to explicitly comment that > > >>> and add corresponding asserts. > > >> > > >> They're certainly page aligned, since they are page borders. > > >> I added assert on alignment of InitializeReserved for the sanity. > > >> > > >>> 2) Check if there are concurrency issues between > > >>> AdvanceXLInsertBuffer() and switching to the new WAL file. > > >> > > >> There are no issues: > > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses > > >> GetXLogBuffer to zero-out WAL page. > > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks, > > >> so switching wal is not concurrent. (Although, there is no need in this > > >> exclusiveness, imho.) > > > > > > Good, thank you. I've also revised commit message and comments. > > > > > > But I see another issue with this patch. In the worst case, we do > > > XLogWrite() by ourselves, and it could potentially could error out. > > > Without patch, that would cause WALBufMappingLock be released and > > > XLogCtl->InitializedUpTo not advanced. With the patch, that would > > > cause other processes infinitely waiting till we finish the > > > initialization. > > > > > > Possible solution would be to save position of the page to be > > > initialized, and set it back to XLogCtl->InitializeReserved on error > > > (everywhere we do LWLockReleaseAll()). We also must check that on > > > error we only set XLogCtl->InitializeReserved to the past, because > > > there could be multiple concurrent failures. Also we need to > > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters. > > > > The single place where AdvanceXLInsertBuffer is called outside of critical > > section is in XLogBackgroundFlush. All other call stacks will issue server > > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer. > > > > XLogBackgroundFlush explicitely avoids writing buffers by passing > > opportunistic=true parameter. > > > > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer > > since server will shutdown/restart. > > > > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call > > to XLogWrite here? > > You're correct. I just reflected this in the next revision of the patch. I've looked at the patchset v6. The overall goal to avoid locking looks good. Both patches look right to me. The only thing I'm slightly concerned about if patchset could demonstrate performance differences in any real workload. I appreciate it as a beautiful optimization anyway. I have the following proposals to change the code and comments: For patch 0001: - Struct XLBlocks contains just one pg_atomic_uint64 member. Is it still needed as a struct? These changes make a significant volume of changes to the patch, being noop. Maybe it was inherited from v1 and not needed anymore. - Furthermore when xlblocks became a struct comments like: > and xlblocks values certainly do. xlblocks values are changed need to be changed to xlblocks.bound. This could be avoided by changing back xlblocks from type XLBlocks * to pg_atomic_uint64 * - It's worth more detailed commenting InitializedUpTo/InitializedUpToCondVar than: +* It is updated to successfully initialized buffer's identities, perhaps +* waiting on conditional variable bound to buffer. "perhaps waiting" could also be in style "maybe/even while AAA waits BBB" "lock-free with cooperation with" -> "lock-free accompanied by changes to..." - Comment inside typedef struct XLogCtlData: /* 1st byte ptr-s + XLOG_BLCKSZ (and condvar * for) */ need to be returned back /* 1st byte ptr-s + XLOG_BLCKSZ */ - Commented out code for cleanup in the final patch: //ConditionVariableBroadcast(&XLogCtl->xlblocks[nextidx].condvar); - in AdvanceXLInsertBuffer() npages initialised to 0 but it is not increased anymore Block under > if (XLOG_DEBUG && npages > 0) became unreachable (InvalidXLogRecPtr + 1) is essentially 0+1 and IMO this semantically calls for adding #define FirstValidXLogRecPtr 1 Typo in a commit message: %s/usign/using/g For patch 0002: I think Yura
DROP CONSTRAINT, printing a notice and/or skipping when no action is taken
Hello. I noticed a small opportunity for a possible enhancement to DROP DEFAULT, and wanted to share the idea. Apologies if this idea was suggested before, I tried a basic search for pgsql-hackers similar things but didn’t find a hit. I noticed when running an ALTER TABLE with DROP DEFAULT, whether the column default exists or not, ALTER TABLE is always printed as the result. This is arguably slightly confusing, because it’s unclear if anything was done. In the scenario where there is no column default, there isn’t a message saying “skipped” or something equivalent, indicating that there was no default that was dropped. Some of the commands in Postgres do have this kind of feedback, so it seems like an opportunity for greater consistency. For example: if I create a column default, or repeatedly run the following ALTER TABLE statements for the "id_new" column, I always get ALTER TABLE back. ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT; ALTER TABLE ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT; ALTER TABLE ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT; ALTER TABLE An opportunity would be to add a NOTICE type of message when ALTER TABLE ALTER COLUMN DROP DEFAULT is issued, at least when no column default exists, and no action was taken. In that scenario, the operation could possibly be skipped altogether, which might have some additional benefits. As a refreshed on a “Notice” type of message example, here’s one when adding an index and using the "if not exists" clause (an equivalent "if not exists" clause does not exist for DROP DEFAULT to my knowledge): -- an index called “foo” already exists psql> create index if not exists foo on organizations (id); NOTICE: relation "foo" already exists, skipping CREATE INDEX The message being “NOTICE: relation "foo" already exists, skipping” A similar message for DROP DEFAULT might look like: “NOTICE: default does not exist, skipping” Or an alternative that includes the column name might look like: “NOTICE: default does not exist for column id_new, skipping” Or another alternative might be a new (non-standard?) "if exists" clause for DROP DEFAULT. Example: ALTER TABLE my_table ALTER COLUMN id_new DROP DEFAULT IF EXISTS; -- Or an alternative placement of the "if exists" clause, because I don’t really know where it would go: ALTER TABLE my_table ALTER COLUMN id_new DROP IF EXISTS DEFAULT; Thanks! - Andrew
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Wed, Feb 12, 2025 at 6:55 AM Peter Eisentraut wrote: > This just depends on how people have built their libcurl, right? > > Do we have any information whether the async-dns-free build is a common > configuration? I don't think the annual Curl survey covers that, unfortunately. On Wed, Feb 12, 2025 at 6:59 AM Peter Eisentraut wrote: > I think you can just remove that comment. It's pretty established that > internal errors don't need translation, so it would be understood from > looking at the code. Okay, will do. Thanks, --Jacob
Remove a unnecessary argument from execute_extension_script()
Hi, The attached patch is to remove a unnecessary argument "schemaOid" from the static function execute_extension_script(). It might have been intended to be used some way initially, but actually this is not used since schemaName is sufficient. Regards, Yugo Nagata -- Yugo Nagata diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index ba540e3de5b..ec1d38b2172 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -982,7 +982,7 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control, const char *from_version, const char *version, List *requiredSchemas, - const char *schemaName, Oid schemaOid) + const char *schemaName) { bool switch_to_superuser = false; char *filename; @@ -1788,7 +1788,7 @@ CreateExtensionInternal(char *extensionName, execute_extension_script(extensionOid, control, NULL, versionName, requiredSchemas, - schemaName, schemaOid); + schemaName); /* * If additional update scripts have to be executed, apply the updates as @@ -3380,7 +3380,7 @@ ApplyExtensionUpdates(Oid extensionOid, execute_extension_script(extensionOid, control, oldVersionName, versionName, requiredSchemas, - schemaName, schemaOid); + schemaName); /* * Update prior-version name and loop around. Since
pg17.3 PQescapeIdentifier() ignores len
I found errors in our sql log after upgrading to 17.3. error_severity | ERROR message| schema "rptcache.44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214" does not exist query | copy "rptcache.44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214"."44e3955c33bb79f55750897da0c5ab1fa2004af1_20250214" from stdin The copy command is from pygresql's inserttable(), which does: do { t = strchr(s, '.'); if (!t) t = s + strlen(s); table = PQescapeIdentifier(self->cnx, s, (size_t)(t - s)); fprintf(stderr, "table %s len %ld => %s\n", s, t-s, table); if (bufpt < bufmax) bufpt += snprintf(bufpt, (size_t)(bufmax - bufpt), "%s", table); PQfreemem(table); s = t; if (*s && bufpt < bufmax) *bufpt++ = *s++; } while (*s); The fprintf suggests that since 5dc1e42b4 PQescapeIdentifier ignores its len. python3 -c "import pg; db=pg.DB('postgres'); db.inserttable('child.a', [1])") table child.a len 5 => "child.a" table a len 13 => "a" -- Justin
Re: Is pgAdmin the only front-end to PostgreSQL debugger ? And is "a working pl/pgsql debugger" something core should care to maintain ?
Hi čt 13. 2. 2025 v 18:00 odesílatel Hannu Krosing napsal: > Hallo PostgreSQL Hackers, > > > We recently discovered an error where pgAdmin fails when stepping into > nested function calls ( > https://github.com/pgadmin-org/pgadmin4/issues/8443 ). > > So while waiting for this to be fixed I would want to know if there > are other debugger front-ends that could be used to do basic debugging > of pl/pgsql code ? If I remember, there was a independent application that was a plpgsql debugger, but it was in time before pgadmin 4 Maybe dbeaver uses it too. I think some commercial gui has support for debugging plpgsql, but I am not sure that it is. > > And would these need a separate debugging extension, or is "debugging" > meant to be a generic PostgreSQL server feature that has a > well-defined API ? > Unfortunately buildin API has not external interface. So everytime you need an extension that use internal plpgsql debug API or allow a access to some plpgsql data like stack or variables. These extensions can be different, - plpgsql_check uses these API, plprofiler or pldebugger If I remember well, the development of pldebugger or plprofiler was a fully independent project on Postgres core, and there was not any attempt to merge it upstream. The internal buildin part is really minimalistic - just few hooks > I know there used to be another debugger as part of OmniDB which had > its own server-side extension as well, but wanted to know what is the > general thinking on this. > > Is debugging pl/pgsq a generic feature of PostgreSQL core considering > that pl/pgsql itself kind of is ? > debugging plpgsql is independent of the PostgreSQL core. > Should there be something similar for debugging plain SQL and possibly > other PL languages? > I can imagine this - there can be some common API supported by language handlers. > > Should there perhaps be debugging support in psql ? > psql does not support asynchronous API, so the implementation in psql really should not be simple. Moreover, I doubt that users want to use a debugger with a GUI similar to gdb. And it isn't possible to design a better UI directly in psql. I can imagine supporting some external debuggers executed inside / or outside the same terminal that communicates with psql by some bidirect pipe (psql will be used like a proxy). Then can be designed some application similar to Borland debugger. This can be a very nice feature with probably not too much code, but handling bidirect pipes isn't trivial. Tuning reading from pipe for pspg for some less usual patterns was not nice work (and there the communication is significantly more simple than for debugger). Regards Pavel > > -- > Hannu > > >
Re: pg17.3 PQescapeIdentifier() ignores len
Ranier Vilela writes: > Interesting, Coverity has some new reports regarding PQescapeIdentifier. > CID 1591290: (#1 of 1): Out-of-bounds access (OVERRUN) > 2. alloc_strlen: Allocating insufficient memory for the terminating null of > the string. [Note: The source code implementation of the function has been > overridden by a builtin model.] That's not new, we've been seeing those for awhile. I've been ignoring them on the grounds that (a) if the code actually had such a problem, valgrind testing would have found it, and (b) the message is saying in so many words that they're ignoring our code in favor of somebody's apparently-inaccurate model of said code. regards, tom lane
Re: pg_stat_statements and "IN" conditions
Hi, Thanks for the updated patch! I spent some time looking at v24 today, and I have some findings/comments. 1/ Constants passed as parameters to a prepared statement will not be handled as expected. I did not not test explicit PREPARE/EXECUTE statement, but I assume it will have the same issue. postgres=# show query_id_squash_values; query_id_squash_values on (1 row) postgres=# select from foo where col_bigint in ($1, $2, $3, $4) \bind 1 2 3 4 postgres-# ; -- (0 rows) postgres=# select from foo where col_bigint in ($1, $2, $3, $4, $5) \bind 1 2 3 4 5 postgres-# ; -- (0 rows) postgres=# select query, queryid, calls from pg_stat_statements where query like 'select%from foo where%' order by stats_since asc; query | queryid| calls --+--+--- select from foo where col_bigint in ($1, $2, $3, $4) | -1169585827903667511 | 1 select from foo where col_bigint in ($1, $2, $3, $4, $5) | -5591703027615838766 | 1 (2 rows) I think the answer is here is to also check for "Param" when deciding if an element should be merged. i.e. if (!IsA(element, Const) && !IsA(element, Param)) 2/ This case with an array passed to aa function seems to cause a regression in pg_stat_statements query text. As you can see the text is incomplete. CREATE OR REPLACE FUNCTION arrtest(i int[]) RETURNS void AS $$ BEGIN NULL; END; $$ LANGUAGE plpgsql; postgres=# select arrtest(array[1, 2]) from foo where col_bigint in (1, 2, 3); arrtest - (0 rows) postgres=# select query from pg_stat_statements; query --- select arrtest(array[...) (1 row) it should otherwise look like this: postgres=# select query from pg_stat_statements; query - select arrtest(array[$1, $2]) from foo where col_bigint in ($3, $4, $5) (1 row) 3/ A typo in the docs. c/lenght/length +occurrence with an array of different lenght. 4/ + +Specifies how an array of constants (e.g. for an IN +clause) contributes to the query identifier computation. Is this parameter specific to only useful to merge the values of an IN list. Should the documentation be more specific and say that only IN lists will benefit from this parameter? Also, if there is only 1 value in the list, it will have a different queryId than that of the same query in which more than 1 value is passed to the IN list. Should the documentation be clear about that? 5/ pg_node_attr of query_jumble_merge is doing something very specific to the elements list of an ArrayExpr. The merge code likely cannot be used for other node types. /* the array elements or sub-arrays */ - List *elements; + List *elements pg_node_attr(query_jumble_merge); Why are we creating a new node attribute rather than following the existing pattern of using the "custom_query_jumble" attribute on ArrayExpr and creating a custom jumble function like we do with _jumbleVariableSetStmt? Regards, Sami
Re: Confine vacuum skip logic to lazy_scan_skip
On Tue, Feb 11, 2025 at 5:10 PM Melanie Plageman wrote: > > On Thu, Feb 6, 2025 at 1:06 PM Melanie Plageman > wrote: > > > > On Wed, Feb 5, 2025 at 5:26 PM Melanie Plageman > > wrote: > > > > > > Yes, looking at these results, I also feel good about it. I've updated > > > the commit metadata in attached v14, but I could use a round of review > > > before pushing it. > > > > I've done a bit of self-review and updated these patches. > > This needed a rebase in light of 052026c9b90. > v16 attached has an additional commit which converts the block > information parameters to heap_vac_scan_next_block() into flags > because we can only get one piece of information per block from the > read stream API. This seemed nicer than a cumbersome struct. > Sorry for the late chiming in. I've reviewed the v16 patch set, and the patches mostly look good. Here are some comments mostly about cosmetic things: 0001: - boolall_visible_according_to_vm, - was_eager_scanned = false; + uint8 blk_flags = 0; Can we probably declare blk_flags inside the main loop? 0002: In lazy_scan_heap(), we have a failsafe check at the beginning of the main loop, which is performed before reading the first block. Isn't it better to move this check after scanning a block (especially after incrementing scanned_pages)? Otherwise, we would end up calling lazy_check_wraparound_failsafe() at the very first loop, which previously didn't happen without the patch. Since we already called lazy_check_wraparound_failsafe() just before calling lazy_scan_heap(), the extra check would not make much sense. --- + /* Set up the read stream for vacuum's first pass through the heap */ + stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, + vacrel->bstrategy, + vacrel->rel, + MAIN_FORKNUM, + heap_vac_scan_next_block, + vacrel, + sizeof(bool)); Is there any reason to use sizeof(bool) instead of sizeof(uint8) here? --- /* * Vacuum the Free Space Map to make newly-freed space visible on -* upper-level FSM pages. Note we have not yet processed blkno. +* upper-level FSM pages. Note that blkno is the previously +* processed block. */ FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno); Given the blkno is already processed, should we pass 'blkno + 1' instead of blkno? 0003: - while ((iter_result = TidStoreIterateNext(iter)) != NULL) I think we can declare iter_result in the main loop of lazy_vacuum_heap_rel(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Confine vacuum skip logic to lazy_scan_skip
On Tue, Feb 11, 2025 at 8:10 PM Melanie Plageman wrote: > > On Thu, Feb 6, 2025 at 1:06 PM Melanie Plageman > wrote: > > > > On Wed, Feb 5, 2025 at 5:26 PM Melanie Plageman > > wrote: > > > > > > Yes, looking at these results, I also feel good about it. I've updated > > > the commit metadata in attached v14, but I could use a round of review > > > before pushing it. > > > > I've done a bit of self-review and updated these patches. > > This needed a rebase in light of 052026c9b90. > v16 attached has an additional commit which converts the block > information parameters to heap_vac_scan_next_block() into flags > because we can only get one piece of information per block from the > read stream API. This seemed nicer than a cumbersome struct. I've done some clean-up including incorporating a few off-list pieces of minor feedback from Andres. - Melanie From da251546b0e7a748884efc9fb6d27fab0b7a452d Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Thu, 13 Feb 2025 17:34:18 -0500 Subject: [PATCH v17 3/3] Use streaming read I/O in VACUUM's third phase Make vacuum's third phase (its second pass over the heap), which reaps dead items collected in the first phase and marks them as reusable, use the read stream API. This commit adds a new read stream callback, vacuum_reap_lp_read_stream_next(), that looks ahead in the TidStore and returns the next block number to read for vacuum. Author: Melanie Plageman Co-authored-by: Thomas Munro Discussion: https://postgr.es/m/CA%2BhUKGKN3oy0bN_3yv8hd78a4%2BM1tJC9z7mD8%2Bf%2ByA%2BGeoFUwQ%40mail.gmail.com --- src/backend/access/heap/vacuumlazy.c | 53 +--- 1 file changed, 48 insertions(+), 5 deletions(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 32083a92c31..f85530a28f3 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -2639,6 +2639,32 @@ lazy_vacuum_all_indexes(LVRelState *vacrel) return allindexes; } +/* + * Read stream callback for vacuum's third phase (second pass over the heap). + * Gets the next block from the TID store and returns it or InvalidBlockNumber + * if there are no further blocks to vacuum. + */ +static BlockNumber +vacuum_reap_lp_read_stream_next(ReadStream *stream, +void *callback_private_data, +void *per_buffer_data) +{ + TidStoreIter *iter = callback_private_data; + TidStoreIterResult *iter_result; + + iter_result = TidStoreIterateNext(iter); + if (iter_result == NULL) + return InvalidBlockNumber; + + /* + * Save the TidStoreIterResult for later, so we can extract the offsets. + * It is safe to copy the result, according to TidStoreIterateNext(). + */ + memcpy(per_buffer_data, iter_result, sizeof(*iter_result)); + + return iter_result->blkno; +} + /* * lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy * @@ -2659,6 +2685,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel) static void lazy_vacuum_heap_rel(LVRelState *vacrel) { + ReadStream *stream; BlockNumber vacuumed_pages = 0; Buffer vmbuffer = InvalidBuffer; LVSavedErrInfo saved_err_info; @@ -2679,7 +2706,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) InvalidBlockNumber, InvalidOffsetNumber); iter = TidStoreBeginIterate(vacrel->dead_items); - while ((iter_result = TidStoreIterateNext(iter)) != NULL) + + /* Set up the read stream */ + stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, + vacrel->bstrategy, + vacrel->rel, + MAIN_FORKNUM, + vacuum_reap_lp_read_stream_next, + iter, + sizeof(TidStoreIterResult)); + + while (true) { BlockNumber blkno; Buffer buf; @@ -2690,9 +2727,15 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) vacuum_delay_point(false); - blkno = iter_result->blkno; - vacrel->blkno = blkno; + buf = read_stream_next_buffer(stream, (void **) &iter_result); + + /* The relation is exhausted */ + if (!BufferIsValid(buf)) + break; + vacrel->blkno = blkno = BufferGetBlockNumber(buf); + + Assert(iter_result); num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets)); Assert(num_offsets <= lengthof(offsets)); @@ -2704,8 +2747,6 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) visibilitymap_pin(vacrel->rel, blkno, &vmbuffer); /* We need a non-cleanup exclusive lock to mark dead_items unused */ - buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL, - vacrel->bstrategy); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); lazy_vacuum_heap_page(vacrel, blkno, buf, offsets, num_offsets, vmbuffer); @@ -2718,6 +2759,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) RecordPageWithFreeSpace(vacrel->rel, blkno, freespace); vacuumed_pages++; } + + read_stream_end(stream); TidStoreEndIterate(iter); vacrel->blkno = InvalidBlockNumber; -- 2.34.1 From 6abfb26af7cd2ff55c9b13cadcef2d56b6460a28 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Thu, 1
Re: Assignment before assert
> On 13 Feb 2025, at 18:08, Dmitry Koval wrote: > > Hi! > Function EvalPlanQualFetchRowMark contains an assignment > > ExecRowMark *erm = earm->rowmark; > > before assert > > Assert(earm != NULL); > > Maybe these lines need to be swapped? That does admittedly look a bit odd, that assertion can't be reached if earm is null since we've already dereferenced it by then. I'll have another look after coffee but something along the lines of your patch looks right (or just remove the Assertion perhaps). -- Daniel Gustafsson
Re: dblink: Add SCRAM pass-through authentication
On Wed, Feb 12, 2025 at 11:54 AM Matheus Alcantara wrote: > Currently dblink_connstr_check and dblink_security_check only check if the > password is present on connection options, in case of not superuser. They also check for GSS delegation. I think SCRAM passthrough should ideally be considered a second form of credentials delegation, from the perspective of those functions. > I added > this logic because the password is not required for SCRAM but I agree with you > that it sounds strange. Maybe these functions could check if the SCRAM is > being used and then skip the password validation if needed internally? As long as the end result is to enforce that the credentials must come from the end user, I think that would be fine in theory. > I also agree that we should enforce the use of the SCRAM on the remote for > safety. To do this I think that we could set require_auth=scram-sha-256 on > connection options if SCRAM pass-through is being used, with this we will get > a > connection error. WYT? We would need to verify that the user mapping can't overwrite that with its own (less trusted) `require_auth` setting. (I think that should be true already, but I'm not 100% sure.) Hardcoding to scram-sha-256 would also prohibit the use of GSS or standard password auth, now that I think about it. The docs currently have a note about being able to choose... Should we add the other permitted authentication types, i.e. `password,md5,scram-sha-256,gss`? Or should we prohibit the use of other auth types if you've set use_scram_passthrough? Or maybe there's an easier way to enforce the use of the SCRAM keys, that better matches the current logic in dblink_security_check? > I can create a new patch to fix this on postgres_fdw too once we define the > approach to this here on dblink. Sounds good to me. Thanks, --Jacob
Re: describe special values in GUC descriptions more consistently
On Wed, Feb 12, 2025 at 04:10:53PM -0700, David G. Johnston wrote: > I presume it doesn't affect the actual output which just concatenates the > fragments together but the source placement probably should be made > consistent; the line containing the initial default value specification > begins its own quoted fragment. The following violate that convention. Eh, most of the other descriptions with multiple sentences don't do that, so IMHO there's no need for the special values to go in their own fragment. You are correct that there should be no difference in the actual output. > Also, maybe put the rules in the commit message into a comment in the file, > or a README, instead. I added a comment to guc_tables.h in v6. -- nathan >From 747384f1e0fe9f56cd6cf8363c74920b99028a4f Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 12 Feb 2025 14:26:00 -0600 Subject: [PATCH v6 1/1] Describe special values in GUC descriptions more consistently. Many GUCs accept special values like -1 or an empty string to disable the feature, use a system default, etc. While the documentation consistently lists these special values, the GUC descriptions do not. Many such descriptions fail to mention the special values, and those that do vary in phrasing and placement. This commit aims to bring some consistency to this area by applying the following rules: * Special values should be listed at the end of the long description. * Descriptions should use numerals (e.g., "0") instead of words (e.g., "zero"). * Special value mentions should be concise and direct (e.g., "0 disables the timeout.", "An empty string means use the operating system setting."). * Multiple special values should be listed in ascending order. Of course, there are exceptions, such as max_pred_locks_per_relation and search_path, whose special values are too complex to include. And there are cases like listen_addresses, where the meaning of an empty string is arguably too obvious to include. In those cases, I've refrained from adding special value information to the GUC description. Reviewed-by: Peter Smith Reviewed-by: "David G. Johnston" Reviewed-by: Daniel Gustafsson Discussion: https://postgr.es/m/Z6aIy4aywxUZHAo6%40nathan --- src/backend/utils/misc/guc_tables.c | 138 ++-- src/include/utils/guc_tables.h | 15 +++ 2 files changed, 84 insertions(+), 69 deletions(-) diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 226af43fe23..408bd5969b8 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -2119,7 +2119,7 @@ struct config_int ConfigureNamesInt[] = {"archive_timeout", PGC_SIGHUP, WAL_ARCHIVING, gettext_noop("Sets the amount of time to wait before forcing a " "switch to the next WAL file."), - NULL, + gettext_noop("0 disables the timeout."), GUC_UNIT_S }, &XLogArchiveTimeout, @@ -2196,7 +2196,7 @@ struct config_int ConfigureNamesInt[] = { {"geqo_pool_size", PGC_USERSET, QUERY_TUNING_GEQO, gettext_noop("GEQO: number of individuals in the population."), - gettext_noop("Zero selects a suitable default value."), + gettext_noop("0 means use a suitable default value."), GUC_EXPLAIN }, &Geqo_pool_size, @@ -2206,7 +2206,7 @@ struct config_int ConfigureNamesInt[] = { {"geqo_generations", PGC_USERSET, QUERY_TUNING_GEQO, gettext_noop("GEQO: number of iterations of the algorithm."), - gettext_noop("Zero selects a suitable default value."), + gettext_noop("0 means use a suitable default value."), GUC_EXPLAIN }, &Geqo_generations, @@ -2229,7 +2229,7 @@ struct config_int ConfigureNamesInt[] = { {"max_standby_archive_delay", PGC_SIGHUP, REPLICATION_STANDBY, gettext_noop("Sets the maximum delay before canceling queries when a hot standby server is processing archived WAL data."), - NULL, + gettext_noop("-1 means wait forever."), GUC_UNIT_MS }, &max_standby_archive_delay, @@ -2240,7 +2240,7 @@ struct config_int ConfigureNamesInt[] = { {"max_standby_streaming_delay", PGC_SIGHUP, REPLICATION_STANDBY, gettext_noop("Sets the maximum delay before canceling queries when a hot standby server is processing streamed WAL data."), - NULL, + gettext_noop("-1 means wait forever."), GUC_UNIT_MS }
Re: Remove a unnecessary argument from execute_extension_script()
On Thu, Feb 13, 2025 at 03:18:51PM -0300, Fabrízio de Royes Mello wrote: > On Thu, Feb 13, 2025 at 1:02 PM Yugo Nagata wrote: >> The attached patch is to remove a unnecessary argument "schemaOid" >> from the static function execute_extension_script(). It might have >> been intended to be used some way initially, but actually this is >> not used since schemaName is sufficient. > > LGTM. Interesting. This parameter seems to have appeared between v30 [0] and v31 [1] of the original extension patch, and even then it wasn't used. And from a quick skim, I don't see any discussion about it. I'll plan on committing this shortly. [0] https://postgr.es/m/m24o8nhd69.fsf%402ndQuadrant.fr [1] https://postgr.es/m/4171.1297135840%40sss.pgh.pa.us -- nathan
Re: DOCS - inactive_since field readability
On Tue, Feb 11, 2025 at 10:10 PM Amit Kapila wrote: > ... > The change in 0001 looks odd after seeing it in HTML format. We should > either add one empty line between two paragraphs otherwise it doesn't > appear good. Did you see multi-paragraphs in any other column > definitions? > > For the 0002 patch, I find the following changes as improvements, > * > > -Note that for slots on the standby > to > +For standby slots > > * > On standby, this is useful for slots > -that are being synced from a primary server (whose > -synced field is true) > -so they know when the slot stopped being synchronized. > to > This helps standby slots > +track when synchronization was interrupted. > > Other than that, I find the current wording okay. > Hi Amit, thanks for the feedback! // Patch 0001 The pg_replication_slots system view is unusual in that there can be entirely different descriptions of the same field depending on the context, such as: a- for logical slots b- for physical slots c- for primary servers versus standby servers IIUC your 0001 feedback says that a blank line might be ok, but just doing it for 'active_since' and nothing else makes it look odd. So, I have modified the 0001 patch to add blank lines separating those different contexts (e.g. a,b,c) for all fields. I think it improves the readability. In passing. - changed null to NULL - changed true to true - changed false to false // Patch 0002 Modified the text as suggested. == Kind Regards, Peter Smith. Fujitsu Australia v2-0001-DOCS-pg_replication_slots.-Add-blank-lines.patch Description: Binary data v2-0002-DOCS-pg_replication_slots.-Improve-the-active_sin.patch Description: Binary data
Re: pg17.3 PQescapeIdentifier() ignores len
On Thu, Feb 13, 2025 at 02:00:09PM -0500, Tom Lane wrote: > Justin Pryzby writes: >> The fprintf suggests that since 5dc1e42b4 PQescapeIdentifier ignores its len. > > Ugh, yes. Need something like the attached. Your patch looks right to me. -- nathan
Re: describe special values in GUC descriptions more consistently
On Thu, Feb 13, 2025 at 3:18 PM Nathan Bossart wrote: > On Wed, Feb 12, 2025 at 04:10:53PM -0700, David G. Johnston wrote: > > I presume it doesn't affect the actual output which just concatenates the > > fragments together but the source placement probably should be made > > consistent; the line containing the initial default value specification > > begins its own quoted fragment. The following violate that convention. > > Eh, most of the other descriptions with multiple sentences don't do that, > Apples and oranges. so IMHO there's no need for the special values to go in their own fragment. > The examples shown look sloppy, IMHO; standing out since the other 50, mostly due to the defaults being the only sentence in the gettext_noop, do line up nicely with either a number or "The empty string" as the lead. the worst offender is: - "lost before a connection is considered dead. A value of 0 uses the " - "system default."), + "lost before a connection is considered dead. 0 " + "means use the system default."), Even if the diff has logic to it - only remove stuff from the first line, only add stuff to the second, it isn't a big enough gain to offset leaving the source ugly, IMHO. David J.
Re: Remove a unnecessary argument from execute_extension_script()
Committed. -- nathan
Re: BitmapHeapScan streaming read user and prelim refactoring
On Fri, Feb 14, 2025 at 5:52 AM Melanie Plageman wrote: > On Thu, Feb 13, 2025 at 11:28 AM Tomas Vondra wrote: > > On 2/13/25 17:01, Melanie Plageman wrote: > > I know it's not changing how much memory we allocate (compared to > > master). I haven't thought about the GinScanEntry - yes, flexible array > > member would make this a bit more complex. > > Oh, I see. I didn't understand Thomas' proposal. I don't know how hard > it would be to make tidbitmap allocate the offsets on-demand. I'd need > to investigate more. But probably not worth it for this patch. No, what I meant was: It would be nice to try to hold only one uncompressed result set in memory at a time, like we achieved in the vacuum patches. The consumer expands them from a tiny object when the associated buffer pops out the other end. That should be possible here too, right, because the bitmap is immutable and long lived, so you should be able to stream (essentially) pointers into its internal guts. The current patch streams the uncompressed data itself, and thus has to reserve space for the maximum possible amount of it, and also forces you to think about fixed sizes. I think if you want "consumer does the expanding" and also "dynamic size" and also "consumer provides memory to avoid palloc churn", then you might need two new functions: "how much memory would I need to expand this thing?" and a "please expand it right here, it has the amount of space you told me!". Then I guess the consumer could keep recycling the same piece of memory, and repalloc() if it's not big enough. Or something like that. Yeah I guess you could in theory also stream pointers to individual uncompressed result objects allocated with palloc(), that is point a point in the per-buffer-data and make the consumer free it, but that has other problems (less locality, allocator churn, need cleanup/destructor mechanism for when the streams is reset or destroyed early, still has lots of uncompressed copies of data in memory *sometimes*) and is not what I was imagining.
Re: BitmapHeapScan streaming read user and prelim refactoring
On Fri, Feb 14, 2025 at 11:50 AM Thomas Munro wrote: > Yeah I guess you could in theory also stream pointers to individual > uncompressed result objects allocated with palloc(), that is point a > point in the per-buffer-data and make the consumer free it, but that > has other problems (less locality, allocator churn, need Sorry, typing too fast, I meant to write: "that is, put a pointer in the per-buffer-data and make the consumer free it".
Re: describe special values in GUC descriptions more consistently
On Thu, Feb 13, 2025 at 03:50:08PM -0700, David G. Johnston wrote: > Even if the diff has logic to it - only remove stuff from the first line, > only add stuff to the second, it isn't a big enough gain to offset leaving > the source ugly, IMHO. Okay, I took your suggestions in v7. -- nathan >From 7526dcc25c992be8e2bc6aed2a030dcdd1d536e2 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 12 Feb 2025 14:26:00 -0600 Subject: [PATCH v7 1/1] Describe special values in GUC descriptions more consistently. Many GUCs accept special values like -1 or an empty string to disable the feature, use a system default, etc. While the documentation consistently lists these special values, the GUC descriptions do not. Many such descriptions fail to mention the special values, and those that do vary in phrasing and placement. This commit aims to bring some consistency to this area by applying the following rules: * Special values should be listed at the end of the long description. * Descriptions should use numerals (e.g., "0") instead of words (e.g., "zero"). * Special value mentions should be concise and direct (e.g., "0 disables the timeout.", "An empty string means use the operating system setting."). * Multiple special values should be listed in ascending order. Of course, there are exceptions, such as max_pred_locks_per_relation and search_path, whose special values are too complex to include. And there are cases like listen_addresses, where the meaning of an empty string is arguably too obvious to include. In those cases, I've refrained from adding special value information to the GUC description. Reviewed-by: Peter Smith Reviewed-by: "David G. Johnston" Reviewed-by: Daniel Gustafsson Discussion: https://postgr.es/m/Z6aIy4aywxUZHAo6%40nathan --- src/backend/utils/misc/guc_tables.c | 139 ++-- src/include/utils/guc_tables.h | 15 +++ 2 files changed, 84 insertions(+), 70 deletions(-) diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 226af43fe23..42728189322 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -2119,7 +2119,7 @@ struct config_int ConfigureNamesInt[] = {"archive_timeout", PGC_SIGHUP, WAL_ARCHIVING, gettext_noop("Sets the amount of time to wait before forcing a " "switch to the next WAL file."), - NULL, + gettext_noop("0 disables the timeout."), GUC_UNIT_S }, &XLogArchiveTimeout, @@ -2196,7 +2196,7 @@ struct config_int ConfigureNamesInt[] = { {"geqo_pool_size", PGC_USERSET, QUERY_TUNING_GEQO, gettext_noop("GEQO: number of individuals in the population."), - gettext_noop("Zero selects a suitable default value."), + gettext_noop("0 means use a suitable default value."), GUC_EXPLAIN }, &Geqo_pool_size, @@ -2206,7 +2206,7 @@ struct config_int ConfigureNamesInt[] = { {"geqo_generations", PGC_USERSET, QUERY_TUNING_GEQO, gettext_noop("GEQO: number of iterations of the algorithm."), - gettext_noop("Zero selects a suitable default value."), + gettext_noop("0 means use a suitable default value."), GUC_EXPLAIN }, &Geqo_generations, @@ -2229,7 +2229,7 @@ struct config_int ConfigureNamesInt[] = { {"max_standby_archive_delay", PGC_SIGHUP, REPLICATION_STANDBY, gettext_noop("Sets the maximum delay before canceling queries when a hot standby server is processing archived WAL data."), - NULL, + gettext_noop("-1 means wait forever."), GUC_UNIT_MS }, &max_standby_archive_delay, @@ -2240,7 +2240,7 @@ struct config_int ConfigureNamesInt[] = { {"max_standby_streaming_delay", PGC_SIGHUP, REPLICATION_STANDBY, gettext_noop("Sets the maximum delay before canceling queries when a hot standby server is processing streamed WAL data."), - NULL, + gettext_noop("-1 means wait forever."), GUC_UNIT_MS }, &max_standby_streaming_delay, @@ -2273,7 +2273,7 @@ struct config_int ConfigureNamesInt[] = { {"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY, gettext_noop("Sets the maximum wait time to receive data from the sending server."), - NULL, + gettext_noop("0 disables the timeout."), GUC_UNIT_MS
Re: [PoC] Federated Authn/z with OAUTHBEARER
> On 13 Feb 2025, at 22:23, Jacob Champion > wrote: > > On Wed, Feb 12, 2025 at 6:55 AM Peter Eisentraut wrote: >> This just depends on how people have built their libcurl, right? >> >> Do we have any information whether the async-dns-free build is a common >> configuration? > > I don't think the annual Curl survey covers that, unfortunately. We should be able to get a decent idea by inspecting the packaging scripts for the major distributions I think. -- Daniel Gustafsson
Re: DOCS - Question about pg_sequences.last_value notes
On Thu, Feb 13, 2025 at 03:59:45PM +1100, Peter Smith wrote: > I noticed the pg_sequences system-view DOCS page [1] has a note about > the 'last_value' field. But the note is not within the row for that > field. Indeed, it is not even within the table. > > Is it deliberate? Apparently, this changed in commit 3cb2f13 [2], so > CC-ing author Nathan. Yes, this was intentional. I felt this was too much information to squeeze into the table, and so I followed the example of pg_user_mappings [0] and placed it below. [0] https://www.postgresql.org/docs/devel/view-pg-user-mappings.html -- nathan
Re: Simplify the logic a bit (src/bin/scripts/reindexdb.c)
On 2025-Feb-13, Ranier Vilela wrote: > Hi. > > Coverity complained about possible dereference null pointer > in *reindex_one_database* function. > That's not really true. > But the logic is unnecessarily complicated. Hmm, this code looks quite suspect, but I wonder if instead of (what looks more or less like) a straight revert of cc0e7ebd304a as you propose, a better fix wouldn't be to split get_parallel_object_list in two: get_parallel_table_list for the DATABASE and SCHEMA cases, and get_parallel_tabidx_list (or whatever) for the INDEX case. In the first case we just return a list of values, but in the latter case we also meddle with the input list which becomes an output list ... -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ ¡Ay, ay, ay! Con lo mucho que yo lo quería (bis) se fue de mi vera ... se fue para siempre, pa toíta ... pa toíta la vida ¡Ay Camarón! ¡Ay Camarón!(Paco de Lucía)
Assignment before assert
Hi! Function EvalPlanQualFetchRowMark contains an assignment ExecRowMark *erm = earm->rowmark; before assert Assert(earm != NULL); Maybe these lines need to be swapped? -- With best regards, Dmitry Koval Postgres Professional: http://postgrespro.com diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 39d80ccfbad..963aa390620 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -2662,13 +2662,15 @@ bool EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot) { ExecAuxRowMark *earm = epqstate->relsubs_rowmark[rti - 1]; - ExecRowMark *erm = earm->rowmark; + ExecRowMark *erm; Datum datum; boolisNull; Assert(earm != NULL); Assert(epqstate->origslot != NULL); + erm = earm->rowmark; + if (RowMarkRequiresRowShareLock(erm->markType)) elog(ERROR, "EvalPlanQual doesn't support locking rowmarks");
Re: Proposal - Allow extensions to set a Plan Identifier
Thanks for the feedback! > I think that makes sense and then the idea would be later on to move to > something > like 5fd9dfa5f50, but for the "planId": is my understanding correct? correct. This is adding infrastructure to eventually have an in-core planId; but in the meanwhile give extensions that already compute a planId the ability to broadcast the planId in pg_stat_activity. > > Proposal Overview: > > > > 1/ Add a planId field to PgBackendStatus. > > 2/ Add a planId field to PlannedStmt. > > 3/ Create APIs for extensions to set the current planId. > > Note sure 3 is needed (MyBEEntry is available publicly and PlannedStmt also > through some hooks). But did not look close enough and would probably depends > of the implementation, so let's see. I don't think direct setting of values is a good idea. We will need an API similar to pgstat_report_query_id which ensures we are only reporting top level planIds -and- in the case of multiple extensions with the capability to set a planId, only the first one in the stack wins. pgstat_report_query_id does allow for forcing a queryId ( force flag which is false by default ), and I think this new API should allow the same. > > Storing planId in PlannedStmt ensures it is available with > > cached plans. > > > > One consideration is whether reserving a 64-bit integer in PgBackendStatus > > and PlannedStmt is acceptable, given that planId may not always be used. > > From what I can see PgBackendStatus is currently 432 bytes and PlannedStmt > is 152 bytes. It looks like that adding an uint64 at the end of those structs > would not add extra padding (that's what "ptype /o struct " tells me) > so would increase to 440 bytes and 160 bytes respectively. I don't think > that's > an issue. Correct, thanks for adding this detail. -- Sami
Re: [PATCH] Optionally record Plan IDs to track plan changes for a query
> I don't understand why we would change any naming here at all. I think > you should be looking at a much broader consensus and plus-ones that a > renaming is needed. -1 from me. The reason for the change is because "query jumble" will no longer make sense if the jumble code can now be used for other types of trees, such as Plan. I do agree that this needs a single-threaded discussion to achieve a consensus. -- Sami
Re: BitmapHeapScan streaming read user and prelim refactoring
Hi, Thomas, there's a bit relevant to you at the bottom. Melanie chatted with me about the performance regressions in Tomas' benchmarks of the patch. I experimented some and I think I found a few interesting pieces. I looked solely at cyclic, wm=4096, matches=8, eic=16 as I wanted to narrow down what I was looking at as much as possible, and as that seemed a significant regression. I was able to reproduce the regression with the patch, although the noise was very high. My first attempt at reducing the noise was using MAP_POPULATE, as I was seeing a lot of time spent in page fault related code, and that help stabilize the output some. After I couldn't really make sense of the different perf characteristics looking at the explain analyze or iostat, so I switched to profiling. As part of my profiling steps, I: - bind postgres to a single core - set the cpu governor to performance - disable CPU idle just for that core (disabling it for all cores sometimes leads to slowness due to less clock boost headroom) - try both with turbo boost on/off With all of those applied, the performance difference almost vanished. Weird, huh! Normally CPUs only clock down when idle. That includes waiting for IO if that wait period is long enough. That made me look more at strace. Which shows this interesting difference between master and the patch: The chosen excerpts are randomly picked, but the pattern is similar throughout execution of the query: master: pread64(52, "\0\0\0\0\260\7|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339271680) = 8192 <0.10> fadvise64(52, 339402752, 8192, POSIX_FADV_WILLNEED) = 0 <0.08> pread64(52, "\0\0\0\0h\10|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339279872) = 8192 <0.11> fadvise64(52, 339410944, 8192, POSIX_FADV_WILLNEED) = 0 <0.08> pread64(52, "\0\0\0\0 \t|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339288064) = 8192 <0.11> fadvise64(52, 339419136, 8192, POSIX_FADV_WILLNEED) = 0 <0.11> pread64(52, "\0\0\0\0\330\t|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339296256) = 8192 <0.11> fadvise64(52, 339427328, 8192, POSIX_FADV_WILLNEED) = 0 <0.08> With the patch: fadvise64(52, 332365824, 131072, POSIX_FADV_WILLNEED) = 0 <0.21> pread64(52, "\0\0\0\0(\223x\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 329220096) = 131072 <0.000161> pread64(52, "\0\0\0\0\250\236x\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 329351168) = 131072 <0.000401> pread64(52, "\0\0\0\0@\252x\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 65536, 329482240) = 65536 <0.44> pread64(52, "\0\0\0\0\0\250y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332365824) = 131072 <0.79> pread64(52, "\0\0\0\0\200\263y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332496896) = 131072 <0.000336> fadvise64(52, 335781888, 131072, POSIX_FADV_WILLNEED) = 0 <0.21> pread64(52, "\0\0\0\0\0\277y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332627968) = 131072 <0.91> pread64(52, "\0\0\0\0\230\312y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332759040) = 131072 <0.000399> pread64(52, "\0\0\0\0\30\326y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 65536, 332890112) = 65536 <0.46> pread64(52, "\0\0\0\0\220\324z\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 335781888) = 131072 <0.81> pread64(52, "\0\0\0\0(\340z\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 335912960) = 131072 <0.000335> fadvise64(52, 339197952, 131072, POSIX_FADV_WILLNEED) = 0 <0.21> There are a few interesting observations here: - The patch does allow us to make larger reads, nice! - With the patch we do *not* fadvise some of the blocks, the read stream sequentialness logic prevents it. - With the patch there are occasional IOs that are *much* slower. E.g. the last pread64 is a lot slower than the two preceding ones. Which I think explains what's happening. Because we don't fadvise all the time, we have to synchronously wait for some IOs. Because of those slower IOs, the CPU has time to go into an idle state. Slowing the whole query down. If I disable that: diff --git i/src/backend/storage/aio/read_stream.c w/src/backend/storage/aio/read_stream.c index e4414b2e915..e58585c4e02 100644 --- i/src/backend/storage/aio/read_stream.c +++ w/src/backend/storage/aio/read_stream.c @@ -242,8 +242,9 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice) * isn't a strictly sequential pattern, then we'll issue advice. */ if (!suppress_advice && -stream->advice_enabled && -stream->pending_read_blocknum != stream->seq_blocknum) +stream->advice_enabled +// && stream->
Re: pg_attribute_noreturn(), MSVC, C11
On 22.01.25 19:16, Peter Eisentraut wrote: On 06.01.25 15:52, Peter Eisentraut wrote: On 03.01.25 21:51, Dagfinn Ilmari Mannsåker wrote: Peter Eisentraut writes: I suggest we define pg_noreturn as 1. If C11 is supported, then _Noreturn, else 2. If GCC-compatible, then __attribute__((noreturn)), else Would it be worth also checking __has_attribute(noreturn)? Or do all compilers that have __attribute__((noreturn)) claim to be GCC? I don't think that would expand the set of supported compilers in a significant way. We can always add it if we find one, of course. In fact, as another thought, we could even drop #2. Among the GCC- compatible compilers, both GCC and Clang have supported #1 for ages, and the only other candidate I could find on the build farm is the Solaris compiler, which also supports C11 by default, per its documentation. 3. If MSVC, then __declspec((noreturn)) Here is an updated patch set that contains the above small change and fixes some conflicts that have arisen in the meantime. From 8464526cede39d032f52239c5dcdfdbada189691 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 13 Feb 2025 16:01:06 +0100 Subject: [PATCH v2 1/4] pg_noreturn to replace pg_attribute_noreturn() We want to support a "noreturn" decoration on more compilers besides just GCC-compatible ones, but for that we need to move the decoration in front of the function declaration instead of either behind it or wherever, which is the current style afforded by GCC-style attributes. Also rename the macro to "pg_noreturn" to be similar to the C11 standard "noreturn". pg_noreturn is now supported on all compilers that support C11 (using _Noreturn), as well as MSVC (using __declspec). (When PostgreSQL requires C11, the latter variant can be dropped.) (We don't need the previously used variant for GCC-compatible compilers using __attribute__, because all reasonable candidates already support C11, so that variant would be dead code in practice.) Now, all supported compilers effectively support pg_noreturn, so the extra code for !HAVE_PG_ATTRIBUTE_NORETURN can be dropped. This also fixes a possible problem if third-party code includes stdnoreturn.h, because then the current definition of #define pg_attribute_noreturn() __attribute__((noreturn)) would cause an error. Note that the C standard does not support a noreturn attribute on function pointer types. So we have to drop these here. There are only two instances at this time, so it's not a big loss. Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw --- contrib/dblink/dblink.c | 6 ++-- contrib/pgcrypto/px.h | 2 +- src/backend/access/transam/xlogrecovery.c | 3 +- src/backend/backup/basebackup_incremental.c | 6 ++-- src/backend/postmaster/autovacuum.c | 2 +- src/backend/postmaster/launch_backend.c | 2 +- src/backend/postmaster/postmaster.c | 2 +- src/backend/replication/logical/tablesync.c | 3 +- src/backend/replication/walsender.c | 2 +- src/backend/utils/adt/ri_triggers.c | 8 +++--- src/backend/utils/fmgr/dfmgr.c| 4 +-- src/backend/utils/hash/dynahash.c | 2 +- src/backend/utils/mmgr/slab.c | 2 +- src/bin/pg_combinebackup/load_manifest.c | 6 ++-- src/bin/pg_dump/pg_backup_utils.h | 2 +- src/bin/pg_upgrade/pg_upgrade.h | 2 +- src/bin/pg_verifybackup/pg_verifybackup.c | 6 ++-- src/bin/pg_verifybackup/pg_verifybackup.h | 4 +-- src/bin/pgbench/pgbench.h | 12 src/include/bootstrap/bootstrap.h | 4 +-- src/include/c.h | 28 --- src/include/commands/defrem.h | 2 +- src/include/common/parse_manifest.h | 3 +- src/include/mb/pg_wchar.h | 6 ++-- src/include/parser/parse_relation.h | 6 ++-- src/include/parser/scanner.h | 2 +- src/include/postmaster/autovacuum.h | 4 +-- src/include/postmaster/bgworker_internals.h | 2 +- src/include/postmaster/bgwriter.h | 4 +-- src/include/postmaster/pgarch.h | 2 +- src/include/postmaster/postmaster.h | 4 +-- src/include/postmaster/startup.h | 2 +- src/include/postmaster/syslogger.h| 2 +- src/include/postmaster/walsummarizer.h| 2 +- src/include/postmaster/walwriter.h| 2 +- src/include/replication/slotsync.h| 2 +- src/include/replication/walreceiver.h | 2 +- src/include/replication/walsender_private.h | 2 +- src/include/storage/ipc.h | 2 +- src/include/storage/lock.h| 2 +- src/include/tcop/backend_startup.h| 2 +- src/include/tcop/tcopprot.h | 12 --
Re: BitmapHeapScan streaming read user and prelim refactoring
On 2/10/25 23:31, Melanie Plageman wrote: > On Mon, Feb 10, 2025 at 4:22 PM Melanie Plageman > wrote: >> >> I don't really know what to do about this. The behavior of master >> parallel bitmap heap scan can be emulated with the patch by increasing >> effective_io_concurrency. But, IIRC we didn't want to do that for some >> reason? >> Not only does effective_io_concurrency == 1 negatively affect read >> ahead, but it also prevents read combining regardless of the >> io_combine_limit. > > Would a sensible default be something like 4? > Could be. It would perform much better for most systems I think, but it's still a fairly conservative value so unlikely to cause regressions. If we wanted to validate the value, what tests do you think would be appropriate? What about just running the same tests with different eic values, and pick a reasonable value based on that? One that helps, but before it stats causing regressions. FWIW I don't think we need to do this before pushing the patches, as long as we do both. AFAIK there'll be an "apparent regression" no matter the order of changes. > I did some self-review on the patches and cleaned up the commit > messages etc. Attached is v29. > I reviewed v29 today, and I think it's pretty much ready to go. The one part where I don't quite get is 0001, which replaces a FLEXIBLE_ARRAY_MEMBER array with a fixed-length array. It's not wrong, but I don't quite see the benefits / clarity. And I think Thomas might be right we may want to make this dynamic, to save memory. Not a blocker, but I'd probably skip 0001 (unless it's required by the later parts, I haven't checked/tried). Other than that, I think this is ready. regards -- Tomas Vondra
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Feb 13, 2025 at 10:46 AM Tomas Vondra wrote: > > I reviewed v29 today, and I think it's pretty much ready to go. > > The one part where I don't quite get is 0001, which replaces a > FLEXIBLE_ARRAY_MEMBER array with a fixed-length array. It's not wrong, > but I don't quite see the benefits / clarity. And I think Thomas might > be right we may want to make this dynamic, to save memory. > > Not a blocker, but I'd probably skip 0001 (unless it's required by the > later parts, I haven't checked/tried). So, on master, it already pallocs an array of size MAX_TUPLES_PER_PAGE (which is hard-coded in the tidbitmap API to MaxHeapTuplesPerPage) -- see tbm_begin_private_iterate(). So we always palloc the same amount of memory. The reason I changed it from a flexible sized array to a fixed size is that we weren't using the flexibility and having a flexible sized array in the TBMIterateResult meant it couldn't be nested in another struct. Since I have to separate the TBMIterateResult and TBMIterator to implement the read stream API for BHS, once I separate them, I nest the TBMIterateResult in the GinScanEntry. If the array of offsets is flexible sized, then I would have to manage that memory separately in GIN code for the TBMIterateResult.. So, 0001 isn't a change in the amount of memory allocated. With the read stream API, these TBMIterateResults are palloc'd just like we palloc'd one in master. However, we have to have more than one at a time. - Melanie
Re: Draft for basic NUMA observability
Hi, On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote: > As I have promised to Andres on the Discord hacking server some time > ago, I'm attaching the very brief (and potentially way too rushed) > draft of the first step into NUMA observability on PostgreSQL that was > based on his presentation [0]. It might be rough, but it is to get us > started. The patches were not really even basically tested, they are > more like input for discussion - rather than solid code - to shake out > what should be the proper form of this. > > Right now it gives: > > postgres=# select numa_zone_id, count(*) from pg_buffercache group by > numa_zone_id; > NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.00 > numa_zone_id | count > --+--- > | 16127 > 6 | 256 > 1 | 1 Thanks for the patch! Not doing a code review but sharing some experimentation. First, I had to: @@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) Sizeos_page_size; void**os_page_ptrs; int *os_pages_status; - int os_page_count; + uint64 os_page_count; and - os_page_count = (NBuffers * BLCKSZ) / os_page_size; + os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size; to make it work with non tiny shared_buffers. Observations: when using 2 sessions: Session 1 first loads buffers (e.g., by querying a relation) and then runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;' Session 2 does nothing but runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;' I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages appear as unmapped when viewed from a process that hasn't accessed them, even though those same pages appear as allocated on a NUMA node in session 1. To double check, I created a function pg_buffercache_pages_from_pid() that is exactly the same as pg_buffercache_pages() (with your patch) except that it takes a pid as input and uses it in move_pages(, …). Let me show the results: In session 1 (that "accessed/loaded" the ~65K buffers): postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.00 numa_zone_id | count --+- | 5177310 0 | 65192 -2 | 378 (3 rows) postgres=# select pg_backend_pid(); pg_backend_pid 1662580 In session 2: postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.00 numa_zone_id | count --+- | 5177301 0 | 85 -2 | 65494 (3 rows) ^ postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.00 numa_zone_id | count --+- | 5177301 0 | 90 -2 | 65489 (3 rows) But when session's 1 pid is used: postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(1662580) group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.00 numa_zone_id | count --+- | 5177301 0 | 65195 -2 | 384 (3 rows) Results show: Correct NUMA distribution in session 1 Correct NUMA distribution in session 2 only when using pg_buffercache_pages_from_pid() with the pid of session 1 as a parameter (the session that actually accessed the buffers) Which makes me wondering if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observe the same behavior though. The initial idea that you shared on discord was to use get_mempolicy() but as Andres stated: " One annoying thing about get_mempolicy() is this: If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as if the thread had performed a read (load) access to that address, and return the ID of the node where that page was allocated. Forcing the allocation to happen inside a monitoring function is decidedly not great. " The man page looks correct (verified with "perf record -e page-faults,kmem:mm_page_alloc -p ") while using get_mempolicy(). But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Remove a unnecessary argument from execute_extension_script()
On Thu, Feb 13, 2025 at 1:02 PM Yugo Nagata wrote: > > Hi, > > The attached patch is to remove a unnecessary argument "schemaOid" > from the static function execute_extension_script(). It might have > been intended to be used some way initially, but actually this is > not used since schemaName is sufficient. > LGTM. -- Fabrízio de Royes Mello
Re: Patch: Log parameter types in detailed query logging
=?UTF-8?B?0KHRgtC10L/QsNC9?= writes: > This patch adds the ability to log the types of parameters used in queries > when detailed query logging is enabled. If there's a patch in what you sent, I'm not seeing it. Looks like a webpage dump or something. However, I suppose that it must add some catalog lookup operations to the logging operations, and I'm feeling resistant to that, because of the added cycles and risk-of-failure. I think you need to make a much stronger case for the value of this information than you've done here. regards, tom lane
Re: explain analyze rows=%.0f
On Thu, Feb 13, 2025 at 4:05 AM Ilia Evdokimov wrote: > 1. Documentation > (v9-0001-Clarify-display-of-rows-as-decimal-fractions-DOC.patch) > > One thing that bothers me is that the documentation explains how to compute > total time, but it does not clarify how to compute total rows. Maybe this was > obvious to others before, but now that we are displaying rows as a fraction, > we should explicitly document how to interpret it alongside total time. > > I believe it would be helpful to show a non-integer rows value in an example > query. However, to achieve this, we need the index scan results to vary > across iterations. One way to accomplish this is by using the condition > t1.unique2 > t2.unique2. Additionally, I suggest making loops a round number > (e.g., 100) for better readability, which can be achieved using t1.thousand < > 10. The final query would look like this: > > EXPLAIN ANALYZE SELECT * > FROM tenk1 t1, tenk2 t2 > WHERE t1.thousand < 10 AND t1.unique2 > t2.unique2; > > I believe this is an ideal example for the documentation because it not only > demonstrates fractional rows, but also keeps the execution plan nearly > unchanged. While the estimated and actual average row counts become slightly > rounded, I don't see another way to ensure different results for each index > scan. Anyone else have an opinion on this? I see Ilia's point, but a non-equality join is probably an atypical case. > 2. Code and tests (v9-0002-Clarify-display-of-rows-as-decimal-fractions.patch) > > I left the code and tests unchanged since we agreed on a fixed format of two > decimal places. This still has the HAS_DECIMAL() thing to which I objected. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Get rid of WALBufMappingLock
Hi, Pavel! On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov wrote: > On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov wrote: > > > > On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov > > wrote: > > > 13.02.2025 12:34, Alexander Korotkov пишет: > > > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov > > > > wrote: > > > >> 08.02.2025 13:07, Alexander Korotkov пишет: > > > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov > > > >>> wrote: > > > Good, thank you. I think 0001 patch is generally good, but needs > > > some > > > further polishing, e.g. more comments explaining how does it work. > > > >> > > > >> I tried to add more comments. I'm not good at, so recommendations are > > > >> welcome. > > > >> > > > >>> Two things comes to my mind worth rechecking about 0001. > > > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and > > > >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be > > > >>> sensitive to that. If so, I would propose to explicitly comment that > > > >>> and add corresponding asserts. > > > >> > > > >> They're certainly page aligned, since they are page borders. > > > >> I added assert on alignment of InitializeReserved for the sanity. > > > >> > > > >>> 2) Check if there are concurrency issues between > > > >>> AdvanceXLInsertBuffer() and switching to the new WAL file. > > > >> > > > >> There are no issues: > > > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses > > > >> GetXLogBuffer to zero-out WAL page. > > > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion > > > >> locks, > > > >> so switching wal is not concurrent. (Although, there is no need in this > > > >> exclusiveness, imho.) > > > > > > > > Good, thank you. I've also revised commit message and comments. > > > > > > > > But I see another issue with this patch. In the worst case, we do > > > > XLogWrite() by ourselves, and it could potentially could error out. > > > > Without patch, that would cause WALBufMappingLock be released and > > > > XLogCtl->InitializedUpTo not advanced. With the patch, that would > > > > cause other processes infinitely waiting till we finish the > > > > initialization. > > > > > > > > Possible solution would be to save position of the page to be > > > > initialized, and set it back to XLogCtl->InitializeReserved on error > > > > (everywhere we do LWLockReleaseAll()). We also must check that on > > > > error we only set XLogCtl->InitializeReserved to the past, because > > > > there could be multiple concurrent failures. Also we need to > > > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters. > > > > > > The single place where AdvanceXLInsertBuffer is called outside of critical > > > section is in XLogBackgroundFlush. All other call stacks will issue server > > > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer. > > > > > > XLogBackgroundFlush explicitely avoids writing buffers by passing > > > opportunistic=true parameter. > > > > > > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer > > > since server will shutdown/restart. > > > > > > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before > > > call > > > to XLogWrite here? > > > > You're correct. I just reflected this in the next revision of the patch. > > I've looked at the patchset v6. Oh, sorry, I really did wrong. I've done git format-patch for wrong local branch for v5 and v6. Patches I've sent for v5 and v6 are actually the same as my v1. This is really pity. Please, find the right version of patchset attached. -- Regards, Alexander Korotkov Supabase v7-0001-Get-rid-of-WALBufMappingLock.patch Description: Binary data v7-0002-several-attempts-to-lock-WALInsertLocks.patch Description: Binary data
BackgroundPsql swallowing errors on windows
Hi, One of the remaining tasks for AIO was to convert the tests to be proper tap tests. I did that and was thanked by the tests fairly randomly failing on windows. Which test fails changes from run to run. The symptom is that BackgroundPsql->query() sometimes would simply swallow errors that were clearly generated by the backend. Which then would cause tests to fail, because testing for errors was the whole point of $test. At first I thought the issue was that psql didn't actually flush stderr after displaying errors. And while that may be an issue, it doesn't seem to be the one causing problems for me. Lots of hair-pulling later, I have a somewhat confirmed theory for what's happening: BackgroundPsql::query() tries to detect if the passed in query has executed by adding a "banner" after the query and using pump_until() to wait until that banner has been "reached". That seems to work reasonably well on !windows. On windows however, it looks like there's no guarantee that if stdout has been received by IPC::Run, stderr also has been received, even if the stderr content has been generated first. I tried to add an extra ->pump_nb() call to query(), thinking that maybe IPC::Run just didn't get input that had actually arrived, due to waiting on just one pipe. But no success. My understanding is that IPC::Run uses a proxy process on windows to execute subprocesses and then communicates with that over TCP (or something along those lines). I suspect what's happening is that the communication with the external process allows for reordering between stdout/stderr. And indeed, changing BackgroundPsql::query() to emit the banner on both stdout and stderr and waiting on both seems to fix the issue. One complication is that I found that just waiting for the banner, without also its newline, sometimes lead to unexpected newlines causing later queries to fail. I think that happens if the trailing newline is read separately from the rest of the string. However, matching the newlines caused tests to fail on some machines. After a lot of cursing I figured out that for interactive psql we output \r\n, causing my regex match to fail. I.e. tests failed whenever IO::PTY was availble... It's also not correct, as we did before, to just look for the banner, it has to be anchored to either the start of the output or a newline, otherwise the \echo (or \warn) command itself will be matched by pump_until() (but then the replacing the command would fail). Not sure that could cause active problems without the addition of \warn (which is also echoed on stdout), but it certainly could after. The banner being the same between queries made it hard to understand if a banner that appeared in the output was from the current query or a past query. Therefore I added a counter to it. For debugging I added a "note" that shows stdout/stderr after executing the query, I think it may be worth keeping that, but I'm not sure. This was a rather painful exercise. Greetings, Andres Freund >From 1aeb60e0687339b9f7524c3b347915bb53ed1284 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 13 Feb 2025 10:09:49 -0500 Subject: [PATCH v1] BackgroundPsql: Fix potential for lost errors on windows Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- .../perl/PostgreSQL/Test/BackgroundPsql.pm| 55 +++ 1 file changed, 43 insertions(+), 12 deletions(-) diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm index dbfd82e4fac..7e76845307d 100644 --- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm +++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm @@ -89,7 +89,8 @@ sub new 'stdin' => '', 'stdout' => '', 'stderr' => '', - 'query_timer_restart' => undef + 'query_timer_restart' => undef, + 'query_cnt' => 1, }; my $run; @@ -219,27 +220,57 @@ sub query my ($self, $query) = @_; my $ret; my $output; + my $query_cnt = $self->{query_cnt}++; + local $Test::Builder::Level = $Test::Builder::Level + 1; - note "issuing query via background psql: $query"; + note "issuing query $query_cnt via background psql: $query"; $self->{timeout}->start() if (defined($self->{query_timer_restart})); # Feed the query to psql's stdin, followed by \n (so psql processes the # line), by a ; (so that psql issues the query, if it doesn't include a ; - # itself), and a separator echoed with \echo, that we can wait on. - my $banner = "background_psql: QUERY_SEPARATOR"; - $self->{stdin} .= "$query\n;\n\\echo $banner\n"; - - pump_until($self->{run}, $self->{timeout}, \$self->{stdout}, qr/$banner/); + # itself), and a separator echoed both with \echo and \warn, that we can + # wait on. + # + # To avoid somehow confusing the separator from separately issued queries, + # and to make it easier to debug, we include a per-psql query counter in + # the separator. + # + # We need both \echo (printing to stdout) and \warn (printing to stderr), + # because
Re: pg17.3 PQescapeIdentifier() ignores len
Justin Pryzby writes: > The fprintf suggests that since 5dc1e42b4 PQescapeIdentifier ignores its len. Ugh, yes. Need something like the attached. FTR, 5dc1e42b4 et al were quite subtle patches done under extreme time pressure. I wonder if they have any other issues. More eyes on those patches would be welcome, now that they are public. regards, tom lane diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index e97ad02542..120d4d032e 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -4224,7 +4224,7 @@ PQescapeInternal(PGconn *conn, const char *str, size_t len, bool as_ident) char *rp; int num_quotes = 0; /* single or double, depending on as_ident */ int num_backslashes = 0; - size_t input_len = strlen(str); + size_t input_len = strnlen(str, len); size_t result_size; char quote_char = as_ident ? '"' : '\''; bool validated_mb = false; @@ -4274,7 +4274,7 @@ PQescapeInternal(PGconn *conn, const char *str, size_t len, bool as_ident) if (!validated_mb) { if (pg_encoding_verifymbstr(conn->client_encoding, s, remaining) - != strlen(s)) + != remaining) { libpq_append_conn_error(conn, "invalid multibyte character"); return NULL;
Re: Patch: Log parameter types in detailed query logging
On Fri, Feb 14, 2025 at 1:03 AM Tom Lane wrote: > =?UTF-8?B?0KHRgtC10L/QsNC9?= writes: > > This patch adds the ability to log the types of parameters used in > queries > > when detailed query logging is enabled. > > If there's a patch in what you sent, I'm not seeing it. Looks like a > webpage dump or something. > > However, I suppose that it must add some catalog lookup operations to > the logging operations, and I'm feeling resistant to that, because of > the added cycles and risk-of-failure. I think you need to make a much > stronger case for the value of this information than you've done here. > > regards, tom lane > My apologies, it seems I am still experiencing some difficulty in accurately transmitting the code change details. I am re-submitting the diff.patch prepared by Stepan Neretin now, and will ensure the formatting is correct. Regarding your concerns, I understand your hesitation about adding catalog lookups due to potential performance impacts and failure risks. However, I believe that the benefits of including parameter types in detailed query logging will significantly outweigh the costs. Often, when analyzing problematic queries, we lack crucial information about the parameter types used. This lack of information forces us to request details from the client, which adds unnecessary delays and complexity to the debugging process. This patch addresses that directly. Best regards, Stepan Neretin diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c index 3acc7508e71..49df56f0135 100644 --- a/src/backend/nodes/params.c +++ b/src/backend/nodes/params.c @@ -21,11 +21,11 @@ #include "nodes/params.h" #include "parser/parse_node.h" #include "storage/shmem.h" +#include "utils/builtins.h" #include "utils/datum.h" #include "utils/lsyscache.h" #include "utils/memutils.h" - static void paramlist_parser_setup(ParseState *pstate, void *arg); static Node *paramlist_param_ref(ParseState *pstate, ParamRef *pref); @@ -367,9 +367,10 @@ BuildParamLogString(ParamListInfo params, char **knownTextValues, int maxlen) ParamExternData *param = ¶ms->params[paramno]; appendStringInfo(&buf, - "%s$%d = ", + "%s$%d = (%s)", paramno > 0 ? ", " : "", - paramno + 1); + paramno + 1, + format_type_extended(param->ptype, -1, FORMAT_TYPE_ALLOW_INVALID)); if (param->isnull || !OidIsValid(param->ptype)) appendStringInfoString(&buf, "NULL");
Re: pg_stat_statements and "IN" conditions
> On Thu, Feb 13, 2025 at 01:47:01PM GMT, Álvaro Herrera wrote: > Also, how wed are you to > "query_id_merge_values" as a name? It's not in any way obvious that > this is about values in arrays. How about query_id_squash_arrays? Or > are you thinking in having values in other types of structures squashed > as well, and that this first patch does it for arrays only but you want > the GUC to also control some future feature? > > (I think I prefer "squash" here as a verb to "merge"). Yeah, when choosing the name I was trying to keep it a bit generic. The high level goal is to reduce repeated non-essential parts, and arrays of constants are one clear scenario, but there could be more to it. Having said that I don't have any particular plans for extending this logic so far. I've ended up with query_id_squash_values, how does this sound? > I think calling func_volatile potentially once per array element is not > good; this might cause dozens/thousands of identical syscache lookups. > Maybe we can pass an initially NIL list from IsMergeableConstList (as > List **), which IsMergeableConst fills with OIDs of functions that have > been checked and found acceptable. Then the second time around we > search the list first and only do func_volatile() after not finding a > match. Good point, added. > Another thing I didn't quite understand is why you did this rather > baroque-looking list scan: I'm pretty sure there was some reason behind it, but when you pointed it out that reason has promptly vanished in a puff of confusion. Fixed. >From 212f5534fb0f99e5daa74ea4464231faec157a58 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Tue, 3 Dec 2024 14:55:45 +0100 Subject: [PATCH v24] Prevent jumbling of every element in ArrayExpr pg_stat_statements produces multiple entries for queries like SELECT something FROM table WHERE col IN (1, 2, 3, ...) depending on the number of parameters, because every element of ArrayExpr is jumbled. In certain situations it's undesirable, especially if the list becomes too large. Make an array of Const expressions contribute only the first/last elements to the jumble hash. Allow to enable this behavior via the new pg_stat_statements parameter query_id_squash_values with the default value off. Reviewed-by: Zhihong Yu, Sergey Dudoladov, Robert Haas, Tom Lane, Michael Paquier, Sergei Kornilov, Alvaro Herrera, David Geier, Sutou Kouhei, Sami Imseih Tested-by: Chengxi Sun, Yasuo Honda --- contrib/pg_stat_statements/Makefile | 2 +- .../pg_stat_statements/expected/merging.out | 432 ++ contrib/pg_stat_statements/meson.build| 1 + .../pg_stat_statements/pg_stat_statements.c | 47 +- contrib/pg_stat_statements/sql/merging.sql| 169 +++ doc/src/sgml/config.sgml | 27 ++ doc/src/sgml/pgstatstatements.sgml| 28 +- src/backend/nodes/gen_node_support.pl | 21 +- src/backend/nodes/queryjumblefuncs.c | 165 ++- src/backend/postmaster/launch_backend.c | 3 + src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 2 +- src/include/nodes/nodes.h | 3 + src/include/nodes/primnodes.h | 2 +- src/include/nodes/queryjumble.h | 8 +- 15 files changed, 894 insertions(+), 26 deletions(-) create mode 100644 contrib/pg_stat_statements/expected/merging.out create mode 100644 contrib/pg_stat_statements/sql/merging.sql diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index 241c02587b..eef8d69cc4 100644 --- a/contrib/pg_stat_statements/Makefile +++ b/contrib/pg_stat_statements/Makefile @@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS)) REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf REGRESS = select dml cursors utility level_tracking planning \ user_activity wal entry_timestamp privileges extended \ - parallel cleanup oldextversions + parallel cleanup oldextversions merging # Disabled because these tests require "shared_preload_libraries=pg_stat_statements", # which typical installcheck users do not have (e.g. buildfarm clients). NO_INSTALLCHECK = 1 diff --git a/contrib/pg_stat_statements/expected/merging.out b/contrib/pg_stat_statements/expected/merging.out new file mode 100644 index 00..881174d0ca --- /dev/null +++ b/contrib/pg_stat_statements/expected/merging.out @@ -0,0 +1,432 @@ +-- +-- Const merging functionality +-- +CREATE EXTENSION pg_stat_statements; +CREATE TABLE test_merge (id int, data int); +-- IN queries +-- No merging is performed, as a baseline result +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + +SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9); + id | data ++-- +(0 rows) + +SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10); + id | data ++-
Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
On Tue, Feb 11, 2025 at 11:23 PM Michael Paquier wrote: > +be_tls_get_peer_subject_name(MyProcPort, lsslstatus.ssl_client_dn, > NAMEDATALEN); > +be_tls_get_peer_serial(MyProcPort, lsslstatus.ssl_client_serial, > NAMEDATALEN); > +be_tls_get_peer_issuer_name(MyProcPort, lsslstatus.ssl_issuer_dn, > NAMEDATALEN); > > But are these three actually OK to have showing up in the catalogs > this early? This data comes from a peer certificate, that we may know > nothing about, become set at a very early stage with > secure_initialize(). I guess I'm going to zero in on your definition of "may know nothing about". If that is true, something is very wrong IMO. My understanding of the backend code was that port->peer is only set after OpenSSL has verified that the certificate has been issued by an explicitly trusted CA. (Our verify_cb() doesn't override the internal checks to allow failed certificates through.) That may not be enough to authorize entry into the server, but it also shouldn't be untrusted data. If a CA is issuing Subject data that is somehow dangerous to the operation of the server, I think that's a security problem in and of itself: there are clientcert HBA modes that don't validate the Subject, but they're still going to push that data into the catalogs, aren't they? So if we're concerned that Subject data is dangerous at this point in the code, I agree that my patch makes it even more dangerous and I'll modify it -- but then I'm going to split off another thread to try to fix that underlying issue. A user should not have to be authorized to access the server in order for signed authentication data from the transport layer to be considered trustworthy. Being able to monitor those separately is important for auditability. > As a whole we still have a gap between what could be OK, what could > not be OK, and the fact that pgstat_bestart_security() is called twice > makes that confusing. My end goal is that all of this _should_ always be OK, so calling it once or twice or thirty times should be safe. (But again, if that's not actually true today, I'll remove it for now.) > > v8-0003 shows this approach. For the record, I think it's materially > > worse than v7-0003. IMO it increases the cognitive load for very > > little benefit and makes it more work for a newcomer to refactor the > > cleanup code for those routines. I think it's enough that you can see > > a separate LOG message for each failure case, if you want to know why > > we're unbinding. > > That's more verbose, as well. As Robert said, that makes the output > easier to debug with a 1:1 mapping between the event and a code path. I agree with Robert's goal in general. I just think that following that rule to *this* extent is counterproductive. But I won't die on that hill; in the end, I just want to be able to see when LDAP calls hang. Thanks! --Jacob